Skip to content

Commit e2c31bc

Browse files
author
zoupeicheng
committed
basic PySparkALS done
1 parent a2fbc6f commit e2c31bc

File tree

1 file changed

+195
-0
lines changed

1 file changed

+195
-0
lines changed

PySparkALS.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,16 +257,211 @@ recommendations.filter(col("userId") == 63).show()
257257

258258
### Implicit Ratings Model
259259

260+
Now we consider the dataset if we don't have explicit ratings.
261+
Million Songs dataset, short hand:msd.
262+
Look at the dataset:
263+
```python
264+
# Look at the data
265+
msd.show()
266+
267+
# Count the number of distinct userIds
268+
user_count = msd.select("userId").distinct().count()
269+
print("Number of users: ", user_count)
270+
271+
# Count the number of distinct songIds
272+
song_count = msd.select("songId").distinct().count()
273+
print("Number of songs: ", song_count)
274+
```
275+
276+
```bash
277+
+------+------+---------+
278+
|userId|songId|num_plays|
279+
+------+------+---------+
280+
| 148| 148| 0|
281+
| 243| 496| 0|
282+
| 31| 471| 0|
283+
| 137| 463| 0|
284+
| 251| 623| 0|
285+
| 85| 392| 0|
286+
| 65| 540| 0|
287+
| 255| 243| 0|
288+
| 53| 516| 0|
289+
+------+------+---------+
290+
only showing top 10 rows
291+
292+
Number of users: 321
293+
Number of songs: 729
294+
```
295+
296+
```python
297+
# Min num implicit ratings for a song
298+
print("Minimum implicit ratings for a song: ")
299+
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show()
300+
301+
# Avg num implicit ratings per songs
302+
print("Average implicit ratings per song: ")
303+
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(avg("count")).show()
304+
305+
# Min num implicit ratings from a user
306+
print("Minimum implicit ratings from a user: ")
307+
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(min("count")).show()
308+
309+
# Avg num implicit ratings for users
310+
print("Average implicit ratings per user: ")
311+
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(avg("count")).show()
312+
```
260313

314+
```bash
315+
<script.py> output:
316+
Minimum implicit ratings for a song:
317+
+----------+
318+
|min(count)|
319+
+----------+
320+
| 3|
321+
+----------+
322+
323+
Average implicit ratings per song:
324+
+------------------+
325+
| avg(count)|
326+
+------------------+
327+
|35.251063829787235|
328+
+------------------+
329+
330+
Minimum implicit ratings from a user:
331+
+----------+
332+
|min(count)|
333+
+----------+
334+
| 21|
335+
+----------+
336+
337+
Average implicit ratings per user:
338+
+-----------------+
339+
| avg(count)|
340+
+-----------------+
341+
|77.42056074766356|
342+
+-----------------+
343+
```
261344

345+
Fill missing values with 0:
346+
```python
347+
# View the data
348+
Z.show()
262349

350+
# Extract distinct userIds and productIds
351+
users = Z.select("userId").distinct()
352+
products = Z.select("productId").distinct()
353+
354+
# Cross join users and products
355+
cj = users.crossJoin(products)
356+
357+
# Join cj and Z
358+
Z_expanded = cj.join(Z, ["userId", "productId"], "left").fillna(0)
359+
360+
# View Z_expanded
361+
Z_expanded.show()
362+
```
363+
364+
Tune Hyperparameters:
365+
366+
```python
367+
ranks = [10, 20, 30, 40]
368+
maxIters = [10, 20, 30, 40]
369+
regParams = [.05, .1, .15]
370+
alphas = [20, 40, 60, 80]
371+
372+
# For loop will automatically create and store ALS models
373+
for r in ranks:
374+
for mi in maxIters:
375+
for rp in regParams:
376+
for a in alphas:
377+
model_list.append(ALS(userCol= "userId", itemCol= "songId", ratingCol= "num_plays", rank = r, maxIter = mi, regParam = rp, alpha = a, coldStartStrategy="drop", nonnegative = True, implicitPrefs = True))
378+
379+
# Print the model list, and the length of model_list
380+
print (model_list, "Length of model_list: ", len(model_list))
381+
382+
# Validate
383+
len(model_list) == (len(ranks)*len(maxIters)*len(regParams)*len(alphas))
384+
385+
```
386+
387+
Cross Validations:
388+
```python
389+
# Split the data into training and test sets
390+
(training, test) = msd.randomSplit([0.8, 0.2])
391+
392+
#Building 5 folds within the training set.
393+
train1, train2, train3, train4, train5 = training.randomSplit([0.2, 0.2, 0.2, 0.2, 0.2], seed = 1)
394+
fold1 = train2.union(train3).union(train4).union(train5)
395+
fold2 = train3.union(train4).union(train5).union(train1)
396+
fold3 = train4.union(train5).union(train1).union(train2)
397+
fold4 = train5.union(train1).union(train2).union(train3)
398+
fold5 = train1.union(train2).union(train3).union(train4)
399+
400+
foldlist = [(fold1, train1), (fold2, train2), (fold3, train3), (fold4, train4), (fold5, train5)]
401+
402+
# Empty list to fill with ROEMs from each model
403+
ROEMS = []
404+
405+
# Loops through all models and all folds
406+
for model in model_list:
407+
for ft_pair in foldlist:
408+
409+
# Fits model to fold within training data
410+
fitted_model = model.fit(ft_pair[0])
411+
412+
# Generates predictions using fitted_model on respective CV test data
413+
predictions = fitted_model.transform(ft_pair[1])
414+
415+
# Generates and prints a ROEM metric CV test data
416+
r = ROEM(predictions)
417+
print ("ROEM: ", r)
418+
419+
# Fits model to all of training data and generates preds for test data
420+
v_fitted_model = model.fit(training)
421+
v_predictions = v_fitted_model.transform(test)
422+
v_ROEM = ROEM(v_predictions)
423+
424+
# Adds validation ROEM to ROEM list
425+
ROEMS.append(v_ROEM)
426+
print ("Validation ROEM: ", v_ROEM)
427+
428+
# Import numpy
429+
import numpy
430+
431+
# Find the index of the smallest ROEM
432+
i = numpy.argmin(ROEMS)
433+
print("Index of smallest ROEM:", i)
434+
435+
# Find ith element of ROEMS
436+
print("Smallest ROEM: ", ROEMS[i])
437+
438+
# Extract the best_model
439+
best_model = model_list[38]
440+
441+
# Extract the Rank
442+
print ("Rank: ", best_model.getRank())
443+
444+
# Extract the MaxIter value
445+
print ("MaxIter: ", best_model.getMaxIter())
446+
447+
# Extract the RegParam value
448+
print ("RegParam: ", best_model.getRegParam())
449+
450+
# Extract the Alpha value
451+
print ("Alpha: ", best_model.getAlpha())
452+
```
453+
454+
Binary Ratings can use Implicit Ratings as well. In addition we could tweak the weights of users or movies (in ROEM)
263455

264456

265457

266458

267459

268460
### Other Resources
269461

462+
[Collaborative Filtering for Implitcit Feedback Datasets by Hu, Koren, Volinsky](http://yifanhu.net/PUB/cf.pdf)
463+
464+
270465
[McKinsey&Company: "How Retailers Can Keep Up With Consumers"](https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers)
271466

272467
[ALS Data Preparation: Wide to Long Function](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/wide_to_long_function.py)

0 commit comments

Comments
 (0)