Skip to content

Commit a2fbc6f

Browse files
author
zoupeicheng
committed
updated
1 parent a0d3692 commit a2fbc6f

File tree

3 files changed

+151
-1
lines changed

3 files changed

+151
-1
lines changed

DeepLearningBasic.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11

22
# Deep Learning
33

4+
Look up Coursera's Courses in Deeplearning!
5+
46
Keras short example of a DNN.
57

68
```python

PySparkALS.md

Lines changed: 147 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,47 @@
11
# Pyspark ALS Model Applications
2-
Notes here are from a Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark).
2+
A lot of Notes here are from a nice Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark).
3+
34

45
## Recommendation System with PySpark
56

7+
8+
The model traning process is similar to other pyspark model training. We could use pipelines.
69
### Toy 1
10+
11+
Prepare the Data
12+
```python
13+
# Import monotonically_increasing_id and show R
14+
from pyspark.sql.functions import monotonically_increasing_id
15+
R.show()
16+
17+
# Use the to_long() function to convert the dataframe to the "long" format.
18+
ratings = to_long(R)
19+
ratings.show()
20+
21+
# Get unique users and repartition to 1 partition
22+
users = ratings.select("User").distinct().coalesce(1)
23+
24+
# Create a new column of unique integers called "userId" in the users dataframe.
25+
users = users.withColumn("userId", monotonically_increasing_id()).persist()
26+
users.show()
27+
28+
# Extract the distinct movie id's
29+
movies = ratings.select("Movie").distinct()
30+
31+
# Repartition the data to have only one partition.
32+
movies = movies.coalesce(1)
33+
34+
# Create a new column of movieId integers.
35+
movies = movies.withColumn("movieID", monotonically_increasing_id()).persist()
36+
37+
# Join the ratings, users and movies dataframes
38+
movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left")
39+
movie_ratings.show()
40+
```
41+
42+
43+
44+
745
Suppose we have pyspark dataframe called ratings:
846
```bash
947
In [1]: ratings.show(5)
@@ -127,6 +165,114 @@ ratings.filter(col("userId") < 100).show()
127165

128166
# Group data by userId, count ratings
129167
ratings.groupBy("userId").count().show()
168+
169+
# Use .printSchema() to see the datatypes of the ratings dataset
170+
ratings.printSchema()
171+
172+
# Tell Spark to convert the columns to the proper data types
173+
ratings = ratings.select(ratings.userId.cast("integer"), ratings.movieId.cast("integer"), ratings.rating.cast("double"))
174+
175+
# Call .printSchema() again to confirm the columns are now in the correct format
176+
ratings.printSchema()
177+
178+
179+
```
180+
181+
Build the model:
182+
```python
183+
# Import the required functions
184+
from pyspark.ml.evaluation import RegressionEvaluator
185+
from pyspark.ml.recommendation import ALS
186+
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
187+
188+
# Create test and train set
189+
(train, test) = ratings.randomSplit([0.80, 0.20], seed = 1234)
190+
191+
# Create ALS model
192+
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False)
193+
194+
# Confirm that a model called "als" was created
195+
type(als)
196+
197+
# Import the requisite items
198+
from pyspark.ml.evaluation import RegressionEvaluator
199+
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
200+
201+
# Add hyperparameters and their respective values to param_grid
202+
param_grid = ParamGridBuilder() \
203+
.addGrid(als.rank, [10, 50, 100, 150]) \
204+
.addGrid(als.maxIter, [5, 50, 100, 200]) \
205+
.addGrid(als.regParam, [.01, .05, .1, .15]) \
206+
.build()
207+
208+
# Define evaluator as RMSE and print length of evaluator
209+
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
210+
print ("Num models to be tested: ", len(param_grid))
211+
212+
# Build cross validation using CrossValidator
213+
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
214+
215+
# Confirm cv was built
216+
print(cv)
217+
218+
# Print best_model
219+
print(type(best_model))
220+
221+
# Complete the code below to extract the ALS model parameters
222+
print("**Best Model**")
223+
224+
# Print "Rank"
225+
print(" Rank:", best_model.getRank())
226+
227+
# Print "MaxIter"
228+
print(" MaxIter:", best_model.getMaxIter())
229+
230+
# Print "RegParam"
231+
print(" RegParam:", best_model.getRegParam())
232+
233+
# View the predictions
234+
test_predictions.show()
235+
236+
# Calculate and print the RMSE of test_predictions
237+
RMSE = evaluator.evaluate(test_predictions)
238+
print(RMSE)
239+
240+
# Look at user 60's ratings
241+
print("User 60's Ratings:")
242+
original_ratings.filter(col("userId") == 60).sort("rating", ascending = False).show()
243+
244+
# Look at the movies recommended to user 60
245+
print("User 60s Recommendations:")
246+
recommendations.filter(col("userId") == 60).show()
247+
248+
# Look at user 63's ratings
249+
print("User 63's Ratings:")
250+
original_ratings.filter(col("userId") == 63).sort("rating", ascending = False).show()
251+
252+
# Look at the movies recommended to user 63
253+
print("User 63's Recommendations:")
254+
recommendations.filter(col("userId") == 63).show()
255+
130256
```
131257

258+
### Implicit Ratings Model
259+
260+
261+
262+
263+
264+
265+
266+
267+
268+
### Other Resources
269+
270+
[McKinsey&Company: "How Retailers Can Keep Up With Consumers"](https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers)
271+
272+
[ALS Data Preparation: Wide to Long Function](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/wide_to_long_function.py)
273+
274+
[Hu, Koren, Volinsky: "Collaborative Filtering for Implicit Feedback Datasets"](http://yifanhu.net/PUB/cf.pdf)
275+
276+
[GitHub Repo: Cross Validation With Implicit Ratings in Pyspark](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_cv.py)
132277

278+
[Pan, Zhou, Cao, Liu, Lukose, Scholz, Yang: "One Class Collaborative Filtering"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.306.4684&rep=rep1&type=pdf)

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ There are many great sources to learn Data Science, and here are some advice to
6868

6969
10. [Network Analysis](Network-Analysis.md)
7070
11. [PySpark](PySpark.md)
71+
* [Recommendation System using ALS](PySparkALS.md)
72+
7173
12. [DeploymentTools](DeploymentTools.md)
7274
13. Others coming ...
7375
* Efficient Coding in Python

0 commit comments

Comments
 (0)