|
1 | 1 | # Pyspark ALS Model Applications
|
2 |
| -Notes here are from a Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark). |
| 2 | +A lot of Notes here are from a nice Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark). |
| 3 | + |
3 | 4 |
|
4 | 5 | ## Recommendation System with PySpark
|
5 | 6 |
|
| 7 | + |
| 8 | +The model traning process is similar to other pyspark model training. We could use pipelines. |
6 | 9 | ### Toy 1
|
| 10 | + |
| 11 | +Prepare the Data |
| 12 | +```python |
| 13 | +# Import monotonically_increasing_id and show R |
| 14 | +from pyspark.sql.functions import monotonically_increasing_id |
| 15 | +R.show() |
| 16 | + |
| 17 | +# Use the to_long() function to convert the dataframe to the "long" format. |
| 18 | +ratings = to_long(R) |
| 19 | +ratings.show() |
| 20 | + |
| 21 | +# Get unique users and repartition to 1 partition |
| 22 | +users = ratings.select("User").distinct().coalesce(1) |
| 23 | + |
| 24 | +# Create a new column of unique integers called "userId" in the users dataframe. |
| 25 | +users = users.withColumn("userId", monotonically_increasing_id()).persist() |
| 26 | +users.show() |
| 27 | + |
| 28 | +# Extract the distinct movie id's |
| 29 | +movies = ratings.select("Movie").distinct() |
| 30 | + |
| 31 | +# Repartition the data to have only one partition. |
| 32 | +movies = movies.coalesce(1) |
| 33 | + |
| 34 | +# Create a new column of movieId integers. |
| 35 | +movies = movies.withColumn("movieID", monotonically_increasing_id()).persist() |
| 36 | + |
| 37 | +# Join the ratings, users and movies dataframes |
| 38 | +movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left") |
| 39 | +movie_ratings.show() |
| 40 | +``` |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | + |
7 | 45 | Suppose we have pyspark dataframe called ratings:
|
8 | 46 | ```bash
|
9 | 47 | In [1]: ratings.show(5)
|
@@ -127,6 +165,114 @@ ratings.filter(col("userId") < 100).show()
|
127 | 165 |
|
128 | 166 | # Group data by userId, count ratings
|
129 | 167 | ratings.groupBy("userId").count().show()
|
| 168 | + |
| 169 | +# Use .printSchema() to see the datatypes of the ratings dataset |
| 170 | +ratings.printSchema() |
| 171 | + |
| 172 | +# Tell Spark to convert the columns to the proper data types |
| 173 | +ratings = ratings.select(ratings.userId.cast("integer"), ratings.movieId.cast("integer"), ratings.rating.cast("double")) |
| 174 | + |
| 175 | +# Call .printSchema() again to confirm the columns are now in the correct format |
| 176 | +ratings.printSchema() |
| 177 | + |
| 178 | + |
| 179 | +``` |
| 180 | + |
| 181 | +Build the model: |
| 182 | +```python |
| 183 | +# Import the required functions |
| 184 | +from pyspark.ml.evaluation import RegressionEvaluator |
| 185 | +from pyspark.ml.recommendation import ALS |
| 186 | +from pyspark.ml.tuning import ParamGridBuilder, CrossValidator |
| 187 | + |
| 188 | +# Create test and train set |
| 189 | +(train, test) = ratings.randomSplit([0.80, 0.20], seed = 1234) |
| 190 | + |
| 191 | +# Create ALS model |
| 192 | +als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False) |
| 193 | + |
| 194 | +# Confirm that a model called "als" was created |
| 195 | +type(als) |
| 196 | + |
| 197 | +# Import the requisite items |
| 198 | +from pyspark.ml.evaluation import RegressionEvaluator |
| 199 | +from pyspark.ml.tuning import ParamGridBuilder, CrossValidator |
| 200 | + |
| 201 | +# Add hyperparameters and their respective values to param_grid |
| 202 | +param_grid = ParamGridBuilder() \ |
| 203 | + .addGrid(als.rank, [10, 50, 100, 150]) \ |
| 204 | + .addGrid(als.maxIter, [5, 50, 100, 200]) \ |
| 205 | + .addGrid(als.regParam, [.01, .05, .1, .15]) \ |
| 206 | + .build() |
| 207 | + |
| 208 | +# Define evaluator as RMSE and print length of evaluator |
| 209 | +evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") |
| 210 | +print ("Num models to be tested: ", len(param_grid)) |
| 211 | + |
| 212 | +# Build cross validation using CrossValidator |
| 213 | +cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5) |
| 214 | + |
| 215 | +# Confirm cv was built |
| 216 | +print(cv) |
| 217 | + |
| 218 | +# Print best_model |
| 219 | +print(type(best_model)) |
| 220 | + |
| 221 | +# Complete the code below to extract the ALS model parameters |
| 222 | +print("**Best Model**") |
| 223 | + |
| 224 | +# Print "Rank" |
| 225 | +print(" Rank:", best_model.getRank()) |
| 226 | + |
| 227 | +# Print "MaxIter" |
| 228 | +print(" MaxIter:", best_model.getMaxIter()) |
| 229 | + |
| 230 | +# Print "RegParam" |
| 231 | +print(" RegParam:", best_model.getRegParam()) |
| 232 | + |
| 233 | +# View the predictions |
| 234 | +test_predictions.show() |
| 235 | + |
| 236 | +# Calculate and print the RMSE of test_predictions |
| 237 | +RMSE = evaluator.evaluate(test_predictions) |
| 238 | +print(RMSE) |
| 239 | + |
| 240 | +# Look at user 60's ratings |
| 241 | +print("User 60's Ratings:") |
| 242 | +original_ratings.filter(col("userId") == 60).sort("rating", ascending = False).show() |
| 243 | + |
| 244 | +# Look at the movies recommended to user 60 |
| 245 | +print("User 60s Recommendations:") |
| 246 | +recommendations.filter(col("userId") == 60).show() |
| 247 | + |
| 248 | +# Look at user 63's ratings |
| 249 | +print("User 63's Ratings:") |
| 250 | +original_ratings.filter(col("userId") == 63).sort("rating", ascending = False).show() |
| 251 | + |
| 252 | +# Look at the movies recommended to user 63 |
| 253 | +print("User 63's Recommendations:") |
| 254 | +recommendations.filter(col("userId") == 63).show() |
| 255 | + |
130 | 256 | ```
|
131 | 257 |
|
| 258 | +### Implicit Ratings Model |
| 259 | + |
| 260 | + |
| 261 | + |
| 262 | + |
| 263 | + |
| 264 | + |
| 265 | + |
| 266 | + |
| 267 | + |
| 268 | +### Other Resources |
| 269 | + |
| 270 | +[McKinsey&Company: "How Retailers Can Keep Up With Consumers"](https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers) |
| 271 | + |
| 272 | +[ALS Data Preparation: Wide to Long Function](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/wide_to_long_function.py) |
| 273 | + |
| 274 | +[Hu, Koren, Volinsky: "Collaborative Filtering for Implicit Feedback Datasets"](http://yifanhu.net/PUB/cf.pdf) |
| 275 | + |
| 276 | +[GitHub Repo: Cross Validation With Implicit Ratings in Pyspark](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_cv.py) |
132 | 277 |
|
| 278 | +[Pan, Zhou, Cao, Liu, Lukose, Scholz, Yang: "One Class Collaborative Filtering"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.306.4684&rep=rep1&type=pdf) |
0 commit comments