updated

zoupeicheng · zoupeicheng · commit a2fbc6fe6954 · 2020-03-14T20:46:55.000+08:00
diff --git a/DeepLearningBasic.md b/DeepLearningBasic.md
@@ -1,6 +1,8 @@
 
 # Deep Learning
 
+Look up Coursera's Courses in Deeplearning!
+
 Keras short example of a DNN.
 
 ```python
diff --git a/PySparkALS.md b/PySparkALS.md
@@ -1,9 +1,47 @@
 # Pyspark ALS Model Applications
-Notes here are from a Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark).
+A lot of Notes here are from a nice Course in [DataCamp](https://campus.datacamp.com/courses/recommendation-engines-in-pyspark).
+
 
 ## Recommendation System with PySpark
 
+
+The model traning process is similar to other pyspark model training. We could use pipelines.
 ### Toy 1
+
+Prepare the Data
+```python
+# Import monotonically_increasing_id and show R
+from pyspark.sql.functions import monotonically_increasing_id
+R.show()
+
+# Use the to_long() function to convert the dataframe to the "long" format.
+ratings = to_long(R)
+ratings.show()
+
+# Get unique users and repartition to 1 partition
+users = ratings.select("User").distinct().coalesce(1)
+
+# Create a new column of unique integers called "userId" in the users dataframe.
+users = users.withColumn("userId", monotonically_increasing_id()).persist()
+users.show()
+
+# Extract the distinct movie id's
+movies = ratings.select("Movie").distinct()
+
+# Repartition the data to have only one partition.
+movies = movies.coalesce(1)
+
+# Create a new column of movieId integers.
+movies = movies.withColumn("movieID", monotonically_increasing_id()).persist()
+
+# Join the ratings, users and movies dataframes
+movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left")
+movie_ratings.show()
+```
+
+
+
+
 Suppose we have pyspark dataframe called ratings:
 ```bash
 In [1]: ratings.show(5)
@@ -127,6 +165,114 @@ ratings.filter(col("userId") < 100).show()
 
 # Group data by userId, count ratings
 ratings.groupBy("userId").count().show()
+
+# Use .printSchema() to see the datatypes of the ratings dataset
+ratings.printSchema()
+
+# Tell Spark to convert the columns to the proper data types
+ratings = ratings.select(ratings.userId.cast("integer"), ratings.movieId.cast("integer"), ratings.rating.cast("double"))
+
+# Call .printSchema() again to confirm the columns are now in the correct format
+ratings.printSchema()
+
+
+```
+
+Build the model:
+```python
+# Import the required functions
+from pyspark.ml.evaluation import RegressionEvaluator
+from pyspark.ml.recommendation import ALS
+from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
+
+# Create test and train set
+(train, test) = ratings.randomSplit([0.80, 0.20], seed = 1234)
+
+# Create ALS model
+als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False)
+
+# Confirm that a model called "als" was created
+type(als)
+
+# Import the requisite items
+from pyspark.ml.evaluation import RegressionEvaluator
+from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
+
+# Add hyperparameters and their respective values to param_grid
+param_grid = ParamGridBuilder() \
+            .addGrid(als.rank, [10, 50, 100, 150]) \
+            .addGrid(als.maxIter, [5, 50, 100, 200]) \
+            .addGrid(als.regParam, [.01, .05, .1, .15]) \
+            .build()
+           
+# Define evaluator as RMSE and print length of evaluator
+evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") 
+print ("Num models to be tested: ", len(param_grid))
+
+# Build cross validation using CrossValidator
+cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
+
+# Confirm cv was built
+print(cv)
+
+# Print best_model
+print(type(best_model))
+
+# Complete the code below to extract the ALS model parameters
+print("**Best Model**")
+
+# Print "Rank"
+print("  Rank:", best_model.getRank())
+
+# Print "MaxIter"
+print("  MaxIter:", best_model.getMaxIter())
+
+# Print "RegParam"
+print("  RegParam:", best_model.getRegParam())
+
+# View the predictions 
+test_predictions.show()
+
+# Calculate and print the RMSE of test_predictions
+RMSE = evaluator.evaluate(test_predictions)
+print(RMSE)
+
+# Look at user 60's ratings
+print("User 60's Ratings:")
+original_ratings.filter(col("userId") == 60).sort("rating", ascending = False).show()
+
+# Look at the movies recommended to user 60
+print("User 60s Recommendations:")
+recommendations.filter(col("userId") == 60).show()
+
+# Look at user 63's ratings
+print("User 63's Ratings:")
+original_ratings.filter(col("userId") == 63).sort("rating", ascending = False).show()
+
+# Look at the movies recommended to user 63
+print("User 63's Recommendations:")
+recommendations.filter(col("userId") == 63).show()
+
 ```
 
+### Implicit Ratings Model
+
+
+
+
+
+
+
+
+
+### Other Resources
+
+[McKinsey&Company: "How Retailers Can Keep Up With Consumers"](https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers)
+
+[ALS Data Preparation: Wide to Long Function](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/wide_to_long_function.py)
+
+[Hu, Koren, Volinsky: "Collaborative Filtering for Implicit Feedback Datasets"](http://yifanhu.net/PUB/cf.pdf)
+
+[GitHub Repo: Cross Validation With Implicit Ratings in Pyspark](https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_cv.py)
 
+[Pan, Zhou, Cao, Liu, Lukose, Scholz, Yang: "One Class Collaborative Filtering"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.306.4684&rep=rep1&type=pdf)
diff --git a/README.md b/README.md
@@ -68,6 +68,8 @@ There are many great sources to learn Data Science, and here are some advice to
 
 10. [Network Analysis](Network-Analysis.md)
 11. [PySpark](PySpark.md)
+    * [Recommendation System using ALS](PySparkALS.md)
+
 12. [DeploymentTools](DeploymentTools.md)
 13. Others coming ...
     * Efficient Coding in Python