Skip to content

Commit 7e52e82

Browse files
committed
init commit
0 parents  commit 7e52e82

18 files changed

+752
-0
lines changed

Diff for: .gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.vscode/
2+
*.zip
3+
data/

Diff for: README.md

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# PySpark implementation of SVD++ for Top-N Recommendation
2+
3+
![pyspark-flow](img/pyspark-flow.png)
4+
5+
## Getting Started
6+
7+
### Prerequisites
8+
9+
You need to install *Apache Hadoop* and *Apache Spark* on every nodes of the cluster.
10+
11+
#### Install Hadoop
12+
13+
```bash
14+
tar zxvf hadoop-3.y.z.tgz
15+
ln -s /your/hadoop/path/hadoop-3.x.z /your/hadoop/path/hadoop
16+
```
17+
18+
#### Install Spark
19+
20+
```bash
21+
tar zxvf spark-2.y.z-bin-hadoop2.7.tgz
22+
ln -s /your/spark/path/spark-2.y.z /your/spark/path/spark
23+
```
24+
25+
### Installing
26+
27+
#### Clone the repository
28+
29+
```bash
30+
git clone [email protected]:citomhuang/spark_svdpp.git
31+
```
32+
33+
#### Create the Python environment
34+
35+
```bash
36+
cd spark_svdpp
37+
conda env create -f conda.yaml
38+
conda activate spark-svdpp-env
39+
```
40+
41+
#### Run the tests
42+
43+
```bash
44+
pytest spark_svdpp/tests
45+
```
46+
47+
## Run a example
48+
49+
```bash
50+
./yarn-client.sh
51+
```
52+
53+
## References
54+
55+
1. [Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. Yehuda Koren, KDD’08](https://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf)
56+
57+
2. [Spark: Cluster Computing with Working Sets](https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf)
58+
59+
3. [Scaling Collaborative Filtering with PySpark](https://engineeringblog.yelp.com/2018/05/scaling-collaborative-filtering-with-pyspark.html)
60+
61+
4. [Running Spark on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html)
62+
63+
5. [NicolasHug/Surprise](https://github.com/NicolasHug/Surprise)

Diff for: clean.sh

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
3+
rm -f spark_svdpp.zip
4+
rm -rf spark-warehouse/
5+
rm -rf .pytest_cache/
6+
rm -rf dist/
7+
rm -rf build/
8+
rm -rf spark_svdpp.egg-info/

Diff for: conda.yaml

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
name: spark-svdpp-env
2+
channels:
3+
- defaults
4+
- anaconda
5+
dependencies:
6+
- python=3.7.5
7+
- pylint=2.4.4
8+
- flake8=3.7.9
9+
- h5py=2.9.0
10+
- ipython=7.9.0
11+
- numpy=1.17.3
12+
- pandas=0.25.3
13+
- pip=19.3.1
14+
- scipy=1.3.1
15+
- pyarrow=0.13.0
16+
- pytest=5.3.0
17+
- psutil=5.6.5
18+
- prompt_toolkit=2.0.10

Diff for: examples/svdpp_example.py

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import sys
2+
from pyspark.sql import SparkSession
3+
from spark_svdpp.algos.svdpp import run
4+
5+
6+
n_pars = int(sys.argv[1])
7+
spark = SparkSession.builder.getOrCreate()
8+
output_data_path = 'hdfs:///svdpp/output.parquet'
9+
input_data_path = 'hdfs:///svdpp/dataset_train.parquet'
10+
11+
run(spark=spark, n_pars=n_pars,
12+
input_data_path=input_data_path,
13+
output_data_path=output_data_path)

Diff for: img/pyspark-flow.png

53.7 KB
Loading

Diff for: setup.py

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from setuptools import setup, find_packages
2+
import os
3+
4+
5+
here = os.path.abspath(os.path.dirname(__file__))
6+
7+
with open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
8+
long_description = f.read()
9+
10+
setup(
11+
name='spark-svdpp',
12+
version='0.1.0',
13+
long_description=long_description,
14+
long_description_content_type='text/markdown',
15+
packages=find_packages()
16+
)

Diff for: spark_svdpp/__init__.py

Whitespace-only changes.

Diff for: spark_svdpp/algos/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)