Docker machine contents:
- PySpark 2.1
- Conda
- Jupyter Notebook
- mongodb-spark driver 2.11:2.0.0
1- Clone the repo
2- Build the docker machine from the repo directory by:
$ sudo docker build -t pyspark_mongo_nb .
3- Create shared directory on you host
$ sudo mkdir /pyspark
4- Execute:
$ sudo docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v /pyspark/:/pyspark --name pyspark_mongo_nb pyspark_mongo_nb
note:
- You can access the jupyter notebook on http://localhost:8888
- You can access spark UI using http://localhost:4040
$ sudo docker exec -it pyspark_mongo_nb bash
$ sudo docker exec -i -u root -t pyspark_mongo_nb bash
$ sudo docker pull phawzy/pyspark_mongo_nb
To use Spark in a Python 3 notebook, add the following code at the start of the notebook:
import os
# make sure pyspark tells workers to use python3
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python3'
Run the Using Spark Local Mode tutorial to test that everything is working
Note That: The docker machine is based on jupyter/pyspark-notebook