Skip to content

Commit bee07ba

Browse files
Merge pull request #33 from andre-marcos-perez/develop
Develop
2 parents 8e1b1c4 + 5d1a2a4 commit bee07ba

14 files changed

Lines changed: 714 additions & 56 deletions

File tree

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,5 @@ Please fill the template below.
4242

4343
Please provide the following:
4444

45-
- [] Docker Engine version: *Can be found using `docker version`, e.g.: 19.03.6*
46-
- [] Docker Compose version: *Can be found using `docker-compose version`, e.g.: 1.21.0*
45+
- [ ] Docker Engine version: *Can be found using `docker version`, e.g.: 19.03.6*
46+
- [ ] Docker Compose version: *Can be found using `docker-compose version`, e.g.: 1.21.0*

.github/pull_request_template.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ parallel computing in distributed environments through our projects. :sparkles:
77

88
### Issue
99

10-
- *Issue number with link, e.g.: [#22](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/issues/22)*
10+
- *Issue number with link, e.g.: #22
1111

1212
### Changes
1313

@@ -23,5 +23,5 @@ parallel computing in distributed environments through our projects. :sparkles:
2323

2424
Please make sure to check the following:
2525

26-
- [] I have followed the steps in the [CONTRIBUTING.md](../CONTRIBUTING.md) file.
27-
- [] I am aware that pull requests that do not follow the rules will be automatically rejected.
26+
- [ ] I have followed the steps in the [CONTRIBUTING.md](../CONTRIBUTING.md) file.
27+
- [ ] I am aware that pull requests that do not follow the rules will be automatically rejected.

CHANGELOG.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,14 @@
22

33
All notable changes to this project will be documented in this file.
44

5-
## [v1.1.0](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/releases/tag/v1.1.0) (2020-08-09)
5+
## [1.2.0](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/releases/tag/v1.2.0) (2020-08-19)
6+
7+
### Features
8+
9+
- R kernel for JupyterLab;
10+
- Jupyter notebook with Spark R API (SparkR) example.
11+
12+
## [1.1.0](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/releases/tag/v1.1.0) (2020-08-09)
613

714
### Features
815

@@ -14,7 +21,7 @@ All notable changes to this project will be documented in this file.
1421
- Docs general improvements;
1522
- Pull request template refactored.
1623

17-
## [v1.0.0](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/releases/tag/v1.0.0) (2020-07-30)
24+
## [1.0.0](https://github.com/andre-marcos-perez/spark-standalone-cluster-on-docker/releases/tag/v1.0.0) (2020-07-30)
1825

1926
### Tech Stack
2027

CONTRIBUTING.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@ parallel computing in distributed environments through our projects. :sparkles:
1515

1616
### Contributions ideas
1717

18-
- [] Microsoft Windows build script;
18+
- [ ] Microsoft Windows build script;
1919
- [x] Docker Hub CI/CD integration;
20-
- [] Spark submit support;
20+
- [ ] Spark submit support;
2121
- [x] JupyterLab Scala kernel;
2222
- [x] Jupyter notebook with Apache Spark Scala API examples;
23-
- [] JupyterLab R kernel;
24-
- [] Jupyter notebook with Apache Spark R API examples;
25-
- [] Test coverage.
23+
- [x] JupyterLab R kernel;
24+
- [x] Jupyter notebook with Apache Spark R API examples;
25+
- [ ] Test coverage;
26+
- [ ] Ever growing examples.

README.md

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# Apache Spark Standalone Cluster on Docker
2+
23
> The project just got its [own article](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445) at Towards Data Science Medium blog! :sparkles:
34
45
This project gives you an **Apache Spark** cluster in standalone mode with a **JupyterLab** interface built on top of **Docker**.
5-
Learn Apache Spark through its Scala and Python API (PySpark) by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
6+
Learn Apache Spark through its **Scala**, **Python** (PySpark) and **R** (SparkR) API by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
67

78
<p align="center"><img src="docs/image/cluster-architecture.png"></p>
89

@@ -13,6 +14,7 @@ Learn Apache Spark through its Scala and Python API (PySpark) by running the Jup
1314
![docker-compose-file-version](https://img.shields.io/badge/docker--compose-v1.10.0%2B-blue)
1415
![spark-scala-api](https://img.shields.io/badge/spark%20api-scala-red)
1516
![spark-pyspark-api](https://img.shields.io/badge/spark%20api-pyspark-red)
17+
![spark-sparkr-api](https://img.shields.io/badge/spark%20api-sparkr-red)
1618

1719
## TL;DR
1820

@@ -33,12 +35,12 @@ docker-compose up
3335

3436
### Cluster overview
3537

36-
| Application | URL | Description |
37-
| ---------------------- | ---------------------------------------- | ----------------------------------------------------------- |
38-
| JupyterLab | [localhost:8888](http://localhost:8888/) | Cluster interface with Scala and PySpark built-in notebooks |
39-
| Apache Spark Master | [localhost:8080](http://localhost:8080/) | Spark Master node |
40-
| Apache Spark Worker I | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default) |
41-
| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default) |
38+
| Application | URL | Description |
39+
| ---------------------- | ---------------------------------------- | ---------------------------------------------------------- |
40+
| JupyterLab | [localhost:8888](http://localhost:8888/) | Cluster interface with built-in Jupyter notebooks |
41+
| Apache Spark Master | [localhost:8080](http://localhost:8080/) | Spark Master node |
42+
| Apache Spark Worker I | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default) |
43+
| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default) |
4244

4345
### Prerequisites
4446

@@ -54,7 +56,7 @@ docker-compose up
5456
docker-compose up
5557
```
5658

57-
4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
59+
4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
5860
5. Stop the cluster by typing `ctrl+c`.
5961

6062
### Build from your local machine
@@ -82,7 +84,7 @@ chmod +x build.sh ; ./build.sh
8284
docker-compose up
8385
```
8486

85-
7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
87+
7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
8688
8. Stop the cluster by typing `ctrl+c`.
8789

8890
## <a name="tech-stack"></a>Tech Stack
@@ -93,15 +95,17 @@ docker-compose up
9395
| -------------- | ------- |
9496
| Docker Engine | 1.13.0+ |
9597
| Docker Compose | 1.10.0+ |
96-
| Python | 3.7 |
97-
| Scala | 2.12 |
98+
| Python | 3.7.3 |
99+
| Scala | 2.12.11 |
100+
| R | 3.5.2 |
98101

99102
- Jupyter Kernels
100103

101-
| Component | Version | Provider |
102-
| -------------- | ------- | ------------------------------- |
103-
| Python | 2.1.4 | [Jupyter](https://jupyter.org/) |
104-
| Scala | 0.10.0 | [Almond](https://almond.sh/) |
104+
| Component | Version | Provider |
105+
| -------------- | ------- | --------------------------------------- |
106+
| Python | 2.1.4 | [Jupyter](https://jupyter.org/) |
107+
| Scala | 0.10.0 | [Almond](https://almond.sh/) |
108+
| R | 1.1.1 | [IRkernel](https://irkernel.github.io/) |
105109

106110
- Applications
107111

@@ -110,18 +114,22 @@ docker-compose up
110114
| Apache Spark | 2.4.0 \| 2.4.4 \| 3.0.0 | **\<spark-version>**-hadoop-2.7 |
111115
| JupyterLab | 2.1.4 | **\<jupyterlab-version>**-spark-**\<spark-version>** |
112116

117+
> Apache Spark R API (SparkR) is only supported on version **2.4.4**. Full list can be found [here](https://cran.r-project.org/src/contrib/Archive/SparkR/).
118+
113119
## <a name="docker-hub-metrics"></a>Docker Hub Metrics
114120

115-
| Image | Latest Version Size | Downloads |
116-
| -------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
117-
| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab) | ![docker-size](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab) |
118-
| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
119-
| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
121+
| Image | Size | Downloads |
122+
| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
123+
| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab) | ![docker-size-jupyterlab](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab) |
124+
| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size-master](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
125+
| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size-worker](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
120126

121127
## <a name="contributing"></a>Contributing
122128

123129
We'd love some help. To contribute, please read [this file](CONTRIBUTING.md).
124130

131+
> Staring us on GitHub is also an awesome way to show your support :star:
132+
125133
## <a name="contributors"></a>Contributors
126134

127-
- **André Perez** - [dekoperez](https://twitter.com/dekoperez) - andre.marcos.perez@gmail.com
135+
- **André Perez** - [dekoperez](https://twitter.com/dekoperez) - andre.marcos.perez@gmail.com

build/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
applications:
22
scala: "2.12.11"
3-
spark: "3.0.0"
3+
spark: "2.4.4"
44
hadoop: "2.7"
55
jupyterlab: "2.1.4"
66
build:

build/docker-compose.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,22 @@ volumes:
99
driver: local
1010
services:
1111
jupyterlab:
12-
image: jupyterlab:2.1.4-spark-3.0.0
12+
image: jupyterlab:2.1.4-spark-2.4.4
1313
container_name: jupyterlab
1414
ports:
1515
- 8888:8888
1616
volumes:
1717
- shared-workspace:/opt/workspace
1818
spark-master:
19-
image: spark-master:3.0.0-hadoop-2.7
19+
image: spark-master:2.4.4-hadoop-2.7
2020
container_name: spark-master
2121
ports:
2222
- 8080:8080
2323
- 7077:7077
2424
volumes:
2525
- shared-workspace:/opt/workspace
2626
spark-worker-1:
27-
image: spark-worker:3.0.0-hadoop-2.7
27+
image: spark-worker:2.4.4-hadoop-2.7
2828
container_name: spark-worker-1
2929
environment:
3030
- SPARK_WORKER_CORES=1
@@ -36,7 +36,7 @@ services:
3636
depends_on:
3737
- spark-master
3838
spark-worker-2:
39-
image: spark-worker:3.0.0-hadoop-2.7
39+
image: spark-worker:2.4.4-hadoop-2.7
4040
container_name: spark-worker-2
4141
environment:
4242
- SPARK_WORKER_CORES=1

build/docker/base/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ ARG shared_workspace=/opt/workspace
2020
RUN mkdir -p ${shared_workspace}/data && \
2121
mkdir -p /usr/share/man/man1 && \
2222
apt-get update -y && \
23-
apt-get install -y curl python3 scala && \
23+
apt-get install -y curl python3 r-base && \
2424
ln -s /usr/bin/python3 /usr/bin/python && \
2525
curl https://downloads.lightbend.com/scala/${scala_version}/scala-${scala_version}.deb -k -o scala.deb && \
2626
apt install -y ./scala.deb && \

build/docker/jupyterlab/Dockerfile

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ LABEL org.label-schema.description="JupyterLab image"
1111
LABEL org.label-schema.url="https://github.com/andre-marcos-perez/spark-cluster-on-docker"
1212
LABEL org.label-schema.schema-version="1.0"
1313

14-
# -- Layer: JupyterLab + Python kernel
14+
# -- Layer: JupyterLab + Python kernel for PySpark
1515

1616
ARG spark_version
1717
ARG jupyterlab_version
@@ -21,7 +21,7 @@ RUN apt-get update -y && \
2121
pip3 install --upgrade pip && \
2222
pip3 install pyspark==${spark_version} jupyterlab==${jupyterlab_version}
2323

24-
# -- Layer: Scala kernel
24+
# -- Layer: Scala kernel for Spark
2525

2626
ARG scala_version
2727

@@ -31,6 +31,17 @@ RUN apt-get install -y ca-certificates-java --no-install-recommends && \
3131
./coursier launch --fork almond:0.10.0 --scala ${scala_version} -- --display-name "Scala ${scala_version}" --install && \
3232
rm -f coursier
3333

34+
# -- Layer: R kernel for SparkR
35+
36+
COPY ./script/sparkr.sh ./sparkr.sh
37+
38+
RUN apt-get install -y r-base-dev && \
39+
R -e "install.packages('IRkernel')" && \
40+
R -e "IRkernel::installspec(displayname = 'R 3.5', user = FALSE)" && \
41+
chmod +x ./sparkr.sh && \
42+
./sparkr.sh ${spark_version} && \
43+
rm -f sparkr.sh
44+
3445
# -- Runtime
3546

3647
EXPOSE 8888

build/script/sparkr.sh

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
#
3+
# -- Download and install Apache Spark R API (SparkR)
4+
5+
# ----------------------------------------------------------------------------------------------------------------------
6+
# -- Variables ---------------------------------------------------------------------------------------------------------
7+
# ----------------------------------------------------------------------------------------------------------------------
8+
9+
SPARK_VERSION="${1}"
10+
11+
# ----------------------------------------------------------------------------------------------------------------------
12+
# -- Main --------------------------------------------------------------------------------------------------------------
13+
# ----------------------------------------------------------------------------------------------------------------------
14+
15+
if [[ "${SPARK_VERSION}" =~ ^(2.1.2|2.3.0|2.4.1|2.4.2|2.4.3|2.4.4|2.4.5|2.4.6)$ ]]
16+
then
17+
curl https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_${SPARK_VERSION}.tar.gz -k -o sparkr.tar.gz
18+
R CMD INSTALL sparkr.tar.gz
19+
rm -f sparkr.tar.gz
20+
fi

0 commit comments

Comments
 (0)