Skip to content

Commit 16747f3

Browse files
committed
feat: introduce Gitpod
0 parents  commit 16747f3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+22445
-0
lines changed

.gitignore

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Result files
2+
output
3+
spark-warehouse
4+
artifacts
5+
metastore_db/
6+
7+
exercises/resources/flights
8+
exercises/target
9+
exercises/resources/as_parquet
10+
fhvhv_tripdata_2020-05
11+
# Editor extensions
12+
.idea
13+
14+
# Byte-compiled / optimized / DLL files
15+
__pycache__/
16+
*.py[cod]
17+
*$py.class
18+
19+
# C extensions
20+
*.so
21+
22+
# Distribution / packaging
23+
.Python
24+
build/
25+
develop-eggs/
26+
dist/
27+
downloads/
28+
eggs/
29+
.eggs/
30+
lib/
31+
lib64/
32+
parts/
33+
sdist/
34+
var/
35+
wheels/
36+
*.egg-info/
37+
.installed.cfg
38+
*.egg
39+
MANIFEST
40+
41+
# PyInstaller
42+
# Usually these files are written by a python script from a template
43+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
44+
*.manifest
45+
*.spec
46+
47+
# Installer logs
48+
pip-log.txt
49+
pip-delete-this-directory.txt
50+
51+
# Unit test / coverage reports
52+
htmlcov/
53+
.tox/
54+
.coverage
55+
.coverage.*
56+
.cache
57+
nosetests.xml
58+
coverage.xml
59+
*.cover
60+
.hypothesis/
61+
.pytest_cache/
62+
63+
# Translations
64+
*.mo
65+
*.pot
66+
67+
# Django stuff:
68+
*.log
69+
local_settings.py
70+
db.sqlite3
71+
72+
# Flask stuff:
73+
instance/
74+
.webassets-cache
75+
76+
# Scrapy stuff:
77+
.scrapy
78+
79+
# Sphinx documentation
80+
docs/_build/
81+
82+
# PyBuilder
83+
target/
84+
85+
# Jupyter Notebook
86+
.ipynb_checkpoints
87+
88+
# pyenv
89+
.python-version
90+
91+
# celery beat schedule file
92+
celerybeat-schedule
93+
94+
# SageMath parsed files
95+
*.sage.py
96+
97+
# Environments
98+
.env
99+
.venv
100+
env/
101+
venv/
102+
ENV/
103+
env.bak/
104+
venv.bak/
105+
106+
# Spyder project settings
107+
.spyderproject
108+
.spyproject
109+
110+
# Rope project settings
111+
.ropeproject
112+
113+
# mkdocs documentation
114+
/site
115+
116+
# mypy
117+
.mypy_cache/
118+
119+
120+
Pipfile.lock
121+
videos
122+
cloud9/terraform
123+
cloud9/python
124+
cloud9/*.yaml
125+
cloud9/.gitignore

.gitpod.Dockerfile

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
FROM gitpod/workspace-python:2022-02-04-06-25-23
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
ENV SPARK_LOCAL_IP=0.0.0.0
5+
# needed for master
6+
7+
USER root
8+
# Install apt packages and clean up cached files
9+
RUN apt-get update && \
10+
apt-get install -y openjdk-8-jdk python3-venv && \
11+
apt-get clean && \
12+
rm -rf /var/lib/apt/lists/*
13+
# Install the AWS CLI and clean up tmp files
14+
RUN wget https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -O ./awscliv2.zip && \
15+
unzip awscliv2.zip && \
16+
./aws/install && \
17+
rm -rf ./aws awscliv2.zip
18+
19+
USER gitpod
20+
21+
# For vscode
22+
EXPOSE 3000
23+
# for spark
24+
EXPOSE 4040

.gitpod.yml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
github:
2+
prebuilds:
3+
master: true
4+
branches: true
5+
pullRequests: false
6+
pullRequestsFromForks: false
7+
addCheck: false
8+
addComment: false
9+
addBadge: false
10+
11+
image:
12+
file: .gitpod.Dockerfile
13+
14+
ports:
15+
- port: 4040 # pyspark UI
16+
onOpen: notify
17+
18+
tasks:
19+
- name: setup
20+
init: |
21+
python -m venv .venv
22+
source .venv/bin/activate
23+
python -m pip install -r requirements.txt
24+
echo "source $(pwd)/.venv/bin/activate" >> ~/.bashrc
25+
echo "export PYTHONPATH=$(pwd)" >> ~/.bashrc
26+
clear
27+
command: |
28+
source .venv/bin/activate
29+
export PYTHONPATH=$(pwd)
30+
31+
vscode:
32+
extensions:
33+
- ms-python.python

README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Better Data Engineering with Pyspark
2+
3+
📚 A course brought to you by the [Data Minded Academy].
4+
5+
## Context
6+
7+
These are the exercises used in the course *Better Data Engineering with
8+
PySpark*, developed by instructors at Data Minded. The exercises are meant
9+
to be completed in the order determined by the lexicographical order of
10+
their parent folders. That is, exercises inside the folder `b_foo` should be
11+
completed before those in `c_bar`, but both should come after those of
12+
`a_foo_bar`.
13+
14+
## Getting started
15+
16+
While you can clone the repo locally, we do not offer support for setting up
17+
your coding environment. Instead, we recommend you [tackle the exercises
18+
using Gitpod][this gitpod].
19+
20+
[![Open in Gitpod][gitpod logo]][this gitpod]
21+
22+
23+
⚠ IMPORTANT: Create a new branch and periodically push your work to the remote.
24+
After 30min of inactivity this environment shuts down and you will lose unsaved
25+
progress.
26+
27+
# Course objectives
28+
29+
- Introduce good data engineering practices.
30+
- Illustrate modular and easily testable data transformation pipelines using
31+
PySpark.
32+
- Illustrate PySpark concepts, like lazy evaluation, caching & partitioning.
33+
Not limited to these three though.
34+
35+
# Intended audience
36+
37+
- People working with (Py)Spark or soon to be working with it.
38+
- Familiar with Python functions, variables and the container data types of
39+
`list`, `tuple`, `dict`, and `set`.
40+
41+
# Approach
42+
43+
Lecturer first sets the foundations right for Python development and
44+
gradually builds up to PySpark data pipelines.
45+
46+
There is a high degree of participation expected from the students: they
47+
will need to write code themselves and reason on topics, so that they can
48+
better retain the knowledge.
49+
50+
Participants are recommended to be working on a branch for any changes they
51+
make, to avoid conflicts (otherwise the onus is on the participant), as the
52+
instructors may choose to release an update to the current branch.
53+
54+
Note: this course is not about writing the best pipelines possible. There are
55+
many ways to skin a cat, in this course we show one (or sometimes a few), which
56+
should be suitable for the level of the participants.
57+
58+
## Exercises
59+
60+
### Warm-up: thinking critically about tests
61+
62+
Glance at the file [./exercises/b_unit_test_demo/distance_metrics.py]. Then,
63+
complete [./tests/test_distance_metrics.py], by writing at least two useful
64+
tests, one of which should prove that the code, as it is, is wrong.
65+
66+
### Adding derived columns
67+
68+
Check out [exercises/c_labellers/dates.py] and implement the pure Python
69+
function `is_belgian_holiday`. Verify your correct implementation by running
70+
the test `test_pure_python_function` from [tests/test_labellers.py]. You could
71+
do this from the command line with
72+
`pytest tests/test_labellers.py::test_pure_python_function`.
73+
74+
With that implemented, it's time to take a step back and think about how one
75+
would compare data that might be distributed over different machines. Implement
76+
`assert_frames_functionally_equivalent` from [tests/comparers.py]. Validate
77+
that your implementation is correct by running the test suite at
78+
[tests/test_comparers.py]. You will use this function in a few subsequent
79+
exercises.
80+
81+
Return to [exercises/c_labellers/dates.py] and implement `label_weekend`.
82+
Again, run the related test from [tests/test_labellers.py]. It might be more
83+
useful to you if you first read the test.
84+
85+
Finally, implement `label_holidays` from [exercises/c_labellers/dates.py].
86+
As before, run the relevant test to verify a few easy cases (keep in mind that
87+
few tests are exhaustive: it's typically easier to prove something is wrong,
88+
than that something is right).
89+
90+
If you're making great speed, try to think of an alternative implementation
91+
to `label_holidays` and discuss pros and cons.
92+
93+
### (Optional) Get in the habit of writing test
94+
95+
Have a look at [exercises/d_laziness/date_helper.py]. Explain the intent of the
96+
author. Which two key aspects to Spark's processing did the author forget? If
97+
you can't answer this, run `test_date_helper_doesnt_work_as_intended` from
98+
[exercises/d_laziness/test_laziness.py]. Now write an alternative to the
99+
`convert_date` function that does do what the author intended.
100+
101+
### Common business case 1: cleaning data
102+
103+
Using the information seen in the videos, prepare a sizeable dataset for
104+
storage in "the clean zone" of a data lake, by implementing the `clean`
105+
function of [exercises/h_cleansers/clean_flights_starter.py].
106+
107+
### Cataloging your datasets
108+
109+
To prevent your code from having links to datasets hardcoded everywhere,
110+
create a simple catalog and a convenience function to load data by
111+
referencing this catalog. You have a template in
112+
[exercises/i_catalog/catalog_starter.py].
113+
114+
Once done, revisit [exercises/h_cleansers/clean_flights_starter.py], and
115+
replace the call to load the dataset using your new catalog helpers.
116+
117+
Adapt the `import` statements in [exercises/h_cleansers/clean_airports.py]
118+
and [exercises/h_cleansers/clean_carriers.py] and execute these files with the
119+
Python interpreter. Pay attention to where the data is being stored.
120+
121+
### Peer review
122+
123+
In group, discuss the improvements one could make to
124+
[exercises/l_code_review/bingewatching.py].
125+
126+
### Common business case 2: report generation
127+
128+
Create a complete view of the flights data in which you combine the airline
129+
carriers (a dimension table), the airport names (another dimension table) and
130+
the flights tables (a facts table).
131+
132+
Your manager wants to know how many flights were operated by American Airlines
133+
in 2011.
134+
135+
How many of those flights arrived with less than (or equal to) 10 minutes of
136+
delay?
137+
138+
A data scientist is looking for correlations between the departure delays and
139+
the dates. In particular, he/she thinks that on Fridays there are relatively
140+
speaking more flights departing with a delay than on any other day of the week.
141+
Verify his/her claim.
142+
143+
Out of the 5 categories of sources for delays, which one appeared most often in
144+
2011? In other words, in which category should we invest more time to improve?
145+
146+
147+
[this gitpod]: https://gitpod.io/#https://github.com/oliverw1/summerschoolsept
148+
[gitpod logo]: https://gitpod.io/button/open-in-gitpod.svg
149+
[Data Minded Academy]: https://www.dataminded.academy/

bootstrap_repo.sh

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
rm -rf .git
2+
git init
3+
git add .
4+
git rm --cached \
5+
tests/test_distance_metrics_solution.py \
6+
tests/comparers_solution.py \
7+
exercises/b_unit_test_demo/distance_metrics_corrected.py \
8+
exercises/c_labellers/dates_solution.py \
9+
exercises/d_laziness/test_improved.py \
10+
exercises/d_laziness/improved_date_helper.py \
11+
exercises/d_laziness/test_laziness.py \
12+
exercises/h_cleansers/clean_airports.py \
13+
exercises/h_cleansers/clean_carriers.py \
14+
exercises/h_cleansers/clean_flights.py \
15+
exercises/h_cleansers/test_clean_flights.py \
16+
exercises/h_cleansers/cleaning_villo_stations_solution.py \
17+
exercises/i_catalog/catalog.py \
18+
exercises/m_business/master_flights.py \
19+
exercises/m_business/num_flights.py \
20+
exercises/resources/fhvhv_tripdata_2020-05.csv

exercises/__init__.py

Whitespace-only changes.

exercises/a_spark_demo/__init__.py

Whitespace-only changes.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
"""
2+
Illustrate several ways to create small, toy-example dataframes.
3+
This is incredibly useful in tests.
4+
"""
5+
6+
from pyspark.sql import SparkSession
7+
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
8+
9+
spark = SparkSession.builder.getOrCreate()
10+
11+
# The verbose way
12+
fields = [
13+
StructField("name", StringType(), nullable=True),
14+
StructField("age", IntegerType(), nullable=True),
15+
]
16+
users = spark.createDataFrame(
17+
data=[
18+
("Wim", 1),
19+
(None, 2),
20+
],
21+
schema=StructType(fields),
22+
)
23+
24+
# A shorter way, with implicit assumptions: Spark will attempt to infer the datatypes.
25+
# They will typically be chosen overly large.
26+
currencies = spark.createDataFrame(
27+
data=[
28+
("Euro", 1.0, 1),
29+
("USD", 1.2, 1),
30+
],
31+
schema=("currency", "value", "random"),
32+
)
33+
34+
for frame in (users, currencies):
35+
frame.show() # An action.
36+
frame.printSchema() # Not an action.

0 commit comments

Comments
 (0)