|
| 1 | +# Better Data Engineering with Pyspark |
| 2 | + |
| 3 | +📚 A course brought to you by the [Data Minded Academy]. |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +These are the exercises used in the course *Better Data Engineering with |
| 8 | +PySpark*, developed by instructors at Data Minded. The exercises are meant |
| 9 | +to be completed in the order determined by the lexicographical order of |
| 10 | +their parent folders. That is, exercises inside the folder `b_foo` should be |
| 11 | +completed before those in `c_bar`, but both should come after those of |
| 12 | +`a_foo_bar`. |
| 13 | + |
| 14 | +## Getting started |
| 15 | + |
| 16 | +While you can clone the repo locally, we do not offer support for setting up |
| 17 | +your coding environment. Instead, we recommend you [tackle the exercises |
| 18 | +using Gitpod][this gitpod]. |
| 19 | + |
| 20 | +[![Open in Gitpod][gitpod logo]][this gitpod] |
| 21 | + |
| 22 | + |
| 23 | +⚠ IMPORTANT: Create a new branch and periodically push your work to the remote. |
| 24 | +After 30min of inactivity this environment shuts down and you will lose unsaved |
| 25 | +progress. |
| 26 | + |
| 27 | +# Course objectives |
| 28 | + |
| 29 | +- Introduce good data engineering practices. |
| 30 | +- Illustrate modular and easily testable data transformation pipelines using |
| 31 | + PySpark. |
| 32 | +- Illustrate PySpark concepts, like lazy evaluation, caching & partitioning. |
| 33 | + Not limited to these three though. |
| 34 | + |
| 35 | +# Intended audience |
| 36 | + |
| 37 | +- People working with (Py)Spark or soon to be working with it. |
| 38 | +- Familiar with Python functions, variables and the container data types of |
| 39 | + `list`, `tuple`, `dict`, and `set`. |
| 40 | + |
| 41 | +# Approach |
| 42 | + |
| 43 | +Lecturer first sets the foundations right for Python development and |
| 44 | +gradually builds up to PySpark data pipelines. |
| 45 | + |
| 46 | +There is a high degree of participation expected from the students: they |
| 47 | +will need to write code themselves and reason on topics, so that they can |
| 48 | +better retain the knowledge. |
| 49 | + |
| 50 | +Participants are recommended to be working on a branch for any changes they |
| 51 | +make, to avoid conflicts (otherwise the onus is on the participant), as the |
| 52 | +instructors may choose to release an update to the current branch. |
| 53 | + |
| 54 | +Note: this course is not about writing the best pipelines possible. There are |
| 55 | +many ways to skin a cat, in this course we show one (or sometimes a few), which |
| 56 | +should be suitable for the level of the participants. |
| 57 | + |
| 58 | +## Exercises |
| 59 | + |
| 60 | +### Warm-up: thinking critically about tests |
| 61 | + |
| 62 | +Glance at the file [./exercises/b_unit_test_demo/distance_metrics.py]. Then, |
| 63 | +complete [./tests/test_distance_metrics.py], by writing at least two useful |
| 64 | +tests, one of which should prove that the code, as it is, is wrong. |
| 65 | + |
| 66 | +### Adding derived columns |
| 67 | + |
| 68 | +Check out [exercises/c_labellers/dates.py] and implement the pure Python |
| 69 | +function `is_belgian_holiday`. Verify your correct implementation by running |
| 70 | +the test `test_pure_python_function` from [tests/test_labellers.py]. You could |
| 71 | +do this from the command line with |
| 72 | +`pytest tests/test_labellers.py::test_pure_python_function`. |
| 73 | + |
| 74 | +With that implemented, it's time to take a step back and think about how one |
| 75 | +would compare data that might be distributed over different machines. Implement |
| 76 | +`assert_frames_functionally_equivalent` from [tests/comparers.py]. Validate |
| 77 | +that your implementation is correct by running the test suite at |
| 78 | +[tests/test_comparers.py]. You will use this function in a few subsequent |
| 79 | +exercises. |
| 80 | + |
| 81 | +Return to [exercises/c_labellers/dates.py] and implement `label_weekend`. |
| 82 | +Again, run the related test from [tests/test_labellers.py]. It might be more |
| 83 | +useful to you if you first read the test. |
| 84 | + |
| 85 | +Finally, implement `label_holidays` from [exercises/c_labellers/dates.py]. |
| 86 | +As before, run the relevant test to verify a few easy cases (keep in mind that |
| 87 | +few tests are exhaustive: it's typically easier to prove something is wrong, |
| 88 | +than that something is right). |
| 89 | + |
| 90 | +If you're making great speed, try to think of an alternative implementation |
| 91 | +to `label_holidays` and discuss pros and cons. |
| 92 | + |
| 93 | +### (Optional) Get in the habit of writing test |
| 94 | + |
| 95 | +Have a look at [exercises/d_laziness/date_helper.py]. Explain the intent of the |
| 96 | +author. Which two key aspects to Spark's processing did the author forget? If |
| 97 | +you can't answer this, run `test_date_helper_doesnt_work_as_intended` from |
| 98 | +[exercises/d_laziness/test_laziness.py]. Now write an alternative to the |
| 99 | +`convert_date` function that does do what the author intended. |
| 100 | + |
| 101 | +### Common business case 1: cleaning data |
| 102 | + |
| 103 | +Using the information seen in the videos, prepare a sizeable dataset for |
| 104 | +storage in "the clean zone" of a data lake, by implementing the `clean` |
| 105 | +function of [exercises/h_cleansers/clean_flights_starter.py]. |
| 106 | + |
| 107 | +### Cataloging your datasets |
| 108 | + |
| 109 | +To prevent your code from having links to datasets hardcoded everywhere, |
| 110 | +create a simple catalog and a convenience function to load data by |
| 111 | +referencing this catalog. You have a template in |
| 112 | +[exercises/i_catalog/catalog_starter.py]. |
| 113 | + |
| 114 | +Once done, revisit [exercises/h_cleansers/clean_flights_starter.py], and |
| 115 | +replace the call to load the dataset using your new catalog helpers. |
| 116 | + |
| 117 | +Adapt the `import` statements in [exercises/h_cleansers/clean_airports.py] |
| 118 | +and [exercises/h_cleansers/clean_carriers.py] and execute these files with the |
| 119 | +Python interpreter. Pay attention to where the data is being stored. |
| 120 | + |
| 121 | +### Peer review |
| 122 | + |
| 123 | +In group, discuss the improvements one could make to |
| 124 | +[exercises/l_code_review/bingewatching.py]. |
| 125 | + |
| 126 | +### Common business case 2: report generation |
| 127 | + |
| 128 | +Create a complete view of the flights data in which you combine the airline |
| 129 | +carriers (a dimension table), the airport names (another dimension table) and |
| 130 | +the flights tables (a facts table). |
| 131 | + |
| 132 | +Your manager wants to know how many flights were operated by American Airlines |
| 133 | +in 2011. |
| 134 | + |
| 135 | +How many of those flights arrived with less than (or equal to) 10 minutes of |
| 136 | +delay? |
| 137 | + |
| 138 | +A data scientist is looking for correlations between the departure delays and |
| 139 | +the dates. In particular, he/she thinks that on Fridays there are relatively |
| 140 | +speaking more flights departing with a delay than on any other day of the week. |
| 141 | +Verify his/her claim. |
| 142 | + |
| 143 | +Out of the 5 categories of sources for delays, which one appeared most often in |
| 144 | +2011? In other words, in which category should we invest more time to improve? |
| 145 | + |
| 146 | + |
| 147 | +[this gitpod]: https://gitpod.io/#https://github.com/oliverw1/summerschoolsept |
| 148 | +[gitpod logo]: https://gitpod.io/button/open-in-gitpod.svg |
| 149 | +[Data Minded Academy]: https://www.dataminded.academy/ |
0 commit comments