|
1 | 1 | {
|
2 | 2 | "cells": [
|
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "f6515406-dc52-4a2b-9ae8-99fff7773146", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "## Preliminaries\n", |
| 9 | + "We can first output some versions that are running and read the minio credentials from the secret that been mounted." |
| 10 | + ] |
| 11 | + }, |
3 | 12 | {
|
4 | 13 | "cell_type": "code",
|
5 | 14 | "execution_count": null,
|
6 |
| - "id": "b8f37284", |
| 15 | + "id": "f0705d7d-d93b-4e3b-bd49-2b6696ddc5be", |
7 | 16 | "metadata": {},
|
8 | 17 | "outputs": [],
|
9 | 18 | "source": [
|
10 |
| - "# Output notebook versions\n", |
11 | 19 | "! python3 -V\n",
|
12 | 20 | "! java --version\n",
|
13 | 21 | "! pyspark --version"
|
|
30 | 38 | " minio_pwd = f.read().strip()"
|
31 | 39 | ]
|
32 | 40 | },
|
| 41 | + { |
| 42 | + "cell_type": "markdown", |
| 43 | + "id": "c264da0f-6ac7-4dc6-b834-00533676bab6", |
| 44 | + "metadata": {}, |
| 45 | + "source": [ |
| 46 | + "## Spark\n", |
| 47 | + "Spark can be used in Client mode (recommended for JupyterHub notebooks, as code is intended to be called in an interactive\n", |
| 48 | + "fashion), which is the default, or Cluster mode. This notebook uses spark in client mode, meaning that the notebook itself\n", |
| 49 | + "acts as the driver. It is important that the versions of spark and python match across the driver (running in the juypyterhub image)\n", |
| 50 | + "and the executor(s) (running in a separate image, specified below with the `spark.kubernetes.container.image` setting.\n", |
| 51 | + "\n", |
| 52 | + "The jupyterhub image quay.io/jupyter/pyspark-notebook:spark-3.5.2 appears to be based off the official spark image, as the versions \n", |
| 53 | + "of java match exactly. Python versions can differ at patch level, and the image used below `oci.stackable.tech/sandbox/spark:3.5.2-python311`\n", |
| 54 | + "is built from a `spark:3.5.2-scala2.12-java17-ubuntu` base image with python 3.11 (the same major/minor version as the notebook) installed.\n", |
| 55 | + "\n", |
| 56 | + "## S3\n", |
| 57 | + "As we will be reading data from an S3 bucket, we need to add the necessary `hadoop` and `aws` libraries in the same hadoop version as the\n", |
| 58 | + "notebook image (see `spark.jars.packages`)." |
| 59 | + ] |
| 60 | + }, |
33 | 61 | {
|
34 | 62 | "cell_type": "code",
|
35 | 63 | "execution_count": null,
|
|
70 | 98 | ")"
|
71 | 99 | ]
|
72 | 100 | },
|
| 101 | + { |
| 102 | + "cell_type": "markdown", |
| 103 | + "id": "eb08096d-1f7a-4c95-8807-aca76290cdfa", |
| 104 | + "metadata": {}, |
| 105 | + "source": [ |
| 106 | + "### Create an in-memory DataFrame\n", |
| 107 | + "This will check that libraries across driver and executor are compatible." |
| 108 | + ] |
| 109 | + }, |
73 | 110 | {
|
74 | 111 | "cell_type": "code",
|
75 | 112 | "execution_count": null,
|
|
81 | 118 | "df.show()"
|
82 | 119 | ]
|
83 | 120 | },
|
| 121 | + { |
| 122 | + "cell_type": "markdown", |
| 123 | + "id": "469988e4-1057-49f6-8c8f-93743c4a6839", |
| 124 | + "metadata": {}, |
| 125 | + "source": [ |
| 126 | + "### Check s3 with pyarrow\n", |
| 127 | + "As well as spark, we can inspect S3 buckets with the 'pyarrow' library." |
| 128 | + ] |
| 129 | + }, |
84 | 130 | {
|
85 | 131 | "cell_type": "code",
|
86 | 132 | "execution_count": null,
|
|
98 | 144 | ]
|
99 | 145 | },
|
100 | 146 | {
|
101 |
| - "cell_type": "code", |
102 |
| - "execution_count": null, |
103 |
| - "id": "bc35f4d3", |
| 147 | + "cell_type": "markdown", |
| 148 | + "id": "1b3e3331-5587-40c5-8a38-a1c3527bb25a", |
104 | 149 | "metadata": {},
|
105 |
| - "outputs": [], |
106 | 150 | "source": [
|
107 |
| - "df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n", |
108 |
| - "df.show()" |
| 151 | + "### Read/Write operations" |
109 | 152 | ]
|
110 | 153 | },
|
111 | 154 | {
|
112 | 155 | "cell_type": "code",
|
113 | 156 | "execution_count": null,
|
114 |
| - "id": "943f77f6", |
| 157 | + "id": "bc35f4d3", |
115 | 158 | "metadata": {},
|
116 | 159 | "outputs": [],
|
117 | 160 | "source": [
|
118 |
| - "df.count()" |
| 161 | + "df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n", |
| 162 | + "df.show()" |
119 | 163 | ]
|
120 | 164 | },
|
121 | 165 | {
|
|
166 | 210 | "source": [
|
167 | 211 | "dfs.write.parquet(\"s3a://demo/gas-sensor/agg/\", mode=\"overwrite\")"
|
168 | 212 | ]
|
| 213 | + }, |
| 214 | + { |
| 215 | + "cell_type": "markdown", |
| 216 | + "id": "94d38d8f-57f7-4629-a4d2-e28142cc6a68", |
| 217 | + "metadata": {}, |
| 218 | + "source": [ |
| 219 | + "### Convert between Spark and Pandas DataFrames" |
| 220 | + ] |
| 221 | + }, |
| 222 | + { |
| 223 | + "cell_type": "code", |
| 224 | + "execution_count": null, |
| 225 | + "id": "24d68a92-c104-4cc6-9a89-c052324ba1fd", |
| 226 | + "metadata": {}, |
| 227 | + "outputs": [], |
| 228 | + "source": [ |
| 229 | + "df_pandas = dfs.toPandas()\n", |
| 230 | + "df_pandas.head(10)" |
| 231 | + ] |
| 232 | + }, |
| 233 | + { |
| 234 | + "cell_type": "code", |
| 235 | + "execution_count": null, |
| 236 | + "id": "128628ff-f2c7-4a04-8c1a-020b239e1158", |
| 237 | + "metadata": {}, |
| 238 | + "outputs": [], |
| 239 | + "source": [ |
| 240 | + "spark_df = spark.createDataFrame(df_pandas)\n", |
| 241 | + "spark_df.show()" |
| 242 | + ] |
169 | 243 | }
|
170 | 244 | ],
|
171 | 245 | "metadata": {
|
|
184 | 258 | "name": "python",
|
185 | 259 | "nbconvert_exporter": "python",
|
186 | 260 | "pygments_lexer": "ipython3",
|
187 |
| - "version": "3.11.8" |
| 261 | + "version": "3.11.10" |
188 | 262 | }
|
189 | 263 | },
|
190 | 264 | "nbformat": 4,
|
|
0 commit comments