Skip to content

Commit 79fdb3b

Browse files
committed
add some notebook comments
1 parent 9d431b5 commit 79fdb3b

File tree

2 files changed

+85
-19
lines changed

2 files changed

+85
-19
lines changed

stacks/jupyterhub-keycloak/jupyterhub.yaml

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,6 @@ options:
6464
for container in pod.spec.containers:
6565
container.security_context = None
6666
67-
# JupyterHub adds NET_ADMIN settings, which we don't need
68-
#retain_init_containers = []
69-
#for init_container in pod.spec.init_containers:
70-
# # retain just the download init container defined below
71-
# if init_container.name == "download-notebook":
72-
# init_container.security_context = None
73-
# retain_init_containers.append(init_container)
74-
7567
return pod
7668
7769
c.KubeSpawner.modify_pod_hook = modify_pod_hook

stacks/jupyterhub-keycloak/process-s3.ipynb

Lines changed: 85 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,21 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "f6515406-dc52-4a2b-9ae8-99fff7773146",
6+
"metadata": {},
7+
"source": [
8+
"## Preliminaries\n",
9+
"We can first output some versions that are running and read the minio credentials from the secret that been mounted."
10+
]
11+
},
312
{
413
"cell_type": "code",
514
"execution_count": null,
6-
"id": "b8f37284",
15+
"id": "f0705d7d-d93b-4e3b-bd49-2b6696ddc5be",
716
"metadata": {},
817
"outputs": [],
918
"source": [
10-
"# Output notebook versions\n",
1119
"! python3 -V\n",
1220
"! java --version\n",
1321
"! pyspark --version"
@@ -30,6 +38,26 @@
3038
" minio_pwd = f.read().strip()"
3139
]
3240
},
41+
{
42+
"cell_type": "markdown",
43+
"id": "c264da0f-6ac7-4dc6-b834-00533676bab6",
44+
"metadata": {},
45+
"source": [
46+
"## Spark\n",
47+
"Spark can be used in Client mode (recommended for JupyterHub notebooks, as code is intended to be called in an interactive\n",
48+
"fashion), which is the default, or Cluster mode. This notebook uses spark in client mode, meaning that the notebook itself\n",
49+
"acts as the driver. It is important that the versions of spark and python match across the driver (running in the juypyterhub image)\n",
50+
"and the executor(s) (running in a separate image, specified below with the `spark.kubernetes.container.image` setting.\n",
51+
"\n",
52+
"The jupyterhub image quay.io/jupyter/pyspark-notebook:spark-3.5.2 appears to be based off the official spark image, as the versions \n",
53+
"of java match exactly. Python versions can differ at patch level, and the image used below `oci.stackable.tech/sandbox/spark:3.5.2-python311`\n",
54+
"is built from a `spark:3.5.2-scala2.12-java17-ubuntu` base image with python 3.11 (the same major/minor version as the notebook) installed.\n",
55+
"\n",
56+
"## S3\n",
57+
"As we will be reading data from an S3 bucket, we need to add the necessary `hadoop` and `aws` libraries in the same hadoop version as the\n",
58+
"notebook image (see `spark.jars.packages`)."
59+
]
60+
},
3361
{
3462
"cell_type": "code",
3563
"execution_count": null,
@@ -70,6 +98,15 @@
7098
")"
7199
]
72100
},
101+
{
102+
"cell_type": "markdown",
103+
"id": "eb08096d-1f7a-4c95-8807-aca76290cdfa",
104+
"metadata": {},
105+
"source": [
106+
"### Create an in-memory DataFrame\n",
107+
"This will check that libraries across driver and executor are compatible."
108+
]
109+
},
73110
{
74111
"cell_type": "code",
75112
"execution_count": null,
@@ -81,6 +118,15 @@
81118
"df.show()"
82119
]
83120
},
121+
{
122+
"cell_type": "markdown",
123+
"id": "469988e4-1057-49f6-8c8f-93743c4a6839",
124+
"metadata": {},
125+
"source": [
126+
"### Check s3 with pyarrow\n",
127+
"As well as spark, we can inspect S3 buckets with the 'pyarrow' library."
128+
]
129+
},
84130
{
85131
"cell_type": "code",
86132
"execution_count": null,
@@ -98,24 +144,22 @@
98144
]
99145
},
100146
{
101-
"cell_type": "code",
102-
"execution_count": null,
103-
"id": "bc35f4d3",
147+
"cell_type": "markdown",
148+
"id": "1b3e3331-5587-40c5-8a38-a1c3527bb25a",
104149
"metadata": {},
105-
"outputs": [],
106150
"source": [
107-
"df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
108-
"df.show()"
151+
"### Read/Write operations"
109152
]
110153
},
111154
{
112155
"cell_type": "code",
113156
"execution_count": null,
114-
"id": "943f77f6",
157+
"id": "bc35f4d3",
115158
"metadata": {},
116159
"outputs": [],
117160
"source": [
118-
"df.count()"
161+
"df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
162+
"df.show()"
119163
]
120164
},
121165
{
@@ -166,6 +210,36 @@
166210
"source": [
167211
"dfs.write.parquet(\"s3a://demo/gas-sensor/agg/\", mode=\"overwrite\")"
168212
]
213+
},
214+
{
215+
"cell_type": "markdown",
216+
"id": "94d38d8f-57f7-4629-a4d2-e28142cc6a68",
217+
"metadata": {},
218+
"source": [
219+
"### Convert between Spark and Pandas DataFrames"
220+
]
221+
},
222+
{
223+
"cell_type": "code",
224+
"execution_count": null,
225+
"id": "24d68a92-c104-4cc6-9a89-c052324ba1fd",
226+
"metadata": {},
227+
"outputs": [],
228+
"source": [
229+
"df_pandas = dfs.toPandas()\n",
230+
"df_pandas.head(10)"
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"id": "128628ff-f2c7-4a04-8c1a-020b239e1158",
237+
"metadata": {},
238+
"outputs": [],
239+
"source": [
240+
"spark_df = spark.createDataFrame(df_pandas)\n",
241+
"spark_df.show()"
242+
]
169243
}
170244
],
171245
"metadata": {
@@ -184,7 +258,7 @@
184258
"name": "python",
185259
"nbconvert_exporter": "python",
186260
"pygments_lexer": "ipython3",
187-
"version": "3.11.8"
261+
"version": "3.11.10"
188262
}
189263
},
190264
"nbformat": 4,

0 commit comments

Comments
 (0)