Skip to content

Commit

Permalink
add some notebook comments
Browse files Browse the repository at this point in the history
  • Loading branch information
adwk67 committed Feb 26, 2025
1 parent 9d431b5 commit 79fdb3b
Show file tree
Hide file tree
Showing 2 changed files with 85 additions and 19 deletions.
8 changes: 0 additions & 8 deletions stacks/jupyterhub-keycloak/jupyterhub.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,6 @@ options:
for container in pod.spec.containers:
container.security_context = None
# JupyterHub adds NET_ADMIN settings, which we don't need
#retain_init_containers = []
#for init_container in pod.spec.init_containers:
# # retain just the download init container defined below
# if init_container.name == "download-notebook":
# init_container.security_context = None
# retain_init_containers.append(init_container)
return pod
c.KubeSpawner.modify_pod_hook = modify_pod_hook
Expand Down
96 changes: 85 additions & 11 deletions stacks/jupyterhub-keycloak/process-s3.ipynb
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f6515406-dc52-4a2b-9ae8-99fff7773146",
"metadata": {},
"source": [
"## Preliminaries\n",
"We can first output some versions that are running and read the minio credentials from the secret that been mounted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8f37284",
"id": "f0705d7d-d93b-4e3b-bd49-2b6696ddc5be",
"metadata": {},
"outputs": [],
"source": [
"# Output notebook versions\n",
"! python3 -V\n",
"! java --version\n",
"! pyspark --version"
Expand All @@ -30,6 +38,26 @@
" minio_pwd = f.read().strip()"
]
},
{
"cell_type": "markdown",
"id": "c264da0f-6ac7-4dc6-b834-00533676bab6",
"metadata": {},
"source": [
"## Spark\n",
"Spark can be used in Client mode (recommended for JupyterHub notebooks, as code is intended to be called in an interactive\n",
"fashion), which is the default, or Cluster mode. This notebook uses spark in client mode, meaning that the notebook itself\n",
"acts as the driver. It is important that the versions of spark and python match across the driver (running in the juypyterhub image)\n",
"and the executor(s) (running in a separate image, specified below with the `spark.kubernetes.container.image` setting.\n",
"\n",
"The jupyterhub image quay.io/jupyter/pyspark-notebook:spark-3.5.2 appears to be based off the official spark image, as the versions \n",
"of java match exactly. Python versions can differ at patch level, and the image used below `oci.stackable.tech/sandbox/spark:3.5.2-python311`\n",
"is built from a `spark:3.5.2-scala2.12-java17-ubuntu` base image with python 3.11 (the same major/minor version as the notebook) installed.\n",
"\n",
"## S3\n",
"As we will be reading data from an S3 bucket, we need to add the necessary `hadoop` and `aws` libraries in the same hadoop version as the\n",
"notebook image (see `spark.jars.packages`)."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -70,6 +98,15 @@
")"
]
},
{
"cell_type": "markdown",
"id": "eb08096d-1f7a-4c95-8807-aca76290cdfa",
"metadata": {},
"source": [
"### Create an in-memory DataFrame\n",
"This will check that libraries across driver and executor are compatible."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -81,6 +118,15 @@
"df.show()"
]
},
{
"cell_type": "markdown",
"id": "469988e4-1057-49f6-8c8f-93743c4a6839",
"metadata": {},
"source": [
"### Check s3 with pyarrow\n",
"As well as spark, we can inspect S3 buckets with the 'pyarrow' library."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -98,24 +144,22 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc35f4d3",
"cell_type": "markdown",
"id": "1b3e3331-5587-40c5-8a38-a1c3527bb25a",
"metadata": {},
"outputs": [],
"source": [
"df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
"df.show()"
"### Read/Write operations"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "943f77f6",
"id": "bc35f4d3",
"metadata": {},
"outputs": [],
"source": [
"df.count()"
"df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
"df.show()"
]
},
{
Expand Down Expand Up @@ -166,6 +210,36 @@
"source": [
"dfs.write.parquet(\"s3a://demo/gas-sensor/agg/\", mode=\"overwrite\")"
]
},
{
"cell_type": "markdown",
"id": "94d38d8f-57f7-4629-a4d2-e28142cc6a68",
"metadata": {},
"source": [
"### Convert between Spark and Pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24d68a92-c104-4cc6-9a89-c052324ba1fd",
"metadata": {},
"outputs": [],
"source": [
"df_pandas = dfs.toPandas()\n",
"df_pandas.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "128628ff-f2c7-4a04-8c1a-020b239e1158",
"metadata": {},
"outputs": [],
"source": [
"spark_df = spark.createDataFrame(df_pandas)\n",
"spark_df.show()"
]
}
],
"metadata": {
Expand All @@ -184,7 +258,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.11.10"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 79fdb3b

Please sign in to comment.