add some notebook comments

adwk67 · adwk67 · commit 79fdb3be91d3 · 2025-02-26T17:08:10.000+01:00
diff --git a/stacks/jupyterhub-keycloak/jupyterhub.yaml b/stacks/jupyterhub-keycloak/jupyterhub.yaml
@@ -64,14 +64,6 @@ options:
           for container in pod.spec.containers:
             container.security_context = None
 
-          # JupyterHub adds NET_ADMIN settings, which we don't need
-          #retain_init_containers = []
-          #for init_container in pod.spec.init_containers:
-          #  # retain just the download init container defined below
-          #  if init_container.name == "download-notebook":
-          #    init_container.security_context = None
-          #    retain_init_containers.append(init_container)
-
           return pod
 
         c.KubeSpawner.modify_pod_hook = modify_pod_hook
diff --git a/stacks/jupyterhub-keycloak/process-s3.ipynb b/stacks/jupyterhub-keycloak/process-s3.ipynb
@@ -1,13 +1,21 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f6515406-dc52-4a2b-9ae8-99fff7773146",
+   "metadata": {},
+   "source": [
+    "## Preliminaries\n",
+    "We can first output some versions that are running and read the minio credentials from the secret that been mounted."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b8f37284",
+   "id": "f0705d7d-d93b-4e3b-bd49-2b6696ddc5be",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Output notebook versions\n",
     "! python3 -V\n",
     "! java --version\n",
     "! pyspark --version"
@@ -30,6 +38,26 @@
     "    minio_pwd = f.read().strip()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "c264da0f-6ac7-4dc6-b834-00533676bab6",
+   "metadata": {},
+   "source": [
+    "## Spark\n",
+    "Spark can be used in Client mode (recommended for JupyterHub notebooks, as code is intended to be called in an interactive\n",
+    "fashion), which is the default, or Cluster mode. This notebook uses spark in client mode, meaning that the notebook itself\n",
+    "acts as the driver. It is important that the versions of spark and python match across the driver (running in the juypyterhub image)\n",
+    "and the executor(s) (running in a separate image, specified below with the `spark.kubernetes.container.image` setting.\n",
+    "\n",
+    "The jupyterhub image quay.io/jupyter/pyspark-notebook:spark-3.5.2 appears to be based off the official spark image, as the versions \n",
+    "of java match exactly. Python versions can differ at patch level, and the image used below `oci.stackable.tech/sandbox/spark:3.5.2-python311`\n",
+    "is built from a `spark:3.5.2-scala2.12-java17-ubuntu` base image with python 3.11 (the same major/minor version as the notebook) installed.\n",
+    "\n",
+    "## S3\n",
+    "As we will be reading data from an S3 bucket, we need to add the necessary `hadoop` and `aws` libraries in the same hadoop version as the\n",
+    "notebook image (see `spark.jars.packages`)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -70,6 +98,15 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "eb08096d-1f7a-4c95-8807-aca76290cdfa",
+   "metadata": {},
+   "source": [
+    "### Create an in-memory DataFrame\n",
+    "This will check that libraries across driver and executor are compatible."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -81,6 +118,15 @@
     "df.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "469988e4-1057-49f6-8c8f-93743c4a6839",
+   "metadata": {},
+   "source": [
+    "### Check s3 with pyarrow\n",
+    "As well as spark, we can inspect S3 buckets with the 'pyarrow' library."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -98,24 +144,22 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bc35f4d3",
+   "cell_type": "markdown",
+   "id": "1b3e3331-5587-40c5-8a38-a1c3527bb25a",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
-    "df.show()"
+    "### Read/Write operations"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "943f77f6",
+   "id": "bc35f4d3",
    "metadata": {},
    "outputs": [],
    "source": [
-    "df.count()"
+    "df = spark.read.csv(\"s3a://demo/gas-sensor/raw/\", header = True)\n",
+    "df.show()"
    ]
   },
   {
@@ -166,6 +210,36 @@
    "source": [
     "dfs.write.parquet(\"s3a://demo/gas-sensor/agg/\", mode=\"overwrite\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94d38d8f-57f7-4629-a4d2-e28142cc6a68",
+   "metadata": {},
+   "source": [
+    "### Convert between Spark and Pandas DataFrames"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24d68a92-c104-4cc6-9a89-c052324ba1fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_pandas = dfs.toPandas()\n",
+    "df_pandas.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "128628ff-f2c7-4a04-8c1a-020b239e1158",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark_df = spark.createDataFrame(df_pandas)\n",
+    "spark_df.show()"
+   ]
   }
  ],
  "metadata": {
@@ -184,7 +258,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.8"
+   "version": "3.11.10"
   }
  },
  "nbformat": 4,