Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Jupyterhub with keycloak, spark and s3 #155

Merged
merged 39 commits into from
Mar 4, 2025
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
8f41ef3
initial keycloak setup
adwk67 Nov 19, 2024
eaf555c
wip: jupyterhub + keycloak
adwk67 Nov 19, 2024
34fb370
wip
adwk67 Nov 19, 2024
54fc0dc
wip: certificates work but callback does not
adwk67 Nov 29, 2024
ae34b6a
wip: various tweaks
adwk67 Jan 9, 2025
34c138d
added some temp docs
adwk67 Jan 15, 2025
05ad6d6
add login info
adwk67 Jan 15, 2025
e16da8c
added some readme info
adwk67 Feb 11, 2025
0f5dce1
corrected ingress secret, set python cacert explicitly
adwk67 Feb 18, 2025
0e3a28c
Merge branch 'main' into feat/keycloak-jupyterhub
adwk67 Feb 18, 2025
ca0c492
wip: working version
adwk67 Feb 19, 2025
f046dd8
clean-up realm-config
adwk67 Feb 19, 2025
e8eb2f9
delegate user check to Keycloak
adwk67 Feb 19, 2025
c1274e6
use demo-specific keycloak
adwk67 Feb 19, 2025
c41f309
removed unnecessary settings
adwk67 Feb 19, 2025
f6d22a9
specify ports
adwk67 Feb 20, 2025
697a0a8
add jupyterhub.yaml to stack
adwk67 Feb 20, 2025
396705f
wip: working nb/spark combo
adwk67 Feb 25, 2025
bc94e33
read/write from s3
adwk67 Feb 25, 2025
53132a8
remove driver service resource in favour of the ones produced dynamic…
adwk67 Feb 25, 2025
803e520
use secret for minio credentials, add demo entry
adwk67 Feb 25, 2025
bcfa3ae
set endpoints via extra config
adwk67 Feb 26, 2025
0ff07da
mount notebook
adwk67 Feb 26, 2025
9d431b5
user-specific job name
adwk67 Feb 26, 2025
79fdb3b
add some notebook comments
adwk67 Feb 26, 2025
b021d35
typos and add password to stack
adwk67 Feb 26, 2025
d3added
first draft of demo docs
adwk67 Feb 27, 2025
9c7298e
typo, fixed title
adwk67 Feb 27, 2025
7c497ee
added hdfs write/read steps
adwk67 Feb 28, 2025
573f812
updated docs
adwk67 Feb 28, 2025
884f0bf
doc cleanup
adwk67 Feb 28, 2025
44fad51
Merge branch 'main' into feat/keycloak-jupyterhub
adwk67 Feb 28, 2025
3d4484c
Apply suggestions from code review
adwk67 Mar 3, 2025
80bd2c6
review suggestions: remove HDFS, improve docs and server options
adwk67 Mar 3, 2025
44b0ecf
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 Mar 3, 2025
49d47e0
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 Mar 3, 2025
5a2c6cf
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 Mar 3, 2025
0eb1ac7
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 Mar 3, 2025
7be5288
added a note about proxy reachability
adwk67 Mar 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions demos/demos-v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -226,3 +226,20 @@ demos:
cpu: "3"
memory: 5098Mi
pvc: 16Gi
jupyterhub-keycloak:
description: Demo showing jupyterhub notebooks secured with keycloak
documentation: https://docs.stackable.tech/stackablectl/stable/demos/jupyterhub-keycloak.html
stackableStack: jupyterhub-keycloak
labels:
- jupyterhub
- keycloak
- spark
- S3
manifests:
# TODO: revert paths
- plainYaml: demos/jupyterhub-keycloak/load-gas-data.yaml
supportedNamespaces: []
resourceRequests:
cpu: 6400m
memory: 12622Mi
pvc: 20Gi
21 changes: 21 additions & 0 deletions demos/jupyterhub-keycloak/load-gas-data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: load-gas-data
spec:
template:
spec:
containers:
- name: load-gas-data
image: "bitnami/minio:2022-debian-10"
command: ["bash", "-c", "cd /tmp; curl -O https://repo.stackable.tech/repository/misc/datasets/gas-sensor-data/20160930_203718.csv && mc --insecure alias set minio http://minio:9000/ $(cat /minio-s3-credentials/accessKey) $(cat /minio-s3-credentials/secretKey) && mc cp 20160930_203718.csv minio/demo/gas-sensor/raw/;"]
volumeMounts:
- name: minio-s3-credentials
mountPath: /minio-s3-credentials
volumes:
- name: minio-s3-credentials
secret:
secretName: minio-s3-credentials
restartPolicy: OnFailure
backoffLimit: 50
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
189 changes: 189 additions & 0 deletions docs/modules/demos/pages/jupyterhub-keycloak.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
= jupyterhub-keycloak

:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu
:spark-pkg: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
:pyspark: https://spark.apache.org/docs/latest/api/python/getting_started/index.html
:jupyterhub-k8s: https://github.com/jupyterhub/zero-to-jupyterhub-k8s
:jupyterlab: https://jupyterlab.readthedocs.io/en/stable/
:jupyter: https://jupyter.org
:keycloak: https://www.keycloak.org/
:gas-sensor: https://archive.ics.uci.edu/dataset/487/gas+sensor+array+temperature+modulation

This demo showcases the integration between {jupyter}[JupyterHub] and {keycloak}[Keycloak] deployed on the Stackable Data Platform (SDP) onto a Kubernetes cluster.
{jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community.
A simple notebook is provided that shows how to start a distributed Spark cluster, reading and writing data from an S3 instance.

For this demo a small sample of {gas-sensor}[gas sensor measurements*] is provided.
Install this demo on an existing Kubernetes cluster:

[source,console]
----
$ stackablectl demo install jupyterhub-keycloak
----

WARNING: When running a distributed Spark cluster from within a JupyterHub notebook, the notebook acts as the driver and requests executors Pods from k8s.
These Pods in turn can mount *all* volumes and Secrets in that namespace.
To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in this demo.

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 8 {k8s-cpu}[cpu units] (core/hyperthread)
* 32GiB memory

You may need more resources depending on how many concurrent users are logged in, and which notebook profiles they are using.

== Aim / Context

This demo shows how to authenticate JupyerHub users against a Keycloak backend using JupyterHub's OAuthenticator.
The same users as in the xref:end-to-end-security.adoc[End-to-end-security] demo are configured in Keycloak and these will be used as examples.
The notebook offers a simple template for using Spark to interact with S3 as a storage backend.

== Overview

This demo will:

* Install the required Stackable Data Platform operators
* Spin up the following data products:
** *JupyterHub*: A multi-user server for Jupyter notebooks
** *Keycloak*: An identity and access management product
** *S3*: A Minio instance for data storage
* Download a sample of the gas sensor dataset into S3
* Install the Jupyter notebook
* Demonstrate some basic data operations against S3
* Illustrate multi-user usage

== JupyterHub

Have a look at the available Pods before logging in:

[source,console]
----
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hub-84f49ccbd7-29h7j 1/1 Running 0 56m
keycloak-544d757f57-f55kr 2/2 Running 0 57m
load-gas-data-m6z5p 0/1 Completed 0 54m
minio-5486d7584f-x2jn8 1/1 Running 0 57m
proxy-648bf7f45b-62vqg 1/1 Running 0 56m

----

The `proxy` Pod has an associated `proxy-public` service with a statically-defined port (31095), exposed with type NodePort.
In order to reach the JupyterHub web interface, navigate to this service.
The node port IP can be found in the ConfigMap `keycloak-address` (written by the Keycloak Deployment as it starts up).
In the example below that would then be 172.19.0.5:31095:

[source,yaml]
----
apiVersion: v1
data:
keycloakAddress: 172.19.0.5:31093 # Keycloak itself
keycloakNodeIp: 172.19.0.5 # can be used to access the proxy-public service
kind: ConfigMap
metadata:
name: keycloak-address
namespace: default
----

NOTE: The `hub` Pod may show a `CreateContainerConfigError` for a few moments on start-up as it requires the ConfigMap written by the Keycloak deployment.

You should see the JupyterHub login page, which will indicate a re-direct to the OAuth service (Keycloak):

image::jupyterhub-keycloak/oauth-login.png[]

Click on the sign-in button.
You will be redirected to the Keycloak login, where you can enter one of the aforementioned users (e.g. `justin.martin` or `isla.williams`: the password is the same as the username):

image::jupyterhub-keycloak/keycloak-login.png[]

A successful login will redirect you back to JupyterHub where different profiles are listed (the drop-down options are visible when you click on the respective fields):

image::jupyterhub-keycloak/server-options.png[]

The explorer window on the left includes a notebook that is already mounted.

Double-click on the file `notebook/process-s3.ipynb`:

image::jupyterhub-keycloak/load-nb.png[]

Run the notebook by selecting "Run All Cells" from the menu:

image::jupyterhub-keycloak/run-nb.png[]

The notebook includes some comments regarding image compatibility and uses a custom image built off the official Spark image that matches the Spark version used in the notebook.
The java versions also match exactly.
Python versions need to match at the `major:minor` level, which is why Python 3.11 is used in the custom image.

Once the spark executor has been started (we have specified `spark.executor.instances` = 1) it will spin up as an extra pod.
We have named the spark job to incorporate the current user (justin-martin).
JupyterHub has started a pod for the user's notebook instance (`jupyter-justin-martin---bdd3b4a1`) and another one for the spark executor (`process-s3-jupyter-justin-martin-bdd3b4a1-9e9da995473f481f-exec-1`):

[source,console]
----
12:49 $ kubectl get pods
NAME READY STATUS RESTARTS AGE
...
jupyter-justin-martin---bdd3b4a1 1/1 Running 0 17m
process-s3-jupyter-justin-martin-... 1/1 Running 0 2m9s
...
----

Stop the kernel in the notebook (which will shut down the spark session and thus the executor) and log out as the current user.
Log in now as `daniel.king` and then again as `isla.williams` (you may need to do this in a clean browser sessions so that existing login cookies are removed).
This user has been defined as an admin user in the jupyterhub configuration:

[source,yaml]
----
...
hub:
config:
Authenticator:
# don't filter here: delegate to Keycloak
allow_all: True
admin_users:
- isla.williams
...
----

You should now see user-specific pods for all three users:


[source,console]
----
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
...
jupyter-daniel-king---181a80ce 1/1 Running 0 6m17s
jupyter-isla-williams---14730816 1/1 Running 0 4m50s
jupyter-justin-martin---bdd3b4a1 1/1 Running 0 3h47m
...
----

The admin user (`isla.williams`) will also have an extra Admin tab in the JupyterHub console where current users can be managed.
You can find this in the JupyterHub UI at http://<ip>:31095/hub/admin e.g http://172.19.0.5:31095/hub/admin:

image::jupyterhub-keycloak/admin-tab.png[]

You can inspect the S3 buckets by using stackable stacklet list to return the Minio endpoint and logging in there with `admin/adminadmin`:

[source,console]
----
$ stackablectl stacklet list

┌─────────┬───────────────┬───────────┬───────────────────────────────┬────────────┐
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
╞═════════╪═══════════════╪═══════════╪═══════════════════════════════╪════════════╡
│ minio ┆ minio-console ┆ default ┆ http http://172.19.0.5:32470 ┆ │
└─────────┴───────────────┴───────────┴───────────────────────────────┴────────────┘
----

image::jupyterhub-keycloak/s3-buckets.png[]

NOTE: if you attempt to re-run the notebook you will need to first remove the `_temporary folders` from the S3 buckets.
These are created by spark jobs and are not removed from the bucket when the job has completed.

*See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25
Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64.
1 change: 1 addition & 0 deletions docs/modules/demos/partials/demos.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
* xref:data-lakehouse-iceberg-trino-spark.adoc[]
* xref:end-to-end-security.adoc[]
* xref:hbase-hdfs-load-cycling-data.adoc[]
* xref:jupyterhub-keycloak.adoc[]
* xref:jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]
* xref:logging.adoc[]
* xref:nifi-kafka-druid-earthquake-data.adoc[]
Expand Down
29 changes: 29 additions & 0 deletions stacks/jupyterhub-keycloak/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# docker build -t oci.stackable.tech/sandbox/spark:3.5.2-python311 -f Dockerfile .
# kind load docker-image oci.stackable.tech/sandbox/spark:3.5.2-python311 -n stackable-data-platform
# or:
# docker push oci.stackable.tech/sandbox/spark:3.5.2-python311

FROM spark:3.5.2-scala2.12-java17-ubuntu

USER root

RUN set -ex; \
apt-get update; \
# Install dependencies for Python 3.11
apt-get install -y \
software-properties-common \
&& apt-get update && apt-get install -y \
python3.11 \
python3.11-venv \
python3.11-dev \
&& rm -rf /var/lib/apt/lists/*; \
# Install pip manually for Python 3.11
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3.11 get-pip.py && \
rm get-pip.py

# Make Python 3.11 the default Python version
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
&& update-alternatives --install /usr/bin/pip pip /usr/local/bin/pip3 1

USER spark
71 changes: 71 additions & 0 deletions stacks/jupyterhub-keycloak/jupyterhub-native-auth.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
releaseName: jupyterhub
name: jupyterhub
repo:
name: jupyterhub
url: https://jupyterhub.github.io/helm-chart/
version: 4.0.0
options:
hub:
config:
Authenticator:
allow_all: True
admin_users:
- admin
JupyterHub:
authenticator_class: nativeauthenticator.NativeAuthenticator
NativeAuthenticator:
open_signup: true
proxy:
service:
type: ClusterIP
rbac:
create: true
prePuller:
hook:
enabled: false
continuous:
enabled: false
scheduling:
userScheduler:
enabled: false
singleuser:
cmd: null
serviceAccountName: hub
networkPolicy:
enabled: false
extraLabels:
stackable.tech/vendor: Stackable
profileList:
- display_name: "Default"
description: "Default profile"
default: true
profile_options:
cpu:
display_name: CPU
choices:
"2":
display_name: "2 request, 2 limit"
kubespawner_override:
cpu_guarantee: 2
cpu_limit: 2
"1 request, 16 limit":
display_name: "1 request, 16 limit"
kubespawner_override:
cpu_guarantee: 1
cpu_limit: 16
memory:
display_name: Memory
choices:
"8 GB":
display_name: "8 GB"
kubespawner_override:
mem_guarantee: "8G"
mem_limit: "8G"
image:
display_name: Image
choices:
"quay.io/jupyter/pyspark-notebook:python-3.11.9":
display_name: "quay.io/jupyter/pyspark-notebook:python-3.11.9"
kubespawner_override:
image: "quay.io/jupyter/pyspark-notebook:python-3.11.9"
Loading