-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Overview
Build a Docker image for running getml on Databricks Container Services. The image will support on-demand cluster execution for feature training, retraining, and interactive notebook workflows.
User Story
As a data engineer deploying getml on Databricks,
I want a production-ready Docker image for Databricks Container Services,
So that I can run getml feature engineering jobs natively within my Databricks clusters.
Technical Approach
1. Dockerfile
Create docker/databricks/Dockerfile extending the Databricks runtime:
# Extend Databricks standard runtime for LTS compatibility
FROM databricksruntime/standard:16.4-LTS
# Install uv for fast, reliable dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Copy dependency files and install with uv
COPY pyproject.toml uv.lock ./
RUN uv pip install --system getml
# Download getml engine
ARG GETML_VERSION
RUN GETML_VERSION=${GETML_VERSION:-$(pip show getml | grep Version | cut -d' ' -f2 || echo "1.5.0")} && \
mkdir -p /opt/getml/.getML && \
curl -L "https://go.getml.com/static/demo/download/${GETML_VERSION}/getml-${GETML_VERSION}-x64-linux.tar.gz" | \
tar -C /opt/getml/.getML -xzf -
# Copy application code (optional, can mount from workspace)
COPY . /opt/getml/app/
# IMPORTANT: CMD and ENTRYPOINT are IGNORED by Databricks
# Databricks controls execution; use init scripts for startup tasksKey design decisions:
- No CMD/ENTRYPOINT: Databricks ignores Docker execution primitives entirely
- uv for dependencies: Faster, more reliable than pip
- Extends official runtime: Required for Databricks compatibility (includes Spark, JDK, etc.)
- Engine in /opt/getml: Accessible system-wide for all users
2. Init Script
Create docker/databricks/getml-init.sh for cluster startup.
Why init scripts? Databricks ignores Docker ENTRYPOINT/CMD. Init scripts run after container creation but before cluster becomes operational.
#!/bin/bash
# getml-init.sh - Databricks init script for getml setup
# This runs on every cluster node at startup
set -e
# Configure getml engine location
export GETML_HOME="/opt/getml/.getML"
export PATH="${GETML_HOME}:${PATH}"
# Persist environment variables for notebooks
echo "export GETML_HOME=${GETML_HOME}" >> /etc/profile.d/getml.sh
echo "export PATH=${GETML_HOME}:\${PATH}" >> /etc/profile.d/getml.sh
# Create project directory on DBFS (shared storage)
mkdir -p /dbfs/getml/projects
# Log initialization
echo "========================================="
echo "getml initialized successfully"
echo "Engine: ${GETML_HOME}"
echo "Projects: /dbfs/getml/projects"
echo "========================================="3. GitHub Actions Workflow
Create .github/workflows/databricks-docker.yml:
name: Build and Push Databricks Docker Image
on:
release:
types: [published]
workflow_dispatch:
inputs:
version:
description: 'getml version'
required: true
jobs:
build-and-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push
run: |
VERSION=${{ github.event.inputs.version || github.ref_name }}
IMAGE="getml/getml-databricks"
docker build \
--build-arg GETML_VERSION=${VERSION} \
-t ${IMAGE}:${VERSION} \
-t ${IMAGE}:latest \
-f docker/databricks/Dockerfile .
docker push ${IMAGE}:${VERSION}
docker push ${IMAGE}:latest4. Alternative: GCR Workflow
For GCP-based Databricks workspaces using Google Container Registry:
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Configure Docker for GCR
run: gcloud auth configure-docker gcr.io
- name: Build and Push to GCR
run: |
VERSION=${{ github.event.inputs.version || github.ref_name }}
IMAGE="gcr.io/${{ secrets.GCP_PROJECT_ID }}/getml-databricks"
docker build \
--build-arg GETML_VERSION=${VERSION} \
-t ${IMAGE}:${VERSION} \
-t ${IMAGE}:latest \
-f docker/databricks/Dockerfile .
docker push ${IMAGE}:${VERSION}
docker push ${IMAGE}:latest5. Deployment Documentation
Create docker/databricks/README.md with:
- Prerequisites (workspace admin, container services enabled)
- Cluster configuration steps
- Init script installation
- Usage examples
- Troubleshooting guide
Files to create:
docker/databricks/
├── Dockerfile
├── pyproject.toml # Dependencies (uv)
├── getml-init.sh # Cluster init script
└── README.md
.github/workflows/
└── databricks-docker.yml
Cluster Configuration
To use the custom container on Databricks:
-
Enable Container Services (workspace admin):
- Admin Console > Advanced > Enable Databricks Container Services
-
Create cluster with custom image:
{ "cluster_name": "getml-cluster", "spark_version": "16.4.x-scala2.12", "docker_image": { "url": "getml/getml-databricks:latest" }, "init_scripts": [ { "workspace": { "destination": "/Shared/init-scripts/getml-init.sh" } } ] } -
Or via UI:
- Create Cluster > Docker > Use your own Docker container
- Enter image URL:
getml/getml-databricks:latest
Acceptance Criteria
- Dockerfile builds successfully extending databricksruntime
- Image pushes to Docker Hub (or ECR) via CI/CD
- Init script runs successfully on cluster startup
- getml engine starts and can process data
- Feature training workflow completes successfully
- Feature retraining workflow completes successfully
- Documentation covers setup and usage
Testing
- Local build test:
docker build -t getml-databricks . - Cluster test: Create cluster with custom image
- Training test: Run sample feature training notebook
- Retraining test: Verify retraining with new data
Constraints & Limitations
- CMD/ENTRYPOINT IGNORED: Databricks completely ignores Docker execution primitives - use init scripts for startup tasks
- Must extend official runtime: Custom images must extend
databricksruntime/*base images - IP range conflict: Avoid using
172.17.0.0/16in container networking - Not supported on: Standard access mode, ML Runtime, AWS Graviton instances
- Rate limits: Docker Hub has pull rate limits; use GCR for high-volume usage
- Init script execution: Runs after container creation, before cluster is operational
Dependencies
- Databricks workspace with Container Services enabled (admin setting)
- Container registry (Docker Hub, ECR, or ACR)
- Docker CLI for local testing
Example Usage
# Build locally
docker build -t getml-databricks -f docker/databricks/Dockerfile .
# Push to Docker Hub
docker tag getml-databricks getml/getml-databricks:latest
docker push getml/getml-databricks:latest
# Configure cluster via Databricks UI
# Create Cluster > Docker > Use your own Docker container
# Image URL: getml/getml-databricks:latest
# Init script: /Shared/init-scripts/getml-init.shImplementation Breakdown
This feature will be decomposed into the following tasks (to be created as sub-issues):
- Create Dockerfile extending databricksruntime/standard
- Create pyproject.toml with getml dependencies
- Create getml-init.sh cluster init script
- Create GitHub Actions workflow for CI/CD (Docker Hub)
- Create alternative GCR workflow (optional)
- Write deployment documentation (README.md)
- Test end-to-end on Databricks cluster
Related Issues
- Parent: Initiative - Native Docker Images
- Parallel: Snowflake Docker image (Build Snowflake Docker image for Snowpark Container Services #47)
- Supports: Databricks notebook (Build Databricks Feature Store integration notebook #44)