Skip to content

Build Databricks Docker image for Container Services #48

@srnnkls

Description

@srnnkls

Overview

Build a Docker image for running getml on Databricks Container Services. The image will support on-demand cluster execution for feature training, retraining, and interactive notebook workflows.

User Story

As a data engineer deploying getml on Databricks,
I want a production-ready Docker image for Databricks Container Services,
So that I can run getml feature engineering jobs natively within my Databricks clusters.

Technical Approach

1. Dockerfile

Create docker/databricks/Dockerfile extending the Databricks runtime:

# Extend Databricks standard runtime for LTS compatibility
FROM databricksruntime/standard:16.4-LTS

# Install uv for fast, reliable dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Copy dependency files and install with uv
COPY pyproject.toml uv.lock ./
RUN uv pip install --system getml

# Download getml engine
ARG GETML_VERSION
RUN GETML_VERSION=${GETML_VERSION:-$(pip show getml | grep Version | cut -d' ' -f2 || echo "1.5.0")} && \
    mkdir -p /opt/getml/.getML && \
    curl -L "https://go.getml.com/static/demo/download/${GETML_VERSION}/getml-${GETML_VERSION}-x64-linux.tar.gz" | \
    tar -C /opt/getml/.getML -xzf -

# Copy application code (optional, can mount from workspace)
COPY . /opt/getml/app/

# IMPORTANT: CMD and ENTRYPOINT are IGNORED by Databricks
# Databricks controls execution; use init scripts for startup tasks

Key design decisions:

  • No CMD/ENTRYPOINT: Databricks ignores Docker execution primitives entirely
  • uv for dependencies: Faster, more reliable than pip
  • Extends official runtime: Required for Databricks compatibility (includes Spark, JDK, etc.)
  • Engine in /opt/getml: Accessible system-wide for all users

2. Init Script

Create docker/databricks/getml-init.sh for cluster startup.

Why init scripts? Databricks ignores Docker ENTRYPOINT/CMD. Init scripts run after container creation but before cluster becomes operational.

#!/bin/bash
# getml-init.sh - Databricks init script for getml setup
# This runs on every cluster node at startup

set -e

# Configure getml engine location
export GETML_HOME="/opt/getml/.getML"
export PATH="${GETML_HOME}:${PATH}"

# Persist environment variables for notebooks
echo "export GETML_HOME=${GETML_HOME}" >> /etc/profile.d/getml.sh
echo "export PATH=${GETML_HOME}:\${PATH}" >> /etc/profile.d/getml.sh

# Create project directory on DBFS (shared storage)
mkdir -p /dbfs/getml/projects

# Log initialization
echo "========================================="
echo "getml initialized successfully"
echo "Engine: ${GETML_HOME}"
echo "Projects: /dbfs/getml/projects"
echo "========================================="

3. GitHub Actions Workflow

Create .github/workflows/databricks-docker.yml:

name: Build and Push Databricks Docker Image

on:
  release:
    types: [published]
  workflow_dispatch:
    inputs:
      version:
        description: 'getml version'
        required: true

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build and Push
        run: |
          VERSION=${{ github.event.inputs.version || github.ref_name }}
          IMAGE="getml/getml-databricks"

          docker build \
            --build-arg GETML_VERSION=${VERSION} \
            -t ${IMAGE}:${VERSION} \
            -t ${IMAGE}:latest \
            -f docker/databricks/Dockerfile .

          docker push ${IMAGE}:${VERSION}
          docker push ${IMAGE}:latest

4. Alternative: GCR Workflow

For GCP-based Databricks workspaces using Google Container Registry:

      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Configure Docker for GCR
        run: gcloud auth configure-docker gcr.io

      - name: Build and Push to GCR
        run: |
          VERSION=${{ github.event.inputs.version || github.ref_name }}
          IMAGE="gcr.io/${{ secrets.GCP_PROJECT_ID }}/getml-databricks"

          docker build \
            --build-arg GETML_VERSION=${VERSION} \
            -t ${IMAGE}:${VERSION} \
            -t ${IMAGE}:latest \
            -f docker/databricks/Dockerfile .

          docker push ${IMAGE}:${VERSION}
          docker push ${IMAGE}:latest

5. Deployment Documentation

Create docker/databricks/README.md with:

  • Prerequisites (workspace admin, container services enabled)
  • Cluster configuration steps
  • Init script installation
  • Usage examples
  • Troubleshooting guide

Files to create:

docker/databricks/
├── Dockerfile
├── pyproject.toml      # Dependencies (uv)
├── getml-init.sh       # Cluster init script
└── README.md
.github/workflows/
└── databricks-docker.yml

Cluster Configuration

To use the custom container on Databricks:

  1. Enable Container Services (workspace admin):

    • Admin Console > Advanced > Enable Databricks Container Services
  2. Create cluster with custom image:

    {
      "cluster_name": "getml-cluster",
      "spark_version": "16.4.x-scala2.12",
      "docker_image": {
        "url": "getml/getml-databricks:latest"
      },
      "init_scripts": [
        {
          "workspace": {
            "destination": "/Shared/init-scripts/getml-init.sh"
          }
        }
      ]
    }
  3. Or via UI:

    • Create Cluster > Docker > Use your own Docker container
    • Enter image URL: getml/getml-databricks:latest

Acceptance Criteria

  • Dockerfile builds successfully extending databricksruntime
  • Image pushes to Docker Hub (or ECR) via CI/CD
  • Init script runs successfully on cluster startup
  • getml engine starts and can process data
  • Feature training workflow completes successfully
  • Feature retraining workflow completes successfully
  • Documentation covers setup and usage

Testing

  1. Local build test: docker build -t getml-databricks .
  2. Cluster test: Create cluster with custom image
  3. Training test: Run sample feature training notebook
  4. Retraining test: Verify retraining with new data

Constraints & Limitations

  • CMD/ENTRYPOINT IGNORED: Databricks completely ignores Docker execution primitives - use init scripts for startup tasks
  • Must extend official runtime: Custom images must extend databricksruntime/* base images
  • IP range conflict: Avoid using 172.17.0.0/16 in container networking
  • Not supported on: Standard access mode, ML Runtime, AWS Graviton instances
  • Rate limits: Docker Hub has pull rate limits; use GCR for high-volume usage
  • Init script execution: Runs after container creation, before cluster is operational

Dependencies

  • Databricks workspace with Container Services enabled (admin setting)
  • Container registry (Docker Hub, ECR, or ACR)
  • Docker CLI for local testing

Example Usage

# Build locally
docker build -t getml-databricks -f docker/databricks/Dockerfile .

# Push to Docker Hub
docker tag getml-databricks getml/getml-databricks:latest
docker push getml/getml-databricks:latest

# Configure cluster via Databricks UI
# Create Cluster > Docker > Use your own Docker container
# Image URL: getml/getml-databricks:latest
# Init script: /Shared/init-scripts/getml-init.sh

Implementation Breakdown

This feature will be decomposed into the following tasks (to be created as sub-issues):

  1. Create Dockerfile extending databricksruntime/standard
  2. Create pyproject.toml with getml dependencies
  3. Create getml-init.sh cluster init script
  4. Create GitHub Actions workflow for CI/CD (Docker Hub)
  5. Create alternative GCR workflow (optional)
  6. Write deployment documentation (README.md)
  7. Test end-to-end on Databricks cluster

Related Issues

Documentation

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions