Skip to content

Latest commit

 

History

History
154 lines (128 loc) · 5.2 KB

File metadata and controls

154 lines (128 loc) · 5.2 KB

Marvelous MLOps End-to-end MLOps with Databricks course

Practical information

  • Weekly lectures on Wednesdays 16:00-18:00 CET.

  • Code for the lecture is shared before the lecture.

  • Presentation and lecture materials are shared right after the lecture.

  • Video of the lecture is uploaded within 24 hours after the lecture.

  • Every week we set up a deliverable, and you implement it with your own dataset.

  • To submit the deliverable, create a feature branch in that repository, and a PR to main branch. The code can be merged after we review & approve & CI pipeline runs successfully.

  • The deliverables can be submitted with a delay (for example, lecture 1 & 2 together), but we expect you to finish all assignments for the course before the 25th of November.

Set up your environment

In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11. In our examples, we use UV. Check out the documentation on how to install it: https://docs.astral.sh/uv/getting-started/installation/

To create a new environment and create a lockfile, run:

uv venv -p 3.11.0 venv
source venv/bin/activate
uv pip install -r pyproject.toml --all-extras
uv lock

Updates, Issues, Workarounds, Notes


week 1

15/10/2024 - Workaround for environment setup

Note

  • loosened python-version in pyproject.toml to requires-python = ">=3.11, <3.12"

  •   uv venv -p 3.11
      source .venv/bin/activate
      uv pip install -r pyproject.toml --all-extras
      uv lock
  • installs and uses python-3.11.10

Issue

  • UV could not find managed or system version download for 3.11.0
    • error: No download found for request: cpython-3.11.0-macos-aarch64-none

Workaround

  • workaround was to run pyenv local 3.11.0 and then uv venv -p 3.11.0, but this lead to other issues:
    • Had issues with dependency resolution when using 3.11.0 (pywin32 was being referenced for some reason? And it was not compatible with mlflow >= 2.16.0)


21/10/2024 - CI pipeline edits

Note

  • added uv pip install pre-commit to the CI pipeline

  • added uv run pre-commit run --all-files to the CI pipeline

  • added uv pip install pytest pytest-mock pytest-cov to the CI pipeline

  • added uv run pytest --cov=power_consumption tests/ to the CI pipeline

  •   name: CI
    
      on:
      pull_request:
          branches:
          - main
    
      jobs:
      build_and_test:
          runs-on: ubuntu-latest
          steps:
          - uses: actions/checkout@v4
    
          - name: Install uv
              uses: astral-sh/setup-uv@v3
    
          - name: Set up Python
              run: uv python install 3.11
    
          - name: Install the dependencies
              run: uv sync
    
          - name: Run pre-commit checks
              run: |
              uv pip install pre-commit
              uv run pre-commit run --all-files
    
          - name: Run tests with coverage
              run: |
              uv pip install pytest pytest-mock pytest-cov
              uv run pytest --cov=power_consumption tests/

Dataset

Note

  • The dataset is not included in the repository to avoid large file size. It should first attempt to get the data from the UCI ML Repository.
  • If that fails, the dataset is expected to be in data/Tetuan City power consumption.csv. You can download it from here.


week 2

26/10/2024 - Feature Engineering

Note

  • Dataset is now available in UC as table
  • updated DataPreprocessor and separated data loading and preprocessing

Issue

  • Feature Engineering was not working when running from within the IDE
    Exception: {'error_code': 'PERMISSION_DENIED', 'message': "Request failed access control checks. Permission check failed for 'heiaepgah71pwedmld01001.power_consumption.power_consumption_features'."}
  • In example code, the features generated at runtime were not used in the fe model

Workaround

  • Ran the feature engineering notebook from Databricks workspace, this resolved permissions issues
  • Ran the feature engineering feature function on the training and testing set and included the new features in the fe model
  • testing_set = fe.create_training_set(
        df=test_set,
        label=target,
        feature_lookups=[
            FeatureFunction(
                udf_name=function_name,
                output_name="weather_interaction",
                input_bindings={
                    "temperature": "Temperature",
                    "humidity": "Humidity",
                    "wind_speed": "Wind_Speed"
                },
            ),
        ],
        exclude_columns=["update_timestamp_utc"]
    )
    training_df = training_set.load_df().toPandas()
    testing_df = testing_set.load_df().toPandas()
    
    X_train = training_df[num_features + cat_features + ["weather_interaction"]]
    y_train = training_df[target]
    
    X_test= testing_df[num_features + cat_features + ["weather_interaction"]]
    y_test = testing_df[target]