Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/week3 #3

Merged
merged 11 commits into from
Nov 14, 2024
Merged

Feature/week3 #3

merged 11 commits into from
Nov 14, 2024

Conversation

Garett601
Copy link
Collaborator

@Garett601 Garett601 commented Nov 11, 2024

Description

Three different serving endpoint creation for different scenarios. Feature serving, model serving and model serving with feature look up. (01_feature_serving.py, 02_model_serving.py, 03_model_serving_feature_lookup.py)

Setup of A/B testing for two different model versions using Pyfunc. We'll train, register, and implement a serving endpoint that uses a Pyfunc model as a wrapper for these versions. (04_AB_test_model_serving.py)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced A/B testing hyperparameters for model performance analysis.
    • Added unique identifier (id column) to datasets for enhanced processing.
    • Implemented feature and model serving pipelines in Databricks for improved efficiency.
  • Documentation

    • Updated README for clearer environment setup and CI pipeline instructions.
    • Expanded sections on dataset management and feature engineering processes.
  • Dependency Updates

    • Updated and specified versions for key libraries in pyproject.toml to ensure compatibility and stability.

@Garett601 Garett601 self-assigned this Nov 11, 2024
@Garett601 Garett601 requested a review from a team as a code owner November 11, 2024 12:46
Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

Walkthrough

This pull request introduces several changes across multiple files, primarily focusing on configuration updates, new feature implementations, and enhancements to documentation. Key modifications include the addition of default stages in the pre-commit configuration, updates to the README for environment setup and CI pipeline improvements, and the introduction of A/B testing hyperparameters in configuration files. New scripts for feature and model serving in a Databricks environment have also been added, along with adjustments to dependency specifications in the pyproject.toml.

Changes

File Change Summary
.pre-commit-config.yaml Added default_stages: pre-commit configuration option; existing hooks and exclusions remain unchanged.
README.md Updated environment setup instructions, loosened Python version requirements, expanded troubleshooting notes, and clarified dataset management. CI pipeline commands enhanced for testing and code quality checks.
configs/project_configs.yml Introduced ab_test_hyperparameters section with parameters for A/B testing (learning rates and max depths).
main.py Deleted file that orchestrated the machine learning workflow for power consumption project.
notebooks/week2/01_prepare_dataset.py Added functionality to create an id column in train_set and test_set from DateTime column.
notebooks/week3/01_feature_serving.py Implemented a feature serving pipeline in Databricks, including creating feature tables and serving endpoints.
notebooks/week3/02_model_serving.py Introduced functionality for serving a machine learning model in Databricks, including load testing and request handling.
notebooks/week3/03_model_serving_feature_lookup.py Added functionality for serving a model with an online table in Databricks, including request handling.
notebooks/week3/04_AB_test_model_serving.py Implemented A/B testing framework for model serving using LightGBM in Databricks.
power_consumption/config.py Added ABTestHyperparameters class for A/B testing hyperparameters; updated Config class to include it.
pyproject.toml Updated dependency versions to exact specifications; added new dependencies and introduced a [build-system] section.
tests/conftest.yml Added ab_test_hyperparameters section for testing configurations related to A/B testing.

Possibly related PRs

  • Feature/week1 #1: The changes in the main PR regarding the .pre-commit-config.yaml file and the introduction of the default_stages configuration option are related to the updates in the CI workflow configuration in the retrieved PR. Both PRs involve enhancements to the configuration files that affect the project's development workflow.
  • Feature/week2 #2: The modifications in the main PR to the .pre-commit-config.yaml file align with the updates in the CI workflow configuration in the retrieved PR. Both PRs focus on improving the CI pipeline and ensuring that pre-commit checks are integrated into the development process.

Suggested labels

enhancement

Suggested reviewers

  • basakeskili
  • mvechtomova

🐇 In the garden where changes bloom,
New features sprout, dispelling gloom.
With A/B tests and serving bright,
Our models now take flight!
Pre-commit hooks, a guiding light,
In CodeRabbit's realm, all feels right! 🌱


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (21)
power_consumption/config.py (2)

13-19: Enhance the ABTestHyperparameters class implementation.

The class structure could be improved in several ways:

  1. Consider splitting n_estimators into n_estimators_a and n_estimators_b for consistency with other A/B parameters
  2. Add parameter validation using Pydantic's field validators
  3. Add a docstring explaining the class purpose and parameter usage
  4. Group A and B parameters together for better readability

Here's a suggested implementation:

 class ABTestHyperparameters(BaseModel):
+    """Configuration for A/B test model hyperparameters.
+    
+    Contains two sets of hyperparameters (A and B) for comparing model versions.
+    All parameters should be positive numbers.
+    """
+    # Model A parameters
     learning_rate_a: float
-    learning_rate_b: float
-    n_estimators: int
     max_depth_a: int
+    n_estimators_a: int
+    
+    # Model B parameters
+    learning_rate_b: float
     max_depth_b: int
+    n_estimators_b: int
+    
+    @validator("learning_rate_a", "learning_rate_b", "max_depth_a", "max_depth_b", 
+              "n_estimators_a", "n_estimators_b")
+    def validate_positive(cls, v, field):
+        if v <= 0:
+            raise ValueError(f"{field.name} must be positive")
+        return v

40-40: Consider improving Config class organization.

The addition looks good, but consider:

  1. Moving ab_test_hyperparameters next to hyperparameters for better grouping of related fields
  2. Adding validation to ensure the configuration is used correctly in either standard or A/B testing mode

Here's a suggested implementation:

 class Config(BaseModel):
     catalog_name: str
     schema_name: str
     hyperparameters: Hyperparameters
-    ab_test_hyperparameters: ABTestHyperparameters
     processed_features: ProcessedFeatures
     target: Target
     dataset: Dataset
+    ab_test_hyperparameters: Optional[ABTestHyperparameters] = None
+
+    @root_validator
+    def validate_hyperparameters_usage(cls, values):
+        """Ensure either standard or A/B test hyperparameters are actively used."""
+        if values.get("hyperparameters") and values.get("ab_test_hyperparameters"):
+            raise ValueError("Cannot use both standard and A/B test hyperparameters simultaneously")
+        return values
notebooks/week2/01_prepare_dataset.py (2)

34-36: Add validation for unique IDs

While converting IDs to strings is fine for downstream processing, we should ensure uniqueness to prevent potential issues in the A/B testing framework.

Consider adding validation after the conversion:

train_set['id'] = train_set['id'].astype(str)
test_set['id'] = test_set['id'].astype(str)
+# Validate ID uniqueness
+assert train_set['id'].nunique() == len(train_set), "Duplicate IDs found in train_set"
+assert test_set['id'].nunique() == len(test_set), "Duplicate IDs found in test_set"

30-36: Document the ID generation strategy

Since this ID generation is crucial for the A/B testing framework mentioned in the PR objectives, it would be beneficial to add documentation explaining the strategy.

Add a docstring explaining the ID generation:

+# Generate unique IDs for A/B testing by converting timestamps
+# The DateTime column is converted to Unix timestamp (seconds since epoch)
+# This ensures deterministic and unique identification of records
pyproject.toml (1)

29-29: Consider adding upper bound for databricks-connect.

The removal of the upper bound (<16) for databricks-connect seems inconsistent with the strict version pinning approach used for core dependencies. This could lead to compatibility issues with future major versions.

Consider maintaining the upper bound:

-    "databricks-connect>=15.4.1",
+    "databricks-connect>=15.4.1,<16",
notebooks/week3/03_model_serving_feature_lookup.py (1)

116-118: Fix typo and improve response handling.

There's a typo in "Reponse" and the response handling could be improved.

Apply these improvements:

-print("Response status:", response.status_code)
-print("Reponse text:", response.text)
-print("Execution time:", execution_time, "seconds")
+print(f"Response status: {response.status_code}")
+print(f"Response text: {response.text}")
+print(f"Execution time: {execution_time:.2f} seconds")
README.md (5)

Line range hint 13-24: Improve environment setup instructions for clarity and consistency.

The environment setup section contains inconsistencies and could be more comprehensive:

  1. Standardize the Python version reference (currently shows both 3.11.0 and 3.11)
  2. Use consistent virtual environment paths (varies between venv and .venv)
  3. Add separate instructions for Windows users

Consider updating the setup instructions like this:

- uv venv -p 3.11.0 venv
- source venv/bin/activate
+ # For Unix/MacOS:
+ uv venv -p 3.11.10 .venv
+ source .venv/bin/activate
+ 
+ # For Windows:
+ uv venv -p 3.11.10 .venv
+ .venv\Scripts\activate
🧰 Tools
🪛 Markdownlint

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


Line range hint 52-86: Consider enhancing the CI pipeline configuration.

The CI pipeline looks good but could be improved with:

  1. Test results artifact upload for better visibility
  2. Version constraints for test dependencies

Consider these additions to the CI workflow:

 - name: Run tests with coverage
   run: |
     uv pip install pytest pytest-mock pytest-cov
     uv run pytest --cov=power_consumption tests/
+    uv run pytest --junitxml=test-results.xml --cov=power_consumption --cov-report=xml tests/
+
+ - name: Upload test results
+   uses: actions/upload-artifact@v3
+   if: always()
+   with:
+     name: test-results
+     path: |
+       test-results.xml
+       coverage.xml
🧰 Tools
🪛 Markdownlint

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


Line range hint 88-93: Enhance dataset documentation with format specifications.

The dataset section would benefit from additional details about:

  1. Expected file format and encoding
  2. Data validation steps
  3. Sample data structure

Would you like me to help create a data validation script or expand the dataset documentation?

🧰 Tools
🪛 Markdownlint

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


124-124: Fix list indentation in Feature Engineering section.

The bullet point about Feature Engineering package limitations has incorrect indentation.

-    - Feature Engineering package limitations [documentation](https://docs.databricks.com/en/machine-learning/feature-store/python-api.html#limitations)
+  - Feature Engineering package limitations [documentation](https://docs.databricks.com/en/machine-learning/feature-store/python-api.html#limitations)
🧰 Tools
🪛 Markdownlint

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


Line range hint 95-157: Improve Feature Engineering section organization and clarity.

The Feature Engineering section would benefit from:

  1. A clearer explanation of the UC (Unity Catalog) permission model
  2. Structured code examples with comments
  3. A troubleshooting guide for common permission issues

Consider reorganizing the code examples into separate sections for:

  • Data preprocessing
  • Feature generation
  • Model training

Would you like help creating a structured template for this section?

🧰 Tools
🪛 Markdownlint

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

notebooks/week3/02_model_serving.py (3)

16-31: Move imports to the top of the file

According to Python's style guidelines (PEP 8), all imports should be at the top of the file, before any executable code. This improves readability and ensures that dependencies are loaded before they are used.

Apply this diff to move the imports:

+# imports
+import time
+import requests
+import random
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from databricks.sdk import WorkspaceClient
+from databricks.sdk.service.serving import (
+    EndpointCoreConfigInput,
+    ServedEntityInput,
+    TrafficConfig,
+    Route,
+)
+from pyspark.sql import SparkSession
+from power_consumption.config import Config
+
 # DBTITLE 1,imports
-import time
-
-import requests
-import random
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-from databricks.sdk import WorkspaceClient
-from databricks.sdk.service.serving import (
-    EndpointCoreConfigInput,
-    ServedEntityInput,
-    TrafficConfig,
-    Route,
-)
-from pyspark.sql import SparkSession
-from power_consumption.config import Config
🧰 Tools
🪛 Ruff

16-16: Module level import not at top of file

(E402)


18-18: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)


20-20: Module level import not at top of file

(E402)


22-22: Module level import not at top of file

(E402)


23-28: Module level import not at top of file

(E402)


29-29: Module level import not at top of file

(E402)


30-30: Module level import not at top of file

(E402)


143-143: Avoid printing full response text to prevent sensitive data exposure

Printing the full response text may inadvertently expose sensitive information. If the response could contain confidential data, consider logging only necessary details or using logging mechanisms with appropriate log levels.

Apply this diff to limit the output:

-print("Reponse text:", response.text)
+print("Response content:", response.content[:100], "...")  # Show only the first 100 characters

177-180: Handle exceptions when retrieving future results

Exceptions raised during the execution of send_request can cause future.result() to raise an exception, potentially stopping the entire load test. To ensure that all requests are accounted for and to handle failures gracefully, wrap the result retrieval in a try-except block.

Apply this diff:

 for future in as_completed(futures):
-    status_code, latency = future.result()
-    latencies.append(latency)
+    try:
+        status_code, latency = future.result()
+        latencies.append(latency)
+    except Exception as e:
+        print(f"An error occurred: {e}")

This will allow the load test to continue even if some requests fail.

notebooks/week3/01_feature_serving.py (4)

41-41: Remove unused import 'SparkSession'

The SparkSession class is imported but not used in the script. Removing unused imports helps keep the code clean and maintainable.

-from pyspark.sql import SparkSession
🧰 Tools
🪛 Ruff

41-41: Module level import not at top of file

(E402)


41-41: pyspark.sql.SparkSession imported but unused

Remove unused import: pyspark.sql.SparkSession

(F401)


82-86: Consider using Spark DataFrames instead of converting to pandas

Converting large Spark DataFrames to pandas DataFrames using .toPandas() can lead to memory issues, especially with large datasets. Processing data using Spark DataFrames keeps the computations distributed and scalable.

Suggestion:

Refactor the code to continue using Spark DataFrames:

-train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas()
-test_set = spark.table(f"{catalog_name}.{schema_name}.test_set").toPandas()
-
-df = pd.concat([train_set, test_set])
+train_set = spark.table(f"{catalog_name}.{schema_name}.train_set")
+test_set = spark.table(f"{catalog_name}.{schema_name}.test_set")
+
+df = train_set.union(test_set)
🧰 Tools
🪛 Ruff

82-82: Undefined name spark

(F821)


83-83: Undefined name spark

(F821)


95-95: Parameterize the model version instead of hardcoding

Hardcoding the model version (/2) can make the code less flexible and harder to maintain. Consider parameterizing the model version or using the latest version programmatically.

Suggestion:

Option 1 - Use the latest model version:

-pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/2")
+pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model@latest")

Option 2 - Make the model version configurable via the configuration file:

+# Assuming you add `model_version` to your configuration
+model_version = config.model_version
+
-pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/2")
+pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/{model_version}")

298-319: Align the number of threads with the number of requests

The ThreadPoolExecutor is initialized with max_workers=100, but you are only sending num_requests=10. This could lead to inefficient resource usage. Consider aligning the number of workers with the number of requests.

Suggestion:

-num_requests = 10
+num_requests = 100  # Or adjust max_workers to match num_requests

 # Send requests concurrently
-with ThreadPoolExecutor(max_workers=100) as executor:
+with ThreadPoolExecutor(max_workers=num_requests) as executor:
     futures = [executor.submit(send_request) for _ in range(num_requests)]
notebooks/week3/04_AB_test_model_serving.py (3)

16-34: Consider moving import statements to the top of the file

PEP 8 recommends placing all import statements at the top of the file to enhance readability and maintain consistency. While notebooks often have imports in various cells, consolidating them at the beginning can help in understanding dependencies and improve maintainability.

🧰 Tools
🪛 Ruff

16-16: Module level import not at top of file

(E402)


18-18: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)


20-20: Module level import not at top of file

(E402)


21-21: Module level import not at top of file

(E402)


22-22: Module level import not at top of file

(E402)


23-23: Module level import not at top of file

(E402)


24-24: Module level import not at top of file

(E402)


25-25: Module level import not at top of file

(E402)


26-26: Module level import not at top of file

(E402)


27-27: Module level import not at top of file

(E402)


28-28: Module level import not at top of file

(E402)


29-29: Module level import not at top of file

(E402)


30-30: Module level import not at top of file

(E402)


31-31: Module level import not at top of file

(E402)


32-32: Module level import not at top of file

(E402)


34-34: Module level import not at top of file

(E402)


119-119: Avoid hardcoding the Git SHA value

The git_sha variable is hardcoded:

git_sha = "30d57afb2efca70cede3061d00f2a553c2b4779b"

Consider retrieving the current Git commit SHA programmatically to ensure that the logged information reflects the actual state of the code:

import subprocess

git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

This change enhances maintainability and reduces the risk of discrepancies between the code and the logged metadata.


298-298: Explicitly display predictions in the notebook

The standalone expression predictions is intended to display the output in the notebook:

predictions

For clarity and to adhere to best practices, consider using a print statement or adding a comment to indicate that this line displays the predictions:

-predictions
+print(predictions)  # Display predictions

This makes the purpose of the line clear to readers unfamiliar with notebook behavior.

🧰 Tools
🪛 Ruff

298-298: Found useless expression. Either assign it to a variable or remove it.

(B018)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f8f9255 and c529d80.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (1 hunks)
  • configs/project_configs.yml (1 hunks)
  • main.py (0 hunks)
  • notebooks/week2/01_prepare_dataset.py (1 hunks)
  • notebooks/week3/01_feature_serving.py (1 hunks)
  • notebooks/week3/02_model_serving.py (1 hunks)
  • notebooks/week3/03_model_serving_feature_lookup.py (1 hunks)
  • notebooks/week3/04_AB_test_model_serving.py (1 hunks)
  • power_consumption/config.py (2 hunks)
  • pyproject.toml (2 hunks)
  • tests/conftest.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • main.py
✅ Files skipped from review due to trivial changes (1)
  • .pre-commit-config.yaml
🧰 Additional context used
🪛 Markdownlint
README.md

124-124: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

🪛 Ruff
notebooks/week3/01_feature_serving.py

11-11: Undefined name dbutils

(F821)


26-26: Module level import not at top of file

(E402)


27-27: Module level import not at top of file

(E402)


28-28: Module level import not at top of file

(E402)


30-30: Module level import not at top of file

(E402)


31-31: Module level import not at top of file

(E402)


32-32: Module level import not at top of file

(E402)


33-33: Module level import not at top of file

(E402)


34-34: Module level import not at top of file

(E402)


35-35: Module level import not at top of file

(E402)


36-39: Module level import not at top of file

(E402)


40-40: Module level import not at top of file

(E402)


41-41: Module level import not at top of file

(E402)


41-41: pyspark.sql.SparkSession imported but unused

Remove unused import: pyspark.sql.SparkSession

(F401)


42-42: Module level import not at top of file

(E402)


82-82: Undefined name spark

(F821)


83-83: Undefined name spark

(F821)


114-114: Undefined name spark

(F821)


135-135: Undefined name spark

(F821)


221-221: Undefined name dbutils

(F821)


222-222: Undefined name spark

(F821)

notebooks/week3/02_model_serving.py

11-11: Undefined name dbutils

(F821)


16-16: Module level import not at top of file

(E402)


18-18: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)


20-20: Module level import not at top of file

(E402)


22-22: Module level import not at top of file

(E402)


23-28: Module level import not at top of file

(E402)


29-29: Module level import not at top of file

(E402)


30-30: Module level import not at top of file

(E402)


85-85: Undefined name dbutils

(F821)

notebooks/week3/03_model_serving_feature_lookup.py

11-11: Undefined name dbutils

(F821)


16-16: Module level import not at top of file

(E402)


18-18: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)


20-23: Module level import not at top of file

(E402)


24-24: Module level import not at top of file

(E402)


25-25: Module level import not at top of file

(E402)


27-27: Module level import not at top of file

(E402)


83-83: Undefined name dbutils

(F821)

notebooks/week3/04_AB_test_model_serving.py

11-11: Undefined name dbutils

(F821)


16-16: Module level import not at top of file

(E402)


18-18: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)


20-20: Module level import not at top of file

(E402)


21-21: Module level import not at top of file

(E402)


22-22: Module level import not at top of file

(E402)


23-23: Module level import not at top of file

(E402)


24-24: Module level import not at top of file

(E402)


25-25: Module level import not at top of file

(E402)


26-26: Module level import not at top of file

(E402)


27-27: Module level import not at top of file

(E402)


28-28: Module level import not at top of file

(E402)


29-29: Module level import not at top of file

(E402)


30-30: Module level import not at top of file

(E402)


31-31: Module level import not at top of file

(E402)


32-32: Module level import not at top of file

(E402)


34-34: Module level import not at top of file

(E402)


298-298: Found useless expression. Either assign it to a variable or remove it.

(B018)


331-331: Undefined name dbutils

(F821)

🔇 Additional comments (9)
configs/project_configs.yml (2)

9-15: Consider differentiating learning rates for meaningful A/B testing.

The current configuration uses identical learning rates (0.02) for both versions. This might not provide meaningful insights into the impact of learning rate on model performance.

Consider:

  1. Using different learning rates to better understand their impact
  2. Splitting the shared n_estimators parameter into version-specific parameters (n_estimators_a and n_estimators_b)
  3. Adding parameter validation constraints (e.g., min/max values)

Example structure:

ab_test_hyperparameters:
  version_a:
    learning_rate: 0.02
    n_estimators: 1000
    max_depth: 6
  version_b:
    learning_rate: 0.03  # Different value for comparison
    n_estimators: 1000
    max_depth: 10
  validation:
    learning_rate_range: [0.01, 0.1]
    max_depth_range: [3, 15]
    n_estimators_range: [100, 2000]

13-14: Verify max_depth difference impact.

The significant difference in max_depth (6 vs 10) between versions could lead to:

  1. Overfitting in version B (max_depth: 10)
  2. Different model complexities affecting inference latency

Let's verify if there are any model evaluation metrics or latency requirements in the codebase:

tests/conftest.yml (3)

13-14: Review max_depth difference for potential overfitting.

The significant difference in max_depth between variants (6 vs 10) might lead to overfitting in variant B. Consider:

  1. Using closer max_depth values to isolate the impact of this parameter
  2. Adding cross-validation in the training script to validate the choice of these values
  3. Documenting the rationale for these specific values
#!/bin/bash
# Search for cross-validation implementation in the training code
rg "cross_val|cv_|validation" -A 5

9-15: Verify A/B testing configuration integration.

The A/B test configuration looks structurally sound, but please ensure:

  1. The A/B testing implementation in 04_AB_test_model_serving.py properly handles these parameters
  2. There's error handling for missing or invalid parameter values
  3. The testing framework can access these parameters correctly
#!/bin/bash
# Search for usage of these parameters in the A/B testing implementation
echo "Checking parameter usage in A/B testing code..."
rg "ab_test_hyperparameters|learning_rate_[ab]|max_depth_[ab]" -A 5

# Look for error handling related to these parameters
echo "Checking error handling..."
ast-grep --pattern 'try {
  $$$
  ab_test_hyperparameters
  $$$
} catch'

10-11: Verify if identical learning rates are intentional.

Both A/B variants use the same learning rate (0.02). Consider using different learning rates to better evaluate the impact of this hyperparameter on model performance.

notebooks/week2/01_prepare_dataset.py (1)

30-32: 🛠️ Refactor suggestion

Consider potential timestamp precision loss

The current implementation divides the timestamp by 10^9 to convert from nanoseconds to seconds, which might lead to collisions if there are multiple records within the same second.

Let's verify if there are potential ID collisions:

Consider using millisecond precision instead:

-train_set['id'] = train_set['DateTime'].astype('int64') // 10**9
-test_set['id'] = test_set['DateTime'].astype('int64') // 10**9
+train_set['id'] = train_set['DateTime'].astype('int64') // 10**6
+test_set['id'] = test_set['DateTime'].astype('int64') // 10**6
✅ Verification successful

Let me gather more information about the data to better understand the timestamp precision requirements.


Based on the gathered information, I can now provide a final response:

The timestamp precision concern is not valid in this context

The DateTime column is used as a primary key in the feature table and the data is collected hourly (as seen in test data with freq="h"). Converting to seconds (10**9) provides sufficient granularity since:

  • The data has hourly frequency (3600 seconds between records)
  • DateTime is used as a primary key in the feature table, confirming uniqueness
  • Test fixtures and data preprocessing show consistent hourly timestamps

The current implementation is appropriate for the use case and doesn't risk ID collisions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any timestamp-related tests or validation logic
rg -g '!*.pyc' -g '!*.git*' "DateTime.*test|test.*DateTime" .

# Look for any existing timestamp handling patterns
ast-grep --pattern 'DateTime.astype("int64")|DateTime.astype('\''int64'\'')'

Length of output: 245


Script:

#!/bin/bash
# Look for DataFrame creation or data loading to understand timestamp frequency
rg -g '!*.pyc' -g '!*.git*' -B 5 -A 5 "DateTime" .

# Search for any validation or uniqueness checks on IDs
rg -g '!*.pyc' -g '!*.git*' "duplicated|unique|drop_duplicates" .

# Look for any configuration or schema definitions that might specify timestamp requirements
rg -g '!*.pyc' -g '!*.git*' "schema|timestamp|frequency" .

Length of output: 19728

pyproject.toml (2)

66-68: LGTM! Good addition of build system configuration.

The build system configuration follows PEP 517/518 best practices and aligns well with the project's structure.


8-24: Verify compatibility with pyarrow downgrade.

The change from pyarrow>=15.0.2 to pyarrow==14.0.1 is a significant downgrade. Please ensure this version is compatible with your Databricks runtime and other dependencies.

Consider using version ranges with upper bounds (e.g., >=14.0.1,<15.0.0) instead of exact versions to:

  1. Allow for patch updates that include security fixes
  2. Maintain compatibility with other packages that might require slightly different versions
✅ Verification successful

Pyarrow version change appears safe to proceed

The search results show that pyarrow is only listed as a dependency in pyproject.toml and referenced in uv.lock, but there are no direct imports or usage patterns found in the codebase. The downgrade from >=15.0.2 to 14.0.1 appears safe as:

  1. No direct pyarrow API usage was detected
  2. No parquet read/write operations were found that might be affected
  3. The version is still recent enough to be compatible with modern Databricks runtimes

While the architectural advice about using version ranges is valid, the exact version pinning is acceptable if it aligns with your team's dependency management strategy.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if any files import pyarrow and use features that might be affected
rg -l 'import.*pyarrow' | xargs rg 'pyarrow\.' -A 3

Length of output: 53


Script:

#!/bin/bash
# Let's try a broader search for pyarrow usage
rg -i "pyarrow" -A 3

# Also check for pa. prefix which is commonly used for pyarrow
rg "pa\." -A 3

# And check for any pandas operations that might use pyarrow under the hood
rg "pd\.read_parquet|to_parquet" -A 3

Length of output: 3525

notebooks/week3/03_model_serving_feature_lookup.py (1)

7-7: Consider using absolute path for wheel installation.

The relative path ../power_consumption-0.0.1-py3-none-any.whl could break if notebook location changes.

Consider using an absolute path or environment variable to specify the wheel location.

notebooks/week3/02_model_serving.py Show resolved Hide resolved
notebooks/week3/01_feature_serving.py Show resolved Hide resolved
notebooks/week3/01_feature_serving.py Show resolved Hide resolved
notebooks/week3/04_AB_test_model_serving.py Show resolved Hide resolved
notebooks/week3/04_AB_test_model_serving.py Show resolved Hide resolved
Copy link

@mvechtomova mvechtomova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! you clearly paid close attention!

@Garett601 Garett601 merged commit 06d5a6a into main Nov 14, 2024
2 checks passed
@Garett601 Garett601 deleted the feature/week3 branch November 14, 2024 06:26
@coderabbitai coderabbitai bot mentioned this pull request Nov 22, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants