Feature/week2 #2

Garett601 · 2024-10-26T04:18:27Z

Description

Saved tables to UC and created MLflow experiments and logged models (standard, custom, and after using feature engineering). The steps are listed in the notebooks

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Updated tests for refactored code. Added code coverage checks in the CI pipeline.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Summary by CodeRabbit

Release Notes

New Features
- Enhanced .gitignore to ignore JSON files and MLFlow directories.
- Updated README.md with environment setup instructions and course logistics.
- Introduced new data processing and modeling workflows in week_1.py and 01_prepare_dataset.py.
- Added MLflow experiment management functionality in 02_mlfow_experiment.py.
- Implemented model logging and registration processes in 03_log_and_register_model.py and 04_log_and_register_custom_model.py.
- Added feature engineering model logging in 05_log_and_register_fe_model.py.
- Introduced a new schema for processed power consumption data.
Bug Fixes
- Resolved environment setup issues documented in the README.
Documentation
- Comprehensive updates to the README for clarity on course logistics and setup processes.
Tests
- Refactored tests for ConsumptionModel and DataProcessor to align with new data handling structures.

…): refactor code based on new config and separate data loading

…factored code

Structure repo for clear separation of weekly tasks

convert to Databricks notebook and update code to new refactored implementation

convert to Databricks notebook and update code to new refactored implementation chore: Actually commit the notebook this time refactor(week_1.py): update Databricks implementation Squashed commits: refactor week 1 notebook to .py format, convert it to Databricks, and ensure changes are committed.

Save training and test sets to catalog. Note the "ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true);"

load data from UC table, preprocess data, save train and test set to UC tables

log and register a model

create, log and register a custom model with the mlflow.pyfunc.model flavor

… to UC Made an oops and saved the table when the DateTime was the index, so I lost that information down the line which made FeatureLookup impossible

create a feature table and generate a new feature at runtime

coderabbitai · 2024-10-26T04:18:36Z

Walkthrough

This pull request introduces multiple changes across various files, including updates to the .gitignore, README.md, configuration files, and new scripts for data processing and machine learning workflows. Key modifications include the addition of JSON and MLFlow entries to .gitignore, restructuring of feature definitions in configuration files, and the implementation of new data processing scripts that utilize PySpark and MLFlow. The updates aim to improve data handling, model training, and evaluation processes, alongside enhancements in the CI pipeline and documentation.

Changes

File	Change Summary
.gitignore	Added entries to ignore `*.json` files and `mlruns/` directory.
README.md	Updated course logistics, environment setup, and CI pipeline commands; added sections for weeks 1 and 2.
configs/project_configs.yml	Renamed `features` to `processed_features`; updated categorical features and dataset definitions.
notebooks/week1/week_1.py	New file implementing data processing and modeling workflow for power consumption analysis.
notebooks/week2/01_prepare_dataset.py	New file for data preparation workflow using PySpark and Pandas.
notebooks/week2/02_mlfow_experiment.py	New file for managing MLflow experiments in Databricks.
notebooks/week2/03_log_and_register_model.py	New file implementing MLflow and LightGBM for power consumption prediction.
notebooks/week2/04_log_and_register_custom_model.py	New file for logging and registering a custom ML model with MLflow.
notebooks/week2/05_log_and_register_fe_model.py	New file for ML pipeline using feature engineering and model registration.
notebooks/week_1.ipynb	Deleted Jupyter notebook that implemented a workflow for processing power consumption data.
power_consumption/config.py	Renamed `Features` to `ProcessedFeatures`; updated `Dataset` class structure.
power_consumption/model/rf_model.py	Updated `ConsumptionModel` class to simplify initialization and change method signatures to use DataFrames.
power_consumption/preprocessing/data_preprocessor.py	Modified `DataProcessor` class to accept DataFrames directly and updated preprocessing methods.
power_consumption/schemas/processed_data.py	Added `ProcessedPowerConsumptionSchema` to represent processed dataset with expanded fields.
power_consumption/utils.py	Refactored `plot_feature_importance` function for clarity in data preparation.
tests/conftest.yml	Updated feature and dataset properties in test configuration.
tests/model/test_rf_model.py	Refactored tests for `ConsumptionModel` to align with new data handling and configuration structure.
tests/preprocessing/test_data_preprocessor.py	Enhanced tests for `DataProcessor`, added new tests, and removed outdated ones.

Possibly related PRs

Feature/week1 #1: The changes in the .gitignore file in the main PR are related to the modifications in the .gitignore file in PR Feature/week1 #1, as both PRs involve updates to the entries that specify which files and directories should be ignored by Git.

Suggested reviewers

basakeskili

Poem

🐰 In the garden where data grows,
New paths and features, the rabbit knows.
With JSON and MLFlow in sight,
We hop along, coding with delight!
Preprocessing with care, we’ll never tire,
For power consumption, we’ll reach higher! 🌱

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 47

🧹 Outside diff range and nitpick comments (22)

configs/project_configs.yml (3)
36-49: Consider restructuring to avoid feature list duplication.

The current configuration duplicates feature lists between processed_features and dataset sections, which could lead to maintenance issues. While the distinction between raw and processed features is important, consider these improvements:

Rename the sections to clearly indicate their purpose:

raw_features instead of dataset.{num,cat}_features

processed_features (current)

Or, consider consolidating the feature definitions:
features:
  raw:
    num_features: [...]
    cat_features: [...]
  processed:
    num_features: [...]
    cat_features: [...]
This would make the relationship between raw and processed features more explicit and reduce the risk of inconsistencies during updates.

Line range hint 1-8: Add documentation for MLflow experiment tracking.

Given that this PR implements MLflow experiments, consider adding:

Comments explaining the hyperparameters and their valid ranges

MLflow experiment naming configuration

Documentation about where these hyperparameters are used in the training pipeline

Example addition:
# MLflow experiment tracking configuration
mlflow:
  experiment_name: "power_consumption_training"
  experiment_description: "Power consumption prediction model training"

# Model hyperparameters
hyperparameters:
  learning_rate: 0.01  # Learning rate for gradient descent (range: 0.001 to 0.1)
  n_estimators: 1000   # Number of trees in the forest (range: 100 to 2000)
  max_depth: 6         # Maximum depth of each tree (range: 3 to 10)
Line range hint 1-2: Enhance Unity Catalog configuration.

Since this PR implements saving tables to Unity Catalog, consider adding:

Table location configuration

Access control settings

Documentation about the catalog structure

Example addition:
unity_catalog:
  catalog_name: heiaepgah71pwedmld01001
  schema_name: power_consumption
  tables:
    raw_data:
      name: tetuan_city_power_consumption
      location: "s3://bucket/path"
    processed_data:
      name: processed_power_consumption
      location: "s3://bucket/processed/path"
  access_control:
    owner: "data_science_team"
    grants:
      - principal: "ML_engineers"
        permissions: ["SELECT", "MODIFY"]
power_consumption/config.py (2)

13-15: LGTM! Clear and descriptive class renaming.

The rename from Features to ProcessedFeatures better reflects the post-processing nature of these features, improving code clarity.

Consider adding docstring documentation to clarify:

The purpose of this class in the ML pipeline

The expected format/requirements for num_features and cat_features

Any validation rules or constraints

Line range hint 1-58: Consider enhancing configuration validation for ML pipeline.

The configuration structure is well-organized, but could benefit from additional MLflow-specific validations since this PR introduces MLflow experiments.

Consider adding:

MLflow experiment tracking configurations

Model registry configurations for Unity Catalog

Version tracking for feature definitions

This would help maintain consistency between your ML experiments and the corresponding configurations.
notebooks/week2/01_prepare_dataset.py (3)
6-6: Add error handling for SparkSession creation.

Consider adding try-except block to handle potential SparkSession creation failures gracefully.
-spark = SparkSession.builder.getOrCreate()
+try:
+    spark = SparkSession.builder.getOrCreate()
+except Exception as e:
+    print(f"Failed to create SparkSession: {str(e)}")
+    raise
14-16: Validate configuration values.

Add validation for critical configuration values to fail fast if they're missing or invalid.
+# Validate critical configuration
+if not all([catalog_name, schema_name, raw_data_table]):
+    raise ValueError("Missing required configuration values")
1-32: Consider moving core logic to proper Python modules.

While notebooks are great for development and visualization, critical data processing logic should reside in proper Python modules. This notebook should focus on orchestration and visualization, while the core logic should be in the power_consumption package.

Benefits:

Better testability

Improved reusability

Easier version control

Better separation of concerns

Consider:

Moving data validation logic to DataProcessor

Creating a dedicated configuration validator

Adding a logging configuration module

Creating a dedicated catalog interaction module
power_consumption/schemas/processed_data.py (3)
27-28: Add more detailed documentation for the processed schema.

The docstring should explain the preprocessing steps applied and the encoding scheme used for categorical variables.
 class ProcessedPowerConsumptionSchema(DataFrameModel):
-    """Processed power consumption dataset schema."""
+    """Processed power consumption dataset schema.
+    
+    This schema represents the processed version of power consumption data where:
+    - Categorical variables (DayOfWeek, IsWeekend) are one-hot encoded
+    - All features are properly typed and validated
+    - Missing values are handled consistently
+    """
30-47: Consider adding field descriptions using Field metadata.

The schema would benefit from explicit documentation of what each field represents, especially for the encoded categorical variables.

Example for a few fields:
-    Temperature: Series[float] = Field(nullable=True)
+    Temperature: Series[float] = Field(nullable=True, description="Temperature in Celsius")
-    DayOfWeek_1: Series[Category] = Field(nullable=True)
+    DayOfWeek_1: Series[Category] = Field(nullable=True, description="Binary indicator for Monday (1 if Monday, 0 otherwise)")
-    IsWeekend_1: Series[Category] = Field(nullable=True)
+    IsWeekend_1: Series[Category] = Field(nullable=True, description="Binary indicator for weekend days (1 if weekend, 0 if weekday)")
27-50: Consider adding data validation between raw and processed schemas.

The relationship between raw and processed data should be validated to ensure consistent preprocessing.

Consider adding class methods to validate the transformation:
@classmethod
def validate_processing(cls, raw_df: PowerConsumptionSchema, processed_df: "ProcessedPowerConsumptionSchema") -> bool:
    """Validate that processed data correctly represents the raw data.
    
    Args:
        raw_df: DataFrame following PowerConsumptionSchema
        processed_df: DataFrame following ProcessedPowerConsumptionSchema
    
    Returns:
        bool: True if processing is valid
    """
    # Validate continuous features remain unchanged
    for field in ['Temperature', 'Humidity', 'Wind_Speed']:
        if not (raw_df[field] == processed_df[field]).all():
            return False
            
    # Validate one-hot encoding
    raw_days = raw_df['DayOfWeek']
    processed_days = processed_df[[f'DayOfWeek_{i}' for i in range(1, 7)]]
    # Add validation logic here
    
    return True
power_consumption/utils.py (1)
Line range hint 66-89: Consider enhancing function robustness and documentation.

While the implementation is correct, consider these improvements:

Add input validation

Add type hints

Handle edge cases

Here's a suggested implementation:
- def plot_feature_importance(feature_importance, feature_names, top_n=10):
+ def plot_feature_importance(
+     feature_importance: np.ndarray,
+     feature_names: list[str],
+     top_n: int = 10
+ ) -> None:
     """
     Plot feature importance.

     Parameters
     ----------
     feature_importance : array-like
         Feature importance scores.
     feature_names : array-like
         Names of the features.
     top_n : int, optional
         Number of top features to display, by default 10.
+
+    Raises
+    ------
+    ValueError
+        If feature_importance and feature_names have different lengths,
+        or if top_n is greater than the number of features.
     """
+    if len(feature_importance) != len(feature_names):
+        raise ValueError("Length mismatch between feature_importance and feature_names")
+    
+    n_features = len(feature_importance)
+    if top_n > n_features:
+        top_n = n_features
+        print(f"Warning: top_n reduced to {n_features} (total number of features)")
+
     plt.figure(figsize=(10, 6))
     sorted_idx = np.argsort(feature_importance)
     pos = np.arange(sorted_idx[-top_n:].shape[0]) + 0.5
     top_features = feature_importance[sorted_idx[-top_n:]]
     top_names = [feature_names[i] for i in sorted_idx[-top_n:]]
     plt.barh(pos, top_features)
     plt.yticks(pos, top_names)
     plt.title(f"Top {top_n} Feature Importance")
     plt.tight_layout()
     plt.show()
README.md (2)

106-106: Enhance CI pipeline documentation.

Consider adding:

List of enabled pre-commit hooks and their purposes

Expected code coverage thresholds

Instructions for running CI checks locally

111-115: Document Unity Catalog setup and permissions.

The note about UC table availability should include:

Steps to set up required permissions

Table schema and location

Instructions for other developers to access the table

tests/preprocessing/test_data_preprocessor.py (1)

Line range hint 11-50: Consider extracting test data generation into a separate utility.

The sample data generation in the fixture could be moved to a separate test utility module. This would:

Make it reusable across different test files

Allow for different data generation strategies

Make it easier to maintain and update test data patterns

Consider creating a new file tests/utils/test_data_generator.py to handle test data generation.
notebooks/week2/03_log_and_register_model.py (3)
87-91: Include the Conda Environment When Logging the Model

To ensure reproducibility, it's beneficial to log the conda environment or the pip requirements when logging the model. This captures the dependencies needed to run the model.

Modify the model logging to include the environment:
 mlflow.sklearn.log_model(
     sk_model=pipeline,
     artifact_path="lightgbm-pipeline-model",
     signature=signature,
+    conda_env=mlflow.sklearn.get_default_conda_env()
 )
30-31: Reuse Loaded Spark DataFrame Instead of Reloading

You have already loaded the training dataset into train_set_spark. Instead of loading it again, reuse this variable to improve efficiency.

Update the code to use the existing DataFrame:
- train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas()
+ train_set = train_set_spark.toPandas()
70-72: Consider Logging Metrics Using MLflow Instead of Printing

While printing metrics is useful for immediate visibility, logging them with MLflow ensures they are tracked and stored for future reference and comparison across runs.

Ensure all relevant metrics are logged to MLflow, as you are already doing in lines 77-79.
notebooks/week2/04_log_and_register_custom_model.py (1)
75-75: Consider retrieving the Git SHA dynamically

Hardcoding the Git SHA can lead to inconsistencies if the codebase updates. To ensure accuracy, retrieve the Git SHA dynamically.

Apply this diff to get the Git SHA automatically:
 import mlflow
+import subprocess
 import pandas as pd
 ...

-git_sha = "30d57afb2efca70cede3061d00f2a553c2b4779b"
+git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
power_consumption/preprocessing/data_preprocessor.py (3)
88-89: Validate raw data before feature engineering

The data is being validated after adding time-based features. It's recommended to validate the raw data before performing any transformations to ensure data integrity from the start.

Consider moving the validation step:
 self._clean_column_names()

+self.data = PowerConsumptionSchema.validate(self.data)

 logger.info("Creating time-based features")

 # Convert DateTime to datetime and set as index
110-127: Remove unnecessary random_state parameter when shuffle=False

In the split_data method, since shuffle=False, the random_state parameter has no effect. Including it may cause confusion.

Update the method signature and the call to train_test_split:
-def split_data(self, test_size: float = 0.2, random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
+def split_data(self, test_size: float = 0.2) -> Tuple[pd.DataFrame, pd.DataFrame]:
     """
     Split the DataFrame (self.data) into training and test sets, maintaining temporal order.
     """
-    train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False, random_state=random_state)
+    train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False)
Also, update the docstring to reflect this change.

129-132: Adjust logging for clarity on date ranges

The logs for the train and test set date ranges might be more informative if formatted with standard date formats.

Consider updating the logging statements:
 logger.info(f"Train set shape: {train_set.shape}")
 logger.info(f"Test set shape: {test_set.shape}")
-logger.info(f"Train set date range: {train_set.index.min()} to {train_set.index.max()}")
-logger.info(f"Test set date range: {test_set.index.min()} to {test_set.index.max()}")
+logger.info(f"Train set date range: {train_set.index.min().strftime('%Y-%m-%d')} to {train_set.index.max().strftime('%Y-%m-%d')}")
+logger.info(f"Test set date range: {test_set.index.min().strftime('%Y-%m-%d')} to {test_set.index.max().strftime('%Y-%m-%d')}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between bd663cf and c612a55.

📒 Files selected for processing (18)

.gitignore (1 hunks)
README.md (2 hunks)
configs/project_configs.yml (3 hunks)
notebooks/week1/week_1.py (1 hunks)
notebooks/week2/01_prepare_dataset.py (1 hunks)
notebooks/week2/02_mlfow_experiment.py (1 hunks)
notebooks/week2/03_log_and_register_model.py (1 hunks)
notebooks/week2/04_log_and_register_custom_model.py (1 hunks)
notebooks/week2/05_log_and_register_fe_model.py (1 hunks)
notebooks/week_1.ipynb (0 hunks)
power_consumption/config.py (2 hunks)
power_consumption/model/rf_model.py (5 hunks)
power_consumption/preprocessing/data_preprocessor.py (3 hunks)
power_consumption/schemas/processed_data.py (1 hunks)
power_consumption/utils.py (1 hunks)
tests/conftest.yml (3 hunks)
tests/model/test_rf_model.py (1 hunks)
tests/preprocessing/test_data_preprocessor.py (2 hunks)

💤 Files with no reviewable changes (1)

notebooks/week_1.ipynb

🧰 Additional context used

🪛 Ruff

notebooks/week2/04_log_and_register_custom_model.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

122-122: Found useless expression. Either assign it to a variable or remove it.

(B018)

notebooks/week2/05_log_and_register_fe_model.py

6-6: Undefined name dbutils

(F821)

10-10: Module level import not at top of file

(E402)

10-10: yaml imported but unused

Remove unused import: yaml

(F401)

11-11: Module level import not at top of file

(E402)

12-12: Module level import not at top of file

(E402)

14-14: Module level import not at top of file

(E402)

14-14: datetime.datetime imported but unused

Remove unused import: datetime.datetime

(F401)

16-16: Module level import not at top of file

(E402)

17-17: Module level import not at top of file

(E402)

17-17: pyspark.sql.functions imported but unused

Remove unused import: pyspark.sql.functions

(F401)

19-19: Module level import not at top of file

(E402)

20-20: Module level import not at top of file

(E402)

21-21: Module level import not at top of file

(E402)

23-23: Module level import not at top of file

(E402)

24-24: Module level import not at top of file

(E402)

25-25: Module level import not at top of file

(E402)

26-26: Module level import not at top of file

(E402)

27-27: Module level import not at top of file

(E402)

29-29: Module level import not at top of file

(E402)

31-31: Module level import not at top of file

(E402)

tests/model/test_rf_model.py

4-4: sklearn.compose.ColumnTransformer imported but unused

Remove unused import: sklearn.compose.ColumnTransformer

(F401)

5-5: sklearn.preprocessing.StandardScaler imported but unused

Remove unused import: sklearn.preprocessing.StandardScaler

(F401)

6-6: sklearn.pipeline.Pipeline imported but unused

Remove unused import: sklearn.pipeline.Pipeline

(F401)

11-11: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)

🔇 Additional comments (20)

configs/project_configs.yml (1)

9-9: LGTM! Clear and consistent section naming.

The renaming from 'features' to 'processed_features' improves clarity and maintains consistency with other project files.
tests/conftest.yml (1)
36-36: Verify the table name in Unity Catalog.

The table name has been changed to tetuan_city_power_consumption. Let's verify this exists in the Unity Catalog.
#!/bin/bash
# Description: Search for references to this table name in the codebase
# Expected: Find consistent table name references

# Search for the table name in configuration and SQL files
rg "tetuan_city_power_consumption" --type yaml --type sql
power_consumption/config.py (2)
32-32: LGTM! Verify YAML configurations are updated.

The rename from features to processed_features in the Config class maintains consistency with the ProcessedFeatures class rename.

Let's verify all YAML configs have been updated accordingly:
#!/bin/bash
# Check if any YAML files still use the old 'features' key
echo "Checking for old 'features' key in YAML files..."
rg -l "features:" --type yaml

# Check for new 'processed_features' key
echo "Verifying presence of new 'processed_features' key..."
rg -l "processed_features:" --type yaml
23-25: 🛠️ Refactor suggestion

Review data model design and add validations.

A few concerns about the current implementation:

num_features appears to be duplicated in both ProcessedFeatures and Dataset classes

raw_data_table might benefit from format validation

Consider these improvements:

To avoid duplication, either:

Move all feature definitions to ProcessedFeatures

Or document why they exist in both places

Add validation for raw_data_table:
from pydantic import Field, validator

class Dataset(BaseModel):
    raw_data_table: str = Field(..., description="Unity Catalog table name")
    num_features: List[str]
    cat_features: List[str]

    @validator('raw_data_table')
    def validate_table_name(cls, v):
        if not v.replace('_', '').isalnum():
            raise ValueError('Table name must be alphanumeric (underscores allowed)')
        return v
Let me verify the feature duplication:
.gitignore (1)

105-106: LGTM! Correctly ignoring MLflow artifacts.

Adding mlruns/ to .gitignore follows MLflow best practices as these directories contain experiment metadata and model artifacts that should not be version controlled.

notebooks/week2/02_mlfow_experiment.py (1)

1-53: Verify Unity Catalog integration.

The PR objectives mention saving tables to Unity Catalog, but this implementation only shows MLflow experiment tracking. Please ensure the Unity Catalog integration is implemented in other files or add it to this notebook.
power_consumption/schemas/processed_data.py (1)
38-43: ⚠️ Potential issue

Verify the completeness of DayOfWeek encoding.

The current implementation only includes 6 binary columns for DayOfWeek (1-6), which seems to be missing one day. This could lead to information loss or incorrect encoding of the seventh day.

Consider adding the missing day:
     DayOfWeek_4: Series[Category] = Field(nullable=True)
     DayOfWeek_5: Series[Category] = Field(nullable=True)
     DayOfWeek_6: Series[Category] = Field(nullable=True)
+    DayOfWeek_7: Series[Category] = Field(nullable=True)
✅ Verification successful

The DayOfWeek encoding is correct and complete

Looking at the preprocessing code, we can see that DayOfWeek is derived from index.dayofweek which uses Python's datetime convention where days are numbered from 0 to 6 (Monday=0 through Sunday=6). The OneHotEncoder is configured with drop="first", which means it drops the first category (0/Monday) to avoid multicollinearity. This explains why we only see DayOfWeek_1 through DayOfWeek_6, as they represent Tuesday through Sunday, with Monday being the reference category that was dropped.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check the usage of DayOfWeek in the codebase to understand the encoding
rg -A 5 "DayOfWeek" --type python
Length of output: 64

Script:
#!/bin/bash
# Let's try searching without file type restriction
rg -A 5 "DayOfWeek"
Length of output: 6040

Script:
#!/bin/bash
# Search for any preprocessing or encoding related files
fd -e py | grep -i "preprocess\|encode\|transform"
Length of output: 227

Script:
#!/bin/bash
# Look for data processing or feature engineering files
fd -e py | grep -i "feature\|process\|data"
Length of output: 302

Script:
#!/bin/bash
# Check the content of the schema file
cat power_consumption/schemas/processed_data.py
Length of output: 2189

Script:
#!/bin/bash
# Let's check how DayOfWeek is processed and encoded
cat power_consumption/preprocessing/data_preprocessor.py
Length of output: 6737
power_consumption/utils.py (1)

83-86: LGTM! Clean refactoring of the plotting logic.

The separation of data preparation from plotting logic improves code readability and maintainability.
README.md (2)
116-124: Improve error handling documentation.

The permission denied error workaround should:

Explain why the IDE vs. workspace behavior differs

Document the minimum required permissions

Provide steps to request necessary access
#!/bin/bash
# Check for other potential permission-related issues in the codebase
rg -i "permission" --type py
32-35: Consider documenting a more robust environment setup solution.

While the workaround addresses immediate issues, consider:

Explicitly documenting the recommended Python version (3.11.10)

Investigating and documenting the root cause of the pywin32/mlflow conflict

Adding platform-specific instructions (Windows vs. Unix)
tests/preprocessing/test_data_preprocessor.py (2)

5-7: LGTM! Good practice using schema validation.

The addition of ProcessedPowerConsumptionSchema import and SparkSession demonstrates good practice by ensuring data consistency through schema validation before saving to Unity Catalog.

Line range hint 11-50: Well-structured test fixtures with comprehensive test data!

The fixtures are well-implemented with:

Comprehensive sample data covering all required columns

Good use of date ranges and random data for testing

Clear documentation with proper parameter and return type descriptions

power_consumption/model/rf_model.py (3)

5-5: ⚠️ Potential issue

Check compatibility of Self type hint with Python version

The use of Self from the typing module is only available in Python 3.11 and above. Please ensure that the project's Python version is 3.11 or later. If compatibility with earlier versions is needed, consider replacing Self with 'ConsumptionModel' in the type annotations.

46-46: ⚠️ Potential issue

Verify use of Self in return type annotation

The return type annotation -> Self in the train method requires Python 3.11 or later. For compatibility with earlier Python versions, you might replace it with 'ConsumptionModel'.

111-111: ⚠️ Potential issue

Reevaluate feature importance aggregation method

Averaging feature importances across all estimators may not accurately capture the significance of features for each target variable in a multi-output regression. Consider returning feature importances for each output separately or using a weighted average if appropriate.

tests/model/test_rf_model.py (1)

16-28: Sample data fixture is well-defined and ensures reproducibility

The sample_data fixture generates a reproducible DataFrame with relevant features, which is excellent for consistent testing.
notebooks/week2/03_log_and_register_model.py (2)
80-81: Verify Correctness of Model Signature Inference

Ensure that the inferred model signature accurately represents the model's input and output schema. Using y_train instead of y_pred when inferring the signature may provide a more accurate representation.

Please verify if y_pred has the same schema as y_train. If not, consider updating the inference as follows:
- signature = infer_signature(model_input=X_train, model_output=y_pred)
+ signature = infer_signature(model_input=X_train, model_output=y_train)
82-85: Check the Usage of mlflow.data.from_spark

The method mlflow.data.from_spark may not be a standard MLflow API or might differ based on the MLflow version. Confirm that this function is appropriate for logging your dataset and that it behaves as expected.

Please ensure that mlflow.data.from_spark is the correct method for creating a dataset input in your MLflow version. If not, consider alternative approaches for logging dataset information.
power_consumption/preprocessing/data_preprocessor.py (1)
153-157: Ensure consistent timestamps for train and test sets

The timestamp variable is calculated once and applied to both the train and test sets, which is good for consistency. However, ensure that the transformation from pandas DataFrame to Spark DataFrame preserves the data accurately, especially the index.

To verify that the index is preserved during the conversion, you can run the following script:

✅ Verification successful

Based on the test file content and previous search results, I can now provide a final response:

Index preservation is properly handled during DataFrame conversion

The code correctly preserves the DateTime index during the pandas to Spark DataFrame conversion, as evidenced by:

The DateTime column is explicitly set as the index using set_index("DateTime", inplace=True")

The test suite validates that:

The index is properly maintained (assert data_processor.data.index.name == "DateTime")

Both train and test sets maintain temporal ordering (assert train_set.index[0] < test_set.index[0])

Schema validation is performed on both datasets before and after the split

The data is properly sorted by index before processing (self.data = self.data.sort_index())
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that the index is preserved when converting pandas DataFrame to Spark DataFrame.

# Check if index is included as a column
# Expect: The index should be present as a column in the Spark DataFrame.

# Example command using pyspark (adjusted for script context)
# Note: This is a placeholder; actual verification may require additional context.

# Save sample pandas DataFrame to CSV
python -c "
import pandas as pd
df = pd.DataFrame({'A': [1,2,3]}, index=pd.date_range('2021-01-01', periods=3, freq='D'))
df.to_csv('sample.csv')
"

# Read CSV with Spark and check columns
spark-shell -e "
val df = spark.read.option('header', 'true').csv('sample.csv')
df.show()
df.printSchema()
"
Length of output: 457

Script:
#!/bin/bash
# Let's examine the code to understand how the DataFrame conversion is handled
rg -A 5 "createDataFrame" power_consumption/preprocessing/data_preprocessor.py
Length of output: 462

Script:
#!/bin/bash
# Check if there's any custom index handling logic
rg -B 3 -A 3 "index" power_consumption/preprocessing/data_preprocessor.py
Length of output: 1586

Script:
#!/bin/bash
# Look for any DataFrame schema definitions or column specifications
ast-grep --pattern 'withColumn($$$)'
Length of output: 38

Script:
#!/bin/bash
# Check if there are any test files that might show the expected behavior
fd "test.*data.*process.*\.py$"
Length of output: 79

Script:
#!/bin/bash
# Let's check the test file to see how the DataFrame conversion is tested
cat tests/preprocessing/test_data_preprocessor.py
Length of output: 6169

Script:
#!/bin/bash
# Check for any schema definitions that might show expected columns
rg -B 2 -A 5 "class.*Schema" power_consumption/preprocessing/
Length of output: 63
notebooks/week2/05_log_and_register_fe_model.py (1)

6-6: Ensure 'dbutils' is available before use

The variable dbutils may not be defined in the current context, which could lead to a NameError when calling dbutils.library.restartPython(). In Databricks notebooks, dbutils should be available, but if you're running this script outside of a notebook environment, make sure to import or define dbutils appropriately.

🧰 Tools

🪛 Ruff

6-6: Undefined name dbutils

(F821)

configs/project_configs.yml

tests/conftest.yml

notebooks/week2/01_prepare_dataset.py

notebooks/week2/05_log_and_register_fe_model.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

.github/workflows/ci.yml (1)
31-31: LGTM! Consider documenting coverage requirements

The test configuration looks good:

Addition of databricks-sdk aligns with the Unity Catalog integration

Coverage threshold and reporting are well configured

Consider adding a comment in the workflow file explaining the 70% coverage requirement to help future contributors understand this threshold.
+          # Enforce minimum 70% test coverage across the codebase
           uv pip install pytest pytest-mock pytest-cov databricks-sdk
           uv run pytest --cov=power_consumption --cov-report=term-missing --cov-fail-under=70 tests/

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c612a55 and beccea3.

📒 Files selected for processing (1)

.github/workflows/ci.yml (2 hunks)

🔇 Additional comments (1)

.github/workflows/ci.yml (1)
21-22: Verify requirements file location for uv sync

The uv sync command requires a requirements file, but none is explicitly specified. Please ensure that a requirements.txt or similar file exists in the repository root.

✅ Verification successful

Let me try another search strategy to verify the requirements files and package management setup.

This will help us understand:

If there are any Python package management files that might be used by uv

The complete context of how uv is configured in the workflow

Let me check the content of pyproject.toml to understand the package dependencies configuration.

This will help us verify if the dependencies are properly configured in pyproject.toml, which is the modern standard for Python package management and is supported by uv.

uv sync command is correctly configured

The dependency installation is properly set up. The uv sync command will install dependencies from pyproject.toml, which clearly defines both main dependencies and dev dependencies. The workflow also correctly installs additional test dependencies via separate uv pip install commands for pre-commit and pytest-related packages.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for requirements files in the repository
echo "Searching for requirements files..."
fd -e txt -e in "requirements"
Length of output: 111

Script:
#!/bin/bash
# Check for pyproject.toml and other Python package management files
echo "Searching for Python package management files..."
fd "pyproject.toml|requirements|setup.py"

# Look for uv configuration in workflow file
echo -e "\nChecking uv configuration in workflow:"
rg "uv" .github/workflows/ci.yml -B2 -A2
Length of output: 890

Script:
#!/bin/bash
# Check pyproject.toml content
echo "Content of pyproject.toml:"
cat pyproject.toml
Length of output: 1458

coderabbitai

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between beccea3 and ea29214.

📒 Files selected for processing (2)

.github/workflows/ci.yml (2 hunks)
tests/preprocessing/test_data_preprocessor.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/ci.yml

🧰 Additional context used

🪛 Ruff

tests/preprocessing/test_data_preprocessor.py

7-7: pyspark.sql.functions imported but unused

Remove unused import: pyspark.sql.functions

(F401)

tests/preprocessing/test_data_preprocessor.py

basakeskili

Nice!

mvechtomova

Great job! there are some minor things you could improve: id must be string and not integer, please check it - it may cause problems for serving.

Another one regarding testing python requirements.

I will approve to unblock you.

.github/workflows/ci.yml

README.md

power_consumption/config.py

tests/preprocessing/test_data_preprocessor.py

Garett601 added 17 commits October 24, 2024 14:45

refactor(project_configs.yml-config.py): update project config

ea58f16

refactor(utils.py,-rf_model.py-data_preprocessor.py-processed_data.py…

a44a2eb

…): refactor code based on new config and separate data loading

test(test_rf_model.py-test_data_preprocessor.py): update tests for re…

0b2660d

…factored code

refactor(week1/week_1.ipynb): update repo structure

4573671

Structure repo for clear separation of weekly tasks

refactor(week_1.py): refactor week 1 notebook

06445d5

convert to Databricks notebook and update code to new refactored implementation

feat(data_preprocessor.py): add save_to_catalog method

b35f12f

Save training and test sets to catalog. Note the "ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true);"

style(data_preprocessor.py): update docstring

6a62cea

test(test_data_preprocessor.py): add test for save_to_catalog method

b2e8238

feat(01_prepare_data.py): first task for week2 homework

30d57af

load data from UC table, preprocess data, save train and test set to UC tables

feat(02_mlflow_experiment.py): week 2 notebook 2 code

1107246

feat(03_log_and_register_model.py): week2 notebook 3 code

444e285

log and register a model

feat(04_log_and_register_custom_model.py): week2 notebook 4 code

65d3bf3

create, log and register a custom model with the mlflow.pyfunc.model flavor

chore: save latest code - why do I keep doing this?

a2c9e31

refactor(01_prepare_dataset.py): maintain DateTime column when saving…

65b5fe8

… to UC Made an oops and saved the table when the DateTime was the index, so I lost that information down the line which made FeatureLookup impossible

feat(05_log_and_register_fe_model.py): week2 notebook 5 code

e7f7120

create a feature table and generate a new feature at runtime

docs(README.md): Update readme with changes

c612a55

Garett601 added the enhancement New feature or request label Oct 26, 2024

Garett601 self-assigned this Oct 26, 2024

Garett601 requested a review from a team as a code owner October 26, 2024 04:18

Garett601 added 2 commits October 26, 2024 06:22

ci(ci.yml): add PySpark to CI run for tests

7aeeb1d

ci(ci.yml): add databricks-sdk to testing step

beccea3

coderabbitai bot reviewed Oct 26, 2024

View reviewed changes

ci(ci.yml): add pyspark AND databricks-sdk to tests step

6a8146a

coderabbitai bot reviewed Oct 26, 2024

View reviewed changes

Garett601 added 2 commits October 26, 2024 06:41

test(test_data_preprocessor.py): mock pysoark.sql.functions

f25a887

test(test_data_preprocessor.py): mock all spark behaviour

ea29214

coderabbitai bot reviewed Oct 26, 2024

View reviewed changes

basakeskili approved these changes Oct 29, 2024

View reviewed changes

mvechtomova approved these changes Oct 29, 2024

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

README.md Show resolved Hide resolved

power_consumption/config.py Show resolved Hide resolved

tests/preprocessing/test_data_preprocessor.py Show resolved Hide resolved

Garett601 merged commit f8f9255 into main Oct 30, 2024
2 checks passed

Garett601 deleted the feature/week2 branch October 30, 2024 05:54

coderabbitai bot mentioned this pull request Nov 11, 2024

Feature/week3 #3

Merged

9 tasks

coderabbitai bot mentioned this pull request Nov 22, 2024

Feature/week5 #4

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/week2 #2

Feature/week2 #2

Garett601 commented Oct 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 26, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

basakeskili left a comment

mvechtomova left a comment

Feature/week2 #2

Feature/week2 #2

Conversation

Garett601 commented Oct 26, 2024 • edited by coderabbitai bot Loading

Description

Type of change

How Has This Been Tested?

Checklist:

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Oct 26, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

basakeskili left a comment

Choose a reason for hiding this comment

mvechtomova left a comment

Choose a reason for hiding this comment

Garett601 commented Oct 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 26, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)