Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/week2 #2

Merged
merged 22 commits into from
Oct 30, 2024
Merged

Feature/week2 #2

merged 22 commits into from
Oct 30, 2024

Conversation

Garett601
Copy link
Collaborator

@Garett601 Garett601 commented Oct 26, 2024

Description

Saved tables to UC and created MLflow experiments and logged models (standard, custom, and after using feature engineering). The steps are listed in the notebooks

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Updated tests for refactored code. Added code coverage checks in the CI pipeline.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced .gitignore to ignore JSON files and MLFlow directories.
    • Updated README.md with environment setup instructions and course logistics.
    • Introduced new data processing and modeling workflows in week_1.py and 01_prepare_dataset.py.
    • Added MLflow experiment management functionality in 02_mlfow_experiment.py.
    • Implemented model logging and registration processes in 03_log_and_register_model.py and 04_log_and_register_custom_model.py.
    • Added feature engineering model logging in 05_log_and_register_fe_model.py.
    • Introduced a new schema for processed power consumption data.
  • Bug Fixes

    • Resolved environment setup issues documented in the README.
  • Documentation

    • Comprehensive updates to the README for clarity on course logistics and setup processes.
  • Tests

    • Refactored tests for ConsumptionModel and DataProcessor to align with new data handling structures.

…): refactor code based on new config and separate data loading
Structure repo for clear separation of weekly tasks
convert to Databricks notebook and update code to new refactored implementation
convert to Databricks notebook and update code to new refactored implementation

chore: Actually commit the notebook this time

refactor(week_1.py): update Databricks implementation

Squashed commits: refactor week 1 notebook to .py format, convert it to Databricks, and ensure changes are committed.
Save training and test sets to catalog. Note the "ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true);"
load data from UC table, preprocess data, save train and test set to UC tables
create, log and register a custom model with the mlflow.pyfunc.model flavor
… to UC

Made an oops and saved the table when the DateTime was the index, so I lost that information down the line which made FeatureLookup impossible
create a feature table and generate a new feature at runtime
@Garett601 Garett601 added the enhancement New feature or request label Oct 26, 2024
@Garett601 Garett601 self-assigned this Oct 26, 2024
@Garett601 Garett601 requested a review from a team as a code owner October 26, 2024 04:18
Copy link
Contributor

coderabbitai bot commented Oct 26, 2024

Walkthrough

This pull request introduces multiple changes across various files, including updates to the .gitignore, README.md, configuration files, and new scripts for data processing and machine learning workflows. Key modifications include the addition of JSON and MLFlow entries to .gitignore, restructuring of feature definitions in configuration files, and the implementation of new data processing scripts that utilize PySpark and MLFlow. The updates aim to improve data handling, model training, and evaluation processes, alongside enhancements in the CI pipeline and documentation.

Changes

File Change Summary
.gitignore Added entries to ignore *.json files and mlruns/ directory.
README.md Updated course logistics, environment setup, and CI pipeline commands; added sections for weeks 1 and 2.
configs/project_configs.yml Renamed features to processed_features; updated categorical features and dataset definitions.
notebooks/week1/week_1.py New file implementing data processing and modeling workflow for power consumption analysis.
notebooks/week2/01_prepare_dataset.py New file for data preparation workflow using PySpark and Pandas.
notebooks/week2/02_mlfow_experiment.py New file for managing MLflow experiments in Databricks.
notebooks/week2/03_log_and_register_model.py New file implementing MLflow and LightGBM for power consumption prediction.
notebooks/week2/04_log_and_register_custom_model.py New file for logging and registering a custom ML model with MLflow.
notebooks/week2/05_log_and_register_fe_model.py New file for ML pipeline using feature engineering and model registration.
notebooks/week_1.ipynb Deleted Jupyter notebook that implemented a workflow for processing power consumption data.
power_consumption/config.py Renamed Features to ProcessedFeatures; updated Dataset class structure.
power_consumption/model/rf_model.py Updated ConsumptionModel class to simplify initialization and change method signatures to use DataFrames.
power_consumption/preprocessing/data_preprocessor.py Modified DataProcessor class to accept DataFrames directly and updated preprocessing methods.
power_consumption/schemas/processed_data.py Added ProcessedPowerConsumptionSchema to represent processed dataset with expanded fields.
power_consumption/utils.py Refactored plot_feature_importance function for clarity in data preparation.
tests/conftest.yml Updated feature and dataset properties in test configuration.
tests/model/test_rf_model.py Refactored tests for ConsumptionModel to align with new data handling and configuration structure.
tests/preprocessing/test_data_preprocessor.py Enhanced tests for DataProcessor, added new tests, and removed outdated ones.

Possibly related PRs

  • Feature/week1 #1: The changes in the .gitignore file in the main PR are related to the modifications in the .gitignore file in PR Feature/week1 #1, as both PRs involve updates to the entries that specify which files and directories should be ignored by Git.

Suggested reviewers

  • basakeskili

Poem

🐰 In the garden where data grows,
New paths and features, the rabbit knows.
With JSON and MLFlow in sight,
We hop along, coding with delight!
Preprocessing with care, we’ll never tire,
For power consumption, we’ll reach higher! 🌱


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 47

🧹 Outside diff range and nitpick comments (22)
configs/project_configs.yml (3)

36-49: Consider restructuring to avoid feature list duplication.

The current configuration duplicates feature lists between processed_features and dataset sections, which could lead to maintenance issues. While the distinction between raw and processed features is important, consider these improvements:

  1. Rename the sections to clearly indicate their purpose:

    • raw_features instead of dataset.{num,cat}_features
    • processed_features (current)
  2. Or, consider consolidating the feature definitions:

features:
  raw:
    num_features: [...]
    cat_features: [...]
  processed:
    num_features: [...]
    cat_features: [...]

This would make the relationship between raw and processed features more explicit and reduce the risk of inconsistencies during updates.


Line range hint 1-8: Add documentation for MLflow experiment tracking.

Given that this PR implements MLflow experiments, consider adding:

  1. Comments explaining the hyperparameters and their valid ranges
  2. MLflow experiment naming configuration
  3. Documentation about where these hyperparameters are used in the training pipeline

Example addition:

# MLflow experiment tracking configuration
mlflow:
  experiment_name: "power_consumption_training"
  experiment_description: "Power consumption prediction model training"

# Model hyperparameters
hyperparameters:
  learning_rate: 0.01  # Learning rate for gradient descent (range: 0.001 to 0.1)
  n_estimators: 1000   # Number of trees in the forest (range: 100 to 2000)
  max_depth: 6         # Maximum depth of each tree (range: 3 to 10)

Line range hint 1-2: Enhance Unity Catalog configuration.

Since this PR implements saving tables to Unity Catalog, consider adding:

  1. Table location configuration
  2. Access control settings
  3. Documentation about the catalog structure

Example addition:

unity_catalog:
  catalog_name: heiaepgah71pwedmld01001
  schema_name: power_consumption
  tables:
    raw_data:
      name: tetuan_city_power_consumption
      location: "s3://bucket/path"
    processed_data:
      name: processed_power_consumption
      location: "s3://bucket/processed/path"
  access_control:
    owner: "data_science_team"
    grants:
      - principal: "ML_engineers"
        permissions: ["SELECT", "MODIFY"]
power_consumption/config.py (2)

13-15: LGTM! Clear and descriptive class renaming.

The rename from Features to ProcessedFeatures better reflects the post-processing nature of these features, improving code clarity.

Consider adding docstring documentation to clarify:

  • The purpose of this class in the ML pipeline
  • The expected format/requirements for num_features and cat_features
  • Any validation rules or constraints

Line range hint 1-58: Consider enhancing configuration validation for ML pipeline.

The configuration structure is well-organized, but could benefit from additional MLflow-specific validations since this PR introduces MLflow experiments.

Consider adding:

  1. MLflow experiment tracking configurations
  2. Model registry configurations for Unity Catalog
  3. Version tracking for feature definitions

This would help maintain consistency between your ML experiments and the corresponding configurations.

notebooks/week2/01_prepare_dataset.py (3)

6-6: Add error handling for SparkSession creation.

Consider adding try-except block to handle potential SparkSession creation failures gracefully.

-spark = SparkSession.builder.getOrCreate()
+try:
+    spark = SparkSession.builder.getOrCreate()
+except Exception as e:
+    print(f"Failed to create SparkSession: {str(e)}")
+    raise

14-16: Validate configuration values.

Add validation for critical configuration values to fail fast if they're missing or invalid.

+# Validate critical configuration
+if not all([catalog_name, schema_name, raw_data_table]):
+    raise ValueError("Missing required configuration values")

1-32: Consider moving core logic to proper Python modules.

While notebooks are great for development and visualization, critical data processing logic should reside in proper Python modules. This notebook should focus on orchestration and visualization, while the core logic should be in the power_consumption package.

Benefits:

  • Better testability
  • Improved reusability
  • Easier version control
  • Better separation of concerns

Consider:

  1. Moving data validation logic to DataProcessor
  2. Creating a dedicated configuration validator
  3. Adding a logging configuration module
  4. Creating a dedicated catalog interaction module
power_consumption/schemas/processed_data.py (3)

27-28: Add more detailed documentation for the processed schema.

The docstring should explain the preprocessing steps applied and the encoding scheme used for categorical variables.

 class ProcessedPowerConsumptionSchema(DataFrameModel):
-    """Processed power consumption dataset schema."""
+    """Processed power consumption dataset schema.
+    
+    This schema represents the processed version of power consumption data where:
+    - Categorical variables (DayOfWeek, IsWeekend) are one-hot encoded
+    - All features are properly typed and validated
+    - Missing values are handled consistently
+    """

30-47: Consider adding field descriptions using Field metadata.

The schema would benefit from explicit documentation of what each field represents, especially for the encoded categorical variables.

Example for a few fields:

-    Temperature: Series[float] = Field(nullable=True)
+    Temperature: Series[float] = Field(nullable=True, description="Temperature in Celsius")
-    DayOfWeek_1: Series[Category] = Field(nullable=True)
+    DayOfWeek_1: Series[Category] = Field(nullable=True, description="Binary indicator for Monday (1 if Monday, 0 otherwise)")
-    IsWeekend_1: Series[Category] = Field(nullable=True)
+    IsWeekend_1: Series[Category] = Field(nullable=True, description="Binary indicator for weekend days (1 if weekend, 0 if weekday)")

27-50: Consider adding data validation between raw and processed schemas.

The relationship between raw and processed data should be validated to ensure consistent preprocessing.

Consider adding class methods to validate the transformation:

@classmethod
def validate_processing(cls, raw_df: PowerConsumptionSchema, processed_df: "ProcessedPowerConsumptionSchema") -> bool:
    """Validate that processed data correctly represents the raw data.
    
    Args:
        raw_df: DataFrame following PowerConsumptionSchema
        processed_df: DataFrame following ProcessedPowerConsumptionSchema
    
    Returns:
        bool: True if processing is valid
    """
    # Validate continuous features remain unchanged
    for field in ['Temperature', 'Humidity', 'Wind_Speed']:
        if not (raw_df[field] == processed_df[field]).all():
            return False
            
    # Validate one-hot encoding
    raw_days = raw_df['DayOfWeek']
    processed_days = processed_df[[f'DayOfWeek_{i}' for i in range(1, 7)]]
    # Add validation logic here
    
    return True
power_consumption/utils.py (1)

Line range hint 66-89: Consider enhancing function robustness and documentation.

While the implementation is correct, consider these improvements:

  1. Add input validation
  2. Add type hints
  3. Handle edge cases

Here's a suggested implementation:

- def plot_feature_importance(feature_importance, feature_names, top_n=10):
+ def plot_feature_importance(
+     feature_importance: np.ndarray,
+     feature_names: list[str],
+     top_n: int = 10
+ ) -> None:
     """
     Plot feature importance.

     Parameters
     ----------
     feature_importance : array-like
         Feature importance scores.
     feature_names : array-like
         Names of the features.
     top_n : int, optional
         Number of top features to display, by default 10.
+
+    Raises
+    ------
+    ValueError
+        If feature_importance and feature_names have different lengths,
+        or if top_n is greater than the number of features.
     """
+    if len(feature_importance) != len(feature_names):
+        raise ValueError("Length mismatch between feature_importance and feature_names")
+    
+    n_features = len(feature_importance)
+    if top_n > n_features:
+        top_n = n_features
+        print(f"Warning: top_n reduced to {n_features} (total number of features)")
+
     plt.figure(figsize=(10, 6))
     sorted_idx = np.argsort(feature_importance)
     pos = np.arange(sorted_idx[-top_n:].shape[0]) + 0.5
     top_features = feature_importance[sorted_idx[-top_n:]]
     top_names = [feature_names[i] for i in sorted_idx[-top_n:]]
     plt.barh(pos, top_features)
     plt.yticks(pos, top_names)
     plt.title(f"Top {top_n} Feature Importance")
     plt.tight_layout()
     plt.show()
README.md (2)

106-106: Enhance CI pipeline documentation.

Consider adding:

  1. List of enabled pre-commit hooks and their purposes
  2. Expected code coverage thresholds
  3. Instructions for running CI checks locally

111-115: Document Unity Catalog setup and permissions.

The note about UC table availability should include:

  1. Steps to set up required permissions
  2. Table schema and location
  3. Instructions for other developers to access the table
tests/preprocessing/test_data_preprocessor.py (1)

Line range hint 11-50: Consider extracting test data generation into a separate utility.

The sample data generation in the fixture could be moved to a separate test utility module. This would:

  • Make it reusable across different test files
  • Allow for different data generation strategies
  • Make it easier to maintain and update test data patterns

Consider creating a new file tests/utils/test_data_generator.py to handle test data generation.

notebooks/week2/03_log_and_register_model.py (3)

87-91: Include the Conda Environment When Logging the Model

To ensure reproducibility, it's beneficial to log the conda environment or the pip requirements when logging the model. This captures the dependencies needed to run the model.

Modify the model logging to include the environment:

 mlflow.sklearn.log_model(
     sk_model=pipeline,
     artifact_path="lightgbm-pipeline-model",
     signature=signature,
+    conda_env=mlflow.sklearn.get_default_conda_env()
 )

30-31: Reuse Loaded Spark DataFrame Instead of Reloading

You have already loaded the training dataset into train_set_spark. Instead of loading it again, reuse this variable to improve efficiency.

Update the code to use the existing DataFrame:

- train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas()
+ train_set = train_set_spark.toPandas()

70-72: Consider Logging Metrics Using MLflow Instead of Printing

While printing metrics is useful for immediate visibility, logging them with MLflow ensures they are tracked and stored for future reference and comparison across runs.

Ensure all relevant metrics are logged to MLflow, as you are already doing in lines 77-79.

notebooks/week2/04_log_and_register_custom_model.py (1)

75-75: Consider retrieving the Git SHA dynamically

Hardcoding the Git SHA can lead to inconsistencies if the codebase updates. To ensure accuracy, retrieve the Git SHA dynamically.

Apply this diff to get the Git SHA automatically:

 import mlflow
+import subprocess
 import pandas as pd
 ...

-git_sha = "30d57afb2efca70cede3061d00f2a553c2b4779b"
+git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
power_consumption/preprocessing/data_preprocessor.py (3)

88-89: Validate raw data before feature engineering

The data is being validated after adding time-based features. It's recommended to validate the raw data before performing any transformations to ensure data integrity from the start.

Consider moving the validation step:

 self._clean_column_names()

+self.data = PowerConsumptionSchema.validate(self.data)

 logger.info("Creating time-based features")

 # Convert DateTime to datetime and set as index

110-127: Remove unnecessary random_state parameter when shuffle=False

In the split_data method, since shuffle=False, the random_state parameter has no effect. Including it may cause confusion.

Update the method signature and the call to train_test_split:

-def split_data(self, test_size: float = 0.2, random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
+def split_data(self, test_size: float = 0.2) -> Tuple[pd.DataFrame, pd.DataFrame]:
     """
     Split the DataFrame (self.data) into training and test sets, maintaining temporal order.
     """
-    train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False, random_state=random_state)
+    train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False)

Also, update the docstring to reflect this change.


129-132: Adjust logging for clarity on date ranges

The logs for the train and test set date ranges might be more informative if formatted with standard date formats.

Consider updating the logging statements:

 logger.info(f"Train set shape: {train_set.shape}")
 logger.info(f"Test set shape: {test_set.shape}")
-logger.info(f"Train set date range: {train_set.index.min()} to {train_set.index.max()}")
-logger.info(f"Test set date range: {test_set.index.min()} to {test_set.index.max()}")
+logger.info(f"Train set date range: {train_set.index.min().strftime('%Y-%m-%d')} to {train_set.index.max().strftime('%Y-%m-%d')}")
+logger.info(f"Test set date range: {test_set.index.min().strftime('%Y-%m-%d')} to {test_set.index.max().strftime('%Y-%m-%d')}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between bd663cf and c612a55.

📒 Files selected for processing (18)
  • .gitignore (1 hunks)
  • README.md (2 hunks)
  • configs/project_configs.yml (3 hunks)
  • notebooks/week1/week_1.py (1 hunks)
  • notebooks/week2/01_prepare_dataset.py (1 hunks)
  • notebooks/week2/02_mlfow_experiment.py (1 hunks)
  • notebooks/week2/03_log_and_register_model.py (1 hunks)
  • notebooks/week2/04_log_and_register_custom_model.py (1 hunks)
  • notebooks/week2/05_log_and_register_fe_model.py (1 hunks)
  • notebooks/week_1.ipynb (0 hunks)
  • power_consumption/config.py (2 hunks)
  • power_consumption/model/rf_model.py (5 hunks)
  • power_consumption/preprocessing/data_preprocessor.py (3 hunks)
  • power_consumption/schemas/processed_data.py (1 hunks)
  • power_consumption/utils.py (1 hunks)
  • tests/conftest.yml (3 hunks)
  • tests/model/test_rf_model.py (1 hunks)
  • tests/preprocessing/test_data_preprocessor.py (2 hunks)
💤 Files with no reviewable changes (1)
  • notebooks/week_1.ipynb
🧰 Additional context used
🪛 Ruff
notebooks/week2/04_log_and_register_custom_model.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)


122-122: Found useless expression. Either assign it to a variable or remove it.

(B018)

notebooks/week2/05_log_and_register_fe_model.py

6-6: Undefined name dbutils

(F821)


10-10: Module level import not at top of file

(E402)


10-10: yaml imported but unused

Remove unused import: yaml

(F401)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


14-14: Module level import not at top of file

(E402)


14-14: datetime.datetime imported but unused

Remove unused import: datetime.datetime

(F401)


16-16: Module level import not at top of file

(E402)


17-17: Module level import not at top of file

(E402)


17-17: pyspark.sql.functions imported but unused

Remove unused import: pyspark.sql.functions

(F401)


19-19: Module level import not at top of file

(E402)


20-20: Module level import not at top of file

(E402)


21-21: Module level import not at top of file

(E402)


23-23: Module level import not at top of file

(E402)


24-24: Module level import not at top of file

(E402)


25-25: Module level import not at top of file

(E402)


26-26: Module level import not at top of file

(E402)


27-27: Module level import not at top of file

(E402)


29-29: Module level import not at top of file

(E402)


31-31: Module level import not at top of file

(E402)

tests/model/test_rf_model.py

4-4: sklearn.compose.ColumnTransformer imported but unused

Remove unused import: sklearn.compose.ColumnTransformer

(F401)


5-5: sklearn.preprocessing.StandardScaler imported but unused

Remove unused import: sklearn.preprocessing.StandardScaler

(F401)


6-6: sklearn.pipeline.Pipeline imported but unused

Remove unused import: sklearn.pipeline.Pipeline

(F401)


11-11: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)

🔇 Additional comments (20)
configs/project_configs.yml (1)

9-9: LGTM! Clear and consistent section naming.

The renaming from 'features' to 'processed_features' improves clarity and maintains consistency with other project files.

tests/conftest.yml (1)

36-36: Verify the table name in Unity Catalog.

The table name has been changed to tetuan_city_power_consumption. Let's verify this exists in the Unity Catalog.

#!/bin/bash
# Description: Search for references to this table name in the codebase
# Expected: Find consistent table name references

# Search for the table name in configuration and SQL files
rg "tetuan_city_power_consumption" --type yaml --type sql
power_consumption/config.py (2)

32-32: LGTM! Verify YAML configurations are updated.

The rename from features to processed_features in the Config class maintains consistency with the ProcessedFeatures class rename.

Let's verify all YAML configs have been updated accordingly:

#!/bin/bash
# Check if any YAML files still use the old 'features' key
echo "Checking for old 'features' key in YAML files..."
rg -l "features:" --type yaml

# Check for new 'processed_features' key
echo "Verifying presence of new 'processed_features' key..."
rg -l "processed_features:" --type yaml

23-25: 🛠️ Refactor suggestion

Review data model design and add validations.

A few concerns about the current implementation:

  1. num_features appears to be duplicated in both ProcessedFeatures and Dataset classes
  2. raw_data_table might benefit from format validation

Consider these improvements:

  1. To avoid duplication, either:

    • Move all feature definitions to ProcessedFeatures
    • Or document why they exist in both places
  2. Add validation for raw_data_table:

from pydantic import Field, validator

class Dataset(BaseModel):
    raw_data_table: str = Field(..., description="Unity Catalog table name")
    num_features: List[str]
    cat_features: List[str]

    @validator('raw_data_table')
    def validate_table_name(cls, v):
        if not v.replace('_', '').isalnum():
            raise ValueError('Table name must be alphanumeric (underscores allowed)')
        return v

Let me verify the feature duplication:

.gitignore (1)

105-106: LGTM! Correctly ignoring MLflow artifacts.

Adding mlruns/ to .gitignore follows MLflow best practices as these directories contain experiment metadata and model artifacts that should not be version controlled.

notebooks/week2/02_mlfow_experiment.py (1)

1-53: Verify Unity Catalog integration.

The PR objectives mention saving tables to Unity Catalog, but this implementation only shows MLflow experiment tracking. Please ensure the Unity Catalog integration is implemented in other files or add it to this notebook.

power_consumption/schemas/processed_data.py (1)

38-43: ⚠️ Potential issue

Verify the completeness of DayOfWeek encoding.

The current implementation only includes 6 binary columns for DayOfWeek (1-6), which seems to be missing one day. This could lead to information loss or incorrect encoding of the seventh day.

Consider adding the missing day:

     DayOfWeek_4: Series[Category] = Field(nullable=True)
     DayOfWeek_5: Series[Category] = Field(nullable=True)
     DayOfWeek_6: Series[Category] = Field(nullable=True)
+    DayOfWeek_7: Series[Category] = Field(nullable=True)
✅ Verification successful

The DayOfWeek encoding is correct and complete

Looking at the preprocessing code, we can see that DayOfWeek is derived from index.dayofweek which uses Python's datetime convention where days are numbered from 0 to 6 (Monday=0 through Sunday=6). The OneHotEncoder is configured with drop="first", which means it drops the first category (0/Monday) to avoid multicollinearity. This explains why we only see DayOfWeek_1 through DayOfWeek_6, as they represent Tuesday through Sunday, with Monday being the reference category that was dropped.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the usage of DayOfWeek in the codebase to understand the encoding
rg -A 5 "DayOfWeek" --type python

Length of output: 64


Script:

#!/bin/bash
# Let's try searching without file type restriction
rg -A 5 "DayOfWeek"

Length of output: 6040


Script:

#!/bin/bash
# Search for any preprocessing or encoding related files
fd -e py | grep -i "preprocess\|encode\|transform"

Length of output: 227


Script:

#!/bin/bash
# Look for data processing or feature engineering files
fd -e py | grep -i "feature\|process\|data"

Length of output: 302


Script:

#!/bin/bash
# Check the content of the schema file
cat power_consumption/schemas/processed_data.py

Length of output: 2189


Script:

#!/bin/bash
# Let's check how DayOfWeek is processed and encoded
cat power_consumption/preprocessing/data_preprocessor.py

Length of output: 6737

power_consumption/utils.py (1)

83-86: LGTM! Clean refactoring of the plotting logic.

The separation of data preparation from plotting logic improves code readability and maintainability.

README.md (2)

116-124: Improve error handling documentation.

The permission denied error workaround should:

  1. Explain why the IDE vs. workspace behavior differs
  2. Document the minimum required permissions
  3. Provide steps to request necessary access
#!/bin/bash
# Check for other potential permission-related issues in the codebase
rg -i "permission" --type py

32-35: Consider documenting a more robust environment setup solution.

While the workaround addresses immediate issues, consider:

  1. Explicitly documenting the recommended Python version (3.11.10)
  2. Investigating and documenting the root cause of the pywin32/mlflow conflict
  3. Adding platform-specific instructions (Windows vs. Unix)
tests/preprocessing/test_data_preprocessor.py (2)

5-7: LGTM! Good practice using schema validation.

The addition of ProcessedPowerConsumptionSchema import and SparkSession demonstrates good practice by ensuring data consistency through schema validation before saving to Unity Catalog.


Line range hint 11-50: Well-structured test fixtures with comprehensive test data!

The fixtures are well-implemented with:

  • Comprehensive sample data covering all required columns
  • Good use of date ranges and random data for testing
  • Clear documentation with proper parameter and return type descriptions
power_consumption/model/rf_model.py (3)

5-5: ⚠️ Potential issue

Check compatibility of Self type hint with Python version

The use of Self from the typing module is only available in Python 3.11 and above. Please ensure that the project's Python version is 3.11 or later. If compatibility with earlier versions is needed, consider replacing Self with 'ConsumptionModel' in the type annotations.


46-46: ⚠️ Potential issue

Verify use of Self in return type annotation

The return type annotation -> Self in the train method requires Python 3.11 or later. For compatibility with earlier Python versions, you might replace it with 'ConsumptionModel'.


111-111: ⚠️ Potential issue

Reevaluate feature importance aggregation method

Averaging feature importances across all estimators may not accurately capture the significance of features for each target variable in a multi-output regression. Consider returning feature importances for each output separately or using a weighted average if appropriate.

tests/model/test_rf_model.py (1)

16-28: Sample data fixture is well-defined and ensures reproducibility

The sample_data fixture generates a reproducible DataFrame with relevant features, which is excellent for consistent testing.

notebooks/week2/03_log_and_register_model.py (2)

80-81: Verify Correctness of Model Signature Inference

Ensure that the inferred model signature accurately represents the model's input and output schema. Using y_train instead of y_pred when inferring the signature may provide a more accurate representation.

Please verify if y_pred has the same schema as y_train. If not, consider updating the inference as follows:

- signature = infer_signature(model_input=X_train, model_output=y_pred)
+ signature = infer_signature(model_input=X_train, model_output=y_train)

82-85: Check the Usage of mlflow.data.from_spark

The method mlflow.data.from_spark may not be a standard MLflow API or might differ based on the MLflow version. Confirm that this function is appropriate for logging your dataset and that it behaves as expected.

Please ensure that mlflow.data.from_spark is the correct method for creating a dataset input in your MLflow version. If not, consider alternative approaches for logging dataset information.

power_consumption/preprocessing/data_preprocessor.py (1)

153-157: Ensure consistent timestamps for train and test sets

The timestamp variable is calculated once and applied to both the train and test sets, which is good for consistency. However, ensure that the transformation from pandas DataFrame to Spark DataFrame preserves the data accurately, especially the index.

To verify that the index is preserved during the conversion, you can run the following script:

✅ Verification successful

Based on the test file content and previous search results, I can now provide a final response:

Index preservation is properly handled during DataFrame conversion

The code correctly preserves the DateTime index during the pandas to Spark DataFrame conversion, as evidenced by:

  • The DateTime column is explicitly set as the index using set_index("DateTime", inplace=True")
  • The test suite validates that:
    • The index is properly maintained (assert data_processor.data.index.name == "DateTime")
    • Both train and test sets maintain temporal ordering (assert train_set.index[0] < test_set.index[0])
    • Schema validation is performed on both datasets before and after the split
  • The data is properly sorted by index before processing (self.data = self.data.sort_index())
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the index is preserved when converting pandas DataFrame to Spark DataFrame.

# Check if index is included as a column
# Expect: The index should be present as a column in the Spark DataFrame.

# Example command using pyspark (adjusted for script context)
# Note: This is a placeholder; actual verification may require additional context.

# Save sample pandas DataFrame to CSV
python -c "
import pandas as pd
df = pd.DataFrame({'A': [1,2,3]}, index=pd.date_range('2021-01-01', periods=3, freq='D'))
df.to_csv('sample.csv')
"

# Read CSV with Spark and check columns
spark-shell -e "
val df = spark.read.option('header', 'true').csv('sample.csv')
df.show()
df.printSchema()
"

Length of output: 457


Script:

#!/bin/bash
# Let's examine the code to understand how the DataFrame conversion is handled
rg -A 5 "createDataFrame" power_consumption/preprocessing/data_preprocessor.py

Length of output: 462


Script:

#!/bin/bash
# Check if there's any custom index handling logic
rg -B 3 -A 3 "index" power_consumption/preprocessing/data_preprocessor.py

Length of output: 1586


Script:

#!/bin/bash
# Look for any DataFrame schema definitions or column specifications
ast-grep --pattern 'withColumn($$$)'

Length of output: 38


Script:

#!/bin/bash
# Check if there are any test files that might show the expected behavior
fd "test.*data.*process.*\.py$"

Length of output: 79


Script:

#!/bin/bash
# Let's check the test file to see how the DataFrame conversion is tested
cat tests/preprocessing/test_data_preprocessor.py

Length of output: 6169


Script:

#!/bin/bash
# Check for any schema definitions that might show expected columns
rg -B 2 -A 5 "class.*Schema" power_consumption/preprocessing/

Length of output: 63

notebooks/week2/05_log_and_register_fe_model.py (1)

6-6: Ensure 'dbutils' is available before use

The variable dbutils may not be defined in the current context, which could lead to a NameError when calling dbutils.library.restartPython(). In Databricks notebooks, dbutils should be available, but if you're running this script outside of a notebook environment, make sure to import or define dbutils appropriately.

🧰 Tools
🪛 Ruff

6-6: Undefined name dbutils

(F821)

configs/project_configs.yml Show resolved Hide resolved
tests/conftest.yml Show resolved Hide resolved
tests/conftest.yml Show resolved Hide resolved
notebooks/week2/01_prepare_dataset.py Show resolved Hide resolved
notebooks/week2/01_prepare_dataset.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.github/workflows/ci.yml (1)

31-31: LGTM! Consider documenting coverage requirements

The test configuration looks good:

  • Addition of databricks-sdk aligns with the Unity Catalog integration
  • Coverage threshold and reporting are well configured

Consider adding a comment in the workflow file explaining the 70% coverage requirement to help future contributors understand this threshold.

+          # Enforce minimum 70% test coverage across the codebase
           uv pip install pytest pytest-mock pytest-cov databricks-sdk
           uv run pytest --cov=power_consumption --cov-report=term-missing --cov-fail-under=70 tests/
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c612a55 and beccea3.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (2 hunks)
🔇 Additional comments (1)
.github/workflows/ci.yml (1)

21-22: Verify requirements file location for uv sync

The uv sync command requires a requirements file, but none is explicitly specified. Please ensure that a requirements.txt or similar file exists in the repository root.

✅ Verification successful

Let me try another search strategy to verify the requirements files and package management setup.

This will help us understand:

  1. If there are any Python package management files that might be used by uv
  2. The complete context of how uv is configured in the workflow

Let me check the content of pyproject.toml to understand the package dependencies configuration.

This will help us verify if the dependencies are properly configured in pyproject.toml, which is the modern standard for Python package management and is supported by uv.


uv sync command is correctly configured

The dependency installation is properly set up. The uv sync command will install dependencies from pyproject.toml, which clearly defines both main dependencies and dev dependencies. The workflow also correctly installs additional test dependencies via separate uv pip install commands for pre-commit and pytest-related packages.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for requirements files in the repository
echo "Searching for requirements files..."
fd -e txt -e in "requirements"

Length of output: 111


Script:

#!/bin/bash
# Check for pyproject.toml and other Python package management files
echo "Searching for Python package management files..."
fd "pyproject.toml|requirements|setup.py"

# Look for uv configuration in workflow file
echo -e "\nChecking uv configuration in workflow:"
rg "uv" .github/workflows/ci.yml -B2 -A2

Length of output: 890


Script:

#!/bin/bash
# Check pyproject.toml content
echo "Content of pyproject.toml:"
cat pyproject.toml

Length of output: 1458

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between beccea3 and ea29214.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml (2 hunks)
  • tests/preprocessing/test_data_preprocessor.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/ci.yml
🧰 Additional context used
🪛 Ruff
tests/preprocessing/test_data_preprocessor.py

7-7: pyspark.sql.functions imported but unused

Remove unused import: pyspark.sql.functions

(F401)

Copy link

@basakeskili basakeskili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link

@mvechtomova mvechtomova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! there are some minor things you could improve: id must be string and not integer, please check it - it may cause problems for serving.

Another one regarding testing python requirements.

I will approve to unblock you.

.github/workflows/ci.yml Show resolved Hide resolved
README.md Show resolved Hide resolved
power_consumption/config.py Show resolved Hide resolved
tests/preprocessing/test_data_preprocessor.py Show resolved Hide resolved
@Garett601 Garett601 merged commit f8f9255 into main Oct 30, 2024
2 checks passed
@Garett601 Garett601 deleted the feature/week2 branch October 30, 2024 05:54
@coderabbitai coderabbitai bot mentioned this pull request Nov 11, 2024
9 tasks
@coderabbitai coderabbitai bot mentioned this pull request Nov 22, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants