-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/week2 #2
Conversation
…): refactor code based on new config and separate data loading
Structure repo for clear separation of weekly tasks
convert to Databricks notebook and update code to new refactored implementation
convert to Databricks notebook and update code to new refactored implementation chore: Actually commit the notebook this time refactor(week_1.py): update Databricks implementation Squashed commits: refactor week 1 notebook to .py format, convert it to Databricks, and ensure changes are committed.
Save training and test sets to catalog. Note the "ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true);"
load data from UC table, preprocess data, save train and test set to UC tables
log and register a model
create, log and register a custom model with the mlflow.pyfunc.model flavor
… to UC Made an oops and saved the table when the DateTime was the index, so I lost that information down the line which made FeatureLookup impossible
create a feature table and generate a new feature at runtime
WalkthroughThis pull request introduces multiple changes across various files, including updates to the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 47
🧹 Outside diff range and nitpick comments (22)
configs/project_configs.yml (3)
36-49
: Consider restructuring to avoid feature list duplication.The current configuration duplicates feature lists between
processed_features
anddataset
sections, which could lead to maintenance issues. While the distinction between raw and processed features is important, consider these improvements:
Rename the sections to clearly indicate their purpose:
raw_features
instead ofdataset.{num,cat}_features
processed_features
(current)Or, consider consolidating the feature definitions:
features: raw: num_features: [...] cat_features: [...] processed: num_features: [...] cat_features: [...]This would make the relationship between raw and processed features more explicit and reduce the risk of inconsistencies during updates.
Line range hint
1-8
: Add documentation for MLflow experiment tracking.Given that this PR implements MLflow experiments, consider adding:
- Comments explaining the hyperparameters and their valid ranges
- MLflow experiment naming configuration
- Documentation about where these hyperparameters are used in the training pipeline
Example addition:
# MLflow experiment tracking configuration mlflow: experiment_name: "power_consumption_training" experiment_description: "Power consumption prediction model training" # Model hyperparameters hyperparameters: learning_rate: 0.01 # Learning rate for gradient descent (range: 0.001 to 0.1) n_estimators: 1000 # Number of trees in the forest (range: 100 to 2000) max_depth: 6 # Maximum depth of each tree (range: 3 to 10)
Line range hint
1-2
: Enhance Unity Catalog configuration.Since this PR implements saving tables to Unity Catalog, consider adding:
- Table location configuration
- Access control settings
- Documentation about the catalog structure
Example addition:
unity_catalog: catalog_name: heiaepgah71pwedmld01001 schema_name: power_consumption tables: raw_data: name: tetuan_city_power_consumption location: "s3://bucket/path" processed_data: name: processed_power_consumption location: "s3://bucket/processed/path" access_control: owner: "data_science_team" grants: - principal: "ML_engineers" permissions: ["SELECT", "MODIFY"]power_consumption/config.py (2)
13-15
: LGTM! Clear and descriptive class renaming.The rename from
Features
toProcessedFeatures
better reflects the post-processing nature of these features, improving code clarity.Consider adding docstring documentation to clarify:
- The purpose of this class in the ML pipeline
- The expected format/requirements for num_features and cat_features
- Any validation rules or constraints
Line range hint
1-58
: Consider enhancing configuration validation for ML pipeline.The configuration structure is well-organized, but could benefit from additional MLflow-specific validations since this PR introduces MLflow experiments.
Consider adding:
- MLflow experiment tracking configurations
- Model registry configurations for Unity Catalog
- Version tracking for feature definitions
This would help maintain consistency between your ML experiments and the corresponding configurations.
notebooks/week2/01_prepare_dataset.py (3)
6-6
: Add error handling for SparkSession creation.Consider adding try-except block to handle potential SparkSession creation failures gracefully.
-spark = SparkSession.builder.getOrCreate() +try: + spark = SparkSession.builder.getOrCreate() +except Exception as e: + print(f"Failed to create SparkSession: {str(e)}") + raise
14-16
: Validate configuration values.Add validation for critical configuration values to fail fast if they're missing or invalid.
+# Validate critical configuration +if not all([catalog_name, schema_name, raw_data_table]): + raise ValueError("Missing required configuration values")
1-32
: Consider moving core logic to proper Python modules.While notebooks are great for development and visualization, critical data processing logic should reside in proper Python modules. This notebook should focus on orchestration and visualization, while the core logic should be in the
power_consumption
package.Benefits:
- Better testability
- Improved reusability
- Easier version control
- Better separation of concerns
Consider:
- Moving data validation logic to
DataProcessor
- Creating a dedicated configuration validator
- Adding a logging configuration module
- Creating a dedicated catalog interaction module
power_consumption/schemas/processed_data.py (3)
27-28
: Add more detailed documentation for the processed schema.The docstring should explain the preprocessing steps applied and the encoding scheme used for categorical variables.
class ProcessedPowerConsumptionSchema(DataFrameModel): - """Processed power consumption dataset schema.""" + """Processed power consumption dataset schema. + + This schema represents the processed version of power consumption data where: + - Categorical variables (DayOfWeek, IsWeekend) are one-hot encoded + - All features are properly typed and validated + - Missing values are handled consistently + """
30-47
: Consider adding field descriptions using Field metadata.The schema would benefit from explicit documentation of what each field represents, especially for the encoded categorical variables.
Example for a few fields:
- Temperature: Series[float] = Field(nullable=True) + Temperature: Series[float] = Field(nullable=True, description="Temperature in Celsius") - DayOfWeek_1: Series[Category] = Field(nullable=True) + DayOfWeek_1: Series[Category] = Field(nullable=True, description="Binary indicator for Monday (1 if Monday, 0 otherwise)") - IsWeekend_1: Series[Category] = Field(nullable=True) + IsWeekend_1: Series[Category] = Field(nullable=True, description="Binary indicator for weekend days (1 if weekend, 0 if weekday)")
27-50
: Consider adding data validation between raw and processed schemas.The relationship between raw and processed data should be validated to ensure consistent preprocessing.
Consider adding class methods to validate the transformation:
@classmethod def validate_processing(cls, raw_df: PowerConsumptionSchema, processed_df: "ProcessedPowerConsumptionSchema") -> bool: """Validate that processed data correctly represents the raw data. Args: raw_df: DataFrame following PowerConsumptionSchema processed_df: DataFrame following ProcessedPowerConsumptionSchema Returns: bool: True if processing is valid """ # Validate continuous features remain unchanged for field in ['Temperature', 'Humidity', 'Wind_Speed']: if not (raw_df[field] == processed_df[field]).all(): return False # Validate one-hot encoding raw_days = raw_df['DayOfWeek'] processed_days = processed_df[[f'DayOfWeek_{i}' for i in range(1, 7)]] # Add validation logic here return Truepower_consumption/utils.py (1)
Line range hint
66-89
: Consider enhancing function robustness and documentation.While the implementation is correct, consider these improvements:
- Add input validation
- Add type hints
- Handle edge cases
Here's a suggested implementation:
- def plot_feature_importance(feature_importance, feature_names, top_n=10): + def plot_feature_importance( + feature_importance: np.ndarray, + feature_names: list[str], + top_n: int = 10 + ) -> None: """ Plot feature importance. Parameters ---------- feature_importance : array-like Feature importance scores. feature_names : array-like Names of the features. top_n : int, optional Number of top features to display, by default 10. + + Raises + ------ + ValueError + If feature_importance and feature_names have different lengths, + or if top_n is greater than the number of features. """ + if len(feature_importance) != len(feature_names): + raise ValueError("Length mismatch between feature_importance and feature_names") + + n_features = len(feature_importance) + if top_n > n_features: + top_n = n_features + print(f"Warning: top_n reduced to {n_features} (total number of features)") + plt.figure(figsize=(10, 6)) sorted_idx = np.argsort(feature_importance) pos = np.arange(sorted_idx[-top_n:].shape[0]) + 0.5 top_features = feature_importance[sorted_idx[-top_n:]] top_names = [feature_names[i] for i in sorted_idx[-top_n:]] plt.barh(pos, top_features) plt.yticks(pos, top_names) plt.title(f"Top {top_n} Feature Importance") plt.tight_layout() plt.show()README.md (2)
106-106
: Enhance CI pipeline documentation.Consider adding:
- List of enabled pre-commit hooks and their purposes
- Expected code coverage thresholds
- Instructions for running CI checks locally
111-115
: Document Unity Catalog setup and permissions.The note about UC table availability should include:
- Steps to set up required permissions
- Table schema and location
- Instructions for other developers to access the table
tests/preprocessing/test_data_preprocessor.py (1)
Line range hint
11-50
: Consider extracting test data generation into a separate utility.The sample data generation in the fixture could be moved to a separate test utility module. This would:
- Make it reusable across different test files
- Allow for different data generation strategies
- Make it easier to maintain and update test data patterns
Consider creating a new file
tests/utils/test_data_generator.py
to handle test data generation.notebooks/week2/03_log_and_register_model.py (3)
87-91
: Include the Conda Environment When Logging the ModelTo ensure reproducibility, it's beneficial to log the conda environment or the pip requirements when logging the model. This captures the dependencies needed to run the model.
Modify the model logging to include the environment:
mlflow.sklearn.log_model( sk_model=pipeline, artifact_path="lightgbm-pipeline-model", signature=signature, + conda_env=mlflow.sklearn.get_default_conda_env() )
30-31
: Reuse Loaded Spark DataFrame Instead of ReloadingYou have already loaded the training dataset into
train_set_spark
. Instead of loading it again, reuse this variable to improve efficiency.Update the code to use the existing DataFrame:
- train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas() + train_set = train_set_spark.toPandas()
70-72
: Consider Logging Metrics Using MLflow Instead of PrintingWhile printing metrics is useful for immediate visibility, logging them with MLflow ensures they are tracked and stored for future reference and comparison across runs.
Ensure all relevant metrics are logged to MLflow, as you are already doing in lines 77-79.
notebooks/week2/04_log_and_register_custom_model.py (1)
75-75
: Consider retrieving the Git SHA dynamicallyHardcoding the Git SHA can lead to inconsistencies if the codebase updates. To ensure accuracy, retrieve the Git SHA dynamically.
Apply this diff to get the Git SHA automatically:
import mlflow +import subprocess import pandas as pd ... -git_sha = "30d57afb2efca70cede3061d00f2a553c2b4779b" +git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()power_consumption/preprocessing/data_preprocessor.py (3)
88-89
: Validate raw data before feature engineeringThe data is being validated after adding time-based features. It's recommended to validate the raw data before performing any transformations to ensure data integrity from the start.
Consider moving the validation step:
self._clean_column_names() +self.data = PowerConsumptionSchema.validate(self.data) logger.info("Creating time-based features") # Convert DateTime to datetime and set as index
110-127
: Remove unnecessaryrandom_state
parameter whenshuffle=False
In the
split_data
method, sinceshuffle=False
, therandom_state
parameter has no effect. Including it may cause confusion.Update the method signature and the call to
train_test_split
:-def split_data(self, test_size: float = 0.2, random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]: +def split_data(self, test_size: float = 0.2) -> Tuple[pd.DataFrame, pd.DataFrame]: """ Split the DataFrame (self.data) into training and test sets, maintaining temporal order. """ - train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False, random_state=random_state) + train_set, test_set = train_test_split(self.data, test_size=test_size, shuffle=False)Also, update the docstring to reflect this change.
129-132
: Adjust logging for clarity on date rangesThe logs for the train and test set date ranges might be more informative if formatted with standard date formats.
Consider updating the logging statements:
logger.info(f"Train set shape: {train_set.shape}") logger.info(f"Test set shape: {test_set.shape}") -logger.info(f"Train set date range: {train_set.index.min()} to {train_set.index.max()}") -logger.info(f"Test set date range: {test_set.index.min()} to {test_set.index.max()}") +logger.info(f"Train set date range: {train_set.index.min().strftime('%Y-%m-%d')} to {train_set.index.max().strftime('%Y-%m-%d')}") +logger.info(f"Test set date range: {test_set.index.min().strftime('%Y-%m-%d')} to {test_set.index.max().strftime('%Y-%m-%d')}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (18)
- .gitignore (1 hunks)
- README.md (2 hunks)
- configs/project_configs.yml (3 hunks)
- notebooks/week1/week_1.py (1 hunks)
- notebooks/week2/01_prepare_dataset.py (1 hunks)
- notebooks/week2/02_mlfow_experiment.py (1 hunks)
- notebooks/week2/03_log_and_register_model.py (1 hunks)
- notebooks/week2/04_log_and_register_custom_model.py (1 hunks)
- notebooks/week2/05_log_and_register_fe_model.py (1 hunks)
- notebooks/week_1.ipynb (0 hunks)
- power_consumption/config.py (2 hunks)
- power_consumption/model/rf_model.py (5 hunks)
- power_consumption/preprocessing/data_preprocessor.py (3 hunks)
- power_consumption/schemas/processed_data.py (1 hunks)
- power_consumption/utils.py (1 hunks)
- tests/conftest.yml (3 hunks)
- tests/model/test_rf_model.py (1 hunks)
- tests/preprocessing/test_data_preprocessor.py (2 hunks)
💤 Files with no reviewable changes (1)
- notebooks/week_1.ipynb
🧰 Additional context used
🪛 Ruff
notebooks/week2/04_log_and_register_custom_model.py
3-3:
numpy
imported but unusedRemove unused import:
numpy
(F401)
122-122: Found useless expression. Either assign it to a variable or remove it.
(B018)
notebooks/week2/05_log_and_register_fe_model.py
6-6: Undefined name
dbutils
(F821)
10-10: Module level import not at top of file
(E402)
10-10:
yaml
imported but unusedRemove unused import:
yaml
(F401)
11-11: Module level import not at top of file
(E402)
12-12: Module level import not at top of file
(E402)
14-14: Module level import not at top of file
(E402)
14-14:
datetime.datetime
imported but unusedRemove unused import:
datetime.datetime
(F401)
16-16: Module level import not at top of file
(E402)
17-17: Module level import not at top of file
(E402)
17-17:
pyspark.sql.functions
imported but unusedRemove unused import:
pyspark.sql.functions
(F401)
19-19: Module level import not at top of file
(E402)
20-20: Module level import not at top of file
(E402)
21-21: Module level import not at top of file
(E402)
23-23: Module level import not at top of file
(E402)
24-24: Module level import not at top of file
(E402)
25-25: Module level import not at top of file
(E402)
26-26: Module level import not at top of file
(E402)
27-27: Module level import not at top of file
(E402)
29-29: Module level import not at top of file
(E402)
31-31: Module level import not at top of file
(E402)
tests/model/test_rf_model.py
4-4:
sklearn.compose.ColumnTransformer
imported but unusedRemove unused import:
sklearn.compose.ColumnTransformer
(F401)
5-5:
sklearn.preprocessing.StandardScaler
imported but unusedRemove unused import:
sklearn.preprocessing.StandardScaler
(F401)
6-6:
sklearn.pipeline.Pipeline
imported but unusedRemove unused import:
sklearn.pipeline.Pipeline
(F401)
11-11:
loguru.logger
imported but unusedRemove unused import:
loguru.logger
(F401)
🔇 Additional comments (20)
configs/project_configs.yml (1)
9-9
: LGTM! Clear and consistent section naming.The renaming from 'features' to 'processed_features' improves clarity and maintains consistency with other project files.
tests/conftest.yml (1)
36-36
: Verify the table name in Unity Catalog.The table name has been changed to
tetuan_city_power_consumption
. Let's verify this exists in the Unity Catalog.#!/bin/bash # Description: Search for references to this table name in the codebase # Expected: Find consistent table name references # Search for the table name in configuration and SQL files rg "tetuan_city_power_consumption" --type yaml --type sqlpower_consumption/config.py (2)
32-32
: LGTM! Verify YAML configurations are updated.The rename from
features
toprocessed_features
in the Config class maintains consistency with the ProcessedFeatures class rename.Let's verify all YAML configs have been updated accordingly:
#!/bin/bash # Check if any YAML files still use the old 'features' key echo "Checking for old 'features' key in YAML files..." rg -l "features:" --type yaml # Check for new 'processed_features' key echo "Verifying presence of new 'processed_features' key..." rg -l "processed_features:" --type yaml
23-25
: 🛠️ Refactor suggestionReview data model design and add validations.
A few concerns about the current implementation:
num_features
appears to be duplicated in bothProcessedFeatures
andDataset
classesraw_data_table
might benefit from format validationConsider these improvements:
To avoid duplication, either:
- Move all feature definitions to
ProcessedFeatures
- Or document why they exist in both places
Add validation for
raw_data_table
:from pydantic import Field, validator class Dataset(BaseModel): raw_data_table: str = Field(..., description="Unity Catalog table name") num_features: List[str] cat_features: List[str] @validator('raw_data_table') def validate_table_name(cls, v): if not v.replace('_', '').isalnum(): raise ValueError('Table name must be alphanumeric (underscores allowed)') return vLet me verify the feature duplication:
.gitignore (1)
105-106
: LGTM! Correctly ignoring MLflow artifacts.Adding
mlruns/
to.gitignore
follows MLflow best practices as these directories contain experiment metadata and model artifacts that should not be version controlled.notebooks/week2/02_mlfow_experiment.py (1)
1-53
: Verify Unity Catalog integration.The PR objectives mention saving tables to Unity Catalog, but this implementation only shows MLflow experiment tracking. Please ensure the Unity Catalog integration is implemented in other files or add it to this notebook.
power_consumption/schemas/processed_data.py (1)
38-43
:⚠️ Potential issueVerify the completeness of DayOfWeek encoding.
The current implementation only includes 6 binary columns for DayOfWeek (1-6), which seems to be missing one day. This could lead to information loss or incorrect encoding of the seventh day.
Consider adding the missing day:
DayOfWeek_4: Series[Category] = Field(nullable=True) DayOfWeek_5: Series[Category] = Field(nullable=True) DayOfWeek_6: Series[Category] = Field(nullable=True) + DayOfWeek_7: Series[Category] = Field(nullable=True)
✅ Verification successful
The DayOfWeek encoding is correct and complete
Looking at the preprocessing code, we can see that
DayOfWeek
is derived fromindex.dayofweek
which uses Python's datetime convention where days are numbered from 0 to 6 (Monday=0 through Sunday=6). The OneHotEncoder is configured withdrop="first"
, which means it drops the first category (0/Monday) to avoid multicollinearity. This explains why we only see DayOfWeek_1 through DayOfWeek_6, as they represent Tuesday through Sunday, with Monday being the reference category that was dropped.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check the usage of DayOfWeek in the codebase to understand the encoding rg -A 5 "DayOfWeek" --type pythonLength of output: 64
Script:
#!/bin/bash # Let's try searching without file type restriction rg -A 5 "DayOfWeek"Length of output: 6040
Script:
#!/bin/bash # Search for any preprocessing or encoding related files fd -e py | grep -i "preprocess\|encode\|transform"Length of output: 227
Script:
#!/bin/bash # Look for data processing or feature engineering files fd -e py | grep -i "feature\|process\|data"Length of output: 302
Script:
#!/bin/bash # Check the content of the schema file cat power_consumption/schemas/processed_data.pyLength of output: 2189
Script:
#!/bin/bash # Let's check how DayOfWeek is processed and encoded cat power_consumption/preprocessing/data_preprocessor.pyLength of output: 6737
power_consumption/utils.py (1)
83-86
: LGTM! Clean refactoring of the plotting logic.The separation of data preparation from plotting logic improves code readability and maintainability.
README.md (2)
116-124
: Improve error handling documentation.The permission denied error workaround should:
- Explain why the IDE vs. workspace behavior differs
- Document the minimum required permissions
- Provide steps to request necessary access
#!/bin/bash # Check for other potential permission-related issues in the codebase rg -i "permission" --type py
32-35
: Consider documenting a more robust environment setup solution.While the workaround addresses immediate issues, consider:
- Explicitly documenting the recommended Python version (3.11.10)
- Investigating and documenting the root cause of the pywin32/mlflow conflict
- Adding platform-specific instructions (Windows vs. Unix)
tests/preprocessing/test_data_preprocessor.py (2)
5-7
: LGTM! Good practice using schema validation.The addition of ProcessedPowerConsumptionSchema import and SparkSession demonstrates good practice by ensuring data consistency through schema validation before saving to Unity Catalog.
Line range hint
11-50
: Well-structured test fixtures with comprehensive test data!The fixtures are well-implemented with:
- Comprehensive sample data covering all required columns
- Good use of date ranges and random data for testing
- Clear documentation with proper parameter and return type descriptions
power_consumption/model/rf_model.py (3)
5-5
:⚠️ Potential issueCheck compatibility of
Self
type hint with Python versionThe use of
Self
from thetyping
module is only available in Python 3.11 and above. Please ensure that the project's Python version is 3.11 or later. If compatibility with earlier versions is needed, consider replacingSelf
with'ConsumptionModel'
in the type annotations.
46-46
:⚠️ Potential issueVerify use of
Self
in return type annotationThe return type annotation
-> Self
in thetrain
method requires Python 3.11 or later. For compatibility with earlier Python versions, you might replace it with'ConsumptionModel'
.
111-111
:⚠️ Potential issueReevaluate feature importance aggregation method
Averaging feature importances across all estimators may not accurately capture the significance of features for each target variable in a multi-output regression. Consider returning feature importances for each output separately or using a weighted average if appropriate.
tests/model/test_rf_model.py (1)
16-28
: Sample data fixture is well-defined and ensures reproducibilityThe
sample_data
fixture generates a reproducible DataFrame with relevant features, which is excellent for consistent testing.notebooks/week2/03_log_and_register_model.py (2)
80-81
: Verify Correctness of Model Signature InferenceEnsure that the inferred model signature accurately represents the model's input and output schema. Using
y_train
instead ofy_pred
when inferring the signature may provide a more accurate representation.Please verify if
y_pred
has the same schema asy_train
. If not, consider updating the inference as follows:- signature = infer_signature(model_input=X_train, model_output=y_pred) + signature = infer_signature(model_input=X_train, model_output=y_train)
82-85
: Check the Usage ofmlflow.data.from_spark
The method
mlflow.data.from_spark
may not be a standard MLflow API or might differ based on the MLflow version. Confirm that this function is appropriate for logging your dataset and that it behaves as expected.Please ensure that
mlflow.data.from_spark
is the correct method for creating a dataset input in your MLflow version. If not, consider alternative approaches for logging dataset information.power_consumption/preprocessing/data_preprocessor.py (1)
153-157
: Ensure consistent timestamps for train and test setsThe
timestamp
variable is calculated once and applied to both the train and test sets, which is good for consistency. However, ensure that the transformation from pandas DataFrame to Spark DataFrame preserves the data accurately, especially the index.To verify that the index is preserved during the conversion, you can run the following script:
✅ Verification successful
Based on the test file content and previous search results, I can now provide a final response:
Index preservation is properly handled during DataFrame conversion
The code correctly preserves the DateTime index during the pandas to Spark DataFrame conversion, as evidenced by:
- The DateTime column is explicitly set as the index using
set_index("DateTime", inplace=True")
- The test suite validates that:
- The index is properly maintained (
assert data_processor.data.index.name == "DateTime"
)- Both train and test sets maintain temporal ordering (
assert train_set.index[0] < test_set.index[0]
)- Schema validation is performed on both datasets before and after the split
- The data is properly sorted by index before processing (
self.data = self.data.sort_index()
)🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the index is preserved when converting pandas DataFrame to Spark DataFrame. # Check if index is included as a column # Expect: The index should be present as a column in the Spark DataFrame. # Example command using pyspark (adjusted for script context) # Note: This is a placeholder; actual verification may require additional context. # Save sample pandas DataFrame to CSV python -c " import pandas as pd df = pd.DataFrame({'A': [1,2,3]}, index=pd.date_range('2021-01-01', periods=3, freq='D')) df.to_csv('sample.csv') " # Read CSV with Spark and check columns spark-shell -e " val df = spark.read.option('header', 'true').csv('sample.csv') df.show() df.printSchema() "Length of output: 457
Script:
#!/bin/bash # Let's examine the code to understand how the DataFrame conversion is handled rg -A 5 "createDataFrame" power_consumption/preprocessing/data_preprocessor.pyLength of output: 462
Script:
#!/bin/bash # Check if there's any custom index handling logic rg -B 3 -A 3 "index" power_consumption/preprocessing/data_preprocessor.pyLength of output: 1586
Script:
#!/bin/bash # Look for any DataFrame schema definitions or column specifications ast-grep --pattern 'withColumn($$$)'Length of output: 38
Script:
#!/bin/bash # Check if there are any test files that might show the expected behavior fd "test.*data.*process.*\.py$"Length of output: 79
Script:
#!/bin/bash # Let's check the test file to see how the DataFrame conversion is tested cat tests/preprocessing/test_data_preprocessor.pyLength of output: 6169
Script:
#!/bin/bash # Check for any schema definitions that might show expected columns rg -B 2 -A 5 "class.*Schema" power_consumption/preprocessing/Length of output: 63
notebooks/week2/05_log_and_register_fe_model.py (1)
6-6
: Ensure 'dbutils' is available before useThe variable
dbutils
may not be defined in the current context, which could lead to aNameError
when callingdbutils.library.restartPython()
. In Databricks notebooks,dbutils
should be available, but if you're running this script outside of a notebook environment, make sure to import or definedbutils
appropriately.🧰 Tools
🪛 Ruff
6-6: Undefined name
dbutils
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
.github/workflows/ci.yml (1)
31-31
: LGTM! Consider documenting coverage requirementsThe test configuration looks good:
- Addition of databricks-sdk aligns with the Unity Catalog integration
- Coverage threshold and reporting are well configured
Consider adding a comment in the workflow file explaining the 70% coverage requirement to help future contributors understand this threshold.
+ # Enforce minimum 70% test coverage across the codebase uv pip install pytest pytest-mock pytest-cov databricks-sdk uv run pytest --cov=power_consumption --cov-report=term-missing --cov-fail-under=70 tests/
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- .github/workflows/ci.yml (2 hunks)
🔇 Additional comments (1)
.github/workflows/ci.yml (1)
21-22
: Verify requirements file location for uv syncThe
uv sync
command requires a requirements file, but none is explicitly specified. Please ensure that a requirements.txt or similar file exists in the repository root.✅ Verification successful
Let me try another search strategy to verify the requirements files and package management setup.
This will help us understand:
- If there are any Python package management files that might be used by uv
- The complete context of how uv is configured in the workflow
Let me check the content of pyproject.toml to understand the package dependencies configuration.
This will help us verify if the dependencies are properly configured in pyproject.toml, which is the modern standard for Python package management and is supported by uv.
uv sync
command is correctly configuredThe dependency installation is properly set up. The
uv sync
command will install dependencies from pyproject.toml, which clearly defines both main dependencies and dev dependencies. The workflow also correctly installs additional test dependencies via separateuv pip install
commands for pre-commit and pytest-related packages.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for requirements files in the repository echo "Searching for requirements files..." fd -e txt -e in "requirements"Length of output: 111
Script:
#!/bin/bash # Check for pyproject.toml and other Python package management files echo "Searching for Python package management files..." fd "pyproject.toml|requirements|setup.py" # Look for uv configuration in workflow file echo -e "\nChecking uv configuration in workflow:" rg "uv" .github/workflows/ci.yml -B2 -A2Length of output: 890
Script:
#!/bin/bash # Check pyproject.toml content echo "Content of pyproject.toml:" cat pyproject.tomlLength of output: 1458
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- .github/workflows/ci.yml (2 hunks)
- tests/preprocessing/test_data_preprocessor.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/ci.yml
🧰 Additional context used
🪛 Ruff
tests/preprocessing/test_data_preprocessor.py
7-7:
pyspark.sql.functions
imported but unusedRemove unused import:
pyspark.sql.functions
(F401)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! there are some minor things you could improve: id must be string and not integer, please check it - it may cause problems for serving.
Another one regarding testing python requirements.
I will approve to unblock you.
Description
Saved tables to UC and created MLflow experiments and logged models (standard, custom, and after using feature engineering). The steps are listed in the notebooks
Type of change
How Has This Been Tested?
Updated tests for refactored code. Added code coverage checks in the CI pipeline.
Checklist:
Summary by CodeRabbit
Release Notes
New Features
.gitignore
to ignore JSON files and MLFlow directories.README.md
with environment setup instructions and course logistics.week_1.py
and01_prepare_dataset.py
.02_mlfow_experiment.py
.03_log_and_register_model.py
and04_log_and_register_custom_model.py
.05_log_and_register_fe_model.py
.Bug Fixes
Documentation
Tests
ConsumptionModel
andDataProcessor
to align with new data handling structures.