Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week 2 #2

Merged
merged 5 commits into from
Nov 18, 2024
Merged

Week 2 #2

merged 5 commits into from
Nov 18, 2024

Conversation

Wartynas
Copy link
Collaborator

@Wartynas Wartynas commented Nov 14, 2024

Things performed in the week 2 submission:

  • Saved data into catalog as delta table
  • Implemented custom DiabetesModel class, inheriting from mlflow.pyfunc.PythonModel
  • Trained the model using Mlflow experiment tracking and registered the model in the unity catalog.

I am aware of the limitations present in this pull request:

  • There are no tests for DiabetesModel class
  • The F1 score of the model is poor (0.39). I am planning to apply some techniques in the future to remedy the class imbalance in my multiclass classification problem.
  • There was no hyperparameter tuning performed to calibrate model hyperparameters. This will be done in future PRs.
  • I have not updated the project config to use Pydantic. This will be done in the next PR.

Please let me know if you have any thoughts regarding these limitations.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a DiabetesModel class for training, predicting, and evaluating diabetes classification models.
    • Enhanced data processing capabilities with Spark integration for handling larger datasets.
    • Added model tracking and logging functionality using MLflow.
  • Improvements

    • Updated configuration file with a new section for model parameters to optimize training.
    • Enhanced data transformation and saving processes with new methods in the DataProcessor.
  • Dependencies

    • Added xgboost as a new dependency for model training.

@Wartynas Wartynas requested a review from a team as a code owner November 14, 2024 11:49
Copy link

coderabbitai bot commented Nov 14, 2024

Walkthrough

The changes involve updates to several files, primarily enhancing the configuration and functionality of a machine learning workflow. The config.yml file has been modified to rename a categorical feature and introduce model parameters. The main.py file has been updated to integrate MLflow for model tracking and Databricks for data handling. New methods have been added to the DataProcessor and DiabetesModel classes to improve data processing and model training capabilities. Additionally, the pyproject.toml file has been reformatted for clarity and updated with a new dependency.

Changes

File Change Summary
config.yml - Renamed categorical feature: HeavyAlcoholConsumpHvyAlcoholConsump
- Added model_parameters section with:
- obj_fn: "multi:softmax"
- n_estimators: 1000
- max_depth: 6
- learning_rate: 0.01
main.py - Integrated MLflow and Databricks for model tracking and data handling.
- Added method call to save raw data.
- Updated preprocessing to return transformed DataFrame.
- Logged model training metrics in MLflow.
pyproject.toml - Reformatted dependencies to multi-line.
- Added dependency: "xgboost>=2.1.2, <3"
src/diabetes/data_processor.py - Added methods: pandas_df_to_delta, save_raw_data_to_catalog, _pandas_to_spark_to_delta_cdf.
- Updated preprocess_data and split_data methods for enhanced data handling.
src/diabetes/diabetes_model.py - Introduced DiabetesModel class with methods for training, predicting, and evaluating the model.
- Initialized model with configuration parameters.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Main
    participant DataProcessor
    participant DiabetesModel
    participant MLflow

    User->>Main: Start ML workflow
    Main->>DataProcessor: Initialize and preprocess data
    DataProcessor->>DataProcessor: Save raw data
    DataProcessor->>Main: Return transformed DataFrame
    Main->>DiabetesModel: Train model with training data
    DiabetesModel->>MLflow: Log training metrics
    Main->>User: Complete workflow
Loading

Possibly related PRs

  • Week 1 #1: The changes in config.yml regarding categorical features and model parameters are related to the new configuration introduced in the retrieved PR, which also specifies parameters for a machine learning project related to diabetes.

Suggested reviewers

  • basakeskili

"In the meadow where data flows,
A rabbit hops where knowledge grows.
With models trained and metrics logged,
In the fields of code, we proudly trod.
New paths we carve, with joy we sing,
For every change, a bright new spring! 🐇✨"


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (10)
src/diabetes/diabetes_model.py (2)

5-5: Remove unnecessary blank line

-

6-31: Address architectural concerns and known limitations

Several architectural improvements are needed:

  1. Testing:

    • Add unit tests for each method
    • Include integration tests with MLflow
    • Add test fixtures with sample data
  2. Model Performance:

    • Implement cross-validation for robust evaluation
    • Add hyperparameter tuning using techniques like grid search or Bayesian optimization
    • Consider ensemble methods or different algorithms for better performance
  3. MLflow Integration:

    • Log more metrics and parameters
    • Add model versioning
    • Include model signature and input example

Would you like me to:

  1. Generate a test suite template?
  2. Provide an implementation for hyperparameter tuning?
  3. Create an enhanced MLflow integration example?
main.py (4)

12-13: Consider externalizing MLflow URIs to configuration

Hardcoding the MLflow tracking and registry URIs within the code can reduce flexibility and make it difficult to manage different environments. Consider moving these URIs to the configuration file or environment variables to enhance maintainability and adaptability.


31-31: Evaluate passing spark as a parameter

Passing the spark session to save_raw_data_to_catalog may not be necessary if the DataProcessor can manage its own session internally. Consider encapsulating the Spark session within the DataProcessor class to improve modularity and reduce the need to pass it as an argument.


56-56: Retrieve the Git SHA programmatically

Hardcoding the Git SHA can lead to inconsistencies and maintenance issues. Consider retrieving the current Git commit SHA programmatically to ensure accuracy and automation.

Apply this change to retrieve the Git SHA dynamically:

-import subprocess
-git_sha = "ffa63b430205ff7"
+import subprocess
+git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("utf-8").strip()

59-59: Consider retrieving the branch name programmatically

Hardcoding the branch name may lead to inconsistencies. Retrieve the current Git branch name programmatically to keep the code adaptable.

Apply this change:

-tags={"git_sha": f"{git_sha}", "branch": "week2"},
+current_branch = subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"]).decode("utf-8").strip()
+tags={"git_sha": f"{git_sha}", "branch": f"{current_branch}"},
src/diabetes/data_processor.py (4)

1-4: Consistency in importing pandas modules

Currently, both pandas as pd and from pandas import read_csv are imported. For consistency and clarity, consider using a single import style. Using pd.read_csv throughout the code is a common practice.

Apply this diff to adjust the imports and usage:

-import pandas as pd
-from pandas import read_csv
+import pandas as pd

And update the load_data method:

 def load_data(self, path):
-    return read_csv(path)
+    return pd.read_csv(path)

71-80: Add error handling for Spark operations

The method _pandas_to_spark_to_delta_cdf could encounter exceptions during DataFrame creation or when writing to the Delta table. Implementing error handling will make the code more robust.

Consider wrapping the operations in a try-except block:

def _pandas_to_spark_to_delta_cdf(self, pandas_df: pd.DataFrame, tbl_name: str, spark: SparkSession):
    try:
        # Existing code
    except Exception as e:
        # Log the exception or handle it appropriately
        print(f"Error while processing {tbl_name}: {e}")
        raise

76-76: Validate table path components to prevent potential issues

When constructing delta_table_path, ensure that catalog_name, schema_name, and tbl_name are valid and do not contain special characters or leading/trailing spaces. This prevents runtime errors and potential security risks.


80-80: Simplify the SQL command string

The SQL command in line 80 has unnecessary string concatenation. Simplifying it improves readability.

Apply this diff to streamline the command:

-spark.sql(f"ALTER TABLE {delta_table_path} " "SET TBLPROPERTIES (delta.enableChangeDataFeed = true);")
+spark.sql(f"ALTER TABLE {delta_table_path} SET TBLPROPERTIES (delta.enableChangeDataFeed = true);")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b5f3d7c and c9e21fe.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • config.yml (2 hunks)
  • main.py (2 hunks)
  • pyproject.toml (1 hunks)
  • src/diabetes/data_processor.py (3 hunks)
  • src/diabetes/diabetes_model.py (1 hunks)
🔇 Additional comments (10)
config.yml (1)

22-22: Verify the feature rename across the codebase

The categorical feature has been renamed from HeavyAlcoholConsump to HvyAlcoholConsump. While this works, ensure this change is consistently applied across all data processing and model training code.

✅ Verification successful

Let me search for any Python or R code files that might be using these feature names to ensure complete verification.


Feature rename is consistent across the codebase

The feature rename from HeavyAlcoholConsump to HvyAlcoholConsump is consistently applied across the codebase. The new name HvyAlcoholConsump appears in:

  • Configuration file (config.yml)
  • All data files (data/diabetes_binary_balanced.csv, data/diabetes_multiclass.csv, data/diabetes_binary.csv)

The data processing code uses feature names dynamically from the config file, so no code changes were needed. No instances of the old name HeavyAlcoholConsump were found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining occurrences of the old feature name
# and verify the new name is used consistently

echo "Checking for old feature name 'HeavyAlcoholConsump'..."
rg "HeavyAlcoholConsump"

echo -e "\nVerifying usage of new feature name 'HvyAlcoholConsump'..."
rg "HvyAlcoholConsump"

Length of output: 1068


Script:

#!/bin/bash
# Search for Python/R files that might reference the feature names
echo "Searching for Python/R files with feature references..."
fd -e py -e R -e ipynb --exec rg -l "H[ev][av]yAlcoholConsump"

echo -e "\nChecking contents of any data processing or model training scripts..."
fd "process|train|model|feature" -e py -e R -e ipynb --exec cat {}

Length of output: 6270

pyproject.toml (2)

7-20: LGTM! Dependencies are well-structured with appropriate version constraints.

The dependencies section is well-organized with:

  • Clear version bounds for all packages
  • Essential ML packages for the project objectives (mlflow, xgboost)
  • Consistent formatting

8-19: Verify versions of critical ML dependencies

Let's ensure we're using recent stable versions of key ML packages to benefit from performance improvements and security fixes.

✅ Verification successful

ML dependencies are up-to-date and secure

The specified versions in pyproject.toml are well-aligned with the latest stable releases and security requirements:

  • lightgbm 4.5.0: Matches latest version, no known vulnerabilities
  • scikit-learn >=1.5.1: Current (1.5.2) with patch available, above vulnerable range (<1.5.0)
  • mlflow >=2.16.0: Using recent version, well above vulnerable range (<=2.2.0)
  • xgboost 2.1.2: Matches latest version, no known vulnerabilities
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check latest versions of critical ML dependencies

echo "Checking latest versions on PyPI..."
for package in "lightgbm" "scikit-learn" "mlflow" "xgboost"
do
    version=$(curl -s "https://pypi.org/pypi/$package/json" | jq -r '.info.version')
    echo "$package: Latest version = $version"
done

# Check for any known vulnerabilities
echo -e "\nChecking for security advisories..."
for package in "lightgbm" "scikit-learn" "mlflow" "xgboost"
do
    gh api graphql -f query="{
        securityVulnerabilities(first: 1, ecosystem: PIP, package: \"$package\") {
            nodes {
                advisory {
                    summary
                    severity
                }
                vulnerableVersionRange
                firstPatchedVersion {
                    identifier
                }
            }
        }
    }" | jq .
done

Length of output: 3845

main.py (3)

78-79: Verify the availability of mlflow.data.from_pandas and mlflow.log_input

Please ensure that the methods mlflow.data.from_pandas and mlflow.log_input exist in your MLflow installation. These functions may not be present in all versions of MLflow or might be part of newer releases.

Run the following script to check for the presence of these functions:

#!/bin/bash
# Description: Check if 'from_pandas' and 'log_input' exist in the MLflow package.

# Test: Search the installed MLflow package for the functions
pip show mlflow

echo "Searching for 'from_pandas' in mlflow/data/__init__.py:"
grep -r "def from_pandas" $(python -c "import mlflow.data; print(mlflow.data.__path__[0])")

echo "Searching for 'log_input' in mlflow/__init__.py:"
grep -r "def log_input" $(python -c "import mlflow; print(mlflow.__path__[0])")

15-15: Verify the initialization of DatabricksSession

Ensure that DatabricksSession.builder.getOrCreate() correctly initializes the session as intended. Confirm that DatabricksSession provides a builder interface similar to SparkSession.

Run the following script to check if DatabricksSession has a builder attribute with a getOrCreate method:

#!/bin/bash
# Description: Verify that DatabricksSession has a builder with getOrCreate method.

# Search for the builder attribute in DatabricksSession
rg 'class DatabricksSession.*\n.*builder' -A 5

# Search for getOrCreate method in the builder
rg 'def getOrCreate' -A 2

43-48: Ensure correct assignment of column names after concatenation

While concatenating X_train and y_train, manually assigning column names may lead to errors if the order of columns changes. Verify that the columns are correctly assigned and consider using DataFrame operations that preserve column names to avoid potential issues.

Run the following script to check the assigned column names:

✅ Verification successful

This script will help us:

  1. Locate all Python files that use pandas DataFrame operations
  2. Find specific instances of DataFrame concatenation with axis=1 and column assignments
    to better understand the data processing flow and column handling in the codebase.

This will help us understand:

  1. How the training data is split and preprocessed before concatenation
  2. If there are any column validation mechanisms in place
    to determine if the manual column assignment is safe or needs additional safeguards.

Based on the evidence from the codebase, I can now generate the final response:

Column assignment after concatenation is safe and well-structured

The current implementation is actually safe because:

  • The data processing flow is well-controlled through the data_processor class
  • Feature names are preserved through the preprocessing pipeline as shown in data_processor.py
  • The concatenation simply adds a single "target" column, which is consistent across both train and test sets
  • Unit tests verify the shape and column structure of the processed data

The manual column assignment is appropriate here since it's a simple case of appending a single known target column name to the existing feature columns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the concatenated DataFrame columns are correctly assigned.

# Test: Print the columns of the train and test DataFrames
echo "Train columns:"
head -n 5 data/train.csv | awk -F',' '{print $0}'

echo "Test columns:"
head -n 5 data/test.csv | awk -F',' '{print $0}'

Length of output: 653


Script:

#!/bin/bash
# Let's check the codebase for the data processing logic and column handling
# First, find relevant Python files
echo "Files containing pandas DataFrame operations:"
rg -l "pd\.concat|DataFrame" --type py

# Then look for specific concatenation and column assignment patterns
echo -e "\nDataFrame concatenation and column operations:"
rg "pd\.concat.*axis.*1|columns.*=.*list" -A 2 -B 2 --type py

Length of output: 784


Script:

#!/bin/bash
# Let's examine the data processing flow to understand how X_train and y_train are created
echo "Data splitting and preprocessing logic:"
rg "X_train|y_train" -B 3 -A 3 --type py

# Also check if there's any data validation or column verification
echo -e "\nColumn validation or verification logic:"
rg "columns|feature.*names" --type py

Length of output: 3222

src/diabetes/data_processor.py (4)

26-26: Verify the data type conversion of the target variable

Casting the target variable to integer using .astype(int) might not be appropriate if the target contains non-integer values (e.g., floats or categorical strings). Ensure that this casting aligns with the nature of your target variable and does not lead to data loss or errors.


51-55: Transformation and DataFrame creation look good

The steps to transform the features and create a DataFrame with the appropriate column names are correctly implemented.


56-64: The split_data method correctly handles default parameters

The method now accepts optional test_size and random_state parameters, defaulting to configuration values when not provided. This enhances flexibility.


78-78: Confirm the use of 'overwrite' mode when writing data

Using .mode("overwrite") will replace any existing data at the delta_table_path. Verify that this is the intended behavior to prevent accidental data loss.

Comment on lines +34 to +38
model_parameters:
obj_fn: "multi:softmax"
n_estimators: 1000
max_depth: 6
learning_rate: 0.01
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding parameters to address class imbalance

Given the PR objectives mention poor F1 score (0.39) and class imbalance issues, the current XGBoost parameters might need adjustment. Consider adding these parameters:

 model_parameters:
   obj_fn: "multi:softmax"
   n_estimators: 1000
   max_depth: 6
   learning_rate: 0.01
+  # Handle class imbalance
+  scale_pos_weight: [1.0, 1.0, 1.0]  # Adjust weights per class
+  max_delta_step: 1  # Helps with class imbalance
+  min_child_weight: 5  # Helps prevent overfitting on imbalanced classes
+  # Additional parameters for robust training
+  subsample: 0.8  # Prevents overfitting
+  colsample_bytree: 0.8  # Prevents overfitting

Additional suggestions:

  1. Consider using multi:softprob instead of multi:softmax to get probability outputs
  2. You might want to experiment with different class weights in scale_pos_weight based on your class distribution
  3. The learning rate (0.01) is quite conservative - you might want to try a higher value (e.g., 0.1) with fewer estimators for faster training during experimentation

For a more systematic approach to handling class imbalance:

  1. Consider implementing data-level solutions (e.g., SMOTE, class weights) in your data processing pipeline
  2. Add evaluation metrics specific to imbalanced classification (e.g., precision, recall per class)
  3. Set up MLflow to track these metrics across experiments

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +28 to +31
# def get_feature_importance(self):
# feature_importance = self.model.named_steps['regressor'].feature_importances_
# feature_names = self.model.named_steps['preprocessor'].get_feature_names_out()
# return feature_importance, feature_names
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix feature importance implementation

The commented feature importance method assumes a scikit-learn pipeline structure that doesn't match the current implementation.

-    # def get_feature_importance(self):
-    #     feature_importance = self.model.named_steps['regressor'].feature_importances_
-    #     feature_names = self.model.named_steps['preprocessor'].get_feature_names_out()
-    #     return feature_importance, feature_names
+    def get_feature_importance(self, feature_names: List[str] = None) -> pd.DataFrame:
+        """Get feature importance scores.
+        
+        Args:
+            feature_names: Optional list of feature names
+            
+        Returns:
+            pd.DataFrame: Feature importance scores with feature names
+        """
+        importance = self.model.feature_importances_
+        if feature_names is None:
+            feature_names = [f"feature_{i}" for i in range(len(importance))]
+            
+        return pd.DataFrame({
+            'feature': feature_names,
+            'importance': importance
+        }).sort_values('importance', ascending=False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# def get_feature_importance(self):
# feature_importance = self.model.named_steps['regressor'].feature_importances_
# feature_names = self.model.named_steps['preprocessor'].get_feature_names_out()
# return feature_importance, feature_names
def get_feature_importance(self, feature_names: List[str] = None) -> pd.DataFrame:
"""Get feature importance scores.
Args:
feature_names: Optional list of feature names
Returns:
pd.DataFrame: Feature importance scores with feature names
"""
importance = self.model.feature_importances_
if feature_names is None:
feature_names = [f"feature_{i}" for i in range(len(importance))]
return pd.DataFrame({
'feature': feature_names,
'importance': importance
}).sort_values('importance', ascending=False)

Comment on lines +22 to +26
def evaluate(self, X_test, y_test):
y_pred = self.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, labels=[0, 1, 2], average="macro")
return acc, f1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve evaluation method robustness

The evaluate method needs improvements in error handling and flexibility.

-    def evaluate(self, X_test, y_test):
-        y_pred = self.predict(X_test)
-        acc = accuracy_score(y_test, y_pred)
-        f1 = f1_score(y_test, y_pred, labels=[0, 1, 2], average="macro")
-        return acc, f1
+    def evaluate(self, X_test, y_test):
+        """Evaluate model performance.
+        
+        Args:
+            X_test: Test features
+            y_test: Test labels
+            
+        Returns:
+            tuple: (accuracy, f1_score)
+            
+        Raises:
+            ValueError: If input data is empty or invalid
+        """
+        if len(X_test) == 0 or len(y_test) == 0:
+            raise ValueError("Empty test data provided")
+            
+        y_pred = self.predict(X_test)
+        
+        # Get unique labels from actual data
+        unique_labels = np.unique(y_test)
+        
+        acc = accuracy_score(y_test, y_pred)
+        f1 = f1_score(y_test, y_pred, 
+                     labels=unique_labels, 
+                     average="weighted",  # Consider weighted average for imbalanced classes
+                     zero_division=0)
+        
+        # Log detailed metrics
+        mlflow.log_metrics({
+            "accuracy": acc,
+            "f1_score": f1,
+            "test_samples": len(X_test)
+        })
+        
+        return acc, f1

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +16 to +18
def train(self, X_train, y_train):
self.model.fit(X_train, y_train)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and training progress logging

The training method should validate inputs and log training progress.

-    def train(self, X_train, y_train):
-        self.model.fit(X_train, y_train)
+    def train(self, X_train, y_train):
+        """Train the model on the provided data.
+        
+        Args:
+            X_train: Training features
+            y_train: Training labels
+            
+        Raises:
+            ValueError: If input data is empty or invalid
+        """
+        if len(X_train) == 0 or len(y_train) == 0:
+            raise ValueError("Empty training data provided")
+        
+        # Log training data statistics
+        mlflow.log_metric("train_samples", len(X_train))
+        
+        # Handle class imbalance
+        class_weights = compute_class_weight('balanced', 
+                                           classes=np.unique(y_train), 
+                                           y=y_train)
+        self.model.set_params(sample_weight=class_weights)
+        
+        self.model.fit(X_train, y_train,
+                      eval_set=[(X_train, y_train)],
+                      verbose=True)

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +7 to +14
def __init__(self, config):
self.config = config
self.model = XGBClassifier(
objective=config["model_parameters"]["obj_fn"],
max_depth=config["model_parameters"]["max_depth"],
learning_rate=config["model_parameters"]["learning_rate"],
n_estimators=config["model_parameters"]["n_estimators"],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation, type hints, and documentation

The constructor could benefit from several improvements:

  1. Input validation to handle missing config parameters
  2. Type hints for better IDE support
  3. Docstring explaining the expected config structure
-    def __init__(self, config):
-        self.config = config
-        self.model = XGBClassifier(
-            objective=config["model_parameters"]["obj_fn"],
-            max_depth=config["model_parameters"]["max_depth"],
-            learning_rate=config["model_parameters"]["learning_rate"],
-            n_estimators=config["model_parameters"]["n_estimators"],
-        )
+    def __init__(self, config: dict) -> None:
+        """Initialize DiabetesModel with configuration parameters.
+        
+        Args:
+            config: Dictionary containing model_parameters with keys:
+                - obj_fn: Objective function for XGBoost
+                - max_depth: Maximum tree depth
+                - learning_rate: Learning rate
+                - n_estimators: Number of trees
+                
+        Raises:
+            KeyError: If required configuration parameters are missing
+        """
+        self.config = config
+        try:
+            model_params = config["model_parameters"]
+            self.model = XGBClassifier(
+                objective=model_params["obj_fn"],
+                max_depth=model_params["max_depth"],
+                learning_rate=model_params["learning_rate"],
+                n_estimators=model_params["n_estimators"],
+            )
+        except KeyError as e:
+            raise KeyError(f"Missing required configuration parameter: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, config):
self.config = config
self.model = XGBClassifier(
objective=config["model_parameters"]["obj_fn"],
max_depth=config["model_parameters"]["max_depth"],
learning_rate=config["model_parameters"]["learning_rate"],
n_estimators=config["model_parameters"]["n_estimators"],
)
def __init__(self, config: dict) -> None:
"""Initialize DiabetesModel with configuration parameters.
Args:
config: Dictionary containing model_parameters with keys:
- obj_fn: Objective function for XGBoost
- max_depth: Maximum tree depth
- learning_rate: Learning rate
- n_estimators: Number of trees
Raises:
KeyError: If required configuration parameters are missing
"""
self.config = config
try:
model_params = config["model_parameters"]
self.model = XGBClassifier(
objective=model_params["obj_fn"],
max_depth=model_params["max_depth"],
learning_rate=model_params["learning_rate"],
n_estimators=model_params["n_estimators"],
)
except KeyError as e:
raise KeyError(f"Missing required configuration parameter: {e}")

main.py Outdated

logger.info("Train and test sets saved as delta tables.")

mlflow.set_experiment(experiment_name="/Workspace/Users/[email protected]/mlops_dev/model-training")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid hardcoding user-specific paths and personal information

Hardcoding paths that include personal information, such as email addresses, can lead to maintainability issues and potential privacy concerns. Consider externalizing the experiment path to a configuration file or environment variable.

Apply this change to use a configuration parameter:

-mlflow.set_experiment(experiment_name="/Workspace/Users/[email protected]/mlops_dev/model-training")
+mlflow.set_experiment(experiment_name=config["mlflow_experiment_name"])

And add mlflow_experiment_name to your config.yml:

mlflow_experiment_name: "/Workspace/Users/<your_username>/mlops_dev/model-training"

main.py Outdated
Comment on lines +83 to +87
model_version = mlflow.register_model(
model_uri=f"runs:/{run_id}/xgboost-model",
name=f"{config['catalog_name']}.{config['schema_name']}.xgboost_model_basic",
tags={"git_sha": f"{git_sha}"},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure correct model registration in MLflow

When registering the model, verify that the model_uri and name parameters are correctly specified and that the target registry supports the provided catalog and schema names. Misconfiguration can lead to registration failures or namespace issues.

Double-check the configuration and consider updating as needed:

-model_version = mlflow.register_model(
-    model_uri=f"runs:/{run_id}/xgboost-model",
-    name=f"{config['catalog_name']}.{config['schema_name']}.xgboost_model_basic",
-    tags={"git_sha": f"{git_sha}"},
-)
+model_version = mlflow.register_model(
+    model_uri=f"runs:/{run_id}/xgboost-model",
+    name=config['model_name'],
+    tags={"git_sha": f"{git_sha}"},
+)

And add model_name to your config.yml:

model_name: "xgboost_model_basic"

self.X = self.df[self.config["num_features"] + self.config["cat_features"]]
self.y = self.df[target_col]
self.X[self.config["cat_features"]] = self.X[self.config["cat_features"]].astype(str)
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential data loss when casting numerical features to integer

Casting numerical features to integers with .astype(int) can lead to loss of precision if the features are floats. This might negatively impact model performance.

Apply this diff to retain the numerical precision:

-    self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int)
+    # Ensure numerical features are of float type
+    self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(float)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int)
# Ensure numerical features are of float type
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(float)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
tests/test_data_processor.py (1)

23-26: Consider verifying feature dimensions.

While checking only the number of samples is more flexible, it might miss issues with feature engineering. Consider adding assertions to verify that the feature dimensions are within expected bounds.

# Add after existing assertions
assert X_train.shape[1] >= 4  # At least original number of features
assert X_test.shape[1] == X_train.shape[1]  # Consistent features across splits
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c9e21fe and 7ff1982.

📒 Files selected for processing (1)
  • tests/test_data_processor.py (1 hunks)

Comment on lines +23 to +26
assert X_train.shape[0] == 3
assert X_test.shape[0] == 2
assert y_train.shape[0] == 3
assert y_test.shape[0] == 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test cases for edge cases in data splitting.

The current test only verifies a basic 60/40 split. Consider adding test cases for:

  • Different test_size values (e.g., 0.2, 0.8)
  • Small datasets where rounding might affect split sizes
  • Stratification with imbalanced classes (especially given the mentioned F1 score issues)

Would you like me to help generate these additional test cases?

Copy link

@mvechtomova mvechtomova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out the package versions in the next submission - those will be critical.

mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

spark = DatabricksSession.builder.getOrCreate()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkSession is preferable over DatabricksSession - as DatabricksSession relies on specific version of databricks connect

"scipy>=1.14.1, <2",
"matplotlib>=3.9.2, <4",
"databricks-feature-engineering>=0.6, <1"]
dependencies = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for serving purpose, it is a good idea to pin packages to certain versions - check out versions we provide in week 2 branch.
This would break if you are going to deploy model serving with feature engineering lookup (it depends on pyarrow version 14)


spark_df.write.mode("overwrite").saveAsTable(delta_table_path)

spark.sql(f"ALTER TABLE {delta_table_path} " "SET TBLPROPERTIES (delta.enableChangeDataFeed = true);")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need " "before set table properties - just being picky here, no big deal.

@Wartynas Wartynas merged commit 4ebbe40 into main Nov 18, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants