-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Week 2 #2
Conversation
WalkthroughThe changes involve updates to several files, primarily enhancing the configuration and functionality of a machine learning workflow. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Main
participant DataProcessor
participant DiabetesModel
participant MLflow
User->>Main: Start ML workflow
Main->>DataProcessor: Initialize and preprocess data
DataProcessor->>DataProcessor: Save raw data
DataProcessor->>Main: Return transformed DataFrame
Main->>DiabetesModel: Train model with training data
DiabetesModel->>MLflow: Log training metrics
Main->>User: Complete workflow
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Outside diff range and nitpick comments (10)
src/diabetes/diabetes_model.py (2)
5-5
: Remove unnecessary blank line-
6-31
: Address architectural concerns and known limitationsSeveral architectural improvements are needed:
Testing:
- Add unit tests for each method
- Include integration tests with MLflow
- Add test fixtures with sample data
Model Performance:
- Implement cross-validation for robust evaluation
- Add hyperparameter tuning using techniques like grid search or Bayesian optimization
- Consider ensemble methods or different algorithms for better performance
MLflow Integration:
- Log more metrics and parameters
- Add model versioning
- Include model signature and input example
Would you like me to:
- Generate a test suite template?
- Provide an implementation for hyperparameter tuning?
- Create an enhanced MLflow integration example?
main.py (4)
12-13
: Consider externalizing MLflow URIs to configurationHardcoding the MLflow tracking and registry URIs within the code can reduce flexibility and make it difficult to manage different environments. Consider moving these URIs to the configuration file or environment variables to enhance maintainability and adaptability.
31-31
: Evaluate passingspark
as a parameterPassing the
spark
session tosave_raw_data_to_catalog
may not be necessary if theDataProcessor
can manage its own session internally. Consider encapsulating the Spark session within theDataProcessor
class to improve modularity and reduce the need to pass it as an argument.
56-56
: Retrieve the Git SHA programmaticallyHardcoding the Git SHA can lead to inconsistencies and maintenance issues. Consider retrieving the current Git commit SHA programmatically to ensure accuracy and automation.
Apply this change to retrieve the Git SHA dynamically:
-import subprocess -git_sha = "ffa63b430205ff7" +import subprocess +git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("utf-8").strip()
59-59
: Consider retrieving the branch name programmaticallyHardcoding the branch name may lead to inconsistencies. Retrieve the current Git branch name programmatically to keep the code adaptable.
Apply this change:
-tags={"git_sha": f"{git_sha}", "branch": "week2"}, +current_branch = subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"]).decode("utf-8").strip() +tags={"git_sha": f"{git_sha}", "branch": f"{current_branch}"},src/diabetes/data_processor.py (4)
1-4
: Consistency in importing pandas modulesCurrently, both
pandas as pd
andfrom pandas import read_csv
are imported. For consistency and clarity, consider using a single import style. Usingpd.read_csv
throughout the code is a common practice.Apply this diff to adjust the imports and usage:
-import pandas as pd -from pandas import read_csv +import pandas as pdAnd update the
load_data
method:def load_data(self, path): - return read_csv(path) + return pd.read_csv(path)
71-80
: Add error handling for Spark operationsThe method
_pandas_to_spark_to_delta_cdf
could encounter exceptions during DataFrame creation or when writing to the Delta table. Implementing error handling will make the code more robust.Consider wrapping the operations in a try-except block:
def _pandas_to_spark_to_delta_cdf(self, pandas_df: pd.DataFrame, tbl_name: str, spark: SparkSession): try: # Existing code except Exception as e: # Log the exception or handle it appropriately print(f"Error while processing {tbl_name}: {e}") raise
76-76
: Validate table path components to prevent potential issuesWhen constructing
delta_table_path
, ensure thatcatalog_name
,schema_name
, andtbl_name
are valid and do not contain special characters or leading/trailing spaces. This prevents runtime errors and potential security risks.
80-80
: Simplify the SQL command stringThe SQL command in line 80 has unnecessary string concatenation. Simplifying it improves readability.
Apply this diff to streamline the command:
-spark.sql(f"ALTER TABLE {delta_table_path} " "SET TBLPROPERTIES (delta.enableChangeDataFeed = true);") +spark.sql(f"ALTER TABLE {delta_table_path} SET TBLPROPERTIES (delta.enableChangeDataFeed = true);")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (5)
config.yml
(2 hunks)main.py
(2 hunks)pyproject.toml
(1 hunks)src/diabetes/data_processor.py
(3 hunks)src/diabetes/diabetes_model.py
(1 hunks)
🔇 Additional comments (10)
config.yml (1)
22-22
: Verify the feature rename across the codebase
The categorical feature has been renamed from HeavyAlcoholConsump
to HvyAlcoholConsump
. While this works, ensure this change is consistently applied across all data processing and model training code.
✅ Verification successful
Let me search for any Python or R code files that might be using these feature names to ensure complete verification.
Feature rename is consistent across the codebase
The feature rename from HeavyAlcoholConsump
to HvyAlcoholConsump
is consistently applied across the codebase. The new name HvyAlcoholConsump
appears in:
- Configuration file (
config.yml
) - All data files (
data/diabetes_binary_balanced.csv
,data/diabetes_multiclass.csv
,data/diabetes_binary.csv
)
The data processing code uses feature names dynamically from the config file, so no code changes were needed. No instances of the old name HeavyAlcoholConsump
were found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining occurrences of the old feature name
# and verify the new name is used consistently
echo "Checking for old feature name 'HeavyAlcoholConsump'..."
rg "HeavyAlcoholConsump"
echo -e "\nVerifying usage of new feature name 'HvyAlcoholConsump'..."
rg "HvyAlcoholConsump"
Length of output: 1068
Script:
#!/bin/bash
# Search for Python/R files that might reference the feature names
echo "Searching for Python/R files with feature references..."
fd -e py -e R -e ipynb --exec rg -l "H[ev][av]yAlcoholConsump"
echo -e "\nChecking contents of any data processing or model training scripts..."
fd "process|train|model|feature" -e py -e R -e ipynb --exec cat {}
Length of output: 6270
pyproject.toml (2)
7-20
: LGTM! Dependencies are well-structured with appropriate version constraints.
The dependencies section is well-organized with:
- Clear version bounds for all packages
- Essential ML packages for the project objectives (mlflow, xgboost)
- Consistent formatting
8-19
: Verify versions of critical ML dependencies
Let's ensure we're using recent stable versions of key ML packages to benefit from performance improvements and security fixes.
✅ Verification successful
ML dependencies are up-to-date and secure
The specified versions in pyproject.toml are well-aligned with the latest stable releases and security requirements:
- lightgbm 4.5.0: Matches latest version, no known vulnerabilities
- scikit-learn >=1.5.1: Current (1.5.2) with patch available, above vulnerable range (<1.5.0)
- mlflow >=2.16.0: Using recent version, well above vulnerable range (<=2.2.0)
- xgboost 2.1.2: Matches latest version, no known vulnerabilities
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check latest versions of critical ML dependencies
echo "Checking latest versions on PyPI..."
for package in "lightgbm" "scikit-learn" "mlflow" "xgboost"
do
version=$(curl -s "https://pypi.org/pypi/$package/json" | jq -r '.info.version')
echo "$package: Latest version = $version"
done
# Check for any known vulnerabilities
echo -e "\nChecking for security advisories..."
for package in "lightgbm" "scikit-learn" "mlflow" "xgboost"
do
gh api graphql -f query="{
securityVulnerabilities(first: 1, ecosystem: PIP, package: \"$package\") {
nodes {
advisory {
summary
severity
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}" | jq .
done
Length of output: 3845
main.py (3)
78-79
: Verify the availability of mlflow.data.from_pandas
and mlflow.log_input
Please ensure that the methods mlflow.data.from_pandas
and mlflow.log_input
exist in your MLflow installation. These functions may not be present in all versions of MLflow or might be part of newer releases.
Run the following script to check for the presence of these functions:
#!/bin/bash
# Description: Check if 'from_pandas' and 'log_input' exist in the MLflow package.
# Test: Search the installed MLflow package for the functions
pip show mlflow
echo "Searching for 'from_pandas' in mlflow/data/__init__.py:"
grep -r "def from_pandas" $(python -c "import mlflow.data; print(mlflow.data.__path__[0])")
echo "Searching for 'log_input' in mlflow/__init__.py:"
grep -r "def log_input" $(python -c "import mlflow; print(mlflow.__path__[0])")
15-15
: Verify the initialization of DatabricksSession
Ensure that DatabricksSession.builder.getOrCreate()
correctly initializes the session as intended. Confirm that DatabricksSession
provides a builder
interface similar to SparkSession
.
Run the following script to check if DatabricksSession
has a builder
attribute with a getOrCreate
method:
#!/bin/bash
# Description: Verify that DatabricksSession has a builder with getOrCreate method.
# Search for the builder attribute in DatabricksSession
rg 'class DatabricksSession.*\n.*builder' -A 5
# Search for getOrCreate method in the builder
rg 'def getOrCreate' -A 2
43-48
: Ensure correct assignment of column names after concatenation
While concatenating X_train
and y_train
, manually assigning column names may lead to errors if the order of columns changes. Verify that the columns are correctly assigned and consider using DataFrame operations that preserve column names to avoid potential issues.
Run the following script to check the assigned column names:
✅ Verification successful
This script will help us:
- Locate all Python files that use pandas DataFrame operations
- Find specific instances of DataFrame concatenation with axis=1 and column assignments
to better understand the data processing flow and column handling in the codebase.
This will help us understand:
- How the training data is split and preprocessed before concatenation
- If there are any column validation mechanisms in place
to determine if the manual column assignment is safe or needs additional safeguards.
Based on the evidence from the codebase, I can now generate the final response:
Column assignment after concatenation is safe and well-structured
The current implementation is actually safe because:
- The data processing flow is well-controlled through the
data_processor
class - Feature names are preserved through the preprocessing pipeline as shown in
data_processor.py
- The concatenation simply adds a single "target" column, which is consistent across both train and test sets
- Unit tests verify the shape and column structure of the processed data
The manual column assignment is appropriate here since it's a simple case of appending a single known target column name to the existing feature columns.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the concatenated DataFrame columns are correctly assigned.
# Test: Print the columns of the train and test DataFrames
echo "Train columns:"
head -n 5 data/train.csv | awk -F',' '{print $0}'
echo "Test columns:"
head -n 5 data/test.csv | awk -F',' '{print $0}'
Length of output: 653
Script:
#!/bin/bash
# Let's check the codebase for the data processing logic and column handling
# First, find relevant Python files
echo "Files containing pandas DataFrame operations:"
rg -l "pd\.concat|DataFrame" --type py
# Then look for specific concatenation and column assignment patterns
echo -e "\nDataFrame concatenation and column operations:"
rg "pd\.concat.*axis.*1|columns.*=.*list" -A 2 -B 2 --type py
Length of output: 784
Script:
#!/bin/bash
# Let's examine the data processing flow to understand how X_train and y_train are created
echo "Data splitting and preprocessing logic:"
rg "X_train|y_train" -B 3 -A 3 --type py
# Also check if there's any data validation or column verification
echo -e "\nColumn validation or verification logic:"
rg "columns|feature.*names" --type py
Length of output: 3222
src/diabetes/data_processor.py (4)
26-26
: Verify the data type conversion of the target variable
Casting the target variable to integer using .astype(int)
might not be appropriate if the target contains non-integer values (e.g., floats or categorical strings). Ensure that this casting aligns with the nature of your target variable and does not lead to data loss or errors.
51-55
: Transformation and DataFrame creation look good
The steps to transform the features and create a DataFrame with the appropriate column names are correctly implemented.
56-64
: The split_data
method correctly handles default parameters
The method now accepts optional test_size
and random_state
parameters, defaulting to configuration values when not provided. This enhances flexibility.
78-78
: Confirm the use of 'overwrite' mode when writing data
Using .mode("overwrite")
will replace any existing data at the delta_table_path
. Verify that this is the intended behavior to prevent accidental data loss.
model_parameters: | ||
obj_fn: "multi:softmax" | ||
n_estimators: 1000 | ||
max_depth: 6 | ||
learning_rate: 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding parameters to address class imbalance
Given the PR objectives mention poor F1 score (0.39) and class imbalance issues, the current XGBoost parameters might need adjustment. Consider adding these parameters:
model_parameters:
obj_fn: "multi:softmax"
n_estimators: 1000
max_depth: 6
learning_rate: 0.01
+ # Handle class imbalance
+ scale_pos_weight: [1.0, 1.0, 1.0] # Adjust weights per class
+ max_delta_step: 1 # Helps with class imbalance
+ min_child_weight: 5 # Helps prevent overfitting on imbalanced classes
+ # Additional parameters for robust training
+ subsample: 0.8 # Prevents overfitting
+ colsample_bytree: 0.8 # Prevents overfitting
Additional suggestions:
- Consider using
multi:softprob
instead ofmulti:softmax
to get probability outputs - You might want to experiment with different class weights in
scale_pos_weight
based on your class distribution - The learning rate (0.01) is quite conservative - you might want to try a higher value (e.g., 0.1) with fewer estimators for faster training during experimentation
For a more systematic approach to handling class imbalance:
- Consider implementing data-level solutions (e.g., SMOTE, class weights) in your data processing pipeline
- Add evaluation metrics specific to imbalanced classification (e.g., precision, recall per class)
- Set up MLflow to track these metrics across experiments
Committable suggestion skipped: line range outside the PR's diff.
src/diabetes/diabetes_model.py
Outdated
# def get_feature_importance(self): | ||
# feature_importance = self.model.named_steps['regressor'].feature_importances_ | ||
# feature_names = self.model.named_steps['preprocessor'].get_feature_names_out() | ||
# return feature_importance, feature_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix feature importance implementation
The commented feature importance method assumes a scikit-learn pipeline structure that doesn't match the current implementation.
- # def get_feature_importance(self):
- # feature_importance = self.model.named_steps['regressor'].feature_importances_
- # feature_names = self.model.named_steps['preprocessor'].get_feature_names_out()
- # return feature_importance, feature_names
+ def get_feature_importance(self, feature_names: List[str] = None) -> pd.DataFrame:
+ """Get feature importance scores.
+
+ Args:
+ feature_names: Optional list of feature names
+
+ Returns:
+ pd.DataFrame: Feature importance scores with feature names
+ """
+ importance = self.model.feature_importances_
+ if feature_names is None:
+ feature_names = [f"feature_{i}" for i in range(len(importance))]
+
+ return pd.DataFrame({
+ 'feature': feature_names,
+ 'importance': importance
+ }).sort_values('importance', ascending=False)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# def get_feature_importance(self): | |
# feature_importance = self.model.named_steps['regressor'].feature_importances_ | |
# feature_names = self.model.named_steps['preprocessor'].get_feature_names_out() | |
# return feature_importance, feature_names | |
def get_feature_importance(self, feature_names: List[str] = None) -> pd.DataFrame: | |
"""Get feature importance scores. | |
Args: | |
feature_names: Optional list of feature names | |
Returns: | |
pd.DataFrame: Feature importance scores with feature names | |
""" | |
importance = self.model.feature_importances_ | |
if feature_names is None: | |
feature_names = [f"feature_{i}" for i in range(len(importance))] | |
return pd.DataFrame({ | |
'feature': feature_names, | |
'importance': importance | |
}).sort_values('importance', ascending=False) |
src/diabetes/diabetes_model.py
Outdated
def evaluate(self, X_test, y_test): | ||
y_pred = self.predict(X_test) | ||
acc = accuracy_score(y_test, y_pred) | ||
f1 = f1_score(y_test, y_pred, labels=[0, 1, 2], average="macro") | ||
return acc, f1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve evaluation method robustness
The evaluate method needs improvements in error handling and flexibility.
- def evaluate(self, X_test, y_test):
- y_pred = self.predict(X_test)
- acc = accuracy_score(y_test, y_pred)
- f1 = f1_score(y_test, y_pred, labels=[0, 1, 2], average="macro")
- return acc, f1
+ def evaluate(self, X_test, y_test):
+ """Evaluate model performance.
+
+ Args:
+ X_test: Test features
+ y_test: Test labels
+
+ Returns:
+ tuple: (accuracy, f1_score)
+
+ Raises:
+ ValueError: If input data is empty or invalid
+ """
+ if len(X_test) == 0 or len(y_test) == 0:
+ raise ValueError("Empty test data provided")
+
+ y_pred = self.predict(X_test)
+
+ # Get unique labels from actual data
+ unique_labels = np.unique(y_test)
+
+ acc = accuracy_score(y_test, y_pred)
+ f1 = f1_score(y_test, y_pred,
+ labels=unique_labels,
+ average="weighted", # Consider weighted average for imbalanced classes
+ zero_division=0)
+
+ # Log detailed metrics
+ mlflow.log_metrics({
+ "accuracy": acc,
+ "f1_score": f1,
+ "test_samples": len(X_test)
+ })
+
+ return acc, f1
Committable suggestion skipped: line range outside the PR's diff.
src/diabetes/diabetes_model.py
Outdated
def train(self, X_train, y_train): | ||
self.model.fit(X_train, y_train) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and training progress logging
The training method should validate inputs and log training progress.
- def train(self, X_train, y_train):
- self.model.fit(X_train, y_train)
+ def train(self, X_train, y_train):
+ """Train the model on the provided data.
+
+ Args:
+ X_train: Training features
+ y_train: Training labels
+
+ Raises:
+ ValueError: If input data is empty or invalid
+ """
+ if len(X_train) == 0 or len(y_train) == 0:
+ raise ValueError("Empty training data provided")
+
+ # Log training data statistics
+ mlflow.log_metric("train_samples", len(X_train))
+
+ # Handle class imbalance
+ class_weights = compute_class_weight('balanced',
+ classes=np.unique(y_train),
+ y=y_train)
+ self.model.set_params(sample_weight=class_weights)
+
+ self.model.fit(X_train, y_train,
+ eval_set=[(X_train, y_train)],
+ verbose=True)
Committable suggestion skipped: line range outside the PR's diff.
src/diabetes/diabetes_model.py
Outdated
def __init__(self, config): | ||
self.config = config | ||
self.model = XGBClassifier( | ||
objective=config["model_parameters"]["obj_fn"], | ||
max_depth=config["model_parameters"]["max_depth"], | ||
learning_rate=config["model_parameters"]["learning_rate"], | ||
n_estimators=config["model_parameters"]["n_estimators"], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation, type hints, and documentation
The constructor could benefit from several improvements:
- Input validation to handle missing config parameters
- Type hints for better IDE support
- Docstring explaining the expected config structure
- def __init__(self, config):
- self.config = config
- self.model = XGBClassifier(
- objective=config["model_parameters"]["obj_fn"],
- max_depth=config["model_parameters"]["max_depth"],
- learning_rate=config["model_parameters"]["learning_rate"],
- n_estimators=config["model_parameters"]["n_estimators"],
- )
+ def __init__(self, config: dict) -> None:
+ """Initialize DiabetesModel with configuration parameters.
+
+ Args:
+ config: Dictionary containing model_parameters with keys:
+ - obj_fn: Objective function for XGBoost
+ - max_depth: Maximum tree depth
+ - learning_rate: Learning rate
+ - n_estimators: Number of trees
+
+ Raises:
+ KeyError: If required configuration parameters are missing
+ """
+ self.config = config
+ try:
+ model_params = config["model_parameters"]
+ self.model = XGBClassifier(
+ objective=model_params["obj_fn"],
+ max_depth=model_params["max_depth"],
+ learning_rate=model_params["learning_rate"],
+ n_estimators=model_params["n_estimators"],
+ )
+ except KeyError as e:
+ raise KeyError(f"Missing required configuration parameter: {e}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__(self, config): | |
self.config = config | |
self.model = XGBClassifier( | |
objective=config["model_parameters"]["obj_fn"], | |
max_depth=config["model_parameters"]["max_depth"], | |
learning_rate=config["model_parameters"]["learning_rate"], | |
n_estimators=config["model_parameters"]["n_estimators"], | |
) | |
def __init__(self, config: dict) -> None: | |
"""Initialize DiabetesModel with configuration parameters. | |
Args: | |
config: Dictionary containing model_parameters with keys: | |
- obj_fn: Objective function for XGBoost | |
- max_depth: Maximum tree depth | |
- learning_rate: Learning rate | |
- n_estimators: Number of trees | |
Raises: | |
KeyError: If required configuration parameters are missing | |
""" | |
self.config = config | |
try: | |
model_params = config["model_parameters"] | |
self.model = XGBClassifier( | |
objective=model_params["obj_fn"], | |
max_depth=model_params["max_depth"], | |
learning_rate=model_params["learning_rate"], | |
n_estimators=model_params["n_estimators"], | |
) | |
except KeyError as e: | |
raise KeyError(f"Missing required configuration parameter: {e}") |
main.py
Outdated
|
||
logger.info("Train and test sets saved as delta tables.") | ||
|
||
mlflow.set_experiment(experiment_name="/Workspace/Users/[email protected]/mlops_dev/model-training") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid hardcoding user-specific paths and personal information
Hardcoding paths that include personal information, such as email addresses, can lead to maintainability issues and potential privacy concerns. Consider externalizing the experiment path to a configuration file or environment variable.
Apply this change to use a configuration parameter:
-mlflow.set_experiment(experiment_name="/Workspace/Users/[email protected]/mlops_dev/model-training")
+mlflow.set_experiment(experiment_name=config["mlflow_experiment_name"])
And add mlflow_experiment_name
to your config.yml
:
mlflow_experiment_name: "/Workspace/Users/<your_username>/mlops_dev/model-training"
main.py
Outdated
model_version = mlflow.register_model( | ||
model_uri=f"runs:/{run_id}/xgboost-model", | ||
name=f"{config['catalog_name']}.{config['schema_name']}.xgboost_model_basic", | ||
tags={"git_sha": f"{git_sha}"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure correct model registration in MLflow
When registering the model, verify that the model_uri
and name
parameters are correctly specified and that the target registry supports the provided catalog and schema names. Misconfiguration can lead to registration failures or namespace issues.
Double-check the configuration and consider updating as needed:
-model_version = mlflow.register_model(
- model_uri=f"runs:/{run_id}/xgboost-model",
- name=f"{config['catalog_name']}.{config['schema_name']}.xgboost_model_basic",
- tags={"git_sha": f"{git_sha}"},
-)
+model_version = mlflow.register_model(
+ model_uri=f"runs:/{run_id}/xgboost-model",
+ name=config['model_name'],
+ tags={"git_sha": f"{git_sha}"},
+)
And add model_name
to your config.yml
:
model_name: "xgboost_model_basic"
src/diabetes/data_processor.py
Outdated
self.X = self.df[self.config["num_features"] + self.config["cat_features"]] | ||
self.y = self.df[target_col] | ||
self.X[self.config["cat_features"]] = self.X[self.config["cat_features"]].astype(str) | ||
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential data loss when casting numerical features to integer
Casting numerical features to integers with .astype(int)
can lead to loss of precision if the features are floats. This might negatively impact model performance.
Apply this diff to retain the numerical precision:
- self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int)
+ # Ensure numerical features are of float type
+ self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(float)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(int) | |
# Ensure numerical features are of float type | |
self.X[self.config["num_features"]] = self.X[self.config["num_features"]].astype(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
tests/test_data_processor.py (1)
23-26
: Consider verifying feature dimensions.While checking only the number of samples is more flexible, it might miss issues with feature engineering. Consider adding assertions to verify that the feature dimensions are within expected bounds.
# Add after existing assertions assert X_train.shape[1] >= 4 # At least original number of features assert X_test.shape[1] == X_train.shape[1] # Consistent features across splits
assert X_train.shape[0] == 3 | ||
assert X_test.shape[0] == 2 | ||
assert y_train.shape[0] == 3 | ||
assert y_test.shape[0] == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add test cases for edge cases in data splitting.
The current test only verifies a basic 60/40 split. Consider adding test cases for:
- Different test_size values (e.g., 0.2, 0.8)
- Small datasets where rounding might affect split sizes
- Stratification with imbalanced classes (especially given the mentioned F1 score issues)
Would you like me to help generate these additional test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out the package versions in the next submission - those will be critical.
mlflow.set_tracking_uri("databricks") | ||
mlflow.set_registry_uri("databricks-uc") | ||
|
||
spark = DatabricksSession.builder.getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkSession is preferable over DatabricksSession - as DatabricksSession relies on specific version of databricks connect
"scipy>=1.14.1, <2", | ||
"matplotlib>=3.9.2, <4", | ||
"databricks-feature-engineering>=0.6, <1"] | ||
dependencies = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for serving purpose, it is a good idea to pin packages to certain versions - check out versions we provide in week 2 branch.
This would break if you are going to deploy model serving with feature engineering lookup (it depends on pyarrow version 14)
|
||
spark_df.write.mode("overwrite").saveAsTable(delta_table_path) | ||
|
||
spark.sql(f"ALTER TABLE {delta_table_path} " "SET TBLPROPERTIES (delta.enableChangeDataFeed = true);") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do not need " "before set table properties - just being picky here, no big deal.
Things performed in the week 2 submission:
I am aware of the limitations present in this pull request:
Please let me know if you have any thoughts regarding these limitations.
Summary by CodeRabbit
Release Notes
New Features
DiabetesModel
class for training, predicting, and evaluating diabetes classification models.Improvements
DataProcessor
.Dependencies
xgboost
as a new dependency for model training.