-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/week2 #2
Conversation
WalkthroughThis pull request introduces several changes across multiple files, primarily focusing on the configuration and data processing for a loan prediction model. Key modifications include updates to the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 25
🧹 Outside diff range and nitpick comments (46)
config/config.yml (2)
1-2
: Consider environment-specific configuration files.While the schema name change from 'lj_test' to 'lj_loan_prediction' is more descriptive, consider implementing environment-specific configuration files (e.g., dev.yml, staging.yml, prod.yml) to manage different settings across environments.
4-7
: Consider externalizing model hyperparameters.The current hyperparameters might need tuning based on different environments or datasets. Consider:
- Moving these to a separate model-specific config file
- Using MLflow for hyperparameter tracking and optimization
loan_prediction/config.py (3)
1-6
: Remove extra empty line.There's an unnecessary double empty line between imports and the class definition.
from pydantic import BaseModel - class ProjectConfig(BaseModel):
7-13
: Enhance type safety and add field validations.While using Pydantic is great for configuration management, consider these improvements:
- The
parameters
field usingDict[str, Any]
is too permissive. Consider defining a more specific structure for model parameters to catch configuration errors early.- Add field constraints to validate the configuration values (e.g., non-empty lists, valid schema names).
+from pydantic import BaseModel, Field, constr + class ProjectConfig(BaseModel): - num_features: List[str] - cat_features: List[str] - target: str - catalog_name: str - schema_name: str - parameters: Dict[str, Any] # Dictionary to hold model-related parameters + num_features: List[str] = Field(..., min_items=1) + cat_features: List[str] = Field(..., min_items=1) + target: str = Field(..., min_length=1) + catalog_name: constr(min_length=1) + schema_name: constr(min_length=1, regex=r'^[a-z][a-z0-9_]*$') + parameters: Dict[str, float | int | str] # More specific types for parameters
7-20
: Consider MLOps best practices for configuration management.Since this is part of an MLOps project with MLflow integration, consider these architectural improvements:
- Add configuration versioning to track changes and ensure reproducibility
- Support environment-specific configurations (dev/staging/prod)
- Add methods to serialize the configuration to MLflow for experiment tracking
- Consider adding validation for MLflow-specific parameters
This will help maintain consistency across different environments and experiments.
Would you like me to provide an example implementation of these improvements?
pyproject.toml (1)
20-20
: Consider adding an upper version bound for pydantic.While the addition of Pydantic for configuration management is a good practice, consider adding an upper version bound to maintain consistency with other dependencies and prevent potential breaking changes:
- "pydantic>=2.9.2", + "pydantic>=2.9.2, <3",notes.md (1)
52-54
: Add language specification to code block.Add
bash
as the language specification to improve readability and enable syntax highlighting.-``` +```bash uv run pre-commit run --all-files<details> <summary>🧰 Tools</summary> <details> <summary>🪛 Markdownlint</summary> 52-52: null Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> </blockquote></details> <details> <summary>loan_prediction/loan_classifier_model.py (2)</summary><blockquote> `14-15`: **Consider enhancing configuration access patterns and type safety.** While the changes work correctly, there are a few improvements that could make the code more robust and maintainable: 1. Consider using a consistent access pattern (either all dot notation or all dict notation) 2. Add type hints and parameter validation Here's a suggested improvement: ```diff +from typing import Any +from dataclasses import dataclass + +@dataclass +class ModelParameters: + n_estimators: int + max_depth: int + class LoanClassifierModel: - def __init__(self, preprocessor, config): + def __init__(self, preprocessor: Any, config: Any) -> None: + if not hasattr(config, 'parameters'): + raise ValueError("Config must have 'parameters' attribute") + if not all(k in config.parameters for k in ['n_estimators', 'max_depth']): + raise ValueError("Config parameters must contain 'n_estimators' and 'max_depth'") self.config = config self.model = Pipeline( steps=[ ("preprocessor", preprocessor), ( "classifier", RandomForestClassifier( - n_estimators=config.parameters["n_estimators"], - max_depth=config.parameters["max_depth"], + n_estimators=config.parameters.n_estimators, + max_depth=config.parameters.max_depth, random_state=42, ), ),
Line range hint
1-38
: Consider additional improvements to enhance code quality.While not directly related to the changes, here are some suggestions to improve the overall code quality:
- Add class and method docstrings
- Add type hints for all methods
- Enhance the evaluate method to return more metrics
- Structure feature importance return values better
Here's a partial example of the suggested improvements:
from typing import Tuple, List, Any import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score class LoanClassifierModel: """Loan classification model using Random Forest classifier with preprocessing pipeline. Attributes: config: Configuration object containing model parameters model: Scikit-learn pipeline with preprocessor and classifier """ def evaluate(self, X_test: np.ndarray, y_test: np.ndarray) -> dict: """Evaluate model performance using multiple metrics. Returns: dict: Dictionary containing accuracy, precision, recall, and f1 scores """ y_pred = self.predict(X_test) return { 'accuracy': accuracy_score(y_test, y_pred), 'precision': precision_score(y_test, y_pred), 'recall': recall_score(y_test, y_pred), 'f1': f1_score(y_test, y_pred) } def get_feature_importance(self) -> Tuple[np.ndarray, List[str]]: """Get feature importance scores and corresponding feature names. Returns: Tuple containing: - np.ndarray: Feature importance scores - List[str]: Corresponding feature names """notebooks/week2/02.mlflow_experiment.py (2)
29-36
: Enhance run metadata and metric logging structure.The current implementation could benefit from additional context and structure:
- Include more git metadata for better traceability
- Add structured metric logging with timestamps
Consider this enhanced approach:
+from datetime import datetime + with mlflow.start_run( run_name="demo-run", tags={"git_sha": GIT_SHA, - "branch": "week2"}, + "branch": "week2", + "commit_timestamp": get_commit_timestamp(), + "commit_author": get_commit_author(), + "commit_message": get_commit_message()}, description="demo run", ) as run: mlflow.log_params({"type": "demo"}) - mlflow.log_metrics({"metric1": 1.0, "metric2": 2.0}) + timestamp = datetime.now().timestamp() + metrics = { + "metric1": {"value": 1.0, "timestamp": timestamp, "step": 0}, + "metric2": {"value": 2.0, "timestamp": timestamp, "step": 0} + } + for name, meta in metrics.items(): + mlflow.log_metric(name, meta["value"], meta["timestamp"], meta["step"])
1-59
: Add proper documentation and type hints.The notebook would benefit from:
- Module-level docstring explaining the purpose
- Type hints for better code maintainability
- Removal of unnecessary command separator comments
Consider adding this documentation at the start:
""" MLflow Experiment Management Script This notebook handles MLflow experiment tracking in Databricks, including: - Experiment creation and configuration - Run tracking with git metadata - Metric and parameter logging - Run information retrieval and storage Usage: This notebook should be run in a Databricks environment with MLflow configured. """ from typing import Dict, Any, Optionalnotebooks/week2/01.prepare_dataset.py (3)
13-14
: Remove unused imports.The following imports are not used in the code:
datetime
pandas
-from datetime import datetime -import pandas as pd🧰 Tools
🪛 Ruff
13-13: Module level import not at top of file
(E402)
13-13:
datetime.datetime
imported but unusedRemove unused import:
datetime.datetime
(F401)
14-14: Module level import not at top of file
(E402)
14-14:
pandas
imported but unusedRemove unused import:
pandas
(F401)
20-21
: Remove redundant logging configuration.There are two similar logging configurations, with one commented out. Remove the commented line to avoid confusion.
-# logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
27-27
: Consider using an absolute path or environment variable for config.Using relative paths (
../../config/config.yml
) in notebooks can be fragile as it depends on the notebook's location. Consider:
- Using an absolute path with the Databricks workspace path
- Setting the config path via environment variable
-config = ProjectConfig.from_yaml(config_path="../../config/config.yml") +config = ProjectConfig.from_yaml(config_path="/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/config/config.yml")notebooks/week1/week1_test_package.py (2)
Line range hint
29-30
: Consider making the filepath configurable.The filepath is hardcoded to a specific Databricks volume
/Volumes/heiaepgah71pwedmld01001/lj_test/mlops_data/train.csv
. Consider moving this to the configuration file for better maintainability and flexibility across different environments.Suggested approach:
- Add the filepath to
config.yml
- Load it through the
ConfigLoader
- Pass it to
DataLoader
through configurationExample config structure:
data: train_data_path: '/Volumes/heiaepgah71pwedmld01001/lj_test/mlops_data/train.csv'
Line range hint
8-8
: Consider using MLflow for model tracking.Given that this PR's objective is to integrate MLFlow, consider adding MLflow tracking to log model metrics, parameters, and artifacts. This would align with the Week 2 goals mentioned in the PR description.
Example integration:
import mlflow # Track parameters mlflow.log_param("model_type", "loan_classifier") # Track metrics mlflow.log_metric("model_score", score) # Log model mlflow.sklearn.log_model(model, "loan_classifier_model")notebooks/week2/README.md (5)
13-18
: Add language specification to code block.For better syntax highlighting and documentation standards, specify the language in the code block.
-``` +```python 02.prepare_dataset.py<details> <summary>🧰 Tools</summary> <details> <summary>🪛 Markdownlint</summary> 14-14: null Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> --- `19-24`: **Add language specification to code block.** For better syntax highlighting and documentation standards, specify the language in the code block. ```diff -``` +```python 02.mlflow_experiment.py
<details> <summary>🧰 Tools</summary> <details> <summary>🪛 Markdownlint</summary> 20-20: null Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> --- `26-35`: **Fix grammar and code block formatting.** 1. Grammar: Replace "an sklearn" with "a sklearn" as "sklearn" is pronounced starting with a consonant sound. 2. Add language specification to the code block. ```diff -``` +```python 03.log_and_register_model.py
<details> <summary>🧰 Tools</summary> <details> <summary>🪛 LanguageTool</summary> [misspelling] ~31-~31: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’. Context: ... train and test datasets, then we train an sklearn pipeline within an MLflow run. ... (EN_A_VS_AN) </details> <details> <summary>🪛 Markdownlint</summary> 27-27: null Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> --- `36-45`: **Fix section title and grammar.** 1. The section title appears to be incorrect. Consider changing it to "Logging and Registering a Custom Model" to match the content. 2. Grammar: Replace "an sklearn" with "a sklearn" as "sklearn" is pronounced starting with a consonant sound. ```diff -### 4. Logging and Registering a Model +### 4. Logging and Registering a Custom Model
🧰 Tools
🪛 LanguageTool
[misspelling] ~42-~42: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...his purpose. In our example, we trained an sklearn model (which isn't custom), but...(EN_A_VS_AN)
🪛 Markdownlint
37-37: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
47-48
: Improve code block formatting.For consistency and better documentation standards, use proper markdown code block formatting with language specification.
-```05.log_and_register_fe_model.py``` +```python +05.log_and_register_fe_model.py +```notebooks/week2/03.log_and_register_model.py (7)
35-36
: Avoid redundant loading of the training setThe training set is being loaded twice from the same table, which can be inefficient.
Reuse
train_set_spark
to avoid redundant queries:train_set_spark = spark.table(f"{catalog_name}.{schema_name}.train_set") -train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas() +train_set = train_set_spark.toPandas()Apply the same optimization for the test set:
-test_set = spark.table(f"{catalog_name}.{schema_name}.test_set").toPandas() +test_set_spark = spark.table(f"{catalog_name}.{schema_name}.test_set") +test_set = test_set_spark.toPandas()
60-60
: Parameterize the MLflow experiment nameHardcoding the experiment name reduces flexibility. Consider parameterizing it to allow for easier configuration and to avoid hardcoding in multiple places.
Update the code to use the experiment name from your configuration:
-mlflow.set_experiment(experiment_name='/Shared/lj_loan_prediction') +mlflow.set_experiment(experiment_name=config.experiment_name)Ensure that
experiment_name
is added to yourconfig.yml
file and included in theProjectConfig
class.
61-61
: Automate retrieval of Git commit SHAHardcoding the Git commit SHA may lead to inconsistencies. It's better to retrieve it automatically to ensure accuracy.
Use the
subprocess
module to dynamically retrieve the current commit SHA:-import git_sha = "ffa63b430205ff7" +import subprocess +git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).strip().decode()Ensure that the script is executed in a Git repository context where this command can run successfully.
95-99
: Log aninput_example
when logging the modelCurrently, the model is logged without an
input_example
, which helps in validating the model's inputs during deployment and serving.Provide an
input_example
when logging the model:mlflow.sklearn.log_model( sk_model=pipeline, artifact_path="lightgbm-pipeline-model", signature=signature, + input_example=X_train.iloc[:1] )
This logs a sample input that can be used for model validation during deployment.
48-48
: Remove trailing whitespaceLine 48 contains trailing whitespace, which should be removed to adhere to code style guidelines.
Apply this diff to remove the trailing whitespace:
transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)], - +🧰 Tools
🪛 Ruff
48-48: Trailing whitespace
(W291)
94-94
: Remove whitespace on blank lineLine 94 is a blank line containing whitespace. Please remove the whitespace to maintain code cleanliness.
Apply this diff to clean up the blank line:
- +🧰 Tools
🪛 Ruff
94-94: Blank line contains whitespace
(W293)
110-110
: Remove trailing whitespaceLine 110 contains trailing whitespace, which should be removed to adhere to code style guidelines.
Apply this diff to remove the trailing whitespace:
-# from the endpoint suggested test +# from the endpoint suggested test🧰 Tools
🪛 Ruff
110-110: Trailing whitespace
(W291)
notebooks/week2/04.log_and_register_custom_model.py (3)
74-79
: Avoid usingtoPandas()
on large datasets to prevent memory issues.Converting large Spark DataFrames to Pandas using
toPandas()
can lead to out-of-memory errors. If the datasets are large, consider processing data using Spark operations or sampling the data before converting.
94-94
: Retrievegit_sha
dynamically instead of hardcoding.Hardcoding the Git SHA reduces flexibility and may lead to inconsistencies with the actual code version. Consider obtaining the current commit SHA programmatically to ensure accuracy.
You can retrieve the Git SHA using the
subprocess
module:import subprocess git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()Also applies to: 97-97, 129-129
57-57
: Remove trailing whitespaces to adhere to code style guidelines.Static analysis tools have detected trailing whitespaces on blank lines, which can affect code readability and may lead to unnecessary diffs in version control.
Affected lines:
- Line 57
- Line 60
- Line 98
- Line 139
- Line 140
Also applies to: 60-60, 98-98, 139-139, 140-140
🧰 Tools
🪛 Ruff
57-57: Blank line contains whitespace
(W293)
loan_prediction/data_processor.py (4)
27-29
: Assignself.df
inload_data()
for clarityCurrently,
self.df
is assigned in thepreprocess()
method. Assigningself.df
directly in theload_data()
method enhances clarity by ensuring that data loading responsibilities are contained within theload_data()
method.Apply this diff to assign
self.df
inload_data()
:def load_data(self): self.dataloader.load() + self.df = self.dataloader.data # data sitting in self.dataloader.data
73-87
: Remove commented-out code to improve readabilityThe code contains several blocks of commented-out code that are not in use. Removing this unused code will clean up the codebase and improve readability.
Apply this diff to remove the commented-out code:
- # example for filing missing values - # median_age = self.df['person_age'].median() - # self.df['person_age'].fillna(median_age, inplace=True) - - # current_year = datetime.now().year - # self.df['person_age'] = current_year - self.df['person_birth'] - # self.df.drop(columns=['person_birth'], inplace=True) - - # Fill missing values with mean or default values - # self.df.fillna({ - # 'LotFrontage': self.df['LotFrontage'].mean(), - # 'MasVnrType': 'None', - # 'MasVnrArea': 0, - # }, inplace=True)
103-106
: Makerandom_state
configurable viaconfig
Hardcoding
random_state
can hinder reproducibility in different environments. Consider retrievingrandom_state
fromself.config
to allow flexibility.Apply this diff to use
random_state
from configuration:def split_train_test(self, test_size=0.2, random_state=42): """Split the DataFrame (self.df) into training and test sets.""" - train_set, test_set = train_test_split(self.df, test_size=test_size, random_state=random_state) + train_set, test_set = train_test_split( + self.df, test_size=test_size, random_state=self.config.random_state + ) return train_set, test_setEnsure that
random_state
is defined in your configuration.
111-111
: Utilize the logger consistently throughout the classAn instance of
logger
is created, and it logs an info message when saving to the catalog. Consider adding more logging statements at key steps in the data processing pipeline to aid in debugging and monitoring.For example, add logging in methods like
load_data()
,preprocess()
, andsplit_train_test()
.notebooks/week2/05.log_and_register_fe_model.py (11)
6-6
: Remove trailing whitespaceThere's trailing whitespace at the end of the comment on line 6. Removing it will improve code cleanliness.
Apply this diff to remove the trailing whitespace:
-# dbutils.library.restartPython() +# dbutils.library.restartPython()🧰 Tools
🪛 Ruff
6-6: Trailing whitespace
Remove trailing whitespace
(W291)
10-10
: Remove unused importsThe following imports are not used in the code and can be removed to clean up the code:
- Line 10:
import yaml
- Line 15:
from pyspark.sql import functions as F
- Line 18:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
- Line 22:
from datetime import datetime
Apply this diff to remove the unused imports:
- import yaml - from pyspark.sql import functions as F - from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score - from datetime import datetimeAlso applies to: 15-15, 18-18, 22-22
🧰 Tools
🪛 Ruff
10-10:
yaml
imported but unusedRemove unused import:
yaml
(F401)
94-104
: Remove unused import inside SQL function definitionWithin the SQL function definition, the import
from datetime import datetime
on line 100 is unnecessary sincedatetime
is not used. Consider removing it to keep the function concise.Apply this diff:
CREATE OR REPLACE FUNCTION {function_name}(person_age INT) RETURNS INT LANGUAGE PYTHON AS $$ -from datetime import datetime return 12 * person_age $$
117-118
: Remove trailing whitespaceLines 117 and 118 contain trailing whitespace. Removing it will improve code cleanliness.
Apply this diff to remove the trailing whitespace:
.drop( - # "cb_person_cred_hist_length", - # "loan_amnt" , + # "cb_person_cred_hist_length", + # "loan_amnt",🧰 Tools
🪛 Ruff
117-117: Trailing whitespace
Remove trailing whitespace
(W291)
118-118: Trailing whitespace
Remove trailing whitespace
(W291)
117-119
: Clean up commented-out codeLines 117-119 contain commented-out columns in the
.drop()
method. If these columns are no longer needed, consider removing the comments to clean up the code.Apply this diff to remove the commented-out lines:
.drop( - # "cb_person_cred_hist_length", - # "loan_amnt", "person_income")🧰 Tools
🪛 Ruff
117-117: Trailing whitespace
Remove trailing whitespace
(W291)
118-118: Trailing whitespace
Remove trailing whitespace
(W291)
151-151
: Remove extraneous expressionThe standalone expression
parameters
on line 151 doesn't affect the program's execution. Consider removing it unless it's needed for debugging.Apply this diff:
-parameters
🧰 Tools
🪛 Ruff
151-151: Found useless expression. Either assign it to a variable or remove it.
(B018)
169-171
: Rename pipeline step from 'regressor' to 'classifier'The pipeline step is named
'regressor'
but uses aRandomForestClassifier
. For clarity and consistency, consider renaming the step to'classifier'
.Apply this diff:
pipeline = Pipeline( steps=[("preprocessor", preprocessor), ("regressor", RandomForestClassifier( n_estimators = parameters['n_estimators'], max_depth = parameters['max_depth'], random_state=42))] ) +pipeline = Pipeline( + steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier( + n_estimators = parameters['n_estimators'], max_depth = parameters['max_depth'], random_state=42))] +)
176-176
: Avoid hardcoding Git SHAThe variable
git_sha
is hardcoded with the value"ffa63b430205ff7"
. For maintainability, consider obtaining the Git SHA dynamically to ensure it reflects the current commit.You can obtain the Git SHA dynamically using the
subprocess
module:import subprocess git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
185-190
: Remove or uncomment unused metric calculationsLines 185-190 contain metric calculations that are currently commented out. If these metrics are no longer needed, consider removing them. If they are needed, consider uncommenting them to use them in your analysis.
107-107
: Avoid reloading 'train_set' unnecessarilyThe variable
train_set
is reloaded from the table on line 107, but it was already loaded earlier on line 61 and hasn't been modified since. Consider removing the redundant loading to streamline the code.Apply this diff:
-train_set = spark.table(f"{catalog_name}.{schema_name}.train_set")
212-212
: Remove whitespace on blank lineLine 212 is a blank line that contains whitespace. Removing the whitespace will adhere to code style guidelines.
Apply this diff:
-
🧰 Tools
🪛 Ruff
212-212: Blank line contains whitespace
Remove whitespace from blank line
(W293)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (16)
.gitignore
(1 hunks)config/config.yml
(1 hunks)loan_prediction/config.py
(1 hunks)loan_prediction/data_processor.py
(3 hunks)loan_prediction/loan_classifier_model.py
(1 hunks)loan_prediction/utils.py
(1 hunks)main.py
(1 hunks)notebooks/week1/week1_test_package.py
(1 hunks)notebooks/week2/01.prepare_dataset.py
(1 hunks)notebooks/week2/02.mlflow_experiment.py
(1 hunks)notebooks/week2/03.log_and_register_model.py
(1 hunks)notebooks/week2/04.log_and_register_custom_model.py
(1 hunks)notebooks/week2/05.log_and_register_fe_model.py
(1 hunks)notebooks/week2/README.md
(1 hunks)notes.md
(1 hunks)pyproject.toml
(1 hunks)
🧰 Additional context used
🪛 Ruff
notebooks/week2/01.prepare_dataset.py
6-6: Undefined name dbutils
(F821)
10-10: Module level import not at top of file
(E402)
11-11: Module level import not at top of file
(E402)
12-12: Module level import not at top of file
(E402)
13-13: Module level import not at top of file
(E402)
13-13: datetime.datetime
imported but unused
Remove unused import: datetime.datetime
(F401)
14-14: Module level import not at top of file
(E402)
14-14: pandas
imported but unused
Remove unused import: pandas
(F401)
15-15: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
notebooks/week2/03.log_and_register_model.py
3-3: SyntaxError: Expected a statement
3-3: SyntaxError: Simple statements must be separated by newlines or semicolons
3-3: SyntaxError: Simple statements must be separated by newlines or semicolons
48-48: Trailing whitespace
(W291)
94-94: Blank line contains whitespace
(W293)
110-110: Trailing whitespace
(W291)
notebooks/week2/04.log_and_register_custom_model.py
3-3: SyntaxError: Expected a statement
3-3: SyntaxError: Simple statements must be separated by newlines or semicolons
3-3: SyntaxError: Simple statements must be separated by newlines or semicolons
57-57: Blank line contains whitespace
(W293)
60-60: Blank line contains whitespace
(W293)
98-98: Blank line contains whitespace
(W293)
139-139: Trailing whitespace
(W291)
140-140: Blank line contains whitespace
(W293)
notebooks/week2/05.log_and_register_fe_model.py
6-6: Trailing whitespace
Remove trailing whitespace
(W291)
10-10: yaml
imported but unused
Remove unused import: yaml
(F401)
15-15: pyspark.sql.functions
imported but unused
Remove unused import: pyspark.sql.functions
(F401)
18-18: sklearn.metrics.mean_squared_error
imported but unused
Remove unused import
(F401)
18-18: sklearn.metrics.mean_absolute_error
imported but unused
Remove unused import
(F401)
18-18: sklearn.metrics.r2_score
imported but unused
Remove unused import
(F401)
22-22: datetime.datetime
imported but unused
Remove unused import: datetime.datetime
(F401)
117-117: Trailing whitespace
Remove trailing whitespace
(W291)
118-118: Trailing whitespace
Remove trailing whitespace
(W291)
151-151: Found useless expression. Either assign it to a variable or remove it.
(B018)
206-206: Undefined name training_set
(F821)
212-212: Blank line contains whitespace
Remove whitespace from blank line
(W293)
🪛 LanguageTool
notebooks/week2/README.md
[misspelling] ~31-~31: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ... train and test datasets, then we train an sklearn pipeline within an MLflow run. ...
(EN_A_VS_AN)
[misspelling] ~42-~42: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...his purpose. In our example, we trained an sklearn model (which isn't custom), but...
(EN_A_VS_AN)
🪛 Markdownlint
notebooks/week2/README.md
14-14: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
20-20: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
27-27: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
37-37: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
notes.md
52-52: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (18)
config/config.yml (2)
Line range hint 9-19
: LGTM! Feature selection looks appropriate.
The separation between numerical and categorical features is clear and well-organized. The removal of the 'id' feature from numerical features is a good practice as it shouldn't be used for modeling.
Line range hint 21-21
: LGTM! Target variable is clearly defined.
The target variable 'loan_status' is appropriately specified for the loan prediction use case.
loan_prediction/utils.py (2)
25-26
: Formatting changes look good.
27-28
: 🛠️ Refactor suggestion
Critical: Arbitrary prediction scaling requires justification and safeguards.
The adjust_predictions
function arbitrarily scales predictions by a default factor of 1.3, which raises several concerns:
- In a loan prediction context, arbitrarily scaling predictions could significantly impact business decisions and risk assessments.
- The function lacks documentation explaining the purpose and justification for the scaling.
- Missing input validation for the scale factor could lead to invalid predictions if negative or zero values are provided.
Consider implementing these improvements:
-def adjust_predictions(predictions, scale_factor=1.3):
- return predictions * scale_factor
+from typing import Union, Sequence
+import numpy as np
+
+def adjust_predictions(
+ predictions: Union[Sequence[float], np.ndarray],
+ scale_factor: float = 1.3
+) -> np.ndarray:
+ """Adjust model predictions by a scale factor.
+
+ Args:
+ predictions: Array-like of model predictions to adjust
+ scale_factor: Positive multiplier to scale predictions (default: 1.3)
+
+ Returns:
+ Scaled predictions as numpy array
+
+ Raises:
+ ValueError: If scale_factor is not positive
+ """
+ if scale_factor <= 0:
+ raise ValueError("scale_factor must be positive")
+
+ return np.asarray(predictions) * scale_factor
Additionally:
- Document the business justification for the 1.3 scale factor
- Consider making this configurable via MLflow parameters
- Add unit tests to verify scaling behavior
- Log the scale factor as part of model metadata for reproducibility
Let's check if this scaling is used in model training or evaluation:
notes.md (2)
42-43
: LGTM! Improved file organization and copy command.
The updated path better reflects the project structure, and the addition of --overwrite
flag prevents copy failures.
56-56
: Please provide details about ci.yml modifications.
The note mentions modifying .github/workflows/ci.yml
but doesn't specify what changes are needed. Consider adding:
- The specific modifications required
- The purpose of these modifications
- How they relate to the pre-commit setup
Let me help verify the current ci.yml configuration:
.gitignore (1)
100-100
: LGTM! Good practice to ignore wheel files.
Adding *.whl
to .gitignore
is a good practice as it prevents Python wheel files (binary package distributions) from being committed to the repository. This is especially relevant for a project using MLFlow and additional dependencies like pydantic.
notebooks/week2/02.mlflow_experiment.py (1)
1-59
: Verify security considerations for data handling.
Please ensure:
- Proper access controls are in place for experiment and run data
- Sensitive information is not inadvertently logged
- File permissions are appropriately set when writing to DBFS
main.py (4)
3-8
: LGTM! Clean imports and proper logging setup.
66-79
: Clarify the status of commented evaluation code.
Several critical ML workflow steps are commented out:
- Model evaluation
- Results visualization
- Feature importance analysis
These are essential for model quality assessment. Please clarify if this is temporary and add TODO comments with tracking issues if these will be implemented later.
#!/bin/bash
# Check if these features are implemented elsewhere
rg -l "def (evaluate|visualize_results|plot_feature_importance)"
41-52
: Clean up commented code and verify data splitting implementation.
The code contains a mix of old and new data splitting implementations, and Databricks-specific code is commented out. This needs cleanup for better maintainability.
#!/bin/bash
# Check for test coverage of split_train_test method
rg -A 5 "def test.*split_train_test" "tests/"
Also, consider adding validation for the split ratios and resulting dataset sizes:
+# Validate split results
+if len(train_set) == 0 or len(test_set) == 0:
+ raise ValueError("Data split resulted in empty train or test set")
+logger.info(f"Split sizes - Train: {len(train_set)}, Test: {len(test_set)}")
13-16
: Consider environment-aware configuration path handling.
The hardcoded config path ./config/config.yml
might cause issues in different environments. Consider making it configurable via environment variables or command-line arguments.
-config = ProjectConfig.from_yaml(config_path="./config/config.yml")
+config_path = os.getenv("CONFIG_PATH", "./config/config.yml")
+config = ProjectConfig.from_yaml(config_path=config_path)
Also, be cautious about logging the entire config object as it might expose sensitive information in logs.
notebooks/week1/week1_test_package.py (1)
54-54
: Review data loading implementation for consistency.
The load_data()
call no longer takes a filepath parameter, but the notebook still creates a DataLoader
with a specific filepath that isn't being used. This suggests an inconsistent pattern in data loading.
Let's verify the data loading implementation:
notebooks/week2/README.md (2)
1-10
: LGTM! Clear and well-structured introduction.
The introduction effectively sets the context for Week 2 materials and MLflow concepts.
48-65
: LGTM! Excellent explanation of feature lookup concepts.
The explanation effectively uses real-world examples (ice cream demand) to illustrate feature lookup concepts, and clearly distinguishes between static and dynamic features. The transition to the house price dataset example is well-explained.
loan_prediction/data_processor.py (3)
24-24
: Ensure preprocessor is updated if configuration changes
The preprocessor is built once during initialization. If self.config
is modified after initialization, the preprocessor may become outdated. Consider rebuilding the preprocessor when configuration changes or enforce immutability of the configuration after initialization.
If dynamic configuration is expected, you might update _build_preprocessor()
to be called whenever the configuration changes.
14-14
: Set up logging configuration if not already done
While you're obtaining a logger instance, ensure that the logging system is configured to handle and direct logs appropriately. If not configured elsewhere, you might add basic configuration.
If logging is not configured, I can help set it up. Would you like assistance in configuring the logging system?
121-127
:
Reconsider using 'append' mode when writing to tables
Using 'append'
mode in saveAsTable()
can lead to duplicate data entries if the method is called multiple times. Unless intentional, consider using 'overwrite'
mode or implementing logic to handle duplicates.
Please verify whether appending data is intended. If you aim to replace the table content each time, modify the write mode as follows:
train_set_with_timestamp.write.mode("append").saveAsTable(
f"{self.config.catalog_name}.{self.config.schema_name}.train_set"
)
Change to:
train_set_with_timestamp.write.mode("overwrite").saveAsTable(
f"{self.config.catalog_name}.{self.config.schema_name}.train_set"
)
with open("run_info.json", "w") as json_file: | ||
json.dump(run_info, json_file, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use appropriate filesystem operations and add error handling.
Similar to the earlier file operations, this should use the Databricks filesystem.
Consider this approach:
-with open("run_info.json", "w") as json_file:
- json.dump(run_info, json_file, indent=4)
+try:
+ run_info_json = json.dumps(run_info, indent=4)
+ dbutils.fs.put(f"dbfs:/run_info.json", run_info_json, overwrite=True)
+except Exception as e:
+ print(f"Error saving run info: {str(e)}")
+ raise
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
with open("run_info.json", "w") as json_file: | |
json.dump(run_info, json_file, indent=4) | |
try: | |
run_info_json = json.dumps(run_info, indent=4) | |
dbutils.fs.put(f"dbfs:/run_info.json", run_info_json, overwrite=True) | |
except Exception as e: | |
print(f"Error saving run info: {str(e)}") | |
raise |
conda_env = _mlflow_conda_env( | ||
additional_conda_deps=None, | ||
additional_pip_deps=["/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/packages/mlops_with_databricks-0.0.1-py3-none-any.whl", | ||
], | ||
additional_conda_channels=None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hardcoding local paths in additional_pip_deps
.
Specifying local file paths in additional_pip_deps
reduces portability and may fail in different environments. It's better to install dependencies from package repositories or ensure they're included in the environment beforehand.
Apply this diff to use a package repository:
conda_env = _mlflow_conda_env(
additional_conda_deps=None,
- additional_pip_deps=["/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/packages/mlops_with_databricks-0.0.1-py3-none-any.whl",
- ],
+ additional_pip_deps=["mlops_with_databricks==0.0.1"],
additional_conda_channels=None,
)
Committable suggestion skipped: line range outside the PR's diff.
self.df[cat_col] = self.df[cat_col].astype("category") | ||
|
||
# Extract target and relevant features | ||
# TODO make sure there is an 'id' column in the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address the TODO: Ensure the 'id' column exists in the data
A TODO comment indicates that the 'id' column may be missing. Since 'id' is included in relevant_columns
, it's crucial to verify its presence to prevent potential KeyError
exceptions.
Would you like assistance in adding a check to ensure the 'id' column exists, or updating the data loading process to include it? I can help implement a solution or open a GitHub issue to track this task.
model=pipeline, | ||
flavor=mlflow.sklearn, | ||
artifact_path="randomforest-pipeline-model-fe", | ||
training_set=training_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined variable 'training_set'
The variable training_set
used on line 206 is not defined. Did you mean to use training_set_fe
, which was created earlier with fe.create_training_set()
?
Apply this diff to fix the issue:
fe.log_model(
model=pipeline,
flavor=mlflow.sklearn,
artifact_path="randomforest-pipeline-model-fe",
- training_set=training_set,
+ training_set=training_set_fe,
signature=signature,
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
training_set=training_set, | |
training_set=training_set_fe, |
🧰 Tools
🪛 Ruff
206-206: Undefined name training_set
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, thanks!
Completed Week 2 notebooks for MLFlow and model registering.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Chores
.gitignore
to exclude wheel files.