Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/week2 #2

Merged
merged 8 commits into from
Nov 5, 2024
Merged

Feature/week2 #2

merged 8 commits into from
Nov 5, 2024

Conversation

lawrencejoffe
Copy link
Collaborator

@lawrencejoffe lawrencejoffe commented Nov 4, 2024

Completed Week 2 notebooks for MLFlow and model registering.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a structured workflow for dataset preparation and model management in Databricks.
    • Added functionality for logging and registering machine learning models using MLflow.
    • Implemented a new method for adjusting predictions based on a scale factor.
  • Bug Fixes

    • Updated configuration settings for loan prediction, ensuring correct schema and feature handling.
  • Documentation

    • Enhanced README and notes to provide clearer instructions on using Databricks and MLflow integration.
    • Added new sections detailing the updated file structure and pre-commit hook usage.
  • Chores

    • Updated .gitignore to exclude wheel files.
    • Added a new dependency for Pydantic in the project configuration.

@lawrencejoffe lawrencejoffe requested a review from a team as a code owner November 4, 2024 17:18
Copy link

coderabbitai bot commented Nov 4, 2024

Walkthrough

This pull request introduces several changes across multiple files, primarily focusing on the configuration and data processing for a loan prediction model. Key modifications include updates to the .gitignore file to exclude wheel files, changes in the configuration schema, and the introduction of a new configuration class. The DataProcessor class has been enhanced with new methods for data preprocessing and logging. Additionally, new scripts for data preparation, MLflow experiment tracking, and model registration have been added. Overall, these changes refine the project's structure and functionality.

Changes

File Change Summary
.gitignore Added entry *.whl to ignore wheel files.
config/config.yml Updated schema_name from lj_test to lj_loan_prediction; removed id from num_features.
loan_prediction/config.py Added ProjectConfig class with method from_yaml for loading YAML configurations.
loan_prediction/data_processor.py Added logging, new methods (preprocess, split_train_test, save_to_catalog), and modified existing methods.
loan_prediction/loan_classifier_model.py Changed access method for n_estimators and max_depth from dictionary-style to attribute-style.
loan_prediction/utils.py Added adjust_predictions function to modify prediction values.
main.py Replaced ConfigLoader with ProjectConfig; updated data loading and model initialization methods.
notebooks/week1/week1_test_package.py Updated load_data method call to not require filepath argument.
notebooks/week2/01.prepare_dataset.py Introduced a new workflow for dataset preparation in Databricks.
notebooks/week2/02.mlflow_experiment.py Added functionality for managing MLflow experiments.
notebooks/week2/03.log_and_register_model.py Implemented a workflow for training and logging a loan prediction model.
notebooks/week2/04.log_and_register_custom_model.py Added functionality for logging and registering a custom model with MLflow.
notebooks/week2/05.log_and_register_fe_model.py Implemented a pipeline for loan prediction using a Random Forest Classifier.
notebooks/week2/README.md Updated to provide an overview of Week 2 materials and scripts.
notes.md Modified file paths and added instructions for pre-commit hooks.
pyproject.toml Added new dependency pydantic>=2.9.2.

Poem

🐇 In the meadow, changes bloom,
Configs shift, dispelling gloom.
Data flows, models take flight,
Predictions shine, oh what a sight!
With each update, we hop and cheer,
For a brighter path, the future's near! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 25

🧹 Outside diff range and nitpick comments (46)
config/config.yml (2)

1-2: Consider environment-specific configuration files.

While the schema name change from 'lj_test' to 'lj_loan_prediction' is more descriptive, consider implementing environment-specific configuration files (e.g., dev.yml, staging.yml, prod.yml) to manage different settings across environments.


4-7: Consider externalizing model hyperparameters.

The current hyperparameters might need tuning based on different environments or datasets. Consider:

  1. Moving these to a separate model-specific config file
  2. Using MLflow for hyperparameter tracking and optimization
loan_prediction/config.py (3)

1-6: Remove extra empty line.

There's an unnecessary double empty line between imports and the class definition.

 from pydantic import BaseModel
-

 class ProjectConfig(BaseModel):

7-13: Enhance type safety and add field validations.

While using Pydantic is great for configuration management, consider these improvements:

  1. The parameters field using Dict[str, Any] is too permissive. Consider defining a more specific structure for model parameters to catch configuration errors early.
  2. Add field constraints to validate the configuration values (e.g., non-empty lists, valid schema names).
+from pydantic import BaseModel, Field, constr
+
 class ProjectConfig(BaseModel):
-    num_features: List[str]
-    cat_features: List[str]
-    target: str
-    catalog_name: str
-    schema_name: str
-    parameters: Dict[str, Any]  # Dictionary to hold model-related parameters
+    num_features: List[str] = Field(..., min_items=1)
+    cat_features: List[str] = Field(..., min_items=1)
+    target: str = Field(..., min_length=1)
+    catalog_name: constr(min_length=1)
+    schema_name: constr(min_length=1, regex=r'^[a-z][a-z0-9_]*$')
+    parameters: Dict[str, float | int | str]  # More specific types for parameters

7-20: Consider MLOps best practices for configuration management.

Since this is part of an MLOps project with MLflow integration, consider these architectural improvements:

  1. Add configuration versioning to track changes and ensure reproducibility
  2. Support environment-specific configurations (dev/staging/prod)
  3. Add methods to serialize the configuration to MLflow for experiment tracking
  4. Consider adding validation for MLflow-specific parameters

This will help maintain consistency across different environments and experiments.

Would you like me to provide an example implementation of these improvements?

pyproject.toml (1)

20-20: Consider adding an upper version bound for pydantic.

While the addition of Pydantic for configuration management is a good practice, consider adding an upper version bound to maintain consistency with other dependencies and prevent potential breaking changes:

-    "pydantic>=2.9.2",
+    "pydantic>=2.9.2, <3",
notes.md (1)

52-54: Add language specification to code block.

Add bash as the language specification to improve readability and enable syntax highlighting.

-```
+```bash
 uv run pre-commit run --all-files

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 Markdownlint</summary>

52-52: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</blockquote></details>
<details>
<summary>loan_prediction/loan_classifier_model.py (2)</summary><blockquote>

`14-15`: **Consider enhancing configuration access patterns and type safety.**

While the changes work correctly, there are a few improvements that could make the code more robust and maintainable:

1. Consider using a consistent access pattern (either all dot notation or all dict notation)
2. Add type hints and parameter validation

Here's a suggested improvement:

```diff
+from typing import Any
+from dataclasses import dataclass
+
+@dataclass
+class ModelParameters:
+    n_estimators: int
+    max_depth: int
+
 class LoanClassifierModel:
-    def __init__(self, preprocessor, config):
+    def __init__(self, preprocessor: Any, config: Any) -> None:
+        if not hasattr(config, 'parameters'):
+            raise ValueError("Config must have 'parameters' attribute")
+        if not all(k in config.parameters for k in ['n_estimators', 'max_depth']):
+            raise ValueError("Config parameters must contain 'n_estimators' and 'max_depth'")
         self.config = config
         self.model = Pipeline(
             steps=[
                 ("preprocessor", preprocessor),
                 (
                     "classifier",
                     RandomForestClassifier(
-                        n_estimators=config.parameters["n_estimators"],
-                        max_depth=config.parameters["max_depth"],
+                        n_estimators=config.parameters.n_estimators,
+                        max_depth=config.parameters.max_depth,
                         random_state=42,
                     ),
                 ),

Line range hint 1-38: Consider additional improvements to enhance code quality.

While not directly related to the changes, here are some suggestions to improve the overall code quality:

  1. Add class and method docstrings
  2. Add type hints for all methods
  3. Enhance the evaluate method to return more metrics
  4. Structure feature importance return values better

Here's a partial example of the suggested improvements:

from typing import Tuple, List, Any
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

class LoanClassifierModel:
    """Loan classification model using Random Forest classifier with preprocessing pipeline.
    
    Attributes:
        config: Configuration object containing model parameters
        model: Scikit-learn pipeline with preprocessor and classifier
    """
    
    def evaluate(self, X_test: np.ndarray, y_test: np.ndarray) -> dict:
        """Evaluate model performance using multiple metrics.
        
        Returns:
            dict: Dictionary containing accuracy, precision, recall, and f1 scores
        """
        y_pred = self.predict(X_test)
        return {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1': f1_score(y_test, y_pred)
        }

    def get_feature_importance(self) -> Tuple[np.ndarray, List[str]]:
        """Get feature importance scores and corresponding feature names.
        
        Returns:
            Tuple containing:
                - np.ndarray: Feature importance scores
                - List[str]: Corresponding feature names
        """
notebooks/week2/02.mlflow_experiment.py (2)

29-36: Enhance run metadata and metric logging structure.

The current implementation could benefit from additional context and structure:

  1. Include more git metadata for better traceability
  2. Add structured metric logging with timestamps

Consider this enhanced approach:

+from datetime import datetime
+
 with mlflow.start_run(
     run_name="demo-run",
     tags={"git_sha": GIT_SHA,
-          "branch": "week2"},
+          "branch": "week2",
+          "commit_timestamp": get_commit_timestamp(),
+          "commit_author": get_commit_author(),
+          "commit_message": get_commit_message()},
     description="demo run",
 ) as run:
     mlflow.log_params({"type": "demo"})
-    mlflow.log_metrics({"metric1": 1.0, "metric2": 2.0})
+    timestamp = datetime.now().timestamp()
+    metrics = {
+        "metric1": {"value": 1.0, "timestamp": timestamp, "step": 0},
+        "metric2": {"value": 2.0, "timestamp": timestamp, "step": 0}
+    }
+    for name, meta in metrics.items():
+        mlflow.log_metric(name, meta["value"], meta["timestamp"], meta["step"])

1-59: Add proper documentation and type hints.

The notebook would benefit from:

  1. Module-level docstring explaining the purpose
  2. Type hints for better code maintainability
  3. Removal of unnecessary command separator comments

Consider adding this documentation at the start:

"""
MLflow Experiment Management Script

This notebook handles MLflow experiment tracking in Databricks, including:
- Experiment creation and configuration
- Run tracking with git metadata
- Metric and parameter logging
- Run information retrieval and storage

Usage:
    This notebook should be run in a Databricks environment with MLflow configured.
"""

from typing import Dict, Any, Optional
notebooks/week2/01.prepare_dataset.py (3)

13-14: Remove unused imports.

The following imports are not used in the code:

  • datetime
  • pandas
-from datetime import datetime
-import pandas as pd
🧰 Tools
🪛 Ruff

13-13: Module level import not at top of file

(E402)


13-13: datetime.datetime imported but unused

Remove unused import: datetime.datetime

(F401)


14-14: Module level import not at top of file

(E402)


14-14: pandas imported but unused

Remove unused import: pandas

(F401)


20-21: Remove redundant logging configuration.

There are two similar logging configurations, with one commented out. Remove the commented line to avoid confusion.

-# logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')

27-27: Consider using an absolute path or environment variable for config.

Using relative paths (../../config/config.yml) in notebooks can be fragile as it depends on the notebook's location. Consider:

  1. Using an absolute path with the Databricks workspace path
  2. Setting the config path via environment variable
-config = ProjectConfig.from_yaml(config_path="../../config/config.yml")
+config = ProjectConfig.from_yaml(config_path="/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/config/config.yml")
notebooks/week1/week1_test_package.py (2)

Line range hint 29-30: Consider making the filepath configurable.

The filepath is hardcoded to a specific Databricks volume /Volumes/heiaepgah71pwedmld01001/lj_test/mlops_data/train.csv. Consider moving this to the configuration file for better maintainability and flexibility across different environments.

Suggested approach:

  1. Add the filepath to config.yml
  2. Load it through the ConfigLoader
  3. Pass it to DataLoader through configuration

Example config structure:

data:
  train_data_path: '/Volumes/heiaepgah71pwedmld01001/lj_test/mlops_data/train.csv'

Line range hint 8-8: Consider using MLflow for model tracking.

Given that this PR's objective is to integrate MLFlow, consider adding MLflow tracking to log model metrics, parameters, and artifacts. This would align with the Week 2 goals mentioned in the PR description.

Example integration:

import mlflow

# Track parameters
mlflow.log_param("model_type", "loan_classifier")

# Track metrics
mlflow.log_metric("model_score", score)

# Log model
mlflow.sklearn.log_model(model, "loan_classifier_model")
notebooks/week2/README.md (5)

13-18: Add language specification to code block.

For better syntax highlighting and documentation standards, specify the language in the code block.

-```
+```python
 02.prepare_dataset.py

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 Markdownlint</summary>

14-14: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

---

`19-24`: **Add language specification to code block.**

For better syntax highlighting and documentation standards, specify the language in the code block.

```diff
-```
+```python
 02.mlflow_experiment.py

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 Markdownlint</summary>

20-20: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

---

`26-35`: **Fix grammar and code block formatting.**

1. Grammar: Replace "an sklearn" with "a sklearn" as "sklearn" is pronounced starting with a consonant sound.
2. Add language specification to the code block.

```diff
-```
+```python
 03.log_and_register_model.py

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 LanguageTool</summary>

[misspelling] ~31-~31: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ... train and test datasets, then we train an sklearn pipeline within an MLflow run. ...

(EN_A_VS_AN)

</details>
<details>
<summary>🪛 Markdownlint</summary>

27-27: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

---

`36-45`: **Fix section title and grammar.**

1. The section title appears to be incorrect. Consider changing it to "Logging and Registering a Custom Model" to match the content.
2. Grammar: Replace "an sklearn" with "a sklearn" as "sklearn" is pronounced starting with a consonant sound.

```diff
-### 4. Logging and Registering a Model
+### 4. Logging and Registering a Custom Model
🧰 Tools
🪛 LanguageTool

[misspelling] ~42-~42: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...his purpose. In our example, we trained an sklearn model (which isn't custom), but...

(EN_A_VS_AN)

🪛 Markdownlint

37-37: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


47-48: Improve code block formatting.

For consistency and better documentation standards, use proper markdown code block formatting with language specification.

-```05.log_and_register_fe_model.py```
+```python
+05.log_and_register_fe_model.py
+```
notebooks/week2/03.log_and_register_model.py (7)

35-36: Avoid redundant loading of the training set

The training set is being loaded twice from the same table, which can be inefficient.

Reuse train_set_spark to avoid redundant queries:

 train_set_spark = spark.table(f"{catalog_name}.{schema_name}.train_set")
-train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas()
+train_set = train_set_spark.toPandas()

Apply the same optimization for the test set:

-test_set = spark.table(f"{catalog_name}.{schema_name}.test_set").toPandas()
+test_set_spark = spark.table(f"{catalog_name}.{schema_name}.test_set")
+test_set = test_set_spark.toPandas()

60-60: Parameterize the MLflow experiment name

Hardcoding the experiment name reduces flexibility. Consider parameterizing it to allow for easier configuration and to avoid hardcoding in multiple places.

Update the code to use the experiment name from your configuration:

-mlflow.set_experiment(experiment_name='/Shared/lj_loan_prediction')
+mlflow.set_experiment(experiment_name=config.experiment_name)

Ensure that experiment_name is added to your config.yml file and included in the ProjectConfig class.


61-61: Automate retrieval of Git commit SHA

Hardcoding the Git commit SHA may lead to inconsistencies. It's better to retrieve it automatically to ensure accuracy.

Use the subprocess module to dynamically retrieve the current commit SHA:

-import git_sha = "ffa63b430205ff7"
+import subprocess
+git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).strip().decode()

Ensure that the script is executed in a Git repository context where this command can run successfully.


95-99: Log an input_example when logging the model

Currently, the model is logged without an input_example, which helps in validating the model's inputs during deployment and serving.

Provide an input_example when logging the model:

 mlflow.sklearn.log_model(
     sk_model=pipeline,
     artifact_path="lightgbm-pipeline-model",
     signature=signature,
+    input_example=X_train.iloc[:1]
 )

This logs a sample input that can be used for model validation during deployment.


48-48: Remove trailing whitespace

Line 48 contains trailing whitespace, which should be removed to adhere to code style guidelines.

Apply this diff to remove the trailing whitespace:

     transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)], 
-                                                            
+
🧰 Tools
🪛 Ruff

48-48: Trailing whitespace

(W291)


94-94: Remove whitespace on blank line

Line 94 is a blank line containing whitespace. Please remove the whitespace to maintain code cleanliness.

Apply this diff to clean up the blank line:

-    
+
🧰 Tools
🪛 Ruff

94-94: Blank line contains whitespace

(W293)


110-110: Remove trailing whitespace

Line 110 contains trailing whitespace, which should be removed to adhere to code style guidelines.

Apply this diff to remove the trailing whitespace:

-# from the endpoint suggested test 
+# from the endpoint suggested test
🧰 Tools
🪛 Ruff

110-110: Trailing whitespace

(W291)

notebooks/week2/04.log_and_register_custom_model.py (3)

74-79: Avoid using toPandas() on large datasets to prevent memory issues.

Converting large Spark DataFrames to Pandas using toPandas() can lead to out-of-memory errors. If the datasets are large, consider processing data using Spark operations or sampling the data before converting.


94-94: Retrieve git_sha dynamically instead of hardcoding.

Hardcoding the Git SHA reduces flexibility and may lead to inconsistencies with the actual code version. Consider obtaining the current commit SHA programmatically to ensure accuracy.

You can retrieve the Git SHA using the subprocess module:

import subprocess

git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

Also applies to: 97-97, 129-129


57-57: Remove trailing whitespaces to adhere to code style guidelines.

Static analysis tools have detected trailing whitespaces on blank lines, which can affect code readability and may lead to unnecessary diffs in version control.

Affected lines:

  • Line 57
  • Line 60
  • Line 98
  • Line 139
  • Line 140

Also applies to: 60-60, 98-98, 139-139, 140-140

🧰 Tools
🪛 Ruff

57-57: Blank line contains whitespace

(W293)

loan_prediction/data_processor.py (4)

27-29: Assign self.df in load_data() for clarity

Currently, self.df is assigned in the preprocess() method. Assigning self.df directly in the load_data() method enhances clarity by ensuring that data loading responsibilities are contained within the load_data() method.

Apply this diff to assign self.df in load_data():

 def load_data(self):
     self.dataloader.load()
+    self.df = self.dataloader.data
     # data sitting in self.dataloader.data

73-87: Remove commented-out code to improve readability

The code contains several blocks of commented-out code that are not in use. Removing this unused code will clean up the codebase and improve readability.

Apply this diff to remove the commented-out code:

-    # example for filing missing values
-    # median_age = self.df['person_age'].median()
-    # self.df['person_age'].fillna(median_age, inplace=True)
-
-    # current_year = datetime.now().year
-    # self.df['person_age'] = current_year - self.df['person_birth']
-    # self.df.drop(columns=['person_birth'], inplace=True)
-
-    # Fill missing values with mean or default values
-    # self.df.fillna({
-    #     'LotFrontage': self.df['LotFrontage'].mean(),
-    #     'MasVnrType': 'None',
-    #     'MasVnrArea': 0,
-    # }, inplace=True)

103-106: Make random_state configurable via config

Hardcoding random_state can hinder reproducibility in different environments. Consider retrieving random_state from self.config to allow flexibility.

Apply this diff to use random_state from configuration:

 def split_train_test(self, test_size=0.2, random_state=42):
     """Split the DataFrame (self.df) into training and test sets."""
-    train_set, test_set = train_test_split(self.df, test_size=test_size, random_state=random_state)
+    train_set, test_set = train_test_split(
+        self.df, test_size=test_size, random_state=self.config.random_state
+    )
     return train_set, test_set

Ensure that random_state is defined in your configuration.


111-111: Utilize the logger consistently throughout the class

An instance of logger is created, and it logs an info message when saving to the catalog. Consider adding more logging statements at key steps in the data processing pipeline to aid in debugging and monitoring.

For example, add logging in methods like load_data(), preprocess(), and split_train_test().

notebooks/week2/05.log_and_register_fe_model.py (11)

6-6: Remove trailing whitespace

There's trailing whitespace at the end of the comment on line 6. Removing it will improve code cleanliness.

Apply this diff to remove the trailing whitespace:

-# dbutils.library.restartPython() 
+# dbutils.library.restartPython()
🧰 Tools
🪛 Ruff

6-6: Trailing whitespace

Remove trailing whitespace

(W291)


10-10: Remove unused imports

The following imports are not used in the code and can be removed to clean up the code:

  • Line 10: import yaml
  • Line 15: from pyspark.sql import functions as F
  • Line 18: from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
  • Line 22: from datetime import datetime

Apply this diff to remove the unused imports:

- import yaml
- from pyspark.sql import functions as F
- from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
- from datetime import datetime

Also applies to: 15-15, 18-18, 22-22

🧰 Tools
🪛 Ruff

10-10: yaml imported but unused

Remove unused import: yaml

(F401)


94-104: Remove unused import inside SQL function definition

Within the SQL function definition, the import from datetime import datetime on line 100 is unnecessary since datetime is not used. Consider removing it to keep the function concise.

Apply this diff:

 CREATE OR REPLACE FUNCTION {function_name}(person_age INT)
 RETURNS INT
 LANGUAGE PYTHON AS
 $$
-from datetime import datetime
 return 12 * person_age
 $$

117-118: Remove trailing whitespace

Lines 117 and 118 contain trailing whitespace. Removing it will improve code cleanliness.

Apply this diff to remove the trailing whitespace:

             .drop(
-                # "cb_person_cred_hist_length", 
-                #   "loan_amnt" , 
+                # "cb_person_cred_hist_length",
+                #   "loan_amnt",
🧰 Tools
🪛 Ruff

117-117: Trailing whitespace

Remove trailing whitespace

(W291)


118-118: Trailing whitespace

Remove trailing whitespace

(W291)


117-119: Clean up commented-out code

Lines 117-119 contain commented-out columns in the .drop() method. If these columns are no longer needed, consider removing the comments to clean up the code.

Apply this diff to remove the commented-out lines:

             .drop(
-                # "cb_person_cred_hist_length",
-                #   "loan_amnt",
                 "person_income")
🧰 Tools
🪛 Ruff

117-117: Trailing whitespace

Remove trailing whitespace

(W291)


118-118: Trailing whitespace

Remove trailing whitespace

(W291)


151-151: Remove extraneous expression

The standalone expression parameters on line 151 doesn't affect the program's execution. Consider removing it unless it's needed for debugging.

Apply this diff:

-parameters
🧰 Tools
🪛 Ruff

151-151: Found useless expression. Either assign it to a variable or remove it.

(B018)


169-171: Rename pipeline step from 'regressor' to 'classifier'

The pipeline step is named 'regressor' but uses a RandomForestClassifier. For clarity and consistency, consider renaming the step to 'classifier'.

Apply this diff:

 pipeline = Pipeline(
     steps=[("preprocessor", preprocessor), ("regressor", RandomForestClassifier(
         n_estimators = parameters['n_estimators'], max_depth = parameters['max_depth'], random_state=42))]
 )
+pipeline = Pipeline(
+    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier(
+        n_estimators = parameters['n_estimators'], max_depth = parameters['max_depth'], random_state=42))]
+)

176-176: Avoid hardcoding Git SHA

The variable git_sha is hardcoded with the value "ffa63b430205ff7". For maintainability, consider obtaining the Git SHA dynamically to ensure it reflects the current commit.

You can obtain the Git SHA dynamically using the subprocess module:

import subprocess
git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

185-190: Remove or uncomment unused metric calculations

Lines 185-190 contain metric calculations that are currently commented out. If these metrics are no longer needed, consider removing them. If they are needed, consider uncommenting them to use them in your analysis.


107-107: Avoid reloading 'train_set' unnecessarily

The variable train_set is reloaded from the table on line 107, but it was already loaded earlier on line 61 and hasn't been modified since. Consider removing the redundant loading to streamline the code.

Apply this diff:

-train_set = spark.table(f"{catalog_name}.{schema_name}.train_set")

212-212: Remove whitespace on blank line

Line 212 is a blank line that contains whitespace. Removing the whitespace will adhere to code style guidelines.

Apply this diff:

-
🧰 Tools
🪛 Ruff

212-212: Blank line contains whitespace

Remove whitespace from blank line

(W293)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cc85b19 and 364529f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • .gitignore (1 hunks)
  • config/config.yml (1 hunks)
  • loan_prediction/config.py (1 hunks)
  • loan_prediction/data_processor.py (3 hunks)
  • loan_prediction/loan_classifier_model.py (1 hunks)
  • loan_prediction/utils.py (1 hunks)
  • main.py (1 hunks)
  • notebooks/week1/week1_test_package.py (1 hunks)
  • notebooks/week2/01.prepare_dataset.py (1 hunks)
  • notebooks/week2/02.mlflow_experiment.py (1 hunks)
  • notebooks/week2/03.log_and_register_model.py (1 hunks)
  • notebooks/week2/04.log_and_register_custom_model.py (1 hunks)
  • notebooks/week2/05.log_and_register_fe_model.py (1 hunks)
  • notebooks/week2/README.md (1 hunks)
  • notes.md (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🪛 Ruff
notebooks/week2/01.prepare_dataset.py

6-6: Undefined name dbutils

(F821)


10-10: Module level import not at top of file

(E402)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


13-13: Module level import not at top of file

(E402)


13-13: datetime.datetime imported but unused

Remove unused import: datetime.datetime

(F401)


14-14: Module level import not at top of file

(E402)


14-14: pandas imported but unused

Remove unused import: pandas

(F401)


15-15: Module level import not at top of file

(E402)


19-19: Module level import not at top of file

(E402)

notebooks/week2/03.log_and_register_model.py

3-3: SyntaxError: Expected a statement


3-3: SyntaxError: Simple statements must be separated by newlines or semicolons


3-3: SyntaxError: Simple statements must be separated by newlines or semicolons


48-48: Trailing whitespace

(W291)


94-94: Blank line contains whitespace

(W293)


110-110: Trailing whitespace

(W291)

notebooks/week2/04.log_and_register_custom_model.py

3-3: SyntaxError: Expected a statement


3-3: SyntaxError: Simple statements must be separated by newlines or semicolons


3-3: SyntaxError: Simple statements must be separated by newlines or semicolons


57-57: Blank line contains whitespace

(W293)


60-60: Blank line contains whitespace

(W293)


98-98: Blank line contains whitespace

(W293)


139-139: Trailing whitespace

(W291)


140-140: Blank line contains whitespace

(W293)

notebooks/week2/05.log_and_register_fe_model.py

6-6: Trailing whitespace

Remove trailing whitespace

(W291)


10-10: yaml imported but unused

Remove unused import: yaml

(F401)


15-15: pyspark.sql.functions imported but unused

Remove unused import: pyspark.sql.functions

(F401)


18-18: sklearn.metrics.mean_squared_error imported but unused

Remove unused import

(F401)


18-18: sklearn.metrics.mean_absolute_error imported but unused

Remove unused import

(F401)


18-18: sklearn.metrics.r2_score imported but unused

Remove unused import

(F401)


22-22: datetime.datetime imported but unused

Remove unused import: datetime.datetime

(F401)


117-117: Trailing whitespace

Remove trailing whitespace

(W291)


118-118: Trailing whitespace

Remove trailing whitespace

(W291)


151-151: Found useless expression. Either assign it to a variable or remove it.

(B018)


206-206: Undefined name training_set

(F821)


212-212: Blank line contains whitespace

Remove whitespace from blank line

(W293)

🪛 LanguageTool
notebooks/week2/README.md

[misspelling] ~31-~31: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ... train and test datasets, then we train an sklearn pipeline within an MLflow run. ...

(EN_A_VS_AN)


[misspelling] ~42-~42: Use “a” instead of ‘an’ if the following word doesn’t start with a vowel sound, e.g. ‘a sentence’, ‘a university’.
Context: ...his purpose. In our example, we trained an sklearn model (which isn't custom), but...

(EN_A_VS_AN)

🪛 Markdownlint
notebooks/week2/README.md

14-14: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


20-20: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


27-27: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


37-37: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

notes.md

52-52: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (18)
config/config.yml (2)

Line range hint 9-19: LGTM! Feature selection looks appropriate.

The separation between numerical and categorical features is clear and well-organized. The removal of the 'id' feature from numerical features is a good practice as it shouldn't be used for modeling.


Line range hint 21-21: LGTM! Target variable is clearly defined.

The target variable 'loan_status' is appropriately specified for the loan prediction use case.

loan_prediction/utils.py (2)

25-26: Formatting changes look good.


27-28: 🛠️ Refactor suggestion

⚠️ Potential issue

Critical: Arbitrary prediction scaling requires justification and safeguards.

The adjust_predictions function arbitrarily scales predictions by a default factor of 1.3, which raises several concerns:

  1. In a loan prediction context, arbitrarily scaling predictions could significantly impact business decisions and risk assessments.
  2. The function lacks documentation explaining the purpose and justification for the scaling.
  3. Missing input validation for the scale factor could lead to invalid predictions if negative or zero values are provided.

Consider implementing these improvements:

-def adjust_predictions(predictions, scale_factor=1.3):
-    return predictions * scale_factor
+from typing import Union, Sequence
+import numpy as np
+
+def adjust_predictions(
+    predictions: Union[Sequence[float], np.ndarray],
+    scale_factor: float = 1.3
+) -> np.ndarray:
+    """Adjust model predictions by a scale factor.
+    
+    Args:
+        predictions: Array-like of model predictions to adjust
+        scale_factor: Positive multiplier to scale predictions (default: 1.3)
+        
+    Returns:
+        Scaled predictions as numpy array
+        
+    Raises:
+        ValueError: If scale_factor is not positive
+    """
+    if scale_factor <= 0:
+        raise ValueError("scale_factor must be positive")
+    
+    return np.asarray(predictions) * scale_factor

Additionally:

  1. Document the business justification for the 1.3 scale factor
  2. Consider making this configurable via MLflow parameters
  3. Add unit tests to verify scaling behavior
  4. Log the scale factor as part of model metadata for reproducibility

Let's check if this scaling is used in model training or evaluation:

notes.md (2)

42-43: LGTM! Improved file organization and copy command.

The updated path better reflects the project structure, and the addition of --overwrite flag prevents copy failures.


56-56: Please provide details about ci.yml modifications.

The note mentions modifying .github/workflows/ci.yml but doesn't specify what changes are needed. Consider adding:

  1. The specific modifications required
  2. The purpose of these modifications
  3. How they relate to the pre-commit setup

Let me help verify the current ci.yml configuration:

.gitignore (1)

100-100: LGTM! Good practice to ignore wheel files.

Adding *.whl to .gitignore is a good practice as it prevents Python wheel files (binary package distributions) from being committed to the repository. This is especially relevant for a project using MLFlow and additional dependencies like pydantic.

notebooks/week2/02.mlflow_experiment.py (1)

1-59: Verify security considerations for data handling.

Please ensure:

  1. Proper access controls are in place for experiment and run data
  2. Sensitive information is not inadvertently logged
  3. File permissions are appropriately set when writing to DBFS
main.py (4)

3-8: LGTM! Clean imports and proper logging setup.


66-79: Clarify the status of commented evaluation code.

Several critical ML workflow steps are commented out:

  • Model evaluation
  • Results visualization
  • Feature importance analysis

These are essential for model quality assessment. Please clarify if this is temporary and add TODO comments with tracking issues if these will be implemented later.

#!/bin/bash
# Check if these features are implemented elsewhere
rg -l "def (evaluate|visualize_results|plot_feature_importance)"

41-52: Clean up commented code and verify data splitting implementation.

The code contains a mix of old and new data splitting implementations, and Databricks-specific code is commented out. This needs cleanup for better maintainability.

#!/bin/bash
# Check for test coverage of split_train_test method
rg -A 5 "def test.*split_train_test" "tests/"

Also, consider adding validation for the split ratios and resulting dataset sizes:

+# Validate split results
+if len(train_set) == 0 or len(test_set) == 0:
+    raise ValueError("Data split resulted in empty train or test set")
+logger.info(f"Split sizes - Train: {len(train_set)}, Test: {len(test_set)}")

13-16: Consider environment-aware configuration path handling.

The hardcoded config path ./config/config.yml might cause issues in different environments. Consider making it configurable via environment variables or command-line arguments.

-config = ProjectConfig.from_yaml(config_path="./config/config.yml")
+config_path = os.getenv("CONFIG_PATH", "./config/config.yml")
+config = ProjectConfig.from_yaml(config_path=config_path)

Also, be cautious about logging the entire config object as it might expose sensitive information in logs.

notebooks/week1/week1_test_package.py (1)

54-54: Review data loading implementation for consistency.

The load_data() call no longer takes a filepath parameter, but the notebook still creates a DataLoader with a specific filepath that isn't being used. This suggests an inconsistent pattern in data loading.

Let's verify the data loading implementation:

notebooks/week2/README.md (2)

1-10: LGTM! Clear and well-structured introduction.

The introduction effectively sets the context for Week 2 materials and MLflow concepts.


48-65: LGTM! Excellent explanation of feature lookup concepts.

The explanation effectively uses real-world examples (ice cream demand) to illustrate feature lookup concepts, and clearly distinguishes between static and dynamic features. The transition to the house price dataset example is well-explained.

loan_prediction/data_processor.py (3)

24-24: Ensure preprocessor is updated if configuration changes

The preprocessor is built once during initialization. If self.config is modified after initialization, the preprocessor may become outdated. Consider rebuilding the preprocessor when configuration changes or enforce immutability of the configuration after initialization.

If dynamic configuration is expected, you might update _build_preprocessor() to be called whenever the configuration changes.


14-14: Set up logging configuration if not already done

While you're obtaining a logger instance, ensure that the logging system is configured to handle and direct logs appropriately. If not configured elsewhere, you might add basic configuration.

If logging is not configured, I can help set it up. Would you like assistance in configuring the logging system?


121-127: ⚠️ Potential issue

Reconsider using 'append' mode when writing to tables

Using 'append' mode in saveAsTable() can lead to duplicate data entries if the method is called multiple times. Unless intentional, consider using 'overwrite' mode or implementing logic to handle duplicates.

Please verify whether appending data is intended. If you aim to replace the table content each time, modify the write mode as follows:

 train_set_with_timestamp.write.mode("append").saveAsTable(
     f"{self.config.catalog_name}.{self.config.schema_name}.train_set"
 )

Change to:

 train_set_with_timestamp.write.mode("overwrite").saveAsTable(
     f"{self.config.catalog_name}.{self.config.schema_name}.train_set"
 )

loan_prediction/config.py Show resolved Hide resolved
notebooks/week2/02.mlflow_experiment.py Show resolved Hide resolved
Comment on lines +49 to +50
with open("run_info.json", "w") as json_file:
json.dump(run_info, json_file, indent=4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use appropriate filesystem operations and add error handling.

Similar to the earlier file operations, this should use the Databricks filesystem.

Consider this approach:

-with open("run_info.json", "w") as json_file:
-    json.dump(run_info, json_file, indent=4)
+try:
+    run_info_json = json.dumps(run_info, indent=4)
+    dbutils.fs.put(f"dbfs:/run_info.json", run_info_json, overwrite=True)
+except Exception as e:
+    print(f"Error saving run info: {str(e)}")
+    raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open("run_info.json", "w") as json_file:
json.dump(run_info, json_file, indent=4)
try:
run_info_json = json.dumps(run_info, indent=4)
dbutils.fs.put(f"dbfs:/run_info.json", run_info_json, overwrite=True)
except Exception as e:
print(f"Error saving run info: {str(e)}")
raise

notebooks/week2/02.mlflow_experiment.py Show resolved Hide resolved
notebooks/week2/02.mlflow_experiment.py Show resolved Hide resolved
Comment on lines +104 to +109
conda_env = _mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/packages/mlops_with_databricks-0.0.1-py3-none-any.whl",
],
additional_conda_channels=None,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid hardcoding local paths in additional_pip_deps.

Specifying local file paths in additional_pip_deps reduces portability and may fail in different environments. It's better to install dependencies from package repositories or ensure they're included in the environment beforehand.

Apply this diff to use a package repository:

 conda_env = _mlflow_conda_env(
     additional_conda_deps=None,
-    additional_pip_deps=["/Volumes/heiaepgah71pwedmld01001/lj_loan_prediction/packages/mlops_with_databricks-0.0.1-py3-none-any.whl",
-                         ],
+    additional_pip_deps=["mlops_with_databricks==0.0.1"],
     additional_conda_channels=None,
 )

Committable suggestion skipped: line range outside the PR's diff.

self.df[cat_col] = self.df[cat_col].astype("category")

# Extract target and relevant features
# TODO make sure there is an 'id' column in the data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address the TODO: Ensure the 'id' column exists in the data

A TODO comment indicates that the 'id' column may be missing. Since 'id' is included in relevant_columns, it's crucial to verify its presence to prevent potential KeyError exceptions.

Would you like assistance in adding a check to ensure the 'id' column exists, or updating the data loading process to include it? I can help implement a solution or open a GitHub issue to track this task.

loan_prediction/data_processor.py Show resolved Hide resolved
loan_prediction/data_processor.py Show resolved Hide resolved
model=pipeline,
flavor=mlflow.sklearn,
artifact_path="randomforest-pipeline-model-fe",
training_set=training_set,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined variable 'training_set'

The variable training_set used on line 206 is not defined. Did you mean to use training_set_fe, which was created earlier with fe.create_training_set()?

Apply this diff to fix the issue:

 fe.log_model(
     model=pipeline,
     flavor=mlflow.sklearn,
     artifact_path="randomforest-pipeline-model-fe",
-    training_set=training_set,
+    training_set=training_set_fe,
     signature=signature,
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
training_set=training_set,
training_set=training_set_fe,
🧰 Tools
🪛 Ruff

206-206: Undefined name training_set

(F821)

Copy link

@basakeskili basakeskili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, thanks!

@lawrencejoffe lawrencejoffe merged commit 24fec2d into main Nov 5, 2024
2 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants