Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week1+2 #1

Merged
merged 33 commits into from
Nov 25, 2024
Merged

Week1+2 #1

merged 33 commits into from
Nov 25, 2024

Conversation

VanessaSchl
Copy link
Collaborator

@VanessaSchl VanessaSchl commented Nov 21, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new configuration file for managing project settings and hyperparameters.
    • Added a comprehensive data preprocessing workflow for hotel reservation data.
    • Implemented a machine learning model for predicting hotel cancellations.
    • Added visualization functionality for model predictions with a confusion matrix.
    • Created a new Databricks asset bundle configuration.
  • Bug Fixes

    • Reformatted several configuration and workflow files for consistency and clarity.
  • Chores

    • Updated project metadata, including project name and version.
    • Reorganized entries in the .gitignore file to better reflect current development needs.

@VanessaSchl VanessaSchl requested a review from a team as a code owner November 21, 2024 07:49
Copy link

coderabbitai bot commented Nov 21, 2024

Walkthrough

The pull request introduces several changes, including the addition of new files for data processing and model training specific to hotel reservations, updates to configuration files, and modifications to existing files for formatting and structure. Key additions include a databricks.yml for Databricks integration, new Python scripts for preprocessing, preparation, logging, registration, and feature engineering, as well as updates to the pyproject.toml and project_config.yml files to reflect new project details and dependencies.

Changes

File Path Change Summary
.github/CODEOWNERS Modified to include a trailing space after @VanessaSchl.
.github/workflows/ci.yml Entire content removed and re-added; no changes to actual workflow configuration.
.gitignore Complete replacement of contents with new entries, including Databricks-related patterns.
.pre-commit-config.yaml Cosmetic changes; whitespace normalization with no functional alterations.
README.md Formatting change; removed leading hyphens from bullet points in specified sections.
databricks.yml New file defining a Databricks asset bundle with development configuration.
notebooks/week_1/01_preprocessing.py New file for data preprocessing workflow, including model training and evaluation.
notebooks/week_2/01_preparation.py New file for preparing hotel reservation data using Spark and saving datasets.
notebooks/week_2/02_logging_and_registering.py New file for training a model with MLflow logging and registration.
notebooks/week_2/03_feature_engineering.py New file for feature engineering workflow, including SQL function definitions and model logging.
project_config.yml New configuration file with project parameters, A/B test details, and feature lists.
pyproject.toml Major updates including project name, version, description, and dependencies.
pytest.ini Formatting changes for consistency; no alterations to configuration settings.
src/hotel_reservations/config.py New ProjectConfig class for managing configuration parameters.
src/hotel_reservations/data_processor.py New DataProcessor class for data preprocessing and saving to catalog.
src/hotel_reservations/reservations_model.py New ReservationsModel class for predicting hotel cancellations.
src/hotel_reservations/utils.py New utility function visualize_results for generating confusion matrix visualizations.

Poem

In the land of code where rabbits play,
New files and changes brighten the day.
With data to process and models to train,
We'll hop through the tasks, never in vain!
So gather 'round, friends, let's code and cheer,
For the magic of updates is finally here! 🐰✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🧹 Outside diff range and nitpick comments (30)
src/hotel_reservations/utils.py (3)

1-1: Enhance module docstring with more details

Consider expanding the docstring to include:

  • Brief description of available functions
  • Example usage
  • Any dependencies or requirements
-"""Utility functions for the hotel reservations project."""
+"""Utility functions for the hotel reservations project.
+
+This module provides visualization utilities for model evaluation:
+    - visualize_results: Creates a normalized confusion matrix visualization
+
+Example:
+    >>> y_test = [0, 1, 0, 1]
+    >>> y_pred = [0, 1, 1, 1]
+    >>> visualize_results(y_test, y_pred)
+"""

12-25: Add option to save the plot

In non-interactive environments (e.g., production), plt.show() might not work as expected. Consider adding an option to save the plot to a file.

-def visualize_results(y_test, y_pred):
+def visualize_results(y_test: np.ndarray, y_pred: np.ndarray, save_path: Optional[str] = None) -> None:
     # ... existing code ...
     plt.ylabel("Actual", fontsize=12)
     plt.xlabel("Prediction", fontsize=12)
+    if save_path:
+        plt.savefig(save_path, bbox_inches='tight', dpi=300)
     plt.show()
+    plt.close()  # Clean up resources

13-21: Consider adding customization options

The visualization parameters (figure size, colors, font sizes) could be made configurable for better reusability.

-    plt.figure(figsize=(10, 6))
+    def visualize_results(
+        y_test: np.ndarray,
+        y_pred: np.ndarray,
+        figsize: Tuple[int, int] = (10, 6),
+        cmap: str = "viridis",
+        font_size: int = 12,
+        save_path: Optional[str] = None
+    ) -> None:
+        plt.figure(figsize=figsize)
         sns.heatmap(
             conf_matrix,
             annot=True,
             fmt=".2f",
             vmin=0.0,
             vmax=1.0,
+            cmap=cmap,
             xticklabels=["Not Cancelled", "Cancelled"],
             yticklabels=["Not Cancelled", "Cancelled"],
         )
src/hotel_reservations/config.py (3)

1-1: Enhance module docstring with more details

Consider expanding the module docstring to include:

  • Purpose and usage of the configuration module
  • Example usage
  • Description of the configuration structure
-"""Configuration module for the project."""
+"""Configuration module for the hotel reservations project.
+
+This module provides a Pydantic-based configuration management system that handles:
+- Feature lists (numerical and categorical)
+- Target variable specifications
+- Database configuration (catalog and schema)
+- Model parameters
+- A/B test parameters
+
+Example:
+    config = ProjectConfig.from_yaml("config.yml")
+    num_features = config.num_features
+"""

9-11: Enhance class docstring with validation details

The class docstring should describe the validation behavior provided by Pydantic and document the expected configuration structure.

-    """Class to store the configuration for the project."""
+    """Configuration class with Pydantic validation for the hotel reservations project.
+
+    This class provides runtime validation for all configuration parameters and
+    ensures type safety through Pydantic's validation system.
+
+    Attributes:
+        num_features (List[str]): List of numerical feature names
+        cat_features (List[str]): List of categorical feature names
+        original_target (str): Name of the original target variable
+        target (str): Name of the processed target variable
+        catalog_name (str): Name of the data catalog
+        schema_name (str): Name of the database schema
+        parameters (Dict[str, Any]): Model-related parameters
+        ab_test (Dict[str, Any]): A/B test configuration parameters
+    """

1-26: Consider architectural improvements for ML pipeline integration

The configuration structure suggests this is part of a machine learning pipeline. Consider these architectural improvements:

  1. Version the configuration schema to support backward compatibility
  2. Add validation for feature name conflicts between num_features and cat_features
  3. Consider adding configuration for:
    • Feature preprocessing steps
    • Model evaluation metrics
    • Model versioning strategy
    • Monitoring parameters

Would you like assistance in implementing any of these architectural improvements?

notebooks/week_2/01_preparation.py (2)

2-2: Consider using a more robust package installation approach

The relative path to the wheel file (../hotel_reservations-2.2.0-py3-none-any.whl) could break if notebook locations change. Consider:

  1. Using an absolute path from a known location
  2. Installing from a package repository instead
  3. Using environment variables or project config for the version number

19-19: Fix incorrect dataset comment

The comment refers to "house prices" but this is a hotel reservations dataset.

-# Load the house prices dataset
+# Load the hotel reservations dataset
project_config.yml (5)

1-2: Consider environment-specific configuration files

The current configuration appears to be development-specific (dev). Consider creating separate configuration files for different environments (dev, staging, prod) to prevent accidental usage of development settings in production.

🧰 Tools
🪛 yamllint

[error] 1-1: wrong new line character: expected \n

(new-lines)


17-17: Remove trailing whitespace

Remove the trailing space after no_of_adults.

🧰 Tools
🪛 yamllint

[error] 17-17: trailing spaces

(trailing-spaces)


16-45: Improve feature organization and preprocessing

Several suggestions for the feature configuration:

  1. Separate one-hot encoded features from original numerical features
  2. Consider standardizing date handling:
    • arrival_year, arrival_month, arrival_date could be combined
    • Add derived features like day_of_week, is_weekend
  3. Consider adding feature groups for better organization
 num_features:
+  # Guest information
   - no_of_adults 
   - no_of_children
+  # Stay duration
   - no_of_weekend_nights
   - no_of_week_nights
+  # Booking history
   - repeated_guest
   - no_of_previous_cancellations
   - no_of_previous_bookings_not_canceled
+  # Additional services
   - required_car_parking_space
   - no_of_special_requests
+  # Financial
   - avg_price_per_room
+  # Temporal features
   - arrival_year
   - arrival_month
   - arrival_date
+  # One-hot encoded features
   - type_of_meal_plan_meal_plan_1
   # ... rest of one-hot features
🧰 Tools
🪛 yamllint

[error] 17-17: trailing spaces

(trailing-spaces)


51-53: Document target variable transformation

Please add comments explaining:

  • The relationship between booking_status and booking_status_canceled
  • The transformation logic
  • Expected values for both variables

1-53: Consider adding version control for configuration

As this configuration file is central to the project, consider:

  1. Adding a version field to track configuration changes
  2. Implementing configuration validation
  3. Adding a changelog section or comments for major changes

Example:

+version: '1.0.0'  # Semantic versioning
 catalog_name: dev
🧰 Tools
🪛 yamllint

[error] 1-1: wrong new line character: expected \n

(new-lines)


[error] 17-17: trailing spaces

(trailing-spaces)

src/hotel_reservations/reservations_model.py (5)

1-1: Enhance the module docstring with more details.

Consider expanding the docstring to include:

  • Purpose and use cases
  • Key features and capabilities
  • Example usage
-"""Model for predicting hotel cancellations."""
+"""Model for predicting hotel cancellations.
+
+This module provides a scikit-learn compatible classifier for predicting hotel reservation
+cancellations using XGBoost. It includes features for:
+- Data preprocessing with StandardScaler
+- Model training and prediction
+- Performance evaluation using accuracy and precision metrics
+
+Example:
+    >>> from hotel_reservations.config import ProjectConfig
+    >>> model = ReservationsModel(config=ProjectConfig())
+    >>> model.fit(X_train, y_train)
+    >>> predictions = model.predict(X_test)
+"""

12-14: Enhance the class docstring with implementation details.

The sklearn integration is well-implemented. Consider expanding the docstring to document the inheritance and parameters.

-    """Class to train and evaluate a model for predicting cancellation on hotel reservations."""
+    """A sklearn-compatible classifier for predicting hotel reservation cancellations.
+
+    This class implements a wrapper around XGBoost classifier with standardized feature scaling.
+    It inherits from sklearn's BaseEstimator and ClassifierMixin to ensure compatibility with
+    sklearn pipelines and cross-validation utilities.
+
+    Parameters
+    ----------
+    config : ProjectConfig
+        Configuration object containing model parameters including:
+        - eta: XGBoost learning rate
+        - n_estimators: Number of boosting rounds
+        - max_depth: Maximum tree depth
+    """

15-23: Add parameter validation in the constructor.

While the implementation is correct, consider adding validation for the config parameters to fail fast if invalid values are provided.

     def __init__(self, config: ProjectConfig):
+        if not isinstance(config, ProjectConfig):
+            raise TypeError("config must be an instance of ProjectConfig")
+        
+        required_params = ["eta", "n_estimators", "max_depth"]
+        missing_params = [p for p in required_params if p not in config.parameters]
+        if missing_params:
+            raise ValueError(f"Missing required parameters: {missing_params}")
+        
         self.config = config
         self.scaler = StandardScaler()
         self.model = XGBClassifier(

35-39: Enhance evaluation metrics and add validation.

Consider adding more evaluation metrics and proper input validation.

     def evaluate(self, y: pd.DataFrame, y_pred: pd.DataFrame):
         """Evaluate the model on the test data."""
+        if not isinstance(y, pd.DataFrame) or not isinstance(y_pred, pd.DataFrame):
+            raise TypeError("Inputs must be pandas DataFrames")
+        if y.empty or y_pred.empty:
+            raise ValueError("Empty evaluation data provided")
+        if y.shape != y_pred.shape:
+            raise ValueError("Mismatched shapes between true and predicted labels")
+
+        try:
             accuracy = accuracy_score(y_true=y, y_pred=y_pred)
             precision = precision_score(y_true=y, y_pred=y_pred)
-        return accuracy, precision
+            recall = recall_score(y_true=y, y_pred=y_pred)
+            f1 = f1_score(y_true=y, y_pred=y_pred)
+            
+            return {
+                'accuracy': accuracy,
+                'precision': precision,
+                'recall': recall,
+                'f1': f1
+            }
+        except Exception as e:
+            raise RuntimeError(f"Error computing metrics: {str(e)}") from e

Don't forget to add the required imports:

 from sklearn.metrics import accuracy_score, precision_score
+from sklearn.metrics import recall_score, f1_score

12-39: Consider adding logging and model persistence capabilities.

To enhance the production readiness of this class, consider:

  1. Adding logging to track training progress, predictions, and evaluation metrics
  2. Implementing model persistence methods (save/load) for the entire pipeline (both scaler and model)
  3. Adding a method to extract feature importance from the XGBoost model

This will improve monitoring and maintainability in production.

notebooks/week_1/01_preprocessing.py (3)

2-2: Consider using an absolute path or environment variable for the wheel file.

The relative path ../hotel_reservations-2.2.0-py3-none-any.whl might break if the notebook location changes. Consider using an absolute path or environment variable to make it more robust.

-# MAGIC %pip install ../hotel_reservations-2.2.0-py3-none-any.whl
+# MAGIC %pip install ${WHEEL_PATH}/hotel_reservations-2.2.0-py3-none-any.whl

15-15: Use absolute path or environment variable for config file.

Similar to the wheel file, the relative path for the config file could break if notebook location changes.

-config = ProjectConfig.from_yaml(config_path="../../project_config.yml")
+config = ProjectConfig.from_yaml(config_path="${CONFIG_PATH}/project_config.yml")

22-26: Consider performance implications of Spark to Pandas conversion.

Loading data as a Spark DataFrame and immediately converting to pandas might be inefficient for large datasets. Consider either:

  1. Using pandas directly if the data size is manageable
  2. Keeping it as a Spark DataFrame if you need to handle large-scale data
🧰 Tools
🪛 Ruff

22-22: Undefined name spark

(F821)

notebooks/week_2/02_logging_and_registering.py (6)

8-8: Remove unused import pandas as pd

The import statement import pandas as pd is not used anywhere in the code. Removing unused imports improves code readability and reduces clutter.

Apply this diff to remove the unused import:

-import pandas as pd
🧰 Tools
🪛 Ruff

8-8: Module level import not at top of file

(E402)


8-8: pandas imported but unused

Remove unused import: pandas

(F401)


8-15: Consider reorganizing imports to comply with PEP 8 standards

PEP 8 recommends placing all import statements at the top of the file. Currently, the imports are placed after restarting the Python environment. While this might be intentional due to the notebook context, consider reorganizing the code to have imports at the top for better readability and to comply with standard conventions.

🧰 Tools
🪛 Ruff

8-8: Module level import not at top of file

(E402)


8-8: pandas imported but unused

Remove unused import: pandas

(F401)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


13-13: Module level import not at top of file

(E402)


14-14: Module level import not at top of file

(E402)


15-15: Module level import not at top of file

(E402)


59-59: Avoid hardcoding the Git SHA

The GIT_SHA is hardcoded as "ffa63b430205ff7". Hardcoding can lead to inconsistencies if the code changes without updating the SHA. Consider retrieving the Git SHA dynamically to ensure it reflects the current commit.

Possible Implementation: Retrieve Git SHA dynamically

-GIT_SHA = "ffa63b430205ff7"
+import os
+GIT_SHA = os.popen('git rev-parse HEAD').read().strip()

Note: Ensure that the environment where this code runs has access to Git commands.


67-72: Clarify preprocessing of target variable y_train

In the preprocessing step for y_train, the method preprocess_data is called with X=y_train, which could be confusing since X typically represents features. Consider renaming the parameter or providing additional comments to improve clarity.

Suggestion: Rename parameter for clarity

If preprocess_data is intended to preprocess the target variable, adjust the parameter name accordingly.

-        X=y_train,
+        y=y_train,

And modify the preprocess_data method signature if necessary.


75-80: Clarify preprocessing of target variable y_test

Similar to the previous comment on y_train, ensure that the preprocessing of y_test is clearly documented and parameter names accurately reflect their purpose.


117-119: Ensure proper usage of MLflow data loading

The code retrieves dataset information and attempts to load it using dataset_source.load(). Confirm that this step is necessary and that it functions as intended. If the loaded data is not used subsequently, consider removing this step to streamline the code.

src/hotel_reservations/data_processor.py (3)

60-65: Simplify column renaming using Pandas built-in methods.

The current loop for renaming columns can be replaced with Pandas string methods for better readability and efficiency.

Apply this diff to simplify the code:

-    col_names = X.columns.to_list()
-    for i, col in enumerate(col_names):
-        col = col.replace(" ", "_").lower()
-        col_names[i] = col
-    X.columns = col_names
+    X.columns = X.columns.str.replace(" ", "_").str.lower()

81-86: Make test_size and random_state configurable through ProjectConfig.

Hardcoding test_size and random_state reduces flexibility. Fetching these values from the configuration enhances adaptability.

Apply this diff to retrieve values from self.config:

-    def split_data(
-        self, X: pd.DataFrame, test_size: float = 0.2, random_state: int = 42
-    ) -> tuple[pd.DataFrame, pd.DataFrame]:
+    def split_data(
+        self, X: pd.DataFrame, test_size: float = None, random_state: int = None
+    ) -> tuple[pd.DataFrame, pd.DataFrame]:
+        if test_size is None:
+            test_size = self.config.test_size
+        if random_state is None:
+            random_state = self.config.random_state
         train_set, test_set = train_test_split(
             X, test_size=test_size, random_state=random_state
         )

Ensure that test_size and random_state are defined in ProjectConfig.


28-31: Allow include_fe_features to be configurable in the transform method.

Hardcoding include_fe_features as True limits flexibility. Consider making it a parameter to control feature inclusion during transformation.

Modify the method signature and usage:

-    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
+    def transform(self, X: pd.DataFrame, include_fe_features: bool = True) -> pd.DataFrame:
         """Preprocess data with One-Hot Encoding and relevant feature extraction."""
         X = self.one_hot_encode(X=X, features="cat_features")
         return self.extract_features(
-            X=X, features="num_features", include_fe_features=True
+            X=X, features="num_features", include_fe_features=include_fe_features
         )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3783d8c and 9c6afac.

⛔ Files ignored due to path filters (2)
  • data/hotel_reservations.csv is excluded by !**/*.csv
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .github/CODEOWNERS (1 hunks)
  • .github/workflows/ci.yml (1 hunks)
  • .gitignore (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (1 hunks)
  • databricks.yml (1 hunks)
  • notebooks/week_1/01_preprocessing.py (1 hunks)
  • notebooks/week_2/01_preparation.py (1 hunks)
  • notebooks/week_2/02_logging_and_registering.py (1 hunks)
  • notebooks/week_2/03_feature_engineering.py (1 hunks)
  • project_config.yml (1 hunks)
  • pyproject.toml (1 hunks)
  • pytest.ini (1 hunks)
  • src/hotel_reservations/config.py (1 hunks)
  • src/hotel_reservations/data_processor.py (1 hunks)
  • src/hotel_reservations/reservations_model.py (1 hunks)
  • src/hotel_reservations/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • .github/CODEOWNERS
  • .gitignore
  • databricks.yml
  • pytest.ini
🧰 Additional context used
🪛 yamllint
.github/workflows/ci.yml

[error] 1-1: wrong new line character: expected \n

(new-lines)


[error] 21-21: trailing spaces

(trailing-spaces)

.pre-commit-config.yaml

[error] 1-1: wrong new line character: expected \n

(new-lines)

project_config.yml

[error] 1-1: wrong new line character: expected \n

(new-lines)


[error] 17-17: trailing spaces

(trailing-spaces)

🪛 Markdownlint
README.md

17-17: null
Bare URL used

(MD034, no-bare-urls)


21-21: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff
notebooks/week_1/01_preprocessing.py

5-5: Undefined name dbutils

(F821)


8-8: Module level import not at top of file

(E402)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


22-22: Undefined name spark

(F821)

notebooks/week_2/01_preparation.py

5-5: Undefined name dbutils

(F821)


8-8: Module level import not at top of file

(E402)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)

notebooks/week_2/02_logging_and_registering.py

5-5: Undefined name dbutils

(F821)


8-8: Module level import not at top of file

(E402)


8-8: pandas imported but unused

Remove unused import: pandas

(F401)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


13-13: Module level import not at top of file

(E402)


14-14: Module level import not at top of file

(E402)


15-15: Module level import not at top of file

(E402)

notebooks/week_2/03_feature_engineering.py

5-5: Undefined name dbutils

(F821)


8-8: Module level import not at top of file

(E402)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)


11-11: Module level import not at top of file

(E402)


12-12: Module level import not at top of file

(E402)


13-13: Module level import not at top of file

(E402)


14-14: Module level import not at top of file

(E402)


15-15: Module level import not at top of file

(E402)


16-16: Module level import not at top of file

(E402)


17-17: Module level import not at top of file

(E402)

src/hotel_reservations/data_processor.py

16-16: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🔇 Additional comments (15)
src/hotel_reservations/utils.py (1)

8-25: Verify function usage in preprocessing notebook

Let's verify the function is properly imported and used in the preprocessing notebook.

✅ Verification successful

Function is correctly imported and used

The visualize_results function is properly imported and used in notebooks/week_1/01_preprocessing.py. The function is called with the correct parameters (y_test and y_pred) as defined in the function signature, where y_pred is obtained from model predictions and passed along with the actual test labels (y_test).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the function is imported and used correctly
rg -A 2 "from hotel_reservations.utils import visualize_results" notebooks/
rg "visualize_results\(" notebooks/

Length of output: 388


Script:

#!/bin/bash
# Check the context of how visualize_results is used
rg -B 5 -A 5 "visualize_results\(" notebooks/week_1/01_preprocessing.py

Length of output: 250

notebooks/week_2/01_preparation.py (3)

8-11: LGTM! Imports are appropriate for the data processing task

The import order warning from static analysis can be ignored as this is a Databricks notebook where imports commonly follow command cell markers.

🧰 Tools
🪛 Ruff

8-8: Module level import not at top of file

(E402)


9-9: Module level import not at top of file

(E402)


10-10: Module level import not at top of file

(E402)


13-13: LGTM! Standard Spark session initialization


27-29: Verify DataProcessor implementation

Let's verify the implementation of the mentioned methods in the DataProcessor class.

✅ Verification successful

Let me search for the specific methods mentioned in the review.


Both methods are properly implemented in DataProcessor class

The verification confirms that both split_data and save_to_catalog methods are properly implemented in the DataProcessor class with appropriate type hints and functionality:

  • split_data: Correctly splits data using train_test_split with configurable test_size and random_state
  • save_to_catalog: Properly saves train and test sets as Databricks tables with timestamp columns
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for DataProcessor class implementation
ast-grep --pattern 'class DataProcessor {
  $$$
  split_data($$$) {
    $$$
  }
  $$$
  save_to_catalog($$$) {
    $$$
  }
  $$$
}'

# Backup: search using ripgrep if ast-grep doesn't find it
rg -A 10 "class DataProcessor"

Length of output: 1135


Script:

#!/bin/bash
# Search for split_data and save_to_catalog methods in data_processor.py
rg -A 15 "def split_data|def save_to_catalog" src/hotel_reservations/data_processor.py

Length of output: 1174

pyproject.toml (5)

6-6: LGTM: Improved Python version constraint

The change from "==3.11" to "==3.11.*" is a good improvement as it allows patch-level updates while maintaining compatibility.


20-24: LGTM: Well-structured dev dependencies

The development dependencies are well-structured with appropriate version constraints and include all necessary tools for Databricks integration.


26-39: LGTM: Comprehensive Ruff configuration

The Ruff configuration includes a good selection of linting rules covering code style, potential bugs, and import sorting.


2-3: Verify version numbering strategy

The version number jumped from 0.0.1 to 2.2.0, which is a significant change. Please ensure this aligns with your versioning strategy and consider documenting the versioning scheme in the README.


7-18: Verify impact of ML library changes

The switch from lightgbm to xgboost suggests a change in the ML approach. Please ensure all existing model code has been updated accordingly.

✅ Verification successful

The ML library switch is properly implemented

The codebase has been correctly updated to use XGBoost instead of LightGBM. The model implementation in src/hotel_reservations/reservations_model.py shows:

  • Clean XGBoost integration with no remaining LightGBM references
  • Proper model configuration using XGBClassifier
  • Complete implementation including training, prediction, and evaluation methods
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining lightgbm references
echo "Checking for lightgbm references in Python files:"
rg -l "lightgbm|lgb" --type py

# Check for existing model files that might need updates
echo "Checking for model-related files:"
fd -e py -e ipynb | rg -i "model|train|predict"

Length of output: 313


Script:

#!/bin/bash
# Check the content of the model file to see if it contains any lightgbm references
# and examine the current ML implementation
echo "Checking reservations_model.py content:"
cat src/hotel_reservations/reservations_model.py

# Also check for any imports in the file
echo -e "\nChecking imports specifically:"
rg "^(import|from.*import)" src/hotel_reservations/reservations_model.py

Length of output: 2051

notebooks/week_1/01_preprocessing.py (2)

17-18: Review sensitive information in config dump.

Printing the entire configuration might expose sensitive information. Consider filtering out sensitive fields before printing.

✅ Verification successful

Config dump appears safe, no sensitive information exposed

The configuration file only contains non-sensitive model parameters, feature names, and database schema information. There are no passwords, tokens, keys, or other sensitive credentials being dumped to the output.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if config file contains sensitive information
rg -i "password|secret|key|token" "../../project_config.yml"

Length of output: 125


Script:

#!/bin/bash
# Search for config files in the repository
fd -e yml -e yaml

# Search for config-related patterns in Python files
rg -i "config.*yaml" -g "*.py"

# Search for yaml load/dump patterns
ast-grep --pattern 'yaml.dump($$$)'

Length of output: 835


Script:

#!/bin/bash
# Check the content of project config file
cat project_config.yml

# Check the ProjectConfig class implementation
cat src/hotel_reservations/config.py

Length of output: 2064


31-34: Verify data splitting strategy.

Consider the following aspects:

  1. Is the split ratio appropriate for your use case?
  2. Should stratification be used if the target variable is imbalanced?
  3. Is the random state set for reproducibility?
notebooks/week_2/02_logging_and_registering.py (2)

5-5: Confirm availability of dbutils in the execution environment

The dbutils module is used but not imported. In Databricks notebooks, dbutils is available by default; however, if this script is run outside of a Databricks notebook, it will raise a NameError. Ensure that this code is intended to run within a Databricks environment or modify it to handle external execution.

🧰 Tools
🪛 Ruff

5-5: Undefined name dbutils

(F821)


83-83: Validate the use of the evaluate method

Ensure that the evaluate method in ReservationsModel correctly calculates accuracy and precision. Verify that it handles input types appropriately and that the evaluation metrics are correctly implemented.

notebooks/week_2/03_feature_engineering.py (2)

161-165: Verify if preprocessing the target variable is appropriate

You're applying the preprocess_data method to y_train, which is the target variable. Typically, preprocessing is applied to features (X) rather than the target (y). Please verify if this is intentional and necessary.


169-174: Verify if preprocessing the target variable is appropriate

Similarly, you are preprocessing y_test with the data preprocessor. Ensure that this step is required for your target variable.

src/hotel_reservations/utils.py Show resolved Hide resolved
src/hotel_reservations/utils.py Show resolved Hide resolved
src/hotel_reservations/config.py Show resolved Hide resolved
src/hotel_reservations/config.py Show resolved Hide resolved
notebooks/week_2/01_preparation.py Show resolved Hide resolved
src/hotel_reservations/data_processor.py Show resolved Hide resolved
src/hotel_reservations/data_processor.py Show resolved Hide resolved
src/hotel_reservations/data_processor.py Outdated Show resolved Hide resolved
notebooks/week_2/03_feature_engineering.py Show resolved Hide resolved
notebooks/week_2/03_feature_engineering.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (17)
.github/workflows/ci.yml (4)

12-12: Consider pinning the checkout action to a commit SHA

While using actions/checkout@v4 is good for getting updates, for maximum security, consider pinning to a specific commit SHA.

-      - uses: actions/checkout@v4
+      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1

14-15: Consider pinning the uv setup action to a commit SHA

Similar to the checkout action, consider pinning astral-sh/setup-uv@v3 to a specific commit SHA for enhanced security.

-        uses: astral-sh/setup-uv@v3
+        uses: astral-sh/setup-uv@e5f837a6d67ddb0b4d6eef68a50b12c5d39b3d73 # v3

24-26: Consider caching pre-commit for faster CI runs

The pre-commit installation could benefit from caching to speed up CI runs.

+      - name: Cache pre-commit
+        uses: actions/cache@v4
+        with:
+          path: ~/.cache/pre-commit
+          key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
       - name: Run pre-commit checks
         run: |
           uv pip install pre-commit
           uv run pre-commit run --all-files

1-26: Fix line endings in the YAML file

The static analysis tool detected incorrect line endings in the file.

Please ensure all line endings are Unix-style (\n) rather than Windows-style (\r\n). You can fix this by:

  1. Setting core.autocrlf=input in your git config
  2. Using a text editor that preserves Unix line endings
  3. Running dos2unix on the file
🧰 Tools
🪛 yamllint

[error] 1-1: wrong new line character: expected \n

(new-lines)

README.md (3)

11-11: Minor grammatical improvement needed

Add the article "a" before "repository" for better readability.

-To submit the deliverable, create a feature branch in that repository, and a PR to main branch.
+To submit the deliverable, create a feature branch in a repository, and a PR to the main branch.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~11-~11: Possible missing article found.
Context: ... branch in that repository, and a PR to main branch. The code can be merged after we...

(AI_HYDRA_LEO_MISSING_THE)


17-17: Format documentation link using proper markdown syntax

Convert the bare URL to a proper markdown link for better readability and maintainability.

-In our examples, we use UV. Check out the documentation on how to install it: https://docs.astral.sh/uv/getting-started/installation/
+In our examples, we use UV. Check out the [installation documentation](https://docs.astral.sh/uv/getting-started/installation/).
🧰 Tools
🪛 Markdownlint

17-17: null
Bare URL used

(MD034, no-bare-urls)


21-26: Add language specification to code block and enhance setup instructions

The code block should specify the shell language, and it would be helpful to add some context about the expected outcome.

-```
+```shell
 uv venv -p 3.11.0 venv
 source venv/bin/activate
 uv pip install -r pyproject.toml --all-extras
 uv lock

+This will:
+1. Create a new Python 3.11 virtual environment
+2. Activate the environment
+3. Install all dependencies including optional extras
+4. Generate a lockfile for reproducible installations


<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 Markdownlint</summary>

21-21: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</blockquote></details>
<details>
<summary>project_config.yml (3)</summary><blockquote>

`1-2`: **Consider environment-specific configuration management**

Environment-specific values like `catalog_name` and `schema_name` should be externalized to support different environments (dev, staging, prod).

Consider:
1. Using environment variables
2. Creating separate config files per environment (e.g., `config.dev.yml`, `config.prod.yml`)
3. Using a configuration management system

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 yamllint</summary>

[error] 1-1: wrong new line character: expected \n

(new-lines)

</details>

</details>

---

`51-53`: **Document target variable transformation**

The configuration defines two targets:
- `original_target: booking_status`
- `target: booking_status_canceled`

Please document:
1. The relationship between these targets
2. The transformation applied
3. The rationale for using the transformed target

---

`1-53`: **Fix newline character format**

The file uses incorrect newline characters. Please ensure consistent use of Unix-style newlines (\n).

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 yamllint</summary>

[error] 1-1: wrong new line character: expected \n

(new-lines)

</details>

</details>

</blockquote></details>
<details>
<summary>src/hotel_reservations/reservations_model.py (3)</summary><blockquote>

`12-14`: **Enhance class documentation with parameters and usage examples.**

While the class purpose is clear, consider adding more detailed documentation:
- Parameters section describing the config
- Returns section for methods
- Example usage section

Example improvement:
```diff
-    """Class to train and evaluate a model for predicting cancellation on hotel reservations."""
+    """Class to train and evaluate a model for predicting cancellation on hotel reservations.
+    
+    Parameters
+    ----------
+    config : ProjectConfig
+        Configuration object containing model parameters like eta, n_estimators, and max_depth
+    
+    Examples
+    --------
+    >>> from hotel_reservations.config import ProjectConfig
+    >>> config = ProjectConfig()
+    >>> model = ReservationsModel(config)
+    >>> model.fit(X_train, y_train)
+    >>> predictions = model.predict(X_test)
+    """

15-22: Add parameter validation and type hints for class attributes.

Consider adding validation for config parameters and type hints for better code maintainability.

 def __init__(self, config: ProjectConfig):
+    if not isinstance(config, ProjectConfig):
+        raise TypeError("config must be an instance of ProjectConfig")
+    required_params = ["eta", "n_estimators", "max_depth"]
+    missing_params = [param for param in required_params if param not in config.parameters]
+    if missing_params:
+        raise ValueError(f"Missing required parameters: {missing_params}")
+
     self.config = config
-    self.scaler = StandardScaler()
-    self.model = XGBClassifier(
+    self.scaler: StandardScaler = StandardScaler()
+    self.model: XGBClassifier = XGBClassifier(
         eta=self.config.parameters["eta"],
         n_estimators=self.config.parameters["n_estimators"],
         max_depth=self.config.parameters["max_depth"],
     )

12-39: Consider making the model implementation more flexible.

The current implementation is tightly coupled with XGBoost. Consider implementing a more flexible architecture that allows easy switching between different model types.

Suggestions:

  1. Create a base model interface
  2. Use factory pattern for model creation
  3. Move model-specific parameters to separate configuration classes

Example architecture:

from abc import ABC, abstractmethod

class BaseModel(ABC):
    @abstractmethod
    def fit(self, X, y):
        pass
    
    @abstractmethod
    def predict(self, X):
        pass

class XGBoostModel(BaseModel):
    def __init__(self, config: XGBoostConfig):
        # XGBoost specific implementation

class LightGBMModel(BaseModel):
    def __init__(self, config: LightGBMConfig):
        # LightGBM specific implementation

class ModelFactory:
    @staticmethod
    def create_model(model_type: str, config: ModelConfig) -> BaseModel:
        if model_type == "xgboost":
            return XGBoostModel(config)
        elif model_type == "lightgbm":
            return LightGBMModel(config)
notebooks/week_2/03_feature_engineering.py (4)

1-2: Consider using a package registry for production deployments

Installing from a local wheel file might cause issues in production environments where the file might not be available or may have different versions. Consider publishing the package to a private PyPI repository or using Databricks package management features.


65-73: Add error handling for data insertion

The INSERT operations should handle potential duplicate key violations or data integrity issues. Consider using MERGE or INSERT IF NOT EXISTS operations.

-spark.sql(
-    f"INSERT INTO {catalog_name}.{schema_name}.hotel_features "
-    f"SELECT Booking_ID, no_of_previous_cancellations, avg_price_per_room FROM {catalog_name}.{schema_name}.train_set_vs"
-)
+spark.sql(
+    f"""
+    MERGE INTO {catalog_name}.{schema_name}.hotel_features AS target
+    USING {catalog_name}.{schema_name}.train_set_vs AS source
+    ON target.Booking_ID = source.Booking_ID
+    WHEN NOT MATCHED THEN
+        INSERT (Booking_ID, no_of_previous_cancellations, avg_price_per_room)
+        VALUES (source.Booking_ID, source.no_of_previous_cancellations, source.avg_price_per_room)
+    """
+)

76-86: Add input validation and documentation to the UDF

The function should include input validation and documentation for better maintainability and reliability.

 spark.sql(
     f"""
 CREATE OR REPLACE FUNCTION {function_name}(no_of_week_nights INT, no_of_weekend_nights INT)
 RETURNS INT
 LANGUAGE PYTHON AS
 $$
+# Calculate total stay length by summing weekday and weekend nights
+if no_of_week_nights < 0 or no_of_weekend_nights < 0:
+    raise ValueError("Number of nights cannot be negative")
 return no_of_week_nights + no_of_weekend_nights
 $$
 """
 )

195-198: Implement a versioning strategy for model registration

Consider adding a version strategy to avoid overwriting existing models and to maintain proper versioning:

+from mlflow.tracking.client import MlflowClient
+
+client = MlflowClient()
+latest_version = client.get_latest_versions(
+    name=f"{catalog_name}.{schema_name}.hotel_reservations_model_fe",
+    stages=["None"]
+)[0].version
+
 mlflow.register_model(
     model_uri=model_uri,
     name=f"{catalog_name}.{schema_name}.hotel_reservations_model_fe",
+    tags={"version_comment": f"Feature engineering pipeline v{int(latest_version) + 1}"}
 )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9c6afac and 19c069c.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • .github/workflows/ci.yml (1 hunks)
  • README.md (1 hunks)
  • notebooks/week_1/01_preprocessing.py (1 hunks)
  • notebooks/week_2/01_preparation.py (1 hunks)
  • notebooks/week_2/02_logging_and_registering.py (1 hunks)
  • notebooks/week_2/03_feature_engineering.py (1 hunks)
  • project_config.yml (1 hunks)
  • pyproject.toml (1 hunks)
  • src/hotel_reservations/data_processor.py (1 hunks)
  • src/hotel_reservations/reservations_model.py (1 hunks)
  • src/hotel_reservations/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • notebooks/week_1/01_preprocessing.py
  • notebooks/week_2/01_preparation.py
  • notebooks/week_2/02_logging_and_registering.py
  • src/hotel_reservations/data_processor.py
  • src/hotel_reservations/utils.py
🧰 Additional context used
🪛 yamllint
.github/workflows/ci.yml

[error] 1-1: wrong new line character: expected \n

(new-lines)

project_config.yml

[error] 1-1: wrong new line character: expected \n

(new-lines)

🪛 LanguageTool
README.md

[uncategorized] ~11-~11: Possible missing article found.
Context: ... branch in that repository, and a PR to main branch. The code can be merged after we...

(AI_HYDRA_LEO_MISSING_THE)

🪛 Markdownlint
README.md

17-17: null
Bare URL used

(MD034, no-bare-urls)


21-21: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (9)
.github/workflows/ci.yml (1)

17-22: LGTM: Python setup and dependency installation

The Python setup and dependency installation using uv looks good. Using uv sync is efficient as it installs dependencies based on the lock file.

pyproject.toml (3)

20-24: LGTM! Development dependencies are well-defined

The development dependencies include all necessary tools for Databricks development with appropriate version constraints.


7-18: Verify impact of switching from lightgbm to xgboost

The change from lightgbm to xgboost represents a significant shift in the model framework choice. While both are gradient boosting implementations, they have different characteristics and performance profiles.

Let's check for model-related files that might need updates:

✅ Verification successful

The switch from lightgbm to xgboost is properly implemented

The codebase has been correctly updated to use XGBoost instead of LightGBM. The implementation in reservations_model.py shows a clean integration of XGBClassifier with proper hyperparameter configuration through the project config. The model training pipeline in the notebook also demonstrates correct usage of the XGBoost-based model with MLflow tracking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for files containing model training code or lightgbm references
rg -l "lightgbm|LGBMRegressor|LGBMClassifier" || echo "No lightgbm references found"

Length of output: 114


Script:

#!/bin/bash
# Let's check for xgboost usage to ensure the transition is properly implemented
rg -l "xgboost|XGBRegressor|XGBClassifier"

# Also search for any model training related files
rg -l "model.*train|train.*model"

Length of output: 361


Script:

#!/bin/bash
# Let's examine the model implementation in reservations_model.py
cat src/hotel_reservations/reservations_model.py

# Also check the training code in notebooks
cat notebooks/week_2/02_logging_and_registering.py

Length of output: 5456


35-39: Review new linting ignore rules

The addition of two new ignore rules raises some concerns:

  1. F821 (undefined-name): This could mask undefined variable references
  2. E402 (module-level-import-not-at-top): This might indicate circular dependency issues

Consider addressing the underlying issues instead of suppressing these warnings.

Let's check for potential issues:

project_config.yml (2)

4-7: Previous review comment about learning rate is still applicable


9-14: Review A/B test experimental design

The current A/B test design:

  • Varies max_depth (6 vs 10)
  • Keeps other parameters constant
  • Uses a much lower learning rate than the main model

Consider documenting:

  1. The hypothesis being tested
  2. Why max_depth was chosen as the variable parameter
  3. Expected impact on model performance
src/hotel_reservations/reservations_model.py (1)

1-10: LGTM! Well-organized imports and clear module documentation.

The imports are logically organized and the module's purpose is clearly documented.

notebooks/week_2/03_feature_engineering.py (2)

124-125: **** The previous review comment about ensuring consistent feature engineering for the test set is still valid.


150-150: **** The previous review comment about replacing hardcoded GIT_SHA with dynamic retrieval is still valid.

pyproject.toml Show resolved Hide resolved
project_config.yml Show resolved Hide resolved
src/hotel_reservations/reservations_model.py Show resolved Hide resolved
Comment on lines +169 to +171
# Evaluate the model performance
accuracy, precision = pipeline.named_steps["classifier"].evaluate(y_test, y_pred)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance model evaluation with cross-validation and additional metrics

The current evaluation is limited to accuracy and precision on a single test set. Consider:

  1. Adding k-fold cross-validation
  2. Including additional metrics like recall, F1-score, and ROC-AUC
  3. Adding confusion matrix visualization

notebooks/week_2/03_feature_engineering.py Show resolved Hide resolved
Copy link

@basakeskili basakeskili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@VanessaSchl VanessaSchl merged commit a9ffe05 into main Nov 25, 2024
2 checks passed
@VanessaSchl VanessaSchl deleted the week1+2 branch November 25, 2024 15:01
@coderabbitai coderabbitai bot mentioned this pull request Nov 25, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants