Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week1 #3

Merged
merged 36 commits into from
Nov 20, 2024
Merged

Week1 #3

merged 36 commits into from
Nov 20, 2024

Conversation

javedhassans
Copy link
Collaborator

@javedhassans javedhassans commented Nov 19, 2024

making a pull request for week 1

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a comprehensive data exploration framework for analyzing child health data.
    • Added a new configuration file for project settings, including model training parameters.
    • Implemented functionality for logging and managing MLflow experiments.
    • Developed new classes for processing and analyzing actigraphy data.
  • Documentation

    • Enhanced the README.md with detailed explanations of the dataset used in the course.
  • Chores

    • Updated .gitignore to include additional entries for ignoring specific file types.
    • Modified CI workflow to integrate pre-commit checks.

@javedhassans javedhassans requested a review from a team as a code owner November 19, 2024 12:25
Copy link

coderabbitai bot commented Nov 19, 2024

Walkthrough

The pull request introduces several significant changes across various files, focusing on enhancing the machine learning workflow for child health data analysis. Key modifications include updates to the CI workflow for integrating pre-commit checks, additions to the .gitignore file, and the introduction of new configuration and processing scripts. New classes and methods for data processing, model training, and evaluation are implemented, alongside the creation of configuration files for project parameters. Additionally, several new notebooks facilitate data exploration, feature preparation, and MLflow experiment management.

Changes

File Change Summary
.github/workflows/ci.yml Added step to install pre-commit and simplified command for running pre-commit checks.
.gitignore Added entries to ignore .databricks and *.parquet files.
README.md Added detailed explanations about the dataset used in the course, including various metrics and analyses.
databricks.yml Created a configuration file for a Databricks asset bundle.
notebooks/week1/00.dataexploration.ipynb Introduced a data exploration framework with methods for processing actigraphy data.
notebooks/week1/01.dataExploraton.py Added functions for data exploration and preprocessing within Databricks.
notebooks/week2/01.prepare_dataset.py Created a notebook for preparing child health dataset for ML tasks.
notebooks/week2/02.prepare_featureTable.py Introduced functionality for preparing a feature table using Actigraph data.
notebooks/week2/03.mlflow_experiment.py Implemented MLflow experiment management functionalities.
notebooks/week2/04.log_and_register_model.py Established a machine learning workflow for model logging and registration with MLflow.
notebooks/week2/05.wip_log_and_register_custom_model.py Added functionality for logging and registering a custom ML model.
notebooks/week2/06.log_and_register_fe_model.py Implemented a workflow for logging and registering a feature engineering model.
notebooks/week2/mlruns/0/meta.yaml Added metadata configuration for MLflow experiment tracking.
notebooks/week2/run_info.json Created a JSON file containing structured information about an MLflow run.
project_config.yml Introduced a configuration file outlining parameters for the ML project.
pyproject.toml Updated project name, Python version requirements, and added new dependencies.
src/childHealth/childhealth_model.py Introduced a LightGBM model class for child health data analysis with various methods for training and evaluation.
src/childHealth/config.py Created a configuration management class for project settings.
src/childHealth/data_processor.py Added a class for preprocessing and feature engineering of training datasets.
src/childHealth/feature_engineering.py Implemented a class for reading and processing participant data from parquet files.
src/childHealth/kappaLgm_model.py Introduced a model class using LightGBM with custom evaluation metrics.
src/childHealth/utils.py Added utility functions for visualization and prediction adjustment.

Poem

🐇 In the land of code where changes bloom,
New scripts and workflows dispel the gloom.
With data explored and models refined,
A journey of health, we joyfully find.
Pre-commit checks and features galore,
Together we leap, always seeking more! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 45

🧹 Outside diff range and nitpick comments (48)
.github/workflows/ci.yml (1)

Line range hint 1-27: Consider workflow optimizations

The workflow could benefit from the following improvements:

  1. Add dependency caching to speed up builds
  2. Set timeout limits to prevent hung jobs
  3. Add failure notifications for critical steps
name: CI

on:
  pull_request:
    branches:
      - main

jobs:
  build_and_test:
    runs-on: ubuntu-latest
+   timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v3

      - name: Set up Python
        run: uv python install 3.11

+     - name: Cache dependencies
+       uses: actions/cache@v3
+       with:
+         path: .uv
+         key: ${{ runner.os }}-uv-${{ hashFiles('**/requirements.txt') }}
+         restore-keys: |
+           ${{ runner.os }}-uv-

      - name: Install the dependencies
        run: uv sync
notebooks/week2/02.prepare_featureTable.py (2)

12-17: Avoid blanket warning suppression

Global warning suppression can hide important issues. Consider suppressing only specific warnings that are known to be safe to ignore.

-warnings.filterwarnings("ignore")
+# Suppress specific warnings with justification
+warnings.filterwarnings("ignore", category=UserWarning, module="childHealth")
+warnings.filterwarnings("ignore", category=DeprecationWarning, module="databricks")

1-34: Consider adding logging and monitoring for production readiness

As this notebook processes sensitive child health data and is part of a production ML pipeline, consider implementing:

  1. Comprehensive logging for debugging and audit trails
  2. Data quality checks before feature table creation
  3. Performance monitoring for the feature engineering process
  4. Access control validation for the data volumes

Would you like assistance in implementing these production-ready features?

src/childHealth/config.py (3)

7-14: Add class documentation and field descriptions

While the class structure is good, it would benefit from additional documentation:

  • Add a class docstring explaining the purpose and usage
  • Add field descriptions using Pydantic's Field with constraints and descriptions

Consider this enhancement:

 from typing import Any, Dict, List
 
 import yaml
-from pydantic import BaseModel, ValidationError
+from pydantic import BaseModel, ValidationError, Field
 
 
 class ProjectConfig(BaseModel):
+    """Configuration class for the Child Health ML project.
+    
+    This class manages all configuration parameters required for the ML pipeline,
+    including feature definitions, model parameters, and data source information.
+    """
     catalog_name: str = Field(
+        description="Name of the catalog in the data warehouse"
+    )
     schema_name: str = Field(
+        description="Name of the schema containing the project tables"
+    )
     random_forest_parameters: Dict[str, Any] = Field(
+        description="Parameters for Random Forest model training"
+    )
     lgb_parameters: Dict[str, Any] = Field(
+        description="Parameters for LightGBM model training"
+    )
     num_features: List[str] = Field(
+        description="List of numerical feature column names"
+    )
     cat_features: List[str] = Field(
+        description="List of categorical feature column names"
+    )
     target: str = Field(
+        description="Name of the target variable column"
+    )

31-32: Enhance usage examples with realistic configurations

The current example is too basic. Consider adding more comprehensive examples that demonstrate real-world usage.

Replace with:

# Example usage:
# config = ProjectConfig.from_yaml('conf/project_config.yml')
#
# Example YAML structure:
# catalog_name: "child_health"
# schema_name: "ml_features"
# random_forest_parameters:
#   n_estimators: 100
#   max_depth: 10
# lgb_parameters:
#   num_leaves: 31
#   learning_rate: 0.05
# num_features: ["age", "weight", "height"]
# cat_features: ["gender", "location"]
# target: "health_status"

1-32: Consider architectural improvements for configuration management

As this is a core configuration component, consider these architectural improvements:

  1. Add environment-specific configuration support (dev/staging/prod)
  2. Implement configuration validation rules for different environments
  3. Consider adding a configuration singleton to prevent multiple loads
  4. Add support for overriding configurations via environment variables

Would you like assistance in implementing any of these architectural improvements?

pyproject.toml (1)

2-2: Consider documenting the project's scope and requirements.

The project name change to "childhealth-mlops-with-databricks" suggests a specific focus on child health analytics. Consider:

  1. Adding a detailed project description in the README
  2. Documenting specific requirements and constraints for child health data processing
  3. Including relevant healthcare data compliance requirements if applicable
notebooks/week2/01.prepare_dataset.py (2)

3-8: Consider improving package version management

The hardcoded wheel file version (0.0.1) could lead to maintenance challenges. Consider:

  1. Using environment variables or config files for version management
  2. Documenting the reason for using --force-reinstall flag
-# MAGIC %pip install childhealth_mlops_with_databricks-0.0.1-py3-none-any.whl --force-reinstall
+# MAGIC %pip install childhealth_mlops_with_databricks-${PACKAGE_VERSION}-py3-none-any.whl

1-1: Add comprehensive documentation

The notebook would benefit from:

  1. A detailed description of the pipeline's purpose
  2. Input data requirements and format
  3. Description of processing steps and their effects
  4. Expected outputs and their format
-# Databricks notebook source
+# Databricks notebook source
+
+# Child Health Dataset Preparation Pipeline
+# =======================================
+#
+# This notebook prepares the child health dataset for machine learning tasks by:
+# 1. Loading raw data from CSV
+# 2. Preprocessing features (specify transformations)
+# 3. Splitting into train/test sets
+# 4. Saving processed datasets to the catalog
+#
+# Input Requirements:
+# - CSV file with columns: [list expected columns]
+# - Configuration file with parameters: [list required parameters]
+#
+# Output:
+# - Processed training and test datasets in the catalog
src/childHealth/utils.py (1)

1-43: Consider restructuring utilities for better separation of concerns.

From an architectural perspective:

  1. Consider separating visualization utilities from prediction adjustment logic
  2. Add type hints for better code maintainability
  3. Consider creating a dedicated visualization module for all plotting functions

Suggested structure:

src/childHealth/
├── visualization/
│   ├── __init__.py
│   ├── plotting.py  # All visualization functions
│   └── utils.py     # Visualization helpers
└── prediction/
    ├── __init__.py
    └── adjustments.py  # Prediction adjustment logic

This separation would make the codebase more maintainable and easier to test.

notebooks/week2/03.mlflow_experiment.py (3)

1-7: Consider environment-aware configuration and type hints

The tracking URI is hardcoded which could make local development challenging. Consider:

  1. Using environment variables or configuration files for the tracking URI
  2. Adding type hints for better code maintainability
 # Databricks notebook source
-import json
-
-import mlflow
+from typing import Dict, Any
+import json
+import os
+import mlflow

-mlflow.set_tracking_uri("databricks")
+# Use environment variable with fallback
+tracking_uri = os.getenv("MLFLOW_TRACKING_URI", "databricks")
+mlflow.set_tracking_uri(tracking_uri)

8-12: Enhance experiment naming and tagging strategy

The experiment name is hardcoded and the tagging strategy could be more comprehensive. Consider:

  1. Using a configurable experiment name prefix
  2. Adding more meaningful tags like:
    • environment (dev/staging/prod)
    • project_version
    • data_version
-mlflow.set_experiment(experiment_name="/Shared/child-health-basic")
-mlflow.set_experiment_tags({"repository_name": "child-health"})
+experiment_name = f"/Shared/{os.getenv('MLFLOW_EXPERIMENT_PREFIX', 'child-health-basic')}"
+mlflow.set_experiment(experiment_name=experiment_name)
+
+experiment_tags = {
+    "repository_name": "child-health",
+    "environment": os.getenv("ENVIRONMENT", "development"),
+    "project_version": "1.0.0",  # Consider using git tags
+    "data_version": "1.0"  # Consider tracking data versions
+}
+mlflow.set_experiment_tags(experiment_tags)

43-49: Use proper logging and safe dictionary access

Replace print statements with proper logging and add safe dictionary access to prevent KeyError.

-print(run_info["data"]["metrics"])
-print(run_info["data"]["params"])
+def safe_get(d: dict, *keys, default=None):
+    """Safely get nested dictionary values."""
+    for key in keys:
+        if not isinstance(d, dict) or key not in d:
+            return default
+        d = d[key]
+    return d
+
+metrics = safe_get(run_info, "data", "metrics", default={})
+params = safe_get(run_info, "data", "params", default={})
+
+logger.info(f"Run metrics: {metrics}")
+logger.info(f"Run parameters: {params}")
notebooks/week1/01.dataExploraton.py (2)

22-22: Avoid blanket warning suppression.

Suppressing all warnings could hide important issues. Consider filtering specific warning categories instead.

-warnings.filterwarnings("ignore")
+# Filter specific warning categories
+warnings.filterwarnings("ignore", category=FutureWarning)
+warnings.filterwarnings("ignore", category=UserWarning)

1-112: Add documentation and modularize the code.

The notebook lacks clear documentation and proper modularization.

Consider:

  1. Adding a module docstring explaining the purpose and workflow
  2. Breaking the code into logical functions (data loading, preprocessing, validation)
  3. Adding logging instead of print statements
  4. Creating a main function to orchestrate the workflow
  5. Adding proper error handling and cleanup

Would you like me to provide an example of how to restructure this notebook?

README.md (3)

11-11: Add missing article "the" before "main branch"

For better readability, consider adding the missing article:

-To submit the deliverable, create a feature branch in that repository, and a PR to main branch.
+To submit the deliverable, create a feature branch in that repository, and a PR to the main branch.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~11-~11: Possible missing article found.
Context: ... branch in that repository, and a PR to main branch. The code can be merged after we...

(AI_HYDRA_LEO_MISSING_THE)


Line range hint 16-26: Consider adding context about UV's benefits

The setup instructions are clear, but it would be helpful to briefly explain why UV was chosen over pip (e.g., faster installation, better dependency resolution). This context would help users understand the tooling decisions.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~11-~11: Possible missing article found.
Context: ... branch in that repository, and a PR to main branch. The code can be merged after we...

(AI_HYDRA_LEO_MISSING_THE)

🪛 Markdownlint

17-17: null
Bare URL used

(MD034, no-bare-urls)


30-81: Fix markdown list indentation for consistency

The dataset documentation is comprehensive, but the list indentation is inconsistent. Consider standardizing the indentation:

  • Use 2 spaces for first-level lists
  • Use 4 spaces for nested lists

Example fix for a section:

### 3. **Internet Use and Educational History**
-   - **`PreInt_EduHx-computerinternet_hoursday`**: Measures daily internet/computer usage hours before any intervention.
-   - **`Parent-Child Internet Addiction Test (PCIAT)`**: Includes **`PCIAT-PCIAT_Total`**
-      - `0`: None
-      - `1`: Mild
+- **`PreInt_EduHx-computerinternet_hoursday`**: Measures daily internet/computer usage hours before any intervention.
+- **`Parent-Child Internet Addiction Test (PCIAT)`**: Includes **`PCIAT-PCIAT_Total`**
+    - `0`: None
+    - `1`: Mild
🧰 Tools
🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...s or impacts. - Basic_Demos-Age: The participant’s age, likely a key dem...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...phic feature. - Basic_Demos-Sex: Gender of the participant, encoded as `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...Child Internet Addiction Test (PCIAT)**: Includes **PCIAT-PCIAT_Total`**, a tot...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~50-~50: Loose punctuation mark.
Context: ...as conducted. - CGAS-CGAS_Score: A numerical scale used by mental health...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~54-~54: Loose punctuation mark.
Context: ...or blood pressure. - **Physical-BMI, Physical-Height, Physical-Weight, `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~55-~55: Loose punctuation mark.
Context: ...l health. - **Physical-Diastolic_BP, Physical-HeartRate, `Physical-Systoli...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint

33-33: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


36-36: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


37-37: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


38-38: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


41-41: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


42-42: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


44-44: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


45-45: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


46-46: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


49-49: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


50-50: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


53-53: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


54-54: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


55-55: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


58-58: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


59-59: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


60-60: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


61-61: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


64-64: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


65-65: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


66-66: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


69-69: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


72-72: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


75-75: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


76-76: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


77-77: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


78-78: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


79-79: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


80-80: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)

notebooks/week2/05.wip_log_and_register_custom_model.py (2)

69-73: Avoid using toPandas() for large DataFrames to prevent memory issues

Converting large Spark DataFrames to Pandas using toPandas() can lead to memory errors if the data is substantial. Consider processing the data using Spark operations or sampling the data before conversion.

For example, you might sample a subset of the data:

sampled_train_set = train_set.sample(fraction=0.1)
X_train = sampled_train_set[num_features + cat_features].toPandas()
y_train = sampled_train_set[[target]].toPandas()

Or perform data processing directly in Spark without converting to Pandas.


86-86: Automate the retrieval of the Git SHA

Hardcoding the Git commit SHA can lead to inconsistencies and requires manual updates. Automate the retrieval to ensure accuracy and reduce maintenance overhead.

You can retrieve the Git SHA dynamically:

import subprocess
git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

Or, if you're using Databricks Repos:

git_sha = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("repoCommitId", "unknown")
notebooks/week2/04.log_and_register_model.py (4)

69-69: Parameterize the MLflow experiment name for flexibility

The experiment name "/Shared/child-health" is hardcoded. To enhance flexibility and reusability across different environments or users, consider parameterizing the experiment name or retrieving it from the configuration file.


81-85: Log the full classification report to MLflow for detailed evaluation

Currently, only the accuracy metric is logged to MLflow. Logging the full classification report will provide a comprehensive evaluation of the model's performance, including precision, recall, and F1-score for each class.

You can log the classification report as an artifact or log individual metrics:

# Log classification report as an artifact
with open("classification_report.txt", "w") as f:
    f.write(report)
mlflow.log_artifact("classification_report.txt")

# Alternatively, log individual metrics
report_dict = classification_report(y_test, y_pred, output_dict=True)
for label, metrics in report_dict.items():
    if isinstance(metrics, dict):
        for metric_name, metric_value in metrics.items():
            mlflow.log_metric(f"{label}_{metric_name}", metric_value)

91-91: Ensure correct inputs when inferring the model signature

When using infer_signature, the model_input and model_output should correspond to the same dataset split. Currently, X_train and y_pred are from different datasets (X_train from training data and y_pred from test predictions). Consider using X_test as model_input to match y_pred.

Apply this change:

-signature = infer_signature(model_input=X_train, model_output=y_pred)
+signature = infer_signature(model_input=X_test, model_output=y_pred)

108-112: Evaluate the necessity of loading the dataset source after the run

The code block that loads the dataset source using mlflow.data.get_source and dataset_source.load() may not be necessary unless it serves a specific purpose. If it's not required, consider removing it to streamline the code.

src/childHealth/childhealth_model.py (4)

174-179: Add plt.tight_layout() to prevent clipping in the feature importance plot.

Including plt.tight_layout() after setting up the plot can help prevent axis labels and titles from being cut off when displaying or saving the figure.

Apply this suggestion:

 plt.gca().invert_yaxis()
+plt.tight_layout()
 plt.show()

131-137: Avoid hardcoding file paths; use configurable inputs instead.

Rather than hardcoding paths such as "project_config.yml" and "train.csv", consider using command-line arguments, environment variables, or a configuration management system to specify file paths. This enhances flexibility and makes the code more portable and maintainable.


145-145: Verify the effect of drop='first' in OneHotEncoder.

Using drop='first' in OneHotEncoder reduces the number of generated features by dropping the first category, which can help prevent multicollinearity in linear models. However, for tree-based models like LightGBM, this is generally not necessary and might lead to loss of information. Consider removing drop='first' to retain all category information.


181-184: Enhance MLflow logging with additional artifacts and parameters.

Currently, only metrics are being logged to MLflow. Consider logging model parameters, the trained model, and artifacts like the feature importance plot to facilitate better experiment tracking, reproducibility, and model comparisons.

src/childHealth/kappaLgm_model.py (1)

93-93: Consider enabling early stopping to improve training

The early_stopping_rounds parameter is currently commented out. Enabling early stopping can prevent overfitting and reduce training time by stopping the training when the validation score stops improving.

Apply this diff to enable early stopping:

             feval=kappa_metric,
-            # early_stopping_rounds=10
+            early_stopping_rounds=10
         )
src/childHealth/feature_engineering.py (3)

105-107: Optimize DataFrame Union Operation

Sequentially unioning DataFrames in a loop can be inefficient for large lists. Use reduce with unionAll to combine all DataFrames more effectively.

Apply this code change:

from functools import reduce
if all_data:
    feature_table = reduce(DataFrame.unionAll, all_data)
    logger.info("Successfully loaded and combined all participant files.")
    return feature_table
else:
    logger.warning("No data was loaded from the files. Returning an empty DataFrame.")
    return self.spark.createDataFrame([], schema=self.schema)

171-171: Avoid Expensive count() Operation

Using feature_table.count() can be costly for large datasets. To check if the DataFrame is not empty, use head(1) which is more efficient.

Apply this code change:

-if feature_table.count() > 0:
+if feature_table.head(1):

73-75: Avoid Catching Broad Exception

Catching broad exceptions can mask underlying issues and make debugging difficult. It's better to catch specific exceptions or re-raise the exception after logging.

Modify the exception handling:

-    except Exception as e:
+    except FileNotFoundError as e:
         logger.error("Unexpected error reading file %s for participant %s: %s", file_path, participant_id, e)
+        # Handle specific exception
+    except Exception as e:
+        logger.error("An unexpected error occurred: %s", e)
+        raise

Replace FileNotFoundError with the specific exceptions you expect.

notebooks/week2/06.log_and_register_fe_model.py (12)

4-5: Suggestion: Install the package using pip.install within the notebook.

Using the %pip magic command may not persist the package installation across notebook sessions. Consider installing the package within the notebook cell to ensure consistent behavior.

Apply this change:

-# MAGIC %pip install childhealth_mlops_with_databricks-0.0.1-py3-none-any.whl
+%pip install childhealth_mlops_with_databricks-0.0.1-py3-none-any.whl

8-8: Remove unnecessary Python environment restart.

Restarting the Python environment immediately after installing packages might not be necessary within this context. It can increase the notebook execution time.

Consider removing:

# MAGIC dbutils.library.restartPython()

13-14: Consolidate imports from the same module.

You can simplify the import statements by importing multiple classes from the same module in a single line.

Refactor the imports:

-from databricks import feature_engineering
-from databricks.feature_engineering import FeatureFunction, FeatureLookup
+from databricks.feature_engineering import FeatureEngineeringClient, FeatureFunction, FeatureLookup

And update the client initialization accordingly.


15-15: Verify the necessity of importing WorkspaceClient.

The WorkspaceClient from databricks.sdk is imported but not used in the code. This unused import can be removed to clean up the code.

Remove the unused import:

-from databricks.sdk import WorkspaceClient

21-21: Import classification_report from sklearn.metrics is unused.

The classification_report is not utilized in the code. If it's not required, consider removing it to declutter the imports.

Remove the unused import:

-from sklearn.metrics import accuracy_score, classification_report
+from sklearn.metrics import accuracy_score

29-31: Update feature engineering client initialization.

After consolidating the imports, initialize the FeatureEngineeringClient properly.

Update the client initialization:

-fe = feature_engineering.FeatureEngineeringClient()
+fe = FeatureEngineeringClient()

59-87: Optimize the aggregate_actigraphy function using a loop or list comprehension.

The aggregation definitions for each column are repetitive. Using a loop or list comprehension can make the code more maintainable and scalable.

Refactor the function:

def aggregate_actigraphy(data: DataFrame) -> DataFrame:
    """
    Aggregate actigraphy data for each participant with summary statistics.
    """
    features = ["X", "Y", "Z", "enmo", "light"]
    agg_exprs = []
    for feature in features:
        agg_exprs.extend([
            F.mean(feature).alias(f"{feature}_mean"),
            F.stddev(feature).alias(f"{feature}_std"),
            F.max(feature).alias(f"{feature}_max"),
            F.min(feature).alias(f"{feature}_min"),
        ])
    agg_exprs.append(F.mean("battery_voltage").alias("battery_voltage_mean"))
    aggregated_df = data.groupBy("id").agg(*agg_exprs)
    return aggregated_df

130-133: Remove redundant table property setting.

The delta.enableChangeDataFeed property is already set during table creation at line 122.

Consider removing the redundant property setting:

-spark.sql(
-    f"ALTER TABLE {catalog_name}.{schema_name}.actigraph_aggregated_features "
-    "SET TBLPROPERTIES (delta.enableChangeDataFeed = true);"
-)

181-181: Handle potential issues with excluded columns.

Excluding update_timestamp_utc might not be necessary if the column doesn't exist. Ensure this does not cause errors.

Verify if the exclusion is required.


239-242: Retrieve Git SHA dynamically for better reproducibility.

Hardcoding the Git SHA may lead to inconsistencies. Retrieve it dynamically to ensure it reflects the current code state.

Use the gitpython library or subprocess to get the current Git SHA:

import subprocess

git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

Ensure that git is available in the environment.


244-271: Add error handling for model training and logging steps.

Wrap critical sections with try-except blocks to capture and log exceptions.

Example:

try:
    # Fit the pipeline and make predictions
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
except Exception as e:
    mlflow.log_param("training_failed", True)
    mlflow.log_param("error_message", str(e))
    raise e

265-271: Ensure that the model logging includes conda environment specifications.

To enhance reproducibility, log the conda environment or requirements.

Include conda_env parameter in fe.log_model:

fe.log_model(
    model=pipeline,
    flavor=mlflow.sklearn,
    artifact_path="RandomForestClasf-pipeline-model-fe",
    training_set=training_set,
    signature=signature,
    conda_env="environment.yml",
)

Create an environment.yml file with the dependencies.

src/childHealth/data_processor.py (2)

174-174: Avoid overwriting original numeric features when scaling

Overwriting the original numeric features with scaled values may lead to loss of the raw data, which could be needed later for analysis or other processes. Consider creating new columns for the scaled features to preserve the original data.

Apply this diff to add scaled features as new columns:

scaler = StandardScaler()
- self.train_df[self.num_features] = scaler.fit_transform(self.train_df[self.num_features])
+ scaled_features = scaler.fit_transform(self.train_df[self.num_features])
+ scaled_columns = [f"{col}_scaled" for col in self.num_features]
+ scaled_df = pd.DataFrame(scaled_features, columns=scaled_columns, index=self.train_df.index)
+ self.train_df = pd.concat([self.train_df, scaled_df], axis=1)

219-219: Simplify timestamp assignment

Since current_timestamp() returns the current UTC timestamp, using to_utc_timestamp(current_timestamp(), "UTC") is redundant. Consider using current_timestamp() directly to simplify the code.

Apply this diff to simplify timestamp assignment:

- spark_df = spark_df.withColumn("timestamp", to_utc_timestamp(current_timestamp(), "UTC"))
+ spark_df = spark_df.withColumn("timestamp", current_timestamp())
notebooks/week1/00.dataexploration.ipynb (2)

180-180: Use logging instead of print statements

The load_data method uses print statements to output information. It's recommended to use the logging module instead, which provides more control over logging levels and output destinations.

Apply this change:

import os
+ import logging

class ActigraphAggregation:
    def __init__(self, root_dir):
        self.root_dir = root_dir
+       self.logger = logging.getLogger(__name__)
+       logging.basicConfig(level=logging.INFO)

    def load_data(self, participant_id):
        file_path = os.path.join(self.root_dir, f"id={participant_id}", "part-0.parquet")
        if not os.path.exists(file_path):
-           print(f"File not found for participant {participant_id}")
+           self.logger.warning(f"File not found for participant {participant_id}")
            return pd.DataFrame()  # Return an empty DataFrame if file doesn't exist
        data = pd.read_parquet(file_path)
        data["id"] = participant_id
-       print(f"Loaded data columns for {participant_id}: {data.columns.tolist()}")
+       self.logger.info(f"Loaded data columns for {participant_id}: {data.columns.tolist()}")
        return data

Also applies to: 184-184, 266-266, 913-913


705-705: Make file paths configurable

Hardcoding file paths like "train.csv" (line 705) and "path/to/train.csv" (line 840) can lead to errors when running the code in different environments. Consider using configuration files, environment variables, or command-line arguments to specify file paths for better flexibility and portability.

Also applies to: 840-840

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 25dfb31 and 6d3a178.

⛔ Files ignored due to path filters (4)
  • data/childHealth.csv is excluded by !**/*.csv
  • data/data_dictionary.csv is excluded by !**/*.csv
  • data/test.csv is excluded by !**/*.csv
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • .github/workflows/ci.yml (1 hunks)
  • .gitignore (1 hunks)
  • README.md (2 hunks)
  • databricks.yml (1 hunks)
  • notebooks/week1/00.dataexploration.ipynb (1 hunks)
  • notebooks/week1/01.dataExploraton.py (1 hunks)
  • notebooks/week2/01.prepare_dataset.py (1 hunks)
  • notebooks/week2/02.prepare_featureTable.py (1 hunks)
  • notebooks/week2/03.mlflow_experiment.py (1 hunks)
  • notebooks/week2/04.log_and_register_model.py (1 hunks)
  • notebooks/week2/05.wip_log_and_register_custom_model.py (1 hunks)
  • notebooks/week2/06.log_and_register_fe_model.py (1 hunks)
  • notebooks/week2/mlruns/0/meta.yaml (1 hunks)
  • notebooks/week2/run_info.json (1 hunks)
  • project_config.yml (1 hunks)
  • pyproject.toml (2 hunks)
  • src/childHealth/childhealth_model.py (1 hunks)
  • src/childHealth/config.py (1 hunks)
  • src/childHealth/data_processor.py (1 hunks)
  • src/childHealth/feature_engineering.py (1 hunks)
  • src/childHealth/kappaLgm_model.py (1 hunks)
  • src/childHealth/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (5)
  • .gitignore
  • databricks.yml
  • notebooks/week2/mlruns/0/meta.yaml
  • notebooks/week2/run_info.json
  • project_config.yml
🧰 Additional context used
🪛 LanguageTool
README.md

[uncategorized] ~11-~11: Possible missing article found.
Context: ... branch in that repository, and a PR to main branch. The code can be merged after we...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...s or impacts. - Basic_Demos-Age: The participant’s age, likely a key dem...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...phic feature. - Basic_Demos-Sex: Gender of the participant, encoded as `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...Child Internet Addiction Test (PCIAT)**: Includes **PCIAT-PCIAT_Total`**, a tot...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~50-~50: Loose punctuation mark.
Context: ...as conducted. - CGAS-CGAS_Score: A numerical scale used by mental health...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~54-~54: Loose punctuation mark.
Context: ...or blood pressure. - **Physical-BMI, Physical-Height, Physical-Weight, `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~55-~55: Loose punctuation mark.
Context: ...l health. - **Physical-Diastolic_BP, Physical-HeartRate, `Physical-Systoli...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint
README.md

33-33: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


36-36: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


37-37: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


38-38: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


41-41: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


42-42: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


44-44: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


45-45: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


46-46: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


49-49: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


50-50: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


53-53: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


54-54: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


55-55: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


58-58: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


59-59: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


60-60: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


61-61: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


64-64: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


65-65: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


66-66: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


69-69: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


72-72: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


75-75: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


76-76: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


77-77: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


78-78: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


79-79: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


80-80: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)

🔇 Additional comments (21)
.github/workflows/ci.yml (2)

21-22: LGTM: Correct usage of uv sync

The dependency installation step is properly configured using uv's sync command.


27-27: Verify pre-commit configuration file existence

The pre-commit check is correctly configured, but we should ensure the .pre-commit-config.yaml file exists in the repository.

✅ Verification successful

Pre-commit configuration is properly set up

The .pre-commit-config.yaml file exists and is well-configured with essential hooks:

  • Standard pre-commit hooks for file checks and formatting
  • Ruff for Python linting and formatting
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for pre-commit config file and its contents

# Check if .pre-commit-config.yaml exists
if [ -f ".pre-commit-config.yaml" ]; then
    echo "Found .pre-commit-config.yaml:"
    cat .pre-commit-config.yaml
else
    echo "Warning: .pre-commit-config.yaml not found"
    # Search for any potential pre-commit configs with different names or locations
    fd -H "pre-commit.*\.yaml"
fi

Length of output: 605

pyproject.toml (4)

27-33: LGTM! Development dependencies are well organized.

The development dependencies are properly segregated and include essential tools for development workflow. The addition of pre-commit aligns well with the CI workflow changes.


35-37: LGTM! Build system configuration follows best practices.

The build system configuration is properly specified with appropriate minimum version requirements for setuptools.


53-53: LGTM! Ruff configuration maintains good code quality standards.

The linting rules and formatting configuration are well-chosen for maintaining consistent code quality.


8-23: Review dependency version constraints for consistency and security.

The dependencies mix exact version pins (==) with version ranges (>=). While exact pins provide reproducibility, they might miss security updates. Consider:

  1. Using consistent version constraint style
  2. Documenting the rationale for exact version pins
  3. Verifying the selected versions for known vulnerabilities
✅ Verification successful

Update pydantic version to address security vulnerability

The security scan revealed that pydantic 2.9.2 is not affected by any current vulnerabilities. However, there's a pattern of security issues being discovered and fixed in pydantic, suggesting we should use a version range that allows patches while maintaining major version compatibility:

pydantic>=2.9.2,<3

The other pinned packages (databricks-feature-lookup==1.2.0 and databricks-sdk==0.32.0) show no known vulnerabilities. However, for consistency with other dependencies and to allow security patches, consider using similar version range constraints unless there are specific compatibility requirements:

databricks-feature-lookup>=1.2.0,<2
databricks-sdk>=0.32.0,<1
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for known vulnerabilities in pinned packages
echo "Checking databricks-feature-lookup==1.2.0"
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "databricks-feature-lookup") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
    }
  }
}'

echo "Checking databricks-sdk==0.32.0"
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "databricks-sdk") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
    }
  }
}'

echo "Checking pydantic==2.9.2"
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "pydantic") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
    }
  }
}'

Length of output: 2109

notebooks/week2/03.mlflow_experiment.py (1)

39-42: ⚠️ Potential issue

Apply consistent file handling pattern

This section has the same issues as the experiment serialization above.

Please apply the same file handling improvements as suggested for the experiment serialization.

README.md (1)

30-88: Excellent dataset documentation!

The documentation provides a comprehensive overview of:

  • Clear categorization of data fields
  • Detailed explanations of measurements
  • Valuable context about data utility
  • Potential analysis directions

This will greatly help users understand and work with the dataset effectively.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...s or impacts. - Basic_Demos-Age: The participant’s age, likely a key dem...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...phic feature. - Basic_Demos-Sex: Gender of the participant, encoded as `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...Child Internet Addiction Test (PCIAT)**: Includes **PCIAT-PCIAT_Total`**, a tot...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~50-~50: Loose punctuation mark.
Context: ...as conducted. - CGAS-CGAS_Score: A numerical scale used by mental health...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~54-~54: Loose punctuation mark.
Context: ...or blood pressure. - **Physical-BMI, Physical-Height, Physical-Weight, `...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~55-~55: Loose punctuation mark.
Context: ...l health. - **Physical-Diastolic_BP, Physical-HeartRate, `Physical-Systoli...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint

33-33: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


36-36: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


37-37: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


38-38: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


41-41: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


42-42: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


44-44: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


45-45: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


46-46: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


49-49: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


50-50: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


53-53: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


54-54: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


55-55: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


58-58: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


59-59: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


60-60: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


61-61: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


64-64: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


65-65: Expected: 2; Actual: 6
Unordered list indentation

(MD007, ul-indent)


66-66: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


69-69: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


72-72: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


75-75: Expected: 0; Actual: 3
Unordered list indentation

(MD007, ul-indent)


76-76: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


77-77: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


78-78: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


79-79: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)


80-80: Expected: 2; Actual: 5
Unordered list indentation

(MD007, ul-indent)

notebooks/week2/04.log_and_register_model.py (3)

94-95: Specify the correct dataset version when logging input data

Using "0" as the version in dataset = mlflow.data.from_spark(...) may not reference the intended dataset version. Ensure that the correct version number is specified, or consider parameterizing it if the dataset version may change.


73-73: Parameterize or dynamically obtain the branch name in MLflow tags

The branch name 'week1-2' is hardcoded in the MLflow run tags. To maintain accurate tracking and reduce manual efforts, consider retrieving the branch name dynamically or parameterizing it based on the execution context.

Run the following script to verify the retrieval of the current Git branch name:


70-70: 🛠️ Refactor suggestion

Retrieve the Git SHA dynamically for accurate version tracking

The Git SHA is hardcoded as "830c17d988742482b639aec763ec731ac2dd4da5". To ensure accurate version tracking and reduce manual updates, retrieve the current Git commit SHA programmatically within the notebook.

Run the following script to verify the retrieval of the current Git commit SHA:

src/childHealth/childhealth_model.py (2)

83-83: Ensure custom evaluation metric is compatible with LightGBM classifier.

When passing the custom kappa_metric function as eval_metric, verify that it adheres to LightGBM's expected format for evaluation functions in the scikit-learn API. The function should accept y_true and y_pred, where y_pred is the predicted probabilities or raw scores. Ensure that the reshaping and argmax operations in kappa_metric correctly handle the shape of y_pred provided by LightGBM during training.


123-124: Handle potential misalignment between feature importances and feature names.

Ensure that the feature importance scores from the model align correctly with the feature names obtained from the preprocessor. After one-hot encoding, the number of features increases, and the feature names may not directly correspond to the original feature list. Verify that get_feature_names_out() provides the correct names corresponding to the expanded feature set after encoding.

src/childHealth/kappaLgm_model.py (2)

79-80: ⚠️ Potential issue

Avoid fitting the preprocessor on validation data

Currently, the preprocessor is fitted on both training and validation data, which can lead to data leakage. It's important to fit the preprocessor only on the training data and then transform the validation data.

Apply this diff to correct the preprocessing steps:

 def train(self, X_train, y_train, X_val, y_val):
     ...
-    X_train = self.preprocessor.fit_transform(X_train)
+    self.preprocessor.fit(X_train)
+    X_train = self.preprocessor.transform(X_train)
     X_val = self.preprocessor.transform(X_val)
     ...

Likely invalid or redundant comment.


172-173: Verify the use of raw predictions in threshold optimization

In the optimize_thresholds method, you're using predict_proba to obtain predictions for threshold optimization. Ensure that the predictions used are raw scores or probabilities compatible with your rounding and thresholding logic.

Run the following script to check the output of predict_proba:

Adjust the method if necessary to ensure correct threshold optimization.

✅ Verification successful

Let me gather more context about the implementation of predict_proba and optimize_thresholds methods.


Let me try with ripgrep to get the implementation details.


Let me check the threshold_Rounder implementation to complete the verification.


Raw probabilities are correctly used in threshold optimization

The implementation is correct. The predict_proba method returns raw probabilities from the underlying model, which are then properly used in the optimize_thresholds method. The thresholding logic in threshold_Rounder correctly applies the thresholds to these probabilities to generate class predictions (0-4), and the optimization uses quadratic weighted kappa as the metric. No adjustments are needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the output format of predict_proba method.

# Expected: The output should be probabilities suitable for thresholding.
grep -A 5 "def predict_proba" src/childHealth/kappaLgm_model.py

Length of output: 261


Script:

#!/bin/bash
# Get the full implementation of predict_proba and optimize_thresholds
ast-grep --pattern 'def predict_proba(self, $_) {
  $$$
}'

ast-grep --pattern 'def optimize_thresholds(self, $_, $_) {
  $$$
}'

Length of output: 131


Script:

#!/bin/bash
# Get the full implementation of both methods
rg "def predict_proba" -A 15 src/childHealth/kappaLgm_model.py
rg "def optimize_thresholds" -A 15 src/childHealth/kappaLgm_model.py

Length of output: 1307


Script:

#!/bin/bash
# Get the implementation of threshold_Rounder and quadratic_weighted_kappa
rg "def threshold_Rounder" -A 10 src/childHealth/kappaLgm_model.py
rg "def quadratic_weighted_kappa" -A 10 src/childHealth/kappaLgm_model.py

Length of output: 947

notebooks/week2/06.log_and_register_fe_model.py (6)

54-54: Confirm the necessity of dropping the non_wear_flag column.

Dropping the non_wear_flag column removes potentially useful information. Ensure that this column is not required for downstream analysis.


97-123: Ensure proper handling of table creation and existing data.

Using CREATE OR REPLACE TABLE will overwrite existing data without warning. Confirm that this behavior is intended.

Additionally, consider specifying the schema explicitly or using DataFrame write operations for better control.


172-173: Include all relevant features in the FeatureLookup.

Ensure that all necessary features are included in the feature_names list. Missing features might affect model performance.


223-231: Ensure that the remainder parameter in ColumnTransformer is appropriate.

Setting remainder='passthrough' will include all unspecified columns in the output, which might include unwanted columns.

  • Confirm that only intended columns are being passed to the model.
  • Alternatively, specify remainder='drop' and explicitly include all required features.

273-276: Verify the correctness of the model registration URI and name.

Ensure that the model_uri and name parameters are correctly specified for model registration.

  • Confirm that run_id and artifact_path are correct.
  • Check if the model name adheres to the MLflow registry naming conventions.

51-54: Ensure that the tables exist before loading.

Attempting to load tables that do not exist may result in runtime errors. Verify that the tables are available in the specified catalog and schema.

Run the following script to check the existence of the tables:

Comment on lines +23 to +24
- name: Install pre-commit
run: pip install pre-commit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use uv instead of pip for consistency

The workflow is using uv as the package manager, but this step uses pip. For consistency and to avoid potential dependency resolution issues, consider using uv here as well.

- - name: Install pre-commit
-   run: pip install pre-commit
+ - name: Install pre-commit
+   run: uv pip install pre-commit
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Install pre-commit
run: pip install pre-commit
- name: Install pre-commit
run: uv pip install pre-commit

Comment on lines +26 to +33
Actigraph = ActigraphFileReader(
app_name="ActigraphAggregation",
root_dir="/Volumes/mlops_students/javedhassi/data/series_train.parquet/",
catalog_name=config.catalog_name,
schema_name=config.schema_name,
)
# COMMAND ----------
Actigraph.save_feature_table()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling and parameter validation for feature table creation

The current implementation uses hardcoded paths and lacks error handling for the feature table creation process.

+# Validate configuration parameters
+if not config.catalog_name or not config.schema_name:
+    raise ValueError("Invalid configuration: catalog_name and schema_name must be specified")
+
+# Use configuration for root directory
+root_dir = config.get_data_path("actigraph_data")
+
+try:
     Actigraph = ActigraphFileReader(
         app_name="ActigraphAggregation",
-        root_dir="/Volumes/mlops_students/javedhassi/data/series_train.parquet/",
+        root_dir=root_dir,
         catalog_name=config.catalog_name,
         schema_name=config.schema_name,
     )
-Actigraph.save_feature_table()
+    Actigraph.save_feature_table()
+except Exception as e:
+    raise RuntimeError(f"Failed to create feature table: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +3 to +8
# MAGIC %pip install childhealth_mlops_with_databricks-0.0.1-py3-none-any.whl --force-reinstall


# COMMAND ----------

# MAGIC dbutils.library.restartPython()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve package installation reliability and version management

The current package installation approach has several potential issues:

  1. The wheel filename is hardcoded without version control
  2. Force reinstall might mask dependency conflicts
  3. No error handling for installation failures

Consider this improved approach:

-# MAGIC %pip install childhealth_mlops_with_databricks-0.0.1-py3-none-any.whl --force-reinstall
+# MAGIC %%bash
+# Set package version
+PACKAGE_VERSION="0.0.1"
+PACKAGE_NAME="childhealth_mlops_with_databricks"
+
+# Install package with error handling
+if ! pip install "${PACKAGE_NAME}-${PACKAGE_VERSION}-py3-none-any.whl"; then
+    echo "Failed to install package"
+    exit 1
+fi

# MAGIC dbutils.library.restartPython()

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +21 to +22
# Load project configuration from YAML file
config = ProjectConfig.from_yaml(config_path="../../project_config.yml")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for configuration loading

The current configuration loading lacks error handling and uses a hardcoded relative path.

-# Load project configuration from YAML file
-config = ProjectConfig.from_yaml(config_path="../../project_config.yml")
+# Load project configuration from YAML file with error handling
+try:
+    config_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
+    config_path = os.path.join(os.path.dirname(config_path), "../../project_config.yml")
+    config = ProjectConfig.from_yaml(config_path=config_path)
+except Exception as e:
+    raise RuntimeError(f"Failed to load project configuration: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +16 to +28
@classmethod
def from_yaml(cls, config_path: str) -> "ProjectConfig":
"""Load configuration from a YAML file."""
try:
with open(config_path, "r") as f:
config_dict = yaml.safe_load(f)
return cls(**config_dict)
except FileNotFoundError as e:
raise FileNotFoundError(f"Configuration file not found: {config_path}") from e
except yaml.YAMLError as e:
raise ValueError(f"Error parsing YAML file: {e}") from e
except ValidationError as e:
raise ValueError(f"Validation error: {e}") from e
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance input validation and security

The YAML loading implementation is good but could benefit from additional security measures:

Consider these security enhancements:

 @classmethod
 def from_yaml(cls, config_path: str) -> "ProjectConfig":
-    """Load configuration from a YAML file."""
+    """Load configuration from a YAML file.
+    
+    Args:
+        config_path: Path to the YAML configuration file
+        
+    Returns:
+        ProjectConfig: Initialized configuration object
+        
+    Raises:
+        FileNotFoundError: If the configuration file doesn't exist
+        ValueError: If YAML parsing or validation fails
+        OSError: If the file path is invalid or inaccessible
+    """
+    if not isinstance(config_path, str):
+        raise TypeError("config_path must be a string")
+
     try:
+        config_path = os.path.abspath(os.path.expanduser(config_path))
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"Configuration file not found: {config_path}")
+
         with open(config_path, "r") as f:
-            config_dict = yaml.safe_load(f)
+            # Use safe_load with size limit to prevent YAML bombs
+            config_dict = yaml.safe_load(f, Loader=yaml.SafeLoader)
+            
+        if not isinstance(config_dict, dict):
+            raise ValueError("YAML file must contain a dictionary")
+
         return cls(**config_dict)
     except FileNotFoundError as e:
         raise FileNotFoundError(f"Configuration file not found: {config_path}") from e
     except yaml.YAMLError as e:
         raise ValueError(f"Error parsing YAML file: {e}") from e
     except ValidationError as e:
         raise ValueError(f"Validation error: {e}") from e
+    except OSError as e:
+        raise OSError(f"Error accessing configuration file: {e}") from e

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +177 to +182
def handle_missing_target(self):
"""Handle missing values in the target column by replacing them with a specific value (e.g., 4.0)."""
logging.info("Handling missing target values")
new_value = 4.0 # New value to replace NaNs in the target column
self.train_df[self.target].fillna(new_value, inplace=True)
logging.info("Missing target values handled")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Reconsider strategy for handling missing target values

Replacing missing values in the target column with a constant value (4.0) may introduce bias and affect model performance. It is generally better to remove records with missing target values or use an appropriate imputation strategy based on domain knowledge.

Comment on lines +120 to +122
self.train_df["Age_Group"] = pd.cut(
self.train_df["Basic_Demos-Age"], bins=[0, 12, 17, 25], labels=["Child", "Teen", "Young Adult"]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Adjust age group bins for accurate classification

When using pd.cut to bin ages into groups, the default binning behavior may lead to misclassification at the edges because bins are right-inclusive and left-exclusive. For example, age 12 would fall into the "Teen" group instead of "Child". To ensure accurate classification, specify the right=False parameter or adjust the bin edges.

Apply this diff to adjust the bins:

self.train_df["Age_Group"] = pd.cut(
    self.train_df["Basic_Demos-Age"],
    bins=[0, 12, 17, 25],
    labels=["Child", "Teen", "Young Adult"],
+   right=False
)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.train_df["Age_Group"] = pd.cut(
self.train_df["Basic_Demos-Age"], bins=[0, 12, 17, 25], labels=["Child", "Teen", "Young Adult"]
)
self.train_df["Age_Group"] = pd.cut(
self.train_df["Basic_Demos-Age"], bins=[0, 12, 17, 25], labels=["Child", "Teen", "Young Adult"],
right=False
)

Comment on lines +99 to +105
def _convert_sex_to_binary(self):
"""Convert 'Sex' to binary encoding if it's part of numerical features and not already binary encoded."""
if "Basic_Demos-Sex" in self.cat_features:
# Map the values to binary encoding
self.train_df["Basic_Demos-Sex"] = self.train_df["Basic_Demos-Sex"].map({1.0: "Male", 0.0: "Female"})
logging.info("Converted 'Basic_Demos-Sex' to binary")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clarify method functionality in _convert_sex_to_binary

The method _convert_sex_to_binary suggests converting 'Sex' to binary encoding, but it maps numeric values to categorical strings "Male" and "Female", which is the opposite of binary encoding. This may cause confusion and inconsistent data types. Consider modifying the method to convert categorical values to binary encoding or rename the method to reflect its actual purpose.

Apply this diff to correctly convert 'Sex' to binary encoding:

def _convert_sex_to_binary(self):
    if "Basic_Demos-Sex" in self.cat_features:
        # Map categorical values to binary encoding
-       self.train_df["Basic_Demos-Sex"] = self.train_df["Basic_Demos-Sex"].map({1.0: "Male", 0.0: "Female"})
+       self.train_df["Basic_Demos-Sex"] = self.train_df["Basic_Demos-Sex"].map({"Male": 1, "Female": 0})
        logging.info("Converted 'Basic_Demos-Sex' to binary")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _convert_sex_to_binary(self):
"""Convert 'Sex' to binary encoding if it's part of numerical features and not already binary encoded."""
if "Basic_Demos-Sex" in self.cat_features:
# Map the values to binary encoding
self.train_df["Basic_Demos-Sex"] = self.train_df["Basic_Demos-Sex"].map({1.0: "Male", 0.0: "Female"})
logging.info("Converted 'Basic_Demos-Sex' to binary")
def _convert_sex_to_binary(self):
"""Convert 'Sex' to binary encoding if it's part of numerical features and not already binary encoded."""
if "Basic_Demos-Sex" in self.cat_features:
# Map categorical values to binary encoding
self.train_df["Basic_Demos-Sex"] = self.train_df["Basic_Demos-Sex"].map({"Male": 1, "Female": 0})
logging.info("Converted 'Basic_Demos-Sex' to binary")

Comment on lines +78 to +145
"import pandas as pd\n",
"\n",
"\n",
"class ActigraphAggregation:\n",
" def __init__(self, root_dir):\n",
" self.root_dir = root_dir\n",
"\n",
" def load_data(self, participant_id):\n",
" file_path = os.path.join(self.root_dir, f\"id={participant_id}\", \"part-0.parquet\")\n",
" data = pd.read_parquet(file_path)\n",
" data[\"id\"] = participant_id\n",
" print(f\"Loaded data columns for {participant_id}: {data.columns.tolist()}\")\n",
" return data\n",
"\n",
" def aggregate_actigraphy(self, data):\n",
" \"\"\"\n",
" Aggregate the actigraphy data for a single participant.\n",
"\n",
" Parameters:\n",
" - data (DataFrame): Actigraphy data for a participant.\n",
"\n",
" Returns:\n",
" - DataFrame: Aggregated actigraphy data with one row per 'id'.\n",
" \"\"\"\n",
" aggregated_df = (\n",
" data.groupby(\"id\")\n",
" .agg(\n",
" {\n",
" \"X\": [\"mean\", \"std\", \"max\", \"min\"],\n",
" \"Y\": [\"mean\", \"std\", \"max\", \"min\"],\n",
" \"Z\": [\"mean\", \"std\", \"max\", \"min\"],\n",
" \"enmo\": [\"mean\", \"std\", \"max\", \"min\"],\n",
" \"anglez\": \"mean\",\n",
" \"non-wear_flag\": \"sum\",\n",
" \"light\": [\"mean\", \"std\", \"max\", \"min\"],\n",
" \"battery_voltage\": \"mean\",\n",
" }\n",
" )\n",
" .reset_index()\n",
" )\n",
"\n",
" # Flatten column names\n",
" aggregated_df.columns = [\"_\".join(col).strip() if col[1] else col[0] for col in aggregated_df.columns.values]\n",
" return aggregated_df\n",
"\n",
" def process_participant_data(self, participant_id):\n",
" data = self.load_data(participant_id)\n",
" aggregated_data = self.aggregate_actigraphy(data)\n",
" return aggregated_data\n",
"\n",
" def process_all_participants(self):\n",
" all_data = []\n",
" with ThreadPoolExecutor() as executor:\n",
" futures = []\n",
" for id_folder in os.listdir(self.root_dir):\n",
" if not id_folder.startswith(\"id=\"):\n",
" continue\n",
" participant_id = id_folder.split(\"=\")[-1]\n",
" futures.append(executor.submit(self.process_participant_data, participant_id))\n",
"\n",
" for future in tqdm(as_completed(futures), total=len(futures), desc=\"Processing participants\"):\n",
" participant_data = future.result()\n",
" all_data.append(participant_data)\n",
"\n",
" feature_table = pd.concat(all_data, ignore_index=True)\n",
" return feature_table"
]
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consolidate multiple definitions of ActigraphAggregation class

The ActigraphAggregation class is redefined multiple times throughout the notebook (lines 78-145, 168-245, 252-385, and 902-1032). This repetition can lead to confusion and make the code harder to maintain. Consider consolidating all methods into a single definition of the class to improve readability and reduce redundancy.

Also applies to: 168-245, 252-385, 902-1032

Comment on lines +601 to +722
" config = yaml.safe_load(file)\n",
" return config\n",
"\n",
" def load_data(self, filepath):\n",
" \"\"\"Load dataset and split into features and target based on config.\"\"\"\n",
" data = pd.read_csv(filepath)\n",
" X = data[self.config[\"num_features\"] + self.config[\"cat_features\"]]\n",
" y = data[self.config[\"target\"]]\n",
" return X, y\n",
"\n",
" def preprocess_data(self, X):\n",
" \"\"\"Preprocess the data by encoding categorical variables.\"\"\"\n",
" X = pd.get_dummies(X, columns=self.config[\"cat_features\"], drop_first=True)\n",
" return X\n",
"\n",
" def train(self, X, y):\n",
" \"\"\"Train the LightGBM model.\"\"\"\n",
" X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)\n",
"\n",
" # LightGBM dataset\n",
" train_data = lgb.Dataset(X_train, label=y_train)\n",
" val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)\n",
"\n",
" # Set LightGBM parameters for multiclass classification\n",
" params = {\n",
" \"objective\": \"multiclass\",\n",
" \"num_class\": len(y.unique()),\n",
" \"learning_rate\": self.config[\"model_parameters\"][\"learning_rate\"],\n",
" \"n_estimators\": self.config[\"model_parameters\"][\"n_estimators\"],\n",
" \"max_depth\": self.config[\"model_parameters\"][\"max_depth\"],\n",
" \"metric\": \"multi_logloss\",\n",
" }\n",
"\n",
" # Train the model\n",
" self.model = lgb.train(\n",
" params, train_data, valid_sets=[train_data, val_data], early_stopping_rounds=50, verbose_eval=10\n",
" )\n",
"\n",
" # Save the model\n",
" joblib.dump(self.model, \"lgbm_model.pkl\")\n",
" print(\"Model training complete and saved as 'lgbm_model.pkl'\")\n",
"\n",
" def evaluate(self, X, y):\n",
" \"\"\"Evaluate the model on the test set.\"\"\"\n",
" y_pred = self.model.predict(X, num_iteration=self.model.best_iteration)\n",
" y_pred = y_pred.argmax(axis=1) # Get class with max probability\n",
"\n",
" # Print classification report\n",
" print(\"Evaluation Results:\")\n",
" print(classification_report(y, y_pred))\n",
" print(\"Accuracy:\", accuracy_score(y, y_pred))\n",
"\n",
" def feature_importance(self, X):\n",
" \"\"\"Display the feature importance.\"\"\"\n",
" importance = self.model.feature_importance(importance_type=\"gain\")\n",
" feature_names = X.columns\n",
" feature_importance = pd.DataFrame({\"feature\": feature_names, \"importance\": importance})\n",
" feature_importance = feature_importance.sort_values(by=\"importance\", ascending=False)\n",
"\n",
" # Plotting feature importance\n",
" plt.figure(figsize=(10, 8))\n",
" plt.barh(feature_importance[\"feature\"], feature_importance[\"importance\"], color=\"skyblue\")\n",
" plt.xlabel(\"Importance\")\n",
" plt.title(\"Feature Importance\")\n",
" plt.gca().invert_yaxis()\n",
" plt.show()\n",
"\n",
" def load_model(self, model_path=\"lgbm_model.pkl\"):\n",
" \"\"\"Load a saved model.\"\"\"\n",
" self.model = joblib.load(model_path)\n",
" print(\"Model loaded from\", model_path)\n",
"\n",
" def predict(self, X):\n",
" \"\"\"Make predictions with the trained model.\"\"\"\n",
" X_preprocessed = self.preprocess_data(X)\n",
" y_pred = self.model.predict(X_preprocessed, num_iteration=self.model.best_iteration)\n",
" return y_pred.argmax(axis=1)\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" # Configuration and paths\n",
" config_path = \"project_config.yml\"\n",
" train_path = \"train.csv\"\n",
"\n",
" # Initialize model class\n",
" lgbm_classifier = MultiClassLightGBM(config_path)\n",
"\n",
" # Load and preprocess data\n",
" X, y = lgbm_classifier.load_data(train_path)\n",
" X = lgbm_classifier.preprocess_data(X)\n",
"\n",
" # Train model\n",
" lgbm_classifier.train(X, y)\n",
"\n",
" # Evaluate model\n",
" lgbm_classifier.evaluate(X, y)\n",
"\n",
" # Display feature importance\n",
" lgbm_classifier.feature_importance(X)"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid code duplication between MultiClassLightGBM and ChildHealthModel classes

Both MultiClassLightGBM (lines 601-722) and ChildHealthModel (lines 1139-1250) implement similar functionality for model training and evaluation using LightGBM. To enhance maintainability and reduce redundancy, consider refactoring the code to use a single class or base class that can be reused in different contexts.

Also applies to: 1139-1250

@javedhassans javedhassans merged commit 24b4329 into main Nov 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants