-
Notifications
You must be signed in to change notification settings - Fork 0
Week1 #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Week1 #1
Conversation
Warning Rate limit exceeded@netojoseaugusto has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 43 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe changes in this pull request involve multiple updates across various files to enhance the functionality of a loan prediction project. Key modifications include the addition of a comprehensive Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 38
🧹 Outside diff range and nitpick comments (19)
logging_config.py (2)
4-5
: Enhance the docstring with more details.The current docstring could be more informative to help other developers understand the function's purpose and usage.
Consider expanding it like this:
- """Set up logging configuration.""" + """Set up application-wide logging configuration. + + Configures logging with the following settings: + - Console handler with INFO level + - Standard formatter with timestamp and log level + - Reduces verbosity of py4j loggers to ERROR level + + Usage: + from logging_config import setup_logging + + setup_logging() + logger = logging.getLogger(__name__) + logger.info("Application started") + """
6-27
: Consider adding file-based logging for production environments.While the current console-based logging is suitable for development, consider enhancing it for production use.
Suggestions for production environment:
- Add a file handler with rotation to prevent disk space issues
- Consider separating configurations for different environments (dev/prod)
- Make log level configurable through environment variables
Would you like me to provide an example implementation with these enhancements?
tests/test_evaluate.py (1)
21-25
: Add docstring to explain custom metric functionThe custom metric function would benefit from documentation explaining its purpose and behavior.
def test_evaluator_init_custom(): def custom_metric(y_true, y_pred): + """ + Simple accuracy metric that compares predictions to true labels. + + Args: + y_true: True binary labels + y_pred: Predicted probabilities or scores + Returns: + float: Mean accuracy score + """ return np.mean(y_true == y_pred) evaluator = Evaluator(metric_function=custom_metric) assert evaluator.metric_function == custom_metricpyproject.toml (2)
10-11
: Consider pinning major versions for consistency.The new dependencies use different version constraints:
catboost==1.2.7, <2
(exact minor version)graphviz==0.20.3, <1
(exact minor version)For consistency with other dependencies like
lightgbm>=4.5.0, <5
, consider using similar version constraints:- "catboost==1.2.7, <2", - "graphviz==0.20.3, <1", + "catboost>=1.2.7, <2", + "graphviz>=0.20.3, <1",
25-26
: Consider specifying an upper bound for pip version.While adding pytest is good for testing, the pip dependency lacks an upper version bound, which could lead to future compatibility issues.
- "pip>=24.2", + "pip>=24.2, <25",notebooks/intermediate_notebook.py (2)
1-2
: Move package installation to project dependencies.Installing packages directly in notebooks is not recommended as it:
- Makes the environment setup non-reproducible
- May cause version conflicts
- Needs to be re-run every time the cluster restarts
Consider adding
catboost
to your project'srequirements.txt
orsetup.py
instead.🧰 Tools
🪛 Ruff
2-2: SyntaxError: Expected a statement
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
1-78
: Improve notebook organization and documentation.Consider these improvements for better notebook maintainability:
- Add markdown cells explaining each step
- Document expected inputs and outputs
- Remove empty cells at the end
- Add error handling for critical operations
Would you like me to provide an example of how to structure the notebook with proper documentation?
🧰 Tools
🪛 Ruff
2-2: SyntaxError: Expected a statement
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
notebooks/prettier_notebook.py (1)
17-17
: Add logging statements for better observability.While logging is set up, there are no actual logging statements throughout the notebook. This makes it difficult to monitor the execution progress and debug issues.
Add logging statements at key points:
- Data loading completion
- Feature engineering steps
- Model training progress
- Prediction generation
Example:
logger.info(f"Loaded training data with {len(dataframe)} rows") logger.info(f"Starting {nfolds}-fold cross-validation")src/loans/data_processor.py (5)
1-7
: Remove extra blank line for better code organization.import pandas as pd from typing import List, Tuple, Optional import logging logger = logging.getLogger(__name__) -
8-16
: Enhance class and attribute documentation.The docstring could be more descriptive about the class purpose and its attributes.
class DataBuilder: def __init__(self, dataframe: Optional[pd.DataFrame] = None): """ - Initializes the DataBuilder with an optional DataFrame. + A builder class for processing and managing DataFrame operations. + + Args: + dataframe (Optional[pd.DataFrame]): Initial DataFrame to process. + If None, must be loaded using load_data method. + + Attributes: + dataframe: The main DataFrame being processed + features: DataFrame containing feature columns after separation + target: Series containing the target variable after separation """ self.dataframe: Optional[pd.DataFrame] = dataframe self.features: Optional[pd.DataFrame] = None self.target: Optional[pd.Series] = None
33-44
: Add logging and type validation.Consider adding logging for better traceability and type validation for the target column.
def separate_features_and_target(self, target_column: str) -> 'DataBuilder': """ Separates the dataframe into features and target attributes. + + Args: + target_column (str): Name of the target column + + Returns: + DataBuilder: self for method chaining + + Raises: + ValueError: If DataFrame is not loaded or target column doesn't exist """ if self.dataframe is None: raise ValueError("Dataframe is not loaded. Call load_data() before separating features and target.") + if not isinstance(target_column, str): + raise TypeError("Target column name must be a string") if target_column not in self.dataframe.columns: raise ValueError(f"Target column '{target_column}' does not exist in the dataframe.") + + logger.info(f"Separating features and target column: {target_column}") self.features = self.dataframe.drop(columns=[target_column]) self.target = self.dataframe[target_column] return self
45-59
: Improve getter methods documentation and add trailing newline.Enhance the docstrings for better clarity and add the missing newline at the end of file.
def get_dataframe(self) -> pd.DataFrame: """ - Returns the current dataframe. + Returns the current DataFrame. + + Returns: + pd.DataFrame: The current DataFrame + + Raises: + ValueError: If DataFrame is not loaded """ if self.dataframe is None: raise ValueError("Dataframe is not loaded.") return self.dataframe def get_features_and_target(self) -> Tuple[pd.DataFrame, pd.Series]: """ - Returns the features and target as a tuple. + Returns the separated features and target. + + Returns: + Tuple[pd.DataFrame, pd.Series]: A tuple containing (features DataFrame, target Series) + + Raises: + ValueError: If features and target haven't been separated """ if self.features is None or self.target is None: raise ValueError("Features and target have not been separated. Call separate_features_and_target() first.") - return self.features, self.target + return self.features, self.target +🧰 Tools
🪛 Ruff
59-59: No newline at end of file
Add trailing newline
(W292)
8-59
: Consider implementing immutability for better data integrity.The current implementation uses mutable state with in-place modifications. Consider implementing immutability by returning new instances instead of modifying the existing state. This would prevent accidental modifications and make the code more predictable and easier to test.
Example approach:
- Remove
inplace=True
from DataFrame operations- Return new
DataBuilder
instances with updated state- Make class attributes private with
_
prefix- Add property decorators for safe access
This would make the class more robust and align with functional programming principles.
🧰 Tools
🪛 Ruff
59-59: No newline at end of file
Add trailing newline
(W292)
tests/test_loans.py (4)
1-7
: Remove unused importroc_auc_score
.The
roc_auc_score
import is not used in any of the test methods. Consider removing it to keep the imports clean.-from sklearn.metrics import roc_auc_score
🧰 Tools
🪛 Ruff
6-6:
sklearn.metrics.roc_auc_score
imported but unusedRemove unused import:
sklearn.metrics.roc_auc_score
(F401)
8-34
: Add docstrings to fixtures for better documentation.The fixtures are well-structured but would benefit from docstrings explaining their purpose and return values. This helps other developers understand the test setup better.
Example improvement:
@pytest.fixture def sample_configs(): + """ + Returns a dictionary containing model configuration parameters. + + Returns: + dict: Contains model_params, categorical_variables, and model_verbose settings + """ return { 'model_params': {'iterations': 10, 'learning_rate': 0.1}, 'categorical_variables': ['cat_feature'], 'model_verbose': False }
134-137
: Improve assertion messages in error handling tests.The error handling tests would be more maintainable with descriptive assertion messages. Also, add a newline at the end of the file.
with pytest.raises(ValueError) as excinfo: loans.predict(data) -assert "Data must contain an 'id' column." in str(excinfo.value) +assert "Data must contain an 'id' column." in str(excinfo.value), "Expected error message about missing 'id' column not found" with pytest.raises(ValueError) as excinfo: loans.predict_cv(data) -assert "Data must contain an 'id' column." in str(excinfo.value) +assert "Data must contain an 'id' column." in str(excinfo.value), "Expected error message about missing 'id' column not found" with pytest.raises(ValueError) as excinfo: loans.predict_cv(data) -assert "No models have been trained. Please call perform_cv first." in str(excinfo.value) +assert "No models have been trained. Please call perform_cv first." in str(excinfo.value), "Expected error message about untrained models not found" +Also applies to: 164-167, 173-175
1-175
: Overall: Good test coverage with room for enhancement.The test suite provides good coverage of the main functionality with well-structured fixtures and test methods. Consider these improvements for better maintainability and robustness:
- Add docstrings to fixtures and test methods
- Enhance edge case coverage for model evaluation
- Add data validation tests
- Improve assertion messages
The current implementation is approved for merging, with the suggested improvements to be considered for future iterations.
🧰 Tools
🪛 Ruff
6-6:
sklearn.metrics.roc_auc_score
imported but unusedRemove unused import:
sklearn.metrics.roc_auc_score
(F401)
175-175: No newline at end of file
Add trailing newline
(W292)
notebooks/starter_notebook.py (2)
126-141
: Type hint corrections for generator functionThe return type hint for
generate_train_val_indices_for_cv
should specify the types yielded by the generator for clarity.Consider updating the type hint to:
def generate_train_val_indices_for_cv(X: pd.DataFrame, Y: pd.DataFrame, nfolds: int) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]: """ Generates training and validation indices for cross-validation. """🧰 Tools
🪛 Ruff
129-129: Blank line contains whitespace
(W293)
134-134: Blank line contains whitespace
(W293)
70-70
: Remove trailing whitespace from blank linesTrailing whitespace is present on blank lines at lines 70, 74, 88, 95, 111, 116, 129, 134, 147, 152, 157, 165, and 208. This is flagged by static analysis tools and can be cleaned up for better code quality.
Ensure your text editor or IDE is configured to remove trailing whitespace on save.
Also applies to: 74-74, 88-88, 95-95, 111-111, 116-116, 129-129, 134-134, 147-147, 152-152, 157-157, 165-165, 208-208
🧰 Tools
🪛 Ruff
70-70: Blank line contains whitespace
(W293)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
.gitignore
(2 hunks)databricks.yml
(1 hunks)logging_config.py
(1 hunks)notebooks/intermediate_notebook.py
(1 hunks)notebooks/prettier_notebook.py
(1 hunks)notebooks/starter_notebook.py
(1 hunks)project_config.yml
(1 hunks)pyproject.toml
(2 hunks)src/loans/data_processor.py
(1 hunks)src/loans/helpers.py
(1 hunks)src/loans/predict_loans.py
(1 hunks)tests/test_evaluate.py
(1 hunks)tests/test_loans.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- databricks.yml
🧰 Additional context used
🪛 Ruff
logging_config.py
30-30: No newline at end of file
Add trailing newline
(W292)
notebooks/intermediate_notebook.py
2-2: SyntaxError: Expected a statement
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
notebooks/prettier_notebook.py
2-2: SyntaxError: Expected a statement
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
notebooks/starter_notebook.py
2-2: SyntaxError: Expected a statement
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
2-2: SyntaxError: Simple statements must be separated by newlines or semicolons
70-70: Blank line contains whitespace
(W293)
74-74: Blank line contains whitespace
(W293)
88-88: Blank line contains whitespace
(W293)
95-95: Blank line contains whitespace
(W293)
111-111: Blank line contains whitespace
(W293)
116-116: Blank line contains whitespace
(W293)
129-129: Blank line contains whitespace
(W293)
134-134: Blank line contains whitespace
(W293)
147-147: Blank line contains whitespace
(W293)
152-152: Blank line contains whitespace
(W293)
157-157: Blank line contains whitespace
(W293)
165-165: Blank line contains whitespace
(W293)
208-208: Blank line contains whitespace
(W293)
src/loans/data_processor.py
59-59: No newline at end of file
Add trailing newline
(W292)
src/loans/helpers.py
14-14: No newline at end of file
Add trailing newline
(W292)
src/loans/predict_loans.py
207-207: Blank line contains whitespace
Remove whitespace from blank line
(W293)
229-229: Blank line contains whitespace
Remove whitespace from blank line
(W293)
298-298: No newline at end of file
Add trailing newline
(W292)
tests/test_evaluate.py
3-3: pandas
imported but unused
Remove unused import: pandas
(F401)
5-5: catboost.CatBoostClassifier
imported but unused
Remove unused import: catboost.CatBoostClassifier
(F401)
34-34: No newline at end of file
Add trailing newline
(W292)
tests/test_loans.py
6-6: sklearn.metrics.roc_auc_score
imported but unused
Remove unused import: sklearn.metrics.roc_auc_score
(F401)
175-175: No newline at end of file
Add trailing newline
(W292)
🔇 Additional comments (10)
src/loans/helpers.py (1)
1-2
: LGTM!
The yaml import is appropriate for the functionality required.
project_config.yml (2)
16-19
: LGTM! Good data processing configuration.
The configuration follows best practices by:
- Removing the identifier column ('id')
- Clearly specifying the target variable
32-34
: LGTM! Good system configuration.
The configuration includes appropriate settings for:
- Reproducibility (fixed random state)
- Debugging support (verbose logging)
logging_config.py (1)
1-2
: LGTM!
The imports are appropriate and well-organized.
tests/test_evaluate.py (1)
9-11
: LGTM!
The fixture is well-defined and follows pytest best practices.
pyproject.toml (2)
2-2
: Verify dependencies in notebooks.
The new dependencies (catboost, graphviz) should be properly imported and used in the notebooks mentioned in the AI summary.
#!/bin/bash
# Check for catboost and graphviz imports in notebooks
echo "Checking for CatBoost usage in notebooks:"
rg -l "import catboost|from catboost" || echo "No CatBoost imports found"
echo -e "\nChecking for Graphviz usage in notebooks:"
rg -l "import graphviz|from graphviz" || echo "No Graphviz imports found"
Also applies to: 6-6, 10-11, 25-26
2-2
: Verify Python version compatibility with deployment environment.
The change from ==3.11
to >=3.11
makes the project more flexible, but ensure your deployment environment (Databricks) supports Python versions beyond 3.11 to prevent potential compatibility issues.
Also applies to: 6-6
✅ Verification successful
Let me gather more information about Python version requirements in the codebase.
Let me try one final verification focusing on Python version constraints and actual Python feature usage.
Python version requirement >=3.11
appears safe to use
The codebase shows no usage of Python 3.11+ specific features, and the only version constraint found is in pyproject.toml itself. Since Databricks Runtime 13.0 and later supports Python 3.11, this version requirement should not cause compatibility issues.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if README or other docs specify Databricks runtime/Python version requirements
rg -i "python.*version|databricks.*runtime" README.md || echo "No version requirements found in README.md"
Length of output: 498
Script:
#!/bin/bash
# Check Python version requirements in other configuration files and documentation
fd -e yaml -e yml -e md -e rst | xargs rg -i "python.*version|runtime.*version"
# Check for any specific Databricks runtime mentions in notebooks
fd -e py -e ipynb | xargs rg -i "databricks.*runtime|dbr.*version"
# Check actual Python features usage that might require specific versions
ast-grep --pattern 'match $_: $$$' # Python 3.10+ pattern matching
ast-grep --pattern 'def $_($$$) -> $_[[$_]]: $$$' # Python 3.9+ type annotations with built-in collections
Length of output: 239
Script:
#!/bin/bash
# Check for Python version constraints in dependency files
rg -i "python.*=|python.*>|python.*<" requirements*.txt pyproject.toml setup.py || echo "No Python version constraints found"
# Check for modern Python features that might require 3.11+
ast-grep --pattern "type $_[$_] = $_" # Python 3.10+ type aliases
ast-grep --pattern "async with $_: $$$" # Check for async context managers
ast-grep --pattern "from typing import TypeAlias" # Python 3.10+ explicit type aliases
# Look for any runtime configuration files
fd -e json -e conf -e cfg | xargs rg -i "python.*version|runtime.*version"
Length of output: 738
.gitignore (2)
34-34
: LGTM!
The added empty line improves readability by providing clear section separation.
99-104
: LGTM!
The additions for VS Code, CatBoost, and Databricks configurations are appropriate and well-organized.
notebooks/prettier_notebook.py (1)
32-37
: Well-structured data preparation implementation!
The builder pattern implementation with method chaining makes the code clean and maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Outside diff range and nitpick comments (7)
.pre-commit-config.yaml (1)
18-25
: LGTM! Consider enhancing the pytest configuration.The basic pytest hook configuration is correct and will run tests before commits.
Consider enhancing the configuration with these improvements:
- id: pytest name: pytest - entry: pytest + entry: pytest tests/ language: system types: [python] pass_filenames: false + stages: [commit] + args: ["-v", "--no-cov"]This will:
- Explicitly specify the test directory
- Add verbosity for better feedback
- Disable coverage reporting for faster commits
- Explicitly set the commit stage
pyproject.toml (2)
Line range hint
7-20
: Consider relaxing the catboost version constraint.While most dependencies follow a pattern of allowing minor version updates (e.g.,
>=x.y.z, <major+1
), catboost is pinned to an exact version. This might make dependency resolution more difficult and prevent getting bug fixes.Consider changing to:
- "catboost==1.2.7, <2", + "catboost>=1.2.7, <2",
22-26
: Consider adding an upper bound to pip version requirement.While specifying a minimum pip version is good, adding an upper bound would help prevent potential breaking changes in future pip releases.
Consider changing to:
- "pip>=24.2", + "pip>=24.2,<25",notebooks/intermediate_notebook.py (1)
1-7
: Consider managing dependencies through cluster configuration.Installing packages directly in notebooks can lead to inconsistency across runs and environments. Instead:
- Add dependencies to cluster configuration
- Use requirements.txt or environment.yml
- Create an init script for cluster setup
tests/test_loans.py (1)
18-23
: Consider expanding the sample dataset for more robust testing.The current sample size of 5 records might be too small for meaningful cross-validation testing, especially when split into training and validation sets. Consider increasing it to at least 20-30 records with varied distributions to better test model behavior.
@pytest.fixture def sample_data(): - X = pd.DataFrame({"feature1": [1, 2, 3, 4, 5], "cat_feature": ["A", "B", "A", "B", "A"]}) - Y = pd.Series([0, 1, 0, 1, 0]) + X = pd.DataFrame({ + "feature1": np.random.normal(0, 1, 30), + "cat_feature": np.random.choice(["A", "B", "C"], 30) + }) + Y = pd.Series(np.random.choice([0, 1], 30)) return X, Ynotebooks/starter_notebook.py (2)
105-105
: Remove unused parameter from documentationThe parameter
fold_
is documented but not used in the function signature.Remove this line from the docstring:
- - fold_ (int): The current fold number.
231-231
: Remove unnecessarystrict=False
parameterThe
strict
parameter is not needed in Python 3.7+ as it's the default behavior.- for model, x_val, y_val in zip(models, vals_x, vals_y, strict=False): + for model, x_val, y_val in zip(models, vals_x, vals_y):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
.github/workflows/ci.yml
(1 hunks).gitignore
(2 hunks).pre-commit-config.yaml
(1 hunks)README.md
(1 hunks)logging_config.py
(1 hunks)notebooks/intermediate_notebook.py
(1 hunks)notebooks/prettier_notebook.py
(1 hunks)notebooks/starter_notebook.py
(1 hunks)pyproject.toml
(2 hunks)src/loans/data_processor.py
(1 hunks)src/loans/helpers.py
(1 hunks)src/loans/predict_loans.py
(1 hunks)tests/test_evaluate.py
(1 hunks)tests/test_loans.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- README.md
🚧 Files skipped from review as they are similar to previous changes (5)
- logging_config.py
- notebooks/prettier_notebook.py
- src/loans/data_processor.py
- src/loans/helpers.py
- tests/test_evaluate.py
🔇 Additional comments (14)
.github/workflows/ci.yml (1)
21-21
: LGTM!
The dependency installation using uv sync
is correctly configured.
.pre-commit-config.yaml (1)
18-25
: Verify pytest installation in the development environment.
Since this hook uses language: system
, it requires pytest to be available in the system PATH.
✅ Verification successful
Let me check the content of pyproject.toml to verify how pytest is configured.
✅ Pytest is properly configured as a project dependency
The project has pytest (version >=8.3.3) listed as a dependency in pyproject.toml, which ensures it will be available in the system PATH when the dependencies are installed. The pre-commit hook configuration is valid and will work as expected.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify pytest is properly configured in the project
# Test 1: Check if pytest is listed as a dependency
rg -l "pytest" pyproject.toml
# Test 2: Check for any existing pytest configuration
fd -g "pytest.ini|conftest.py|pytest.conf"
Length of output: 89
Script:
#!/bin/bash
# Check pytest configuration in pyproject.toml
rg "pytest" pyproject.toml -C 3
Length of output: 178
pyproject.toml (2)
28-42
: LGTM! Well-structured Ruff configuration.
The configuration properly separates linting and formatting settings, and includes a good selection of rules for code quality.
2-6
: Verify project name references and Python version compatibility.
The project has been renamed and the Python version constraint has been relaxed. This might require updates in other parts of the codebase.
✅ Verification successful
Let me gather more information about the project structure and imports to ensure there are no remaining references to the old project name in the actual code.
Based on the verification results, I can now provide the final response:
Project renaming and Python version changes appear safe to proceed
The codebase verification shows:
- No references to the old project name "mlops-with-databricks" were found in the code or configuration files
- The module structure is properly organized under
src/loans/
- The imports are standard sklearn packages that are compatible across Python 3.11+ versions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for references to the old project name
echo "Checking for old project name references..."
rg -i "mlops-with-databricks" --type-not toml
# Check if there are any Python version specific features used
echo "Checking for Python version specific imports..."
rg "^from \w+\.\w+ import" -A 1
Length of output: 1277
Script:
#!/bin/bash
# Check imports and module references in Python files
echo "Checking Python imports and module references..."
rg "import.*mlops|from.*mlops"
# Check setup.py and other configuration files
echo "Checking setup.py and other configs..."
fd -e py -e cfg -e ini -e yaml -e yml -x grep -l "mlops-with-databricks" {} \;
# Check the actual module structure
echo "Checking module structure..."
fd -t d . -d 2
Length of output: 426
.gitignore (2)
99-103
: LGTM! Appropriate entries for development tools.
The additions for VS Code history, CatBoost logs, and Databricks configurations are appropriate and follow best practices for excluding tool-specific files and directories.
105-105
: Consider tracking uv.lock file in version control.
The uv.lock
file should typically be committed to version control to ensure dependency reproducibility, similar to how requirements.txt
or poetry.lock
files are tracked.
notebooks/intermediate_notebook.py (5)
10-21
: LGTM! Well-structured imports with proper logging setup.
Good practice to initialize logging at the start of the notebook.
25-29
: Previously identified path-related issues remain.
The manual path manipulation and relative path usage issues identified in previous reviews are still present.
46-54
: Previously identified cross-validation issue remains.
The concern about using only 2 folds for cross-validation remains valid.
58-70
: Previously identified test preprocessing inconsistency remains.
The test data preprocessing pipeline is still missing steps compared to training data.
Let's verify the preprocessing steps in both pipelines:
#!/bin/bash
# Compare preprocessing steps between training and test data
echo "Training pipeline:"
rg -A 5 "builder = \("
echo "\nTest pipeline:"
rg -A 5 "test_builder = test_builder"
33-42
: 🛠️ Refactor suggestion
Consider simplifying the builder pattern implementation.
While the builder pattern is well-implemented, it could be more concise:
-builder = DataBuilder()
-
-builder = (
+builder = DataBuilder().(
builder.load_data(configs.get("train_file_path"))
.drop_columns(configs.get("dropped_columns"))
.separate_features_and_target(configs.get("target_column"))
)
Also, the previously identified need for data validation steps remains valid.
notebooks/starter_notebook.py (1)
24-35
: Exception handling for YAML file operations
A previous review already suggested adding exception handling to this function.
src/loans/predict_loans.py (2)
1-11
: LGTM! Well-structured imports and logging setup.
The imports are appropriately organized and include all necessary dependencies for ML functionality.
207-238
: LGTM! Clear and concise getter methods.
The getter methods are well-implemented and properly documented.
class Loans: | ||
""" | ||
Loans class for training, evaluating, and predicting loan status using machine learning models. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
configs: Dict[str, Any], | ||
evaluator: Optional[Evaluator] = None, | ||
model_class: Type = CatBoostClassifier, | ||
) -> None: | ||
""" | ||
Initialize the Loans class with configuration parameters. | ||
|
||
Parameters | ||
---------- | ||
configs : dict | ||
A dictionary containing configuration parameters. | ||
evaluator : Evaluator, optional | ||
An Evaluator object for evaluating the model. | ||
model_class : Type, default=CatBoostClassifier | ||
The class of the model to use. | ||
""" | ||
self.configs = configs | ||
self.model_class = model_class | ||
self.model_params = configs.get("model_params", {}) | ||
self.categorical_variables = configs.get("categorical_variables", []) | ||
self.verbose = configs.get("model_verbose", True) | ||
self.evaluator = evaluator or Evaluator() | ||
self.models: List[Any] = [] | ||
self.vals_x: List[pd.DataFrame] = [] | ||
self.vals_y: List[pd.Series] = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add configuration validation in init
Consider validating the configuration parameters to ensure required fields are present and have appropriate types/values.
def __init__(
self,
configs: Dict[str, Any],
evaluator: Optional[Evaluator] = None,
model_class: Type = CatBoostClassifier,
) -> None:
+ required_configs = ["model_params", "categorical_variables"]
+ for config in required_configs:
+ if config not in configs:
+ configs[config] = {}
+ if not isinstance(configs["categorical_variables"], list):
+ raise ValueError("categorical_variables must be a list")
self.configs = configs
self.model_class = model_class
self.model_params = configs.get("model_params", {})
Committable suggestion was skipped due to low confidence.
def predict(self, data: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Generate predictions from the first model. | ||
|
||
Parameters | ||
---------- | ||
data : pd.DataFrame | ||
A pandas DataFrame containing the features for prediction, including an 'id' column. | ||
|
||
Returns | ||
------- | ||
pd.DataFrame | ||
A pandas DataFrame with two columns: 'id' and 'loan_status', | ||
where 'loan_status' is the prediction from the first model. | ||
""" | ||
if "id" not in data.columns: | ||
raise ValueError("Data must contain an 'id' column.") | ||
if not self.models: | ||
raise ValueError("No models have been trained. Please call perform_cv first.") | ||
|
||
data_without_id = data.drop(columns=["id"]) | ||
model = self.models[0] # Use only the first model | ||
if hasattr(model, "predict_proba"): | ||
predictions = model.predict_proba(data_without_id)[:, 1] | ||
else: | ||
predictions = model.predict(data_without_id) | ||
return pd.DataFrame({"id": data["id"], "loan_status": predictions}) | ||
|
||
def predict_cv(self, data: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Generate mean predictions from all cross-validated models. | ||
|
||
Parameters | ||
---------- | ||
data : pd.DataFrame | ||
A pandas DataFrame containing the features for prediction, including an 'id' column. | ||
|
||
Returns | ||
------- | ||
pd.DataFrame | ||
A pandas DataFrame with two columns: 'id' and 'loan_status', | ||
where 'loan_status' is the mean prediction from all models. | ||
""" | ||
if "id" not in data.columns: | ||
raise ValueError("Data must contain an 'id' column.") | ||
if not self.models: | ||
raise ValueError("No models have been trained. Please call perform_cv first.") | ||
|
||
data_without_id = data.drop(columns=["id"]) | ||
predictions_list = [] | ||
for model in self.models: | ||
if hasattr(model, "predict_proba"): | ||
predictions = model.predict_proba(data_without_id)[:, 1] | ||
else: | ||
predictions = model.predict(data_without_id) | ||
predictions_list.append(predictions) | ||
predictions_mean = np.mean(predictions_list, axis=0) | ||
return pd.DataFrame({"id": data["id"], "loan_status": predictions_mean}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding prediction confidence metrics
For the CV predictions, consider including confidence metrics (e.g., standard deviation) along with the mean predictions to provide insight into model uncertainty.
def predict_cv(self, data: pd.DataFrame) -> pd.DataFrame:
# ... existing code ...
predictions_mean = np.mean(predictions_list, axis=0)
+ predictions_std = np.std(predictions_list, axis=0)
- return pd.DataFrame({"id": data["id"], "loan_status": predictions_mean})
+ return pd.DataFrame({
+ "id": data["id"],
+ "loan_status": predictions_mean,
+ "prediction_std": predictions_std
+ })
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def predict(self, data: pd.DataFrame) -> pd.DataFrame: | |
""" | |
Generate predictions from the first model. | |
Parameters | |
---------- | |
data : pd.DataFrame | |
A pandas DataFrame containing the features for prediction, including an 'id' column. | |
Returns | |
------- | |
pd.DataFrame | |
A pandas DataFrame with two columns: 'id' and 'loan_status', | |
where 'loan_status' is the prediction from the first model. | |
""" | |
if "id" not in data.columns: | |
raise ValueError("Data must contain an 'id' column.") | |
if not self.models: | |
raise ValueError("No models have been trained. Please call perform_cv first.") | |
data_without_id = data.drop(columns=["id"]) | |
model = self.models[0] # Use only the first model | |
if hasattr(model, "predict_proba"): | |
predictions = model.predict_proba(data_without_id)[:, 1] | |
else: | |
predictions = model.predict(data_without_id) | |
return pd.DataFrame({"id": data["id"], "loan_status": predictions}) | |
def predict_cv(self, data: pd.DataFrame) -> pd.DataFrame: | |
""" | |
Generate mean predictions from all cross-validated models. | |
Parameters | |
---------- | |
data : pd.DataFrame | |
A pandas DataFrame containing the features for prediction, including an 'id' column. | |
Returns | |
------- | |
pd.DataFrame | |
A pandas DataFrame with two columns: 'id' and 'loan_status', | |
where 'loan_status' is the mean prediction from all models. | |
""" | |
if "id" not in data.columns: | |
raise ValueError("Data must contain an 'id' column.") | |
if not self.models: | |
raise ValueError("No models have been trained. Please call perform_cv first.") | |
data_without_id = data.drop(columns=["id"]) | |
predictions_list = [] | |
for model in self.models: | |
if hasattr(model, "predict_proba"): | |
predictions = model.predict_proba(data_without_id)[:, 1] | |
else: | |
predictions = model.predict(data_without_id) | |
predictions_list.append(predictions) | |
predictions_mean = np.mean(predictions_list, axis=0) | |
return pd.DataFrame({"id": data["id"], "loan_status": predictions_mean}) | |
def predict(self, data: pd.DataFrame) -> pd.DataFrame: | |
""" | |
Generate predictions from the first model. | |
Parameters | |
---------- | |
data : pd.DataFrame | |
A pandas DataFrame containing the features for prediction, including an 'id' column. | |
Returns | |
------- | |
pd.DataFrame | |
A pandas DataFrame with two columns: 'id' and 'loan_status', | |
where 'loan_status' is the prediction from the first model. | |
""" | |
if "id" not in data.columns: | |
raise ValueError("Data must contain an 'id' column.") | |
if not self.models: | |
raise ValueError("No models have been trained. Please call perform_cv first.") | |
data_without_id = data.drop(columns=["id"]) | |
model = self.models[0] # Use only the first model | |
if hasattr(model, "predict_proba"): | |
predictions = model.predict_proba(data_without_id)[:, 1] | |
else: | |
predictions = model.predict(data_without_id) | |
return pd.DataFrame({"id": data["id"], "loan_status": predictions}) | |
def predict_cv(self, data: pd.DataFrame) -> pd.DataFrame: | |
""" | |
Generate mean predictions from all cross-validated models. | |
Parameters | |
---------- | |
data : pd.DataFrame | |
A pandas DataFrame containing the features for prediction, including an 'id' column. | |
Returns | |
------- | |
pd.DataFrame | |
A pandas DataFrame with two columns: 'id' and 'loan_status', | |
where 'loan_status' is the mean prediction from all models. | |
""" | |
if "id" not in data.columns: | |
raise ValueError("Data must contain an 'id' column.") | |
if not self.models: | |
raise ValueError("No models have been trained. Please call perform_cv first.") | |
data_without_id = data.drop(columns=["id"]) | |
predictions_list = [] | |
for model in self.models: | |
if hasattr(model, "predict_proba"): | |
predictions = model.predict_proba(data_without_id)[:, 1] | |
else: | |
predictions = model.predict(data_without_id) | |
predictions_list.append(predictions) | |
predictions_mean = np.mean(predictions_list, axis=0) | |
predictions_std = np.std(predictions_list, axis=0) | |
return pd.DataFrame({ | |
"id": data["id"], | |
"loan_status": predictions_mean, | |
"prediction_std": predictions_std | |
}) |
week 1 files
Summary by CodeRabbit
New Features
databricks.yml
configuration file for project asset management.DataBuilder
class for efficient data handling and preprocessing.Evaluator
class to assess model predictions with customizable metrics.project_config.yml
file for key machine learning project configurations.Bug Fixes
.gitignore
to exclude unnecessary files and directories from version control.Tests
Evaluator
andLoans
classes to ensure functionality and reliability.Chores
pyproject.toml
.