Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week3 #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Week3 #3

wants to merge 4 commits into from

Conversation

netojoseaugusto
Copy link
Collaborator

@netojoseaugusto netojoseaugusto commented Nov 24, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced functionality for serving machine learning predictions and models using Databricks.
    • Implemented A/B testing for CatBoost classifiers with performance logging.
  • Improvements

    • Enhanced model serving capabilities with concurrent request handling and performance metrics collection.
    • Added new feature lookup capabilities for improved data management.
  • Dependencies

    • Updated and added specific versions for key dependencies to ensure compatibility and stability.

@netojoseaugusto netojoseaugusto requested a review from a team as a code owner November 24, 2024 20:01
Copy link

coderabbitai bot commented Nov 24, 2024

Walkthrough

The changes introduce several new Python scripts for serving machine learning models and predictions using Databricks. Each script initializes necessary libraries, retrieves configuration settings from YAML files, and sets up serving endpoints for different models. The scripts also implement functionality for sending requests, measuring execution times, and handling concurrent requests. Additionally, modifications were made to the pyproject.toml file to update and add dependencies relevant to the new functionalities.

Changes

File Change Summary
notebooks/week3/01.feature_serving.py Introduced functionality for serving predictions, including loading datasets, generating predictions, and creating serving endpoints. Added send_request() function.
notebooks/week3/02.model_serving.py Implemented model serving for a CatBoost model, including sending requests and measuring performance. Added send_request() function.
notebooks/week3/03.model_serving_feature_lookup.py Created functionality for serving models with feature lookups, including online table specifications and request handling. Added open_yaml_file(file_path) in loans.helpers, and several new classes in databricks.sdk.service.
notebooks/week3/04.AB_test_model_serving.py Developed A/B testing workflow for CatBoost classifiers, including model training, logging to MLflow, and serving endpoints. Added LoansModelWrapper class and predict() method.
pyproject.toml Updated mlflow and databricks-feature-engineering dependencies to fixed versions; added new dependencies for databricks-feature-lookup, databricks-sdk, and catboost.

Poem

🐇 In the land of code, where models play,
New scripts are born, brightening the day.
With serving endpoints, they dance and sing,
Predictions and features, oh what joy they bring!
So hop along, let’s celebrate this feat,
For in the world of data, we’re never beat! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🧹 Outside diff range and nitpick comments (9)
notebooks/week3/03.model_serving_feature_lookup.py (1)

1-126: Consider architectural improvements for production readiness

While the basic functionality is implemented, consider these architectural improvements for production:

  1. Move all configuration to a proper configuration management system
  2. Implement proper logging instead of print statements
  3. Add monitoring and metrics collection for the serving endpoint
  4. Consider implementing circuit breakers for external service calls
  5. Add proper documentation for maintenance and troubleshooting

Would you like assistance in implementing any of these improvements?

notebooks/week3/02.model_serving.py (4)

2-2: Use Relative Paths for Package Installation

Using an absolute path for package installation may cause issues when running the notebook in different environments. Consider using a relative path or installing the package from a package repository to improve portability.

Apply this diff to use a relative path:

-# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
+# MAGIC %pip install ../../package/loans-0.0.1-py3-none-any.whl

93-95: Avoid Printing Full Response Text

Printing the full response text may expose sensitive information. Consider logging selectively or using appropriate logging levels.

Apply this diff to print a summary of the response:

-print("Response status:", response.status_code)
-print("Reponse text:", response.text)
+print(f"Response status: {response.status_code}")
+print(f"Response content length: {len(response.content)} bytes")
 print("Execution time:", execution_time, "seconds")

124-125: Be Cautious with High Concurrency Levels

Using a high number of concurrent requests may overload the serving endpoint or lead to throttling. Consider reducing max_workers or implementing rate limiting.

Apply this diff to reduce the number of workers:

-with ThreadPoolExecutor(max_workers=100) as executor:
+with ThreadPoolExecutor(max_workers=20) as executor:

135-138: Round Latency Metrics for Readability

For better readability of latency metrics, consider rounding the values to a reasonable number of decimal places.

Apply this diff to round the output values:

 print("\nTotal execution time:", total_execution_time, "seconds")
-print("Average latency per request:", average_latency, "seconds")
+print("Average latency per request:", round(average_latency, 4), "seconds")
notebooks/week3/01.feature_serving.py (1)

206-212: Adjust max_workers to match num_requests for optimal thread utilization

The max_workers is set to 100, but num_requests is 10. This means the thread pool is larger than necessary, which can be inefficient.

Consider adjusting max_workers to match num_requests:

with ThreadPoolExecutor(max_workers=100) as executor:
+    num_workers = min(num_requests, 100)  # Ensure we don't exceed 100 workers
-    futures = [executor.submit(send_request) for _ in range(num_requests)]
+    futures = [executor.submit(send_request) for _ in range(num_workers)]

Alternatively, if you plan to increase num_requests, ensure that max_workers is appropriately set.

notebooks/week3/04.AB_test_model_serving.py (3)

2-2: Consider parameterizing the package installation path

The hardcoded installation path /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl might not be portable across different environments. Consider parameterizing the path or using a relative path to enhance portability and maintainability.


100-100: Capture git_sha dynamically

Currently, the git_sha is hardcoded as "ffa63b430205ff7". To ensure accuracy and ease of maintenance, consider capturing the Git SHA dynamically from the repository. This ensures that the logged git_sha reflects the current state of the code.

Apply this change to dynamically obtain the git_sha:

-import subprocess

-git_sha = "ffa63b430205ff7"
+import subprocess

+git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("utf-8").strip()

316-316: Typo in log message

There's a typo in the print statement: "Reponse text" should be "Response text".

Apply this change:

-print("Reponse text:", response.text)
+print("Response text:", response.text)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fd22390 and e99d88c.

📒 Files selected for processing (5)
  • notebooks/week3/01.feature_serving.py (1 hunks)
  • notebooks/week3/02.model_serving.py (1 hunks)
  • notebooks/week3/03.model_serving_feature_lookup.py (1 hunks)
  • notebooks/week3/04.AB_test_model_serving.py (1 hunks)
  • pyproject.toml (1 hunks)
🔇 Additional comments (6)
notebooks/week3/03.model_serving_feature_lookup.py (1)

23-28: LGTM!

The Spark session and Databricks client initialization follows best practices.

notebooks/week3/02.model_serving.py (1)

52-60: ⚠️ Potential issue

Ensure Consistency Between Served Entity and Route Configuration

Verify that the served_model_name in the Route correctly references the served entity and version. Inconsistencies may lead to deployment issues.

Apply this diff to correct the served_model_name if needed:

Route(served_model_name="catboost_model_basic-2", traffic_percentage=100)

Ensure that "catboost_model_basic-2" matches the entity_name and entity_version specified.

Run the following script to confirm the available served models and versions:

✅ Verification successful

The served model name correctly matches the entity configuration

The code is correctly configured. The served_model_name="catboost_model_basic-2" in the route configuration matches the entity configuration where:

  • The base name comes from entity_name=f"{catalog_name}.{schema_name}.catboost_model_basic"
  • The suffix -2 matches the entity_version=2

The naming convention follows the expected pattern of <model_name>-<version> and is consistent across the configuration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: List available models and versions in the workspace.

# Test: List models under the specified catalog and schema.
databricks mlflow models list --filter "name LIKE '%catboost_model_basic%'"

Length of output: 184


Script:

#!/bin/bash
# Let's analyze the code to verify the entity name and version consistency
# Search for similar model serving configurations in the codebase
rg -A 5 "served_model_name.*catboost" --type py

# Also search for entity name patterns
rg -A 5 "entity_name.*catboost" --type py

Length of output: 1916

notebooks/week3/01.feature_serving.py (4)

95-100: Consider setting perform_full_copy to True for initial online table creation

When creating an online table, perform_full_copy is set to False. This means that only changes from the Change Data Feed will be applied, and existing data may not be copied to the online store.

If you intend to have the current data available in the online table immediately, set perform_full_copy to True during the initial creation:

spec = OnlineTableSpec(
    primary_key_columns=["id"],
    source_table_full_name=feature_table_name,
    run_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({"triggered": "true"}),
-    perform_full_copy=False,
+    perform_full_copy=True,
)

Please verify whether this change aligns with your data availability requirements.


36-36: Confirm that the MLflow registry URI is correctly set

The MLflow registry URI is set to "databricks-uc". Ensure that this is the intended URI for your environment.

If you're using Unity Catalog with MLflow, "databricks-uc" is appropriate. Otherwise, you might need to adjust it.


178-182: Ensure that id_list is correctly populated from the Spark DataFrame

The code uses preds_df.select("Id").rdd.flatMap(lambda x: x).collect() to create id_list. Confirm that the column "Id" exists in preds_df.

If the column is named "id" in your DataFrame, adjust the code accordingly:

-id_list = preds_df.select("Id").rdd.flatMap(lambda x: x).collect()
+id_list = preds_df.select("id").rdd.flatMap(lambda x: x).collect()

76-79: ⚠️ Potential issue

Inconsistency in column naming between 'id' and 'Id'

There is a discrepancy in the casing of the 'id' column. In the Pandas DataFrame df, the column is likely named 'id', but in later operations, 'Id' with an uppercase 'I' is used. This can lead to key errors or mismatches during feature lookup and serving.

To ensure consistency, standardize the column name to 'id' throughout the code. Apply this diff to fix the column naming:

-preds_df = df[["id", "person_age", "person_income"]]
+preds_df = df[["id", "person_age", "person_income"]]
-preds_df["predicted_loan_status"] = pipeline.predict(df[continuous_variables + categorical_variables])
+preds_df = preds_df.withColumnRenamed("id", "Id")

Alternatively, if you prefer using 'id', update the FeatureLookup and request payload accordingly.

Likely invalid or redundant comment.

Comment on lines 25 to 29
dev = ["databricks-connect>=15.4.1, <16",
"databricks-sdk>=0.32.0, <0.33",
"databricks-sdk==0.32.0",
"ipykernel>=6.29.5, <7",
"catboost==1.2.7",
"pip>=24.2",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove redundant dependencies and align versions

The optional dependencies section contains duplicates from the main dependencies:

  • databricks-sdk==0.32.0
  • catboost==1.2.7

These packages are already specified in the main dependencies section with the same versions.

Apply this diff to remove the redundant entries:

dev = ["databricks-connect>=15.4.1, <16",
-      "databricks-sdk==0.32.0",
       "ipykernel>=6.29.5, <7",
-      "catboost==1.2.7",
       "pip>=24.2",
       "pytest>=8.3.3"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dev = ["databricks-connect>=15.4.1, <16",
"databricks-sdk>=0.32.0, <0.33",
"databricks-sdk==0.32.0",
"ipykernel>=6.29.5, <7",
"catboost==1.2.7",
"pip>=24.2",
dev = ["databricks-connect>=15.4.1, <16",
"ipykernel>=6.29.5, <7",
"pip>=24.2",

Comment on lines +7 to +22
dependencies = [
"lightgbm>=4.5.0, <5",
"scikit-learn>=1.5.1, <2",
"cloudpickle>=3.0.0, <4",
"catboost==1.2.7,<2",
"graphviz==0.20.3, <1",
"mlflow==2.17.0",
"numpy>=1.26.4, <2",
"pandas>=2.2.2, <3",
"cffi>=1.17.1, <2",
"scipy>=1.14.1, <2",
"matplotlib>=3.9.2, <4",
"databricks-feature-engineering==0.6",
"databricks-feature-lookup==1.2.0",
"databricks-sdk==0.32.0",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Update MLflow and Databricks dependencies to latest secure versions

Several dependencies have newer versions available with important security fixes:

  • MLflow 2.17.0 -> 2.18.0 (current version is vulnerable to path traversal attacks)
  • databricks-feature-engineering 0.6 -> 0.7.0
  • databricks-sdk 0.32.0 -> 0.38.0

Other pinned versions are current:

  • catboost 1.2.7 (latest)
  • databricks-feature-lookup 1.2.0 (latest)

Consider:

  1. Updating MLflow to at least 2.18.0 to address security vulnerabilities
  2. Updating Databricks packages to their latest versions for compatibility
  3. Adding version ranges (e.g., >=2.18.0, <3) to allow future security patches while maintaining compatibility
🔗 Analysis chain

Review dependency version management strategy

The dependencies section shows inconsistent version pinning strategies:

  • Some packages use fixed versions (e.g., mlflow==2.17.0, catboost==1.2.7)
  • Others use ranges (e.g., lightgbm>=4.5.0, <5)

While fixed versions for ML frameworks ensure reproducibility, consider:

  1. Using version ranges for utility packages to allow security updates
  2. Documenting the version pinning strategy in README.md

Let's verify the latest stable versions and any security advisories for the pinned dependencies:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check latest versions and security advisories for critical dependencies

# Check PyPI for latest versions
echo "Checking latest versions..."
for pkg in mlflow catboost databricks-feature-engineering databricks-feature-lookup databricks-sdk
do
  echo "$pkg:"
  curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version'
done

# Check for security advisories
echo "Checking security advisories..."
for pkg in mlflow catboost databricks-feature-engineering databricks-feature-lookup databricks-sdk
do
  gh api graphql -f query='{
    securityVulnerabilities(first: 5, ecosystem: PIP, package: "'$pkg'") {
      nodes {
        advisory {
          summary
          severity
          publishedAt
        }
        vulnerableVersionRange
      }
    }
  }'
done

Length of output: 4186

@@ -0,0 +1,126 @@
# Databricks notebook source
# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid hardcoding user-specific paths

The wheel file path contains a hardcoded username netojoseaugusto. This could break when other users try to run the notebook.

Consider using environment variables or notebook parameters:

-# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
+# MAGIC %pip install /Volumes/mlops_students/${USER}/package/loans-0.0.1-py3-none-any.whl
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
# MAGIC %pip install /Volumes/mlops_students/${USER}/package/loans-0.0.1-py3-none-any.whl

Comment on lines +74 to +75
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
host = spark.conf.get("spark.databricks.workspaceUrl")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate authentication credentials

The API token and host URL are retrieved without validation.

Add validation:

token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
host = spark.conf.get("spark.databricks.workspaceUrl")
+if not token:
+    raise ValueError("Failed to retrieve API token")
+if not host:
+    raise ValueError("Failed to retrieve workspace URL")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
host = spark.conf.get("spark.databricks.workspaceUrl")
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
host = spark.conf.get("spark.databricks.workspaceUrl")
if not token:
raise ValueError("Failed to retrieve API token")
if not host:
raise ValueError("Failed to retrieve workspace URL")

Comment on lines +57 to +69
workspace.serving_endpoints.create(
name="catboost-loans-serving-fe",
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name=f"{catalog_name}.{schema_name}.catboost_fe",
scale_to_zero_enabled=True,
workload_size="Small",
entity_version=1,
)
]
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling and make endpoint name configurable

The endpoint creation lacks error handling and uses a hardcoded name.

Consider these improvements:

+endpoint_name = configs.get("endpoint_name", "catboost-loans-serving-fe")
+try:
    workspace.serving_endpoints.create(
-        name="catboost-loans-serving-fe",
+        name=endpoint_name,
        config=EndpointCoreConfigInput(
            served_entities=[
                ServedEntityInput(
                    entity_name=f"{catalog_name}.{schema_name}.catboost_fe",
                    scale_to_zero_enabled=True,
                    workload_size="Small",
                    entity_version=1,
                )
            ]
        ),
    )
+except Exception as e:
+    raise RuntimeError(f"Failed to create serving endpoint: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +153 to +167
start_time = time.time()
serving_endpoint = f"https://{host}/serving-endpoints/loans-feature-serving/invocations"
response = requests.post(
f"{serving_endpoint}",
headers={"Authorization": f"Bearer {token}"},
json={"dataframe_records": [{"Id": "182"}]},
)

end_time = time.time()
execution_time = end_time - start_time

print("Response status:", response.status_code)
print("Reponse text:", response.text)
print("Execution time:", execution_time, "seconds")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for HTTP requests to the serving endpoint

The current implementation does not handle potential errors from the requests.post() call, such as network timeouts or HTTP errors. Adding error handling will make the code more robust.

Consider wrapping the request in a try-except block and checking for HTTP errors:

start_time = time.time()
serving_endpoint = f"https://{host}/serving-endpoints/loans-feature-serving/invocations"

try:
    response = requests.post(
        serving_endpoint,
        headers={"Authorization": f"Bearer {token}"},
        json={"dataframe_records": [{"Id": "182"}]},
        timeout=10  # Optional: set a timeout
    )
    response.raise_for_status()  # Raise an exception for HTTP errors
    print("Response status:", response.status_code)
    print("Response text:", response.text)
except requests.exceptions.RequestException as e:
    print("An error occurred:", e)
finally:
    end_time = time.time()
    execution_time = end_time - start_time
    print("Execution time:", execution_time, "seconds")

from pyspark.dbutils import DBUtils
from pyspark.sql import SparkSession

from loans.helpers import open_yaml_file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Warning: Duplicate function definition found

The function open_yaml_file is defined in two locations:

  • src/loans/helpers.py
  • notebooks/week1/starter_notebook.py

This could lead to potential conflicts. The import in notebooks/week3/01.feature_serving.py correctly references the function from loans.helpers, but having duplicate definitions is not recommended.

Consider removing the duplicate definition from starter_notebook.py and using the one from loans.helpers consistently across the codebase.

🔗 Analysis chain

Ensure open_yaml_file is properly defined and imported

The function open_yaml_file is imported from loans.helpers. Verify that this module and function exist and are accessible in the environment.

Run the following script to check if open_yaml_file is defined once and imported correctly:

Ensure that there are no naming conflicts or missing modules.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the definition and import of `open_yaml_file`

# Find the definition of `open_yaml_file`
rg -t py "def open_yaml_file" -A 2

# Check for imports of `open_yaml_file`
rg -t py "from loans.helpers import open_yaml_file"

Length of output: 1246

Comment on lines +111 to +112
lookup_key="Id",
feature_names=["person_age", "person_income", "predicted_loan_status"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update lookup_key and feature references to match column naming

The lookup_key is set to "Id", but if your DataFrame uses "id" (all lowercase), this will cause mismatches during feature serving. Similarly, the feature names must match the exact column names.

Adjust the lookup_key and feature_names to use consistent casing:

lookup_key="id",
feature_names=["person_age", "person_income", "predicted_loan_status"],

Or, if standardizing to "Id" as per previous comment, ensure the DataFrame columns are renamed accordingly.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +148 to +149
id_list = preds_df["id"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential error accessing 'id' column on a Spark DataFrame

After converting preds_df to a Spark DataFrame in line 79, attempting to access preds_df["id"] will raise an error because Spark DataFrames do not support item access using square brackets.

To collect the 'Id' values from a Spark DataFrame, use the select and collect methods:

-id_list = preds_df["id"]
+id_list = preds_df.select("Id").rdd.flatMap(lambda x: x).collect()

Ensure that the column name matches the DataFrame's column name, considering any previous renaming for consistency.

Committable suggestion skipped: line range outside the PR's diff.


# Log parameters, metrics, and the model to MLflow
mlflow.log_param("model_type", "Catboost")
mlflow.log_params(parameters)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logging incorrect parameters for Model B

While training Model B, the code logs parameters instead of parameters_2. Since parameters_2 contains the actual parameters used for Model B, this could lead to confusion during analysis.

Apply the following change to log the correct parameters:

-mlflow.log_params(parameters)
+mlflow.log_params(parameters_2)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mlflow.log_params(parameters)
mlflow.log_params(parameters_2)

"scipy>=1.14.1, <2",
"matplotlib>=3.9.2, <4",
"databricks-feature-engineering>=0.6, <1"]
dependencies = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very strict versioning is required if you use feature engineering package and want to deploy an endpoint (including pandas and oter packages, also pyarrow must be fixed to 14.0.2 - default would be 15.x, and model endpoint deployment would fail)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants