-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Week3 #3
base: main
Are you sure you want to change the base?
Week3 #3
Conversation
WalkthroughThe changes introduce several new Python scripts for serving machine learning models and predictions using Databricks. Each script initializes necessary libraries, retrieves configuration settings from YAML files, and sets up serving endpoints for different models. The scripts also implement functionality for sending requests, measuring execution times, and handling concurrent requests. Additionally, modifications were made to the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 21
🧹 Outside diff range and nitpick comments (9)
notebooks/week3/03.model_serving_feature_lookup.py (1)
1-126
: Consider architectural improvements for production readinessWhile the basic functionality is implemented, consider these architectural improvements for production:
- Move all configuration to a proper configuration management system
- Implement proper logging instead of print statements
- Add monitoring and metrics collection for the serving endpoint
- Consider implementing circuit breakers for external service calls
- Add proper documentation for maintenance and troubleshooting
Would you like assistance in implementing any of these improvements?
notebooks/week3/02.model_serving.py (4)
2-2
: Use Relative Paths for Package InstallationUsing an absolute path for package installation may cause issues when running the notebook in different environments. Consider using a relative path or installing the package from a package repository to improve portability.
Apply this diff to use a relative path:
-# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl +# MAGIC %pip install ../../package/loans-0.0.1-py3-none-any.whl
93-95
: Avoid Printing Full Response TextPrinting the full response text may expose sensitive information. Consider logging selectively or using appropriate logging levels.
Apply this diff to print a summary of the response:
-print("Response status:", response.status_code) -print("Reponse text:", response.text) +print(f"Response status: {response.status_code}") +print(f"Response content length: {len(response.content)} bytes") print("Execution time:", execution_time, "seconds")
124-125
: Be Cautious with High Concurrency LevelsUsing a high number of concurrent requests may overload the serving endpoint or lead to throttling. Consider reducing
max_workers
or implementing rate limiting.Apply this diff to reduce the number of workers:
-with ThreadPoolExecutor(max_workers=100) as executor: +with ThreadPoolExecutor(max_workers=20) as executor:
135-138
: Round Latency Metrics for ReadabilityFor better readability of latency metrics, consider rounding the values to a reasonable number of decimal places.
Apply this diff to round the output values:
print("\nTotal execution time:", total_execution_time, "seconds") -print("Average latency per request:", average_latency, "seconds") +print("Average latency per request:", round(average_latency, 4), "seconds")notebooks/week3/01.feature_serving.py (1)
206-212
: Adjustmax_workers
to matchnum_requests
for optimal thread utilizationThe
max_workers
is set to 100, butnum_requests
is 10. This means the thread pool is larger than necessary, which can be inefficient.Consider adjusting
max_workers
to matchnum_requests
:with ThreadPoolExecutor(max_workers=100) as executor: + num_workers = min(num_requests, 100) # Ensure we don't exceed 100 workers - futures = [executor.submit(send_request) for _ in range(num_requests)] + futures = [executor.submit(send_request) for _ in range(num_workers)]Alternatively, if you plan to increase
num_requests
, ensure thatmax_workers
is appropriately set.notebooks/week3/04.AB_test_model_serving.py (3)
2-2
: Consider parameterizing the package installation pathThe hardcoded installation path
/Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
might not be portable across different environments. Consider parameterizing the path or using a relative path to enhance portability and maintainability.
100-100
: Capturegit_sha
dynamicallyCurrently, the
git_sha
is hardcoded as"ffa63b430205ff7"
. To ensure accuracy and ease of maintenance, consider capturing the Git SHA dynamically from the repository. This ensures that the loggedgit_sha
reflects the current state of the code.Apply this change to dynamically obtain the
git_sha
:-import subprocess -git_sha = "ffa63b430205ff7" +import subprocess +git_sha = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("utf-8").strip()
316-316
: Typo in log messageThere's a typo in the print statement: "Reponse text" should be "Response text".
Apply this change:
-print("Reponse text:", response.text) +print("Response text:", response.text)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
notebooks/week3/01.feature_serving.py
(1 hunks)notebooks/week3/02.model_serving.py
(1 hunks)notebooks/week3/03.model_serving_feature_lookup.py
(1 hunks)notebooks/week3/04.AB_test_model_serving.py
(1 hunks)pyproject.toml
(1 hunks)
🔇 Additional comments (6)
notebooks/week3/03.model_serving_feature_lookup.py (1)
23-28
: LGTM!
The Spark session and Databricks client initialization follows best practices.
notebooks/week3/02.model_serving.py (1)
52-60
:
Ensure Consistency Between Served Entity and Route Configuration
Verify that the served_model_name
in the Route
correctly references the served entity and version. Inconsistencies may lead to deployment issues.
Apply this diff to correct the served_model_name
if needed:
Route(served_model_name="catboost_model_basic-2", traffic_percentage=100)
Ensure that "catboost_model_basic-2"
matches the entity_name
and entity_version
specified.
Run the following script to confirm the available served models and versions:
✅ Verification successful
The served model name correctly matches the entity configuration
The code is correctly configured. The served_model_name="catboost_model_basic-2"
in the route configuration matches the entity configuration where:
- The base name comes from
entity_name=f"{catalog_name}.{schema_name}.catboost_model_basic"
- The suffix
-2
matches theentity_version=2
The naming convention follows the expected pattern of <model_name>-<version>
and is consistent across the configuration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: List available models and versions in the workspace.
# Test: List models under the specified catalog and schema.
databricks mlflow models list --filter "name LIKE '%catboost_model_basic%'"
Length of output: 184
Script:
#!/bin/bash
# Let's analyze the code to verify the entity name and version consistency
# Search for similar model serving configurations in the codebase
rg -A 5 "served_model_name.*catboost" --type py
# Also search for entity name patterns
rg -A 5 "entity_name.*catboost" --type py
Length of output: 1916
notebooks/week3/01.feature_serving.py (4)
95-100
: Consider setting perform_full_copy
to True
for initial online table creation
When creating an online table, perform_full_copy
is set to False
. This means that only changes from the Change Data Feed will be applied, and existing data may not be copied to the online store.
If you intend to have the current data available in the online table immediately, set perform_full_copy
to True
during the initial creation:
spec = OnlineTableSpec(
primary_key_columns=["id"],
source_table_full_name=feature_table_name,
run_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({"triggered": "true"}),
- perform_full_copy=False,
+ perform_full_copy=True,
)
Please verify whether this change aligns with your data availability requirements.
36-36
: Confirm that the MLflow registry URI is correctly set
The MLflow registry URI is set to "databricks-uc"
. Ensure that this is the intended URI for your environment.
If you're using Unity Catalog with MLflow, "databricks-uc"
is appropriate. Otherwise, you might need to adjust it.
178-182
: Ensure that id_list
is correctly populated from the Spark DataFrame
The code uses preds_df.select("Id").rdd.flatMap(lambda x: x).collect()
to create id_list
. Confirm that the column "Id"
exists in preds_df
.
If the column is named "id"
in your DataFrame, adjust the code accordingly:
-id_list = preds_df.select("Id").rdd.flatMap(lambda x: x).collect()
+id_list = preds_df.select("id").rdd.flatMap(lambda x: x).collect()
76-79
:
Inconsistency in column naming between 'id' and 'Id'
There is a discrepancy in the casing of the 'id' column. In the Pandas DataFrame df
, the column is likely named 'id'
, but in later operations, 'Id'
with an uppercase 'I' is used. This can lead to key errors or mismatches during feature lookup and serving.
To ensure consistency, standardize the column name to 'id'
throughout the code. Apply this diff to fix the column naming:
-preds_df = df[["id", "person_age", "person_income"]]
+preds_df = df[["id", "person_age", "person_income"]]
-preds_df["predicted_loan_status"] = pipeline.predict(df[continuous_variables + categorical_variables])
+preds_df = preds_df.withColumnRenamed("id", "Id")
Alternatively, if you prefer using 'id'
, update the FeatureLookup
and request payload accordingly.
Likely invalid or redundant comment.
dev = ["databricks-connect>=15.4.1, <16", | ||
"databricks-sdk>=0.32.0, <0.33", | ||
"databricks-sdk==0.32.0", | ||
"ipykernel>=6.29.5, <7", | ||
"catboost==1.2.7", | ||
"pip>=24.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove redundant dependencies and align versions
The optional dependencies section contains duplicates from the main dependencies:
databricks-sdk==0.32.0
catboost==1.2.7
These packages are already specified in the main dependencies section with the same versions.
Apply this diff to remove the redundant entries:
dev = ["databricks-connect>=15.4.1, <16",
- "databricks-sdk==0.32.0",
"ipykernel>=6.29.5, <7",
- "catboost==1.2.7",
"pip>=24.2",
"pytest>=8.3.3"]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
dev = ["databricks-connect>=15.4.1, <16", | |
"databricks-sdk>=0.32.0, <0.33", | |
"databricks-sdk==0.32.0", | |
"ipykernel>=6.29.5, <7", | |
"catboost==1.2.7", | |
"pip>=24.2", | |
dev = ["databricks-connect>=15.4.1, <16", | |
"ipykernel>=6.29.5, <7", | |
"pip>=24.2", |
dependencies = [ | ||
"lightgbm>=4.5.0, <5", | ||
"scikit-learn>=1.5.1, <2", | ||
"cloudpickle>=3.0.0, <4", | ||
"catboost==1.2.7,<2", | ||
"graphviz==0.20.3, <1", | ||
"mlflow==2.17.0", | ||
"numpy>=1.26.4, <2", | ||
"pandas>=2.2.2, <3", | ||
"cffi>=1.17.1, <2", | ||
"scipy>=1.14.1, <2", | ||
"matplotlib>=3.9.2, <4", | ||
"databricks-feature-engineering==0.6", | ||
"databricks-feature-lookup==1.2.0", | ||
"databricks-sdk==0.32.0", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Update MLflow and Databricks dependencies to latest secure versions
Several dependencies have newer versions available with important security fixes:
- MLflow 2.17.0 -> 2.18.0 (current version is vulnerable to path traversal attacks)
- databricks-feature-engineering 0.6 -> 0.7.0
- databricks-sdk 0.32.0 -> 0.38.0
Other pinned versions are current:
- catboost 1.2.7 (latest)
- databricks-feature-lookup 1.2.0 (latest)
Consider:
- Updating MLflow to at least 2.18.0 to address security vulnerabilities
- Updating Databricks packages to their latest versions for compatibility
- Adding version ranges (e.g.,
>=2.18.0, <3
) to allow future security patches while maintaining compatibility
🔗 Analysis chain
Review dependency version management strategy
The dependencies section shows inconsistent version pinning strategies:
- Some packages use fixed versions (e.g.,
mlflow==2.17.0
,catboost==1.2.7
) - Others use ranges (e.g.,
lightgbm>=4.5.0, <5
)
While fixed versions for ML frameworks ensure reproducibility, consider:
- Using version ranges for utility packages to allow security updates
- Documenting the version pinning strategy in README.md
Let's verify the latest stable versions and any security advisories for the pinned dependencies:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check latest versions and security advisories for critical dependencies
# Check PyPI for latest versions
echo "Checking latest versions..."
for pkg in mlflow catboost databricks-feature-engineering databricks-feature-lookup databricks-sdk
do
echo "$pkg:"
curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version'
done
# Check for security advisories
echo "Checking security advisories..."
for pkg in mlflow catboost databricks-feature-engineering databricks-feature-lookup databricks-sdk
do
gh api graphql -f query='{
securityVulnerabilities(first: 5, ecosystem: PIP, package: "'$pkg'") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
}
}
}'
done
Length of output: 4186
@@ -0,0 +1,126 @@ | |||
# Databricks notebook source | |||
# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hardcoding user-specific paths
The wheel file path contains a hardcoded username netojoseaugusto
. This could break when other users try to run the notebook.
Consider using environment variables or notebook parameters:
-# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl
+# MAGIC %pip install /Volumes/mlops_students/${USER}/package/loans-0.0.1-py3-none-any.whl
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# MAGIC %pip install /Volumes/mlops_students/netojoseaugusto/package/loans-0.0.1-py3-none-any.whl | |
# MAGIC %pip install /Volumes/mlops_students/${USER}/package/loans-0.0.1-py3-none-any.whl |
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get() | ||
host = spark.conf.get("spark.databricks.workspaceUrl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate authentication credentials
The API token and host URL are retrieved without validation.
Add validation:
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
host = spark.conf.get("spark.databricks.workspaceUrl")
+if not token:
+ raise ValueError("Failed to retrieve API token")
+if not host:
+ raise ValueError("Failed to retrieve workspace URL")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get() | |
host = spark.conf.get("spark.databricks.workspaceUrl") | |
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get() | |
host = spark.conf.get("spark.databricks.workspaceUrl") | |
if not token: | |
raise ValueError("Failed to retrieve API token") | |
if not host: | |
raise ValueError("Failed to retrieve workspace URL") |
workspace.serving_endpoints.create( | ||
name="catboost-loans-serving-fe", | ||
config=EndpointCoreConfigInput( | ||
served_entities=[ | ||
ServedEntityInput( | ||
entity_name=f"{catalog_name}.{schema_name}.catboost_fe", | ||
scale_to_zero_enabled=True, | ||
workload_size="Small", | ||
entity_version=1, | ||
) | ||
] | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and make endpoint name configurable
The endpoint creation lacks error handling and uses a hardcoded name.
Consider these improvements:
+endpoint_name = configs.get("endpoint_name", "catboost-loans-serving-fe")
+try:
workspace.serving_endpoints.create(
- name="catboost-loans-serving-fe",
+ name=endpoint_name,
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name=f"{catalog_name}.{schema_name}.catboost_fe",
scale_to_zero_enabled=True,
workload_size="Small",
entity_version=1,
)
]
),
)
+except Exception as e:
+ raise RuntimeError(f"Failed to create serving endpoint: {str(e)}")
Committable suggestion skipped: line range outside the PR's diff.
start_time = time.time() | ||
serving_endpoint = f"https://{host}/serving-endpoints/loans-feature-serving/invocations" | ||
response = requests.post( | ||
f"{serving_endpoint}", | ||
headers={"Authorization": f"Bearer {token}"}, | ||
json={"dataframe_records": [{"Id": "182"}]}, | ||
) | ||
|
||
end_time = time.time() | ||
execution_time = end_time - start_time | ||
|
||
print("Response status:", response.status_code) | ||
print("Reponse text:", response.text) | ||
print("Execution time:", execution_time, "seconds") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for HTTP requests to the serving endpoint
The current implementation does not handle potential errors from the requests.post()
call, such as network timeouts or HTTP errors. Adding error handling will make the code more robust.
Consider wrapping the request in a try-except
block and checking for HTTP errors:
start_time = time.time()
serving_endpoint = f"https://{host}/serving-endpoints/loans-feature-serving/invocations"
try:
response = requests.post(
serving_endpoint,
headers={"Authorization": f"Bearer {token}"},
json={"dataframe_records": [{"Id": "182"}]},
timeout=10 # Optional: set a timeout
)
response.raise_for_status() # Raise an exception for HTTP errors
print("Response status:", response.status_code)
print("Response text:", response.text)
except requests.exceptions.RequestException as e:
print("An error occurred:", e)
finally:
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time, "seconds")
from pyspark.dbutils import DBUtils | ||
from pyspark.sql import SparkSession | ||
|
||
from loans.helpers import open_yaml_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Warning: Duplicate function definition found
The function open_yaml_file
is defined in two locations:
src/loans/helpers.py
notebooks/week1/starter_notebook.py
This could lead to potential conflicts. The import in notebooks/week3/01.feature_serving.py
correctly references the function from loans.helpers
, but having duplicate definitions is not recommended.
Consider removing the duplicate definition from starter_notebook.py
and using the one from loans.helpers
consistently across the codebase.
🔗 Analysis chain
Ensure open_yaml_file
is properly defined and imported
The function open_yaml_file
is imported from loans.helpers
. Verify that this module and function exist and are accessible in the environment.
Run the following script to check if open_yaml_file
is defined once and imported correctly:
Ensure that there are no naming conflicts or missing modules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the definition and import of `open_yaml_file`
# Find the definition of `open_yaml_file`
rg -t py "def open_yaml_file" -A 2
# Check for imports of `open_yaml_file`
rg -t py "from loans.helpers import open_yaml_file"
Length of output: 1246
lookup_key="Id", | ||
feature_names=["person_age", "person_income", "predicted_loan_status"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update lookup_key
and feature references to match column naming
The lookup_key
is set to "Id"
, but if your DataFrame uses "id"
(all lowercase), this will cause mismatches during feature serving. Similarly, the feature names must match the exact column names.
Adjust the lookup_key
and feature_names
to use consistent casing:
lookup_key="id",
feature_names=["person_age", "person_income", "predicted_loan_status"],
Or, if standardizing to "Id"
as per previous comment, ensure the DataFrame columns are renamed accordingly.
Committable suggestion skipped: line range outside the PR's diff.
id_list = preds_df["id"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error accessing 'id'
column on a Spark DataFrame
After converting preds_df
to a Spark DataFrame in line 79, attempting to access preds_df["id"]
will raise an error because Spark DataFrames do not support item access using square brackets.
To collect the 'Id'
values from a Spark DataFrame, use the select
and collect
methods:
-id_list = preds_df["id"]
+id_list = preds_df.select("Id").rdd.flatMap(lambda x: x).collect()
Ensure that the column name matches the DataFrame's column name, considering any previous renaming for consistency.
Committable suggestion skipped: line range outside the PR's diff.
|
||
# Log parameters, metrics, and the model to MLflow | ||
mlflow.log_param("model_type", "Catboost") | ||
mlflow.log_params(parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging incorrect parameters for Model B
While training Model B, the code logs parameters
instead of parameters_2
. Since parameters_2
contains the actual parameters used for Model B, this could lead to confusion during analysis.
Apply the following change to log the correct parameters:
-mlflow.log_params(parameters)
+mlflow.log_params(parameters_2)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mlflow.log_params(parameters) | |
mlflow.log_params(parameters_2) |
"scipy>=1.14.1, <2", | ||
"matplotlib>=3.9.2, <4", | ||
"databricks-feature-engineering>=0.6, <1"] | ||
dependencies = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very strict versioning is required if you use feature engineering package and want to deploy an endpoint (including pandas and oter packages, also pyarrow must be fixed to 14.0.2 - default would be 15.x, and model endpoint deployment would fail)
Summary by CodeRabbit
Release Notes
New Features
Improvements
Dependencies