-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/week3 #3
Conversation
…ld and dependancies
Ran into an error with feature serving where a DateTime object is not json serializable - needed to find another method to get the DateTime as PK.
create feature serving endpoint using feature tables and online table
WalkthroughThis pull request introduces several changes across multiple files, primarily focusing on configuration updates, new feature implementations, and enhancements to documentation. Key modifications include the addition of default stages in the pre-commit configuration, updates to the README for environment setup and CI pipeline improvements, and the introduction of A/B testing hyperparameters in configuration files. New scripts for feature and model serving in a Databricks environment have also been added, along with adjustments to dependency specifications in the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🧹 Outside diff range and nitpick comments (21)
power_consumption/config.py (2)
13-19
: Enhance the ABTestHyperparameters class implementation.The class structure could be improved in several ways:
- Consider splitting
n_estimators
inton_estimators_a
andn_estimators_b
for consistency with other A/B parameters- Add parameter validation using Pydantic's field validators
- Add a docstring explaining the class purpose and parameter usage
- Group A and B parameters together for better readability
Here's a suggested implementation:
class ABTestHyperparameters(BaseModel): + """Configuration for A/B test model hyperparameters. + + Contains two sets of hyperparameters (A and B) for comparing model versions. + All parameters should be positive numbers. + """ + # Model A parameters learning_rate_a: float - learning_rate_b: float - n_estimators: int max_depth_a: int + n_estimators_a: int + + # Model B parameters + learning_rate_b: float max_depth_b: int + n_estimators_b: int + + @validator("learning_rate_a", "learning_rate_b", "max_depth_a", "max_depth_b", + "n_estimators_a", "n_estimators_b") + def validate_positive(cls, v, field): + if v <= 0: + raise ValueError(f"{field.name} must be positive") + return v
40-40
: Consider improving Config class organization.The addition looks good, but consider:
- Moving
ab_test_hyperparameters
next tohyperparameters
for better grouping of related fields- Adding validation to ensure the configuration is used correctly in either standard or A/B testing mode
Here's a suggested implementation:
class Config(BaseModel): catalog_name: str schema_name: str hyperparameters: Hyperparameters - ab_test_hyperparameters: ABTestHyperparameters processed_features: ProcessedFeatures target: Target dataset: Dataset + ab_test_hyperparameters: Optional[ABTestHyperparameters] = None + + @root_validator + def validate_hyperparameters_usage(cls, values): + """Ensure either standard or A/B test hyperparameters are actively used.""" + if values.get("hyperparameters") and values.get("ab_test_hyperparameters"): + raise ValueError("Cannot use both standard and A/B test hyperparameters simultaneously") + return valuesnotebooks/week2/01_prepare_dataset.py (2)
34-36
: Add validation for unique IDsWhile converting IDs to strings is fine for downstream processing, we should ensure uniqueness to prevent potential issues in the A/B testing framework.
Consider adding validation after the conversion:
train_set['id'] = train_set['id'].astype(str) test_set['id'] = test_set['id'].astype(str) +# Validate ID uniqueness +assert train_set['id'].nunique() == len(train_set), "Duplicate IDs found in train_set" +assert test_set['id'].nunique() == len(test_set), "Duplicate IDs found in test_set"
30-36
: Document the ID generation strategySince this ID generation is crucial for the A/B testing framework mentioned in the PR objectives, it would be beneficial to add documentation explaining the strategy.
Add a docstring explaining the ID generation:
+# Generate unique IDs for A/B testing by converting timestamps +# The DateTime column is converted to Unix timestamp (seconds since epoch) +# This ensures deterministic and unique identification of recordspyproject.toml (1)
29-29
: Consider adding upper bound for databricks-connect.The removal of the upper bound (
<16
) for databricks-connect seems inconsistent with the strict version pinning approach used for core dependencies. This could lead to compatibility issues with future major versions.Consider maintaining the upper bound:
- "databricks-connect>=15.4.1", + "databricks-connect>=15.4.1,<16",notebooks/week3/03_model_serving_feature_lookup.py (1)
116-118
: Fix typo and improve response handling.There's a typo in "Reponse" and the response handling could be improved.
Apply these improvements:
-print("Response status:", response.status_code) -print("Reponse text:", response.text) -print("Execution time:", execution_time, "seconds") +print(f"Response status: {response.status_code}") +print(f"Response text: {response.text}") +print(f"Execution time: {execution_time:.2f} seconds")README.md (5)
Line range hint
13-24
: Improve environment setup instructions for clarity and consistency.The environment setup section contains inconsistencies and could be more comprehensive:
- Standardize the Python version reference (currently shows both
3.11.0
and3.11
)- Use consistent virtual environment paths (varies between
venv
and.venv
)- Add separate instructions for Windows users
Consider updating the setup instructions like this:
- uv venv -p 3.11.0 venv - source venv/bin/activate + # For Unix/MacOS: + uv venv -p 3.11.10 .venv + source .venv/bin/activate + + # For Windows: + uv venv -p 3.11.10 .venv + .venv\Scripts\activate🧰 Tools
🪛 Markdownlint
124-124: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
Line range hint
52-86
: Consider enhancing the CI pipeline configuration.The CI pipeline looks good but could be improved with:
- Test results artifact upload for better visibility
- Version constraints for test dependencies
Consider these additions to the CI workflow:
- name: Run tests with coverage run: | uv pip install pytest pytest-mock pytest-cov uv run pytest --cov=power_consumption tests/ + uv run pytest --junitxml=test-results.xml --cov=power_consumption --cov-report=xml tests/ + + - name: Upload test results + uses: actions/upload-artifact@v3 + if: always() + with: + name: test-results + path: | + test-results.xml + coverage.xml🧰 Tools
🪛 Markdownlint
124-124: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
Line range hint
88-93
: Enhance dataset documentation with format specifications.The dataset section would benefit from additional details about:
- Expected file format and encoding
- Data validation steps
- Sample data structure
Would you like me to help create a data validation script or expand the dataset documentation?
🧰 Tools
🪛 Markdownlint
124-124: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
124-124
: Fix list indentation in Feature Engineering section.The bullet point about Feature Engineering package limitations has incorrect indentation.
- - Feature Engineering package limitations [documentation](https://docs.databricks.com/en/machine-learning/feature-store/python-api.html#limitations) + - Feature Engineering package limitations [documentation](https://docs.databricks.com/en/machine-learning/feature-store/python-api.html#limitations)🧰 Tools
🪛 Markdownlint
124-124: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
Line range hint
95-157
: Improve Feature Engineering section organization and clarity.The Feature Engineering section would benefit from:
- A clearer explanation of the UC (Unity Catalog) permission model
- Structured code examples with comments
- A troubleshooting guide for common permission issues
Consider reorganizing the code examples into separate sections for:
- Data preprocessing
- Feature generation
- Model training
Would you like help creating a structured template for this section?
🧰 Tools
🪛 Markdownlint
124-124: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
notebooks/week3/02_model_serving.py (3)
16-31
: Move imports to the top of the fileAccording to Python's style guidelines (PEP 8), all imports should be at the top of the file, before any executable code. This improves readability and ensures that dependencies are loaded before they are used.
Apply this diff to move the imports:
+# imports +import time +import requests +import random +from concurrent.futures import ThreadPoolExecutor, as_completed +from databricks.sdk import WorkspaceClient +from databricks.sdk.service.serving import ( + EndpointCoreConfigInput, + ServedEntityInput, + TrafficConfig, + Route, +) +from pyspark.sql import SparkSession +from power_consumption.config import Config + # DBTITLE 1,imports -import time - -import requests -import random -from concurrent.futures import ThreadPoolExecutor, as_completed - -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.serving import ( - EndpointCoreConfigInput, - ServedEntityInput, - TrafficConfig, - Route, -) -from pyspark.sql import SparkSession -from power_consumption.config import Config🧰 Tools
🪛 Ruff
16-16: Module level import not at top of file
(E402)
18-18: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
20-20: Module level import not at top of file
(E402)
22-22: Module level import not at top of file
(E402)
23-28: Module level import not at top of file
(E402)
29-29: Module level import not at top of file
(E402)
30-30: Module level import not at top of file
(E402)
143-143
: Avoid printing full response text to prevent sensitive data exposurePrinting the full response text may inadvertently expose sensitive information. If the response could contain confidential data, consider logging only necessary details or using logging mechanisms with appropriate log levels.
Apply this diff to limit the output:
-print("Reponse text:", response.text) +print("Response content:", response.content[:100], "...") # Show only the first 100 characters
177-180
: Handle exceptions when retrieving future resultsExceptions raised during the execution of
send_request
can causefuture.result()
to raise an exception, potentially stopping the entire load test. To ensure that all requests are accounted for and to handle failures gracefully, wrap the result retrieval in a try-except block.Apply this diff:
for future in as_completed(futures): - status_code, latency = future.result() - latencies.append(latency) + try: + status_code, latency = future.result() + latencies.append(latency) + except Exception as e: + print(f"An error occurred: {e}")This will allow the load test to continue even if some requests fail.
notebooks/week3/01_feature_serving.py (4)
41-41
: Remove unused import 'SparkSession'The
SparkSession
class is imported but not used in the script. Removing unused imports helps keep the code clean and maintainable.-from pyspark.sql import SparkSession
🧰 Tools
🪛 Ruff
41-41: Module level import not at top of file
(E402)
41-41:
pyspark.sql.SparkSession
imported but unusedRemove unused import:
pyspark.sql.SparkSession
(F401)
82-86
: Consider using Spark DataFrames instead of converting to pandasConverting large Spark DataFrames to pandas DataFrames using
.toPandas()
can lead to memory issues, especially with large datasets. Processing data using Spark DataFrames keeps the computations distributed and scalable.Suggestion:
Refactor the code to continue using Spark DataFrames:
-train_set = spark.table(f"{catalog_name}.{schema_name}.train_set").toPandas() -test_set = spark.table(f"{catalog_name}.{schema_name}.test_set").toPandas() - -df = pd.concat([train_set, test_set]) +train_set = spark.table(f"{catalog_name}.{schema_name}.train_set") +test_set = spark.table(f"{catalog_name}.{schema_name}.test_set") + +df = train_set.union(test_set)🧰 Tools
🪛 Ruff
82-82: Undefined name
spark
(F821)
83-83: Undefined name
spark
(F821)
95-95
: Parameterize the model version instead of hardcodingHardcoding the model version (
/2
) can make the code less flexible and harder to maintain. Consider parameterizing the model version or using the latest version programmatically.Suggestion:
Option 1 - Use the latest model version:
-pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/2") +pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model@latest")Option 2 - Make the model version configurable via the configuration file:
+# Assuming you add `model_version` to your configuration +model_version = config.model_version + -pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/2") +pipeline = mlflow.sklearn.load_model(f"models:/{catalog_name}.{schema_name}.power_consumption_model/{model_version}")
298-319
: Align the number of threads with the number of requestsThe
ThreadPoolExecutor
is initialized withmax_workers=100
, but you are only sendingnum_requests=10
. This could lead to inefficient resource usage. Consider aligning the number of workers with the number of requests.Suggestion:
-num_requests = 10 +num_requests = 100 # Or adjust max_workers to match num_requests # Send requests concurrently -with ThreadPoolExecutor(max_workers=100) as executor: +with ThreadPoolExecutor(max_workers=num_requests) as executor: futures = [executor.submit(send_request) for _ in range(num_requests)]notebooks/week3/04_AB_test_model_serving.py (3)
16-34
: Consider moving import statements to the top of the filePEP 8 recommends placing all import statements at the top of the file to enhance readability and maintain consistency. While notebooks often have imports in various cells, consolidating them at the beginning can help in understanding dependencies and improve maintainability.
🧰 Tools
🪛 Ruff
16-16: Module level import not at top of file
(E402)
18-18: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
20-20: Module level import not at top of file
(E402)
21-21: Module level import not at top of file
(E402)
22-22: Module level import not at top of file
(E402)
23-23: Module level import not at top of file
(E402)
24-24: Module level import not at top of file
(E402)
25-25: Module level import not at top of file
(E402)
26-26: Module level import not at top of file
(E402)
27-27: Module level import not at top of file
(E402)
28-28: Module level import not at top of file
(E402)
29-29: Module level import not at top of file
(E402)
30-30: Module level import not at top of file
(E402)
31-31: Module level import not at top of file
(E402)
32-32: Module level import not at top of file
(E402)
34-34: Module level import not at top of file
(E402)
119-119
: Avoid hardcoding the Git SHA valueThe
git_sha
variable is hardcoded:git_sha = "30d57afb2efca70cede3061d00f2a553c2b4779b"Consider retrieving the current Git commit SHA programmatically to ensure that the logged information reflects the actual state of the code:
import subprocess git_sha = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()This change enhances maintainability and reduces the risk of discrepancies between the code and the logged metadata.
298-298
: Explicitly display predictions in the notebookThe standalone expression
predictions
is intended to display the output in the notebook:predictions
For clarity and to adhere to best practices, consider using a print statement or adding a comment to indicate that this line displays the predictions:
-predictions +print(predictions) # Display predictionsThis makes the purpose of the line clear to readers unfamiliar with notebook behavior.
🧰 Tools
🪛 Ruff
298-298: Found useless expression. Either assign it to a variable or remove it.
(B018)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (12)
.pre-commit-config.yaml
(1 hunks)README.md
(1 hunks)configs/project_configs.yml
(1 hunks)main.py
(0 hunks)notebooks/week2/01_prepare_dataset.py
(1 hunks)notebooks/week3/01_feature_serving.py
(1 hunks)notebooks/week3/02_model_serving.py
(1 hunks)notebooks/week3/03_model_serving_feature_lookup.py
(1 hunks)notebooks/week3/04_AB_test_model_serving.py
(1 hunks)power_consumption/config.py
(2 hunks)pyproject.toml
(2 hunks)tests/conftest.yml
(1 hunks)
💤 Files with no reviewable changes (1)
- main.py
✅ Files skipped from review due to trivial changes (1)
- .pre-commit-config.yaml
🧰 Additional context used
🪛 Markdownlint
README.md
124-124: Expected: 2; Actual: 4
Unordered list indentation
(MD007, ul-indent)
🪛 Ruff
notebooks/week3/01_feature_serving.py
11-11: Undefined name dbutils
(F821)
26-26: Module level import not at top of file
(E402)
27-27: Module level import not at top of file
(E402)
28-28: Module level import not at top of file
(E402)
30-30: Module level import not at top of file
(E402)
31-31: Module level import not at top of file
(E402)
32-32: Module level import not at top of file
(E402)
33-33: Module level import not at top of file
(E402)
34-34: Module level import not at top of file
(E402)
35-35: Module level import not at top of file
(E402)
36-39: Module level import not at top of file
(E402)
40-40: Module level import not at top of file
(E402)
41-41: Module level import not at top of file
(E402)
41-41: pyspark.sql.SparkSession
imported but unused
Remove unused import: pyspark.sql.SparkSession
(F401)
42-42: Module level import not at top of file
(E402)
82-82: Undefined name spark
(F821)
83-83: Undefined name spark
(F821)
114-114: Undefined name spark
(F821)
135-135: Undefined name spark
(F821)
221-221: Undefined name dbutils
(F821)
222-222: Undefined name spark
(F821)
notebooks/week3/02_model_serving.py
11-11: Undefined name dbutils
(F821)
16-16: Module level import not at top of file
(E402)
18-18: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
20-20: Module level import not at top of file
(E402)
22-22: Module level import not at top of file
(E402)
23-28: Module level import not at top of file
(E402)
29-29: Module level import not at top of file
(E402)
30-30: Module level import not at top of file
(E402)
85-85: Undefined name dbutils
(F821)
notebooks/week3/03_model_serving_feature_lookup.py
11-11: Undefined name dbutils
(F821)
16-16: Module level import not at top of file
(E402)
18-18: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
20-23: Module level import not at top of file
(E402)
24-24: Module level import not at top of file
(E402)
25-25: Module level import not at top of file
(E402)
27-27: Module level import not at top of file
(E402)
83-83: Undefined name dbutils
(F821)
notebooks/week3/04_AB_test_model_serving.py
11-11: Undefined name dbutils
(F821)
16-16: Module level import not at top of file
(E402)
18-18: Module level import not at top of file
(E402)
19-19: Module level import not at top of file
(E402)
20-20: Module level import not at top of file
(E402)
21-21: Module level import not at top of file
(E402)
22-22: Module level import not at top of file
(E402)
23-23: Module level import not at top of file
(E402)
24-24: Module level import not at top of file
(E402)
25-25: Module level import not at top of file
(E402)
26-26: Module level import not at top of file
(E402)
27-27: Module level import not at top of file
(E402)
28-28: Module level import not at top of file
(E402)
29-29: Module level import not at top of file
(E402)
30-30: Module level import not at top of file
(E402)
31-31: Module level import not at top of file
(E402)
32-32: Module level import not at top of file
(E402)
34-34: Module level import not at top of file
(E402)
298-298: Found useless expression. Either assign it to a variable or remove it.
(B018)
331-331: Undefined name dbutils
(F821)
🔇 Additional comments (9)
configs/project_configs.yml (2)
9-15
: Consider differentiating learning rates for meaningful A/B testing.
The current configuration uses identical learning rates (0.02) for both versions. This might not provide meaningful insights into the impact of learning rate on model performance.
Consider:
- Using different learning rates to better understand their impact
- Splitting the shared
n_estimators
parameter into version-specific parameters (n_estimators_a
andn_estimators_b
) - Adding parameter validation constraints (e.g., min/max values)
Example structure:
ab_test_hyperparameters:
version_a:
learning_rate: 0.02
n_estimators: 1000
max_depth: 6
version_b:
learning_rate: 0.03 # Different value for comparison
n_estimators: 1000
max_depth: 10
validation:
learning_rate_range: [0.01, 0.1]
max_depth_range: [3, 15]
n_estimators_range: [100, 2000]
13-14
: Verify max_depth difference impact.
The significant difference in max_depth (6 vs 10) between versions could lead to:
- Overfitting in version B (max_depth: 10)
- Different model complexities affecting inference latency
Let's verify if there are any model evaluation metrics or latency requirements in the codebase:
tests/conftest.yml (3)
13-14
: Review max_depth difference for potential overfitting.
The significant difference in max_depth between variants (6 vs 10) might lead to overfitting in variant B. Consider:
- Using closer max_depth values to isolate the impact of this parameter
- Adding cross-validation in the training script to validate the choice of these values
- Documenting the rationale for these specific values
#!/bin/bash
# Search for cross-validation implementation in the training code
rg "cross_val|cv_|validation" -A 5
9-15
: Verify A/B testing configuration integration.
The A/B test configuration looks structurally sound, but please ensure:
- The A/B testing implementation in
04_AB_test_model_serving.py
properly handles these parameters - There's error handling for missing or invalid parameter values
- The testing framework can access these parameters correctly
#!/bin/bash
# Search for usage of these parameters in the A/B testing implementation
echo "Checking parameter usage in A/B testing code..."
rg "ab_test_hyperparameters|learning_rate_[ab]|max_depth_[ab]" -A 5
# Look for error handling related to these parameters
echo "Checking error handling..."
ast-grep --pattern 'try {
$$$
ab_test_hyperparameters
$$$
} catch'
10-11
: Verify if identical learning rates are intentional.
Both A/B variants use the same learning rate (0.02). Consider using different learning rates to better evaluate the impact of this hyperparameter on model performance.
notebooks/week2/01_prepare_dataset.py (1)
30-32
: 🛠️ Refactor suggestion
Consider potential timestamp precision loss
The current implementation divides the timestamp by 10^9 to convert from nanoseconds to seconds, which might lead to collisions if there are multiple records within the same second.
Let's verify if there are potential ID collisions:
Consider using millisecond precision instead:
-train_set['id'] = train_set['DateTime'].astype('int64') // 10**9
-test_set['id'] = test_set['DateTime'].astype('int64') // 10**9
+train_set['id'] = train_set['DateTime'].astype('int64') // 10**6
+test_set['id'] = test_set['DateTime'].astype('int64') // 10**6
✅ Verification successful
Let me gather more information about the data to better understand the timestamp precision requirements.
Based on the gathered information, I can now provide a final response:
The timestamp precision concern is not valid in this context
The DateTime column is used as a primary key in the feature table and the data is collected hourly (as seen in test data with freq="h"
). Converting to seconds (10**9) provides sufficient granularity since:
- The data has hourly frequency (3600 seconds between records)
- DateTime is used as a primary key in the feature table, confirming uniqueness
- Test fixtures and data preprocessing show consistent hourly timestamps
The current implementation is appropriate for the use case and doesn't risk ID collisions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any timestamp-related tests or validation logic
rg -g '!*.pyc' -g '!*.git*' "DateTime.*test|test.*DateTime" .
# Look for any existing timestamp handling patterns
ast-grep --pattern 'DateTime.astype("int64")|DateTime.astype('\''int64'\'')'
Length of output: 245
Script:
#!/bin/bash
# Look for DataFrame creation or data loading to understand timestamp frequency
rg -g '!*.pyc' -g '!*.git*' -B 5 -A 5 "DateTime" .
# Search for any validation or uniqueness checks on IDs
rg -g '!*.pyc' -g '!*.git*' "duplicated|unique|drop_duplicates" .
# Look for any configuration or schema definitions that might specify timestamp requirements
rg -g '!*.pyc' -g '!*.git*' "schema|timestamp|frequency" .
Length of output: 19728
pyproject.toml (2)
66-68
: LGTM! Good addition of build system configuration.
The build system configuration follows PEP 517/518 best practices and aligns well with the project's structure.
8-24
: Verify compatibility with pyarrow downgrade.
The change from pyarrow>=15.0.2
to pyarrow==14.0.1
is a significant downgrade. Please ensure this version is compatible with your Databricks runtime and other dependencies.
Consider using version ranges with upper bounds (e.g., >=14.0.1,<15.0.0
) instead of exact versions to:
- Allow for patch updates that include security fixes
- Maintain compatibility with other packages that might require slightly different versions
✅ Verification successful
Pyarrow version change appears safe to proceed
The search results show that pyarrow is only listed as a dependency in pyproject.toml
and referenced in uv.lock
, but there are no direct imports or usage patterns found in the codebase. The downgrade from >=15.0.2
to 14.0.1
appears safe as:
- No direct pyarrow API usage was detected
- No parquet read/write operations were found that might be affected
- The version is still recent enough to be compatible with modern Databricks runtimes
While the architectural advice about using version ranges is valid, the exact version pinning is acceptable if it aligns with your team's dependency management strategy.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if any files import pyarrow and use features that might be affected
rg -l 'import.*pyarrow' | xargs rg 'pyarrow\.' -A 3
Length of output: 53
Script:
#!/bin/bash
# Let's try a broader search for pyarrow usage
rg -i "pyarrow" -A 3
# Also check for pa. prefix which is commonly used for pyarrow
rg "pa\." -A 3
# And check for any pandas operations that might use pyarrow under the hood
rg "pd\.read_parquet|to_parquet" -A 3
Length of output: 3525
notebooks/week3/03_model_serving_feature_lookup.py (1)
7-7
: Consider using absolute path for wheel installation.
The relative path ../power_consumption-0.0.1-py3-none-any.whl
could break if notebook location changes.
Consider using an absolute path or environment variable to specify the wheel location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! you clearly paid close attention!
Description
Three different serving endpoint creation for different scenarios. Feature serving, model serving and model serving with feature look up. (
01_feature_serving.py
,02_model_serving.py
,03_model_serving_feature_lookup.py
)Setup of A/B testing for two different model versions using Pyfunc. We'll train, register, and implement a serving endpoint that uses a Pyfunc model as a wrapper for these versions. (
04_AB_test_model_serving.py
)Type of change
Checklist:
Summary by CodeRabbit
Release Notes
New Features
id
column) to datasets for enhanced processing.Documentation
Dependency Updates
pyproject.toml
to ensure compatibility and stability.