-
Notifications
You must be signed in to change notification settings - Fork 0
🚀 week-01-environment-setup #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
76ee6e5
4060716
03da52c
4d035f0
75771cb
a45d509
d2f99c5
4c61b79
b4a0850
bef5ba1
e149d0e
0a72c4a
efa1fa8
58e71b1
2517f96
5b639f7
80ff34c
de035f8
d6da452
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,5 @@ | ||
* @end-to-end-mlops-databricks/teachers @MahaAmin | ||
# These owners will be the default owners for everything in | ||
# the repo. Unless a later match takes precedence, | ||
# @global-owner1 and @global-owner2 will be requested for | ||
# review when someone opens a pull request. | ||
* @end-to-end-mlops-databricks/teachers @MahaAmin |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,5 +13,4 @@ repos: | |
rev: v0.6.9 | ||
hooks: | ||
- id: ruff | ||
args: [--fix, --exit-non-zero-on-fix, --show-fixes] | ||
- id: ruff-format |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,56 @@ | ||
<h1 align="center"> | ||
Marvelous MLOps End-to-end MLOps with Databricks course | ||
|
||
## Practical information | ||
- Weekly lectures on Wednesdays 16:00-18:00 CET. | ||
- Code for the lecture is shared before the lecture. | ||
- Presentation and lecture materials are shared right after the lecture. | ||
- Video of the lecture is uploaded within 24 hours after the lecture. | ||
## Course Project Description | ||
|
||
- Every week we set up a deliverable, and you implement it with your own dataset. | ||
- To submit the deliverable, create a feature branch in that repository, and a PR to main branch. The code can be merged after we review & approve & CI pipeline runs successfully. | ||
- The deliverables can be submitted with a delay (for example, lecture 1 & 2 together), but we expect you to finish all assignments for the course before the 25th of November. | ||
- **Dataset:** [Kaggle - Credit Card Fraud Detection Dataset 2023](https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023) | ||
|
||
|
||
### Course Deliverables | ||
|
||
#### PR #1 | ||
|
||
- Azure Databricks Environment Setup | ||
- Select dataset [Kaggle - Credit Card Fraud Detection Dataset 2023](https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023) | ||
- Run python notebooks in databricks cluster for fraud_credit_cards usecase | ||
- Create "DataProcessor" and "FraudModel" classes | ||
- Push data.csv to databricks volume | ||
- Push package.whl to databricks volume | ||
|
||
|
||
## Set up your environment | ||
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11. | ||
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11. | ||
In our examples, we use UV. Check out the documentation on how to install it: https://docs.astral.sh/uv/getting-started/installation/ | ||
|
||
To create a new environment and create a lockfile, run: | ||
### To create a new environment and create a lockfile, run: | ||
|
||
``` | ||
uv venv -p 3.11.0 venv | ||
source venv/bin/activate | ||
uv venv -p 3.11.11 .venv | ||
source .venv/bin/activate | ||
uv pip install -r pyproject.toml --all-extras | ||
uv lock | ||
``` | ||
|
||
### To Build fraud_credit_cards package | ||
|
||
``` | ||
uv build | ||
``` | ||
|
||
To install and run fraud_credit_cards package | ||
|
||
``` | ||
uv pip install dist/fraud_credit_cards-0.0.1-py3-none-any.whl | ||
``` | ||
|
||
``` | ||
uv run python main.py | ||
``` | ||
|
||
### Pre-Commit Checks | ||
|
||
To run pre-commit checks | ||
|
||
``` | ||
uv run pre-commit run --all-files | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"target": "Class" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# This is a Databricks asset bundle definition for marvelous-databricks-course-MahaAmin. | ||
# The Databricks extension requires databricks.yml configuration file. | ||
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. | ||
|
||
bundle: | ||
name: marvelous-databricks-course-MahaAmin | ||
|
||
targets: | ||
dev: | ||
mode: development | ||
default: true | ||
workspace: | ||
host: https://adb-3537333413571968.8.azuredatabricks.net | ||
|
||
## Optionally, there could be 'staging' or 'prod' targets here. | ||
# | ||
# prod: | ||
# workspace: | ||
# host: https://adb-3537333413571968.8.azuredatabricks.net |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
from databricks.connect import DatabricksSession | ||
|
||
spark = DatabricksSession.builder.profile("adb-3537333413571968").getOrCreate() | ||
df = spark.read.table("samples.nyctaxi.trips") | ||
df.show(5) |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,63 @@ | ||||||
import logging | ||||||
|
||||||
import yaml | ||||||
from colorama import Back, Fore, Style | ||||||
from sklearn.metrics import classification_report | ||||||
|
||||||
from fraud_credit_cards.data_processor import DataProcessor | ||||||
from fraud_credit_cards.fraud_model import FraudModel | ||||||
|
||||||
|
||||||
def print_evaluation(y_test, y_pred, accuracy): | ||||||
print("Accuracy:", accuracy) | ||||||
print("\n" + Back.BLUE + Fore.WHITE + "Classification Report" + Style.RESET_ALL) | ||||||
report = classification_report(y_test, y_pred, output_dict=True) | ||||||
for key, value in report.items(): | ||||||
if key in ["0", "1"]: | ||||||
color = Fore.GREEN if value["precision"] > 0.8 else Fore.RED | ||||||
print(f"Class {key}:") | ||||||
print(f" Precision: {color}{value['precision']:.2f}{Style.RESET_ALL}") | ||||||
color = Fore.GREEN if value["recall"] > 0.8 else Fore.RED | ||||||
print(f" Recall: {color}{value['recall']:.2f}{Style.RESET_ALL}") | ||||||
color = Fore.GREEN if value["f1-score"] > 0.8 else Fore.RED | ||||||
print(f" F1-score: {color}{value['f1-score']:.2f}{Style.RESET_ALL}") | ||||||
print(f" Support: {value['support']}") | ||||||
else: | ||||||
print(key + ":", value) | ||||||
|
||||||
|
||||||
# configure logging | ||||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") | ||||||
logger = logging.getLogger(__name__) | ||||||
|
||||||
# load configurations | ||||||
with open("project_config.yml", "r") as file: | ||||||
config = yaml.safe_load(file) | ||||||
|
||||||
logger.info("Configuration loaded: ") | ||||||
print(yaml.dump(config, default_flow_style=False)) | ||||||
|
||||||
# initialize DataProcessor | ||||||
data_processor = DataProcessor("data/creditcard_2023.csv", config) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Avoid hardcoding file paths The data file path should be configurable: -data_processor = DataProcessor("data/creditcard_2023.csv", config)
+data_processor = DataProcessor(config.get('data_path', 'data/creditcard_2023.csv'), config) Consider adding the data path to your configuration file. 📝 Committable suggestion
Suggested change
|
||||||
logging.info("DataProcessor Initialized ...") | ||||||
|
||||||
# preprocess the data | ||||||
data_processor.preprocess_data() | ||||||
logging.info("Data preprocessed ...") | ||||||
|
||||||
# Split the data | ||||||
X_train, X_test, y_train, y_test = data_processor.split_data() | ||||||
logger.info("Data split into training and test sets.") | ||||||
logger.debug(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}") | ||||||
|
||||||
# Intialize and train model | ||||||
model = FraudModel(data_processor.preprocessor) | ||||||
model.train(X_train, y_train) | ||||||
logger.info("Model training completed.") | ||||||
|
||||||
# evaluate model | ||||||
y_pred, accuracy, precision, recall, mse, f1 = model.evaluate(X_test, y_test) | ||||||
logging.info("Model evaluation completed. ") | ||||||
|
||||||
# print evaluation report | ||||||
print_evaluation(y_test, y_pred, accuracy) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Databricks notebook source | ||
# MAGIC %md | ||
# MAGIC # Credit Card 2023 Fraud Detection | ||
# MAGIC | ||
# MAGIC **Dataset:** [Kaggle-Credit-Card-Fraud-Detection-Dataset-2023](https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023/data) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Installing Packages | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %pip install colorama==0.4.6 catboost==1.2.0 gecs==0.1.1 | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Import Libraries | ||
|
||
# COMMAND ---------- | ||
|
||
import warnings | ||
|
||
import pandas as pd | ||
from catboost import CatBoostClassifier | ||
from colorama import Back, Fore, Style | ||
from sklearn.compose import ColumnTransformer | ||
from sklearn.metrics import accuracy_score, classification_report | ||
from sklearn.model_selection import train_test_split | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.preprocessing import StandardScaler | ||
|
||
warnings.filterwarnings("ignore", category=FutureWarning) | ||
warnings.filterwarnings("ignore") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Reading Data From DB Catalog Volume | ||
|
||
# COMMAND ---------- | ||
|
||
df_tr = pd.read_csv("/Volumes/fraud_credit_cards/data/credit_cards_2023/creditcard_2023.csv") | ||
|
||
|
||
# COMMAND ---------- | ||
|
||
df_tr.head() | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Data Preprocessing and Modeling | ||
|
||
# COMMAND ---------- | ||
|
||
# Spliting the data into features and target | ||
X = df_tr.drop("Class", axis=1) | ||
y = df_tr["Class"] | ||
|
||
# Define numeric features (remove categorical columns) | ||
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist() | ||
|
||
# Define preprocessing steps | ||
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())]) | ||
|
||
preprocessor = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features)]) | ||
|
||
# Define the model | ||
model = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", CatBoostClassifier(verbose=False))]) | ||
|
||
# Split the data into training and test sets | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) | ||
|
||
# Fit the model | ||
model.fit(X_train, y_train) | ||
|
||
# Predict on the test set | ||
y_pred = model.predict(X_test) | ||
|
||
# Calculate accuracy | ||
accuracy = accuracy_score(y_test, y_pred) | ||
print("Accuracy:", accuracy) | ||
|
||
# Display classification report with colors and heading | ||
print("\n" + Back.BLUE + Fore.WHITE + "Classification Report" + Style.RESET_ALL) | ||
report = classification_report(y_test, y_pred, output_dict=True) | ||
for key, value in report.items(): | ||
if key in ["0", "1"]: | ||
color = Fore.GREEN if value["precision"] > 0.8 else Fore.RED | ||
print(f"Class {key}:") | ||
print(f" Precision: {color}{value['precision']:.2f}{Style.RESET_ALL}") | ||
color = Fore.GREEN if value["recall"] > 0.8 else Fore.RED | ||
print(f" Recall: {color}{value['recall']:.2f}{Style.RESET_ALL}") | ||
color = Fore.GREEN if value["f1-score"] > 0.8 else Fore.RED | ||
print(f" F1-score: {color}{value['f1-score']:.2f}{Style.RESET_ALL}") | ||
print(f" Support: {value['support']}") | ||
else: | ||
print(key + ":", value) | ||
|
||
# COMMAND ---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider managing pre-commit through project dependencies
The separate installation of pre-commit could lead to version conflicts and adds unnecessary build time overhead. Instead:
uv sync
will handle itApply this diff to streamline the workflow:
And add pre-commit to your development dependencies in pyproject.toml: