Problem: Predict individual healthcare costs using regression analysis on demographic and health data, achieving Mean Absolute Error (MAE) under $3,500.
Solution: Built a deep neural network regression model with feature normalization and categorical encoding to predict medical expenses from patient characteristics.
Impact: Achieved $2,800 MAE (20% better than the requirement), demonstrating proficiency in regression modeling, data preprocessing, and TensorFlow/Keras implementation.
Completed Project
https://colab.research.google.com/drive/1TzGEH3w3DyVbxumFz2LufzQEgTDMPYgX?usp=drive_link
What: Imported TensorFlow 2.x, Keras, and a healthcare insurance dataset.
Dataset: insurance.csv containing patient information and medical costs.
Key Libraries:
- TensorFlow / Keras for neural networks
- Pandas for data manipulation
- Matplotlib for visualization
tensorflow_docsfor training callbacks
What: Examined dataset structure and features.
Features Identified:
- Numerical: age, bmi, children
- Categorical: sex, smoker, region
- Target Variable: expenses (healthcare costs)
Initial Analysis: Used dataset.tail() to inspect data structure and identify preprocessing needs.
Problem: Neural networks require numerical inputs, but the dataset contains text categories.
Solution: Manual mapping of categorical variables to numerical values.
- Sex:
male → 0,female → 1 - Smoker:
no → 0,yes → 1 - Region:
- southwest → 0
- southeast → 1
- northwest → 2
- northeast → 3
Key Action: Applied string stripping to handle whitespace before mapping.
Impact: Converted all features to numerical format for model compatibility.
Split Ratio: 80% training, 20% testing.
Method: Random sampling using frac=0.2 for the test set.
Implementation:
- Test dataset: 20% random sample from original data
- Train dataset: Remaining 80%, created by dropping test indices
Label Separation:
- Popped the
expensescolumn to create labels train_labels: target values for trainingtest_labels: target values for evaluation
Problem: Features exist on different scales (age: 18–64, BMI: 15–50, expenses: $1k–$60k), causing training instability.
Solution: Z-score normalization (standardization).
- Calculated mean and standard deviation from training data only
- Applied formula:
(x − mean) / std
Critical Decision: Used training statistics for both training and test data to prevent data leakage.
Impact:
- Faster convergence during training
- Improved gradient descent stability
- Better model generalization
Model Type: Sequential feedforward neural network for regression.
Architecture:
- Input layer: 6 features after encoding
- Hidden Layer 1: 64 neurons, ReLU activation
- Hidden Layer 2: 64 neurons, ReLU activation
- Output Layer: 1 neuron for continuous cost prediction
Rationale:
- Two hidden layers capture non-linear relationships
- ReLU activation avoids vanishing gradients
- 64 neurons balance learning capacity and overfitting risk
Optimizer: RMSprop (learning rate = 0.001)
- Adaptive learning rate suitable for regression
- Handles sparse gradients effectively
Loss Function: Mean Squared Error (MSE)
- Standard loss for regression
- Penalizes large errors more heavily
Metrics Tracked:
- MAE (Mean Absolute Error): primary evaluation metric
- MSE (Mean Squared Error): secondary monitoring metric
Training Configuration:
- Epochs: 500
- Validation split: 20% of training data
- Verbose: 0 with
EpochDotscallback
Data Used: Normalized training data
Monitoring: Validation loss tracked to detect overfitting.
Result: Model learned relationships between patient features and healthcare costs.
Test Evaluation: model.evaluate() on normalized test data.
Key Metric: Mean Absolute Error (MAE)
Performance Achieved: $2,800 MAE
Target: ≤ $3,500
✅ Beat target by $700 (20% improvement).
Interpretation: Model predictions are, on average, $2,800 away from actual healthcare costs.
What: Generated a scatter plot of true versus predicted expenses.
Plot Elements:
- X-axis: Actual healthcare costs (
test_labels) - Y-axis: Model predictions
- Diagonal line: Perfect prediction reference
- Equal aspect ratio for accurate comparison
Interpretation:
- Points close to the diagonal indicate accurate predictions
- Scatter distribution highlights model strengths and weaknesses
- Visual confirmation of $2,800 MAE performance
Automated Test: Built-in evaluation verifying MAE ≤ $3,500.
Result:
✅ "You passed the challenge. Great job!"
Verification: Model successfully generalizes to unseen test data.