This notebook demonstrates how to build and evaluate a regression tree model to predict taxi tip amounts using the NYC Taxi dataset. Below is a summary of the key steps and findings:
-
Data Preparation:
- Loaded the dataset containing taxi trip information
- Examined correlations between features and the target variable (tip_amount)
- Identified low-correlation features that could potentially be removed
-
Model Training:
- Split data into training (70%) and testing (30%) sets
- Built a Decision Tree Regressor with max_depth=8
- Trained the model on the training data
-
Evaluation:
- Evaluated model performance using MSE (Mean Squared Error) and R² score
- MSE: 1.784
- R²: 0.001 (very low, indicating poor predictive performance)
-
Experimentation:
- Tested different max_depth values (4, 12)
- Removed low-correlation features to simplify the model
- Visualized the decision tree structure
-
Feature Importance:
- The top 3 features affecting tip amount are:
- fare_amount (highest correlation)
- tolls_amount
- trip_distance
- The top 3 features affecting tip amount are:
-
Model Performance:
- The initial model performed poorly (R² ≈ 0)
- Reducing max_depth to 4 improved performance slightly
- Increasing max_depth to 12 worsened performance (negative R²), indicating overfitting
-
Feature Selection:
- Removing low-correlation features (payment_type, VendorID, etc.) had minimal impact on model performance
- A simplified model using only the top 3 features performed similarly to the full-feature model