End-to-end notebook for training gradient-boosted models (LightGBM & XGBoost) to predict whether a loan will be paid back. Includes lightweight EDA, feature engineering, cross-validated training, and submission file generation.
.
├── prediction_models_for_loan_payback.ipynb # Main notebook
├── train.csv # Training data
├── test.csv # Test data
├── sample_submission.csv # Kaggle/competition template
├── lgbm_submission.csv # LightGBM predictions (created by the notebook)
├── xgb_submission.csv # XGBoost predictions (created by the notebook)
└── README.md
Columns (train):
- Numerical:
annual_income,debt_to_income_ratio,credit_score,loan_amount,interest_rate,loan_paid_back(target) - Categorical:
gender,marital_status,education_level,employment_status,loan_purpose,grade_subgrade
Basic checks performed:
- Drop
id - Verify shapes (train: 593,994 × 12; test: 254,569 × 11)
- No nulls/duplicates detected
- Moderate class imbalance (~80% repaid)
Tested with Python ≥ 3.10.
# 1) (optional) create a virtual environment
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2) install dependencies
pip install numpy pandas scikit-learn lightgbm xgboost scipy matplotlib seaborn
# 3) run the notebook
jupyter lab # or jupyter notebookCommon scientific Python stack plus LightGBM/XGBoost. Warnings suppressed for cleaner logs.
Reads train.csv, test.csv, sample_submission.csv, and the two submission files (if present).
- Summary statistics for numeric features
- Correlation heatmap
- Distribution plots & boxplots
- Skewness computation
- Target removal from numeric set before plotting correlations
- Skew handling:
log1pon highly skewed numeric features (annual_income,debt_to_income_ratio) - Outlier clipping: IQR clipping per numeric column (train & test aligned)
- Shift checks: train vs. test KDE overlays for key features
Adds domain-inspired features to enrich signal:
-
Ratios / capacities:
loan_to_income = loan_amount / (annual_income + 1)total_debt = debt_to_income_ratio * annual_incomeavailable_income = annual_income * (1 - debt_to_income_ratio)affordability = available_income / (loan_amount + 1)payment_to_income = monthly_payment / (annual_income/12 + 1)
-
Payment proxy:
monthly_payment = loan_amount * (1 + interest_rate/100) / 12
-
Composite risk and interactions:
risk_score = 40*dti + 30*(1 - credit_score/850) + 2*interest_ratecredit_interest = credit_score * interest_rate / 100income_credit = log1p(annual_income) * credit_score / 1000debt_loan = debt_to_income_ratio * log1p(loan_amount)
-
Log transforms:
log_income = log1p(annual_income)log_loan = log1p(loan_amount)
LabelEncoder for all categorical columns; fitted on train and applied to test (ensures alignment).
Cross-validation: StratifiedKFold (5 folds) using ROC-AUC.
-
LightGBM (GBDT): tuned parameters (e.g.,
n_estimators=1320,num_leaves=93,max_depth=5,learning_rate=0.05, subsampling/regularization).- Reported fold AUCs:
[0.9235, 0.9239, 0.9220, 0.9230, 0.9221] - OOF ROC-AUC: 0.92291
- Reported fold AUCs:
-
XGBoost (hist): tuned parameters (
max_depth=6,n_estimators=732,learning_rate≈0.0669, regularization,max_bin=504, etc.).- 5-fold CV performed; per-fold AUCs are printed in the notebook logs.
-
Fit final LightGBM on the full train set
-
Generate test probabilities for both models
-
Save submissions:
lgbm_submission.csvwithloan_paid_backprobabilities from LightGBMxgb_submission.csvwithloan_paid_backprobabilities from XGBoost
Both files follow the
sample_submission.csvschema:id,loan_paid_back.
-
Place
train.csv,test.csv, andsample_submission.csvin the repo root. -
Open
prediction_models_for_loan_payback.ipynb. -
Run all cells top-to-bottom.
- CV metrics print to the console.
lgbm_submission.csvandxgb_submission.csvare written to disk.
- Imbalance: The target is ~80/20; ROC-AUC is the primary metric to avoid accuracy pitfalls.
- Categoricals:
grade_subgradeis informative; label encoding is simple—consider target or CatBoost encoding for potential gains. - Reproducibility:
random_state=42is used across models/splits. - Performance: On the provided settings, LightGBM achieves ~0.923 OOF ROC-AUC.
- Calibrated probabilities (Platt/Isotonic) and threshold tuning for business KPIs
- Target/cat encoding for high-cardinality categories
- Feature interaction search (e.g., polynomial on select ratios)
- SHAP analysis for interpretability & bias checks
- Model ensembling / stacking (blend LGBM & XGB)
- Hyperparameter search with Optuna
- Robust pipelines with
sklearnColumnTransformer+Pipeline - Train/serving parity and model versioning
Specify a license here (e.g., MIT). If omitted, the project is “all rights reserved” by default.