This project develops a credit risk model using HistGradientBoosting to prioritize clients with a high probability of default. The goal is to improve decision-making in credit approval and focus efforts on higher-risk applicants while maintaining operational efficiency.
Dataset: Home Credit Default Risk (application_train.csv, application_test.csv, bureau, previous_application, etc.)
Tools & Libraries:
- Python
- pandas, numpy
- matplotlib, seaborn
- scikit-learn
- SHAP
We started by analyzing the dataset, performing Exploratory Data Analysis (EDA) and selecting the top 11 features that most influence the target variable TARGET using permutation importance.
Missing values were inspected, notably in EXT_SOURCE_1 (~56% missing). We tested imputing the median and adding missing flags, but the ROC-AUC showed minimal change (~0.0001 difference), so imputation was deemed optional.
Feature selection and correlation:
- Selected top 11 features using permutation importance.
- Verified that removing features slightly reduces ROC-AUC, confirming importance of selected features.
Image Placeholder: Aquí va la imagen de (Feature Importance Permutation)
The model was trained using a Pipeline with a ColumnTransformer to process numerical and categorical features:
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('model', HistGradientBoostingClassifier(
max_iter=300,
learning_rate=0.05,
random_state=42
))
])The trained model was evaluated using several key metrics:
- ROC-AUC: 0.75
- PR-AUC: 0.23 (baseline: 0.08)
- Lift@10%: 3.33
- KS: 0.37
These metrics indicate that the model can effectively rank clients according to default risk, capturing a significant proportion of defaulters while maintaining discriminative power over the minority class.
The model’s decision threshold was carefully analyzed to balance business objectives and operational constraints. Different strategies were considered:
- Target Recall: Capture at least 40% of defaults.
- Operationally infeasible due to reviewing a very large portion of clients.
- Operational Capacity: Only a fixed percentage of clients can be reviewed (e.g., 10%).
- High precision but low recall; useful if intervention costs are high.
- Maximum F1: Optimizes the balance between precision and recall.
- Provides a reasonable recall with acceptable precision and manageable intervention population.
A profit-based approach was also applied, defining:
- Benefit per True Positive: 100
- Cost per False Positive: 10
The threshold maximizing expected profit was identified at 0.089, achieving a gain of 247,330 units.
At this operational point:
- Recall: 0.63 → captures the riskiest clients.
- Precision: 0.17 → acceptable given the severe class imbalance.
- Predicted positive rate: ≈ 29.5%.
To understand model predictions at both global and individual levels, SHAP (SHapley Additive exPlanations) was applied.
- Features such as
EXT_SOURCE_2,EXT_SOURCE_3,DAYS_BIRTH,AMT_CREDIT, andDAYS_EMPLOYEDwere identified as the most influential. - Contributions are moderate and no single variable dominates the predictions, indicating balanced signal across the dataset.
- High-risk client: Driven primarily by low values in
EXT_SOURCE_2andEXT_SOURCE_3, showing high confidence for default prediction. - Borderline client: Risk is spread across multiple features, indicating lower model certainty and a good candidate for manual review.
The model was applied to the test set to identify the top 10% of clients by predicted risk.
- EXT_SOURCE_2 and EXT_SOURCE_3: Top-risk clients show significantly lower values, indicating weaker credit history and financial reliability.
- AMT_CREDIT: Moderate credit amounts requested, typical of mid-to-high risk clients.
- DAYS_BIRTH: Average age of top-risk clients is 30–36 years, suggesting shorter credit history and potential income volatility.
- DAYS_EMPLOYED: Shorter employment tenure, indicating lower income stability.
- Missing values in EXT_SOURCE features are more prevalent in the top-risk segment, which aligns with higher observed default rates.
- Default rate in top 10% risk segment: 22.9%
- Default rate in overall population: 19.8%
A credit risk model was developed using HistGradientBoosting to prioritize clients with a high probability of default.
- ROC-AUC: 0.75
- PR-AUC: 0.23 (baseline: 0.08)
- Lift@10%: 3.33
- KS: 0.37
Business Impact:
- The model captures ~33% of defaults by reviewing only 10% of the population, significantly improving risk review efficiency.
- Profit curve analysis identified an operational threshold of ~0.089, maximizing expected benefit.
- High-risk clients (top 10% predicted risk) should be subject to stricter credit approval policies.
- Borderline applicants may require additional verification or reduced credit limits.
- Low-risk clients can benefit from faster approvals to enhance customer acquisition.
- EXT_SOURCE_2 and EXT_SOURCE_3 are the most influential predictors.
- DAYS_BIRTH, AMT_CREDIT, DAYS_EMPLOYED also contribute significantly.
- Lower EXT_SOURCE values strongly increase predicted default risk.
- The model consistently ranks clients by default risk.
- Accumulated gain analysis confirms the model concentrates defaults in the top-ranked population compared to random selection.
- High-risk segment exhibits a default rate of 22.9% versus 19.8% in the overall training population.
- Probability calibration improvements
- Advanced handling of class imbalance
- Feature engineering from bureau and previous application tables
- Model comparison with other gradient boosting methods
The trained model was applied to the app_test dataset to generate risk probabilities (risk_score) for each client.
Here goes the image of the risk_score distribution
)
- Name: Bastián Burgos / bastianb-analytics
- Email: bastian.burgos.c@gmail.com