Skip to content

Credit risk modeling with HistGradientBoosting, featuring evaluation, SHAP explainability, threshold optimization, and high-risk client analysis.

Notifications You must be signed in to change notification settings

bastianb-analytics/Credit-Risk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Credit Risk Prediction – HistGradientBoosting Model

Overview

This project develops a credit risk model using HistGradientBoosting to prioritize clients with a high probability of default. The goal is to improve decision-making in credit approval and focus efforts on higher-risk applicants while maintaining operational efficiency.

Dataset: Home Credit Default Risk (application_train.csv, application_test.csv, bureau, previous_application, etc.)

Tools & Libraries:

  • Python
  • pandas, numpy
  • matplotlib, seaborn
  • scikit-learn
  • SHAP

Data Exploration and Preprocessing

We started by analyzing the dataset, performing Exploratory Data Analysis (EDA) and selecting the top 11 features that most influence the target variable TARGET using permutation importance.

Missing values were inspected, notably in EXT_SOURCE_1 (~56% missing). We tested imputing the median and adding missing flags, but the ROC-AUC showed minimal change (~0.0001 difference), so imputation was deemed optional.

Feature selection and correlation:

  • Selected top 11 features using permutation importance.
  • Verified that removing features slightly reduces ROC-AUC, confirming importance of selected features.

Image Placeholder: Aquí va la imagen de (Feature Importance Permutation)


Model Training

The model was trained using a Pipeline with a ColumnTransformer to process numerical and categorical features:

pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', HistGradientBoostingClassifier(
        max_iter=300,
        learning_rate=0.05,
        random_state=42
    ))
])

Model Evaluation Metrics

The trained model was evaluated using several key metrics:

  • ROC-AUC: 0.75
  • PR-AUC: 0.23 (baseline: 0.08)
  • Lift@10%: 3.33
  • KS: 0.37

These metrics indicate that the model can effectively rank clients according to default risk, capturing a significant proportion of defaulters while maintaining discriminative power over the minority class.

roc curve pr curve

Threshold Optimization

The model’s decision threshold was carefully analyzed to balance business objectives and operational constraints. Different strategies were considered:

  • Target Recall: Capture at least 40% of defaults.
    • Operationally infeasible due to reviewing a very large portion of clients.
  • Operational Capacity: Only a fixed percentage of clients can be reviewed (e.g., 10%).
    • High precision but low recall; useful if intervention costs are high.
  • Maximum F1: Optimizes the balance between precision and recall.
    • Provides a reasonable recall with acceptable precision and manageable intervention population.

A profit-based approach was also applied, defining:

  • Benefit per True Positive: 100
  • Cost per False Positive: 10

The threshold maximizing expected profit was identified at 0.089, achieving a gain of 247,330 units.

At this operational point:

  • Recall: 0.63 → captures the riskiest clients.
  • Precision: 0.17 → acceptable given the severe class imbalance.
  • Predicted positive rate: ≈ 29.5%.
profit curve

Explainability Analysis (SHAP)

To understand model predictions at both global and individual levels, SHAP (SHapley Additive exPlanations) was applied.

Global Feature Importance

  • Features such as EXT_SOURCE_2, EXT_SOURCE_3, DAYS_BIRTH, AMT_CREDIT, and DAYS_EMPLOYED were identified as the most influential.
  • Contributions are moderate and no single variable dominates the predictions, indicating balanced signal across the dataset.
shap summary

Individual Case Analysis

  • High-risk client: Driven primarily by low values in EXT_SOURCE_2 and EXT_SOURCE_3, showing high confidence for default prediction.
  • Borderline client: Risk is spread across multiple features, indicating lower model certainty and a good candidate for manual review.
shap high shap borderline

High-Risk Population Analysis

The model was applied to the test set to identify the top 10% of clients by predicted risk.

Key Observations

  • EXT_SOURCE_2 and EXT_SOURCE_3: Top-risk clients show significantly lower values, indicating weaker credit history and financial reliability.
  • AMT_CREDIT: Moderate credit amounts requested, typical of mid-to-high risk clients.
  • DAYS_BIRTH: Average age of top-risk clients is 30–36 years, suggesting shorter credit history and potential income volatility.
  • DAYS_EMPLOYED: Shorter employment tenure, indicating lower income stability.
top-risk distribu

Statistical Summary

  • Missing values in EXT_SOURCE features are more prevalent in the top-risk segment, which aligns with higher observed default rates.
  • Default rate in top 10% risk segment: 22.9%
  • Default rate in overall population: 19.8%
top-risk describe quartiles

Executive Summary & Business Recommendations

A credit risk model was developed using HistGradientBoosting to prioritize clients with a high probability of default.

Model Performance

  • ROC-AUC: 0.75
  • PR-AUC: 0.23 (baseline: 0.08)
  • Lift@10%: 3.33
  • KS: 0.37

Business Impact:

  • The model captures ~33% of defaults by reviewing only 10% of the population, significantly improving risk review efficiency.
  • Profit curve analysis identified an operational threshold of ~0.089, maximizing expected benefit.

Business Recommendations

  • High-risk clients (top 10% predicted risk) should be subject to stricter credit approval policies.
  • Borderline applicants may require additional verification or reduced credit limits.
  • Low-risk clients can benefit from faster approvals to enhance customer acquisition.

Key Risk Drivers

  • EXT_SOURCE_2 and EXT_SOURCE_3 are the most influential predictors.
  • DAYS_BIRTH, AMT_CREDIT, DAYS_EMPLOYED also contribute significantly.
  • Lower EXT_SOURCE values strongly increase predicted default risk.

Final Conclusions

  • The model consistently ranks clients by default risk.
  • Accumulated gain analysis confirms the model concentrates defaults in the top-ranked population compared to random selection.
  • High-risk segment exhibits a default rate of 22.9% versus 19.8% in the overall training population.

Limitations & Future Work

  • Probability calibration improvements
  • Advanced handling of class imbalance
  • Feature engineering from bureau and previous application tables
  • Model comparison with other gradient boosting methods

Submission & Contact

Model Predictions on Test Data

The trained model was applied to the app_test dataset to generate risk probabilities (risk_score) for each client.
Here goes the image of the risk_score distribution predicted)

Contact

Releases

No releases published

Packages

No packages published