Handling Class Imbalance in Random Forest
Python Scikit-learn SMOTE Classification
Context
Standard Random Forest classifiers degrade on imbalanced datasets common in fraud detection and credit risk.
Data & Modeling
Systematically compared SMOTE, ADASYN, Tomek links, cost-sensitive learning, and ensemble balancing across multiple imbalance ratios.
Results
Cost-sensitive RF with SMOTE achieved 15–20% F1 improvement over baseline on highly skewed datasets.
Takeaways
Extend to gradient-boosted ensembles and evaluate on real-world credit default data.
Evaluation
- 5-fold stratified CV across imbalance ratios (1:10 to 1:100)
- Metrics: F1, AUC-ROC, precision-recall AUC
- Tested on synthetic + real-world credit datasets