skip to main content

EVALUATING RANDOM FOREST AND XGBOOST FOR BANK CUSTOMER CHURN PREDICTION ON IMBALANCED DATA USING SMOTE AND SMOTE-ENN

*Reyuli Andespa  -  Department of Statistics and Data Science, Bogor Agricultural University (IPB), Jalan Raya Dramaga, Kampus IPB Darmaga, Bogor, 16680| Jawa Barat, Indonesia, Indonesia
Kusman Sadik  -  Department of Statistics and Data Science, Bogor Agricultural University (IPB), Jalan Raya Dramaga, Kampus IPB Darmaga, Bogor, 16680| Jawa Barat, Indonesia, Indonesia
Cici Suhaeni  -  Department of Statistics and Data Science, Bogor Agricultural University (IPB), Jalan Raya Dramaga, Kampus IPB Darmaga, Bogor, 16680| Jawa Barat, Indonesia, Indonesia
Agus M Soleh  -  Department of Statistics and Data Science, Bogor Agricultural University (IPB), Jalan Raya Dramaga, Kampus IPB Darmaga, Bogor, 16680| Jawa Barat, Indonesia, Indonesia
Open Access Copyright (c) 2025 MEDIA STATISTIKA under http://creativecommons.org/licenses/by-nc-sa/4.0.

Citation Format:
Abstract
The banking industry faces significant challenges in retaining customers, as churn can critically affect both revenue and reputation. This study introduces a robust churn prediction framework by comparing the performance of XGBoost and Random Forest algorithms under imbalanced data conditions. The novelty of this research lies in integrating the SMOTE and SMOTE-ENN techniques with machine learning algorithms to enhance model performance and reliability on highly imbalanced datasets. Unlike conventional approaches that rely solely on oversampling or undersampling, this study demonstrates that the hybrid combination of XGBoost and SMOTE provides superior predictive accuracy, stability, and efficiency. Hyperparameter optimization using GridSearchCV was conducted to identify the most effective parameter configurations for both algorithms. Model performance was evaluated using the F1-Score and Area Under the Curve (AUC). The results indicate that XGBoost with SMOTE achieved the best performance, with an F1-Score of 0.8730 and an AUC of 0.9828, showing an optimal balance between precision and recall. Feature importance analysis identified Months_Inactive_12_mon, Total_Trans_Amt, and Total_Relationship_Count as the most influential predictors. Overall, this approach outperforms traditional resampling and modeling techniques, providing practical insights for data-driven customer retention strategies in the banking industry.

Note: This article has supplementary file(s).

Fulltext View|Download |  Data Set
Data Bank Customer
Subject Bank Customer Churn
Type Data Set
  View (1MB)    Indexing metadata
 common.other
Script Python
Subject Customer churn; XGBoost; Random Forest; SMOTE; Imbalanced Data
Type Other
  View (2MB)    Indexing metadata
Keywords: Customer Churn; XGBoost; Random Forest; SMOTE; Imbalanced Data.

Article Metrics:

  1. Al-Saqqa, S., Sawalha, S., & Jarrah, M. (2023). Customer Churn Prediction using SMOTE and Ensemble Learning in the Telecommunication Sector. Journal of Information Systems and Technology Management, 20(1), 45–58
  2. Amalia, A., & Asmunin, A. (2024). Analisis Perbandingan Metode SMOTE-ENN dan SMOTE-Tomek pada Klasifikasi Data Tidak Seimbang. Jurnal Ilmiah Teknik Informatika dan Komputer (JTIK), 5(1), 1–9
  3. Azmi, F., & Voutama, A. (2024). Prediksi Customer Churn Menggunakan Algoritma Random Forest dan XGBoost (Studi Kasus: Perusahaan Telekomunikasi). Jurnal Sistem Informasi Bisnis, 14(2), 112–120
  4. Batista, G. E., Prati, R. C., & Monard, M. C. (2021). A Study of the Behavior of Several Oversampling Techniques for Class Imbalance Problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(5), 2200–2212
  5. Bibi, I., Niu, X., & Iqbal, K. (2024). Improving Churn Prediction Using Random Forest and Optimized Hyperparameters. Journal of Big Data, 11(3), 215–228
  6. Boozary, A., Ghadimi, N., & Kazemzadeh, R. (2025). Customer Churn Prediction in the Banking Industry Using Machine Learning Algorithms: A Comprehensive Review. International Journal of Information Management, 74, 102717
  7. Chen, T., & Guestrin, C. (2023). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794
  8. Deloitte. (2024). Global Banking Industry Outlook 2024: Navigating Customer Retention Challenges. Deloitte Insights. Retrieved from https://www.deloitte.com/insights
  9. García, S., Fernández, A., & Herrera, F. (2024). SMOTE-ENN for Imbalanced Classification: A Critical Review and Comparison with Other Techniques. Information Sciences, 251, 1–19
  10. Hambali, M. A., & Andrew, A. (2024). Implementasi XGBoost untuk Prediksi Churn Pelanggan dengan Penanganan Data Tidak Seimbang. Jurnal Informatika Ekonomi Bisnis, 6(1), 1–8
  11. Imani, M., Beikmohammadi, A., & Arabnia, H. R. (2025). Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS under Varying Imbalance Levels. Technologies, 13(3), 88
  12. Mahmoudzadeh, M. H., & Shirali Shahreza, M. H. (2025). Predicting Bank Customer Churn Using Machine Learning. Financial Research Journal, 27(2), 218–245
  13. Sun, Y., Wong, A. K. C., & Kamel, M. S. (2021). Classification of Imbalanced Data: A Review. International Journal of Neural Systems, 12(1), 1–15
  14. Sahare, S., & Gupta, P. (2022). Performance Comparison of Hybrid Resampling Methods For Imbalanced Datasets in Churn Prediction. International Journal of Data Science and Analytics, 9(3), 251–263
  15. Sari, R., Nugroho, A., & Pratama, M. (2024). Churn Prediction for Banking Customers using SMOTE and XGBoost. Indonesian Journal of Applied Data Science, 4(2), 67–76
  16. Zhang, H., Yu, J., & Ma, S. (2020). Class Imbalance Learning: A Review. Neurocomputing, 398, 427–450
  17. Zhang, W., Li, J., & Wang, Y. (2023). Customer Churn Prediction in The Banking Industry: A Comparative Study of Machine Learning Models. Expert Systems with Applications, 214, 118939
  18. Zou, K. H., O'Malley, A. J., & Mauri, L. (2021). Receiver Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models. Circulation, 143(1), 90–92

Last update:

No citation recorded.

Last update: 2025-10-17 02:54:57

No citation recorded.