skip to main content

COMPARISON OF SMOTE RANDOM FOREST AND SMOTE K-NEAREST NEIGHBORS CLASSIFICATION ANALYSIS ON IMBALANCED DATA

*Jus Prasetya  -  PROGRAM STUDI MAGISTER MATEMATIKA, Departemen Matematika, Fakultas Matematika dan Ilmu Pengetahuan Alam, Universitas Gadjah Mada, Sekip Utara BLS 21 Yogyakarta 55281, Indonesia
Abdurakhman Abdurakhman  -  Department of Mathematics, Gadjah Mada University, Indonesia, Indonesia
Open Access Copyright (c) 2022 MEDIA STATISTIKA under http://creativecommons.org/licenses/by-nc-sa/4.0.

Citation Format:
Abstract
In machine learning study, classification analysis aims to minimize misclassification and also maximize the results of prediction accuracy. The main characteristic of this classification problem is that there is one class that significantly exceeds the number of samples of other classes. SMOTE minority class data is studied and extrapolated so that it can produce new synthetic samples. Random forest is a classification method consisting of a combination of mutually independent classification trees. K-Nearest Neighbors which is a classification method that labels the new sample based on the nearest neighbors of the new sample. SMOTE generates synthesis data in the minority class, namely class 1 (cervical cancer) to 585 observation respondents (samples) so that the total observation respondents are 1208 samples. SMOTE random forest resulted an accuracy of 96.28%, sensitivity 99.17%, specificity 93.44%, precision 93.70%, and AUC 96.30%. SMOTE K-Nearest Neighborss resulted an accuracy of 87.60%, sensitivity 77.50%, specificity 97.54%, precision 96.88%, and AUC 82.27%. SMOTE random forest produces a perfect classification model, SMOTE K-Nearest neighbors classification produces a good classification model, while the random forest and K-Nearest neighbors classification on imbalanced data results a failed classification model.
Fulltext View|Download
Keywords: Machine Learning; Classification; SMOTE; Random Forest; k-Nearest Neighbors

Article Metrics:

  1. Becker, R. & Thrän, D. (2017). Completion of Wind Turbine Data Sets for Wind Integration Studies Applying Random Forests and K-Nearest Neighbors. Applied Energy, 208(October), 252–262. https://doi.org/10.1016/j.apenergy.2017.10.044
  2. Brown, I. & Mues, C. (2012). An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets. Expert Systems with Applications, 39(3), 3446–3453. https://doi.org/10.1016/j.eswa.2011.09.033
  3. Fernandes, K., Cardoso, J. S., & Fernandes, J. (2017). Transfer Learning with Partial Observability Applied to Cervical Cancer Screening. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10255 LNCS, 243–250. https://doi.org/10.1007/978-3-319-58838-4_27
  4. Fernandes, K., Chicco, D., Cardoso, J. S., & Fernandes, J. (2018). Supervised Deep Learning Embeddings for The Prediction of Cervical Cancer Diagnosis. PeerJ Computer Science, 2018(5), 1–20. https://doi.org/10.7717/peerj-cs.154
  5. Gorunescu, F. (2011). Data Mining : Concepts, Models and Techniques. Romania: Springer
  6. Goyal, A., Rathore, L., & Sharma, A. (2021). SMO-RF:A Machine Learning Approach by Random Forest for Predicting Class Imbalancing Followed by SMOTE. Materials Today: Proceedings, xxxx. https://doi.org/10.1016/j.matpr.2020.12.891
  7. Han, J., Kamber, M., & Pei, J. (2011). Concepts and Techniques-Chapter 2
  8. Hoyos-Osorio, J., Alvarez-Meza, A., Daza-Santacoloma, G., Orozco-Gutierrez, A., & Castellanos-Dominguez, G. (2021). Relevant Information Undersampling to Support Imbalanced Data Classification. Neurocomputing, 436, 136–146. https://doi.org/10.1016/j.neucom.2021.01.033
  9. James, G et al. (2014). An Introduction to Statistical Learning. New York: Springer
  10. Jatmiko, Y. A., Padmadisastra, S., & Chadidjah, A. (2019). Analisis Perbandingan Kinerja CART Konvensional, Bagging dan Random Forest Pada Klasifikasi Objek: Hasil Dari Dua Simulasi. Media Statistika, 12(1), 1-12. https://doi.org/10.14710/medstat.12.1.1-12
  11. Lee, D. & Kim, K. (2021). An Efficient Method to Determine Sample Size in Oversampling Based on Classification Complexity for Imbalanced data. Expert Systems with Applications, 184(May), 115442. https://doi.org/10.1016/j.eswa.2021.115442
  12. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An Insight Into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007
  13. Rokach, L. & Maimon, O. (2015). Data Mining with Decision Trees. Edition 2. Singapore : World Scientific Publishing Co
  14. Team, R. (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL
  15. http://www.rstudio.com/
  16. Thabtah, F., Hammoud S., Kamalov, F., & Gonsalves, A. (2020). Data Imbalance in Classification: Experimental Evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004
  17. Wei, G., Mu, W., Song, Y., & Dou, J. (2022). An Improved and Random Synthetic Minority Oversampling Technique for Imbalanced Data. Knowledge-Based Systems, 248, 108839. https://doi.org/10.1016/j.knosys.2022.108839
  18. Zheng, W. & Jin, M. (2020). The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study. SN Computer Science, 1(2), 1–13. https://doi.org/10.1007/s42979-020-0074-0
  19. Zhu, L., Zhou, X., & Zhang, C. (2021). Rapid Identification of High-Quality Marine Shale Gas Reservoirs Based on the Oversampling Method and Random Forest Algorithm. Artificial Intelligence in Geosciences, 2(July), 76–81. https://doi.org/10.1016/j.aiig.2021.12.001

Last update:

No citation recorded.

Last update: 2024-04-18 09:22:30

No citation recorded.