skip to main content

Resolving Data Imbalance using SMOTE for the Analysis and Prediction of Hate Speech Sentences

*Sutikman Sutikman  -  Doctor of Information System, School of Post Graduate Studies, Diponegoro University, Jl. Imam Bardjo S.H., No. 5, Pleburan, Semarang, Indonesia 50241, Indonesia
Heri Sutanto  -  Physics Department, Faculty of Science and Mathematics, Diponegoro University, Jl. Prof. Soedarto, S.H., Tembalang, Semarang, Indonesia 50275, Indonesia
Aris Puji Widodo  -  Department of Informatics, Faculty of Science and Mathematics, Diponegoro University, Jl. Prof. Soedarto, S.H., Tembalang, Semarang, Indonesia 50275, Indonesia
Open Access Copyright (c) 2025 Jurnal Sistem Informasi Bisnis

Citation Format:
Abstract

Hate speech is characterized as a form of communication that expresses hostility or discontent towards particular individuals, groups, or ethnicities, with the intent to belittle one party. This research aims to examine hate speech expressions on Twitter, assessing their categorization as hate speech through the application of machine learning methodologies. The study incorporates feature engineering techniques, such as Term Frequency-Inverse Document Frequency (TF-IDF) and the Synthetic Minority Over-sampling Technique (SMOTE), to mitigate challenges related to data imbalance. The machine learning models utilized include Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), and Random Forest (RF). Among these models, Logistic Regression (LR) demonstrated the highest efficacy, achieving an accuracy of 91.43%, precision of 88.83%, recall of 93.99%, and an F1 score of 97.10%.

Fulltext View|Download
Keywords: Hate Speech Analysis; Data Imbalance; SMOTE; Machine Learning Models; Twitter Sentiment Classification

Article Metrics:

  1. Abro, Sindhu, Sarang Shaikh, Zafar Ali, Sajid Khan, Ghulam Mujtaba, and Zahid Hussain Khand. 2020. “Automatic Hate Speech Detection Using Machine Learning: A Comparative Study.” International Journal of Advanced Computer Science and Applications 11 (8): 484–91. https://doi.org/10.14569/IJACSA.2020.0110861
  2. Ahammed, Khair, Md Shahriare Satu, Md Imran Khan, and Md Whaiduzzaman. 2020. “Predicting Infectious State of Hepatitis C Virus Affected Patient’s Applying Machine Learning Methods.” 2020 IEEE Region 10 Symposium, TENSYMP 2020, no. June: 1371–74. https://doi.org/10.1109/TENSYMP50017.2020.9230464
  3. Alfina, Ika, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata. 2017. “Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study.” 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017 2018-January (October): 233–37. https://doi.org/10.1109/ICACSIS.2017.8355039
  4. Candanedo, Luis M., Véronique Feldheim, and Dominique Deramaix. 2017. “Data Driven Prediction Models of Energy Use of Appliances in a Low-Energy House.” Energy and Buildings 140: 81–97. https://doi.org/10.1016/j.enbuild.2017.01.083
  5. Cao, Guogang, Mengxue Li, Cong Cao, Ziyi Wang, Meng Fang, and Chunfang Gao. 2019. “Primary Liver Cancer Early Screening Based on Gradient Boosting Decision Tree and Support Vector Machine.” ICIIBMS 2019 - 4th International Conference on Intelligent Informatics and Biomedical Sciences, 287–90. https://doi.org/10.1109/ICIIBMS46890.2019.8991441
  6. Febiana Anistya, and Erwin Budi Setiawan. 2021. “Hate Speech Detection on Twitter in Indonesia with Feature Expansion Using GloVe.” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 5 (6): 1044–51. https://doi.org/10.29207/resti.v5i6.3521
  7. Khanday, Akib Mohi Ud Din, Syed Tanzeel Rabani, Qamar Rayees Khan, and Showkat Hassan Malik. 2022. “Detecting Twitter Hate Speech in COVID-19 Era Using Machine Learning and Ensemble Learning Techniques.” International Journal of Information Management Data Insights 2 (2): 100120. https://doi.org/10.1016/j.jjimei.2022.100120
  8. Merinda Lestandy, Abdurrahim Abdurrahim, and Lailis Syafa’ah. 2021. “Analisis Sentimen Tweet Vaksin COVID-19 Menggunakan Recurrent Neural Network Dan Naïve Bayes.” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 5 (4): 802–8. https://doi.org/10.29207/resti.v5i4.3308
  9. Papel, Habibur Rahman, Udoy Chandra Dey, Toufiq Hasan Turza, Avijit Datta, and Tanu Sarkar. 2024. “Bangla Hate Speech Detection by Embedding and Hybrid Machine Learning Algorithms.”
  10. Patihullah, Junanda, and Edi Winarko. 2019. “Hate Speech Detection for Indonesia Tweets Using Word Embedding And Gated Recurrent Unit.” IJCCS (Indonesian Journal of Computing and Cybernetics Systems) 13 (1): 43. https://doi.org/10.22146/ijccs.40125
  11. Perwira Joan Dwitama, Aditya, Dhomas Hatta Fudholi, and Syarif Hidayat. 2023. “Indonesian Hate Speech Detection Using Bidirectional Long Short-Term Memory (Bi-LSTM).” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 7 (2): 302–9. https://doi.org/10.29207/resti.v7i2.4642
  12. Ro, Sex. 1999. “Greek a d Roman MytholoGy,” 473–81. https://support.twitter.eom/articles/l
  13. Schröer, Christoph, Felix Kruse, and Jorge Marx Gómez. 2021. “A Systematic Literature Review on Applying CRISP-DM Process Model.” Procedia Computer Science 181 (2019): 526–34. https://doi.org/10.1016/j.procs.2021.01.199
  14. Sheng, Peng, Li Chen, and Jing Tian. 2018. “Learning-Based Road Crack Detection Using Gradient Boost Decision Tree.” Proceedings of the 13th IEEE Conference on Industrial Electronics and Applications, ICIEA 2018, 1228–32. https://doi.org/10.1109/ICIEA.2018.8397897
  15. Taradhita, Dewa Ayu Nadia, and I. Ketut Gede Darma Putra. 2021. “Hate Speech Classification in Indonesian Language Tweets by Using Convolutional Neural Network.” Journal of ICT Research and Applications 14 (3): 225–39. https://doi.org/10.5614/itbj.ict.res.appl.2021.14.3.2
  16. Zaidi, Syed Ali Jafar, Saad Tariq, and Samir Brahim Belhaouari. 2021. “Future Prediction of Covid-19 Vaccine Trends Using a Voting Classifier.” Data 6 (11). https://doi.org/10.3390/data6110112

Last update:

No citation recorded.

Last update: 2025-06-14 03:51:57

No citation recorded.