Resolving Data Imbalance using SMOTE for the Analysis and Prediction of Hate Speech Sentences

Sutikman Sutikman; Heri Sutanto; Aris Puji Widodo

doi:10.14710/vol15iss2pp198-203

DOI: https://doi.org/10.14710/vol15iss2pp198-203

Resolving Data Imbalance using SMOTE for the Analysis and Prediction of Hate Speech Sentences

*Sutikman Sutikman - Doctor of Information System, School of Post Graduate Studies, Diponegoro University, Jl. Imam Bardjo S.H., No. 5, Pleburan, Semarang, Indonesia 50241, Indonesia

Heri Sutanto - Physics Department, Faculty of Science and Mathematics, Diponegoro University, Jl. Prof. Soedarto, S.H., Tembalang, Semarang, Indonesia 50275, Indonesia

Aris Puji Widodo - Department of Informatics, Faculty of Science and Mathematics, Diponegoro University, Jl. Prof. Soedarto, S.H., Tembalang, Semarang, Indonesia 50275, Indonesia

Citation Format:

Abstract

Hate speech is characterized as a form of communication that expresses hostility or discontent towards particular individuals, groups, or ethnicities, with the intent to belittle one party. This research aims to examine hate speech expressions on Twitter, assessing their categorization as hate speech through the application of machine learning methodologies. The study incorporates feature engineering techniques, such as Term Frequency-Inverse Document Frequency (TF-IDF) and the Synthetic Minority Over-sampling Technique (SMOTE), to mitigate challenges related to data imbalance. The machine learning models utilized include Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), and Random Forest (RF). Among these models, Logistic Regression (LR) demonstrated the highest efficacy, achieving an accuracy of 91.43%, precision of 88.83%, recall of 93.99%, and an F1 score of 97.10%.

Fulltext View|Download Email colleagues

Keywords: Hate Speech Analysis; Data Imbalance; SMOTE; Machine Learning Models; Twitter Sentiment Classification

Article Metrics:

Article Info

Section: Research Articles

Language : EN

In Vol 15, No 2 (2025): Volume 15 Number 2 Year 2025

Most cited articles

Metode Quality Function Deployment Dan Fuzzy Topsis Untuk Sistem Pendukung Keputusan Pemilihan Perusahaan Penyedia Jasa Internet Sistem Panel Kinerja Untuk Program Studi Sarjana Berbasis BAN PT Penerapan Metode AHP dan Fuzzy Topsis Untuk Sistem Pendukung Keputusan Promosi Jabatan Sistem Gesture Accelerometer dengan Metode Fast Dynamic Time Warping (FastDTW) Sistem Informasi Penyebaran Penyakit Demam Berdarah Menggunakan Metode Jaringan Syaraf Tiruan Backpropagation More cited articles

Abro, Sindhu, Sarang Shaikh, Zafar Ali, Sajid Khan, Ghulam Mujtaba, and Zahid Hussain Khand. 2020. “Automatic Hate Speech Detection Using Machine Learning: A Comparative Study.” International Journal of Advanced Computer Science and Applications 11 (8): 484–91. https://doi.org/10.14569/IJACSA.2020.0110861
Ahammed, Khair, Md Shahriare Satu, Md Imran Khan, and Md Whaiduzzaman. 2020. “Predicting Infectious State of Hepatitis C Virus Affected Patient’s Applying Machine Learning Methods.” 2020 IEEE Region 10 Symposium, TENSYMP 2020, no. June: 1371–74. https://doi.org/10.1109/TENSYMP50017.2020.9230464
Alfina, Ika, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata. 2017. “Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study.” 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017 2018-January (October): 233–37. https://doi.org/10.1109/ICACSIS.2017.8355039
Candanedo, Luis M., Véronique Feldheim, and Dominique Deramaix. 2017. “Data Driven Prediction Models of Energy Use of Appliances in a Low-Energy House.” Energy and Buildings 140: 81–97. https://doi.org/10.1016/j.enbuild.2017.01.083
Cao, Guogang, Mengxue Li, Cong Cao, Ziyi Wang, Meng Fang, and Chunfang Gao. 2019. “Primary Liver Cancer Early Screening Based on Gradient Boosting Decision Tree and Support Vector Machine.” ICIIBMS 2019 - 4th International Conference on Intelligent Informatics and Biomedical Sciences, 287–90. https://doi.org/10.1109/ICIIBMS46890.2019.8991441
Febiana Anistya, and Erwin Budi Setiawan. 2021. “Hate Speech Detection on Twitter in Indonesia with Feature Expansion Using GloVe.” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 5 (6): 1044–51. https://doi.org/10.29207/resti.v5i6.3521
Khanday, Akib Mohi Ud Din, Syed Tanzeel Rabani, Qamar Rayees Khan, and Showkat Hassan Malik. 2022. “Detecting Twitter Hate Speech in COVID-19 Era Using Machine Learning and Ensemble Learning Techniques.” International Journal of Information Management Data Insights 2 (2): 100120. https://doi.org/10.1016/j.jjimei.2022.100120
Merinda Lestandy, Abdurrahim Abdurrahim, and Lailis Syafa’ah. 2021. “Analisis Sentimen Tweet Vaksin COVID-19 Menggunakan Recurrent Neural Network Dan Naïve Bayes.” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 5 (4): 802–8. https://doi.org/10.29207/resti.v5i4.3308
Papel, Habibur Rahman, Udoy Chandra Dey, Toufiq Hasan Turza, Avijit Datta, and Tanu Sarkar. 2024. “Bangla Hate Speech Detection by Embedding and Hybrid Machine Learning Algorithms.”
Patihullah, Junanda, and Edi Winarko. 2019. “Hate Speech Detection for Indonesia Tweets Using Word Embedding And Gated Recurrent Unit.” IJCCS (Indonesian Journal of Computing and Cybernetics Systems) 13 (1): 43. https://doi.org/10.22146/ijccs.40125
Perwira Joan Dwitama, Aditya, Dhomas Hatta Fudholi, and Syarif Hidayat. 2023. “Indonesian Hate Speech Detection Using Bidirectional Long Short-Term Memory (Bi-LSTM).” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 7 (2): 302–9. https://doi.org/10.29207/resti.v7i2.4642
Ro, Sex. 1999. “Greek a d Roman MytholoGy,” 473–81. https://support.twitter.eom/articles/l
Schröer, Christoph, Felix Kruse, and Jorge Marx Gómez. 2021. “A Systematic Literature Review on Applying CRISP-DM Process Model.” Procedia Computer Science 181 (2019): 526–34. https://doi.org/10.1016/j.procs.2021.01.199
Sheng, Peng, Li Chen, and Jing Tian. 2018. “Learning-Based Road Crack Detection Using Gradient Boost Decision Tree.” Proceedings of the 13th IEEE Conference on Industrial Electronics and Applications, ICIEA 2018, 1228–32. https://doi.org/10.1109/ICIEA.2018.8397897
Taradhita, Dewa Ayu Nadia, and I. Ketut Gede Darma Putra. 2021. “Hate Speech Classification in Indonesian Language Tweets by Using Convolutional Neural Network.” Journal of ICT Research and Applications 14 (3): 225–39. https://doi.org/10.5614/itbj.ict.res.appl.2021.14.3.2
Zaidi, Syed Ali Jafar, Saad Tariq, and Samir Brahim Belhaouari. 2021. “Future Prediction of Covid-19 Vaccine Trends Using a Voting Classifier.” Data 6 (11). https://doi.org/10.3390/data6110112

Last update:

No citation recorded.

Last update: 2026-07-15 09:04:11

No citation recorded.

Authors who submit the manuscripts to Journal JSINBIS must understand and agree that if the manuscript is accepted for publication, the copyright of the article belongs to JSINBIS and Diponegoro University as the journal publisher.

Copyright includes the exclusive right to reproduce and provide articles in all forms and media, including reprints, photographs, microfilm and any other similar reproductions, as well as translations. The author reserves the rights to the following:

Reproduce all or part of published material for use by the author himself as teaching material in class or oral presentation material in various forums;
Reuse part or all of the material as compilation material for the author's written work;
Make copies of published materials for distribution within the institution where the author works.

JSINBIS and Diponegoro University and the Editors make every effort to ensure that no false or misleading data, opinions or statements are published in this journal. The content of articles published in JSINBIS is the sole and exclusive responsibility of the respective authors.

Copyright transfer agreement can be found here: [Copyright transfer agreement in doc] and [Copyright transfer agreement in pdf].