A Novel Fusion of Machine Learning Methods for Enhancing Named Entity Recognition in Indonesian Language Text

Widyawan Widyawan; Bayu Prasetiyo Utomo; Muhammad Nur Rizala

doi:10.21456/vol14iss4pp311-320

DOI: https://doi.org/10.21456/vol14iss4pp311-320

A Novel Fusion of Machine Learning Methods for Enhancing Named Entity Recognition in Indonesian Language Text

*Widyawan Widyawan

- Universitas Gadjah Mada, Indonesia

Bayu Prasetiyo Utomo - Universitas Gadjah Mada, Indonesia

Muhammad Nur Rizala - Universitas Gadjah Mada, Indonesia

BibTex Citation Data :

@article{JSINBIS65126,
    author = {Widyawan Widyawan and Bayu Utomo and Muhammad Rizala},
    title = {A Novel Fusion of Machine Learning Methods for Enhancing Named Entity Recognition in Indonesian Language Text},
    journal = {Jurnal Sistem Informasi Bisnis},
  volume = {14},
    number = {4},
    year = {2024},
    keywords = {NER; BERT; Pre-training; Machine Learning},
    abstract = { One of the important implementations in machine learning is Named Entity Recognition (NER), which is used to process text and extract entities such as people, organizations, laws, religions, and locations. NER for the Indonesian language still faces significant challenges due to the lack of high-quality labelled datasets, which limits the development of more advanced models. To address this issue, we utilized several pre-trained BERT models (bert-base-uncased, indobenchmark/indobert-base-p1, indolem/indobert-base-uncased) and datasets (NERGRIT-IndoNLU, NERGRIT-Corpus, NERUGM, and NERUI). This study proposes a novel fusion approach by integrating deep learning architectures such as CNN, Bi-LSTM, Bi-GRU, and CRF to detect 19 entities. This approach enhances BERT’s sequence modelling and feature extraction capabilities, while CRF improves entity prediction by enforcing global word-sequence constraints. Experimental results demonstrate that the fusion approach outperforms previous methods. On the bert-base-uncased dataset, accuracy reached 94.75%, while indobenchmark/indobert-base-p1 achieved 95.75%, and indolem/indobert-base-uncased achieved 95.85%. This study emphasizes the effectiveness of combining deep learning architectures with pre-trained transformers to improve NER performance in the Indonesian language. The proposed methodology offers significant advancements in entity extraction for languages with limited datasets, such as Indonesian. },
   issn = {2502-2377},   pages = {311--320}  doi = {10.21456/vol14iss4pp311-320},
    url = {https://ejournal.undip.ac.id/index.php/jsinbis/article/view/65126}
}

Citation Format:

Abstract

One of the important implementations in machine learning is Named Entity Recognition (NER), which is used to process text and extract entities such as people, organizations, laws, religions, and locations. NER for the Indonesian language still faces significant challenges due to the lack of high-quality labelled datasets, which limits the development of more advanced models. To address this issue, we utilized several pre-trained BERT models (bert-base-uncased, indobenchmark/indobert-base-p1, indolem/indobert-base-uncased) and datasets (NERGRIT-IndoNLU, NERGRIT-Corpus, NERUGM, and NERUI). This study proposes a novel fusion approach by integrating deep learning architectures such as CNN, Bi-LSTM, Bi-GRU, and CRF to detect 19 entities. This approach enhances BERT’s sequence modelling and feature extraction capabilities, while CRF improves entity prediction by enforcing global word-sequence constraints. Experimental results demonstrate that the fusion approach outperforms previous methods. On the bert-base-uncased dataset, accuracy reached 94.75%, while indobenchmark/indobert-base-p1 achieved 95.75%, and indolem/indobert-base-uncased achieved 95.85%. This study emphasizes the effectiveness of combining deep learning architectures with pre-trained transformers to improve NER performance in the Indonesian language. The proposed methodology offers significant advancements in entity extraction for languages with limited datasets, such as Indonesian.

Fulltext View|Download Email colleagues

Keywords: NER; BERT; Pre-training; Machine Learning

Funding: Ministry of Education, Culture, Research and Technology

Article Metrics:

Article Info

Section: Research Articles

Language : EN

In Vol 14, No 4 (2024): Volume 14 Nomor 4 Tahun 2024

Most cited articles

Perancangan Model Data Flow Diagram Untuk Mengukur Kualitas Website Menggunakan Webqual 4.0 Algoritma K-Means Clustering Untuk Pengelompokan Ayat Al Quran Pada Terjemahan Bahasa Indonesia Kombinasi Balanced Scorecard dan Objective Matrix Untuk Penilaian Kinerja Perguruan Tinggi Sistem Informasi Pengukuran Kinerja Pada Perkebunan Kelapa Sawit Dengan Menggunakan Metode Balanced Scorecard Kajian Data Mining Customer Relationship Management pada Lembaga Keuangan Mikro More cited articles

Alfina, I., Manurung, R., & Fanany, M.I., 2016. DBpedia entities expansion in automatically building dataset for Indonesian NER. 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 335–340. https://doi.org/10.1109/ICACSIS.2016.7872784
Alfina, I., Savitri, S., & Fanany, M.I, 2017. Modified DBpedia entities expansion for tagging automatically NER dataset. 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 216–221. https://doi.org/10.1109/ICACSIS.2017.8355036
Azzahra, N. S., Ibrohim, M.O., Fahmi, J., Apriyanto, B.F., & Riandi, O., 2020. Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset. 2020 Fifth International Conference on Informatics and Computing (ICIC), 1–7. https://doi.org/10.1109/ICIC50835.2020.9288566
Budi, I., & Suryono, R.R., 2023. Application of named entity recognition method for Indonesian datasets: a review. Bulletin of Electrical Engineering and Informatics, 12(2), 969–978. https://doi.org/10.11591/eei.v12i2.4529
Devi, S., Fatchiya, A., & Susanto, D., 2016. Kapasitas Kader dalam Penyuluhan Keluarga Berencana di Kota Palembang, Provinsi Sumatera Selatan. Jurnal Penyuluhan, 12(2), 144. https://doi.org/10.25015/penyuluhan.v12i2.11223
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K., 2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805
Eftimov, T., Koroušić Seljak, B., & Korošec, P., 2017. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLOS ONE, 12(6), e0179488. https://doi.org/10.1371/journal.pone.0179488
Genta Indra Winata, 2023. Dataset IndoNLU. https://github.com/indobenchmark/indonlu/tree/master/dataset
Inovasi Teknologi, G.R., 2023. Dataset NERGrit. https://huggingface.co/datasets/grit-id/id_nergrit_corpus
Koto, F., Rahimi, A., & Chandra, A., 2023. Dataset IndoLEM. https://github.com/indolem/indolem
Koto, F., Rahimi, A., Lau, J. H. & Baldwin, T., 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
Ma, X., & Hovy, E., 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1064–1074. https://doi.org/10.18653/v1/P16-1101
Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., & Kompatsiaris, Y., 2018. Location Extraction from Social Media. ACM Transactions on Information Systems, 36(4), 1–27. https://doi.org/10.1145/3202662
Nasichuddin, Moch. A., Adji, T. B., & Widyawan, W., 2018. Performance Improvement Using CNN for Sentiment Analysis. IJITEE (International Journal of Information Technology and Electrical Engineering), 2(1). https://doi.org/10.22146/ijitee.36642
Nuranti, E.Q., & Yulianti, E., 2020. Legal Entity Recognition in Indonesian Court Decision Documents Using Bi-LSTM and CRF Approaches. 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 429–434. https://doi.org/10.1109/ICACSIS51025.2020.9263157
Putra, F. N., 2021. Ekstraksi Informasi Menggunakan Kombinasi Metode NeuroNER, Neural Relation Extraction, dan FASM pada Deteksi Kejadian dari Data Stream Twitter [Institut Teknologi Sepuluh Nopember]. https://repository.its.ac.id/57797/1/5116201057-Masters_Thesis.pdf
Sang, E.F.T.K, & Meulder, F. D., 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142-147. https://aclanthology.org/W03-0419
Simanjuntak, L. F., Mahendra, R., & Yulianti, E., 2022. We Know You Are Living in Bali: Location Prediction of Twitter Users Using BERT Language Model. Big Data and Cognitive Computing, 6(3), 77. https://doi.org/10.3390/bdcc6030077
Sinta, W., & Sanjaya E.R., 2021. Rule-based Named Entity Recognition (NER) to Determine Time Expression for Balinese Text Document. JELIKU (Jurnal Elektronik Ilmu Komputer Udayana), 9(4), 555. https://doi.org/10.24843/JLK.2021.v09.i04.p14
Situmeang, S., 2022. Impact of Text Preprocessing on Named Entity Recognition Based on Conditional Random Field in Indonesian Text. (Mantik) Jurnal Manajemen, Teknologi Informasi dan Komputer, 6(1), 423–430. https://iocscience.org/ejournal/index.php/mantik/article/view/2297
Statista, 2024. Twitter: most users by country, 2024, June 20
Sukardi, S., Susanty, M., Irawan, A., & Putra, R.F., 2020. Low Complexity Named-Entity Recognition for Indonesian Language using BiLSTM-CNNs. 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 137–142. https://doi.org/10.1109/ICOIACT50329.2020.9331989
Takahashi, K., Yamamoto, K., Kuchiba, A., & Koyama, T., 2022. Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Applied Intelligence, 52(5), 4961–4972. https://doi.org/10.1007/s10489-021-02635-5
Utomo, M.N.Y., Adji, T.B., & Ardiyanto, I., 2018. Geolocation prediction in social media data using text analysis: A review. 2018 International Conference on Information and Communications Technology (ICOIACT), 84–89. https://doi.org/10.1109/ICOIACT.2018.8350674
Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S. & Purwarianti, A., 2020. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. ArXiv, abs/2009.05387. https://api.semanticscholar.org/CorpusID:221640658

Last update:

No citation recorded.

Last update: 2026-07-14 15:08:06

No citation recorded.

Authors who submit the manuscripts to Journal JSINBIS must understand and agree that if the manuscript is accepted for publication, the copyright of the article belongs to JSINBIS and Diponegoro University as the journal publisher.

Copyright includes the exclusive right to reproduce and provide articles in all forms and media, including reprints, photographs, microfilm and any other similar reproductions, as well as translations. The author reserves the rights to the following:

Reproduce all or part of published material for use by the author himself as teaching material in class or oral presentation material in various forums;
Reuse part or all of the material as compilation material for the author's written work;
Make copies of published materials for distribution within the institution where the author works.

JSINBIS and Diponegoro University and the Editors make every effort to ensure that no false or misleading data, opinions or statements are published in this journal. The content of articles published in JSINBIS is the sole and exclusive responsibility of the respective authors.

Copyright transfer agreement can be found here: [Copyright transfer agreement in doc] and [Copyright transfer agreement in pdf].