skip to main content

A Novel Fusion of Machine Learning Methods for Enhancing Named Entity Recognition in Indonesian Language Text

*Widyawan Widyawan orcid scopus  -  Universitas Gadjah Mada, Indonesia
Bayu Prasetiyo Utomo  -  Universitas Gadjah Mada, Indonesia
Muhammad Nur Rizala  -  Universitas Gadjah Mada, Indonesia
Open Access Copyright (c) 2024 Jurnal Sistem Informasi Bisnis

Citation Format:
Abstract

One of the important implementations in machine learning is Named Entity Recognition (NER), which is used to process text and extract entities such as people, organizations, laws, religions, and locations. NER for the Indonesian language still faces significant challenges due to the lack of high-quality labelled datasets, which limits the development of more advanced models. To address this issue, we utilized several pre-trained BERT models (bert-base-uncased, indobenchmark/indobert-base-p1, indolem/indobert-base-uncased) and datasets (NERGRIT-IndoNLU, NERGRIT-Corpus, NERUGM, and NERUI). This study proposes a novel fusion approach by integrating deep learning architectures such as CNN, Bi-LSTM, Bi-GRU, and CRF to detect 19 entities. This approach enhances BERT’s sequence modelling and feature extraction capabilities, while CRF improves entity prediction by enforcing global word-sequence constraints. Experimental results demonstrate that the fusion approach outperforms previous methods. On the bert-base-uncased dataset, accuracy reached 94.75%, while indobenchmark/indobert-base-p1 achieved 95.75%, and indolem/indobert-base-uncased achieved 95.85%. This study emphasizes the effectiveness of combining deep learning architectures with pre-trained transformers to improve NER performance in the Indonesian language. The proposed methodology offers significant advancements in entity extraction for languages with limited datasets, such as Indonesian.

Fulltext View|Download
Keywords: NER; BERT; Pre-training; Machine Learning
Funding: Ministry of Education, Culture, Research and Technology

Article Metrics:

  1. Alfina, I., Manurung, R., & Fanany, M.I., 2016. DBpedia entities expansion in automatically building dataset for Indonesian NER. 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 335–340. https://doi.org/10.1109/ICACSIS.2016.7872784
  2. Alfina, I., Savitri, S., & Fanany, M.I, 2017. Modified DBpedia entities expansion for tagging automatically NER dataset. 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 216–221. https://doi.org/10.1109/ICACSIS.2017.8355036
  3. Azzahra, N. S., Ibrohim, M.O., Fahmi, J., Apriyanto, B.F., & Riandi, O., 2020. Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset. 2020 Fifth International Conference on Informatics and Computing (ICIC), 1–7. https://doi.org/10.1109/ICIC50835.2020.9288566
  4. Budi, I., & Suryono, R.R., 2023. Application of named entity recognition method for Indonesian datasets: a review. Bulletin of Electrical Engineering and Informatics, 12(2), 969–978. https://doi.org/10.11591/eei.v12i2.4529
  5. Devi, S., Fatchiya, A., & Susanto, D., 2016. Kapasitas Kader dalam Penyuluhan Keluarga Berencana di Kota Palembang, Provinsi Sumatera Selatan. Jurnal Penyuluhan, 12(2), 144. https://doi.org/10.25015/penyuluhan.v12i2.11223
  6. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K., 2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805
  7. Eftimov, T., Koroušić Seljak, B., & Korošec, P., 2017. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLOS ONE, 12(6), e0179488. https://doi.org/10.1371/journal.pone.0179488
  8. Genta Indra Winata, 2023. Dataset IndoNLU. https://github.com/indobenchmark/indonlu/tree/master/dataset
  9. Inovasi Teknologi, G.R., 2023. Dataset NERGrit. https://huggingface.co/datasets/grit-id/id_nergrit_corpus
  10. Koto, F., Rahimi, A., & Chandra, A., 2023. Dataset IndoLEM. https://github.com/indolem/indolem
  11. Koto, F., Rahimi, A., Lau, J. H. & Baldwin, T., 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
  12. Ma, X., & Hovy, E., 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1064–1074. https://doi.org/10.18653/v1/P16-1101
  13. Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., & Kompatsiaris, Y., 2018. Location Extraction from Social Media. ACM Transactions on Information Systems, 36(4), 1–27. https://doi.org/10.1145/3202662
  14. Nasichuddin, Moch. A., Adji, T. B., & Widyawan, W., 2018. Performance Improvement Using CNN for Sentiment Analysis. IJITEE (International Journal of Information Technology and Electrical Engineering), 2(1). https://doi.org/10.22146/ijitee.36642
  15. Nuranti, E.Q., & Yulianti, E., 2020. Legal Entity Recognition in Indonesian Court Decision Documents Using Bi-LSTM and CRF Approaches. 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 429–434. https://doi.org/10.1109/ICACSIS51025.2020.9263157
  16. Putra, F. N., 2021. Ekstraksi Informasi Menggunakan Kombinasi Metode NeuroNER, Neural Relation Extraction, dan FASM pada Deteksi Kejadian dari Data Stream Twitter [Institut Teknologi Sepuluh Nopember]. https://repository.its.ac.id/57797/1/5116201057-Masters_Thesis.pdf
  17. Sang, E.F.T.K, & Meulder, F. D., 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142-147. https://aclanthology.org/W03-0419
  18. Simanjuntak, L. F., Mahendra, R., & Yulianti, E., 2022. We Know You Are Living in Bali: Location Prediction of Twitter Users Using BERT Language Model. Big Data and Cognitive Computing, 6(3), 77. https://doi.org/10.3390/bdcc6030077
  19. Sinta, W., & Sanjaya E.R., 2021. Rule-based Named Entity Recognition (NER) to Determine Time Expression for Balinese Text Document. JELIKU (Jurnal Elektronik Ilmu Komputer Udayana), 9(4), 555. https://doi.org/10.24843/JLK.2021.v09.i04.p14
  20. Situmeang, S., 2022. Impact of Text Preprocessing on Named Entity Recognition Based on Conditional Random Field in Indonesian Text. (Mantik) Jurnal Manajemen, Teknologi Informasi dan Komputer, 6(1), 423–430. https://iocscience.org/ejournal/index.php/mantik/article/view/2297
  21. Statista, 2024. Twitter: most users by country, 2024, June 20
  22. Sukardi, S., Susanty, M., Irawan, A., & Putra, R.F., 2020. Low Complexity Named-Entity Recognition for Indonesian Language using BiLSTM-CNNs. 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 137–142. https://doi.org/10.1109/ICOIACT50329.2020.9331989
  23. Takahashi, K., Yamamoto, K., Kuchiba, A., & Koyama, T., 2022. Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Applied Intelligence, 52(5), 4961–4972. https://doi.org/10.1007/s10489-021-02635-5
  24. Utomo, M.N.Y., Adji, T.B., & Ardiyanto, I., 2018. Geolocation prediction in social media data using text analysis: A review. 2018 International Conference on Information and Communications Technology (ICOIACT), 84–89. https://doi.org/10.1109/ICOIACT.2018.8350674
  25. Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S. & Purwarianti, A., 2020. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. ArXiv, abs/2009.05387. https://api.semanticscholar.org/CorpusID:221640658

Last update:

No citation recorded.

Last update: 2024-12-03 19:06:57

No citation recorded.