COMPARISON OF MISSING VALUE IMPUTATION USING MEAN, BAYESIAN KNN, AND NON-BAYESIAN KNN ON TEP GENE EXPRESSION DATA

Mastika Mastika; Titin Siswantining; Alhadi Bustamam

doi:10.14710/medstat.18.1.61-72

DOI: https://doi.org/10.14710/medstat.18.1.61-72

COMPARISON OF MISSING VALUE IMPUTATION USING MEAN, BAYESIAN KNN, AND NON-BAYESIAN KNN ON TEP GENE EXPRESSION DATA

Mastika Mastika - Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia

*Titin Siswantining

- Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia

Alhadi Bustamam - Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia

Citation Format:

Abstract

Analysis of gene expression data, particularly in cancer data, often faces challenges due to the presence of missing values. One approach to overcome this is data imputation. This study evaluates the performance of three imputation methods, namely mean imputation, K-Nearest Neighbors (KNN), and KNN with Bayesian optimization using Gaussian Process modeling, on Tumor Educated Platelets (TEP) gene expression data. Missing values were introduced using Missing Completely at Random (MCAR) gradually at levels of 5%, 10%, 15%, and up to 60%, and performance was evaluated using three metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Normalized Root Mean Squared Error (NRMSE). The results show that the three methods produce relatively similar performance, with differences in MAE, MSE, and NRMSE values only at a small decimal scale. Although Bayesian Optimization is expected to improve the accuracy of KNN, the resulting improvement on this dataset is not significant. These findings indicate that simple imputation such as the average and KNN-based methods still provide competitive results on TEP data with data characteristics that have 14,020,496 zeros out of a total of 16,512,496 existing values, which is approximately 84.91% of the total data.

Note: This article has supplementary file(s).

Fulltext View|Download | Research Instrument

Untitled

Subject
Type	Research Instrument
	Download (69KB) Indexing metadata

common.other

ANALYSIS MULTILEVEL SURVIVAL DATA USING COVARIATE-ADJUSTED FRAILTY PROPORTIONAL HAZARDS MODEL

Subject
Type	Other
	Download (664KB) Indexing metadata

Email colleagues

Keywords: Mean Absolute Error; Mean Squared Error; Normalized Root Mean Squared Error; Gaussian Process; Optimization

Funding: Directorate of Research and Community Engagement, Ministry of Education, Science, and Technology

Article Metrics:

Article Info

Section: Articles

Language : EN

In Vol 18, No 1 (2025): Media Statistika

Most cited articles

PEMODELAN INFLASI BERDASARKAN HARGA-HARGA PANGAN MENGGUNAKAN SPLINE MULTIVARIABEL ANALISIS KLASIFIKASI KABUPATEN DI JAWA TENGAH BERDASARKAN POPULASI TERNAK MENGGUNAKAN FUZZY CLUSTER MEANS APLIKASI GENERALIZED SPACE TIME AUTOREGRESSIVE (GSTAR) PADA PEMODELAN VOLUME KENDARAAN MASUK TOL SEMARANG Penerapan Regresi Logistik Ordinal Proportional Odds Model pada Analisis Faktor-Faktor yang Mempengaruhi Kelengkapan Imunisasi Dasar Anak Balita di Provinsi Aceh Tahun 2015 ANALISIS DATA INFLASI DI INDONESIA PASCA KENAIKAN TDL DAN BBM TAHUN 2013 MENGGUNAKAN MODEL REGRESI KERNEL More cited articles

Al-Janabi, S., & Alkaim, A.F. (2020). A Nifty Collaborative Analysis to Predicting a Novel Tool (DRFLLS) for Missing Values Estimation. Soft Computing, 24(1),555–569. https://doi.org/10.1007/s00500-019-03972-x
Ayilara, O. F., Zhang, L., Sajobi, T. T., Sawatzky, R., Bohm, E., & Lix, L. M. (2019). Impact of Missing Data on Bias and Precision When Estimating Change in Patient-Reported Outcomes from a Clinical Registry. Health and Quality of Life Outcomes, 17(1), 1–9
Brown, S. et al. (2018). Technical Variability and Missing Data in Gene Expression Studies. Bioinformatics, 34(22), 3808–3815
Chungnoy, K., Tanantong, T., & Songmuang, P. (2024). Missing Value Imputation on Gene Expression Data Using Bee-Based Algorithm to Improve Classification Performance. PLoS ONE, 19(8), e0305492. https://doi.org/10.1371/journal.pone.0305492
Farswan, A., Gupta, A., Gupta, R., & Kaur, G. (2020). Imputation of Gene Expression Data in Blood Cancer and its Significance in Inferring Biological Pathways. Frontiers in Oncology, 9, 1442. https://doi.org/10.3389/fonc.2019.01442
Hameed, W. M., & Ali, N. A. (2023). Missing Value Imputation Techniques: A Survey. UHD Journal of Science and Technology, 7(1), 72–81. https://doi.org/10.21928/uhdjst.v7n1y2023.pp72-81
Hong, S., & Lynn, H. S. (2020). Accuracy of Random-Forest-Based Imputation of Missing Data in The Presence of Non-Normality, Non-Linearity, and Interaction. BMC Medical Research Methodology, 20(1), 1–12
Injadat, M. N., Salo, F., Bou Nassif, A., Essex, A., & Shami, A. (2020). Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection. arXiv preprint arXiv:2008.02327v1. https://arxiv.org/abs/2008.02327
Ismail, A. R., Zainal Abidin, N., & Maen, M. K. (2022). Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare. Journal of Robotics and Control (JRC), 3(2), 143-150. https://doi.org/10.18196/jrc.v3i2.13133
Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence, 33(10), 913–933. https://doi.org/10.1080/08839514.2019.1637138
Jafrasteh, B., Hernández-Lobato, D., Lubián-López, S. P., & Benavente-Fernández, I. (2023). Gaussian Processes for Missing Value Imputation. Knowledge-Based Systems, 273, 110603. https://doi.org/10.1016/j.knosys.2023.110603
Keerin, P., & Boongoen, T. (2022). Improved KNN Imputation for Missing Values in Gene Expression Data. Computers, Materials & Continua, 70(2), 4009-4025. https://doi.org/10.32604/cmc.2022.020261
Khan, M. A. (2024). A Comparative Study on Imputation Techniques: Introducing A Transformer Model for Robust and Efficient Handling of Missing EEG Amplitude Data. Bioengineering, 11(8), 740. https://doi.org/10.3390/bioengineering11080740
Latief, M. A., Bustamam, A., & Siswantining, T. (2020). Performance Evaluation XGBoost in Handling Missing Value on Classification of Hepatocellular Carcinoma Gene Expression Data. 2020 4th International Conference on Informatics and Computational Sciences (ICICoS), 1–6. IEEE. https://doi.org/10.1109/ICICoS51170.2020.9299009
Lee, K. & Tung, C. (2019). Missing Value Imputation for Microarray Data: A Comprehensive Review. Journal of Bioinformatics and Computational Biology, 17(3), 1950023
Little, R., & Rubin, D. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley Series in Probability and Statistics. Print ISBN: 9780470526798 | Online ISBN: 9781119482260 | https://doi.org/10.1002/9781119482260
Liu, L., Lin, F., Ma, X., Chen, Z., & Yu, J. (2020). Tumor-educated Platelet as Liquid Biopsy in Lung Cancer Patients. Cancer Letters, 146, Article 102863. https://doi.org/10.1016/j.canlet.2020.102863
Lo, A. W., Siah, K. W., & Wong, C. H. (2019). Machine Learning with Statistical Imputation for Predicting Drug Approval. Harvard Data Science Review, 2019
Miller, K. D., Ortiz, A. P., Pinheiro, P. S., Bandi, P., Minihan, A., Fuchs, H. E., Martinez Tyson, D., Tortolero-Luna, G., Fedewa, S. A., & Jemal, A. M. (2021). Cancer Statistics for the US Hispanic/Latino Population. CA: A Cancer Journal for Clinicians, 71(6), 466–487. https://doi.org/10.3322/caac.21695
Mostafa, S. M. (2019). Imputing Missing Values Using Cumulative Linear Regression. CAAI Transactions on Intelligent Technology, 4(3), 182–200
Ravindran, U., & Gunavathi, C. (2023). A Survey on Gene Expression Data Analysis Using Deep Learning Methods for Cancer Diagnosis. Progress in Biophysics and Molecular Biology, 177, 1–13. https://doi.org/10.1016/j.pbiomolbio.2022.08.004
Siswantining, T., Anwar, T., Sarwinda, D., & Al-Ash, H. S. (2021). A Novel Centroid Initialization in Missing Value Imputation Towards Mixed Datasets. Communications in Mathematical Biology and Neuroscience, 2021(11). https://doi.org/10.28919/cmbn/5344
Siswantining, T., Vivaldi, K. G., Sarwinda, D., Soemartojo, S. M., Sari, I. M., & Al-Ash, H. S. (2022). Implementation of Ensemble Self-Organizing Maps for Missing Values Imputation. Indonesian Journal of Statistics and Its Applications, 6(1), 1–12. https://doi.org/10.29244/ijsa.v6i1p1-12
Zhu, X., Wang, J., Sun, B., Ren, C., Yang, T., & Ding, J. (2021). An Efficient Ensemble Method for Missing Value Imputation in Microarray Gene Expression Data. BMC Bioinformatics, 22, 188. https://doi.org/10.1186/s12859-021-04109-4

Last update:

No citation recorded.

Last update: 2026-03-10 09:04:53

No citation recorded.

The Authors submitting a manuscript do so on the understanding that if accepted for publication, copyright of the article shall be assigned to Media Statistika journal and Department of Statistics, Universitas Diponegoro as the publisher of the journal. Copyright encompasses the rights to reproduce and deliver the article in all form and media, including reprints, photographs, microfilms, and any other similar reproductions, as well as translations.

Media Statistika journal and Department of Statistics, Universitas Diponegoro and the Editors make every effort to ensure that no wrong or misleading data, opinions or statements be published in the journal. In any way, the contents of the articles and advertisements published in Media Statistika journal are the sole and exclusive responsibility of their respective authors and advertisers.

The Copyright Transfer Form can be downloaded here: [Copyright Transfer Form Media Statistika]. The copyright form should be signed originally and send to the Editorial Office in the form of original mail, scanned document or fax :

Dr. Di Asih I Maruddani (Editor-in-Chief)
Editorial Office of Media Statistika
Department of Statistics, Universitas Diponegoro
Jl. Prof. Soedarto, Kampus Undip Tembalang, Semarang, Central Java, Indonesia 50275
Telp./Fax: +62-24-7474754
Email: maruddani@live.undip.ac.id