skip to main content

COMPARISON OF MISSING VALUE IMPUTATION USING MEAN, BAYESIAN KNN, AND NON-BAYESIAN KNN ON TEP GENE EXPRESSION DATA

Mastika Mastika  -  Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia
*Titin Siswantining orcid scopus  -  Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia
Alhadi Bustamam  -  Master’s Program in Mathematics, Universitas Indonesia, Depok, Indonesia, Indonesia
Open Access Copyright (c) 2025 MEDIA STATISTIKA under http://creativecommons.org/licenses/by-nc-sa/4.0.

Citation Format:
Abstract
Analysis of gene expression data, particularly in cancer data, often faces challenges due to the presence of missing values. One approach to overcome this is data imputation. This study evaluates the performance of three imputation methods, namely mean imputation, K-Nearest Neighbors (KNN), and KNN with Bayesian optimization using Gaussian Process modeling, on Tumor Educated Platelets (TEP) gene expression data. Missing values were introduced using Missing Completely at Random (MCAR) gradually at levels of 5%, 10%, 15%, and up to 60%, and performance was evaluated using three metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Normalized Root Mean Squared Error (NRMSE). The results show that the three methods produce relatively similar performance, with differences in MAE, MSE, and NRMSE values only at a small decimal scale. Although Bayesian Optimization is expected to improve the accuracy of KNN, the resulting improvement on this dataset is not significant. These findings indicate that simple imputation such as the average and KNN-based methods still provide competitive results on TEP data with data characteristics that have 14,020,496 zeros out of a total of 16,512,496 existing values, which is approximately 84.91% of the total data.

Note: This article has supplementary file(s).

Fulltext View|Download |  Research Instrument
Untitled
Subject
Type Research Instrument
  Download (69KB)    Indexing metadata
 common.other
ANALYSIS MULTILEVEL SURVIVAL DATA USING COVARIATE-ADJUSTED FRAILTY PROPORTIONAL HAZARDS MODEL
Subject
Type Other
  Download (664KB)    Indexing metadata
Keywords: Mean Absolute Error; Mean Squared Error; Normalized Root Mean Squared Error; Gaussian Process; Optimization
Funding: Directorate of Research and Community Engagement, Ministry of Education, Science, and Technology

Article Metrics:

  1. Al-Janabi, S., & Alkaim, A.F. (2020). A Nifty Collaborative Analysis to Predicting a Novel Tool (DRFLLS) for Missing Values Estimation. Soft Computing, 24(1),555–569. https://doi.org/10.1007/s00500-019-03972-x
  2. Ayilara, O. F., Zhang, L., Sajobi, T. T., Sawatzky, R., Bohm, E., & Lix, L. M. (2019). Impact of Missing Data on Bias and Precision When Estimating Change in Patient-Reported Outcomes from a Clinical Registry. Health and Quality of Life Outcomes, 17(1), 1–9
  3. Brown, S. et al. (2018). Technical Variability and Missing Data in Gene Expression Studies. Bioinformatics, 34(22), 3808–3815
  4. Chungnoy, K., Tanantong, T., & Songmuang, P. (2024). Missing Value Imputation on Gene Expression Data Using Bee-Based Algorithm to Improve Classification Performance. PLoS ONE, 19(8), e0305492. https://doi.org/10.1371/journal.pone.0305492
  5. Farswan, A., Gupta, A., Gupta, R., & Kaur, G. (2020). Imputation of Gene Expression Data in Blood Cancer and its Significance in Inferring Biological Pathways. Frontiers in Oncology, 9, 1442. https://doi.org/10.3389/fonc.2019.01442
  6. Hameed, W. M., & Ali, N. A. (2023). Missing Value Imputation Techniques: A Survey. UHD Journal of Science and Technology, 7(1), 72–81. https://doi.org/10.21928/uhdjst.v7n1y2023.pp72-81
  7. Hong, S., & Lynn, H. S. (2020). Accuracy of Random-Forest-Based Imputation of Missing Data in The Presence of Non-Normality, Non-Linearity, and Interaction. BMC Medical Research Methodology, 20(1), 1–12
  8. Injadat, M. N., Salo, F., Bou Nassif, A., Essex, A., & Shami, A. (2020). Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection. arXiv preprint arXiv:2008.02327v1. https://arxiv.org/abs/2008.02327
  9. Ismail, A. R., Zainal Abidin, N., & Maen, M. K. (2022). Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare. Journal of Robotics and Control (JRC), 3(2), 143-150. https://doi.org/10.18196/jrc.v3i2.13133
  10. Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence, 33(10), 913–933. https://doi.org/10.1080/08839514.2019.1637138
  11. Jafrasteh, B., Hernández-Lobato, D., Lubián-López, S. P., & Benavente-Fernández, I. (2023). Gaussian Processes for Missing Value Imputation. Knowledge-Based Systems, 273, 110603. https://doi.org/10.1016/j.knosys.2023.110603
  12. Keerin, P., & Boongoen, T. (2022). Improved KNN Imputation for Missing Values in Gene Expression Data. Computers, Materials & Continua, 70(2), 4009-4025. https://doi.org/10.32604/cmc.2022.020261
  13. Khan, M. A. (2024). A Comparative Study on Imputation Techniques: Introducing A Transformer Model for Robust and Efficient Handling of Missing EEG Amplitude Data. Bioengineering, 11(8), 740. https://doi.org/10.3390/bioengineering11080740
  14. Latief, M. A., Bustamam, A., & Siswantining, T. (2020). Performance Evaluation XGBoost in Handling Missing Value on Classification of Hepatocellular Carcinoma Gene Expression Data. 2020 4th International Conference on Informatics and Computational Sciences (ICICoS), 1–6. IEEE. https://doi.org/10.1109/ICICoS51170.2020.9299009
  15. Lee, K. & Tung, C. (2019). Missing Value Imputation for Microarray Data: A Comprehensive Review. Journal of Bioinformatics and Computational Biology, 17(3), 1950023
  16. Little, R., & Rubin, D. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley Series in Probability and Statistics. Print ISBN: 9780470526798 | Online ISBN: 9781119482260 | https://doi.org/10.1002/9781119482260
  17. Liu, L., Lin, F., Ma, X., Chen, Z., & Yu, J. (2020). Tumor-educated Platelet as Liquid Biopsy in Lung Cancer Patients. Cancer Letters, 146, Article 102863. https://doi.org/10.1016/j.canlet.2020.102863
  18. Lo, A. W., Siah, K. W., & Wong, C. H. (2019). Machine Learning with Statistical Imputation for Predicting Drug Approval. Harvard Data Science Review, 2019
  19. Miller, K. D., Ortiz, A. P., Pinheiro, P. S., Bandi, P., Minihan, A., Fuchs, H. E., Martinez Tyson, D., Tortolero-Luna, G., Fedewa, S. A., & Jemal, A. M. (2021). Cancer Statistics for the US Hispanic/Latino Population. CA: A Cancer Journal for Clinicians, 71(6), 466–487. https://doi.org/10.3322/caac.21695
  20. Mostafa, S. M. (2019). Imputing Missing Values Using Cumulative Linear Regression. CAAI Transactions on Intelligent Technology, 4(3), 182–200
  21. Ravindran, U., & Gunavathi, C. (2023). A Survey on Gene Expression Data Analysis Using Deep Learning Methods for Cancer Diagnosis. Progress in Biophysics and Molecular Biology, 177, 1–13. https://doi.org/10.1016/j.pbiomolbio.2022.08.004
  22. Siswantining, T., Anwar, T., Sarwinda, D., & Al-Ash, H. S. (2021). A Novel Centroid Initialization in Missing Value Imputation Towards Mixed Datasets. Communications in Mathematical Biology and Neuroscience, 2021(11). https://doi.org/10.28919/cmbn/5344
  23. Siswantining, T., Vivaldi, K. G., Sarwinda, D., Soemartojo, S. M., Sari, I. M., & Al-Ash, H. S. (2022). Implementation of Ensemble Self-Organizing Maps for Missing Values Imputation. Indonesian Journal of Statistics and Its Applications, 6(1), 1–12. https://doi.org/10.29244/ijsa.v6i1p1-12
  24. Zhu, X., Wang, J., Sun, B., Ren, C., Yang, T., & Ding, J. (2021). An Efficient Ensemble Method for Missing Value Imputation in Microarray Gene Expression Data. BMC Bioinformatics, 22, 188. https://doi.org/10.1186/s12859-021-04109-4

Last update:

No citation recorded.

Last update: 2025-10-16 20:53:46

No citation recorded.