skip to main content

ENSEMBLE-BASED LOGISTIC REGRESSION ON HIGH-DIMENSIONAL DATA: A SIMULATION STUDY

*Tintrim Dwi Ary Widhianingsih orcid  -  Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia
Heri Kuswanto  -  Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia
Dedy Dwi Prastyo  -  Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia
Open Access Copyright (c) 2024 MEDIA STATISTIKA under http://creativecommons.org/licenses/by-nc-sa/4.0.

Citation Format:
Abstract
Dramatic computation growth encourages big data era, which induces data size escalation in various fields. Apart from huge sample size, cases arise high-dimensional data having more feature size than its samples. High-computing power compels the usage of modern approaches to deal with this typical dataset, while in practice, common logistic regression method is yet applied due to its simplicity and explainability. Applying logistic regression on high-dimensional data arises multicollinearity, overfitting, and computational complexity issues. Logistic Regression Ensemble (Lorens) and Ensemble Logistic Regression (ELR) are the logistic-regression-based alternative methods proposed to solve these problems. Lorens adopts ensemble concept with mutually exclusive feature partitions to form several subsets of data, while ELR involves feature selection in the algorithm by drawing part of features based on probability ranking value. This paper uncovers the effectiveness of Lorens and ELR applied to high-dimensional data classification through simulation study under three different scenarios, i.e., with feature size variation, for imbalanced high-dimensional data, and under multicollinearity conditions. Our simulation study reveals that ELR outperforms Lorens and obtains more stable performance over different feature sizes and imbalanced data settings. On the other hand, Lorens achieves more reliable performance than ELR on a simulation study with a multicollinearity issue.
Fulltext View|Download
Keywords: Affordable Medicin; Classification; ELR; High-Dimensional Data; Lorens

Article Metrics:

  1. Ahn, H., Moon, H., Fazzari, M. J., Lim, N., Chen, J. J., & Kodell, R. L. (2007). Classification by ensembles from random partitions of high-dimensional data. Comput. Stat. Data Anal., 6166--6179
  2. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences (pp. 6745--6750). National Acad Sciences
  3. Annest, A., Bumgarner, R. E., Raftery, A. E., & Yeung, K. Y. (2009). Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinform
  4. Ayesha, S., Hanif, M. K., & Talib, R. (2020). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf. Fusion, 44-58
  5. Bhattacharjee, A., & Meyerson, M. (2003). Classification of Human Lung Carcinomas by mRNA Expression Profiling. Springer
  6. Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinform., 106
  7. Bolon-Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2016). Feature selection for high-dimensional data. Prog. Artif. Intell., 65-75
  8. Buhlmann, P. (2012). Bagging, Boosting and Ensemble Methods. Springer Berlin Heidelberg
  9. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., 321-357
  10. Chung, D., & Keles, S. (2010). Sparse partial least squares classification for high dimensional data. Statistical applications in genetics and molecular biology
  11. Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 21689
  12. Destrero, A., Mosci, S., Mol, C. D., Verri, A., & Odone, F. (2009). Feature selection for high-dimensional data. Comput. Manag. Sci., 25-40
  13. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In Multiple Classifier Systems, First International Workshop, MCS 2000, Cagliari, Italy, June 21-23, 2000, Proceedings (pp. 1-15). Springer
  14. Duan, Q., Ajami, N. K., Gao, X., & Sorooshian, S. (2007). Multi-model ensemble hydrologic prediction using Bayesian model averaging. Advances in Water Resources, 1371-1386
  15. Gao, L., Song, J., Liu, X., Shao, J., Liu, J., & Shao, J. (2017). Learning in high-dimensional multimedia data: the state of the art. Multim. Syst., 303--313
  16. Haghighi, M., Caicedo, J. C., Cimini, B. A., Carpenter, A. E., & Singh, S. (2022). High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nature Methods, 1-8
  17. Hoerl, A. E., & Kennard, R. W. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 80-86
  18. Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinform., 1509-1515
  19. Joe, H. (2006). Generating random correlation matrices based on partial correlations. Journal of Multivariate Analysis, 2177-2189
  20. Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 4237--4253
  21. Kuswanto, H., Melasasi, J. N., & Ohwada, H. (2018). Enzyme Classification on DUD-E Database Using Logistic Regression Ensemble (Lorens). Innovative Computing, Optimization, and Its Applications: Modelling and Simulations, 93-109
  22. Li, Y., Chai, Y., Yin, H., & Chen, B. (2021). A novel feature learning framework for high-dimensional data classification. Int. J. Mach. Learn. Cybern., 555-569
  23. Lim, N. (2007). Classification by ensembles from random partitions using logistic regression models. State University of New York at Stony Brook
  24. Lim, N., Ahn, H., Moon, H., & Chen, J. J. (2009). Classification of High-Dimensional Data with Ensemble of Logistic Regression Models. Journal of Biopharmaceutical Statistics, 160-171
  25. Lin, W.-J., & Chen, J. J. (2013). Class-imbalanced classifiers for high-dimensional data. Briefings Bioinform., 13--26
  26. Qiu, W., & Joe, H. (2020). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation
  27. Ray, P., Reddy, S. S., & Banerjee, T. (2021). Various dimension reduction techniques for high dimensional data analysis: a review. Artif. Intell. Rev., 3473-3515
  28. Rokach, L. (2010). Ensemble-based classifiers. Artif. Intell. Rev., 1-39
  29. Romero, C., Ventura, S., Pechenizkiy, M., & Baker, R. S. (2010). Handbook of educational data mining. CRC press
  30. Shu, C., & Burn, D. H. (2004). Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resources Research
  31. Sotiriou, C., Neo, S.-Y., McShane, L. M., Korn, E. L., Long, P. M., Jazaeri, A., . . . Liu, E. T. (2003). Breast cancer classification and prognosis based on gene expression profiles from a population-based study
  32. Proceedings of the National Academy of Sciences (pp. 10393-10398). National Acad Sciences
  33. Suhartono, Faulina, R., Lusia, D. A., Otok, B. W., Sutikno, & Kuswanto, H. (2012). Ensemble method based on ANFIS-ARIMA for rainfall prediction. In 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE) (pp. 1-4)
  34. Thudumu, S., Branch, P., Jin, J., & Singh, J. J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. J. Big Data, 42
  35. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 267--288
  36. Wang, W., Baladandayuthapani, V., Morris, J. S., Broom, B. M., Manyam, G., & Do, K.-A. (2013). iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinform., 149-159
  37. Widhianingsih, T. D., Kuswanto, H., & Prastyo, D. D. (2020). Logistic Regression Ensemble (LORENS) Applied to Drug Discovery. MATEMATIKA: Malaysian Journal of Industrial and Applied Mathematics, 43-49
  38. Xu, Y., Yu, Z., Cao, W., & Chen, C. L. (2023). A Novel Classifier Ensemble Method Based on Subspace Enhancement for High-Dimensional Data Classification. IEEE Trans. Knowl. Data Eng., 16-30
  39. Zakharov, R., & Dupont, P. (2011). Ensemble Logistic Regression for Feature Selection. In Pattern Recognition in Bioinformatics - 6th IAPR International Conference, PRIB 2011, Delft, The Netherlands, November 2-4, 2011. Proceedings (pp. 133-144). Springer
  40. Zou, H., & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 301-320

Last update:

No citation recorded.

Last update: 2024-10-31 19:37:49

No citation recorded.