ENSEMBLE-BASED LOGISTIC REGRESSION ON HIGH-DIMENSIONAL DATA: A SIMULATION STUDY

Tintrim Dwi Ary Widhianingsih; Heri Kuswanto; Dedy Dwi Prastyo

doi:10.14710/medstat.17.1.13-24

DOI: https://doi.org/10.14710/medstat.17.1.13-24

ENSEMBLE-BASED LOGISTIC REGRESSION ON HIGH-DIMENSIONAL DATA: A SIMULATION STUDY

*Tintrim Dwi Ary Widhianingsih

- Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia

Heri Kuswanto - Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia

Dedy Dwi Prastyo - Department of Statistics, Institut Teknologi Sepuluh Nopember, Indonesia

BibTex Citation Data :

@article{Medstat58037,
    author = {Tintrim Dwi Ary Widhianingsih and Heri Kuswanto and Dedy Dwi Prastyo},
    title = {ENSEMBLE-BASED LOGISTIC REGRESSION ON HIGH-DIMENSIONAL DATA: A SIMULATION STUDY},
    journal = {MEDIA STATISTIKA},
  volume = {17},
    number = {1},
    year = {2024},
    keywords = {Affordable Medicin; Classification; ELR; High-Dimensional Data; Lorens},
    abstract = { Dramatic computation growth encourages big data era, which induces data size escalation in various fields. Apart from huge sample size, cases arise high-dimensional data having more feature size than its samples. High-computing power compels the usage of modern approaches to deal with this typical dataset, while in practice, common logistic regression method is yet applied due to its simplicity and explainability. Applying logistic regression on high-dimensional data arises multicollinearity, overfitting, and computational complexity issues. Logistic Regression Ensemble (Lorens) and Ensemble Logistic Regression (ELR) are the logistic-regression-based alternative methods proposed to solve these problems. Lorens adopts ensemble concept with mutually exclusive feature partitions to form several subsets of data, while ELR involves feature selection in the algorithm by drawing part of features based on probability ranking value. This paper uncovers the effectiveness of Lorens and ELR applied to high-dimensional data classification through simulation study under three different scenarios, i.e., with feature size variation, for imbalanced high-dimensional data, and under multicollinearity conditions. Our simulation study reveals that ELR outperforms Lorens and obtains more stable performance over different feature sizes and imbalanced data settings. On the other hand, Lorens achieves more reliable performance than ELR on a simulation study with a multicollinearity issue. },
   issn = {2477-0647},   pages = {13--24}  doi = {10.14710/medstat.17.1.13-24},
    url = {https://ejournal.undip.ac.id/index.php/media_statistika/article/view/58037}
}

Citation Format:

Abstract

Dramatic computation growth encourages big data era, which induces data size escalation in various fields. Apart from huge sample size, cases arise high-dimensional data having more feature size than its samples. High-computing power compels the usage of modern approaches to deal with this typical dataset, while in practice, common logistic regression method is yet applied due to its simplicity and explainability. Applying logistic regression on high-dimensional data arises multicollinearity, overfitting, and computational complexity issues. Logistic Regression Ensemble (Lorens) and Ensemble Logistic Regression (ELR) are the logistic-regression-based alternative methods proposed to solve these problems. Lorens adopts ensemble concept with mutually exclusive feature partitions to form several subsets of data, while ELR involves feature selection in the algorithm by drawing part of features based on probability ranking value. This paper uncovers the effectiveness of Lorens and ELR applied to high-dimensional data classification through simulation study under three different scenarios, i.e., with feature size variation, for imbalanced high-dimensional data, and under multicollinearity conditions. Our simulation study reveals that ELR outperforms Lorens and obtains more stable performance over different feature sizes and imbalanced data settings. On the other hand, Lorens achieves more reliable performance than ELR on a simulation study with a multicollinearity issue.

Fulltext View|Download

Keywords: Affordable Medicin; Classification; ELR; High-Dimensional Data; Lorens

Article Metrics:

Article Info

Section: Articles

Language : EN

In Vol 17, No 1 (2024): Media Statistika

Recent articles

ANALYSIS OF MULTI-OBJECTIVE LINEAR ROBUST OPTIMIZATION MODEL WITH LEXICOGRAPHICAL METHOD A-OPTIMAL DESIGN IN NON-LINEAR MODELS TO INCREASE SILICON DIOXIDE PURITY LEVELS CONWAY-MAXWELL POISSON REGRESSION MODELING OF INFANT MORTALITY IN SOUTH SULAWESI More recent articles

Most cited articles

ANALISIS PERBANDINGAN METODE FUZZY C-MEANS DAN SUBTRACTIVE FUZZY C-MEANS Penerapan Regresi Logistik Ordinal Proportional Odds Model pada Analisis Faktor-Faktor yang Mempengaruhi Kelengkapan Imunisasi Dasar Anak Balita di Provinsi Aceh Tahun 2015 PEMODELAN INFLASI BERDASARKAN HARGA-HARGA PANGAN MENGGUNAKAN SPLINE MULTIVARIABEL ANALISIS DATA INFLASI DI INDONESIA PASCA KENAIKAN TDL DAN BBM TAHUN 2013 MENGGUNAKAN MODEL REGRESI KERNEL MODEL REGRESI COX PROPORSIONAL HAZARD PADA DATA KETAHANAN HIDUP More cited articles

Ahn, H., Moon, H., Fazzari, M. J., Lim, N., Chen, J. J., & Kodell, R. L. (2007). Classification by ensembles from random partitions of high-dimensional data. Comput. Stat. Data Anal., 6166--6179
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences (pp. 6745--6750). National Acad Sciences
Annest, A., Bumgarner, R. E., Raftery, A. E., & Yeung, K. Y. (2009). Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinform
Ayesha, S., Hanif, M. K., & Talib, R. (2020). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf. Fusion, 44-58
Bhattacharjee, A., & Meyerson, M. (2003). Classification of Human Lung Carcinomas by mRNA Expression Profiling. Springer
Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinform., 106
Bolon-Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2016). Feature selection for high-dimensional data. Prog. Artif. Intell., 65-75
Buhlmann, P. (2012). Bagging, Boosting and Ensemble Methods. Springer Berlin Heidelberg
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., 321-357
Chung, D., & Keles, S. (2010). Sparse partial least squares classification for high dimensional data. Statistical applications in genetics and molecular biology
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 21689
Destrero, A., Mosci, S., Mol, C. D., Verri, A., & Odone, F. (2009). Feature selection for high-dimensional data. Comput. Manag. Sci., 25-40
Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In Multiple Classifier Systems, First International Workshop, MCS 2000, Cagliari, Italy, June 21-23, 2000, Proceedings (pp. 1-15). Springer
Duan, Q., Ajami, N. K., Gao, X., & Sorooshian, S. (2007). Multi-model ensemble hydrologic prediction using Bayesian model averaging. Advances in Water Resources, 1371-1386
Gao, L., Song, J., Liu, X., Shao, J., Liu, J., & Shao, J. (2017). Learning in high-dimensional multimedia data: the state of the art. Multim. Syst., 303--313
Haghighi, M., Caicedo, J. C., Cimini, B. A., Carpenter, A. E., & Singh, S. (2022). High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nature Methods, 1-8
Hoerl, A. E., & Kennard, R. W. (2000). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 80-86
Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinform., 1509-1515
Joe, H. (2006). Generating random correlation matrices based on partial correlations. Journal of Multivariate Analysis, 2177-2189
Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 4237--4253
Kuswanto, H., Melasasi, J. N., & Ohwada, H. (2018). Enzyme Classification on DUD-E Database Using Logistic Regression Ensemble (Lorens). Innovative Computing, Optimization, and Its Applications: Modelling and Simulations, 93-109
Li, Y., Chai, Y., Yin, H., & Chen, B. (2021). A novel feature learning framework for high-dimensional data classification. Int. J. Mach. Learn. Cybern., 555-569
Lim, N. (2007). Classification by ensembles from random partitions using logistic regression models. State University of New York at Stony Brook
Lim, N., Ahn, H., Moon, H., & Chen, J. J. (2009). Classification of High-Dimensional Data with Ensemble of Logistic Regression Models. Journal of Biopharmaceutical Statistics, 160-171
Lin, W.-J., & Chen, J. J. (2013). Class-imbalanced classifiers for high-dimensional data. Briefings Bioinform., 13--26
Qiu, W., & Joe, H. (2020). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation
Ray, P., Reddy, S. S., & Banerjee, T. (2021). Various dimension reduction techniques for high dimensional data analysis: a review. Artif. Intell. Rev., 3473-3515
Rokach, L. (2010). Ensemble-based classifiers. Artif. Intell. Rev., 1-39
Romero, C., Ventura, S., Pechenizkiy, M., & Baker, R. S. (2010). Handbook of educational data mining. CRC press
Shu, C., & Burn, D. H. (2004). Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resources Research
Sotiriou, C., Neo, S.-Y., McShane, L. M., Korn, E. L., Long, P. M., Jazaeri, A., . . . Liu, E. T. (2003). Breast cancer classification and prognosis based on gene expression profiles from a population-based study
Proceedings of the National Academy of Sciences (pp. 10393-10398). National Acad Sciences
Suhartono, Faulina, R., Lusia, D. A., Otok, B. W., Sutikno, & Kuswanto, H. (2012). Ensemble method based on ANFIS-ARIMA for rainfall prediction. In 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE) (pp. 1-4)
Thudumu, S., Branch, P., Jin, J., & Singh, J. J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. J. Big Data, 42
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 267--288
Wang, W., Baladandayuthapani, V., Morris, J. S., Broom, B. M., Manyam, G., & Do, K.-A. (2013). iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinform., 149-159
Widhianingsih, T. D., Kuswanto, H., & Prastyo, D. D. (2020). Logistic Regression Ensemble (LORENS) Applied to Drug Discovery. MATEMATIKA: Malaysian Journal of Industrial and Applied Mathematics, 43-49
Xu, Y., Yu, Z., Cao, W., & Chen, C. L. (2023). A Novel Classifier Ensemble Method Based on Subspace Enhancement for High-Dimensional Data Classification. IEEE Trans. Knowl. Data Eng., 16-30
Zakharov, R., & Dupont, P. (2011). Ensemble Logistic Regression for Feature Selection. In Pattern Recognition in Bioinformatics - 6th IAPR International Conference, PRIB 2011, Delft, The Netherlands, November 2-4, 2011. Proceedings (pp. 133-144). Springer
Zou, H., & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 301-320

Last update:

No citation recorded.

Last update: 2025-08-13 06:02:07

No citation recorded.

The Authors submitting a manuscript do so on the understanding that if accepted for publication, copyright of the article shall be assigned to Media Statistika journal and Department of Statistics, Universitas Diponegoro as the publisher of the journal. Copyright encompasses the rights to reproduce and deliver the article in all form and media, including reprints, photographs, microfilms, and any other similar reproductions, as well as translations.

Media Statistika journal and Department of Statistics, Universitas Diponegoro and the Editors make every effort to ensure that no wrong or misleading data, opinions or statements be published in the journal. In any way, the contents of the articles and advertisements published in Media Statistika journal are the sole and exclusive responsibility of their respective authors and advertisers.

The Copyright Transfer Form can be downloaded here: [Copyright Transfer Form Media Statistika]. The copyright form should be signed originally and send to the Editorial Office in the form of original mail, scanned document or fax :

Dr. Di Asih I Maruddani (Editor-in-Chief)
Editorial Office of Media Statistika
Department of Statistics, Universitas Diponegoro
Jl. Prof. Soedarto, Kampus Undip Tembalang, Semarang, Central Java, Indonesia 50275
Telp./Fax: +62-24-7474754
Email: maruddani@live.undip.ac.id