Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Hybrid feature-selection and diversity-guided stacking framework for interpretable ensemble learning: Application to COVID-19 mortality prediction

  • Farideh Mohtasham ,

    Contributed equally to this work with: Farideh Mohtasham, Seyed Saeed Hashemi Nazari

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran

  • Seyed Saeed Hashemi Nazari ,

    Contributed equally to this work with: Farideh Mohtasham, Seyed Saeed Hashemi Nazari

    Roles Conceptualization, Formal analysis, Methodology, Validation, Writing – review & editing

    Affiliation Department of Epidemiology, School of Public Health & Safety, Shahid Beheshti University of Medical Sciences (SBMU), Tehran, Iran

  • Mohamad Amin Pourhoseingholi ,

    Roles Conceptualization, Data curation, Methodology, Software, Validation, Visualization, Writing – review & editing

    ‡ These authors also contributed equally to this work.

    Affiliation National Institute for Health and Care Research (NIHR) Nottingham Biomedical Research Center, Hearing Sciences, Mental Health and Clinical Neurosciences, School of Medicine, University of Nottingham, Nottingham, United Kingdom

  • Kaveh Kavousi ,

    Roles Conceptualization, Formal analysis, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    kkavousi@ut.ac.ir

    Affiliation Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran

  • Mohammad Reza Zali

    Roles Conceptualization, Methodology, Supervision, Validation, Writing – review & editing

    ‡ These authors also contributed equally to this work.

    Affiliation Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Abstract

Background

Reliable predictive modeling in high-dimensional biomedical data requires a balance between accuracy, interpretability, and computational efficiency. However, existing ensemble methods often overlook model diversity or rely on ad hoc feature-selection approaches, which limit generalizability. This study introduces a hybrid feature-selection and diversity-guided stacking framework designed to improve robustness and scalability across clinical and other data-intensive domains.

Methods

The proposed framework integrates a hybrid feature-selection pipeline—combining Variance Inflation Factor (VIF), Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression—to reduce multicollinearity and overfitting. It also employs a diversity-aware stacking strategy that constructs sub-model sets based on pairwise diversity measures (Disagreement, Yule’s Q, and Cohen’s Kappa) and non-pairwise metrics (Entropy and Kohavi–Wolpert). Sixteen base classifiers and five meta-learners were trained using repeated 10-fold cross-validation. The framework was evaluated using data from 4,778 hospitalized COVID-19 patients with 116 clinical and laboratory attributes, preprocessed using robust scaling and ROSE-based class balancing.

Results

The optimal configuration, which stacked Random Forest and XGBoost models using a Neural Network meta-learner, achieved 91.4% accuracy (95% CI: 89.8–92.8), AUC = 0.955, F1 = 0.801, and MCC = 0.746, outperforming the best individual model (AdaBoost, 90.2%). Training time (~450 s) and per-case inference time (<0.2 s) demonstrated computational feasibility. Feature-importance analysis and SHAP-based interpretation confirmed clinical relevance and interpretability.

Conclusions

The hybrid feature-selection and diversity-guided stacking framework improves predictive accuracy and interpretability while maintaining computational efficiency. Although validated using COVID-19 mortality data, the approach is broadly applicable to biomedical, environmental, and engineering prediction tasks that require interpretable and scalable ensemble learning.

1 Introduction

Machine learning (ML) has become an essential component of modern biomedical research, enabling the discovery of complex, nonlinear patterns and supporting improved risk stratification across diverse clinical domains. The primary goal of ML research is to develop efficient, interpretable, and generalize algorithms capable of delivering reliable performance across heterogeneous datasets [1]. Efficiency in ML encompasses not only time and memory requirements but also data utilization and interpretability—key considerations for deployment in high-stakes clinical environments, where transparency, reproducibility, and auditability are critical.

Traditional ML models, however, often encounter significant challenges, including data quality issues, overfitting, class imbalance, and limited interpretability, which restrict their usefulness in clinical applications. These limitations have driven growing interest in ensemble learning, in which multiple algorithms are combined to reduce both variance and bias, thereby improving predictive performance compared with individual models [24]. Among ensemble approaches, stacking has emerged as a particularly powerful technique. Stacking integrates diverse base learners and employs a meta-learner to combine their predictions, leveraging complementary error structures to improve accuracy and robustness [57].

Recent advancements in various disciplines highlight both the promise and the challenges of ensemble modeling. In cardiovascular diagnostics, Feng et al. (2023) developed a hybrid model that combined hemodynamic modeling with ML to achieve >90% accuracy in under two seconds per case, demonstrating that hybrid approaches can deliver high computational efficiency while maintaining strong performance [8]. Similarly, Wang et al. (2023) used the Super Learner algorithm to improve prediction of cumulative lead exposure, though at the cost of substantial computational load due to the combination of multiple algorithms [9]. In proteomics and chemical sciences, feature selection and dimensionality reduction have proven effective in enhancing prediction accuracy while reducing model complexity.

Xu et al. (2022) benchmarked 13 ML models for protein-level inference from RNA features across more than 2,500 samples and 20 datasets. Their findings demonstrated that combining appropriate feature selection with classical models and voting ensembles improved accuracy, although computation time varied widely [10].

Reda et al. (2023) applied variable selection with partial least squares regression to predict olive-oil quality parameters using near-infrared spectroscopy, showing that variable reduction improved accuracy and decreased computational demands [11].

In oncology, stacking and hybrid ensembles have also yielded substantial gains in predictive performance and generalization. Mohammed et al. (2021) applied a CNN-based stacking ensemble to multi-cancer RNA-Seq classification, achieving superior accuracy to single models while retaining computational feasibility [12]. Wang et al. (2025) designed a multimodal stacking framework that integrated radiomics and deep learning for head-and-neck cancer prognosis (C-index = 0.93), demonstrating the influence of meta-learner selection on scalability [13]. Kwon et al. (2019) found that gradient boosting performed best as a meta-learner for accuracy, while generalized linear models minimized error in breast cancer classification, highlighting the trade-offs between model complexity and efficiency [14]. Other architectures, such as the relevance-aware capsule network [15], deep convolutional neural networks [16], and U-Net–based MRI segmentation models [17], have demonstrated that improvements in accuracy commonly require substantially higher training time and memory, emphasizing the need to balance predictive strength with practicality and interpretability.

Ensemble learning has been widely adopted in other clinical areas. Abualnaja et al. [18] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Likewise, Lei et al. [19] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Other studies in cardiovascular disease have shown similar results. Dhingra et al. [20] developed an ensemble model (PRESENT-SHD) using 261,228 ECGs, achieving AUROC values of 0.85–0.90 across multiple hospitals, indicating strong cross-population stability. Tseng et al. [21] used XGBoost and random forest models to predict acute kidney injury following cardiac surgery (AUC = 0.843), demonstrating ensemble learning’s value in perioperative risk prediction.

In infectious diseases research, Sawesi et al. [22] reviewed 17 leptospirosis studies and found that ML and deep learning methods—including CNN ensembles—achieved high accuracy (80–98%), though most lacked external validation. Chiasakul et al. [23] reported that AI methods for venous thromboembolism prediction outperformed traditional risk scores (mean AUC 0.79 vs 0.61), although many studies exhibited bias and limited generalizability.

Ensemble learning has also consistently outperformed single models in forecasting outbreaks of dengue, influenza, Ebola, and COVID-19. Early COVID-19 mortality forecasts demonstrated that ensembles delivered greater accuracy and precision than individual models [24].

Stacking is particularly suited to heterogenous clinical datasets, such as COVID-19 mortality prediction, which depends on numerous clinical, biochemical, and physiological indicators [25]. Berliana and Bustamam [4] demonstrated that a two-level stacking model achieved more than 97% accuracy with CT data and 99% with chest X-ray images. Cui et al [26] introduced a nested heterogeneous ensemble integrating SVR, ELM, and logistic regression, achieving improved generalization. Li et al. [27] predicted early mortality using five base models and a genetic-algorithm optimization procedure, achieving an AUC of 0.907 in a cohort of 4,711 patients. Other studies demonstrated that hybrid ensembles incorporating supervised and unsupervised learning improved performance by over 10%, and that boosted models remained competitive with strong clinical relevance [28,29].

Despite these advances, several methodological limitations persist. Systematic reviews highlight widespread issues such as small and unrepresentative datasets, weak handling of missing data, lack of external validation, and overreliance on discrimination metrics alone [2931]. Many studies also neglect calibration, effect-size estimation, or fairness analyses, reducing clinical interpretability [32,33]. research shows a nonlinear relationship between predictive gain and computational cost: complex models often deliver higher accuracy but at significant increases in resource consumption [3437]. While innovations such as subgraph learning [35] and simplified coronary models [8] can mitigate these burdens, a careful balance of accuracy, efficiency, and interpretability remains necessary.

Sample size adequacy is another concern. Many COVID-19 models are trained on datasets too small for their complexity, leading to instability and overfitting [38]. Class imbalance is also common in mortality modeling; although oversampling and weighting strategies are widely used, these must be validated to avoid artificial distortions [39,40].

Furthermore, many ensemble studies rely heavily on tree-based models such as random forest, XGBoost, and LightGBM, limiting diversity and restricting the full advantages of ensemble learning [41,42]. Quantitative measures of diversity—such as Yule’s Q, Disagreement, Cohen’s Kappa, or Double-Fault—remain rarely used in COVID-19 modeling despite consistent evidence that diversity improves generalization [4244].

To address these limitations, this study introduces a computationally efficient, diversity-guided stacking ensemble framework that integrates heterogeneous base classifiers and interpretable meta-learners to predict COVID-19 mortality. Our approach incorporates:

  1. Hybrid feature-selection using variance inflation factor (VIF) analysis, ANOVA, sequential backward elimination (SBE), and Lasso regression to control multicollinearity and enhance interpretability;
  2. Controlled ensemble depth to balance predictive gain and computational feasibility; and
  3. Lightweight meta-learners capable of capturing nonlinear dependencies among diverse base learners.

We constructed sub-model ensembles using multiple diversity metrics across 16 machine learning algorithms and assessed model performance using discrimination, calibration, and statistical significance tests, including Wilcoxon, McNemar, and DeLong analyses. Model interpretability was enhanced through SHAP-based explanation of global and local prediction behavior.

This study presents a generalizable diversity-aware ensemble framework designed to balance accuracy, interpretability, and computational cost. Although applied here to COVID-19 mortality prediction, the approach is suitable for a wide range of biomedical prediction problems that require robust, interpretable, and scalable machine learning solutions.

2 Materials and methods

2.1 Overview of the proposed framework

As illustrated in Fig 1, this study adopts a multi-stage framework for predicting mortality risk, integrating standard machine learning techniques with the proposed algorithmic innovations. The workflow contains two primary layers:

thumbnail
Fig 1. Research methodology of the proposed machine learning framework.

https://doi.org/10.1371/journal.pone.0341198.g001

Foundational Stage – Data preprocessing, normalization, and training of base models using established machine learning procedures.

Algorithmic Stage – A diversity-guided stacking ensemble that integrates hybrid feature selection, explicit model diversity assessment, and a comparison of multiple meta-learners to optimize predictive performance, interpretability, and computational efficiency.

Data from 4,778 confirmed COVID-19 cases were cleaned through exclusion of incomplete records, iterative multivariate imputation, and normalization. The hybrid feature-selection process removed multicollinearity using Variance Inflation Factor (VIF), followed by Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression to select 15 key predictors.

Sixteen machine learning classifiers were trained using stratified and balanced datasets and evaluated via repeated 10-fold cross-validation. To enhance predictive performance, ensemble sets were constructed based on correlation and statistical diversity metrics, then stacked using five different meta-learners. Models were evaluated based on discrimination and calibration performance and validated using significance tests. Interpretability was assessed using feature importance and SHAP-based analyses.

2.2 Data source and ethical approval

Data were obtained from 4,778 confirmed COVID-19 patients admitted to three general hospitals in Tehran, Iran, between March 2020 and March 2021. Demographic, clinical, laboratory, symptom, comorbidity, vital sign, and outcome information was extracted from clinical records reviewed by trained medical staff. Laboratory findings were collected on the first day of admission through the hospital information system, and COVID-19 diagnosis was confirmed using real-time polymerase chain reaction (RT-PCR) of nasal or oropharyngeal swab samples.

The study followed formal institutional requirements and received ethical approval from the Institutional Review Board (IRB) of Shahid Beheshti University of Medical Sciences (IR.SBMU.RIGLD.REC.1401.032).

Informed consent was waived due to the retrospective nature of the study, and data were anonymized prior to analysis in accordance with the Declaration of Helsinki. This dataset provides comprehensive temporal, demographic, and clinical information suitable for developing predictive models for COVID-19 mortality. Additional details on the epidemiological profile of the cohort are available in Hatamabadi et al. [45].

2.3 Data preprocessing

Missing data were assessed and addressed prior to model development. Patients with missing values in any categorical variable or more than two missing continuous variables were excluded. Among the 123 available variables (52 categorical and 71 numeric), no categorical variables contained missing values. Seven numerical variables with more than 5% missingness were removed to reduce risk of bias. Remaining missing values were imputed using an iterative multivariate approach implemented in Scikit-learn [46], which models each variable with missing entries as a function of all other features in a chained regression process. This preserves multivariate relationships and minimizes bias under the Missing at Random (MAR) assumption.

The resulting imputed dataset was previously validated by Hatamabadi et al. [47] to confirm realistic variable distributions and consistent multivariate relationships. To account for skewed distributions and sensitivity to outliers, continuous variables were standardized using robust scaling [48], which centers variables on the median and scales using the interquartile range (IQR). This approach enhances stability in clinical models by reducing the influence of extreme values.

2.4 Feature selection

Feature Selection (FS) is essential for managing high-dimensional clinical datasets by reducing redundant, irrelevant, or correlated predictors while improving model accuracy, computational efficiency, and interpretability [49,50].

A multi-stage hybrid feature-selection strategy was implemented to progressively eliminate multicollinearity, non-informative predictors, and weak contributors. Four complementary methods were applied:

  1. Variance Inflation Factor (VIF): Used to detect and remove highly collinear continuous variables, thereby improving model stability and avoiding inflated variance estimates [51,52].
  2. Analysis of Variance (ANOVA): Applied next to evaluate between-group differences using F-tests and eliminate non-discriminative features with minimal computational cost [53,54].
  3. Sequential Backward Elimination (SBE): Iteratively removed the least informative features based on cross-validated model performance, preserving meaningful interactions and improving generalization [55].
  4. Lasso regression: Imposed regularization to shrink weak coefficients to zero, providing sparse and stable model structures suited for correlated predictors [56].

This integrated pipeline—removing collinearity (VIF), filtering weak predictors (ANOVA), refining via model performance (SBE), and enforcing sparsity (Lasso)—produced a compact, interpretable feature set optimized for ensemble learning.

2.5 Model training and selection

Sixteen base machine learning models were trained, including ten standard algorithms and six boosting or bagging methods, representing diverse methodological families. Hyperparameters were optimized through grid search using the caret package [57,58], with 10-fold cross-validation repeated 10 times [59] to balance bias and variance Models with minimal benefit from extensive tuning (e.g., GLM, LDA, CART, Naïve Bayes) retained default settings to maximize computational efficiency.

Hyperparameter ranges were informed by existing literature and empirical results from clinical prediction research. Table 1 summarizes the optimized settings for each model.

thumbnail
Table 1. Parameter settings for the 16 base machine learning models.

https://doi.org/10.1371/journal.pone.0341198.t001

Model selection was based on three criteria:

  1. Algorithmic diversity,
  2. Demonstrated success in biomedical or COVID-19 prediction studies, and
  3. Complementary bias–variance profiles.

The final pool (supported by systematic evidence, e.g., Bottino et al. [60]) included linear models (GLM, Lasso, Ridge, Elastic Net), probabilistic models (Naïve Bayes, LDA), instance-based learners (KNN), tree-based models (CART, C5.0, Random Forest, XGBoost, GBM, Treebag), and Neural Networks.

This diversity ensured coverage of linear and nonlinear relationships, uncertainty modeling, and hierarchical interactions common in clinical decision data.

Cross-validated performance guided final model selection, which was subsequently confirmed through independent test-set validation.

2.6 Diversity-guided sub-model construction

Traditional stacking often selects base models with low prediction correlation to ensure each contributes complementary information. Highly correlated models (>0.75) add redundancy and weaken ensemble gains [61].

Sixteen candidate models were initially generated using the caretList() from the caretEnsemble package [57,58]. Prediction correlations were calculated. Where pairs exceeded 0.75 correlation, the less accurate model was removed across 10 repeated 10-fold cross-validation.

To enhance diversity beyond correlation filtering, additional sub-model sets were constructed using explicit diversity metrics capturing complementary error patterns among classifiers. Pairwise measures (Disagreement, Yule’s Q, Cohen’s Kappa, Double-Fault) and non-pairwise metrics (Entropy, Kohavi–Wolpert) were computed following Tattar [62].

The contingency table defining model agreements (n11and n00) and disagreements (n10 and n01) across N observations is presented in Table 2.

thumbnail
Table 2. Contingency table illustrating agreement and disagreement between two classifiers.

https://doi.org/10.1371/journal.pone.0341198.t002

2.6.1 Disagreement measure.

This quantifies the proportion of instances where the two classifiers differ in prediction(1):

(1)

Higher values indicate greater diversity and reduced redundancy.

2.6.2 Yule’s Q-statistic.

Yule’s Q (or Q-statistic) assesses the strength and direction of association between two classifiers’ predictions (range: –1 to +1). Lower absolute values denote weaker association and thus higher diversity(2):

(2)

2.6.3 Cohen’s Kappa statistic.

A widely used measure that evaluates inter-model agreement while adjusting for chance. Low or negative Kappa values suggest that classifiers make independent errors, which enhances ensemble robustness.

2.6.4 Double-fault measure.

Measures the proportion of cases where both classifiers misclassify the same instance. Smaller values indicate complementary error patterns and reduced correlated failures (3):

(3)

Two non-pairwise metrics were also used:

  • Entropy measure [63]: Reflects the overall variability of predictions across all classifiers, ranging from 0 (perfect agreement, no diversity) to 1 (maximum diversity).
  • Kohavi-Wolpert Measure [62]: Derived from error variance decomposition, it quantifies the dispersion of predictions across classifiers; higher values imply greater diversity and richer ensemble representation.

Pairwise metrics identified redundant learners, while non-pairwise metrics assessed overall heterogeneity within sub-model sets. This integrated diversity evaluation ensured complementary base learners and improved the generalization performance of the stacking ensemble.

2.7 Meta-learner integration

Predictions from the sub-model sets were integrated using a stacking framework with five meta-learners:

  • Generalized Linear Model (GLM)
  • Linear Discriminant Analysis (LDA)
  • Random Forest (RF)
  • Gradient Boosting Machine (GBM)
  • Neural Network (NN)

Linear models (GLM, LDA) were chosen for their transparency and stable inference, while tree-based (RF, GBM) and neural meta-learners modeled nonlinear dependencies among base models. This balanced design supports both interpretability and computational efficiency, aligning with the study’s objective of developing a robust, clinically applicable framework for mixed-type healthcare data.

Stacking was implemented using the caretEnsemble to effectively fuse diverse predictive outputs.

2.8 Model evaluation and statistical analysis

Performance was assessed using an independent test dataset. Discrimination metrics included accuracy, sensitivity, specificity, precision, F1-score, Cohen’s Kappa, area under the ROC curve (AUC), and the Matthews correlation coefficient (MCC). These metrics collectively capture overall correctness, class-specific detection, and robustness under class imbalance—critical in mortality prediction, where false negatives carry severe clinical risk and false positives may strain resources. AUC quantified global discrimination across thresholds, while MCC provided a balanced evaluation under uneven outcome distributions [64].

Calibration was assessed using reliability curves to compare predicted probabilities against observed outcomes. Statistical comparisons employed:

  • Wilcoxon signed-rank test was applied for accuracy differences,
  • Effect size [65] interpreted per Cohen’s criteria (large = 0.5, medium = 0.3, small = 0.1) [66],
  • Holm correction to control the family-wise error rate [67],
  • McNemar’s test to compare classification outcomes between paired models,
  • DeLong’s test to assess the statistical significance of AUC differences under the null hypothesis of equal performance [68].

All tests were conducted using appropriate R packages, including rstatix [69]and pROC.

2.9 Model interpretability

Feature importance and effect analyses were conducted to explain individual predictions and quantify how specific feature values influenced model outputs, using tools from the iml package [70]. To further enhance interpretability, we employed SHAP (SHapley Additive exPlanations), a model-agnostic framework that quantifies each feature’s contribution to predicted outcomes [71].

SHAP analysis was applied directly to the final stacked ensemble, treating it as a single predictive function. This approach decomposed predicted probabilities into feature-level attributions of the original clinical variables rather than intermediate model outputs, enabling transparent interpretation of global importance, feature interactions, and local instance-level effects driving mortality predictions.

3 Results

3.1 Data characteristics and preparation

The study analyzed 4,778 confirmed COVID-19 cases, comprising 116 clinical, laboratory, and demographic features. The overall mortality rate was 22% (1,050 patients). Males accounted for 59.6% of deaths. The mean age of deceased patients was 70.8 years (SD = 15.6), compared with 58.3 years (SD = 16.9) among survivors. Mortality showed significant associations with comorbidities such as hypertension, diabetes, and heart failure, highlighting their importance in COVID-19 risk assessment (Fig. 2).

thumbnail
Fig 2. Comorbidity distribution and mortality associations in the study cohort.

https://doi.org/10.1371/journal.pone.0341198.g002

Numeric features were standardized using the robust scaler based on the interquartile range (IQR) to reduce the influence of outliers. The dataset was split into training (70%, n = 3,345) and testing (30%, n = 1,433) subsets, maintaining consistent mortality rates (21.97%) across both. The original training set exhibited class imbalance (22% “Death” vs. 78% “Alive”), which was corrected using the ROSE package [72], resulting in a balanced 1:1 ratio (“Death”: 1,616; “Alive”: 1,729). This ensured robust model training and reliable performance evaluation.

3.2 Feature Selection and correlation analysis

VIF analysis removed highly collinear variables, reducing the dataset to 109 predictors. ANOVA eliminated 39 features with limited predictive value. SBE and Lasso further refined the selection to 15 key mortality predictors: age, neutrophil count (NEUT), lactate dehydrogenase (LDH), ferritin, phosphorus (P), ventilator oxygen saturation (O₂sat.Ventilator), total iron-binding capacity (TIBC), fasting blood sugar (FBS), procalcitonin, serum sodium (Na), muscle pain, chronic kidney disease (CKD), taste/smell loss, D-dimer, and erythrocyte sedimentation rate (ESR).

This selection achieved an optimal balance between model complexity and predictive performance, as adding more features did not improve cross-validation accuracy. The final feature set captured multiple pathophysiological domains relevant to COVID-19 outcomes, including inflammation (NEUT, ESR, ferritin), oxygenation (O₂sat.Ventilator), metabolism (FBS, Na, P, TIBC), and organ dysfunction (procalcitonin, CKD). Symptom-based predictors such as muscle pain and loss of taste/smell further enhanced discrimination between severe and non-severe disease.

Correlation analysis (S1–S2 Tables in S1 File) revealed generally weak associations among predictors, indicating low multicollinearity. Moderate correlations were observed between ferritin and ESR (r = 0.26), ferritin and TIBC (r = –0.34), and TIBC and ESR (r = –0.35), reflecting physiologically coherent inflammation–iron metabolism dynamics. Overall low inter-feature correlations support model stability and generalizability.

All selected predictors are routinely collected in clinical settings, ensuring clinical interpretability and feasibility for integration into real-world decision-support systems (Fig 3).

thumbnail
Fig 3. Correlation between selected features and the outcome (Death/Alive) in the training dataset.

https://doi.org/10.1371/journal.pone.0341198.g003

3.3 Model performance evaluation

Fig 4 and Table S3 in S1 File summarize model performance using 10 repeated 10-fold cross-validation across ten base classifiers and six boosting/bagging algorithms applied to all 15 predictive features. Accuracy estimates with 95% confidence intervals indicated that AdaBoost achieved the highest performance, with a mean accuracy of 92.81% (SD = 0.013). Optimized hyperparameters for each models are presented in S1 Fig in S1 File.

thumbnail
Fig 4. Performance comparison of base, boosting, and bagging machine learning algorithms using repeated 10-fold cross-validation on the training data.

https://doi.org/10.1371/journal.pone.0341198.g004

3.4 Diversity-guided sub-model construction

The first sub-model set was generated using a traditional stacking approach, producing 16 candidate models with the caretList() function. Correlation analysis from repeated 10-fold cross-validation revealed strong dependencies among several learners. The Generalized Linear Model (GLM) demonstrated high correlations with LDA (0.940), Lasso (0.984), Ridge (0.966), and Elastic Net (0.966); similarly, C5.0 and Random Forest (RF) were highly correlated (0.806). AdaBoost also showed high correlations with NN (0.798), GBM (0.848), and XGBoost (0.950). To reduce redundancy, the less accurate model from each correlated pair was removed, retaining four classifiers—Ridge, KNN, CART, and AdaBoost—for the first sub-model set.

To improve model complementarity, additional sub-model sets were constructed using diversity metrics.

Pairwise analyses indicated that GLM, LDA, and other linear models exhibited high agreement, whereas GBM, NN, and RF displayed greater disagreement, suggesting complementary error patterns (Fig 5).

thumbnail
Fig 5. Disagreement metrics among classifier predictions on the test dataset.

https://doi.org/10.1371/journal.pone.0341198.g005

Non-pairwise metrics further quantified ensemble heterogeneity. Entropy values ranged from 0.64 to 0.90, with the NN–GBM combination showing the greatest diversity. Yule’s Q ranged from 0.39 to 0.47, while Kohavi–Wolpert values ranged from 0.099 to 0.195. Negative Kappa values in some sets indicated substantial prediction disagreement, reinforcing diversity among classifiers (Fig 6, S4 Table in S1 File).

thumbnail
Fig 6. Inter-rater agreement among classifier predictions on the test dataset.

https://doi.org/10.1371/journal.pone.0341198.g006

Ultimately, eight sub-model sets were selected based on these metrics (Table 3, Fig 7).

thumbnail
Table 3. Diversity metrics for the eight selected sub-model sets on the test dataset.

https://doi.org/10.1371/journal.pone.0341198.t003

thumbnail
Fig 7. Comparison of diversity metrics across eight selected sub-model sets on the test dataset.

https://doi.org/10.1371/journal.pone.0341198.g007

This diversity-driven selection ensured inclusion of models differing in both architecture and error behavior, enhancing ensemble robustness and reducing correlated prediction errors.

3.5 Stacking model evaluation and statistical comparison

Stacking was performed using five meta-learners: Generalized Linear Model (GLM), Linear Discriminant Analysis (LDA), Neural Network (NN), Gradient Boosting Machine (GBM), and Random Forest (RF). Table 4 summarizes accuracies across stacking configurations on the independent test dataset.

thumbnail
Table 4. Accuracy of stacking sub-model sets using five different meta-learners on the test dataset.

https://doi.org/10.1371/journal.pone.0341198.t004

Not all stacking configurations improved upon the strongest base classifier. Ensembles composed of highly correlated models (e.g., Ridge–KNN–CART–AdaBoost) yielded limited gains, indicating performance saturation. In contrast, heterogeneous combinations—such as NB + GBM, RF + XGB, and NB + C5.0 + GBM—achieved significant improvements (accuracy up to 0.914). The NN meta-learner consistently outperformed other meta-learners by capturing nonlinear relationships among base-model outputs.

Statistical analyses (Table 5) showed that AdaBoost remained superior to some stacking configurations, supported by significant Wilcoxon results favoring the single model.

thumbnail
Table 5. Results of significance tests comparing the best base classifier with the stacking model in each sub-model set.

https://doi.org/10.1371/journal.pone.0341198.t005

Wilcoxon signed-rank, McNemar’s, and ROC-based tests compared stacking models with their best-performing base learners, applying Holm-adjusted p-values to control family-wise error and reporting effect sizes (r). Most pairwise comparisons (e.g., GBM vs. NN–GBM stack) showed small effects (r < 0.2, p > 0.05), indicating negligible practical gain. However, combinations such as NB + GBM with GLM meta-learner, AdaBoost + KNN with GBM meta-learner, and RF + XGB with NN meta-learner achieved significant improvements (r > 0.8), reflecting meaningful accuracy gains.

The best-performing configuration—stacking RF and XGB with an NN meta-learner—achieved:

  • Accuracy: 0.914 (95% CI: 0.898–0.928)
  • AUC: 0.955
  • F1 score: 0.801
  • MCC: 0.746

This model outperformed both individual classifiers and other stacking variants (Tables 6–7, Fig 8). Wilcoxon tests showed large effect sizes (r > 0.5), confirming substantial performance improvements, whereas McNemar’s and DeLong’s tests indicated that some pairwise differences were not significant after correction. ROC curves (Fig 9) demonstrated strong sensitivity and specificity, and calibration plots (Fig 10) showed excellent alignment between predicted and observed outcomes.

thumbnail
Table 6. Performance evaluation of selected stacking models that outperform the most accurate individual algorithm in their respective combinations.

https://doi.org/10.1371/journal.pone.0341198.t006

thumbnail
Table 7. Statistical comparison between stacked random forest (RF) and XGBoost (XGB) utilizing a neural network (NN) meta-learner and other stacking models.

https://doi.org/10.1371/journal.pone.0341198.t007

thumbnail
Fig 8. Performance metrics of the selected stacked models on the training dataset.

https://doi.org/10.1371/journal.pone.0341198.g008

thumbnail
Fig 9. ROC curves of the best-performing stacked models on the test dataset.

Stacking NB, GBM using GLM meta-learner (stack.GLM.GBM.NB). Stacking SVM, GBM using the GBM meta-learner (stack. GBM.SVM.GBM). Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest meta-learner (stack. RF.RF.CART.NN.GBM.XGB.Treebag). Stacking RF, XGB using the Neural Network meta-learner (stack. NN.RF.XGB). Stacking NB, C5.0, GBM using the GBM meta-learner (stack. GBM.NB.C5.0.GBM).

https://doi.org/10.1371/journal.pone.0341198.g009

thumbnail
Fig 10. Calibration plot of the stacked Random Forest (RF) and XGBoost (XGB) model using a Neural Network (NN) meta-learner under repeated 10-fold cross-validation.

https://doi.org/10.1371/journal.pone.0341198.g010

3.6 Computational complexity and training time

Computational complexity was assessed on a standard workstation (Intel Core i5 M520 @ 2.40 GHz, 8 GB RAM, Windows 10, 64-bit). Training times varied across models: XGB required ~98 s, RF ~ 306 s, and AdaBoost ~869 s. The stacked RF–XGB–NN model required ~450 s—faster than AdaBoost despite its complexity (Table 8).

thumbnail
Table 8. Training and prediction times for single models and stacking ensembles.

https://doi.org/10.1371/journal.pone.0341198.t008

Inference times were uniformly low, ranging from 0.01 s per patient (XGB) to 0.70 s (AdaBoost). The stacked model achieved 0.17 s per prediction, supporting deployment in near real-time clinical settings through electronic decision-support tools.

3.7 Model Interpretation

Feature importance analysis of the stacked RF–XGB–NN ensemble identified age as the most influential predictor of mortality, demonstrating strong predictive stability (variability ±0.06, permutation error 0.237) (Fig 11, S5 Table in S1 File).

thumbnail
Fig 11. Most influential predictors contributing to “Death” outcomes in the stacked RF–XGB model with an NN meta-learner.

https://doi.org/10.1371/journal.pone.0341198.g011

Neutrophil count (NEUT), phosphorus levels, and oxygen saturation while on a ventilator (O2sat.Ventilator) followed as critical predictors, reflecting infection severity, metabolic status, and respiratory function, respectively. Additional features such as lactate and ferritin contributed meaningfully, consistent with their roles in sepsis, inflammation, and critical illness.

Model-agnostic SHAP analysis, applied to the final stacked model, decomposed individual predictions into feature-level contributions. SHAP analysis indicated that for the “Death” class, advancing age (φ = –0.18), reduced O₂sat.Ventilator (φ = –0.02), and elevated NEUT (φ = –0.08) were dominant mortality drivers (Fig 12, S6 Table in S1 File). Reduced sodium (NA, φ = –0.08) and phosphorus (P, ϕ = –0.04) also contributed to poor outcomes, whereas higher lactate levels had a modest positive contribution (φ = 0.02). Features such as muscle pain, taste/smell disturbances, CKD, FBS, ESR, ferritin, and D-dimer had minimal SHAP contributions.

thumbnail
Fig 12. SHAP-based interpretation of the stacked RF–XGB model using a Neural Network meta-learner.

https://doi.org/10.1371/journal.pone.0341198.g012

Interaction analysis (Fig 13, S7 Table in S1 File) showed that age exhibited strong interactions with O₂sat.Ventilator, NEUT, phosphorus, and ferritin, highlighting their synergistic effects on mortality risk. Together, these findings demonstrate that the stacking ensemble not only improves predictive accuracy but also maintains meaningful clinical interpretability.

thumbnail
Fig 13. Interaction effects between age and key clinical predictors influencing mortality in the stacked RF–XGB model with an NN meta-learner.

https://doi.org/10.1371/journal.pone.0341198.g013

4 Discussion

This study introduced a hybrid feature-selection and diversity-guided stacking framework designed to improve predictive accuracy, interpretability, and computational efficiency in high-dimensional clinical data. Although demonstrated on a large cohort of 4,778 COVID-19 patients, the proposed approach is broadly applicable to biomedical, environmental, and engineering domains that require scalable and transparent ensemble learning.

Our framework integrates hybrid feature selection—combining VIF, ANOVA, SBE, and Lasso—with a diversity-based stacking strategy that systematically quantifies inter-model complementarity using both pairwise (e.g., Yule’s Q, Disagreement, Kappa) and non-pairwise (e.g., Entropy, Kohavi–Wolpert) diversity measures. This approach directly addresses major limitations of prior COVID-19 prognostic models, including small sample sizes, poor calibration, and redundant base learners [2931]. It also mitigates the trade-offs observed in traditional models, wherein improved predictive performance often comes at the cost of computational demand or reduced interpretability—limitations frequently encountered in deep neural networks and boosting algorithms [3437].

A key advantage of this study is the large sample size, which is substantially greater than that of many prior investigations. This enhances both the stability and generalizability of our findings. A multicenter study in Tehran reported a case fatality rate (CFR) of 10.05% across 19 hospitals [73]. Consistent with guidelines recommending at least 20 events per predictor variable for outcomes with low prevalence [74], we determined that a minimum of 3,000 patients would be required to reliably identify significant mortality predictors. Our dataset exceeded this threshold, reducing the risk of overfitting and supporting the robustness of the derived model.

The use of the Robust Scaling for standardization and ROSE-based resampling to achieve class balance were essential to model reliability where class imbalance, is a common barrier in predictive modeling [75], particularly in the context of COVID-19 [76]. The original training dataset exhibited pronounced class imbalance, with “Death” cases constituting only 22% of the cohort. This imbalance skewed model learning toward the majority “Alive” class, inflating accuracy but suppressing sensitivity—a critical limitation in mortality prediction. ROSE resampling created a fully balanced 1:1 dataset, substantially improving sensitivity and enabling models to better recognize minority-class (Death) cases. As expected, this rebalancing slightly reduced specificity due to the presence of synthetic samples. Although ROSE improves minority-class recognition and stabilizes cross-validation metrics, synthetic oversampling may introduce mild calibration shifts or artificial patterns, as noted in previous studies [77,78]. To mitigate this, final model performance was strictly evaluated on the original unbalanced test set, ensuring that reported results reflect real clinical distributions rather than synthetic balance.

Through iterative refinement, the dataset was reduced from 115 to 15 clinically interpretable features—such as age, neutrophil count, ferritin, and O₂ saturation—all of which are routinely available in electronic health records. These variables span immunologic, metabolic, and respiratory domains, collectively capturing key biological mechanisms underlying severe COVID-19 outcomes.

Traditional stacking methods that use highly correlated base learners provided only limited performance gains, confirming that redundancy constrains the potential benefits of ensemble modeling. In contrast, the proposed diversity-guided stacking approach achieved superior accuracy and generalization by leveraging heterogeneous classifiers with complementary error patterns.

For instance, Ribeiro et al. [79] demonstrated that stacking models can enhance predictive performance in COVID-19 outcomes. Their stack-ensemble model, which incorporated support vector regression, effectively forecasted mortality among 14,267 COVID-19 patients in Brazil. Similarly, our findings reinforce the notion that combining strong, heterogeneous classifiers through stacking is more effective than relying on a single best-performing classifier (53).

This observation aligns with emerging literature on ensemble learning. Hussain et al. [80] highlighted the superiority of hybrid classifier systems in improving prediction accuracy, and further reported an impressive AUC of 96.0% using a deep stacking neural network to predict mortality risk. Together, these studies support the effectiveness of diversity-aware ensemble approaches in high-stakes biomedical prediction.

Our stacking model significantly outperformed established machine learning approaches reported in the literature. For example, Yakovyna et al. [28] applied a combination of supervised and unsupervised learning techniques but did not achieve the level of discrimination observed in our stacking model. Rahmatinejad et al. [29] reported high Brier scores for Random Forest and improved precision and sensitivity for XGBoost; however, our RF–XGB–NN stacking configuration provided substantially higher accuracy and AUC, supported by rigorous statistical analyses, including Wilcoxon, McNemar, and DeLong tests. Furthermore, compared to a study using over 500 EHR variables to train an RF model for sepsis mortality prediction [81], our stacking method achieved notably better calibration and discrimination while maintaining computational efficiency.

Our findings also outperform those of de Paiva et al. [82], who analyzed 10,897 COVID-19 patients using various machine learning models—including FNet transformers, convolutional neural networks, support vector machines, LightGBM, and traditional statistical approaches such as LASSO and Generalized Additive Models (GAM). Their best models achieved an AUROC of 0.826 and a MacroF1 score of 65.4%. In contrast, our stacking framework delivered an F1 score of 80.1% and AUC of 0.955, demonstrating substantially improved predictive accuracy and reliability.

The Neural Network meta-learner emerged as the most effective combiner of base-model outputs. Neural meta-learning has previously been shown to outperform linear or tree-based approaches by capturing higher-order, nonlinear relationships among model outputs, particularly in biomedical prediction tasks [8385]. Our cross-validation results confirmed this, with the NN meta-learner providing superior discrimination and calibration across diverse sub-model sets.

Feature importance analysis identified age as the strongest predictor of mortality, followed by neutrophil count (NEUT), phosphorus levels, oxygen saturation (O₂sat.Ventilator), and lactate. Elevated NEUT counts have been widely associated with severe disease and cytokine storm responses, while reduced oxygen saturation is a direct marker of respiratory compromise [86,87]. SHAP analysis further showed that advancing age and high NEUT levels significantly increased mortality risk. These results reinforce established clinical findings that older age and immune dysregulation critically influence COVID-19 severity [88].

Beyond performance metrics, the proposed framework emphasizes computational efficiency and real-world applicability. Although training complexity increased relative to single models, the computation remained manageable and feasible for operational clinical environments. Inference latency (<0.2 seconds per prediction) is sufficiently low for real-time or near-real-time decision support, including bedside applications and automated triage systems.

Overall, this study presents a generalizable, diversity-aware ensemble framework that balances accuracy, interpretability, and computational efficiency. While validated for COVID-19 mortality prediction, the approach is adaptable to broader biomedical domains where heterogeneous data, transparency, and performance stability are critical.

5 Limitations and biases

This study has several limitations that should be considered when interpreting the findings. First, the retrospective EHR-based design may introduce selection and information biases, as data collection depended on available hospital records, which may not fully capture all relevant clinical variables [89]. Retrospective EHR studies are particularly vulnerable to incomplete or non-standardized data, including missing values, heterogeneous logging practices, and inconsistent variable definitions across hospitals [90]. Moreover, implicit clinician bias, referral patterns, and disparities in diagnosis or treatment may have influenced the data used for training, thereby propagating systemic inequities into predictive algorithms [91].

Second, the dataset was limited to three Iranian hospitals, which constrains the generalizability and external validity of the model. Differences in genetic, sociodemographic, cultural, and healthcare system characteristics across populations may influence both feature distributions and outcome risks, and prior reviews have shown that COVID-19 prediction models often perform poorly outside their development setting [92]. For example, variations in comorbidity prevalence, access to intensive care resources, and laboratory reference ranges may affect model performance in non-Iranian populations. Without independent external validation, our results should be interpreted cautiously when applied elsewhere. Recent reviews further emphasize that prediction models can never be considered fully “validated,” as transportability depends on population, setting, and temporal context [78,93]. Accordingly, validation in multinational, diverse cohorts is essential before clinical translation.

Third, although we applied resampling (ROSE) to mitigate class imbalance, oversampling approaches may embed artificial patterns that distort calibration or inflate predictive accuracy. This limitation has been repeatedly recognized in both COVID-19 studies and broader clinical prognostic modeling research [77,78].

Fourth, although ensemble learning (stacking) enhanced predictive performance, it introduced computational complexity and challenges for clinical deployment. Even with SHAP-based interpretability, ensemble models remain partially opaque, and post-hoc explanations represent approximations rather than causal insights. This raises concerns about clinical trust and automation bias, particularly if model outputs are adopted uncritically [93].

Fifth, our models did not incorporate unmeasured contextual confounders such as evolving treatment regimens, viral variants, or social determinants of health. As shown in previous literature, omission of such factors can bias effect estimates and limit real-world applicability [91]. Relatedly, model performance drift is a potential risk, as COVID-19 epidemiology, therapeutic strategies, and patient characteristics have evolved over time, necessitating ongoing monitoring and recalibration [78].

Finally, several design-related limitations warrant consideration. Statistical literature highlights that non-random sampling and limited site representativeness reduce the generalizability of predictive models, particularly when outcome heterogeneity is present [92]. Furthermore, most machine learning studies—including our own—focus primarily on discrimination metrics (e.g., AUC), with less emphasis on calibration and fairness assessments, which limits their clinical interpretability and adoption [93].

Despite these limitations, we employed rigorous validation strategies, including repeated cross-validation, calibration evaluation, and effect size reporting, consistent with TRIPOD recommendations [94]. These measures enhance robustness and transparency; however, independent, prospective, multi-center validation in larger and more diverse populations remains essential before clinical implementation.

5 Conclusion

In conclusion, this study demonstrates that a diversity-guided stacking ensemble—integrating Random Forest, XGBoost, and a Neural Network meta-learner—can achieve high predictive accuracy and interpretability for COVID-19 mortality risk. By combining a hybrid feature selection pipeline with heterogeneous base learners, the framework effectively reduced redundancy, captured nonlinear interactions, and maintained computational efficiency suitable for near real-time deployment.

Key predictors such as age, neutrophil count, phosphorus, and oxygen saturation consistently aligned with known clinical mechanisms of severe COVID-19, supporting both the statistical and biological validity of the model. SHAP-based interpretation further illustrated how interactions among these variables shape mortality risk, helping to bridge predictive performance with clinical insight.

Nevertheless, this work has several limitations, including its retrospective design, reliance on data from a single regional health system, and the absence of external validation, all of which may restrict generalizability. Future research should extend this framework to multi-center or multi-disease cohorts, integrate multimodal data sources (e.g., imaging, genomics), and evaluate real-time performance in prospective clinical environments.

Ultimately, the proposed stacking strategy represents a scalable and interpretable modeling paradigm that can be readily adapted to a wide range of clinical prediction tasks beyond COVID-19, advancing the application of ensemble learning for precision medicine and healthcare decision support.

Supporting information

S1 File.

This file contains supplementary tables, figures, and additional analyses including correlation matrices, model tuning parameters, performance summaries, feature importance results, interaction analyses, and SHAP outputs.

https://doi.org/10.1371/journal.pone.0341198.s001

(DOCX)

Acknowledgments

This article was part of the Ph.D. dissertation in epidemiology at Shahid Beheshti University of Medical Sciences (SBMU).

References

  1. 1. Das K, Behera RN. A survey on machine learning: concept, algorithms and applications. International Journal of Innovative Research in Computer and Communication Engineering. 2017;5(2):1301–9.
  2. 2. Windeatt T, Ghaderi R. Binary labelling and decision-level fusion. Information Fusion. 2001;2(2):103–12.
  3. 3. Ghasemieh A, Lloyed A, Bahrami P, Vajar P, Kashef R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decision Analytics Journal. 2023;7:100242.
  4. 4. Berliana AU, Bustamam A. Implementation of Stacking Ensemble Learning for Classification of COVID-19 using Image Dataset CT Scan and Lung X-Ray. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 2020. 148–52. https://doi.org/10.1109/icoiact50329.2020.9332112
  5. 5. Khyani D, Jakkula S, Gowda S, Anusha KJ, Swetha KR. An interpretation of stacking and blending approach in machine learning. Int Res J Eng Technol. 2021;8(07).
  6. 6. Mienye ID, Sun Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access. 2022;10:99129–49.
  7. 7. Graczyk M, Lasota T, Trawiński B, Trawiński K. Comparison of bagging, boosting and stacking ensembles applied to real estate appraisal. In: Intelligent Information and Database Systems: Second International Conference, ACIIDS, Hue City, Vietnam, March 24-26, 2010 Proceedings, Part II, 2010.
  8. 8. Feng Y, Li B, Fu R, Hao Y, Wang T, Guo H, et al. A simplified coronary model for diagnosis of ischemia-causing coronary stenosis. Comput Methods Programs Biomed. 2023;242:107862. pmid:37857024
  9. 9. Wang X, Bakulski KM, Mukherjee B, Hu H, Park SK. Predicting cumulative lead (Pb) exposure using the Super Learner algorithm. Chemosphere. 2023;311(Pt 2):137125. pmid:36347347
  10. 10. Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform. 2022;23(3):bbac091. pmid:35352096
  11. 11. Reda R, Saffaj T, Bouzida I, Saidi O, Belgrir M, Lakssir B, et al. Optimized variable selection and machine learning models for olive oil quality assessment using portable near infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2023;303:123213. pmid:37523847
  12. 12. Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep. 2021;11(1):15626. pmid:34341396
  13. 13. Wang B, Liu J, Zhang X, Lin J, Li S, Wang Z, et al. A stacking ensemble framework integrating radiomics and deep learning for prognostic prediction in head and neck cancer. Radiat Oncol. 2025;20(1):127. pmid:40804402
  14. 14. Kwon H, Park J, Lee Y. Stacking Ensemble Technique for Classifying Breast Cancer. Healthc Inform Res. 2019;25(4):283–8. pmid:31777671
  15. 15. Alhussen A, Anul Haq M, Ahmad Khan A, Mahendran RK, Kadry S. XAI-RACapsNet: Relevance aware capsule network-based breast cancer detection using mammography images via explainability O-net ROI segmentation. Expert Systems with Applications. 2025;261:125461.
  16. 16. Haq MA, Khan I, Ahmed A, Eldin SM, Alshehri ALI, Ghamry NA. DCNNBT: A novel deep convolution neural network-based brain tumor classification model. Fractals. 2023;31(06):2340102.
  17. 17. Yousef R, Khan S, Gupta G, Siddiqui T, Albahlal BM, Alajlan SA, et al. U-Net-Based Models towards Optimal MR Brain Image Segmentation. Diagnostics (Basel). 2023;13(9):1624. pmid:37175015
  18. 18. Abualnaja SY, Morris JS, Rashid H, Cook WH, Helmy AE. Machine learning for predicting post-operative outcomes in meningiomas: a systematic review and meta-analysis. Acta Neurochir (Wien). 2024;166(1):505. pmid:39688716
  19. 19. Lei J, Zhai J, Zhang Y, Qi J, Sun C. Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study. J Med Internet Res. 2025;27:e66733. pmid:40418571
  20. 20. Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. Journal of the American College of Cardiology. 2025;85(12):1302–13.
  21. 21. Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Critical Care. 2020;24(1):478.
  22. 22. Sawesi S, Jadhav A, Rashrash B. Machine Learning and Deep Learning Techniques for Prediction and Diagnosis of Leptospirosis: Systematic Literature Review. JMIR Med Inform. 2025;13:e67859. pmid:40440642
  23. 23. Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, et al. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol. 2023;111(6):951–62. pmid:37794526
  24. 24. Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A, Castro Rivadeneira AJ, et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proc Natl Acad Sci U S A. 2022;119(15):e2113561119. pmid:35394862
  25. 25. Dessie ZG, Zewotir T. Mortality-related risk factors of COVID-19: a systematic review and meta-analysis of 42 studies and 423,117 patients. BMC Infect Dis. 2021;21(1):855. pmid:34418980
  26. 26. Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput. 2021;113:107946. pmid:34646110
  27. 27. Li J, Li X, Hutchinson J, Asad M, Liu Y, Wang Y, et al. An ensemble prediction model for COVID-19 mortality risk. Biol Methods Protoc. 2022;7(1):bpac029. pmid:36438173
  28. 28. Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. pmid:38684770
  29. 29. Rahmatinejad Z, Dehghani T, Hoseini B, Rahmatinejad F, Lotfata A, Reihani H, et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci Rep. 2024;14(1):3406. pmid:38337000
  30. 30. Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369.
  31. 31. Jamshidi MB, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, et al. Artificial Intelligence and COVID-19: Deep Learning Approaches for Diagnosis and Treatment. IEEE Access. 2020;8:109581–95. pmid:34192103
  32. 32. Sperrin M, Grant SW, Peek N. Prediction models for diagnosis and prognosis in Covid-19. BMJ. 2020.
  33. 33. Banoei MM, Dinparastisaleh R, Zadeh AV, Mirsaeidi M. Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying. Crit Care. 2021;25(1):328. pmid:34496940
  34. 34. Saxena A, Nixon B, Boyd A, Evans J, Faraone SV. A systematic review of the application of graph neural networks to extract candidate genes and biological associations. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2025;:e33031.
  35. 35. Ji C, Yu N, Wang Y, Ni J, Zheng C. SGLMDA: A Subgraph Learning-Based Method for miRNA-Disease Association Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1191–201. pmid:38446654
  36. 36. Wang J, Li J, Yue K, Wang L, Ma Y, Li Q. NMCMDA: neural multicategory MiRNA-disease association prediction. Brief Bioinform. 2021;22(5):bbab074. pmid:33778850
  37. 37. Li J, Lin H, Wang Y, Li Z, Wu B. Prediction of potential small molecule-miRNA associations based on heterogeneous network representation learning. Front Genet. 2022;13:1079053. pmid:36531225
  38. 38. Riley RD, Ensor J, Snell KIE, Archer L, Whittle R, Dhiman P, et al. Importance of sample size on the quality and utility of AI-based prediction models for healthcare. Lancet Digit Health. 2025;7(6):100857. pmid:40461350
  39. 39. Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health. 2024;6:1430245. pmid:39131184
  40. 40. Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024;57(10).
  41. 41. Malhotra R, Khanna M. Particle swarm optimization-based ensemble learning for software change prediction. Information and Software Technology. 2018;102:65–84.
  42. 42. Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. pmid:38684770
  43. 43. Tian W, Jiang W, Yao J, Nicholson CJ, Li RH, Sigurslid HH, et al. Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis. J Med Virol. 2020;92(10):1875–83. pmid:32441789
  44. 44. de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. pmid:36859446
  45. 45. Hatamabadi H, Sabaghian T, Sadeghi A, Heidari K, Safavi-Naini SAA, Looha MA. Epidemiology of COVID-19 in Tehran, Iran: A cohort study of clinical profile, risk factors, and outcomes. BioMed Research International. 2022;2022.
  46. 46. Klomp T. Iterative Imputation in Python: A Study on the Performance of the Package IterativeImputer. University Utrecht. 2022.
  47. 47. Barough SS, Safavi-Naini SAA, Siavoshi F, Tamimi A, Ilkhani S, Akbari S, et al. Generalizable machine learning approach for COVID-19 mortality risk prediction using on-admission clinical and laboratory features. Sci Rep. 2023;13(1):2399. pmid:36765157
  48. 48. Sharma V. A Study on Data Scaling Methods for Machine Learning. INJGAADMIN, ftjijgasr. 2022;1(1).
  49. 49. Mishra S, Pradhan RK. Analyzing the impact of feature correlation on classification accuracy of machine learning model. In: 2023.
  50. 50. Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28.
  51. 51. Alin A. Multicollinearity. Wiley Interdiscip Rev Comput Stat. 2010;2(3):370–4.
  52. 52. Daoud JI. Multicollinearity and regression analysis. Journal of Physics: Conference Series. 2017.
  53. 53. Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press. 2020.
  54. 54. Moorthy U, Gandhi UD. RETRACTED ARTICLE: A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J Ambient Intell Human Comput. 2020;12(3):3527–38.
  55. 55. Ladha L, Deepa T. Feature Selection Methods and Algorithms. International Journal on Computer Science and Engineering (IJCSE). 2023;55.
  56. 56. Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 2014;10(11):e1004754. pmid:25393026
  57. 57. Kuhn M. Building predictive models in R using the caret package. Journal of Statistical Software. 2008;28:1–26.
  58. 58. Kuhn M. Variable selection using the caret package. 2012.
  59. 59. Berrar D. Cross-validation. 2019. 542–5.
  60. 60. Bottino F, Tagliente E, Pasquini L, Napoli AD, Lucignani M, Figà-Talamanca L, et al. COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal. J Pers Med. 2021;11(9):893. pmid:34575670
  61. 61. Brownlee J. Machine learning mastery with R: Get started, build accurate models and work through projects step-by-step. Machine Learning Mastery. 2016.
  62. 62. Tattar PN. Hands-On Ensemble Learning with R: A Beginner’s Guide to Combining the Power of Machine Learning Algorithms Using Ensemble Techniques. Packt Publishing Ltd. 2018.
  63. 63. Kuncheva LI. Combining pattern classifiers: methods and algorithms. John Wiley & Sons. 2014.
  64. 64. Wang L, Mo T, Wang X, Chen W, He Q, Li X, et al. A hierarchical fusion framework to integrate homogeneous and heterogeneous classifiers for medical decision-making. Knowledge-Based Systems. 2021;212:106517.
  65. 65. Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279–82. pmid:23997866
  66. 66. Fritz CO, Morris PE, Richler JJ. Effect size estimates: current use, calculations, and interpretation. J Exp Psychol Gen. 2012;141(1):2–18. pmid:21823805
  67. 67. Zhu Y, Guo W. Family-Wise Error Rate Controlling Procedures for Discrete Data. Statistics in Biopharmaceutical Research. 2019;12(1):117–28.
  68. 68. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC. Package ‘pROC’. 2021.
  69. 69. Kassambara A. Comparing groups: Numerical variables. Sydney, Australia: Datanovia. 2019.
  70. 70. Molnar C, Schratz P. Package ‘iml’. R CRAN. 2020.
  71. 71. Roth AE. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press. 1988.
  72. 72. Lunardon N, Menardi G, Torelli N. ROSE: a package for binary imbalanced learning. R journal. 2014;6(1).
  73. 73. Zali A, Gholamzadeh S, Mohammadi G, Azizmohammad Looha M, Akrami F, Zarean E. Baseline characteristics and associated factors of mortality in COVID-19 patients; an analysis of 16000 cases in Tehran, Iran. Arch Acad Emerg Med. 2020;8(1):e70. pmid:33134966
  74. 74. Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol. 2016;76:175–82. pmid:26964707
  75. 75. Chamseddine E, Mansouri N, Soui M, Abed M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl Soft Comput. 2022;129:109588. pmid:36061418
  76. 76. Javidi M, Abbaasi S, Naybandi Atashi S, Jampour M. COVID-19 early detection for imbalanced or low number of data using a regularized cost-sensitive CapsNet. Sci Rep. 2021;11(1):18478. pmid:34531477
  77. 77. de Jong VMT, Rousset RZ, Antonio-Villa NE, Buenen AG, Van Calster B, Bello-Chavolla OY, et al. Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis. BMJ. 2022;378:e069881. pmid:35820692
  78. 78. Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. pmid:36829188
  79. 79. Ribeiro MHDM, da Silva RG, Mariani VC, Coelho LDS. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals. 2020;135:109853. pmid:32501370
  80. 80. Hussain S, Songhua X, Aslam MU, Hussain F. Clinical predictions of COVID-19 patients using deep stacking neural networks. J Investig Med. 2024;72(1):112–27. pmid:37712431
  81. 81. Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23(3):269–78. pmid:26679719
  82. 82. de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. pmid:36859446
  83. 83. An N, Ding H, Yang J, Au R, Ang TFA. Deep ensemble learning for Alzheimer’s disease classification. J Biomed Inform. 2020;105:103411. pmid:32234546
  84. 84. Gupta A, Jain V, Singh A. Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications. New Gener Comput. 2022;40(4):987–1007. pmid:34924675
  85. 85. Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform. 2023;175:105090. pmid:37172507
  86. 86. Liu Y, Du X, Chen J, Jin Y, Peng L, Wang HHX, et al. Neutrophil-to-lymphocyte ratio as an independent risk factor for mortality in hospitalized patients with COVID-19. J Infect. 2020;81(1):e6–12. pmid:32283162
  87. 87. Wu C, Chen X, Cai Y, Xia J, Zhou X, Xu S, et al. Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Intern Med. 2020;180(7):934–43. pmid:32167524
  88. 88. Xu W, Sun N-N, Gao H-N, Chen Z-Y, Yang Y, Ju B, et al. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning. Sci Rep. 2021;11(1):2933. pmid:33536460
  89. 89. Sedgwick P. Retrospective cohort studies: advantages and disadvantages. BMJ. 2014;348(jan24 1):g1072–g1072.
  90. 90. Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361:k1479. pmid:29712648
  91. 91. Perets O, Stagno E, Yehuda EB, McNichol M, Anthony Celi L, Rappoport N, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. 2024;:2024.04.09.24305594. pmid:38680842
  92. 92. Degtiar I, Rose S. A Review of Generalizability and Transportability. Annu Rev Stat Appl. 2023;10(1):501–24.
  93. 93. Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819. pmid:38191193
  94. 94. Collins GS, Reitsma JB, Altman DG, Moons KGM, TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation. 2015;131(2):211–9. pmid:25561516