Predictive divergence in machine learning models for clinical mortality risk: A multicohort study of covid-19 patients

Júlia Chaves Neuenschwander Magalhães; Alexandre Dias Porto Chiavegatto Filho

doi:10.1371/journal.pone.0344354

Abstract

Background

Machine learning (ML) algorithms are increasingly used in healthcare to support clinical decision-making. While models with similar overall performance are often considered interchangeable for deployment, they may produce divergent predictions, a phenomenon known as algorithmic multiplicity. In such cases, the choice of algorithm may introduce bias. This study investigates the impacts of algorithmic multiplicity in mortality prediction and assesses the influence of patient characteristics on model decisions.

Methods

A cohort of 4,337 adult patients (≥18 years) with RT-PCR–confirmed covid-19 from five tertiary care hospitals in Brazil was followed from March to August 2020. Five popular ML models for structured data were trained on demographic and laboratory data collected at early hospital admission to predict in-hospital mortality. Model performance, feature importance, and algorithmic prediction similarity were evaluated. Feature distributions were compared between patients correctly or incorrectly classified by all models using paired t-tests or Mann–Whitney U tests, as applicable, at the 5% significance level. Subgroup performance differences were assessed using 10-fold cross-validation applied to five k-means–delineated clusters, compared by one-way ANOVA. Within-cluster predictive divergence was assessed within a 95% confidence interval.

Results

All models achieved high overall predictive performance (µ = 0.855, σ² = 0.0072). However, the comparison of individual-level predictions revealed substantial heterogeneity, with pairwise prediction correlations ranging from R² = 0.56 to 0.80. Unsupervised k-means clustering identified five clinically distinct patient subgroups with mortality rates ranging from 22% to 80%, within which model performance varied significantly (F = 73.18, p < 0.001). Notably, TabPFN and LightGBM showed superior performance in the “Anemia” cluster, whereas TabPFN underperformed in the “Immunodeficient” cluster (95% CI).

Conclusions

This study demonstrates that ML models with similar overall performance can yield substantially divergent predictions at both the individual and subgroup levels, and that no single algorithm consistently outperforms others across all patient subgroups. These findings highlight the limitations of relying solely on global performance metrics and underscore the need for context-aware evaluation of ML models in heterogeneous clinical populations.

Citation: Magalhães JCN, Chiavegatto Filho ADP (2026) Predictive divergence in machine learning models for clinical mortality risk: A multicohort study of covid-19 patients. PLoS One 21(3): e0344354. https://doi.org/10.1371/journal.pone.0344354

Editor: Marcela Pagano, Universidade Federal de Minas Gerais, BRAZIL

Received: September 21, 2025; Accepted: February 19, 2026; Published: March 6, 2026

Copyright: © 2026 Magalhães, Chiavegatto Filho. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in this study were obtained from five distinct hospitals and are not publicly available due to restrictions imposed by the Brazilian General Data Protection Law (Lei Geral de Proteção de Dados – LGPD), which safeguards individual privacy even in de-identified datasets. Access to the data may be granted upon reasonable request to the Laboratory of Big Data and Predictive Analysis in Healthcare at the School of Public Health, University of São Paulo (labdaps@usp.br). Reasonable requests must be for legitimate research purposes and must include a clear plan for maintaining data confidentiality and complying with ethical standards. Analytical code can be found on https://github.com/labdaps/iacovbr_predictive_divergence.

Funding: Funding for this research was provided by the National Council for Scientific and Technological Development, under grant No. 444610/2024-3.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, rapid advances in artificial intelligence (AI), especially in its machine learning (ML) subfield, have driven the increasing adoption of these technologies in medicine [1], particularly in the development of computer vision systems for medical imaging analysis [2]. In parallel, there has been growing interest in models based on structured variables, which leverage routinely collected demographic and laboratory data to provide cost-effective and highly scalable solutions in hospital settings [3]. In the context of mortality prediction, such models may support clinical decision-making by enabling early risk stratification, identifying patients who may require closer monitoring or earlier escalation of care, and informing the prioritization of clinical resources in high-risk settings. The scale of this adoption is evident in the Artificial Intelligence Index Report 2025, which documents a fivefold increase in published clinical trials evaluating AI-based systems for clinical decision support between 2019 and 2024 [4].

These models are no longer confined to research settings and are increasingly being deployed in real-world clinical practice. In a 2024 survey of 67 health systems across the United States, 90% of respondents reported deploying AI tools for imaging and radiology in at least limited clinical settings, while 67% had implemented models for early sepsis detection and 52% had deployed models to predict hospital readmission risk [5]. However, this widespread use of ML in healthcare has also raised significant fairness and ethical concerns, particularly given the high-stakes nature of clinical decision-making [5,6].

As algorithmic performance improves, selecting an appropriate model becomes more challenging. Accuracy alone is often insufficient, as multiple models may achieve statistically indistinguishable performance. This situation reflects an established yet still underexplored phenomenon: algorithmic multiplicity. Originally introduced by Breiman (2001), algorithmic multiplicity describes scenarios in which multiple models with comparable predictive performance yield different predictions for the same instances [7,8]. Formally, let denote a base classifier and a tolerance level. The set of competing classifiers is defined as,

where denotes the error rate of the base classifier. A prediction is said to exhibit algorithmic multiplicity if there exists a model such that for at least one training instance [8].

This definition indicates that algorithmic multiplicity is not merely a theoretical artifact but has practical consequences. The indeterminacy resulting from multiplicity can lead to arbitrary and unfair decisions, since given equivalent performance, the very choice of an algorithm may introduce bias [6,8,9]. This is particularly concerning in large-scale applications, as such arbitrariness might result in severe outcomes for individuals or groups systematically disadvantaged by the chosen model [10]. On the other hand, understanding this multiplicity paves the way for selecting models based on criteria such as interpretability and fairness, without significantly compromising accuracy, thus allowing for greater flexibility in model choice [6].

In this context, the present study investigates the effects of algorithmic multiplicity using covid-19 mortality prediction as a clinically relevant and data-rich case study. Data from Brazilian hospitals provides a particularly suitable context due to the country’s large geographic extent and demographic diversity, yielding a heterogeneous patient population. Five state-of-the-art ML algorithms reported in the literature were evaluated [2,11], each trained with samples of varying sizes. The objectives were to compare predictive performance, identify consistent error patterns, and assess the influence of sample characteristics on predictive divergence.

Although the empirical analysis focuses on covid-19 mortality, the methodological insights regarding predictive variability, subgroup-specific errors, and model selection under algorithmic multiplicity apply to a broad range of clinical prediction tasks. By characterizing algorithmic multiplicity in a large-scale clinical dataset, this work aims to contribute to a deeper understanding of predictive divergence in ML systems and may inform the development of more robust, equitable, and transparent clinical decision support tools.

Methods

Data source and study design

A multicohort retrospective study, aiming to investigate the effects of algorithmic multiplicity on mortality prediction for covid-19 in hospitals across Brazil, was conducted with 4,377 adult patients (≥18 years) with RT-PCR–confirmed covid-19, who were followed between March and August 2020. Data were obtained from the IACOV-BR database, which integrates records from five hospitals, with cohort sizes ranging from 247 to 1,776 patients. Only anonymized data, originally collected for clinical care and limited to patients who had already been discharged, were included. These data were accessed for research purposes on June 6, 2024. The study received ethical approval from the University of São Paulo’s Institutional Review Board (IRB), in accordance with the resolutions of the National Health Council (CNS) and National Research Ethics Committee (CONEP), under reference number 32872920.4.1001.5421, which included a waiver of consent. This approval also covered the utilization of data and collaboration with all hospitals participating in the IACOV-BR network.

Data preprocessing

The dataset was split using random sampling stratified by the outcome, with 70% allocated for training and 30% for testing. Categorical variables with two or more categories were one-hot encoded into sets of dummy variables. Continuous variables were normalized by z-score transformation, using the mean and standard deviation calculated from the training set. Columns with more than 90% correlation were eliminated, as were variables with more than 90% missing data. The remaining missing values were handled by median imputation.

Given the imbalanced nature of the dataset, class balancing was performed using the Random Oversampling technique, applied to the training set. SMOTE and Borderline SMOTE were also tested but were discarded as they did not improve predictive performance. Hyperparameters were selected through Bayesian optimization using HyperOpt, coupled with 10-fold cross-validation.

Model development and evaluation

Five popular models for structured data, XGBoost, LightGBM, Catboost, Random Forest, and TabPFN, were trained on demographic and laboratory data collected during early hospital admission (within ±24 h from RT-PCR exam) to predict in-hospital mortality. A total of 22 predictors were included: age, sex, heart rate, respiratory rate, systolic pressure, diastolic pressure, mean arterial pressure, temperature, hemoglobin, platelets, hematocrit, erythrocytes, mean corpuscular hemoglobin (MCH), red cell distribution width (RDW), mean corpuscular volume (MCV), leukocytes, neutrophils, lymphocytes, basophils, eosinophils, monocytes, and C-reactive protein (CRP). These features were selected based on their routine availability during hospital admissions, including in low-resource settings, and their demonstrated predictive performance [12].

The predictive performance was measured through the area under the receiver operating characteristic curve (AUC). Additionally, accuracy, precision, recall, and F1-score were calculated. Feature importance was assessed using the Shapley Additive Explanations (SHAP) method for XGBoost, Catboost, LightGBM, and Random Forest. TabPFN interpretability was constrained due to tool compatibility, necessitating the use of proxy interpretability via Shapley Interaction Quantification (SHAP-IQ).

Cross-model prediction similarity

The similarity among the algorithms was first evaluated through paired distribution plots of the probabilistic predictions, on which the R² statistic was calculated. Additionally, histograms of the probability distributions for each model were analyzed to gain insights into the classification behavior of each algorithm.

Misclassification patterns

To investigate misclassification patterns associated with sample characteristics, two patient groups were defined: patients unanimously correctly or incorrectly classified by all five models. The distribution of predictor variable values was compared between these groups. Variables with homoscedasticity confirmed by Levene’s test were analyzed using a paired t-test, while those with unequal variance between groups were compared using the non-parametric Mann–Whitney U test. Variables that rejected the null hypothesis at a 5% significance level were considered statistically significant. The findings were then compared with the set of variables deemed important for model decisions, as calculated using Shapley values.

Subgroup predictive performance

To assess algorithm performance across patient subgroups, the training set was partitioned into five clusters using the unsupervised k-means algorithm. The number of clusters was selected based on the elbow method, through minimization of the weighted cumulative sum of squares (WCSS). These clusters were characterized by using the ten most important predictors on average across all algorithms, as determined by SHAP, and visualized through a radar chart. The clustered data were used to re-train all five models using 10-fold cross-validation. The resulting AUC was compared across subgroups by a one-way ANOVA test. Within-group predictive divergence was analyzed in a 95% confidence interval.

All analyses were conducted using Python 3, with ML models implemented through Scikit-learn.

Results

Demographic characteristics of the sample

As presented in Table 1, the analyzed sample consisted of 4,377 adult patients with RT-PCR confirmed covid-19 admitted across five different hospitals. The list of hospitals and their respective locations is presented in S1 Table. The mean age of participants was 56.1 years (SD = 17.2), and 55.1% were male. 15.5% of participants reported ethnic identity, with 11.5% identifying as White and 3.9% as Black, Mixed, or Asian. Patients who died during the study period were, on average, older (65 years vs 52 years for survivors) and more likely to be male (60.7% vs 52% among survivors).

Download:

Table 1. Demographic characteristics of the sample by hospital and outcome.

https://doi.org/10.1371/journal.pone.0344354.t001

Model performance and prediction agreement

To assess baseline predictive performance and the impact of site-specific heterogeneity, we compared algorithmic performance across hospitals using both locally trained and aggregated models.

Table 2 shows the AUC results by algorithm with local training in each hospital, as well as for training with aggregated data from all hospitals. The list of hyperparameters used for the aggregated data training, selected via Bayesian Optimization using HyperOpt coupled with 10-fold cross-validation, can be seen in S2 Table.

Download:

Table 2. Algorithmic performance (AUC) by hospital.

https://doi.org/10.1371/journal.pone.0344354.t002

Regarding the site-specific results, in absolute terms, TabPFN achieved the best performance, with the highest AUC values in four of the five cases (80%). Sample size did not show a clear association with prediction quality, whereas higher mortality rates were associated with lower AUC values on average.

As for aggregate performance metrics, they were largely comparable across algorithms both in terms of AUC (Table 2) and accuracy, precision, recall, and F1-score (S3 Table). However, as similar performance metrics do not necessarily imply similar prediction behavior, we proceeded to examine the concordance and distribution of probabilistic predictions across models.

Fig 1 illustrates the similarity between models. The diagonal panels display histograms showing the individual distributions of predictions. It can be noted that Random Forest tends to generate more intermediate values compared to the other models, which show a greater propensity for extreme values. This characteristic is especially evident in LightGBM, which demonstrates a strong prevalence of predictions close to zero or one.

Download:

Fig 1. Probabilistic predictions correlation.

Matrix of pairwise scatter plots showing the correlation between predicted probabilities of in-hospital mortality generated by the different machine-learning models (XGBoost, Random Forest, LightGBM, and Catboost). Each point represents an individual patient, colored by hospital. Solid lines indicate fitted linear regression models, with the corresponding coefficient of determination (R²) reported in each panel. Diagonal panels display histograms of predicted probability distributions for each model.

https://doi.org/10.1371/journal.pone.0344354.g001

The lower panels contain scatter plots that illustrate the linear relationship between pairs of models, with a regression line in blue and an annotation of the R² statistic for each model pair. This analysis reveals that the highest correlation occurs between XGBoost and Catboost, while the lowest is observed between Random Forest and LightGBM. A notable discrepancy is observed in the frequency of probabilities close to zero for patients from Hospital 2, which can be attributed to its lower mortality rate (6.9%) (Table 2). Greater distribution dispersion is evident for pairs of boosting-based models (XGBoost, LightGBM, and Catboost), while comparisons involving XGBoost and Random Forest, as well as Random Forest and TabPFN, show a more concentrated distribution. This is surprising, as it does not align with architectural similarities. These results highlight important differences in the distributions and prediction patterns among the algorithms, revealing significant predictive divergence even when the variance in predictive performance is minimal.

Error pattern analysis

Given the observed divergence in probabilistic predictions despite comparable performance metrics, we next investigated whether misclassifications were associated with specific patient characteristics.

Fig 2 presents the patient profiles stratified by error group, based on the vector of group means derived from variables normalized via z-score transformation. The expected value for this vector corresponds to the null vector (represented by the grey line), a consequence of the transformation’s definition as z = (x – μ)/σ. Variables whose values lie close to this line are evenly distributed across groups and, thus, do not substantially contribute to the differentiation among error profiles. In contrast, variables that deviate from this pattern may be associated with increased susceptibility to errors in each algorithm.

Download:

Fig 2. Profile graphs by error group.

Profile plot showing the mean standardized value of each variable among patients misclassified by each machine-learning model. A reference group of correctly classified patients (“No errors”) is included for comparison. All variables were normalized using z-score transformation; values above and below zero indicate higher or lower mean levels relative to the overall cohort.

https://doi.org/10.1371/journal.pone.0344354.g002

It can be noted that, although there are variations in the error profiles across different models, they tend to follow a similar pattern, which contrasts with the profile observed in patients for whom no model produced errors. These observations suggest that certain population-level characteristics may be associated with a greater propensity for algorithm misclassification. Even so, certain variables exhibit different levels of influence across algorithms. For instance, respiratory rate appears to have a greater impact on the predictions generated by Catboost, whereas platelet count contributes less significantly to the predictions made by LightGBM, when compared to the other models.

When analyzing model training using the combined dataset from all hospitals, it was observed that 66.5% of test set patients were not misclassified by any algorithm, while 11.5% were misclassified by all of them. A comparison of predictor value distributions between these two groups, using a paired t-test for normally distributed homoscedastic variables, revealed that age, mean corpuscular volume (MCV), and lymphocyte count rejected the null hypothesis at the 5% significance level. For variables with unequal variance between groups, comparisons were conducted using the non-parametric Mann-Whitney U test. At the 5% significance level, the null hypothesis was rejected for neutrophils, red blood cell count, respiratory rate, platelet count, and C-reactive protein (CRP). These findings are consistent with the variable importance profiles derived from Shapley values (Fig 3).

Download:

Fig 3. Variable importance boxplots.

Boxplots showing feature importance estimated using Shapley Additive Explanations (SHAP) for each machine-learning model. Each boxplot represents the distribution of SHAP values across models for a given variable. The ten most influential predictors overall are displayed, ranked by their aggregated importance.

https://doi.org/10.1371/journal.pone.0344354.g003

Together, these findings suggest that model errors are not randomly distributed across the population but are instead associated with specific clinical and laboratory profiles, motivating further stratification of the cohort.

Cluster-specific insights

To further characterize population heterogeneity and assess its impact on predictive performance, we applied unsupervised clustering to identify clinically distinct patient subgroups.

The training set was split into five clusters by k-means. This number was chosen by the elbow method, through minimization of the weighted cumulative squares sum (WCSS). Fig 4 illustrates this clusters’ profiles in radar charts created using the ten most important predictors according to the Shapley values analysis.

Download:

Fig 4. Cluster characterization radar charts.

Radar charts depicting the mean standardized values of the ten most influential predictors (as identified by SHAP analysis) within each patient cluster. Standardization was performed using z-score normalization based on the overall cohort. Each chart summarizes the characteristic clinical profile of the corresponding cluster.

https://doi.org/10.1371/journal.pone.0344354.g004

Analysis of the radar plots and group-wise mean values of the normalized variables indicates that the first cluster, denominated “Anemia”, comprising 529 patients, is characterized by individuals with higher average age (+0.23), normal respiratory rate (−0.09), C-reactive protein (+0.05), and leukocyte count (−0.12). This cluster also presents a slight reduction in neutrophils (−0.24), a moderate decrease in platelet count (−0.43), and a pronounced deficiency in red blood cells (−1.61), accompanied by a substantial increase in red cell distribution width (RDW) (+1.10). This profile is indicative of a group of anemic patients with marked anisocytosis (elevated RDW). The observed mortality rate within this cluster was 79.58%.

“Young” represents the group with the lowest mean age (−0.69), with slightly reduced respiratory rate (−0.21), and near-average values for leukocytes (−0.06), neutrophils (−0.13), C-reactive protein (−0.30), and RDW (−0.26). This group exhibits elevated lymphocyte (+0.39) and red blood cell (+0.63) counts. Comprising a total of 1,310 patients, this cluster appears to represent a relatively healthy population with a robust immune response and low levels of systemic inflammation. The observed mortality rate for this group was 23.20%.

Consisting of 1,286 patients, the third cluster is distinguished by an older age profile (+0.42), elevated leukocyte (+0.39), neutrophil (+0.69), and platelet counts (+0.75), as well as moderately increased lymphocytes (+0.36). C-reactive protein (−0.02) and red blood cell count (−0.20) are near the average. This profile is indicative of pronounced immune activation, which could reflect either a more widespread infectious process or a well-functioning immune response. The mortality rate in this group was 53.49%.

“Recovery” cluster is the smallest, including 295 patients with slightly increased age (+0.26), reduced respiratory rate (−0.18), and normal leukocyte (−0.06) and neutrophil (+0.00) counts. It is characterized by elevated lymphocytes (+0.38) and monocytes (+0.54), along with markedly reduced C-reactive protein (−0.53). This group exhibited the lowest mortality rate among all clusters, at 22.37%.

The fifth cluster comprises 964 patients with moderately elevated age (+0.17) and substantially increased respiratory rate (+0.53), alongside reduced leukocyte (−0.35), neutrophil (−0.60), lymphocyte (−0.92), and monocyte (−0.65) counts. These patients also show elevated C-reactive protein (+0.57) and slightly increased red blood cell count (+0.25). This profile suggests a systemic inflammatory state with respiratory compromise, potentially indicative of sepsis or respiratory failure. The associated mortality rate was 73.92%. These findings are summarized in Table 3.

Download:

Table 3. Qualitative description of clusters.

https://doi.org/10.1371/journal.pone.0344354.t003

The five algorithms were retrained on data specific to each patient cluster using 10-fold cross-validation, and their predictive performance was assessed based on the mean area under the ROC curve (Fig 5). Analysis of variance (ANOVA) test revealed significant heterogeneity in performance across clusters (F = 73.182316, p < 0.001). Notably, the best performance occurred in the “Recovery” cluster, which had the smallest sample size (295) and low mortality rate (23.37%). Meanwhile, in the “Immune Activation” cluster (n = 1286, mortality rate = 53.49%), all algorithms underperformed.

Download:

Fig 5. Within-cluster algorithmic performance comparison.

Bar plots showing the predictive performance of each machine-learning algorithm within individual patient clusters, evaluated using 10-fold cross-validation. Bars indicate mean performance across bootstrap samples, and error bars represent 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0344354.g005

Within a 95% confidence interval, all algorithms exhibited statistically equivalent performance in the “Young,” “Immune Activation,” and “Recovery” clusters—subgroups also characterized by the lowest mortality rates. In contrast, the “Anemia” cluster demonstrated significantly superior performance from TabPFN and LightGBM. Notably, in the “Immunodeficient” cluster, TabPFN exhibited a mean AUC of 0.75 (95% CI: 0.72–0.78), markedly lower than the other algorithms, which achieved mean AUCs near 0.85. These results underscore the variation in algorithmic performance across clinically distinct subpopulations, highlighting the necessity of incorporating population-specific characteristics into model selection for clinical prediction tasks and suggesting that future studies should explicitly evaluate subgroup-specific performance rather than relying solely on aggregate metrics.

Discussion

This study demonstrates that all five evaluated algorithms achieved consistently high performance across diverse clinical settings, with TabPFN emerging as the top-performing and most stable model in most scenarios. Notably, TabPFN required no hyperparameter tuning and exhibited the lowest performance variance across hospitals. These results corroborate those of Hollmann et al. (2025), who found TabPFN to outperform several benchmark models, including Random Forest, XGBoost, Catboost, and LightGBM, on datasets with up to 10,000 samples [11].

Our findings on predictive performance are consistent with those reported by Savalli and Wichmann using the same database [12,13]. Additionally, the variable importance profiles align with those identified by Smith and Alvarez (2021), who highlighted age, lymphocyte count, and neutrophil count as key predictors of covid-19 mortality [14]. The recurrence of these variables across different analytical frameworks supports the robustness of our approach. Importantly, the observation of substantial predictive divergence among models with similar global performance metrics contributes new empirical evidence to the discourse on algorithmic multiplicity [6–8,10,15]. This reinforces the notion that even models with similar architectures can produce markedly different individual-level predictions, as evidenced by low R² values.

A major strength of this study lies in its multifaceted evaluation framework, which integrates algorithm performance across multiple hospitals, individual prediction consistency, algorithmic similarity analysis, variable importance via SHAP values, and subgroup-based performance comparison. A key novelty is the extension of algorithmic multiplicity analyses to a country as large and socio-demographically diverse as Brazil. This broader context enables models to capture complex, context-specific interactions and provides a more realistic assessment of model behavior under heterogeneous clinical conditions.

Nevertheless, several limitations must be considered. First, the analysis was restricted to a single national dataset collected during the early phase of the covid-19 pandemic, which may limit generalizability to subsequent viral variants or international populations. Second, although preprocessing and training procedures were standardized to ensure fair comparison, the interpretability of some models—most notably TabPFN—remains limited due to incompatibility with widely used explainability tools. Third, while the unsupervised clustering approach proved useful for delineating patient subgroups, it may be sensitive to initialization and hyperparameter choices, warranting caution when extrapolating clinical interpretations from these clusters.

An additional limitation is the restricted scope of evaluated models. The analysis was limited to five ML algorithms and therefore does not capture the full spectrum of available approaches. As a result, conclusions regarding algorithmic multiplicity may be primarily applicable to high-capacity, non-linear predictors. Future research should systematically expand the model space to examine how architectural assumptions, learning objectives, and inductive biases contribute to predictive multiplicity. Such comparative efforts would support the development of more general guidelines for model selection, ensemble construction, and clinical deployment in heterogeneous healthcare settings.

The error mechanisms identified in this work highlight several avenues for developing more equitable ML models. Future research should prioritize strategies to mitigate biases arising from predictive multiplicity, including dual-prioritization bias correction through subgroup-specific modeling [16] and ensemble methods guided by multi-objective optimization to jointly balance accuracy and fairness [17]. Meta-learning approaches, which leverage prior knowledge to accelerate adaptation to new tasks, also hold promise for deployment in dynamic and heterogeneous clinical environments [18]. Moreover, advances in interpretability—particularly for complex architectures such as TabPFN—and validation across multiple institutions and broader patient populations will be critical for improving model transparency and generalizability. Ultimately, adaptive ensemble systems that explicitly incorporate fairness and stability metrics may offer a principled pathway toward more equitable and personalized clinical decision-support tools.

Conclusions

Using a large, multicentric cohort of hospitalized covid-19 patients in Brazil, this study provides empirical evidence that ML models with comparable overall performance can produce substantially divergent predictions at both the individual and subgroup levels when applied to mortality prediction.

Despite achieving similarly high aggregate AUC values, the evaluated algorithms exhibited heterogeneous probabilistic outputs, distinct classification patterns, and marked variability in performance across clinically meaningful patient subgroups. These findings indicate that reliance on global performance metrics alone may obscure clinically relevant differences in model behavior and may inadvertently amplify inequities when models are deployed in heterogeneous patient populations.

Although focused on covid-19 mortality, the methodological insights presented here are broadly applicable to other clinical prediction tasks and healthcare contexts. As ML continues to be integrated into clinical decision-making, systematically characterizing and addressing algorithmic multiplicity may support the development of more transparent, reliable, and equitable predictive systems.

Supporting information

S1 Table. Characteristics of participating hospitals included in the study.

https://doi.org/10.1371/journal.pone.0344354.s001

(DOCX)

S2 Table. Model Hyperparameters.

Hyperparameters for each model, selected using Bayesian optimization on the aggregated dataset.

https://doi.org/10.1371/journal.pone.0344354.s002

(DOCX)

S3 Table. Performance metrics by model.

Detailed performance metrics for each model evaluated on the aggregated dataset.

https://doi.org/10.1371/journal.pone.0344354.s003

(DOCX)

Acknowledgments

We would like to thank the IACOV-BR Network, in alphabetic order: Ana Claudia Martins Ciconelle (Institute of Mathematics and Statistics, University of São Paulo); Ana Maria Espírito Santo de Brito (Instituto de Medicina, Estudos e Desenvolvimento—IMED, São Paulo, São Paulo); Bruno Pereira Nunes (Universidade Federal de Pelotas—UFPel); Dárcia Lima e Silva (Hospital Santa Lúcia); Fernando Anschau (Setor de Pesquisa da Gerência de Ensino e Pesquisa do Grupo Hospitalar Conceição, RS – Brasil; Programa de Pós-Graduação em Neurociências da Universidade Federal do Rio Grande do Sul); Henrique de Castro Rodrigues (Serviço de Epidemiologia e Avaliação/Direção Geral do HUCFF/UFRJ); Hermano Alexandre Lima Rocha (Unimed Fortaleza. Fortaleza, Ceará, Brasil; Departamento de Saúde Comunitária. Universidade Federal do Ceará. Fortaleza, Ceará, Brasil); João Conrado Bueno dos Reis (Hospital São Francisco); Liane de Oliveira Cavalcante (Hospital Santa Julia de Manaus); Liszt Palmeira de Oliveira (Instituto Unimed-Rio; Universidade do Estado do Rio de Janeiro); Lorena Sofia dos Santos Andrade (Universidade de Pernambuco—UPE/UEPB); Luiz Antonio Nasi (Hospital Moinhos de Vento); Marcelo de Maria Felix (InRad—Institute of Radiology, School of Medicine, University of São Paulo); Marcelo Jenne Mimica (Departamento de Ciências Patológicas Faculdade de Ciências Médicas da Santa Casa de São Paulo); Maria Elizete de Almeida Araujo (Federal University of Amazonas, University Hospital Getulio Vargas, Manaus, AM, Brazil); Mariana Volpe Arnoni (Serviço de Controle de Infecção Hospitalar Santa Casa de São Paulo); Rebeca Baiocchi Vianna (Hospital Santa Lúcia); Renan Magalhães Montenegro Junior (Complexo Hospitalar da Universidade Federal do Ceará – EBSERH); Renata Vicente da Penha (Hospital Evangélico de Vila Velha); Rogério Nadin Vicente (Hospital Santa Catarina de Blumenau); Ruchelli França de Lima (Hospital Moinhos de Vento); Sandro Rodrigues Batista (Faculdade de Medicina, Universidade Federal de Goiás, Goiânia, Goiás; Secretaria de Estado da Saúde de Goiás, Goiânia, Goiás); Silvia Ferreira Nunes (Fundação Santa Casa de Misericórdia do Pará—FSCMP; Mestrado Profissional em Gestão e Saúde na Amazônia); Tássia Teles Santana de Macedo (Escola Bahiana de Medicina e Saúde Pública); Valesca Lôbo e Sant’ana Nuno (Hospital Português da Bahia). We would also like to thank all those people who somehow contributed to the progress of this research, in alphabetical order: Adriana Weinfeld Massaia; Alexandre Amaral; Ana Maria Pereira Rangel; Antônia Célia de Castro Alcantara; Bruna Donida; Bruno Mendes Carmon; Carisi Polanczyk; Carolina Zenilda Nicolao; Claiton Marques de Jesus; Denise Corrêa Nunes; Diana Almeida; Eduardo Menezes Lopes; Elias Bezerra Leite; Elimar Ponzzo Dutra Leal; Fernanda Arns de Castro; Fernanda Colares de Borba Netto; Flávia Araújo; Flávio Lúcio Pontes Ibiapina; Gerência de Ensino e pesquisa do Complexo Hospitalar da Universidade Federal do Ceará – EBSERH; Hospital Português da Bahia; Humberto Bolognini Tridapalli; Iasmin Luiza Leite; Laura Freitas de Faveri; Lena Claudia Maia Alencar; Luciane Kopittke; Luciano Hammes; Luiz Alberto Mattos; Marly Suzielly Miranda Silva; Mayara Rocha de Oliveira; Mohamed Parrini; Pablo Viana Stolz; Paloma Farina de Lima; Paulo Pitrez; Pollyana Bueno Siqueira; Rafaella Côrti Pessigatti; Raul José de Abreu Sturari Junior; Rodrigo Smania Garrastazu Almeida; Rogério Farias Bitencourt; Rubens Vasconcelos Barreto; Tatiane Lima Aguiar; Thyago Gregório Mota Ribeiro.

References

1. Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 2019;6(2):94–8. pmid:31363513
2. Kamalov F, Cherukuri AK, Sulieman H, Thabtah F, Hossain A. Machine learning applications for COVID-19: a state-of-the-art review. Data Science for Genomics. Elsevier. 2023. p. 277–89.
- View Article
- Google Scholar
3. Kuo K-M, Talley PC, Chang C-S. The accuracy of machine learning approaches using non-image data for the prediction of COVID-19: a meta-analysis. Int J Med Inform. 2022;164:104791. pmid:35594810
4. Maslej N, Fattorini L, Perrault R, Gil Y, Parli V, Nienga Kariuki J. The AI Index 2025 Annual Report. Stanford (CA): Stanford University. 2025.
- View Article
- Google Scholar
5. Poon EG, Lemak CH, Rojas JC, Guptill J, Classen D. Adoption of artificial intelligence in healthcare: survey of health system priorities, successes, and challenges. J Am Med Inform Assoc. 2025;32(7):1093–100. pmid:40323320
6. Black E, Raghavan M, Barocas S. Model multiplicity: Opportunities, concerns, and solutions. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022.
7. Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist Sci. 2001;16(3).
- View Article
- Google Scholar
8. Marx C, Calmon F, Ustun B. Predictive multiplicity in classification. In: Proceedings of the International Conference on Machine Learning. PMLR; 2020.
9. Watson-Daniels J, Parkes DC, Ustun B. Predictive multiplicity in probabilistic classification. AAAI. 2023;37(9):10306–14.
- View Article
- Google Scholar
10. Meyer AP, et al. Perceptions of the fairness impacts of multiplicity in machine learning. arXiv. 2024.
- View Article
- Google Scholar
11. Hollmann N, Müller S, Purucker L, Krishnakumar A, Körfer M, Hoo SB, et al. Accurate predictions on small data with a tabular foundation model. Nature. 2025;637(8045):319–26. pmid:39780007
12. Wichmann RM, Fernandes FT, Chiavegatto Filho ADP, IACOV-BR Network. Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts. Sci Rep. 2023;13(1):1022. pmid:36658181
13. Savalli C, Wichmann RM, Filho FB, Fernandes FT, Filho ADPC, IACOV-BR Network. Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning. PLOS Digit Health. 2024;3(12):e0000699. pmid:39723970
14. Smith M, Alvarez F. Identifying mortality factors from Machine Learning using Shapley values - a case of COVID19. Expert Syst Appl. 2021;176:114832. pmid:33723478
15. Hsu H, Calmon F. Rashomon capacity: A metric for predictive multiplicity in classification. In: Advances in Neural Information Processing Systems. 2022;28988–9000.
16. Afrose S, Song W, Nemeroff CB, Lu C, Yao DD. Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction. Commun Med (Lond). 2022;2:111. pmid:36059892
17. Zhang Q, Liu J, Zhang Z, Wen J, Mao B, Yao X. Mitigating Unfairness via Evolutionary Multiobjective Ensemble Learning. IEEE Trans Evol Computat. 2023;27(4):848–62.
- View Article
- Google Scholar
18. Vettoruzzo A, Bouguelia M-R, Vanschoren J, Rognvaldsson T, Santosh KC. Advances and challenges in meta-learning: a technical review. IEEE Trans Pattern Anal Mach Intell. 2024;46(7):4763–79. pmid:38265905

[ref1] 1. Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 2019;6(2):94–8. pmid:31363513
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Kamalov F, Cherukuri AK, Sulieman H, Thabtah F, Hossain A. Machine learning applications for COVID-19: a state-of-the-art review. Data Science for Genomics. Elsevier. 2023. p. 277–89.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Kuo K-M, Talley PC, Chang C-S. The accuracy of machine learning approaches using non-image data for the prediction of COVID-19: a meta-analysis. Int J Med Inform. 2022;164:104791. pmid:35594810
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Maslej N, Fattorini L, Perrault R, Gil Y, Parli V, Nienga Kariuki J. The AI Index 2025 Annual Report. Stanford (CA): Stanford University. 2025.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Poon EG, Lemak CH, Rojas JC, Guptill J, Classen D. Adoption of artificial intelligence in healthcare: survey of health system priorities, successes, and challenges. J Am Med Inform Assoc. 2025;32(7):1093–100. pmid:40323320
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Black E, Raghavan M, Barocas S. Model multiplicity: Opportunities, concerns, and solutions. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022.

[ref7] 7. Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist Sci. 2001;16(3).
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Marx C, Calmon F, Ustun B. Predictive multiplicity in classification. In: Proceedings of the International Conference on Machine Learning. PMLR; 2020.

[ref9] 9. Watson-Daniels J, Parkes DC, Ustun B. Predictive multiplicity in probabilistic classification. AAAI. 2023;37(9):10306–14.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Meyer AP, et al. Perceptions of the fairness impacts of multiplicity in machine learning. arXiv. 2024.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Hollmann N, Müller S, Purucker L, Krishnakumar A, Körfer M, Hoo SB, et al. Accurate predictions on small data with a tabular foundation model. Nature. 2025;637(8045):319–26. pmid:39780007
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref12] 12. Wichmann RM, Fernandes FT, Chiavegatto Filho ADP, IACOV-BR Network. Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts. Sci Rep. 2023;13(1):1022. pmid:36658181
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref13] 13. Savalli C, Wichmann RM, Filho FB, Fernandes FT, Filho ADPC, IACOV-BR Network. Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning. PLOS Digit Health. 2024;3(12):e0000699. pmid:39723970
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref14] 14. Smith M, Alvarez F. Identifying mortality factors from Machine Learning using Shapley values - a case of COVID19. Expert Syst Appl. 2021;176:114832. pmid:33723478
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref15] 15. Hsu H, Calmon F. Rashomon capacity: A metric for predictive multiplicity in classification. In: Advances in Neural Information Processing Systems. 2022;28988–9000.

[ref16] 16. Afrose S, Song W, Nemeroff CB, Lu C, Yao DD. Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction. Commun Med (Lond). 2022;2:111. pmid:36059892
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref17] 17. Zhang Q, Liu J, Zhang Z, Wen J, Mao B, Yao X. Mitigating Unfairness via Evolutionary Multiobjective Ensemble Learning. IEEE Trans Evol Computat. 2023;27(4):848–62.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. Vettoruzzo A, Bouguelia M-R, Vanschoren J, Rognvaldsson T, Santosh KC. Advances and challenges in meta-learning: a technical review. IEEE Trans Pattern Anal Mach Intell. 2024;46(7):4763–79. pmid:38265905
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

Figures

Abstract

Background

Methods

Results

Conclusions

Introduction

Methods

Data source and study design

Data preprocessing

Model development and evaluation

Cross-model prediction similarity

Misclassification patterns

Subgroup predictive performance

Results

Demographic characteristics of the sample

Model performance and prediction agreement

Error pattern analysis

Cluster-specific insights

Discussion

Conclusions

Supporting information

S1 Table. Characteristics of participating hospitals included in the study.

S2 Table. Model Hyperparameters.

S3 Table. Performance metrics by model.

Acknowledgments

References