Uncovering the effects of model initialization on deep model generalization: A study with adult and pediatric chest X-ray images

Model initialization techniques are vital for improving the performance and reliability of deep learning models in medical computer vision applications. While much literature exists on non-medical images, the impacts on medical images, particularly chest X-rays (CXRs) are less understood. Addressing this gap, our study explores three deep model initialization techniques: Cold-start, Warm-start, and Shrink and Perturb start, focusing on adult and pediatric populations. We specifically focus on scenarios with periodically arriving data for training, thereby embracing the real-world scenarios of ongoing data influx and the need for model updates. We evaluate these models for generalizability against external adult and pediatric CXR datasets. We also propose novel ensemble methods: F-score-weighted Sequential Least-Squares Quadratic Programming (F-SLSQP) and Attention-Guided Ensembles with Learnable Fuzzy Softmax to aggregate weight parameters from multiple models to capitalize on their collective knowledge and complementary representations. We perform statistical significance tests with 95% confidence intervals and p-values to analyze model performance. Our evaluations indicate models initialized with ImageNet-pretrained weights demonstrate superior generalizability over randomly initialized counterparts, contradicting some findings for non-medical images. Notably, ImageNet-pretrained models exhibit consistent performance during internal and external testing across different training scenarios. Weight-level ensembles of these models show significantly higher recall (p<0.05) during testing compared to individual models. Thus, our study accentuates the benefits of ImageNet-pretrained weight initialization, especially when used with weight-level ensembles, for creating robust and generalizable deep learning solutions.


Introduction
The prowess of Deep learning (DL) has been well established for medical imaging artificial intelligence (AI) applications with automation making way for improved and efficient image acquisition, quality assessment, object detection and tracking, disease screening, diagnostics, and prediction [1].As a subset of machine learning (ML), DL comprises multilayered neural networks for automated feature extraction and predictions, outperforming traditional techniques in accuracy and robustness.
Chest X-rays (CXRs) are a routinely used diagnostic imaging modality.Despite lower sensitivity compared to computed tomography (CT) scans, CXRs offer several advantages, including costeffectiveness, reduced radiation exposure, and accessibility, making them practical in resource-limited settings [2] [3].Several CXR datasets are available to the ML community which has resulted in significant advances in disease detection [4][5][6][7][8].This dataset listing is not intended to be exhaustive as new datasets are being made available with higher frequency.
A key step in developing high-performing DL solutions is determining appropriate model initialization strategies [9].Model initialization refers to the method of assigning initial values to neural network weights and biases.Optimal selection of the initialization strategy depends on various factors such as data characteristics including dimensionality, variability due to differences in patient anatomy, disease states, image acquisition procedures, and requirement for expert interpretation among others, activation functions, and optimization algorithms selected in the design [10].Understanding the intricacies of model initialization and its impact on performance is essential for devising effective training methodologies, and addressing various issues in the training process, including vanishing or exploding gradients, achieving faster convergence, and stable training dynamics.An appropriately selected initialization strategy can also result in reliable and enhanced medical AI performance which is crucial for precision medicine applications.
The significance of model initialization is amplified when we consider challenges in model generalization which are primarily due to feature distribution shifts between training datasets and real-world use.For example, a model trained and tested on adult CXR data from the same source (internal testing) may result in significantly higher performance compared to testing it on adult CXR data from another source (external testing) [11].Additional performance degradation may be observed when pediatric images exhibiting the same disease(s) are included in the testing.The inherent high-dimensional complexity and variability of medical images exacerbate this problem, causing models to overfit the training data.In this work, we present findings from our investigations on the impact of different model initialization techniques on DL models and propose mechanisms to improve generalizability.
A review of DL literature on model initialization reveals two main techniques, namely, cold-start and warm-start, each with distinct implications for model training dynamics, generalizability, and performance [10].The cold-start method initializes new weights and biases with small random values which results in training a new model from scratch.This technique offers an unbiased foundation but deprives the model of initialization guidance thereby resulting in slower convergence.Conversely, the warm-start strategy leverages weights and biases from a model that has been previously trained on data from similar content.Initialization guidance offered using this approach enables faster model convergence and also provides potentially enhanced performance.However, a previous study [10] has conversely reported that warm-start consistently underperforms with non-medical images, yielding models with poorer generalization and lower prediction accuracy compared to cold-start models.The Shrink and Perturb method proposed in [10] shrinks existing model weights towards zero and adds noise, resulting in faster training than cold-start and improved generalization over warm-start models.However, that and other studies focused on non-medical images [9,10,[12][13][14][15] left a gap in understanding the impact of model initialization techniques on medical computer vision.Unlike non-medical images, medical images have unique characteristics including (i) variations in imaging modalities, e.g., CT, MRI, ultrasound, X-ray, pathology, endoscopy, where each modality captures different aspects of the human body at varying levels of resolution, contrast, and noise levels; (ii) image acquisition conditions including patient positioning, imaging protocols, and the expertise of medical professionals during acquisition that impacts the quality and appearance; (iii) varying anatomical structures that depict internal organs, tissues, and systems, and physiological processes that provide vital information for the diagnosis, treatment planning, and monitoring of diseases, (iv) limited and imbalanced data where instances of specific diseases or conditions with varying levels of progression are significantly smaller compared to healthy cases, and (v) ethical and regulatory considerations in handling medical data since they involve sensitive patient information, thereby ensuring the confidentiality and other critical factors [16,17].
Model generalizability is defined as the ability of a trained model to capture generalized patterns and perform well on unseen data.Medical computer vision relies on model generalizability for several reasons [11] including accommodating patient diversity, adapting to various data sources and quality, addressing ethical considerations, and enhancing clinical utility.A general model is robust to different data sources and population distributions, considering factors such as the patient/study subject's ethnicity, sex, and severity of the disease(s) expressed on the image.Further, in many ML applications, data continuously flows into the system which may require regular model updates and may be unreasonable or difficult to implement.Therefore, developing reliable and generalizable models mandates both internal and external/out-of-distribution testing [18].
Most of the literature has focused on assessing internal generalization due to the lack of widely available data sets [5,[19][20][21] and the findings, though significant, may not guarantee optimal model performance with external data.Federated learning methods have been proposed that use decentralized training to address challenges in achieving external generalization by incorporating diverse data distributions [22].This approach could mitigate the risk of performance degradation when the model encounters unseen data distributions.However, this approach has its limitations, such as requiring consistent communication and synchronization between data sources, which can be challenging in realworld settings with privacy concerns or network instability [23].Further, there could be data interoperability and completeness issues that limit generalization gains.Therefore, while federated learning provides a path toward achieving external generalization, it also introduces new challenges.This presents us with an opportunity for considering and evaluating other novel and efficient methods for achieving external generalization.
For this work, we use adult and pediatric CXRs to evaluate model generalizability as they simultaneously exhibit significant similarities and differences due to anatomy and disease presentation across age groups [24].These include: (i) Developmental stages: Evolving thoracic anatomy in pediatric patients is distinct in appearance from adults.There are thinner chest walls and more compliant rib cages in children.(ii) Unique abnormalities: Pediatric disease can present differently than adults or similar presentations could indicate different diseases.(iii) Imaging technique: Distinct protocols for pediatric CXRs can result in variations in intensity and contrast.Further, inspiration may be inconsistent across patients.(iv) Patient pose: Pediatric patients may need to be held down resulting in the presence of other hands in the image and unusual and variable pose of the patient.These discrepancies present challenges for DL models trained on adult data when directly applied to pediatric cases, potentially leading to sub-optimal generalizability, and reduced clinical utility.Prior work in pediatric CXR image analysis includes the development and evaluation of a ResNet-50 model trained to classify pediatric CXRs as showing pneumonia-consistent manifestations or normal lungs [11].The model demonstrated comparatively improved performance on the internal test set (area under the curve (AUC): 0.95) compared to the external NIH-CXR test set (AUC: 0.54), highlighting potential limitations in model generalizability.There is limited literature analyzing the generalizability of deep models trained on adult CXRs to the pediatric population.
Our study presents key contributions to address the knowledge gap in the current literature regarding the impact of model initialization methods on the generalizability of DL models when we apply them to external adult and pediatric populations after training on internal adult CXR data.We specifically focus on scenarios with periodically arriving data for training, which is a common challenge faced by medical computer vision algorithms.Our investigation delves into the performance of widely-used model initialization methods, providing insights into their adaptability and their implications on generalizability.Furthermore, we propose novel weight-level ensemble methods to improve model generalizability.This crucial understanding will pave the way for the successful deployment of DL models in medical imaging applications, ultimately improving clinical decision-making and patient outcomes.

Datasets
This retrospective study utilizes the following datasets: (i) RSNA-CXR dataset: This publicly available CXR collection results from a collaboration between the RSNA, the Society of Thoracic Radiology (STR), and the National Institutes of Health (NIH) for the Kaggle pneumonia detection challenge [25].The objective was to help support the design and development of image analysis and ML algorithms through a challenge targeting automatic classification of CXRs as normal, containing non-pneumonia-related, or pneumonia-related opacities.The collection comprises 26,684 deidentified anterior-posterior (AP) and posterior-anterior (PA) CXRs in DICOM format, featuring 8,851 normal lungs and 17,833 other abnormal radiographic patterns, of which 6,012 manifest pneumonia-related opacities.We use this dataset to train, validate, and internally test the DL model.
(ii) Indiana-CXR dataset: The Indiana CXR dataset contains 7,470 frontal and lateral CXR projections [26] in DICOM format, accompanied by multiple annotations, including indications, findings, and impressions in textual form.These images are sourced from hospitals affiliated with the Indiana University School of Medicine.Among these, 2,378 PA CXRs exhibit abnormal pulmonary manifestations, and 1,726 CXRs have normal lung appearances.This de-identified dataset is stored at the National Library of Medicine (NLM) and has been exempted from Institutional Review Board review (OHSRP # 5357).We use this dataset as the external adult test set.
(iii) VINDR-PCXR dataset: The VINDR-PCXR dataset is a publicly available pediatric CXR collection [27]   The external test sets consist of adult CXRs from the Indiana-CXR collection and pediatric CXRs from the NIH-CXR and VINDR-PCXR collections.We categorize the pediatric CXRs into three groups: Ped-2 (1 day to under 24 months), Ped-11 (24 months to under 11 years), and Ped-18 (11 years to under 18 years), based on the lung developmental stages from infancy to adulthood as discussed in [29].Table 2 shows the categorization of test CXRs according to various age groups.

Lung region delineation and cropping
We utilize a UNet [30] model with an ImageNet-pretrained Inception-V3 encoder backbone from our previous study [31] to delineate the lung regions and crop them to the size of a bounding box.The purpose of lung cropping is to prevent the DL model from learning irrelevant features for cardiopulmonary disease detection.We resize the cropped lung bounding boxes to 256×256-pixel dimensions and normalize them to the range [0, 1] to reduce computational complexity.

Model architecture and training
For the model architecture and training scenario, we employ a VGG-16 model [32] architecture.
We truncate it at its deepest pooling layer and append a global average pooling (GAP) layer and a final dense layer with two nodes and Softmax activation.This modified model, referred to as VGG-16-M, predicts whether the CXRs show normal lungs or other cardiopulmonary abnormalities.We choose the VGG-16 model due to its simplicity, effectiveness, and well-known performance in medical image classification tasks, particularly using CXRs [33][34][35].Selecting an optimal model falls beyond the scope of this research, as our study aims to analyze the impact of model initialization strategies on deep model generalization.The proposed technique can be applied to any model suitable for the characteristics of the data under study.Table 3 provides a list of the data and model terminologies used in this study.

Optimizing the weight-scaling factor
The method proposed in the Shrink and Perturb technique [10] involves shrinking the existing model weights by multiplying with a factor α and incorporating a small noise β to accelerate DL model convergence and enhance generalization compared to standard cold-start and warm-start methods.Let W be the set of model weights.We calculate the updated weights W' using Equation (1): Here, α denotes the weight-scaling factor.Previous experiments [10] used discrete α values and fixed β at 0.01.In contrast, while we continue to use a fixed value for β as 0.01, we apply Bayesian optimization via Gaussian Process (GP) minimization [36] to identify the optimal α for shrinking the weights of the Cold-RP and Cold-IP models.These are subsequently used to initialize the weights in the Shrink-RF and Shrink-IF models, respectively.Bayesian optimization using GP minimization reduces susceptibility to local minima, enabling more effective identification of the optimal α within a continuous interval compared to the grid or random search methods at discrete intervals.GP minimization explores the search space more thoroughly and converges efficiently by modeling the objective function as a Gaussian process sample.We define the continuous search space for α within the range [0.1, 0.9].We create a function that accepts α as input and performs the following steps: (i) instantiate and compile the model with the current weights, (ii) train and validate the model, storing the best model weights, validation loss, α, and training history whenever the validation loss decreases, and (iii) perform GP minimization for 100 function calls and 30 random starts to converge to the optimal α with minimal validation loss.The hyperparameters for GP minimization follow the default settings in the scikit-optimize Python library.

Weight-level ensembles
We are also proposing ensemble methods that merge the weights of multiple models.Our approach is different from traditional techniques that aggregate model predictions [37][38][39].Our proposed weightlevel ensembles harness the power of diverse weight initializations, capitalizing on complementary learning dynamics to foster robust generalization in complex, high-dimensional medical data landscapes.
We perform Equal Weight Averaging (EWA), which combines the weights of multiple trained models to create an average model.This technique aims to enhance classification performance by leveraging the complementary strengths of individual models in capturing data patterns.We achieve this by iterating through each model's layers, retrieving and averaging the layer weights with equal weight factors, resulting in a new model with a similar architecture for prediction.
We introduce a novel F-score-weighted Sequential Least-Squares Quadratic Programming (F-SLSQP)-based weighted ensemble method to determine the optimal multiplication factors for combining the weights of multiple models in the ensemble.We identify these optimal factors by minimizing the error, as defined in Equation ( 2), through SLSQP-based constrained minimization [40].
The process of determining the optimal multiplication factors involves the following steps: Additionally, we present a novel method for developing an attention-guided ensemble incorporating a learnable Fuzzy Softmax layer (AGELFS).This technique utilizes attention mechanisms [41] to emphasize relevant features of each model while mitigating less significant ones.The ensemble construction involves the following steps: (i) instantiating and freezing the constituent models with their respective weights, (ii) processing training input through these models and appending a GAP layer to each model's output, (iii) concatenating the outputs of the GAP layers, (iv

Performance evaluation and statistical significance analysis
We examine model performance using key metrics, including balanced accuracy, precision, recall, the area under the precision-recall curve (AUPRC), F-score, and Matthews Correlation Coefficient (MCC).
Each metric provides valuable insights into the model's effectiveness in various aspects of classification tasks.We present the statistical significance of the MCC by utilizing 95% binomial confidence intervals (CIs) and ascertain them through the Clopper-Pearson Exact methodology to distinguish model efficacy.
We determine the p-values based on the CI-based Z-test [43].We obtain the MCC values and their corresponding 95% CIs for the compared models.For each model, we compute the standard error (SE) using Equation (4): Here, CI upper and CI lower represent the upper and lower bounds of the CIs, respectively.We compute the difference in the MCC (ΔMCC) and SE (ΔSE) values using Equations ( 5) and ( 6) respectively: 1 2 .
Here, MCC1, MCC2, SE1, and SE2 are the MCC and SE metrics of the compared models.We compute the Z-score from this difference using Equation ( 7): We calculate the corresponding p-value for the Z-score using an online Z-table .A threshold of 0.05 is utilized to establish statistical significance using the 95% CIs.If the p-value is less than 0.05, we observe that the difference in performance, as gauged by MCC, is statistically significant.We repeat this process to present the statistical significance of the recall values for the proposed weight-level ensembles.

Results and discussion
We first present a comparative analysis between the performances of the Cold-RP and Cold-IP     We proceed to train and evaluate models on 100% of the data, i.e., the RSNA-F dataset, with the aforementioned configurations for Cold-RF, Warm-RF, Shrink-RF, Cold-IF, Warm-IF, and Shrink-IF models (Table 3).The weights of the Cold-RP and Cold-IP models that are used to initialize the weights for the Shrink-RF model and the Shrink-IF models, respectively, are shrunk by an optimal scaling factor of 0.7209 (α1) and 0.9 (α2), respectively, as determined by Bayesian optimization through GP minimization in the constrained continuous interval of [0.1, 0.9].We also apply ensemble methods to evaluate if the generalization performance with internal and external test sets can surpass that of the individual models.Table 7 presents the ensemble performances when predicting the internal adult test.

Conclusion and future scope
developed to support computer-aided diagnosis algorithm development for pediatric CXR interpretation.It consists of 9,125 CXR scans, in DICOM format, collected from three major Vietnamese hospitals between 2020 and 2021.The pediatric dataset includes deidentified images of 5,354 males, 3,709 females, and 62 patients with unknown gender.Among the 8,755 pediatric CXRs, 5,876 show normal lungs, and 2,879 exhibit other cardiopulmonary abnormalities, with age distributions as follows: 5,335 CXRs for ages 1 day to under 24 months, 3,351 CXRs for ages 24 months to under 11 years, and 69 CXRs for ages 11 to under 18 years.We use this dataset as an external pediatric test.(iv) NIH-CXR dataset: The NIH-CXR dataset is a publicly accessible, large-scale collection of deidentified CXRs [28] compiled by the NIH Clinical Center.It contains 112,120 frontal-view CXR images in PNG format, from 30,805 unique patients.The dataset includes 14 cardiopulmonary disease labels, textmined from radiological reports using a Natural Language Processing (NLP) labeler.Among these, 5,257 pediatric CXRs represent normal lungs (n = 3,066) and other cardiopulmonary abnormalities (n = 2,191), divided into three age groups: 34 CXRs captured from pediatric patients of ages 1 day to under 24 months, 1,787 CXRs of ages 24 months to under 11 years, and 3,486 CXRs of ages 11 to under 18 years.The pediatric group consists of 3,018 males and 2,239 females, while 106,863 CXRs belong to patients older than 18 years.We use this dataset as the external pediatric test.We further partition the RSNA-CXR dataset at the patient level into 70% for training, 10% for validation, and 20% for internal testing.The training and validation sets are additionally divided into two equal-sized subsets to simulate periodic data arrival for training and validation and facilitate the simplest case of warm-start.The DL model trains to converge on the first half of the data and then trains on the full collection, which represents 100% of the data.We name the first half RSNA-Partial (P) and the full collection RSNA-Full (F).The internal test remains the same for both the RSNA-P and RSNA-F datasets.
(i) defining a function to compute the weighted average of weights for the ensemble models, (ii) defining a function to create a new model with the same architecture as the models in the ensemble, (iii) creating a global variable for the best multiplication factors, (iv) defining a function to calculate the error from the weighted average of the models, (v) setting the optimization parameters, including the constraints and bounds, where the constraint ensures the sum of scaling factors equals 1.0 and the bounds ensure each scaling factor is within the range [0, 1], (vi) executing the SLSQP algorithm multiple times (n = 100) to minimize the error and find the optimal multiplication factors, (vii) performing weighted averaging with the optimal multiplication factors to create the weighted ensemble model, and (viii) compiling and saving the weighted ensemble model for prediction.
models.Recall that the Cold-RP model initializes the VGG-16 backbone of the VGG-16-M model with random weights and trains it on the RSNA-P dataset.Conversely, the Cold-IP model initializes the VGG-16 backbone of the VGG-16-M model with ImageNet-pretrained weights and also trains it on the RSNA-P dataset.

Fig 2 Fig 1 .
Fig 2 depicts histograms that illustrate the distribution of Softmax activations for the positive (1 -Abnormal) and negative (0 -No Finding) classes when predicting the RSNA-P test set using the Cold-RP and Cold-IP models.The Softmax histograms provide insight into the correctness and confidence of each model's predictions, as well as differences in Softmax predictions and overall performance.The x-axis represents Softmax activations, and the y-axis indicates the density of these activations.The histograms' shape and density reveal a more distinct separation between the two classes in the Cold-IP model, characterized by two clear peaks near 0 and 1.This distinction may result from the Cold-IP model's initialization with ImageNet-pretrained weights, allowing it to leverage useful features learned from a largescale dataset.

Fig 4 .Fig 5 .
Fig 4. AUPRC of the models while predicting the internal adult test set.
We further analyze the weight distribution similarity of the Cold-IF, Warm-IF, and Shrink-IF models using scatter plots (Fig 7).The plots visually depict the relationship between the weight distributions of each model pair, namely (Cold-IF, Warm-IF), (Cold-IF, Shrink-IF), and (Warm-IF, Shrink-IF).Each point in the scatter plot represents a pair of weights from the compared models, with the x-axis and y-axis representing the weights of the respective models.Dense point distributions along the diagonal indicate higher weight similarity, while more dispersed distributions suggest less similarity.We observe a strong positive correlation between weight distributions as evident from the scatter plot patterns.The scatter plots demonstrate a dense diagonal distribution, indicating highly similar weight distributions for the compared models.This similarity implies that the models learned similar features and representations during training, resulting in comparable Softmax predictions for the positive and negative classes, as supported by their performance metrics.

Fig 6 .Fig 7 .
Fig 6.Heatmap showing EMD values between each model pair for the Cold-IF, Warm-IF, and Shrink-IF models.
Diverse model initialization techniques are instrumental for deep model optimization thereby affecting convergence speed, reducing the risk of overfitting, and improving generalizability.Our qualitative and quantitative analyses validate the claim that cold-start approaches can decelerate convergence while warm-start methods, such as ImageNet-pretrained weight initialization, enhance convergence and performance.Furthermore, improper weight initialization can introduce biases that inadvertently favor certain classes or feature sets which, in turn, increases the risk of model overfitting to the data and reducing generalizability.To mitigate this risk, we perform ensemble learning and propose novel weight-level ensemble methods to improve performance over individual constituent models.These ensembles can harness a broader range of feature representations, making them more adaptable and effective when handling unseen data.This adaptability is particularly relevant in medical computer vision, where models must demonstrate exceptional generalizability across diverse patient populations and imaging modalities.Future research could explore alternative ensemble methods, such as advanced stacking or voting techniques, to further improve generalization.Further, incorporating demographic factors during model initialization could enable the development of personalized DL models for medical image analysis, extending the scope of this research to other medical imaging tasks and modalities.Pursuing these research directions could help improve medical computer vision DL models for reliable healthcare applications.S3 Figure.Histograms show the distribution of Softmax activations for each model pair.(a) Cold-RF and Cold-IF; (b) Warm-RF and Warm-IF, and (c) Shrink-RF and Shrink-IF.

Table 1
provides details of this partition.

Table 3 . Data and model terminologies.
The checkpoint exhibiting the lowest validation loss is used to generate predictions for both the internal and external test datasets.Test performance evaluation occurs at the ideal classification threshold, determined by maximizing the F-score for the validation dataset.
of 64.We utilize the Adam optimizer with an initial learning rate of 0.001 to minimize the categorical crossentropy loss.Model checkpoints are stored via callbacks when a decrease in validation loss is observed.

Table 4
notably higher values for other performance metrics compared to the Cold-RP model.

Table 4 . Performance of models initialized with random and ImageNet-pretrained weights on the internal adult test set.
The terms B. Acc., P, R, and F denote balanced accuracy, precision, recall, and Fscore, respectively.Bold numerical values denote superior performance in respective columns.Values in

Table 5
displays performance metrics, while Fig 4 illustrates the AUPRC achieved by each model when predicting the RSNA-F test (i.e., the internal adult test set).We observe that the models initialized with ImageNet-pretrained weights (Cold-IF, Warm-IF, Shrink-IF) converge considerably faster and also significantly outperform their randomly initialized counterparts (Cold-RF, Warm-RF, Shrink-RF) in terms of MCC (p<0.00001) and other metrics.

Table 5 : Performance of models on the internal adult test set.
Bold numerical values denote superior performance in their respective columns.The * denotes statistically significant MCC among each model pair, i.e., (Cold-RF, Cold-IF), (Warm-RF, Warm-IF), and (Shrink-RF, Shrink-IF) (p<0.00001).

Table 6 . Comparing the model performances when predicting the external adult and pediatric test sets.
Bold numerical values denote superior performance in their respective columns.Lower EMD values indicate higher weight similarity due to shared ImageNet-pretrained weight initialization for the Cold-IF, Warm-IF, and Shrink-IF models.This similarity, supported by the low EMD values, aligns with the observation that the models' performance differences are not pronounced.

Table 7 . Model performances achieved with the internal adult test. Bold
We select the baseline model based on the best MCC performance reported for the individual models in

Table 5 .
We observe that the Attention-Guided Ensemble with Learnable Fuzzy Softmax (AGELFS) of the Cold-IF and Shrink-IF models deliver significantly superior values for recall (p<0.00001) and marginally higher values for AUPRC and F-score among other ensemble methods.The AGELFS of the Cold-IF and Warm-IF models delivers higher but not significantly superior values for balanced accuracy and precision.The learned Fuzziness values for the Softmax Layer in the AGELFS ensemble are 1.113, 1.113, 1.039, and recall.However, an increase in false positive (FP) predictions could counteract precision improvements, resulting in relatively unchanged F-score, MCC, and AUPRC.(ii) Ensemble learning bias-variance tradeoff: Ensemble learning aims to reduce the bias and variance of individual models for better generalization.The EWA ensemble decreases variance without significantly impacting bias.Since recall is sensitive to reducing false negatives (FN) (i.e., variance reduction), it can show significant improvement while other metrics remain unchanged if bias remains relatively constant.(iii) Imbalanced datasets: In imbalanced datasets, EWA ensemble techniques can improve recall for the minority class without significantly affecting other metrics.This is evident in the external pediatric test sets where abnormal CXRs are fewer compared to normal samples.The EWA ensemble model's robustness against overfitting and improved generalization in identifying minority class samples may not lead to significant changes in other metrics.The aforementioned discussions also apply to the significantly superior recall values obtained using the AGELFS of Cold-IF and Shrink-IF models for the internal adult test.