Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The external validity of machine learning-based prediction scores from hematological parameters of COVID-19: A study using hospital records from Brazil, Italy, and Western Europe

  • Ali Safdari,

    Roles Data curation, Formal analysis, Software, Writing – original draft

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Chanda Sai Keshav ,

    Contributed equally to this work with: Chanda Sai Keshav, Deepanshu Mody, Kshitij Verma, Utsav Kaushal, Vaadeendra Kumar Burra

    Roles Data curation, Methodology, Software

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Deepanshu Mody ,

    Contributed equally to this work with: Chanda Sai Keshav, Deepanshu Mody, Kshitij Verma, Utsav Kaushal, Vaadeendra Kumar Burra

    Roles Formal analysis, Methodology, Software, Validation

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Kshitij Verma ,

    Contributed equally to this work with: Chanda Sai Keshav, Deepanshu Mody, Kshitij Verma, Utsav Kaushal, Vaadeendra Kumar Burra

    Roles Data curation, Formal analysis

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Utsav Kaushal ,

    Contributed equally to this work with: Chanda Sai Keshav, Deepanshu Mody, Kshitij Verma, Utsav Kaushal, Vaadeendra Kumar Burra

    Roles Formal analysis, Methodology, Software, Validation, Visualization

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Vaadeendra Kumar Burra ,

    Contributed equally to this work with: Chanda Sai Keshav, Deepanshu Mody, Kshitij Verma, Utsav Kaushal, Vaadeendra Kumar Burra

    Roles Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

  • Sibnath Ray,

    Roles Conceptualization, Formal analysis, Investigation

    Affiliation Gencrest Private Limited, 301-302, B-Wing, Corporate Center, Mumbai, India

  • Debashree Bandyopadhyay

    Roles Conceptualization, Funding acquisition, Investigation, Project administration, Resources, Software, Supervision, Validation, Writing – review & editing

    banerjee.debi@hyderabad.bits-pilani.ac.in

    Affiliation Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, Telangana, India

Abstract

The unprecedented worldwide pandemic caused by COVID-19 has motivated several research groups to develop machine-learning based approaches that aim to automate the diagnosis or screening of COVID-19, in large-scale. The gold standard for COVID-19 detection, quantitative-Real-Time-Polymerase-Chain-Reaction (qRT-PCR), is expensive and time-consuming. Alternatively, haematology-based detections were fast and near-accurate, although those were less explored. The external-validity of the haematology-based COVID-19-predictions on diverse populations are yet to be fully investigated. Here we report external-validity of machine learning-based prediction scores from haematological parameters recorded in different hospitals of Brazil, Italy, and Western Europe (raw sample size, 195554). The XGBoost classifier performed consistently better (out of seven ML classifiers) on all the datasets. The working models include a set of either four or fourteen haematological parameters. The internal performances of the XGBoost models (AUC scores range from 84% to 97%) were superior to ML models reported in the literature for some of these datasets (AUC scores range from 84% to 87%). The meta-validation on the external performances revealed the reliability of the performance (AUC score 86%) along with good accuracy of the probabilistic prediction (Brier score 14%), particularly when the model was trained and tested on fourteen haematological parameters from the same country (Brazil). The external performance was reduced when the model was trained on datasets from Italy and tested on Brazil (AUC score 69%) and Western Europe (AUC score 65%); presumably affected by factors, like, ethnicity, phenotype, immunity, reference ranges, across the populations. The state-of-the-art in the present study is the development of a COVID-19 prediction tool that is reliable and parsimonious, using a fewer number of hematological features, in comparison to the earlier study with meta-validation, based on sufficient sample size (n = 195554). Thus, current models can be applied at other demographic locations, preferably, with prior training of the model on the same population. Availability: https://covipred.bits-hyderabad.ac.in/home; https://github.com/debashreebanerjee/CoviPred.

Introduction

The COVID-19 infection has posed the deadliest threat to the health of the human population in the 21st century. Likely, the danger is far from over concerning the emerging variants of COVID-19, such as alpha (B.1.1.7), beta (B.1.351), gamma (P.1), delta (B.1.617.2), lambda (C.37), and omicron (B.1.1.529) [1], along with other frequently mutating respiratory diseases, like, influenza virus A (H1N1) [2]. The most common clinical feature of COVID-19 is pneumonia with fever, cough, fatigue, headache, diarrhoea, hypoxia, and dyspnoea. The latest variant, omicron, has some common symptoms with the earlier SARS-COV-2 strains, although with lesser severity due to mild infection in the lower respiratory tract and reduced probability of hospitalization [1,3]. In the case of mild COVID-19 infection, either no (asymptotic) or only mild pneumonia is observed. In moderate infection, dyspnoea, hypoxia, and lung injury may occur. In severe infection, respiratory failure to multi-organ failure occurs. In brief, severe cases of COVID-19 can lead to a systemic infection affecting almost all of the major organ systems. Due to the nature of the disease, timely detection of COVID-19 is of utmost importance. Hence, detection techniques play a pivotal role in its diagnosis. There are two major types of COVID-19 tests: a) molecular tests (qRT-PCR tests) and b) rapid antigen tests. There is a less common antibody-based detection technique. Molecular qRT-PCR tests are considered as the gold standard for COVID-19 detection those detect the load of viral RNA in a patient. The sensitivity value (low false negative rate) for qRT-PCR ranges from 88–96% [4,5]. Although qRT-PCR tests are considered gold-standard, it has several limitations, like manual errors during sample (nasal and oral swab) collection, operational errors, etc. [6]. Moreover, the time required for the experiment and availability of the detection kits at a mass level becomes difficult in a vast population with a large number of infections. The test is also costly for low-income groups. The rapid-antigen test (RAT) is an alternate to qRT-PCR that detects the load of viral protein in an individual, that is much faster to qRT-PCR (on average 15 minutes only). RAT results have high specificity range 98%-99% (low false positives), but low sensitivity value 70%-72%, [79] This is because, RATs are more sensitive in the symptomatic and transmissive stages of disease when the viral load is higher [710]. The advantage of RAT is it can be done at mass-scale [11], and provide point-of-care by self-testing. Thus, in addition to their advantages, both qRT-PCR and RAT techniques have their own limitations. Development of a rapid, accurate and low-cost detection protocol can circumvent the short-coming of both the methods and supplement initial screening in large-scale and for low-income populations, particularly in a country like India, with the second-largest population in the world.

Literature reports indicate availability of several hematological bio-markers in COVID-19 patients, thus making those as potential candidates to develop alternate protocol. Patients of COVID-19 exhibit a wide range of hematologic abnormalities that changes with disease progression, severity, and mortality [12]. For example, the white blood cells sense [13]. and respond to microbial threats [14]. Similarly, blood platelet expression and platelet counts are altered in COVID-19 patients [1517]. Platelet hyperactivity was demonstrated as one of the unique features of COVID-19 infection [18]. Abnormal levels of CRP, D-dimer, Procalcitonin, Troponin values were observed in the deceased. The most effective mortality biomarkers identified were ESR, INR, PT, CRP, D-dimer and Ferritin. Neutrophilia, leukocytosis and erythrocytopenia were identified as mortality risk predictors [13]. As per the earlier reports, high d-CWL and d-CFL values largely confirmed the Covid-19 diagnosis. d-CIT, d-CT, d-PPT biomarkers were efficient in prognosis of COVID-19 disease [19]. White blood cell counts (WBC), lymphocyte counts, C-reactive proteins (CRP), D-dimer were used in prognosis and diagnosis of COVID-19 [20]. Hence, a complete blood count (CBC) could serve as a biomarker for COVID-19. Screening the COVID-19 infection in terms of CBC has been attempted by various research groups worldwide, [2126].

Machine learning approaches were reported in literature for COVID-19 disease prediction, based on incident moments [27], SEIR models [28] etc. Some of the research groups used machine learning (ML) approaches to exploit the haematological parameters for prognosis, diagnosis and risk factor predictions [2935]. From a specific population for disease prediction; the Area Under Curve (AUC) performance ranges from 84% to 87% in those models. So far, only handful of reports are available to test the applicability of the haematology-based ML models across different ethnicity and populations which has not been explored in previous studies [36]. The combination of haematological parameters varies with ethnicity in non-COVID individuals, for example, mean corpuscular (fL) and white blood cell counts (109/L) differ among African-American and whites [37]. A study conducted at a hospital of San Francisco between April 2017 and January 2018, showed that the reference intervals of neutrophil, lymphocyte and eosinophil counts; hemoglobin, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC) were significantly different (p<0.05) across four racial/ethnic groups, namely, Asian, Black, Hispanic and White [38]. Various blood biomarkers (WBC, CRP, eosinophil, monocytes, platelets etc.) varied since inception of COVID-19 in 2020, world-wide, till date [17]. A set of blood biomarkers vary across severe (ICU patients) and non-severe (non-ICU) COVID-19 patients [39,40]. A study conducted in China in January 2020, showed that patients admitted with COVID-19 were reported with lymphocytopenia (83.2% of the patients), thrombocytopenia (36.2%), and leukopenia (33.7%) [41]. Similar study conducted in South Africa (March to June 2020) showed that decrease in median lymphocyte count and rise in d-dimer have no significant association with outcome [26]. Another study conducted in India (August 2020 to January 2021) noted severe elevation of D-dimer level in some patients, along with haemoglobin, Red Blood Cells (RBC) count, haematocrit, neutrophil and lymphocyte [42]. All these observations together suggested that race/ethnicity-specific variations occur in CBC parameters, both for reference intervals and COVID-19 patients. Furthermore, a recent study from Iran showed that hematological parameters, varied even within the same ethnicity, across different COVID-19 pandemic waves [43]. The authors showed that MCV and RDW-CV have increased during first wave, whereas, lymphocyte count, MCHC, PLT count, and RDW-SD have highest increase during second wave and so on. However, alteration in some of the blood parameters leading to lymphocytopenia [41,44] leucopenia, and thrombocytopenia [4547], are more or less common due to COVID-19.

Based on the above observations, we hypothesized that ML model developed on haematological parameters would yield the best (COVID-19) probabilistic predictions when trained and (externally) validated on the same populations. The hypothesis, to some extent, was supported by the literature report where the authors showed external validation of the ML model trained on CBC parameters from Italy (training dataset) and externally validated on three other Italian datasets produced high sensitivity values (ranging from 85% to 91%). In contrast, low sensitivity values were produced when the same model was externally validated on three Brazilian datasets (ranging from 29% to 37%) [36]. In the current study, we aim to optimize the features in ML models those can be at least acceptable (in terms of meta-validation results) across the populations. eXtreme Gradient Boost (XGBoost) model was benchmarked as the best-performing model across the datasets compared to published literature. The state-of-the-art of the current work is development of a COVID-19 prediction tool that is reliable and parsimonious, based on sufficient sample size (n = 195554).

Method

Description of clinical datasets for training, validation, and prediction

There are three major datasets curated from publicly available hospital sources (https://www.kaggle.com/einsteindata4u/covid19; https://zenodo.org/record/4081318#.X4RWqdD7TIU; https://repositoriodatasharingfapesp.uspdigital.usp.br/).

Dataset 1.

Dataset-1 was generated based on anonymized patient data publicly available from Hospital Israelita Albert Einstein, in São Paulo, Brazil https://www.kaggle.com/einsteindata4u/covid19. The data were recorded from February 26th, 2020, to March 23rd, 2020. The patients hospitalized under i) regular ward, ii) semi-intensive care unit and iii) intensive care unit were included in this study. The cases and controls for this dataset include the patients whose samples were collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests during a visit to the hospital.

The initial data set consisted of 558 positive and 5086 negative cases of COVID-19. This dataset was processed to minimize the null-value columns and eliminate the negative instances with many null values. The value (xi) in each cell was pre-normalized (at the source) to a mean value (μ) of zero and a unit standard deviation (σ); this was termed as ‘normalized count’; xi’ = (xi-μ)/σ. The same normalization scheme has been used throughout the subsequent datasets. The columns with null values appearing more than 90% were dropped. The initial data size was already small hence, usage of lower cutoff values reduced the dataset beyond its usability for model training. The records (rows) showing positive results were retained by default, and the negative records were maintained only with more than 10% non-null entries.

Thus, the negative data size was dropped to 1446, enriching the relative size of the positive data; the negative to positive sample size ratio reduced to 2.59, four times less than that in the published model (11.51) [21]. This newly processed dataset, enriched with positive results, was termed as dataset 1. In total, dataset 1 contains thirty-seven features and 2004 records, 558 positives and 1446 negatives (Table 1). Here ‘features’ refer to x-parameters used to train the model; the definition excludes the y-parameter, SARS-COV2 results (positive or negative). This definition is consistently used in the subsequent datasets. These thirty-seven features were categorized into four classes, namely, i) age, ii) severity of the infection, iii) hematological features, and iv) co-morbidities (S1 Table in S1 Appendix). The hematology analyser used therein was not known. We have further processed dataset 1 by dropping the co-morbidities. Thus, the total number of features were reduced to 18 and the number of records was 602, with 83 positives and 519 negatives.

Dataset 2.

This dataset was obtained from San Raffaele Hospital (OSR), Italy [23]. Inclusion criteria was patients admitted to the emergency department of the hospital from February 19, 2020, to May 31, 2020. Patients were excluded those who have potentially confounding pathologies and other sources of bias, such as insufficient data availability and admitted between February and April, 2020. All patients admitted during May 2020, were included. In the original OSR dataset, there were 1736 entries with a total of 72 features, among those, 36 were haematological features. Fifty-two percent of the patients were COVID-19 positive, as determined by RT-PCR tests on nasopharyngeal swabs.

These 1736 entries were processed such that all rows (records) with more than 66% null values were dropped. The null values were dropped based on trial and error method, using other (higher or lower) values the dataset became either too small or too sparse. In case of dataset 1, the initial size was already small hence, usage of 66% cut off would make the dataset not fit for ML model training. The processed dataset contained 1388 records, 765 positives, and 623 negatives (Table 1). This dataset includes 31 features: age, sex, a feature for suspicion (representing subjective analysis of the patient by a physician), and 28 haematological parameters (Fig 1). Haematology analyser used was Sysmex XE 2100. The ratio of negative to positive records was 0.81, indicating greater number of positives than negatives (Table 1).

thumbnail
Fig 1. Haematological features used in different datasets.

https://doi.org/10.1371/journal.pone.0316467.g001

Dataset 3.

Dataset 3 was obtained from the Covid Data Sharing initiative created by a consortium led by FAPESP (Sao Paulo Research Foundation) and USP (University of Sao Paulo, Brazil). The data originated from three prominent private hospitals in Sao Paulo, Brazil—Fleury Institute, Sírio-Libanês Hospital, and Albert Einstein Hospital, from November 1st, 2019, to July 1st, 2020 (https://repositoriodatasharingfapesp.uspdigital.usp.br/). The data was anonymized from patients tested for COVID-19 (serology or RT-PCR). The haematology analyser used therein was not known.

The raw data obtained from the data sharing initiative had multiple rows (records) corresponding to individual patients containing different clinical features ("long-form" of the dataset). The "long form" of the dataset was converted, using an in-house Python code, to the "wide form," where one row corresponds to all the clinical features of a patient. The "wide form" of the dataset has 189227 records and 454 features. These 454 features were common, as there were duplicates in the column headers (due to different reference ranges) for some features. After deduplication, the feature number was reduced to 104. The non-duplicated features were further filtered by excluding the following conditions, i) no qRT-PCR results available, ii) all the rows with more than 66% null values (similar to dataset 2). Twenty-seven hematological indices (features) were identified based on the above cutoff (Fig 1). The final dataset size was 6488; 4533 negatives and 1955 positives.

Description of the clinical dataset for blind prediction

Western European dataset.

This dataset was obtained from several Western European hospitals, as reported elsewhere [25]. The dataset includes the patients from the first day of hospitalization to nearly five weeks [25]. This published data was in the form of twenty separate tables that we merged into a single file comprising 2587 entries and thirty-seven features. According to the source authors [25], there are two stages of the disease, a) early stage, from day zero through three (total of four days), and b) advanced stage, comprising all the subsequent days. This blind prediction dataset includes only four hematological parameters consistent with sub-dataset 2-four-features.

Machine Learning (ML) approaches

The machine learning (ML) algorithms were implemented in Python (3.7.13) using the following libraries, Numpy (1.21.6), Pandas (1.3.5), XGBoost (0.90), Scikit-learn (1.0.2), Seaborn (0.11.2), Matplotlib (3.2.2) and Pickle 4.0 libraries.

Different algorithms:

Extreme Gradient Boost (XGBoost)

The algorithm primarily employed was the Extreme Gradient Boost (XGBoost) classifier that implements gradient-boosted decision trees (with enhanced speed and performance) and trains a class-weighted (or cost-sensitive) version of imbalanced classification [48]. XGBoost, a ternary classifier, considers null entries as one of the classes that handle the null-entry values.

Other classifiers tested on these datasets were logistic regression, Fischer linear discriminant Naïve Bayes, SVM, random forest, and K-Nearest Neighbor (KNN).

Logistic regression

Logistic regression predicts the output of a categorical dependent variable by fitting an "S" shaped logistic function that indicates two maximum values, 0 or 1 [49].

Fischer linear discriminant

Fischer linear discriminant classifier maximizes the separation between the projected class means and minimizes the class overlap leading to well-separated classes [50].

Naive Bayes

Naive Bayes is a classification technique based on the Bayes theorem with an assumption of independent predictors; a particular feature is independent of another feature in a class [51].

Support Vector Machine (SVM)

The SVM algorithm aims to create the best line or decision boundary to segregate n-dimensional space into classes to accommodate a new data point. The best decision boundary, a hyperplane, is made based on the extreme points (vectors) [52].

Random forest

Random forest is a concept of ensemble learning–a combination of multiple classifiers to solve a complex problem and improve the model performance. As the name suggests, Random Forest contains several decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset [53].

K Nearest Neighbour (KNN)

KNN algorithm stores all the available data and classifies a new data point based on the similarity by placing a new data point in the nearest category. Thus, new data belongs to an appropriate class [54].

Model training criteria

The proportion between training and testing sets is 90:10. 10-fold cross-validation with random split was performed on all the datasets.

Hyper-parameter used in XGBoost classifier.

To normalize the imbalance in the number of negative and positive data points in the XGBoost classifier, hyper-parameter–“scale_pos_weighthttps://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tree-booster, was introduced. The scale_pos_weight value was used to scale the gradient for the positive class. For example, the "scale_pos_weight” = 2 indicates twice the weight of the positive class compared to the negative class. It also overcorrects the misclassification of the positive class. The loss curve (optimized to get a better model) will be affected differently in case of positive and negative entry misclassification. However, large scale_pos_weight can help the model achieve better performance for the positive class prediction (overfitting the positive class) at the cost of worse performance on the negative or both classes. Hence, we have consistently considered the default scale-pos-weight (the ratio of numbers of negative to positive entries) throughout this report.

Imputation for other ML models.

Unlike XGBoost, most ML algorithms cannot handle null values, thus requiring data imputation. We imputed missing values through the IterativeImputer module in the ScKit-learn package (https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation), which imputes values for null data points for each feature iteratively. It does so by fitting a regressor to the other feature columns (X-parameter) for records with known values of the target feature (y-parameter) and then predicts missing values of the target feature. Chi-square test was performed between the imputed and non-imputed dataset which returns zero values, in almost all the cases (S2 Table in S1 Appendix). The null hypothesis tested in Chi-squared test was that two populations were significantly different (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html). Chi-squared test returning zero values indicated that the null hypothesis was wrong; hence, both the populations behave similarly.

Performance metrics.

Internal Evaluation. Four metrics were used for internal evaluation of the models, namely, sensitivity, specificity, accuracy and AUC scores. The performance metrics were defined by true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) (Eqs 13). False negatives, in terms of diagnosis, are the cases where Covid-19 positive patients would be classified as negative and would be possibly allowed to go home. Thus, these would be more harmful than false positive cases–healthy individuals predicted as COVID-19 positive. In screening task, accuracy is defined as success in identifying patients and healthy individuals in total.

(1)(2)(3)

For all the above-mentioned metrics, interval values were computed within 95% confidence limit using the formula (Eq 4). (4) the constant 1.96 stands for the number of standard deviations (1.96 for 95% confidence limit).

The fourth metric was the Area Under the ROC Curve (AUC). The AUC was computed from prediction scores using the roc_auc_score (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) module of the sklearn—metrics library. A ROC curve (Receiver Operating Characteristic curve) plots the performance (True Positive Rate (TPR) versus False Positive Rate (FPR)) of a classification model at all classification thresholds. In terms of diagnosis, AUC determines positive cases as actual positive.

TPR is synonymous with sensitivity, also known as recall. FPR is FP/(FP + TN). AUC measures the Area under ROC (as defined by TPR versus FPR) curve from (0,0) to (1,1) along the x-axis (FPR axis). AUC ranges from 0 to 1; 0 implies a 100% wrong model, and 1 indicates a 100% correct model.

External Evaluation. In addition to the above metrics, few more measures were considered for external evaluation, capable of handling both balanced and imbalanced dataset.

Validation of the sample size for the external datasets (dataset cardinality). Minimum Sample Size (MSS) was computed to validate the sample size of the external dataset following the method described in the literature [36,55].

Measuring data similarity between the training and the external validation datasets. The data similarities among the training and the external validation datasets were determined using Kolmogorov–Smirnov (KS) test [56]. KS test is non-parametric that determines whether given two sample datasets come from the same distribution.

Handling data imbalance between training and test datasets. In order to handle the potential imbalance in the target distribution, “Balanced Accuracy”, “F- beta scores” were computed, using sklearn module in python (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html). “Balanced Accuracy” is defined as the average of sensitivity and specificity (Eq 5). The best value is 1 and the worst value is 0 when adjusted = False. The F-beta score is robust scoring scheme for balanced and unbalanced datasets. F-Beta accounts “precision” and “recall” together and performs a weighted harmonic mean between these two (Eqs 68). A harmonic mean is the average computed by adding the reciprocal of individual values in a data set and normalizing it with the total number of datapoints. To note, “precision” computes the percentage of correct predictions for positive class and “recall” computes the percentage of correct predictions for positive class out of all possible positives. A smaller value of beta gives more weight to Recall, while a large value of beta gives lower weight to Recall. The value of the F-beta score lies between 0 to 1 (1 is the best value). In this study we used beta equals to two.

(5)(6)(7)(8)

TP, TN, FP and FN represent True positive, true negative, false positive and false negative, respectively.

Model Calibration using Brier Score: Brier Score evaluates the accuracy of a probabilistic prediction. It is more like a cost function [57]. In case of a binary prediction, the score is defined as in Eq 9. (9) Where pi is the probability of occurrence of the event”i” and oi is the actual outcome (0 or 1) of the event.

Model utility using standardized net benefit: Standardized net benefit (sNBpt) was computed using the formula given by Riley et. al [58], Eq 10. (10) Where φ represents observed outcome event proportion; for example, if there are 83 COVID-19 positive patients in total sample size of 602, φ will be 0.14. pt represents probability threshold, that is generally considered as 0.5, where fifty percent of the population is positive and remaining is negative. However, this is mostly not the real case. Earlier reports included pt = 0.08 for the highest risk group [59].

Meta-validation of the external performances: Meta-validation procedure was adopted from the method described in the literature [36]. Here, performances of the ML models were assessed in two dimensions, i) dataset similarity (between training dataset and external validation dataset), measured using KS test, described above and ii) dataset cardinality, measured in terms of minimum sample size (MSS), described above. The performance was evaluated in terms of discrimination (Balanced accuracy), utility (Standardized net benefit) and calibration (Brier score). Two sets of training-external validation datasets were used for meta-validation, in this study.

Results and discussion

Correlation between features and SARS-COV-2 results in different clinical datasets

Three independent clinical datasets (dataset 1, dataset 2, and dataset 3) were curated and processed from hospitals in Brazil and Italy. The point-biserial correlations (positive or negative) and p-values were computed between the features and the SARS-COV-2 results (Fig 2).

thumbnail
Fig 2. Point biserial correlation coefficients between SARS-COV-2 results and individual features for a) dataset 1 b) dataset 2 and c) dataset 3.

Parameters with p values < 0.05 are shown in blue, remaining values are in red.

https://doi.org/10.1371/journal.pone.0316467.g002

For dataset-1, following features were correlated (p-value <0.05) to SARS-COV-2 results, namely, age, regular ward admittance, IC admittance, hemoglobin, hematocrit, platelets, MPV, leukocytes, eosinophils and monocytes. Among the CBC parameters, monocytes, hemoglobin, hematocrit, RBC and MPV have shown significant increase in their values in SARS-COV-2 patients (positive point biserial correlation). The remaining parameters decreased during infection (negative point biserial correlation) (Fig 2). Careful observation revealed that in the case of non-admitted patients, the increase in monocyte is maximum, suggesting that innate immunity is handling the infection. On the other hand, platelet volume (MPV) increased, and platelet counts decreased in the case of regular ward patients, clearly indicating the increase in platelet size. Thus, the immune system will be affected, and the number of immune cells will decrease, justifying the negative correlation of eosinophil, leukocytes, and platelet count with SARS-COV-2 disease. The low platelet counts were accounted for severe COVID-19 patients, those were even down in non-survivors compared to the survivors [60]. The correlation coefficient values between SARS-COV-2 results and different features reported elsewhere were similar to these observations [23]. For dataset 2, twenty-six features were correlated (p-value <0.05) to SARS-COV-2 results (Fig 2). Following parameters showed positive point biserial correlations, namely, sex, aspartate amino transferase, alanine transferase, lactate dehydrogenase, hemoglobin, hematocrit and MCHC, i.e., these parameter values increased in COVID-19 patients. Negative point biserial correlations were observed in leukocytes, platelets, erythrocytes, eosinophils, basophils, neutrophils, lymphocytes, monocytes and basophils. For dataset 3, twenty-two features were correlated (p<0.05) to SARS-COV-2 results (Fig 2). Seven parameters, namely, serum ferritin, serum magnesium, MPV, lactate dehydrogenase, GGT, aspartate amino transferase and alanine transferase have showed positive point biserial correlations with SARS-COV-2 results. Whereas basophil, eosinophil, erythrocytes, hematocrits, hemoglobin, leukocytes, lymphocytes, neutrophils, monocytes, platelets, MCHC, serum albumin, serum calcium and total bilirubin have showed negative point biserial correlations with SARS-COV-2 results.

While comparing datasets, 1 to 3, few features–those correlated with SARS-COV-2 results (p<0.05) were observed common across the datasets. Those features were, hemoglobin, hematocrit, platelets, leukocytes, eosinophils and monocytes. These features either decrease (negative point biserial correlation) or increase (positive correlation) in SARS-COV-2 patients, in different degrees according to the disease severity. Platelets, leukocytes, and eosinophils consistently showed negative correlations with SARS-COV-2 results, across all the datasets. Monocytes, also showed negative correlations with SARS-COV-2 results, except in dataset 1. Other parameters, like, basophils, neutrophils, lymphocytes, showed negative correlations with SARS-COV-2 results, in both datasets 2 and 3. Hemoglobin and hematocrit showed positive correlations in dataset 2, whereas, negative correlations in dataset 3. The observations indicated the dependencies of certain hematological parameters on demographic populations. In summary, features correlated (p<0.05) with SARS-COV-2 results across all three datasets were, hemoglobin, hematocrit, platelets, leukocytes, eosinophils and monocytes. The features common across datasets 2 and 3 were, hemoglobin, platelets, leukocytes, eosinophils, monocytes, hematocrit, erythrocytes, lymphocytes, basophils, neutrophils, LDH, serum calcium, MCHC and ALT. To generalize further, we compared these observations with hematological parameters obtained from Indian populations, those correlated with D-dimer values [42]. Henceforth, this dataset will be mentioned as Indian dataset. The common features across datasets 1, 2, 3 and Indian dataset, were platelets, eosinophils, monocytes and leukocytes. These four features were used earlier by Banerjee et.al., (on Brazilian dataset) for hematology-based ML model development [21]. Hence, the same set of features were used, here, for external validation of the models.

Curation of working sub-datasets

We have curated three primary datasets from hospitals of Brazil and Italy. Based on the common hematological parameters across datasets, 1, 2, 3 and Indian dataset, a subset of four-features was developed from datasets 1 and 2 named as 1-four-features, and 2-four-features respectively. The 1-four-features sub-dataset contained 602 records, 83 positives, and 519 negatives. Thus, the negative-to-positive sample size ratio was 6.25. The 2-four-features sub-dataset contained 1736 records, 816 positives, and 920 negatives, with subsequent increase in the ratio of negative to positive records to 1.13 compared to 0.81 in dataset 2.

Apart from four-featured datasets, we have also considered standard full blood count [21,61]—hematocrit, haemoglobin, platelets, mean platelet volume (MPV), red blood cells (RBC), lymphocytes, mean corpuscular haemoglobin concentration (MCHC), leukocytes, basophils, neutrophils, mean corpuscular haemoglobin (MCH), eosinophils, mean corpuscular volume (MCV), monocytes and red blood cell distribution width (RBCDW). Most of these parameters were present across all the datasets; although, all might not have shown correlations (p<0.05) with SARS-COV-2 results. This is termed as fourteen-featured dataset, developed from datasets 1 and 3 and named as, 1-fourteen-features and 3-fourteen-features. The 1-fourteen-features sub-dataset contains 602 records, 83 positives, and 519 negatives; the negative-to-positive sample size ratio was 6.25. The 3-fourteen-features sub-dataset included total 12105 records. The negative to positive ratio in this dataset was 0.356, that is, the dataset is mildly skewed towards positive results. Hence, a scale_pos_weight of 0.356 was used to treat the data imbalance.

These four different sub-datasets curated, here, were trained on seven different machine learning models. The summary of all the raw and processed datasets along with the four sub-datasets were represented as a scheme (Fig 3).

thumbnail
Fig 3. Description of data sources used for training, testing and external validation of different ML-models based on haematological features for COVID-19 characterization.

https://doi.org/10.1371/journal.pone.0316467.g003

Benchmarking of seven different ML models on working sub-datasets

The 10-fold cross validation studies on four sub-datasets showed that random splits for 10-fold validation did not affect the performance metrics (nominal changes observed in standard deviation values, ranging from 0.01 to 0.19) (S3 Table in S1 Appendix). The performances of different models on four sub-datasets were measured using the receiver operating characteristic (ROC) curves (Fig 4). Receiver Operating Characteristic curve plots the performance (True Positive Rate (TPR) versus False Positive Rate (FPR)) of a classification model at all classification thresholds. In terms of diagnosis, AUC curves determine positive cases as actual positive. For all the four sub-datasets, ROC curve showed the optimal performance for XGBoost model. XGBoost model, being a ternary classifier, can handle the class imbalance and perform in a better way, compared to other ML models. To note, for other ML model data imputation was performed, yet the results were the best for XGBoost model. XGBoost outperformed other models for all the four sub-datasets, in terms of ROC curves. When all the performance metrics, namely, accuracy, sensitivity, specificity, and AUC scores were considered together, XGBoost still performed the best (S4 Table in S1 Appendix) The elapsed computational time was comparable across different ML models except support vector machine which has taken slightly longer time (S5 Table in S1 Appendix). Based on the above standardization, XGBoost model was selected for subsequent studies.

thumbnail
Fig 4. Receiver Operating Characteristics Curves (ROC) across different ML models for four sub-datasets.

https://doi.org/10.1371/journal.pone.0316467.g004

Sub-dataset, 1-four-feature, has shown optimal performance (AUC score 0.94) in all four metrics. The sensitivity value was observed as 1.0, indicating 100% correct prediction of True Positive (TP) values, presumably, due to overcorrection of the TP values in a small dataset (n = 602) with a low population of positives (n = 83), leading to large scale-pos-weight of 6.25. As mentioned in the method section, large scale-pos-weight improves the performance of the positive class prediction at the cost of the negative class prediction. Models other than XGBoost showed low-sensitivity values for 1-four-feature and 1-fourteen-feature datasets (S4 Table in S1 Appendix). The low sensitivity values presumably attributed to the smaller data size and even a shallower positive population. Most likely, the XGBoost, being a ternary classifier, can more effectively handle the class imbalance than the imputations performed in other ML methods. However, the low sensitivity problem was absent in sub-dataset 2-four-feature, where the number of positives and negatives were equivalent. The performance of XGBoost model on 3-fourteen-feature sub-dataset was reasonable (Table 2) whereas, performances of the other models on the same sub-dataset was too low (S4 Table in S1 Appendix). Overall assessment indicated the general acceptability of XGBoost models over others.

thumbnail
Table 2. Internal evaluation of the XGBoost model on different datasets and comparison with published datasets.

Interval computed within 95% confidence limit.

https://doi.org/10.1371/journal.pone.0316467.t002

Comparison of internal performances of the XGBoost model with published reports

Purpose of internal validation of the XGBoost model was to evaluate the ability of the model to generalize on any given dataset.

The internal performances of the XGBoost model were compared with reported methods from the published literature (Fig 5). The XGBoost model developed on 1-four-feature and 1-fourteen-feature dataset showed AUC scores 0.92 and 0.94, respectively (Table 2). The sensitivity and specificity values obtained from XGBoost models on 1-fourteen-features sub-dataset were 0.75 and 0.89, respectively. Those from sub-dataset 1-four-features were 1.0 and 0.91 respectively (Table 2). These values were better than those reported by authors in [21] (sensitivity, 0.43 and specificity 0.91). The model by Banerjee et.al. [21] was developed on Kaggle dataset (same source, as described in this work, as dataset 1). Banerjee et. al. model was trained on a dataset with skewed numbers of positive (n = 39) and negative cases (n = 598); the negative to positive ratio was much improved in the current work. Moreover, this model has not provided the information regarding the analytical instruments, analytical principle and the units of measurements, thus limiting the model in terms of replicability and generalizability. Using the same dataset [21], Avila et.al. [62] developed a Bayesian model with improved sensitivity and specificity values (76.7%). There were more reports in literature those described CBC-based ML models. Some of these models were developed on a very small dataset (n = 171) and limited time frame (from March 7th 2020 to March 19th 2020) [63]. Sensitivity and specificity reported from that work was 83% and 82%, respectively. Another research group [64] has developed a logistic regression model, trained on only 380 CBC data and reported high sensitivity (93%) but low specificity (43%) values. Other ML models, namely, Gradient Boosting (n = 3,356) [65] and K-nearest-neighbor and Random Forest (n = 1624) [23] were trained on large number of haematological and biochemical parameters and reported reasonable AUC scores, 0.85 and 0.78 respectively. Considering the overall raw sample size of the dataset used in this work (n = 195554) and the robustness of the XGBoost ternary classifier (ability to handle the class imbalance of the datasets, as evident from the results compared with other ML methods) (Tables 2 and S4 in S1 Appendix) this approach seems to be more effective in terms of reliability and generality.

thumbnail
Fig 5. Comparative performances of different sub-datasets trained on XGBoost model.

The datasets with published AUC scores 21were compared.

https://doi.org/10.1371/journal.pone.0316467.g005

XGBoost models used for external evaluation across the populations

Purpose of the external evaluation was to test the sensitivity and specificity of the developed models on independent datasets and also to identify the potential suspect cases, by implementing XGBoost models.

As per the internal evaluation, the XGBoost model performed the best on the sub-dataset 1-four-feature. The performance of the XGBoost model on sub-dataset 1-fourteen-feature was comparable to that of sub-dataset 1-four-feature, with a slightly lower AUC Score (0.94 versus 0.92). Based on these scores, XGBoost models developed on sub-datasets 1-four-features and 1-fourteen-features tend to be attractive candidates for external evaluations. However, models developed on these datasets have some limitations, namely, overfitting problem due to small sample size. Moreover, the Kaggle dataset (the source of dataset 1) lacks the description of analytical instruments, analytical principles and measurement units; hence, models built on these training datasets have issues regarding reproducibility and generality [66]. Therefore, we selected two other XGBoost models with four and fourteen parameters obtained from two sub-datasets, sub-dataset 2-four-features (Italy), and sub-dataset 3-fourteen-features (Brazil), albeit with a slightly lowered AUC score of 0.842 in both cases. These two were the final working models (training dataset) for external evaluation.

External evaluation of XGBoost models with four hematological parameters across Italian and Brazilian populations

External evaluation for the four-parameter model was performed on the test dataset 1-four-features from Brazil. Note that the training dataset 2-four-features was from Italy. The sensitivity was 0.81 with a low specificity value of 0.45. The dataset cardinality was evaluated using two metrics, AUC score and balanced accuracy; the score values were 0.69 and 0.63, respectively (Table 3a). The AUC score is sensitive to class imbalance, whereas balanced accuracy is better suited to handle imbalanced datasets and thus, is more reliable in this particular case. The data utility was computed in terms of standardized net benefit using the formulation reported earlier by Riley et al. [58]. The value was -2.6, at the prediction threshold value of 0.5. Considering the large class imbalance (ratio of positive cases to total sample size was 0.14), we used a different probability threshold of 0.15 that yielded a standardized net benefit score of -0.084. Both the values were worse than the “treat none” scenario—that is, no clinical decision administered to any patient. However, if we use a “treat all” situation, that is, all patients were treated (irrespective of COVID-19 positive or negative), the net benefit was -0.012 and 0.72, for probability threshold values of 0.15 ad 0.5, respectively. To note, standardized net benefit was computed earlier at different probability threshold values depending on the degree of risk factor [59].

thumbnail
Table 3. External evaluation of XGBoost algorithm based on a) 4- hematological features and b) 14-hematological features trained and tested across different datasets.

Interval computed within 95% confidence limit.

https://doi.org/10.1371/journal.pone.0316467.t003

External evaluation of XGBoost models with fourteen hematological parameters within the Brazilian populations

The fourteen-parameter XGBoost model was trained on dataset 3-fourteen-features (n = 12105) and tested on dataset 1-fourteen-features (n = 602), both from Brazilian populations. However, the samples in these two datasets were from different time points; hence those can be considered independent data sources. The AUC score and balanced accuracy for this prediction were 0.86 and 0.73, respectively (Table 3b). These results were better than the performance for the four-feature XGBoost model across the populations. There could be multiple reasons for the better performance of the fourteen-feature model over the four-feature model, a) the larger size of the training dataset, b) training and prediction data obtained from the same demographic location, that is, Brazil, and c) combination of a greater number of features with a larger dataset, presumably, yields to a better result. The computed standardized net benefit [58], at the probability threshold of 0.5 was -0.07. Considering large imbalance in the COVID-19 positive cases in the external validation dataset (ratio of positive cases to total sample size was 0.14), the standardized net benefit was computed at a different probability threshold of 0.15 yielding the value of 0.33. Both the values at two different probability thresholds were fairly better than the “treat none” or “treat all” situations. Thus, the standardized net benefit indicated lesser risk and more benefit by implementing this model on the same group of population.

Meta-validation of the ML models on external dataset performances

Meta-validation was performed on two sets of training / external validation datasets, namely, i) 2-four-features (Italian) / 1-four-features (Brazilian) and ii) 3-fourteen-feature (Brazilian) / 1-fourteen-feature (Brazilian). The performances of the ML models were assessed in two dimensions, i) dataset similarity (between training dataset and external validation dataset), and ii) minimum sample size (MSS); and evaluated in terms of Balanced accuracy, Standardized net benefit and Brier score. The meta-validation results were depicted using External Performance Diagram (Fig 6). The results showed that the minimum sample size exceeded 110% for both the datasets, as apparent from the hue brightness (Fig 6). The 3-fourteen feature/1-fourteen feature model (training and external validation set both from Brazil) performed well (“good” according to external performance diagram). As reported earlier [36], an external validation can be considered successful when the model exhibits at-least good performance on some of the external datasets. The training and the external validation datasets from Brazil were “slight” in similarity and have adequate sample size (Fig 6). “Slight” similarity (less than 40%) of the external validation dataset with respect to the training dataset implied a reliable test bunch in terms of conservative estimates of model performance [36]. On the other hand, 2-four feature / 1-four feature model (training dataset from Italy and test dataset from Brazil) sample size was adequate according to the hue brightness. The data similarity between training and the test dataset was “low”, lower than 20%, hence reliable. However, the model utility was outside the “acceptable” region. According to the previous study [36], there were several discrepancies across Italian and Brazilian datasets those lead to poor performance on their external validation; for example, i) lack of predictive features, such as “Suspect feature” in Brazilian dataset, ii) the instruments used for data collection in these two datasets were different (at least, instrument names were unknown in case of Brazilian datasets), etc.

thumbnail
Fig 6. The external performance diagram generated using the online tool (https://qualiml.pythonanywhere.com), depicted the results from external validation studies on COVID-19 diagnosis trained and tested on i) same population–Brazil-Brazil (Br) and ii) different population–Italy-Brazil (It).

The Minimum sample size, depicted by the hue brightness. The width of the ellipse equals to the width of 95% confidence interval with respect to the given performance metrics.

https://doi.org/10.1371/journal.pone.0316467.g006

Comparison of meta-validation results with the literature reports

In recent past Cabitza et. al. from Italy has performed meta-validation studies for COVID-19 diagnosis based on hematological parameters [36]. Their training dataset comprised of twenty-one hematological features curated from two different hospitals of Italy, along with eight external validation datasets—three from Italy, three from Brazil, one from Spain and one from Ethiopia. The total dataset size (including both training and external validation) was n = 7046, much lower than that presented in this study (n = 195554, raw data). The results from their analyses indicated that the model performance is good or excellent when the external datasets were from Italy. However, the external performances were poorer than acceptable values in case of Spain and Brazil datasets (Fig 8 of ref [36]). The performances were quantified in terms of various metrics, for example, average F2-score for Italian datasets was 87%±3% and that of Brazilian datasets was 37%±4%. This clearly indicated that the model performed good or excellent when trained and tested on the same population (in this case, Italy). Authors have pointed out that the poor performance on external dataset from different population could be due to various reasons, such as, differences in testing equipment, reference ranges, ethnic variability, phenotypic variability, human immune response etc. In the present study, we have considered two external validation datasets, i) training and external dataset from the same population (Brazil) and ii) training (Italy) and external (Brazil) dataset from different populations. The intra-population F2-score was 80% and inter-population F2-score was 49%; the inter-population F2-score, in this study, was better than the published reported (stated above). Similarly, the Brier score reported, in this study (Fig 6), was comparable to the published report (Table 4 of [36]) for intra-population external validation. Whereas, the Brier score for inter-population, reported in this study was lower (0.3) than some of the external performances in the earlier study, indicating better accuracy of the probabilistic prediction. The lower performance of intra-population external validation in this study could be due to the anonymised data in Dataset 1 (obtained from the Kaggle data set) or the missing unit information for different haematological features, in the same dataset. While comparing the utility of the model, in terms of standardized net benefit, the current model performance was poor for the inter-population external validation, at two different probability threshold values (Table 3), compared to the earlier report [36]. One of the possible reasons could be the anonymity of the data and the instrumentation for Dataset 1. The standardized net benefit value for intra-population validation was better at probability threshold value of 0.15, and close to the acceptable range reported earlier [36]. However, the utility of this model was lower compared to the reported models [36]. The limitation could be the unknown instrumentation, units, and normalization schemes used in the Brazilian dataset, that left a gap to compare and calibrate the results. The state-of-the-art (SOTA) of the current study lies in its robustness in the data size (n = 195554) in comparison to the above-mentioned report (n = 7046). Moreover, the current study can predict COVID-19 disease using a lesser number of haematological features (four or fourteen) compared to the previous study (twenty-one features) without compromising the external model performance. In addition, the F2-score and Brier score for the inter-population external validations in the current study were superior to some of those reported earlier. Thus, the current models seem to be more parsimonious and useful toward general applicability.

Impact of Instantial Variation (IV) on the external validation datasets

Instantial Variation (IV) was defined as the within-subject possible variation that is not due to population differences or errors; rather, intrinsic to a given instance or the measurement process [67]. The biological variations were estimated for different blood parameters, namely leukocytes [68], platelet parameters [69], red blood cell, reticulocyte parameters [70], etc.; where the blood samples were collected from healthy individuals within intervals of weeks (medium-term variations) or intervals of days (short-term variations). The biological variations were marginally different between short-term and medium-term instances, in most of the cases. The impact of IV on the quality of ML models was tested, where the authors showed that ML model performances were poorer for the IV-perturbation data and the model performances improved when IV-perturbation data were augmented with synthetic data [67].

As each instance is different from the other, we studied the effect of IV on the XGBoost model tested on the external validation datasets across a) the Italian and Brazilian populations and b) intra-Brazilian populations. The IV-perturbation and data augmentation were implemented in Python v. 3.10.4, using numpy v. 1.23.0, scikit-learn v. 1.1.1 and scikit-weak v.0.2.0, adapting the zenodo code provided in the literature [67]. The IV-perturbation was performed by resampling each data point following a Gaussian distribution. The Gaussian distribution width was obtained from the variability scores reported for different hematological parameters [6770]. One hundred synthetic data points were augmented using the Gaussian distribution for the IV-perturbation plus augmentation method. Three XGBoost performance metrics were reported (Table 4).

thumbnail
Table 4. XGBoost model performance metrics were shown from one hundred iterations on external validation datasets for a) 2-four-features (Italian) /1-four-features (Brazilian), and b) 3-fourteen-features (Brazilian) / 1-fourteen-feature (Brazilian) with IV-perturbation, and IV-perturbation plus augmentation methods.

Baseline values were reported. The baseline data was generated using lower fuzziness with resampling in a single step, in contrast to perturbation and perturbation plus augmentation data, where higher fuzziness was applied using sequential resampling of the baseline data. The standard deviation values were shown in parentheses.

https://doi.org/10.1371/journal.pone.0316467.t004

The notable point in the performance metrics was very small standard deviation values from one hundred cycles of iterations; more than a logarithmic scale lower than the confidence interval reported elsewhere (Fig 2 of reference 67). Although IV-perturbation has affected the model performance on the inter-population data (Italy versus Brazil), the intra-population data was least affected (Table 4). Thus, the XGBoost model was robust enough on the external validation dataset from the same population which indicated its potential applicability. There could be three possible reasons for less IV-perturbation of the XGBoost model (at least for one dataset) compared to the literature report [67], i) the external validation data size was sufficiently large, ii) the data were internally normalized (as mentioned in the method section), and inherent robustness of the XGBoost model.

Blind prediction of XGBoost models with four hematological parameters on West European populations

To further validate the efficacy of the working models, we have considered one more dataset from published literature with thirty-seven features, including the data points from different stages (time points) of COVID-19 [25]. The dataset was obtained from the literature without pre-processing (no feature, record, or data point removed). The source authors have reported two distinct stages of COVID-19 patients, W.E.-early and W.E.-advanced. Distributions of four haematological parameters across the datasets, 1-four-features, 2-four-features, W.E.-early, and W.E.-advanced, were compared (Fig 7). The distributions were almost the same across all the datasets for Leukocytes and platelets. For eosinophils and monocytes, the distributions for datasets 2-four-features and W.E.-early are similar, and for the distributions across datasets 1-four-features and W.E.-advanced. The external performance of the model on the W.E.-early dataset (0.65) was high compared to that on the W.E.-advanced dataset (0.52) (Table 5). To note, W.E.-early and W.E.-advanced datasets contain information only from COVID-19 patients and no negative controls. Hence, only the sensitivity metric was reported (Table 5).

thumbnail
Fig 7. Distributions of four hematological parameters across four different datasets (two training datasets–Dataset 1-four-features and Dataset 2-four-features and two test datasets–early and advance).

The hematological parameters are–a) platelet, b) leukocyte c) eosinophil and d) monocyte. These distributions indicate the proximity of the individual test datasets to the training datasets.

https://doi.org/10.1371/journal.pone.0316467.g007

thumbnail
Table 5. Blind prediction of XGBoost model trained on dataset 2-four-feature and tested on W.E.-early and W.E.-advanced datasets.

The early and advanced datasets contain only COVID-19-positive patient results; no negatives were available. Hence, only sensitivity values reported.

https://doi.org/10.1371/journal.pone.0316467.t005

Deployment of prediction server

We deployed a web server where two sets of inputs are accepted for binary COVID-19 prediction, i) four hematological parameters (leukocyte, monocyte, eosinophil, and platelet count) and ii) fourteen-parameter models (CBC and WBC differentials). The server outputs the COVID-9 results, either positive or negative, with the COVID-19 probability reported in percentage. Link to the web server: https://covipred.bits-hyderabad.ac.in/home. Design of the webserver is given the supplementary information.

Conclusion

Considering the need to develop an alternate protocol for rapid, near-accurate, and cheaper COVID-19 detection technique, we aimed to externally validate the haematology-based ML prediction by optimizing the features, which is yet to be fully understood. We have integrated published clinical records from Brazil, Italy, and West Europe hospitals. The data from Brazil and Italy were classified into four sub-datasets and trained on seven different ML methods. The XGBoost algorithm consistently performed superior to other ML methods. The internal performances of the XGBoost models were compared with the published reports available on the same datasets; the models reported here outperformed the published reports. The meta validation of the ML models on external datasets indicated the acceptability of the external performances; these results were either comparable or superior to the published reports. In this study, two sets of haematological parameters were selected for ML models, i) four features–leukocytes, monocytes, platelets and eosinophils (those were common across three different countries, namely, Brazil, Italy and India) and ii) fourteen features–CBC and WBC differentials. Both the set of parameters are represented by basic routine blood tests, usually, available in low-resource settings (CBC dataset). The ML methodology developed and externally validated here, was based on routine blood examination outcomes–available for inpatients and emergency-admitted patients. The current models, developed on a large sample cohort, can be used parsimoniously (as it used lesser number of haematological features compared to the previous study) for COVID-19 diagnosis. Moreover, the XGBoost model was marginally perturbed by Instantial Variations (IV), albeit noise, in the intra-population external validation dataset. To note, ML models have their own limitations, in terms of dataset dependencies, size of the datasets, ethnic variabilities, phenotypic variabilities, analytical instrumentations for clinical chemistry tests, etc. This shortcoming was, to some extent, reflected in the external validation; for four-feature models—specificity value was good but sensitivity value was poor (Table 3a). As reviewed earlier, variation in analytical instrumentations may exhibit extreme heterogeneity, thus may cause concern for the further usage of the model [66]. On the other hand, external validation for fourteen-feature model exhibited good sensitivity but poor specificity values (Table 3b). One way to overcome this limitation would be using the blood test results from same analytical instrumentations. However, as suggested in the literature, CBC standardization was less problematic due to change in analytical instrumentations compared to other tests [23]. Nevertheless, ML-based models are low cost and depends on rapid blood test exam, providing a good start for initial screening. Moreover, these results could be combined and compounded with qRT-PCR tests with an expectation of higher accuracy and sensitivity for the suspected cases. Thus, large scale identification of COVID-19 patients can be done in timely manner. Two XGBoost models, based on these two sets of features, were selected for external evaluations. The external performance of the fourteen-parameter XGBoost model trained and tested on the Brazilian dataset was comparable to that of the internal performance. However, the external performances of the four-parameter XGBoost model trained on the Italian dataset and tested on a) Brazilian and b) West European datasets were poorer than the internal evaluation. The results promise the utility of these models when trained and tested on the same populations. However, it also warns to use the model, with caution, trained on one population and tested on another. The outcome of this work has the potential for an initial screen of COVID-19 based on haematological parameters. In future work, we aim to train and test those on the Indian population to use at the healthcare centres of India.

Supporting information

S1 Appendix. 5 supporting tables and additional description on the webserver development.

https://doi.org/10.1371/journal.pone.0316467.s001

(DOCX)

References

  1. 1. World Health Organization. Use of SARS-CoV-2 antigen-detection rapid diagnostic tests for COVID-19 self-testing: interim guidance, 9 March 202World Health Organization; 2022.
  2. 2. Lubna S, Chinta S, Burra P, Vedantham K, Ray S, Bandyopadhyay D. New substitutions on NS1 protein from influenza A (H1N1) virus: Bioinformatics analyses of Indian strains isolated from 2009 to 202Health Science Reports. 2022 May;5(3):e626.
  3. 3. Menni C, Valdes AM, Polidori L, Antonelli M, Penamakuri S, Nogal A, et al. Symptom prevalence, duration, and risk of hospital admission in individuals infected with SARS-CoV-2 during periods of omicron and delta variant dominance: a prospective observational study from the ZOE COVID Study. The Lancet. 2022 Apr;399(10335):1618–24. pmid:35397851
  4. 4. Pecoraro V, Negro A, Pirotti T, Trenti T. Estimate false‐negative RT‐PCR rates for SARS‐CoV‐A systematic review and meta‐analysis. Eur J Clin Investigation. 2022 Feb;52(2):e13706.
  5. 5. Pu R, Liu S, Ren X, Shi D, Ba Y, Huo Y, et al. The screening value of RT-LAMP and RT-PCR in the diagnosis of COVID-19: systematic review and meta-analysis. Journal of Virological Methods. 2022 Feb;300:114392. pmid:34856308
  6. 6. Syal K. Guidelines on newly identified limitations of diagnostic tools for COVID‐19 and consequences. Journal of Medical Virology. 2021 Apr;93(4):1837–42. pmid:33200414
  7. 7. Khalid MF, Selvam K, Jeffry AJN, Salmi MF, Najib MA, Norhayati MN, et al. Performance of Rapid Antigen Tests for COVID-19 Diagnosis: A Systematic Review and Meta-Analysis. Diagnostics. 2022 Jan 4;12(1):110. pmid:35054277
  8. 8. Brümmer LE, Katzenschlager S, McGrath S, Schmitz S, Gaeddert M, Erdmann C, et al. Accuracy of rapid point-of-care antigen-based diagnostics for SARS-CoV-2: An updated systematic review and meta-analysis with meta-regression analyzing influencing factors. PLoS Med. 2022 May 26;19(5):e1004011. pmid:35617375
  9. 9. Tapari A, Braliou GG, Papaefthimiou M, Mavriki H, Kontou PI, Nikolopoulos GK, et al. Performance of Antigen Detection Tests for SARS-CoV-2: A Systematic Review and Meta-Analysis. Diagnostics. 2022 Jun 4;12(6):1388. pmid:35741198
  10. 10. Guglielmi G. Fast coronavirus tests: what they can and can’t do. Nature. 2020 Sep 24;585(7826):496–8. pmid:32939084
  11. 11. Kahanec M, Lafférs L, Schmidpeter B. The impact of repeated mass antigen testing for COVID-19 on the prevalence of the disease. J Popul Econ. 2021 Oct;34(4):1105–40. pmid:34219976
  12. 12. Taj S, kashif A, Arzinda Fatima S, Imran S, Lone A, Ahmed Q. Role of hematological parameters in the stratification of COVID-19 disease severity. Annals of Medicine and Surgery. 2021 Feb;62:68–72. pmid:33437468
  13. 13. Tahir Huyut M, Huyut Z, İlkbahar F, Mertoğlu C. What is the impact and efficacy of routine immunological, biochemical and hematological biomarkers as predictors of COVID-19 mortality?. International Immunopharmacology. 2022 Apr;105:108542. pmid:35063753
  14. 14. Beadling C, Slifka MK. Quantifying viable virus-specific T cells without a priori knowledge of fine epitope specificity. Nat Med. 2006 Oct 1;12(10):1208–12. pmid:17013384
  15. 15. Manne BK, Denorme F, Middleton EA, Portier I, Rowley JW, Stubben C, et al. Platelet gene expression and function in patients with COVID-1Blood. 2020 Sep 10;136(11):1317–29.
  16. 16. Güçlü E, Kocayiğit H, Okan HD, Erkorkmaz U, Yürümez Y, Yaylacı S, et al. Effect of COVID-19 on platelet count and its indices. Rev Assoc Med Bras. 2020 Aug;66(8):1122–7. pmid:32935808
  17. 17. Mertoglu C, Huyut M, Olmez H, Tosun M, Kantarci M, Coban T. COVID-19 is more dangerous for older people and its severity is increasing: a case-control study. Med Gas Res. 2022;12(2):51. pmid:34677152
  18. 18. Comer SP, Cullivan S, Szklanna PB, Weiss L, Cullen S, Kelliher S, et al. COVID-19 induces a hyperactive phenotype in circulating platelets. PLoS Biol. 2021 Feb 17;19(2):e3001109. pmid:33596198
  19. 19. Huyut MT, İlkbahar F. The effectiveness of blood routine parameters and some biomarkers as a potential diagnostic tool in the diagnosis and prognosis of Covid-19 disease. International Immunopharmacology. 2021 Sep;98:107838. pmid:34303274
  20. 20. Huyut MT, Huyut Z. Forecasting of Oxidant/Antioxidant levels of COVID-19 patients by using Expert models with biomarkers used in the Diagnosis/Prognosis of COVID-1 International Immunopharmacology. 2021 Nov;100:108127.
  21. 21. Banerjee A, Ray S, Vorselaars B, Kitson J, Mamalakis M, Weeks S, et al. Use of Machine Learning and Artificial Intelligence to predict SARS-CoV-2 infection from Full Blood Counts in a population. International Immunopharmacology. 2020 Sep;86:106705. pmid:32652499
  22. 22. Djakpo DK, Wang Z, Zhang R, Chen X, Chen P, Antoine MMLK. Blood routine test in mild and common 2019 coronavirus (COVID-19) patients. Bioscience Reports. 2020 Aug 28;40(8) pmid:32725148
  23. 23. Cabitza F, Campagner A, Ferrari D, Di Resta C, Ceriotti D, Sabetta E, et al. Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests. Clinical Chemistry and Laboratory Medicine (CCLM). 2021 Feb 23;59(2):421–31.
  24. 24. Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst. 2020 Aug;44(8) pmid:32607737
  25. 25. Linssen J, Ermens A, Berrevoets M, Seghezzi M, Previtali G, van der Sar-van der Brugge S, et al. A novel haemocytometric COVID-19 prognostic score developed and validated in an observational multicentre European hospital-based study. eLife. 2020 Nov 26;9
  26. 26. Abdullah I, Cornelissen HM, Musekwa E, Zemlin A, Jalavu T, Mashigo N, et al. Hematological findings in adult patients with SARS CoV‐2 infection at Tygerberg Hospital Cape Town South Africa. Health Science Reports. 2022 May;5(3) pmid:35509400
  27. 27. Canals L M, Canals C A, Cuadrado N C. Incidence moments: a simple method to study the memory and short term forecast of the COVID-19 incidence time-series. Epidemiologic Methods. 2022 Feb 26;11(s1)
  28. 28. Al-Raeei M, El-Daher MS, Solieva O. Applying SEIR model without vaccination for COVID-19 in case of the United States, Russia, the United Kingdom, Brazil, France, and India. Epidemiologic Methods. 2021 Mar 3;10(s1)
  29. 29. Huyut M, Üstündağ H. Prediction of diagnosis and prognosis of COVID-19 disease by blood gas parameters using decision trees machine learning model: a retrospective observational study. Med Gas Res. 2022;12(2):60. pmid:34677154
  30. 30. Huyut MT, Velichko A. Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values and LogNNet Neural Network. Sensors. 2022 Jun 25;22(13):4820. pmid:35808317
  31. 31. Huyut M. Automatic Detection of Severely and Mildly Infected COVID-19 Patients with Supervised Machine Learning Models. IRBM. 2023 Feb;44(1):100725. pmid:35673548
  32. 32. Huyut MT, Velichko A, Belyaev M. Detection of Risk Predictors of COVID-19 Mortality with Classifier Machine Learning Models Operated with Routine Laboratory Biomarkers. Applied Sciences. 2022 Nov 28;12(23):12180.
  33. 33. Velichko A, Huyut MT, Belyaev M, Izotov Y, Korzun D. Machine Learning Sensors for Diagnosis of COVID-19 Disease Using Routine Blood Values for Internet of Things Application. Sensors. 2022 Oct 17;22(20):7886. pmid:36298235
  34. 34. Huyut MT, Huyut Z. Effect of ferritin, INR, and D-dimer immunological parameters levels as predictors of COVID-19 mortality: A strong prediction with the decision trees. Heliyon. 2023 Mar;9(3):e14015. pmid:36919085
  35. 35. Campagner A, Carobene A, Cabitza F. External validation of machine learning models for COVID-19 detection based on complete blood count. Health Information Science and Systems. 2021 Dec;9:1–5.
  36. 36. Cabitza F, Campagner A, Soares F, de Guadiana-Romualdo LG, Challa F, Sulejmani A, Seghezzi M, Carobene A. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Computer Methods and Programs in Biomedicine. 2021 Sep 1;208:106288. pmid:34352688
  37. 37. Beutler E, West C. Hematologic differences between African-Americans and whites: the roles of iron deficiency and α-thalassemia on hemoglobin levels and mean corpuscular volume. Blood. 2005 Jul 15;106(2):740–5.
  38. 38. Lee S, Ong CM, Zhang Y, Wu AH. Narrowed reference intervals for complete blood count in a multiethnic population. Clinical Chemistry and Laboratory Medicine (CCLM). 2019 Aug 27;57(9):1382–7. pmid:30753155
  39. 39. Mertoglu C, Huyut MT, Arslan Y, Ceylan Y, Coban TA. How do routine laboratory tests change in coronavirus disease 2019?. Scandinavian Journal of Clinical and Laboratory Investigation. 2021 Feb;81(1):24–33. pmid:33342313
  40. 40. Üstündağ H, Mertoğlu C, Huyut MT. Oxyhemoglobin Dissociation Curve in COVID-19 Patients. meandros. 2023 May 4;24(1):58–64.
  41. 41. Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, et al. Clinical characteristics of coronavirus disease 2019 in China. New England journal of medicine. 2020 Apr 30;382(18):1708–20. pmid:32109013
  42. 42. Chinnathambi PS, Sripriya T, Krishnakanth G. Analysis of Haematological Parameters of Peripheral Blood in COVID-19 Patients with a Special Emphasis on D-dimer.
  43. 43. Charostad J, Rezaei Zadeh Rukerd M, Shahrokhi A, Aghda FA, Ghelmani Y, Pourzand P, et al. Evaluation of hematological parameters alterations in different waves of COVID-19 pandemic: A cross-sectional study. Plos one. 2023 Aug 25;18(8):e0290242. pmid:37624800
  44. 44. Chong VCL, Lim KGE, Fan BE, Chan SSW, Ong KH, Kuperan P. Reactive lymphocytes in patients with COVID‐1 Br J Haematol. 2020 Jun;189(5):844.
  45. 45. Fan BE, Chong VCL, Chan SSW, Lim GH, Lim KGE, Tan GB, et al. Hematologic parameters in patients with COVID‐19 infection. American J Hematol. 2020 Jun;95(6)
  46. 46. Henry BM, de Oliveira MHS, Benoit S, Plebani M, Lippi G. Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 (COVID-19): a meta-analysis. Clinical Chemistry and Laboratory Medicine (CCLM). 2020 Jun 25;58(7):1021–8. pmid:32286245
  47. 47. Jiang S, Huang Q, Xie W, Lv C, Quan X. The association between severe COVID‐19 and low platelet count: evidence from 31 observational studies involving 7613 participants. Br J Haematol. 2020 Jul;190(1) pmid:32420607
  48. 48. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785–794).
  49. 49. Sperandei S. Understanding logistic regression analysis. Biochemia medica. 2014 Feb 15;24(1):12–8. pmid:24627710
  50. 50. Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and information Processing. 1998 Mar 2;18(1998):1–8
  51. 51. McCallum, A. "Graphical models, lecture2: Bayesian network represention". PDF). Retrieved 2019; 22.
  52. 52. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995 Sep;20(3):273–97.
  53. 53. None Breiman L. Machine Learning. 2001;45(1):5–32.
  54. 54. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967 Jan;13(1):21–7.
  55. 55. Bradley AA, Schwartz SS, Hashino T. Sampling uncertainty and confidence intervals for the Brier score and Brier skill score. Weather and Forecasting. 2008 Oct;23(5):992–1006.
  56. 56. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261–72. pmid:32015543
  57. 57. Murphy AH. A new vector partition of the probability score. Journal of Applied Meteorology and Climatology. 1973 Jun;12(4):595–600.
  58. 58. Riley RD, Debray TPA, Collins GS, Archer L, Ensor J, van Smeden M, et al. Minimum sample size for external validation of a clinical prediction model with a binary outcome. Statistics in Medicine. 2021 Aug 30;40(19):4230–51. pmid:34031906
  59. 59. Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE Jr, Moons KGM, et al, Minimum sample size for developing a multivariable prediction model: part I—continuous outcomes. Statistics in Medicine. 2019 38:1276–1296
  60. 60. Wool GD, Miller JL. The Impact of COVID-19 Disease on Platelets and Coagulation. Pathobiology. 2021;88(1):15–27. pmid:33049751
  61. 61. Troussard X, Vol S, Cornet E, Bardet V, Couaillac J, Fossat C, et al. Full blood count normal reference values for adults in France. J Clin Pathol. 2014 Apr;67(4):341–4. pmid:24170208
  62. 62. Avila E, Kahmann A, Alho C, Dorn M. Hemogram data as a tool for decision-making in COVID-19 management: applications to resource scarcity scenarios. PeerJ. 2020 Jun 29;8:e9482. pmid:32656001
  63. 63. Formica V, Minieri M, Bernardini S, Ciotti M, D’Agostini C, Roselli M, et al. Complete blood count might help to identify subjects with high probability of testing positive to SARS-CoV-Clin Med. 2020 Jul;20(4):e114–e119.
  64. 64. Joshi RP, Pejaver V, Hammarlund NE, Sung H, Lee SK, Furmanchuk A, et al. A predictive tool for identification of SARS-CoV-2 PCR-negative emergency department patients using routine test results. Journal of Clinical Virology. 2020 Aug;129:104502. pmid:32544861
  65. 65. Yang HS, Hou Y, Vasovic LV, Steel PAD, Chadburn A, Racine-Brzostek SE, et al. Routine Laboratory Blood Tests Predict SARS-CoV-2 Infection Using Machine Learning. Clinical Chemistry. 2020 Nov 1;66(11):1396–404. pmid:32821907
  66. 66. Carobene A, Milella F, Famiglini L, Cabitza F. How is test laboratory data used and characterised by machine learning models? A systematic review of diagnostic and prognostic models developed for COVID-19 patients using only laboratory data. Clinical Chemistry and Laboratory Medicine (CCLM). 2022 Nov 25;60(12):1887–901. pmid:35508417
  67. 67. Campagner A, Famiglini L, Carobene A, Cabitza F. Everything is varied: The surprising impact of instantial variation on ML reliability. Applied Soft Computing. 2023 Oct 1;146:110644.
  68. 68. Buoro S, Carobene A, Seghezzi M, Manenti B, Pacioni A, Ceriotti F, et al. Short-and medium-term biological variation estimates of leukocytes extended to differential count and morphology-structural parameters (cell population data) in blood samples obtained from healthy people. Clinica Chimica Acta. 2017 Oct 1;473:147–56. pmid:28705776
  69. 69. Buoro S, Seghezzi M, Manenti B, Pacioni A, Carobene A, Ceriotti F, et al. Biological variation of platelet parameters determined by the Sysmex XN hematology analyzer. Clinica Chimica Acta. 2017 Jul 1;470:125–32. pmid:28479317
  70. 70. Buoro S, Carobene A, Seghezzi M, Manenti B, Dominoni P, Pacioni A, et al. Short-and medium-term biological variation estimates of red blood cell and reticulocyte parameters in healthy subjects. Clinical Chemistry and Laboratory Medicine (CCLM). 2018 May 24;56(6):954–63. pmid:29303771