Assessing generalizability of a dengue classifier across multiple datasets

Bingqian Lu; Yanni Li; Ciaran Evans

doi:10.1371/journal.pone.0323886

Abstract

Early diagnosis of dengue fever is important for individual treatment and monitoring disease prevalence in the population. To assist diagnosis, previous studies have proposed classification models to detect dengue from symptoms and clinical measurements. However, there has been little exploration of whether existing models can be used to make predictions for new populations. In this study, we assess the generalizability of dengue classification models to new datasets.

We trained logistic regression models on five publicly available dengue datasets from previous studies, using three explanatory variables identified as important in prior work: age, white blood cell count, and platelet count. These five datasets were collected at different times in different locations, with a variety of disease rates and patient ages. A model was trained on each dataset, and predictive performance and model calibration was evaluated on both the original (training) dataset, and the other (test) datasets from different studies. By comparing the model’s performance when applied to data from a new location, we are able to assess the model’s generalizability to new populations.

We further compared performance with larger models and other classification methods. In-sample area under the receiver operating characteristic curve (AUC) values for the logistic regression models ranged from 0.74 to 0.89, while out-of-sample AUCs ranged from 0.55 to 0.89. Matching age ranges in training/test datasets increased AUC values and balanced the sensitivity and specificity. Adjusting the predicted probabilities to account for differences in dengue prevalence improved calibration in 20/28 training-test pairs. Results were similar when other explanatory variables were included and when other classification methods (decision trees and support vector machines) were used. The in-sample performance of the logistic regression model was consistent with previous dengue classifiers, suggesting the chosen model is a good choice in a variety of settings and has decent overall performance. However, adjustments are required to make predictions on new datasets. Practitioners can use existing dengue classifiers in new settings but should be careful with different patient ages and disease rates.

Citation: Lu B, Li Y, Evans C (2025) Assessing generalizability of a dengue classifier across multiple datasets. PLoS One 20(6): e0323886. https://doi.org/10.1371/journal.pone.0323886

Editor: Harapan Harapan, Universitas Syiah Kuala, INDONESIA

Received: July 8, 2024; Accepted: April 15, 2025; Published: June 3, 2025

Copyright: © 2025 Lu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All raw data, and the source code for all the analysis in this manuscript, can be found in our GitHub repository (https://github.com/ciaran-evans/dengue-generalizability).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Dengue fever is an acute mosquito-borne infection worldwide, most common in tropical and subtropical areas, with a substantial and growing disease burden. In countries such as Vietnam, Malaysia, and Colombia, more severe dengue cases were reported in the last decades [1, 2], and the number of dengue cases doubled every 10 years from 1990 to 2013 [3]. In 2019, dengue was observed in Afghanistan for the first time [4], and the World Health Organization (WHO) has also reported that dengue has been recorded in more than 100 countries worldwide and is now spreading to Europe [5, 6]. Furthermore, accurate estimation of the number of cases per year is complicated by low levels of reporting, a lack of inexpensive diagnostic methods, and incompatible comparative analysis; these issues lead to underestimates of dengue cases, particularly in developing countries, and the annual number of cases may be as high as 390 million [5, 7].

Dengue diagnosis

Dengue is characterized by symptoms such as mild (37.5°C T 38.3°C) or high fever (>40°C), muscle and joint pain, severe headache, and skin rash [6, 8–10]. The standard guidelines for dengue diagnosis are given by the WHO [6], but the symptoms used for diagnosis overlap with several other diseases, such as yellow fever, Zika, and West Nile virus. It can therefore be difficult to identify dengue in the first few days of illness [11], and previous research suggests that the diagnostic guidelines need further modifications [12, 13].

Beyond general symptoms, the WHO recommends examining viral RNA and antibodies for dengue detection. Viral RNA detection offers the advantage of providing faster results, typically within 24 to 48 hours. However, viral RNA detection tools require more expensive equipment and complicated processes, which limit their use in many developing countries impacted by dengue fever [14]. Dengue shock syndrome (DSS), the most common life-threatening complication of dengue caused by a serious loss of blood in the circulatory system, usually onsets between the 4th and 6th day of illness, and the mortality rate can be up to 20% [11, 15]. Thus, identifying dengue in the early stage is crucial for impacted countries.

According to the WHO, the gold standard of dengue diagnosis is composed of RT-PCR viral detection, IgM/IgG antibodies and NS1 detection [16]. As antigens can be detected earlier, rapid antigen tests are often used together with antibody detection in early diagnosis [17], and the NS1 rapid test is generally considered very sensitive in the early stages of dengue [17]. In practice, both NS1 and IgM/IgG antibodies rapid tests are widely used for diagnosis, depending on factors like the patient’s clinical presentation, the illness’s timing, and the availability of the test [18]. In a study comparing the different rapid tests’ diagnostic performance, simultaneous NS1 and IgM/IgG antibodies tests gave the best performance with 92% sensitivity and 96% specificity [19]. Another comparison study among RT-PCR viral detection, IgM/IgG antibodies, and NS1 detection suggests that viral RNA and NS1 antigen detection were more suitable than IgM antibodies in the early days after symptom onset. However, starting from days 4-5, IgM detection became the most effective diagnostic method. On days 6-7 and in later samples, both NS1 antigen and IgM detection outperformed viral RNA detection as diagnostic markers for dengue infection [20].

Classification methods for diagnosis

While a combination of antigen and antibody tests performs well for detecting dengue, these rapid tests are not always available. To supplement rapid tests, previous studies have investigated the use of classifiers trained on common diagnostic measurements such as symptoms, blood work, and demographic variables. For example, Tuan et al. [11], Ngim et al. [2], and Andries et al. [21] recorded counts of white blood cells, platelets, and lymphocytes. Physical symptoms recorded include rashes, body temperature, bleeding, arthralgia, asthenia, cough, and headache, and demographic variables include sex and age [2, 11, 22, 23].

These symptoms, demographics, and laboratory measurements have been employed by previous studies to predict whether an ill patient actually has dengue. Tuan et al. [11] used logistic regression to classify patients, and found through model selection that age, white blood cell count, and platelet count were the most important predictors. Their final model has a sensitivity of 74.8% and a specificity of 76.3% for detecting dengue in children, and their results indicate that combining their classifier with rapid tests can slightly improve the performance of the rapid test. Ngim et al. [2] also applied logistic regression, using a penalized model to decrease small-sample bias and improve estimation. Variables were selected using stepwise selection, similar to Tuan et al. [11], and the final model had an area under the receiver operating characteristic (ROC) curve (AUC) of 0.82, a sensitivity of 78.4%, and a specificity of 74.6%.

Beyond logistic regression, a variety of other models have been proposed, with performance similar to Tuan et al. [11] and Ngim et al. [2]. For example, Tanner et al. [24] employed a C4.5 decision tree classifier, using predictors such as platelet count, white blood cell count, lymphocyte count, body temperature, hematocrit count, and neutrophil count, and reported a specificity of 80.2% and a sensitivity of 78.2%. Park et al. [25] used structural equation models (SEMs), with variables also including white blood cell count, platelet cells count, age, lymphocyte counts, and hematocrit; their final model had an AUC of 0.84 at fever day -3 for dengue prediction.

Limitations of previous studies

In a recent systematic review, Neto et al. [26] found limited comparisons of existing methods for dengue classification in the literature, and a dearth of available data from previous studies that would allow researchers to reproduce and compare prior results. Comparisons of previous work are further complicated by the fact that each dataset contains different explanatory variables, so a model from one study may not be usable with new data if one or more variables in the model is not recorded. Additionally, previous studies were conducted in a wide variety of locations and at different times, and it is unclear whether models built in one location and time can generalize to a new setting. To the best of our knowledge, no previous work has assessed the generalizability of dengue classification models.

Contributions

The purpose of this paper is to evaluate whether a dengue classifier trained on one dataset can make useful predictions on a new dataset. Our work focuses on logistic regression models, which have been widely used in previous work, can be easy for practitioners to interpret [11], and which have also demonstrated their competitive performance compared to other classifiers. We have identified five publicly available datasets from previous dengue studies, which all contain commonly used explanatory variables like age, white blood cell count, and platelet count. To assess generalizability, we conduct an extensive comparison of in-sample and out-of-sample performance for models built on each of the five datasets. We further explore possible reasons a classifier may fail to generalize, and investigate adjustments to make predictions more generalizable. Finally, we consider other possible classification methods (CART classification trees and support vector machines (SVMs)), and other potential explanatory variables available in subsets of the five datasets studied here.

If dengue classifiers generalize to new data, researchers could adapt existing models for new locations, saving the substantial time and effort required to collect new data and fit a model. If classifiers do not generalize, however, then existing models have limited utility beyond the specific data used to train them.

Methods

Datasets

Data was used from five different papers on predicting dengue. These papers were chosen because the datasets were made publicly available by the authors, and considered a range of dengue classification approaches. Each dataset was collected in a different location and contains varying numbers of patients and features. All raw data, and the source code for all the analysis in this manuscript, can be found in our GitHub repository (https://github.com/ciaran-evans/dengue-generalizability). The five datasets were first downloaded on July 25, 2022.

All five datasets are publicly available and anonymized, and none of the authors were involved in the original studies, nor did the authors have any contact with any of the participants involved in the original studies. At no point did any of the authors have access to any information beyond the publicly available, de-identified data described here, nor did we have any means of identifying any of the participants in these studies. The sources for the five datasets used in our research are detailed below, and no additional data was used in the research. The Wake Forest University Institutional Review Board confirmed that the research presented here does not constitute human subjects research, and that no IRB approval was required.

Dataset 1.

Tuan et al. [11] collected data on 5726 patients, aged 1–15 years, who presented with possible dengue fever at seven hospitals in southern Vietnam between 2010 and 2012, with between 340 and 1589 patients from each hospital. Of these patients, 1698 (29.7%) were diagnosed with dengue using a combination of RT-PCR, IgM serology, and NS1 detection by ELISA. Patients were eligible for the study if their fever and symptom history was less than 72 hours. Blood samples, personal information, and symptoms were collected from each patient at the time of enrolment. A total of 35 variables were recorded, including white blood cell count, platelet count, temperature, height, and weight, and symptoms such as vomiting, rashes, skin bleeding, and mucosal bleeding. We omitted two patients with missing values for white blood cell count and platelet count.

Dataset 2.

Gasem et al. [27] collected data on 1486 fever patients (body temperature at least 38^°C) aged 1–98 from eight different sites in Indonesia between July 2013 and June 2016, with 65 patients from the smallest site and 267 from the largest site. Of these patients, 467 (31.4%) were diagnosed with dengue. A total of 51 variables were recorded on the symptoms, laboratory results, and final diagnosis for each subject, and the site at which each subject was recorded. Diagnoses included Streptococcus, dengue, Chikungunya, influenza, and respiratory syncytial virus (RSV), among others. To focus on identifying dengue fever, we grouped all non-dengue diagnoses together as one category for analysis. We omitted one patient with a missing value for platelet count.

Dataset 3.

Saito et al. [23] collected data on 1573 patients aged 1–85, with a fever lasting at most 21 days and suspected of bloodstream infection, in the Philippines from 2015 to 2019. Of these patients, 256 (16.3%) were diagnosed with dengue. A total of 284 variables were recorded on the symptoms, laboratory results, medical history, and the diagnosis from laboratory tests for each subject. We omitted 21 patients with missing values for platelet count or white blood cell count, leaving 1552 rows in the data.

Dataset 4.

Park et al. [25] collected data on 257 children aged 6 months to 15 years from Thailand, with fever less than 72 hours. Of these patients, 156 (60.7%) were diagnosed with dengue; of the 156 dengue patients, 51 were diagnosed with dengue hemmorhagic fever, and nine were diagnosed with dengue shock syndrome. A total of 21 variables were recorded, including the patient’s age, sex, final diagnosis, and laboratory results. Each of the laboratory results was recorded twice at different time points, one of which was four days before patients had a fever (fever day -3), and the other was two days before patients had a fever (fever day -1). In the dataset, age is recorded as a two-year range (e.g., 12–13); we converted the range to a single numerical value by taking the midpoint of the range (e.g., 12.5).

Dataset 5.

Ngim et al. [2] collected data on 368 adult patients with possible dengue infection in Malaysia in 2018. Of these patients, 167 (45.4%) were ultimately diagnosed with dengue using a laboratory test. A total of 82 variables were recorded, including dengue status, white blood cell count, platelet count, age, and symptoms such as fatigue, headache, and chills.

Choice of explanatory variables

There are many possible explanatory variables available in each dataset, and most of the available variables do not appear in all five datasets. However, previous research suggests that a small number of features can be used for dengue classification, so to assess generalizability we restricted our attention to some of the most common and widely-used features available across studies.

Using variable selection [28], Tuan et al. simplified a full logistic regression model containing 17 variables with several interaction terms to a simpler, final model with only three explanatory variables: patient age, white blood cell count, and platelet count [11]. Their final model with these three variables had a senstivity of 74.8%, a specificity of 76.3%, and an AUC of 0.829. Similarly, Cavailler et al. [29] created a logistic regression model with five variables: positive tourniquet test, absence of upper respiratory infection symptoms, platelet count, white blood cell count, and liver transaminases (sensitivity: 75.7%, specificity: 65.7%, AUC: 0.71). Tuan et al.’s work is also similar to the model created by Ho et al. [30], who created a logistic regression model with patient age, temperature, white blood cell count, and platelet count, and reported an AUC of 0.846. Ho et al. also compared the performance of their small logistic regression model to a larger model with 18 explanatory variables, and saw a similar AUC of 0.841. Ngim et al. [2] also chose a relatively small logistic regression model with seven variables, including white blood cell count and platelet count.

To assess generalizability of dengue classification, we focused on logistic regression with three explanatory variables which are available in all five of the datasets we studied: age, white blood cell count, and platelet count. These explanatory variables were also identified as important dengue predictors by Park et al. [25] and Potts et al. [31].

Logistic regression

We focused on a logistic regression model to predict dengue status because logistic regression is familiar to many researchers, is relatively easy to interpret, works well with a small number of explanatory variables, and has been widely used in prior studies [2, 11, 29, 30].

Furthermore, logistic regression has competitive performance with other classification methods. For example, Tuan et al. compared logistic regression with classification and regression trees (CART) and random forests, and chose logistic regression for their final model [11]. Ho et al. reported similar results when comparing logistic regression with decision trees and deep neural networks; the respective AUCs of the three methods were 0.797, 0.794, and 0.810 [30]. The AUCs for structural equation models (SEMs) reported by Park et al. were similar, ranging from 0.73 to 0.94 depending on the time of illness [25].

Logistic regression was fit using R version 4.2.1 [32]. All code for model fitting and analysis can be found in our GitHub repository (https://github.com/ciaran-evans/dengue-generalizability).

Assessing in-sample performance

Before assessing generalizability of dengue predictions to different datasets, we summarized the in-sample performance of the logistic regression model (with age, white blood cell count, and platelet count as explanatory variables) on each of the five datasets. Summarizing in-sample performance allows us to better understand the expected performance of the model, which helps develop a baseline for comparison across datasets.

In-sample performance was assessed by fitting a logistic regression model on each dataset, and evaluating the model’s performance on the same dataset. Cross-validation was used to evaluate performance on new observations from each dataset. If data were collected from multiple sites, cross-validation was performed by holding out each site in turn, as in Tuan et al. [11]. Otherwise, 10-fold cross-validation was used. Predictive performance was measured using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the ROC curve (AUC); these performance metrics were calculated using the predicted probabilities from cross-validation for each observation.

Sensitivity, specificity, PPV, and NPV all require that the predicted probabilities produced by the logistic regression model are converted into binary predictions, which requires a choice of threshold (AUC, on the other hand, considers all possible thresholds for binarization). If a particular threshold was chosen in the original study from which we obtained the data (e.g., Tuan et al. [11] used a threshold of 0.33), we used that threshold for consistency. If no threshold was chosen, we selected the threshold which maximized the geometric mean of sensitivity and specificity [33, 34].

Assessing model generalizability

A model generalizes to a new dataset or population when it can be used to make useful predictions on new observations. In the context of many classification methods, including logistic regression, predictions on new data can be presented as a predicted probability (in this case, the estimated probability that the individual has dengue). For these predicted probabilities to be useful, we therefore require (a) that the predictions can distinguish dengue patients from non-dengue patients, and (b) that the estimated probability be approximately calibrated, i.e. close to the true probability of dengue. Here we assess both aspects of generalizable predictions, and also explore adjustments to improve calibration of the predicted probabilities.

To assess generalizability of the logistic regression model with age, white blood cell count, and platelet count as explanatory variables, we considered all possible pairs of the available datasets, with one of the datasets as the training data and one of the remaining datasets as the test data. The logistic regression model was fit on the training set, and applied to the test set to calculate a predicted probability of dengue for each patient in the test set. We then calculated AUC, sensitivity, specificity, positive predictive value, and negative predictive value for the test set using these predicted probabilities. For metrics which required a threshold for binary classification, we considered both the thresholds for the training and test data from our assessment of in-sample performance.

In practice, we usually only have the training data to train and assess the model’s performance, which means the model would make predictions and calculate the metrics using the threshold from the training data when we apply it to new data. Given information on both the training and test datasets, we examined thresholds from both the training and test datasets when calculating the sensitivity and specificity. Exploring both thresholds provides insight into whether the choice of threshold is generalizable.

Restricting age ranges.

It is well known that statistical methods perform poorly when asked to extrapolate to new data beyond the range of the training sample. As age is an explanatory variable in our logistic regression model, and the range of patient ages differs substantially between datasets, we re-assessed model generalizability after restricting the age range in test sets.

When the training set was Dataset 1 or Dataset 4 (age <16 years), and the test set was Dataset 2 or Dataset 3, we restricted the age range of the test set to also be below 16 years. When the training set was Dataset 5 (age >16 years), and the test set was Dataset 2 or Dataset 3, we restricted the age range of the test set to also be above 16 years. We re-calculated all performance metrics for predictions on the age-restricted test sets, and compared the results to predictions on the unrestricted test sets. When re-calculating the performance metric, we only use the threshold from the training data to calculate metrics that require a threshold for binary classification to evaluate whether restricting the age range would be an efficient way to improve the generalizability of the model in practice.

Assessing calibration.

To make a binary decision about whether or not a patient has dengue, it is important that the predicted probabilities of dengue should be close to the true probabilities in the test data, in which case the model predictions are said to be calibrated. We assessed calibration through calibration plots, in which the predicted probabilities for the test data are binned into deciles, and the true rate of dengue cases in each bin is plotted against the bin midpoint. When predictions are calibrated, the points should be close to the diagonal line with intercept 0 and slope 1; deviations from this diagonal indicate mis-calibrated predictions. Mis-calibrated predictions do not necessarily impact AUC (because the AUC is calculated over all possible thresholds, and any monotonic transformation of the predictions will result in the same AUC), but will lead to problems when applying a threshold to binarize predicted probabilities.

Calibrating classifier predictions.

If model predictions fail to generalize, it is because the distribution of data has changed between training and test sets. It is generally impossible to correct the predictions and improve generalizability without making assumptions about the specific nature of this distributional change. Under certain assumptions, however, it is possible that the predicted probabilities will distinguish positive and negative cases, and can be calibrated to produce good estimates of the probability of dengue in new patients.

The two common assumptions made about distributional change are covariate shift and label shift. Covariate shift [35, 36] posits a change in the distribution of the explanatory variables, but no change in the relationship between the explanatory variables and the response. If covariate shift holds, then classifier predictions should generalize without correction, because the training and test sets have the same relationship between predictors and response. Conversely, label shift [37–39] assumes that the distribution of the response (i.e., dengue prevalence) changes between datasets, but the distributions of age, white blood cell count, and platelet count remain the same conditional on dengue status. Label shift has received substantial attention in the classification and machine learning literature, and previous work has suggested that label shift may be an appropriate assumption for modeling diseases, in which the disease rate can change but the symptoms remain the same [39]. Label shift may also be suitable for the datasets considered here, as the proportion of study patients who were diagnosed with dengue ranges from 16% in Dataset 3 to 61% in Dataset 4.

Under label shift, predictions on test data can be calibrated if the prevalence of dengue in the test data is known, without having to re-fit the model on the test data. Let denote the predicted probability of dengue for the ith patient in the test set, using a model fit on a different training set. Let denote the marginal prevalence of dengue in the training set, and the marginal prevalence of dengue in the test set. If is known, then the calibrated predictions on the test set can be calculated by applying Bayes’ rule:

(1)

After restricting age ranges in the test set as described above, we applied Eq 1 and assessed performance of the label shift-adjusted predictions with calibration plots. We also compared predictive performance by calculating deviance for the test data ( the binomial log-likelihood) using the uncorrected predictions and the label shift-adjusted predictions.

If the rate of dengue in an area is unknown, it may be possible to estimate this rate by leveraging the label shift assumption. Lipton et al. [39] proposed a discretization method for estimating test prevalence under label shift, which involves inverting an estimated confusion matrix. An alternative is a fixed point method proposed by Saerens et al. [37], selecting the estimate of the test prevalence which most closely matches the mean of the corrected predictions from Eq 1. We assessed performance of both methods, and compare to the simple estimate of test prevalence which averages the uncorrected predicted probabilities for the test observations.

Exploring other explanatory variables and classification methods

To assess whether predictive performance and generalizability could be improved with other explanatory variables, we identified three alternative sets of explanatory variables which were shared by at least three of the five datasets (see Table 1). For each alternative set of explanatory variables, predictive performance on all possible pairs of the available datasets was assessed, with one of the datasets as the training set and one of the remaining datasets as the test set. Performance was measured on the test set using AUC, sensitivity, specificity, positive predictive value, and negative predictive value. When the training and test sets were the same, cross-validation was used as described previously. For each training set, we compared a logistic regression model built on all variables in the alternative set; the original logistic regression model with age, white blood cell count, and platelet count; and the “best” logistic regression model selected through best subset selection with Akaike’s information criterion (AIC) [40]. The bestglm R package was used for best subset selection with logistic regression [41].

Download:

Table 1. Definitions of the alternative sets of explanatory variables considered to assess in-sample performance and predictive generalizability.

https://doi.org/10.1371/journal.pone.0323886.t001

Model comparison.

In addition to logistic regression models, we assessed the performance of CART classification trees and support vector machines (SVMs) with a radial basis kernel [40] on all four sets of explanatory variables shown in Table 1. Performance was compared using the AUC for each pair of training and test sets. Decision trees and SVMs were fit using the rpart and e1071 packages, respectively, in R [42, 43].

Results

Feature distributions in each dataset

The distribution of each explanatory variable in each dataset is shown in Fig 1, which shows that these features share certain similarities in their distributions. Full summary statistics for each variable are provided in S1 Table. The white blood cell count and platelet count in each dataset were scaled to the same measuring unit, , for comparison. Despite differences in the range of age, the age distribution was unimodal and right-skewed in almost all datasets, except for Dataset 4, where the distribution of age was unimodal and roughly symmetric. White blood cell count had a right-skewed distribution that centered around 8 among all datasets. The distribution of platelet count was also right-skewed, except in Dataset 1 and Dataset 5, in which the distributions of platelet count were unimodal and roughly symmetric. The center of the platelet count was around among the five datasets. Lastly, for Dataset 4, which collected the data of white blood cell count and platelet count from both two days and four days before the onset of fever, the distributions of white blood cell count and platelet count collected four days before the fever were more similar to the distributions we observed in other datasets. The mean of platelet count and white blood cell count from data collected two days before the fever are the lowest among the five datasets (supplementary table).

Download:

Fig 1. Distributions of explanatory variables across datasets.

The distribution of age, white blood cell count (WBC), and platelet count (PLT) for patients in each dataset, shown separately for dengue and non-dengue patients.

https://doi.org/10.1371/journal.pone.0323886.g001

In-sample performance

The in-sample performance of each dataset is summarized in Table 2. The model on Dataset 5 had a sensitivity of 72%, a specificity of 64%, and an AUC of 0.74 (95% CI: (0.69, 0.79)), which is the lowest value among the five datasets and slightly lower than the AUC (0.82) reported by Ngim et al. [2]. All other datasets had sensitivities and specificities around 75%, and AUCs above 0.82. The AUCs for Dataset 1 (0.83, 95% CI: (0.81, 0.84)) and Dataset 4, Day -3 (0.82, 95% CI: 0.77, 0.88)) were close to the AUCs reported by [11] and [25] (0.83 and 0.84, respectively), while the AUC for Dataset 4, Day -1 (0.86, 95% CI: (0.81, 0.90)) was slightly lower than the value of 0.93 reported by Park et al. [25]. AUC values were not reported by Gasem et al. [27] and Saito et al. [23] for Datasets 2 and 3; we found respective AUC values of 0.89 (95% CI: (0.87, 0.91)) and 0.85 (95% CI: (0.83, 0.87)), demonstrating relatively good predictive performance.

Download:

Table 2. In-sample performance results for each dataset.

https://doi.org/10.1371/journal.pone.0323886.t002

Generalizability

Model performance for each training/test pair is shown in Fig 2, and summarized in S2 Table. As expected, AUC values tended to be slightly lower than in-sample performance when using different training/test sets. However, the differences were mostly small, which supports model generalizability. The exception was training on Datasets 1 and 4, and testing on Datasets 2 and 3; these combinations had substantially lower AUCs on the test data, likely because Datasets 1 and 4 included only pediatric patients, whereas the other datasets included adult patients. The lowest AUC value, 0.547, was observed when the model was trained on Dataset 4 fever day -3 and tested on Dataset 2. However, when we tested the model on Datasets 1 and 4 using Dataset 2 and Dataset 3 as the training data, all AUC values were greater than 0.75. Datasets 2 and 3 include both pediatric and adult patients, and so extrapolation is not required to predict on Datasets 1 and 4. However, models trained on Dataset 5 (age >16 years) still predicted well on Datasets 1 and 4. This may suggest that datasets with a wider age range can extrapolate to younger patients who are not in the age group more accurately.

Download:

Fig 2. Model performance for each pair of training/test datasets.

Each panel represents one test set, with predictive performance on that test set for each training set. Predictive performance is measured by AUC, sensitivity, and specificity. Sensitivity and specificity are both threshold dependent, and so we consider thresholds from both the training and test data (Table 2.)

https://doi.org/10.1371/journal.pone.0323886.g002

There were also several training/test pairs with higher AUCs than the observed in-sample performance. The highest AUC value, 0.894, was observed when the model was trained on Dataset 3 and tested on Dataset 2; this AUC was higher than the in-sample performances of both Dataset 2 and Dataset 3 (Table 2). Similarly, training on Dataset 1 and testing on Dataset 4 fever day -1 also gave an AUC higher than the in-sample performances for both Dataset 1 and Dataset 4. The error bars in Fig 2 suggest that these differences in AUC are likely due to chance.

While AUC is threshold-independent, sensitivity and specificity are not. Fig 2 shows that using a pre-specified threshold resulted in substantial imbalance between sensitivity and specificity, suggesting that one probably needs to use an alternative way of choosing the threshold for the model to be more efficient in practice.

Restricting age ranges.

Since Dataset 2 and Dataset 3 had wider age ranges than the other datasets, we restricted their age range and re-assessed the model’s generalizability when testing on the restricted Dataset 2 and Dataset 3.

As shown in Fig 3, restricting the age range of the test set to match the training set improved the AUC and often led to a more balanced combination of sensitivity and specificity. An exception was training on Dataset 4 fever day -1 with a threshold of 0.64, and testing on Dataset 3; the AUC improved, but the imbalance between the sensitivity and specificity was not alleviated. Since 0.64 was the largest threshold across the five datasets (and the training threshold for Dataset 3, 0.21, was the smallest), the imbalance after filtering the data may be caused by the great discrepancies between the threshold value and the proportion of dengue patients in Dataset 3.

Download:

Fig 3. Model performance with restricted age ranges.

Changes in AUC, sensitivity, and specificity are shown when restricting the age range of test data from Dataset 2 and Dataset 3. The Unrestricted performance metrics are calculated on the full test sets, while the Restricted performance metrics are calculated to match the age range in the training data (<16 for Datasets 1 and 4, >16 for Dataset 5). For sensitivity and specificity, the threshold for each training set (Table 2) was used.

https://doi.org/10.1371/journal.pone.0323886.g003

The model trained on Dataset 5 and tested on restricted Dataset 2 and Dataset 3 showed a similar performance with improved AUC values but still imbalanced sensitivity and specificity, which may be due to the differences between the dengue proportion and the value of the chosen threshold as well.

Calibration and label shift.

Several calibration plots for different training/test pairs are shown in Fig 4 (calibration plots for all pairs are shown in S1 Fig and S2 Fig). The uncorrected predicted probabilities (applying the logistic regression model, fit on the training data, directly to the test data) tended to be poorly calibrated, particularly when the prevalence of dengue differed between training and test sets. Adjusting the predicted probabilities with the dengue prevalence in the test data (Eq 1) often improved calibration, as shown in Fig 4. The deviance for the test data decreased in 20 out of 28 training/test pairs when using these label shift-adjusted predictions (S3 Table).

Download:

Fig 4. Example calibration plots for several training/test pairs.

The logistic regression model is fit on the training set, then applied to the test set to calculate a predicted probability for each test observation. These predicted probabilities are binned into deciles, and the true rate of dengue cases in each bin is plotted against the bin midpoint. Plots are shown for both the raw predictions, and for the label shift-adjusted predictions using Eq 1.

https://doi.org/10.1371/journal.pone.0323886.g004

Estimating dengue prevalence in test data.

The corrected calibration plots in Fig 4 require the true dengue prevalence in each test set. Fig 5 shows the results of three different estimates of dengue prevalence in the test data, for each training/test pair. The discretization method [39] performed worst, with several estimates outside the (0, 1) range, and a median log ratio (i.e., ) of 0.244 compared to the true dengue prevalence. Performance of the fixed point method [37] was also poor, with a median log ratio of 0.184 compared to the true prevalence. Somewhat surprisingly, simply estimating prevalence with the mean of the uncorrected predicted probabilities performed best, with a median log ratio of 0.016. However, none of the three methods is consistently reliable, with particularly poor performance on test Datasets 4 and 5.

Download:

Fig 5. Estimating prevalence of dengue in test data.

For each training/test pair, the prevalence of dengue in the test data is estimated by the discretization method, the fixed point method, and by simply averaging the predicted probabilities. The solid horizontal line in each panel represents the true prevalence of dengue in that test set.

https://doi.org/10.1371/journal.pone.0323886.g005

Other explanatory variables

Predictive performance for logistic regression models on each pair of training and test datasets, for the three different alternative sets of explanatory variables (Table 1), are shown in S4 Table, S5 Table, and S6 Table. Overall, performance on the original subset (age, white blood cell count, and platelet count) is comparable or better than performance with other explanatory variables in the model.

For the first alternative subset, AUC is around 0.8 for most logistic regression models considered. An exception occurs when training the full model (all variables in Alternative Subset 1) or the BSS model (the results of best subset selection with AIC) on Dataset 1, and predicting on Dataset 3 or Dataset 5. In these cases, performance on the test data is extremely poor, with an AUC below 0.5. This poor performance is not seen with logistic regression on the original explanatory variables.

For the second alternative subset, the difference in predictive performance between the three different logistic regression models (original variables, all variables in Alternative Subset 2, and the BSS model) is very small (generally within 0.01 or 0.02) for all pairs of training and test sets. For the third alternative subset, performance is again similar between all three logistic regression models, particularly when the age range is restricted to be comparable.

These results show that logistic regression with age, white blood cell count, and platelet count achieves similar performance to larger models with additional explanatory variables, which aligns with previous results from Tuan et al. [11] and Potts et al. [31]. Furthermore, in some cases (see Alternative Subset 1 in particular) using fewer variables allowed for improved generalizability to new data.

Other classification methods

Predictive performance for logistic regression models is compared to decision trees and SVMs, for each set of explanatory variables shown in Table 1, in S3 Fig, S4 Fig, S5 Fig, and S6 Fig.

In general, it appears that the logistic regression model, as a parametric model with a very specific shape assumption, is the most susceptible to extrapolation issues when the datasets have different age ranges. Otherwise, the AUC values for the logistic regression model tend to be higher than for the decision tree, especially when Dataset 1 or Dataset 4 is the test set. The exception is predicting on Dataset 5, for which the decision tree outperforms both the logistic regression and the SVM. The SVM performs similarly to the logistic regression model, with slightly worse performance on Datasets 1 and 2, but less sensitivity to age range differences.

Finally, we note that for all three models, performance on the different sets of explanatory variables (Table 1) is similar, which further supports the choice of age, white blood cell count, and platelet count as our primary explanatory variables of interest.

Discussion

Many previous studies have investigated the use of statistical models to diagnose dengue in patients. When these models work well, they have the potential to quickly identify dengue fever without requiring potentially expensive or slow laboratory tests. However, each model has been developed for a different population, and collecting training data and fitting an appropriate model can be laborious and time-consuming tasks. If existing models can generalize to new populations, practitioners could make quick diagnoses without needing to create their own new model.

The goal of this paper is to assess the generalizability of a dengue classifier to new populations. We have focused on the Early Dengue Classifier proposed by Tuan et al. [11] as a good candidate, because it is easy for practitioners to use, it does not involve too many explanatory variables (and Tuan et al. demonstrated similar performance to more complicated models), and the three covariates have been identified as important dengue predictors in multiple other studies. Our analysis explores whether the model performs well when trained on a new dataset (in-sample performance), and whether a pre-trained model can predict well on new test data from a different population (out-of-sample performance). Furthermore, we compare the performance of this model with other logistic regression models incorporating more explanatory variables, and with two other classification methods. We found comparable (or occasionally worse) performance when using larger models, and also when using decision trees and support vector machines. We therefore focus on the simpler logistic regression model proposed by Tuan et al. [11] for its ease of use, its interpretability, and its simplicity: requiring few variables makes it easier to implement in new settings.

Our in-sample results in Table 2 show that, when re-trained on data from new settings, this logistic regression model performs well in a variety of settings with different age ranges and dengue prevalences. Indeed, the in-sample AUCs for each of the five datasets are above 0.75, and are between 0.8 and 0.9 for four of the five datasets. The similarities in AUCs across datasets suggest that the chosen logistic regression model is an effective choice for dengue classification in a wide variety of settings, and could be adapted to new environments by collecting new training sets. This performance is also similar to the AUCs reported in many previous studies using a range of classification techniques [11, 25, 29, 30], which supports that the model is competitive with other classifiers.

After confirming that the logistic regression model is a good choice across datasets, we next assessed whether the model could be applied to data from new settings without retraining. If the model were generalizable, we would expect its performance on a new dataset to be similar to the in-sample performance we observe. However, there are several potential challenges when applying an existing model to new data: differences in feature distributions (e.g., different age groups), different rates of dengue prevalence, and selecting an appropriate threshold for binary classification.

When we assessed the generalizability of the model with no restriction on the age range of the datasets, the AUC values for the training/test pairs that have a similar or the training dataset has a wider age range were all above 0.75, though most of the AUC values were slightly lower than the in-sample performances, as expected. For training/test pairs with training datasets that have a narrower age range, we observed substantial declines in the AUC values compared with their in-sample performances (S2 Table). The disparities in the generalizing performance demonstrate that the model trained on a dataset with a broader age range can extrapolate well when tested on a dataset with a narrower age range but not vice versa.

When we re-assessed the generalizability of the model with restrictions on the age range of the datasets, there were improvements shown in AUC values for all datasets (Fig 3), which means that having similar age ranges in both training and test datasets can improve the generalizability of the model. Since the model has an AUC value above 0.75 and is similar to the in-sample performance for all training/testing pairs with similar age ranges, the logistic regression model we have studied in this paper can be applied to a new dataset with minimal reduction in its overall performance (as measured by AUC, which summarizes the overall ability to distinguish dengue and non-dengue cases) if the model is trained on a similar age range with the new dataset.

However, AUC does not capture whether predicted probabilities are calibrated, nor how to choose a threshold when making binary predictions. In the assessment and re-assessment of the generalizability, we observed an imbalance between sensitivity and specificity (S2 Table). When we evaluated the model with no restrictions on the age range, all training/test pairs have shown a great imbalance between sensitivity and specificity for both the thresholds from the training and testing datasets. The imbalance is somewhat alleviated when we re-evaluated the model with restricted age ranges using only the threshold from the training datasets. However, the discrepancy between sensitivity and specificity still existed for Dataset 4 Day -1 and Dataset 5 (Fig 3). Thus, although restricting the age range can mitigate the issue of imbalanced sensitivity and specificity, it is only to an extent. Since having a model with imbalanced sensitivity and specificity may not be helpful for practitioners making diagnoses, one would need to find an alternative way of choosing a threshold for the model to be effective in practice.

Additionally, although the AUC value of the model appears to be relatively good, the predictions were consistently higher or lower than the desired probabilities (Fig 4) due to the discrepancies in dengue prevalence between the training and testing datasets, which is expected in practice. This suggests that an existing model might not be applicable to new data, without adjusting for differences in dengue prevalence.

The label shift assumption [37, 39] provides a natural method to account for changes in disease prevalence, and the datasets have roughly similar distributions of white blood cell count and platelet count, conditional on dengue status (Fig 1). Furthermore, in many of the training/test pairs, the predicted probabilities could be calibrated using a label shift assumption (Fig 4, and S1 Fig and S2 Fig). However, label shift adjustment requires knowing the dengue prevalence in the test set. If the population prevalence is unknown, or changes (e.g. due to seasonal trends), calibrating the model predictions would be impossible without collecting new data. Attempts to leverage the label shift assumption to estimate test prevalence were unsuccessful (Fig 5).

In conclusion, we have shown that the logistic regression model from Tuan et al. [11] can be applied to a new dataset without re-training, but its ability to help clinicians make diagnoses is limited. Specifically, for the model to generalize to a new dataset, the age range of the training and test datasets should be roughly the same, or the age range of the training dataset should be wider than the test dataset. Although having the age range restricted can ensure the model’s general predictive ability, it may still not be very suitable to use the model in practice since it is hard to find an appropriate threshold value to generate a relatively balanced sensitivity and specificity. In future studies, to improve the model’s efficiency in clinical settings, one could focus on investigating how to choose a threshold that can result in a balanced sensitivity and specificity. Additionally, since the predicted probabilities made by the model tend to be biased, and the technique required to calibrate it might be too complicated to exert in practice, one could also focus on improving the model’s performance by adjusting the distribution of the explanatory variables before training the model so that its predictive probabilities match with the desired probabilities without calibration.

Supporting information

S1 Table. Summary statistics for each dataset.

https://doi.org/10.1371/journal.pone.0323886.s001

(PDF)

S2 Table. Full performance metrics for each combination of training/test datasets.

https://doi.org/10.1371/journal.pone.0323886.s002

(PDF)

S3 Table. Label shift adjustment and estimation for each training/test pair.

https://doi.org/10.1371/journal.pone.0323886.s003

(PDF)

S4 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 1.

Model denotes whether the model evaluated is the original model (age, WBC, and PLT), the full model (using all available variables in the subset), or the model chosen by best subsets selection (BSS) with that training data. Missing values in the PPV indicate that there were 0 predicted positive cases.

https://doi.org/10.1371/journal.pone.0323886.s004

(PDF)

S5 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 2.

Model denotes whether the model evaluated is the original model (age, WBC, and PLT), the full model (using all available variables in the subset), or the model chosen by best subsets selection (BSS) with that training data.

https://doi.org/10.1371/journal.pone.0323886.s005

(PDF)

S6 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 3.

Model denotes whether the model evaluated is the original model (age, WBC, and PLT), the full model (using all available variables in the subset), or the model chosen by best subsets selection (BSS) with that training data.

https://doi.org/10.1371/journal.pone.0323886.s006

(PDF)

S1 Fig. Calibration plots for all training/test pairs (uncorrected predictions).

The logistic regression model is fit on the training set, and applied to the test set to calculate predicted probabilities. These raw predicted probabilities are compared against true instances of dengue.

https://doi.org/10.1371/journal.pone.0323886.s007

(TIF)

S2 Fig. Calibration plots for all training/test pairs (label shift-adjusted predictions).

The logistic regression model is fit on the training set, and applied to the test set to calculate predicted probabilities. The predicted probabilities are adjusted with a label shift correction to account for a different rate of dengue fever in the test dataset (see Eq 1 in the manuscript).

https://doi.org/10.1371/journal.pone.0323886.s008

(TIF)

S3 Fig. Comparison of predictive performance with the original explanatory variables for three different classification methods, for each pair of training/test datasets.

Recall that the original explanatory variables are Age, WBC, and PLT (see Table 1 in the main manuscript). Each panel represents one test set, with the predictive performance on that test set (the AUC) displayed for each training set. Performance is shown with and without age range restrictions. Without age range restrictions, all complete cases in the training and test datasets are used. With age range restrictions, the test dataset is restricted (if possible) to match the age range of the training data (<16 if training data is Dataset 1 or Dataset 4 and test data is Dataset 2 or Dataset 3; >16 if training data is Dataset 5 and test data is Dataset 2 or Dataset 3).

https://doi.org/10.1371/journal.pone.0323886.s009

(TIF)

S4 Fig. Comparison of predictive performance with the alternative subset 1 explanatory variables for three different classification methods, for each pair of training/test datasets.

The variables in Alternative Subset 1 can be found in Table 1. Each panel represents one test set, with the predictive performance on that test set (the AUC) displayed for each training set. Performance is shown with and without age range restrictions. Without age range restrictions, all complete cases in the training and test datasets are used. With age range restrictions, the test dataset is restricted (if possible) to match the age range of the training data (<16 if training data is Dataset 1 or Dataset 4 and test data is Dataset 2 or Dataset 3; >16 if training data is Dataset 5 and test data is Dataset 2 or Dataset 3).

https://doi.org/10.1371/journal.pone.0323886.s010

(TIF)

S5 Fig. Comparison of predictive performance with the alternative subset 2 explanatory variables for three different classification methods, for each pair of training/test datasets.

The variables in Alternative Subset 2 can be found in Table 1. Each panel represents one test set, with the predictive performance on that test set (the AUC) displayed for each training set. Performance is shown with and without age range restrictions. Without age range restrictions, all complete cases in the training and test datasets are used. With age range restrictions, the test dataset is restricted (if possible) to match the age range of the training data (<16 if training data is Dataset 1 or Dataset 4 and test data is Dataset 2 or Dataset 3; >16 if training data is Dataset 5 and test data is Dataset 2 or Dataset 3).

https://doi.org/10.1371/journal.pone.0323886.s011

(TIF)

S6 Fig. Comparison of predictive performance with the alternative subset 3 explanatory variables for three different classification methods, for each pair of training/test datasets.

The variables in Alternative Subset 3 can be found in Table 1. Each panel represents one test set, with the predictive performance on that test set (the AUC) displayed for each training set. Performance is shown with and without age range restrictions. Without age range restrictions, all complete cases in the training and test datasets are used. With age range restrictions, the test dataset is restricted (if possible) to match the age range of the training data (<16 if training data is Dataset 1 or Dataset 4 and test data is Dataset 2 or Dataset 3; >16 if training data is Dataset 5 and test data is Dataset 2 or Dataset 3).

https://doi.org/10.1371/journal.pone.0323886.s012

(TIF)

References

1. World Health Organization Regional Office for South-East Asia. Comprehensive guideline for prevention and control of dengue and dengue haemorrhagic fever. World Health Organization; 2011.
2. Ngim CF, Husain SMT, Hassan SS, Dhanoa A, Ahmad SAA, Mariapun J, et al. Rapid testing requires clinical evaluation for accurate diagnosis of dengue disease: a passive surveillance study in Southern Malaysia. PLoS Negl Trop Dis. 2021;15(5):e0009445. pmid:34014983
- View Article
- PubMed/NCBI
- Google Scholar
3. Stanaway JD, Shepard DS, Undurraga EA, Halasa YA, Coffeng LE, Brady OJ, et al. The global burden of dengue: an analysis from the Global Burden of Disease Study 2013. Lancet Infect Dis. 2016;16(6):712–23. pmid:26874619
- View Article
- PubMed/NCBI
- Google Scholar
4. Sahak MN. Dengue fever as an emerging disease in Afghanistan: epidemiology of the first reported cases. Int J Infect Dis. 2020;99:23–7. pmid:32738489
- View Article
- PubMed/NCBI
- Google Scholar
5. Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, et al. The global distribution and burden of dengue. Nature. 2013;496(7446):504–7. pmid:23563266
- View Article
- PubMed/NCBI
- Google Scholar
6. World Health Organization. Dengue and severe dengue. Available from: https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue
7. Wilder-Smith A, Byass P. The elusive global burden of dengue. Lancet Infect Dis. 2016;16(6):629–31. pmid:26874620
- View Article
- PubMed/NCBI
- Google Scholar
8. Khan MB, Yang Z-S, Lin C-Y, Hsu M-C, Urbina AN, Assavalapsakul W, et al. Dengue overview: an updated systemic review. J Infect Public Health. 2023;16(10):1625–42. pmid:37595484
- View Article
- PubMed/NCBI
- Google Scholar
9. Srikiatkhachorn A, Mathew A, Rothman AL. Immune-mediated cytokine storm and its role in severe dengue. Semin Immunopathol. 2017;39(5):563–74. pmid:28401256
- View Article
- PubMed/NCBI
- Google Scholar
10. Kularatne SAM. Dengue fever. BMJ. 2015;351:h4661. pmid:26374064
- View Article
- PubMed/NCBI
- Google Scholar
11. Tuan NM, Nhan HT, Chau NVV, Hung NT, Tuan HM, Tram TV, et al. Sensitivity and specificity of a novel classifier for the early diagnosis of dengue. PLoS Negl Trop Dis. 2015;9(4):e0003638. pmid:25836753
- View Article
- PubMed/NCBI
- Google Scholar
12. Hadinegoro SRS. The revised WHO dengue case classification: Does the system need to be modified?. Paediatr Int Child Health. 2012;32(Suppl 1):33–8. pmid:22668448
- View Article
- PubMed/NCBI
- Google Scholar
13. Horstick O, Martinez E, Guzman MG, Martin JLS, Ranzinger SR. WHO dengue case classification 2009 and its usefulness in practice: an expert consensus in the Americas. Pathog Glob Health. 2015;109(1):19–25. pmid:25630344
- View Article
- PubMed/NCBI
- Google Scholar
14. Guzmán MG, Kourı G. Dengue diagnosis, advances and challenges. Int J Infect Dis. 2004;8(2):69–80. pmid:14732325
- View Article
- PubMed/NCBI
- Google Scholar
15. Guzman MG, Harris E. Dengue. Lancet. 2015;385(9966):453–65.
- View Article
- Google Scholar
16. World Health Organization and Special Programme for Research and Training in Tropical Diseases and World Health Organization Department of Control of Neglected Tropical Diseases and World Health Organization Epidemic and Pandemic Alert. Dengue: Guidelines for diagnosis, treatment, prevention and control. World Health Organization; 2009.
17. Centers for Disease Control and Prevention. Dengue virus antigen detection; 2019. Available from: https://www.cdc.gov/dengue/healthcare-providers/testing/antigen-detection.html
18. Luo R, Fongwen N, Kelly-Cirino C, Harris E, Wilder-Smith A, Peeling RW. Rapid diagnostic tests for determining dengue serostatus: A systematic review and key informant interviews. Clin Microbiol Infect. 2019;25(6):659–66. pmid:30664935
- View Article
- PubMed/NCBI
- Google Scholar
19. Mata VE, Andrade CAF de, Passos SRL, Hökerberg YHM, Fukuoka LVB, Silva SA da. Rapid immunochromatographic tests for the diagnosis of dengue: A systematic review and meta-analysis. Cad Saude Publ. 2020;36(6):e00225618. pmid:32520127
- View Article
- PubMed/NCBI
- Google Scholar
20. Huhtamo E, Hasu E, Uzcátegui NY, Erra E, Nikkari S, Kantele A, et al. Early diagnosis of dengue in travelers: Comparison of a novel real-time RT-PCR, NS1 antigen detection and serology. J Clin Virol. 2010;47(1):49–53. pmid:19963435
- View Article
- PubMed/NCBI
- Google Scholar
21. Andries A-C, Duong V, Ly S, Cappelle J, Kim KS, Lorn Try P, et al. Value of routine dengue diagnostic tests in urine and saliva specimens. PLoS Negl Trop Dis. 2015;9(9):e0004100. pmid:26406240
- View Article
- PubMed/NCBI
- Google Scholar
22. Utama IMS, Lukman N, Sukmawati DD, Alisjahbana B, Alam A, Murniati D, et al. Dengue viral infection in Indonesia: Epidemiology, diagnostic challenges, and mutations from an observational cohort study. PLoS Negl Trop Dis. 2019;13(10):e0007785. pmid:31634352
- View Article
- PubMed/NCBI
- Google Scholar
23. Saito N, Solante RM, Guzman FD, Telan EO, Umipig DV, Calayo JP, et al. A prospective observational study of community-acquired bacterial bloodstream infections in Metro Manila, the Philippines. PLoS Negl Trop Dis. 2022;16(5):e0010414. pmid:35613181
- View Article
- PubMed/NCBI
- Google Scholar
24. Tanner L, Schreiber M, Low JGH, Ong A, Tolfvenstam T, Lai YL, et al. Decision tree algorithms predict the diagnosis and outcome of dengue fever in the early phase of illness. PLoS Negl Trop Dis. 2008;2(3):e196. pmid:18335069
- View Article
- PubMed/NCBI
- Google Scholar
25. Park S, Srikiatkhachorn A, Kalayanarooj S, Macareo L, Green S, Friedman JF, et al. Use of structural equation models to predict dengue illness phenotype. PLoS Negl Trop Dis. 2018;12(10):e0006799. pmid:30273334
- View Article
- PubMed/NCBI
- Google Scholar
26. da Silva Neto SR, Tabosa Oliveira T, Teixeira IV, Aguiar de Oliveira SB, Souza Sampaio V, Lynn T, et al. Machine learning and deep learning techniques to support clinical diagnosis of arboviral diseases: a systematic review. PLoS Negl Trop Dis. 2022;16(1):e0010061. pmid:35025860
- View Article
- PubMed/NCBI
- Google Scholar
27. Gasem M, Kosasih H, Tjitra E, Alisjahbana B, Karyana M, Lokida D, et al. An observational prospective cohort study of the epidemiology of hospitalized patients with acute febrile illness in Indonesia. PLoS Negl Trop Dis. 2020;14(1):e0007927.
- View Article
- Google Scholar
28. Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Ser B: Stat Methodol. 2010;72(4):417–73.
- View Article
- Google Scholar
29. Cavailler P, Tarantola A, Leo YS, Lover AA, Rachline A, Duch M, et al. Early diagnosis of dengue disease severity in a resource-limited Asian country. BMC Infect Dis. 2016;16(1):512. pmid:27670906
- View Article
- PubMed/NCBI
- Google Scholar
30. Ho T-S, Weng T-C, Wang J-D, Han H-C, Cheng H-C, Yang C-C, et al. Comparing machine learning with case-control models to identify confirmed dengue cases. PLoS Negl Trop Dis. 2020;14(11):e0008843. pmid:33170848
- View Article
- PubMed/NCBI
- Google Scholar
31. Potts JA, Gibbons RV, Rothman AL, Srikiatkhachorn A, Thomas SJ, Supradish P-O, et al. Prediction of dengue disease severity among pediatric Thai patients using early clinical laboratory indicators. PLoS Negl Trop Dis. 2010;4(8):e769. pmid:20689812
- View Article
- PubMed/NCBI
- Google Scholar
32. R Core Team. R: A language and environment for statistical computing; 2022. Available from: https://www.R-project.org/
33. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning; 1997. p. 179–86.
34. Barandela R, Sánchez JS, Garcıa V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recogn. 2003;36(3):849–51.
- View Article
- Google Scholar
35. Bickel S, Brückner M, Scheffer T. Discriminative learning under covariate shift. J Mach Learn Res. 2009;10(9).
- View Article
- Google Scholar
36. Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence N, editors. Dataset Shift in Machine Learning. The MIT Press; 2008. p. 131–60.
37. Saerens M, Latinne P, Decaestecker C. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Comput. 2002;14(1):21–41. pmid:11747533
- View Article
- PubMed/NCBI
- Google Scholar
38. Storkey A. When training and test sets are different: characterizing learning transfer. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence N, editors. Dataset shift in machine learning. The MIT Press; 2008. p. 3–28.
39. Lipton Z, Wang Y, Smola A. Detecting and correcting for label shift with black box predictors. In: Proceedings of the international conference on machine learning. PMLR; 2018. p. 3122–30.
40. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed. Springer Science & Business Media; 2009.
41. McLeod A, Xu C, Lai Y. bestglm: Best subset GLM and regression utilities; 2020. Available from: https://CRAN.R-project.org/package=bestglm
42. Therneau T, Atkinson B. Rpart: Recursive partitioning and regression trees; 2022. Available from: https://CRAN.R-project.org/package=rpart
43. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc functions of the department of statistics, probability theory group (formerly: e1071). TU Wien; 2023. Availabler from: https://CRAN.R-project.org/package=e1071

[ref1] 1. World Health Organization Regional Office for South-East Asia. Comprehensive guideline for prevention and control of dengue and dengue haemorrhagic fever. World Health Organization; 2011.

[ref2] 2. Ngim CF, Husain SMT, Hassan SS, Dhanoa A, Ahmad SAA, Mariapun J, et al. Rapid testing requires clinical evaluation for accurate diagnosis of dengue disease: a passive surveillance study in Southern Malaysia. PLoS Negl Trop Dis. 2021;15(5):e0009445. pmid:34014983
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Stanaway JD, Shepard DS, Undurraga EA, Halasa YA, Coffeng LE, Brady OJ, et al. The global burden of dengue: an analysis from the Global Burden of Disease Study 2013. Lancet Infect Dis. 2016;16(6):712–23. pmid:26874619
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Sahak MN. Dengue fever as an emerging disease in Afghanistan: epidemiology of the first reported cases. Int J Infect Dis. 2020;99:23–7. pmid:32738489
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, et al. The global distribution and burden of dengue. Nature. 2013;496(7446):504–7. pmid:23563266
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. World Health Organization. Dengue and severe dengue. Available from: https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue

[ref7] 7. Wilder-Smith A, Byass P. The elusive global burden of dengue. Lancet Infect Dis. 2016;16(6):629–31. pmid:26874620
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Khan MB, Yang Z-S, Lin C-Y, Hsu M-C, Urbina AN, Assavalapsakul W, et al. Dengue overview: an updated systemic review. J Infect Public Health. 2023;16(10):1625–42. pmid:37595484
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Srikiatkhachorn A, Mathew A, Rothman AL. Immune-mediated cytokine storm and its role in severe dengue. Semin Immunopathol. 2017;39(5):563–74. pmid:28401256
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Kularatne SAM. Dengue fever. BMJ. 2015;351:h4661. pmid:26374064
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Tuan NM, Nhan HT, Chau NVV, Hung NT, Tuan HM, Tram TV, et al. Sensitivity and specificity of a novel classifier for the early diagnosis of dengue. PLoS Negl Trop Dis. 2015;9(4):e0003638. pmid:25836753
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref12] 12. Hadinegoro SRS. The revised WHO dengue case classification: Does the system need to be modified?. Paediatr Int Child Health. 2012;32(Suppl 1):33–8. pmid:22668448
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref13] 13. Horstick O, Martinez E, Guzman MG, Martin JLS, Ranzinger SR. WHO dengue case classification 2009 and its usefulness in practice: an expert consensus in the Americas. Pathog Glob Health. 2015;109(1):19–25. pmid:25630344
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref14] 14. Guzmán MG, Kourı G. Dengue diagnosis, advances and challenges. Int J Infect Dis. 2004;8(2):69–80. pmid:14732325
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref15] 15. Guzman MG, Harris E. Dengue. Lancet. 2015;385(9966):453–65.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref16] 16. World Health Organization and Special Programme for Research and Training in Tropical Diseases and World Health Organization Department of Control of Neglected Tropical Diseases and World Health Organization Epidemic and Pandemic Alert. Dengue: Guidelines for diagnosis, treatment, prevention and control. World Health Organization; 2009.

[ref17] 17. Centers for Disease Control and Prevention. Dengue virus antigen detection; 2019. Available from: https://www.cdc.gov/dengue/healthcare-providers/testing/antigen-detection.html

[ref18] 18. Luo R, Fongwen N, Kelly-Cirino C, Harris E, Wilder-Smith A, Peeling RW. Rapid diagnostic tests for determining dengue serostatus: A systematic review and key informant interviews. Clin Microbiol Infect. 2019;25(6):659–66. pmid:30664935
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref19] 19. Mata VE, Andrade CAF de, Passos SRL, Hökerberg YHM, Fukuoka LVB, Silva SA da. Rapid immunochromatographic tests for the diagnosis of dengue: A systematic review and meta-analysis. Cad Saude Publ. 2020;36(6):e00225618. pmid:32520127
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref20] 20. Huhtamo E, Hasu E, Uzcátegui NY, Erra E, Nikkari S, Kantele A, et al. Early diagnosis of dengue in travelers: Comparison of a novel real-time RT-PCR, NS1 antigen detection and serology. J Clin Virol. 2010;47(1):49–53. pmid:19963435
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref21] 21. Andries A-C, Duong V, Ly S, Cappelle J, Kim KS, Lorn Try P, et al. Value of routine dengue diagnostic tests in urine and saliva specimens. PLoS Negl Trop Dis. 2015;9(9):e0004100. pmid:26406240
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref22] 22. Utama IMS, Lukman N, Sukmawati DD, Alisjahbana B, Alam A, Murniati D, et al. Dengue viral infection in Indonesia: Epidemiology, diagnostic challenges, and mutations from an observational cohort study. PLoS Negl Trop Dis. 2019;13(10):e0007785. pmid:31634352
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref23] 23. Saito N, Solante RM, Guzman FD, Telan EO, Umipig DV, Calayo JP, et al. A prospective observational study of community-acquired bacterial bloodstream infections in Metro Manila, the Philippines. PLoS Negl Trop Dis. 2022;16(5):e0010414. pmid:35613181
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref24] 24. Tanner L, Schreiber M, Low JGH, Ong A, Tolfvenstam T, Lai YL, et al. Decision tree algorithms predict the diagnosis and outcome of dengue fever in the early phase of illness. PLoS Negl Trop Dis. 2008;2(3):e196. pmid:18335069
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref25] 25. Park S, Srikiatkhachorn A, Kalayanarooj S, Macareo L, Green S, Friedman JF, et al. Use of structural equation models to predict dengue illness phenotype. PLoS Negl Trop Dis. 2018;12(10):e0006799. pmid:30273334
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref26] 26. da Silva Neto SR, Tabosa Oliveira T, Teixeira IV, Aguiar de Oliveira SB, Souza Sampaio V, Lynn T, et al. Machine learning and deep learning techniques to support clinical diagnosis of arboviral diseases: a systematic review. PLoS Negl Trop Dis. 2022;16(1):e0010061. pmid:35025860
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref27] 27. Gasem M, Kosasih H, Tjitra E, Alisjahbana B, Karyana M, Lokida D, et al. An observational prospective cohort study of the epidemiology of hospitalized patients with acute febrile illness in Indonesia. PLoS Negl Trop Dis. 2020;14(1):e0007927.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref28] 28. Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Ser B: Stat Methodol. 2010;72(4):417–73.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref29] 29. Cavailler P, Tarantola A, Leo YS, Lover AA, Rachline A, Duch M, et al. Early diagnosis of dengue disease severity in a resource-limited Asian country. BMC Infect Dis. 2016;16(1):512. pmid:27670906
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref30] 30. Ho T-S, Weng T-C, Wang J-D, Han H-C, Cheng H-C, Yang C-C, et al. Comparing machine learning with case-control models to identify confirmed dengue cases. PLoS Negl Trop Dis. 2020;14(11):e0008843. pmid:33170848
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref31] 31. Potts JA, Gibbons RV, Rothman AL, Srikiatkhachorn A, Thomas SJ, Supradish P-O, et al. Prediction of dengue disease severity among pediatric Thai patients using early clinical laboratory indicators. PLoS Negl Trop Dis. 2010;4(8):e769. pmid:20689812
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref32] 32. R Core Team. R: A language and environment for statistical computing; 2022. Available from: https://www.R-project.org/

[ref33] 33. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning; 1997. p. 179–86.

[ref34] 34. Barandela R, Sánchez JS, Garcıa V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recogn. 2003;36(3):849–51.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref35] 35. Bickel S, Brückner M, Scheffer T. Discriminative learning under covariate shift. J Mach Learn Res. 2009;10(9).
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref36] 36. Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence N, editors. Dataset Shift in Machine Learning. The MIT Press; 2008. p. 131–60.

[ref37] 37. Saerens M, Latinne P, Decaestecker C. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Comput. 2002;14(1):21–41. pmid:11747533
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref38] 38. Storkey A. When training and test sets are different: characterizing learning transfer. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence N, editors. Dataset shift in machine learning. The MIT Press; 2008. p. 3–28.

[ref39] 39. Lipton Z, Wang Y, Smola A. Detecting and correcting for label shift with black box predictors. In: Proceedings of the international conference on machine learning. PMLR; 2018. p. 3122–30.

[ref40] 40. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed. Springer Science & Business Media; 2009.

[ref41] 41. McLeod A, Xu C, Lai Y. bestglm: Best subset GLM and regression utilities; 2020. Available from: https://CRAN.R-project.org/package=bestglm

[ref42] 42. Therneau T, Atkinson B. Rpart: Recursive partitioning and regression trees; 2022. Available from: https://CRAN.R-project.org/package=rpart

[ref43] 43. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc functions of the department of statistics, probability theory group (formerly: e1071). TU Wien; 2023. Availabler from: https://CRAN.R-project.org/package=e1071

Figures

Abstract

Introduction

Dengue diagnosis

Classification methods for diagnosis

Limitations of previous studies

Contributions

Methods

Datasets

Dataset 1.

Dataset 2.

Dataset 3.

Dataset 4.

Dataset 5.

Choice of explanatory variables

Logistic regression

Assessing in-sample performance

Assessing model generalizability

Restricting age ranges.

Assessing calibration.

Calibrating classifier predictions.

Exploring other explanatory variables and classification methods

Model comparison.

Results

Feature distributions in each dataset

In-sample performance

Generalizability

Restricting age ranges.

Calibration and label shift.

Estimating dengue prevalence in test data.

Other explanatory variables

Other classification methods

Discussion

Supporting information

S1 Table. Summary statistics for each dataset.

S2 Table. Full performance metrics for each combination of training/test datasets.

S3 Table. Label shift adjustment and estimation for each training/test pair.

S4 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 1.

S5 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 2.

S6 Table. In-sample and generalizability performance metrics for logistic regression with each training/test pair, Alternative Subset 3.

S1 Fig. Calibration plots for all training/test pairs (uncorrected predictions).

S2 Fig. Calibration plots for all training/test pairs (label shift-adjusted predictions).

S3 Fig. Comparison of predictive performance with the original explanatory variables for three different classification methods, for each pair of training/test datasets.

S4 Fig. Comparison of predictive performance with the alternative subset 1 explanatory variables for three different classification methods, for each pair of training/test datasets.

S5 Fig. Comparison of predictive performance with the alternative subset 2 explanatory variables for three different classification methods, for each pair of training/test datasets.

S6 Fig. Comparison of predictive performance with the alternative subset 3 explanatory variables for three different classification methods, for each pair of training/test datasets.

References