Advertisement
  • Loading metrics

Weekly dengue forecasts in Iquitos, Peru; San Juan, Puerto Rico; and Singapore

  • Corey M. Benedum ,

    Roles Conceptualization, Formal analysis, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    corey.benedum@gmail.com

    Affiliations Draper, Cambridge, Massachusetts, United States of America, Department of Epidemiology, Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Kimberly M. Shea,

    Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Department of Epidemiology, Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Helen E. Jenkins,

    Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America

  • Louis Y. Kim,

    Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Draper, Cambridge, Massachusetts, United States of America

  • Natasha Markuzon

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Draper, Cambridge, Massachusetts, United States of America

Weekly dengue forecasts in Iquitos, Peru; San Juan, Puerto Rico; and Singapore

  • Corey M. Benedum, 
  • Kimberly M. Shea, 
  • Helen E. Jenkins, 
  • Louis Y. Kim, 
  • Natasha Markuzon
PLOS
x

Abstract

Background

Predictive models can serve as early warning systems and can be used to forecast future risk of various infectious diseases. Conventionally, regression and time series models are used to forecast dengue incidence, using dengue surveillance (e.g., case counts) and weather data. However, these models may be limited in terms of model assumptions and the number of predictors that can be included. Machine learning (ML) methods are designed to work with a large number of predictors and thus offer an appealing alternative. Here, we compared the performance of ML algorithms with that of regression models in predicting dengue cases and outbreaks from 4 to up to 12 weeks in advance. Many countries lack sufficient health surveillance infrastructure, as such we evaluated the contribution of dengue surveillance and weather data on the predictive power of these models.

Methods

We developed ML, regression, and time series models to forecast weekly dengue case counts and outbreaks in Iquitos, Peru; San Juan, Puerto Rico; and Singapore from 1990–2016. Forecasts were generated using available weekly dengue surveillance, and weather data. We evaluated the agreement between model forecasts and actual dengue observations using Mean Absolute Error and Matthew’s Correlation Coefficient (MCC).

Results

For near term predictions of weekly case counts and when using surveillance data, ML models had 21% and 33% less error than regression and time series models respectively. However, using weather data only, ML models did not demonstrate a practical advantage. When forecasting weekly dengue outbreaks 12 weeks in advance, ML models achieved a maximum MCC of 0.61.

Conclusions

Our results identified 2 scenarios when ML models are advantageous over regression model: 1) predicting dengue weekly case counts 4 weeks ahead when dengue surveillance data are available and 2) predicting weekly dengue outbreaks 12 weeks ahead when dengue surveillance data are unavailable. Given the advantages of ML models, dengue early warning systems may be improved by the inclusion of these models.

Author summary

Accurate and timely forecasts of dengue fever can help mitigate the impact of the disease. Currently, regression and time series models are frequently used to predict dengue cases and outbreaks. However, these models may be limited in terms of model assumptions and the number of predictors that can be included. Machine learning (ML) models offer an appealing alternative as they have a nonlinear framework and can be applied to high dimensional data. In this study, we compared the performance of ML algorithms with that of regression and time series models in predicting dengue cases and outbreaks from 4 to up to 12 weeks in advance in 3 dengue-endemic regions. Model predictions were based upon local dengue surveillance (e.g., case counts), population, temporal, and weather data. Many countries lack sufficient health surveillance infrastructure, as such we evaluated the contribution of dengue surveillance and weather data on the predictive power of the models. Our results identified 2 scenarios when ML models performed better than conventional models: 1) predicting dengue weekly case counts 4 weeks ahead when dengue surveillance data are available and 2) predicting weekly dengue outbreaks 12 weeks ahead when dengue surveillance data are unavailable. This research suggests that ML models can be a beneficial tool for dengue early warning systems.

Introduction

Dengue fever, a mosquito-borne disease, poses a significant public health concern due to its re-emergence in tropical and sub-tropical regions [1]. In many countries where dengue is present, the disease is endemic. Globally, researchers estimate that dengue infects 390 million people per year [2]; however, only 50–100 million cases are detected due to the high asymptomatic rate [16]. Estimating dengue burden can be problematic due to delays in case identification, strong intra- and inter-annual variation in incidence, and the majority of cases being clinically mild or asymptomatic [710]. As a result, implementing effective vector control operations can be challenging [11]. To overcome these issues, the development of accurate and timely early warnings systems capable of predicting future dengue incidence that do not depend upon current dengue case data remains an active area of research [5].

Several modeling approaches have been evaluated as early warning models for various infectious diseases. Time series and regression models are commonly used but have had various levels of success [5,7,1228]. These models offer a robust and easily interpretable framework; however, these approaches can be limited by the underlying model assumptions (e.g., linear relationships between predictors and outcome) and the number of predictors that can be included [29,30]. Mechanistic models, which model individual components of a dynamic system, have accurately described outbreaks of influenza and mosquito borne diseases [3137]; yet, the data required to parameterize these models are difficult to obtain, and the necessary model assumptions (e.g., disease infectivity) may not be clear until after the outbreak [7]. Ensemble approaches, which integrate multiple forecasting methods, have performed well and lately have received increased interest. Using dengue and climate data from Iquitos and San Juan, Buczak et al. [38] developed an ensemble of 300 models, which included Method of Analogs and Holt-Winters models, to predict various characteristics of dengue outbreaks (e.g., peak week, peak week incidence, and total cases in a season). The approach employed by Buczak et al. performed well when predicting peak week and total number of cases in a season but had difficulty predicting when the peak week would occur [38]. Yamana et al. [39] integrated multiple models, including a mechanistic model among others, to forecast dengue incidence in San Juan. In this study, the authors used Bayesian model averaging to integrate model results and found that the ensemble approach outperformed each of the individual models.

In contrast to the previously described approaches, machine learning (ML) models offer an appealing alternative and have already been used to successfully predict infectious disease case counts and outbreaks [4046]. Similar to mechanistic models, ML models have a nonparametric and nonlinear modeling structure, but unlike regression and mechanistic models, ML approaches are independent of a priori specification of variable relationships, and can accommodate high dimensional data. Additionally, several ML models employ an ensemble framework to improve model accuracy. Though ML models have demonstrated good accuracy, the performance of these models have typically not been compared with the performance of more conventional approaches [45].

Regardless of the selected statistical framework, dengue prediction models typically use 2 types of inputs–a measure of prior dengue case counts and local weather conditions [7]. Prior dengue cases counts are included because there is strong relationship between current and subsequent levels of dengue, given the infectious nature of the disease. A weather component is included to describe how short-term changes in atmospheric conditions affect dengue vectors, hosts, and the infectious agent itself. In the case of dengue, rainfall plays an integral role in creating suitable breeding conditions for its vector, the Aedes mosquito [4749]. Temperature also is known to affect larvae development, adult biting behavior, and the replication rate of the dengue virus [4,47,5053]. Likewise, humidity improves egg longevity by preventing environmental desiccation [5456].

In this study, we developed models using dengue surveillance (e.g., case counts), population, and weather data from 3 dengue-endemic locations to predict dengue case counts and outbreaks (i.e., where the number of reported cases exceeded a predefined threshold) 4 to 12 weeks in advance. We selected these 2 outcomes because case counts are an objective prediction measure where uncertainty can be easily quantified, while weekly outbreak occurrence is more relevant within the context of public health decision making [7]. We used forecast horizons of 4 and up to 12 weeks to develop models that can provide real-time updates and to provide timely warnings to give governmental authorities adequate response time, respectively [11]. We then used these models to examine 3 questions: (1) “How well do ML models (i.e., Random Forest [RF] and Random Forest-Univaraite Flagging Algorith [RF-UFA]) forecast dengue, relative to commonly used prediction models (i.e., Poisson regression, Logistic regression and autoregressive integrated moving average [ARIMA] models)?” (2) “How is model accuracy impacted by the availability–or lack of–current dengue surveillance data?” and (3) “Among data used in our models, what were the strongest predictors of the weekly number of reported dengue cases?”

Materials and methods

Study areas

We predicted dengue case counts and outbreaks in 3 endemic locations: Iquitos, Peru; San Juan, Puerto Rico; and Singapore. Iquitos is a geographically isolated port city located on the Amazon River with a population of approximately 400,000 people [57,58]. Rainfall occurs year round and is heaviest between November and May. The mean daily temperatures of the coolest and hottest months are 25.6°C and 27.5°C, respectively. San Juan is the capital and largest city in Puerto Rico. It is located on the Northeastern coast of the island, and has an approximate population of 400,000 people. Rainfall primarily occurs between April and November, leaving the other months relatively dry. The mean daily temperatures of the coolest and hottest months are 25.3°C and 28.7°C, respectively. Singapore is a city state off the Southern-most tip of the Malay Peninsula, and has approximately 5.6 million inhabitants. Rainfall is heaviest during the Northeast monsoon season, which typically occurs from November to March [59]. A second drier monsoonal period occurs between June and October. The mean daily temperatures of the coolest and hottest months in Singapore are 26.5°C and 28.4°C, respectively.

Dengue surveillance data, predictors, and outcomes

Weekly dengue case counts for Iquitos were available between June 2000 and June 2013 from a passive surveillance network representing approximately 40% of the Iquitos population [57,60,61]. Weekly case counts for San Juan were available from April 1990 to April 2013 and were ascertained from a combination of active and passive surveillance systems [62]. All confirmed dengue cases, regardless of severity were reported in Iquitos and San Juan. Further, when the number of samples exceeded local testing capactiy, the number of positive cases among those not tested was estimated by multiplying the number of untested cases by the rate of laboratory-positive cases amongst those that were tested [57,60,62]. In both locations, all dengue and DHF cases were reported together. For Singapore, weekly dengue and DHF cases [63] for were reported separately and available between January 2000 and December 2016 from the Ministry of Health. Dengue is a nationally notifiable disease in Singapore, meaning that all clinically diagnosed and laboratory-confirmed cases must be reported to the Ministry of Health within 24 hours [28,63]. Clinically confirmed cases were then confirmed with serologic or virologic testing by the Ministry of Health. Data from each of the 3 study locations are publically available [61,64].

Using weekly case counts, we created surveillance-based predictors for our models (S1 Table). We summarized observed dengue case counts with weekly and cumulative totals starting from the beginning of the year. We also summarized the annual number of dengue cases in the past 1 to 3 years.

These data also served as the prediction outcomes, “weekly case counts” and “weekly outbreaks.” We created the binary outcome variable, weekly outbreaks, to indicate whether or not weekly case counts exceeded a predefined threshold. For this study, the outbreak threshold was set at 1.5 standard deviations above the mean weekly reported cases and is defined as: (1) where CasesT is the number of reported dengue cases for week “T” (the week of interest). The outbreak threshold was defined as: (2) where CasesT,training is the number of cases reported for week T in the training data (a subset of the study data, including outcome [e.g,. weekly outbreaks] and predictor variables [e.g., 7-day average temperature, 7-day average absolute humidity], used to develop the predictive model and is discussed in more detail in section Prediction Approach); is the average weekly case counts in the training data; and n is the number of observed weeks in the training data.

Population data and predictors

We used government data to generate population estimates for each study area. Population estimates for the Iquitos metropolitan area (2000–2014) and the San Juan-Carolina-Caguas Metropolitan Statistical Area (1990, 1999–2014) came from the National Statistical Institute of Peru and the U.S. Census Bureau, respectively [61]. Population estimates for Singapore (2000–2016) were obtained from the Ministry of Trade and Industry, Department of Statistics and are derived from registry-based administrative data [65,66]. For years without population estimates, we imputed the missing data with a linear regression model where total population was regressed by year.

Singapore is unique in that it has a highly mobile population with large influxes of travelers. To account for the variation in nonresidents, we identified government data detailing the monthly number of air passenger arrivals at Changi Airport (1999–2016) [67]. With these population and air travel data we created additional predictor variables for our models (S2 Table).

Temporal predictors

Inter- and intra-annual variations in dengue cases have been observed across the globe, providing evidence for multi-year periodicity which has been estimated to be approximately 3 years [57,6870]. To account for the temporal variation in dengue cases, we summarized time by including the month that the week of interest occurs in and 1 to 4 year periodic components as predictor variables (S2 Table). The periodic components were sine and cosine functions described below: (3) (4) where t is the number of months since the start of the study period and a is the inter-annual period length in years.

Weather data and predictors

We ascertained daily temperature, humidity, and rainfall summaries (i.e., averages, minimums, maximums, and totals) from the National Oceanic and Atmospheric Administration and the National Environment Agency, Singapore (Table 1). We obtained weather measurements from weather stations, remote sensed imagery, and meteorological reanalysis to account for the various strengths and limitations of each data source (see Weather data limitations in Supplemental S1 Text for a brief overview of these limitations) [12,7177]. Daily weather summaries obtained from remote sensed imagery and meteorological reanalysis were collected from the gridded cell surrounding the weather station used for each study area. We collected daily weather summaries from January 1999 to March 2014 for Iquitos, January 1989 to April 2013 for San Juan, and January 1999 to December 2016 for Singapore.

We created weather-based predictors for our models (S3 Table) by aggregating daily weather summaries into multi-day and multi-week summaries. Temperature and humidity predictors included 7-, 14-, 21-, and 28-day moving averages and standard deviations. As temperature alone does not account for the optimal temperature ranges for the Aedes mosquito and may not accurately represent the temperature-dengue relationship, we created additional temperature predictors based upon the Temperature Suitability Index (TSI) [78]. Rainfall predictors included 7-, 14-, 21-, and 28-day moving averages, standard deviations, and total number of days with any recorded rainfall. We also summarized daily total rainfall for cumulative periods of 1- to 20-weeks. Since the effect of rainfall on mosquito abundance has been found to differ across seasons [70,79] we created additional rainfall predictors that summarized daily total rainfall for cold, warm, and hot periods which were based upon average daily temperature and the extreme minimum and maximum TSI thresholds [70,78].

Missing weather data

We observed missing daily weather measurements in each area due to non-reporting or instrument failure (S1 Fig). We imputed missing weather data using multiple imputation by chained equations with the MICE R package [80]. For this study, we created 10 imputation sets which we then averaged to obtain a final value for each missing observation [81].

Prediction approach

In our analysis, we developed models to predict dengue case counts and outbreaks based upon the temporal variation in dengue activity, regional population, and weather. Fig 1 reflects the general framework, used in this study, for developing a predictive ML (i.e., RF, RF-UFA) and regression-based models (i.e., Poisson regression, Logistic Regression) using historical and near-real time data as input. In our approach, we trained (i.e., fit to data) models with a subset of the study data (i.e., training data) and evaluated the accuracy of model forecasts on the last 4 years’ worth of data (i.e., testing data) that had been withheld during model training. We evaluated each model on 1 year’s worth of testing data at a time and in chronological order. After model evaluation, the test set was then added to the training data and the process was repeated for the subsequent year of test data. This resulted in each model being redeveloped and retrained for each year of testing data. Each model made 4 and 12 week prospective forecasts from week “T” using the previous 26 weeks of predictor data (T-1, T-2, …, T-26).

thumbnail
Fig 1. General framework to develop RF and regression prediction models.

To assess how each model’s predictive accuracy was affected by the lack of current dengue surveillance data, we trained models to predict dengue case counts and outbreaks using only population, temporal, and weather predictor variables. We compared the performance of these models with the performance of the same models when surveillance data inputs were included.

https://doi.org/10.1371/journal.pntd.0008710.g001

For each trained ML and regression model, we analyzed the predictor variables and assessed their importance. The variable analysis allowed us to (1) identify the strongest predictors of dengue case counts for each study area and (2) to perform variable reduction, a conventional approach to improve model accuracy. During variable reduction, we removed weak and non-informative predictors by ranking each variable according to the variables measure of importance, which is defined later. After ranking each variable, we removed all non-informative variables and selected the top 1%, 5%, and 10% most important variables. We then retrained each model using the 3 subsets of predictors and evaluated the predictive accuracy of these models. This process was performed for each test set.

For this study, all models and statistical analyses were implemented in the R programming environment version 3.3.3.[82]

Predicting weekly outbreaks

We observed substantial imbalance in the proportion of outbreak and non-outbreak weeks for each study area. Class imbalance can cause a predictive model to classify all predictions as the same class in an effort to maximize model accuracy, resulting in an uninformative model. To overcome the limitation of class imbalance [83], we trained the models on a “balanced” dataset where we under-sampled non-outbreak observations to create a 1:1 ratio of outbreak to non-outbreak observations in the training set. To account for sampling variability, we created 500 training sets which we used to train each model and averaged the predictions. Additionally, we optimized model performance by selecting the classification threshold (i.e., the minimum prediction value required for an observation to be classified as “outbreak”) that maximized model performance.

Machine learning models

In our study we used RF to predict weekly case counts and weekly outbreaks and RF-UFA to predict weekly outbreaks only. RF is an ensemble ML algorithm based upon decision trees and has been previously used to analyze time series data [40,45,84]. RF-UFA is an extension of the RF algorithm where the Univariate Flagging Algorithm (UFA) is used to transform continuous predictors into binary predictors [85]. UFA transforms continuous predictors by identifying an optimal threshold that is associated with a statistically significant (p ≤ 0.01) higher (“high-risk”) or lower (“low-risk”) risk of the outcome. All RF models were fitted with the randomForest R package [86]. A more detailed explaination of both models is available in Supplemental S1 Text “Overview of machine learning models.

Regression models

We used 2 types of generalized linear regression models in our study: Poisson regression to predict weekly case counts and Logistic regression to predict weekly outbeaks. Unlike RF, regression models are not well suited for high dimensional data analysis and requires additional measures to prevent overfitting. To minimize this risk, we used the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm [8789]. We identified the optimal penalty parameter using 10-fold cross validation and selecting the parameter that minimized the cross validation mean absolute error (MAE), for Poisson regression models, and the misclassification error rate, for Logistic regression models [89]. All Poisson regression and Logistic regression models were implemented with the glmnet R package [90].

Time series model

We developed an autoregressive integrated moving average (ARIMA) model to forecast weekly dengue case counts in each study location. As ARIMA models cannot be applied to high dimensional data, model predictions were based upon the time series of observed case counts only. In this study, we also evaluated seasonal ARIMA (SARIMA) models and found that the added seasonal component did not consistently improve model performance, as such we do not present the results of the SARIMA model.

The ARIMA model parameters were identified by finding the parameters that resulted in the best fit of the training data. To identify the best fitting parameters, we performed a stepwise search and selected the parameters which minimized the model Akaike Information Criterion (AIC). The ARIMA model was implemented using the forecast R package [91].

Variable importance

Variable importance is a measure of how much a single variable contributes to the overall predictive accuracy of a model. For RF-based models, we ranked variables according to their “percentage increase in mean squared error” when predicting weekly case counts and by their “mean decrease in accuracy” when predicting weekly outbreaks [92]. Both metrics measure how much error would be introduced into the model’s predictions if the variable were to be removed from the model. For Poisson regression and Logistic regression, we ranked variables according to the absolute value of the standardized coefficient, a conventional ranking approach for regression models [93].

Model evaluation

We evaluated the performance of each model with the withheld testing data. To quantify model accuracy, we selected accuracy metrics that measure how well model predictions approximate observed outcomes. When predicting weekly case counts, we used mean absolute error (MAE) which measures how far a prediction deviates from the observed outcome. The MAE is defined as follows: (5) where n is the number of observations, yi is the observed number of dengue cases for week i, and is the predicted number of dengue cases for week i. The MAE is considered to be an unbiased estimator because it only considers the variance and not the magnitude of the errors [45]. Since the magnitude of reported dengue cases varied widely by study area, we also report the normalized MAE (nMAE). The nMAE provides an estimate of the prediction error relative to the average number of weekly cases in the testing data and allows for better comparisons of model accuracy between study areas and forecast horizons. We calculated the nMAE by dividing the MAE by the average weekly number of dengue cases. The nMAE is defined as follows: (6) where n is the number of observations yi is the observed number of dengue cases for week i, and MAE is the mean absolute error. The best value that can be obtained for both MAE and nMAE is 0, while the worse value is unbounded.

For models forecasting weekly outbreaks, we quantified how well model predictions approximated observed outcomes with Matthew’s Correlation Coefficient (MCC) [94]. MCC measures the correlation between a binary outcome and prediction and unlike other measures MCC is insensitive to class imbalance [95,96]. MCC is defined as follows: (7) where TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives; and FN is the number of false negatives. The best value that can be obtained for MCC is +1, while the worse value is -1.

Results

Weekly dengue case counts for each study area are presented in Fig 2. We observed substantial inter-annual variation as well as wide ranges in the number of weekly reported cases during the observational periods by study area. Reported weekly case counts ranged from 0 to 116 in Iquitos, 0 to 461 in San Juan, and 3 to 888 in Singapore. The average number of weekly cases varied greatly by study area as well. The average number of weekly cases was 7.57, 38.84, and 115.96 for Iquitos, San Juan, and Singapore, respectively. In 2013, we observed a notable increase in the number of reported dengue cases in Singapore, which was the result of a large dengue outbreak throughout all of Southeast Asia [97100].

thumbnail
Fig 2. Weekly observations of reported dengue cases by study area.

In this figure, left-hand panels (red curves) represent training data, while right-hand panels (blue curves) represent the testing data.

https://doi.org/10.1371/journal.pntd.0008710.g002

In our study, we developed multiple ML, regression-based, and time series models under various data availability and forecast horizon settings. Since the objective of this study was to compare ML (i.e., RF and RF-UFA) models with conventional forecasting models, we only describe the results for models with the best performance under each data-forecast horizon scenario. In our evaluation, models with the smallest nMAE or largest MCC were defined as the best performing models.

Forecasting dengue case counts

In Iquitos (4 week ahead forecasts: Fig 3; 12 week ahead forecasts: S2 Fig), both RF and Poisson regression models did not fully capture the sharp increase in dengue cases in 2011. Interestingly, during the typical peak dengue period the predictions made by the Poisson regression model had the highest level of uncertainty as demonstrated by the wide confidence intervals. Unlike the Poisson regression model’s predictions, RF model forecasts had small confidence intervals regardless of the transmission period (peak or non-peak season). Forecasts made by ARIMA model (S3 Fig) typically captured the transmission dynamics (i.e., increased cases during the peak season and fewer cases during the low dengue season); however, ARIMA model forecasts did not marginally vary from year to year, indicating an inability to differentiate between large and small epidemics.

thumbnail
Fig 3. 4 week forecast accuracy of the temporal pattern of dengue case counts, Iquitos, Peru, June 2009 –June 2013.

Observed weekly cases counts (black area) are compared with 4 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.g003

In San Juan, both RF and Poisson models captured the general trend in dengue case counts regardless of the inclusion of surveillance data (Fig 4). When surveillance data were included, both RF and Poisson model forecasts were more similar to observed case counts as when surveillance data were not included. As was observed in Iquitos, Poisson model forecasts showed higher levels of uncertainty, especially during the peak dengue period. Confidence intervals around RF model forecasts remained consistent throughout the testing period. When forecasting dengue cases 12 weeks in advance (S4 Fig), RF and Poisson regression models again reflected the general trends in dengue cases. Similarly, the ARIMA model (S5 Fig) at times captured the general dynamics; however, there were several occasions where the model predicted increases in dengue cases several weeks after the observed peak week.

thumbnail
Fig 4. 4 week forecast accuracy of the temporal pattern of dengue case counts, San Juan, Puerto Rico, April 2009 –April 2013.

Observed weekly cases counts (black area) are compared with 4 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.g004

In Singapore (Fig 5), when surveillance data were included in the model, RF and Poisson regression 4 week ahead predictions did not reflect the general trend in dengue cases for the first 2 sets of testing data (2013 and 2014). In the last 2 test sets (2015 and 2016) 4 week ahead forecasts for both RF and Poisson regression captured the general trend in dengue cases, suggesting that the training data was not representative of the first 2 test sets (2013 and 2014). A similar trend was also observed for the ARIMA model (S6 Fig) where the model was unable to capture the general trend in the first 2 test sets (2013 and 2014) but improved in the last 2 test sets (2015 and 2016). When surveillance data were removed, both models performed poorly. When model forecasts were extended to 12 weeks in advance (S7 Fig), both RF and Poisson regression performed poorly, even when the model inputs included surveillance data. Similarly, the ARIMA model’s 12 week ahead forecasts did not reflect the general trend in dengue cases.

thumbnail
Fig 5. 4 week forecast accuracy of the temporal pattern of dengue case counts, Singapore, January 2013 –December 2016.

Observed weekly cases counts (black area) are compared with 4 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.g005

Table 2 summarizes the nMAE and MAE of the residuals between observed weekly dengue case counts and model predictions for the optimal RF, Poisson regression, and ARIMA models by study area and the data used to make the predictions (results for all evaluated models are available in S4 Table). When the evaluated models predicted dengue cases 4 weeks ahead and surveillance data were included, RF had more accurate forecasts relative to both Poisson regression and ARIMA models. We estimated RF nMAEs as 0.87, 0.27, and 0.40 in Iquitos, San Juan, and Singapore respectively. On average, RF forecasts had 21% and 33% less error than Poisson regression and ARIMA models. As model performance may differ by dengue season, we also evaluated model accuracy during peak and non-peak dengue periods [102104]. During peak dengue season (S5 Table), the RF model had less error than Poisson regression and ARIMA models in San Juan (RF nMAE: 0.22) and Singapore (RF nMAE: 0.37). In Iquitos, the ARIMA model had the least amount of error (ARIMA nMAE: 0.70). During the non-peak dengue (S6 Table), Poisson regression had the least amount of error in Iquitos (Poisson nMAE: 0.91) while RF had the smallest nMAE in San Juan (RF nMAE: 0.37). In Singapore, RF and Poisson regression had identical nMAEs, 0.43.

thumbnail
Table 2. Optimal model performance when predicting weekly dengue case counts.

https://doi.org/10.1371/journal.pntd.0008710.t002

We evaluated each model’s ability to make long-term forecasts of dengue case counts. Compared with RF and Poisson regression, ARIMA had a smaller nMAE in Iquitos and Singapore, 0.85 and 0.40 respectively. However, in San Juan, RF (nMAE: 0.48) had less error than Poisson regression (nMAE: 0.59) and ARIMA (nMAE: 1.16). We observed similar trends in performance during the peak-dengue season (S5 Table). During non-peak dengue season (S6 Table) RF was more accurate than Poisson regression and ARIMA in Iquitos and San Juan (Iquitos RF nMAE: 1.34; San Juan RF nMAE: 0.59). In Singapore, ARIMA performed better than both RF and Poisson regression (ARIMA nMAE: 0.43).

To understand how model accuracy is affected when current surveillance data are unavailable, we retrained models using only population, temporal, and weather data inputs. We found that for near term forecasts RF nMAEs were equal to 0.96, 0.59, and 0.61 in Iquitos, San Juan and Singapore respectively. We observed Poisson regression nMAEs equal to 0.88, 0.50 and 0.58 in Iquitos, San Juan, and Singapore. During peak dengue season RF had the least amount of error in Iquitos (RF: 0.80; Poisson: 0.89) and Singapore (RF: 0.57; Poisson: 0.62) but more error than the Poisson regression model in San Juan (RF: 0.58; Poisson: 0.45). During the non-peak season, the Poisson regression model had smaller or similar nMAEs (Poisson Iquitos: 0.85; San Juan: 0.59; Singapore: 0.55) relative to RF (RF Iquitos: 1.36; San Juan: 0.59; Singapore: 0.63).

For long-term forecasts in Iquitos and San Juan, the RF model (nMAE = 0.96 and 0.57 respectively) was less accurate than Poisson regression; we estimated Poisson regression nMAEs as 0.87 and 0.56 for Iquitos and San Juan respectively. In Singapore, we estimated RF and Poisson regression nMAEs as 0.62 and 0.65, indicating similar model accuracy.

The strongest predictors of dengue case counts

Using variable analysis, we identified the strongest RF model predictors of weekly dengue case counts (Figs 68). When models included surveillance inputs, previous dengue levels were the strongest predictors for near term forecasts. When model forecasts were based upon only population, temporal, and weather data, the strongest predictors included population size, 3- and 4-year periodicity, multi-week cumulative rainfall, peak daily rainfall (Iquitos only), the average and variation in minimum daily temperature (Iquitos only), and monthly air passenger arrivals (Singapore only). Of note, these predictors were typically distributed over lag periods greater than 15 weeks. Across all study areas, we found that the inclusion of surveillance predictors had a much smaller impact on the model’s long-term forecast accuracy.

thumbnail
Fig 6. Top 10 most important predictors for the Random Forest model when predicting weekly dengue case counts, Iquitos, Peru.

The 10 most important predictors to the Random Forest model prior to variable reduction. Predictor importance was quantified as the percentage increase in mean squared error. Red bars indicate the model included surveillance data inputs while blue bars indicate the model did not include surveillance data inputs. Predictors are shown for forecasts made 4 (A) and 12 (B) weeks in advance.

https://doi.org/10.1371/journal.pntd.0008710.g006

thumbnail
Fig 7. Top 10 most important predictors for the Random Forest model when predicting weekly dengue case counts, San Juan, Puerto Rico.

The 10 most important predictors to the Random Forest model prior to variable reduction. Predictor importance was quantified as the percentage increase in mean squared error. Red bars indicate the model included surveillance data inputs while blue bars indicate the model did not include surveillance data inputs. Predictors are shown for forecasts made 4 (A) and 12 (B) weeks in advance.

https://doi.org/10.1371/journal.pntd.0008710.g007

thumbnail
Fig 8. Top 10 most important predictors for the Random Forest model when predicting weekly dengue case counts, Singapore.

The 10 most important predictors to the Random Forest model prior to variable reduction. Predictor importance was quantified as the percentage increase in mean squared error. Red bars indicate the model included surveillance data inputs while blue bars indicate the model did not include surveillance data inputs. Predictors are shown for forecasts made 4 (A) and 12 (B) weeks in advance.

https://doi.org/10.1371/journal.pntd.0008710.g008

Forecasting dengue outbreaks

Table 3 presents model MCCs, summarizing how well the optimal RF, RF-UFA, and Logistic regression models correctly predicted weekly dengue outbreaks 4 and 12 weeks in advance (results for all evaluated models are available in S7 Table). When predictions were made 4 weeks in advance and based upon surveillance, population, temporal, and weather data, both RF and RF-UFA performed worse than Logistic regression in San Juan and Sinagpore (Logistic San Juan: 0.84; Singapore: 0.57). RF-UFA had the largest MCC in Iquitos (0.56). For long-term forecasts, RF-UFA outperformed all other models where MCCs equaled 0.58, 0.61, and 0.30 in Iquitos, San Juan, and Singapore, respectively. On average, RF-UFA MCCs were 125% and 79% larger than RF and Logistic regression model MCCs.

thumbnail
Table 3. Optimal model performance when predicting weekly dengue outbreaks.

https://doi.org/10.1371/journal.pntd.0008710.t003

When model predictions were based upon population, temporal, and weather data only, we found that RF-UFA was the most accurate model when predicting 4 weeks ahead, (Iquitos: 0.49; San Juan: 0.66; Singapore: 0.22). For long-term predictions, RF-UFA performed best in Iquitos (MCC: 0.58) and Singapore (MCC: 0.27). While in San Juan, RF-UFA (MCC: 0.61) and Logistic regression (MCC: 0.62) had similar performance.

To evaluate RF-UFA’s utility as an early warning tool, we compared the total number of high and low-risk flags per week with weekly dengue case counts (Figs 911). Using Pearson’s correlation, we estimated the correlation between high-risk flags and dengue cases being 0.60, 0.69 and 0.73 in Iquitos, San Juan, and Singapore. We observed a weaker and negative correlation between the number of low-risk flags and dengue cases in Iquitos (-0.35) and Singapore (-0.37), but a strong negative correlation in San Juan, (-0.79).

thumbnail
Fig 9. RF-UFA forecast accuracy of the temporal pattern of dengue outbreaks, Iquitos, Peru, June 2009–June 2013.

The number of high-risk (red) and low-risk (blue) flags per week that are met 12 weeks in advance are plotted against weekly dengue case counts (black) in the testing data. Grey regions represent observed outbreak weeks. Thresholds were identified using UFA and are associated with dengue outbreaks 12 weeks into the future. Black dashed lines indicate the beginning of a new test set.

https://doi.org/10.1371/journal.pntd.0008710.g009

thumbnail
Fig 10. RF-UFA forecast accuracy of the temporal pattern of dengue outbreaks, San Juan, Puerto Rico, April 2009–April 2013.

The number of high-risk (red) and low-risk (blue) flags per week that are met 12 weeks in advance are plotted against weekly dengue case counts (black) in the testing data. Grey regions represent observed outbreak weeks. Thresholds were identified using UFA and are associated with dengue outbreaks 12 weeks into the future. Black dashed lines indicate the beginning of a new test set.

https://doi.org/10.1371/journal.pntd.0008710.g010

thumbnail
Fig 11. RF-UFA forecast accuracy of the temporal pattern of dengue outbreaks, Singapore, January 2013–December 2016.

The number of high-risk (red) and low-risk (blue) flags per week that are met 12 weeks in advance are plotted against weekly dengue case counts (black) in the testing data. Grey regions represent observed outbreak weeks. Thresholds were identified using UFA and are associated with dengue outbreaks 12 weeks into the future. Black dashed lines indicate the beginning of a new test set.

https://doi.org/10.1371/journal.pntd.0008710.g011

Discussion

In this study, we developed RF, regression, and ARIMA models to predict dengue cases and outbreaks in 3 geographic locations. For near term forecasts, we found that RF performed better than both Poisson regression and ARIMA when the model had access to prior dengue surveillance data (Table 2). On average, RF predictions had 21% and 33% less error than Poisson regression and ARIMA models respectively. These results are consistent with other studies comparing the forecasting capabilities of RF with regression and time series models [40,45,84]. We believe that RF’s better performance is due to the model’s ability to capture the nonlinear dynamics that are part of dengue ecology [105] and to learn the trajectory of an outbreak from previously observed outbreaks. When forecasts were extended to 12 weeks in advance, the ARIMA model had the least amount of error in Iquitos and Singapore. However, in San Juan, RF performed better than Poisson regression and ARIMA. Our observation of the ARIMA model outperforming both the RF and Poisson models may be due to the ARIMA model’s ability to describe key underlying factors without being overly complex [106]. The performance of these models in providing short- and long-term forecasts appar to indicate that for short-term prediction, models benefit from an increase in complexity as the outcome is more certain and the added complexity increases model accuracy. However, for long-term predictions where the outcome is less certain, the additional model complexity appears to hurt model accuracy.

In a forecasting challenge which used similar dengue and weather data from Iquitos and San Juan; mechanistic, statistical and multimodel ensemble models were used to predict 3 dengue outcomes: peak incidence, week of peak incidence and total incidence [106]. Model performance was highly variable where models did not consistently perform well across locations and prediction targets. Similar to our study, the models did not perform well during high incidence seasons–potentially due to only having a few high incidence seasons to train the model on. Further, Johansson et al (2019) found that on average, models which included biologically meaningful data and mechanisms had lower accuracy [106]. This result appears to support our finding that ML models can at times, better leverage biologically meaningful data as they utilize a more flexible framework and do not require a priori assumptions of the predictor-target relationship.

Due to delays in case identification, current surveillance data may not be available in real time. To evaluate this limitation, we removed model inputs related to surveillance data and reassessed model performance. We found that predicted values generated by both RF and Poisson regression were similar to the general trend in dengue case counts in Iquitos and San Juan but not in Singapore. Our results show that both models were sensitive to the lack of surveillance data and model error increases. The increase in error is most likely a result of the combination of similar yearly weather patterns but high inter-annual variation in dengue spread. As such, these models are unable to fully anticipate whether or not future dengue levels will be high or low when surveillance data are unavailable.

In each study area, the random forest model had a high degree of confidence in its predictions, as evidenced by the small confidence intervals. Though the confidence intervals were small, the observed number of weekly cases were typically not included within the confidence interval. This is due to the way that the random forest model estimates the standard error: as the variation in predictions among the individual trees [102]. This result indicates that there was little variation in predicted values between individual trees.

For some scenarios, such as vector control planning, the accurate prediction of outbreak periods may prove sufficient to provide an early warning of an imminent dengue outbreak. The RF-UFA model was able to forecast weekly dengue outbreaks 12 weeks in advance where model MCCs ranged from 0.27 to 0.61 (Table 3). Further, the RF-UFA model was able to indicate periods of low dengue risk 12 weeks in advance (Figs 911). Of interest, RF-UFA performed well even when surveillance data inputs were removed from the model. In our analysis of the RF-UFA model we found that the number of weekly high and low-risk flags correlated well with dengue cases. Twelve weeks have been identified as the optimal lead time to enact widespread vector control efforts [11]; based upon our study results RF-UFA could be a beneficial addition to an early warning system due to its ability to identify changes in dengue spread risk.

Another study objective was to identify the strongest predictors of dengue case counts (Figs 68). According to our models, the strongest predictors were previous levels of dengue cases -indicating that factors such as force of infection have a stronger influence on local transmission than weather factors. These results do not imply that weather is not important but rather, once suitable weather conditions are achieved, outbreak risk becomes a function of other drivers such as: vector control, population immunity, and virus infectivity. Interestingly, in Johansson et al (2019), models which incorporated weather and surveillane data typically performed worse than models based only on surveillance data, suggesting that previous levels of dengue cases are the strongest predictors [106]. The authors further hypothesized that surveillance predictors alone may contribute equivalent information as weather predictors regarding future dengue levels and the addition of weather data may overly complicating the model [106].

For each study area, when we removed surveillance inputs from the models and predictions were based upon population, temporal and weather data only, the strongest predictors typically described multi-week weather patterns distributed over lag periods greater than 15 weeks. The observed relationships in our study are most likely due to the phase difference between seasonal signals causing the variables to become correlated rather than being related through a causal mechanistic link [57]. The strongest weather predictors demonstrated low week-to-week variation, but larger month-to-month variation. In addition, the observed lag periods are towards the maximum period by which weather variables have been observed to affect dengue spread. In Singapore, monthly air travel patterns distributed over long lag periods were also a strong predictor of dengue cases. Though global travel has been identified as an important driver of dengue outbreaks in Singapore [107], the effect of imported cases has been observed to persist a maximum of 14 to 16 weeks, suggesting that this finding is also due to phase differencing [108112].

Our study has some limitations. Data availability may have negatively affected model performance. We could not obtain vector control data, which are critical in diminishing the size of the outbreak [11,113115], and may confound the relationship between predictors and prediction outcomes, causing the model to learn biased predictor-outcome relationships.

To train our models, we used dengue case counts as reported by passive surveillance systems. As such, asymptomatic and clinically mild cases were most likely missed, suggesting that model predictions are underestimates of the true number of cases [2].

Our study highlighted various limitations for each modeling approach. When predicting dengue case counts, RF consistently underestimated observed extreme values, for example the 2011 outbreak in Iquitos and the 2013 outbreak in Singapore (Fig 5 and Fig 7). This consistent underestimation is a direct result of the RF’s inability to predict outside of the training set’s outcome distribution [92]. Despite this limitation, the RF model typically identified when dengue cases would peak. In contrast, Poisson regression would occasionally overestimate peak weeks with a delay, due to the model’s reliance upon the previous week’s reported cases and the linear relationship imposed by the model. When predicting weekly outbreaks, we found that all models performed poorly in Singapore, where there was an unprecedented increase in dengue cases beginning in 2013 due to a severe dengue outbreak throughout Southeast Asia [97100]. As a result, the models were unable to account for this shift in dengue dynamics.

In evaluating RF-UFA performance, we found that this model suffered from false positives in Iquitos and San Juan. Typically, the model predicted an earlier onset and a later end to the outbreak period and, on occasion, would incorrectly predict extended outbreak periods during the traditional peak dengue months. This is certainly problematic and requires further attention since too many false positives can lead to alarm fatigue and can rapidly deplete limited resources [116].

Conclusions

In this study, we compared the ability of ML, regression, and time-series based modeling approaches to forecast dengue case counts and outbreaks. When using dengue surveillance, population, temporal, and weather data as model inputs, RF was more accurate than both Poisson regression and ARIMA models, for near term predictions while the ARIMA model performed best for long-term predictions. We also found that when predicting dengue outbreaks, RF-UFA outperformed both RF and logistic regression models when using only population, temporal, and weather data as model inputs. Given the potential advantages of ML models the forecasting capabilities of dengue early warning systems may be improved by the inclusion of ML models.

Supporting information

S1 Text. Additional Materials and Methods.

https://doi.org/10.1371/journal.pntd.0008710.s001

(DOCX)

S1 Table. Surveillance predictor variables.

https://doi.org/10.1371/journal.pntd.0008710.s002

(DOCX)

S2 Table. Population and temporal predictor variables.

https://doi.org/10.1371/journal.pntd.0008710.s003

(DOCX)

S4 Table. Normalized mean absolute error and mean absolute error for all evaluated Random Forest and Poisson regression models when predicting weekly dengue case counts.

Abbreviations: nMAE: normalized mean absolute error; MAE: mean absolute error.

https://doi.org/10.1371/journal.pntd.0008710.s005

(DOCX)

S5 Table. Optimal model performance when predicting weekly dengue case counts during the typical peak dengue season.

*The ARIMA model was only developed using previously observed case counts. Abbreviations: nMAE: normalized mean absolute error; MAE: mean absolute error. Iquitos peak dengue season: January to July [102]. San Juan peak dengue season: May to November [104]. Singapore peak dengue season: September to February [103].

https://doi.org/10.1371/journal.pntd.0008710.s006

(DOCX)

S6 Table. Optimal model performance when predicting weekly dengue case counts during the typical low dengue season.

*The ARIMA model was only developed using previously observed case counts. Abbreviations: nMAE: normalized mean absolute error; MAE: mean absolute error. Iquitos peak dengue season: January to July [102]. San Juan peak dengue season: May to November [104]. Singapore peak dengue season: September to February [103].

https://doi.org/10.1371/journal.pntd.0008710.s007

(DOCX)

S7 Table. Matthew’s Correlation Coefficient for each statistical modeling approach when predicting weekly outbreaks.

Abbreviations: MCC: Matthew’s Correlation Coefficient.

https://doi.org/10.1371/journal.pntd.0008710.s008

(DOCX)

S1 Fig. Description of missing weather data.

S1 Fig describes the amount of missing data per variable by study area. In Iquitos and Singapore, weather stations were the primary source of missing data; for San Juan, remote sensed imagery was the most affected data source. Among all days in the data collection period, 69.5% in Iquitos, 1.1% in San Juan, and 0.6% in Singapore had at least 1 missing measurement.

https://doi.org/10.1371/journal.pntd.0008710.s009

(TIF)

S2 Fig. 12 week forecast accuracy of the temporal pattern of dengue case counts, Iquitos, Peru, June 2009 –June 2013.

Observed weekly cases counts (black area) are compared with 12 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.s010

(TIF)

S3 Fig.

ARIMA model 4 (A) and 12 (B) week forecast accuracy of the temporal pattern of dengue case counts, Iquitos, Peru, June 2009 –June 2013. Observed weekly cases counts (black area) are compared with 4 and 12 week ahead forecasts (panels A and B respectively) made by the ARIMA. Dotted lines represent 95% confidence intervals around the model’s prediction.

https://doi.org/10.1371/journal.pntd.0008710.s011

(TIF)

S4 Fig. 12 week forecast accuracy of the temporal pattern of dengue case counts, San Juan, Puerto Rico, April 2009 –April 2013.

Observed weekly cases counts (black area) are compared with 12 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.s012

(TIF)

S5 Fig.

ARIMA model 4 (A) and 12 (B) week forecast accuracy of the temporal pattern of dengue case counts, San Juan, Puerto Rico, April 2009 –April 2013. Observed weekly cases counts (black area) are compared with 4 and 12 week ahead forecasts (panels A and B respectively) made by the ARIMA. Dotted lines represent 95% confidence intervals around the model’s prediction.

https://doi.org/10.1371/journal.pntd.0008710.s013

(TIF)

S6 Fig. 12 week forecast accuracy of the temporal pattern of dengue case counts, Singapore, January 2013 –December 2016.

Observed weekly cases counts (black area) are compared with 12 week ahead forecasts made by Random Forest and Poisson regression models. Dotted lines represent 95% confidence intervals around the model’s prediction. RF model standard errors were estimated using the infinitesimal jackknife for bagging approach [101].

https://doi.org/10.1371/journal.pntd.0008710.s014

(TIF)

S7 Fig.

ARIMA model 4 (A) and 12 (B) week forecast accuracy of the temporal pattern of dengue case counts, Singapore, January 2013 –December 2016. Observed weekly cases counts (black area) are compared with 4 and 12 week ahead forecasts (panels A and B respectively) made by the ARIMA. Dotted lines represent 95% confidence intervals around the model’s prediction.

https://doi.org/10.1371/journal.pntd.0008710.s015

(TIF)

References

  1. 1. Rezza G. Aedes albopictus and the reemergence of Dengue. BMC Public Health. 2012;12: 72. pmid:22272602
  2. 2. Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, et al. The global distribution and burden of dengue. Nature. 2013;496: 504–507. pmid:23563266
  3. 3. Beatty ME, Letson W, Edgil DM, Margolis HS. Estimating the total world population at risk for locally acquired dengue infection. American Journal of Tropical Medicine and Hygiene. 2007. pp. 221–221. pmid:17690390
  4. 4. Hales S, De Wet N, Maindonald J, Woodward A. Potential effect of population and climate changes on global distribution of dengue fever: an empirical model. The Lancet. 2002;360: 830–834.
  5. 5. Hii YL, Zhu H, Ng N, Ng LC, Rocklöv J. Forecast of dengue incidence using temperature and rainfall. PLoS Negl Trop Dis. 2012;6: e1908. pmid:23209852
  6. 6. Schaefer TJ, Wolford RW. Dengue Fever. StatPearls [Internet]. StatPearls Publishing; 2018.
  7. 7. Johansson MA, Reich NG, Hota A, Brownstein JS, Santillana M. Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico. Sci Rep. 2016;6: srep33707. pmid:27665707
  8. 8. Quirine A, Clapham HE, Lambrechts L, Duong V, Buchy P, Althouse BM, et al. Contributions from the silent majority dominate dengue virus transmission. PLoS Pathog. 2018;14: e1006965. pmid:29723307
  9. 9. Duong V, Lambrechts L, Paul RE, Ly S, Lay RS, Long KC, et al. Asymptomatic humans transmit dengue virus to mosquitoes. Proc Natl Acad Sci. 2015;112: 14688–14693. pmid:26553981
  10. 10. Clapham HE, Tricou V, Van Vinh Chau N, Simmons CP, Ferguson NM. Within-host viral dynamics of dengue serotype 1 infection. J R Soc Interface. 2014;11: 20140094. pmid:24829280
  11. 11. Hii YL, Rocklöv J, Wall S, Ng LC, Tang CS, Ng N. Optimal Lead Time for Dengue Forecast. PLoS Negl Trop Dis. 2012;6: e1848. pmid:23110242
  12. 12. Sewe MO, Tozan Y, Ahlm C, Rocklöv J. Using remote sensing environmental data to forecast malaria incidence at a rural district hospital in Western Kenya. Sci Rep. 2017;7. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5453969/
  13. 13. Teklehaimanot HD, Lipsitch M, Teklehaimanot A, Schwartz J. Weather-based prediction of Plasmodium falciparum malaria in epidemic-prone regions of Ethiopia I. Patterns of lagged weather effects reflect biological mechanisms. Malar J. 2004;3: 41. pmid:15541174
  14. 14. Briët OJ, Vounatsou P, Gunawardena DM, Galappaththy GN, Amerasinghe PH. Models for short term malaria prediction in Sri Lanka. Malar J. 2008;7: 76. pmid:18460204
  15. 15. Wangdi K, Singhasivanon P, Silawan T, Lawpoolsri S, White NJ, Kaewkungwal J. Development of temporal modelling for forecasting and prediction of malaria infections using time-series and ARIMAX analyses: A case study in endemic districts of Bhutan. Malar J. 2010;9: 251. pmid:20813066
  16. 16. Abeku TA, De Vlas SJ, Borsboom G, Tadege A, Gebreyesus Y, Gebreyohannes H, et al. Effects of meteorological factors on epidemic malaria in Ethiopia: a statistical modelling approach based on theoretical reasoning. Parasitology. 2004;128: 585–593. pmid:15206460
  17. 17. Eastin MD, Delmelle E, Casas I, Wexler J, Self C. Intra- and interseasonal autoregressive prediction of dengue outbreaks using local weather and regional climate for a tropical environment in Colombia. Am J Trop Med Hyg. 2014;91: 598–610. pmid:24957546
  18. 18. Hu W, Clements A, Williams G, Tong S. Dengue fever and El Nino/Southern Oscillation in Queensland, Australia: a time series predictive model. Occup Environ Med. 2010;67: 307–311. pmid:19819860
  19. 19. Karim MdN Munshi SU, Anwar N Alam MdS. Climatic factors influencing dengue cases in Dhaka city: A model for dengue prediction. Indian J Med Res. 2012;136: 32–39. pmid:22885261
  20. 20. Depradine CA, Lovell EH. Climatological variables and the incidence of Dengue fever in Barbados. Int J Environ Health Res. 2004;14: 429–441. pmid:15545038
  21. 21. Luz PM, Mendes BVM, Codeço CT, Struchiner CJ, Galvani AP. Time series analysis of dengue incidence in Rio de Janeiro, Brazil. Am J Trop Med Hyg. 2008;79: 933–939. pmid:19052308
  22. 22. Martinez EZ, Silva EAS da, Fabbro ALD. A SARIMA forecasting model to predict the number of cases of dengue in Campinas, State of São Paulo, Brazil. Rev Soc Bras Med Trop. 2011;44: 436–440. pmid:21860888
  23. 23. Fuller DO, Troyo A, Beier JC. El Niño Southern Oscillation and vegetation dynamics as predictors of dengue fever cases in Costa Rica. Environ Res Lett ERL Web Site. 2009;4: 140111–140118. pmid:19763186
  24. 24. Gharbi M, Quenel P, Gustave J, Cassadou S, La Ruche G, Girdary L, et al. Time series analysis of dengue incidence in Guadeloupe, French West Indies: forecasting models using climate variables as predictors. BMC Infect Dis. 2011;11: 166. pmid:21658238
  25. 25. Phung D, Huang C, Rutherford S, Chu C, Wang X, Nguyen M, et al. Identification of the prediction model for dengue incidence in Can Tho city, a Mekong Delta area in Vietnam. Acta Trop. 2015;141: 88–96. pmid:25447266
  26. 26. Siriyasatien P, Phumee A, Ongruk P, Jampachaisri K, Kesorn K. Analysis of significant factors for dengue fever incidence prediction. BMC Bioinformatics. 2016;17: 166. pmid:27083696
  27. 27. Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis. 2011;5: e1206. pmid:21647308
  28. 28. Althouse BM, Ng YY, Derek A. Cummings T.. Prediction of Dengue Incidence Using Search Query Surveillance. PLoS Negl Trop Dis. 2011;5: e1258. pmid:21829744
  29. 29. Racloz V, Ramsey R, Tong S, Hu W. Surveillance of dengue fever virus: a review of epidemiological models and early warning systems. PLoS Negl Trop Dis. 2012;6: e1648. pmid:22629476
  30. 30. Cordell HJ. Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet. 2009;10: 392–404. pmid:19434077
  31. 31. Shaman J, Karspeck A. Forecasting seasonal outbreaks of influenza. Proc Natl Acad Sci. 2012;109: 20425–20430. pmid:23184969
  32. 32. Yang W, Cowling BJ, Lau EH, Shaman J. Forecasting influenza epidemics in Hong Kong. PLoS Comput Biol. 2015;11: e1004383. pmid:26226185
  33. 33. Yang W, Karspeck A, Shaman J. Comparison of Filtering Methods for the Modeling and Retrospective Forecasting of Influenza Epidemics. PLOS Comput Biol. 2014;10: e1003583. pmid:24762780
  34. 34. DeFelice NB, Little E, Campbell SR, Shaman J. Ensemble forecast of human West Nile virus cases and mosquito infection rates. Nat Commun. 2017;8. pmid:28233783
  35. 35. Shaman J, Pitzer VE, Viboud C, Grenfell BT, Lipsitch M. Absolute Humidity and the Seasonal Onset of Influenza in the Continental United States. PLOS Biol. 2010;8: e1000316. pmid:20186267
  36. 36. Mangal TD, Paterson S, Fenton A. Predicting the Impact of Long-Term Temperature Changes on the Epidemiology and Control of Schistosomiasis: A Mechanistic Model. PLOS ONE. 2008;3: e1438. pmid:18197249
  37. 37. Reiner RC, Perkins TA, Barker CM, Niu T, Chaves LF, Ellis AM, et al. A systematic review of mathematical models of mosquito-borne pathogen transmission: 1970–2010. J R Soc Interface. 2013;10: 20120921. pmid:23407571
  38. 38. Buczak AL, Baugher B, Moniz LJ, Bagley T, Babin SM, Guven E. Ensemble method for dengue prediction. PloS One. 2018;13: e0189988. pmid:29298320
  39. 39. Yamana TK, Kandula S, Shaman J. Superensemble forecasts of dengue outbreaks. J R Soc Interface. 2016;13: 20160410. pmid:27733698
  40. 40. Kane MJ, Price N, Scotch M, Rabinowitz P. Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks. BMC Bioinformatics. 2014;15: 276. pmid:25123979
  41. 41. Ruiz MO, Chaves LF, Hamer GL, Sun T, Brown WM, Walker ED, et al. Local impact of temperature and precipitation on West Nile virus infection in Culex species mosquitoes in northeast Illinois, USA. Parasit Vectors. 2010;3: 19. pmid:20302617
  42. 42. Rehman NA, Kalyanaraman S, Ahmad T, Pervaiz F, Saif U, Subramanian L. Fine-grained dengue forecasting using telephone triage services. Sci Adv. 2016;2: e1501215. pmid:27419226
  43. 43. Aramaki E, Maskawa S, Morita M. Twitter Catches the Flu: Detecting Influenza Epidemics Using Twitter. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics; 2011. pp. 1568–1576. Available: http://dl.acm.org/citation.cfm?id=2145432.2145600
  44. 44. Wu Y, Lee G, Fu X, Hung T. Detect climatic factors contributing to dengue outbreak based on wavelet, support vector machines and genetic algorithm. 2008. Available: http://oar.a-star.edu.sg/jspui/handle/123456789/700
  45. 45. Carvajal TM, Viacrusis KM, Hernandez LFT, Ho HT, Amalin DM, Watanabe K. Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines. BMC Infect Dis. 2018;18: 183. pmid:29665781
  46. 46. Johnson LR, Gramacy RB, Cohen J, Mordecai E, Murdock C, Rohr J, et al. Phenomenological forecasting of disease incidence using heteroskedastic Gaussian processes: A dengue case study. Ann Appl Stat. 2018;12: 27–66.
  47. 47. Christophers S, others. Aedes aegypti (L.) the yellow fever mosquito: its life history, bionomics and structure. Aëdes Aegypti Yellow Fever Mosq Its Life Hist Bionomics Struct. 1960 [cited 15 Apr 2017]. Available: https://www-cabdirect-org.ezproxy.bu.edu/cabdirect/abstract/19602901825
  48. 48. Morrison AC, Gray K, Getis A, Astete H, Sihuincha M, Focks D, et al. Temporal and geographic patterns of Aedes aegypti (Diptera: Culicidae) production in Iquitos, Peru. J Med Entomol. 2004;41: 1123–1142. pmid:15605653
  49. 49. Hammond SN, Gordon AL, Lugo E del C, Moreno G, Kuan GM, López MM, et al. Characterization of Aedes aegypti (Diptera: Culcidae) production sites in urban Nicaragua. J Med Entomol. 2007;44: 851–860. pmid:17915519
  50. 50. Focks DA, Haile DG, Daniels E, Mount GA. Dynamic life table model for Aedes aegypti (Diptera: Culicidae): analysis of the literature and model development. J Med Entomol. 1993;30: 1003–1017. pmid:8271242
  51. 51. Hopp MJ, Foley JA. Global-scale relationships between climate and the dengue fever vector, Aedes aegypti. Clim Change. 2001;48: 441–463.
  52. 52. Johansson MA, Cummings DA, Glass GE. Multiyear climate variability and dengue—El Nino southern oscillation, weather, and dengue incidence in Puerto Rico, Mexico, and Thailand: a longitudinal data analysis. PLoS Med. 2009;6: e1000168. pmid:19918363
  53. 53. Rueda LM, Patel KJ, Axtell RC, Stinner RE. Temperature-dependent development and survival rates of Culex quinquefasciatus and Aedes aegypti (Diptera: Culicidae). J Med Entomol. 1990;27: 892–898. pmid:2231624
  54. 54. Surtees G. Effects of irrigation on mosquito populations and mosquito-borne diseases in man, with particular reference to ricefield extension. Int J Environ Stud. 1970;1: 35–42.
  55. 55. Russell RC. Mosquito-borne arboviruses in Australia: the current scene and implications of climate change for human health. Int J Parasitol. 1998;28: 955–969. pmid:9673874
  56. 56. Faull KJ, Williams CR. Intraspecific variation in desiccation survival time of Aedes aegypti (L.) mosquito eggs of Australian origin. J Vector Ecol. 2015;40: 292–300. pmid:26611964
  57. 57. Stoddard ST, Wearing HJ, Reiner RC Jr, Morrison AC, Astete H, Vilcarromero S, et al. Long-term and seasonal dynamics of dengue in Iquitos, Peru. PLoS Negl Trop Dis. 2014;8: e3003. pmid:25033412
  58. 58. Forshey BM, Laguna-Torres VA, Vilcarromero S, Bazan I, Rocha C, Morrison AC, et al. Epidemiology of influenza-like illness in the Amazon Basin of Peru, 2008–2009. Influenza Other Respir Viruses. 2010;4: 235–243. pmid:20836798
  59. 59. Seidahmed OM, Eltahir EA. A Sequence of Flushing and Drying of Breeding Habitats of Aedes aegypti (L.) Prior to the Low Dengue Season in Singapore. PLoS Negl Trop Dis. 2016;10: e0004842. pmid:27459322
  60. 60. Forshey BM, Guevara C, Laguna-Torres VA, Cespedes M, Vargas J, Gianella A, et al. Arboviral etiologies of acute febrile illnesses in Western South America, 2000–2007. PLoS Negl Trop Dis. 2010;4: e787. pmid:20706628
  61. 61. US Department of Commerce N. Dengue Forecasting. NOAA’s National Weather Service; [cited 19 Apr 2017]. Available: https://dengueforecasting.noaa.gov/
  62. 62. Sharp TM, Hunsperger E, Santiago GA, Muñoz-Jordan JL, Santiago LM, Rivera A, et al. Virus-specific differences in rates of disease during the 2010 Dengue epidemic in Puerto Rico. PLoS Negl Trop Dis. 2013;7: e2159. pmid:23593526
  63. 63. Ong A, Goh KT. A guide on infectious diseases of public health importance in Singapore. Communicable Diseases Division, Ministry of Health [and] Communicable Disease Centre, Tan Tock Seng Hospital; 2011.
  64. 64. Ministry of Health Singapore. Weekly Infectious Diseases Bulletin. [cited 21 Apr 2017]. Available: https://www.moh.gov.sg/content/moh_web/home/statistics/infectiousDiseasesStatistics/weekly_infectiousdiseasesbulletin.html
  65. 65. Singapore Residents By Age Group, Ethnic Group And Gender, End June, Annual. In: Data.gov.sg [Internet]. [cited 31 Jan 2019]. Available: https://data.gov.sg/dataset/resident-population-by-ethnicity-gender-and-age-group?resource_id%3Df9dbfc75-a2dc-42af-9f50-425e4107ae84
  66. 66. Jialin C, Lip TY. Challenges in the Development of Register-Based Population Statistics. Strategic Resource and Population Division, Singapore Department of Statistics; 2017. Available: https://www.singstat.gov.sg/-/media/files/publications/population/ssnmar17-pg1-7.pdf
  67. 67. Air Passenger Arrivals—Total by Region and Selected Country of Embarkation. In: Data.gov.sg [Internet]. [cited 31 Jan 2019]. Available: https://data.gov.sg/dataset/air-passenger-arrivals-total-by-region-and-selected-country-of-embarkation?resource_id%3D4b634602-570d-47af-bae2-403135179249
  68. 68. Thai KT, Cazelles B, Van Nguyen N, Vo LT, Boni MF, Farrar J, et al. Dengue dynamics in Binh Thuan province, southern Vietnam: periodicity, synchronicity and climate variability. PLoS Negl Trop Dis. 2010;4: e747. pmid:20644621
  69. 69. Cummings DA, Irizarry RA, Huang NE, Endy TP, Nisalak A, Ungchusak K, et al. Travelling waves in the occurrence of dengue haemorrhagic fever in Thailand. Nature. 2004;427: 344. pmid:14737166
  70. 70. Stratton MD, Ehrlich HY, Mor SM, Naumova EN. A comparative analysis of three vector-borne diseases across Australia using seasonal and meteorological models. Sci Rep. 2017;7: 40186. pmid:28071683
  71. 71. Mendelsohn R, Kurukulasuriya P, Basist A, Kogan F, Williams C. Climate analysis with satellite versus weather station data. Clim Change. 2007;81: 71–83.
  72. 72. Davey CA, Pielke RA. Microclimate Exposures of Surface-Based Weather Stations: Implications For The Assessment of Long-Term Temperature Trends. Bull Am Meteorol Soc. 2005;86: 497–504.
  73. 73. Chabot-Couture G, Nigmatulina K, Eckhoff P. An Environmental Data Set for Vector-Borne Disease Modeling and Epidemiology. PLOS ONE. 2014;9: e94741. pmid:24755954
  74. 74. Hay SI, Lennon JJ. Deriving meteorological variables across Africa for the study and control of vector-borne disease: a comparison of remote sensing and spatial interpolation of climate. Trop Med Int Health. 1999;4: 58–71. pmid:10203175
  75. 75. Hay SI, Tatem AJ, Graham AJ, Goetz SJ, Rogers DJ. Global Environmental Data for Mapping Infectious Disease Distribution. Adv Parasitol. 2006;62: 37–77. pmid:16647967
  76. 76. Kalluri S, Gilruth P, Rogers D, Szczur M. Surveillance of Arthropod Vector-Borne Infectious Diseases Using Remote Sensing Techniques: A Review. PLOS Pathog. 2007;3: e116. pmid:17967056
  77. 77. Program (US) CCS. Reanalysis of historical climate data for key atmospheric features: Implications for attribution of causes of observed change. US Climate Change Science Program; 2008.
  78. 78. de Wet N, Slaney D, Ye W, Hales S, Warrick RA. Hotspots: exotic mosquito risk profiles for New Zealand. The International Global Change Institute (IGCI), University of Waikato.; 2005.
  79. 79. Hu W, Tong S, Mengersen K, Oldenburg B. Rainfall, mosquito density and the transmission of Ross River virus: A time-series forecasting model. Ecol Model. 2006;196: 505–514.
  80. 80. Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45. Available: pmid:22289957
  81. 81. Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivar Behav Res. 1998;33: 545–571.
  82. 82. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. 2014.
  83. 83. Olson DL. Data Set Balancing. Data Mining and Knowledge Management. Springer, Berlin, Heidelberg; 2005. pp. 71–80. https://doi.org/10.1007/978-3-540-30537-8_8
  84. 84. Petukhova T, Ojkic D, McEwen B, Deardon R, Poljak Z. Assessment of autoregressive integrated moving average (ARIMA), generalized linear autoregressive moving average (GLARMA), and random forest (RF) time series regression models for predicting influenza A virus frequency in swine in Ontario, Canada. PloS One. 2018;13: e0198313. pmid:29856881
  85. 85. Sheth M, Welsch R, Markuzon N. The Univariate Flagging Algorithm (UFA): a Fully-Automated Approach for Identifying Optimal Thresholds in Data. ArXiv Prepr ArXiv160403248. 2016 [cited 17 May 2017]. Available: https://arxiv-org.ezproxy.bu.edu/abs/1604.03248
  86. 86. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2: 18–22.
  87. 87. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996; 267–288.
  88. 88. Guo P, Liu T, Zhang Q, Wang L, Xiao J, Zhang Q, et al. Developing a dengue forecast model using machine learning: A case study in China. PLoS Negl Trop Dis. 2017;11: e0005973. pmid:29036169
  89. 89. Shi Y, Liu X, Kok S- Y, Rajarethinam J, Liang S, Yap G, et al. Three-month real-time dengue forecast models: an early warning system for outbreak alerts and policy decision support in Singapore. Environ Health Perspect. 2015;124: 1369–1375. pmid:26662617
  90. 90. Friedman J, Hastie T, Tibshirani R. glmnet: Lasso and elastic-net regularized generalized linear models. R Package Version. 2009;1.
  91. 91. Hyndman RJ, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, et al. Package ‘forecast.’ Online Http://scranR-ProjOrgwebpackagesforecastforecastPdf. 2019.
  92. 92. Breiman L. Random forests. Mach Learn. 2001;45: 5–32.
  93. 93. Grömping U. Variable importance in regression models. Wiley Interdiscip Rev Comput Stat. 2015;7: 137–152.
  94. 94. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta BBA-Protein Struct. 1975;405: 442–451.
  95. 95. Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10: 35. pmid:29234465
  96. 96. Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. 2011.
  97. 97. Ng L- C, Koo C, Mudin RNB, Amin FM, Lee K- S, Kheong CC. 2013 dengue outbreaks in Singapore and Malaysia caused by different viral strains. Am J Trop Med Hyg. 2015;92: 1150–1155. pmid:25846296
  98. 98. Guo C, Zhou Z, Wen Z, Liu Y, Zeng C, Xiao D, et al. Global epidemiology of dengue outbreaks in 1990–2015: a systematic review and meta-analysis. Front Cell Infect Microbiol. 2017;7: 317. pmid:28748176
  99. 99. Wang B, Yang H, Feng Y, Zhou H, Dai J, Hu Y, et al. The distinct distribution and phylogenetic characteristics of dengue virus serotypes/genotypes during the 2013 outbreak in Yunnan, China: Phylogenetic characteristics of 2013 dengue outbreak in Yunnan, China. Infect Genet Evol. 2016;37: 1–7. pmid:26597450
  100. 100. Hapuarachchi HC, Koo C, Rajarethinam J, Chong C- S, Lin C, Yap G, et al. Epidemic resurgence of dengue fever in Singapore in 2013–2014: A virological and entomological perspective. BMC Infect Dis. 2016;16: 300. pmid:27316694
  101. 101. Wager S, Hastie T, Efron B. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J Mach Learn Res. 2014;15: 1625–1651. pmid:25580094
  102. 102. International Association for Medical Assistance to Travellers. Peru: Dengue | IAMAT. [cited 1 Dec 2019]. Available: https://www.iamat.org/country/peru/risk/dengue
  103. 103. International Association for Medical Assistance to Travellers. Singapore: Dengue | IAMAT. [cited 1 Dec 2019]. Available: https://www.iamat.org/country/singapore/risk/dengue
  104. 104. International Association for Medical Assistance to Travellers. Puerto Rico: Dengue | IAMAT. [cited 1 Dec 2019]. Available: https://www.iamat.org/country/puerto-rico/risk/dengue
  105. 105. Lowe R, Gasparrini A, Van Meerbeeck CJ, Lippi CA, Mahon R, Trotman AR, et al. Nonlinear and delayed impacts of climate on dengue risk in Barbados: A modelling study. PLoS Med. 2018;15: e1002613. pmid:30016319
  106. 106. Johansson MA, Apfeldorf KM, Dobson S, Devita J, Buczak AL, Baugher B, et al. An open challenge to advance probabilistic forecasting for dengue epidemics. Proc Natl Acad Sci. 2019;116: 24268–24274. pmid:31712420
  107. 107. Lee K- S, Lo S, Tan SS- Y, Chua R, Tan L-K, Xu H, et al. Dengue virus surveillance in Singapore reveals high viral diversity through multiple introductions and in situ evolution. Infect Genet Evol. 2012;12: 77–85. pmid:22036707
  108. 108. Shang C-S, Fang C-T, Liu C-M, Wen T-H, Tsai K-H, King C-C. The Role of Imported Cases and Favorable Meteorological Conditions in the Onset of Dengue Epidemics. PLoS Negl Trop Dis. 2010;4: 1–9. pmid:20689820
  109. 109. Huang X, Clements AC, Williams G, Milinovich G, Hu W. A threshold analysis of dengue transmission in terms of weather variables and imported dengue cases in Australia. Emerg Microbes Infect. 2013;2: e87. pmid:26038449
  110. 110. Sittisede Polwiang. The Estimation of Imported Dengue Virus From Thailand. J Travel Med. 2015;22: 194–199. pmid:25728849
  111. 111. Yan H, Ding Z, Yan J, Yao W, Pan J, Yang Z, et al. Epidemiological characterization of the 2017 dengue outbreak in Zhejiang, China and molecular characterization of the viruses. Front Cell Infect Microbiol. 2018;8.
  112. 112. Peng H-J, Lai H-B, Zhang Q-L, Xu B-Y, Zhang H, Liu W-H, et al. A local outbreak of dengue caused by an imported case in Dongguan China. BMC Public Health. 2012;12: 83. pmid:22276682
  113. 113. Liu T, Zhu G, He J, Song T, Zhang M, Lin H, et al. Early rigorous control interventions can largely reduce dengue outbreak magnitude: experience from Chaozhou, China. BMC Public Health. 2017;18: 90. pmid:28768542
  114. 114. Wu H, Wu C, Lu Q, Ding Z, Xue M, Lin J. Evaluating the effects of control interventions and estimating the inapparent infections for dengue outbreak in Hangzhou, China. PLOS ONE. 2019;14: e0220391. pmid:31393899
  115. 115. Cheng Q, Jing Q, Spear RC, Marshall JM, Yang Z, Gong P. The interplay of climate, intervention and imported cases as determinants of the 2014 dengue outbreak in Guangzhou. PLoS Negl Trop Dis. 2017;11: e0005701. pmid:28640895
  116. 116. Buczak AL, Baugher B, Babin SM, Ramac-Thomas LC, Guven E, Elbert Y, et al. Prediction of high incidence of dengue in the Philippines. PLoS Negl Trop Dis. 2014;8: e2771. pmid:24722434