Forecasting dengue fever in Brazil: An assessment of climate conditions

Local climate conditions play a major role in the biology of the Aedes aegypti mosquito, the main vector responsible for transmitting dengue, zika, chikungunya and yellow fever in urban centers. For this reason, a detailed assessment of periods in which changes in climate conditions affect the number of human cases may improve the timing of vector-control efforts. In this work, we develop new machine-learning algorithms to analyze climate time series and their connection to the occurrence of dengue epidemic years for seven Brazilian state capitals. Our method explores the impact of two key variables—frequency of precipitation and average temperature—during a wide range of time windows in the annual cycle. Our results indicate that each Brazilian state capital considered has its own climate signatures that correlate with the overall number of human dengue-cases. However, for most of the studied cities, the winter preceding an epidemic year shows a strong predictive power. Understanding such climate contributions to the vector’s biology could lead to more accurate prediction models and early warning systems.


Completing missing climate data via compressive sensing
The time series of our selected climate dataset contain episodic gaps on days where variables (temperature and precipitation) were not recorded. To fill in the missing data gaps, we employ two different methods: compressive sensing and interpolation (see Fig. A for illustrative examples). For temperature time series data with 2 or more consecutive missing recordings, we use a recently developed compressive sensing method based upon L 1 -convex optimization for approximating the missing data [1][2][3][4][5]. The compressive sensing method attempts to reconstruct a signal from a sparse, subsampling of the time series data. In this case, the sparse subsampling occurs from the fact that we have missing data. We have chosen to fill in the missing data through this matrix completion procedure for two reasons: (i) the mathematical methods for handling missing in this way have matured significantly over the past decade, and (ii) mixing data from two different recording devices (satellite data and ground recording data) is statistically unjustified. Specifically, the data collected by ground stations at fixed locations are the most reliable sources of climate information. Satellite data is typically noisier and has less precision, in part, due to the limitations of its resolution in comparison to ground recordings. Ultimately, using one data source as a proxy for another is its own statistically interesting, yet challenging, data science problem [6][7][8].
The signal reconstruction problem is nothing more than a large underdetermined system of linear equations. To be more precise, consider the conversion of a time series data to the frequency domain via the discrete cosine transform (DCT) where f is the signal vector in the time domain and c are the cosine transform coefficients representing the signal in the DCT domain. The matrix ψ represents the DCT transform itself. The key observation is that most of the coefficients of the vector c are zero, i.e. the time series is sparse in the Fourier domain. Note that the matrix ψ is of size n × n while f and c are n × 1 vectors. The choice of basis functions is critical in carrying out the compressed sensing protocol. In particular, the signal must be sparse in the chosen basis. For the example here of a cosine basis, the signal is clearly sparse, allowing us to accurately reconstruct the signal using sparse sampling. The idea is to now sample the signal randomly (and sparsely) so that b = φf (2) where b is a few (m) random samples of the original signal f (ideally m n). Thus φ is a subset of randomly permuted rows of the identity operator. More complicated sampling can be performed, but this is a simple example that will illustrate all the key features. Note here that b is an m × 1 vector while the matrix φ is of size m × n. Approximate signal reconstruction can then be performed by solving the linear system

CVX Interpolation
where b is an m × 1 vector, x is n × 1 vector and is a matrix of size m × n. Here the x is the sparse approximation to the full DCT coefficient vector. Thus for m n, the resulting linear algebra problem is highly under-determined. The idea is then to solve the underdetermined system using an appropriate norm constraint that best reconstructs the original signal, i.e. the sparsity promoting L 1 is highly appropriate. The signal reconstruction is performed by using If the original signal had exactly m non-zero coefficients, the reconstruction could be made exact (See Ref. [1], Ch. 18). We applied this technique specifically to the climate series of Rio de Janeiro, Salvador and São Luís. For the other state capitals, we just linearly interpolate the time series whenever a single daily recording is missing. We note that there were intractable large gaps for the INMET precipitation series for Rio de Janeiro, which forced us to use alternative data sources made available by the city's alert system of rain events [9].

Variable Selection
Here we present a simple method based on the Singular Value Decomposition (SVD) [1] to utilize our climate date to find periods of high-separability between dengue vs non-dengue years. Our original dataset consists of the following daily measurements: (1) maximum temperature, (2) minimum temperature, (3) mean temperature, (4) humidity and (5) precipitation. These variables were not pre-selected, but instead, were the ones made available by INMET [10]. Our SVD algorithm works as follows: 1. We select climate data over the same period (t 0 , p) for different years and build a corresponding matrix X(t 0 , p) that allows for an SVD decomposition.
2. We select data from k climate variables over the years, always starting at t0 and ending p days later.
3. We stack and normalize the data associated with year j in a block matrix B j (t 0 , p), for j = 1, . . . , N . All blocks are reshaped into column vectors, forming a new matrix X = X(t 0 , p), which yields The columns of U -the SVD modes -form an orthogonal basis for the space generated by the columns of X and the projections of the principal components are given by the ΣV T (t0, p) matrix. For details see [11] Figure B shows a panel with SVD-heatmaps from all state capitals considered in this study, for the same range of t0 and p values that we used in the SVM method. In the top row, we find the critical periods (highest separability between dengue vs non-dengue years) using all variables. On the bottom row, we find critical periods selecting only average temperature and precipitation. For most cities -and especially to the city of Rio de Janeiro -we can observe that these two variables detect almost the same periods of high separability in the [t 0 × p] heatmaps as detected by all variables. We highlight a few of the similar high-separability periods with green boxes (by visual inspection) to illustrate this fact. This result helps illustrate our choice of a sparse and generalizable variable set that gives a regression performance on part with using all variables.
Such an approach to a selection of a parsimonious variable set is consistent with commonly used regression techniques such as the LASSO, or with information criteria such as AIC (Akaike information criteria) and BIC (Bayes information criteria), whereby the selection of variables and goodness-of-fit is penalized by the number of total variables. Thus our methodology is consistent with well-known and commonly used techniques in the statistical sciences for variable selection.
In addition, there is the issue of generalizability. While it is true that for some cities, there might exist some particular combinations of variables that gives equal or better outcome. The city of Manaus, for example, yields a better result if one considers humidity instead of precipitation. These particular choices, however, do not generalize, thus, we decided to keep the two that are collectively a better representative of the climate dataset across all cities. Indeed, the two variables we have kept are the only two that generalize across all the cities, despite their distinct climates and clustering patterns. This generalizability argument is also consistent with LASSO and BIC/AIC model/variable selection theory [12,13] MeanTemperature and Precipitation Figure B: Comparison between the use of all climate variables to detect high-separability periods (top row) vs using average temperature & precipitation alone (bottom row). Rio de Janeiro, Belo Horizonte, Aracajú, and Salvador detected very similar separability periods using this widely available and easily accessible environmental data (as highlighted by the green boxes to facilitate visual inspection). Different combinations of variables -even if they might perform better for a particular city -do not generalize to all others.

Examples of periods with high and low separability of the climate signatures
For each state Capital we selected special time windows in which there is a clear separation between climate signatures preceding epidemic and non-epidemic years. Figure C illustrates the distinct separation of the data for each individual city, suggesting that a universal model for climate effects across all cities may be unattainable. The separability of data further suggests that epidemics may be accurately predicted in a given state capital six to nine months in advance of their outbreak. This separability notion is made quantitatively precise by the SVM scores. Figure D shows specific time windows in which the epidemic and non-epidemic climate variables seem to be poorly distinguishable, therefore not suitable for dengue prediction. Unlike Fig. C, the mixing of data suggests poor predictability across all cities. This separability notion is made quantitatively precise by the SVM scores.

Aracajú
The most accurate predictions were obtained with (i) a nonlinear RBF kernel, (ii) an SVM threshold of α = 0.9, and (iii) using the EP-strategy to calculate the outbreak probability. Surprisingly, the same EP-rectangle was used in all out-of-sample predictions, giving an EP-window within June 1 st -19 th (winter). Fig E highlights this rectangle (green box) and the respective results in the T j × δ j −1 plane for each year. Only the year of 2006 was wrongly predicted (FP). There is a clear separability between dengue and No-dengue regarding a temperature threshold around 26°Celsius.

Salvador
Figure I (top) shows the best prediction result for the city of Salvador using (i) an RBF kernel, (ii) an SVM threshold of α = 0.95, and (iii) the AA-strategy to calculate the outbreak probability. The (t 0 , p) rectangles used in the prediction covered most of the year but were especially clustered around December-February (boxed in magenta). All years except 2002 (FN) and 2010 (FN) were correctly predicted (82% of accuracy).
Predictions using (i) a linear kernel, (ii) α = 0.9, and (iii) the EP-strategy also gave good results (highlighted in Fig I (bottom) ). Eight years were correctly predicted (73% accuracy) but the years of 2008 (FP), 2010 (FN) and 2012 (FN) were not. The EP strategy was just slightly less accurate than the AA strategy, yielding EP-windows within August 30 th and December 11 th (spring and summer). The epidemic years typically showed lower precipitation rates in the selected EP-rectangles.

Additional Practical Considerations
Throughout the manuscript, we refer to a "user" as someone with access to new/different data (both climate and epidemiological) willing to calculate Dengue outbreak probabilities using our same methodology.
The user's first task is to (i) split the yearly climate data between a training set and a testing/prediction set. Then, he/she must choose between (ii) SVM kernels (linear or nonlinear), (iii) alpha values, (iv) rectangle sizes and (v) prediction strategies. If the user is simply appending new climate data to the Brazilian Capitals analyzed in this work, they could leverage our best parameter/strategy choices shown in Table 1. For other cities, the user might want to follow our work as a guideline. Ultimately, one should compute prediction routines for each combination of parameters, build confusion matrices, and compare the distinct choices listed above regarding their accuracy. Table  1 also demonstrates that cities may require significantly different choices. Thus, we recommend the user trying to predict a dengue outbreak in a novel city to systematically test all sensible combinations. We point out that the user should be careful when choosing rectangle sizes: smaller rectangles will increase the heatmap resolution but will contain fewer (t0, p) points. Our choice of dimensions 5 and 6 was reasonable for our specific dataset, but will likely not generalize to others. It is definitely worth to investigate in future studies this tradeoff between resolution vs content in rectangle sizes and how to control it to maximize prediction accuracy.

Tables with epidemic/non-epidemic years and missing climate data gaps
We provide tables with estimated population, total number of Dengue cases, incidence per 100, 000 inhabitants, and details of our climate data completing protocols (if any).

Prediction Results for each state capital
Here the reader can find the best prediction results for all 7 state capitals considered in our work. We have chosen a criterion based on highest prediction accuracy (see manuscript for details) for selecting the SVM kernel, threshold α and prediction strategy. The following tables contain the evaluated probabilities of dengue outbreaks for each test year. For those state capitals where the EP-strategy performed the best results, we could also exhibit the correspondent dates of the EP-chosen rectangles of each out-of-sample prediction. For those capitals where the AA-strategy had best results, we provide a full list of the time windows ((t 0 , p)-rectangles) that were common in all out-of-sample predictions. The AA-months are those months containing all time windows that were found. Remark: "D" represents epidemic years and "ND" non-epidemic years. Remark: "D" represents epidemic years and "ND" non-epidemic years. Remark: "D" represents epidemic years and "ND" non-epidemic years. We find the following time windows (in the format DD/MM) common to all out-of-sample predictions: 11/09-9/10, 11/09-14/10, 12/08-24/09, 12/08-24/10, 12/08-29/09, 17/09-15/10, 18/08-25/10 and 23/09-16/10.  Remark: "D" represents epidemic years and "ND" non-epidemic years. Remark: "D" represents epidemic years and "ND" non-epidemic years. We find a total of 168 time windows common to all out-of-sample predictions, covered by almost all months of the year. Remark: "D" represents epidemic years and "ND" non-epidemic years. We find the following time windows (in the format DD/MM) common to all out-of-sample predictions: 10/12-3/03, 10/12-6/02, 10/12-11/02, 10/

Confusion Matrices
We provide all confusion matrices. See manuscript for details on how we compute them. The best results are highlighted in bold-red color. For those capitals where we find the same accuracy for different α values, we choose the highest α in order to promote the best separability scores. The file can be downloaded here