Linear and Machine Learning modelling for spatiotemporal disease predictions: Force-of-Infection of Chagas disease

Background Chagas disease is a long-lasting disease with a prolonged asymptomatic period. Cumulative indices of infection such as prevalence do not shed light on the current epidemiological situation, as they integrate infection over long periods. Instead, metrics such as the Force-of-Infection (FoI) provide information about the rate at which susceptible people become infected and permit sharper inference about temporal changes in infection rates. FoI is estimated by fitting (catalytic) models to available age-stratified serological (ground-truth) data. Predictive FoI modelling frameworks are then used to understand spatial and temporal trends indicative of heterogeneity in transmission and changes effected by control interventions. Ideally, these frameworks should be able to propagate uncertainty and handle spatiotemporal issues. Methodology/principal findings We compare three methods in their ability to propagate uncertainty and provide reliable estimates of FoI for Chagas disease in Colombia as a case study: two Machine Learning (ML) methods (Boosted Regression Trees (BRT) and Random Forest (RF)), and a Linear Model (LM) framework that we had developed previously. Our analyses show consistent results between the three modelling methods under scrutiny. The predictors (explanatory variables) selected, as well as the location of the most uncertain FoI values, were coherent across frameworks. RF was faster than BRT and LM, and provided estimates with fewer extreme values when extrapolating to areas where no ground-truth data were available. However, BRT and RF were less efficient at propagating uncertainty. Conclusions/significance The choice of FoI predictive models will depend on the objectives of the analysis. ML methods will help characterise the mean behaviour of the estimates, while LM will provide insight into the uncertainty surrounding such estimates. Our approach can be extended to the modelling of FoI patterns in other Chagas disease-endemic countries and to other infectious diseases for which serosurveys are regularly conducted for surveillance.


Introduction
Chagas disease is a neglected tropical disease estimated to affect between 6 and 7 million persons worldwide. While only endemic in 21 countries in Latin America, the number of Chagas disease cases detected in Europe, North America, and the Far East has greatly increased, due to migration of infected populations [1]. Being able to identify how the cases are distributed in space and whether the control interventions implemented have been successful is critical to identifying how resources should be allocated to eliminate the disease as a public health problem in the 2021-2030 time horizon [2]. As a long-lasting and chronic disease, analyses based solely on the current prevalence of infection (typically measured as seroprevalence) has limited scope. Indeed, the prevalence recorded at a given time does not reflect the current epidemiological situation, as infection may have occurred in the past. The Force-of-Infection (FoI), i.e. the rate at which susceptible individuals become infected, is a modelling-derived metric that can be used to understand changes in incidence in space and time as a result of deliberate control interventions and/or secular changes, including environmental change [3]. However, the use of FoI raises its own challenges, particularly those regarding quantification and propagation of uncertainty when used as a response variable in predictive models. A catalytic model (fitted to age-structured seroprevalence data, often using Bayesian methods) has been used to obtain the FoI and thus, the FoI values for each serosurvey and each year are posterior distributions and not only single values [4]. When the derived FoI is used to fit predictive models, the mean or median values of FoI are predominantly used, neglecting the uncertainty surrounding the estimated values [5][6][7]. Furthermore, when a non-constant (e.g. a yearly-varying) FoI is assumed, each serosurvey analysed becomes a temporal series at a certain location, requiring specific and computationally-intensive methods to be included into predictive models [8]. Machine Learning methods could represent a faster and more flexible framework to implement such models.
Machine Learning (ML) methods are computational processes based on probabilities and algorithms that use prior knowledge to produce predictions. ML can handle non-linear and non-parametric models that are able to flout the linearity, normality (Gaussian distribution) and equal variance assumptions of statistical models [9]. Essentially, ML methods make no assumptions about the statistical distribution of the data [9].
These methods have previously been used in contexts in which those assumptions are challenged, such as spatial, temporal and spatiotemporal analyses of infectious diseases, e.g. mapping of human leptospirosis [10,11], severe fever with thrombocytopenia syndrome [12], lymphatic filariasis [13], or to identify individuals with a higher risk of HIV infection based on socio-behavioural-driven data [14].
Two types of ML models have been extensively used in the context of infectious disease epidemiology, namely, Boosted Regression Trees (BRT) and Random Forest (RF). Although they are not spatial approaches (as data locations and sampling patterns are ignored to produce estimates), they have shown potential in spatial modelling [15], in particular, when used with appropriate sampling strategies [16]. Specifically, BRT and RF have been used to study the spatial spread of numerous infectious diseases, including epidemics among swine farms in the USA [17], Ebola case-fatality ratio [18], risk factors for visceral leishmaniasis [19,20], African swine fever [21], scrub typhus [22], dengue incidence [23], and dengue FoI [5]. RF also proved its potential in modelling epidemics in a spatiotemporal framework, outperforming time series models [17].
This paper aims to compare the performance of two ML methods, namely, BRT and RF, with a Linear Model (LM) framework we have previously developed [8] in their ability to predict the FoI of Chagas disease across space and time. We use detailed data from Colombia as a case study and describe the advantages and disadvantages of using such Machine Learning methods compared to Linear Model frameworks, specifically focussing on their ability to handle uncertainty on the response variable.

Data sources
Current and past exposure to Chagas disease can be characterised by estimating the (timevarying) Force-of-Infection (FoI), i.e. temporal changes in the per susceptible rate of parasite acquisition [3,4]. Using results of 76 age-stratified serosurveys conducted at municipal level in Colombia between 1998 and 2014 (S1 Fig), yearly-varying FoI values were estimated, for each serosurvey, by fitting a catalytic model to age-stratified seropositivity data (see [4] for details). For each serosurvey, FoI estimates, for the period ranging from the birth of the oldest participants to the year when the serosurvey was conducted, were obtained using a Bayesian framework to fit the catalytic model to data, thus allowing for extraction of the full joint posterior distribution of the yearly FoI estimates. We refer to those municipalities where at least one serosurvey was conducted as municipalities 'in catchment areas', whereas those municipalities for which serosurveys were not conducted, not available, or not used in our analyses, are referred as 'out of catchment areas'. S1 Fig in the Supporting Information file depicts the geographical distribution of the available serosurveys ('ground-truth' data).
The predictors used in these analyses included demographic, entomological and climatic variables (recorded at the municipality level), contextual information about the serosurveys (location, year conducted and setting, i.e. urban, rural, mixed and indigenous (as defined in [4]), and information from public blood banks on the prevalence of Chagas disease and number of blood units tested (available at the departmental level). A full list and description of the predictors is available in S1 Table of the Supporting Information File.

Linear Model (LM) framework
The LM framework relied on a list of plausible linear combinations of predictors selected based on expert knowledge and to avoid correlation between predictors. They were then integrated into an ensemble model using model averaging with weights based on the performance indicators of each individual linear model. The 10 best models for each setting type (urban, rural and indigenous) were averaged and used to obtain FoI predictions. The LM framework has been fully described in [8].

Machine Learning (ML) framework
Both ML methods tested in this paper are based on decision trees. A decision tree is an intuitive process that builds an algorithm by generating a step-by-step tree, whereby the dataset is repeatedly split to make a decision at each node. The splitting relies on optimising a variablespecific threshold that best discriminates the data into two branches at each node. Sequentially, the entire dataset is divided by defining new variable-specific thresholds defining the nodes in the decision tree.
The size of the tree, its complexity (reflecting predictors' interactions), the number of observations in the terminal nodes and the criteria to stop the process are defined as model hyperparameters and form the basis of more complex designs.

Boosted Regression Trees (BRT)
Boosting Regression Trees (BRT) or Gradient Boosting Trees (GBT) are based on the building of a large number of small decision trees. The boosting aspect refers to fitting repeatedly very simple and basic classifiers, in which a different subset of the data is used for fitting at each iteration [9]. The Gradient technique is used to reduce the variance in the model; sequentially, each new tree added to the model is fitted to explain the remaining variance from the previous trees.
While BRT is considered a robust ML method, including its use for spatiotemporal analyses [19,20,22,24], it is known as having a tendency to overfit, unless a very large amount of data is available [25].

Random Forest (RF)
Random Forest (RF), first described by Breiman in 2001 [26], consists of a large collection of decision trees [9]. To grow an RF tree, random inputs and predictors are selected at each node [26], and this randomness is thought to reduce overfitting. RF is also considered a robust ML method that can handle outliers and noise while being faster than bagging-and boostingbased methods [26].
RF is not explicitly designed to explore spatial observations [15], and is known to produce suboptimal prediction when sampling is spatially biased and/or in the presence of strong spatial correlation [15]. However, spatiotemporal resampling strategies and variable selection processes have been developed to overcome this challenge [16,27].

Models' workflow
In order to assess the importance of integrating uncertainty on the response variable, we implemented two approaches The former relies on generating and assessing model predictions using the median estimates of the FoI for each serosurvey as an outcome (referred to as Med-FoI). The latter seeks to propagate the uncertainty linked to the catalytic model-derived 'observations' by accounting for the full posterior distribution of the FoI estimates (referred to as FullPostFoI).
With the MedFoI approach, models are fitted to the median FoI estimates and the performance indicator, the predictive R 2 , is based on central tendencies only. With the FullPostFoI approach, models are fitted on the full posterior distribution of FoI estimates and the performance indicator is based both on central tendency and on the amount of overlap between the 'observed' and predicted distribution of the outcome. This allowed us to quantify the ability of the models to match the uncertainty surrounding the FoI estimates (i.e. the outcome) inherited from the catalytic model (Fig 1). The percentage of overlap was obtained using the "overlap" function from the Overlapping R-package [28] and provided the proportion of the area of two kernel density estimations that overlap [29].
For the two approaches, we defined six different coefficients of determination (R 2 ) linked to the sampling strategy. An R 2 was estimated based directly on the data used to fit the models; a predictive R 2 was estimated based on a proportion of the dataset that was not used to fit the models, i.e. the cross-validation (CV) set. In addition, both urban-and rural-specific predictive R 2 were estimated based on the urban/rural data from the CV set. Finally, in the ML frameworks, a resample R 2 was estimated based on out-of-sample data for each resample iterations (see Fig 2 for a schematic representation of these approaches). While the LM framework necessitated transformation of the data to normalise them, ML methods should, in principle, to be able to handle data without requiring normalisation (i.e. without requiring that their distribution is Gaussian). However, this process can help improve the performance of the model and was, therefore, tested (i.e. ML approaches were used to predict the FoI values both on non-transformed and log-transformed scales).

Fig 2. Description of the modelling workflow for the Linear Models (LM) and the Machine Learning (ML) frameworks. ML framework include Boosted
Regression Trees (BRT) and Random Forest (RF) methods). CV denotes cross-validation; Pred R 2 urban and Pred R 2 rural denote urban-and rural-specific predictive R 2 values that were estimated based on the urban/rural data from the CV set; %Overlap indicates the proportion of the 'observed' and predicted distributions that overlap (see Fig 1), assessed over all settings and for urban and rural settings separately. https://doi.org/10.1371/journal.pntd.0010594.g002

PLOS NEGLECTED TROPICAL DISEASES
While the LM framework relies on a list of plausible and pre-defined linear models including interactions between factors (predictors) and excluding correlated predictors, the ML framework is implemented in two steps, to be fitted only on the ten most important variables. At first, ML models were fitted using all the predictors available, then the importance/influence of each predictor was assessed, and the 10 most influential factors were used in the second step, during which the models were fitted again, and predictions extracted.
Finally, ML requires a tuning step, during which the best hyperparameters are selected. Also, the resampling strategy used within the LM framework relies on random resampling while a spatiotemporal resampling strategy has been used for the ML framework. A detailed description of the tuning of hyperparameters and the comparison of several resampling strategies is available in S1 Appendix of the Supporting Information File, including details about the tuning of hyperparameters and the comparison of several resampling strategies.

Indicators used to compare LM with ML frameworks
The best ML models obtained were then compared with the LM framework previously developed [8] in terms of their performance indicators, predictions, and ability to propagate uncertainty. In addition to these aspects, the models' ability to deal with temporal and spatial correlation, as well as their different computational aspects entered the comparison.
To allow comparison of predictive ability across multiple serosurveys, the distributions of predictions were standardised to the 'observations', allowing us to visualise whether the median and confidence intervals of the predictions matched those (median and credible intervals) of the catalytic model-derived FoI 'observations'. This process was performed at the serosurvey level to assess how much of the uncertainty inherited from the FoI calculation (via catalytic model fitting) was propagated into the predictions (see S2 Appendix in the Supporting Information File for details).
The uncertainty in the predictions was quantified using the Coefficient of Variation based on the standardised Median Absolute Deviation (MAD-CV), as FoI values were not normally distributed [30]. (Note that MAD-CV refers to coefficient of variation, whilst CV denotes cross validation.) The residual spatial correlation was assessed using the Moran's I heterogeneity test from the "spdep" R package [31]. For the LM framework, the Moran's I test was applied on all the residuals (originating from the cross validation (CV) and fitting sets) excluding those from the rural-urban mixed settings (as LM model selection was based on setting type and no model was explored, selected or averaged for mixed settings, and thus no predictions were produced for the 'observed' FoI values corresponding to such settings). For ML models, the Moran's I test was applied to the residuals of the CV set. Residuals for a single year were used to exclude potential temporal autocorrelation, and for presentation purposes, we selected 2005 as the year with the largest number of independent FoI 'observations'.
The residual temporal correlation was tested using a Durbin-Watson test (DW) [32] (see Eq 1 for the DW statistic). In order to capture the residual correlation inherited from the estimation of the FoI values through fitting the catalytic model, the residuals being compared were always from the same serosurvey and for consecutive years. Thus, the DW statistic provided the residual serial correlation for a lag of one year, where r denotes the residuals for i serosurveys, lag is one year (for consecutive, yearly series of serosurveys), and n the number of 'observations' tested.
All ML analyses were run under the mlr3 framework (an object-oriented machine learning framework in R) [33] using R-4.0.3 software [34].

Comparison of the performance of LM and ML frameworks
The predictive R 2 values for the LM framework obtained, on average, for its 5 best-fitting models, were 77% and 70%, with %overlap of 54% and 39% for urban and rural settings, respectively [8]. For the ML frameworks, the MedFoI approach yielded substantially better predictive R 2 values (ranging between 90% and 98%), but the degree of overlap between the distributions of the FoI 'observations' and the predictions was substantially lower (19%-25%), reflective of a tighter distribution around the central estimates and thus indicating over-confidence in the predictions when using such a simple approach (i.e. an approach that ignores the uncertainty linked to the outcome). The FullPostFoI approach gave a more balanced performance indicator, with predictive R 2 values ranging between 39% and 70%, and %overlap between 22% and 42% (Table 1). For both BRT and RF methods, the use of log-transformation to normalise the distribution of the FoI 'observations' consistently led to improved results (Table 1), with predictive R 2 values ranging between 59% and 70%, and %overlap between 34% and 42%.
Nested resampling, tested on the RF method with log-transformation, did not substantially improve model performance. Thus, the following subsections focus on the results obtained by fitting the frameworks on the full posterior distribution of the log-transformed FoI.

Comparison of the influence of predictor variables
The factors selected for the ML models were consistent with those that had been selected for the LM framework (Table 2); a Spearman correlation test showed that there was substantial rank correlation of the predictors included among the three models investigated (with Spearman rank correlation coefficient, r S , between LM and BRT = 0.50; between LM and RF = 0.54, and between BRT and RF = 0.64, all p-values <0.05). For MedFoI models, the performance indicator is the value of R 2 alone. Therefore % overlap is presented for comparison but was not used in fitting or selecting models.
For FullPostFoI, the performance indicator is the arithmetic mean between R 2 and % of overlap (see Fig 1).
https://doi.org/10.1371/journal.pntd.0010594.t001 Table 2 The type of setting was the most important factor for both the LM and BRT. Latitude, year when the serosurvey was conducted, and population density also played an important role. Poverty, climatic and entomological features had a moderate role.

Predictors
For the ML frameworks, blood-bank and intervention-related features were less influential than for the LM framework. Generally, the year (temporal trend) seemed to play a greater role in the ML models.

Comparison of spatial trends in predictions
All methods showed generally similar spatial trends and comparable levels of uncertainty ( Fig  3) for FoI prediction across Colombia (using the year 1990 for illustration as the pattern is consistent in time). Generally, FoI estimates were higher in northern and eastern municipalities and lower in the south of the country, with the latter being associated with higher uncertainty (Fig 3). The BRT framework predictions showed increased spatial heterogeneity, while predictions from the LM framework resulted in more spatially uniform predictions.
When comparing FoI predictions directly across the three methods, for urban and rural settings (Fig 4), we found good agreement between all of them, particularly between RF and LM. Generally, the BRT tended to predict higher FoI values in both settings. The patterns observed in the entire country seemed to follow what was observed in the catchment areas (municipalities where at least one serosurvey was conducted).

Comparison of temporal trends in predictions across serosurveys
When comparing 'observations' with predictions over time, all methods performed well regarding their ability to capture central trends (Fig 5). However, the LM framework was better at capturing uncertainty, as the confidence bounds of the predictions mirrored more closely the credible intervals (CrI) of the 'observations'.
The median uncertainty across municipalities (Table 3) was comparable using any of the methods and restricting the assessment to 'in catchment area' only (i.e. municipalities where at least one serosurvey had been conducted) or not ('out of catchment area'). However, for some municipalities, the uncertainty associated with the LM framework increased dramatically.
Comparatively, the RF method produced more uniform uncertainty across predictions, with median and range similar to those yielded by the other (BRT and LM) methods, but with fewer municipalities with substantial uncertainty (defined as MAD-CV>2), and only a moderate number of municipalities with extreme uncertainty (defined as MAD-CV>5).

Residual spatial and temporal correlation
While the ML-based methods did not show any significant spatial correlation in their residuals, this was not the case with the LM framework (Table 4). For all models, the DW test's statistic (see Eq 1) showed a significant residual temporal correlation between residuals from the

Computational aspects
Computationally, RF and BRT required the least effort (31 and 42 hrs respectively, on standard laptop, with an i7-8565U processor and 16.0 GB RAM) ( Table 5). By contrast, although

PLOS NEGLECTED TROPICAL DISEASES
implementation of LM required far fewer R-packages than the ML framework, it took over twice the time to run when compared to RF (72 hr). Also, the computer hard-drive space that was required to store 'objects' and model outputs was about 20 times higher for LM than for the ML framework. Finally, the overall implementation of the models was substantially simpler for the ML framework; particularly to make adjustments and updates.

Discussion
Our comparative analyses indicated generally consistent results among the three modelling methods investigated to generate Chagas disease FoI predictions, namely, the linear model (LM) framework we previously developed [8], and the two Machine Learning (ML) methods explored here (Boosted Regression Trees (BRT) and Random Forest (RF)). The predictors that were selected, as well as the location of the most uncertain FoI values were coherent and generally consistent among the three methods (Table 2 and Figs 3 and 4). Not entirely surprising, RF was faster to run than BRT and LM [26] (Table 5).

PLOS NEGLECTED TROPICAL DISEASES
Based on the performance indicators used, RF performed best (Table 1) but did less well when considering the propagation of uncertainty in the FoI inherited from the catalytic model ( Fig 5). Also, RF generated fewer municipality-level predicted values with substantial or extreme uncertainty (Table 3). All methods, when fitted on the median FoI alone (MedFoI approach), were unable to capture the uncertainty in the response variable (the FoI 'observations' generated by fitting the catalytic model to the age-stratified serosurveys), leading to overconfident predictions (with high predictive R 2 values but smaller % of overlap values). This highlights an important issue not fully addressed in the literature, as most publications using FoI data to infer spatiotemporal patterns of infectious disease incidence tend to use the central FoI estimates alone to fit predictive models (i.e. using what we labelled here as the MedFoI approach). We argue that neglecting to appreciate and propagate the uncertainty inherent in their estimation [5][6][7] may therefore lead to significant over-confidence in predictions. This issue, already highlighted in our previous LM work [8], is not mitigated by implementing ML frameworks, and deserves careful consideration, not only from a methodological perspective, but importantly, when the results are applied in policy-relevant contexts [35].
Indeed, quantifying and communicating uncertainty in FoI appropriately is critical when the results of predictive models are used to inform stakeholders and public health programme managers on the level of certitude associated with exposure risk or number of cases. Thus areas/populations for which exposure has been certainly high or low can be differentiated from those with exposure levels or number of cases that necessitate further investigation due to highly uncertain estimation.
Even when the three methods showed good performance and generally good agreement at the serosurvey level (Fig 4), the residuals remained correlated in time (Table 4). Thus, the Table 3. Uncertainty across Chagas disease Force-of-Infection predictions for the three frameworks under comparison. The uncertainty was estimated using the Median Absolute Deviation Coefficient of Variation (MAD-CV) of the predictions for Colombia in 1990, in (urban and rural) areas where at least one serosurvey had been conducted ('in catchment area') and where no data were available or used in the analyses ('out of catchment area'). The number of municipalities where MAD-CV is greater than 2 (substantial uncertainty) or greater than 5 (extreme uncertainty) is also included.

MAD-CV values (range)
Number correlation inherited from the FoI calculation was not fully accounted for in any of our methods, i.e. none of the predictors included was able to account for the full extent of this correlation. While the final ML models showed no evidence of residual spatial correlation (Table 4), the spatial extrapolation shown (Fig 3) should be interpreted with caution, as the (ground-truth) serosurveys available had only been conducted in a relatively small number of municipalities and tended to be aggregated in the same area (S1 Fig). When using RF, the degree of uncertainty inside and outside 'catchment areas' was consistent, suggesting reliable extrapolation. This contrasted with the LM framework, which predicted large uncertainty in some municipalities.
Most of the earliest serosurveys (up to early 2000) seemed to have targeted high-risk populations [36], presumably because the perceived risk of Chagas disease transmission in those areas was higher and required improved situational awareness. However, using only such information to make predictions across Colombia would have led to higher predicted FoI in areas where no ground-truth data had been collected. By contrast, the most recent serosurveys (2010-2014) seem to have been conducted on more representative samples of the population, presumably motivated by providing a more realistic assessment of the epidemiological situation nationally and demonstrating progress in reducing vectorial transmission. We, therefore, included the year when the serosurvey took place to account for this bias and, in fact, this variable appeared to be one of the most influential in all three methods and particularly for BRT and RF (Table 2). This demonstrates the crucial importance of understanding the motivation behind the implementation of serosurveys in order to assess the sampling strategy and ultimately quantify potential biases that may interfere with the representativeness of FoI estimates.
Finally, and regarding computational aspects, the LM framework required substantial userinput to prepare the data for model fitting (including data transformation; choice of predictors included in each model; tests for spatial and temporal correlation, etc.). In contrast, ML frameworks were faster (particularly RF) and required less pre-processing of the data and hard-drive space (Table 5). These features render the ML models more flexible, more readily updatable, and thus easier and simpler to be extended to other Chagas disease-endemic countries, and potentially to other infectious diseases, including neglected tropical diseases, for which serological surveys are regularly conducted as surveillance tools to assess epidemiological situation, incidence, and impact of control interventions across spatial and temporal scales [5,6,37,38].

Concluding remarks
ML methods are increasingly used to derive computationally efficient algorithms for data analysis that are agnostic to the distributional properties of such data. They represent an attractive modelling tool for the generation of predictive maps of important infectious disease epidemiological metrics, such as the FoI. Most published literature on the subject use measures of FoI central tendency, neglecting to quantify, propagate and ultimately communicate the uncertainty appropriately. We show that the uncertainty on the input variables cannot be neglected whatever statistical method is used and furthermore that the choice of modelling framework requires careful consideration according to the ultimate objectives of the modelling endeavour. If the aim is, for instance, to use the predicted FoI patterns to provide numbers of cases and estimates of the associated disease burden, ML framework (and particularly RF) would indeed be an optimal choice, as capturing the median (central tendency behaviour) may be sufficient and computationally efficient. However, if the objective is to identify areas where serological surveillance surveys are scarce and should be conducted to improve the reliability of FoI estimates and provide ground-truth data, we conclude that the LM framework, albeit more timeconsuming and computationally intensive, would provide a better indication of where uncertainty is greatest. Although in this paper we focused on Chagas disease in Colombia as a case study, the modelling frameworks compared here can be applied to other Chagas diseaseendemic countries and to infectious diseases (including neglected tropical diseases) for which age-stratified serological data are regularly collected.