Figures
Abstract
Accurate mapping and disaggregation of key health and demographic risk factors have become increasingly important for disease surveillance, which can reveal geographical social inequalities for improved health interventions and for monitoring progress on relevant Sustainable Development Goals (SDGs). Household surveys like the Demographic and Health Surveys have been widely used as a proxy for mapping SDG-related household characteristics. However, there is no consensus on the workflow to be used, and different methods have been implemented with varying complexities. This study aims to compare multiple modelling frameworks to model indicators of human vulnerability to malaria (SDG Target 3.3) in Senegal. These indicators were categorised into socioeconomic (e.g., stunting prevalence, wealth index) and malaria prevention indicators (e.g., indoor residual spraying, insecticide-treated net ownership). We compared three categories of the commonly used methods: (1) spatial interpolation methods (i.e., inverse distance weighting, thin plate splines, kriging), (2) ensemble methods (i.e., random forest), and (3) Bayesian geostatistical models. Most indicators could be modelled with medium to high predictive accuracy, with R2 values ranging from 0.40 to 0.86. No method or method category emerged as the best, but performance varied widely. Overall, socioeconomic indicators were generally better predicted by covariate-based models (e.g., random forest and Bayesian models), while methods using spatial autocorrelation alone (e.g., thin plate splines) performed better for variables with heterogeneous spatial structure, such as ethnicity and malaria prevention indicators. Increasing the complexity of the models did not always improve predictive performance, e.g., thin plate splines sometimes outperformed random forest or Bayesian geostatistical models. Beyond performance, we compared the different methods using other criteria (e.g., the ability to constrain the prediction range or to quantify prediction uncertainty) and discussed their implications for selecting a modelling approach tailored to the needs of the end user.
Citation: Morlighem C, Nnanatu CC, Visée C, Fall A, Linard C (2025) Spatial interpolation of health and demographic variables: Predicting malaria indicators with and without covariates. PLoS One 20(5): e0322819. https://doi.org/10.1371/journal.pone.0322819
Editor: Jinyi Wu, Wuhan Fourth Hospital, CHINA
Received: April 14, 2024; Accepted: March 26, 2025; Published: May 29, 2025
Copyright: © 2025 Morlighem et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This study uses data from the 2017 Senegal Continuous DHS (with questions on population and housing, respondent’s characteristics, nutrition of children and adults, malaria and water, sanitation and hygiene). DHS datasets are publicly available by registration and request to the DHS Program (https://dhsprogram.com/data/dataset_admin/login_main.cfm). The Guide to DHS statistics (https://dhsprogram.com/data/Guide-to-DHS-Statistics/) as well as the 2017 Senegal DHS report can be used to calculate the indicators of this study. National boundary shapefiles can be downloaded from the GADM website (https://gadm.org/). R scripts used to conduct this study are available at https://doi.org/10.6084/m9.figshare.24874218.v1. A simulated DHS dataset is provided at this link to demonstrate how the code works. All other data (covariates, interpolated surfaces, prediction grid) are available at the link above.
Funding: CM is a Research Fellow from the Fonds de la Recherche Scientifique (F.R.S.-FNRS) (https://www.frs-fnrs.be/fr/). The funder did not play any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The Sustainable Development Goals (SDGs) were launched by the United Nations as a set of 17 goals and 169 related targets to be achieved through a collective effort by 2030, addressing the global challenges of social and economic inequalities, climate change and planet degradation. These goals include ending poverty and hunger, providing quality education and health for all, and achieving gender equality, among others [1]. The SDGs were defined with the aim of ‘leaving no one behind’, to ensure that everyone is included in this 2030 agenda [1]. Achieving these goals requires consistent tracking of progress towards the SDGs, and to this end, a series of 232 SDG indicators have been defined for evaluation and monitoring [1]. Yet, global and regional indicators mask large disparities at the sub-regional level, but also at finer scales, arising from complex social and environmental processes [2]. In this context, mapping these indicators helps to reveal spatial inequalities at different scales below the sub-regional level (e.g., urban/rural, intra-urban), thus contributing to ‘leaving no one behind’ [3]. Maps of SDG indicators at a finer scale can serve several purposes: they can help track the progress towards the SDG attainment, support decision-making to achieve the SDG agenda [3,4], help evaluate the impact of interventions and health investments [5], and can also be used as base covariate layers for spatial models of other health and demographic variables [4], e.g., malaria risk models [6,7].
Common approaches to mapping SDG indicators (e.g., poverty mapping) often rely on census data and geospatial data analysis, sometimes combined with ancillary survey data, such as small area estimation [3,8]. However, censuses are typically conducted every 10 years, if not more in some low-income countries, while monitoring SDG progress requires updated maps [3]. Other drawbacks include the unreliability and/or unavailability of census data and a coarse spatial resolution in some resource-constrained settings [4]. With these limitations, it is challenging for many government agencies to use census data to effectively monitor and assess SDG progress and plan interventions [4].
Since the mid-2000s, household survey programs such as the Demographic and Health Surveys (DHS), Malaria Indicator Surveys (MIS) and Multiple Indicator Cluster Surveys (MICS) come as a substitute for census data, especially in countries where the last census is outdated [3,5]. In particular, DHS provide estimates of key health and demographic indicators, based on nationally representative samples, that can help measure progress towards various SDGs: percent distribution of population by wealth quintiles (SDG 1: ‘No poverty’), prevalence of stunting and anemia (SDG 2: ‘Zero hunger’), HIV and malaria prevalence (SDG 3: ‘Good health and well-being’), literacy rate (SDG 4: ‘Quality education’), and percentage use of improved sanitation (SDG 6: ‘Clean water and sanitation’), among others. These indicators can be aggregated at the survey cluster level, for which geographic coordinates are available, and these can be interpolated to create continuous surfaces [9]. These maps help assess the progress towards the SDGs, as DHS are consistently conducted in more than 90 developing countries every 3–5 years [10] – or almost annually in some countries (e.g., Senegal). DHS hence provides an opportunity to map SDG indicators more accurately, at a higher spatial resolution, with regular updates in many countries [4]. Overall, DHS and MIS (from the same DHS program) have already been used for various mapping applications: mapping malaria prevalence [11–14], vaccination coverage [15,16], HIV prevalence [17], poverty [8,18,19], population age structure [4], ethnicity [20], female genital mutilation prevalence [21–23], etc.
However, although the mapping of DHS indicators is a widely studied topic, there is no consensus on the modelling workflow to be used for this purpose. Different types of models have been implemented with different levels of complexity and model inputs. Many of these studies used covariate-based approaches such as Bayesian geostatistical models [4,11,12,15,18] or machine learning techniques (e.g., random forest modelling, boosted regression trees) [8,13,19]. Yet, other research has shown that spatial interpolation methods that rely on recovering the spatial autocorrelation pattern in the data can also be effective for mapping DHS indicators (e.g., inverse distance weighting, splines) [17,20]. Given the wide variety of modelling approaches, some studies have compared Bayesian geostatistical models with other covariate-based methods: machine learning models [3,16], multivariate regression [24,25] and generalised additive models using spline interpolation [25]. [26] further compared several methods that use only spatial autocorrelation, including Bayesian geostatistical models, kriging and spatial random forest. Nonetheless, there has been no systematic comparison of all these approaches that use spatial covariates and/or spatial autocorrelation, and it is still unclear how all these methods will perform for different indicators. Although the DHS program recommends the use of model-based geostatistics with covariates [27], from a policy perspective, it is important to compare their added value with other approaches. For example, spatial interpolation methods that do not rely on ancillary covariate data can save significant time and computational effort in data processing and model implementation, compared to complex methods that use covariates or require higher computational power.
Among the SDGs, SDG Target 3.3 focuses on ending epidemics of communicable diseases, including malaria, which accounted for 249 million cases in 2022 [28]. In this context, malaria risk maps support policymakers in targeting control interventions and achieving the SDG target. However, these maps are often based solely on the hazards that influence the suitability of the environment for malaria mosquito vectors, such as climate and land cover variables, and rarely consider the vulnerability of society to malaria. Yet, socioeconomic factors and human behaviour regarding the use of preventive measures are known to influence malaria risk [29]. In general, the risk of malaria is known to be higher in areas with higher levels of poverty, as people may be more exposed to malaria disease, for example by living in houses with poor housing materials [30]. Malaria prevention measures such as insecticide-treated nets and indoor residual spraying have a direct impact on people’s ability to anticipate malaria [29]. Women’s education can also play an important role in prevention, as it may increase household income through better jobs and improve mothers’ knowledge of malaria prevention [29,31]. Ethno-religious beliefs may also influence perceptions of malaria and the uptake of preventive measures [32]. Some people may also be more biologically susceptible to malaria, such as stunted children [29,33] or people suffering from malaria co-infections (e.g., schistosomiasis [34]). While hazard-related variables are more readily available from earth observation data, household surveys such as the DHS have the potential to provide such malaria vulnerability indicators, which can be mapped into continuous surfaces. Beyond malaria, mapping vulnerability indicators from DHS may also be relevant to other vector-borne diseases for which epidemiological survey data are not always available, such as dengue, and where maps of vulnerability indicators can assist in planning health interventions.
The aim of this paper is to produce continuous surfaces of useful malaria-related indicators from the DHS with the following sub-objectives: 1) compare three categories of methods for predicting DHS indicators (spatial interpolation methods, ensemble methods and Bayesian geostatistical models), 2) assess the added value of covariate-based methods over methods that rely only on spatial autocorrelation, and 3) provide a comprehensive assessment of the strengths and weaknesses of these methods to guide users in their choice. In addition, the codes to implement these methods are also provided (see [35]). We used DHS indicators that drive malaria vulnerability: socioeconomic and malaria prevention indicators. This study focuses on Senegal, which has set a goal of eliminating malaria by 2030 [36]. Such interpolated surfaces of malaria-related indicators could (1) be used at a later stage to build spatially integrated malaria risk models, and (2) be useful for the planning of health interventions in Senegal to achieve the 2030 elimination target. In addition, the high availability of DHS data provides an opportunity for a follow-up study in Senegal or replication in other countries.
Materials and methods
DHS data and indicators
The DHS program provides cross-sectional estimates of demographic and health indicators sampled at the national level from over 400 surveys in more than 90 low- and middle-income countries. The DHS sampling frame is stratified by geographic region and rural/urban areas within each region. Across each stratum, primary sampling units (PSU), or clusters, are defined using enumeration areas (EAs) provided by the most recent national population census. PSU are selected with a probability proportional to their population size, and a group of households (around 25–30) is further selected within each PSU for questionnaire interview [10,37]. Geolocations for cluster centroids are made available along with the DHS recode files, but their geographic coordinates are randomly displaced up to 2 km in urban areas and 5 km in rural areas (with an additional 1% offset up to 10 km) to protect the privacy of the survey participants [37,38]. While rural areas are on average less affected by this displacement, previous works showed that it significantly affects the accuracy of spatial models of DHS indicators at the urban scale [8,9,14]. We further account for this displacement at the covariate extraction stage.
In this study, we used data from the Continuous DHS conducted in Senegal in 2017 by the National Agency of Statistics and Demography of Senegal (ANSD). This survey was chosen for two main reasons. First, it is the most recent household survey with the largest number of clusters for Senegal (i.e., 400), and second it was conducted mostly during the wet season (i.e., which lasts from June to October) [39]. This ensures that indicators related to malaria prevention are measured during the season of high malaria transmission. Based on DHS recommendations [27], we selected several indicators for spatial interpolation at the national level for Senegal. These fall into two categories: socioeconomic indicators and malaria prevention indicators. These indicators were aggregated at the cluster level following instructions from the Senegal DHS 2017 report [39] and general guidance provided by the DHS program [10]. They are summarized in Table 1 and described in more detail in S1 Text.
Geospatial covariates
In this study, we compiled a set of open-source environmental and socioeconomic covariates that have been shown to correlate with DHS indicators in previous work [3,9,15]. However, in contrast to existing studies, we assembled datasets with higher spatial resolution (1 km at the coarsest) to improve the accuracy of predictions at the urban scale, as recommended in [3,8]. We selected covariates that matched 2017 as closely as possible, i.e., consistent with the DHS data. All the covariates collected in this study are listed with their characteristics in Table 2 and are described in more detail in S1 Text. Due to differences in spatial resolution, projection and extent, all covariates were resampled to a common 1x1 km grid resolution and continuous covariates were converted to z-scores to account for different units of measurement. Covariates were extracted in 5 km buffers around rural DHS cluster centroids and 2 km buffers around urban clusters, following DHS recommendations [43]. For continuous covariates, we extracted the average value per buffer. For categorical covariates, we extracted the average proportion of each class and the average minimum distance to each class [15,44].
Modelling approaches
Three categories of methods are tested for modelling DHS indicators: (1) spatial interpolation methods (i.e., inverse distance weighting, thin plate splines, kriging), (2) ensemble methods (i.e., random forest regression) and (3) Bayesian geostatistical models (see Table 3).
Spatial interpolation methods.
Inverse distance weighting: Inverse distance weighting (IDW) is an exact spatial interpolation method, meaning that the interpolated surface passes exactly through the sample points. It estimates the value () of a random variable
at an unsampled location
as the weighted average of its values (
) at the
-nearest observations at location
[60]:
where is the interpolation weight assigned to each point at location
for calculating the interpolated value at location
[60]. It is computed as follows:
where is the Euclidean distance between
and
, and
is the power assigned to that distance. The higher the value of
, the less the influence distant observation points have on the target point [60]. The best fit for the number of neighbours (
) and the power parameter (
) was found using a 50-repeated 4-fold random cross-validation.
Thin plate spline: Fitting a thin plate spline (TPS) can be thought of as fitting a thin steel plate through the sample points, and this fit can be more or less smooth; the fit can pass through the sample points exactly, or it can be more flexible and deviate slightly from them [61]. This consists of fitting a function through the sample points such that the energy required to bend the steel plate is minimised [61]. This is achieved by minimising the following objective function:
where are the coordinates of location
in the 2D space,
is the value of observation at location
is the penalty function,
is the penalty parameter and
is the function fitted to the observation points. The first term in (3) represents the goodness-of-fit (measured by the squared residuals) and the second term represents the average curvature of the TPS allowed by the penalty parameter
[20]. Function
fitted through the sample points is given by [61]:
where is the Euclidean distance between observation at location
and the point with coordinates
, and
are unknown coefficients to be estimated. While
represent the global linear trend of the spline, the TPS radial basis function
controls the amount of local distortion [20,61]. Solving (3), we can find a closed-form solution for the parameters
and
for a given value of
. At any (unsampled)
location, the value of the TPS is given by (4) [61]. The penalty parameter was tuned using a 50-repeated 4-fold random cross-validation, as the recommended generalized cross-validation method can lead to unreliably small parameter estimates [62].
Kriging: Kriging has also been used in previous work to map DHS indicators [63–65]. It estimates the value of a random variable at unsampled locations as the weighted average of its values at sample points, such as IDW. The difference with IDW is that the weights are estimated based on a variogram function that recovers the spatial autocorrelation pattern in the data [60]. The variogram function describes how the dissimilarity (semi-variance) between the values of the random variable
at pairs of sample points increases with the distance
separating these points:
Kriging is a geostatistical method, hence the response variable is decomposed into a general non-random spatial trend modelled by the expectation
and a spatially correlated random term
that represents the local deviation from this trend:
Ordinary and universal kriging differ in the assumptions made about the spatial trend. Ordinary kriging (OK) assumes a stationary and unknown spatial trend. The variogram function hence models both the spatial trend and the residual term [60]. In universal kriging (UK), the spatial trend is not constant over space, but depends on covariates. The global trend is then modelled as a linear combination of covariates:
where is the vector of covariates at location
and
are the regression coefficients. In this approach, the variogram function is applied only to the spatially correlated residuals, which are then interpolated at unsampled locations using kriging weights. The final interpolated value is the sum of the interpolated residual with the global trend calculated from the covariates [66].
The best set of covariates was selected by fitting separate linear regression models for each DHS indicator and implementing stepwise feature selection. This process identified the combination of features that minimized the AIC (Akaike Information Criterion). To avoid multicollinearity issues, only covariates with a variance inflation factor (VIF) below 5 were retained [15]. Note that ethnicity-related variables were not used to model Fula ethnicity (to avoid circularity). Details on the calculation of interpolation weights and hyperparameter tuning are given in S2 Text.
Random forest regression.
Random forest (RF) regression is a popular machine learning algorithm that uses ensemble learning. A RF model consists of multiple decision trees that model the relationship between the response variable and a set of covariates [67]. Decision trees are built using bagging, where each tree is trained on a random sample drawn with replacement from the observation dataset. This reduces the correlation between trees and thus prevents overfitting [67]. Originally used for ecological niche modelling, this method has been widely used to model various other spatial phenomena due to several advantages: RF handles large datasets, it is robust to multicollinearity and it handles non-linear relationships.
RF models were built using a 5-repeated 5-fold random cross-validation (i.e., 25 RF models in total). Hyperparameters tuning was performed by further dividing each fold into four sub-folds and fitting 50 RF models to tune (1) the number of covariates used at each node split, (2) the minimum number of observations per terminal node, (3) the fraction of observations used per decision tree and (4) the number of trees in the model [10].
To further address multicollinearity, covariates were selected using a recursive feature elimination. This method iteratively removes the least important covariate from the model until the predictive performance (measured as a root mean square error on the cross-validation test sets) is the highest [68,69]. Covariate importance is measured as the standardised increase in the average Out of Bag (OOB) error after random permutation of the covariate values [68]. A larger increase in OOB error indicates a more important covariate. The latitude and longitude of the DHS cluster locations were added as supplementary covariates to the original set of covariates, as RF models are not explicitly spatial. Note that ethnicity-related variables were not used to model Fula ethnicity (to avoid circularity).
Bayesian geostatistical models.
Bayesian geostatistical models treat model parameters (e.g., regression coefficients) as random variables with a density distribution [70]. Similarly, each response value in the data is considered sampled from a distribution. By doing so, this approach allows to quantify uncertainties in the Bayesian predictions by constructing the posterior predictive distributions of the response variable [70]. Bayesian geostatistical models have been increasingly used to model health and demographic outcomes in recent years [4,11,12,15,18].
Estimating the posterior distributions of model parameters can be very computationally intensive for large spatial datasets. The integrated nested Laplace approximation (INLA) approach for latent Gaussian models now allows us to approximate the posterior distributions, significantly reducing the computational time compared to classical Markov chain Monte Carlo (MCMC) approaches [71]. In spatio-temporal models, INLA Bayesian inference is combined with the stochastic partial differential equation (SPDE) approach [72].
In this paper, Bayesian generalised linear models were implemented using INLA-SPDE within the R-INLA package [71], with the DHS indicators as response variables. The models can be summarised as follows:
where is the realisation of a spatial field at cluster location
,
is the vector of covariates at location
and
are the regression coefficients.
is a vector of SPDE random effects representing the structured spatial random variation in
. It is modelled as a zero-mean Gaussian Process (GP) with Matérn covariance matrix
[71]. Spatial dependence between two observations of the GP at locations
and
is modelled using the Matérn covariance function, defined as follows:
where is the Euclidean distance between
and
,
is the marginal variance,
controls the smoothness of the spatial process,
is the modified Bessel function of the second kind and
is a scale parameter controlling the range, i.e., the distance from which there is no more spatial autocorrelation in the data. The unexplained random variation is included in the term
All DHS indicators representing proportions (Table 1) were modelled using a binomial likelihood, with a zero-inflated binomial likelihood for IRS given the high prevalence of zero values. A Gaussian likelihood was used for the wealth index. As with UK, covariate selection was performed using stepwise selection on linear regression models (excluding ethnicity-related variables for modelling Fula ethnicity). Multicollinearity was addressed by checking the VIF. Details on the SPDE approach and prior specification are provided in S2 Text.
Model validation and predictions
Model validation followed the validation scheme of [3], which is also used to compare different methods. DHS data were divided into a training and a test set using an 80–20 split ratio. All methods were trained on the training set, with cross-validation applied for hyperparameter tuning when necessary (as detailed in each method section). Predictive performance was evaluated on the 20% test set using (i) the root mean square error (RMSE), (ii) the mean absolute error (MAE), and (iii) the R2 (coefficient of determination) calculated as follows [3]:
where is the number of observations,
is the value of observation
,
is its predicted value and
is the mean value over all observations. The RMSE and MAE are restricted to positive values, with lower values indicating better performance. The R2 ranges from minus infinity to 1, with higher values indicating better performance. For Bayesian geostatistical models, metrics were calculated using the posterior mean estimates. Predictive maps of DHS indicators (Table 1) were generated by each method on a common 1x1 km resolution grid.
Finally, all methods were compared using the following criteria: predictive performance (RMSE, MAE and R2 [3]), restriction of the prediction range to the observation range, sensitivity to the number and spatial distribution of observations, uncertainty quantification, computational efficiency, need for covariate collection and processing, ease of implementation/intuitiveness and handling of missing response data.
Results
Model validation
Model predictive performance was evaluated on the test set and measured with the RMSE, MAE and R2 [3]. R2 values are summarised in Table 4, Figs 1 and 2. RMSE and MAE scores are given in S1 Table. Some DHS indicators could be accurately predicted, with best R2 values between 0.62 and 0.86 (literacy, sanitation, Fula ethnicity, IRS and wealth index). Only two indicators were not accurately predicted, with R2 values around 0.30 (ITN ownership for 2 and anemia). For the remaining indicators, the models achieved medium predictive performance, with best R2 scores ranging from 0.40 to 0.49 (ITN access, stunting and ITN ownership) (Table 4). The RMSE and MAE scores also indicate overall good model performance, with RMSE values ranging from 0.087 to 0.194 for all proportion-type indicators and 0.330 for the wealth index. MAE values ranged from 0.044 to 0.150 for all proportion-type indicators and 0.258 for the wealth index (S1 Table).
R2 values are evaluated on the test set (20% of the data). R2 is the coefficient of determination (values range from minus infinity to 1), with higher values indicating better model performance. Orange, green and blue colours respectively indicate spatial interpolation methods, ensemble methods and Bayesian geostatistical models. Abbreviations: IDW (inverse distance weighting), TPS (thin plate spline), OK (ordinary kriging), UK (universal kriging), RF (random forest), BM (Bayesian model), ITN (insecticide-treated net), IRS (indoor residual spraying).
DHS indicators are grouped by socioeconomic (a) and malaria prevention (b) categories. For each indicator, the R2 values from all methods are combined to create the boxplot. R2 values are evaluated on the test set (20% of the data). R2 is the coefficient of determination (values range from minus infinity to 1), with higher values indicating better model performance. Blue and orange colours respectively indicate covariate-based models and covariate-independent models. Abbreviations: ITN (insecticide-treated net), IRS (indoor residual spraying).
None of the three categories of methods emerged as the best performer (Fig 1 and S1 Table). For all socioeconomic indicators except Fula ethnicity, the best performing method always used covariates and was most often RF (Table 4 and S1 Table). In addition, RF was always in the top 3 methods for socioeconomic indicators (Fig 1b–1f and S1 Table), except for Fula ethnicity, for which TPS and OK performed best. Compared to socioeconomic indicators, malaria prevention indicators showed more variation in the best performing methods, which ranged between TPS (based on spatial autocorrelation), BM and UK (both using covariates) (Table 4). When BM and UK performed best, they only slightly outperformed methods without covariates (TPS or OK), with R2 differences of up to 0.02 (see column ‘Best diff.’ in Table 4). In comparison, for socioeconomic variables, covariate-based methods outperformed covariate-independent methods by up to 0.28 in R2. These results suggest that methods that rely only on spatial autocorrelation tend to perform better in modelling malaria prevention indicators than socioeconomic indicators (with the exception of Fula ethnicity). Fig 2 further highlights these findings, with covariate-based methods showing higher R2 values for socioeconomic variables (Fig 2a). For malaria prevention indicators, there is less difference between the performance of covariate-independent and covariate-based methods (Fig 2b). Note that these trends are also consistent with the RMSE and MAE values (S1 Table).
Geospatial covariates
The results of covariate selection for UK, RF and BM are detailed in S2 Table. On average, RF used 28 covariates, whereas BM and UK used only 12 (S2 Table). These methods selected different types of covariates. Some covariates were selected for most DHS indicators by RF (e.g., climatic variables, longitude, nighttime lights), while UK and BM used more land cover variables (e.g., proportion of crops). A few variables were used in all models, in particular accessibility variables such as walking time to health facilities and distance to major roads (S2 Table), which are key variables influencing health indicators. It is expected that different methods would select different variables; while RF handles multicollinearity [67], UK and BM require multicollinearity checks prior to modelling. RF deals with non-linear relationships between the response variable and the covariates [67], whereas the feature selection for UK and BM was based on linear regression models. Lastly, unlike UK and BM, RF does not explicitly account for spatial autocorrelation.
Predictive maps
We produced predictive maps of the DHS indicators at a resolution of 1 km in Senegal. Fig 3 shows maps of all DHS indicators predicted using their respective best performing method. Most indicators show a west-east gradient, with better socioeconomic status and access to malaria prevention in the western regions. Fula ethnicity and IRS do not follow this pattern and are instead highly clustered in space.
For each indicator, the best performing method was used, that is, thin plate splines (Fula ethnicity, ITN ownership, ITN access), random forest (stunting, anemia, sanitation, wealth index) and Bayesian models (literacy, ITN ownership for 2, IRS). ‘Out of range’ indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1 for proportions). National boundaries were downloaded from GADM. Abbreviations: TPS (thin plate spline), RF (random forest), BM (Bayesian model), ITN (insecticide-treated net), IRS (indoor residual spraying).
To compare the predictions between methods, each indicator was also predicted using all the methods in this study. Fig 4 shows, as an example, the predicted access to basic sanitation service at 1x1 km for Senegal. Similar maps can be found for other DHS indicators in the electronic supplementary material (S1–S18 Figs). All maps in Fig 4 show a similar pattern of predicted access to basic sanitation service, with a decreasing gradient from west to east. However, due to the specificities of each method and their relative performance (see Table 4), the maps show differences in predictions at a finer scale. RF and BM allowed a higher level of detail (Fig 4e and 4f) due to the use of covariates compared to TPS and OK (Fig 4b and 4c), which predicted smoother interpolation surfaces. TPS and UK predicted out-of-range values, i.e., values outside the possible range for the indicator (from 0 to 1 as it is a proportion) (Fig 4b and 4d). Kriging and Bayesian models allowed to quantify the uncertainty in the predictions as the prediction variance and standard deviation respectively, see Fig 5 as an example.
The maps show the spatial distribution of the proportion of households with access to basic sanitation in Senegal. ‘Out of range’ indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by thin plate spline and universal kriging. National boundaries were downloaded from GADM.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted access to basic sanitation, reflecting lower confidence in the accuracy of the predictions. National boundaries were downloaded from GADM.
Comparison of methods
Although predictive performance is an essential feature for model comparison, there are other criteria to consider. These comparison criteria are summarised in Table 5 and discussed in the following sections.
IDW is straightforward, intuitive and computationally efficient. Besides, the interpolated values are constrained to the range of observed values because the interpolation weights always lie between 0 and 1 (and sum to 1) [60]. However, as weights relies solely on distance and do not consider the spatial pattern in the data, IDW is highly sensitive to the spatial distribution and number of observations. In areas with dense observations, the sample points used for interpolation may be close to the target point. In areas with sparse observations, neighbouring sample points may be located at large distances, leading to inaccurate predictions [60]. IDW does not quantify prediction uncertainty and does not account for missing data in the response variable.
TPS can be implemented easily, without any covariate processing. Besides, it allows to recover both completely linear and highly convex surfaces [20]. A drawback of this flexibility is that TPS may overfit noise in the data, leading to less reliable predictions in certain cases. If the spatial distribution of sample points is irregular, TPS may overfit in regions with dense data points and oversmooth in regions with sparse data. In addition, TPS can predict values largely outside the plausible range of values, i.e., the so-called Runge effect, where the interpolated surface oscillates between the sample points and ‘overshoots’ [60]. TPS made out-of-range predictions for four indicators (Fig 4b and S9b, S15b and S17b Figs).
Kriging offers an improvement over IDW for estimating interpolation weights by using the spatial autocorrelation pattern in the response variable. The prediction variance helps to assess the uncertainty of the predictions. High prediction variance means that the predicted value is based on sample points located at greater distances. Kriging is more robust to the relative distribution and number of observations [60], provided that there are enough to recover the spatial pattern. They sometimes predict out-of-range values due to negative weights when the dataset contains both highly clustered and more distant observations [73]. Local kriging methods could compensate for negative weights, but are computationally inefficient [20]. OK and UK made out-of-range predictions for one (S15c Fig) and seven indicators respectively (Fig 4d and S3d, S7d, S9d, S11d, S15d and S17d Figs).
RF deals with multicollinearity and non-linear relationships between covariates and the response variable [67]. Based on decision trees, they implicitly capture interactions between covariates. As non-spatial models, they are less sensitive to the spatial distribution of observations than previous methods, provided there are enough observations to recover the relationships between the response and the covariates. In addition, predictions are constrained to the range of observations. However, RF are computationally intensive when tuning the hyperparameters and performing stepwise covariate selection. They are less intuitive than IDW, TPS and kriging, and do not quantify prediction uncertainty. The standard RF model does not handle missing values in the response, but some recent implementations (e.g. missForest) allow imputation of missing data at model fit [74].
By defining priors and a likelihood in the model, Bayesian models allow predictions to be constrained to a range of values, avoiding out-of-range predictions [70]. If the model is robust, the estimation of the posterior distributions of the parameters is less affected by the removal or addition of observations in the model. In addition, methods that use covariates are less sensitive to uneven spatial distribution of sample points, because even if they fail to recover the spatial autocorrelation pattern, they also depend on the covariates to model the response variable. Uncertainty can be quantified through standard deviations, 2.5th and 97.5th percentiles of the posterior distribution. Unlike previous methods, Bayesian models handle missing response data and provide imputation during model fitting. However, they can be time consuming with lots of covariates and may be less intuitive than other models when lacking statistical skills.
Discussion
Achieving the SDG 2030 agenda requires fine-resolution mapping of SDG indicators to monitor progress and support targeted health interventions [3,4]. In this context, DHS data have been widely used to map SDG indicators across countries and over time, using a variety of methods in terms of complexity and model inputs [4,8,11–13,15,17–20]. In this paper, we compared three categories of methods to model DHS indicators: (1) spatial interpolation methods (i.e., IDW, TPS, OK, UK), (2) ensemble methods (i.e., RF), and (3) Bayesian geostatistical models (BM). Focusing on SDG Target 3.3, we applied these methods to map socioeconomic and malaria prevention indicators at 1 km resolution in Senegal.
Our findings show that most indicators could be mapped with medium to high predictive accuracy, with R2 values ranging from 0.40 to 0.86. This is in line with previous studies modelling DHS indicators in African countries [3,9,15]. However, there was no consensus on the best method or category of methods. RF was the best method for modelling most socioeconomic indicators. Previous studies have already highlighted its good performance with DHS data (e.g., malaria prevalence [75], wealth index [8,19]). For malaria prevention indicators, there was more diversity in the best performing approaches (ranging between TPS, UK and BM). Increasing model complexity did not always improve predictive performance, e.g., TPS sometimes outperformed RF or Bayesian geostatistical models (e.g., ITN access in Table 4), as in previous work [25]. Previous research that estimated spatial accessibility to health facilities has questioned the use of data-demanding methods when (outcome) data are of poor quality [76]. This may be the case here, where the displacement of DHS cluster coordinates for anonymisation affects the spatial resolution of the data [38]. Nevertheless, other research has shown that Bayesian models can accurately model DHS indicators in various settings [3,4,12,15,18,24]. Overall, our conclusion that no single method consistently excels in all contexts aligns with [3], which found no best approach between Bayesian geostatistical models and artificial neural networks. Similarly, [25] highlights that model performance depends on factors such as the study area, data scarcity and the structure of the spatial pattern.
Although there was no clear best method, we found that socioeconomic indicators were generally better predicted using covariates, while methods relying only on spatial autocorrelation performed better for Fula ethnicity and malaria prevention indicators. This may be explained by the spatial clustering of both ethnicity and access to malaria prevention, which may be better captured by methods recovering spatial autocorrelation patterns. Ethnic groups may be spatially clustered due to historical factors such as migration patterns and settlement practices (see the distribution of Fula ethnicity in Fig 3a). Access to malaria prevention may be influenced by policy decisions on intervention planning, resulting in strong spatial autocorrelation in areas (e.g., administrative unit, health district) where interventions are implemented. This may explain why TPS performed well for these indicators. The interpolated surface can adapt to both abrupt changes, capturing local variations (by passing through the sample points), and smooth trends (by deviating from the sample points) [20,61]. This makes TPS particularly suitable for datasets with heterogeneous spatial structures (e.g., variable values change abruptly in space, as seen with intervention-related variables and ethnic groups). Another explanation for the lower performance of covariate-based methods could be because not all health variables are environmentally linked [9]. While socioeconomic status is more stable over time, access to malaria prevention can change rapidly following interventions such as ITN distribution, making it less predictable by long-term covariates. The lack of association between malaria prevention access and covariates may also be due to seasonal effects, as prevention behaviour and access peak during the rainy season, which was not accounted for in the covariates. The high cloud cover in Senegal during the rainy season limits the availability of satellite imagery [48], making it challenging to compile covariates from the wet season. Annual composites (e.g., NDVI and NDWI) may therefore better represent the dry season [48], missing temporary water bodies and other seasonal features that correlate with malaria risk.
This paper also provides a list of criteria that can be used to compare models for mapping DHS indicators and health and demographic variables in general. Constraining predicted values to the observed range is essential, as highly accurate maps can be useless with out-of-range predictions. Low sensitivity to the number and spatial distribution of observations is also important; overall, covariate-based methods may be less sensitive because the structure of the response variable is also captured by the covariates. Distance-based approaches such as IDW will typically be less accurate in areas with sparse observations [60]. Computational efficiency, covariate processing and intuitiveness of the methods also need to be considered. Where complex approaches perform better, the trade-off between complexity and information gain might be assessed. For example, RF sometimes only slightly improved predictive performance over IDW (e.g., see ITN ownership in Table 4). However, RF requires more computational resources and storage to handle large covariate datasets, whereas IDW can be implemented in GIS software without coding or extensive covariate processing. Lastly, the ability to quantify the uncertainty in predictions and handle missing data are other interesting features to consider. These criteria were used to compare the methods employed in this study. IDW, for instance, is an intuitive, computationally efficient method that does not predict out-of-range values [60]. However, it is sensitive to the spatial distribution of observations and cannot quantify prediction uncertainty or handle missing data effectively. TPS offers more flexibility, being able to model both linear and highly convex surfaces [20], but it can overfit data and become less reliable, especially with irregular spatial distributions of sample points. In some cases, it can lead to out-of-range predictions, known as the Runge effect. Kriging improves upon IDW by incorporating spatial autocorrelation into the interpolation process and quantifying prediction uncertainty [60]. This makes it more robust to uneven spatial distributions of observations, although it can sometimes lead to out-of-range predictions [73]. RF handles non-linear relationships with covariates and multicollinearity [67], and is less sensitive to the spatial distribution of observations than spatial models. Predictions are constrained by the range of observations, but RF is computationally intensive and does not quantify prediction uncertainty. Bayesian geostatistical models constrain predictions to a defined range, handle missing data inherently and allow uncertainty to be quantified [70]. They are less sensitive to the number of observations and spatial distribution, but are computationally intensive and require statistical expertise.
The choice of method for predicting DHS indicators depends on the purpose of the predictive maps. Planning health interventions may require accurate maps with uncertainty estimates, in which case Bayesian geostatistical models are appropriate. Yet, simpler approaches may sometimes be preferred to complex geostatistical models. Where computational resources and time allow, or where relationships with covariates are of interest, Bayesian or RF models can also be used. Maps used for communication (e.g., reports) or as covariates to model other variables could be based on simpler methods such as IDW or Bayesian models without covariates to avoid circularity issues. Simpler methods can also be used if the main objective is to identify hotspots or coldspots (e.g., areas with low access to malaria prevention) rather than to obtain the actual estimates. Research that aims to model multiple outcomes simultaneously could investigate joint Bayesian models, which account for correlations between the spatial structures of all response variables. However, these models are significantly more complex and require more computational resources. Overall, the results of this study suggest using covariate-based models for socioeconomic indicators and methods based on spatial autocorrelation for variables with heterogeneous spatial structures, such as malaria prevention or ethnicity. We do not recommend using methods that predict out-of-range values (TPS and kriging) as out-of-range predictions are useless for decision-making. Future work will investigate TPS with tension or combine TPS with other interpolation methods (see [20]) to avoid out-of-range predictions. Similarly, kriging could incorporate corrections for negative weights, see the algorithms developed in [73].
This study has some limitations related to the data and methods used. Although the DHS data were collected during the rainy season, the survey covers several months, during which malaria transmission and prevention behaviours may vary. Seasonal indicators, such as malaria prevention, may therefore be less accurately captured by DHS. Besides, IRS coverage was better predicted than ITN-related indicators (Table 4), probably because IRS refers to households sprayed in the year prior to the survey, whereas ITN-related indicators are measured at the time of the survey, which varies spatially as the DHS is conducted in the country. Other limitations associated with DHS are the displacement of survey cluster coordinates and the fact that DHS are designed to be representative at the level of coarse administrative areas, rather than at the cluster level [10]. Future work could address these limitations by incorporating the month of data collection, the survey design and the uncertainty in cluster coordinates [77] into the Bayesian geostatistical models. Bayesian models could also include interactions between covariates and random effects to account for urban/rural character of DHS clusters and potential non-linear relationships with other covariates, although this would increase computational costs. Covariate-based models could also benefit from covariates on the availability and quality of malaria services at health facilities, obtained from the Service Provision Assessment surveys of the DHS program. Note that this study does not raise ethical concerns as the DHS data is provided after anonymisation and the output maps are model outcomes with a spatial resolution of 1km.
Overall, our results show that DHS data can be used to produce interpolated surfaces of vulnerability factors that influence malaria risk. Such interpolated surfaces can support policy makers in planning malaria control interventions by showing areas with poor access to malaria prevention. ITN and IRS coverage maps can then be used to target ITN and IRS interventions in Senegal to help achieve the 2030 elimination target. Furthermore, this work can be extended to other vector-borne diseases, such as dengue, for which epidemiological data are not readily available, making maps of vulnerability indicators even more important for intervention planning. Although we focused only on indicators from the Senegal 2017 DHS, findings of this study may be useful in other settings. The results may be generalisable to other DHS surveys, as these are conducted according to standardised protocols and are comparable across countries [10]. In addition, we found that methods based on spatial autocorrelation performed better for malaria prevention and ethnicity variables due to their heterogeneous spatial structures. This is likely to be true for other variables with similar structures (e.g., vaccination coverage, which also depends on policy decisions, or prevalence of female genital mutilation, which is related to ethno-cultural factors [23]). Lastly, the strengths, weaknesses and recommendations regarding the methods used in this study, along with the criteria for choosing a method and the codes provided with this paper, are relevant for modelling other DHS indicators and demographic and health variables in general. We encourage future research to replicate the methods of this study to other countries and to additional malaria-related indicators, such as access to health care and the prevalence of migrants, to further validate and extend these findings.
Conclusions
In this paper, we compared three categories of methods for modelling DHS indicators: (1) spatial interpolation methods, (2) ensemble methods, and (3) Bayesian geostatistical models. We focused on DHS indicators that are potential drivers of malaria in Senegal, i.e., socioeconomic and malaria prevention indicators. Our main results show that there was no consensus on the best method or category of methods for modelling all indicators. Overall, socioeconomic indicators were better predicted by covariate-based models, while malaria prevention access and ethnicity variables tended to be better predicted by methods relying (only) on spatial autocorrelation. Beyond the predictive performance, there are other criteria to consider when mapping DHS indicators, such as the ability to constrain the range of predicted values or the intuitiveness of the method, and the criteria to consider when choosing a method depend on the end application of the predictive maps. We encourage future research to replicate the methods of this study using other DHS datasets in Senegal and in other countries where DHS are available.
Supporting information
S1 Text. Description of DHS indicators and geospatial covariates.
https://doi.org/10.1371/journal.pone.0322819.s001
(DOCX)
S2 Text. Details on kriging and Bayesian geostatistical models.
https://doi.org/10.1371/journal.pone.0322819.s002
(DOCX)
S1 Table. RMSE and MAE values (training and cross-validation) for all modelling approaches and all DHS indicators.
https://doi.org/10.1371/journal.pone.0322819.s003
(DOCX)
S2 Table. Covariates selected by RF, UK and BM for all DHS indicators.
https://doi.org/10.1371/journal.pone.0322819.s004
(DOCX)
S3 Table. Model parameter values for all methods.
https://doi.org/10.1371/journal.pone.0322819.s005
(DOCX)
S1 Fig. Predicted stunting in children at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of children under 5 years old that are moderately or severely stunted in Senegal. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s006
(TIF)
S2 Fig. Uncertainty maps of the predicted stunting in children.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s007
(TIF)
S3 Fig. Predicted anemia prevalence in children at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of children with mild, moderate or severe anemia in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by universal kriging. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s008
(TIF)
S4 Fig. Uncertainty maps of the predicted anemia prevalence in children.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s009
(TIF)
S5 Fig. Predicted wealth index at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the household wealth index in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s010
(TIF)
S6 Fig. Uncertainty maps of the predicted wealth index.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s011
(TIF)
S7 Fig. Predicted literacy rate in women at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of women who are literate in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by universal kriging. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s012
(TIF)
S8 Fig. Uncertainty maps of the predicted literacy rate in women.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s013
(TIF)
S9 Fig. Predicted ITN ownership at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of households with at least one insecticide-treated net (ITN) in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by thin plate spline and universal kriging. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s014
(TIF)
S10 Fig. Uncertainty maps of the predicted ITN ownership.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. ITN stands for insecticide-treated net. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s015
(TIF)
S11 Fig. Predicted ITN ownership for 2 at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of households with at least one insecticide-treated net (ITN) for every two people who slept in the house the night before the survey. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by universal kriging. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s016
(TIF)
S12 Fig. Uncertainty maps of the predicted ITN ownership for 2.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. ITN stands for insecticide-treated net. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s017
(TIF)
S13 Fig. Predicted ITN access at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of population with access to an insecticide-treated net (ITN) in their household in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s018
(TIF)
S14 Fig. Uncertainty maps of the predicted ITN access.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. ITN stands for insecticide-treated net. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s019
(TIF)
S15 Fig. Predicted IRS coverage at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of households that were sprayed with a residual insecticide in the last year prior to the survey. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by thin plate spline, ordinary kriging and universal kriging. IRS stands for indoor residual spraying. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s020
(TIF)
S16 Fig. Uncertainty maps of the predicted IRS coverage.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. IRS stands for indoor residual spraying. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s021
(TIF)
S17 Fig. Predicted proportion of Fula people at 1x1 km for Senegal with each modelling approach.
The maps show the spatial distribution of the proportion (ranging from 0 to 1) of people belonging to the Fula ethnic group in Senegal. Gridded surfaces are produced at a resolution of 1x1 km for all methods examined in the study. The ‘Out of range’ label indicates predicted values that are outside the possible range of values of the indicator (below 0 or above 1). Out-of-range predictions were made by thin plate spline and universal kriging. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s022
(TIF)
S18 Fig. Uncertainty maps of the predicted proportion of Fula people.
Uncertainty is measured as a prediction variance (var) for kriging methods (a, b) and as a standard deviation (SD) for Bayesian models (c). Higher values of SD or variance indicate areas with greater uncertainty in the predicted indicator, reflecting lower confidence in the accuracy of the predictions in these regions. National boundaries were downloaded from GADM.
https://doi.org/10.1371/journal.pone.0322819.s023
(TIF)
References
- 1.
United Nations. Transforming our world: the 2030 Agenda for Sustainable Development. New York: United Nations General Assembly; 2015. Available from: https://wedocs.unep.org/20.500.11822/9814
- 2. Kraak M, Ricker B, Engelhardt Y. Challenges of Mapping Sustainable Development Goals Indicators Data. IJGI. 2018;7(12):482.
- 3. Bosco C, Alegana V, Bird T, Pezzulo C, Bengtsson L, Sorichetta A, et al. Exploring the high-resolution mapping of gender-disaggregated development indicators. J R Soc Interface. 2017;14(129):20160825. pmid:28381641
- 4. Alegana VA, Atkinson PM, Pezzulo C, Sorichetta A, Weiss D, Bird T, et al. Fine resolution mapping of population age-structures for health and development applications. J R Soc Interface. 2015;12(105):20150073. pmid:25788540
- 5. Alegana VA, Wright J, Bosco C, Okiro EA, Atkinson PM, Snow RW, et al. Malaria prevalence metrics in low- and middle-income countries: an assessment of precision in nationally-representative surveys. Malar J. 2017;16(1):475. pmid:29162099
- 6. Giardina F, Gosoniu L, Konate L, Diouf MB, Perry R, Gaye O, et al. Estimating the burden of malaria in Senegal: Bayesian zero-inflated binomial geostatistical modeling of the MIS 2008 data. PLoS One. 2012;7(3):e32625. pmid:22403684
- 7. Riedel N, Vounatsou P, Miller J, Gosoniu L, Chizema-Kawesha E, Mukonka V. Geographical patterns and predictors of malaria risk in Zambia: Bayesian geostatistical modelling of the 2006 Zambia national malaria indicator survey (ZMIS). Malar J. 2010;9(1):37.
- 8. Georganos S, Gadiaga AN, Linard C, Grippa T, Vanhuysse S, Mboga N, et al. Modelling the Wealth Index of Demographic and Health Surveys within Cities Using Very High-Resolution Remotely Sensed Information. Remote Sensing. 2019;11(21):2543.
- 9.
Gething P, Tatem A, Bird T, Burgert-Brucker CR. Creating Spatial Interpolation Surfaces with DHS Data. Rockville, Maryland, USA: ICF International; 2015. (DHS Spatial Analysis Reports No. 11).
- 10.
Croft TN, Marshall AMJ, Allen CK. Guide to DHS Statistics 7. Rockville, Maryland, USA: ICF International; 2018.
- 11. Ejigu BA. Geostatistical analysis and mapping of malaria risk in children of Mozambique. PLoS One. 2020;15(11):e0241680. pmid:33166322
- 12. Giardina F, Kasasa S, Sié A, Utzinger J, Tanner M, Vounatsou P. Effects of vector-control interventions on changes in risk of malaria parasitaemia in sub-saharan africa: a spatial and temporal analysis. Lancet Glob Health. 2014;2(10):e601-15.
- 13. Kabaria CW, Molteni F, Mandike R, Chacky F, Noor AM, Snow RW, et al. Mapping intra-urban malaria risk using high resolution satellite imagery: a case study of Dar es Salaam. Int J Health Geogr. 2016;15(1):26. pmid:27473186
- 14. Morlighem C, Chaiban C, Georganos S, Brousse O, van Lipzig NPM, Wolff E, et al. Spatial Optimization Methods for Malaria Risk Mapping in Sub-Saharan African Cities Using Demographic and Health Surveys. Geohealth. 2023;7(10):e2023GH000787. pmid:37811342
- 15. Utazi CE, Thorley J, Alegana VA, Ferrari MJ, Takahashi S, Metcalf CJE. High resolution age-structured mapping of childhood vaccination coverage in low and middle income countries. Vaccine. 2018;36(12):1583–91.
- 16. Utazi C, Yankey O, Chaudhuri S, Olowe I, Danovaro-Holliday C, Lazar A, et al. Geostatistical and Machine Learning Approaches for High-Resolution Mapping of Vaccination Coverage. Preprints [preprint]; 2024 Dec. Available from: https://www.preprints.org/manuscript/202412.2190/v1
- 17. Larmarange J, Vallo R, Yaro S, Msellati P, Méda N. Methods for mapping regional trends of HIV prevalence from demographic and health surveys (DHS). Cybergeo Eur J Geogr. 2011;2011(558).
- 18. Tatem D, Gething D, Pezzulo D, Weiss D, Bhatt D. Development of high-resolution gridded poverty surfaces. International Journal of Applied Earth Observation and Geoinformation. 2014;30:150–61.
- 19. Zhao X, Yu B, Liu Y, Chen Z, Li Q, Wang C, et al. Estimation of Poverty Using Random Forest Regression with Multi-Source Data: A Case Study in Bangladesh. Remote Sensing. 2019;11(4):375.
- 20. Müller-Crepon C, Hunziker P. New spatial data on ethnicity: introducing SIDE. J Peace Res. 2018;55(5):687–98.
- 21. Nnanatu CC, Atilola G, Komba P, Mavatikua L, Moore Z, Matanda D, et al. Evaluating changes in the prevalence of female genital mutilation/cutting among 0-14 years old girls in Nigeria using data from multiple surveys: A novel Bayesian hierarchical spatio-temporal model. PLoS One. 2021;16(2):e0246661. pmid:33577614
- 22. Kandala N-B, Nnanatu CC, Atilola G, Komba P, Mavatikua L, Moore Z, et al. Analysing Normative Influences on the Prevalence of Female Genital Mutilation/Cutting among 0-14 Years Old Girls in Senegal: A Spatial Bayesian Hierarchical Regression Approach. Int J Environ Res Public Health. 2021;18(7):3822. pmid:33917443
- 23. Morlighem C, Visée C, Nnanatu C. Comparison of fgm prevalence among nigerian women aged 15-49 years using two household surveys conducted before and after the covid-19 pandemic. BMC Public Health. 2024;24(1):1866.
- 24. Cleary E, Hetzel M, Siba P, Lau C, Clements A. Spatial prediction of malaria prevalence in Papua New Guinea: a comparison of Bayesian decision network and multivariate regression modelling approaches for improved accuracy in prevalence prediction. Malar J. 2021;20(1):269.
- 25. Wong KLM, Brady OJ, Campbell OMR, Benova L. Comparison of spatial interpolation methods to create high-resolution poverty maps for low- and middle-income countries. J R Soc Interface. 2018;15(147):20180252. pmid:30333244
- 26. Wong S, Flegg JA, Golding N, Kandanaarachchi S. Comparison of new computational methods for spatial modelling of malaria. Malar J. 2023;22(1):356. pmid:37990242
- 27.
DHS Spatial Interpolation Working Group. Spatial Interpolation with Demographic and Health Survey Data: Key Considerations. Rockville, Maryland, USA: ICF International; 2014. (DHS Spatial Analysis Reports No. 9).
- 28.
World Health Organization. World malaria report 2023. Geneva: World Health Organization; 2023.
- 29. Sarfo JO, Amoadu M, Kordorwu PY, Adams AK, Gyan TB, Osman AG, et al. Malaria amongst children under five in sub-Saharan Africa: a scoping review of prevalence, risk factors and preventive interventions. Eur J Med Res. 2023;28(1):80.
- 30. Semakula H, Song G, Achuu S, Zhang S. A Bayesian belief network modelling of household factors influencing the risk of malaria: A study of parasitaemia in children under five years of age in sub-Saharan Africa. Environ Model Softw. 2016;75:59–67.
- 31. Njau JD, Stephenson R, Menon MP, Kachur SP, McFarland DA. Investigating the important correlates of maternal education and childhood malaria infections. Am J Trop Med Hyg. 2014;91(3):509–19.
- 32. Bates I, Fenton C, Gruber J, Lalloo D, Medina Lara A, Squire SB, et al. Vulnerability to malaria, tuberculosis, and HIV/AIDS infection and disease. Part 1: determinants operating at individual and household level. Lancet Infect Dis. 2004;4(5):267–77. pmid:15120343
- 33. Oldenburg CE, Guerin PJ, Berthé F, Grais RF, Isanaka S. Malaria and nutritional status among children with severe acute malnutrition in Niger: a prospective cohort study. Clin Infect Dis. 2018;67(7):1027–34.
- 34. Yang D, He Y, Wu B, Deng Y, Li M, Yang Q, et al. Drinking water and sanitation conditions are associated with the risk of malaria among children under five years old in sub-Saharan Africa: A logistic regression model analysis of national survey data. J Adv Res. 2019;21:1–13. pmid:31641533
- 35. Morlighem C. Spatial interpolation of health and demographic variables: predicting malaria indicators with and without covariates. Database: figshare. 2025.
- 36.
PNLP. Bulletin Epidemiologique annuel 2022 du Paludisme au Sénégal. 2022.
- 37. Corsi DJ, Neuman M, Finlay JE, Subramanian SV. Demographic and health surveys: a profile. Int J Epidemiol. 2012;41(6):1602–13. pmid:23148108
- 38.
Burgert CR, Colston J, Roy T, Zachary B. Geographic displacement procedure and georeferenced data release policy for the Demographic and Health Surveys. Calverton, Maryland, USA: ICF International; 2013. (DHS Spatial Analysis Reports No. 7).
- 39.
ANSD [Sénégal], ICF. Sénégal: Enquête Démographique et de Santé Continue (EDS-Continue 2017). Rockville, Maryland, USA: ANSD and ICF; 2018.
- 40. Seck MC, Thwing J, Fall FB, Gomis JF, Deme A, Ndiaye YD, et al. Malaria prevalence, prevention and treatment seeking practices among nomadic pastoralists in northern Senegal. Malar J. 2017;16(1):413. pmid:29029619
- 41. White NJ. Anaemia and malaria. Malar J. 2018;17(1):371. pmid:30340592
- 42. Degarege A, Fennie K, Degarege D, Chennupati S, Madhivanan P. Improving socioeconomic status may reduce the burden of malaria in sub Saharan Africa: A systematic review and meta-analysis. PLoS One. 2019;14(1):e0211205. pmid:30677102
- 43.
Perez-Heydrich C, Warren JL, Burgert CR, Emch ME. Guidelines on the Use of DHS GPS Data. Calverton, Maryland, USA: ICF International; 2013. (DHS Spatial Analysis Reports No. 8.).
- 44. Brousse O, Georganos S, Demuzere M, Dujardin S, Lennert M, Linard C, et al. Can we use local climate zones for predicting malaria prevalence across sub-Saharan African cities?. Environ Res Lett. 2020;15(12):124051. pmid:35211191
- 45. Karger DN, Conrad O, Böhner J, Kawohl T, Kreft H, Soria-Auza RW, et al. Climatologies at high resolution for the earth’s land surface areas. Sci Data. 2017;4:170122. pmid:28872642
- 46. Wan Z, Hook S, Hulley G. MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1km SIN Grid V061 [dataset]. NASA EOSDIS Land Processes DAAC; 2021.
- 47. Wan Z, Hook S, Hulley G. MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 1km SIN Grid V061 [dataset]. NASA EOSDIS Land Processes DAAC; 2021.
- 48. Simonetti D, Pimple U, Langner A, Marelli A. Pan-tropical Sentinel-2 cloud-free annual composite datasets. Data Brief. 2021;39:107488. pmid:34729386
- 49. Brown C, Brumby S, Guzder-Williams B, Birch T, Hyde S, Mazzariello J, et al. Dynamic world, near real-time global 10 m land use land cover mapping. Sci Data. 2022;9(1):251.
- 50. Marconcini M, Metz-Marconcini A, Üreyen S, Palacios-Lopez D, Hanke W, Bachofer F. Outlining where humans live, the world settlement footprint 2015. Sci Data. 2020;7(1):242.
- 51.
Pesaresi M, Politis P. GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) [dataset]. European Commission, Joint Research Centre (JRC); 2023 May. https://doi.org/10.2905/9F06F36F-4B11-47EC-ABB0-4F8B7B1D72EA
- 52.
Pesaresi M, Politis P. GHS-BUILT-H R2023A - GHS building height, derived from AW3D30, SRTM30, and Sentinel2 composite (2018) [dataset]. European Commission, Joint Research Centre (JRC); 2023 May. https://doi.org/10.2905/85005901-3A49-48DD-9D19-6261354F56FE
- 53. Elvidge CD, Zhizhin M, Ghosh T, Hsu F-C, Taneja J. Annual Time Series of Global VIIRS Nighttime Lights Derived from Monthly Averages: 2012 to 2019. Remote Sensing. 2021;13(5):922.
- 54. Weiss DJ, Nelson A, Gibson HS, Temperley W, Peedell S, Lieber A, et al. A global map of travel time to cities to assess inequalities in accessibility in 2015. Nature. 2018;553(7688):333–6. pmid:29320477
- 55. Weiss DJ, Nelson A, Vargas-Ruiz CA, Gligorić K, Bavadekar S, Gabrilovich E, et al. Global maps of travel time to healthcare facilities. Nat Med. 2020;26(12):1835–8. pmid:32989313
- 56. Lloyd CT, Chamberlain H, Kerr D, Yetman G, Pistolesi L, Stevens FR, et al. Global spatio-temporally harmonised datasets for producing high-resolution gridded population distribution datasets. Big Earth Data. 2019;3(2):108–39. pmid:31565697
- 57. Tatem AJ, Campbell J, Guerra-Arias M, de Bernis L, Moran A, Matthews Z. Mapping for maternal and newborn health: the distributions of women of childbearing age, pregnancies and births. Int J Health Geogr. 2014;13(1):2.
- 58. Farr T, Rosen P, Caro E, Crippen R, Duren R, Hensley S. The shuttle radar topography mission. Rev Geophys. 2007;45(2):RG2004.
- 59. Robinson T, Wint G, Conchedda G, Boeckel T, Ercoli V, Palamara E, et al. Mapping the global distribution of livestock. PLoS ONE. 2014;9(5):e96084.
- 60.
Ledoux H, Ohori K, Peters R. Computational modelling of terrains. Zenodo. 2020.
- 61. Keller W, Borkowski A. Thin plate spline interpolation. J Geod. 2019;93:1251–69.
- 62. Lukas M, de Hoog F, Anderssen R. Efficient algorithms for robust generalized cross-validation spline smoothing. J Comput Appl Math. 2010;235(1):102–7.
- 63. Alemayehu MA, Agimas MC, Shewaye DA, Derseh NM, Aragaw FM. Spatial distribution and determinants of limited access to improved drinking water service among households in ethiopia based on the 2019 ethiopian mini demographic and health survey: spatial and multilevel analyses. Front Water. 2023;5.
- 64. Wasswa R, Kananura RM, Muhanguzi H, Waiswa P. Spatial variation and attributable risk factors of anaemia among young children in Uganda: Evidence from a nationally representative survey. PLOS Glob Public Health. 2023;3(5):e0001899. pmid:37195979
- 65. Bulstra CA, Hontelez JAC, Giardina F, Steen R, Nagelkerke NJD, Bärnighausen T, et al. Mapping and characterising areas with high levels of HIV transmission in sub-Saharan Africa: A geospatial analysis of national survey data. PLoS Med. 2020;17(3):e1003042.
- 66.
Lichtenstern A. Kriging methods in spatial statistics [BSc Thesis]. Technische Universität München; 2013.
- 67. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
- 68. Gregorutti B, Michel B, Saint-Pierre P. Correlation and variable importance in random forests. Stat Comput. 2017;27(3):659–78.
- 69. Morlighem C, Chaiban C, Georganos S, Brousse O, Van de Walle J, van Lipzig NPM, et al. The Multi-Satellite Environmental and Socioeconomic Predictors of Vector-Borne Diseases in African Cities: Malaria as an Example. Remote Sensing. 2022;14(21):5381.
- 70.
Krainski ET, Gómez-Rubio V, Bakka H, Lenzi A, Castro-Camilo D, Simpson D, et al. Advanced Spatial Modeling with Stochastic Partial Differential Equations Using R and INLA. Boca Raton, FL: Chapman & Hall/CRC Press; 2019.
- 71. Rue H, Martino S, Chopin N. Approximate Bayesian Inference for Latent Gaussian models by using Integrated Nested Laplace Approximations. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(2):319–92.
- 72. Lindgren F, Rue H, Lindström J. An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. J R Stat Soc Ser B Stat Methodol. 2011;73(4):423–98.
- 73. Deutsch CV. Correcting for negative weights in ordinary kriging. Comput Geosci. 1996;22(7):765–73.
- 74. Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol. 2020;20(1):199. pmid:32711455
- 75. Georganos S, Brousse O, Dujardin S, Linard C, Casey D, Milliones M, et al. Modelling and mapping the intra-urban spatial distribution of Plasmodium falciparum parasite rate using very-high-resolution satellite derived indicators. Int J Health Geogr. 2020;19(1):38. pmid:32958055
- 76. Bihin J, De Longueville F, Linard C. Spatial accessibility to health facilities in Sub-Saharan Africa: comparing existing models with survey-based perceived accessibility. Int J Health Geogr. 2022;21(1):18. pmid:36369009
- 77.
Altay U, Paige J, Riebler A, Fuglstad GA. Fast geostatistical inference under positional uncertainty: Analysing DHS household survey data. 2022.