^{*}

Conceived and designed the experiments: PWG SIH. Performed the experiments: PWG. Analyzed the data: PWG APP. Contributed reagents/materials/analysis tools: PWG APP. Wrote the paper: PWG APP SIH.

The authors have declared that no competing interests exist.

Risk maps estimating the spatial distribution of infectious diseases are required to guide public health policy from local to global scales. The advent of model-based geostatistics (MBG) has allowed these maps to be generated in a formal statistical framework, providing robust metrics of map uncertainty that enhances their utility for decision-makers. In many settings, decision-makers require spatially aggregated measures over large regions such as the mean prevalence within a country or administrative region, or national populations living under different levels of risk. Existing MBG mapping approaches provide suitable metrics of local uncertainty—the fidelity of predictions at each mapped pixel—but have not been adapted for measuring uncertainty over large areas, due largely to a series of fundamental computational constraints. Here the authors present a new efficient approximating algorithm that can generate for the first time the necessary joint simulation of prevalence values across the very large prediction spaces needed for global scale mapping. This new approach is implemented in conjunction with an established model for

Reliable disease maps can support rational decision making. These maps are often made by interpolation: taking disease data from field studies and predicting values for the gaps between the data to make a complete map. Such maps always contain uncertainty, however, and measuring this uncertainty is vital so that the reliability of risk maps can be determined. A modern approach called model-based geostatistics (MBG) has led to increasingly sophisticated ways of mapping disease and measuring spatial uncertainty. Many health management decisions are made for administrative areas (e.g., districts, provinces, countries) and disease maps can support these decisions by averaging their values over the regions of interest. Carrying out this aggregation in conjunction with MBG techniques has not previously been possible for very large maps, however, due largely to the computational constraints involved. This study has addressed this problem by developing a new algorithm and allows aggregation of a global MBG disease map over very large areas. It is used to estimate

Risk maps estimating the spatial distribution of infectious diseases in relation to underlying populations are required to support public health decision-making at local to global scales

MBG models take point observations of disease prevalence from dispersed survey locations and generate continuous maps by interpolating prevalence at unsampled locations across raster grid surfaces. The most striking advantage of MBG in disease mapping is its handling of uncertainty. Interpolating sparse, often imperfectly sampled, survey data to predict disease prevalence across wide regions results in inherently uncertain risk maps, with the level of uncertainty varying spatially as a function of the density, quality, and sample size of available survey data, and moderated by the underlying spatial variability of the disease in question. MBG approaches allow these sources of uncertainty to be propagated to the final mapped output, predicting a probability distribution (known formally as a posterior predictive distribution) for the prevalence at each location of interest. Where predictions are made with small uncertainty, these distributions will be tightly concentrated around a central value; where uncertainty is large they will be more dispersed. These techniques have been used to generate robust and informative risk maps for malaria

Implementation of MBG models over even relatively small areas is extremely computationally expensive. Not only are the matrix algebra operations required to generate predictions at each individual pixel costly compared to simpler interpolation methods

There is often a need to evaluate disease prevalence aggregated across spatial regions, temporal periods, or combinations of both

The solution to the problem outlined above is to replace per-pixel simulation of prevalence realisations with the simultaneous or ‘joint’ simulation of all pixels to be aggregated, recreating appropriate spatial and temporal correlation between them

In this paper we use a new approximate algorithm for joint simulation to quantify, for the first time, aggregated uncertainty over space and time in a global scale MBG disease model for

PAR estimates form a fundamental metric for malaria decision-makers at national and international levels

In the remainder of this introductory section we outline the computational challenges of large scale joint simulation and review existing approaches to overcoming them. In the

A general form for MBG models can be defined as follows:

In MBG, the aim is to estimate the joint posterior distribution of the model parameters

In a per-pixel implementation, the predictive distributions

Switching from a per-pixel implementation to a joint simulation over many prediction locations increases profoundly the computational challenge. The efficiency of a per-pixel approach arises from the effective reduction of ^{7} seconds (around 694 days). In practice these scaling factors along with those of memory and storage requirements mean direct joint simulation using the equations outlined above is generally limited to predictions at a maximum of around 10,000 points

In response to the strict computational limits of direct joint simulation outlined above, a wide range of algorithmic and mathematical tools have been developed that increase substantially the maximum number of prediction locations

The most widely used family of joint simulation algorithms in geostatistics is known as sequential simulation

In a disease mapping context, the goal is to generate joint simulations conditioned by observed prevalence or incidence data. This precludes the direct use of a wider class of algorithms developed for unconditional joint simulation

A new global map of _{2–10}, across a regular spherical grid within the limits of stable transmission _{2–10}. The set of realisations of _{2–10} for each pixel provided an appropriate measure of local uncertainty, with which the precision of _{2–10} predictions could be assessed at all individual pixel locations worldwide.

The aim of the current study was to implement the predictions described above _{2–10} over spatially and temporally aggregated regions. This presented an unprecedented challenge in geostatistical disease modelling for a number of reasons. Firstly, the target prediction space was exceptionally large: a grid of resolution equivalent to 5×5 km at the equator spanning the extent of stable

Sequential simulation algorithms maintain feasible memory and computation requirements over large grids by limiting conditioning data to small local neighbourhoods, but the repeated identification of local data, evaluation of local covariance matrices, and subsequent linear algebra calculations are prohibitively inefficient for very large numbers of prediction locations. The extremely efficient algorithms developed for unconditional joint simulation over regular grids, such as circulant embedding, also reach memory limits for very large prediction tasks, and are not suited to sphere-based grids. In this study a new algorithm was developed that adopted and extended the principle of traditional sequential simulation - that joint simulation over very large areas can be broken down into many small simulations conditioned on nearby values - but incorporated some of the efficiencies exploited by unconditional algorithms operating on a grid whilst overcoming the complications of sphere-based grid systems.

Firstly, the decomposition of a conditional joint simulation into an unconditional joint simulation and a per-pixel conditioning stage was exploited (Eq. 5). The bulk of the computational challenge therefore lay in generating unconditioned realisations of the zero-mean Gaussian process ^{8} was the largest individual prediction task (for the Africa region, with 1718 columns, 1315 rows and 276 months). Each grid pixel was 0.04165 decimal degrees in height and width, corresponding to approximately 5×5 km at the equator. Defining a regular grid in terms of spherical coordinates meant that the width of pixels varied with latitude. A second stage was then required to condition the field

In this first stage the principle of sequential simulation was adapted. Rather than randomly visiting individual prediction locations in turn, each column of each monthly surface of the 3-d space-time prediction grid was jointly simulated in sequence, scanning left-to-right across each monthly surface and from the earliest month (January 1985) to the latest (December 2007). Each simulated column then became available for conditioning subsequent columns. As with conventional sequential simulation, the size of the conditioning set was prevented from becoming prohibitively large by limiting conditioning data to a set of local prediction locations. A key inefficiency in the use of local conditioning neighbourhoods in conventional sequential simulation algorithms is that, because the spatial (or spatiotemporal) configuration of data with respect to the target prediction location in each neighbourhood is likely to be unique, the data-to-data and data-to-prediction covariance evaluations must be repeated for every sequential prediction. In this algorithm, however, the locations of conditioning data relative to the prediction column were defined

This procedure resulted in a number of computational advantages over conventional sequential simulation. Firstly, since the locations of conditioning data,

Because the algorithm scanned left-to-right, only columns to the left of the prediction column could be included from the same month. Preceding months could include columns both to the right, directly ‘below’ and to the left. The exact configuration of the footprint in terms of the number and spacing of preceding columns and the number and spacing of preceding months could be varied. Similarly, the density of included columns could be varied such that pixels from every 2nd, 5th, or 10th row of each column, for example, could be included rather than from every row. Analogous to the specification of conditioning neighbourhoods in conventional sequential simulation, the success of the procedure presented here was dependent on a suitable configuration of the conditioning footprint. This configuration represented a trade-off between the computational cost of the algorithm, which scaled sharply with larger and more dense footprints, and the extent to which the resulting unconditioned field approximated the hypothetical result obtained using a direct simulation. The appropriate tool for identifying a suitable footprint configuration was the extent to which the resulting simulated field reproduced the required covariance properties specified by

A final algorithmic complication arose from the constraints placed on the footprint configuration by the spatial and temporal boundaries of the grid. Clearly a footprint using

The sequence of schematic diagrams shows the algorithm at six different stages. In this schematic, the prediction space is 25 columns by 25 rows by two months. In each diagram the target column to be predicted is marked in red, pixels already predicted to the left or below the target column are shaded, those yet to be predicted are left white. The ‘footprint’ of conditioning data used in each prediction is shaded blue. In this example the full footprint extent is specified to include, in the target month, seven columns to the left of the target column and, in the preceding month, seven columns to the left, seven to the right and the column directly below the target column. This full extent is thinned to include only every second column and row. Diagram 1 (lower left) shows the algorithm at an early stage: having already simulated values in the first three columns of the first month, the target column being simulated is the fourth from left. The full footprint is truncated and consists of only two columns to the left of the target column. As the algorithm scans across this first month, more columns become available to the left and the footprint grows (diagrams 2 and 3). In diagram 4 the algorithm has moved to the second month, and the footprint can now begin to include simulated pixels from the preceding month. In diagram 5 the full footprint is shown, truncated neither to the left nor right. As the algorithm scans further to the right to complete the second month, the footprint becomes truncated once more, this time by the right-hand margin (diagram 6).

Values of the unconditioned field

All code was written in the R

Having implemented the algorithm described above, the jointly-simulated conditioned field _{2–10}. Crucially, in contrast to the original per-pixel implementation _{2–10} across the 12 months of 2007 for three scales of spatial aggregation: continental, national, and at the first sub-national administrative unit level, quantities that span the spectrum of information scales required by malaria public-health decision-makers.

Previous approaches to estimating PAR have used modelled surfaces of _{2–10} was converted into a categorical map identifying pixels where prevalence was predicted as either low stable (_{2–10}≤5%), medium stable (_{2–10}>5%≤40%) or high stable (_{2–10}>40%) transmission. These prevalence classes have been proposed previously as of particular relevance to decision-makers when developing optimal strategies for intervention and control

In the original implementation of the global _{2–10} at individual pixel locations and, (ii) crucially, to provide posterior predictive distributions of _{2–10} that represented appropriate measures of local uncertainty for each pixel. Correspondingly, the joint simulation implemented in the current study was tested for the ability to (i) accurately predict mean _{2–10} over aggregated sets of pixels and (ii) generate appropriate posterior predictive distributions of these aggregated means. A hold-out set of 10% of the data was selected using a spatially-declustered stratified random sampling procedure described previously _{2–10} continuously over large spatial regions and time periods. As an alternative, sets were made by aggregating non-contiguous pixels from the hold-out set dispersed through space and time. These sets were made by simple random sampling from the full hold-out set and consisted of 1000 sets each of sizes between 2 and 100 pixels. Sets made in this way were dispersed in both space and time and this was preferred to an alternative strategy of defining spatial or temporal-only sets so that the full space-time functionality of the algorithm could be assessed. Additionally, very few long time series of data existed at the same spatial location, preventing the definition of time-only validation sets. For each set, the true arithmetic mean _{2–10} was extracted, along with the corresponding posterior predictive distribution of the mean generated

For each simulated set, a point estimate of the mean _{2–10} was derived using the mean of the posterior predictive distribution, and the error between this prediction and the observed true mean _{2–10} was calculated. A plot was constructed that plotted for each set the error value on the

The following procedure was implemented to assess the fidelity of the posterior predictive distributions of mean _{2–10} for each set as models of uncertainty. Firstly, each distribution was summarised using 100 equally spaced quantiles. Secondly, each quantile was considered in turn and the proportion of true mean _{2–10} values, across all the aggregated sets, that exceeded the corresponding predicted mean _{2–10} value for that quantile was calculated. This proportion was interpreted as an “observed” probability threshold and was plotted against the “predicted” probability threshold associated with that quantile. In a perfect model it would be expected, for example, that 50% of the true set means would exceed the values predicted by the corresponding 0.5 quantile of each posterior predictive distribution, 90% would exceed the values predicted by the 0.1 quantile, and 99% the value predicted by the 0.01 quantile. By calculating the actual proportions of true means exceeding each of the 100 quantiles, a “coverage” plot was generated that compared these observed and predicted probability thresholds across the range of probabilities from zero to one. In a perfect model, all plotted values would lie on the 1∶1 line indicating an exact correspondence between predicted and observed probability thresholds, and an exact representation of the uncertainty in aggregated predictions. This procedure was carried out for 1000 sets each of size 1, 2, 5, 10, 15, 20, 30, 40, and 50 pixels drawn by simple random sampling from the full space-time hold-out set.

_{2–10} values of many simulated sets of different sizes. The mean error, shown by the green line, was effectively zero for all set sizes, indicating that the model produces unbiased predictions with no overall tendency to over- or under-predict mean _{2–10}. As would be expected, the dispersion of these errors reduced with set size, as did their mean magnitude (shown by the mean absolute error line in red). Mean absolute error for sets of size 1, 25, 50 and 100 pixels was 11.4, 2.7, 1.9 and 1.3 _{2–10} respectively. _{2–10} values exceeded the predicted 0.7 probability threshold, rather than the 70% predicted by the model. This suggested that the rising limb of the posterior predictive distributions tended to rise too gently towards the peak, thus underestimating slightly the probability of a given prediction taking values smaller than its point estimate. Taken as a whole, the coverage plot suggested that posterior predictive distributions were likely to represent predictive uncertainty reasonably well, albeit with slight deviations from a perfect model discussed above.

A validation procedure generated many sets of aggregated pixels for which a posterior predictive distribution and point estimate of the set-mean _{2–10} could be compared to the true value. In (A) the error between the point estimate and true value is plotted against the size (number of pixels) of each aggregated set (black dots). Also shown are smoothed moving averages of the mean error (green line) and mean absolute error (red line) in relation to set size. (B) is a coverage plot comparing, for aggregated sets of different sizes, the correspondence between predicted probability thresholds (as provided by the modelled posterior predictive distributions of mean _{2–10} in the validation sets) and actual probability thresholds (defined as the observed proportions of true set means exceeding the predicted threshold values).

_{2–10} within the global limits of stable transmission, aggregated temporally across the 12 months of 2007. None of these maps, taken individually, are intended to represent the true pattern of global prevalence. Each is driven by the underlying data but represents a random draw from a universe of possible maps given the model specification, the information in the data, and the resultant modelled uncertainty. Whilst the large-scale regional patterns of endemicity are similar in each realisation, small scale heterogeneity exists between each, and this variation across the 500 realisations defined the form of the posterior predictive distribution of the global surface. The validation procedures explained above provided evidence of the suitability of these surfaces to be aggregated spatially or temporally to provide appropriate posterior predictive distributions of mean _{2–10} within spatiotemporal units of different sizes. _{2–10} across the entire African continent, across three individual countries (Ghana, Democratic Republic of Congo (DRC) and Kenya), and across a first level administrative unit in each of these countries (Ashanti Region, Ghana; Kinshasa Province, DRC; Nyanza Province, Kenya). As would be expected, the dispersion of the distributions, which can be interpreted directly as the modelled uncertainty in the predicted mean _{2–10}, tended to decrease as aggregated predictions were made over progressively larger regions, such that the continent-wide mean was predicted with lower uncertainty than were national-level means which, in turn, were less uncertain than first-level administrative unit means. Dispersion was also moderated, however, by predictive uncertainty influenced by the availability of input survey data in different regions. This explains why the posterior predictive distribution for DRC, a country with very few available survey data, is substantially more dispersed than that for Kenya, for which many survey points exist, despite constituting a much larger spatial unit of aggregation. These example plots also illustrate in general terms why joint-simulation is necessary when predicting aggregated prevalence. Under a standard per-pixel implementation with all locations simulated independently, the variance of the aggregated mean _{2–10} would decline in proportion to the square-root of the number of pixels in the aggregated unit. At even the first administrative unit level, this would result in artificially small variances for the posterior predictive distributions. At national and continental levels the predicted uncertainty would effectively be zero. Under the joint simulation approach presented here, the space-time variance structure is preserved and this resulted in even the continent-wide prediction retaining a non-negligible level of uncertainty.

Examples of five of the 500 realisations of _{2–10} generated

The crucial feature of the jointly simulated realisations of _{2–10} was that they could be aggregated over arbitrary spatial and/or temporal regions to generate posterior predictive distributions of mean _{2–10}, constituting appropriate models of regional uncertainty. In this example, such predicted distributions are provided at three different scales, predicting mean _{2–10} for the entire African continent (i); across the nations of Ghana (ii), Democratic Republic of Congo (iv) and Kenya (vi); and across a first administrative level unit of those countries (Ashanti Region, Ghana(iii); Kinshasa Province, DRC (v); Nyanza Province, Kenya (vii)).

Estimated 2007 populations living under low, medium, and high stable transmission risk are presented by country in _{2–10}>40%).

Map A shows the estimated population at risk of high stable transmission (_{2–10}>40%) for each of the 80

Numerous algorithms exist that seek to increase the efficiency of joint simulation, including the widely-used family of sequential simulation algorithms, and those based on spectral decompositions operating on regular lattices. Whilst these elegant algorithms expand considerably the magnitude of joint simulation tasks that can be achieved relative to a direct calculation, none of them could produce simulations on arbitrary input grids on the scale required for global-scale disease maps as addressed in this study. We have overcome these limitations using a practical elaboration of standard sequential simulation that is empirically highly efficient but does not put any special requirements on the input grid or covariance function. The approach represents an important increase in the feasibility of aggregated uncertainty assessment over very large prediction spaces, expanding the scope of geostatistical models in global scale epidemiology.

The current study builds on a modelling framework for the global mapping of _{2–10}>40% has been proposed as separating lower transmission settings, where universal coverage of insecticide treated bed-nets could interrupt transmission

The direct practical utility to decision-makers of accompanying uncertainty metrics is less well established since they have not been available previously. Uptake by decision-makers will be aided by the packaging of uncertainty measures into easily understood information and the ranked uncertainty maps presented in

In this study we have presented the extension of a jointly simulated prediction framework for _{0}, which provides biological insight into the intensity of malaria transmission and is particularly useful when assessing the effect of current or future interventions _{0} can be estimated as a function of prevalence _{0} useful for strategic planning.

The approach presented in this study can be applied readily to any large-scale MBG prediction of infectious disease prevalence and corresponding populations at risk. An important caveat for further applications, however, is that the algorithm cannot be treated as a black-box that will generate appropriate output without user supervision. The algorithm relies on a key assumption: that the use of a relatively small proportion of conditioning data proximate to each target prediction column generates predicted values that are sufficiently similar to a theoretical (although infeasible) direct joint simulation based on all locations simultaneously. In reality, the footprint-based predictions will approach the theoretical “true” values asymptotically, such that the use of progressively more conditioning data in the footprint will result in progressively smaller increases in convergence between the two sets of values. This leads to a delicate trade-off between feasible computational demand and appropriate predictive precision. Suitable resolutions of this trade-off cannot be prescribed

Automatic optimization of the footprint would be a useful area for future research. Approaches to this problem, based on evaluation of the Markov properties of 2-d fields, have been proposed

The expansion of MBG in epidemiology has been rapid and led to major advances in the handling of uncertainty in disease risk maps. To date, fundamental computational constraints have precluded the use of such models for predictions of aggregated prevalence and populations at risk required by decision-makers across national and continental spatial scales. In this study we have designed, implemented and tested a new algorithm that overcomes the prohibitive computational barriers of large scale joint simulation allowing, for the first time, appropriate handling of aggregated uncertainty in global scale disease maps. This epidemiological insight has been extended to defining national populations at risk with appropriate confidence intervals which are released here in the public domain to support informed efforts in disease burden estimation.

Populations at risk under different levels of

(0.41 MB DOC)

The data used in this paper were critically dependent on the contributions made by a large number of people in the malaria research and control communities and these individuals are listed on the MAP website (