Using Community-Level Prevalence of Loa loa Infection to Predict the Proportion of Highly-Infected Individuals: Statistical Modelling to Support Lymphatic Filariasis and Onchocerciasis Elimination Programs

Lymphatic Filariasis and Onchocerciasis (river blindness) constitute pressing public health issues in tropical regions. Global elimination programs, involving mass drug administration (MDA), have been launched by the World Health Organisation. Although the drugs used are generally well tolerated, individuals who are highly co-infected with Loa loa are at risk of experiencing serious adverse events. Highly infected individuals are more likely to be found in communities with high prevalence. An understanding of the relationship between individual infection and population-level prevalence can therefore inform decisions on whether MDA can be safely administered in an endemic community. Based on Loa loa infection intensity data from individuals in Cameroon, the Republic of the Congo and the Democratic Republic of the Congo we develop a statistical model for the distribution of infection levels in communities. We then use this model to make predictive inferences regarding the proportion of individuals whose parasite count exceeds policy-relevant levels. In particular we show how to exploit the positive correlation between community-level prevalence and intensity of infection in order to predict the proportion of highly infected individuals in a community given only prevalence data from the community in question. The resulting prediction intervals are not substantially wider, and in some cases narrower, than the corresponding binomial confidence intervals obtained from data that include measurements of individual infection levels. Therefore the model developed here facilitates the estimation of the proportion of individuals highly infected with Loa loa using only estimated community level prevalence. It can be used to assess the risk of rolling out MDA in a specific community, or to guide policy decisions.

The empirical distributions of log{ρ i /{1 − ρ i )} and of log(λ i ) are unimodal and moderately negatively skewed, suggesting that linear mixed effects modelling on these transformed scales is not unreasonable. The empirical distribution of theκ i is extremely skewed, with most values less than 1; also, the larger values ofκ i have larger associated standard errors. The likelihood-ratio statistic indicates a significant difference between a model with a single shape parameter κ, estimated asκ = 0.55, and 156 separate parameters κ i (D = 239 on 155 degrees of freedom, p < 0.001). However, estimates for the village-specific prevalencesρ i and scale parametersλ i were almost unchanged whether or not we estimated separate values of κ for each village ( Figure B).

S.2 Parameter estimation
Parameters were estimated using maximum likelihood in conjunction with a quasi-Monte Carlo algorithm to evaluate the integrals on the right-hand side of equation (6) (main article). The algorithm was implemented using a two-dimensional Halton sequence for the generation of the integration points. The number of points used was 1000. All the candidate covariates included in the model were centred. Maximum NDVI was centered at 0.8, elevation at 1000 and the remaining covariates at their observed means. Numerical maximisation of the likelihood was performed with the R function nlminb() which uses PORT routines for optimisation. These routines are based on a reverse-communication trust-region quasi-Newton method.
For the numerical optimisation, transformed parameters with unconstrained ranges were used. The correspondence between the transformed parameters θ, as listed in Table A, and the original parameterisation is as follows for the model without covariates: θ 1 is the intercept for the regression on the log-odds of prevalence; θ 2 is the intercept for the regression on log(λ); θ 3 is the logarithm of the shape parameter κ in the Weibull distribution; θ 4 and θ 5 are the logarithms of the variances, σ 2 U and σ 2 V respectively, of the random effects U and V ; θ 6 is log{(1 + φ)/(1 − φ)} where φ is the correlation between U and V .
For the model with covariates the correspondence is: θ 1 is the intercept for the regression on the log-odds of prevalence; θ 2 is the slope parameter for the effect of forest cover on the log-odds of prevalence; θ 3 and θ 4 are the slope parameters for the elevation effect on log-odds of prevalence covering the elevation ranges 0-1000 and 1000+, respectively; θ 5 is the slope parameter for the effect of temperature (× 10) on log-odds of prevalence; θ 6 is the intercept for the regression on log(λ); θ 7 is the slope parameters for the effect of forest cover on log(λ)); θ 8 and θ 9 are the slope parameters for the elevation effect on log(λ) covering the elevation ranges 0-1000 and 1000+, respectively; θ 10 is the slope parameters for the effect of temperature (× 10)) on log(λ)); θ 11 is the logarithm of the shape parameter κ in the Weibull distribution; θ 12 and θ 13 are the logarithms of the variances, σ 2 U and σ 2 V respectively, of the random effects U and V ; θ 14 is log{(1 + φ)/(1 − φ)} where φ is the correlation between U and V .
For the model that uses site as a covariate the correspondence is as follows: θ 1 is the intercept for the regression on the log-odds of prevalence; θ 2 to θ 5 are the differences in effect on the log-odds of prevalence between the study site being DRC North-West, Congo, Cameroon West and Camroon East, respectively, and it being DRC Bas Congo; θ 6 is the intercept for the regression on log(λ); θ 7 to θ 10 are the differences in effect on log(λ) between the study site being DRC North-West, Congo, Cameroon West and Cameroon East, respectively, and it being DRC Bas Congo; θ 11 is the logarithm of the shape parameter κ in the Weibull distribution; θ 12 and θ 13 are the logarithms of the variances, σ 2 U and σ 2 V respectively, of the random effects U and V ; θ 14 is log{(1 + φ)/(1 − φ)} where φ is the correlation between U and V .
Estimates and standard errors for the transformed parameters, and 95% confidence intervals for the original parameters, are given in Table A. S.3 Algorithm to simulate samples from the plug-in predictive distribution Let Z denote the number of infected individuals out of a sample of size n, Y the vector of infection levels for each of the infected individuals, and (U, V ) the bivariate random effect, as defined in equations (3) and (4) (main article). Write T = ρ(U ){1 − G(c; V )}, suppressing the dependence of T on model parameters and covariates as these are assumed known. We wish to draw samples from the predictive distribution of T given Z and n. In general we can factorise the joint distribution of the random variables U , V , Z and Y as Integrating both sides of (1) with respect to Y gives For the model defined by equations (2), (3) and (4) Also, [V |U ] is a univariate Normal distribution with known mean and variance, m = φU × σ U /σ V and v = (1 − φ 2 )σ 2 V . Hence, the problem reduces to sampling first a value, u say, from the predictive distribution [U |Z] and then sampling from the Normal distribution [V |u]. The second of these is straightforward; we used the R function rnorm(). For the first, write The algebraic expressions for each of [U ] and [Z|U ] are known; the former is a univariate Normal density, the latter a binomial probability with n trials and success probabilityρ(U ).
[Z|U ]du, can easily be calculated using Gauss-Hermite quadrature. Thus knowing the probability distribution [U |Z], we can sample from it using the inverse probability integral transform as follows. First we compute a look-up table for the cumulative distribution F (U |Z) for all the combinations of n and Z observed in the data. For this we generate an equally spaced vector of possible values of U covering the interval [µ u − 3σ u , µ u + 3σ u ] with a spacing of 0.1. The cumulative probability F (U = u i |Z) of each of these values u i , is approximated by Given n and Z for a specific village, a sample from [U |Z] is then generated through the following steps.
1. generate a sample y 1 , ...y n from the unifrom distribution U (0,1) 2. for each element y j of this sample, find the largest cumulative probability F (u i |Z) for the given n and Z such that y j ≥ F (u i |Z) 3. if y j = F (u i |Z), the corresponding sample from [U |Z] is u i , otherwise the corresponding sample will lie between u i and u i+1 and we estimate it by linearly interpolating the two points and inverting this line at y j .
Using the above approach we generate 100 000 samples from [U |Z], and in turn from [V |U ], which we then use to calculate the sampled values from the plug-in predictive distribution of T . Table B gives point predictions and 95% predictive intervals for T = P(Y > 8000/ml blood for each of the 222 villages.       S.5 Data-analysis to show that site differences make a very small contribution to the variability between villages.

S.6 Model Validation
Although desirable, formal model validation is not possible as the target of prediction, the proportion of individuals in a village, which exceeds infection levels of a certain threshold, cannot be observed unless all individuals in a community are tested and their blood samples analysed. However, in order to get a sense of the accuracy of out-of-sample predictions, we predict the proportions of highly infected individuals in communities and compare these with the observed proportion of highly infected individuals in the samples and their 95% confidence intervals based on the binomial sampling distribution for datasets from Cameroon, Gabon and Equatorial Guinea that have not been used in the model development. The following tables and figures show the results of this.  Equatorial Guinea Table E: Model-based 95% prediction intervals (PI) for T , the proportion of individuals with parasite counts greater than 8000/ml blood the proportion of individuals with parasite counts greater than 30000/ml blood as well as the 95% confidence intervals for the true proportions based on the binomial sampling distribution of the observed numbers in each village. n, the number of individuals tested, Z, the number of individuals tested positive for Loa Loa, Prop 8k is the observed proportion with infection levels greater than 8000mf/ml blood, Prop 30k is the observed proportion with infection levels greater than 30000mf/ml blood Estimation of the true proportion of individuals with infection levels >8,000 mf/ml blood based on the binomial sampling distribution Model based prediction of the proportion of individuals with infection levels >8,000 mf/ml blood Figure E: Comparison of the proportion of individuals with more than 8,000 microfilariae per ml blood as predicted by the model based on the observed prevalence in a sample and the binomial sampling distribution based on the observed proportion in a sample. The black dots and lines are the centre and the extent of the 95% confidence interval of the binomial sampling distribution. The red dots and lines are the centre and extend of the 95% predicted interval. Estimation of the true proportion of individuals with infection levels >30,000 mf/ml blood based on the binomial sampling distribution Model based prediction of the proportion of individuals with infection levels >30,000 mf/ml blood Figure F: Comparison of the proportion of individuals with more than 30,000 microfilariae per ml blood as predicted by the model based on the observed prevalence in a sample and the binomial sampling distribution based on the observed proportion in a sample. The black dots and lines are the centre and the extent of the 95% confidence interval of the binomial sampling distribution. The red dots and lines are the centre and extend of the 95% predicted interval.