Model-based small area estimation methods and precise district-level HIV prevalence estimates in Uganda

Background Model-based small area estimation methods can help generate parameter estimates at the district level, where planned population survey sample sizes are not large enough to support direct estimates of HIV prevalence with adequate precision. We computed district-level HIV prevalence estimates and their 95% confidence intervals for districts in Uganda. Methods Our analysis used direct survey and model-based estimation methods, including Fay-Herriot (area-level) and Battese-Harter-Fuller (unit-level) small area models. We used regression analysis to assess for consistency in estimating HIV prevalence. We use a ratio analysis of the mean square error and the coefficient of variation of the estimates to evaluate precision. The models were applied to Uganda Population-Based HIV Impact Assessment 2016/2017 data with auxiliary information from the 2016 Lot Quality Assurance Sampling survey and antenatal care data from district health information system datasets for unit-level and area-level models, respectively. Results Estimates from the model-based and the direct survey methods were similar. However, direct survey estimates were unstable compared with the model-based estimates. Area-level model estimates were more stable than unit-level model estimates. The correlation between unit-level and direct survey estimates was (β1 = 0.66, r2 = 0.862), and correlation between area-level model and direct survey estimates was (β1 = 0.44, r2 = 0.698). The error associated with the estimates decreased by 37.5% and 33.1% for the unit-level and area-level models, respectively, compared to the direct survey estimates. Conclusions Although the unit-level model estimates were less precise than the area-level model estimates, they were highly correlated with the direct survey estimates and had less standard error associated with estimates than the area-level model. Unit-level models provide more accurate and reliable data to support local decision-making when unit-level auxiliary information is available.

Introduction jmusinguzi@infocom.co.ug 2. Data from the Uganda Population and Housing Census, conducted in 2014 is available from the census report (https://www.ubos.org/wp-content/uploads/ publications/03_20182014_National_Census_ Main_Report.pdf). This data is publicly available and can be extracted from the report as described in the methods section of our study. 3. Antenatal HIV testing data, extracted from DHIS2, can been accessed upon request from Ministry of health Uganda (https://www.health.go.ug/), through the permanent secretary, Ministry of Health Uganda. Dr Diana Atwine. email: diana.atwine@health.go.ug 4. Lot Quality Assurance Sampling survey data can also be accessed upon request from ministry of local government (https://molg.go.ug/) or Ministry of Health through the permanent secretary Dr Diana Atwine. Email: diana.atwine@health.go.ug. It contains individual level sensitive information We confirm that none of the authors had special privileges to access any of the datasets. We obtained ethical clearance from the Uganda National Council of Science and Technology to use the datasets. This is a key requirement for anyone accessing these datasets in Uganda. prevalence in South Africa [23]. To our knowledge, our study is the first to use unit-level models to estimate local HIV prevalence in Uganda.
Uganda's HIV prevalence distribution varies across geographical regions and among population groups. National HIV prevalence is 6.2% among persons aged 15-64 years (women, 7.6%; men, 4.7%) and varies from 3.1% to 8.0% across regions [24]. Availability of district-level prevalence information is therefore critical for resources allocation and decision-making.
We applied unit-level and area-level models to estimate HIV prevalence for districts in Uganda. We compared the precision of the model-based estimates to direct survey estimates and the precision of the unit-level model to the area-level model estimates. Our findings include several alternative district-level estimates that may be helpful for monitoring districtlevel HIV/AIDS intervention programs.

Data sources
Uganda Population HIV Impact Assessment. The Uganda Population HIV Impact Assessment (UPHIA) 2016-2017 used a two-stage, stratified cluster-sampling design. In the first stage, 520 enumeration areas (EAs) or clusters were randomly selected using probability proportional to the number of households in the cluster; in the second stage, a sample of 25 households were randomly selected using equal probabilities from each EA. The EAs were based on the 2014 National Population and Housing Census (NPHC) [25]. Uganda's Ministry of Health conducted the survey with technical support from ICAP at Columbia University, Centers for Disease Control and Prevention, and ICF/Macro International. For detailed survey information, see the official survey report [24]. For our study, we analyzed data from 16,828 adults aged 15-64 years from 70 districts; participants provided written informed consent were tested for HIV during the survey.
Lot quality assurance sampling surveys. In Uganda, annual LQAS surveys are used to obtain district-level indicator estimates for monitoring health interventions [26]. In LQAS, each program area is defined to be a district subdivided into 4-7 supervision areas (SA). SA comprise either a sub-county or neighboring sub-counties with similar socioeconomic characteristics. Using probability proportional to the number of households in the village, survey staff randomly select a sample of 19 villages for districts with 5-7 SAs and 24 villages for those with four SAs. A minimum sample of 19 and 24 respondents per SA for districts with 5-7 and four SAs, respectively, are selected to maintain the combined SA misclassification (Alpha + Beta) errors to <15%. These sample sizes enable computation of district (program)-level coverage with <10% error margin for indicators [27]. A reference household is randomly identified using equal probability sampling within the SA and the next nearest household from the exit of the reference household is selected to start the survey. Eligible and consenting adult respondents are interviewed from selected households. Parallel sampling is applied to select eligible respondents for the following population categories: mothers of children aged 0-59 months, youth aged 15-24 years, women aged 15-49 years, and men aged 15-54 years. A sample of 19 or 24 respondents are selected for each of the population groups. The design does not permit selection of youth, men, and women from the same household because similar indicators are assessed in these population groups. For full details, see the survey reports [16]. We analyzed data from youth, men, and women.
Other sources of covariate data. Summary district level covariate data were obtained from the NPHC 2014 [28], and HIV prevalence from antenatal care attendance was obtained from the District Health Information System version 2 (DHIS2) [29]. Variables obtained from the NPHC 2014 include district population density, percentage of the population living in urban areas, and proportion of individuals who accessed a health facility in the 12 months preceding the survey.

Statistical analysis
We computed district-level HIV prevalence estimates for 70 districts that conducted both UPHIA and LQAS surveys in 2016 using direct survey, area-level SAE models, and unit-level SAE models. We further assessed the estimates for consistency in estimating HIV prevalence and compared the precision of the estimates via the confidence intervals and the coefficient of variation of the estimates.
1. Direct survey estimates. If y ij is the HIV status (positive, 1; negative, 0) of the j-th individual in the i-th district and p i is the proportion of HIV-positive people in district i, then taking into account the sampling weights, the direct estimate of district HIV prevalence is obtained as follows: Where m is the number of districts and N i is the number of individuals in district i. Using the direct survey estimate (Eq 1.1), we used the UPHIA 2016-2017 dataset to compute weighted district HIV prevalence estimates and associated standard errors (SE) based on the sampled observation from each district. More details about the survey weights are available [24]. SE were computed using standard survey estimation (linearization) methods.
2. Area-level model estimation. The multivariate Fay-Herriot model [1] is the most commonly used explicit area-level model to estimate area parameters when area-level auxiliary data are available. In area-level models, the outcome obtained from direct estimation is regressed against summary/aggregate explanatory variables that are available only at the administrative/geographical area of interest [18]. The model is defined in two stages: developing a sampling model for the direct survey estimates and applying a linking model to obtain area-level parameter estimates.
2.1. Model estimation and prediction. Taking y i ¼ gð� y i Þ, assuming g(.) is a logit link function, and relating it to a vector of p area-level covariates, z i , where z i =(z 1i ,z 2i , . . . ,z pi ) for the m areas, the linking model for the area-level parameter θ i is defined as Where: β = (β 1 ,β 2 , . . . ,β p ) T is a p × 1 vector of regression coefficients and u 0 i s are area-specific random effects assumed to be independent and identically distributed ( The area-level random effects, υ i , capture the unstructured heterogeneity among the areas (districts) not explained by the sampling error variance.
The unbiased direct estimator of θ i is obtained using a sampling model in model 2.2.
for i = 1,2, . . . ,m, e i~N (0,ψ i ). Where e i is the sampling error with known sampling variance, Var (e i ) = ψ i and E(e i ) = 0 for all areas. Model 2.2 is referred to as the sampling model because θ i is unobservable and is estimated based on the sampled observations in the area. Combining 2.1 and 2.2, we obtain the mixed model Where u i e Nð0; s 2 v Þ, e i~N (0,ψ i ), i = 1,2,. . .,m and υ i is independent of e i The Fay-Herriot, area-level model estimate is then obtained as a weighted combination of the direct (ŷ i ) and regression-synthetic estimators (z i T b).
The weighting component is the ratio of the model error variance to the total variance.
From Eq 2.4, the estimateŷ FH i tends toŷ i for large values of the model variance (ŝ 2 v ) and tends to z i T b for small values of the model variance relative to s 2 i . Model parameters and SE are estimated using maximum likelihood methods [18,30].
2.2. Auxiliary variables for the area-level model. Variants of the area-level model were fitted with different combinations of the predictor variables as shown in S1 File. The general population direct HIV prevalence estimate (p i ) was found to be related to HIV prevalence from ANC attendance (P ANCi ) y i = logit(p i ) and z i = logit(p ANCi ). This model had the lowest value of the Akaike Information Criterion (AIC) ( Table 1 in S1 File). The models were fitted using the SAE package in R version 3.6.2 [31].
3. Unit-level model estimation. When unit-level or individual-level auxiliary data X ij ¼ ðx ij1 ; x ij2 ; ; x ij3 ; . . . :x ijq ; Þ for a vector of q auxiliary variables are available, the Battese-Harter-Fuller [17] unit-level model is often applied to obtain small area parameter estimates. Under the unit-level model, data for both the outcome and the explanatory variables are available for each population element in the district, irrespective of the administrative area/domain of interest [18].
3.1. Model estimation and prediction. Letting p ij be the probability of individual j from district i being HIV positive (p ij ¼ Prðy ij ¼ 1 j x ij ; m i Þ, where y ij corresponds to the HIV status of individual i in district, a logistic regression model with area-level effect is used to estimate p i [18,32]. Where y ij is assumed to be independent Bernoulli (p ij ) conditioned on p ij 's with area random effects υ i ; β is the vector of regression model parameters. υ i is assumed to be independent and identically distributed with E(υ i ) = 0 and Var ðu i Þ ¼ s 2 v . Assuming absence of area-level auxiliary information, the indirect estimators of district HIV prevalence, based on model 3.1 is the empirical best predictor obtained as follows: Model fitting and parameter estimation were implemented using the SAE [31] package in R software, version 3.6.2 [31] The MSE ofp i is obtained using the parametric bootstrap estimation method for finite populations [33,34] as described in S2 File.
3.2. Auxiliary variables for the unit-level model. Auxiliary variables from the 2016 LQAS data include age group in years (15-19, 20-24, 25-34, 35-44, and �45), sex (male and female), level of education (none, primary, secondary, and tertiary), marital status (single, married, and previously married: widowed/divorced/separated) and number of sexual partners including spouse in the 12 months preceding the survey (0, 1, and �2). We selected these variables because they were significantly associated with HIV positivity in our previous study [6].

Comparison of the unit-level and area-level model estimates.
We used summary statistics, regression analysis, and graphical assessment for consistency to compare estimates from direct, area-level, and unit-level models. We assessed gain in precision of the model-based estimates compared to the direct survey estimates using the coefficient of variation of the estimate and ratio of the MSE. We further computed ratios of the unit-level model and the area-level model estimates to assess for improvement in precision of the unit-level model estimates.

Ethics approval and consent to participate
Ethical clearance to conduct this study was obtained from the University of Witwatersrand Human Research Ethics Committee (HREC), clearance Certificate number M171053. Further clearance was obtained from the Uganda National Council for Science and Technology (UNCST) with registration number HS2366. Data for the study were obtained from surveys conducted in Uganda. The study was a secondary analysis of data, so consent to participate is not applicable.

Selected characteristics of survey participants
For both surveys, most respondents were women, had incomplete primary education, were married/cohabiting, and had one sexual partner in the 12 months preceding the surveys (Table 1).

HIV prevalence estimates
HIV prevalence estimates for the districts included in the analysis are summarized in Table 2. District-specific estimates with 95% confidence intervals are presented in S1 Table. On average, direct survey estimates were higher (mean = 0.064, SD = 0.034) than area-level (mean = 0.056, SD = 0.018) and unit-level model (mean = 0.058, SD = 0.026) estimates. Direct survey estimates also had a higher variation than the model-based estimates, with minimum and maximum values of 0.010 and 0.148, respectively. The estimates from the area-level model had the least variation (Table 2). Unit-level model HIV prevalence estimates Fig 1C had a similar pattern compared to the direct survey prevalence estimates (Fig 1A). HIV prevalence estimates based on the area-level model had the least recognizable pattern between districts ( Fig 1B) consistent with the summary statistics in Table 2. HIV prevalence was generally higher in districts in Central, South Western, and Northern regions of Uganda and in districts bordering lakes.

Model diagnostics
We regressed model-based estimates against direct survey estimates to assess bias and reliability of the model-based estimates. The bias diagnostics plot also presents a scatter plot of the estimates and shows the effect of extreme values in the estimates. Unit-level model estimates were strongly correlated with the direct survey estimates (β 1 = 0.66; r 2 = 0.862; Fig 2B) compared to area-level model estimates (β 1 = 0.44; r 2 = 0.698; Fig 2A).

Precision and consistency of HIV prevalence estimates
The model-based estimates were similar to the direct survey estimates, but the point estimates had less variation compared to the direct survey estimates (Fig 3A and 3C). We also note that direct survey estimates were significantly different (i.e., they were either higher or lower) compared with model-based estimates for small survey sample sizes Fig 3A and  3C), demonstrating the shrinkage of the model-based estimates toward the point estimates. However, direct survey and model-based estimates tended to be similar with  Table). The average improvement in the precision of the estimates was 37.5% and 33.1% for the unit-level and area-level model, respectively (data not shown). The coefficient of variation of the direct survey estimates were generally larger and had a higher variation compared to the coefficient of variation of the model-based estimates irrespective of the survey sample size in the districts (Fig 3B and 3D). Table 3 summarizes the coefficients of variation and shows that the mean coefficient of variation for direct survey estimates was 138.0% (346.9) compared to 22.4% (7.5) for the area-level and 23.7% (18.5) unit-level models.
Assuming estimates are considered as reliable for decision making if the coefficient of variation is <20% [31], then <50% of the districts have reliable data for decision making based on the direct survey estimates, whereas >50% of the districts would have reliable data based on the SAE methods (Table 3). Specifically, only 14 (20%), 36 (51.4%), and 36 (51.4%) of the districts would have reliable information for decision making based on direct survey estimates, area-level model estimates, and unit-level model estimates, respectively (data not shown).

Consistency of model-based estimates
Unit-level model estimates varied more than the area-level model estimates (Fig 4A). Additionally, the coefficients of variation of the unit-level model estimates were consistently larger for districts with small survey sample sizes and consistently smaller for districts with large survey sample sizes (Fig 4B). Implying that the unit-level model estimates converge more rapidly to the point estimate compared to the area-level model estimates as the survey sample sizes in the districts increased.

Discussion
Study findings show the feasibility of applying SAE models to population survey data with auxiliary information from community LQAS surveys and DHIS2 data to obtain more precise HIV prevalence estimates for districts in Uganda. Both the unit-level and area-level model estimates were similar to the direct survey estimates, although the unit-level model estimates were more correlated with the direct survey estimates compared to area-level model estimates. A graphical assessment shows that the model-based estimates were close to the direct survey estimates, were less polarized, and had no extreme values/outliers. Mapping model-based HIV prevalence estimates shows a similar pattern between the unit-level estimates and the direct survey estimates. Estimates for all approaches were generally similar to regional survey prevalence estimates [24].
The coefficient of variation of both the unit-level and area-level model estimates were lower than the coefficient of variation of the direct survey estimates. However, the coefficient of variation of the unit-level model estimates were larger than the coefficient of variation of the arealevel model estimates, which suggests that area-level model estimates were more precise. Coefficient of variation is computed as a ratio of the SE to the mean and expressed as a percentage. It therefore expresses the sampling variability of the estimates from the point estimate, implying that estimates with large coefficients of variation would be considered over-dispersed and therefore unreliable for decision-making.
Although population surveys provide more accurate national and regional information, they yield only partial information for district-level planning and allocating resources. Uganda, like many other low and middle-income countries, lacks the resources to collect representative data for monitoring social services at the district-level; however, in Uganda's de-centralized

PLOS ONE
governance model, districts plan, implement, and monitor interventions on behalf of the central government. Using simple methods to generate more precise HIV prevalence estimates for districts will therefore augment district-level decision making. Application of SAE methods elsewhere in Africa have shown more precise district-level estimates [21][22][23] as observed in our study.
We note that the unit-level model estimates were highly correlated with direct survey estimates, although they were less precise compared to the area-level model estimates, which is contrary to SAE's methodological theory [18]. Lower correlation of the area-level model estimates with direct survey estimates compared to the unit-level model estimates versus direct survey estimates may be attributed to the aggregated area-level covariates, which mask any internal variations or heterogeneity of units. The heterogeneous spread of HIV is well documented in literature [6,23,24]. For example, HIV prevalence is higher in urban areas compared to rural areas [6,23,24]. Similarly, females have a higher HIV positivity compared to males although the differences between males and females varies by age group [24]. The difference is wider for the younger age group 15-24 years and converges with increasing age [24]. It is therefore important to take into consideration, this internal differences in obtaining the estimates as demonstrated by the BHF model applied in this study.
Additionally, use of antenatal data for HIV prevalence monitoring in the general population has limitations and biases including selection bias of routine data. Antenatal HIV surveillance data excludes non-pregnant women and men and includes information from health facilities in urban and easily accessible areas [35]; public health facilities that report to national HMIS system [20]; younger and more educated women who have higher antenatal care attendance rates [36][37][38]. These limitations imply that the antenatal survey data may not accurately reflect the general population HIV prevalence distribution as observed in our study.
Overall improvement in the precision of the estimates was 37.5% and 33.1% for the unitlevel and area-level model, respectively. A study combining population survey with routine data found 28% improvement in the precision of estimates but lower correlation with estimates based on routine data [5]. Use of risk factor data, representative of the general population, therefore improves the precision of the estimates compared to combining routine and survey data or use of more aggregate information.
Study findings were consistent with regional HIV prevalence estimates based on survey data. Districts with higher HIV prevalence were from regions with overall higher HIV prevalence (e.g., Kaborale district in Western region; Masaka and Mpigi in central 1 region; and Mbarara in South Western region) as observed in the national level survey [24]. These districts are urban and are major transport corridors for truck drivers. Masaka district also is inhabited by fishing communities, which typically have higher HIV prevalence [8,9,39,40].
Although the model-based methods assume independence of area random effects, independence is unlikely to hold between neighboring districts. Spatial estimation approaches attempt to solve this, although these methods have their own limitations [23]. Borders and boundaries are arbitrary, and individuals tend to seek healthcare services in neighboring districts or even other regions that are convenient or offer better quality services. For example, some districts in Uganda do not have Health Centre IV or hospitals, which are known to provide better quality healthcare and a broad range of health services [23]. Furthermore, 40 out of 112 districts in the country did not conduct LQAS surveys in 2016; this implies that BHF model estimates could not be obtained for these districts limiting the breath of model evaluation. Data-driven district-level decision making for HIV programs requires precise and reliable estimates, but survey sample sizes from population surveys are significantly smaller for districts than for regions. Our study shows that with external auxiliary data, estimates that are more precise can be obtained, but precision depends on whether the information is available at the sampling unit level or area level. Unit-level model estimates, although less precise than area-level estimates, were more consistent with direct survey estimates. This application also promotes use of annual community LQAS data, which is readily available to district service managers. SAE models also can be developed in freely available software, such as R and STATA. Models that use both area-level and unit-level covariates to obtain SAE could help provide more precise and accurate HIV prevalence estimates in other settings.