Comparative performance of multiple-list estimators of key population size

Estimates of the sizes of key populations (KPs) affected by HIV, including men who have sex with men, female sex workers and people who inject drugs, are required for targeting epidemic control efforts where they are most needed. Unfortunately, different estimators often produce discrepant results, and an objective basis for choice is lacking. This simulation study provides the first comparison of information-theoretic selection of loglinear models (LLM-AIC), Bayesian model averaging of loglinear models (LLM-BMA) and Bayesian nonparametric latent-class modeling (BLCM) for estimation of population size from multiple lists. Four hundred random samples from populations of size 1,000, 10,000 and 20,000, each including five encounter opportunities, were independently simulated using each of 30 data-generating models obtained from combinations of six patterns of variation in encounter probabilities and five expected per-list encounter probabilities, producing a total of 36,000 samples. Population size was estimated for each combination of sample and sequentially cumulative sets of 2–5 lists using LLM-AIC, LLM-BMA and BLCM. LLM-BMA and BLCM were quite robust and performed comparably in terms of root mean-squared error and bias, and outperformed LLM-AIC. All estimation methods produced uncertainty intervals which failed to achieve the nominal coverage, but LLM-BMA, as implemented in the dga R package produced the best balance of accuracy and interval coverage. The results also indicate that two-list estimation is unnecessarily vulnerable, and it is better to estimate the sizes of KPs based on at least three lists.


Introduction
Among the 1.7 million new HIV infections globally in 2018, 54% occurred among key populations (KPs), particularly female sex workers (FSW), people who inject drugs (PWID), men who have sex with men (MSM), transgender women, clients of sex workers, and sex partners of other KP members [1]. Even in the generalized HIV epidemics in eastern and southern Africa where 75% of new infections occurred among the general population, targeted scale-up of antiretroviral therapy and other interventions among KPs may be the most efficient way to avert new infections [2,3]. For those reasons, provision of HIV services to KPs has long been an important component of the (United States) President's Emergency Plan for AIDS Relief a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 [4] and The Global Fund to Fight AIDS, Tuberculosis and Malaria [5]. Scaling and targeting of life-saving HIV services to KPs, and evaluating the efficacy of those services requires knowledge about the sizes of KPs [6].
KP members are often adversely affected by discrimination and stigma [7]. Stigma and criminalization [8] create incentives for key population members to remain hidden, which challenges both population size estimation (PSE) and provision of HIV services. Therefore multiple methods of PSE have been recommended [9]. PSE based on the method known by the monikers "capture-recapture" and the "multiplier method" is a statistically principled approach which has been widely used to estimate the sizes of KPs [10][11][12][13][14][15][16][17][18][19][20]. Such multiple-list PSE is commonly based on only two lists, but three-or-more-list estimation [21][22][23][24] is becoming increasingly common.
Ratio estimation of population sizes from partial observations from two lists (sources or sampling events) dates to 1786 [25], and later became known as "capture-recapture" or "markrecapture" estimation among animal ecologists [26,27]. Although early applications and developments focused heavily on non-human animal populations, the methods have been applied more broadly including human birth registration [28], census undercount [29], and epidemiological applications [30][31][32] which trace back to at least 1968 [33]. "Multiplier" or "service-multiplier" estimation in the public-health literature [34][35][36][37] is a rediscovery of ratio estimation of population sizes. The essential data are counts of population members that are recorded on two lists (sources), wherein individuals on the first list can be defined as "marked" and those on the second list are tabulated as either previously encountered ("recaptured") or newly encountered. Estimation from two lists requires the strong assumptions that 1) the population is static over the observation interval, 2) previously encountered individuals are identified without error, 3) individuals are sampled independently, and 4) all population members share a common and constant probability of encounter. The first assumption is well-approximated by sampling over short time intervals. The second and third assumptions remain uncertain in KPs because humans can choose whether or not interview or to disclose a previous encounter. The fourth assumption is untenable for KPs; it is inconceivable that all KP members share a common and constant probability of encounter. Rather, we should expect that individual KP members are highly inhomogeneous in their encounter probabilities.
Subsequent statistical developments included accommodation of more than two encounter sources (survey rounds or service rosters) [38], which enables relaxation of the fourth assumption via the development of model-based estimation using distribution mixtures [39,40] and loglinear models (LLMs) of observation frequencies [41] or encounter probabilities [42]. Multiple variations of LLMs which satisfy different assumptions about inhomogeneities in encounter probabilities are commonly fitted to the data, and then the "best" model by some criterion (usually Akaike's Information Criterion, AIC) is selected for estimation of N. That conventional approach is henceforth denoted LLM-AIC. Unfortunately, two or more LLM variations can fit data equally well and yet produce very different estimates and uncertainty intervals [43].
Discrepant estimates occur because the population-size parameter N is generally not identified [40,44,45]. Roughly, a parameter is said to be unidentified whenever its true value remains unknown given an infinite number of observations. Recognition of the unidentifiability of PSE parameters seems absent from the epidemiological and public-health literature, yet it has enormous implications for estimation. The lower bound of population size [46] is identified [47], but is rarely the desired target for KPs.
More recent Bayesian developments eliminate the need for model selection and may improve robustness. Bayesian model averaging of loglinear models (LLM-BMA) reduces the volatility resulting from choice of a single model by properly accounting for model uncertainty [48]. The feasible set of LLMs are fitted and N is estimated as the model-probability-weighted average from those LLMs. Bayesian nonparametric latent-class modeling (BLCM) [49] abandons the LLM framework in favor of estimation from distribution mixtures. Consider thatwith sufficient information-a population that is inhomogeneous with respect to encounter probability could be correctly stratified into some potentially large number of homogeneous classes. It that case, a well-informed distribution mixture could be employed to estimate N. However the number of homogeneous classes is unknown in practice. Instead, BLCM "learns" the most probable latent classes from the data in a Bayesian way, and prevents over-parameterization by imposing a parsimonious prior distribution.
Public-health scientists who estimate the sizes of KPs need to know the performances of alternative estimation methods in order to make informed choices. To obtain a more objective basis for choice, LLM-AIC, LLM-BMA and BLCM were compared using simulated populations of known size and different patterns of variation in encounter probability. Secondarily, the frequencies with which LLM-AIC correctly matches the underlying data-generating models were quantified, and the performance of selected heterogeneity corrections in LLM-AIC estimation were compared.

Study design
The numbers of population members in simulated samples from known populations were estimated using LLM-AIC, LLM-BMA and BLCM. The population sizes, inhomogeneities in encounter probabilities and the number of observation events/lists were varied in the simulated samples to assess the effects of those factors on PSE. The simulated data enabled comparison among methods based on their abilities to estimate the true population size.

Sample simulation
Four hundred random samples from populations of size N = 1,000, 10,000 and 20,000, each including five encounter opportunities, were independently simulated from each combination of six models of variation in encounter probabilities p, and five expected per-list encounter probabilities E(p), producing a total of 36,000 samples from which to estimate N based on 2-5 lists. The population sizes were chosen to align with KP size estimates, which commonly fall in the range of 10 3 -10 4 , and 20,000 was a compromise for computational feasibility in simulations.
The patterns in encounter probability are standards from the literature [50][51][52]. The choice of inhomogeneity patterns was a simplification for comparative purposes; the patterns in KPs may be nearly infinite. The "heterogeneity" model M h accommodates encounter probabilities which vary among individuals. Humans are capable of complex behaviors and preferences, including variations in propensities to seek social and sexual contacts, attend particular venues, or seek services from organizations through which encounters may be listed. Therefore we should not expect KP members to share a common encounter probability. The "temporal" model M t allows encounter probabilities to vary over lists/times. Encounter probabilities might vary with many temporal factors including, weather, economic conditions, day-of-the week, and variations in law-enforcement efforts. The assumption that KP members have temporally constant encounter probability is extreme and risks biased estimation. The "behavioral" model M b imposes a common expected probability of first encounter on each individual and-after the first encounter-that individual's encounter probability is henceforth increased or decreased. For this study, the expected encounter probabilities were reduced by 50% after the first. Behavioral effects can arise when, for example, the first contact tends to be either pleasing or displeasing to KP members. For example, FSW might seek out subsequent contacts with surveyors if the "mark" (typically a uniquely identifiable gift) received during their first contact was perceived to be desirable. Conversely, MSM and PWID might avoid subsequent contacts with recognizable surveyors in order to minimize their risk of recrimination or prosecution. Given the complexity of human behavior, we should anticipate combinations of all three basic patterns of inhomogeneity. Models M th , M bh and M tbh are combinations of Individual encounter histories were simulated from beta-Bernoulli distributions given by y ijk * Bernoulli(p ijk ) and p ijk * Beta(θ ijk ), where p ijk denotes the encounter probability for sample i, i = 1, . . ., 400, individual j, j = 1, . . ., N and list k, k = 1, . . ., 5. The θ ijk are 2 × 1 vectors of shape parameters (β 1 , β 2 ) ( Table 1), which were chosen to produce expected encounter probabilities E(p) = 0.025, 0.050, 0.100, 0.150 and 0.200 given a coefficient of variation of 0.85. The inhomogeneities in encounter probabilities, as measured by the standard deviation of the Beta distribution, ranged by more than a factor of eight from 0.021 to 0.170 (Fig 1). The complete encounter history for individual j in sample i and lists 1, . . ., k is the k-element vector y ijk of zeros and ones, wherein a one in position k indicates that the individual appears on list k and a zero indicates absence. Given a total of K lists, there are 2 K − 1 observable encounter histories and one unobservable history consisting entirely of zeros. The unobservable encounter histories were removed from the simulated data prior to estimation. Given encounter probability p and k lists of encounters, the proportion of population members observed for the first time from list k is given by p 1 (k) = p(1 − p) k−1 , which has expectation with respect to the Beta distribution where Γ(�) denotes the Gamma function, and β 1 and β 2 are the shape parameters of the Beta distribution (S1 Text). Therefore the expected percentages of the populations observed at least once ranged from 4.9% for two lists with E(p) = 0.025, to 54.7% from five lists with E(p) = 0.200 (Table 1). That may encompass the most likely range of sampling percentages from encounters within KPs affected by HIV. For example, sampling encountered approximately 10%, 22% and 30% of the estimated sizes of the MSM, PWID and FSW populations, respectively, in Kampala, Uganda [22].

Estimation
The population-size parameter N was estimated from each combination of estimation method, sample replicate, data-generating model and sequentially cumulative sets of K = 2, . . ., 5 lists. The first estimation method was traditional LLM-AIC estimation as implemented in the Rcapture package [53] for R [54]. This traditional application of model selection to multiple-list population-size estimation ignores model uncertainty. The Rcapture package is comprehensive, and was used only to implement estimation of models M b , M t , M h , M bh and M th . Models M bt and M bth are not loglinear; the latter cannot be fitted using Rcapture, and estimation of the former is unstable and was ignored in this study. Fitted models were compared using AIC, and the model having the smallest AIC was selected for estimation of N. The Rcapture package enables use of alternative heterogeneity corrections in models M h , M bh and M th . Use of more than one heterogeneity correction is problematic because estimates can vary substantially among the correction methods and yet share a common AIC, leaving the analyst without any objective basis for choice. Estimates from the "Poisson2" heterogeneity correction for models M h , M bh and M th were used for comparison in this simulation study, per the demonstration of superiority in S1 Table. The second method was LLM-BMA [48], as implemented in the dga R package [55], which accounts for model uncertainty. The dga package is currently limited to 3-5-list sampling. The set of feasible estimation models is a large superset of our data-generating models, and increases geometrically in size with the number of lists included in the estimation. Each feasible model and model probabilities are computed for each. The final PSE estimate is the probability-weighted average of model-specific estimates. The prior maximum number of unobserved population members was set to 10N, based on the premise that the true size of KPs might be known within an order of magnitude. The hyperparameter for the hyper-Dirichlet prior on list intersection probabilities was set to 2 −K , where K = 3, . . ., 5 denotes the number of lists included in the estimation, as recommended by the package authors. A brief sensitivity analysis of the prior specification is presented in S1 Text. Estimation of N is based on Laplace approximation, which nonetheless becomes computationally time-consuming with increasing K because of the large number of feasible models.
Last, population size was estimated using Bayesian nonparametric latent-class modeling [49], (BLCM) as implemented in the LCMCR R package. The value for the maximum number of latent classes was set to 10. The prior distribution for the vector of latent-class probabilities is a stick-breaking formulation of a Dirichlet process prior having parameter α. That prior concentrates the probability mass on the first few latent classes to avoid overfitting. The hyperprior for α is a Gamma distribution having parameters a and b, which were both set to 0.25 to provide a reasonably vague specification for the simulations [49]. A brief sensitivity analysis of the prior specification is presented in S1 Text. Estimation is based on Markov Chain Monte Carlo (MCMC) simulation. Based on a preliminary analysis, pre-convergence "burn-in" samples of 500,000 iterations were discarded and the posterior sample consisted of an additional 50,000 iterations out of 5,000,000 after thinning by 100 to reduce autocorrelation. In practice, far fewer burn-in iterations are typically required. The numbers chosen here assured convergence and stable estimation of posterior quantiles with small Monte Carlo error.
The resulting LLM-AIC, LLM-BMA and BLCM estimatesN i were compared using estimated root mean-squared error ( d RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi , bias (¼ E½N i � À N) and the estimated coverage probabilities of uncertainty intervals (95% profile-likelihood confidence intervals for LLM-AIC, and Pr = 0.95 credible intervals for LLM-BMA and BLCM). Meansquared error is the sum of sampling variance and squared bias, and is an omnibus measure of accuracy and precision of estimation. LLM-AIC, LLM-BMA and BLCM estimates were compared over the aggregated set of data-generating models in order to assess estimation of real populations, for which the underlying data-generating processes are never known. Finally, the unreliability of LLM-AIC to correctly match underlying data-generating models M h , M t , M b , M bh and M th was evaluated to illustrate a consequence of unidentified parameters. All computations were performed using R 4.0.3 [54]. R code and population-size estimates are provided in S1 File.

Comparative performance of LLM-AIC, LLM-BMA and BLCM estimation
Population-size estimates from all methods exhibited at least some evidence of multiple modes across expected encounter probabilities and numbers of encounter events over the mix of data-generating models (Fig 2). The LLM-AIC estimates exhibited the largest ranges, usually spanning more than seven orders of magnitude. The distributions of LLM-AIC estimates were reasonably compact for estimating populations of 1,000 only where the per event expected encounter probability was 0.2 over five sampling events. LLM-BMA and BLCM modeling performed nearly equally, but BLCM estimation produced distributions having longer lower tails. LLM-BMA and BLCM estimation outperformed LLM-AIC estimation in terms of both root mean-squared error (RMSE) and bias ( Table 2). The estimated RMSEs and bias of the LLM-AIC estimates were effectively infinite for all combinations of population size and expected encounter probability when estimating from two lists, and estimates sometimes exceeded 10 19 , which is a manifestation of unidentified parameters. LLM-AIC estimation became moderately reliable in terms of RMSE and bias from three-event sampling only where the expected per-event encounter probabilities were at least 0.150. In contrast, RMSEs and bias

PLOS GLOBAL PUBLIC HEALTH
Comparative performance of population-size estimators from both BLCM and LLM-BMA estimation indicated that those methods produced estimates within the correct order of magnitude across all expected encounter probabilities and three or more sampling events, and BLCM estimation produced similarly reasonable estimates from two sampling events. Uncertainty intervals from LLM-AIC and BLCM estimation almost always failed to achieve nominal coverage, and produced intervals which were too narrow (Table 3). In contrast, the credible intervals from LLM-BMA estimation tended to be too wide, with coverage probabilities frequently larger than 0.98.
Relative RMSE was largely constant across true population sizes (Fig 3), indicating that the results of of this study apply at least over the range of simulated population sizes. Relative RMSE was also independent of the expected encounter probabilities.

Ability of LLM-AIC to match the data-generating models
Loglinear model selection offers the hope of inferring the type of variation in encounter probabilities. For example, if the AIC-best fitting model happens to be M h , then one might hope that encounter probabilities differed among individuals, but not over time, and similarly there would be no behavioral effect. That would be more than one should hope for because the parameter N is almost always unidentified. The simulation results provide a concrete illustration of the consequences of unidentifiability. No more than 8.1% of the replicate data sets generated by M th and no more than 24.8% of the replicates generated by M bh were correctly identified by the AIC-best model in populations of size 1,000 and 20,000 (Table 4). Correct matching of M b and M h data and models increased with increasing expected detection probability and, less distinctly, with the number of sampling lists. Matchings by data-generating model are shown in S2 Table.

Discussion
PSE is inherently challenging, and especially for KPs affected by discrimination, prosecution and stigma. Unlike inanimate or non-human population members, people can refuse contact and acceptance/disclosure of marks. For key populations those marks are typically inexpensive small gifts or membership on some service list. It is unreasonable to expect that all people will share a common and constant propensity seek services from a particular entity, or to accept interpersonal contact and marks, and to disclose prior receipt of a mark. Therefore M h may be the simplest plausible form of inhomogeneity among KPs, and more complex forms than those used in these simulations may be in play. For example, some KP members might increase their encounter probability after the first contact if they find the gift marks attractive while others might decrease their subsequent encounter probabilities, leading to distribution mixtures of different behavioral effects. A priori, the analyst confronting PSE has no knowledge of patterns of variation in encounter probabilities. Worse, the lack of identifiability of model parameters [44,45] precludes the possibility of reliable inference about the form of inhomogeneity, as is clearly illustrated by the results of this study. Therefore the analyst can never be confident that any model matches the underlying data-generating process. Model uncertainty is especially problematic where the estimates differ substantially, which is often the case. The only practical recourse is to use estimation methods which are robust to model uncertainty and inhomogeneities in encounter probabilities.
LLM-BMA and BLCM estimation demonstrated considerable robustness. Both generally outperformed LLM-AIC in terms of sample RMSE and bias, except where per-list encounter probabilities were at least 0.1. Two-list LLM-AIC estimation-which has been commonplace for PSE-was unreliable across all three population sizes and all expected encounter probabilities. LLM-BMA and BLCM estimates were generally comparable and never produced effectively infinite RMSEs. RMSEs decreased with increasing numbers of lists across all three methods, which should be unsurprising given that the observed fraction of a population increases with the number of lists. All three PSE methods failed to achieve the nominal 95% coverage for uncertainty intervals in these simulations. LLM-AIC and BLCM estimation produced intervals with substantially less than the nominal coverage, while LLM-BMA estimation, as implemented in the dga package for R, produced highly conservative intervals. Overall, LLM-BMA estimation tended to produce the best balance of accuracy and interval coverage in this study.
The dga package for R is convenient for loglinear model averaging, but other options are available with greater effort. For example, frequentist model averaging has been proposed [56], but was not considered here because it requires custom coding by the analyst. Likely more important, frequentist model averaging lacks the theoretical grounding of BMA and does not exploit prior information on N, so that practically infinite estimates are not precluded. In practice, some upper bound of convenience on N is always known. For example, the number of FSW and cannot be larger than the female population, and the number of MSM is highly unlikely to be more than 10% of the male population in most settings [57,58]. Therefore the ability to constrain the upper bound on N in the prior for Bayesian model averaging as implemented in the dga package is an advantage.
The limitations of this study arise from reliance on Monte Carlo simulation, which provides weaker conclusions than formal mathematical proof. However, simulation is the only practical way to compare estimates with known population sizes. Monte Carlo simulation relies on machine-generated pseudo-random numbers, and therefore results will vary slightly across different streams of random numbers. All results from this simulation study are conditional on the choice of data-generating models, and also on control and prior parameters for the estimation models. The choice of data-generating models was broad and representative of commonly expected patterns of variation in encounter probability, but was not exhaustive. Results may differ from other data-generating models, other control and prior parameters for estimation, and other true population sizes and numbers of lists. Still, this simulation study provides the first cross-cutting comparison of the performance distinctly different PSE methods, and provides an objective basis for choice among those methods.

Conclusion
The results of this simulation study strongly suggest that some form of comprehensive model averaging or latent-class modeling should be the default choice for PSE, and that estimation should be based on data from at least three encounter events or lists. The two Bayesian approaches, LLM-BMA and BLCM, were more robust than LLM-AIC. LLM-BMA, as implemented in the freely available dga R package is particularly appealing because the analyst will almost always have some prior information on population size. Although none of the methods produced uncertainty intervals that achieved nominal coverage, the conservative intervals produced by LLM-BMA, as implemented in the dga R package, came closest in these simulations.
All of the estimation methods compared in this study are implemented using the freely available R packages. However, they are also easily accessible to those unfamiliar with R via web-based Multiple Source Recapture web application at https://www.epiapps.com/.
Supporting information S1 Text. Supplemental methods and results. A PDF file containing the derivation of the probability of first encounters in succesive lists and brief sensitivity analyses of the prior distributions for LLM-BMA and BLCM estimation. (PDF) S1 Table. Comparison of LLM-AIC heterogeneity corrections. Sample sizes and estimated root mean-squared errors ( d RMSE) from Poisson2, Darroch and Gamma3.5 heterogeneity corrections in loglinear estimation models M h and M th for estimation of population size N from five lists generated from models M h , M bh , M th , and M bth . (PDF) S2 Table. Frequencies (percentages) by which individual data-generating models were matched to various AIC-best estimation models. The column labels are self-explanatory. This table is an expansion of Table 3 showing matchings by each data-generating model. In practice, the analyst would not know the data-generating model. (PDF) S1 File. R code and population-size estimates. A zip archive containing the R code used to generate the random samples from the simulated populations and to estimate population size from those samples. The population size estimates are also included because obtaining those from the samples is computationally burdensome. (ZIP)