Choosing a Cluster Sampling Design for Lot Quality Assurance Sampling Surveys

Lot quality assurance sampling (LQAS) surveys are commonly used for monitoring and evaluation in resource-limited settings. Recently several methods have been proposed to combine LQAS with cluster sampling for more timely and cost-effective data collection. For some of these methods, the standard binomial model can be used for constructing decision rules as the clustering can be ignored. For other designs, considered here, clustering is accommodated in the design phase. In this paper, we compare these latter cluster LQAS methodologies and provide recommendations for choosing a cluster LQAS design. We compare technical differences in the three methods and determine situations in which the choice of method results in a substantively different design. We consider two different aspects of the methods: the distributional assumptions and the clustering parameterization. Further, we provide software tools for implementing each method and clarify misconceptions about these designs in the literature. We illustrate the differences in these methods using vaccination and nutrition cluster LQAS surveys as example designs. The cluster methods are not sensitive to the distributional assumptions but can result in substantially different designs (sample sizes) depending on the clustering parameterization. However, none of the clustering parameterizations used in the existing methods appears to be consistent with the observed data, and, consequently, choice between the cluster LQAS methods is not straightforward. Further research should attempt to characterize clustering patterns in specific applications and provide suggestions for best-practice cluster LQAS designs on a setting-specific basis.


Introduction
When designing global health monitoring and evaluation (M&E) surveys in resource limited settings, the key objective is to balance cost, precision, and feasibility. Lot quality assurance sampling (LQAS) surveys were originally proposed for industrial quality control [1] and have more recently been applied in global health M&E for classification of health indicators [2]. In LQAS surveys, samples of size n are collected from supervision areas (SAs), and the number of successes X is compared to a decision rule d to classify an SA as acceptable or unacceptable [3]. Because LQAS is a classification procedure, LQAS surveys typically require smaller sample sizes than estimation-based surveys. Further, the simplicity of classification facilitates local-level data analysis, reducing the time between data collection and data-driven actions.
LQAS designs are typically determined by four parameters p l , p u , α, and β. Specifically, the sample size n and rule d are selected such that the risk of misclassifying an SA as acceptable or unacceptable is small, namely: The risk α is the probability of classifying as low when p = p u , i.e. coverage is truly high; and the risk β is the probability of classifying as high when the health indicator coverage p = p l , i.e. coverage is truly low. The risk α can then be interpreted as the 'service provider risk,' namely the risk of intervening in an area that actually has high coverage; and the risk β interpreted as the 'population risk,' namely the risk that a truly under-performing area is classified as having adequate coverage. It is important to note that the LQAS equations in 1 are often parameterized differently in the literature depending on the application, and the interpretation of the risks changes with the parameterization. The α and β risks are not restricted for SAs with p l < p < p u , i.e. in the "grey region." Choice of the design parameters p l , p u , α, and β has been discussed elsewhere [3][4][5] and is not addressed in this paper. Typically, the probabilities in Eq 1 are calculated by assuming data are collected using simple random sampling (SRS) from the target population and then modeling X * Binomial(n, p). Implementing SRS requires enumerating and randomly sampling individuals from the target population, a challenging feat in geographically disperse and hard to reach populations. Cluster sampling surveys are popular in M&E, where enumerating clusters (often villages) and sampling within clusters is less costly [6]. Commonly used rapid M&E cluster surveys include 30 × 30 emergency nutrition surveys [7,8] and 30 × 7 Expanded Program of Immunization (EPI) surveys [9].
Recently, three methods have been proposed to adjust LQAS sample sizes and decision rules to accommodate cluster sampling. We refer to these methods as the Pezzoli [10][11][12][13], Hedt [14], and Hund [15] methods. In this paper, we contrast these methods and provide recommendations for choosing a cluster LQAS design. First, we review two-stage cluster sampling surveys and describe the three methods for melding cluster sampling with LQAS. We then evaluate these methods using example designs from vaccination and nutrition surveys. Finally, we provide recommendations for choosing a cluster LQAS design based on these results.

Cluster surveys
Global health M&E surveys often implement two-stage cluster sampling, with clusters sampled using probability proportional to size (PPS) and individuals sampled using an approximation to SRS. When clusters are sampled with replacement (or the probability of sampling the same cluster more than once is small), this design is 'self-weighting'. That is, every individual in the target population has an equal probability of being sampled and the data do not have to be weighted during the analysis. Cluster LQAS decision rules rely on use of self-weighting designs, and throughout this paper we assume that data are collected using a self-weighted two-stage cluster sampling design.
When individuals within a cluster are more similar than those between clusters, the data cannot be analyzed as a simple random sample. Consider a population of size N consisting of K total clusters of size M j for cluster j. Define X ij as the health indicator status (1 success, 0 failure) of subject i in cluster j; X j = ∑ i X ij as the number of 'successes' in cluster j; and X = ∑ j X j as the number of 'successes' in the sample. Suppose k clusters are sampled and m individuals are sampled per cluster (n = km) and define p j as the coverage in cluster j. In the presence of clustering, the cluster-level coverages {p j } differ from the overall coverage p; one measure of clustering is the standard deviation of p j across clusters, denoted σ. Another measure of clustering is the intraclass correlation, ρ. For two-stage designs in which cluster coverages {p j } have mean p and standard deviation σ and X j jp j * Binomial(m, p j ), the intraclass correlation is defined as [16]: The intraclass correlation ρ varies between 0 and 1 with higher values denoting more clustering (technically, under alternative models, ρ can be negative when individuals within a cluster are less similar than between clusters, though this setting is typically not relevant to M&E surveys and is not considered here).

Cluster sampling in LQAS
Initially, cluster sampling LQAS surveys were designed to minimize the amount of variance inflation due to clustering without directly inflating the sample size for clustering. For instance, within-cluster sample sizes of m = 1 have been used in LQAS surveys (e.g. [17]; [18]) to achieve the same precision as a SRS survey. In 67 × 3 and 33 × 6 LQAS nutrition surveys, small withincluster sample sizes were selected to minimize cluster sampling variance inflation [19][20][21]. Stroh and Birmingham (2009) use cluster sampling for LQAS neonatal tetanus elimination surveys and ignore the cluster sampling design, noting that cluster sampling approximates SRS 'when an attribute to be measured in a population is rare or is homogeneously distributed' [22]. The common theme between these approaches is that the cluster sampling design can be ignored if the within cluster sample size is small or if clustering is negligible. In these situations, the LQAS sample size and decision rule can be calculated using the standard binomial model with no adjustments required to accommodate the use of cluster sampling. In practice, clustering is often not negligible, and visiting many clusters and selecting few individuals per cluster may be less logistically feasible or cost-effective than sampling more individuals per cluster. To address these limitations, three methods have recently been developed to explicitly accommodate clustering sampling at the design-phase in cluster LQAS surveys: the Pezzoli [10][11][12][13], Hedt [14], and Hund [15] methods. In this paper, we compare these three methods (Hund, Hedt, and Pezzoli) for explicitly accounting for the cluster sampling design and do not discuss designs that ignore the cluster sampling design. We describe these methods below.
Pezzoli method. Pezzoli et. al [10] proposed the first cluster LQAS method, developed to assess vaccination coverage. The mathematical representation of the Pezzoli model described in [11,12] is: We refer to this model as a binomial-scaled model, because p j is modeled as a binomial random variable scaled by η to ensure that p j 2 (0,1). Clustering is incorporated through specifying a value for σ, the standard deviation of the cluster level coverages. To apply the Pezzoli design, η is selected such that σ 2 = p(1 − p)/η. Using the relationship between σ and ρ, it follows that η = 1/ρ. This method is subject to some rounding error, since η is rounded to an integer.
To select designs, the authors choose p = {p l , p u } and σ 2 (0,.1) and determine decision rules via simulation based on the distribution of X = ∑ j X j . There is currently no user-friendly software for implementing the Pezzoli method that allows users to select their own design, though Stata simulation code is available from the authors of [11]. We have written R functions for designing these surveys and provide examples using this code at https://github.com/lbhund/ ClusterLQAS_PLOS.
Hedt method. Hedt et. al [14] propose modeling X j as a beta-binomial random variable with mean p and intraclass correlation ρ: The betabinomial model can also be written as a two-stage model, with X j * Binomial(m, p j ) and p j * Beta(p, ρ) (where the beta distribution is parameterized based on the mean p and intraclass correlation ρ). The beta distribution has support on (0,1), and this betabinomial model is common for clustered binary data. The final decision rule is based on the distribution of X = ∑ j X j . Hedt et. al [14] provide R code for selecting sample sizes and decision rules for this method; the R package in [15] can also be used to calculate sample sizes and decision rules using the Hedt method. Hund method. Hund and Pagano (2014) do not specify a distribution for p j and instead use an overdispersed binomial model for X [15]: where D is the survey design effect allowing variability in X above that of the binomial distribution. The design effect is the ratio of the variances using cluster sampling compared to SRS and is the multiplicative factor by which the sample size should be inflated to achieve the same precision as SRS when using cluster sampling [23]. For the two-stage cluster sample with large K, the design effect is D = 1 + (m − 1)ρ. The Hund method can also be written in terms of the 'effective sample size' as X Ã * Binomial(n Ã , p), where X Ã = X/D and n Ã = n/D is the effective sample size. Thus, the sample size and decision rule can be calculated under SRS and inflated by the design effect to accommodate clustering. The R package in [15] can be used to calculate sample sizes and decision rules for this method. A limitation of the Hund method is that α and β errors are inexact due to rounding error when transforming from the effective sample size scale.

Evaluating the designs
There are two primary differences between the methods above: the distributional assumptions (binomial-scaled, beta, and quasi-binomial models) and clustering parameterization using σ versus ρ (that is, the difference between treating σ or ρ as known and the same for p l and p u ). These differences are summarized in Table 1. We compare the impacts of these differences across the Pezzoli, Hedt, and Hund methods and determine situations in which the choice of method results in a substantively different design. As example designs, we use three couplets for p l and p u : 55-70%; 75-90%; and 90-95%. The first two couplets are from the 6 × 10 twostage cluster designs for vaccination coverage proposed in [12]. The last couplet is consistent with the 5-10% couplet used in the 33 × 6 or 67 × 3 malnutrition LQAS surveys. We parameterize these designs according to Eq 1 and specify X as the number of successes (vaccinated children or children with no evidence of malnutrition) out of n. As noted above, both the Pezzoli and Hund models are subject to some rounding error; we use nearest integer rounding for all calculations. All analyses have been conducted using R 3.1.1 [24], and code for reproducing all figures and tables, along with the R package from [15], is available at https://github. com/lbhund/ClusterLQAS_PLOS.
Comparing the distributional assumptions. First, we conduct a simulation study to evaluate whether the choice of distributional model (binomial-scaled, beta, and quasi-binomial) makes a substantive difference in the classification precision of the cluster LQAS designs. For the 2 p l and p u threshold couplets from [12], we use the 6 × 10 design with decision rules d = 38 and 50; for the 90-95% couplet, we use the 33 × 6 design with d = 13. We fix σ = .1 for all models (we repeat the analysis fixing σ = .05; results are similar and are not presented). That is, we select the clustering parameters at p l and p u such that σ = .1 for all models and thresholds (deviating from the specific methods described above) to isolate the impact of the probability distribution choice. For each p l , p u couplet, we generate 10,000 draws from each of the three distributions fixing m, k, d, and σ = .1 and examine how α and β change with the distributional assumption. For the quasi-binomial and betabinomial models, we calculate ρ using Eq 2. We then repeat the simulation fixing ρ = .1 (rather than fixing σ = .1) and calculate σ for the binomial-scaled model using Eq 2.
Fixing σ versus ρ. Historically, survey sample size calculations rely on the design effect and thus assume ρ is known and constant across levels of coverage p [16,25]. The Hund and Hedt methods also model clustering assuming ρ is known and fixed as a function of p. The Pezzoli method parameterizes clustering differently, assuming σ is fixed as a function of p. Fixing ρ results in less cluster-level variance (smaller σ) as p moves away from .5. Fixing σ results in the same variance of p j for all levels of p (and larger ρ as moves away from .5). Intuitively, variability in p j may decrease as p moves away from .5 (e.g. less variability in the coverage of a rare event), though there is also evidence that ρ decreases as p moves away from .5 [25,26]. It is thus not immediately clear as to whether fixing σ, ρ, or neither is appropriate.
To examine design differences according to the clustering parameterization, we compare the Pezzoli, Hund, and Hedt designs when σ = .1 and when ρ = .1. That is, we first calculate ρ l and ρ u such that σ = .1 at p l and p u , respectively, and compare fixing both ρ l and ρ u for the Hedt and Hund designs to the Pezzoli design with σ = .1. We then repeat this exercise, calculating σ l and σ u such that ρ = .1 at p l and p u , respectively, and compare fixing both σ l and σ u for the Pezzoli design to the Hedt and Hund designs with ρ = .1. We again evaluate the three couplets 55-70%, 75-90%, and 90-95% and specify α = β = .1 and m = 10.
We then examine reported estimates of p, σ, and ρ from [12] and [27] to assess how estimates of ρ and σ vary as a function of estimated vaccination coverage in practice. Before delving into this data example, we note that the standard deviation of the estimatedp j s, s ¼ wherep is the estimated coverage, is an overestimate of σ. Some have criticized the Pezzoli method for using an upper bound of σ = .1 in the designs and cite examples of σ >.1 in practice. Using s as an estimate of σ with small m likely accounts for some of this phenomenon. While s estimates sdðp j Þ, sdðp j Þ is higher than σ, because sdðp j Þ includes additional variability associated with the estimation ofp j . That is, the standard deviation of the estimated cluster coverages is not an accurate estimate of σ. In the Appendix, we derive an estimator for σ 2 , denotedŝ 2 :ŝ From Eq 6, it is clear that using s to estimate σ is only appropriate when m is large (which is seldom the case in cluster LQAS surveys). We estimateŝ using reportedp and standard errors s= ffiffi ffi k p for 20 different 6 × 10 surveys in [12]. Minetti et. al [27] reportp, s, and intraclass correlation estimatesr from 41 different 10 × 15 vaccination coverage surveys. Rather than using Eq 6, we estimateŝ ¼rpð1 ÀpÞ for the data in [27]. For each dataset, we estimate the relationship betweenr andp and betweenŝ andp using a nonparametric loess smooth [28] based on the loess function in R. We use the default parameters in the loess function, using a weighted least squares fit based on 75% of the data for the local smoothing.
For each data example, we then estimate operating characteristic (OC) curves for the 6 × 10 design with d = 50 for the 75-90% couplet using the binomial-scaled distribution but varying the values of ρ and σ. First, we assume ρ is fixed at the value corresponding to the mean ofr across all SAs; next, we allow ρ to vary as a function of p, estimating ρ from the nonparametric loess smooth forr. We then repeat for σ, resulting in 4 OC curves for each data example.

Distributional assumptions
In Fig 1, we contrast the shapes of the beta and binomial-scaled distributions for p j (the Hund method does not impose a distribution on p j ). We plot the shape of these distributions with σ = .1 and p = .5, .7 and .9. When p = .5, these distributions are symmetric and similar. When p 6 ¼.5, both distributions are skewed, with the beta distribution more skewed than the binomial. Additionally, as p ! 1 and as σ increases, the binomial distribution becomes more discrete. When p = .9 and σ = .1, η = 9 and p j can only take on 10 distinct values: (0,1/9,. . .,8/9,1).
While Fig 1 highlights important differences in beta and binomial-scaled distributions, our simulations suggest that the distribution choice does not have a large impact on the α and β risk levels (Table 2). Specifically, when the clustering levels are fixed (to either σ = .1 or ρ = .1), the differences in risks between the binomial-scaled and beta model are negligible for these designs, even with the relatively high levels of clustering specified. The quasi-binomial model generally results in similar α and β errors to the other models, though differences occur due to rounding error in small sample sizes. Results of these simulations thus suggest that specification of the distribution for p j does not markedly impact the α and β errors.

Clustering parameterization
In Tables 3 and 4, designs are compared based on the clustering parameterization (fixed σ versus fixed ρ). For the 55-70% and 75-90% couplets, the designs are similar regardless of how clustering is parameterized. For the 90-95% couplet, the sample size is much larger in the Hedt and Hund methods with fixed ρ u versus fixed ρ l ; similarly, the sample size is much larger in the Pezzoli method with fixed σ l versus fixed σ u . Recall that the examples in Tables 3 and 4 assume that m, the sample size within clusters, is fixed and increments the overall sample size by adding additional clusters. As noted above, clustering variance inflation increases as the withincluster sample size increases. Subsequently, discrepancies between the three methods are more pronounced when the number of sampled clusters k is fixed, rather than m. As an example, if we were to fix the number of clusters to k = 6 rather than fixing the within-cluster sample size m = 10, the total sample size n would range between 72 and 150 when σ = .1 (analogous to the 70-90 sample size range in Table 3 with m fixed).
Eq 2 helps explain the sensitivity of the sample size to the clustering parameterization when p l and p u are near 0 or 1. The parameters ρ and σ will be similar for both p l and p u when p l (1 − p l ) % p u (1 − p u ). Hence, ρ l and ρ u are negligibly different when σ = .1 for the 55-70% couplet, whereas ρ u is nearly double ρ l for the 75-90% and 90-95% couplets. To illustrate this difference Table 2. Comparing distributional assumptions between the methods when σ = .1 and ρ = .1. p l = .55, p u = .7 p l = .75, p u = .9 p l = .9, p u = .95 α β α β α β for the 75-90% couplet, we plot beta distributions fixing σ = .1, fixing ρ l , and fixing ρ u in Fig 2. Unlike the distributional assumptions, choice of clustering parameterization can substantially impact the final sample size for the survey when p l and p u are near 0 or 1. Next, we examine whether existing data support one of the two proposed parameterizations. In Figs 3 and 4,ŝ andr are plotted as a function ofp for the data in [12] and [27], respectively. Bothr andŝ decrease as a function of estimated vaccination coverage (less clustering occurs in more highly vaccinated areas), consistent with [25] and [26]. Fixing either σ or ρ likely results in some level of misspecification of α and β. The OC curves corresponding to the different clustering specifications are also shown in Figs 3 and 4. In the example from [12], the α and β risks are substantially higher when the mean versus the lowess smooth estimate of σ and ρ are used. This inflation of the risks occurs because σ and ρ decrease as a function of p but the average coverage across SAs is lower than 75%. The risks are not inflated in the example from [27] because the range of vaccination coverages is similar to the 75-90% couplet thresholds.

Discussion
After reviewing three methods for implementing cluster sampling in LQAS surveys, we conclude that the cluster LQAS methods result in similar designs when p l and p u are far from 0 or 1 or when clustering is minimal. In these situations, the user should not be concerned about which of the cluster methods to use, as the methods should all result in similar designs. Our simulation study suggests that the choice of distributional assumptions also does not appear to substantively impact the survey design.
Choice of the clustering parameterization is the major factor that can impact the final cluster LQAS survey design. Specifically, if p l or p u is near 0 or 1, the design (sample size) changes substantially depending on the clustering parameterization, namely whether ρ or σ is fixed as a function of p. That is, when p l .1 and p u !.9, clustering parameters should be selected carefully to avoid severe misspecification of classification risks. As an example, assume p l , p u >.5. When σ and ρ result in the same amount of clustering for p = p l (that is, σ = ρp l (1 − p l )), the    [12], with loess smooth overlayed. Bottom panel: Solid line design with fixed σ will require a larger n; for p = p u , the design with fixed ρ will require a larger n. Differences in n between these options can be substantial when p u is close to 1 or p u −p l is large.
We did not find evidence that ρ is constant as a function of p (assumed by Hedt and Hund) nor that σ is constant as a function of p (assumed by Pezzoli). In two vaccination coverage examples, estimates of both ρ and σ decrease as coverage increases, consistent with [25,26]. Using historical data to plot clustering estimatesŝ andr as a function of estimated coveragep can help evaluate the clustering parameterization assumptions. Our data analysis suggests that, if clustering design parameters (σ or ρ) are determined from historical data, only estimates ofŝ andr from SAs with estimated coverages near the range of p l and p u should be incorporated. To facilitate incorporation of historical data, we provide R functions that estimate σ using summary statistics commonly available in the literature. We strongly encourage cluster LQAS users to publish their estimates of these summary statistics (p, s, andr) for each SA to provide more historical data for specifying these design parameters in the future.
When the Hund or Hedt methods are used, we propose the following steps to choose a design: 1. Define p m as whichever of p l or p u is closer to .5. Estimate ρ using existing data or existing literature using SAs with coverage estimates near p m . If data are not available, consider how much variability is expected around p m to select ρ.
2. Determine the cost of sampling a cluster and sampling each individual per cluster.
3. Iterate through choices of m and k to minimize cost for α, β, p l , p u , and ρ.
The same steps could be applied to the Pezzoli method, replacing ρ with σ. The R package accompanying this paper can help achieve Steps 1 and 3. Additionally, users can examine the sensitivity of their designs to the selected model (as in Tables 3 and 4) by calculating different possible designs with varied clustering parameterizations (after selecting the relevant values of p l , p u , α and β) using the R package. Knowledge of the design's sensitivity to the clustering parameterization contributes to understanding the robustness of the design.
In Step 1 above, determining ρ or σ based on p m has two advantages: (1) σ and ρ are likely to decrease as p moves away from .5, so this clustering estimate is conservative (note that, using this algorithm, fixing σ is more conservative than fixing ρ); and (2) visualizing a highly skewed distribution in terms of a mean and standard deviation is not straightforward, and the skewness of the beta and binomial-scaled distributions increases as p moves away from .5. Returning to Fig 1, the beta and binomial-scaled distributions with mean p = .5 are symmetric, and one can easily visually check that σ = .1 by examining the spread of the distribution. With mean p = .9, there is no way to visually inspect these highly skewed distributions and conclude that σ = .1.
If the clustering parameters are underestimated in practice, the survey design will not have the desired precision; that is, the true α and β risks will be higher than those specified by design. One way to protect against this situation is to employ a sequential design [22,29]. In a sequential design, the data are collected in multiple stages. In theory, the clustering parameters could be estimated as the data are collected and the design could then be changed to accommodate the observed clustering values. However, the variance of clustering parameters is high in small samples. Further, sequential designs are logistically more challenging. Future research should evaluate such sequential designs in simulation and in practice. Users should keep in mind that overestimating clustering parameters at the design phase is always more conservative and adds to the cost of the survey.
In practice, to select an LQAS survey design, the user first must decide whether SRS is feasible and practical. If SRS is used, survey design can proceed using a standard binomial model and the simple steps described in the Introduction. If SRS is not used, two-stage cluster sampling is a commonly used practical alternative. In some situations, the cluster sampling design can be ignored when small within-cluster sample sizes are used [17,[19][20][21] or when clustering is known to be negligible [22]. Survey design is much simpler when the cluster sampling design is ignored, because the user can act as though the data are a simple random sample and use standard LQAS procedures for design; that is, the user does not need to specify a clustering model or specify the clustering amount (i.e. ρ or σ). When the cluster sampling design cannot be ignored, choice between the cluster LQAS methods (Pezzoli, Hedt, and Hund) is not straightforward. Historically, ρ, not σ, is fixed as a function of p in design calculations [16,25]. Uniting the methodology for LQAS surveys with other M&E estimation surveys may help de-mystify LQAS. The binomial-scaled model is not used in the statistics or survey sampling literature, possibly due to the discrete nature of this distribution. However, based on the empirical calculations in this article, choice between the Pezzoli, Hedt, and Hund methods should depend more upon the anticipated clustering patterns in the study population than the distributional specification (though prior information on clustering is likely quite limited). Hence, further research should attempt to characterize clustering patterns in specific applications and provide suggestions for best-practice cluster LQAS designs on a setting-specific basis. We only considered data examples in the context of vaccination coverage cluster LQAS surveys in this paper, but the validity of different clustering parameterizatons should be evaluated in other contexts, for instance nutrition.
Cluster sampling in LQAS has also been recently discussed in the context of Large Country LQAS (LC-LQAS) [18,30,31]. These LC-LQAS surveys select SAs using cluster sampling with the goals of estimating prevalence over a large geographic region and classifying indicators at the SA-level. The LC-LQAS surveys are distinct from cluster LQAS surveys discussed in this paper which sample clusters within SAs, rather than sampling SAs. Future work could explore accommodating uncertainty in ρ or σ at the design phase. In the data examples in this paper, both ρ and σ vary with p, complicating selection of a cluster LQAS method. All three methods could be modified to allow σ or ρ to vary as a function of p, though this modification would introduce an additional design parameter into the LQAS procedure. Future work should investigate the merits of considering designs with different clustering parameters corresponding to p l and p u or exploring different clustering parameterizations [25]. Additionally, all of the cluster surveys discussed in this paper rely on using a self-weighting sampling design. Future work should explore how to extend LQAS designs for unequal probability weights.