Two-stage lot quality assurance sampling framework for monitoring and evaluation of neglected tropical diseases, allowing for imperfect diagnostics and spatial heterogeneity

Background Monitoring and evaluation (M&E) is a key component of large-scale neglected tropical diseases (NTD) control programs. Diagnostic tests deployed in these M&E surveys are often imperfect, and it remains unclear how this affects the population-based program decision-making. Methodology We developed a 2-stage lot quality assurance sampling (LQAS) framework for decision-making that allows for both imperfect diagnostics and spatial heterogeneity of infections. We applied the framework to M&E of soil-transmitted helminth control programs as a case study. For this, we explored the impact of the diagnostic performance (sensitivity and specificity), spatial heterogeneity (intra-cluster correlation), and survey design on program decision-making around the prevalence decisions thresholds recommended by WHO (2%, 10%, 20% and 50%) and the associated total survey costs. Principal findings The survey design currently recommended by WHO (5 clusters and 50 subjects per cluster) may lead to incorrect program decisions around the 2% and 10% prevalence thresholds, even when perfect diagnostic tests are deployed. To reduce the risk of incorrect decisions around the 2% prevalence threshold, including more clusters (≥10) and deploying highly specific diagnostic methods (≥98%) are the most-cost saving strategies when spatial heterogeneity is moderate-to-high (intra-cluster correlation >0.017). The higher cost and lower throughput of improved diagnostic tests are compensated by lower required sample sizes, though only when the cost per test is <6.50 US$ and sample throughput is ≥3 per hour. Conclusion/Significance Our framework provides a means to assess and update M&E guidelines and guide product development choices for NTD. Using soil-transmitted helminths as a case study, we show that current M&E guidelines may severely fall short, particularly in low-endemic and post-control settings. Furthermore, specificity rather than sensitivity is a critical parameter to consider. When the geographical distribution of an NTD within a district is highly heterogeneous, sampling more clusters (≥10) may be required.


Introduction
Neglected tropical diseases (NTD) are a diverse group of 20 parasitic, bacterial, and viral diseases, several of which are zoonotic, food or vector-borne in nature [1][2][3]. Worldwide, they affect more than 1.5 billion of the world's population, but disproportionately impact the most impoverished communities in tropical countries [3], particularly those on the African continent [1]. Enormous progress has been made so far in controlling NTD, yet significant challenges remain such as a lack of improved diagnostic tests, new interventions, and monitoring and evaluation (M&E) strategies. As a response to this, WHO has recently published its new road map, aiming (i) to reduce the number of people requiring interventions against NTDs by 90%, (ii) to reduce the global disease burden by 75%, (iii) to eliminate at least one NTD in 100 countries, and (iv) to eradicate two NTD by the end of 2030 [4].
To reach these ambitious 2030 targets, periodic follow-up surveys to measure progress and determine whether scaling down or stopping the interventions is justified, so-called M&E, are important aspects of these large-scale NTD programs. To track progress towards achieving the 2030 targets for NTD, WHO developed and published M&E guidelines for 15 of the 20 NTD [4]. Generally, these M&E guidelines involve recommendations for (i) which diagnostic test to use, (ii) which survey design to use (e.g., number of subjects and number of clusters), and (iii) the corresponding decision rules for continuing or reducing the frequency (or intensity) of interventions. However, the diagnostic tests deployed in these M&E surveys are often imperfect, and it remains unclear how this affects decision-making at the population level. For example, Gass (2020) recently urged for a shift in paradigm from sensitivity to specificity and a holistic approach to developing WHO M&E guidelines for population-based interventions, aligning survey designs and decision rules to the deployed diagnostic test [5].
In a first attempt to gain insight into this complex interplay between diagnostic performance, survey designs and decision rules, we previously developed a general framework based on a 1-stage lot quality assurance sampling (LQAS) approach that consider the impact of diagnostic performance at the individual level (sensitivity and specificity), survey design (e.g., number of subjects), and decision rules on the correctness of program decision-making at the population level [6]. We found that specificity rather than sensitivity becomes more important when the program approaches the endgame. In addition, the study found that the requirements for both parameters are inversely correlated, resulting in multiple combinations of sensitivity and specificity that allow for reliable decision-making. This study further highlighted that improving diagnostic performance results in smaller sample sizes for the same level of program decision-making. Thus, the additional costs per diagnostic test with improved diagnostic performance can be compensated by lower operational costs in the field. These findings have been instrumental in defining the required diagnostic performance of tests deployed in M&E of control programs targeting soil-transmitted helminthiasis [6]. However, an important limitation of this framework is the assumption that all tested subjects originate from the same cluster (e.g., community/school). As such, spatial heterogeneity of infections across clusters was ignored. Also, a detailed cost assessment was not included (e.g., operational costs to collect and screen samples, cost per test and the sample throughput). This is important as improved diagnostic tests might allow decisions to be based on fewer samples, which would compensate for the increased cost per test and reduced sample throughput.
In the present study, we developed a 2-stage LQAS framework for decision-making that allows for both imperfect diagnostics and spatial heterogeneity of infections. Second, we applied the framework using M&E of soil-transmitted helminth (STH) control programs as a case study. For this, we explored the impact of diagnostic performance (sensitivity and specificity), spatial heterogeneity (intra-cluster correlation), and survey design (number of clusters and number of subjects per cluster) on the correctness of program decision-making based on WHO-recommended thresholds, as well as the associated total survey costs. In addition, we evaluated to what extent an increased cost per test and a reduced sample throughput might be compensated by lower sample size requirements when using improved diagnostic tests.

Development of a 2-stage LQAS framework for population-based decisionmaking using imperfect diagnostic tests
General concepts of the framework. The 2-stage LQAS framework for population-based decision-making using imperfect diagnostic tests consists of four steps (Fig 1). In the first step, n clust clusters (e.g., schools or communities) are randomly selected from a given implementation unit i (e.g., a geographical area where preventive chemotherapy (PC) is administered). In the second step, n sub subjects are randomly selected within each cluster j. In the third step, all n tot ( = n clust �n sub ) subjects are screened using an imperfect diagnostic test D that has a sensitivity se d and a specificity sp d . In the fourth and final step, the program decisions for unit i are made based on the number of positive test results (X i+ ). The frequency (or intensity) of the intervention will be reduced (e.g., scale-down from 6-monthly to annual PC) in case X i+ is less than some decision cut-off c. In contrast, when X i+ �c, the frequency or intensity of the intervention remains unchanged or will even be increased. Note that X i+ includes both true and false positive test results. Also, the framework allows for the spatial heterogeneity, the socalled intra-cluster correlation ρ i . It is a measure of the extent to which positive test results are clustered within clusters of an implementation unit i, and as such, is a measure of how prevalences vary between clusters. The intra-cluster correlation can take on values between zero and one, with higher values representing more clustering of positive test results (i.e., higher geographical heterogeneity in infection levels).
Ultimately, we wish to determine the survey design (n clust and n sub ) and the decision cut-off c that allows for an acceptable risk of making incorrect program decision-making when an imperfect diagnostic test D is deployed in a unit i with intra-cluster correlation ρ i . In other words, we should aim to minimize the probability that interventions are implemented at a frequency or intensity higher than officially necessary based on the true underlying prevalence (i.e., when true prevalence π i is under the program decision threshold T) as this would lead to a waste of both time and resources. Simultaneously, we should aim to minimize the probability that interventions are scaled down prematurely (i.e., when π i �T), which would lead to a preventable increase in infection and morbidity. Here, we assume that the target prevalence threshold T is defined in terms of the true prevalence π i (not as measured by some imperfect diagnostic test). We further assume that the value of T is appropriate for scaling interventions up or down, which may [7] or may not be the case [8][9][10]. Fig 2A illustrates the program decision-making process based on LQAS, using a toy example. It describes the probability of continuing or scaling up the frequency or intensity of the intervention as a function of the true underlying prevalence π i at the implementation unit level. For this, we assumed that a theoretical diagnostic method D was deployed (sensitivity se d = 80% and specificity sp d = 98%) to screen n clust = 5 clusters and n sub = 50 subjects per cluster.  For illustrative purposes, the LQAS decision cut-off was set at c = 90 (in a sample of n tot = n clust �n sub = 250). Given a program decision threshold T = 50% (vertical straight line), we can now deduce the error probability of unnecessarily continuing or upscaling the intervention at a frequency or intensity that is greater than needed (ε overtreat ) when true prevalence π i <T, and the error probability of prematurely reducing the frequency or intensity of interventions (ε undertreat ) when true prevalence π i �T. These error probabilities can be considered to be community-level analogues of one minus the negative predictive value and one minus the positive predicted value [5], as used in recent NTD modelling exercises [8,11]. In Fig 2B, we deduce the decision cut-off c for which ε overtreat and ε undertreat do not exceed acceptable risk levels for arbitrary choices of π i <T and π i �T. For this purpose, we define the minimal survey performance in terms of the (un)acceptable risk levels (ε overtreat = 25% and ε undertreat = 5%) for a particular range of π i close to threshold T that is defined by a lower (LL) and upper limit (UL). Together, they form a "grey zone" that dictates the slope of the sigmoid curve (from lower left Program decision-making process based on 2-stage LQAS framework allowing for an imperfect diagnostic test. Panel A illustrates the risk of making a wrong program decision (ε overtreat and ε undertreat ) for a given true prevalence π i and decision cut-off c = 90 was applied. The vertical straight line indicates the program prevalence threshold T = 50%, whereas the horizontal dotted lines represent the ε overtreat and ε undertreat for π i = 37.5% and π i = 62.5%, respectively. Panel B represents the decision cut-off c for a choice of grey zone width (lower limit (LL) = 37.5% and upper limit (UL) = 62.5%) and acceptable risk levels (E overtreat = 25% and ε undertreat = 5%) for a particular of range of π i . The solid sigmoid curves (in red and blue) indicate the values of c that satisfy the conditions ε overtreat ðπ i ¼ LLÞ � E overtreat and ε undertreat ðπ i ¼ ULÞ � E undertreat , whereas the dotted sigmoid curves (green and purple) do not satisfy the conditions mentioned above. All graphs are based on the same theoretical diagnostic test D (se d = 80% and sp d = 98%), survey design (n clust = 5 and n sub = 50), an intra-cluster correlation ρ i = 0.02 and 10,000 Monte Carlo simulations.

PLOS NEGLECTED TROPICAL DISEASES
https://doi.org/10.1371/journal.pntd.0010353.g002 bottom to top right corner of the grey zone). We then try to find the range of values for the decision cut-off c that satisfy the conditions ε overtreat ðp i ¼ LL ¼ 37:5%Þ � E overtreat and ε undertreat ðp i ¼ UL ¼ 62:5%Þ � E undertreat ; and we select the value of decision cut-off c that offers the lowest attainable risk of ε undertreat as we considered this the most important risk to mitigate. As can be seen from Fig 2B, the solid lines indicate the values of c that satisfy the conditions mentioned above (red line: c = 85 and blue line: c = 111), and we choose the decision cut-off equalled 85, as it offers the lowest ε undertreat (0.5%) as compared to 111 (ε undertreat = 4.7%). On the contrary, the dotted lines represent some values of c that do not satisfy the conditions (green line: c = 83 and purple line: c = 113).

The mathematical backbone of the framework
The mathematical backbone of the 2-stage LQAS population-based decision-making using an imperfect diagnostic test is summarized by the three equations below (Eq (1)-(3)). Tables 1 and 2 describe the definitions of the different parameters and derived variables used in these equations, respectively. Eq (1) describes the 2-stage sampling process that results in the total Table 1. Definitions of the parameters that describe the 2-stage LQAS framework.

Parameters Definition
Shape parameters of a beta distribution describing the variation in true cluster-level prevalence within an implementation unit i π i Expected cluster-level true prevalence within implementation unit i; True prevalence in cluster j from implementation unit i; π ij~B eta(π i , ρ i ) or synonymously π ij~B eta(α i , β i ). number of positive test results within a cluster. For this, we used a 2-stage beta-binomial model, where the beta-distribution represents the variation in true cluster-level prevalence π ij within an implementation unit i, and the binomial distribution represents the sample variation in the number of positive tests X ij resulting in cluster j and implementation unit i when an imperfect diagnostic test D with sensitivity se d and specificity sp d is applied. The shape parameters α i and β i , both greater than zero, define the beta distribution, where index i indicates that these parameters can be implementation unit specific. For the present study, we parameterized the beta distribution in terms of the expected value p i ¼ a i a i þb i and the intra-cluster correlation r i ¼ 1 , where α i and β i are the distributions' two shape parameters. An added benefit of using the intra-cluster correlation as a measure of variation is that it is independent of the mean π i of the beta distribution, unlike its variance s 2 To illustrate our framework, we considered an intra-cluster correlation ρ i of 0.02 [12]. The binomial distribution for positivity of individual test results was parameterized in term of p ijþ ð¼ se d � p ij þ ð1 À sp d Þ � ð1 À p ij ÞÞ, the expected proportion of positive test results within a cluster j from implementation unit i, and the number of subjects screened per cluster (n sub ): Eq (2) and Eq (3) below represent the risks associated with incorrect program decisionmaking, denoting the parts of the graph line in Fig 2A left and right of the prevalence threshold T, respectively. Here, Eq (2) describes the probability that the interventions are implemented at a frequency or intensity higher than required for the true underlying prevalence when π i <T. Vice versa, Eq 3 describes the probability that interventions are prematurely stopped or scaled down when π i �T: To be able to calibrate a value of the decision cut-off c, we need to choose what risk of overand undertreatment (ε overtreat and ε undertreat ) we find acceptable for a point π i <T and another point π i �T. Together, these two true prevalence points, which correspond with LL and UL of the grey zone, dictate the slope of the sigmoid curve illustrated in Fig 2B (the higher the desired steepness, the lower risk we are willing to accept). We express this as ε overtreat (π i = LL)�E overtreat and ε undertreat (π i = UL)�E undertreat .

Simulation framework to determine the decision cut-off c
The determination of the decision cut-off c is straightforward in a 1-stage LQAS ignoring the spatial heterogeneity of infections across clusters [8], but there is no analytical solution for a 2-stage LQAS allowing for spatial heterogeneity [6]. We therefore applied a Monte Carlo (MC) simulation framework to determine the decision cut-off c. In brief, we first simulated the distribution of the total number of positive test results (X i+ ) based on 10,000 MC draws from the 2-stage beta-binomial model (Eq (1)). A set of MC draws was produced for each combination of chosen prevalence threshold T and values of the limits of the grey zone (LL, UL), conditional on survey design (n clust , n sub ) and spatial variation in true prevalences (ρ i ).
Subsequently, for each combination of c 2(0, n clust �n sub ) and prevalence threshold T, we calculated the probability of overtreatment ε overtreat (Eq (2)) when π i = LL and the probability of undertreatment ε undertreat (Eq (3)) when π i = UL. Suitable values of the decision cut-off c were those that resulted in both ε overtreat (π i = LL)�E overtreat and ε undertreat (π i = UL)�E undertreat . When a multiple values of c satisfied the grey zone conditions (Eq(2) and (Eq(3)), we selected the value of decision cut-off c that offers the lowest attainable risk of ε undertreat as we considered this the most important risk to mitigate.
In Fig 3, we further illustrate this process using the toy example of Fig 2. The top row (Fig 3A-3D) shows the process for the lower limit of the grey zone, whereas the bottom row (Fig 3E-3H) illustrates this for the upper limit of the grey zone. The first three panels in each row (Fig 3A-3C and Fig 3E-3G) represent the iterative process to determine the distribution of X i+ based on a 2-stage beta-binomial model; the fourth panel in each row (Fig 3D and 3H) Fig

Determination of the decision cut-off c based on 2-stage LQAS framework allowing for imperfect tests.
This figure describes the simulation framework to determine the decision cut-off c that allows for adequate (E overtreat = 25% and E undertreat = 5%) decision-making at a true underlying prevalence π i at the implementation unit i equal to 37.5% (LL) and 62.5% (UL) when an imperfect theoretical diagnostic test D (se d = 80% and sp d = 98%) was deployed to screen n clust = 5 and n sub = 50 subjects per cluster. We fixed the intra-cluster correlation ρ i = 0.02 at both limits of the grey zone. The bottom row graphs (Panels A-D) illustrate the process for the lower limit of the grey zone, whereas the bottom row graphs (Panels E-H) show this for the UL of the grey zone. The horizontal dotted line in Panel D represents the E overtreat , wherein Panel H represents E undertreat . The dotted vertical lines in these panels represent the decision cut-off c and the bullet the ε overtreat (Panel D) and ε undertreat (Panel H). All graphs are based on the same set of 10,000 Monte Carlo simulations.
https://doi.org/10.1371/journal.pntd.0010353.g003 describes ε overtreat and ε undertreat across all potential values of c between zero and n clust �n sub . Clearly, a decision cut-off c between 85 (Fig 3D: ε overtreat (π i = LL)�E overtreat ) and 111 (Fig 3H: ε undertreat (π i = UL)�E undertreat ) allowed for an acceptable risk of erroneous decision-making (see also Fig 2B). Note that in this example, there are multiple options for c, but this may not always be the case. Indeed, it is anticipated that for some combinations of n clust , n sub , se d , sp d , LL, UL, and ρ i , there will not be any decision cut-off that fulfils the conditions ε overtreat (π i = LL)�E overtreat and ε undertreat (π i = UL)�E undertreat .

Application of the 2-stage LQAS framework to soil-transmitted helminths
Monitoring and evaluation of soil-transmitted helminth control programs as a case study. STH are a group of intestinal roundworms, including Ascaris lumbricoides (giant roundworm), Trichuris trichiura (whipworm), Ancylostoma duodenale and Necator americanus (hookworms). WHO recommends to control the STH-attributable morbidity through periodical administration of anthelmintic drugs to both children and other at-risk populations living in endemic areas. The frequency of these large-scale deworming programs is based on the observed prevalence of STH infections (any species). That is, at the start of the program, it is recommended to distribute drugs twice a year when the prevalence is at least 50% and once a year when the prevalence is at least 20%. During the implementation phase, the prevalence of any STH infection is periodically re-evaluated to verify whether objectives are being met, and if necessary, adjusting the frequency of drug administration (prevalence �50%: 3x PC / year; 50%> prevalence �20%: maintain PC frequency; 20%> prevalence �10%: 1x PC / year; 10%> prevalence �2%: 1x PC / 2 years; prevalence <2%: stop PC). To monitor and evaluate STH control programs, WHO recommends screening 5 schools and 50 children per school. Traditionally, STHs have been diagnosed by detecting worm eggs in a stool smear using a compound light microscope (the Kato-Katz thick smear method). This method is cheap and straightforward, yet notoriously imperfect, and sensitivity is more of a limitation (mainly for detection of low-intensity infections) than specificity [13,14]. Notably, there is a lack of studies investigating whether its performance is sufficient to make reliable program decision to stop PC and how use of better diagnostics or sampling strategies could lead to improvement.
We will illustrate the application of the 2-stage LQAS framework in the context of M&E of STH control programs, with the aims (i) to determine the required diagnostic performance when the current WHO survey design is applied, (ii) to further optimize the survey design for imperfect diagnostic tests, and (iii) to customize the sample throughput and cost per test according to the improvements in diagnostic performance. For each of these objectives, we applied the aforementioned 2-stage LQAS framework on strategically selected scenarios of n clust , n sub , se d , sp d , LL, UL, and ρ i , and verified whether we could render a decision cut-off c that fulfils the conditions ε overtreat (π i = LL)�E overtreat and ε undertreat (π i = UL)�E undertreat . To allow for two different operational definitions of 'reliable' program decision-making ('adequate' vs. 'ideal'), we set the E undertreat at 5% for both definitions, whereas we set E overtreat at either 25% ('adequate') or 10% ('ideal'). These values for E undertreat and E overtreat have also been previously used to determine the sensitivity and specificity for diagnostic tests for other helminth diseases [5,6]. S1 Table provides an overview of the different n clust , n sub , se d , sp d , LL, UL, ρ i , E undertreat and E overtreat that were considered in our simulations and S2 Table shows a summary of the simulated diagnostic methods. In the following sections, we will briefly justify the choices made to meet the objectives. Also note that we assume that sensitivity and specificity of a diagnostic test D do not vary across individuals, even though this assumption does not really hold, as the sensitivity of detecting worm eggs in stool depends on the intensity of STH infection, which is known to vary between individuals [15].

Determine the required diagnostic performance when applying the current WHO survey design.
To determine the required diagnostic performance when applying the current WHO survey design (n clust = 5 and n sub = 50) across each of the 4 program decision thresholds T (2%, 10%, 20% and 50%), we varied both se d and sp d from 60% to 100% (with 1% increments), resulting in a 41 × 41 grid of hypothetical diagnostic tests.
Given the wide range in program decision thresholds T (2% to 50%), we opted to define the limits of the grey zone proportional to T. We arbitrarily set the limits at T±25%. For example, for a T equal to 2%, LL and UL were set at 1.5% and 2.5%, respectively, while for a T equal to 50%, these numbers were 37.5% and 62.5%. We assumed that spatial heterogeneity (ρ i = 0.02) was the same for all true prevalences of infection π i . For example, this assumption translates to a central 95%-confidence interval (95%-CI) of cluster-level true prevalence π ij of 36.2-63.8% for an implementation unit i with an expected cluster-level true prevalence π i of 50%; analogously, for π i = 2% this translates to a 95% CI of 0.0-7.3%.
Optimise the survey design when using an imperfect diagnostic test. To optimise the survey design, we explored the impact of the width of the grey zone (LL and UL), se d , sp d , n clust and ρ i on n sub . To this end, we determined the required n sub that allowed for adequate (E undertreat = 5% and E overtreat = 25%) program decisions around a program decision threshold of 2% for different scenarios of the width of the grey zone (T±25%, ±40%, ±50%, ±60% and ±75%), se d (80%−100%, with 1% increments), sp d (98%−100%, with 0.1% increments), n clust (5, 10, 15 and 20) with p i (initially) fixed at 0.02. We focused on T = 2% because the requirements for n clust and n sub become more stringent as a function with decreasing values of T [6].
Finally, we further assessed the impact of the geographical variation in prevalence between clusters (ρ i ) and the number of sampled clusters (n clust ) on the minimum required number of subjects per cluster (n sub ). To this end, we arbitrarily fixed the limits of the grey zone at 1% and 3%, assumed a theoretical diagnostic test with se d = 80% and sp d = 98%, explored three values for n clust (10, 15 and 20) and varied the intra-cluster correlation ρ i from 0.012 (lower spatial heterogeneity of infections across the clusters) to 0.032 (higher spatial heterogeneity of infections across the clusters) with 0.004 increments.
We also estimated how the total survey costs (C tot ) may change with the diagnostic test characteristics (se d and sp d ), which dictate the minimum required n clust and n sub , given (LL, LL, E undertreat and E overtreat ), reagent costs, and sample throughput. As a benchmark, we used a theoretical reference diagnostic test D t1 (se dt1 = 80% and sp dt1 = 98%, as in the example used in Fig 2) for which we assume that reagent costs and throughput are the same as a single Kato-Katz thick smear, as recently estimated by Coffeng et al. [16]. These cost estimates include (i) the reagent cost for collection (0.57 US$ per sample) and testing (1.38 US$ per sample), (ii) the salary for a single mobile field team comprised of one nurse and three laboratory technicians (90 US$ per day, assuming 8 working hours), and (iii) the cost per day for car rental, including salary of the driver and gasoline (90 US$ per day). As in Coffeng et al. [16], we adopt the assumption that a team collects samples in the morning (8:00-12:00) and that all collected samples are processed in the afternoon (13:00-17:00), which implies that the number of samples that can be collected daily is limited. Therefore, we calculated for each survey design the number of working days (n days ) required to screen all recruited subjects. For this, as by Coffeng et al. [16], we assume that it takes a person on average 412 seconds to process and test a single stool sample (which in Coffeng et al.'s analysis included preparation and examination of stool smear, digitization of demographic information, and recording the egg counting results). The sample throughput for the hypothetical reference test was 9 samples per hour, whereas the reagent cost to collect samples and the reagent cost to test samples were 0.57 US$ and 1.38 US$ respectively.
The equation to calculate n days is described below: Finally, (iv) we included a cost of 180 US$ (salary for one team + one car rental) per school included in the survey. This cost also covers the time required to inform the schools of the purpose of the study. The total survey cost C tot is as follows: C tot ¼ cost of consumables to collect and process samples þoperational cost to collect and process samples þcost to inform school ¼ n clust � n sub � ðcost to collect sample þ reagent cost to test sampleÞ We subsequently applied C tot to determine which n clust and n sub represent the number of clusters and subjects per cluster that allow for adequate (E undertreat = 5% and E overtreat = 25%) program decisions around a program decision threshold T of 2% (LL = 1% and UL = 3%) for a given se d and sp d . We fixed ρ i at 0.02. We assumed that (i) there were no additional costs for the laboratory infrastructure, (ii) that the team only received compensation during working days (Monday-Friday), and (iii) did not take any breaks during processing.
Customize the sample throughput and cost per test according to the improvements in diagnostic performance. To assess how the total cost of a survey depends on the diagnostic test performance (which dictates the minimal adequate design in terms of n clust and n sub ), the sample throughput and cost per test, we defined three theoretical diagnostic methods D t1 −D t3 with the same sensitivity (se dt1 ¼ se dt2 ¼ se dt3 ¼ 80%) but varying specificity (sp dt1 ¼ 98%; sp dt2 ¼ 96% and sp dt3 = 94%) and apply these to varying scenarios of sample throughput (3-50 samples processed by one person in an hour, with 1 sample increment) and cost per test (1 US$−10 US$, with 0.01 US$ increment) applying Eqs (4) and (5) to estimate the total survey cost. For each combination of the parameters mentioned above, we then expressed the entire survey cost relative to the total cost of a survey based on the theoretical reference diagnostic test D t3 (se dt3 = 80% and sp dt3 = 94%) with cost characteristics as for single Kato-Katz thick smears. In each of the scenarios, ρ i was fixed at 0.02, and the grey zone was defined as T±50% (LL = 1% and UL = 3%).
Application of the framework to develop evidence-based guidelines and direct R&D. To further illustrate how our framework can contribute to the development of more evidencebased WHO guidelines, we explored the required number of subjects per cluster when 10, 15, and 20 clusters (intra-cluster correlation ρ i of 0.02) were sampled that allows for adequate program decision-making (E overtreat = 25% and E undertreat = 5%). We considered the 2% prevalence threshold (LL = 1% and UL = 3%) separately for two theoretical diagnostic tests D tpp1 (Se tpp1 ¼ 60% and Sp tpp1 ¼ 99%) and D tpp2 (Se tpp2 ¼ 86% and Sp tpp2 ¼ 94%). The values for the diagnostic performance of both tests were based on the minimal required sensitivity and specificity defined in the recently published WHO target product profiles (TPPs) for M&E of STH control programs [17]. We further estimated the total survey cost using equations Eq (4) and Eq (5). To this end, we used the required sample throughput (minimal: 7 samples per person per hour; ideal: 10 samples per person per hour) and reagent costs (minimal: 3 US$, ideal: 1 US$) listed in the WHO TPPs [17].
Further, we used our framework to gain insights into where R&D should be directed. For this, we fixed the diagnostic performance for single Kato-Katz thick smear KK (se kk = 55% and sp kk = 95%) to detect low intensity infections [18][19][20][21]. Then, we considered two hypothetical improved Kato-Katz thick smear methods, one with improved sensitivity KK se (se kkse = 60% and sp kkse = 95%) and one with improved specificity KK sp (se kksp = 55% and sp kksp = 99%). We then estimated the required number of subjects per cluster and the associated total survey cost when 10, 15, and 20 clusters were sampled around the 2% program prevalence threshold (LL = 1% and UL = 3%) that allow for adequate decision-making (E overtreat = 25% and E undertreat = 5%). Additionally, we estimated the highest possible reagent cost per test for the diagnostic test KK sp . For this, we determined the survey cost for this improved Kato-Katz thick smear with varying scenarios of sample throughput (3-50 samples) and reagent cost to test one sample (1 US$−10 US$) relative to the total cost of a survey based on the hypothetical diagnostic test that meets the WHO TPPs for M&E of STH control program in terms of diagnostic performance D tpp1 (Se tpp1 = 60% andSp tpp1 = 99%), sample throughput (7 samples per person per hour) and reagent cost (3 US$ per test), while setting the number of clusters at 10.

Results
The required diagnostic performance when applying the current WHO survey design Fig 4 illustrates the required sensitivity and specificity when applying the WHO survey design (n clust = 5 and n sub = 50 per cluster) around the three highest program prevalence thresholds T (Fig 4A: 10%; Fig 4B: 20%; Fig 4C: 50%). We fixed the intra-cluster correlation ρ i at 0.02 and defined the limits of the grey zone as a proportion of the T (LL = T−0.25�T; UL = T+0.25�T). Each of the panels represents a contour plot highlighting all possible combinations of sensitivity and specificity that rendered a decision cut-off c that fulfils ε overtreat ðp i ¼ LLÞ � E overtreat and The required sensitivity and specificity when applying the current WHO study design. This figure indicates the required sensitivity and specificity when applying the current WHO study design (n clust = 5 and n sub = 50) that allow for adequate (E overtreat ¼ 25% and E undertreat ¼ 5%Þ or ideal (E overtreat ¼ 10% and E undertreat ¼ 5%Þ decision-making around each of the four program prevalence thresholds T (Panel A: 10%; Panel B: 20%; Panel C: 50%). The intra-cluster correlation ρ i was fixed at 0.02, and the limits of the grey zone were defined as a proportion of the program prevalence threshold TðLL ¼ T À 0:25 � T; UL ¼ T þ 0:25 � TÞ. The red area indicates that the combination of the survey design (n clust , n sub ) and the test performance (se d and sp d ) was inadequate for decision-making, while the light blue and the dark blue represent combinations that allow for adequate (E overtreat ¼ 25% and E undertreat ¼ 5%Þ and ideal ðE overtreat ¼ 10% and E undertreat ¼ 5%Þ decision-making, respectively. All graphs are based on the same set of 10,000 Monte Carlo simulations. https://doi.org/10.1371/journal.pntd.0010353.g004 ε undertreat ðp i ¼ ULÞ � E undertreat for both adequate (all combinations within the light blue and dark blue area) and ideal program decision-making (all combinations in the dark blue area). The red area indicates all combinations of sensitivity and specificity that did not render a decision cut-off c that fulfilled the conditions for adequate decision-making. Three crucial aspects can be observed in Fig 4. First, the requirements for diagnostic test performance become less stringent when the program threshold is closer to 50%. None of the combinations of sensitivity and specificity allowed for adequate or ideal decision-making around program thresholds of 10% (areas of Fig 4A is entirely red) and 2% (not shown as it is identical to 10%). For thresholds of 20% and 50%, program decisions were either adequate (Fig 4B contains red and light blue) or adequate to ideal (Fig 4C contains all three colours), respectively. Second, the requirements for specificity and sensitivity are inversely correlated. For instance, when adequate performance program decisions are made around a 50% threshold (Fig 4C) and with a diagnostic test with specificity equal to 60%, the sensitivity must be at least 70% and vice versa. Third, the requirements for specificity are more stringent than those for sensitivity, which became already apparent in the scenario where program decision-making is made around a T of 20% (Fig 4B). When employing a diagnostic method with a perfect specificity, the sensitivity should not drop below 72% to guarantee adequate decision-making. In contrast, the specificity can only drop to 86% when the sensitivity is set at 100%.

Optimise the survey design when using an imperfect diagnostic test
We now proceed to optimise the survey design for a program prevalence threshold of 2%, for which the WHO-recommended survey design is not sufficient. The first step towards this is to evaluate how stringency (i.e., grey zone width) affects the minimum survey design.  Fig 5A), the total number of subjects sampled across n clust clusters (n tot ; Fig 5B), the corresponding decision cut-off (c; Fig 5C), and the associated survey costs (C tot ; Fig 5D) required for adequate program decision-making ðE overtreat ¼ 25%; E undertreat ¼ 5%Þ. We fixed the intra-cluster correlation ρ i at 0.02 and assumed the use of a theoretical diagnostic test D t1 (se dt1 = 80% and sp dt1 = 98%). Each of these four metrics (n sub , n tot , c and C tot ) decreased as the width of the grey zone became wider (i.e., when less stringent requirements for decisionmaking are defined). For example, when the limits of the grey zone were defined as 40% of the program threshold (LL ¼ T À 0:4 � T ¼ 0:02 À 0:02 � 0:40 ¼ 0:012; UL ¼ T þ 0:4� T ¼ 0:02 þ 0:02 � 0:40 ¼ 0:028) and 10 clusters were randomly selected, the minimum required number of subjects per cluster was 540 (n tot = 5,400), where this was 175 (n tot = 1,750) when the limits were defined as T±50% (LL = 1%, UL = 3%) (Fig 5A and B). Additionally, it is important to note that this difference in n tot subjects further decreased when the width of the grey zone became wider. Indeed, when the width of the grey zone was T±75%, the required n tot subjects was 650 when n clust equalled 5 and 500 when n clust was set at 10 (Fig 5B). Furthermore, Fig 5C shows very similar trends as Fig 5B because the decision cut-off also decreased when the width of the grey zone became wider. Also, for relatively small grey zone, a higher number of clusters (either 15 or 20) is most cost-efficient. As a result, it was not unexpected to observe that for a width of the grey zone defined as T±75%, sampling 5 clusters became the most cost-efficient design (Fig 5D).
To facilitate interpretation of the impact of the other parameters on the program decisionmaking, we fixed the limits of the grey zone in all next figures to 1% (= LL) and 3% (= UL). These limits corresponded to T±50% (black dots in Fig 5) and resulted in a total sample size n tot of about 2,000 (200 subjects per cluster) and a decision cut-off of 60 when 10 clusters were sampled, a sample size that we considered feasible under field conditions. We now proceed to evaluate the impact of diagnostic test performance on the minimum survey design. Fig 6 illustrates the impact of the diagnostic performance (se d �80% and sp d �98%) and the number of sampled clusters (n clust : 5, 10, 15 and 20) on the minimum required number of subjects per cluster (n sub ) for adequate program decision-making (E overtreat = 25% and E undertreat = 5%) around a 2% program prevalence threshold. We fixed the intracluster correlation ρ i at 0.02. Generally, this figure highlights three important aspects. First, improving the diagnostic performance can substantially further reduces the minimum required number of subjects per cluster and the total sample size. For example, when sampling 5 clusters and deploying a diagnostic test D (se d = 80% and sp d = 98%), the minimum required number of subjects per cluster was~1,300. However, it dropped to 400 subjects per cluster when the diagnostic test specificity was set at 99.9% (Fig 6A). Second, these panels confirm that specificity is more important than sensitivity when evaluating prevalence around 2%. In other words, improving the specificity has more impact on the minimum required sample size  (Panel A), the total number of subjects sampled across n clust clusters on the minimum required survey design for adequate program decision-making (E overtreat = 25% and E undertreat = 5%) (Panel B), the decision threshold c (Panel C), and the associated total survey cost (Panel D). We considered program prevalence threshold T of 2%, fixed the intra-cluster correlation ρ i at 0.02 and assumed a theoretical diagnostic test D t1 (se dt1 = 80% and sp dt1 = 98%). The width of the grey zone is expressed as a proportion of program prevalence threshold T. The black bullet across the four panels indicates the chosen hypothetical reference grey zone width (T±50%), which was chosen to further illustrate the impact of the diagnostic performance (Fig 6) and geographical variation in prevalence between clusters on program decision-making (Fig 7), and the relative change in total survey cost based on newer diagnostic tests (Fig 8). All graphs are based on the same set of 10,000 Monte Carlo simulations.
https://doi.org/10.1371/journal.pntd.0010353.g005 than improving the sensitivity. This can be readily seen from the almost vertical orientation of the contour lines in all the panels of Fig 6, which indicate that sample size changes fastest with specificity (i.e., about perpendicular to the contour lines). Third, as expected, sampling fewer clusters leads to a greater number of subjects per cluster and ultimately to an increased total sample size. For example, when sensitivity was set at 80% and specificity was set at 98%, the minimum required number of subjects per cluster was~1,300 when 5 clusters were sampled (n tot = 6,500). For 10, 15 and 20 clusters, these numbers were about 150 (n tot = 1,500), 80 (n tot = 1,200), and 50 (n tot = 1,000), respectively (Fig 6B).
Last, we consider the impact of geographical variation on the optimal study design for a threshold of 2%, assuming test sensitivity and specificity of 80% and 98%, respectively. Fig 7  displays the impact of the geographical variation in prevalence (expressed as intra-cluster correlation, which ranged from 0.012 to 0.032) on the number of subjects per cluster (n sub ; Fig  7A), the total number of subjects sampled across n clust (n tot ; Fig 7B), the corresponding decision cut-off (c ; Fig 7C), and the associated total survey costs (C tot ; Fig 7D) required for adequate program decision-making (E overtreat = 25% and E undertreat = 5%) around a 2% program prevalence threshold. Each of these four metrics increased as a function of more heterogeneously distribution of STH infections. In other words, if the spatial heterogeneity is different between implementation units, the optimal design for decision-making will be different. For example, when fixing the number of clusters at 10, a minimum of 120 subjects per cluster are required when the intra-cluster correlation ρ i equalled 0.012, while 188 subjects per cluster are needed when ρ i was set at 0.025 (Fig 7A). Additionally, for the total survey cost (Fig  7D), sampling fewer clusters are preferred when infections are more homogeneously among clusters (ρ i <0.017). For example, it is apparent from Fig 7D that sampling 10 or 15 clusters was the most cost-effective when the intra-cluster correlation ranged from 0.012 to 0.017. However, in case of a more heterogeneous distribution, Fig 7D indicates that it becomes more cost-effective to sample more clusters and fewer subjects per cluster. For example, when fixing the intra-cluster correlation ρ i at 0.012, the total survey cost was~6,000 US$ when 10 clusters were sampled and~7,000 US$ in case 20 clusters were sampled. In contrast, for an intra-cluster correlation of 0.025, these numbers were~8,500 US$ for 10 clusters and~7,8000 US$ for both 15 and 20 clusters (Fig 7D).  ; Panel B), the corresponding decision cut-off c (Panel C), and the associated total survey costs (C tot ; Panel D) required for adequate program decision-making (E overtreat = 25% and E undertreat = 5%) around a 2% program prevalence threshold. In this figure, the sensitivity and specificity were set at 80% and 98%, respectively. The limits of the grey zone at 1% and 3%. All graphs are based on the same set of 10,000 Monte Carlo simulations. https://doi.org/10.1371/journal.pntd.0010353.g007

Customize the sample throughput and cost per test according to the improvements in specificity
Let us now consider how the potentially increased cost and lower throughput of improved diagnostic tests may be offset by lower required sample sizes due to higher test specificity. Fig  8 illustrates the survey cost for varying scenarios of sample throughput and reagent cost (i.e., the cost to test one sample) relative to the total cost of a survey based on the hypothetical reference diagnostic test with sample throughput (9 samples per hour per person) and cost characteristics (1.38 US$ reagent test cost) as single Kato-Katz thick smears. Three theoretical diagnostic methods D t1 −D t3 were defined with varying specificity (sp dt1 ¼ 98%; sp dt2 ¼ 96% and sp dt3 = 94%) while setting the sensitivity at 80% (se dt1 ¼ se dt2 ¼ se dt3 ¼ 80%) for all three tests. As diagnostic test specificity will have an impact on the sample size (number of clusters The contour lines illustrate the total survey cost when applying imperfect diagnostic tests with varying sample throughput and reagent cost to test one sample relative to the total cost of a survey based on the hypothetical reference diagnostic test with the sample throughput (9 samples per hour per person) and cost characteristics (1.38 US$ reagent test cost) of single Kato-Katz thick smear. We considered three hypothetical diagnostic tests D t3 (reference diagnostic test; Panel A), D t2 (Panel B), and D t3 (Panel C), each with a different diagnostic performance (D t1 : se dt1 ¼ 80%; sp dt1 ¼ 98%; D t2 : se dt2 ¼ 80%; sp dt2 ¼ 96%; D t3 : se dt3 ¼ 80%; sp dt3 ¼ 94%). The intra-cluster correlation ρ i was set at 0.02 and the number of clusters at 10. The number of subjects per cluster and the total survey cost was defined as the minimum required number of subjects per cluster and minimal cost required for adequate decision-making (E overtreat = 25% and E undertreat = 5%) around a program prevalence threshold of 2%, with the grey zone defined as T±50% (LL = 1% and UL = 3%). All three panels were based on 10,000 Monte Carlo simulations.
https://doi.org/10.1371/journal.pntd.0010353.g008 and individuals per cluster) and thus the total survey costs (see Fig 6), we considered as reference the diagnostic method with the lowest specificity, D t3 .
The estimated total survey cost for the reference diagnostic test D t3 (10 clusters and 350 individuals per cluster tested) was~14,500 US$ (indicated by a black bullet in Fig 6A). From  Fig 8A, it becomes clear that to keep the total survey cost the same (i.e., the purple line), the requirements for reagent costs and the sample throughput are correlated. For example, when we increase the reagent costs with 0.50 US$ to 1.88 US$, the required sample throughput should be at least 13 instead of 9 to result in the same total survey cost. The panel also indicates that the association describing the break-even point is non-linear. In other words, an increase of 0.50 US$ will not always require increasing the sample throughput by 4. Indeed, if we now further increase the reagent cost by 0.50 US$ to 2.38 US$, the required throughput needs to be increased up to 23 (increase of 10), underscoring that the compensation of increased reagent cost by the sample throughput is limited. In fact, for a survey based on the reference diagnostic test D t3 , the reagent cost per test cannot exceed 2.90 US$. Fig 8B and C illustrate the same trade-off between sample throughput and reagent costs for diagnostic tests D t2 (se dt2 = 80% and sp dt2 = 96%) and D t1 (se dt1 = 80%, sp dt1 = 98%), respectively. Generally, these panels indicate that when fixing the number of clusters at n clust = 10, the improved diagnostic performance reduced the required number of subjects per cluster (350 vs. 250 vs. 175), which allowed for diagnostic tests with a lower sample throughput (downward shift of the contour lines) and higher reagent costs (shift of the contour lines to the right) while keeping the total survey cost the same (purple line). For example, to compensate for an 0.50 US$ increase in reagent costs, the required sample throughput should be at least 6 for diagnostic test D t2 and 4 for diagnostic test D t1 instead of 13 per hour. Furthermore, the highest possible reagent cost per test to ensure the same total survey cost for this improved diagnostic test was~4.30 US$ for the diagnostic test D t2 and 6.50 US$ for diagnostic test D t1 .

Application of the framework to develop evidence-based guidelines and direct R&D
Given a 2% program prevalence threshold and a choice of grey zone width (T±50%) for adequate program decision-making (E overtreat = 25% and E undertreat = 5%), we investigated the potential contribution of our framework to more evidence-based WHO guidelines. For this, two hypothetical tests D tpp1 (se tpp1 = 60% and sp tpp1 = 99%) and D tpp2 (se tpp2 = 86% and sp tpp2 = 94%), both identified as potential diagnostic tests in the recently published by WHO TPPs for M&E of STH. The intra-cluster correlation (ρ i ) was fixed at 0.02 when sampled 10, 15, and 20 clusters. Furthermore, we set the sample throughput (7 and 10 samples per person per hour) and reagent cost (1 US$, 3 US$ per test) [17]. Table 3 illustrates the required number of subjects per cluster for each scenario. It is evident from this table that for a diagnostic test D tpp1 the number of subjects per cluster equalled 165 for 10 clusters, 92 for 15 clusters, and 60 for 20 clusters. For a diagnostic test D tpp2 the numbers were 330 (10 clusters), 162 (15 clusters), and 110 (20 clusters), respectively. Also, it becomes clear that the requirements for both sample throughput and reagent cost are less stringent for diagnostic test D tpp1 . For approximately the same survey costs (~11,000 US$), the sample throughput and cost per test for diagnostic test D tpp1 can be 7 samples and 3 US$, respectively, while this needs to be 10 samples and 1 US$ for diagnostic test D tpp2 .
Let us now illustrate the potential contribution of our framework to R&D, considering a 2% program prevalence threshold (LL = 1% and UL = 3%). Table 3 highlights that R&D should be directed towards improved specificity rather than sensitivity. For instance, if sampling 20 clusters with improved sensitivity, 190 subjects per cluster must be sampled, whereas only 80 are

Discussion
Diagnostic tests currently deployed in M&E surveys are often imperfect. In the present study, we developed a 2-stage LQAS framework for program decision-making that allows for both imperfect diagnostics and spatial heterogeneity of infections. We applied the framework using M&E of STH control programs as a case study. For this, we explored the impact of the diagnostic performance, spatial heterogeneity and survey design on both the program decision-making around the prevalence decisions thresholds recommended by WHO and the total survey costs. In addition, we assessed the trade-off between cost per test, sample throughput and diagnostic performance of tests. Finally, we illustrated how our framework could support the development of evidence-based guidelines and direct R&D.

Revision of WHO guidelines for M&E of STH programs may be warranted
The currently proposed survey design by WHO (5 schools and 50 subjects per school per implementation unit) allows for adequate (prevalence threshold of 20%) to ideal (prevalence threshold of 50%) program decision-making. However, the risk of an incorrect program decision around a prevalence threshold of 2% and 10% may be too high given the geographical heterogeneity (ρ i = 0.02) in infection levels and predefined grey zone (LL = 1% and UL = 3%), as assumed here. Given that even perfect diagnostic tests cannot reduce this risk to an acceptable level under these assumptions (S1 Fig), revision of the WHO guidelines for M&E of STH

A paradigm shift from sensitivity to specificity is warranted
Our study confirms that the requirements for diagnostic test parameters are inversely correlated [6] and that specificity rather than sensitivity is a critical diagnostic parameter to consider when NTD programs progress towards the end-game [5]. These observations have already been explained in more detail elsewhere [5,6], and hence we will restrict ourselves to underscoring their impact on research and development (R&D) of future diagnostic tests for NTD. The contour lines illustrate the total survey cost when applying a Kato-Katz thick smear with improved specificity KK sp (se kksp = 55%, and sp kksp = 99%) with varying sample throughput and reagent cost to test one sample relative to the total cost of a survey based on the soil-transmitted helminths target product profiles diagnostic performance D tpp1 (se tpp1 = 60% and sp tpp1 = 99%) with sample throughput (7 samples per hour per person) and cost characteristics (3 US$ reagent test cost). The dotted line indicates the maximum reagent cost per test for the improved Kato-Katz thick smear (1.85 US$) when the sample throughput was set at 7. The intra-cluster correlation ρ i was set at 0.02 and the number of clusters at 10. The number of subjects per cluster was estimated as the required minimum number of subjects for adequate decisionmaking (E overtreat = 25% and E undertreat = 5%) around a program prevalence threshold of 2%, with the grey zone defined as T±50% (LL = 1%and UL = 3%). The total survey cost was defined as the minimal cost required. The graph was based on 10,000 Monte Carlo simulations. https://doi.org/10.1371/journal.pntd.0010353.g009 First, we highlight that an important shift in paradigm on diagnostic performance will be required from the NTD research community. Where the focus has mainly been on the diagnostic sensitivity (sustained by the focus and conclusions of various research [22][23][24]), it has now become clear that obtaining a high diagnostic specificity will be the driving force for R&D. This need for more specific diagnostic tests is already apparent in the different TPPs that WHO recently published for onchocerciasis [25], lymphatic filariasis [26], soil-transmitted helminthiasis [17] and schistosomiasis [27]. In each of these TPPs, the required specificity did not drop below 94%, while the requirements for sensitivity were relaxed to 60%. Second, we suggest a re-appraisal of the value of the current diagnostic standards. For example, a single Kato-Katz thick smear is the current diagnostic standard for STHs, and although it lacks sensitivity to detect infections of low intensity ( [13,14,28,29]: Ascaris:~55%, Trichuris:~80%, hookworms:~70%), it has a high specificity (�98%) [19,30]. In other words, this cheap and simple method might still be valuable at the program's endgame, though the survey design will need to be adapted accordingly. For example, assuming the aforementioned diagnostic performance of a single Kato-Katz thick smear and the operational cost for Kato-Katz thick smear described in the methods section, sampling 20 clusters and 230 subjects per cluster (total sample size = 4,600) is the most cost-effective survey design for Ascaris that allows for adequate program decision-making (E overtreat = 25%, E undertreat = 5%) around the 2% prevalence threshold (lower limit = 1% and upper limit = 3%) ( Table 3). However, this survey design will require substantial financial resources (20,236 US$), though improvements to the Kato-Katz thick smear may drastically reduce this cost. Although there are several alternative diagnostics for STH (e.g., Mini-FLOTAC, McMaster, FECPAK G2 and qPCR), we strongly doubt whether they will outcompete the Kato-Katz thick smear. Their specificity too is not perfect [13,31] and they come with substantially more operational costs.

Compensating improved diagnostics tests' higher cost and lower throughput by lower sample size
Our results confirm that improving diagnostic performance results in smaller sample sizes for the same level of program decision-making. They also highlight that improving the diagnostic performance and the corresponding reduced sample sizes can compensate for more costly tests and lower sample throughput. In addition, our results highlight that there is a limit to the extent to which higher reagent costs can be compensated by lower sample throughput and vice versa. For example, in our theoretical scenarios, the highest possible cost per test equaled 6.50 US$ and the minimal sample throughput 3 samples per person per hour when test sensitivity is 80%, and specificity is 98%. These findings are crucial for R&D and will guide developers in making strategic choices. Indeed, different pairs of sensitivity and specificity have been included in the WHO TPP. Yet, each of them will come with a different survey design (see above), and because of this, requirements for both sample throughput and reagent cost per test will be different.

Still a long way to go
Although we have illustrated the added value of our framework to develop more evidencebased M&E guidelines and make strategic R&D choices, there is still a long way to go. First, the optimal design will be specific to each implementation unit as it is expected that spatial heterogeneity will not be the same across implementation units due to local dispersion of houses and hygienic behaviour. Second, the framework is applicable to any NTD but it will need to be adapted for each specific NTD separately. This is because not only the program prevalence thresholds will be different, it is also expected that the spatial heterogeneity will vary across NTD. For instance, the geographical distribution of some NTD like lymphatic filariasis and schistosomiasis may be more focal (higher spatial heterogeneity) due to relatively low vector mobility as compared to, for example onchocerciasis (highly mobile vector) and STH (widespread environmental contamination). Moreover, it will be equally important for each NTD community to agree on the acceptable width of the grey zone separately for each program threshold. Third, although we focused primarily on the prevalence of STH infections of any intensity, the current framework can also be used to assess moderate-to-heavy intensity infections (WHO is striving to reduce the number of children that carry a moderate-to-heavy intensity infections to less than 2%) [32]. In this case, the presence of infections of any intensity is replaced by the presence of moderate-to-heavy intensity infections. Finally, we assumed that sensitivity and specificity do not vary across individuals. Yet, this assumption does not hold, as it is known that the sensitivity of Kato-Katz thick smear (and most other parasitological diagnostic methods) is positively related to infection intensity and is therefore different between individuals [15] as well as at the lower and upper boundary of the grey zone around a decision threshold. We believe that relaxing this assumption will impact the survey design, the sample size and the corresponding decision cut-off for any program prevalence thresholds. Therefore, future research should prioritize the expansion of the framework allowing for varying sensitivity as a function of egg counts and day-to-day variation in egg excretion. This expansion would allow for more insights into the impact of increased sampling (1 vs. 2 stool samples) and diagnostic effort (single vs. duplicate Kato-Katz thick smear) on the survey design, which in turn would provide a framework to make more evidence-based and cost-efficient recommendations.

Conclusion
Our framework allows for the assessment and updating of M&E guidelines for NTD and product development choices. Using STH as a case study, we show that current M&E guideline may fall short severely, especially in low-endemic and post-control settings. Furthermore, specificity rather than sensitivity becomes a critical parameter to consider and sampling more clusters (�10) may be necessary in case of considerable heterogeneity in the geographical distribution of NTD infections.
Supporting information S1 Table. Parameterisation of the variables used to illustrate the 2-stage LQAS framework for STH control programs. T: program decision prevalence threshold, LL: lower limit of the grey zone, E overrtreat : highest allowed probability of falsely continuing or upscaling an intervention within an implementation unit i, E undertreat : highest allowed probability of prematurely stopping or scaling down interventions within an implementation unit i; ρ i : intra-cluster correlation; se d : sensitivity of an imperfect diagnostic test D; sp d : specificity of an imperfect diagnostic test D; n clust : number of clusters; n sub : number of subjects per cluster. _: output variable and hence is not fixed on one value. � : The number of subjects per cluster was determined as the minimum required number of subjects that allow for adequate program decision-making (E overtreat = 25%, E undertreat = 5%) around a 2% program prevalence threshold. A total of 10,000 Monte Carlo simulations was used. (DOCX) S2 Table. Summary of diagnostic methods used to illustrate the 2-stage LQAS framework for STH control programs. TPPs: target product profiles. (DOCX) S1 Fig. The relationship between the risk of under and overtreatment when deploying perfect diagnostic test. ε undertreat : probability of prematurely reducing interventions within an implementation unit i; ε overtreat : probability of falsely continuing or upscaling an intervention frequency within an implementation unit i. The blue indicates the area in which the combination of the risk of under and overtreatment allow for adequate decision-making (ε undertreat = 5%, ε overtreat = 25%). We defined the grey zone as T±50% (2%: lower limit = 1% and upper limit = 3%). The intra-cluster correlation ρ i was set at 0.02 and the number of clusters at 5 and the number of subjects per cluster at 50. The graph was based on 10,000 Monte Carlo simulations. (TIF)