Assessment of the required performance and the development of corresponding program decision rules for neglected tropical diseases diagnostic tests: Monitoring and evaluation of soil-transmitted helminthiasis control programs as a case study

Recently, the World Health Organization established the Diagnostic Technical Advisory Group to identify and prioritize diagnostic needs for neglected tropical diseases, and to ultimately describe the minimal and ideal characteristics for new diagnostic tests (the so-called target product profiles (TPPs)). We developed two generic frameworks: one to explore and determine the required sensitivity (probability to correctly detect diseased persons) and specificity (probability to correctly detect persons free of disease), and another one to determine the corresponding samples sizes and the decision rules based on a multi-category lot quality assurance sampling (MC-LQAS) approach that accounts for imperfect tests. We applied both frameworks for monitoring and evaluation of soil-transmitted helminthiasis control programs. Our study indicates that specificity rather than sensitivity will become more important when the program approaches the endgame of elimination and that the requirements for both parameters are inversely correlated, resulting in multiple combinations of sensitivity and specificity that allow for reliable decision making. The MC-LQAS framework highlighted that improving diagnostic performance results in a smaller sample size for the same level of program decision making. In other words, the additional costs per diagnostic tests with improved diagnostic performance may be compensated by lower operational costs in the field. Based on our results we proposed the required minimal and ideal diagnostic sensitivity and specificity for diagnostic tests applied in monitoring and evaluating of soil-transmitted helminthiasis control programs.

Traditionally, STHs have been diagnosed by detecting worm specific eggs in stool using a compound light microscope. Since the 1990s, Kato-Katz has been the WHO recommended diagnostic standard for quantifying eggs in stools [7], and hence it has been used to guide soiltransmitted helminthiasis control programs. During the last decade, a variety of new diagnostic tests have been introduced to the STH field, including both other microscopy-based [8][9][10], and DNA-based methods [11]. Each of these tests have important advantages and disadvantages over the Kato-Katz. Important advantages are a clearer microscopic view [8,9], a higher clinical sensitivity (referring to the proportion of diseased individuals correctly diagnosed as infected) [12,13], opportunities for automated egg counting and quality control [10,14], the ability to differentiate hookworm species [11] and to simultaneously detect parasites other than STHs [8,9,11]. The chief limitations of these novel tests are the need for well-equipped laboratories with well-trained technicians, the need to transport samples to a distant laboratory, the higher cost of processing large numbers of samples [15,16], and the lack of standardized protocols for DNA-based methods [11,17,18]. Currently, most diagnostic technologies based on biomarkers other than eggs or DNA (e.g. antigens, antibodies and metabolites) or other sample matrices (e.g. serum and urine) are either not yet explored or in research phase [19][20][21][22]. As these new diagnostic technologies transit from research to routine program tools, important consideration needs to be paid to the performance of these tools when used by NTD programs for making public health decisions.
In the present study, we developed a generic framework to explore the impact of diagnostic test sensitivity and specificity at the individual level on program decision making at the population level, with the ultimate aim to better define minimum TPP sensitivity and specificity targets for diagnostic tests for PC targeted NTDs. To this end, we first explored the impact of diagnostic sensitivity and specificity on the probability of making an incorrect program decision within a soil-transmitted helminthiasis control program: unnecessarily selecting a PC frequency that is greater than indicated by the true prevalence or prematurely reducing the frequency of PC. Subsequently, we developed a multi-category lot quality assurance sampling (MC-LQAS) framework that incorporates imperfect test performance to determine the corresponding sample size and associated decision rules.

Required sensitivity and specificity
General framework. A program decision is generally based on the outcome of an epidemiological survey in which N tot subjects are screened for the presence of any infection. The observed prevalence (proportion of positive test results N + out of N tot , which includes both false and true positive test results) is then compared to a program decision threshold (T). Rather than a proportion, one can also verify whether the number of positive test results N + exceeds T 0 . When we assume a diagnostic test D with a sensitivity of Se d and a specificity Sp d , a true underlying prevalence equal to Prev true and a sample size of N tot , the probability observing at least T 0 positive results can be written as Prob þ ¼ Se d � Prev true þ ð1 À Sp d Þ � ð1 À Prev true Þ ð2Þ It is important to note that T 0 is not a fixed value, rather it will be a function of the total number of subjects screened (N tot ), the program decision threshold (T) and the diagnostic performance of the test (Se d and Sp d ), and this can be best illustrated with a few toy examples. Assume that we are screening 500 subjects (N tot ) with a perfect test (Se d = Sp d = 100%) and the program decision threshold T is set at 50%, then T 0 equals 250. In case 1,000 subjects are screened with a perfect test, T 0 equals 500. Given the same N tot (1,000 subjects) and diagnostic performance but a T of 2% instead of 50%, T 0 equals 20. When an imperfect diagnostic test (Se d = 80% and Sp d = 80%) is used to screen 1,000 subjects and decisions are made around a program decision threshold T of 2%, T 0 equals 212 or more generally Combining (1)-(3) allows one to explore the impact of Se d and Sp d on the probability of making an incorrect program decision around a set of program decision thresholds T. For example, suppose 500 subjects (N tot ) are randomly selected from a population where the true underlying prevalence equals 45% (Prev true ) and a threshold of 50% (T) is used to make program decisions. The probability of N + � T 0 , and therefore unnecessarily selecting a PC frequency that is higher than indicated by the true prevalence, equals 1.4% when a perfect test (Se d = Sp d = 100%) is applied and 9.7% for an imperfect test (Se d = Sp d = 80%). Similarly, one can determine the probability of prematurely reducing the PC frequency. For example, if we change the true underlying prevalence from 45% to 55% (Prev true � T), the probability of N + < T 0 , and therefore prematurely reducing the PC frequency equals 1.1% (= 1 -the probability of N + � T 0 ) when a perfect test (Se d = Sp d = 100%) is applied and 8.2% for the same imperfect test (Se d = Sp d = 80%).
Data generation. For this analysis, we fixed N tot to 500, but varied both Se d and Sp d from 60% to 100% with 1% increments (resulting in 41 x 41 theoretic diagnostic tests) and Prev true from 0% to 100% with 0.2% increments. The program decision thresholds included the currently recommended thresholds for an STH control program (2%, 10%, 20% and 50%). In addition, we included program thresholds of 1% and 5%. This is because the current program thresholds are based on the observed prevalence using Kato-Katz thick smear, for which we know the specificity is not 100% [23,24]. As a consequence of this, the true underlying prevalence might be overestimated as it approaches zero.
Analysis of generated data. To further illustrate the interpretation of the obtained data, we worked out a toy example in Fig 1. This figure represents the probability of N + � T 0 over a wide range of Prev true when an imperfect diagnostic test (Se d = Sp d = 80%) was applied. Given a program decision threshold T of 50% (vertical straight line), we can deduce both the error related to unnecessarily selecting a PC frequency that is greater than needed (ε overtreat ) or prematurely reducing the frequency of PC (ε undertreat ). These errors are analogous to 1 minus the negative predictive value and 1 minus the positive predicted value, as used in recent NTD modelling studies on optimal program decision thresholds [25][26][27]. Subsequently, we can also deduce to what extent this diagnostic test allows for reliable decision making. In the present study, we will use two different operating definitions for 'reliable' based on both errors. In both definitions, we set the highest allowed probability of prematurely reducing frequency (E undertreat ) at 5%, whereas the highest allowed probability of falsely continuing or increasing PC frequency (E overtreat ) was set at either 10% and 25%. Generally, a lower value for E undertreat is preferred as prematurely reducing PC frequency may lead to an increase in infection and morbidity. The two values for E overtreat allow to differentiate between both adequate (E overtreat = 25%) and ideal (E overtreat = 10%) program decision making scenarios. In the remainder of the document, we will refer to (in)adequate and (less than) ideal program decision making when the E overtreat is set at 25% and 10% respectively. The values for E undertreat and E overtreat here have also been applied earlier to determine the sensitivity and specificity for diagnostic tests for other helminth diseases [28].
In the toy example (Fig 1), the diagnostic test performed at ε undertreat �5% when Prev true is at least 55.8% and at ε overtreat �25% when the Prev true is not higher than 47.2%. In other words, any program decision making within the Prev true interval] 47.2; 55.8 [is considered inadequate The yellow areas highlight the program errors ε overtreat (Prev true <50%) and ε undertreat (Prev true �50%). The horizontal black dashed lines represent a ε overtreat equal to 25% and a ε undertreat equal to 5% (= 100% -95%), the vertical red dashed lines indicate the corresponding Prev true . The grey zone indicates the range of Prev true for which the diagnostic test is considered inadequate to make a well-informed program decision (ε overtreat >25% and ε undertreat >50%).
https://doi.org/10.1371/journal.pntd.0009740.g001 when applying this test; we will refer to this interval as the 'grey zone'. It is expected that for a given sample size, the grey zone narrows with higher levels of sensitivity and specificity of diagnostic methods. Because the width of grey zones also depends on binomial variation, and thus on the program decision threshold itself, we quantified the grey zone for each combination Se d and Sp d and program decision threshold separately.
In order to further differentiate diagnostic tests with small grey zones from those with a wider zone, we classified the grey zone into three levels (level 1-3) for each program decision threshold T separately. This classification into 3 levels was based on the 25 th and 75 th percentile of the width of the grey zones (level 1: width of grey zone < 25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile (see S1 Table) across all potential diagnostic methods that allowed for adequate program decision making. In other words, each of these diagnostic methods allowed for adequate decision making (E overtreat is set at 25%) at a true underlying prevalence of zero and 100%. Finally, we arbitrarily classified the diagnostic tests into 'minimal' and 'optimal' based on their corresponding levels of grey zone across each of the 6 program decisions thresholds. Diagnostic performance was considered optimal when they resulted in level 1 grey zone for at least 3 out of the 6 program decision thresholds and did not result in a level 3 grey zone in any of the 6 program thresholds. In all other cases, the diagnostic test was considered 'minimal'.

MC-LQAS framework
General framework for LQAS. Lot quality assurance sampling (LQAS) is a technique to gather the minimal amount of information required for decision making, using a sample size as small as possible. Instead of constructing a precise estimate of a population parameter, LQAS aims to quantify whether the population parameter is above or below some decision cut-off c with some desired minimal probability. For STH, LQAS can be used to verify whether the observed number of positive test results (N + ) in a random sample (N tot ) equals or exceeds a predefined decision cut-off c [29,30], followed by continuing the current PC frequency if this is the case, and reducing the PC frequency in all other cases. The sample size N tot and the corresponding decision cut-off c are chosen to satisfy two conditions. The first is that for some prevalence Prev true less than the program decision threshold T (Prev true<T ), the probability ε overtreat to select a PC frequency that is higher than indicated by the true underlying prevalence does not exceed the target probability E overtreat . The second condition is that for some Prev true equal or above the program decision threshold T (Prev true�T ), the probability ε undertreat to prematurely reduce the PC frequency is not higher than E undertreat . Based on Eqs (1)-(3) one can write these conditions as Process to determine the decision cut-off c within LQAS. Fig 2 further illustrates the process to determine the appropriate decision cut-off for two theoretical diagnostic tests. In this example, we determined the decision cut-off c for a sample size of 500 subjects (N tot ) that allowed for E overtreat �25% and E undertreat �5% at a Prev true<T arbitrarily set at 45% and at a Prev true�T arbitrarily set at 55% (program decision threshold T = 50%), respectively. To contrast the findings, we determined c for both a perfect (Se d = Sp d = 100%) and an imperfect test (Se d = Sp d = 80%).
For both theoretical diagnostic tests there is a range of possible values for c. For a perfect test (Se d = Sp d = 100%) any value between 233 (Fig 2B) and 257 (Fig 2A) can be used, whereas for an imperfect test (Se d = Sp d = 80%) the range of possible values is narrower, only ranging from 244 (Fig 2E) to 247 (Fig 2D). This reduction in options of c for an imperfect test is also reflected in panels representing the probability the number of positive test results (N + ) in a random sample of N tot subjects being at least c over a wide range of true underlying prevalence (Prev true ) (Fig 2C and 2F). Where both lines are almost overlapping for an imperfect test, there is a shift in Prev true of 5-point percent between both lines for a perfect test.
Expansion of framework to MC-LQAS. In STH control programs decisions are made around multiple program decision thresholds, and hence a MC-LQAS (based on multiple decision cut-offs) would be more appropriate. In 2012, Olives et al. described the mathematical underpinnings of a multi-category LQAS for schistosomiasis based on 2 decision cut-offs, resulting in three categories (three-way MC-LQAS) [31]. The different panels in this figure illustrate the process to determine the decision cut-off c when 500 subjects (N tot ) are randomly recruited for both a perfect test (sensitivity (Se d ) = specificity (Sp d ) = 100%; Panels A-C) and an imperfect test (Se d = Sp d = 80%); Panels D-F). Panels A and D represent the cumulative error of prematurely reducing the preventive chemotherapy (PC) (ε undertreat ) when the true underlying prevalence was arbitrarily set at 55% (Prev true�T ). The horizontal dashed line represents a ε undertreat of 5%, the red dashed line represents the allowed possible decision cut-off c resulting in a ε undertreat �5%. The red area under the curve highlight all possible values for c resulting in a ε undertreat �5%. Panels B and E represent the cumulative error of selecting a PC frequency that is higher than needed (ε overtreat ) when the true underlying prevalence was arbitrarily set at 45% (Prev true<T ). The horizontal dashed line represents a ε overtreat of 25%, the blue dashed line represents the lowest possible decision cut-off c resulting in a ε overtreat of � 25%. The blue area under the curve highlights all possible values for c resulting in a ε overtreat of � 25%. Panels C and F represent the probability (in %) of the number of positive test results (N + ) in a random sample of N tot subjects being at least c over a wide range of true underlying prevalence (Prev true ) based on the two extreme decision cut-offs (red line: lowest possible value; blue line: highest possible value). The vertical straight line represents the program decision threshold T of 50%. The horizontal black dashed lines represent a ε overtreat equal to 25% and a ε undertreat equal to 5% (= 100% -95%). The grey zone indicates the range of Prev true for which decision making is inadequate (ε overtreat >25% (blue dashed line) and ε undertreat >5% (red dashed line). In this example, the grey zone ranges from 45% to 55% by design. https://doi.org/10.1371/journal.pntd.0009740.g002

Fig 3. The build-up of multi-category LQAS for STH control program decision making using an imperfect test.
The different panels illustrate the build-up of a multi-category LQAS around 4 program decision thresholds T (2%, 10%, 20% and 50%) when applying an imperfect test (sensitivity (Se d ) = 76% and specificity (Sp d ) = 99%) on 500 randomly selected subjects (N tot ). Panel A provides the provides the probability (in %) of the number of positive test results (N + ) in a random sample of N tot subjects (= 500) being at least c separately for each of the 4 thresholds, their corresponding decision cut-offs (c 2% = 13, c 10% = 41, c 20% = 84, c 50% = 182) and true underlying prevalence Prev true (Prev true<2% : 0.0%, Prev true�2% : 4.0%; Prev true<10% : 7.5%, Prev true�10 : 12.5%; Prev true<20% : 15.0%, Prev true�20% : 25.0%; Prev true<50% : 45.0%, Prev true�50 : 55.0%). Note that these Prev true -values define the borders of the grey zone around the program thresholds and for these Prev true -values ε overtreat �25% and ε undertreat �5%. The vertical straight line represents the program decision threshold T (orange: 2%, red: 10%, green: 20% and blue: 50%). The horizontal black dashed lines represent a ε overtreat equal to 25% and a ε undertreat equal to 5% (= 100% -95%). The grey zone indicates the range of Prev true for which decision making is inadequate (ε overtreat >25% and ε undertreat >5%). Panel B provides the same information as Panel A, but highlights the error of falsely scaling up the PC frequency (solid surfaces). Panels C and D represent the probability of correct program decision making across a wide range of Prev true , where Panel C provides an overview of the relative contribution of ε overtreat (colored areas) in the program decision making.
For simplicity, we have classified the width of the grey zone into three levels (1-3) for each threshold and ε undertreat separately. This classification into 3 levels was based on the 25 th and 75 th percentile of the width of the grey zones (level 1: width of grey zone <25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile (see S1 Table) across all potential diagnostic methods that allowed for adequate program decision making. In other words, each of these diagnostic methods allowed for adequate decision making (ε overtreat is set at 25%) at a true underlying prevalence of zero. Diagnostic tests were considered 'optimal' (blue) when they resulted in level 1 grey zone in at least 3 out of the 6 thresholds and did not result in a level 3 grey zone in any of the 6 program thresholds. In all other cases, the diagnostic test was considered 'minimal' (white). ε overtreat results into the probability of making incorrect program decisions, or in other words 1 −(ε undertreat +ε overtreat ) or 1−ε provides the probability of correct program decision making. Fig  3C and 3D represent the probability of correct program decision making across a wide range of Prev true , where Fig 3C provides an overview of the relative contribution of ε undertreat and ε overtreat in the program decision making. It is important to note that the different decision cut-offs c T i in this example are not based on (4) and (5) for each threshold separately, rather they were determined using the equations below

Sp
Pðc where the E given Prev true<T (indicated with the odd subscript) represents the allowed probability of selecting a PC frequency that is greater than indicated by the true underlying prevalence, and those E given Prev true�T (indicated with an even subscript) represents the allowed probability of prematurely reducing the PC frequency. In this example, the E given Prev true<T was set at 25% and those given Prev true�T limit at 5%. Determine sample size N tot and decision cut−offs c for the required sensitivity and specificity within MC-LQAS. We will determine the sample size (N tot ) and the corresponding decision cut-offs c T i for those theoretical diagnostic tests that allowed for adequate or ideal program decision making. We varied the N tot from 150-2,000 (by increments of 1), the corresponding decision cut-offs were based on (6)-(10). In this MC-LQAS, we considered all thresholds currently used in STH control programs (2%, 10%, 20% and 50%). For the corresponding Prev true limits, we used those used in the example illustrated in Fig 3 (Prev true<2% : 0.0%, Prev true�2% : 4.0%; Prev true<10% : 7.5%, Prev true�10 : 12.5%; Prev true<20% : 15.0%, Prev true�20% : 25.0%; Prev true<50% : 45.0%, Prev true�50 : 55.0%). The E was set at 5% at Prev true�T , E at Prev true<T was either set at 25% for adequate program decision making and at 10% for ideal program decision making.  (Fig 5A) vs. E overtreat = 10% (Fig 5B)), (ii) program decision thresholds (50% (Fig 5A) vs. 2% (Fig 5C) and (iii) diagnostic performance (diagnostic test D 2 (Fig 5C) vs. diagnostic test D 3 (Fig 5D)) on the grey zone.

Required sensitivity and specificity
Taken together, these figures highlight three important aspects. First, they indicate that program decision making becomes inadequate (ε overtreat >25% and ε undertreat >5%) when the true underlying prevalence (Prev true ) approaches the program decision threshold T, even if a perfect diagnostic method (D 4 ) is applied. Second, they confirm that improved diagnostic tests (Fig  4), less stringent program errors (Fig 5A and 5B) and lower program thresholds (Fig 5B and  5C) allow for narrower grey zones. Third, it is important to note that improving the specificity has a greater impact on the program decision making than improving the sensitivity, and that the impact of specificity increases as the program decision threshold shifts to 2%. Indeed, for a  (Panels A, B and C); Se d2 = 60% and Sp d2 = 100% (Panel D)). The grey area represents the range of true underlying prevalence for which program decision is inadequate (ε overtreat >25% and ε undertreat >5% (Panels A, D and C) or not ideal (ε overtreat >10% and ε undertreat >5% (Panel B).
https://doi.org/10.1371/journal.pntd.0009740.g005 program threshold of 50%, the grey zone of both diagnostic method D 2 (Se d2 = 100% and Sp d2 = 60%) and D 3 (Se d3 = 60% and Sp d3 = 100) are equally wide (Fig 4), whereas for program decision threshold of one percent, the grey zone of diagnostic method D 3 is smaller compared to that one of diagnostic method D 2 (2%:~3-point percent vs.~8-point percent) (Fig 5C and 5D). Fig 6 further summarizes the width of the grey zone for each of the 1,681 theoretic diagnostic tests by means of contour plots (each line represents the same width of grey zone) for adequate program decision making (S1 Fig provides the contour plots for ideal decision making). This figure highlights that multiple combinations of sensitivity and specificity can result in the Fig 6. The width of grey zones around 6 program decision thresholds for 1,168 theoretic diagnostic tests. These contour plots illustrate the width of the grey zone for each of the unique combinations of sensitivity and specificity when decision making is adequate (ε overtreat �25% and ε undertreat �5%), each line representing the same width of grey zone. The number beside the line represents the floor value of the width of the grey zone in % (e.g., any value �10% and <11% is set at 10%). https://doi.org/10.1371/journal.pntd.0009740.g006

PLOS NEGLECTED TROPICAL DISEASES
Required diagnostic performance for NTDs tests same width of grey zone. For example, there are 408 combinations that result in a grey zonẽ 10-point percent wide around a program decision threshold T of 10%. However, for each of these combinations the sensitivity and specificity are inversely correlated (if sensitivity increases then the specificity decreases). Indeed, when the sensitivity is set at 60%, the specificity should not drop below~83%. Similarly, a sensitivity of at least~91% is required to obtain the same level of accurate decision making when the specificity is fixed at 60%. The figure also indicates that not all combinations can be recommended for monitoring and evaluating of STH programs, as the width of the grey zone would be too large to be relevant. An extreme case are the program decisions around a 2% threshold, where grey zones larger than 5-point percent would include a true underlying prevalence of zero, and hence would result in unnecessarily distributing drugs when disease has already been eliminated.
Of the 1,681 pairs of sensitivity (n = 41) and specificity (n = 41) that were evaluated, there were 207 combinations that allowed for adequate (ε overtreat �25% and ε undertreat �5%) program decision making and 61 that resulted in ideal program decisions (ε overtreat �10% and ε undertreat �5%) across each of the 6 program decision thresholds. In other words, they allowed for adequate or ideal decision making when the true underlying prevalence was zero and 100% across all thresholds. Tables 1 and 2 provide an overview of the different possible diagnostic tests and their corresponding grey zone for ε overtreat less or equal to 25% and 10% respectively. For simplicity, we have classified the width of the grey zone into three levels (1-3) for each threshold separately. The classification into these 3 levels was based for each program decision threshold separately on the 25 th and 75 th percentile of the width of the grey zones (level 1: width of grey zone < 25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile (see S1 Table).
Generally, each of these tables highlight four important aspects. First, they confirm that not all pairs of sensitivity and specificity allow for reliable decision making throughout all program phases. For example, combinations with specificity <94% are not included in Table 1. Second, they also confirm that diagnostic requirements become more stringent as program thresholds shift to 1%. This is because level 3 of the width of the grey zone in both tables is restricted by the program threshold of 1%. In other words, there are number of diagnostic tests that allowed for adequate or ideal program decision making around program decision thresholds between 2% and 50%, but failed to do so around a threshold T of 1%. Third, the requirements for both specificity and sensitivity are inversely correlated with each other; if the requirements are relaxed for one parameter, the requirements for the other one become more stringent for the other one. For example, if the specificity is 100% in Table 1, the lowest sensitivity to result in sufficient program decision making is 60%, whereas for a specificity of 94%, a sensitivity of at least 86% is required for sufficient decision making.
Fourth, when comparing Table 1 and Table 2 it becomes apparent that ideal program decisions require improved diagnostic tests. In contrast to an adequate program decision making ( Table 1), for which there are 207 potential diagnostic tests, there are only 61 for ideal program decision making ( Table 2). In addition, the requirements for specificity are more stringent. For an ideal decision making the specificity cannot drop below 99% ( Table 2), whereas this was 94% across for an adequate decision making ( Table 1).
In Table 3 we cross tabulated the pairs of sensitivity and specificity across the two levels of program decision making (adequate vs. ideal) and two types of diagnostic test (minimal vs. optimal). ). From the same panel we can deduce that improving the specificity has more impact on the sample size than improving sensitivity. For example, when improving the sensitivity from 96% to 100% when the specificity remains 96%, the sample size can only be reduced to 285, whereas improving the specificity from 96% to 100% when the sensitivity is fixed at 96%, the sample sizes can be further reduced to 209. Not unexpectedly, the sample size increases when an ideal rather than an adequate program decision making is required, and this is illustrated in Fig 7B. Fig 7C illustrates the variation in decision thresholds, highlighting that these values decrease when diagnostic tests become more perfect, which can be partially explained by the variation in sample size (see Fig 7A). The data used to determine the required diagnostic performance, the sample size and the corresponding decision cut-offs is provided S1 Data.

Discussion
This study presents a generic and readily adaptable framework to explore the impact of diagnostic test sensitivity and specificity at the individual level on program decision making, in this instance applied to STH decision thresholds. Our results emphasize that specificityrather than sensitivity-will become increasingly important at the end-game as decision-relevant prevalence thresholds become lower. Although it is commonly stated that sensitivity is the most important diagnostic parameter when the prevalence drops [32][33][34], our study suggests the opposite. Indeed, the outcome of the simulation study indicated that there are fewer options for specificity (�94%) than for sensitivity (�60%), when it comes to sufficient program decision making, and that increasing specificity improved the overall accuracy of program decision making (narrower grey zones; Fig 6, Tables 1 and 2 and S1 Fig). Expanding this to explore the outcome of decision-making using MC-LQAS further highlighted that improving specificity would result in significantly less operational costs in the field (fewer subjects required to make adequate or ideal program decisions (Fig 7)).  The table represents the width of the grey  zone around the six program decision thresholds T (1%, 2%, 5%, 10%, 20% and 50%) that allowed for a sufficient decision making (ε overtreat �10% and ε undertreat �5%) for each of the 61 pairs of sensitivity (Se d ) and specificity (Sp d ). For simplicity, we have classified the width of the grey zone into three levels (1-3) for each threshold separately. This classification into 3 levels was based on the 25 th and 75 th percentile of the width of the grey zones (level 1: width of grey zone < 25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile (see S1 Table) across all potential diagnostic methods that allowed for adequate program decision making. In other words, each of these diagnostic methods allowed for adequate decision making (ε overtreat is set at 25%) at a true underlying prevalence of zero). Diagnostic tests were considered 'optimal' (blue) when they resulted in level 1 grey zone around at least 3 out of the 6 thresholds and did not result in a level 3 grey zone in any of the 6 program thresholds. In all other cases, the diagnostic test was considered 'minimal' (white). Generally, our findings are very much in line with recent similar work [28]. In fact, these observations are not unexpected, and this can be best illustrated by an extreme case. Assume the disease is truly absent in population and samples are processed with an imperfect Table 3. The diagnostic performance of minimal and optimal diagnostic tests for adequate and ideal decision making. Diagnostic tests were considered 'optimal' when they resulted in level 1 grey zone in at least 3 out of the 6 thresholds and did not result in a level 3 grey zone in any of the 6 program thresholds. In all other cases, the diagnostic test was considered 'minimal'. For simplicity, we have classified the width of the grey zone into three levels (1-3) for each threshold and ε undertreat separately. The classification into these 3 levels was based on the 25 th and 75 th percentile of the width of the grey zones (level 1: width of grey zone < 25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile (see S1 Table)). For an adequate decision making the ε overtreat �25%, whereas for ideal decision making this ε overtreat �10%. For both levels of decision making ε undertreat �5%.

Sensitivity and specificity need to be determined for each program use case
In the present study, we focused on defining the required specificity and sensitivity that allowed for adequate/ideal decision-making at each program treatment threshold. This strategy will result in diagnostic tests that can be used across all program decision thresholds; however, there may be diagnostic tests that perform well at a single threshold that are excluded by this approach (e.g., tests that perform well in high-prevalence settings). Indeed, all combinations of sensitivity and specificity allow for adequate and ideal program decisions around program thresholds of 20% and 50%. In other words, the required diagnostic performance will need to be determined for each program use case separately (see also Fig 6 and S1 Fig). For this, it will be equally important for the STH community to agree on the acceptable width of the grey zone separately for each program threshold, which in turn would provide a more justified criteria to classify diagnostic tests as 'optimal' and 'minimal' than those arbitrarily used in the present study.

Specificity and sensitivity are inversely correlated
Although the lowest possible specificity and sensitivity is 94% and 60% for adequate decision making and 99% and 60% for ideal program decision making ( Table 3), it is important to note that the diagnostic requirements for specificity and sensitivity are inversely correlated. As a consequence of this, it would be inappropriate to independently report the lowest values of specific and sensitivity into a TPP, as this would lead to the development of diagnostic tests that result in poor program decision making. Rather, combinations/pairs of specificity and sensitivity will need to be incorporated. S2 Table lists the pairs of sensitivity and specificity that were eventually recommended to the STH subgroup. They include the pairs summarized in Table 3, excluding all combinations with a perfect sensitivity or specificity, because this was deemed unrealistic.

Currently used diagnostic methods may not allow for reliable decision making throughout an STH program
When comparing the recommended diagnostic performance (S2 Table) with the sensitivity and specificity for selection of currently available microscopic-based methods (e.g. direct smear, formol-ether concentration, Kato-Katz thick, McMaster, and (Mini-)FLOTAC) reported in a meta-analysis, it is clear that direct smear, formol-ether a single Kato-Katz and McMaster did not meet the requirements for detection of infections of any intensity for at least one of the three soil-transmitted helminths ( Table 2 of [12]), and that in low endemic areas only FLOTAC would be a potential candidate (Table 3 of [12]). In a more recent study and assuming a perfect specificity [13], both a single and duplicate Kato-Katz, Mini-FLOTAC and qPCR did meet the required sensitivity for STH of any intensity (Table 3 of [13]), but when it concerns low intensity infections only qPCR remains as a potential candidate (Table 4 of [13]). FECPAK G2 did not meet any of the requirements. Although both studies indicate the potency of FLOTAC and qPCR, there are some important logistical obstacles to roll them out in large-scale deworming programs [16][17][18].

Extension of the (MC)-LQAS framework allows to both develop and compare program decision algorithms for imperfect tests
To our knowledge this is the first description of a five-way MC-LQAS framework that accounts for imperfects test. The expansion of this framework not only allows for developing program decision algorithms across imperfect tests, but can also be used to gain insights into the operational cost. For example, we showed that additional investments to improve the test (e.g., the specificity) may provide downstream benefits of reducing the required survey sample sizes for making adequate programme decisions. This is because diagnostic tests with improved specificity require smaller sample sizes for the same level of program decision making. In other words, any additional cost per diagnostic test with improved diagnostic performance can be compensated by savings in operational costs for testing in the field or laboratory. Therefore, it is recommended to split up operational costs for testing into the material cost per test and the number of tests that can be processed in an hour by one person in future cost-analyses. This level of costing detail would lead to greater evidence-based recommendations in the TPPs.

MC-LQAS framework needs to be adapted for 2-stage clustered sampling
In the current MC-LQAS framework we assumed that subjects are originating from the same cluster (e.g; community/school) and ignored the clustered nature of STH and assumed that these 500 subjects all represent one cluster (e.g. school/community). However, program decisions are not made at each cluster separately, rather decisions are made for a certain administrative or geographical area-the so-called implementation units-based on the aggregation of results across multiple clusters, with a number of subjects per cluster. In other words, programs employ 2-stage cluster sampling, whereby clusters are first chosen via random selection within an implementation unit and then a select number of subjects are chosen within each cluster. The development of a 2-stage cluster sampling MC-LQAS simulation approach was out of scope of the present study. A possible way forward would be to determine MC-LQAS around a 2-stage beta-binomial model, where the beta distribution describes the prevalence/ proportion of positive test results across clusters and the binomial distribution the proportion of positive test results within a cluster.

Both frameworks are generalizable to moderate-to-heavy intensity STH and any NTD program using population-based decision thresholds
Although the aforementioned frameworks were illustrated for program decision making around the prevalence of any STH infection, it is clear that both frameworks are agnostic to both the level of infection intensity and pathogen. For example, the results can also be used to make program decisions on whether the prevalence of moderate-to-heavy STH intensity infections has dropped below 2% [1]. Based on the diagnostic performance recommended in S2 Table and the recently reported probability of Mini-FLOTAC, McMaster and qPCR to correct classify moderate-to-heavy intensity infections when compared to Kato-Katz (Table 4 of [35]), we can deduce that only Mini-FLOTAC meets these requirements, though not for all STH species. Given that the schistosomiasis control programs use similar program decision thresholds [36], this framework will also provide insights for this NTD.
Supporting information S1 Table. The thresholds to classify the width of the grey zone into three levels. This classification into 3 levels was based on the 25 th and 75 th percentile of the width of the grey zones across all potential diagnostic methods for each program threshold T separately that allowed for an adequate program decision making (level 1: width of grey zone < 25 th percentile; level 2: 75 th percentile > width of grey zone � 25 th percentile; level 3: width of grey zone � 75 th percentile). (DOCX) S2 Table. The minimum and ideal sensitivity and specificity recommended by the STH subgroup.
(DOCX) S1 Fig. The width of grey zones around 6 program decision thresholds for 1,168 theoretic diagnostic tests. These contour plots illustrate the width of the grey zone for each of the 1,168 unique combinations of sensitivity and specificity when decision making ideal (ε overtreat �10% and ε undertreat �5%)each line represents the same width of grey zone. The number of the beside the line represents the floor value of the width of the grey zone in % (e.g., any value �10% and <11% is set at 10%).
(TIF) S1 Data. The data used to determine the required diagnostic performance, the sample size and the corresponding decision cut-offs. (CSV)