Accuracy of Five Algorithms to Diagnose Gambiense Human African Trypanosomiasis

Background Algorithms to diagnose gambiense human African trypanosomiasis (HAT, sleeping sickness) are often complex due to the unsatisfactory sensitivity and/or specificity of available tests, and typically include a screening (serological), confirmation (parasitological) and staging component. There is insufficient evidence on the relative accuracy of these algorithms. This paper presents estimates of the accuracy of five algorithms used by past Médecins Sans Frontières programmes in the Republic of Congo, Southern Sudan and Uganda. Methodology and Principal Findings The sequence of tests in each algorithm was programmed into a probabilistic model, informed by distributions of the sensitivity, specificity and staging accuracy of each test, constructed based on a literature review. The accuracy of algorithms was estimated in a baseline scenario and in a worst-case scenario introducing various near worst-case assumptions. In the baseline scenario, sensitivity was estimated as 85–90% in all but one algorithm, with specificity above 99.9% except for the Republic of Congo, where CATT serology was used as independent confirmation test: here, positive predictive value (PPV) was estimated at <50% in realistic active screening prevalence scenarios. Furthermore, most algorithms misclassified about one third of true stage 1 cases as stage 2, and about 10% of true stage 2 cases as stage 1. In the worst-case scenario, sensitivity was 75–90% and PPV no more than 75% at 1% prevalence, with about half of stage 1 cases misclassified as stage 2. Conclusions Published evidence on the accuracy of widely used tests is scanty. Algorithms should carefully weigh the use of serology alone for confirmation, and could enhance sensitivity through serological suspect follow-up and repeat parasitology. Better evidence on the frequency of low-parasitaemia infections is needed. Simulation studies should guide the tailoring of algorithms to specific scenarios of HAT prevalence and availability of control tools.


Introduction
The diagnosis of gambiense human African trypanosomiasis (HAT, sleeping sickness) in routine conditions is complex [1]. Because infection prevalence is usually low (,1-2%), diagnostic tests require a high sensitivity and specificity to achieve adequate positive predictive value (PPV). Furthermore, accurate classification into stage 1 (haemo-lymphatic) and 2 (meningo-encephalitic) is crucial: the stage 1 treatment, pentamidine, is inefficacious for stage 2 due to limited blood brain barrier penetration [2], while, of the two stage 2 treatments, melarsoprol is highly toxic [3] and eflornithine-nifurtimox is cumbersome to administer.
No single HAT diagnostic test currently offers satisfactory sensitivity and specificity. Diagnostic algorithms therefore combine several tests and feature a screening, confirmation and staging component. The Card Agglutination Test for Trypanosomiasis (CATT) [4], highly sensitive when performed in whole blood (CATT-wb) but insufficiently specific (,96%), is used for screening. After CATT-wb or CATT plasma screening, various parasitological confirmation tests are applied either alone or in sequence on blood and/or neck gland aspirate, so as to maximise specificity while maintaining acceptable levels of sensitivity. Various dilutions of the CATT in plasma (between 1:4 and 1:16) may also be performed ahead of parasitology to reduce the number of individuals needing parasitological testing. Parasitological positives (T+) undergo lumbar puncture and are classified as stage 2 if parasites are found in cerebrospinal fluid (CSF), or if a given threshold of CSF white blood cell (WBC) density (ranging from 5 to 20/mL) is exceeded [5]. Individuals with strong CATT reactions (dilutions $1:4) but no parasitological evidence of infection (T2) are generally considered serological suspects. Some control programmes follow-up suspects for up to one year, repeating parasitological tests. Others consider them non-cases or treat them presumptively. The underlying infection prevalence affects the relative efficiency of these different strategies [6,7,8].
The accuracy of HAT diagnostic algorithms has not been documented in detail, partly because their complexity precludes straightforward analysis. Here, we present estimates of the accuracy of five different diagnostic algorithms used by Médecins Sans Frontières (MSF) in past gambiense HAT control programmes using summary estimates of reported accuracy of individual HAT tests and a probabilistic model.  (Adjumani District, 1991-1996Arua andYumbe Districts, 1995-2002). The Southern Sudan project made progressive modifications to its algorithm, but only the first (old) and the last (new) algorithms used by that project are assessed here.

Description of the MSF algorithms
As initial screening tests, all algorithms used the CATT-wb, and the Congo and Sudan algorithms also used systematic gland palpation among CATT-wb negatives. Parasitology (performed on the field during active screening) included microscopic examination of aspirate from punctured palpable cervical glands (GP) [9], done in all algorithms, complemented by capillary tube centrifugation (CTC or the Woo test [10]; theoretical detection limit 100 parasites/mL, reported limit 500-600/mL) or the Quantitative Buffy Coat (QBC; 15/mL, 15-300/mL) technique [11] in Southern Sudan, and the mini anion exchange centrifugation technique (mAECT; 5/mL, 15-100/mL) [12] or QBC in Uganda. Furthermore, the Southern Sudan algorithms used the QBC as the parasitological test during passive screening (testing of patients spontaneously presenting to a HAT treatment centre), and the CTC during active screening.
All programmes initially did systematic follow-up of serological suspects, but this was eventually stopped in Congo and Kiri due to low follow-up rates and high workload; in Kiri, this strategy was replaced with systematic treatment of suspects positive at CATT dilution $1:16, later restricted to villages with observed prevalence $2%. The Congo algorithm treated CATT$1:8 positive but T2 individuals as cases regardless of CSF WBC density.
Staging of HAT in T+ (and CATT$1:8 positive in Congo) individuals was done at the fixed treatment centre by lumbar puncture and double centrifugation of the CSF (CSF-DC). If CSF-DC revealed no parasites, staging was based on WBC density thresholds. These thresholds were either .5 or .10/mL as per country guidelines [13].
With the exception of Congo, all algorithms performed LP in T2 but CATT dilution ($1:4 or $1:16) positive individuals for simultaneous confirmation and staging. For these patients, the WBC density threshold was increased to .20/mL; furthermore, those not meeting stage 2 criteria were not automatically considered stage 1 cases, but rather suspects, creating a differential in sensitivity according to whether the case was stage 1 or stage 2.
Differences among algorithms reflect adherence to national HAT guidelines (for example, in Congo the WBC threshold was higher); the availability on the market of certain parasitological tests at different times (for example, the mAECT is a more recent development and interruptions in the production line have occurred); different operational strategies (in Congo MSF aimed to cover a large, sparse territory with single active screening visits with the overriding objective of maximum coverage and thus sensitivity); and, to some extent, decisions by individual programme coordinators or MSF sections (in the past decade, an inter-sectional working group has worked toward greater standardisation).
Studies were included in the review only if they had tested the accuracy of T. brucei gambiense diagnosis among untreated cases, and if they featured an acceptable diagnostic gold standard, defined as follows: (i) for screening and confirmation tests, testing with GP or CTC and at least one of the following: QBC, mAECT, enzyme linked immunosorbent assay (ELISA), Kit for In Vitro Identification (KIVI), or animal inoculation; (ii) for the specificity of the CATTwb, testing of individuals not living in HAT endemic areas; (iii) for staging tests, testing of CSF, among T+ cases only, with polymerase chain reaction (PCR), in vitro culture, or immunological markers of infection including raised IgM levels [16].
Studies that were not designed for testing validity, but contained sufficient data for accuracy estimation, were included. In some studies, we considered the experimental test used by investigators as the gold standard, and vice versa: in these cases, we inverted the two and re-calculated accuracy. The accuracy of CATT dilutions was only evaluated from studies among CATT-wb positives, since the algorithms only performed such dilutions after the CATT-wb screening, i.e. the parameter of interest was relative accuracy compared to the CATT-wb. Reports of CATT accuracy from foci where parasites frequently lack the LiTat1.3 gene [1] (Nigeria, Cameroon) were excluded.
Details on studies meeting inclusion criteria are provided in Text S1, and the amount of information available for each diagnostic test

Author Summary
Gambiense human African trypanosomiasis (HAT, sleeping sickness) usually features low prevalence. The two stages of the disease require different treatments, and stage 2 is fatal if untreated. HAT diagnosis must therefore be highly sensitive (i.e., detect as many true cases as possible) and specific (i.e., minimize false positives). HAT diagnostic algorithms are complex and involve several tests to screen for, confirm and stage infection. We analyzed five algorithms used by Médecins Sans Frontières HAT programmes. We combined published data on the accuracy of each test in the algorithm with a computer program that simulates all possible algorithm branches. We found that all algorithms had reasonable sensitivity (85-90%); specificity was high (.99.9%) except for the Republic of Congo, where confirmation did not rely on microscopic evidence, resulting in frequent false positives (but also higher sensitivity). Algorithms misclassified about one third of stage 1 cases as stage 2, but stage 2 classification was highly accurate. The use of serology alone for confirmation merits caution. HAT diagnosis could be made more sensitively by following up serological suspects and repeating microscopic examinations. Computer simulations can help to adapt algorithms to local conditions in each HAT programme, such as the prevalence of infection and operational constraints.
Accuracy of Trypanosomiasis Diagnostic Algorithms www.plosntds.org is summarised in Table 1. An additional nine studies were excluded from either the sensitivity or specificity reviews because the gold standard was inadequate [17,18,19,20,21,22] or the study design did not allow for diagnostic accuracy estimation [23,24,25]. One study of staging accuracy [26] was excluded because the IgM threshold used was deemed too high.

Probability distributions of test accuracy
Individual estimates of test accuracy were combined into probability distributions for further modelling. Distributions for the accuracy of successive CATT dilutions were constructed by fitting polynomial functions to plots of available sensitivity or specificity point estimates versus the natural logarithm of the dilution, with observations weighted proportionately to each study's sample size (Figure S1a, Figure S1b in Text S1). The fitted values and their 95% confidence intervals at each dilution of interest were used to construct binomial distributions.
Probability distributions for other tests were constructed as follows. First, exact binomial probability distributions were built around the point estimate of each study. Second, each study's distribution was weighted proportionately to the study's sample size. Third, the individual study distributions were summed, and the resulting distribution was scaled so that the area under the curve totalled unity. An illustration is provided for the CTC ( Figure 6).
For the QBC, there was only one published estimate of sensitivity, from a small study (n = 11). The technique is reported to have similar sensitivity to the mAECT [12,20], which is plausible given their comparable detection limits: therefore, the same distribution was used for the QBC as for the mAECT.
Finally, the specificity of parasitological tests for confirmation was fixed at 100%: the presence of trypanosomes is unequivocal, and trained microscopists should ordinarily not report false positives.

Alternative worst-case scenario
For the purpose of planning for long-term transmission control, it might be useful to consider minimum requirements to guarantee success even if conditions in reality are less favourable than published evidence suggests. Accordingly, more conservative accuracy estimates were obtained by applying a set of worst-case scenario assumptions ( Table 2). These assumptions sought to account for the fact that even the most sensitive tests (QBC, mAECT) are likely to miss low parasitaemias (,5-15 trypanosomes/mL). Studies of T-suspects, based on PCR assays for T. brucei s.l. [27] featuring 100% specificity in controls from non-endemic regions [28,29,30,31], have reported 22% positivity in Cameroon [30]; 19-37% in the Ivory Coast [29]; and 15% in Equatorial Guinea and Angola [32]. Probabilistic model R software was used to program the different algorithms into a sequence of conditional probabilities, so as to calculate sensitivity, specificity, and staging accuracy (defined as the probability of being correctly classified into either stage) of the algorithm as a whole, given any values of accuracy for individual tests. Equations for the accuracy estimation of each algorithm are provided in Text S1.
Because some algorithms used CSF-DC and WBC count for confirmation as well as staging, sensitivities vary according to whether the true positive case is in stage 1 or stage 2, and were thus computed separately. Furthermore, scenarios with and without follow-up of serological suspects were evaluated, i.e. assuming none or all such cases are re-tested according to the stipulated schedule (in practice, the follow-up rate varies by site [33]).
The sensitivity and specificity of any given test for the baseline scenario were specified by the probability distributions constructed above, summarised in Table 3. The model was run 10 000 times for each algorithm and for both the baseline and worst-case (incorporating the adjustments in Table 2) scenarios. During each run, a random value was sampled from each input probability distribution. Median sensitivity, specificity and staging accuracy were then computed based on the output distribution of results from the 10 000 runs, along with their 95% percentile interval (i.e. the interval comprising 95% of the run results).
The resulting negative and positive predictive values (NPV, PPV) were also calculated assuming 0.1%, 1% or 10% infection prevalence. The ratio of non-cases needlessly treated to true cases treated (over-treatment ratio) was also calculated for each algorithm Accuracy of Trypanosomiasis Diagnostic Algorithms www.plosntds.org and prevalence scenario, assuming a stage 1 to stage 2 ratio of two among prevalent infections detected actively in never-before screened communities, consistent with empirical observations in most MSF projects (Francesco Checchi, unpublished observations). However, this assumption is of negligible importance: the converse (a ratio of 0.5) would result in nearly identical estimates (data not shown), since differences in sensitivity between stage 1 and stage 2 are small and of limited influence given that HAT is a low-prevalence infection (PPV and NPV are largely determined by specificity).

Sensitivity, specificity and staging accuracy
Accuracy estimates for the baseline scenario are shown in Table 4. Sensitivity including suspect follow-up was highest in Congo, and considerably lower than elsewhere for the new Kiri algorithm, which screened out cases negative at a high CATT dilution (,1:16). Specificity was 99.9% or 100% everywhere with the exception of Congo (99.1%).
The theoretical sensitivity gain from suspect follow-up was considerable: about 3-4% everywhere, but 10-20% in Kiri, where T2, CATT dilution $1:4 positives were followed up. There was no appreciable specificity cost from suspect follow-up. Algorithms were predicted to misclassify about one in ten of the stage 2 cases as stage 1; conversely, about one third of stage 1s were treated as stage 2, with the exception of Congo, where the higher WBC threshold (.10/mL) resulted in a small increase in stage 2 misclassification, but only 13% stage 1 misclassification (note however the wide percentile intervals).
In the worst-case scenario (Table 5), sensitivity was 10-15% lower everywhere except for Congo (where conservative assump- tions mostly did not affect the CATT$1:8 dilution test), and around 50% for the new Kiri algorithm. Specificity decreased below 99.8% except for the new Kiri algorithm. Stage misclassification affected more than half of stage 1 cases.
Overall, the Congo and new Kiri algorithms offered opposite extreme characteristics: the former guaranteed very high sensitivity but had low specificity; the latter was highly specific even under worst-case scenario assumptions, but had low sensitivity.
Predictive values and over-treatment ratios NPV was uniformly high ( Table 6). Because of low specificity, the predicted PPV of the Congo algorithm was also low at most plausible prevalence levels (,50% for any prevalence ,1%), resulting in a high over-treatment ratio. Because PPV is extremely sensitive to minimal changes in specificity, predicted PPVs with high specificity values should be interpreted with caution (e.g. in Uganda, median specificity was 99.94%, but was rounded to Accuracy of Trypanosomiasis Diagnostic Algorithms www.plosntds.org 99.9%, which results in a 20% decrease in PPV at prevalence 0.1%). Only the new Kiri algorithm achieved perfect PPV at any prevalence (however, the resultant elimination of over-treatment was counterbalanced by a policy of treating serological suspects with pentamidine in high-prevalence villages).

Interpretation of findings
This study suggests that diagnostic algorithms previously used by MSF had a sensitivity of 85-90% in a baseline scenario analysis, except for an algorithm in Southern Sudan in which only individuals CATT$1:16 positive underwent blood and CSF parasitological exams. At least theoretically, and irrespective of its efficiency and cost-effectiveness, the follow-up of serological suspects does yield an appreciable increase in sensitivity; however, this benefit may largely be negated in the field because of low suspect follow-up rates (suspect follow-up is costly as it often requires active patient tracing). Among other studies of HAT diagnostic algorithms (all starting with CATT-wb positivity), Miezan et al. [34]   All algorithms also appeared to have an acceptable PPV except for Congo's, where serological diagnosis probably resulted in a high frequency of stage 1 false positives (see below). Furthermore, reliance on the conventional HAT staging approach (parasitology and WBC threshold of .5 leucocytes/mL) may have captured the vast majority of stage 2 cases but misclassified about one third of stage 1 cases as stage 2: this harm-benefit ratio is nonetheless likely to be favourable, since the risk of death from undetected stage 2 HAT is probably 100% [37], while the risk of death due to stage 2 drug toxicity among stage 1 cases misclassified as stage 2 is less than 5%, and ,2% wherever eflornithine-nifurtimox has replaced melarsoprol as firstline treatment. Misclassification of stage 2 cases could partly be avoided by introducing some clinical criteria in the algorithm (e.g. patients with typical signs and symptoms of stage 2, and who are classified as stage 1, should be retested or treated empirically).
Our findings refer to the relatively favourable conditions of HAT diagnosis provided for by a well-resourced non-governmental organisation with access to the best available technology, ability to train and supervise staff and considerable field logistics. Many HAT programmes, particularly those implemented by national control programmes after humanitarian agencies and other donors discontinue support, do not dispose of such resources, and must use simpler algorithms, sometimes relying on blood smears and cervical node microscopy alone for parasitological testing in remote active screening campaigns. Such simple algorithms are likely to feature a much lower accuracy than those we have evaluated here: national programmes should receive continued technical and material support in order to offer adequate HAT diagnosis.

Plausibility of worst-case scenario assumptions
While worst-case scenario estimates may be implausibly low, the question of whether current tests miss a larger proportion of cases than currently thought, as suggested by PCR data, should be explored further. While in non-endemic areas PCR appears extremely specific, among CATT-wb negatives in endemic areas PCR positives do occur: 4/73 (5.5%) in Ivory Coast [29], 3/222 (1.4%) in Cameroon [30], and 1/36 (2.8%) in Equatorial Guinea and Angola [32]. These observations could be explained as (i) false PCR positives due to cross-reactivity with other antigens, including DNA from non-gambiense T. brucei s.l. transiently infecting the host; or (ii) true T. b. gambiense infections undetectable by other tests due to low parasite density.
The former explanation is supported by the finding that a study in an Ivory Coast focus employing a PCR assay specific for T. b. gambiense yielded no PCR positives [31], while all studies with high PCR positivity relied on non-gambiense specific assays. However, the Ivory Coast assay used had a detection limit comparable to the mAECT, and may have failed to detect cases of low parasitaemia (by contrast, the non-gambiense specific Cameroon assay developed by Penchenier et al. [30] has a reported limit of 1/mL).
The latter explanation requires the existence of infections that maintain extremely scanty parasitaemia and are not or only mildly pathogenic [37].
Better evidence should come from the development of T.brucei gambiense specific molecular assays that also have a detection limit appreciably lower than parasitology, and their application to longterm follow-up of T2 serological suspects [38]. Estimating the true sensitivity of tests would require knowledge of the typical distribution of parasitaemias in human hosts, but this is difficult to measure precisely because of the detection limit of current methods (presumably, if a large database of known parasite densities were assembled, the resulting distribution could be treated as truncated, and extrapolated below the minimum detection limit). Data on cattle are available, but may not apply to humans due to differences in host-parasite interactions. Figure 6. Steps to build a probability distribution of CTC test sensitivity. Each report is denoted by the name of the first author and the year of publication. In step three, the final probability distribution is then normalised to unity (i.e. the total probability = 1). doi:10.1371/journal.pntd.0001233.g006 Specificity of CATT-wb Results from non-HAT exposed populations may be unrepresentative (e.g. HAT-exposed populations may also have higher prevalence of parasitic infections, such as P. falciparum, that may cross-react with the CATT [56]) Re-constructed probability distribution by including reports from apparently HAT-negative controls in HAT-endemic sites Specificity of GP, CTC, mAECT, QBC, CSF-DC Rare false positives could occur due to microscopic artefacts, e.g. microfilaria, or clerical mistakes 99.5-100.0% of the baseline scenario (uniform distribution); as no evidence was found, this range is assumed to be plausible Staging accuracy of CSF-DC for stage 2 One study [16] reported much lower accuracy based on a gold standard consisting of various markers of neuro-inflammation including intrathecal IgM Specificity from study in question (73.3%) adopted instead of those used in the baseline scenario In the mean time, we suggest that worst-case assumptions be used for determining requirements of programmes aiming for long-term control or local elimination.

Implications for field diagnosis
Specificity is key to maximising PPV. Very low HAT infection prevalence (e.g. ,0.2%) is common in many communities screened actively, implying poor PPV, considerable over-treatment, and inflated prevalence estimates for even the most specific algorithms considered here. However, in many programmes the majority of cases are detected passively. The prevalence of infection among individuals spontaneously presenting to the fixed HAT centre is higher, and was above 2% in all MSF programmes where these algorithms were used (Table 7). These observed prevalence figures suggest that PPV is generally high during passive screening (.95% everywhere except Congo).
Assuming reasonable laboratory quality, all parasitological tests are likely to be 100% specific, and reliance on these alone for confirmation should guarantee perfect PPV. By contrast, this study suggests that use of a CATT 1:8 dilution positive test as criterion for confirming infection, irrespective of parasitological results, entails a heavy specificity price. Field data appear to corroborate this finding. Among true cases, the proportion diagnosed via the CATT 1:8 dilution (serologically) should in theory not depend on HAT stage (serological tests in blood are believed by some to be less sensitive in stage 2, but no published evidence for this was found).
On the other hand, among false positives, most cases diagnosed serologically would be classified as stage 1, since during staging all would be negative for CSF-DC and most would have normal WBC density. A preponderance of stage 1 is thus indicative of considerable over-diagnosis. Within the three Congo sites, serological cases were 1559/2857 (54.6%) of naïve (previously untreated) cases, of which 1364/1559 (87.5%) were in stage 1, compared to 624/1298 (48.1%) of cases confirmed parasitologically. Furthermore, serological cases were 244/629 (38.8%) of cases detected passively, and 1244/2152 (57.8%) of cases detected actively. In a simple logistic regression model, both stage 1 classification and active screening were associated with serological diagnosis (odds ratios 7.45 [95%CI 6.13-9.05] and 1.35 [95%CI 1.10-1.66] respectively). Altogether, these observations suggest considerable over-diagnosis of HAT (nearly all classified as stage 1) in Congo. Inojosa et al. found a similarly low PPV of an algorithm based on the CATT 1:8 dilution in Angola (13.2% with 0.07% prevalence) [22]. Diagnosis through CATT serology does improve sensitivity considerably; however, we suggest that its use be restricted to (i) passive screening and (ii) active screening in remote communities with suspected high prevalence where there is likely to be only one opportunity for screening, and where melarsoprol is not used as first-line therapy or the algorithm minimises misclassification of stage 1 cases. Furthermore, we recommend use of a 1:16 dilution in lieu of 1:8. Control programs that use algorithms with serological criteria aim to reduce transmission at the expense of      The main reason for lack of sensitivity of the parasitological tests is likely to be low parasite density. As HAT parasitaemia is known to undulate on a daily basis, some laboratories perform repeat blood parasitological tests so as to increase chances of detecting parasites. Repeat tests could be a simple way to improve sensitivity. Better evidence on the typical period between peak and trough parasitaemia would be helpful to optimise the timing of blood sampling. Clearly, keeping suspects for days at the treatment centre in order to repeat tests would present serious acceptability challenges; however, a single overnight might be feasible, and, furthermore, the selection of suspects in whom to perform repeat tests might also be restricted to those displaying typical signs and symptoms of HAT.
These findings also have implications for burden estimation, since they introduce a need to adjust observed prevalence or incidence data for imperfect sensitivity, PPV below 100% due to low specificity (particularly for active screening data), and unequal stage 1 and stage 2 misclassification probabilities.

Study limitations
The literature review revealed a dearth of quality studies of HAT test accuracy, with the exception of the CATT-wb. Many were imprecise (only two presented a sample size rationale) and featured less than optimal gold standards. The mAECT, used in a variety of programmes, appears to be supported by only one large study, and for the QBC only one study was found. This  Ratio of false to true cases treated 0.1 25. Accuracy of Trypanosomiasis Diagnostic Algorithms www.plosntds.org uncertainty may introduce information bias in the construction of accuracy distributions. More specifically, the adoption of specificity estimates for the CATT from populations from non-endemic areas may have led to overly optimistic estimates (this was partly addressed in the worst case scenario analysis).
Our method of constructing accuracy distributions attempts to use existing data with minimal assumptions about their parametric form. Arguably, meta-analysis could have been used instead, with distributions provided by the confidence intervals of the summary estimates from pooled studies. However, preliminary analysis showed evidence of heterogeneity in study estimates for several HAT tests: under these conditions, meta-analysis is discouraged. Furthermore, there is lack of consensus on appropriate methods for meta-analysis of diagnostic test studies [39,40].
Bayesian approaches to diagnostic accuracy estimation [41,42], which do not require a gold standard, could be a useful alternative to the method used here, and should also be explored.
More generally, this study's theoretical estimates overlook some practical realities of field work. For example, algorithms are sometimes not performed as indicated (e.g. gland palpation may be skipped due to heavy workload); some diagnostic decisions are taken on clinical grounds (though probably rarely), overriding laboratory results; and patient attrition is an issue (e.g. suspect follow-up rates are generally low). Thus, the algorithms' accuracy in routine conditions may be higher or lower than our estimates, the latter being more likely.

Conclusions
Algorithms using non-parasitological diagnosis have lower specificity leading to varying degrees of overtreatment. Overestimation of disease burden could be avoided by excluding individuals diagnosed serologically from the case counts. Differences between active and passive screening should be considered.
Ways to improve sensitivity include follow-up of serological suspects and repeat blood parasitological testing. This study highlights the urgent need to pursue research on new HAT diagnostics [43]. Improved tests should ideally replace most of the present algorithms, and be feasible in outpatient settings (e.g. as simple serological rapid tests), thus enabling integration of HAT services [44]. In the present scenario of falling prevalence, any new tests will need to be practically 100% specific. However, high sensitivity will remain necessary to maximise the chances of elimination. No single algorithm will be appropriate for all epidemiological settings: rather, our study demonstrates the value of estimating the accuracy of the algorithm as a whole, and could be replicated in a variety of prevalence scenarios, or integrated in a cost-effectiveness analysis that would help control programmes, particularly those working with limited resources, optimise the use of available diagnostics.

Supporting Information
Text S1 Model equations and results of the diagnostic literature review. (DOC)