Commercial Nucleic-Acid Amplification Tests for Diagnosis of Pulmonary Tuberculosis in Respiratory Specimens: Meta-Analysis and Meta-Regression

Background Hundreds of studies have evaluated the diagnostic accuracy of nucleic-acid amplification tests (NAATs) for tuberculosis (TB). Commercial tests have been shown to give more consistent results than in-house assays. Previous meta-analyses have found high specificity but low and highly variable estimates of sensitivity. However, reasons for variability in study results have not been adequately explored. We performed a meta-analysis on the accuracy of commercial NAATs to diagnose pulmonary TB and meta-regression to identify factors that are associated with higher accuracy. Methodology/Principal Findings We identified 2948 citations from searching the literature. We found 402 articles that met our eligibility criteria. In the final analysis, 125 separate studies from 105 articles that reported NAAT results from respiratory specimens were included. The pooled sensitivity was 0.85 (range 0.36–1.00) and the pooled specificity was 0.97 (range 0.54–1.00). However, both measures were significantly heterogeneous (p<.001). We performed subgroup and meta-regression analyses to identify sources of heterogeneity. Even after stratifying by type of commercial test, we could not account for the variability. In the meta-regression, the threshold effect was significant (p = .01) and the use of other respiratory specimens besides sputum was associated with higher accuracy. Conclusions/Significance The sensitivity and specificity estimates for commercial NAATs in respiratory specimens were highly variable, with sensitivity lower and more inconsistent than specificity. Thus, summary measures of diagnostic accuracy are not clinically meaningful. The use of different cut-off values and the use of specimens other than sputum could explain some of the observed heterogeneity. Based on these observations, commercial NAATs alone cannot be recommended to replace conventional tests for diagnosing pulmonary TB. Improvements in diagnostic accuracy, particularly sensitivity, need to be made in order for this expensive technology to be worthwhile and beneficial in low-resource countries.


INTRODUCTION
Tuberculosis (TB) is a major global health problem. Each year, 8 to 9 million people develop disease, and 2 million die [1]. Pulmonary TB is the most common form of the disease [2]. Diagnosis of TB relies on the detection of acid-fast bacilli by microscopy (smear) and culture. Microscopy is rapid, specific, and inexpensive but has low sensitivity [3,4]. Culture is more sensitive, but results can take several weeks. In addition, culture may be falsely-negative in 10-20% of cases [5]. Better efforts to control TB require faster and more accurate diagnostic tests [6][7][8]. Nucleic acid amplification tests (NAATs), which can give results in 3-6 hours, have been developed to address these issues [9].
The polymerase chain reaction (PCR) is the most common NAAT. Tests include those that are ''in-house'', when they are based on a protocol developed in a non-commercial laboratory (''home-brew''), or commercial kits. Several commercial NAATs exist, and each uses a different method to amplify specific nucleicacid regions in the Mycobacterium tuberculosis complex. These kits include: the GenProbe Amplified M. tuberculosis Direct test (AMTD), the Roche Amplicor MTB test, the Cobas Amplicor test, the Abbott LCx test, and the BD-ProbeTec (SDA) test. Another NAAT has been recently developed-the Loop-mediated Isothermal Amplification (LAMP) test, but research experience is limited with this test [10]. Table 1 provides a summary of the different commercial tests. The LCx kit is no longer in use, and Becton Dickinson has produced an enhanced version of the SDA test (BD-ProbeTec-ET). The Food and Drug Administration (FDA) has approved the use of select commercial NAATs for only respiratory specimens. In addition, the AMTD and Amplicor tests are licensed for testing smear-positive specimens, while the FDA recently approved a 2 nd -generation AMTD (E-AMTD) test for smear-negative specimens [11]. The LCx, BD-ProbeTec-ET, and LAMP tests are currently not FDA-approved.
Systematic reviews of previous studies have suggested that the diagnostic accuracy of NAATs varies more among in-house NAATs than commercial tests [12,13]. A meta-analysis on the use of in-house PCR assays for testing sputum samples found significant heterogeneity and could not summarize the measures of diagnostic accuracy (i.e. sensitivity and specificity) [14]. Several meta-analyses have evaluated the accuracy of commercial NAATs in both pulmonary and extrapulmonary TB [12,13,[15][16][17]. Most of them have reported high and consistent specificity but low and inconsistent estimates of sensitivity [12,13,15]. Smear-negative patients may be the most likely group to benefit from the use of NAATs. If the NAAT result is positive, a faster diagnosis can lead to an earlier initiation of therapy [11]. However, studies have shown that sensitivity is lower for smear-negative TB compared to smear-positive TB [12,13,15,18]. One meta-analysis on the use of commercial NAATs for only smear-negative patients found that the sensitivity estimates were too low and variable to be used for confirming diagnosis in this group [16]. Another recent metaanalysis evaluated diagnostic accuracy for pulmonary TB stratified by smear status [18]. It concluded that the low sensitivity of smearnegative patients precludes the use of commercial NAATs for ruling out TB. Its high specificity in this group of patients, however, is useful for ruling in TB. The same study also noted that the high sensitivity in smear-positive samples could be helpful in ruling out a diagnosis of pulmonary TB due to infection by nontuberculous mycobacteria (NTM) [18]. In our meta-analysis, we used a comprehensive search strategy to determine the accuracy of commercial NAATs for diagnosing pulmonary TB in combined smear-positive and smear-negative respiratory specimens. We further explore factors that may be accountable for differences among studies by meta-regression analysis.

Study selection
We identified 2948 citations from the initial search. After screening titles and abstracts, 471 English and Spanish articles were eligible for full-text review. Of these, 69 articles were excluded, and 402 articles on the use of commercial NAATs for all forms of TB were included (screening done by two reviewers). A total of 142 articles focused on respiratory specimens [sputa, bronchial aspirates, bronchoalveolar lavages (BAL), and tracheal aspirates] for the diagnosis of pulmonary TB. Some articles considered gastric aspirates as respiratory specimens. They were accepted if the number of gastric aspirates was less than 5% of the total sample size. A total of 37 articles were further excluded from data extraction, and 105 articles were included in our metaanalysis . Several articles compared more than one NAAT against the same reference standard in head-to-head trials, in which case each comparison was considered as a separate study. Thus, the total number of studies in the final analysis was 125. Figure 1 displays how the studies were selected.

Data extraction
We created and piloted a data extraction form with a subset of eligible studies. Based upon experience gained in the pilot study, the data extraction form was finalized. The final set of studies was assessed with the standardized form by two reviewers (DIL and LLF), and any differences were resolved by consensus. Many articles compared NAAT results to more than one reference standard, and we used a hierarchical approach to choose one comparison from each study: (1) culture result plus clinical data (most preferred reference standard) (2) culture result alone and (3) clinical data alone (least preferred reference standard). We used the specimen as the unit of analysis when possible. We also chose to use data that were not subject to discrepant analyses (i.e. unresolved data) when available, since resolved data after discrepant analyses are a potential source of bias and result in higher estimates of accuracy [126]. In addition, NTM and inhibited specimens were excluded if possible.

Assessment of study quality
We assessed the quality of studies using the following criteria, suggested as important for diagnostic studies [127]: (1) Was there a comparison of the commercial NAAT with an independent, appropriate reference standard? (2) Was the NAAT result interpreted without knowledge of the results of the reference standard (blinded interpretation) and vice-versa? (3) Did the whole sample or a randomly selected subset of the sample receive verification using the reference standard? and (4) Did the study prospectively recruit consecutive patients suspected of having pulmonary tuberculosis (i.e. cross-sectional vs case-control design)?

Data synthesis and meta-analysis
Data were analyzed using Meta-Disc (version 1.4) software [128]. We pooled the data with the DerSimonian-Laird random effects model (REM) [20,[129][130][131]. The REM gives more conservative estimates with wider confidence intervals because it assumes that the meta-analysis includes only a sample of all possible studies [19,132,133]. In addition, the REM accounts for both withinstudy variability (random error) and between-study variability (heterogeneity). Accuracy measures include: sensitivity, specificity, positive likelihood ratio (LR+), negative likelihood ratio (LR-), and the diagnostic odds ratio (DOR). Sensitivity is the proportion of positive test results among those with the target disease. Specificity is the proportion of negative test results among those without the disease. In a clinical setting, likelihood ratios are considered useful. The LR+ measures how much more frequent a positive test is found in diseased versus non-diseased individuals. On the other hand, the LR-measures how more likely a negative result is found in diseased versus non-diseased individuals. The DOR, or the odds of a positive result in diseased individuals compared to the odds of a positive result in non-diseased individuals, combines both likelihood ratios and is a global measure of test performance [134]. A value of 1 would indicate that the test cannot discriminate between people with and without disease. The DOR is calculated by LR+/LR2 or [sensitivity/ (1-specificity)]/[(1-sensitivity)/specificity] [134].
Each study in the meta-analysis contributed a pair of numbers: sensitivity and specificity. Since these measures tend to be strongly correlated and vary with the thresholds (cut-off values for determining test positives) used across the individual studies, it is standard practice to analyze sensitivity and specificity proportions as pairs, and to also explore the effect of the threshold on study results. To do this, we performed the summary receiver operating characteristic (SROC) curve analysis [131,135]. The SROC displays each study's sensitivity and specificity estimates within the ROC space. A regression curve is fitted through the distribution of pairs of sensitivity and specificity. A shoulder-like curve indicates that the variability between studies may be due to the threshold effect (i.e. variation in cut-off values used across studies) and that an underlying common DOR exists that does not change with the threshold [130,135,136]. A non shoulder-like curve shows that sensitivity and specificity are not correlated. The area under the regression curve also measures the overall accuracy of diagnostic tests. If the area under the curve (AUC) is 100%, then the test differentiates perfectly between diseased and non-diseased individuals. An AUC of 50% indicates poor diagnostic accuracy [130,135,136].

Meta-regression
Heterogeneity in meta-analysis refers to a high degree of variability in study results (e.g. variability in sensitivity estimates). Such heterogeneity could be due to variability in thresholds (cutoff values), disease spectrum and populations studied, variations in NAAT protocols, and study quality across studies. When significant heterogeneity is present, summary estimates from meta-analyses are hard to interpret. We investigated heterogeneity using subgroup (stratified) analysis and meta-regression analysis [137]. In the subgroup analysis, we computed pooled DOR estimates in various strata to determine if accuracy is higher in specific subgroups. The meta-regression analysis is an extension of the SROC model [135]. In this linear regression model, studies are the units of analysis. The DOR is the outcome (dependent) variable. The independent variables are the covariates that might be associated with the variability in the DOR. Based on previous meta-analyses [12][13][14], potentially relevant covariates for our meta-regression model included: prospective or retrospective study direction, recruitment method, blinded interpretation, type of test, specimen type, reference standard, and data resolution. There were insufficient numbers to compare categories of differing study design, degree of verification, and smear status.
The meta-regression model generates relative diagnostic odds ratios (RDOR) as the output [134,137]. An RDOR is a ratio of two DORs. An RDOR of 1.0 indicates that a particular covariate (e.g. blinded study design) does not affect the overall DOR. An RDOR .1.0 indicates that studies with a particular characteristic (e.g. those that employed a specific target sequence in the PCR) have a higher DOR than studies without this characteristic. For a RDOR ,1.0, the reverse holds.

RESULTS
The average sample size of the included studies was 715 (range 57-7539). With the exception of one study, all of our studies were crosssectional. A majority (86%) of the studies were prospective in design. A total of 45 (36%) studies used consecutive or random sampling, while 29 (23%) studies recruited patients using some convenient sampling. The convenient sample was chosen from a bigger group of patients or was selected from a screening program. All but two studies reported complete verification of NAAT results with the same reference standard. Most of the studies (96%) collected both smearpositive and smear-negative specimens, and 84% compared NAAT results to culture as the reference standard. Ninety-five (76%) studies tested respiratory specimens, while 30 (24%) studies only used sputum specimens. We were able to analyze unresolved data (i.e. not subjected to discrepant analyses) in 88 (70%) studies. Past evidence has shown that investigators do not report all the study components in their publications [6,138]. In our analysis, 103 (82%) studies did not report blinding status, and 51 (41%) studies did not explicitly report the method of patient recruitment. Table 2 gives the characteristics of the studies in our meta-analysis.
The overall sensitivity and specificity estimates were 0.85 (range 0.36-1.00) and 0.97 (range 0.54-1.00), respectively. Figures 2 and  3 show the accuracy measures from all the studies in a forest plot. Specificity appears to be more consistent than sensitivity. Thirteen of 125 studies (10%) gave specificity estimates less than 90%. Most of them included either patients on treatment or who had history of prior disease. The overall LR+ was 32.74 (95% CI: 26.02, 41.22), and the overall LR-was 0.14 (95% CI: 0.12, 0.16). The pooled DOR was 268.88 (95% CI: 212.07, 340.9). We used Chisquare analysis to detect heterogeneity in the summary results. All of them showed highly significant heterogeneity (p,.001). Thus, pooled measures of the tests' diagnostic accuracy are not meaningful and do not adequately describe the data. Table 3 displays the accuracy measures and their corresponding statistics for the Chi-square test of heterogeneity.
Heterogeneity is a common concern for diagnostic metaanalyses. This variability may result from the threshold effect or differences in test methods and study characteristics [135]. Figure 4 shows the SROC plot with studies weighted by their inverse variance. The shoulder-like curve indicates that the threshold effect exists in our meta-analysis. There is a trade-off between sensitivity and specificity among the studies. Subgroup analysis is also used to identify other sources of variability by stratifying data into relatively more homogeneous strata [137]. Table 4 compares the DOR estimates for the study characteristics. The heterogeneity could be explained in some strata, but they consisted of small numbers. We stratified by type of commercial kit since they have standardized protocols. The variability in LR-did not persist for the SDA test ( Table 5). The SDA test amplifies IS6110, which is usually present in high number of copies in MTB and may increase sensitivity. However, only 6 studies evaluated the SDA A meta-regression analysis was performed to help explain the variation even after subgroup analysis. Table 6 shows the RDOR estimates from the meta-regression analysis using the Restricted Maximum Likelihood (REML) method to measure between-study variance. The threshold effect (S) = 20.21 was significant (p = 0.01) in accordance with the SROC plot. The ''S'' coefficient is a way to measure the effect of different thresholds on the DOR  among studies, and the negative value indicates that the thresholds increase specificity at the expense of sensitivity [16]. Thus, the heterogeneity found in our meta-analysis could be explained in part by the use of different cut-off values in the studies. In addition, studies that evaluated respiratory specimens had almost a two-fold increase in DOR compared to studies that used only sputum. None of the other covariates in the model reached statistical significance. Previous meta-analyses have shown that including bronchial specimens gave higher accuracy estimates compared to studies that only collected sputum [16,18].

Principal findings
Lack of rapid and accurate diagnostics for TB has been a major concern for global TB control. NAATs were introduced as promising novel tests for TB, and several commercial assays were introduced into the market. However, their actual performance has been less than optimal [12][13][14][15][16][17][18]. Since hundreds of studies have been published on NAATs, there is now the opportunity to perform meta-analyses and meta-regression to explore factors that influence NAAT performance.
In this meta-analysis, we performed extensive literature searches and identified a total of 125 separate studies from 105 articles that reported NAAT results from respiratory specimens. The results showed that sensitivity and specificity estimates for commercial NAATs in respiratory specimens were highly variable, with sensitivity lower and more inconsistent than specificity. Thus, summary measures of diagnostic accuracy are not clinically meaningful. The use of different cut-off values and the use of specimens other than sputum could explain some of the observed heterogeneity.

Implications of the findings
The most notable advantage of commercial NAATs is their rapid turn-around time, which may have important implications for  patient management and TB control. However, they appear to be impacted by a trade-off between sensitivity and specificityspecificity appears maximized at the cost of sensitivity. Reasons to account for their low sensitivity include low concentration of bacilli (i.e. paucibacillary specimens), such as smear-negative sputum specimens, or the presence of inhibitory substances [139]. We did not find high rates of inhibition in the studies reviewed (range 1%-7.5%). In addition, the small volumes of specimen (template) used in each commercial test may offer additional explanations. A recent meta-analysis on NAATs for TB lymphadenitis found that studies which used volumes of template .20 ml were more accurate than studies that used lesser template volumes [17]. Furthermore, study results may be influenced by the reference standard used to compare test results. It is well known that culture is not 100% sensitive and can give false-negative results. The lack of a diagnostic gold standard remains one of the biggest obstacles for evaluating new diagnostics, especially in HIV-infected persons and in paucibacillary disease (e.g. extrapulmonary TB and pediatric disease). The true accuracy of commercial NAATs may actually be higher than reported when using an imperfect reference standard [140].
Our results show a high degree of variability in accuracy across studies. The increased power of a meta-analysis can determine a test's overall diagnostic ability, but a summary measure is misleading in the presence of significant heterogeneity. In previous meta-analyses [12][13][14], subgroup analyses did not fully explain the variability found in NAAT results across studies. Even when stratifying by commercial test, our results remained heterogeneous. Other setting-specific factors, such as background TB prevalence rates or laboratory experience, could help account for this variation. Aside from the threshold effect, meta-regression analysis found that studies which collected several types of respiratory specimens were associated with higher diagnostic accuracy, possibly since the induction of aspirates yields a higher recovery of bacteria. Our findings agree with previous meta-analyses that suggest commercial NAATs cannot replace culture and microscopy but should be interpreted along with conventional tests and clinical data for diagnosing TB [12,13,15]. NAATs are also not useful for monitoring treatment progress since they can detect non-viable bacteria and give false-positive results [141]. However, they can distinguish M. tuberculosis from NTM [9]. This may be helpful in high-NTM populations, such as HIV/AIDS patients.

Limitations of NAAT studies
Systematic reviews and meta-analyses are critical for evidence-based clinical practice [131,142]. However, they are only as good as the quality of the studies that they include. There is growing concern that primary research on TB diagnostics are not methodologically rigorous [143,144]. In a review of 12 recent meta-analyses of various TB tests, studies were plagued by limitations such as lack of blinding, use of a case-control design, and lack of random or consecutive patient sampling methodology [6]. One review of 31 meta-analyses on several diseases found higher accuracy measures associated with studies that used non-consecutive sampling methods [138]. In our meta-regression, the use of some convenience sampling gave a DOR that was 1.5-fold higher than the DOR for studies that used random or consecutive sampling. This finding was almost significant (p = 0.15). In addition, 41% of our studies did not report how their patients were recruited. Thus, besides poor methodological quality, poor reporting of study components is another problem [6]. In our meta-analysis, 82% of the studies did not report blinding status. Not blinding investigators to reference standard results when interpreting the NAAT test has been shown to overestimate the DOR [13,16,145]. Another limitation of existing NAAT studies is lack of data on whether NAATs actually have an impact on patient outcomes and how much value NAATs contribute, over and above the information already obtained by conventional testing. Most studies only provided information on sensitivity and specificity.

Strengths and limitations of the systematic review
Our systematic review had several strengths. First, we used a comprehensive search strategy with various overlapping approaches. This enabled us to retrieve a large number of studies. Moreover, two reviewers independently completed screening, study selection, and data extraction. Finally, we analyzed data within specific subgroups to lessen the effect of heterogeneity and used meta-regression to identify factors associated with higher accuracy. Our review had limitations as well. Despite searching several sources, it is possible that we may have missed some eligible studies. Further, we could only extract data from English language studies, and this could have introduced bias in our results. Lastly, despite using subgroup analysis and meta-regression methods, considerable heterogeneity remained unexplained. Even if sensitivity were to be improved, an important issue that will remain is the implementation of these new tools in developing countries. Commercial kits, whose prices range from US$25-50 per test, are popular in the US and other developed countries [9,11]. The US Center for Disease Control and Prevention (CDC) has reported that commercial NAATs are used mostly in hospitals, health departments, and independent laboratories in the US [146]. However, many developing countries still use in-house PCR assays, which only cost about $15 per test [147]. Ironically, the poorest countries are often the ones burdened by the highest number of cases and therefore unlikely to benefit from expensive technologies. Realizing this, agencies such as the Foundation for Innovative New Diagnostics (FIND), the WHO, and the Stop TB Working Group for New Diagnostics have launched initiatives to make technologies for detecting TB and other neglected diseases affordable and accessible for developing countries [148].