Within-Subject Interlaboratory Variability of QuantiFERON-TB Gold In-Tube Tests

Background The QuantiFERON®-TB Gold In-Tube test (QFT-GIT) is a viable alternative to the tuberculin skin test (TST) for detecting Mycobacterium tuberculosis infection. However, within-subject variability may limit test utility. To assess variability, we compared results from the same subjects when QFT-GIT enzyme-linked immunosorbent assays (ELISAs) were performed in different laboratories. Methods Subjects were recruited at two sites and blood was tested in three labs. Two labs used the same type of automated ELISA workstation, 8-point calibration curves, and electronic data transfer. The third lab used a different automated ELISA workstation, 4-point calibration curves, and manual data entry. Variability was assessed by interpretation agreement and comparison of interferon-γ (IFN-γ) measurements. Data for subjects with discordant interpretations or discrepancies in TB Response >0.05 IU/mL were verified or corrected, and variability was reassessed using a reconciled dataset. Results Ninety-seven subjects had results from three labs. Eleven (11.3%) had discordant interpretations and 72 (74.2%) had discrepancies >0.05 IU/mL using unreconciled results. After correction of manual data entry errors for 9 subjects, and exclusion of 6 subjects due to methodological errors, 7 (7.7%) subjects were discordant. Of these, 6 (85.7%) had all TB Responses within 0.25 IU/mL of the manufacturer's recommended cutoff. Non-uniform error of measurement was observed, with greater variation in higher IFN-γ measurements. Within-subject standard deviation for TB Response was as high as 0.16 IU/mL, and limits of agreement ranged from −0.46 to 0.43 IU/mL for subjects with mean TB Response within 0.25 IU/mL of the cutoff. Conclusion Greater interlaboratory variability was associated with manual data entry and higher IFN-γ measurements. Manual data entry should be avoided. Because variability in measuring TB Response may affect interpretation, especially near the cutoff, consideration should be given to developing a range of values near the cutoff to be interpreted as “borderline,” rather than negative or positive.


Introduction
Interferon gamma (IFN-c) release assays (IGRAs) are designed to detect both latent Mycobacterium tuberculosis infection (LTBI) and infections manifesting as active tuberculosis disease, collectively referred to as M. tuberculosis infection (MtbI). IGRAs are a popular, viable, and often preferred alternative to the traditional tuberculin skin test (TST) in some settings [1][2][3]. Despite inadequacies in diagnostic standards for identifying MtbI, numerous studies have assessed the sensitivity and specificity of IGRAs [2][3][4]. However, few studies have assessed the within-subject variability of IGRA results. Within-subject variability includes differences in test results due to both subject fluctuations and test performance fluctuations. Excessive variability in IGRA results may limit their utility for detecting MtbI. A limited number of studies have assessed IGRA variability among people where treatment might affect serial test results [5][6][7][8][9] or among contacts, healthcare workers (HCW), or residents of high-TB burden countries where ongoing transmission may affect serial IGRA results [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. Rarely have investigators examined variability due solely to test performance fluctuations on blood collected at the same time [13,20]. No published investigation has addressed variability when IGRAs are performed in different laboratories on blood collected at the same time.
The QuantiFERONH-TB Gold In-Tube test (QFT-GIT, Cellestis Limited, Carnegie, Victoria, Australia) is one of two commercially available IGRAs currently in use in the U.S. The goal of this study was to determine the within-subject variability of the QFT-GIT when performed in different laboratories on blood collected at the same time and to investigate potential reasons for variability.

Ethics Statement
The Centers for Disease Control and Prevention (CDC) and Wilford Hall Medical Center human subjects institutional review boards approved this study. All subjects provided written informed consent.

Subject Selection
Subjects were recruited from among Air Force and CDC staff located in San Antonio, Texas, and Atlanta, Georgia, respectively, as part of a larger study investigating parameters that affect QFT-GIT variability. Prior unpublished assessments among a similar cohort found a broad range of IFN-c measurements, and that 40% to 50% of persons with self-reported prior positive TST results were positive by QFT-GIT as compared to ,3% for the general U.S. population. To increase the proportion of subjects with positive QFT-GIT results and to assess subjects with a continuous range of IFN-c measurements, including those with IFN-c measurements near the cutoff separating positive and negative interpretations, only persons with self-reported prior positive TST results were recruited. Exclusion criteria were age of less than 18 years or a history of an adverse reaction to TST (e.g., blistering, scarring, or anaphylaxis). All subjects completed a detailed study questionnaire.

QFT-GIT Procedure
Blood from each subject was collected at a single sitting into three sets of QFT-GIT tubes so that the assay could be completed in three different labs (Lab1, Lab2, and Lab3), all with extensive experience and demonstrated proficiency. Approximately 1 mL of blood was collected into three tubes containing only heparin (Nil tube); three tubes containing heparin, dextrose, and the mitogen phytohemagglutinin A (Mitogen tube); and three tubes containing heparin, dextrose, and Mtb antigens (TB Antigen tube). Mtb antigens consisted of a single mixture of peptides representing ESAT-6, CFP-10, and TB7.7 as described in the package insert. Tubes with identical lot numbers were used. Tube contents were mixed with a Stuart rock-and-roll mixer (SciTech Instruments, Inc., Franklin, NJ) for 3 minutes at 33 RPM with the tube cap end lowered 20u to ensure that the entire inner surface of each tube was covered with blood. Within 1 hour of collection, the tubes were placed upright in an incubator at 37+/20.5uC. The tubes were incubated for 23 to 24 hours, after which they were centrifuged at 3,000 g for 10 minutes. Centrifuged tubes were stored and shipped at 2uC to 8uC. Temperatures during incubation, storage, and shipping were confirmed with a SL300 temperature data logger (SupCo, Allenwood, NJ). The IFN-c concentrations in plasmas from the Nil tube, the TB Antigen tube, and the Mitogen tube (abbreviated Nil, TB, and Mitogen, respectively) were determined by enzyme-linked immunosorbent assay (ELISA), performed 13 to 15 days after blood collection using reagents included in QFT-GIT kits. No attempt was made to assure that QFT-GIT ELISA kits had identical lot numbers. All test parameters were within specifications stipulated in the QFT-GIT package insert. Test results were interpreted as indicated in the Cellestis package insert and CDC guidelines [2,25]. The interpretation was ''positive'' if the Nil was #8.0 IU/mL and the TB Response was $0.35 IU/mL and $25% of the Nil. The interpretation was ''negative'' if the Nil was #8.0 IU/mL, the Mitogen Response was $0.5 IU/mL, and the TB Response was ,0.35 IU/mL or ,25% of the Nil. The interpretation was ''indeterminate'' if (a) the Nil was .8.0 IU/mL or (b) the Nil was #8.0 IU/mL, the Mitogen Response was ,0.5 IU/mL, and the TB Response was ,0.35 IU/mL or ,25% of the Nil. For subjects with discordant interpretations, discrepancies in TB Response .0.05 IU/mL, or unusual IFN-c measurements [26], results were recalculated based on verified OD values entered directly from the ELISA reader printout and used to create a reconciled dataset.

Statistical Methods
For assessment of variability in test interpretations (variability in qualitative results), the percentage of subjects with concordant results from tests performed at the three different labs was determined. For each pair of labs, positive agreement, negative agreement, and agreement beyond chance (Cohen's kappa statistic) were calculated. For the assessment of variability in quantitative results, Nil, TB, and TB Response distributions were compared using the Wilcoxon signed-rank test. Five additional indices of quantitative variability, the last two of which were derived from the standard deviation of the differences (SD diff ), were examined including (1) within-subject coefficient of variation (W-S CV%), (2) intraclass correlation coefficient (ICC), (3) mean difference between two labs (bias), (4) the smallest detectable difference (SDD), and (5) the within-subject standard deviation (W-S SD). SDD = 1.96*SD diff and is the smallest change in a second measurement that must occur to detect a change above the variability (e.g., noise) with 95% confidence [27,28]. W-S SD = 6(SD diff /!2) [29] and represents 68% of the variation expected around the true value [30]. Limits of agreement (LOA) = bias 6 SDD and encompass the range around the bias that contains 95% of within-subject differences [31]. ICCs were calculated using the SAS macro ICC_SAS [32]. W-S CV% was calculated as described by Bland (root mean square approach) [33] for Nil and TB but estimated for TB Response using the formula ![(W-S CV% TB ) 2 +(W-S CV% Nil ) 2 ]. The W-S CV%s for the TB Response could not be directly determined due to inflation caused by zeroes and low means in the denominator (a result of subjects with both positive and negative TB Response values). A confidence level of 0.95 was used in all hypothesis tests. Stratified analyses for quantitative indices were performed on concordant positive, concordant negative, and discordant groups and three groups stratified by mean TB Response of ,0.10 IU/mL, 0.10 through 0.60 IU/mL, and .0.60 IU/mL. Indices of variability were not reported for groups with less than 10 subjects to avoid inaccuracies due to small sample size. SAS v9.2 (SAS, Cary, NC) and ''Analyse-It'' v2. 22 for Excel (Analyse-It Software, Ltd., Leeds, UK) were used to perform the analyses.

Subject Characteristics
Study participation is depicted in Figure 1. Of the 174 people asked to participate, 103 consented, and 97 had QFT-GIT tests completed in all three labs. Characteristics of study subjects are shown in Table 1.

Qualitative Results Using Original Data
Comparisons of test interpretations among all three labs using original (unreconciled) data are shown in Table 2. No QFT-GIT result was indeterminate. Eleven of 97 subjects (11.3%) had discordant results. Comparisons of test interpretations between

Recognition of Data Entry and Methodological Errors
No errors in electronically transferred data were identified. Two types of manual data entry errors at Lab3 were identified, affecting results for nine subjects. The first type of error was a misalignment of results for eight subjects so that TB, Nil, and TB Response values were assigned to the wrong subjects. The second type of error, affecting a ninth subject, occurred as a result of a misplaced decimal point due to human error that caused inaccuracy in reported TB and TB Response values. A line listing of QFT-GIT results from these nine subjects is shown in Table S3. These errors were corrected in the reconciled dataset. A third type of error was recognized for six subjects who had extremely high IFN-c concentrations reported for TB values in Lab3 (range 37.4 to 102.5 IU/mL) when compared to Lab1 and Lab2 (range 8.6 to 18.4 IU/mL) and when compared to other Lab3 TB values (all .7 times the interquartile range of 3.33 IU/mL). TB and TB Response values for these six subjects and a seventh subject with the next highest Lab3 TB and TB Response values are shown in Table S4. The large discrepancies and high TB values reported by Lab3 were due to misinterpreted OD values reported by the ELISA workstation. OD values above the working range of the Lab3 reader were reported as ''9.999'', resulting in calculation of exaggerated and inaccurate IFN-c concentrations. This was a methodological error. OD values above the working range were reported in the other labs as ''OWR'' (outside of working range), thus preventing calculation of an IFN-c concentration. Because the ODs reported as ''9.999'' could not be verified for the six subjects with exaggerated TB values, data from these six subjects were excluded from the reconciled dataset.

Qualitative Results Using Reconciled Data
Comparisons of test interpretations among all three labs using reconciled data are shown in Table 2. No QFT-GIT result was indeterminate. Seven of 91 subjects (7.7%) had discordant results after data were reconciled. Comparisons of test interpretations between pairs of labs are shown in Table 3 using reconciled data. Nil concentrations, TB Responses, and QFT-GIT interpretations are shown in Table S1 for the 7 subjects with discordant interpretations using reconciled data. Of these seven, six (85.7%) had all TB Responses within 0.25 IU/mL of the 0.35 IU/mL cutoff. Of 12 subjects who had one or more TB Responses within 0.25 IU/mL of the cutoff, 7 (58.3%) had discordant QFT-GIT interpretations, while none of the 72 subjects with no TB Response in this range had discordance.

Quantitative Results Using Reconciled Data
Median and mean Nil, TB, and TB Response values using reconciled data are shown in Table 4. NIL values .0.4 IU/mL did not change. No subjects had TB Responses ,20.35 IU/mL or Mitogen Responses ,20.5 IU/mL. Examination of the reconciled data with Bland-Altman difference plots ( Figure 2)   Table 5. Bias and LOA showed greater variability in TB Response among subjects with concordant positive interpretations than those with concordant negative interpretations. Bias in TB Response ranged from 0.00 IU/mL when data from Lab1 and Lab2 were compared for subjects with concordant negative interpretations to 1.82 IU/mL when data from Lab1 and Lab3 were compared for subjects with concordant positive interpretations. SDD ranged from 0.08 to 9.61 IU/mL in these groups, respectively. Indices for TB Response variability tracked indices of variability for TB. Nil values were less variable between strata and between labs than TB or TB Response values. W-S SD followed a similar trend with variability of concordant positives . variability of total population . variability of concordant negatives. Examination of ICC revealed that concordant negatives were less correlated than concordant positives. Variability adjusted for each subject's mean value (W-S CV%) was similar for subjects with concordant negative and concordant positive results for Lab1 vs. Lab2, but much larger in concordant negatives for TB and TB Response when Lab1 or Lab2 was compared to Lab3.
Bias, upper and lower LOA, W-S SD, and their 95% confidence intervals (CIs) for TB Response using an alternative stratification scheme (,0.10 IU/mL, 0.10 to 0.60 IU/mL, and .0.60 IU/mL) based on the subject's mean value from the three labs are shown in Table 6. These results indicate a similar trend of increasing variability with increasing TB Response. The values for the middle group (0.10 IU/mL to 0.60 IU/mL), are intended to provide an estimate of the variability of TB Response surrounding the assay cutoff. W-S SD for this group ranged from 60.08 IU/mL to 60.16 IU/mL with the largest upper 95% CI boundary for this group being 0.25 IU/mL (Lab1 vs. Lab 2).

Comparison of Results Using Original and Reconciled Data
Correction of the manual data entry errors for 9 subjects changed the test interpretations for six subjects: from positive to negative for three and from negative to positive for three (Table  S3). Table S1 shows that correcting manual data entry errors resolved the discordance observed in the original results for five subjects, but generated discordance for another subject. While 11.3% of subjects had discordant interpretations among the three labs using original data, 7.7% had discordant interpretations using reconciled data ( Table 2). As shown in Table 3, of the Lab3 comparisons, those involving the original data showed lower agreement than those involving reconciled data, while minimal change was observed for Lab1 vs. Lab2, with lowering of the denominator from 97 to 91. Removal of the six subjects with extremely high Lab3 TB and TB Response values did not change the number of subjects with discordant interpretations because these six subjects were concordantly positive. While 36.4% of subjects with discordance using original data had all TB Responses within 0.25 IU/mL of the cutoff, 85.5% of those with discordance using reconciled data had all TB Responses within 0.25 IU/mL of the cutoff. Quantitative indices of test variability were lowered by correcting the data entry errors. Comparison of quantitative results of original and reconciled data showed that Lab3 median and mean TB and TB Response values decreased following correction of the misplaced decimal point and exclusion of the six subjects with exaggerated TB and TB Response values (Table 4). Median and mean TB and TB Response values for Lab1 and Lab2 also decreased with exclusion of these six subjects. Quantitative variability in TB and TB Response values decreased with data reconciliation as demonstrated by reductions in LOA, W-S SD, ICC, and W-S CV% when unstratified results from each pair of labs were compared using original data (Table S2) versus reconciled data (Table 5).

Discussion
We observed substantial within-subject interlaboratory variability in QFT-GIT interpretations and IFN-c measurements when blood samples collected from the same person at the same time were tested in three different labs. Of the 97 subjects tested in three labs, 11% had discordant QFT-GIT interpretations based on the original reported data. Electronic transfer of data was not possible for one of the three labs testing specimens for this study, and a portion of the variability in test interpretation was associated with manual data entry errors. Data entry errors included data misalignments and a misplaced decimal point that were encountered with manual data entry but not electronic data transfers. All three labs used an automated ELISA workstation to assist in performing QFT-GIT, and this may have avoided additional data entry errors. As compared to manually performed ELISAs, automated ELISA workstations can read specimen barcodes that discriminate subjects and QFT-GIT tube type (i.e., Nil tube, TB Antigen tube, Mitogen tube) and assign OD values to specific specimens. This avoids some inaccuracies that have been attributed in prior studies to data entry errors and transposition of IFN-c measurements [26].
A third type of error was recognized for six subjects who had exaggerated TB values in one lab due to errors in interpreting OD values when they were over the working range of the ELISA workstation. Certain lots of ELISA kits with higher activity as evidenced by higher OD values for standards tended to have higher ODs for plasma samples and have more TB ODs above the working range for the ELISA readers (data not shown). Data from the six subjects with OD values over the working range were excluded from the reconciled dataset. Removal of these subjects with methodological errors did not appreciably alter interpretation agreement because all were concordantly positive.
Corrections of data entry errors made a substantial difference in interpretative agreement between each lab and among all three labs. When reconciled data from Lab1 vs. Lab2, Lab1 vs. Lab3, or Lab2 vs. Lab3 were compared, 94.5%, 93.4%, and 96.7% of interpretations agreed, respectively. However, among all three labs, 92.3% of subjects had concordant results after the data were reconciled.
Several pieces of evidence suggest that the majority of discordance in QFT-GIT interpretation remaining after data reconciliation was due to variability in measuring TB Response. While none of the subjects with discordance attributed to data entry errors had all TB Response values within 0.25 IU/mL of the cutoff separating positive and negative interpretations, 86% of those with discordance after data were reconciled had all TB Response values within this range. Additionally, 37% of the subjects who had one or more TB Response values within this range after data were reconciled had discordance, but none of the subjects without a TB response within this range had discordance. These statistics do not describe the actual magnitude of variability in TB Response.
We examined the magnitude of variability in TB Response and the two IFN-c measurements used to calculate TB Response. Of the many indices of variability, LOA may be the most informative. LOA is expressed in units of test measurement and includes bias. W-S CV% masks the impact of IFN-c concentration magnitude on variability, while ICC and W-S SD do not take into account the bias between measurements. Variability, as measured by LOA, was greater for higher IFN-c measurements. This was observed for Nil, TB, and TB Response, but because TB and TB Response values tended to be larger than Nil values, greater variability was While subjects with concordant positive interpretations had more variability in TB Response than those with concordant negative interpretations, the variability near the cutoff is of greater importance because of its effect on interpretive agreement. Bland-Altman analysis allows assessment of variability in paired measurements and identifies the range of measurements encompassing 95% of TB Response variability associated with repeat testing. Because variability is not uniform across the range of TB Response values, applying a global measure of variability derived from the entire range may not be suitable near the cutoff. Among the 14 subjects with a mean TB Response of 0.10 through 0.60 IU/mL (i.e., 0.3560.25 IU/mL), which included 6 of the 7 subjects with discordant QFT-GIT interpretations, the upper LOA was as high as 0.43 IU/mL and the lower LOA was as low as 20.46 IU/mL ( Table 6). The 95% CIs for LOAs may be relatively large because of the small number of subjects with mean TB Response values near the cutoff. Clinicians, naive to the direction of comparison, can expect results from a second lab to be within 0.46 IU/mL of the first with 95% certainty. Because this estimate of variability is determined for a range (i.e., 0.10 through 0.60 IU/mL), it overestimates variability for TB Response values near 0.10 IU/mL and underestimates variability for TB Response values near 0.60 IU/mL. Another consideration is that for a particular TB Response, changes in only one direction can alter test interpretation.
The amount of uncertainty in interpreting QFT-GIT that is acceptable has not been established. Whereas LOA encompasses a range for 95% of the test-retest differences, bias 6 W-S SD encompasses 52% of the variability expected with retesting [30]. W-S SD also reflects the variability relative to the true value such that 68% of measurements will be within one W-S SD of the theoretical true value (typically estimated as the subject's mean value) [30]. W-S SD for TB Response was as high as 0.16 IU/mL for subjects with mean TB Response near the cutoff (i.e., 0.10 through 0.60 IU/mL). W-S SD, which is also referred to as ''wobble'', is intended to describe random variation. What we measured as interlaboratory bias could be misinterpreted as random variation if testing were performed in a random selection of laboratories.
We harmonized testing methods as much as possible, so that there were no differences in delays to incubation, incubation time, incubation temperature, and minimal differences in duration of storage. However, there were areas where consistency could not be maintained. For example, labs used QFT-GIT kits with different lot numbers, different automated ELISA workstations, different calibration curves, and different reporting methods. Greater variability may have occurred with less harmonization of test methods.
Various borderline zones around the cutoff have been proposed to address variability [14,15,[18][19][20]34]. However, prior investigations have not considered interlaboratory variability or the impact of nonuniform variability in measuring TB Response. Most prior investigations of variability have been challenged to analyze relatively small sample sizes. The small number of subjects near the cutoff also challenged our stratified analysis. Despite the lack of available data from interlaboratory reproducibility studies, our estimates of discordance (11.3% to 7.7%) seem to be in keeping with those seen in intralaboratory between-run estimates of discordance [13,[18][19][20].
Interlaboratory variability is a symptom of a larger problem of IGRA imprecision. IGRA imprecision may also explain a portion of the variability encountered with serially performed IGRAs among healthcare workers [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. We measured test variation that is not attributable to subject variation (e.g., due to new infection, treatment, or fluctuations in immune status). Blood samples were collected at the same time to exclude the effect of subject variation due to time. Additional studies are needed to assess IGRA imprecision and understand the components of variation seen in serial testing. The imprecision demonstrated with Table 6. W-S SD, bias, and LOA in three strata based on the mean TB Response for Lab1, Lab2, and Lab3. serial testing and by interlaboratory variability is also relevant when interpreting individual or initial IGRA results.
In conclusion, greater interlaboratory variability was associated with manual data entry and higher IFN-c measurements. Manual data entry should be avoided. Our data suggest that variability in measuring TB Response may affect QFT-GIT interpretation, especially when near the cutoff. Therefore, consideration should be given to interpreting such responses as ''borderline'' rather than negative or positive, and clinical decisions regarding treatment or the need to repeat these tests should be based on individualized clinical judgment considering the risk of infection, the risk of disease, and the proximity of the TB Response to the cutoff. In the population we studied, interpreting TB Response values of 0.10 through 0.60 as ''borderline'' would have avoided most changes in test interpretation due to measurement variability. However, this may not be the appropriate range for the entire population for whom QFT-IT is recommended. Additional studies are needed to determine the optimal range of values for borderline results and to explore the impact of using a borderline interpretation.