Proficiency of phenotypic drug susceptibility testing for Mycobacterium tuberculosis in China, 2008–2021

To analyze the results of proficiency testing for anti-tuberculosis drug susceptibility testing (DST) in China. Number of laboratory participating the proficiency testing performed DST, and the sensitivity, specificity, reproducibility, and accordance rate were calculated from data of 13 rounds proficiency testing results for DST from 2008 to 2021. A total of 30 and 20 strains of Mycobacterium tuberculosis with known susceptibility results were sent to each laboratory in 2008 to 2019, 2020 and 2021, respectively. The number of participating laboratories ranged from 30 in 2009 to 546 in 2021. L-J DST was the predominant method. The specificity presented relatively higher than sensitivity. Improvement of specificity were observed for all drugs through the years, while sensitivity did not show improvement for amikacin and capreomycin. Accordance rate of pyrazinamide and kanamycin and reproducibility of capreomycin and pyrazinamide were not significantly improved through the years. Most of the participating laboratories significantly improved the quality of their DST through the consecutive rounds of proficiency testing except for second-line injectable drugs and pyrazinamide. The results highlight the importance of developing novel and/or improving existing methods for phenotypic DST for certain drugs.


Introduction
Drug-resistant Tuberculosis (TB) remains a major public health concern in China.There were an estimated 33 000 (27,000-39 000) Multi-resistant/Rifampin-resistant Tuberculosis (MDR/ RR -TB) cases in 2021 in China.Among which, 16,766 cases were detected, accounting for 50.8% of the estimated MDR/RR-TB cases [1].Over the past years, most of provincial and city level TB laboratories established the capacity of drug susceptibility testing (DST) either by molecular or phenotypic tools.Therefore, further expansion of access to these tools and improvement of quality are more essential.Rapid molecular tests have been recommended as initial diagnostic tool.Phenotypic DST is still useful in cases who are highly suspected resistant but tested susceptible by molecular method, and to detect susceptibility to second-line drugs which are not covered by initial molecular tools.Establishment of TB laboratory network to assess phenotypic DST's proficiency is critical.Proficiency testing uses an interlaboratory comparison to assess the performance of a laboratory tests.The nationwide DST proficiency testing (DST-PT) organized by National Tuberculosis Reference Laboratory started since 2008.Most of previous studies on proficiency testing focused on first-line drugs [2][3][4][5], but very few on second-line drugs [6,7].There has been lack of proficiency data covering both first-and secondline drugs especially in a single country.This study retrospectively analyzed phenotypic DST proficiency testing over ten years for both first-and second-line drugs in a high TB burden country.

Participating laboratories
Thirteen rounds of DST-PT were implemented from 2008 to 2021 except for 2015.Laboratory number participating the DST-PT increased from 30 (round 1) to over 500 laboratories (round 13), among which the city-level laboratories accounted for the most (Table 1).

Origin and composition of proficiency testing panels
Each DST-PT panel was from the World Health Organization (WHO)'s coordinating Belgium supernational reference laboratory (SRL) or Hong Kong TB SRL.DST using Lowenstein- Jensen (L-J) medium and liquid Mycobacterium Growth Indicator Tube (MGIT) was performed in China's National Tuberculosis Reference Laboratory (NTRL).Strains were sub-cultured and allocated into 2.0 ml plastic cryovials containing L-J medium or Middlebrook 7H9 with glycerol before transportation to 31 Provincial TB Reference laboratories (PTRLs).PTRLs sub-cultured strains further, and transported the strains to city-or county-level laboratories.
The definition of provincial-, city-, and county-level laboratory was based on China's administrative divisions.That is, the medical institution to which the laboratory affiliated is defined as the corresponding level.Approval from relevant department is required before transportation.
Triple packaging compliment with UN packing instruction P620 was used to transport panel strains either from NTRL to provincial level labs or from which to lower level labs.Panel strains were transported either by air flight by company with qualifications for transporting materials containing infectious substances or by trained staffs using vehicles (S1 Fig) .A total of 30 strains consisting of 10 pairs of duplicate strains and 10 single strains were used in 1-11 rounds, while 20 strains composed of 9 duplicate strains, one single MTB, and one single Nontuberculous Mycobacteria (NTM) isolate in the last two rounds (Table 2).1-3 NTM strains were added for strain identification since the 9 th round.The types of drugs referred to the proficiency testing provided by supernational laboratory network in the same period.

Identification and drug susceptibility testing methods
The participating laboratories are required to use their routine used DST methods for proficiency testing, such as L-J, MGIT, and commercial minimal inhibitory concentration (MIC) method.Detailed information using different methods were summarized in Table 3. Critical concentrations for L-J and MGIT were recommended according to the WHO's guideline (Table 4).Cutoff values of MIC method were based on manufacture's instruction.Since the 9 th round, identification by biochemical test or rapid immunochromatographic assay (such as Capillia TB) was required before the DST.Contaminated strains will be decontaminated using 4% NaOH, subcultured and subjected to the following tests.If contamination is found to be concentrated on a specific strain, provincial labs may decontaminate and resent that strain.If the contamination cannot be resolved, they will report the results as contamination, and this strain will be excluded when calculating the performance indicators by NTRL as described in the data analysis section.

Data analysis
Results were compared with consensus results, defined as at least 80% concordant "susceptible" or "resistant" between all reported results as previously published [3].The performance to detect true resistance (rate of detection of judicially resistant strains, RD; sensitivity), true susceptibility (rate of detection of judicially susceptible strains, SD; specificity), intralaboratory agreement between duplicate strains (reproducibility), and accordance rates (number of correct results divided by total number of results excluding contamination strains, AR; efficiency) were calculated.If result was unavailable for one of the paired strains due to contamination or no growth, the pair was excluded from the analysis for reproducibility.
SAS software version 9.4 was used for statistical analysis.The chi-square test was used to analyze the significance of difference among rounds, and the Cochran-Armitage trend test was used to describe the trend of improvement, significant difference was defined as P<0.05.

Ethical review
The present study is based on routine work and only involved laboratory testing of mycobacteria, not related to the individual human subjects and information, thus the ethical review was waived.

Sensitivity
The average sensitivity with 95% confidence intervals (CIs) of RFP, INH, EMB, SM, PZA, KM, AK, CPM and OFX ranged respectively from 78.54% (95%CI: 75.02-82.07)to 99.73% (95%CI: .For RFP, the sensitivity was higher than 90% in all rounds except for the 3 rd round.For INH, sensitivity was higher than 90% in 10 rounds except for round 2-4.For EMB, the sensitivity was lower than 80% in 2 nd , 3 rd , 4 th , 5 th , 7 th and 8 th round and maintain at higher than 90% since round 9. For SM, 2/8 rounds showed sensitivity lower than 90%.For KM, the sensitivity of the first four rounds was around 80%, and higher than 90% in other rounds.For AK, the sensitivity was higher than 90% in all rounds, but showed the instability between different rounds.For CPM, the sensitivity also varied a lot, especially in round 5, 10 and 11.For OFX, the sensitivity was higher than 90% except for round 4 with 84.75%.The sensitivity was significantly different between rounds for all drugs (p<0.001).As a whole, the sensitivity of most of drugs had a

Specificity
The ).For EMB, the specificity showed lower than 90% in round 3 and 4. For SM, the specificity was lower than 90% in round 2. For PZA, the specificity was lower than 90% in round 7, 10, 12.The specificity was significantly different between rounds for all drugs (p<0.001).As a whole, the specificity for all drugs showed improvement throughout all rounds (p<0.05)(Fig 1).

Reproducibility
The The reproducibility of SM was lower than 90% in the first four rounds.For PZA, the reproducibility was lower than 90% in almost all rounds except for the 13 th round.For CPM, the reproducibility was lower than 90% in round 10 and 11.There were significant difference between rounds for all drugs (p<0.001)(Fig 1).The reproducibility also improved for most of drugs (p<0.001)except for CPM (p = 0.9884) and PZA (p = 0.1509) from the 1 st to the last round.

Discussion
Proficiency testing is one of critical elements of quality assurance, through which the laboratory can find the major problems of DST.The present study showed that numbers of laboratory with capacity of phenotypic DST expanded rapidly in China over past years.The quality of DST for most tested drugs also improved.Provincial and city-level laboratory with phenotypic DST capacity was established as one of objectives in the 10 th five-year program (2011-2015) for tuberculosis control and prevention in China, which promoted the rapid increase of number of city-level laboratories providing the DST service since 2011.The performance in round 3 and 4 were worse than those of round 1 and 2, and then improved in the subsequent rounds.We think that this result is related to a higher proportion of provincial laboratories in the first two rounds, most of which participated in the drug resistance surveillance supported by WHO and global fund since 1994 [8][9][10][11], and thus had more proficiency than city level laboratories without DST experience at the earlier rounds.
For RFP and INH, the two most important first-line anti-tuberculosis drugs, all performance indicators showed good results except in round 2, 3 and 4. For EMB, the sensitivity and reproducibility did not show good results especially in earlier rounds.Since most EMB resistance related mechanisms confer only modest MIC increases and result in a significant overlap with the wild type strains MIC distribution [12], thus the current binary DST results make it difficult to distinguish this overlap.In addition, the defined critical concentration is very close to the MIC required to achieve anti-mycobacterial activity, increasing the probability of misclassification of susceptibility or resistance, and bring poor reproducibility of phenotypic DST results [13].From 2018, WHO does not recommend EMB DST as routine testing method any more [14].The improvement measures has been explored, such as an inconclusive MIC breakpoint of 4ug/mL was introduced for EMB in clinical and laboratory standards institute (CLSI) document, commenting that an MIC of 4ug/mL obtained by broth microdilution does not correlate with either susceptible or resistant result, and suggest repeating testing using other method (e.g., genotypic or an alternative broth method), which bring out other issues to be urgently resolved, such as the reliability of current molecular-based EMB resistance detection tools and phenotypic methods on the current critical concentration in L-J and MGIT medium.PZA was only included in the proficiency testing since the 7 th round in 2013.MGIT is the single phenotypic method that can be used to detect PZA resistance, limiting the rollout of PZA DST.The specificity may be affected by an over-inoculation or non-homogeneous bacterial suspension, which thus result in reducing the PZA effect due to the increased pH [15,16].Non-homogeneous bacterial suspension, such as large clumps in the bacterial suspension when adjusting the turbidity, can also cause varied true bacterial count in the suspensions and thus affect the reproducibility.In the present study, although the sensitivity and specificity of PZA improved, the specificity remained being below than 90% without significant improvement.The specificity problems was also reported by other study on the proficiency testing against pyrazinamide [17].The critical concentration itself may also result in inconsistent results for isolates with a PZA MIC close to this concentration [18,19].So the phenotypic DST against PZA should be further optimized given the weakness shown in this study.DST-PT for SM was cancelled since round 9, consistent with WHO's drug profile, based on the fact that SM was only to be considered only if AK cannot be used and the unreliability of performance even in the supranational tuberculosis reference laboratories, in which the sensitivity and specificity showed high variability [20].For second-line injectable drugs, the best sensitivity was observed for AK, while CPM did not show improvement even after implementation of around ten years.The specificity and reproducibility of these three second-line drugs showed good performance.AK is now classified as one of Group C drugs recommended for the treatment of RR-TB, is only to be considered if DST results confirm susceptibility as in WHO guideline.OFX showed good performance, similar conclusion with other study results [20], although testing of OFX is not recommended as it is no longer used for treating resistant TB and laboratories should transition to testing the later generation FQs, such as LFX and MFX.Other studies also pointed out that except for INH and RMP, the accuracy and reproducibility of the other 7 drugs are poor [3,17,[20][21][22][23].
From the implementation perspective, the contamination occurred more concentrated in laboratories of certain provinces in round 3, which indicated that the further subculture in provincial level laboratories can bring the risk of more contamination especially in the early years when the technology of provincial laboratory staff was not competent and proficient.After practice and training, the proficiency improved.DST training in China adopts a hierarchical training approach considering the large number of labs and staffs, that is, from national level to provincial level, and from provincial level to city and county level.In addition to training, it is also required that laboratories with false susceptible or false resistant results should identify the root cause, and solve the problem.Afterwards, contamination occured with very few frequency.The confidence interval of the first five rounds is relatively wide, reflecting the poor performance of the participating laboratories.On the one hand, the number of participating laboratories is small, and on the other hand, it is also related to the instability of drug results and the proficiency of staff.Although the range is wide, most of the lowest points are more than 80%, but the sensitivity of the following four drugs: RFP EMB KM CPM is less than 80%.In addition to the above reasons, it is also related to the ratio of strains (drug resistance/ sensitivity).There are too few drug-resistant strains in each round, generally 3-11 strains.In particular, the number of strains resistant to CPM in the fifth round is only 3, and the total sensitivity of CPM in the fifth round is 52.45%.If there is a wrong drug resistance detection of one strain, the sensitivity is 66.67% (2/3), which reduces the detection sensitivity of CPM, and it is not because of the laboratory detection ability that this indicator is low.Similarly, the low detection rate and reproducibility rate of RFP resistance in the third round were also related to this reason (9 resistant strains).Therefore, it is also suggested that the laboratory that organizes the proficiency test of drug sensitivity test should pay attention to the stability of the strains and the number of drug-resistant strains when preparing the test strains.WHO and the International Federation against Tuberculosis and Pulmonary Disease recommend that each drug in the test strains should contain 50% of drug-resistant strains [4].However, it is more difficult than before to constitute 50% of resistant strains for all tested drugs which include critical first-line, second-line drugs, and even new or repurposed drugs that has very low resistance rate up to now, such as bedaquiline, and Linezolid in only 10 strains.So more research on how to constitute of panels has to be explored.
It should also be noted that the proficiency test showed good results, which was related to the fact that most of the strains selected by WHO were far from the MIC [3], excluding exclude mixtures of strains and heteroresistance and the judicial result gold standard used, and so there was a certain difference from the daily clinical strains.Thus enhanced internal quality control was recommended to ensure routine DST services.MIC results as a potential phenotypic DST method are more useful than categorical DST results to classify resistance mutations.Phenotypic DST is still a useful method to detect drug resistance, which cannot be obtained by current commercial molecular based tools, such as XpertMTB/RIF, XpertMTB/ RIF Ultra, Line-probe assay recommended by WHO, and some Chinese local products (Genechip MDR-TB detection assay and Melt-curve drug resistance detection assay).
In addition to the improving technical performance indicators, there are also some improvements at the implementation level over the past years of proficiency testing.The DST methods, measures taken for reduce contamination, and data analysis method have all been improved (S1 Table ).However, the results report still relies on manual input through Excel spreadsheets and email, thus a more convenient information platform needs to be established.
This study also has certain limitations, the judicial results were used to evaluate the proficiency of laboratories, the MIC and genotypic results were not obtained to accurately analyze the false results which may be due to the low-level resistance.We need to determine of the minimal inhibitory concentration and characterize molecular resistance mutations to select representative and stable strains in the future [24] and this additional information can aid in resolving discrepant results and indicate future directions for proficiency testing.Second, although MIC method was used in some laboratories, there is no recommended critical concentration.The standards and quality of commercial MIC plate are various.Thirdly, the types of drugs did not contain some of critical drugs composed of MDR/RR-TB treatment regimen, such as moxifloxacin, BDQ, LZD, CFZ, DLM, which has been added in the proficiency test since 2022 and will be systematically analyzed in the future.

Conclusions
Most of the participating laboratories significantly improved the quality of their DST through the consecutive rounds of proficiency testing.Thus, the current program of DST-PT should be continued with drugs and methods updated.All laboratories conducting DST are encouraged to participate in annual proficiency testing and to strengthen internal quality control.The reliability of some drugs still need to be resolved, however.

Fig 1 .
Fig 1. Performance of DST proficiency testing for each drug in different rounds.The mean value of RD and SD with 95% confidence intervals and mean value of RP, and AR is shown.Difference of performance among different rounds is based a Cochran-Armitage trend test.RD, rate of detection of resistant strains; SD, rate of detection of susceptible strains; RP, reproducibility; AR, rate of accordant results among total test results.https://doi.org/10.1371/journal.pone.0304265.g001

Table 3 . Number of participating laboratories using different DST methods.
*A few laboratories that use more than one method counted multiple times accordingly.https://doi.org/10.1371/journal.pone.0304265.t003