Reference set of Mycobacterium tuberculosis clinical strains: A tool for research and product development

The Mycobacterium tuberculosis complex (MTBC) causes tuberculosis (TB) in humans and various other mammals. The human-adapted members of the MTBC comprise seven phylogenetic lineages that differ in their geographical distribution. There is growing evidence that this phylogeographic diversity modulates the outcome of TB infection and disease. For decades, TB research and development has focused on the two canonical MTBC laboratory strains H37Rv and Erdman, both of which belong to Lineage 4. Relying on only a few laboratory-adapted strains can be misleading as study results might not be directly transferrable to clinical settings where patients are infected with a diverse array of strains, including drug-resistant variants. Here, we argue for the need to expand TB research and development by incorporating the phylogenetic diversity of the MTBC. To facilitate such work, we have assembled a group of 20 genetically well-characterized clinical strains representing the seven known human-adapted MTBC lineages. With the “MTBC clinical strains reference set” we aim to provide a standardized resource for the TB community. We hope it will enable more direct comparisons between studies that explore the physiology of MTBC beyond the laboratory strains used thus far. We anticipate that detailed phenotypic analyses of this reference strain set will increase our understanding of TB biology and assist in the development of new control tools that are broadly effective.

Introduction Tuberculosis (TB) remains an urgent public health problem causing 10.4 million new cases and 1.3 million deaths every year [1]. The TB epidemic is worsening due to growing drug resistance and the absence of a universally effective vaccine against the transmissible pulmonary form of the disease [1].
The outcome of TB infection and disease is highly variable, ranging from rapid clearing by the innate immune response to life-long latent infection and various forms of active pulmonary and extra-pulmonary disease. In the past, most of this variation was attributed to the host and environmental factors. Because of the limited genetic diversity within the Mycobacterium tuberculosis complex (MTBC) compared to other bacteria [2], the view has been that no relevant phenotypic variation should be expected. However, recent advances in whole genome sequencing of large MTBC clinical strain collections from global sources have revealed more genomic diversity than previously appreciated. Specifically, the human-adapted MTBC comprises seven phylogenetic lineages that differ in their geographic distribution, and individual members of these lineages can differ by up to~2,000 single nucleotide polymorphisms (SNPs). This is equivalent to the phylogenetic distance between M. tuberculosis sensu stricto and M. bovis, which is a typical pathogen of cattle.
In addition to the genomic diversity across MTBC clinical strains, findings from many experimental studies have led to a change in paradigm by demonstrating the phenotypic impact of this genetic diversity. For example, studies have reported differences between clinical strains with respect to their transcriptomic profiles [3,4], protein and metabolite levels [5] [4], methylation profiles [5], drug susceptibility [6] and cell wall structure [7][8][9]. In addition, MTBC genetic diversity has also been shown to influence disease severity and human to human transmission, with "modern" lineages showing a faster progression to disease and shorter latency periods compared to strains from the "ancestral" clades [10][11][12][13].
Most of what we know about TB biology today is based on work performed during many decades, most of which has relied on the two canonical laboratory strains H37Rv and Erdman. Both of these strains, as well as the clinical strain CDC1551 used by some TB laboratories more recently, belong to MTBC Lineage 4 [14]. A notable exception is HN878, which belongs to Lineage 2 and is a gaining prominence as a laboratory representative of the Beijing family of strains [15].
H37Rv was first isolated from a patient (H37) with pulmonary tuberculosis in 1905 at the Trudeau Sanatorium in Saranac Lake, New York, while Erdman was isolated from human sputum by William H. Feldman in 1945, at Mayo Clinic, Rochester.
Since its original isolation, H37Rv has been used extensively in biomedical research. The sequence of its genome was published by Cole and colleagues in 1998, which was a breakthrough in TB research [16]. Indeed, H37Rv and its genome sequence still provide the backbone for most of TB biological research today, informing studies ranging from basic biochemistry and microbiology to global omics profiling, systems biology, drug discovery and immunology. However, H37Rv has been passaged countless times in various laboratories, and despite retaining its virulence in mice, it has adapted to laboratory conditions [17]. The same is likely true for Erdman and CDC1551 which have been isolated later than H37Rv, but which by now, have also been passaged in the laboratory for several decades. Hence, despite the great progress in our understanding of TB generated through studies based on laboratory strains, there are good reasons to expect that the findings from these studies do not paint the full picture and could benefit from being validated in more genetic backgrounds.
Despite the increasing number of experimental studies revealing important phenotypic differences across MTBC clinical strains, many of these studies have been difficult to reproduce between different laboratories, and the data are often contradictory. Moreover, linking experimental phenotypes to clinical and epidemiological characteristics of MTBC lineages or strains has been particularly challenging. We propose that part of these challenges could be overcome by standardizing the complement of clinical MTBC strains we study. As a first step, we suggest to broaden the scope of basic and translational TB research by incorporating a set of genetically well-characterized clinical strains representative of the known phylogenetic diversity of the pathogen. In time, the community would accumulate a significant body of data that could support new findings that are more relevant to global TB. To this end it is important that there is a collective agreement to avoid passaging these strains extensively and minimize laboratory adaptation.
Over the years, our group has been collecting strains from around the world and characterizing them by whole genome sequencing. Our main aim was to draw evolutionary and phylogeographic inferences [18], however, we also realized the importance of studying this diversity more broadly, which is why we used our global collection of MTBC clinical strains and the associated phylogenomic data to rationally select a subset to be used as reference strains for future research. We believe this set of strains will be of value for the TB research community.
The "MTBC clinical strain reference set" comprises 20 clinical strains covering all 7 known human-adapted MTBC lineages. These strains have been submitted to the Mycobacterial culture bank of the Belgian Coordinated Collections of Microorganism (BCCM/ITM) and will be available for anyone interested in the phenotypic impact of MTBC diversity (http://bccm. belspo.be/).

Strain selection
We based our initial selection of strains to be included in this reference set on phylogenetic trees that were built with a combination of genomes from our collection and other publicly available genomes representing the known global diversity of the human-adapted MTBC [19]. Initially, we picked 43 strains that were intended to represent a diverse sampling of each lineage, comprising several sub-lineages where appropriate. We strove to include strains that represent as much as possible the phylogenetic breadth of each lineage, thus attempting to capture most of the within-lineage diversity The strains had to be free of known drug resistance mutations and carry only genomic deletions that were congruous with their phylogenetic background, without any rare genomic abnormalities. Moreover, we included strains that were already used in experimental work in the past; N0072, N0157, N0031, N0145 and N0155 [4,20]. The remaining strains were selected from the large number of isolates present in the combined collections of the authors. Specifically: N0004, N0054, N0069 and N0136 were contributed by UCSF, University of California; N1268, N1272, N1274 and N1283, were contributed by the Research Center in Borstel, Germany; N1176, N1201, N1202 and N1216 were contributed by the Noguchi Memorial Institute for Medical Research in Accra, Ghana; N0091 was contributed by MRC-Gambia and N3913 in the Victorian Infectious Diseases Reference Laboratory in Melbourne. None of the strains were isolated specifically for this study.

Bacterial culture and DNA extraction
All MTBC isolates included into the "MTBC clinical strain reference set", were processed and derived from single colonies. Strains were grown in 7H9/Tween 0.05% medium (BD) +/-40mM sodium pyruvate. We extracted genomic DNA for whole genome sequencing (WGS) from cultures in the late exponential phase of growth using the CTAB method [21].

Spoligotyping
Spoligotyping was performed according to internationally standardized protocols [22]. We used KvarQ to derive in silico spoligotypes from FASTQ files containing the WGS information [23] when necessary.

Sequence read alignment and variant determination
The obtained FASTQ files were processed with Trimmomatic v 0.33 (SLIDINGWINDOW: 5:20) [24] to clip Illumina adaptors and trim low quality reads. Any reads shorter than 20 bp were excluded for the downstream analysis. Overlapping paired-end reads were then merged with SeqPrep v 1.2 (overlap size = 15) (https://github.com/jstjohn/SeqPrep). We used BWA v 0.7.13 (mem algorithm) [25] to align the resultant reads to the reconstructed ancestral sequence of MTBC obtained in [19]. Duplicated reads were marked by the Mark Duplicates module of Picard v 2.9.1 (https://github.com/broadinstitute/picard) and excluded. The Realigner Target Creator and Indel Realigner modules of GATK v 3.4.0 [26] were used to perform local realignment of reads around indels. To avoid false positive calls Pysam v 0.9.0 (https:// github.com/pysam-developers/pysam) was used to exclude reads with alignment score lower than (0.93 � read_length)-(read_length � 4 � 0.07)), corresponding to more than 7 miss-matches per 100 bp. SNPs were called with Samtools v 1.2 mpileup [27] and VarScan v 2.4.1 [28] using the following thresholds: minimum mapping quality of 20, minimum base quality at a position of 20, minimum read depth at a position of 7-fold and without strand bias. Only SNPs considered to have reached fixation within a patient were considered (at a within-host frequency of �90%). Conversely, when the SNP within-host frequency was �10% the ancestor state was called. Additionally, we excluded genomes with average coverage < 15-fold (after all the referred filtering steps). All SNPs were annotated using snpEff v4.11, in accordance with the M. tuberculosis H37Rv reference annotation (NC000962). SNPs falling in regions such as PPE and PE-PGRS, phages, insertion sequences and in regions with at least 50 bp identities to other regions in the genome were excluded from the analysis as in [29]. Drug resistance-conferring mutations were annotated based on a previously published list [23]. Determination of sublineage was done using the phylogenetic SNPs according to Stucki et al. [29] and to Coll et al. [30].

Detection of genomic duplications
We used the output of Samtools v1.2 mpileup and VarScan 2.4.1 to extract the mapping coverage depth of short sequencing reads (coverage) per genomic position. We split the genome into bins of 500 base pairs and calculated the median coverage for each bin. We then computed the z-score for all the bins across the genome of each strain. Finally, we calculated the median of coverage z-score medians for 80 consecutive bins spanning 40,000 base pairs. We plotted both the z-scores and z-score pool medians and identified large genomic deletions by visual inspection of the resultant coverage plots. Duplicated regions appear as regions with a marked increase in the apparent coverage.

Phylogenetic reconstruction
To put the reference strain set into a phylogenetic context, we combined them into a phylogeny containing publicly available genomes (n = 232) [19]. We used all 16,614 variable positions to infer a Maximum Likelihood phylogeny using the MPI parallel version of RAxML [31]. We used the GTR model as implemented in RAxMLto perform 1,000 rapid bootstrap inferences, followed by a thorough maximum-likelihood search [31]. We show the best-scoring Maximum Likelihood topology. The phylogeny was rooted using Mycobacterium canettii as an out-group.

Selection of the "MTBC clinical strain reference set"
Starting from the 43 candidates, we excluded strains that we could not re-grow in the laboratory and those that had any mutation known to confer drug resistance [32]. We then selected 20 pan-susceptible clinical strains, based on full genome data for inclusion as reference strains ( Table 1). The selected set covers all 7 known phylogenetic lineages of the human-adapted MTBC, but does not include any animal-adapted members of the MTBC. For most lineages, we selected 3 representative strains to maximize the within-and between-lineage diversity ( Fig  1). Given the current limited availability of Lineage 7 strains, we were able to include only a  single representative. In the case of Lineage 2, we chose 4 strains to cover the wide range of genomic deletions found in this lineage [33]. These include N0155, a clinical strain that was used by our group in the past [20], and all the Lineage 2 strains that were transcriptionally profiled by Rose et al. [4]. We also included a "Proto-Beijing" strain (N0031), which belongs to a Lineage 2 clade that is phylogenetically basal compared to all other Lineage 2/Beijing strains. Proto-Beijing strains are also characterized by a deletion of RD105 but no deletion in RD207, which differentiates "proto-Beijing" from the regular "Beijing strains [34]. Moreover, N0031 shows an ancestral spoligotype with all DR spaces present (Table 1). Its basal branching provides an important contrast to the classical Beijing strains which carry deletions in both RD105 and RD207 [34,35]. The Lineage 1 strains N0157 and N0072 have also been transcriptionally profiled [4]. In the case of Lineage 4, we selected representative strains of the "generalist" and "specialist" groups as defined by Stucki et al [29], respectively N0136 and N1216. The Lineage 4 strain N1283 is a representative of the sub-lineage L4.2 [29], which is also referred to as "Ural" based on spoligotyping [36]. The phylogenetic relationships of the 20 reference strains with respect to other representative MTBC strains are shown in Fig 1. Phenotypic resistance to STR was observed in N1274. This strain did not carry any mutation in the most common STR DR associated genes; rpsL or in the region 530_900 of rrs, however, a rare mutation was found in gydB (Rv3919c) at the AA position R137P. The rest of the "MTBC clinical strains reference set" was confirmed to be phenotypically drug susceptible to all main TB drugs.

Genomic characteristics
All annotated SNPs and insertions/deletions (indels) identified by comparison with the reconstructed MTBC ancestor sequence [37] and considered fixed (at frequency of �90%) for each strain are provided as supplementary files (S1 and S2 Tables). The general characteristics of the genome of each strain are presented in Table 2. We were able to observe all the large genomic deletions reported before [33] as gaps in sequencing coverage. For example, all Lineage 2 strains carried the deletion in RD105 and all but the Proto-Beijing strain also had a deletion in RD207. N0145 and N0155 shared the deletion in RD181, while N0145 harboured an additional deletion in RD150. Given several reports in the literature regarding the importance of the duplication of a part of the genome that includes DosR and DosS (Rv3133c and Rv3134c) for MTBC virulence [4], we looked for areas of excessive read coverage within the genomes (S1 Fig). We identified four strains showing evidence of overlapping duplications covering DosR/S-N0031, N0145, N0155 and N1283 (S2 Fig). The first three strains belong to Lineage 2 while N1283 belongs to Lineage 4, corroborating past suggestions of convergent evolution [38,39]. We did not detect any other genomic duplications of a comparable size in the genomes (S1 Fig).

Recommendations for growing and preserving the reference strains
Strains have been deposited in the Belgian Coordinated Collections of Microorganism (BCCM) and can be obtained from the BCCM/ITM: http://bccm.belspo.be/about-us/bccmitm. Upon receipt, we suggest to grow a large culture of each strain in 7H9 (BD) and freeze multiple glycerol stocks for future use to avoid the acquisition of genetic changes due to laboratory adaptation during sequential sub-culturing [17]. Note that some strains require the addition of 40mM sodium pyruvate for optimal growth (Table 1).

Conclusions
For decades, TB research has almost exclusively focused on the two laboratory-adapted MTBC reference strains H37Rv and Erdman. Both strains have provided a common language across TB laboratories allowing knowledge to be built incrementally, with interoperable protocols, results and resources. However, insufficient attention has been given to the fact that both of these strains show patterns of laboratory adaptation and that they do not adequately represent the phylogenetical breadth of the human-adapted MTBC. The new "MTBC clinical reference set" presented here covers much of this diversity and will provide the TB research community the opportunity to go beyond one single strain/lineage. The potential of sharing and integrating the experimental data generated with this strain set will enrich our understanding of the relationship between genotype and phenotype and potentially lead to fundamental new insights into TB biology. The true impact of genetic diversity in MTBC is slowly coming into focus; however there are still considerable gaps in our understanding. For example, it is known that clinical isolates show variations in drug susceptibility, but the basis for this is unclear [40]. Similarly, the association between drug resistance and specific strain backgrounds has been proposed in several studies, but the underlying mechanism remains unknown [9]. Vaccine and diagnostics development are two areas where understanding the impact of genetic diversity could be key to delivering effective products [14]. Similarly, we are only beginning to scratch the surface of the interplay between bacterial and human genetics at the immune interface [41]. These aspects of MTBC physiology deserve further attention especially due to their potential to have real clinical relevance. At a minimum, testing new TB diagnostics, drugs and vaccines against this strain set will help ensure these innovations are broadly effective.