Comparative Whole-Genome Analysis of Clinical Isolates Reveals Characteristic Architecture of Mycobacterium tuberculosis Pangenome

The tubercle complex consists of closely related mycobacterium species which appear to be variants of a single species. Comparative genome analysis of different strains could provide useful clues and insights into the genetic diversity of the species. We integrated genome assemblies of 96 strains from Mycobacterium tuberculosis complex (MTBC), which included 8 Indian clinical isolates sequenced and assembled in this study, to understand its pangenome architecture. We predicted genes for all the 96 strains and clustered their respective CDSs into homologous gene clusters (HGCs) to reveal a hard-core, soft-core and accessory genome component of MTBC. The hard-core (HGCs shared amongst 100% of the strains) was comprised of 2,066 gene clusters whereas the soft-core (HGCs shared amongst at least 95% of the strains) comprised of 3,374 gene clusters. The change in the core and accessory genome components when observed as a function of their size revealed that MTBC has an open pangenome. We identified 74 HGCs that were absent from reference strains H37Rv and H37Ra but were present in most of clinical isolates. We report PCR validation on 9 candidate genes depicting 7 genes completely absent from H37Rv and H37Ra whereas 2 genes shared partial homology with them accounting to probable insertion and deletion events. The pangenome approach is a promising tool for studying strain specific genetic differences occurring within species. We also suggest that since selecting appropriate target genes for typing purposes requires the expected target gene be present in all isolates being typed, therefore estimating the core-component of the species becomes a subject of prime importance.

The tubercle complex consists of closely related mycobacterium species which appear to be variants of a single species. Comparative genome analysis of different strains could provide useful clues and insights into the genetic diversity of the species. We integrated genome assemblies of 96 strains from Mycobacterium tuberculosis complex (MTBC), which included 8 Indian clinical isolates sequenced and assembled in this study, to understand its pangenome architecture. We predicted genes for all the 96 strains and clustered their respective CDSs into homologous gene clusters (HGCs) to reveal a hard-core, soft-core and accessory genome component of MTBC. The hard-core (HGCs shared amongst 100% of the strains) was comprised of 2,066 gene clusters whereas the soft-core (HGCs shared amongst at least 95% of the strains) comprised of 3,374 gene clusters. The change in the core and accessory genome components when observed as a function of their size revealed that MTBC has an open pangenome. We identified 74 HGCs that were absent from reference strains H37Rv and H37Ra but were present in most of clinical isolates. We report PCR validation on 9 candidate genes depicting 7 genes completely absent from H37Rv and H37Ra whereas 2 genes shared partial homology with them accounting to probable insertion and deletion events. The pangenome approach is a promising tool for studying strain specific genetic differences occurring within species. We also suggest that since selecting appropriate target genes for typing purposes requires the expected target gene be present in all isolates being typed, therefore estimating the core-component of the species becomes a subject of prime importance.

Introduction
Mycobacterium tuberculosis complex (MTBC) is the etiologic agent of tuberculosis (TB), one of the largest killer infectious disease causing worldwide morbidity and mortality. Bacterial strains belonging to the same species vary considerably in their genetic repertoire. The complex encompasses a group of closely related gram-positive bacteria including, M. tuberculosis (Mtb) and M. africanum-the typical human pathogens [1,2]; M. bovis-the bovine form [3]; M. canettii-whose actual host is unknown; and several other lineages infecting mammals such as M. microti, M. caprae, M. pinnipedii [4] and M. orygis [5]. The species of the tubercle complex are not merely closely related but appears to be variants of a single species. TB manifests in a variety of clinically distinct and phenotypically diverse infections in humans with a distinctly long sub-clinical latency period [6,7]. Despite numerous improved diagnostic methods, and an effective combination therapy, the global emergence and spread of tuberculosis continues unabated, with an estimated 8.7 million people getting new infections every year [8]. This scenario has been complicated by widespread emergence of multidrug resistant (MDR) (resistance to first-line of Mtb drugs, isoniazid and rifampicin), extensively-drug resistant (XDR) (resistance to any member of the quinolone family and one of the second-line injectable drugs in addition to first-line drugs isoniazid and rifampicin) [9] and lately, totally drug resistant (TDR) (resistance to a wide range of TB drugs) strains [10,11]. The infection also forms a lethal combination when associated with clinical immunosuppressive states, which are either induced due to long-term treatment using immune-suppressants [12] or its synergy with Human Immunodeficiency Virus [13]. Recent studies on global phylo-geography of the pathogen and analysis of its virulence and drug resistance have suggested strain-specific genetic differences and a geographically constrained clonal population structure of MTBC [14,15]. Detailed studies of spoligotypes and associations with drug resistance phenotypes have been extensively reported in geographically limited isolates [16][17][18]. MTBC is also known to have a large arsenal of genomic features relating to the virulence, including its ability to persist and replicate in extremely hostile intracellular conditions and establish long-term infections in the host [7,19].
The availability of the complete genome sequence of Mtb laboratory strain H37Rv in 1998 [2], provided useful insights into the biology of the pathogen. The reference H37Rv genome, encodes for a repertoire of close to 4,000 protein-coding genes [2]. Previous comparative genome analysis of different strains has shown that such an approach could provide useful clues and insights into the genetic diversity of the species [20]. The availability of high throughput and cost effective nucleic acid sequencing technologies has opened up new possibilities towards understanding genomic diversity spanning a large number of members of the same species and has revolutionized the in-depth analyses of complex microbial populations [21][22][23]. Previous studies from other bacterial genera highlighted significant differences arising in the genomes of laboratory strains and clinical isolates [24]. Recent studies involving re-sequencing of a large number of clinical isolates have also provided key insights into the genomic variability of the pathogen. The study revealed that the variability among MTBC strains is more widespread than was initially anticipated [25]. Nevertheless, extensive characterization of the genomic structure and diversity in terms of gene repertoires of laboratory and clinical isolates has not been studied in detail. The recent availability of complete and near-complete sequences of clinical isolates of MTBC provides an interesting and novel opportunity to address the question of conservation and variation to understand the pangenome of the species. A pangenome is a union of entire genetic pool of several strains of a species under comparison, essentially consisting of a core genome containing genes and sequences shared in all strains and an accessory genome comprising of genes and sequences, which may be absent from one or more strains and genes that are unique to each strain [26].
In the present study, we elucidate the pangenome structure of M. tuberculosis complex using 96 strains, which comprise both complete as well as draft genomes including eight genomes of clinical isolates from India sequenced and assembled as part of this study. The data pool had MDR and XDR strains as well. We predicted genes for all the 96 strains and clustered their respective CDSs to reveal a hard-core, soft-core and accessory genome component of MTBC. Additionally, we also identified a subset of 74 gene clusters from the accessory component of MTBC, which was present in at-least a third of clinical isolates (i.e. >1/3 rd of total dataset), but absent from the laboratory maintained reference strains (Mtb H37Ra and H37Rv). The study demonstrates the power of genomics based approach enabling close assessment and detection of conserved gene sets within closely related strains of MTBC.

Results
We analyzed a total of 96 strains in the present study to describe the pangenome of MTBC. Briefly, the data pool comprised of 25 complete reference genomes and 71 draft assemblies. The draft assemblies comprised of eight Indian strains derived from the 'Open Source Drug Discovery (OSDD) open access repository' and were sequenced and assembled in this study ( Table 1). The sequencing and assembly protocols of the eight OSDD isolates has been described in detail in methods section. The 88 strains derived from public domain encompassed diverse geographical locations around the globe and were sequenced and assembled using different sequencing techniques and methodologies. Reference guided genome assemblies were excluded from our analysis as estimating the size of bacterial pangenome requires de-novo assemblies of the genome in complete or partial genomes with large contigs. A detailed account of the dataset including full organism name, their status of completeness, sequencing platform and coverage is presented in S1 Table. Strain characterization

Genome sequencing and assembly
Estimating the size of the bacterial pangenome requires de-novo assemblies of the genome into complete or partial genomes with large contigs. De-novo genome assembly also provides an opportunity to reveal differences in newly sequenced strains with respect to reference genomes. A recent report [27] suggests that the majority of genes in the prokaryotic genomes can be easily re-constructed with short read sequencing technologies. The 96 MTBC genome sequences ( The 101bp paired-end raw sequence reads were de novo assembled at different k-mers as detailed in materials and methods. Majority of the samples had the best assembly at k-mer value of greater than 50 with the highest N50 values ranging from 14Kbps-28Kbps. The entire assembly resulted in draft genomes of over 4.3Mbps. The genome assembly statistics for the strains are summarized in Table 2. The finest contig assemblies generated from the de novo assembly are available at http://genome.igib.res.in/Mtb_Pangenome.

Pangenome estimation and analysis
Gene prediction and clustering. Whole genome sequences (WGS) of the complete MTBC genomes and scaffold/contig assemblies of draft genomes were used to predict their coding sequences (CDS) / Open Reading Frames (ORFs). Standard bacteria/archaea translation codes were used for CDS prediction. An account of the number of CDS predicted along for each of the 96 strains is provided in Table 1. The predicted gene pool of the 96 strains was used to estimate the pangenome of Mtb. The gene repertoire of each of the 96 Mtb isolates comprises of 4,162 ± 119 genes (mean ± standard deviation). For some genomes the number of predicted genes could be an over-estimation because of the draft nature of the genomes and lower quality scaffolds with stretches of N's. The protein translations of CDSs predicted from the complete genomes and scaffold assemblies of draft genomes were clustered based on sequence identity using CD-HIT as detailed in the methods section. The clustering approach resulted in a total number of 8,099 Homolog Gene Clusters (HGCs), which represents the MTBC "pangenome". The clusters are referred to as HGCs because a given cluster can have paralog sequences as well.
Hard-core versus Soft-core. Genes shared amongst all strains of a species under study is referred to as the 'hard-core' genome. A 'soft-core' genome of a species is the number of genes shared amongst a large proportion of genomes under study, say >90% of the strains under consideration. The soft-core could perhaps serve to be more biologically relevant in view of the large number of draft quality genomes used in the analysis. The number of HGCs shared amongst all 96 MTBC strains i.e. the 'hard-core' comprised of 2,066 HGCs and remaining 6,033 HGCs formed the accessory gene pool. In order to choose an optimum threshold for defining the 'soft-core', we checked out for the presence of essential genes in the core component. Genes which are critical for the growth and survival of an organism are considered as essential genes for that organism. Three independent studies differing in the mycobacterium growth conditions have been performed in the past to define the essential genes of Mycobacterium tuberculosis H37Rv [28][29][30] and have been catalogued in the Database of Essential Genes (DEG) [31].
We performed a BLAST search of all 8,099 representative cluster sequences against the essential genes of Mtb in the DEG database and identified 2,149 matches. Since the BLAST search retrieved multiple and redundant matches we selected only the ones present in all the three studies. There were 348 overlapping genes from the three studies and 597 sequences from our data had homology with these 348 genes. An optimal 'soft-core' was defined at 95% where maximum number of our essential gene matches was falling into the core component (S2 Fig  Thus in our study, the 'soft-core' component of MTBC is defined as clusters present in at least 95% of the strains. The soft-core of MTBC comprises of 3,374 HGCs and 4,725 HGCs are present in the accessory component. The global phylogeny of MTBC. We investigated the genetic diversity within MTBC strains by inferring phylogeny based on genes. In contrast to the single gene based trees which have low inter-species discriminatory power, multi-gene approaches offer more robust phylogenetic trees [32,33]. In order to infer phylogeny of MTBC strains, we used the hard-core component i.e. genes shared amongst all the isolates under consideration. From the 2,066 hardcore, we removed the HGCs having paralogs thus leaving 971 orthologous gene clusters (OGCs) in the hard-core. CDS translations of the 971 OGCs from all 96 strains were subjected to multiple sequence alignment followed by bootstrapping, generating a total of 100 resamples. Pairwise differences amongst sequences were estimated and the resultant distance matrix was used to build the tree. The original and the resampled trees were compared and the resultant tree referred here as the core-gene tree, is presented in Fig 1 and  The resultant tree of different species of MTBC is in concordance with the previously published reports where M. canettii formed the most ancient lineage of the MTBC [4] and in our results also, all strains of M. canettii formed a distinct clade. M. africanum was found to be more closely related to animal adapted strains of MTBC depicting similar evolutionary history as of M. bovis and M. orygis which has been reported earlier as well [34]. Except Mtb OSDD472, all other Indian clinical isolates sequenced in this study (orange colored in Fig 1) were all found to be clustering in a common clade and appeared to be very closely related. The Mtb reference strains H37Rv and H37Ra also clustered distinctly, depicting their close association. We found that the phylogenomic relationship between human and animal adapted strains and species identified by using the core-gene component was largely consistent with previous reports [4,34,35] and additionally such a multi-gene approach served to better resolve strain specific genomic differences.
Core and accessory genome size evolution. In view of defining the size of MTBC pangenome the primary question that arises is whether sufficient number of genomes has been sequenced to describe the core and accessory gene content of the species. For this we observed Pangenome of Mycobacterium tuberculosis the change in the core and accessory gene component as a function of their size with increasing number of sampled genomes over the entire 96 genomes (Fig 2). The total genome component of the 96 MTBC strains was analysed to study the core and accessory genome size evolution in terms of exponential decay and growth models. The models are based on the median values of the conserved and accessory genome HGCs which in turn are obtained from the random permutations of genome comparisons and limiting the number of possible combinations to 100, for each new genome being added. The exponential decay model in Fig 2A suggests that the number of core HGCs tends to approach a plateau near 2,000 HGCs whereas the accessory HGCs tends to reach a plateau near 6,000 HGCs (Fig 2B) for the 96 strains under comparison. Since there is no distinctly sharp plateau formation, we estimate that the MTBC has an open pangenome i.e. the number of distinct genes found in MTBC strains is infinite as opposed to finite number of genes in a closed pangenome.
Annotation distribution of pangenome of MTBC. Each of the HGCs has a representative sequence which is the parent sequence of every cluster. All HGCs (i.e. 8,099) were annotated by querying their corresponding representative sequences against BLAST2GO as described in methods section. The overall annotation distribution (Fig 3) obtained from BLAST2GO showed that out of the 8,099 protein sequences, 47.77% (3,869) sequences were fully annotated with GO slim terms. 26.63% (2,157) sequences were without any BLAST hits (i.e. the sequence had absolutely no homology to any of the sequences present in the NCBI databases). Based on the results of BLAST hits obtained, the gene ontology mapping process retrieved GO terms distributed in BLAST matches. 23.43% (1,898) sequences failed to retrieve any GO terms associated with them.
Analysis of the soft-core component of the pangenome. In a total of 3,374 soft-core sequences, 2,921 sequences had GO terms associated with them (i.e. fully annotated) and the remaining ones had one or the other result missing (as in Fig 3 pie). The soft-core encompassed majority of the essential genes of Mtb as described above. The major COG (Clusters of Orthologous Groups) classes represented in the essential genes were E (amino acid transport and metabolism), J (translation, ribosomal structure and biogenesis), H (coenzyme transport and metabolism), R (general function), C (energy production and conversion), and M (cell wall or membrane or envelope biogenesis). A detailed account of the spread of the COG categories of essential genes is in S4 Fig. The three main Gene ontology terms under which all genes and gene products are represented include molecular function, cellular component and biological process. The 3,374 softcore HGCs were annotated with a total of 14,297 GO terms out of which 93 were unique GO categories. Each sequence can have multiple GO terms associated with it, therefore representing redundant terms. A total of 5,023 sequences were spread over 41 GO terms of molecular function category and a distribution of these terms in the soft-core component is presented in  A closer look into unique genes of all other genomes suggested that there could be multiple factors that can account for this intra-and inter-species variability such as genome size, nature of genomes (draft, complete or finished), size of contigs, genome scaffolds generated by introduction of N's, sequence coverage, etc (S1 Table) and at present it is difficult to generalize or streamline the basis of this variability and also to differentiate true genes from artefacts.
An overlap of the number of accessory clusters shared within a strain pair is presented in Fig 6. The diagonal shows the total number of accessory clusters present in any strain being considered. The highest number of accessory clusters are present in the strain Mtb T17 (897) and the minimum number is present in Mbov BCG Korea (364). The upper and lower triangles are mirror images and show the number of overlapping clusters in each strain pair. The minimum number of overlapping clusters is 109 between Mcan CIPT 140070008 and Mbov BCG Phipps and maximum is 521, between Mbov BCG Sweden and Mbov BCG prague. The overlapping clusters formed a symmetrical matrix i.e. a square matrix which is equivalent to its transpose.
Clusters of genes absent in the H37Rv and H37Ra genome. We further analyzed the differences in the gene pool of clinical isolates (from across the globe) and laboratory maintained  strains by looking out for genes that were conspicuously missing in the genome assemblies of the laboratory strains, H37Rv and H37Ra, but were present in at-least or greater than 1/3 rd of the clinical strains. We obtained 74 clusters meeting this criterion from the accessory genome of 4,725 HGCs. A cross-validation of the 74 clusters against whole genome sequences of H37Rv and H37Ra was done using BLAT [37]. We identified 10 clusters which had absolutely no match with H37Ra and H37Rv genomes, whereas the other 64 shared partial homology with H37Rv and H37Ra in variable regions of the genomes with a sequence length difference in the alignment. Fig 7 shows the 74 clusters absent from H37Ra and H37Rv (first 4 columns in Fig 7) but present in at-least 1/3 rd of the clinical isolates along with their annotations. Experimental validation of the genes in the pangenome specific to clinical isolates. We randomly picked 9 candidate HGCs sequences (Fig 7, genes 1-9) from the 74 clusters which were present in most of the clinical isolates and employed a classical PCR approach to experimentally validate these putative genic loci. PCR validation was performed on genomic DNA of the eight Indian OSDD strains sequenced in this study and laboratory strains Mtb ATCC H37Ra and Mtb ATCC H37Rv procured from ATCC. We designed primers for the 8 available OSDD strains using primer BLAST [37]. A detailed account of the forward and reverse primers used for the validation and their expected product sizes are summarized in S2 Table. All primers provided amplification of genomic DNA in expected sizes, substantiating and validating the presence of these genes in the strains under consideration. The PCR result for these 9 gene sequences are shown in Fig 8. Out of the 9 genes which were PCR validated, 7 genes were absolutely absent from H37Rv and H37Ra whereas 2 genes (Gene 4 and gene 8) displayed appearance of differentially sized gene products (these 2 genes shared partial homology with H37Ra and H37Rv).
The PCR amplification of Gene 4 showed presence of differentially sized products in Mtb ATCC H37Rv and Mtb ATCC H37Ra genomes (Fig 8A). In order to reveal the nature of these differentially sized products we used the predicted ORF sequence of Gene 4 (annotated as hypothetical protein) to align against the reference whole genomes of Mtb ATCC H37Rv and Mtb ATCC H37Ra using BLAST [38] (Fig 8B). The sequence alignment of Gene 4 (Texts C  and D in S1 File) on the negative strand of Mtb ATCC H37Rv and Mtb ATCC H37Ra showed an insertion sequence of 175bp in Gene 4 of OSDD strains in comparison to the reference genomes. Detailed insights into the genomic loci of Gene 4 alignment with reference strains showed partial overlapping of Gene 4 with hypothetical proteins in Mtb ATCC H37Rv and Mtb ATCC H37Ra thereby resulting in the formation of differentially sized gene products (Figures A and B in S1 File). The PCR results of Gene 8 also showed differentially sized products in Mtb ATCC H37Rv and Mtb ATCC H37Ra genomes (Fig 8C). The alignment of the predicted ORF sequence of Gene 8 (no annotation) against the reference whole genomes of Mtb ATCC H37Rv and Mtb ATCC H37Ra (Texts C and D in S2 File) showed a deletion of 1,352bp sequence in Mtb OSDD strains. The genomic loci where Gene 8 aligns, majorly consists of repeat region and IS6110 transposase subunits in Mtb ATCC H37Rv and Mtb ATCC H37Ra ( Figures  A and B in S2 File). The corresponding insertion and deletion sequences of the putative ORFs of Gene 4 and Gene 8 respectively were confirmed by di-deoxy sequencing of their PCR products in Mtb OSDD472, Mtb ATCC H37Rv and Mtb ATCC H37Ra (Texts A and B in S1 and S2 Files).
The PCR results of Gene 4 and Gene 8 highlights the active role of genome re-arrangements taking place in the Mtb genomes and how the pathogen evolves by means of structural variations. Predicted functional annotation of the 9 candidate genes (Table 3) showed that 'Gene 9' is a NADP-dependent oxido-reductase. Gene 1, 2, 4 and 5 are hypothetical proteins whereas Gene 8 has no annotation associated with it. Genes 3 and 7 are involved in transcriptional and regulatory activity respectively and Gene 6 is a transposase involved in DNA binding.

Discussion
The close association of different species of the MTBC complex presents a challenging issue in understanding the genetic pool of the species. The difficulty is further enhanced by the fact that bacteria exchange genetic material in unique and unusual ways making the assessment of their core and accessory gene repertoire evolution even more complex. The study of genetic variability within natural population of pathogens may provide insight into their evolution and pathogenesis. Comparative genome analysis of the species in terms of variation in the distribution of SNPs, Indels, CRISPR-cas locus and nucleotide diversity has shown that their evolution is a complex process [35]. Also high level of sequence variation in the PE/PPE genes has showed absence of selective constraints and thereby suggesting neutral evolution [39]. The laboratory strain, M. tuberculosis H37Rv, has long been considered as the representative reference template for tuberculosis and has been extensively used to uncover potential drug targets [40,41]. However, selecting appropriate target gene/s for typing purposes requires that the expected target gene should be present in all isolates being typed. Therefore, estimating the corecomponent of the species becomes a subject of prime importance. Over the course of evolution genomes can undergo many large and small scale changes. For instance, recombination causes genome rearrangements, horizontal transfer introduces new sequences, and deletions remove segments of the genome [42,43]. Structural changes in the genome of closely related species have been reported in literature [44,45]. Such genome structural changes could also have phenotypic correlates [46]. Earlier studies have shown the emergence of significant genetic differences in genomes of laboratory strains and clinical isolates [24,25]. The newly emerging strains of a pathogen may have genomic differences, which cannot be addressed adequately using a sole reference genome. Over the past few years, WGS has widely been adopted for identifying genome-wide differences between related species [44,45], and significant work has been done for establishing WGS as a standard typing tool [47]. The current work not only complements previously done studies [35,39,48] but also provides an extensive characterization of the genomic structure of a large number of Mtb isolates from across the globe and highlights the rich diversity present in their genetic catalogue. Though relatively low sequencing costs have provided useful means to re-sequence genomes and simultaneously perform large scale comparative studies on pathogenic isolates [21,49] but the quality of the sequenced genomes remains a major concern. There is always a possibility of accidental incorporation of sequencing errors due to differences in sequencing protocols as well as genome assembly methods. Thus owing to the large number of draft quality genomes used in this work, we describe both hard-core and soft-core genome component of MTBC.
Understanding the differences existing within MTBC and evolutionary forces shaping these differences is not only crucial for uncovering its success as a pathogen but it is also expected to reveal key virulence mechanisms if any [50]. In the present study, we used a pangenome approach to identify a core set of genes which is shared by 96 strains of MTBC, and included most of the key essential genes of the bacterium. Our study comprised of diverse strains of MTBC from across the globe which included 10 MDR and 6 XDR strains and 8 Indian clinical isolates sequenced and assembled in this study. Global phylogeny based on a multi-gene approach involving 971 genes revealed the evolutionary relatedness of the 96 strains. The multigene tree was able to clearly differentiate between human-and animal-adapted strains and also phylo-geographically separated strains such as the newly sequenced 8 OSDD strains which formed a distinct clade and the laboratory strains clustering together showing their close association.
The MTBC pangenome comprise of 8,099 HGCs of which 2,066 gene clusters were present in the hard-core (100%) and 3,374 HGCs in the soft-core genome which was shared amongst 95% of the total strains analyzed. The variable component of the pangenome comprised of 4,725 gene clusters. Previous studies performed by Trost et al and Zakham et al on fourteen and twenty one mycobacterial genomes respectively have also shown that the mycobacteria has a fairly small core genome as compared to the accessory genome component [48,51]. From the accessory genome, a subset of 74 clusters was identified which was not represented in the reference laboratory templates, Mtb ATCC H37Rv and Mtb ATCC H37Ra, but were present in most of the clinical isolates being analyzed. We confirmed the complete absence of 7 candidate gene sequences from the laboratory templates using PCR and 2 genes sharing partial homology and having differentially sized gene products suggesting probable genome re-arrangement events of insertion and deletion. We argue that using a sole reference genome based approach could have severe limitations with under-representation of the actual repertoire of potential drug targets.
Our data suggest that a pangenome based approach applied to whole genome sequences can not only identify conserved genome regions but also detect rich information on genomic differences existing within strains of same species as has also been shown from previous comparative genomic studies on Mtb [35,48,51]. In a recent study by Liu et al, a comparative analysis of 7 newly sequenced clinical isolates and 7 complete reference genomes of Mtb showed that genomic variations might play an important role in the genomic plasticity of Mtb [35]. Such comparative genomics approaches are a promising tool for studying strain specific genetic differences occurring within species. With the availability of cheaper and high-throughput DNA sequencing techniques, there are a large number of MTBC strains being sequenced and raw sequencing reads being deposited in SRA [52]. Therefore, we hope that we will be able to extend these analyses encompassing a much larger dataset with more representative sequences corresponding to various geographical regions and strain types to elucidate the geographical signatures of pangenome components. We propose that a pangenome approach could identify novel and functionally relevant subsets of gene clusters with distinct functional characteristics and could provide a starting point towards re-looking at standard drug target discovery pipelines, which currently rely heavily on reference or template genomes.

Materials and Methods
The Strains and DNA isolation A total of 96 strains from the Mtb complex were used in this study to construct and analyse the Pangenome architecture of MTBC. Out of these 96 genomes (data acquisition: 30-December-2013), 25 strains comprised of the whole genome sequences of MTBC and were obtained from the Refseq database of NCBI [53], 63 were draft genome assemblies present in the form of Scaffolds and/or contigs in NCBI whereas the remaining 8 strains were clinical isolates from India ( Table 1). The DNA of the 8 strains was procured from the Open Source Drug Discovery Open Access Repository and corresponds to strain IDs OSDD472, OSDD326, OSDD071, OSDD504, OSDD630, OSDD386, OSDD487 and OSDD518. The strains were originally isolated at the National JALMA Institute of Leprosy and other Mycobacterial Diseases, Agra, India. Genomic DNA was isolated using the Mycobacterium tuberculosis genomic DNA isolation kit following manufacturer's instruction (Premas Biotech, India, Cat No: PB-MTB-0200). Briefly, one ml of culture was centrifuged at 6,000 rpm for one min and supernatant was discarded. The pellet was re-suspended in 200 ul of WB1 buffer and incubated at 100°C for 5 min. The solution was centrifuged at 8,000 rpm for one min at room temperature and the supernatant was discarded. To the pellet, 250 ul of LB1 buffer was added and vortexed for 20 sec followed by 5 min incubation at room temperature. Buffer LB2 (250 ul) was added followed by 20ul of Proteinase K solution (20mg/ml). The solution was mixed and incubated at 65°C for 20 min with intermittent vortex. To this 10ul of LB3 buffer was added and mixed, followed by incubation at 65°C for 10 min. The solution was then centrifuged at 12,000 rpm for 1 min and the resultant supernatant transferred to a fresh tube. To this an equal volume of Phenol:Chloroform was added and mixed followed by centrifugation at 12,000 rpm for 5 min. Carefully the aqueous phase was transferred to a fresh tube and 800ul of chilled WB2 buffer was added followed by mixing. The solution was centrifuged at 12,000 rpm for 20 min and the supernatant was discarded. To this, 400ul of chilled WB3 buffer was added and centrifuged at 12,000 rpm for 1 min. The supernatant was discarded and the pellet was air-dried. Finally 100ul of RE buffer was added and the DNA solution was stored at 2-8°C.

Genome Sequencing, Spoligotyping and Drug sensitivity profiling
Briefly, 101 base paired-end DNA libraries were generated using Illumina paired-end sample preparation kit following standard manufacturer protocol [54]. 5 μg of genomic DNA (gDNA) was nebulized to generate double stranded DNA fragments of size less than 800 bp. The sticky ends of the fragmented DNA were converted to blunt ends using T4 DNA polymerase and Klenow enzyme and a single "A" base was added to 3 0 end using polymerase activity of Klenow fragments (3 0 to 5 0 exo minus). To this "A" overhang, manufacturer specified paired end adapter with a single "T" overhang was ligated. Of these ligated products, *350 bp fragment was selected from a 2% agarose gel and selectively enriched by PCR using adapter specific primers. A minimum base quality of 30 was used for filtering the data. The DNA libraries were indexed and sequenced using the Illumina Genome Analyzer IIx platform following standard protocols provided by the manufacturer. The reads of newly sequenced genomes have been deposited in NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) with accessions SRR784917, SRR786188, SRR786373, SRR786397, SRR786667, SRR786668, SRR786669 and SRR786670 [52]. Spoligotyping and drug sensitivity profiling were performed using standard protocol and spoligotyping kit from Ocimum Biosolutions Ltd. by Premas Biotech, India.

De-novo Assembly
De-novo assemblies were performed for the eight Indian OSDD isolates sequenced in the study. Raw paired-end reads obtained after sequencing were de novo assembled using Velvet software [55] using different hash lengths (from 11 to 63). A 10 fold coverage cut-off was used for each hash length, and the best assembly for each of the strain was selected based on the highest N50 values (N50 denotes the weighted median statistic meaning that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value) and also the total assembly size. It was observed that assemblies with higher N50 values and assembly size tend to produce a decent and reliable assembled genome. Only contigs greater than 300bp were included in the final assembly output for further analysis. The contig assemblies are available at http://genome.igib.res.in/Mtb_Pangenome.

Gene Prediction and Translation
Gene prediction and their corresponding protein translations were performed for all the 96 Mtb strains using Prodigal (version 2.5) which has improved translation initiation site recognition capability and produces less number of false positives [56]. Mtb gene prediction by Prodigal has been reported to be better than or equivalent to commonly used gene prediction algorithms [56]. WGS of the reference MTBC genomes and scaffolds/contigs of draft genomes were used to predict their CDS. We used Prodigal in normal mode with default parameters. Genome sequences of each strain was provided as input and gene predictions along with its CDS translations were obtained as output using standard command line options. We didn't allow the predicted genes to run off the edges of the contigs. The raw gene predictions are available at http://genome.igib.res.in/Mtb_Pangenome/Pred_genes.

Homologous gene clustering
We analysed the genomes using homologous clustering approach. Clustering tasks were performed with the tool CD-HIT [57], using an empirical threshold of 70% global sequence identity which is calculated as number of identical amino acids in alignment divided by full length of the shorter sequence. Orthologous clusters were determined by selecting HGCs which had only one sequence from each strain. If a HGC had more than one sequence from same strain it was considered a paralog cluster.

Core-gene tree
A core-gene tree was calculated for all the 96 MTBC strains using the 100% core (i.e. hardcore) protein sequences using the approach followed by Kaas et al [47]. Briefly, in order to create the core-gene tree, we used only orthologous clusters. A multiple sequence alignment of the protein sequences for each cluster was done using MUSCLE version 3.8.31 [58]. The alignments were concatenated using FASconCAT [59]. Bootstrapping was done to generate 500 resamples using Seqboot version 3.695 [60]. Distance matrices were calculated for both original alignment as well as bootstrapped alignment using protdist version 3.695 [60]. Trees for both the original and bootstrapped matrices were determined by FastME available from NCBI [61]. FastME is based on the minimum evolution method and has been shown to have better accuracy than other commonly used methods such as the Neighbor joining (NJ) [61]. A comparative study of trees obtained from original alignment and resampled alignment was done by Com-pareToBootstrap [62]. The final un-rooted core-gene tree was visualized in FigTree (http:// tree.bio.ed.ac.uk/software/figtree/).

Evolution of Core and accessory genome size
We also estimated the change in the sizes of core and accessory gene content of MTBC each time a new strain is added to the analysis. For this we developed exponential decay (for core component) and exponential growth (for accessory component) models using MATLAB (version R2009b). We first generated random combinations (limiting it to 100, to avoid the combinatorial explosion) of genomes being sampled each time (i.e. 1, 2, 3, 4, . . ...96) using a bespoke Perl script followed by counting the presence or absence of genes in these combinations. Since the aim was to observe the change in size of the core and accessory genome with each new genome added to the analysis and to decide the saturation levels, we plotted the exponential models of the median values of the counts obtained as has been reported previously also [63].

Pangenome estimation and annotation
Core and Pangenome estimation was performed by clustering all the translated CDS sequences of the 96 Mtb strains. The representative sequences of all clusters were then subjected to annotation using BLAST2GO [64]. The BLAST2GO basically performs annotation in three main steps: Firstly, it performs BLAST searches to find similar sequences for the input dataset using NCBI BLAST; Secondly, the program extracts the GO terms associated to each of the obtained BLAST hits using BLAST hit accessions using four different mappings and at last, returns an evaluated GO annotation for the query sequence(s) using an annotation rule [64]. Data mining on the results were carried out using custom Perl scripts. Data was parsed so as to reveal core and accessory genome clusters, clusters unique to each strain, clusters shared among a given strain pair, clusters of genes (pertaining to their absence in laboratory strains, hence called novel) present in greater than 1/3 rd of the clinical isolates and absent from the laboratory strains.

Gene sequences specific to clinical strains
From the accessory gene clusters, we filtered clusters which were absent in the laboratory strains, Mtb H37Rv and Mtb H37Ra, but were present in more than 1/3 rd of the clinical strains being considered. The clusters were further matched against the laboratory whole genome sequences using BLAT [36], so as to reveal the level of similarity if any, in the laboratory strains.

Validation of genes by PCR
Some of the genes identified from the newly sequenced strains were validated using PCR amplification followed by di-deoxy chain termination sequencing. List of validated genes, primer pairs and annealing temperature used for PCR has been summarized in S2 Table. PCR primers were designed using the NCBI primer BLAST [37]. Apart from 8 clinical isolates obtained from OSDD repository we have procured the Mycobacterium tuberculosis reference strains (H37RV-ATCC 25618D-5 and H37Ra-ATCC 25177D-5) from ATCC for the PCR validation. PCR mix constituted by 5 μl of 10X Taq buffer (Fermentas), 3 μl of 25mM MgCl2 (Fermentas), 2μl 10mM dNTP (Fermentas), 1 μl of respective forward and reverse primers (10mM), 36μl nuclease free water (Ambion) and 5 units Taq (Fermentas) along with 1 μl of respective sample DNA. Thermal cycling was performed by a PTC 200 thermal cycler (Biorad) with an initial denaturation by 94°for 5 minutes followed by 30 cycles of 94°for 45 seconds, respective annealing temperature as per S2 Table for 45 seconds, 72°for 45 seconds and a final extension of 72°for 10 minutes. The resultant PCR product was checked in 2% agarose 1X TAE gel and thereafter performed di-deoxy chain termination sequencing as originally described by Sanger et al [65]. Essential gene based criteria for setting threshold for defining soft-core. We identified the number of overlapping essential genes by merging the three independent studies done on Mycobacterium tuberculosis H37Rv (Mtb). The study resulted in 348 genes shared amongst the three studies. After performing a BLAST search, 2,149 sequences out of the total 8,099 representative sequences matched with essential genes of Mtb. Out of the 2,149 BLAST hits 597 genes matched with the 348 overlapping genes. To set an optimal threshold for defining the soft-core we looked out for maximum number of essential genes matching with the core. The most optimal cut-off was found to be at 95% after which there was no significant rise in the number of essential genes falling in the core component and also assuming that all the 348 genes will be present at least once in the 95% core (i.e. 485 genes shared with BLAST hits).  replication, recombination and repair. PLoS One 6: e16020. doi: 10.1371/journal.pone.0016020 PMID: