Significance of the Identification in the Horn of Africa of an Exceptionally Deep Branching Mycobacterium tuberculosis Clade

Molecular and phylogeographic studies have led to the definition within the Mycobacterium tuberculosis complex (MTBC) of a number of geotypes and ecotypes showing a preferential geographic location or host preference. The MTBC is thought to have emerged in Africa, most likely the Horn of Africa, and to have spread worldwide with human migrations. Under this assumption, there is a possibility that unknown deep branching lineages are present in this region. We genotyped by spoligotyping and multiple locus variable number of tandem repeats (VNTR) analysis (MLVA) 435 MTBC isolates recovered from patients. Four hundred and eleven isolates were collected in the Republic of Djibouti over a 12 year period, with the other 24 isolates originating from neighbouring countries. All major M. tuberculosis lineages were identified, with only two M. africanum and one M. bovis isolates. Upon comparison with typing data of worldwide origin we observed that several isolates showed clustering characteristics compatible with new deep branching. Whole genome sequencing (WGS) of seven isolates and comparison with available WGS data from 38 genomes distributed in the different lineages confirms the identification of ancestral nodes for several clades and most importantly of one new lineage, here referred to as lineage 7. Investigation of specific deletions confirms the novelty of this lineage, and analysis of its precise phylogenetic position indicates that the other three superlineages constituting the MTBC emerged independently but within a relatively short timeframe from the Horn of Africa. The availability of such strains compared to the predominant lineages and sharing very ancient ancestry will open new avenues for identifying some of the genetic factors responsible for the success of the modern lineages. Additional deep branching lineages may be readily and efficiently identified by large-scale MLVA screening of isolates from sub-Saharan African countries followed by WGS analysis of a few selected isolates.

1 Univ Paris-Sud, Institut de Génétique et Microbiologie, UMR 8621, Orsay, France, 2 CNRS, Orsay, France, 3 Laboratoire de biologie clinique, hô pital d'instruction des armées Percy, Clamart, France, 4 Hô pital militaire Bouffard, Djibouti, Djibouti, 5 Hô pital d'instruction des armées Alphonse Laveran, Marseille, France, 6 Department of Veterinary Tropical Diseases, Faculty of Veterinary Science, University of Pretoria, Onderstepoort, Pretoria, South Africa, 7 DGA/MRIS-Mission pour la Recherche et l'Innovation Scientifique, Bagneux, France Abstract Molecular and phylogeographic studies have led to the definition within the Mycobacterium tuberculosis complex (MTBC) of a number of geotypes and ecotypes showing a preferential geographic location or host preference. The MTBC is thought to have emerged in Africa, most likely the Horn of Africa, and to have spread worldwide with human migrations. Under this assumption, there is a possibility that unknown deep branching lineages are present in this region. We genotyped by spoligotyping and multiple locus variable number of tandem repeats (VNTR) analysis (MLVA) 435 MTBC isolates recovered from patients. Four hundred and eleven isolates were collected in the Republic of Djibouti over a 12 year period, with the other 24 isolates originating from neighbouring countries. All major M. tuberculosis lineages were identified, with only two M. africanum and one M. bovis isolates. Upon comparison with typing data of worldwide origin we observed that several isolates showed clustering characteristics compatible with new deep branching. Whole genome sequencing (WGS) of seven isolates and comparison with available WGS data from 38 genomes distributed in the different lineages confirms the identification of ancestral nodes for several clades and most importantly of one new lineage, here referred to as lineage 7. Investigation of specific deletions confirms the novelty of this lineage, and analysis of its precise phylogenetic position indicates that the other three superlineages constituting the MTBC emerged independently but within a relatively short timeframe from the Horn of Africa. The availability of such strains compared to the predominant lineages and sharing very ancient ancestry will open new avenues for identifying some of the genetic factors responsible for the success of the modern lineages. Additional deep branching lineages may be readily and efficiently identified by large-scale MLVA screening of isolates from sub-Saharan African countries followed by WGS analysis of a few selected isolates.

Introduction
Tuberculosis (TB) has occurred in humans for at least several thousand years. Archaeological findings from a number of Neolithic sites in Europe, pre-Columbian sites in South America and sites from ancient Egypt to the Greek and Roman empires showed typical skeletal changes associated with tuberculosis, at least some of which could be confirmed with ancient DNA analysis [1,2,3,4,5,6,7,8,9]. The progenitor species of the M. tuberculosis complex (MTBC) is unknown, as, in contrast with other pathogenic mycobacteria, no environmental source has been found so far for this human pathogen. One model is that all the members of the MTBC derived from an ancestor which might be related to ''Mycobacterium canettii'', the smooth group of M. tuberculosis (SmTB). The first ''M. canettii'' strain was described forty years ago, and its relative position with respect to the MTBC is not yet fully established [10,11,12,13,14,15,16]. These isolates are exceptional in that their evolution is not clonal and they undergo recombination with unknown mycobacteria. Their morphotype (smooth colonies) is also very different and was the basis for their initial identification [17].
In contrast, the MTBC sensu stricto (i.e. not including ''M. canettii'') is highly clonal. One hypothesis is that the apparent switch from non-clonality to clonality resulted from the emergence of the obligate pathogen with a restricted ecological niche including humans and some animal species [13,17,18] and limited opportunities for detectable recombination events [19]. During the past 20 years, and owing to the advent of molecular tools, some of which can be applied to routinely type thousands of isolates, our understanding of the population structure of the MTBC has increased tremendously. Different complementary methods have contributed to this knowledge: spoligotyping [20], IS6110 typing [21], Large-scale polymorphism (LSP) [10], variable number of tandem repeats (VNTR) typing [22], SNP analysis [23], partial or whole genome sequencing of selected representative strains [14,24,25]. Six main deep branching human-adapted lineages have been defined so far on the basis of canonical deletions and point mutations [15,26,27] or characteristic spoligotype signatures [18,25,28] and confirmed by partial or whole genome sequencing [14,24,25]. They can be further divided in strain families [14,15,18,25,26,27,28]. Lineage 1 contains the East-African-Indian (EAI) and some Manu spoligotype families, lineage 2 the Beijing group, lineage 3 the Central Asian (CAS) spoligotype family and lineage 4 the H/T, X and LAM families. Lineages 5 and 6 both corresponding to Mycobacterium africanum are observed predominantly and at very high frequency in Western Africa [29,30,31,32]. Finally, there exist a number of animal associated lineages or ecotypes including M. bovis, M. caprae, M. pinnipedii-M. microti, M. orygis, M. bovis from antelopes (Dassie bacillus) and M. mungi [18,33] which all belong to the same clade.
The clonal population structure of the MTBC complex implies that it emerged from a unique location. This location is currently unknown. The most recent common ancestor (MRCA) of the extant MTBC lineages is estimated to have lived some 5.000-40.000 years ago [19,34,35,36]. The presence of most lineages in Africa makes this continent a likely place of emergence of the MTBC. More significantly perhaps, the vast majority of the ''M. canettii'' isolates collected so far (less than 70) originated from the Horn of Africa and the geographic origin of the others is uncertain [13]. Sub-Saharan Africa is also the region where the documented most recent common ancestor of the human population originated [37,38,39,40]. Under this hypothesis, characterising the genetic diversity of MTBC strains isolated in Sub-Saharan Africa including the Horn of Africa might be of major importance in trying to reconstruct the emergence of the MTBC. In this way, one might expect to identify the ecotype as defined by Cohan and col. [41,42] corresponding to the place of birth of the MTBC. This ecotype should be recognisable by its deep branching in the MTBC and by its restricted geographic spread. A study by Godreuil et al. investigated 62 isolates from patients from the Republic of Djibouti (Horn of Africa) by spoligotyping and multiple locus VNTR analysis (MLVA) using the MIRU-VNTR24 assay [22,43]. All isolates could be assigned to either lineage 1, lineage 3 or lineage 4. This indicated that unknown deep branching M. tuberculosis geotypes, if they are present in Djibouti in spite of periodic selection [41], are rare. Hence a significantly larger number of isolates would need to be screened in order to identify them.
In the present investigation 435 M. tuberculosis isolates from the Horn of Africa (mostly collected in the Republic of Djibouti) have been genotyped, allowing the assessment of the current genetic diversity of M. tuberculosis in Djibouti and the identification of rare isolates of interest. Whole genome draft sequencing was applied to seven selected isolates, leading to the identification of a previously unknown and very deep lineage which is an excellent candidate for representing the original ecotype of the MTBC. The phylogenetic position of this lineage provides new insights on the early history of the MTBC.

Genotyping and clustering of MTBC isolates
A total of 411 MTBC isolates, including two M. africanum and one M. bovis, were recovered over 12 years in Djibouti, the Republic of Djibouti. Together with 24 isolates from three neighbouring countries (Somalia, Sudan and Kenya) they were genotyped by spoligotyping and MLVA using the MLVA24 Orsay assay [13] and their susceptibility to four antibiotics was determined. Spoligotyping and MLVA defined respectively 80 and 211 genotypes, and a combination of the two methods led to 235 genotypes (eleven MLVA genotypes could be separated into 2 or 3 groups according to the spoligotype whereas 16 spoligotypes could be separated by MLVA). Lineage assignment was deduced on the basis of MLVA clustering and presence of spoligotype signatures [28,44]. Of the 80 observed spoligotypes, thirty-three were classified as orphan by SITVITWEB and several were found mostly in East Africa or Saudi Arabia (Table S1).
The results of MLVA clustering for the 435 isolates are shown on Figure 1 in a condensed format using a minimum spanning tree (MST) representation and in Figure S1 as a dendrogram presenting the isolates in each cluster, their spoligotype, the Spoligo-international-type (SIT) when available from SITVIT-WEB and the clade (the same color code is used in both figures). Two ''M. canettii'' and two reference strains (H37Rv and CDC1551) were also included. The predominant human lineages are represented. Lineage 4 which includes two T/H subfamilies (colored in red and green) and the LAM subfamily (colored in dark blue) is the most abundant (252 isolates, 57.9%). Lineage 3 (CAS clade, colored in bright blue) is represented by 98 isolates (22.5%), lineage 1 (EAI clade colored in purple) by 70 isolates (16%) and lineage 2 (colored in yellow and including the Beijing clade) by 11 isolates (2.75%). The diversity inside lineage 1 (46 combined genotypes for 70 strains) is slightly higher than that of the lineage 4 (120 combined genotypes for 252 strains) and lineage 3 (CAS) (55 combined genotypes for 98 strains).
To get an idea of the diversity of our collection as compared to isolates from diverse origins worldwide, we performed an MLVA clustering using available data for 700 isolates from Europe, West Africa and Asia, including data from the 186 isolates in the VNTRplus database [45]. The MST in Figure 2 shows the MLVA clustering achieved when using the 19 loci shared by two 24-loci assays, MLVA24 Orsay [13] and MIRU-VNTR24 [22] (MLVA panels are recalled in Table S2). In this figure, the isolates from the present investigation are shown in white whereas the additional isolates are colored as indicated according to lineage and/or spoligotype clade. Most isolates from Djibouti do not differ significantly from the bulk but some large subclusters are more specifically found in Djibouti. Several isolates are distantly linked and they will be discussed in the following paragraphs.

The lineage 4
Two hundred and fifty-two isolates belong to lineage 4, characterized by the absence of spacers S33-S36 in the DR locus. The larger cluster with 150 isolates includes reference strain CDC1551 (called here lineage 4-CDC and shown in bright green in Figures 1, 2 and S1) and the second with 78 isolates includes reference strain H37Rv (called here lineage 4-H37 and shown in red in Figures 1, 2 and S1). In these clusters there is no exact correlation between MLVA type and spoligotype except for some particular subclones. The most represented spoligotype is T1 (SIT53, 55 isolates), the very widespread lineage 4 ancestral form with the signature deletion of spacers S33-S36 (octal code 777777777760771), present in both 4-H37 and 4-CDC clusters.
In lineage 4-H37 (red), 38 isolates with the same MLVA genotype are separated into two groups by spoligotyping. Four isolates are SIT37 missing S13 (octal code 777737777760771), and thirty four have a spoligotype not present in SITVITWEB with an absence of contiguous spacers S4-S23 (octal code 700000017760771) which might be a one-step variant from SIT37 or SIT53. Interestingly this group contains only multi drug resistant (MDR) isolates collected across the whole sampling period (1999-2010, Figure S1) and may represent an ongoing outbreak.
In lineage 4-CDC, fifty-eight isolates with the same MLVA genotype show the ancestral SIT53 pattern or the T3-ETH (Ethiopia) SIT149 variant (additionally missing spacers S10 to S19, octal code 777000377760771). This predominant clone observed in 43 isolates was recovered along the whole sampling period and might also reflect the presence of an ongoing outbreak (Table S1).

The lineage 3 (CAS) and lineage 2 (including Beijing)
The 98 CAS strains isolated in Djibouti (colored bright blue in Figure 1 and Figure S1) are separated into two groups by MLVA, one of which showing little diversity with 35 SIT21 (CAS-Kili) isolates and one variant. The second group contains 14 SIT26 (CAS1-Delhi, octal code 703777740003771) founder genotype isolates, and variants of SIT26: SIT25 missing the extra spacers S37 and S38 (16 isolates), SIT247 (5 isolates) and 6 other spoligotypes (12 isolates). Ten isolates from Sudan cluster in the CAS-Delhi group. Again in this clade the founder genotype is one of the most abundant genotypes.
The eleven lineage 2 isolates belong to the Beijing-family with a classical spoligotype (deletion S1 to S34, octal code 000000000003771) whereas 7 MLVA genotypes are observed. When clustered with 126 Beijing isolates from China, Thailand and Cambodia they localise into two groups together with isolates showing identical genotypes (data not shown). We searched for Beijing clade specific deletions and found that all 11 isolates were deleted for RD150, RD181 and RD142 demonstrating that they belong to the same currently highly successful Beijing sublineage. All the Beijing isolates were MDR.

Lineage 1, lineage 6 and Percy556
The cluster colored in purple in Figure 1 and Figure S1 shows the 70 EAI isolates. Three main subgroups are observed, the larger one including essentially isolates from Djibouti and the Horn of Africa (as shown by comparison with other isolates worldwide, Figure 2) with a spoligotype identical or similar to EAI8-MDG (SIT109 Madagascar). A second group containing mostly isolates from Djibouti shows a spoligotype close to EAI-SOM (SIT48 Somalia). The variability inside the EAI clade is large with 18 spoligotypes and 43 MLVA types for a total of 70 isolates. The other lineage 1 clade, represented by the Manu1, 2, 3 spoligotype is not detected [46].
The two M. africanum isolates from Djibouti (Percy34 and Percy119) cluster with our collection of 81 M. africanum isolates from West-Africa in an outgroup position of lineage 6 West African 2 ( Figure 2, two open circles within the pink cluster). Their spoligotype shows the canonical deletion S39 but interestingly S8 is present whereas M. africanum are frequently deleted of S7 to S9 [18,47]. The unique M. bovis isolate does not show any particularities when compared to 30 other M. bovis isolates.
Lastly, one isolate, Percy556 isolated in 2007 could not be strongly linked to any lineage ( Figure 1 and Figure 2). Percy256, another isolate from the Percy collection (recovered in France in 1998 from a patient of unknown origin [48]) clusters closely with Percy556 by MLVA (Figure 2, arrowed). Similar to EAI lineage 1 isolates, and in contrast with lineages 2 to 6, their genome is intact for regions of deletion TbD1 and RD9. In both isolates the number of repeats at VNTR Miru26 is 5 whereas it is 2 in all tested EAI isolates. Portions of the katG and gyrA genes were sequenced showing that they belong to Principal Genetic Group (PGG) 1 [35]. Finally both isolates show an identical spoligotype (SIT910, octal code 700000007177771). In particular they possess spacers S34 and S39 which is a feature found in some lineage 1 isolates [49] and some rare Beijing-family ancestor isolates [50] in relation with the Manu ancestor lineage [46]. SIT910 spoligotype is very rare as it is represented by only three isolates in the SITVITWEB database, two from Ethiopia and one from the USA. Minimum spanning tree representation of the Horn of Africa isolates with respect to 700 isolates of various origins. The 435 isolates from the Horn of Africa are displayed in white, whereas the isolates of worldwide origins are colored according to lineage using the same color code as in Figure 1. Red arrows: isolates selected for sequencing. The minimum spanning tree analysis is based upon the 19 VNTR loci shared by the 24 loci MLVA assays used by [13] and [22]. doi:10.1371/journal.pone.0052841.g002 Whole genome SNP analysis Most isolates from Djibouti fit into well characterised lineages in agreement with previous reports [43]. However a few isolates appear to be only weakly connected to previously known lineages as shown in Figure S1 and Figure 2. This is the case for M. africanum Percy119 and Percy34 (weakly linked to lineage 6, Figure 2), and especially for Percy556. We performed draft whole genome sequencing in order to more precisely locate these isolates within the MTBC complex. Isolates Percy717, an M. africanum isolate from Gabon weakly connected to lineage 5 West African 1, Percy209 an isolate from a patient originating from Mali and Percy256 closely related to Percy556 were also included ( Figure 2). Percy209 shows a SIT523 spoligotype (Manu ancestor spoligotype with all 43 spacers present) and is connected to lineage 2 ( Figure 2). Percy644 a Beijing clade isolate was incorporated as a control. The seven isolates were positioned with respect to 38 strains including 14 fully sequenced genomes (Table 1). By mapping each dataset against the H37Rv genome, 13382 SNPs were identified, corresponding to 13358 positions in the H37Rv genome (at 24 positions, three states are observed, as can be deduced from Table  S3 so that the number of distinct SNPs is 13382). Figure 3 shows a MST deduced from Table S4 SNP data and allowing the creation of hypothetical missing links. The size of the minimum spanning tree is 13463, an excess of 81 corresponding to an homoplasia level of 0.6%. Fifty-six SNPs are responsible for the homoplasia. Fortythree occur only twice, but 13 SNPs are responsible for almost half of the homoplasia. SNP s1319 in Rv0280 (a PPE family protein) occurs in seven positions in the MST. Rv2082, annotated as hypothetical protein in H37Rv refseq NC_000962.2 contains 5 homoplasic SNPs accounting for 14 out of the 81 excess. The red star in the central position of the Figure 3 MST indicates the approximate position of the MRCA for the MTBC complex, according to Namouchi et al. [19], i.e. the branching point of the ''M. canettii'' taxon, under the hypothesis that ''M. canettii'' can indeed be considered as an outgroup. Three branches were observed to radiate from this red star; one corresponds to lineage 1 including EAI (pink), a second one leads to lineages 5, 6 and the animal lineage including M. bovis (brown, green, orange), and the third one leads to lineages 2, 3 and 4 (blue, purple, red). Lineages 2-3 and lineage 4 split at a distance of 241 SNPs from the red star. The secondary split between lineages 2 and 3 occurs very shortly afterwards, 42 SNPs later. The distance from the red star to the tip of the branches is closely related within each lineage and across lineages: it is approximately 700-800 SNPs long for lineages 1 to 5, a little more than 900 SNPs for lineage 6 and 7, and a little more than 1000 SNPs in the animal lineage ( Table 1). The Percy644 control strain fits in the lineage 2 (including the Beijing clade) as expected. It contributes to 1.5% of the size of the MST (202 bp terminal branch). Similarly, Percy717, Percy34 and Percy119 belong to their predicted lineages, 5, 6 and 6 respectively and they contribute to 8.4% of the total MST. Interestingly these two groups of strains define two new deep branching nodes and significantly older MRCAs within these lineages (indicated by blue stars in Figure 3). These nodes are located approximately midway along the lengthy initial lineage 5 and lineage 6 branches. Percy209 contributes a 354 SNPs terminal branch corresponding to 2.6% of the MST. Percy209 from Mali and M4100A from South Korea both showing a Manu ancestor SIT523 spoligotype split simultaneously from the rest of lineage 2 leading to the Beijing sublineage.
As initially suggested by MLVA and spoligotyping, Percy256 and Percy556 are very exceptional compared to all the other studied genomes. They introduce a 876 SNPs long terminal branch to the tree, which contributes to 6.5% of the MST. This remarkable branch splits from the branch leading to lineages 2, 3, and 4 only 66 SNPs away from the red star. Previous investigations identified seven interruptions of coding sequences (ICDS), numbered ICDS0013, 0037, 0038, 0045, 0055, 0073, 0078, which occurred along superlineage 2-3-4 before the split between lineage 4 and lineages 2-3 [51]. ICDS0045 could not be analysed in Percy256 and Percy556, whereas ICDS0013 (Rv0618-Rv0619) and ICDS0038 (Rv1549-Rv1550) are present in Percy256 and Percy556. The four others are still intact in both strains. We propose to call this new lineage, lineage 7. Percy256 and Percy556 isolated respectively in 1998 and 2007, are separated from each other by only 8 SNPs.
Rooting the MTBC Lineage 7 is unique by being the only lineage with a geographic location apparently limited to the Horn of Africa and as such coincident with the ''M. canettii'' lineage. It is also quite remarkable by its rarity. This strongly suggests that its current members constitute the geotypes resulting from the evolution of the M. tuberculosis ancestor in its original geographic location. Figure 4 draws a linear tree running from the putative ancestral human obligate pathogen to Percy256 and Percy556, as can be deduced from the MST in Figure 3. In this view, all contemporary worldwide lineages and sublineages are the result of three independent ''Out of the Horn of Africa'' events. The first two, corresponding to lineage 1 on one hand, and lineages 5, 6 and M. bovis on the other, are coincident or almost coincident. The third superlineage, subsequently leading to lineages 2, 3 and 4, emerged 66 SNPs later.
We attempted to estimate the time span which may correspond to these 66 SNPs. For this purpose, we took advantage of the previous investigation by Gardy et al. who sequenced 32 isolates from an outbreak, and additional historical isolates from the same region, collected a few years before [52]. We re-analysed the raw sequence reads applying the same SNPs detection procedure as above. Figure 5 shows the results obtained in the form of a MST. The outbreak isolates complex collected over three years contains 17 SNPs: 23 outbreak isolates show an identical genotype, whereas the other 9 are disposed in a star-like pattern around the central and presumably founder genotype, at a distance of 1 up to 4 SNPs. The historical isolates are defining a very close ancestral node. This provides a rough estimate of the MTBC branches growth rate, of approximately 1 SNP per 10 years in an outbreak context, which would convert into 10 thousand years for the whole MTBC, in reasonable agreement with independent estimates. It also indicates that lineage 1 and lineages 5-6-animal might have emerged within a single outbreak, whereas lineage 2-3-4 emerged a few centuries later.

Regions of deletion along lineage 7 (Percy256 and Percy556)
To further explore the Percy556 genome, we aligned it against H37Rv and predicted the absence in Percy556 of 13 regions of more than 2 kb (Table S5). They were given arbitrary names according to their position in the list of missing regions larger than 1 kb. Two of them encompass phage-like elements: Del62_(1779276, 1788519) which is flanked by 12 bp direct repeats and Del103_(2973436, 2980972) containing phiRv2 prophage genes flanked by 28 bp repeats. Del97_(2969985, 2972110) encompass the tRNA Val locus and an integrase. Eight missing regions correspond to genes encoding PPE or PE_PGRs family proteins. Deletion Del79_(2211370, 2219425) encompass 8 kb of the mammalian cell entry (Mce) operon 3 (mce3) including the mce3B, mce3C, mce3D and mce3F genes ( Figure 5). Del79 is within the 12.7 kb RD7 region previously shown to be absent from M. bovis and lineage 6 West African 2 strains (spoligotype AFRI1) [10,53]. In order to more precisely define the junction, we performed an assembly of reads mapping in the interval corresponding to the deletion and found a 1128 bp fragment which is present in genomes representing lineage 1, 2 and 3 and in the ''M. canettii'' CIPT140010059 (accession number refseq NC_015848.1) genome but absent in the lineage 4 genomes. The last deletion Del154_(3884925, 3887201) of approximately 2.2 kb encompass genes putatively involved in metabolic pathways, Rv3468 (involved in dTDP-L-Rhamnose biosynthesis), Rv3469 (aromatic hydrocarbon degradation) and Rv3470 (valine and isoleucine biosynthesis).
To check the distribution of Del79 and Del154, we selected oligonucleotide primers inside the region of deletion and in flanking regions and we performed a PCR amplification on genomic DNA samples belonging to the different lineages. Del79 was tested using primers inside Rv1971 (mce3F) and no amplification was observed in Percy556 and Percy256. The same results were obtained with Percy34 and Percy119, the lineage 6 M. africanum isolates which is not unexpected given the overlap of Del79 with the RD7 deletion (data not shown and Figure 6). Using primers flanking the deletion a 1.6 kb product was obtained in Percy556 and Percy256 but as expected not in any other isolate including two ''M. canettii''. Sequencing of this fragment confirmed the extent of the deletion as compared to H37Rv and showed the presence of a tuf-like gene and a non-coding fragment both absent from the H37Rv genome and present in genomes of lineages 1, 2 and 3. Del154 was tested using primers inside Rv3468 and a product of 340 bp was observed in every sample except Percy556 and Percy256.  Figure 7, in which each node is numbered and the relative proportion of the different mutations is shown by disks under the assumption that node 7 is the ancestral node. The disk of the terminal branches (leaves) are shown with an equal size independently of the branch size, whereas internal disks sizes are proportional to branch length. The previously reported large  excess of G/CRA/T mutations are illustrated here with no significant differences between internal and terminal branches across the whole tree. This indirectly confirms the ancestral status of node 7: selecting any other node (with the exception of node 6) as ancestor would create strikingly different mutation pattern along some branches compared to the others. Interestingly, the pattern of mutations along branch (6,7) is slightly abnormal compared to the others, with a deficit of G to A and an excess of A to G mutations which might indicate that the more precise position of the root of the MTBC is located somewhere along this branch. Lineage 7 shows a very similar pattern compared to other branches. Considered collectively, the 30 SNPs distinguishing the outbreak isolates and close neighbors in Figure 5 also show a similar pattern.

MLVA allows the identification of rare and phylogenetically important isolates
Comas et al. recently argued that VNTR-based genotyping could not perfectly define phylogenies in MTBC as compared to single nucleotide polymorphism or large sequence polymorphism for phylogenetic studies [25]. Indeed the homoplasy level is higher with VNTRs as compared to carefully selected SNPs. However although MLVA cannot be expected to provide robust phylogenetic support, it offers a high level of discrimination as well as correct and unbiased clustering assignment. It is also currently the only simple and widely affordable technique which allows the study of large populations of samples: it is able to identify isolates which are only distantly related to known strains and clusters, in a more robust way than spoligotyping and consequently is appropriate to provide a rapid and global overview of a bacterial population. In the present study the MLVA clustering combined with spoligotyping allowed the clade assignment of all 435 isolates except for Percy556, which turned out to be a member of a previously unknown lineage. Similarly the two M. africanum isolates, placed in an outgroup position by MLVA clustering, proved to be defining new ancestral nodes within lineage 6. Percy717, an M. africanum West African 1 isolate from Gabon which was differentiated by MLVA from the other members of this sublineage is also defining a very ancestral node within lineage 5.

Geographical origin of the different lineages
Phylogeographic studies have shown that lineages within M. tuberculosis strains can be associated with particular localisations in the world presumably as a result of human migrations. Whilst the Republic of Djibouti occupies a very small surface, the patients harbour strains of all the major MTBC clades specifically affecting humans. The larger group possess the deletion of spacers S33 to S36 the signature of lineage 4 isolates. MLVA distributes the H, T and X clades isolates into two subfamilies, lineage 4-H37 and lineage 4-CDC. The LAM spoligotypes mostly correspond to LAM1 and LAM9. Interestingly, among the lineage 4-CDC, a  Figure 3 minimum spanning tree was calculated. The disk size reflects the branch length except for terminal branches (leaves) for which identical size disks are used. The mutation direction is deduced by comparing the genotypes of the hypothetical nodes (numbered in red) and considering node 7 as the ancestral node. The color code and branch scale are indicated. The disk on branch (6,7) is slightly abnormal, suggesting that the MTBC MRCA is located somewhere along this branch rather than coincident with node 7. doi:10.1371/journal.pone.0052841.g007 group of isolates shows a large deletion of spacer S4 to S23 previously unreported and likely to be a recently emerging clone in Djibouti (here called the Djibouti family).
Lineage 3 CAS represents the second largest clade and is separated into CAS-Kili previously found in Tanzania [55] and CAS-Delhi highly represented in India and Saudi Arabia [56]. The genetic diversity inside the CAS-Delhi family suggests a long time of evolution or recolonisation by more diverse sublineages whereas the CAS-Kili family appears to be emerging in the Horn of Africa, or has not had the opportunity to diversify into multiple geotypes. The EAI clade from lineage 1 displays the larger diversity. However, we observe no Manu1, 2 or 3 lineage isolate. The Manu1-3 lineage is relatively frequent in India where it has been shown to be related to the EAI clade and to belong to lineage 1 characterised by the RD239 deletion [49,57,58].
To compare the strain diversity in Djibouti to that of neighbouring countries and countries of East Africa mostly spoligotyping data are available in the literature as summarised in Table S6 [46,55,56,59,60,61,62,63,64,65]. SITVITWEB also provides a very convenient tool to view available data. The clade identification from the spoligotype is not always straightforward, and only 70 to 80% of isolates have been assigned to a clade. In brief, the pattern in East Africa, Egypt and Saudi Arabia is highly similar, with a predominance of lineages 1, 3 and 4. In South Africa a study of 252 isolates showed the predominance of lineage 2 and lineage 4 strains [64]. In Ethiopia, most of the 141 isolates in SIVITWEB are lineage 4 (T, in particular T3-ETH). In Sudan, 2/ 3 of the 49 isolates investigated are CAS lineage 3.
In agreement with previous investigations, only two M. africanum and one M. bovis strains were found in the present study [43]. The M. africanum lineages 5 and 6 are currently mostly limited to West Africa where they are very abundant. They represent for instance an estimated 20% of MTBC isolates in Burkina-Faso [31] but are rare in East African countries [66]. Curiously the two lineage 6 M. africanum isolates from Djibouti appear to possess spacer S8 which is absent from West African isolates, and whole genome sequence analysis places them in a new branch defining a node closer to the junction with the M. bovis lineage.

A new method for whole genome SNP identification
There is not yet a standardised approach to identify phylogenetically relevant SNPs in whole genome sequence data. Very often, SNPs deduced by aligning short reads, contigs, or whole genome sequences on a reference genome are filtered in order to mask different kinds of repeated elements [14,52,67,68]. At least two reasons can be invoked to justify this masking step. Firstly short reads cannot confidently be used to reconstitute such elements, and consequently, analyses using a mixture of fully sequenced genomes and NGS short reads data should not take into account SNPs located in repeated elements. Secondly, intrachromosomal homologous recombination within such elements have been shown to exist in M. tuberculosis [69]. The use of genetic variations resulting from internal shuffling to infer phylogeny would require specific tools compared to the analysis of de novo SNPs inherited clonally. It is also common practice to discard SNPs occurring in very close proximity, typically less than 10 bp apart [52]. As a result different investigations of the same sequence data may produce slightly different lists of SNPs. In the present investigation, no filtering step was applied to the initially produced list of SNPs. This is due to the constitution of pseudoshort reads from the fully sequenced genomes previous to the mapping step on the reference genome. In this way, most if not all genetic variations potentially resulting from intrachromosomal exchanges between repeated elements are de facto erased.

Identification of a new lineage and branch length
Lineage 7 identified here is an excellent candidate for representing a local M. tuberculosis geotype as predicted by evolutionary models for clonal bacterial species [42]. Percy256 and Percy556 with spoligotype SIT910 happen to share a remarkable characteristic in terms of spoligotype, the presence of spacers S34 and S39. Apart from the rare SIT523 Manu ancestor with all forty-three spacers identified predominantly in Asia, this signature is also present in SIT343, SIT415, SIT461, SIT1729 and SIT1764. SIT415 and SIT461 were observed in India and countries of Indian emigration. In contrast, SIT343 was observed in 5 isolates, two from Ethiopia, two from the UK, and one from the USA. SIT1729 was observed in one Somalian and one Ethiopian isolate. SIT1764 was observed in two Kenyan isolates. These isolates are candidates to represent additional geotypes within lineage 7. The observation that the few suspected members of lineage 7 (sharing the same or significantly similar spoligotype) as well as all ''M. canettii'' isolates with an unambiguous geographic origin originate from the Horn of Africa suggests that lineage 7 emerged and remained in this region of the world. Interestingly, 5 among 141 isolates from Ethiopia present in SITVIWEB contain both S34 and S39. This is much higher than the proportion observed in Djibouti, and may suggest that Ethiopia contains the epicentre of lineage 7, or at least is closer to it. SITVITWEB currently contains little or no data from additional neighboring countries such as Somalia or Uganda. In the present study thirty percent of the 411 isolates collected in Djibouti were from Ethiopian or Somalian patients so that we cannot exclude that Percy556 originate from Ethiopia via an Ethiopian patient. Due to the strict anonymisation of samples, this question could not be addressed here.
The topology of the MTBC phylogeny, and in particular the relative position of lineage 7, is a clear indication that all three superlineages constituting the extant MTBC emerged separately from the Horn of Africa. The first two superlineages, comprising lineage 1, and lineages 5, 6, Bovis respectively, might have emerged almost simultaneously. A more precise positioning of the root of the MTBC will be required in order to firmly establish the temporal order of emergence of these two superlineages. The third superlineage, comprising lineages 2, 3, and 4, emerged later. One striking observation is that the length from root to tip is very similar or even slightly longer in the new lineage 7 compared to the other lineages in spite of the fact that lineage 7 is much less widespread than others. This might be consistent with the observation by Ford et al. [54] that the mutation rate of M. tuberculosis is very similar during latency, active disease or even exponential growth in culture media. Hopefully, future investigations of ancient M. tuberculosis DNA by whole genome sequencing might allow the more precise calibration of the molecular clock within the MTBC. Under the 5,000-40,000 years current estimate, the third superlineage which later colonized the whole of Eurasia would have emerged ca 300-3000 years after the first two. In that respect, it may indeed be called the Modern superlineage, whereas the first two are the Ancient superlineages. Subsequent worldwide spread of these three independent branches via human migration resulted in the current MTBC geotypes. The animal lineage is the unique exception where host preference has been the basis for true ecotypes emergence and preservation [42]. It is remarkable that the Modern superlineage, which would have split 300-3000 years after the first two, was able to spread, suggesting that the area was at that time a major demographic and presumably economic actor. This may also suggest that it possessed a selective advantage and comparative genome analysis of lineage 7 versus lineage 2-3-4 may help identify the underlying genetic factors. It is also remarkable that no other similarly successful event occurred afterwards, suggesting that the human niche was by then already largely occupied by the descendants of the first three superlineages or that the region experienced major economical changes. Interestingly, in their investigation of Egyptian mummies, Zink et al. detected both the Ancient (M. africanum) and the Modern lineages. More precisely, all four samples from the oldest tomb investigated (2050-1650 BC) had the characteristic M. africanum spoligotype, whereas five out of six samples investigated in tombs dated 1450-500 or 1250-500 BC were Modern with a typical lineage 4 spoligotype signature (incidentally, 3 among the M. africanum isolates possessed spacer 8, as observed here for Percy34 and Percy117, which is a rare feature among contemporary M. africanum). It is tempting to speculate that within a few centuries, Modern superlineage representatives replaced the Ancient Africanum lineage in Egypt and East Africa.
The transience of M. tuberculosis geotypes and dynamics of M. tuberculosis population structure in a given place [70] is further illustrated in the present investigation. In the snapshot of the M. tuberculosis diversity in the republic of Djibouti, some lineages are clearly new comers, and the country may be a place of intense competition between the many geotypes which are present. The rare lineage 2 Beijing isolates identified in the present study are most likely the result of re-import events, because no member of the RD150-, RD181-or RD142-intact sub-lineages were observed. Similarly, it will be interesting in future investigations to characterise in more details lineage 1 EAI and lineage 3 CAS isolates from the Horn of Africa, in order to better understand at what time the different sublineages were re-introduced. Some of the isolates investigated here are positioned on long MLVA branches and may be relevant choices for future whole genome SNP analysis.

Ethics statement
The study obtained approval of local ethics committee (the Paul Faure Centre Ethics Committee). The committee estimated that patient informed consent was not necessary since collection of sputum or pus was part of the patients' usual care and no additional sample was necessary for the present study. All samples were anonymised.
Patients. In Djibouti, Republic of Djibouti, Horn of Africa, the Paul Faure Anti-tuberculosis Centre (PFC) cares for patients with suspected TB. Among 421 cultures positive for mycobacteria obtained between 1997 and 2011, 411 were shown to contain a unique MTBC genotype and were retained for the present study. The origin of the patients was not always known but about 70% of them were Djiboutian, and the others were from Somalia or Ethiopia. Sputum was collected and sent to the Percy military hospital, France, for isolation of M. tuberculosis. In addition 24 isolates recovered from patients in neighbouring countries were included in the study: Somalia (2 isolates), Sudan (19 isolate) and Kenya (3 isolates). For comparison, two ''M. canettii'' isolates, Percy525 and Percy673 [13], and the reference strains H37Rv and CDC1551 were included.
Sample processing and identification tests. Acid-fast bacilli (AFB) in samples were detected by auramine staining. Positive results were confirmed by Ziehl-Neelsen staining. Lowenstein-Jensen (LJ) slants and Coletsos slants were inoculated with each sample. The samples were cultured in Mycobacterial Growth Indicator Tubes (MGIT, Becton-Dickinson, Le Pont de Claix, France) and MTBC were directly detected by 16S rRNA amplification (Amplified Mycobacterium tuberculosis Direct Test, AMTD; Gen-Probe-bioMerieux, Lyon, France). Mycobacteria were identified according to conventional biochemical procedures. The concentration of thiophen-2-carboxylic acid hydrazide used was 2 g/L. Each positive culture was also tested with an M. tuberculosis complex DNA probe (Accuprobe; Gen-Probe-bioMerieux, Lyon, France). Susceptibility of the MTBC strains to isoniazid (INH, 0.2 g/L), rifampicin (RIF, 40 g/L), streptomycin (SM, 4 g/L), ethambutol (EMB, 2 g/L) were determined by the proportion method on LJ medium.

DNA preparation
Thermolysates were prepared by suspending bacteria into water and heating at 90uC for 30 min. DNA for whole genome draft sequencing was prepared as follows. Bacteria were killed by heating 20 min at 80uC and treated by lysozyme at 2 mg/ml final concentration for 30 min at room temperature. Then one volume of lysis buffer (20 mM Tris PH 8, 20 mM EDTA PH 8, 20 mM NaCl, 1%SDS) and 50 mg/mL Proteinase K were added and the lysate was incubated 2 hours at 50uC. NaCl was added to a final concentration of 0.7 M, followed by CTAB at a 1% final concentration and the solution was thoroughly mixed before incubation for 10 min at 65uC. Chloroform extraction was performed and the aqueous phase containing DNA was purified using two successive extractions with phenol and chloroform. Finally the DNA was precipitated with 2 volumes ethanol.

MLVA data management and analyses
PCR amplification of VNTR loci from thermolysates and electrophoresis of products on agarose gels were as described in [48]. The twenty-four loci selection previously proposed by Fabre et al. was used (Table S2; [13]) with the list of primers indicated in Table S7. The reference H37Rv M. tuberculosis strain was included as a control in the MLVA analysis. The MLVA typing data is presented in Table S8 and has been incorporated in the Mycobacterium tuberculosis database on the MLVAbank website [71].
Gel images were analyzed using the BioNumerics software package version 6.6 (Applied-Maths, Sint-Martens-Latem, Belgium) as previously described [48]. The number of repeats in each allele was deduced from the amplicon size. The resulting data were analyzed as a character dataset. Clustering analysis was done using the categorical parameter and the UPGMA coefficient. The minimum spanning tree was constructed with the following options: (a) in case of equivalent solutions in terms of calculated distances, the selected tree was the one containing the highest number of links between genotypes differing at only one locus (''Highest number of single locus variants'' option); (b) the creation of hypothetical types (missing links) reducing the total length of the tree was allowed.

Analysis of the DR locus by Spoligotyping and PCR amplification
Spoligotyping was performed using a membrane from Isogen (De Meern, The Netherlands) as described by the manufacturer. The resulting data was imported into BioNumerics as a character data set. Spoligotypes were analysed using SITVITWEB [44,57,72].

Deletion analysis
The presence/absence of the RD7, RD9 and Tbd1 regions was analysed by PCR using the primers described by Brosch et al. [10]. Beijing-family strains were analysed for the RD142, RD150 and RD181 deletion as described [73]. Investigation of newly observed regions of deletion in strain Percy556 was performed using primers canettii'' CIPT140010059 (NC_015848.1) were downloaded from Genbank. Sequence read archives were downloaded from the short reads archive (sra) hosted by NCBI [14,52,74] or from the Mycobacterium tuberculosis Comparative Sequencing and Mycobacterium tuberculosis Diversity Sequencing Projects, Broad Institute of Harvard and MIT [75].
In order to harmonize the SNP detection procedure, the fourteen whole-genome sequences were split into short overlapping 75 bp reads resulting in a 756 coverage. Artificial and real short reads data sets were mapped on the reference H37Rv strain genome using BioNumerics asking for a similarity of at least 90%. A set of SNPs with a coverage higher than 106 was deduced for each genome sequence data. Individual lists were compiled, removing SNP positions at which one or more isolate displayed an ambiguous residue call or missing data. The list of SNP positions and main characteristics is provided in Table S3 and the SNP data for each genome is provided in Table S4.
A minimum spanning tree was drawn using BioNumerics and allowing the creation of hypothetical intermediates. Nodes were numbered by BioNumerics and the list of SNPs contributing to each branch were automatically retrieved to be included in Table  S3.
In order to identify missing regions in the genome of strain Percy 556 compared to H37Rv, a mapping assembly on this reference strain genome was done using BioNumerics. Missing regions larger than 2 kb and specific for Percy556 are listed in Table S5.
All NGS sequencing reads produced in the course of this work are deposited in the European Nucleotide Archive (ENA project accession ERP001885, [76]) maintained by the European Bioinformatics Institute (EBI).
In order to study the Interrupted Coding Sequences (ICDSs) identified by Deshayes et al. [51], a de novo assembly was carried out on the Percy256 sequence data. The assembly was performed on 3 millions Illumina 100 bp reads, using the BioNumerics assembly module (Power Assembler). The algorithm chosen for this step was Velvet, with k set to 31 and a minimum contig length of 1000 bp. The region corresponding to each ICDS was extracted from the H37Rv genome, and compared to the Percy256 de novo assembly. Figure S1 Dendrogram of the 435 isolates based upon MLVA24 Orsay . The spoligotype, strain Id, geographic origin, spoligotype clade, year of isolation and antibiotic resistance status are indicated. The color code used is as indicated in Figure 1. The dendrogram is based upon data produced using the previously published MLVA24 Orsay assay, the categorical coefficient and UPGMA clustering [13]. Two reference genomes (H37Rv and CDC1551) and two M. canettii isolates, Percy525 and Percy673 are included for comparison. (PPT)