Genomics and Machine Learning for Taxonomy Consensus: The Mycobacterium tuberculosis Complex Paradigm

Infra-species taxonomy is a prerequisite to compare features such as virulence in different pathogen lineages. Mycobacterium tuberculosis complex taxonomy has rapidly evolved in the last 20 years through intensive clinical isolation, advances in sequencing and in the description of fast-evolving loci (CRISPR and MIRU-VNTR). On-line tools to describe new isolates have been set up based on known diversity either on CRISPRs (also known as spoligotypes) or on MIRU-VNTR profiles. The underlying taxonomies are largely concordant but use different names and offer different depths. The objectives of this study were 1) to explicit the consensus that exists between the alternative taxonomies, and 2) to provide an on-line tool to ease classification of new isolates. Genotyping (24-VNTR, 43-spacers spoligotypes, IS6110-RFLP) was undertaken for 3,454 clinical isolates from the Netherlands (2004-2008). The resulting database was enlarged with African isolates to include most human tuberculosis diversity. Assignations were obtained using TB-Lineage, MIRU-VNTRPlus, SITVITWEB and an algorithm from Borile et al. By identifying the recurrent concordances between the alternative taxonomies, we proposed a consensus including 22 sublineages. Original and consensus assignations of the all isolates from the database were subsequently implemented into an ensemble learning approach based on Machine Learning tool Weka to derive a classification scheme. All assignations were reproduced with very good sensibilities and specificities. When applied to independent datasets, it was able to suggest new sublineages such as pseudo-Beijing. This Lineage Prediction tool, efficient on 15-MIRU, 24-VNTR and spoligotype data is available on the web interface “TBminer.” Another section of this website helps summarizing key molecular epidemiological data, easing tuberculosis surveillance. Altogether, we successfully used Machine Learning on a large dataset to set up and make available the first consensual taxonomy for human Mycobacterium tuberculosis complex. Additional developments using SNPs will help stabilizing it.


Introduction
Bacterial taxonomy has logically emerged when technology unveiled the microscopic level of life at the end of the nineteenth century. Macroscopic organisms were being classified since Aristotle, 350 BC [1,2], and the same taxonomic levels (Species, Genus, Family, Order, Class, Phylum, and Reign) were chosen. The process of assigning an organism inside a definite taxonomy will thereafter be referred to as classification, the term taxonomies being used for the classification schemes themselves. For the Species level, microbiologists have applied various concepts of species definition (phenotypical, morphological, ecological), trying to identify groups of organisms with identical biochemical features, studying colony morphology, nutrition requirements, etc., as well as host symptoms for pathogenic species. The similarities between lineages were randomly or intuitively ordered. With the advent of molecular biology, DNA-DNA hybridization has been recognized as a powerful, objective and consistent tool for characterizing lineages, and the threshold of 70% identity has been proposed for species delineation [1]. 16S rRNA sequence has then attracted attention: the presence of conserved regions ensured amplification in any bacteria. 16S rRNA sequences retrieved environmental samples helped describe bacterial diversity among uncultivable organisms. The comparison between DNA-DNA hybridization data with 16S rRNA sequences has shown that organisms with less than 97% identity in 16S rRNA sequence could safely be considered as belonging to different species [3]. Whole Genome sequences of Bacteria are now easily acquired due to their relative small size (below 10 Mb). The Average Nucleotide Identity (ANI) assesses DNA identity between two genomes and proves very concordant with DNA-DNA Hybridization. A threshold of 95% ANI is now advised to name new species [4]: individuals with more than 95% ANI should be considered as belonging to the same species.
Tuberculosis agent was first isolated by Robert Koch in 1882, the name Bacillus tuberculosis was proposed to the community by Zopf in 1883, which was changed for Mycobacterium tuberculosis in 1907 [5]. In 1912, a bacillus isolated by E. Nocard on cows and today known as M. bovis was specifically used to set up a vaccine after serial passage experiments led by Calmette and Guérin; it did not receive a different species name at that time, possibly because it was known to infect humans and trigger symptoms very similar to those of M. tuberculosis. In contrast, a bacillus collected on rodents, initially referred to as M. tuberculosis var. muris, was officially raised to the species level in 1957 [6] in a period where scientists promoted the use of infectivity profile as a mean of pathogen characterization [7]. Following this rationale, an increasing number of species were named according to their principal animal host even if alternative mammal hosts were common and disregarding DNA similarity criteria: M.  [8,9,10,11,12,13]. Two lineages infecting humans were raised to the level of species because of specific metabolic and phenotypic features: M. africanum and M. canettii [14,15]. In fact, the DNA diversity within and between all these lineages as studied by Whole Genome Sequencing proved limited, except for M. canettii's showing higher intra-and interlineage diversity [16,17,18]. According to molecular data such as average nucleotide identity (ANI) and 16S rRNA divergence, all these "species" could be considered as a single one despite the clear diversity in host spectrum. In contrast, some M. tuberculosis isolates proved more distant from one another than one was from any animal isolates. The resemblance between all the species listed above has led to the advent of "M. tuberculosis complex" (MTC) terminology for more than 30 years [19]. The diversity among human isolates harboring similar metabolic features being higher than the diversity between animal isolates whatever the host species [20], infra-species taxonomy of M. tuberculosis affecting humans is as important as the description of animal isolates diversity.
RFLP data detecting IS6110 insertion sequence and/or that of CRISPR locus (long called "Direct Repeat" or "DR locus", and the method, "Spoligotyping") have initiated MTC infraspecies MTC taxonomies strictly based on genotyping data [21,22,23]. The taxonomy based on CRISPR locus, relying on the presence/absence of specific spacers, became soon the most extensively used and a worldwide database referred to as SpolDB and later SITVITWEB is including an increasing number of sublineages since 1999 [24,25,26]. The CRISPR locus was found to have carried 68 spacers in the most recent common ancestor to all MTC species except M. canettii [27,28] and to have subsequently evolved by the loss of spacers or the integration of IS6110 sequences [29]. Recurrent "signatures", i.e. the absence of specific spacers, easily detected by experts, led to the naming of LAM, CAS, S, X, etc. sublineages [30]. The relevance of the corresponding taxonomy has been promptly acknowledged by the tuberculosis community based on the good congruence with geographical data, previously described ecological species M. africanum, M. bovis, sublineages such as Beijing and Haarlem detected using IS6110-RFLP [31,32]. Other studies criticized this taxonomy based on the fact that the deletion of each spacer considered independently can suffer from convergence [33,34]. However, these criticisms were defeated for major signatures as defined in SITVITWEB [35,36]. The reason for the reliability of well-known signatures despite convergence effects on individual spacers is that these signatures take into account the spatial organization of the locus. In the end, CRISPR-derived taxonomy is still widely used with 100 hits in Pubmed during the last 12 months (as assessed on February 12 th , 2015) using keywords "(spoligo Ã OR CRISPR) AND tuberculosis". The automation of CRISPR data use for classification has led to several web interfaces: spoligoforest to infer transmission chains SPOTCLUST and TB-Lineage for labeling new isolates [29,37,38], with TB-Lineage using a simplified taxonomy indicating large lineages as defined by Gagneux et al. [38,39].
From 2000 on, tuberculosis taxonomy was complexified by the advent of large deletions [40,41] and minisatellites termed MIRU-VNTR (mycobacterial interspersed repetitive units, variable number tandem repeats) [42,43,44,45,46,47]. The large acquisition of MIRU-VNTR data soon suggested that at least some specific labels provided using spoligotype patterns could be flawed [48]. An independent database and the corresponding taxonomy was set up to classify isolates according to the standardized combination of 24 MIRU-VNTR patterns [49], and/ or Regions of deletion: MIRU-VNTRPlus [47,50]. These 24-VNTR genotypes can be used in parallel for molecular epidemiology as they include many loci with high discriminatory power [51,52,53]. The most variable loci form a 15-MIRU-VNTR set that is now collected for epidemiological surveillance by most Reference labs, in combination or not with spoligotyping [54]. The next step in MTC diversity exploration was the analysis of Single Nucleotide Polymorphisms (SNP) either using high-throughput SNP typing [55,56,57] or Whole Genome Sequencing [39,58,59,60,61,62,63]. These approaches largely validated spoligotype and MIR-U-VNTR based taxonomies [34]. Several studies propose new classification tools using SNPs detection, the most precise and consensual being that derived from the data mining in more than 1,000 genomes [62].
Altogether, several tools currently exist for assigning M. tuberculosis complex isolates to taxonomic groups at different depths, but little time was by now invested in characterizing the concordance between the corresponding taxonomies. As a consequence, TB epidemiologists are often puzzled when trying to characterize their isolates. There is a clear need for using large studies and powerful algorithms on large datasets for establishing consensual infra-specific MTC taxonomy.
Machine learning is a statistics science identifying relevant information in large datasets even when some data are missing. It involves pattern recognition in prototypes i.e. the identification of formal rules correlated to a characteristic of interest [64]. Once patterns have been identified, assignation of unknown individuals is easy and very fast. Such method has been applied previously on spoligotyping data and helped identifying informative spacers to recognize experts-based groups [65]. Weka is a work bench set up in 1997 and implementing state of the art machine learning algorithms. This ability proved critical for reducing assignation errors [66]. It is very popular and was used in studies as diverse as epilepsy characterization using imaging data [67] and methylated DNA patterns linked to genetic diseases [68]. Its swiftness enables to implement it in parallel on different type of data, so that complete annotation of very large dataset can be reached in a timeframe of one minute.
In this work, we first completed the genotyping of a large dataset of 3,454 human M. tuberculosis isolates from the National Reference Center of Netherlands (RIVM) collected between 2004 and 2008 [69]. This data was used to further describe TB diversity and transmission dynamics in Netherlands and to clarify the potential of spoligotyping in molecular epidemiology. We then used this large database as a reference for human M. tuberculosis worldwide diversity after complementing it with genotypes from Lineages 5 and 6 particularly underrepresented in the RIVM dataset. We annotated all these genotypes according to existing classification tools to search for stable correspondences between the underlying taxonomies and proposed a new consensus where the correspondences are made explicit. We finally used Weka software to learn in parallel classification procedures, handling Spoligotype or MIRU-VNTR data, reproducing original and new taxonomies, and made the most successful procedure available on-line. Altogether, our work successfully clarifies the correspondence between the existing M. tuberculosis complex taxonomies. This consensus can be retrieved for any sufficiently informed new genotype (best when including at least Spoligotype + 15 MIRU-VNTR) using our new web interface, TBminer.

Isolates
All isolates analyzed in this study were cultured on Lowenstein Jensen solid media or 7H9 liquid medium in MGIT960 device. Three thousand four hundred and fifty four (n = 3,454) were from the National Reference Center of Netherlands, also committed in worldwide quality control studies. 225 isolates were from diverse hospitals in Pakistan (Faisalabad n = 6; Islamabad National Reference Center n = 109; Karachi n = 29; Lahore n = 21; Rawalpindi n = 60). DNA was extracted using the standard procedure using Cetyl-trimethyl-ammonium bromide (CTAB) [70]. No information concerning the patients was included in the analysis so that no approval from any ethical comity was required.
High throughput Luminex-based spoligotyping A total of 3,454 DNA samples, extracted from isolates collected between 2004 and 2008 by the RIVM and sent as concentrated CTAB (Cetyl-trimetthyl-ammonium bromide) extracts, were genotyped by high-throughput Luminex spoligotyping [23,27,71,72]. Briefly, 1 μL of~50 ng/ μL DNAs were amplified by PCR using DRb and biotinylated DRa in 25μL. PCR product (2μL) were hybridized with coupled polystyrene microbeads (2500 Microplex microbeads per individual target, Luminex Corp, Austin, USA) for 30mn at 55°C in 1.5X TMAC (Tetra-methyl ammonium chloride). After washing, 2μL streptavidin-R-phycoerythrin (1mg/mL, Invitrogen, USA) was added, microbeads were centrifuged and washed again, and after resuspending in 1x TMAC, the plates were read at 52°C. Interpretation was made using home-made excel matrixes helping control of cut-off selection between positive and negative values for each spacer. A Quality Control check, done by an independent investigator on 5% of randomly selected samples showed a perfect reproducibility of the patterns.

Assignation to sublineages using available classifications
Files of 500 24-VNTR and spoligotyping genotypes complying with all specified requirements were loaded onto TB-lineage and MIRU-VNTRPlus websites. In MIRU-VNTRPlus, default settings were changed to assign isolates to the closest inside the curated database as soon as the distance is of 0.5 maximum (default = 0.17), so that most isolates could be classified (n = 3382). TB-lineage could classify 3283 isolates (i.e. 97%).
To assign genotypes according to SITVITWEB taxonomy, we first made use of an Excel version of SpolDB4 uncovering the 2881 first SITs implemented in SITVITWEB. The assignations were slightly corrected by taking into account recent knowledge on relatedness among Euro-American sublineages. For instance, CAMEROON genotypes previously referred to as LAM10_CAM and subsequently found not to be related to LAM were simply labeled CAM, TURKISH isolates previously referred to as LAM7_TUR and subsequently found not to be related to LAM were simply labeled TUR, H4 subsequently found not to be related to Haarlem were renamed URAL, and more precisely URAL1 when spacer 2 was present and URAL2 when spacer 2 was absent. For the unclassified isolates, we used expert knowledge of C. Sola, mainly applying rules as described in Filliol et al. [73,74].
To assign each genotype according to Borile et al. taxonomy, we computed distances to the 32 Borile references based on shared blocks of absent spacers [36]. Every isolates was assigned to the group of the most similar reference, and unassigned when equal distances were found with at least 2 references.

Taxonomy consensus identification
Correspondences between all taxonomies were listed using an in-house algorithm to identify synonyms. As 24-VNTR signatures are known to be less prone to convergence than deletions of individual spacers in spoligotype patterns [34], when assignations by SITVITWEB and MIR-U-VNTRPlus taxonomies were conflictive, we privileged MIRU-VNTRPlus assignation. To make the synonym explicit, we tended to concatenate short versions of the different synonyms unless it was too long (Table 1).

Classification learning using Weka
A curated database enlarged to include more M. africanum isolates was imported into Weka. More specifically, curation removed all isolates for which at least one taxonomy was not able to provide an assignation or when one VNTR had a zero value (indeed absence of results was at some point in this long-lasting study recorded as 0 which may have lead to erroneous profiles); this concerned all lineages (data not shown) and almost all MIRU-VNTR so that it is not likely to have introduced any bias in the database. M. africanum isolates added to the RIVM curated dataset came from a Nigerian study. Altogether, the database counted 2,904 isolates including  Weka was used to train a classifier using different Machine learning algorithms and a Vote procedure. Machine algorithms handle characteristics such as genotype patterns at different loci, and a selected feature to be predicted. Every genotypic characteristic handled independently is called an "attribute". Different algorithms exist. A short description of their different characteristics is as follows: 1) J48 algorithm is the Weka version of the C4.5 algorithm. J48 is used to learn decision trees using quantitative values of attributes. First, the pair (attribute, value) that optimizes a criterion (entropy, gini index) is used to split the data in two sample. Then, for each sample, if it is pure (only one lineage represented) or if it contains less than a predefine number of data, the tree growing is stopped, else another split is determined based on the same algorithm; 2) JRip algorithm (with Rip standing for Repeated Incremental Pruning, and J for Java, the programming language used to implement Weka) an algorithm already implemented on tuberculosis data for TB-lineage [38] consists in identifying an ordered list of complex rules using quantitative values, beginning with less prevalent class; if the first set of rules is fulfilled, the isolate is assigned to the corresponding lineage, otherwise the next set of rules is examined; a default assignation is proposed at the end of the list if no set of rules has been fulfilled; 3) Naive Bayes algorithm is based on the assumption that attributes are independent. Despite the fact that this assumption is rarely true, the obtained classifier has frequently good performances. The main idea of Naive Bayes is to determine the lineage maximizing the probability of being associated to a given set of attributes (here spacers in spoligotype pattern, individual MIRU-VNTRs). This probability is the product of all the marginal probabilities of each attribute associated to the lineage (this product makes sense only under the assumption of attributes independence); PART uses rules in the same way JRip does, but these rules are built using a decision tree as does J48; Random Forest consists in assigning each isolates using a multitude of decision trees and provides the assignation that is the mode among all of these trees; the decision trees are built partly randomly. All these methods can undergo meta-bagging procedures to reduce the impact of overfit to the data. Overfitting occurs when lots of data are used to set up the classifiers so that some irrelevant features are included in the decision trees or rules. Meta-bagging consists in randomly selecting features and data used to induce the classifiers, a step that actually reduces the chances for learning irrelevant features.
All algorithms can be combined and contribute to a Vote step during which the most frequent assignation is selected.
Evaluation was performed for each algorithm using stratified cross-validation. Stratified cross-validation consists in partitioning the training dataset in a specified number of folds k, learn on k-1 folds and test the algorithm of the remaining fold. The number of folds was varied between 3 and n-1 (leave-one-out procedure, n corresponding to the total number of items). When the number of folds is low, the computed accuracy is more likely to match that obtained on an independent dataset as the testing set is quite large, however, the rules or trees inferred may be less precise as they have been set up with smaller training data.
The "Vote-10" algorithm (using all 5 original algorithms as well as all their 5 meta-bagging derivates) was applied on the complete training dataset to build up the Lineage prediction classifier available on TBminer website.

Datasets for independent evaluation of Lineage Prediction tool
The first independent dataset was that of MIRU-VNTRPlus database uncovering 186 isolates from both human and animal tuberculosis.

Characteristics of the RIVM dataset and contribution of spoligotyping to molecular epidemiology
We characterized by spoligotyping 3,454 DNA samples retrieved from as many clinical isolates of the RIVM collection (2004)(2005)(2006)(2007)(2008). Genotypes obtained by 24-MIRU-VNTR and IS6110 RFLP profiles were already available [69]. Ninety-nine percent of the patterns were successfully obtained (99.4%; n = 3432). According to TB incidence estimates in the Netherlands for this period (n = 1330 per year as estimated in 2004), this represents around 65% of all TB cases [75]. Assignation of genotypic profiles to described lineages was performed independently for each isolate and for each taxonomy using available on-line tools: 1) SITVITWEB database using spoligotype patterns and upgraded by the expert eye for new profiles, 2) MIRU-VNTR-Plus assignation tool implementing similarities on 24 MIRU-VNTR patterns, 3) TB-Lineage using both spoligotype patterns and 24 MIRU-VNTR information, 4) Borile et al.-derived inhouse algorithm using spoligotype profiles only [36]. The resulting file is available as S1 Table. Alternative taxonomies were largely concordant as detailed below. Most prevalent lineage according to TB-Lineage taxonomy was the so-called "lineage 4" (also referred to as Euro-American lineage, 68%, n = 2,345). Lineages 1, 2 and 3 (corresponding to EAI, Beijing and CAS SITVITWEB denominations) corresponded each to slightly less than 10% of the isolates (n = 324, 228 and 291 respectively), lineages 5 and 6 (designated as Afri2 and Afri1 in SITVIT-WEB) and M. bovis lineage representing around 1% of the isolates each (n = 30, 18 and 47 respectively, Fig 1). Among the 171 isolates (5.0%) that could not be classified using TB- lineage, most of them were classified by MIRU-VNTRPlus as being part of Euro-American lineage (Haarlem, n = 50) or EAI (n = 23). Altogether, this RIVM dataset is typical of that of a Western country (predominance of Euro-American lineage) with worldwide immigration providing larger diversity. Euro-American isolates infra-lineage diversity (67.9% of the total dataset) can be described using the MIRU-VNTRPlus classification. Most were labeled Haarlem (n = 777; 22.5% of total dataset), second, LAM (n = 548; 15.9%) and third, Cameroon (n = 260; 7.5%).
We assessed the clustering level in this database and set up an online tool to automate it, available at info-demo.lirmm.fr/TBminer/ (Clusters Analysis option). We used 24-VNTR profiles to define clusters as it stands as the current Gold standard for molecular epidemiology since 2006 [49,69,76]. We detected 109 to 295 clustered isolates each year (S2 Table). When considering the global 5-years set, the number of clustered isolates reached 1,362 which was higher than the sum of clustered isolates for each separate year. This indicates that unique isolates collecting during different years carry the same genotype. Such phenomenon may illustrate the likely underestimation of clustered cases if performing smaller studies over a shorter period as previously shown theoretically [77]. Alternatively, it may be caused by a lack of discriminatory power in the genotyping techniques, as recently shown by Whole Genome Sequencing [78]. Recent Transmission Index as computed using the "n-1" method [79] varied from 10.8% to 26.6% when considering each year separately but reached 29% when considering all 5-years sampling period (S2 Table). Mean cluster size was 3.8 isolates but one cluster belonging to LAM lineage was made of 64 isolates and nineteen clusters exhibited more than 10 isolates. We explored if these clusters could be split when considering spoligotyping data ( Table 2). All spoligotype patterns of isolates belonging to the same 24-VNTR cluster were similar. The isolates carrying different spoligotype signatures often were collected on different years. Inside these 24-VNTR clusters, spoligotype patterns harboring more deletions were rather posterior to patterns with fewer deletions (n = 21), albeit with several exceptions (n = 7; Table 2). This confirms that both rare mutations in spoligotype pattern without changes in the 24-VNTR set may occur, and that convergence in 24-VNTR data for related isolates is not so rare. On the whole dataset, adding spoligotyping to 24-VNTR typing refined clustering by 10.1% (S1 Fig), likely identifying most convergence events among 24-VNTR clusters that could be detected using Whole Genome data [59].
All major lineages except Beijing were represented among these large clusters and no significant difference with the global representation of lineages could be detected (Fisher exact test n1 = 9, n2 = 2, p = 0.275).
Set-up of a correspondence table between the different M. tuberculosis taxonomies and proposition of a consensual "expert" taxonomy We took advantage of the large resulting database to examine the correspondence between the existing MTC taxonomies. Preliminarily, we removed all isolates for which no assignation existed for at least one of the taxonomy. Dataset was subsequently enriched in Afri1 and Afri2 isolates by including 52 additional isolates from a published study [80]. The resulting dataset counted 2,904 genotype profiles with classifications by TB-Lineage, Miru-VntrPlus, SITVIT-WEB and Borile (S3 Table). When browsing this material, we identified good concordance for 22 sublineages between the TB-lineage, MIRU-VNTRPlus, Borile and SITVITWEB classification tools. In addition, we could find a good correspondence with the SNP-based classification set up by Coll et al. [63] (Table 1). When assignations were discordant, we kept MIRU-VNTR-Plus assignation only if the spoligotype signature was not contradicting it. Consequently, T5_RUS1 isolates for instance, that carry a larger deletion than standard LAM isolates in their spoligotype pattern, were classified as LAM as suggested by MIRU-VNTRPlus, a classification independently confirmed by Mokrousov [81]. The same rule led us to label "T2_Uganda" a part of the isolates labeled as Uganda by MIRU-VNTRPlus: those that carry the spacer 40-deletion in their spoligotype pattern. We named these consensual sublineages by merging short versions of the synonym assignations. For instance, isolates labeled New-1 according to MIR-U-VNTRPlus were found mostly H4-Ural2 according to SITVITWEB, and were thus named L4_Ural2(New1) in the proposed "expert" taxonomy (Table 1). LAM3_S SITVITWEB sublineage could not find any clear correspondence in the other taxonomies; we therefore propose to temporarily abandon this sublineage and simply label the corresponding isolates "L4" as they clearly belong to the Euro-American lineage also known as lineage 4. We similarly propose to temporarily abandon MIRU-VNTRPlus Ghana sublineage. Our interpretation of such absence of correspondence for some sublineages with other taxonomies is that they represent an insufficient number of isolates to be relevant in a worldwide classification. Similarly, these sublineages are not described in the SNP-inspired taxonomy of Coll et al. [62,63]. We measured the concordance of every existing taxonomies with the newly proposed "expert" one on all isolates of our database. This consensual "expert" taxonomy reached almost perfect concordance with that of TB-lineage as retrieved using 24VNTR and spoligotype data (99.7%; discordant points n = 9; Fig 2). Although priority in naming was given to MIR-U-VNTRPlus taxonomy during consensus building (see Methods), the "expert" taxonomy reached a very good concordance with the SITVITWEB taxonomy refined using expert knowledge (85% of concordant precise assignations; n = 2,464; Fig 2).
Rapid machine learning algorithms to achieve fine and specific classification using spoligotyping and/or MIRU-VNTR data Applying different classification tools on the same dataset helps understanding the real diversity in a sample, but this procedure is very time-consuming. We took advantage of our large amount of data and on machine learning to set-up a fast webtool combining all existing taxonomies including the newly proposed consensual "expert" one. We aimed at providing a classifier using spoligotype pattern, 24-VNTR profile and/or 15-MIRU profile. The 15-MIRU panel is a subset of the 24-VNTR loci with high discriminatory power. It has been validated alone or rather in combination with spoligotyping as an alternative to the 24-VNTR typing scheme for epidemiological studies [49].
We used spoligotype data to predict spoligotype-inspired taxonomies (SITVITWEB, Borile et al., and TB-lineage) and we used VNTR data to predict the 24-VNTR-inspired taxonomy of MIRU-VNTRPlus. For expert taxonomy, we chose to predict it using VNTR data as it was the first criterion inspected to set it up. Second, we reasoned that, when handling new isolates, we should check if the assignations provided by the learnt classifier obtained with Weka using the standard taxonomies are reproducing concordant associations as listed in the correspondence table set up in this study (Table 1). If they are concordant, this could comfort the expert assignation provided independently by the learnt classifier obtained with Weka. This examination of concordance between all original assignations will be thereafter referred to as "consensus classification". For each taxonomy, we trained five basic classifying algorithms (J48, JRip, Naïve Bayes, PART and Random Forest, see Methods), one algorithm called Vote-5 using the mode of these 5 classifications, and one algorithm called Vote-10 using the mode of these 5 classifications and their 5 meta-bagging counterparts (see Methods). The training dataset included all the 2,904 isolates described above. Accuracies of the different algorithms were stable when assessed with different levels of stratified cross-validation (S4 Table) and were impacted both by the number of lineages in the taxonomy to be inferred and by the input genotypic data (Table 3): TB-lineage with only 7 lineages was the most easy to predict, with 99.8% accuracy using the Vote-10 classifier. Spoligotype-related predictions reached higher accuracies potentially due to the simplest nature of this locus (complexity of 2 43 as compared to more than 5 24 i.e. a difference in complexity of more than 1,000). 15-MIRU data allowed for higher performances than 24-VNTR data when using the finest algorithms such as Vote-10, showing that phylogenetic information can be identified even among the most variable MIRU-VNTR loci. On all types of genotypic data, the finest algorithm Vote-10 outperformed all basic classifiers and was therefore chosen for on-line implementation.

Lineage Prediction available on-line on TBminer website
A Lineage prediction tool on the website TBminer was set up to make the Vote-10 classifier available to all users. Underlying rules and trees were derived using the complete training dataset. The web interface allows to upload any genotypic data including CRISPR spacer data (1 to 43 standard spacer set), and 24 standard MIRU-VNTR numbers of repeats. When including at least 43-spacers spoligotype data or 15-MIRU, TBminer Lineage Prediction webtool provides an output file with 7 new columns, one column for each of the 7 predictions mentioned in Table 3 (Pred1 to Pred5bis), as well as 2 consensus columns clarifying whether the predictions using spoligotype data and MIRU-VNTR data are concordant (Pred6 and Pred6bis; Fig 3).

Precise performances of Lineage Prediction tool on the training set
For TB-lineage taxonomy, sensitivity and specificity of Pred1 reached 100% for all sublineages (data not shown). For MIRU-VNTRPlus taxonomy (Pred2 and 2bis), median sensitivity of sublineage prediction was 98.4% and 98.1% when performed on 24VNTR and 15MIRU respectively. Both predictions had very high sensitivity for all sublineages, the minimum corresponding in both cases to Ghana sublineage (90% and 85% respectively). The median specificities were 99.6% and 98.7% respectively with minima for UgandaII sublineage (93% and 90% respectively, data not shown).
For SITVITWEB-expert taxonomy that includes 50 sublineages, Pred4 predicted most sublineages with a sensitivity of 100% (n = 44) and the minimum sensitivity was 94.1% for T4_CEU1. Most sublineages also were predicted with a specificity of 100% (n = 42) and the minimum specificity was 83% for H3-T3 sublineage.

Validation of the automatic assignations on independent datasets
We tested the performance of our webtool to predict the assignations (as provided by all existing taxonomic tools and the newly proposed "expert one") of the isolates included in the reference database of MIRU-VNTRPlus. This database includes not only human but also animal isolates. We first classified all isolates using the standard tools (SITVITWEB adjusted using the expert eye, MIRU-VNTRPlus, TB-lineage) and inferred their expert assignation as described above (Table 1). We then ran our webtool to retrieve all assignations predicted by the learnt classifier obtained with Weka.
We observed that the prediction of MIRU-VNTRPlus assignations (Pred2 tool) had an accuracy of 100% when assigning human isolates from lineages 1 to 6 ( Fig 4A). It failed in predicting animal sublineages other than M. bovis as expected due to the absence of such isolates in the training dataset. Most of them (n = 14 out of 21; 67%) were assigned to the closely related lineage Lineage6_Afri1(WestAfrican2). Altogether, diversity picture of the whole sample as provided by Pred2 tool in Lineage Prediction module of TBminer was very similar to that of MIRU-VNTRPlus tool, overestimating only the prevalence of West African 2 lineage and being unable to identify peculiar animal isolates as well as M. canettii (Fig 4A).
The performance of the consensus tool (consensus between spoligo and MIRU-VNTR data as appearing in column 6_Consensus_Pred1_2_3_4) was characterized by measuring its ability to reproduce the "expert" assignation. The consensus tool proved able to identify more precise labels than the expert taxonomy for most Lineage_1 isolates and a significant proportion from Lineage_0 (Bovis-BCG) and Lineage_4 (such as H1 as compared to H3) (Fig 4B). It provided a less precise assignation for less than 20% of the Lineage 4 (possibly due to a lower rate of false positive) and was unable to recognize both animal isolates other than M. bovis and M. canettii isolates. Altogether, this tool provides precise assignations for typical human isolates and is able to detect when no known/implemented lineage corresponds to the uploaded genotype.
We performed the same approach on a representative dataset of Pakistanis clinical isolates for which complete genotypes (at most one VNTR lacking) were available (n = 225). This dataset is interesting because it uncovers human isolates of an origin very different from that of the dataset used to set up TBminer Lineage Prediction webtool. Our rationale was that, if good performances could be observed with this dataset, this would mean that TBminer Lineage Prediction tool is robust to the geographical origin of the isolates. For SITVITWEB taxonomy, Pred4 reached an accuracy of 99.5% (1 error out of 190 isolates, mispredicting a T2 assignation for a LAM3 genotype). For isolates with no prototypic assignation, examination of the profiles found the predicted assignation to be plausible for 83% of the isolates (n = 29 out of 35), so that altogether, 28 isolates (12.4%) reached a better assignation using TBminer than using SITVIT-WEB. For MIRU-VNTRPlus taxonomy, Lineage prediction tool Pred2 provided a concordance of 94.1% with the reference assignation using 24-VNTR genotypes, and reached 99.4% concordance for the most prevalent lineage (Delhi/CAS). Concordance was low for isolates exhibiting spoligotype patterns apparently discordant with their MIRU-VNTR profiles: for instance, isolates classified as Cameroon by MIRU-VNTRPlus lacked the deletion of spacers 23 to 25 in their spoligotype pattern, the deletion that originally characterizes this sublineage [82,83]. This suggests that no appropriate label exists in the current system for the corresponding isolates. In this case, the consensus proposed by TBminer, "Lineage4", may be more accurate than the tentative discordant assignations proposed by the existing taxonomies. The Consensus tool (6_Consensus) exhibited a perfect concordance with the expert assignation for 92% of the isolates, 6% of which being proposed a finer assignation by the automatized tool only (Fig 5). Potential errors of this consensus tool were very rare and limited to sublineage inside Lineage 4 (1.3%). For 6.7% of the isolates, no output could be provided by this tool: no consensus was identified between MIRU-VNTR-based and spoligotype pattern-based classifications. These isolates need further studies, either using SNP or using expert knowledge of the genetic diversity in the region. For instance, we identified isolates carrying a Beijing spoligotype but a VNTR profile characteristic of CAS isolates. As recent studies have identified Pseudo-Beijing isolates in the Middle East region, these specific isolates might well represent new cases of this sublineage identified as a bona-fide Lineage3-CAS [84]. Interestingly, the consensus based on 15-MIRU and spoligotype (6bis) performed almost as well as that using 24-VNTR (91.6% instead of 92% concordance), with the only discordance reported for 24-VNTR being moved to the unassigned group (No consensus) (data not shown).
We proceeded to a third validation test using the SNP-classification provided by Abadia et al. [56]. In that study, several SIT were shown to be associated to phylogenetic SNPs classifying them differently than SITVITWEB. For instance, SIT316 previously labeled as Haarlem 1 because of lacking spacers 26-31 (but also lacking spacer 25 and spacer 40) was found to be related to the T2 sublineage (absence of the mgtC SNP characteristic for Haarlem sublineage and presence of the recR specific to T2). Similarly, SIT254, originally classified as T5-RUS1 was renamed LAM because of carrying the ligB mutation. This SIT harbors the very specific signature of LAM (21-24 spacers deletion) but it was not recognized as such until 2014 because of other missing spacers [56,81]. Here, we compared the SITVITWEB assignations, the expert assignation taking into account new knowledge such as the belonging of former T5_RUS1 group to LAM sublineage [81], the SNP naming, and the Consensus assignation provided by Lineage Prediction (6 and 6bis) for isolates having a discordant naming according to SITVIT-WEB and SNP taxonomies as found in [56]. We also included 6 isolates harbouring SIT742 as they are likely highly related to SIT316, and therefore likely belonging to the same SNP lineage. Out of 21 isolates, 12 had a match between the SNP naming and the Lineage Prediction (6) naming (57%, see Table 4). Two isolates only (10%), annotated as Haarlem 3 under SITVIT-WEB taxonomy and classified as H by our algorithm were classified as X by SNP classification, the other being linked to an imprecise but not spurious assignation (Table 4).

Discussion
We characterized M. tuberculosis complex diversity in the Netherlands using the 3 most widely-used genotyping techniques, 24-MIRU-VNTR and spoligotyping and IS6110-RFLP on a set of clinical isolates collected over 5 years (2004-2008). To analyze this diversity we set up an automatic Cluster analysis tool available on a new website dedicated to molecular epidemiology and classification for human tuberculosis, called TBminer. Combined with a few additional genotypes from Lineages 5 and 6 (M. africanum lineages), the resulting database allowed us to identify correspondences between the existing taxonomies based on spoligotype or MIR-U-VNTR data, leading to the setting up of a consensual "expert" taxonomy. We then set up a tool based on machine learning for identifying the alternative assignations of new isolates according to these different taxonomies. This tool is available as "Lineage Prediction" on the TBminer website.

M. tuberculosis complex diversity in the Netherlands
Our study confirmed the large prevalence of the Euro-American lineage in the Netherlands as observed in Western countries and Africa, and especially so called "T" isolates [25]. Beijing prevalence proved stable as compared to the 1993-2000 period, accounting for 6.6% of the TB cases as compared to between 5 and 8% depending on the year between 1995 and 2000 [85]. Although tuberculosis prevalence steadily decreases among inlanders and increases among immigrants i.e. M. tuberculosis strains are more and more collected from non-Western European citizens [86], this features had no significant consequence on lineage prevalence by now. Recent transmission rate differed if evaluated on a one-year period or over a five year period. This reminds that the concept of recent transmission should be used cautiously because current inland transmission may reveal only few years after having occurred [77]. Alternatively, this observation might be due to active transmission in the original country of immigrants.
Consensus taxonomy for M. tuberculosis complex Subclassification of M. tuberculosis complex has developed thanks to the use of high-mutation rate markers: IS6110-RFLP, CRISPR, MIRU-VNTR. The monophyly of groups defined using spoligotype patterns (CRISPR diversity) or multiple MIRU-VNTR genotyping was questioned due to the possibility of convergence in these loci. Some sublineages such as LAM were in fact wrongly delimited as evidenced by discordant genotypes using the alternative genotyping method: LAM10_CAM isolates grouped with other LAM according to spoligotyping carry MIRU-VNTR patterns largely discordant with other LAMs, and T5_RUS1 isolates clearly distinguished from LAM according to spoligotyping carried in contrasts patterns that indicated a clear relatedness with LAM [81]. The redefinitions of groups were confirmed by the use of SNPs [62,63]. SNP subclassification tends thus to become the new standard. As its cost when using complete SNP panels is still high, a combination of a global panel and a restricted panel using geographically-specific SNPs could become future practice for correct and precise taxonomical assignation of any local TB clinical isolate. Confronting VNTR markers with assignation based on spoligotype patterns can still provide helpful information for classifying new isolates. The large database set up in this study allowed us to build a new classification tool efficient when used on both 15-MIRU-VNTR and spoligotype patterns. This on-line induction algorithm proved fairly robust. We believe that until Whole Genome Sequencing decreases to below 40 euros per isolate, the continuous use of 15-MIRU-VNTR typing in combination with spoligotyping is relevant to both infer main epidemiological events and get a clear picture of the circulating diversity. As our tool greatly speeds up the bioinformatical analysis of produced genotyping patterns, it may be of great use to epidemiologists.
A limitation to our tool is that it was built exclusively on widespread human M. tuberculosis lineages, excluding animal sublineages such as M. caprae, M. microtii, M. pinnipedii, as well as M. canettii and the recently described Lineage7 isolates [16]. M. caprae and M. microtii genotypes from MIRU-VNTRPlus database were confidently classified as Lineage 0 or Lineage 6 by our tool. Most M. canettii genotypes (21-VNTR data) as available in Blouin et al. [18] found no concordant assignation by Pred2, Pred2bis, Pred5 and Pred5bis in our Lineage prediction tool (n = 60; 75%), 15% were erroneously assigned to L4 Lineage, 4% to EAI_Lineage 1 and 1% to Animal lineage_Lineage 0 (data not shown). Altogether, we can infer that our Lineage Prediction tool as available on TBminer website is 100% reliable for large lineage assignation based on 15-MIRU and spoligotyping when implemented on human epidemiological datasets in most regions of the world. The only possible region where our tool may provide little information and likely some spurious information could be the Horn of Africa where the diversity deviates from the data we used to implement our tool.

Search for consensus taxonomies among Bacteria using Machine Learning
Taxonomies rely on the biological data characterizing the classified organisms. When new tools are developed to characterize diversity, new taxonomies are usually set up. It is only when confronting multiple information that a consensus can emerge. Taxonomies built on different genetic data do not match either due to convergence events or due to horizontal gene transfer. Dealing with a single taxonomy without trying to search for a consensus limits inferences on the properties of lineages as some characteristics may have been attributed to a complete subgroup defined by taxonomy A, but a subpart that classification B could have identified may really carry the property of interest. Building a consensus is usually a long process. We think that our approach can speed up definition of consensual taxonomies. We recommend the following steps as developed in this study (Fig 6): 1) set up of a large collection of isolates representative of the largest diversity and type them with all available tools; 2) inform every alternative assignation according to existing taxonomies; 3) identify the concordant assignations, and propose a name made of all synonyms (but of reasonable length) for the consensual groups; 4) implement Machine learning algorithms, for instance in Weka, to learn all taxonomies independently; 5) Making the tool for consensus assignation available on-line. This last step of making the consensus available on-line, is in our opinion, very important as it will clearly help all users to get acquainted with it. By providing not only the consensus assignation but also the assignations according to existing taxonomies, users can build their trust in the consensus by checking if the assignation provided for the known taxonomy is relevant to their expert knowledge.
Given the deluge of genomic data produced using next generation sequencing, our approach could also be used to check SNPs informativity in a very near future. Indeed, SNPs diversity may cover various level of informativity, which remains for the time-being poorly explored at the statistical level, in particular concerning epistatic mutations significance.

Conclusion
We developed an approach making use of standard typing data for human M. tuberculosis isolates to infer a consensual taxonomy concordant with most up-to-date data on Whole Genome Sequencing diversity. We believe that this tool will not only increase the understanding of clinical and epidemiological experts about the tuberculosis worldwide diversity, but it will also help them build refined knowledge on the genetic diversity circulating in their country. We hope that the same approach can benefit other human pathogens having alternative taxonomies according to Serotypes, CRISPRtypes, MLST-types such as Salmonella, Listeria, Brucella, and more broadly to other organisms such as bacterial plantpathogens.