Strain Classification of Mycobacterium tuberculosis Isolates in Brazil Based on Genotypes Obtained by Spoligotyping, Mycobacterial Interspersed Repetitive Unit Typing and the Presence of Large Sequence and Single Nucleotide Polymorphism

Rio de Janeiro is endemic for tuberculosis (TB) and presents the second largest prevalence of the disease in Brazil. Here, we present the bacterial population structure of 218 isolates of Mycobacterium tuberculosis, derived from 186 patients that were diagnosed between January 2008 and December 2009. Genotypes were generated by means of spoligotyping, 24 MIRU-VNTR typing and presence of fbpC103, RDRio and RD174. The results confirmed earlier data that predominant genotypes in Rio de Janeiro are those of the Euro American Lineages (99%). However, we observed differences between the classification by spoligotyping when comparing to that of 24 MIRU-VNTR typing, being respectively 43.6% vs. 62.4% of LAM, 34.9% vs. 9.6% of T and 18.3% vs. 21.5% of Haarlem. Among isolates classified as LAM by MIRU typing, 28.0% did not present the characteristic spoligotype profile with absence of spacers 21 to 24 and 32 to 36 and we designated these conveniently as “LAM-like”, 79.3% of these presenting the LAM-specific SNP fbpC103. The frequency of RDRio and RD174 in the LAM strains, as defined both by spoligotyping and 24 MIRU-VNTR loci, were respectively 11% and 15.4%, demonstrating that RD174 is not always a marker for LAM/RDRio strains. We conclude that, although spoligotyping alone is a tool for classification of strains of the Euro-American lineage, when combined with MIRU-VNTRs, SNPs and RD typing, it leads to a much better understanding of the bacterial population structure and phylogenetic relationships among strains of M. tuberculosis in regions with high incidence of TB.


Introduction
Tuberculosis (TB) is an infectious disease with an effective treatment but remains an important cause of morbidity and mortality in many countries. In Brazil, 73.000 new TB cases and 4.600 cases that deceased were registered in 2011. The southeast region has 44.8% of all cases reported in the country and the state of Rio de Janeiro has the highest disease incidence (72.3/100,000) and mortality rate (5.7/100,000) in the country [1].
A couple of years ago, Mycobacterium tuberculosis (Mtb) strains designated as RD Rio were reported as being a predominant genotype in Rio de Janeiro [2]. These strains have a deletion of 26.3 kb and seem to be restricted to the Latin American-Mediterranean (LAM) family. The RD Rio strains have been associated with higher levels of recent transmission and of Multi-Drug Resistance (MDR) but there are contradictory data about their relation with disease severity [3], [4], [5], [6]. Another Region of Difference called RD174 has been described as a co-marker of RD Rio and as a marker for the LAM type [7], [8], [9], [10] and strains with the RD174 deletion were found to have an increased secondary case rate ratio in San Francisco [10].
The generalized use of spoligotyping and MIRU-VNTR resulted in the construction of large international databases of genotypes, allowing the study of global distribution and phylogenetic analysis of the distribution of M. tuberculosis worldwide [23], [24], [25], [26], [27], [28], [29]. Combined MIRU-VNTR typing and spoligotyping also helps in revealing epidemiologically meaningful clonal diversity of M. tuberculosis strain lineages and is useful to explore internal phylogenetic ramifications [8]. In addition, SNPs and LSPs represent robust markers for inferring phylogenies and for strain classification [30], [31], [32], [33], [34], [35]. Besides the presence of RDs, that of determinate SNPs is also a characteristic of LAM strains, [8], [9], [10] such as fbpC 103 in codon 103 (G to A) of the gene encoding antigen 85 Complex Ag85 c (Rv0129c) resulting in a Glu103Asp amino acid replacement in a protein that is involved in biosynthesis of cell wall components of M. tuberculosis [7], [36]. Other LAM-specific SNPs are (i) the silent C to G ligB mutation at genome position 3426795 [37], validated by the Abadia et al. [17] and (ii) the SNP at 240 codon of mgtC (C to T) that was discovered by Homolka et al. [38].
The objective of this study was to evaluate 24 MIRU-VNTR typing for alternative classification of strain families/lineages by means of phylogenetic tree building and comparison with MIRU databases (MIRU-VNTRPlus), as compared with spoligotypesbased classification (SITVITWEB and SITVIT2), and increase the consistency of the analysis by including a SNP and two LSPs. In addition, the differentiating power of both techniques in a population that is almost exclusively of the Euro-American lineages.

Study setting
The study was based on a convenience sampling of TB patients diagnosed between January 2008 and December 2009 at the ''Hospital Municipal Raphael de Paula e Sousa'', Curicica, Rio de Janeiro, Brazil. Two hundred eighteen clinical isolates were obtained from 188 patients and 28 cases had more than one isolate, including 25 patients with two and three patients with three isolates. Multiple isolates from the same patient were included to evaluate the consistency of the results of genotyping and to verify eventual multiple infections.

Ethics statement
This study was approved by the Research Ethics Committee of the Municipal Health and Civil Defense of the city of Rio de Janeiro Number 160/09 CAAE: 0182.0.314.000-9. Isolates of this study were obtained through bacteriological culture from clinical specimens of patients diagnosed at the Hospital Raphael and Paula Souza as part of routine diagnosis and drug susceptibility testing. The ethical committee stated that, as genotyping of M. tuberculosis isolates is complementary to routine diagnosis, there was no need for a written or verbal consent. No information other than that provided for the diagnosis was used.
Culture, DNA extraction and identification of Mycobacterium tuberculosis complex Sputum samples were cultured on Löwenstein-Jensen (LJ) medium following standard microbiological laboratory procedures as a routine of the Clinical Analysis Laboratory of ''Hospital Raphael de Paula e Souza''. After cultivation, bacterial mass was re-suspended in 400 ml of sterile distilled water and heat inactivated at 90uC for one hour, followed by DNA extraction and purification using the CTAB method [39].

MIRU-VNTR typing
Amplification of 24 MIRU-VNTR loci was performed by using a commercial typing kit (Genoscreen, Lille, France) and automated MIRU-VNTR analysis performed as previously described [14]. Fragment size of the amplicons was analyzed on a ABI 3730 DNA sequence analyzer (Applied Biosystems, California, USA) and number of copies of each locus was determined by automated assignment using the Genemapper 4.0 software (Applied Biosystems, California, USA). In the case of doubtful results, the size of the repeats was double checked by size estimation as compared to a DNA ladder (50 and 100 bp) and the positive control (H37Rv) on agarose gels and by comparing to a reference table as described [13].

PCR-RFLP of fbpC 103
For characterization of SNP fbpC 103 , we adapted the procedure described by Gibson et al. in 2008 [7]. Amplifications were performed in 50 ml reactions containing 40 pmol each of the primers Ag85C103F (5-CTG GCT GTT GCC CTG ATA CTG CGA GGG CCA-3) and Ag85C103R (5-CGA GCA GCT TCT GCG GCC ACA ACG TT-3), 2 mM MgCl2, 0.2 mM dNTPs, 1 U Taq DNA polymerase (Invitrogen, Brazil), buffer (10 mM Tris-HCl, 1, 5 mM MgCl2, 50 mM KCl, pH 8.3), 5% DMSO (v/ v) and 10 ng of target DNA. Amplification was performed in a Vieriti Thermal Cycler (Applied Biosystems, Foster City, CA), starting with denaturation at 95uC for 5 min, followed by 45 cycles of 1 min at 95uC, 1 min at 60uC, 4 min at 72uC and final extension for 10 min at 72uC. The amplified products of 519 bp were analyzed on 2% agarose gels after staining with ethidium bromide. Fifteen microliters of the amplified products were subjected to enzymatic digestion with 1 U of MnlI (New England BioLabs Inc. USA) at 37uC for four hours, generating three fragments of 365 bp, 96 bp and 48 bp for the wild type allele and two bands of 461 bp and 48 bp when the SNP G103A is present. Bands were detected in 3% agarose gel and their size estimated by comparison with a 100 bp DNA ladder (Invitrogen).

RD174 deletion
For detection of RD174, we used the protocol described earlier [7], performing multiplex-PCR using two primers that anneal to the RD174 flanking regions and one to the internal sequence. For amplification, we used 40 pmol of each of the primers RD174F (59-AGC TGC TCC GGC CGG TCG TCG TCC TTG TC-93), RD174Fi (59-TAT GCC GCA GCC CGG GCA TCC GTG ATT A-93) and RD174R (59-ATC GTG AAC GCA GCG GTT TCG ACG GCA TCT-93) in a 50 ml reaction containing 2 mM MgCl2, 0.2 mM dNTP, 1 U Taq DNA polymerase, 16 buffer, 1M of betaine and 10 ng of bacterial DNA. Amplification was performed starting with 5 min incubation at 95uC, followed by 45 cycles of 1 min at 95uC, 1 min at 60uC, 4 min at 72uC and final extension for 10 min at 72uC. To determine the size of the amplicons, 10 ml of PCR product was applied in 2% agarose gel and after electrophoresis, bands of either 300 bp (intact RD174) or 500 bp (deleted) are observed.

RD Rio deletion
Detection of RD Rio was performed using the multiplex PCRprotocol described by Lazzarini et al. (2007) [2]. For amplification, we used 20 pmol of each of the primers BridgeF

Classification and definition of genotype-based lineages and families
Spoligotype and MIRU patterns were defined according to the definitions in the SITVITWEB database [23], (http://www. pasteurguadeloupe.fr:8081/SITVIT_ONLINE/) in the format of November 20, 2012. In addition, spoligotypes were compared with those in SITVIT2 (proprietary database of the ''Institut Pasteur de la Guadeloupe'', which is an updated in-house version of the publicly released SpolDB4 and SITVITWEB [23], [24] to be released in 2014. This comparison led to the characterization of families and lineages by spoligotyping and Spoligotype International Types (SIT), MIRU International Types (12-MIT, 15-MIT AND 24-MIT) and 5 loci VNTR International Types (VIT). The SIT, MIT and VIT were designated identical patterns when shared by two or more patient isolates, whereas ''orphan'' patterns were those observed in a single isolate that did not correspond to any of the patterns present in the SITVIT2 database. In order to classify ''orphan'' patterns, we used SpotClust (http://tbinsight.cs. rpi.edu/run_spotclust.html).
The 24 MIRU-VNTR profiles and spoligotypes were also compared with the MIRU-VNTRplus database [28]

Data analysis
The number and size of genotype clusters were defined initially by introducing numerical values into an Excel table and different criteria for definition of clusters were used, being either individual identical spoligotypes or 15-24 MIRU-based genotypes, or by combining with Spoligotyping both. Excel tables were introduced into BioNumerics software (version 6.1, Applied Maths, Belgium) for construction of similarity matrices and phylogenetic trees, using the Jaccard Index (spoligotypes) and Categorical value (MIRU-VNTR) for construction of the Neighbor Joining (NJ) and Minimal Spanning Tree (MST).
The MIRU-VNTR allelic diversity (h) at each of the 24 loci was calculated by the equation described by Graur and Li [41] and the Diversity index was calculated as the Hunter Gaston diversity index (HGDI) [42]. Statistical analysis were performed using chisquare analysis with a confidence interval of 95% using Epi Info.Version 3.51 (Centers for Disease Control and Prevention, Atlanta, GA,USA), by using x2 test or Fisher exact test for the comparison of proportions. A p value,0.05 was considered significant.
To visualize the classification difference observed between Spoligotyping and 24 loci MIRU-VNTRs, we constructed a Confusion Matrix for the frequency of correct and incorrect predictions.

Classification by spoligotyping
The identification of spoligotype profiles as well as their definition to the family and lineage level was realized by comparison with profiles deposited in the SITVITWEB database and in SITVIT2 and compared with the classification obtained with previous version of the spoligotype database (Table S1).
The patterns were classified based on the database tool that allows construction of a Neighbor-Joining based phylogenetic tree, visualizing proximity of a particular genotype with that of a set of reference strains to the genotype family level. One hundred thirtysix isolates (62.4%) were classified as LAM (127 patterns), 47 (21.5%) as Haarlem (45 patterns), 21 (9.6%) as T (21 patterns), 11 (5%) as S (11 patterns), and one isolate (0.45%) each as Beijing and EAI. Considerable differences were observed between Spoligotyping and 24 MIRU-VNTR loci-based classifications, even after excluding eventual small typing errors by repeating the assays. The differences observed between the two classifications are presented in the table 3 and in the Confusion Matrix ( Figure 1). The precision, accuracy, sensitivity and error rate were respectively 0.74, 0.64, 0.82 and 0.36. The T lineage showed the highest incongruence rate related to classification (sensitivity = 0.26).
The ''LAM-like'' isolates. The thirty-seven isolates that did not present the typical LAM spoligotype signature (absence of spacers 21 to 24 and 33-36) were positioned within the LAM branch in the neighbor-joining tree using MIRU-VNTRPlus and were conveniently designated as ''LAM-like''. These had been classified by spoligotyping earlier as T (n = 24), H (n = 10) and three ''Unknown'' patterns (SIT2511/n = 2 and SIT1952/n = 1; classified by SpotClust as EAI and H1 respectively). Ninety-one percent of these strains showed one copy of MIRU24 and two copies of MIRU04. Intriguingly, when including both 24 MIRU-VNTRs and spoligotypes for tree building, these isolates are localized among those classified as T and H.
The Haarlem family. Forty-seven isolates were classified as belonging to the H family representing 21.5% of all isolate, 85% of the these isolates initially classified as Haarlem by Spoligotyping were confirmed by MIRU-VNTRs 24 loci (TPR = 0.65, TNR = 0,88) ( Table 3 and figure 1). Among these isolated 16 (34%) presented spoliotype profile compatible with T family, 3 (6.4%) with LAM family and 2 (2% each) isolates, respectively compatible with X and EAI. The MIRU-VNTRs characteristic for Haarlem are two copies of MIRU02 (100%) and MIRU39 (89.1%) and three copies of MTUB34 (100%), MIRU16 (91.3%), MIRU27 (97.8%), ETR-A (95.6%) and ETR-C (91.3%). Different from LAM isolates having a single copy of MTUB30, H isolates present either three or four copies of this MIRU. Combined analysis of the MIRU-24 and spoligotyping showed that isolates of genotype H1 (absences of spacers 26-31) have two copies of MIRU04, ETR-B and MTUB29 and three copies of MIRU16, MIRU31, ETR-A and ETR-C. Isolates of genotype H3 (absences of spacer 31) share two copies of MIRU02, ETR-C, MIRU4 and MTUB29 and three copies of MIRU27, ETR-A, ETR-B and MTUB34.  The S family. Eleven isolates were classified as belonging to the S family and four of these exhibit SIT4, that in SpolDB4 was classified as LAM3-S Convergent, having absence of spacers 1-24 and 33-34 (The characteristic pattern of S family is absence of spacers 9-10, and 33-34). Among the others isolates, we observed SITs 53/n = 1, 102/n = 1, 378/n = 1, 2500/n = 1, 3907/n = 2 and 3909/n = 1; only two of these had the characteristic of S family (SIT3909/LAM8 and SIT2500/Unknown by SITVIT2 and H1 by Spotclust). Isolates of the S family shared their copy number in six loci, being one copy of MIRU24 and MTUB 21, two copies each of MIRU20 and MIRU39 and three copies MIRU10 and MIRU27.

RD Rio , RD174 and fbpC 103 analysis
The genotypes defined by the presence of RD Rio , RD174 and SNP fbpC 103 were added to the classification based on 24-MIRU-VNTR typing (Table S2). Thirty-six isolates (16.5%) were excluded from the final analysis either because of showing genotypes suggestive for multiple infection of because of failure in at least one of the three genotype procedures. The results of 182  strains are summarized in Table 3 and for simplification (Table 4 and Table 5), but not all were RD174; two isolates classified as LAM2 SIT3908 were not had RD174 (isolated from different patients). The frequency of RD174 (n = 25) was therefore higher than that of RD Rio (32.5% vs 20.7%; p = 0.053 and X 2 = 2,69) in LAM. Eleven LAM isolates (14.3%) presented RD174 but not RD Rio , all were positive for the fbpC 103 .
The allelic diversity of the 24 loci MIRU-VNTR loci in LAM/ RDRio isolates ( Figure S2) showed that the copy number of combining of MIRU04, MIRU20, MIRU24, MIRU31, ETR-A, MTUB21 and MTUB30 loci was a signature of this genotype (2213231) ( Table 4). In general, these loci present low variability in LAM, except for MTUB21, is being moderately variable in such isolates (table 2). Upon comparing 24 MIRU-VNTR signatures of LAM, LAM-like and non-LAM strains (Table 4), we observed two isolates that presented the hypothetical ancestral MIRU-VNTR signature (224226153321) for RD Rio that was suggested by Lazzarini et al. (2007) [2]. One isolate (C2009) was classified as LAM1 SIT2539 and the other (C1966) as H3 SIT50 by spoligotyping and ''LAM-like'' by 24 MIRU-VNTR; both presented SNP fbpC 103 and were deleted for RD Rio and RD174. Among the LAM/RD Rio strains, frequency of LAM subtypes as defined by spoligotyping using SITVIT2 was 31.5% of LAM2, 25% of LAM9, 12.5% each of LAM6 and LAM5 and 6.3% each of LAM4 and LAM1. Figure 2 is a graphical representation of these three markers in the LAM strains as defined by 24 loci MIRU-VNTRs.

Discussion
Rio de Janeiro is the capital of the state of Rio de Janeiro, located in the southeast of Brazil, has a population of 16 million habitants, six and a half million of these living in the capital, being the second largest city and a major touristic attraction in Brazil (Census 2010 http://cidades.ibge.gov.br/xtras/perfil.php?lang= &codmun=330455 accessed 12/18/2013). According to the Ministry of Health, 11.639 new TB cases were recorded in 2011 in the state with an incidence of 72.3/100.000 habitants and the highest mortality rate (5.1/100.000 habitants) at the national level [1]. Rio de Janeiro city has strong economic and social contrasts, with a large portion of the population living in numerous suburbs, consisting of ''communities'', urban areas where housing conditions, health, education and security are extremely precarious. These factors are directly related to the number of TB cases detected.
In the present study, we performed spoligotyping and 24 MIRU-VNTR typing and characterized a SNP in fbpC 103 and the status of RD174 and RD Rio , to decipher the population structure and the phylogenetic relationships of the MTBC strains of TB cases in this particular population that attended a single reference center in the city of Rio de Janeiro. Mycobacterium tuberculosis is classified into six phylogenetic lineages, each of which can be divided into sublineages [32]. Members of the Euro-American lineages represented by the LAM, H, T, S and X Spoligotyping families, which are genetically closely related are the most common in South America [19], [20], [21] and this was confirmed recently also in Brazil [22] and in different states of the country [8], [43], [44], [45], [46], [47] [48], but differences in the frequency of these families are observed, theses difference could also be due to differences in population and immigration history in each region or in period associated genotype frequencies or in differences in sample size. We here confirm the predominance of the Euro-American lineage (also known as lineage 4) with only two isolates classified as Indo-Oceanic lineage (EAI5) and East Asian lineage (Beijing), the first being more prevalent in east Africa, southeast Asia and in south India and the second in east Asia, Russia and South Africa. [32]. Recently, the frequency of such strains in Brazil was described as being less than 1% except for the higher frequency in the state of Pará, north Brazil [22].
The most prevalent families as defined by spoligotyping were LAM (43.6%), T (34.9%) and H (18.3%); however, classification based on 24-MIRU-VNTR and Neighbor-Joining based phylogenetic tree building using the database tool (www.miruvntrplus. org) showed considerable difference with that obtained by spoligotyping, being respectively 62.4%, 9.6% and 21.6% (error rate of 0.36). Among the isolates with discordant results, mostly isolates initially classified as T (n = 31), H (n = 11) and EAI (n = 3) by spoligotyping were classified as LAM by 24 MIRU-VNTRs typing, for isolates that did not show LAM prototype (absence of spacers 21-24 and 33-36), conveniently named LAM-like. Subsequently, 78.3% these isolates confirmed to be LAM by the presence of the LAM-specific SNP fbpC 103 [7], [38] and when considering the presence of this SNP as an absolute marker for LAM family the frequency of this lineage is 58.9%. We also have preliminary data showing that the supposed LAM-specific marker ligB 404 was observed in isolates that had been defined by spoligotyping as being T or H (data not shown). Very recently, Mokrousov et al. reported difference in classification of LAM strains as defined by spoligotyping and SNP analysis (fbpC 103 and ligB 404 ) [49] although both SNPs were previously validated in different collection of clinical isolates and reference strains as to be specific for the LAM family [7], [38]. Table 4. Cont. It is now common knowledge that spoligotyping has limitations as a tool prediction of the exact phylogenetic relationships between strains of the MTBC, particularly in modern strains (Euro American, East Asian and Indian e East African) [16], [18], [24], [48], [49], [50], [51] [52], mainly due to homoplasy [16], [24]. The accuracy of the phylogenetic grouping by MIRU-VNTR is more exact than that of spoligotyping but depends on the number of loci included in the analysis [49] and classification errors are reduced when analyzing 24 loci [16]. Indeed, several authors have suggested that SNPs are more suitable than spoligotyping and MIRU-VNTR settings for phylogenetic classification [16], [38], [53], [54], [55], [56]. The fbpC 103 is described as a good marker for the LAM family [7], [38], [49], [57], [58], [59], and our original intention was to evaluate if strains RD Rio was still prevalent in Rio de Janeiro as seen previously by Lazzarini in clinical isolates collected between 2002 and 2003 [2], for this purpose we believed only a marker that could differentiate LAM and non-LAM associated with the 24 MIRU-VNTR and Spoligotyping could be sufficient, however the scenario was observed more complex.
We here present the first data comparing classification by spoligotyping and MIRU-VNTRs and fbpC 103 in Euro American lineage prevalent in Rio de Janeiro and Brazil. Among these different types of markers, we observed four major groups: (i) strains classified as LAM by spoligotyping and MIRU-VNTRs 24 loci without the LAM-characteristic SNP fbpC 103 , (ii) strains not classified as LAM by spoligotyping and MIRU-VNTRs 24 loci but carrying the fbpC 103 SNP, (iii) strains classified by spoligotyping as non-LAM but as LAM by MIRU-VNTRs 24 loci (LAM-like) and with SNP fbpC 103 and (iv), strains classified by spoligotyping as non-LAM and LAM by MIRU-VNTRs 24 loci (LAM-like) but not presenting the SNP fbpC 103 . These different scenarios could be explained by convergent evolution of spoligotypes and of MIRU-VNTRs loci (even including 24 alleles) because a limited number of loci were evaluated that might evolve rapidly and therefore susceptible to pronounced convergence [53] and/or because the existence of ancestral progenitor of Euro-American lineages containing SNP fbpC 103 . In addition, a possible important limitation of the current classification based on MIRU-VNTRs and their similarity with genotypes present in MIRU-VNTRplus is that, despite including well characterized strains, the database contains a limited number of strains that does not reflect the real genetic diversity of isolates belonging to the MTBC; another limitation of this study is that no additional specific SNPs for H, T, S and X and T were investigated. The Whole Genome Sequencing (WGS) is superior to conventional genotyping for MTBC [60] and has been used in areas not yet studied, from global (phylogeography) for site (transmission chains and diversity of circulating strains), for single patient (clonal diversity) and the bacterium itself (evolutionary studies) [61]. We intend to compare through WGS (developing) these isolates to a better comprehension of the evolution of lineages Euro American, such as the development of new methodologies that allow a more rapid and accurate typing.
Analyzing the data obtained in this study, the spoligotypes of the T family showed the largest number of divergent results when compared to classification by 24-MIRU (TPR = 0,26/ TNR = 0,99). The prototype of the T family is characterized by the absence of spacers 33-36 only [48] and have been observed in almost every country, representing 20% of all isolates deposited in the database SITVITWEB. Despite the high frequency, this genotype family is still considered as ''ill-defined'' and includes non-monophyletic groups [54] [55]. In South America, the frequency of this family is 26.7% and in Brazil 18.6% (370/ 1991), with the T1 SIT53 subfamily being the most prevalent SIT [22], as observed also in this study. Here, among the 76 isolates classified as T by spoligotyping, only 21 (27.6%) were confirmed  by 24 MIRU-VNTRs typing, the rest was reclassified as LAM (n = 31), S (n = 9) and H (n = 16). Interestingly, SIT53 was associated with mixed infection in a study conducted in South Africa, a country characterized by a high prevalence of TB [57] and Lazzarini et al. [62], using a computational approach, showed that this is the most frequent false spoligotype derived from mixed infections. However, among isolates with this SIT, we did not observe double signals during 24 MIRU typing indicative for mixed infections. We also observed that spoligotypes, indicative for the T family, were sometimes grouped with spoligotypes of the H family by MIRU-VNTR typing. This could be related with the fact that the prototype spoligotype defining T1 SIT53 and H3 SIT 50 differ only in the absence of the spacer 31. We also observed that in our population, the absence of spacer 31 is not crucial for classification as being either H or T; what differentiates between the two is the number of copies of ETR-A, ETR-B, ETR-C. The H3 subfamily is characterized by the combination 323 of these alleles while T strains present considerable diversity of these loci (one to three copies of ETR-A, two or three copies of ETR-B and three to five copies of ETR-C). The RD nominated RD Rio was first reported as new M. tuberculosis lineage in Rio de Janeiro in 2007, Lazzarini et al. [2] and is a deletion of 26.3 kb restricted to the LAM family and in particular, in subfamilies LAM1, LAM2, LAM4, LAM5, LAM6 and LAM9. This deletion affects 10 genes, including two genes encoding Proline-proline-glutamic Acid Proteins (PPE) [2] and association between RD Rio strains and high prevalence may be related to virulence and/or adaptation specifies the Latin American and European population-based epidemiological and clinical characteristics; however, studies have proven to be inconclusive or contradictory [4], [63], [64]. The lineage RD Rio , was identified in different countries [5], [6], [7], [49] and in Brazil, has been described in different states, including Rio de Janeiro, Rio Grande do Sul, Minas Gerais, Espírito Santo and Rio Grande do Sul [3], [8], [63], [64]. The frequency ranges from 30 to 38% of isolates tested and was associated with MRD-TB [4], [63], [64] and with genotype clustering [5], [63], indicating a higher rate of recent infection and transmissibility. Another deletion, RD174, initially described as a marker for the LAM family and as a comarker for RD Rio [7], [8] was associated with high transmissibility [10].
The frequency of strains RD Rio in the present study, with an overall frequency of 11% and 20.7% in LAM, is lower than that observed in other studies and even when including the eight isolates that showed mixed RD Rio /WT signals, resulting frequencies of 14.7% and 28.2% still lower than previously observed. This difference could be related with the relative low frequency of LAM1, LAM2, LAM5, LAM6 and LAM9 related subfamilies in our study sample. Earlier studies on classification of RD Rio strains was based on genotypes defined by spoligotyping and/or 12 MIRUs only [2], [5], [6], [7], [8], [63], [64] and again, this is the first study that used 24 MIRU-VNTR typing to conduct a more detailed phylogenetic analysis. We verified the signature of MIRU-VNTR loci for RD Rio and RD174, as compared with that of the hypothetical ancestral of RD Rio as defined by 12 MIRUs [2] and that, besides sharing two copies of MIRU04 and MIRU20, they also share one copy of MIRU24, three copies of MIRU31, two copies of ETR-A, three copies of MTUB21 and one copy of MTUB30, yielding 2213231 as fingerprint for RD Rio . All LAM/ RD174 isolates, with or without RD Rio , carried two copies of MIRU20 and ETRA and one copy of MTUB31. We observed that LAM3 has two copies of MTUB30 (2213232) and we propose that this subfamily that do not carry RD Rio , has evolved independently. We also observed RD Rio in two isolates (2.6%) with a spoligotype not indicative for being LAM; a small number of such cases have also been observed by other research groups [5], [7], [64]. One of these isolates that had been classified by spoligotyping as H3 SIT53 was reclassified by 24 MIRU as being ''LAM-Like'' and called attention because it had the MIRU signature of the hypothetical ancestor of RD Rio and carried the SNP fbpC 103 and RD174. Overall, we observed four scenarios: (i) isolates with the RD Rio and RD174, (ii) isolates showing only RD Rio (iii) isolates showing only RD174 and (iv) isolates that did not carry any of the deletions, suggesting that both markers evolved on different time points. This is different from earlier data [7] that claim that RD174 is an absolute co-marker for RD Rio and therefore, studies that use the presence of RD174 to infer the presence of RD Rio [8] may overestimate the frequency of the latter. In 2007, Lazzarini et al. [2] suggested that RD Rio arose by homologous recombination between genes and although all neighboring sequences were identical, such event could have happened more than once. This suggests that the RD174 deletion occurred before RD Rio but this needs to be confirmed as we also observed RD Rio strains that had intact RD174. In Figure S3, we propose a possible evolution of members of the Euro-American lineages but with the limitation that the spoligotype defined lineages S, X and other are not represented and this concerns a sampling only from Rio de Janeiro.
In a recent study, Hill et al. (2012) [53], mentions the difficulty in studying the evolution of the Euro-American lineage (LAM, Haarlem, T, X and S) using spoligotyping due to the large number of IS6110 copies in such strains that may result in IS6110 mediated deletions in the DR locus. This might be the reason why bacterial evolution exclusively based on spoligotyping is not robust and the wide range of profiles reported as unclassified in SITVITWEB. Our approach, combining spoligotyping, MIRU-VNTRs, SNPs and RDs allowed the reclassification of 13 SITs that did not rank in SITVITWEB, allowing definition of 29 new spoligotypes and refine classification of isolates belonging to Euro-American lineage. Our data are also support the idea that absence of spacers 21-24 is not sufficient for classification as LAM and of spacer 31 to differentiate T and H, the latter indicative for subfamilies H3 SIT50 and T1 SIT53. Possible explanations are that ancestral lineages are currently circulating (plesiomorphic state) or that the isolates are suffering homoplasic evolution and reversion into the plesiomorphic state. Figure S1 Dendrogram constructed with BioNumerics software version 6.6 with the results of MIRU-VNTRs 24 loci and spoligotyping by similarity coefficient for categorical data and the neighbor-joining algorithm. 1st column (after spoligotypes): number of isolated label; 2nd column: patients who have more than one isolate in the study (n = 27) received numbering 1-27, and the different strains present the same numbering; 3rd column: International Spoliotype Types (SIT); 4th column: classification obtained through SITVITWEB (family and subfamily).