Prospecting Environmental Mycobacteria: Combined Molecular Approaches Reveal Unprecedented Diversity

Background Environmental mycobacteria (EM) include species commonly found in various terrestrial and aquatic environments, encompassing animal and human pathogens in addition to saprophytes. Approximately 150 EM species can be separated into fast and slow growers based on sequence and copy number differences of their 16S rRNA genes. Cultivation methods are not appropriate for diversity studies; few studies have investigated EM diversity in soil despite their importance as potential reservoirs of pathogens and their hypothesized role in masking or blocking M. bovis BCG vaccine. Methods We report here the development, optimization and validation of molecular assays targeting the 16S rRNA gene to assess diversity and prevalence of fast and slow growing EM in representative soils from semi tropical and temperate areas. New primer sets were designed also to target uniquely slow growing mycobacteria and used with PCR-DGGE, tag-encoded Titanium amplicon pyrosequencing and quantitative PCR. Results PCR-DGGE and pyrosequencing provided a consensus of EM diversity; for example, a high abundance of pyrosequencing reads and DGGE bands corresponded to M. moriokaense, M. colombiense and M. riyadhense. As expected pyrosequencing provided more comprehensive information; additional prevalent species included M. chlorophenolicum, M. neglectum, M. gordonae, M. aemonae. Prevalence of the total Mycobacterium genus in the soil samples ranged from 2.3×107 to 2.7×108 gene targets g−1; slow growers prevalence from 2.9×105 to 1.2×107 cells g−1. Conclusions This combined molecular approach enabled an unprecedented qualitative and quantitative assessment of EM across soil samples. Good concordance was found between methods and the bioinformatics analysis was validated by random resampling. Sequences from most pathogenic groups associated with slow growth were identified in extenso in all soils tested with a specific assay, allowing to unmask them from the Mycobacterium whole genus, in which, as minority members, they would have remained undetected.


Introduction
Environmental mycobacteria (EM), also known as nontuberculous, opportunistic or atypical mycobacteria, belong to the genus Mycobacterium but are generally considered distinct from the M. tuberculosis complex, especially in terms of the impact on public health [1]. EM comprise both saprophytic species and a number of pathogens that have been identified as causative agents of a wide range of diseases such as pulmonary, cutaneous and disseminated diseases, posing a particular risk for those that are immunocompromised [2][3][4]. EM species are characteristically and phylogenetically separated on the basis of growth into fast growing and slow growing mycobacteria [5], with pathogenicity predominantly correlated with slow growth [6]. The number of newly identified Mycobacteria species from the environment or in clinical settings has risen from about 100 species to 157 in the last decade [7,8] (and http://www.bacterio.cict.fr/m/mycobacterium.html); of these a third are potential human and animal pathogens. In addition, evidence suggests that exposure to EM interferes with the efficacy of the BCG vaccination against adult pulmonary tuberculosis [9].
EM are ubiquitous in the environment, having been isolated from a variety of sources including natural waters, drinking water distribution systems [10][11][12], biofilms, hot tubs and spas, peat, acid brown swamps, potting soils [13] reed beds [14] and acidic boreal forest soils [15][16][17][18]. They have also been isolated from dust, milk and from the salivary glands of insects [19][20][21]. To date, EM diversity and prevalence in the environment has been mainly assessed using cultivation, which is an insensitive method particularly for selective groups. Therefore, to date, few studies have reported slow-growing EM in soil isolation studies [15,22,23]. Previously, we reported one of the first analysis of diversity of slow growing mycobacteria in soil [24] and subsequently developed a suite of techniques for reliable extraction and detection of M. bovis in soil. Direct diversity analysis of EM in soil has focused on 16S rRNA genes using PCR-DGGE, clone libraries and T-RFLP [24][25][26][27][28][29][30]. These studies aimed either at the whole genus or at slow growers, but through targeting the whole genus, they risk underrepresentation of the low prevalence slow growers which have a single rrn operon [31] copy. To address this issue we report here the development of a combined approach to determine the diversity of EM which includes slow growers and hence the M. tuberculosis and the M. avium complex too, using the 16S rRNA gene as target. Here a method was devised which exploited a particular signature in the Mycobacterium genus 16S rRNA gene, known as the long helix 18 [6], which is associated with pathogenicity and is harboured by the majority of slow growers known to date. Our aims were i) to perform a bioinformatics analysis with a customized approach for EM detection using tagencoded Titanium FLX amplicon pyrosequencing ( [32]), ii) to compare deep sequencing with community profiling using PCR-DGGE and iii) to determine if ratios of fast to slow growing EM correlate with their distribution in selected soil environments via quantitative PCR (qPCR).

Bacterial strains and growth conditions
The strains used in this study are listed in Table 1. All species were stored at 280uC as glycerol stocks. The stocks were resuscitated on Middlebrook 7H10 selective agar plates (BD, Oxford, UK) and grown at 37uC until colony formation. A single colony was then inoculated onto Middlebrook Broth 7H9 (BD, Oxford, UK) with constant shaking at 150 rpm. For Mycobacterium avium subsp. paratuberculosis a supplement of ferric mycobactin J (Allied Monitor, Fayette, Missouri, USA) was added to both the media and broth following the manufacturer instructions.

Soil Samples
A Material Transfer Agreement was made with the Institute of Biodiversity and Conservation, Ethiopia, for export of soils. A DEFRA Licence to import, move and keep prohibited soil (Licence No. PHL 208/6138) was issued. The study was carried out on private land and the owners of the land gave permission to conduct the study on the sites. Samples were collected in duplicate from the upper layer (0-5 cm) of sites in and around the market town of Mehal Meda, Ethiopia (soils 1-4) and from Cryfield field, Cryfield Farm, Warwick University, Warwick, United Kingdom [33]. Soil sample 1 (1108), described as very compacted and dry soil was collected from the ground used for cattle and goods in the market place of Mehal Meda town (10u319N, 39u669E). Soil sample 2 (1109), was collected from the compacted soil floor at the entrance to a house near Mehal Meda (10u339N, 39u749E). Soil sample 3 (1110), was collected from a ploughed field in Mezezo (9u979N 39u749E). Soil sample 4 (1111), was collected from a house yard on a roadside in Sembo (9u429N 39u359E). Soil samples were collected in October 2008 during the Ethiopian dry season. The soil sample 5 from Cryfield, (52u369N 1u569E) was collected in August 2008. Samples were stored at room temperature and were thoroughly mixed before extraction. The sampling was preliminary to a subsequent field survey throughout varying agroclimatic regions in Ethiopia.

DNA Extraction
Genomic DNA from the various reference strains of Mycobacteria was extracted using the DNeasy Blood & Tissue Kit (QIAGEN, Ltd., Crawley, UK) following manufacturer's instructions. Total soil DNA was extracted from soil samples (0.5 g60.2 g) with the FastDNA Spin Kit (MP Biomedicals, OH, USA) as per manufacturer's instructions.

Primer design
The primers used in this study are shown in Table 2. Primers previously published by our group (Young et al., 2005) were initially assessed as their specificity had been proven. These primers targeted two different sequences in the 16S rRNA gene, encompassing the regions V2-V4 (termed Mycobacterium genus specific JSY1SF/R primer set) and V3-V6 which contains the long helix 18 insertion (of a variable size of 28 to 31 bp), distinctively present in a large subgroup of slow growing mycobacteria [6,24], termed JSY16Sslow primer set. Preliminary work showed that the length of the amplicons generated by the  V3 region  and specifically targeting the helix 18 insertion and upstream  regions, globally, V1-V3 (termed APTK16S). Theoretically, for the regions targeted in this study, at a species differentiation similarity cutoff of 97%, a minimum difference of 12 bp from each strain is expected in order to distinguish a species, based on the minimal sequence length of 405 bp amplified by the primers. Fiftyone selected reference strains including 21 slow growers (listed in Table S3 and Table S4 in File S1) were analyzed with a similarity matrix and differed from each other by 7-74 bp in the V1-V3 region and by 4-76 bp in the V2-V4 region.
As qPCR intrinsically requires shorter amplicons, we used the same software to design a suitable forward primer APslowF for the slow grower specific qPCR assay and a suitable forward primer APmycoF for the genus specific qPCR assay to be used with JSY16SR.
For DGGE, A GC clamp (59CGCCCGCCGCGCGCGGC-GGGCGGGGCGGGGGCACGGGGGG 39) was added to the reverse primers JSY16SR and APTK16SR and used in PCR reactions.

PCR and quantitative PCR conditions
In order to reliably generate PCR amplicons for DGGE, nesting was applied to all soil extracted DNA using the primer set pA and pH (Table 2) to first amplify the whole 16S rRNA gene [34]. Direct PCR attempts had failed to sufficiently produce amplicons in a standardized fashion for all samples. Reactions were performed in a 50 ml volume containing 1 ml of DNA, 25 ml of Promega PCR master mix (Promega Ltd., Southampton, UK), 1 ml (10 mmol) of each primer, 2.5 ml DMSO, 2 ml (10 mg/ml) Bovine Serum Albumine (BSA)(Sigma-Aldrich, Dorset, UK), and 17.5 ml of filter sterilized MonoQ water.
The PCR cycle was 94uC for 1 min, followed by 40 cycles of 94uC for 1 min, 62uC for 1 min, 72uC for 1.30 min and a single extension step of 72uC for 7 mins. Subsequently, specific 16S rRNA gene regions were amplified in the following manner: each 50 ml reaction contained 1 ml of DNA, 25 ml of Promega PCR master mix (Promega Ltd., Southampton, UK), 1 ml (1mmol) of each primer, 2.5 ml DMSO, 2 ml (10 mg/ml) Bovine Serum Albumine (BSA)(Sigma-Aldrich, Dorset, UK), and 17.5 ml of filter sterilized MonoQ water. The thermal protocol for the JSY16SF/R primer set was 94uC for 5 mins, followed by 35 cycles at 94uC, 55uC for 1 min, 65uC for 1 min, and finally a single extension step of 65uC for 5 mins. For the APTK16SF/R primer set it was 94uC for 5 min, followed by 35 cycles at 94uC for 1 min, 59.9uC for 1 min, 67uC for 1 min, and finally a single extension step of 67uC for 5 mins.
For qPCR, amplification reactions were set up and run in triplicate with ABI 7500 Fast Real-Time PCR System (Applied Biosystems Inc., CA, USA). For both the genus specific and the slow growers specific qPCR reactions, 12.5 ml of SYBRH Green PCR Master Mix (Applied Biosystems Inc, CA, USA) were mixed with 2.5 ml of a 0.45 mM primer Mastermix (made up from a stock of 10 mM of each primer), 0.25 ml BSA (10 mg/ml) (Sigma-Aldrich, Dorset, UK), and 4.75 ml of filter sterilized MonoQ water to which 1 ml of DNA template directly extracted from soil was added for each sample. Plates were analyzed with the auto threshold and a melting curve analysis was included. Mycobacterium bovis BCG genomic linear DNA standard dilutions were used to generate a standard curve for absolute quantification and were run in triplicate on each quantitative plate. They were converted to genomic equivalents [33] and ranged from 8.5610 5 copies/ml to 8.5 copies/ml. The thermal cycling protocol used was an initial step of 2 min at 50uC followed by 10 mins at 95uC, then by 40 cycles consisting of 20 secs at 95uC and 1 min at 60 C. The final dissociation step consisted of 15 secs at 95uC, 1 min at 60uC for and 15 secs at 95uC. As seen previously with a different assay (Pontiroli et al. [33]), the use of an inhibition control via an internal amplification control confirmed the lack of inhibitors with good amplification of DNA on DNA extracted from soils of varying characteristics with the FastDNA Spin Kit, and the yields of the extractions here were very comparable across all samples tested, so copy numbers were corrected only by subtracting the noise due to background fluorescence typically generated by SYBRH Green at the plateau stage of the Real Time PCR. For all reactions, the R 2 values of the standard curve were at least 0.97 and the reaction efficiency varied between 90-100%. Estimates of the titer of the fast growers were obtained by subtracting the counts of the slow growers from the total cell counts and by dividing the number of the fast growers by two, considering the double copy of rrn operons of the latter.

Denaturing gradient gel electrophoresis (DGGE)
DGGE was performed using the DCode Universal Detection System according to the manufacturer's instructions (Bio-Rad Laboratories, CA, USA). An 8% polyacrylamide gel (acrylamidebis-acrylamide 37.5: 1) in 0.5 X TAE buffer was used. The denaturing gradient for the Mycobacterium genus specific JSY16S PCR products was 45%-55% where 100% denaturant corresponds to 7M Urea and 40% [vol/vol] deionized formamide (Muyzer et al., 1993). The denaturant gradient for the slow growing specific APTK16S PCR products was 40%-60%. PCR products (22 ml) with 8 ml of Fermentas 6x DNA loading dye were loaded into each lane of the gel. Electrophoresis was carried out initially for 10 minutes at 150 V and subsequently for 960 mins at 60 V at 60uC in 7 litres of 0.5 X TAE buffer (40 mM Tris-acetate and 1 mM EDTA, pH 8.0). The gels were stained with Ethidium bromide (10 mg/ml) for 15 mins and destained with distilled water for 20 minutes and photographed (GeneFlash Gel Documentation System, Syngene, Cambridge, UK). Major bands were excised and then reamplified with either the JSY16S primer set or the APTK16S primer set without the GC clamp. The PCR products were purified using the QIAquick Gel Extraction Kit (QIAGEN, Ltd., Crawley, UK) according to the manufacturer's instructions, and then sequenced using an ABI PRISM 3130xl Genetic Analyser (Life Technologies Ltd, Paisley, United Kingdom). All sequences were aligned using ClustalW (http://www.ebi.ac.uk/ Tools/msa/clustalw2) and compared with the nucleotide database in Genbank using BLASTn (http://www.ncbi.nih.gov/) and with Silva v.104 database using a standalone BLASTn (see below). In order to determine the limit of detection of the technique, i.e. to assess the varying analytical sensitivities obtained with the different primer sets, soil microcosms consisting of 0.5 g (60.1) sterile soil microcosms were seeded in duplicate with a dilution series of M. bovis BCG cells which ranged from of 10 1 to 10 6 equivalents of genomic DNA. DNA was extracted and was amplified either directly with the JSY16S and APTK16S primer sets or after nesting, necessitating a preliminary amplification with pA and pH primer sets prior to the DGGE run, in order to determine the detection limit of the DGGE method. M. bovis BCG genomic DNA was used as control for an independent run in the electrophoresis.
Pyrosequencing sample preparation, tag design and sequencing PCR DNA samples were normalized to 100 ng/ml and 1 ml aliquots were used to amplify the genus and slow grower specific regions with customized primers (Table 2). Pyrosequencing was carried out using tag encoded FLX-Titanium amplicon pyrosequencing similar to that described previously [32] including the use of Titanium reagents and a mixture of Hot Start and HotStart high fidelity Taq polymerases as well as a one-step PCR, with 30 cycles of PCR to reduce chimera formation [35]. Pyrosequencing was performed at the Research and Testing Laboratory (Lubbock, TX) based upon RTL protocols (www.researchandtesting.com).
Pipeline for sequences analysis. A customised bioinformatics pipeline was developed to process sequences, using individual and shell scripts in a Linux environment. Sequences were trimmed in mothur [36] using the quality files with a 50 bp moving window at an average quality score cutoff of 35, and a maximum number of homopolymers below eight. Further processing as per recent recommendations [37][38][39] was done with Perl and Bioperl scripts which trimmed pyrosequencing tag sequences, screened for presence of the forward PCR primer sequence, and removed sequences with any ambiguous base calls. Based on expected amplicon sizes and sequence length frequency distributions of bacterial sequences in the Silva reference database (v.104), sequences were further limited to a range of 468-472 bp with JSY16S primer set and 420-424 bp for the amplicons obtained with APTK16S. Sequences which passed these screens were aligned to the reference dataset used for blastn using PyNAST [40] implemented in QIIME [41] and putative chimeric sequences were identified with ChimeraSlayer [42] following the authors' specifications, in order to identify and remove the chimeric sequences potentially generated by pyrosequencing. The proportion of chimeras detected was 5.66% for the genus data set (1544 intragenus chimeras) and 0.58% for the slow grower data set (59 intragenus and 4 intraorder chimeras).
Sequences which passed all the screens described above were compared with blastn run locally against the reference database described above derived from the Silva project (v.104) curated seed database [43]. For blastn, the highest ranked hit to this dataset was retained using an e-value of 10 24 . The number of reads pre-and post quality control are shown in Table 3.
Classification of metadata into OTU based on similarity to reference sequences. Filtered sequences were clustered into similarity-based OTUs (Operational Taxonomic Units) using a shell script which combined the two softwares CD-HIT-EST [44] and mothur version 1.5.0 [45] by using a series of scripts edited in Perl and R (http://www.r-project.org). Based on preliminary testing, four sequence similarity cutoffs for CD-HIT-EST (99, 97, 95, 93%), were set as default in the shell script.

Random resampling
In order to take into account any potential bias resulting from different numbers of sequences among different soil samples, a script was developed in R to randomly select equal numbers of sequences from each site for each primer set; the resulting fasta files were used as input files for the above described processing pipeline. Output data was compared to non randomized results (Table 3) obtained with the clustering approach previously described.

Statistical analysis
The cluster output files were automatically converted to tabular format with Perl scripts and input into mothur which generated similarity based distance matrices and computed derivative diversity estimates such as Shannon (H') and Simpson and the nonparametric species richness estimators ACE and Chao1. Concomitantly, an automated summary analysis of OTU frequency distributions, including rarefaction curves and CCA was generated in R. Clustering analysis, principle component analysis, P test significance and Unifrac significance test were

Phylogenetic assignment via standalone BLAST
Quality controlled sequences and representative sequences were compared to the SILVA SSU rRNA database (Release 104, 2010); we modified the Silva reference database to ensure the most accurate possible classification of Mycobacteria to the species level by adding all available Mycobacteria sequences (n = 899). Sequences were compared at first to the whole SSU-16S rRNA and then only to the Mycobacteria species from the customised database with BLASTn. For each query sequence, the best match was retained for sequences with at least 97% identity and minimum query coverage of 90%.

Phylogenetic diversity analysis
Sequences were aligned against a template alignment containing 16S rRNA genes from Mycobacteria exported from the SILVA v.104 database using the align.seqs function in mothur v. 1.5.0 [45]. The aligned fasta file was then imported into ARB [46] and used for the construction of phylogenetic trees using maximum likelihood and neighbour-joining algorithms. Output files containing cluster related information and frequency distribution were retrieved from the pipeline and selected cluster representative sequences imported into ARB in order to compare their topology with the Mycobacteria database. MrBayes (http:// mrbayes.sourceforge.net/index.php) was also used for phylogenetic reconstruction with default parameters except for a general time reversible (GTR) evolutionary model and gamma distributed rate variation across sites. The analysis was run for 2,000,000 generations, after which the convergence diagnostic, the average standard deviation of split frequencies, was ,0.008. To assess the differences between soil samples using phylogenetic inferences, the FastUniFrac web interface was used [47,48] which uses branch length and position to compare actual phylogenies to a null model of randomly permuted sites. Phylogenetic trees were built with ARB neighbour-joining method and the sample mapping file and the category mapping file were used as input for FastUniFrac analysis.

Real Time quantification of EM
Successful, sensitive and specific qPCR was achieved with both primer sets, with the APTK16S primer set amplifying exclusively the slow growing mycobacteria in the panel tested as expected (Table 1).
EM communities were detected from bulk soil DNA extracted from five soils; counts were in the range of 2.9610 5 to 1.2610 7 cells g 21 of soil for slow growers and 2.3610 7 to 2.7610 8 gene targets g 21 of soil when the Mycobacterium genus was targeted ( Table 3). The vast majority of fast growers have two copies of 16S rRNA genes compared to a single copy in slow growers ( [31]), allowing estimation of fast grower titer of between 1.135610 7 and 1.29610 8 and of slow growers between 2.9610 5 to 1.2610 7 cells g 21 of soil. This reflects the fact that fast growers are typically more abundant and outcompete slow growers in the environment. In addition to the differences in copy number of 16S rRNA genes the V3-V6 amplicons may skew results giving an underestimation of the slow growing mycobacteria, especially as the amplification targets of the two groups are from different regions. Global counts measured with both primer sets were in the same range as those reported in the study of Jacobs et al. ( [30]) where a Taqman assay targeting another portion of the 16S rRNA gene was used.

Profiling bacterial communities by DGGE
Resolution of EM at the species level with DGGE was achieved using both genus specific and slow grower only targets (Fig. 1). Optimization with clinical isolates singly and as equimolar mixtures was performed prior to profiling of soil indigenous EM communities ( Figure 1A, B). The optimized protocol was then used for comparative analysis of the total soil bacterial community ( Fig. 1 C and D). A range of DGGE bands were sequenced (Table 4). Genus specific DGGE bands yielded predominantly fast growers (.97% sequence similarity) (DGGE-1A and DGGE-1I) whereas slow growers specific DGGE yielded exclusively slow growing mycobacteria (DGGE-2A and DGGE-2I). All of the genus and slow growers DGGE blast matches apart from one (M. haemophilum) were also identified in the pyrosequencing data for the corresponding samples. It was expected that the DGGE bands would represent the most prevalent species present in the sample, as judged by pyrosequencing, as it was the case for M. moriokaense in sample 1110 and M. colombiense in samples 1108 and 1109. However, this was not always the case; for example, in the Cryfield sample M. sp. JS623 and M. sp. GR-2001-270 only represented a small proportion of the pyrosequencing reads. This suggests that DGGE bands may not consistently represent the most abundant species, but that the species identified are likely to be present in the sample as shown by the pyrosequencing.

OTU based diversity analysis of EM
Five soil samples were deep sequenced with two primer sets targeting the EM genus and slow growers only. A total of 27258 (genus specific) and 10970 (slow growers specific) reads were retained for analysis post quality control ( Table 3). The number of reads of individual samples after trimming was in the range of 1000 to 9000 for the genus specific data set and 1000 to 4000 for the slow growers data set. The number of clusters obtained after binning with CD-HIT at the cutoff of 97% similarity is also given ( Table 3). Binning was also done at the 99% cutoff. As an example, for the slow growing EM dataset, the more stringent binning cutoff revealed a higher number of clusters (336 versus 33). Notwithstanding the higher number of clusters, in reality this only corresponded to a few additional species obtained as compared to those obtained with the 97% id clustering cutoff (i.e. 24 taxonomical identities for 335 clusters against 15 identities with 33 clusters). The rarefaction curves (Fig. 2) indicated that the sampling effort was adequate with both primers sets.
With the genus specific primers, a total of 314 clusters or OTUs were identified (Table 3 and Table S1 in File S1) across the five sites. This represented a ten-fold higher figure than the number of OTUs obtained with the slow grower data set. However, only 177 (56%) of these OTUs contained more than 10 sequences per cluster and there were 66 singleton clusters (21%), which suggests either high diversity of rare Mycobacterium species, or potential sequencing errors which is considered less likely due to the quality controls adopted for culling chimeras and sequences with ambigous base calls.
After phylogenetic assignment using BLASTn, 88% of the data set (279 clusters) corresponded to 96 species that were culturable or clinical isolates, the remaining 35 clusters were matched to uncultured or unidentified species. However, only 153 clusters (50%), representing 86% of the sequences (23396), could be assigned to known species at the 97% similarity cutoff. All sequences could be assigned to the Mycobacterium genus by classifying the sequences against the Silva Mycobacterium database (overall, identities had ranged from 92 to 100%, with the cutoff of 97% used to retain species). Only 60% ca. of these 153 clusters, corresponding to 29 species, had a prevalence higher than .1% (Table S1 in File S1). After classification with BLASTn to the reference Mycobacterium database, the top ten species identified matched to uncultured Mycobacterium sp. For the slow grower data set, a total of 33 representative clusters were identified across five soils (Table 3, and Table S2 in File S1). Of these, four were singletons, 11 were unique clusters to one soil, whilst the remaining 22 were shared between the five soil samples. Excluding the singletons, only 11 clusters had a prevalence higher than 1.5%. About 99% of the 10970 (9941 reads) were matched to a sequence in the Silva database with a similarity higher than 97%; those below this cutoff were excluded from the analysis. These sequences were classified (Table S2 in File S1) as M. colombiense (7 clusters, which represented 40% of the total) and M. gordonae (8 clusters, 13%), M. aemonae (1 cluster, 13%), M. asiaticum (4 clusters, 11% of the total), M. malmoense (2 clusters, 8% of the total), M. riyadhense, (1 cluster, 7%) M. sp. NLA000202017 (1 cluster, 2.75%) and M. angelicum (2 clusters, 1.4%). The predominant species listed above were often represented by several clusters, but generally only a few of these contained the majority of sequences, this was the case for M. colombiense where Cluster 4 was the most conspicous (4153 sequences, 38.13% of the total). Interestingly, only sequences from the Cryfield site were identified as M. tuberculosis complex in Cluster 14, but these had very low prevalence (0.1% of the total). For the slow grower data set chimeras were predominantly found in cluster 13, M. angelicum (29/142), cluster 3, M. aemonae (18/1519) and cluster 0, M. riyadhense (7/760).

Diversity indices Shannon (H') and
Simpson varied considerably between the genus and slow grower sequences. For the genus data set, they were generally higher than with the slow growers  (Table 4). (D) EM soil community profiling using the slow grower mycobacteria specific 16S rRNA gene specific primers (APTK16S). L is a reference sizing ladder (TrackIt TM 50 bp DNA Ladder, Invitrogen, Ltd., Paisley, UK), C is the negative PCR control, samples 1-4 are the four Ethiopian soils (1108, 1109, 1110, 1111) and 5 is Cryfield. The arrows (2A-2I) represent the bands that were excised and sequenced (Table 4). doi:10.1371/journal.pone.0068648.g001 data set (Table 3), reflecting that the diversity attained was higher when the whole genus was targeted.
Considering the diversity estimators CHAO1 and ACE, a x 2 test between the number of expected sequences and those observed, revealed that the sequencing effort was adequate for the slow grower data set (p.0.05). For the genus specific primer set, however, there were statistically significant differences (p,0.05) between the number of unique OTU observed and those expected, indicating that the EM diversity for these soil samples was not fully explored, although the rarefaction curves had plateaued (Fig. 2 A).

Random sequence resampling
For any biological system, there is an intrinsic variability between samples, with uneveness being introduced during the sampling and during the downstream molecular assays. In this case the number of reads obtained for each sample varied considerably between samples thought to be due to amplification biases, inhibitory compounds carried over in the soil extracts, or Table 4. Sequence comparisons of bands excised from DGGE gels (Fig. 1 C and D) and blasted against the 16S rRNA gene database retrieved from SILVA.  the stochasticity of sequencing reactions. Therefore, randomised resampling was also performed in order to remove any potential bias in the computation of the indices (Table 3) and to assess the reliability of the bioinformatics approach. The genus data set size was normalized to 1135 sequences and the slow grower data set to 1381 sequences. When controlling for sequencing effort, no statistical differences were observed in the number of clusters, the diversity indices H' or Simpson.

Standalone blast of the sequences
Independently from the results obtained with the taxon based approach with clustering sequences into OTUs, entire data sets were also compared to the same blast databases, with only sequences having similarity equal or higher than 97% retained. A comparison of diversity of EM within soils, revealed site specific assemblages of mycobacteria ( Fig. 3A and 3B). Overall, 16 species for the genus data set and 7 species for the slow growing EM data set were .1% prevalent.
It is true to say that the genus data set predominantly harboured previously described species (compared with the OTU based diversity analysis) with the remaining sequences that were similar to those originally isolated from a variety of sources including chlorinated contaminated sites or activated sludge, marine and riverbed sediment or soil. Cryfield site uniquely harbored species A comparison of the clustering approach with the the taxon independent approach, i.e. blasting the clusters to the same database used for the standalone blast of the sole sequences, revealed that the same species were found with both approaches, reinforcing the validity of these independent methods.

Linking OTUs to phylogeny
For both data sets, when the reference sequences representing the various clusters were aligned to the reference set of mycobacterial sequences and imported into ARB, their position on the tree correlated well with the independent result of the standalone blast analysis (Fig. 4A). For reasons of brevity, only the tree relative to the slow data set is shown, with clusters with the highest prevalence labeled in red.

Phylogenetic diversity analysis
Unweighted FastUnifrac was used to compare the diversity of microbial communities using a phylogenetic based approach. The distribution of sequences was significantly different between sites regardless of whether all samples were analyzed together, each sample individually or with pairwise analysis. For both the genus level and the slow grower data set, the distance metrics indicated that branch lengths across samples were significantly unique and not due to chance (uncorrected p,0.0001, corrected p,0.002). Significant differences (p,0.002) between each pair of samples were observed after the Bonferroni correction for multiple comparisons. The exception being for the slow growers data set, where if the samples were compared individually, sample distance p values for sample 1109 indicated that the tree topology for this site was not significantly unique. The P test within the FastUnifrac package, which reveals if samples are significantly clustered on the tree, was also performed for both data sets. For the genus data set, when comparing each pair of samples, clustering was highly significant (raw p,0.001 and after the Bonferroni correction for multiple comparisons, p,0.002). When all samples were analysed together (unweighted mode, 500 permutations) again clustering was significant (p,0.01) (qualitatively illustrated in Fig. 4 B). For the slow data set this test gave similar results to the genus level. Overall, the comparative phylogenetic analysis suggests that EM communities are significantly different in the five soil samples.

Analytical sensitivity of DGGE method
The analytical sensitivity with the genus specific JSY16S primer set was 10 5 BCG gene targets g 21 of soil; nested PCR did not improve the limit of detection. When the slow grower specific APTK16S was used, the detection limit was 10 3 gene targets g 21 of soil which improved to 10 2 BCG gene targets g 21 of soil with the nested approach.

Discussion
This paper reports a new approach devised to detect and quantify EM in soil, using a combination of qPCR, DGGE and Pyrosequencing. To our knowledge, this is the first paper to report analysis of EM diversity using the same target gene, whilst attempting to distinguish slow growers from the rest of the mycobacteria. We report successful testing and validation of these methods for the detection of EM in two very distinct sites.
Novel primers were designed and optimized to detect slow growing mycobacteria, and previously designed genus specific primers were validated with two of the three methods and optimized for qPCR. We were able to segregate slow and fast growing members of the Mycobacterium genus and, as expected, fast growers were more prevalent in all soils comparing all three methods. Our data shows that the prevalence of EM as tested by qPCR is consistent with what has been reported for the genus: using a qPCR assay targeting a different region of the rRNA operon, cell counts ranged from 10 4 to 10 7 cells g 21 of house dust [29] and [30].
High-throughput pyrosequencing offers an opportunity to study in depth the diversity of mycobacteria in soil. Two complementary approaches were taken with the analyses, an OTU based and a phylogeny based method. A comparison of the clustering approach with the the taxon independent approach, i.e. blasting the clusters to the same database used for the standalone blast of the sole sequences, revealed that the same species were found with both approaches, reinforcing the validity of these independent methods. A number of points regarding the optimisation of the pyrosequencing merit discussion.
A 97% binning cutoff was chosen -analysis was also done at a more stringent clustering cutoff of 99% but this did not significantly alter the most preponderant species identities found or the overall numbers of sequences affiliated to these.
The less stringent cutoff of 97% was selected as the conservative choice given the demonstrated errors in pyrosequencing [39] and so it was used for the OTU based analysis. CD-HIT-EST is a powerful clustering tool which enables management of large amount of data with a reduced set of representative sequences and downstream diversity analysis. However, in our case, where the target was a specific genus and a subdivision within a genus, this approach resulted in some very similar clusters; hence the subsequent diversity analysis is very likely overestimating diversity due to inflated numbers of OTUs. Similarly, for the slow growing EM dataset, a binning cutoff at 99% against 97% increased the number of the OTU by ten-fold without significantly increasing the number of exact species matches.
Using universal primers for pyrosequencing, the proportion of sequences that belonged to the Mycobacterium genus in the soil varied from 0.49% to 2.3% of the total reads (S. Dowd pers. comm.) which indicated a significant numerical presence of EM in the environment, which was also confirmed in our study by qPCR.
Determining the exact prevalence of EM load within the environment required a molecular approach and was achieved by dual focus on both fast and slow growing mycobacteria as the latter are difficult to detect due to higher prevalence of fast growers, and because we can cultivate only a very small proportion of the estimated 10 10 bacterial species [49] contained within a gram of soil. Moreover, cultivation-based approaches are well known to underestimate and bias the number of species by orders of magnitude [50]. Although different regions of the 16S rRNA gene were targeted for the genus specific and slow grower analysis, our work supports indeed the hypothesis that the number of EM species in the environment is higher than the number of species that has been isolated to date. Pyrosequencing indicated more unknown groups within the genus data set with most of the new diversity grouping with fast growers (14%), whereas about only 1% of sequences in the slow grower data set did not match known identities in the database. In addition, for the genus data set only, the Chao1 and ACE statistics suggested that greater diversity could be expected to be recovered with additional sequencing.
In general, the DGGE profiles correlated well with the pyrosequencing approach, detecting the most abundant species, particularly for the slow growers' data set. This included M. colombiense, a recently isolated opportunistic pathogen [51], and M. riyadhense, previously related to the M. tuberculosis group based on biochemical tests, but recently separated based on phylogeny [52].
In fact, only one slow grower DGGE band (M. haemophilum from Cryfield soil) was not also identified through pyrosequencing. For the genus data set there was a fairly good correlation between the methods in some cases. For example, M. moriokaense was the most abundant species detected in both DGGE and pyrosequencing for sample 1110. However, prevalent sequences in pyrosequenced libraries did not always correspond to prevalent DGGE bands. This suggests that DGGE bands may not consistently represent the most abundant species, but that the species identified are likely to be present in the sample as shown by the pyrosequencing. However pyrosequencing has its own amplification biases as for instance those due to the variability in the EM content across different technical replicates. Differences in PCR amplification including a preliminary round of nesting may have been the causes of minor discrepancies observed between DGGE and pyrosequencing. In addition, DGGE may not be as robust as deep sequencing for inferring beta diversity correlations within the genus, as illustrated by two bands matching to M. moriokaense that were isolated from the DGGE gel at different positions. Overall, DGGE has its well documented drawbacks [53], but in our case it has proved, with new and customized primers for EM, to provide a reliable snapshot of the environmental microbial diversity, as previously shown in our lab [54].
Slow growing bacteria present a considerable challenge for community analysis and thus a molecular approach is currently the only feasible option for studying slow growing mycobacterial reservoirs in the environment. However, caveats must be considered due to the well-known biases introduced by PCR based methodology, which is common to all the three methods employed in this study. In our work, in order to control for the unevenness of the reads across samples, a random resample of the sequences was also performed to test the reproducibility of the data from the OTU dependent approach. Although some variation could be observed and anticipated, overall the differences in the diversity indexes of the original and the resampled data sets were not significant, with higher diversity still being observed for the genus data set. Whilst there are other drawbacks in PCR based approaches [55,56], these are at present the only sufficiently valid methods to estimate diversity and prevalence of target genera in the environment.
In this study, we attempted identification of Mycobacterium to the species level using several approaches, all based on the 16S rRNA genes which are commonly used for phylogenetic studies concerning the genus Mycobacterium, in particular the first 500 bp, which include its major hypervariable regions ( [57]; [5]; [31]). However, a non-negligible limit of such phylogenetic reconstructions is the poor robustness of the trees. A 97% cutoff for species level discrimination comparisons to a comprehensive reference database is also a globally accepted procedure. However, a 97% species identity cut off may not properly resolve some species complexes such as M. tuberculosis and M. avium or individuals exceptionally close to the M. tuberculosis group such as M. lacus, M. marinum and M. ulcerans for the slow growers specific target region and for a few fast growers such as M. szulgai (Tables S3 and S4 in File S1) for the genus specific region.
To improve phylogenetic accuracy and robustness, more recently the multilocus approach [58] has been used in two large studies ( [59]; [60]) in which four and seven different nucleotide fragments, respectively, were concatenated. These approaches are useful, but are not yet applicable for culture independent high throughput methods for environmental surveys of mycobacteria as reported here [61]. More recent studies based on datasets obtained with next generation sequencing technologies, which report results from the Human Microbiome Project [62] and species level analysis of the latter datasets [63] show that accurate species labeling is dependent on the choice of the region. For instance, for the Staphylococcus genus and, globally, for the larger HMP dataset the V1-V3 region had a higher species resolving power, and this is the same target region for the analysis of the slow growers EM in this study.
The approach described here can now be used for larger scale environmental surveys devoted to establishing if EM prevalence and diversity and human exposure to these EM could be linked to the TB infectious status at different geographical settings, and what role these conditions play in the reduction of the BCG vaccine efficacy at southern latitudes. In conclusion, we report here of the successful use of PCR based methodology employing two primer sets on the metagenome of five example soils with varying characteristics. These methods revealed with an unprecedented level of resolution the prevalence and diversity of EM in soil.

Supporting Information
File S1 Table S1, Clusters characteristics for the genus data set. Only representative sequences with a prevalence higher than 1% are displayed, alongside with the information related to nu of sequences per cluster, the relative proportion and the sequence accession number after BLASTn analysis. Table S2, Clusters characteristics for the slow growing EM data set. Representative sequences are displayed alongside with the information related to the nu of sequences per cluster, the relative proportion and the sequence accession number after BLASTn analysis. Table S3, Sequence difference count matrix for 21 slow growing strains as compared to 51 slow and fast growing EM for theAPTK16S primers target region (slow growing EM specific). ID = identical strains. Table S4, Sequence difference count matrix for 51 EM strains for the JSY16S primers target region (Mycobacterium genus specific). ID = identical strains. (DOCX)