Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets

  • Nikhil Chaudhary ,

    ‡ These authors contributed equally to this work.

    Affiliation MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India

  • Ashok K. Sharma ,

    ‡ These authors contributed equally to this work.

    Affiliation MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India

  • Piyush Agarwal,

    Affiliations MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India, Department of Physics, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India

  • Ankit Gupta,

    Affiliation MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India

  • Vineet K. Sharma

    vineetks@iiserb.ac.in

    Affiliation MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Madhya Pradesh, India

16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets

  • Nikhil Chaudhary, 
  • Ashok K. Sharma, 
  • Piyush Agarwal, 
  • Ankit Gupta, 
  • Vineet K. Sharma
PLOS
x

Abstract

The diversity of microbial species in a metagenomic study is commonly assessed using 16S rRNA gene sequencing. With the rapid developments in genome sequencing technologies, the focus has shifted towards the sequencing of hypervariable regions of 16S rRNA gene instead of full length gene sequencing. Therefore, 16S Classifier is developed using a machine learning method, Random Forest, for faster and accurate taxonomic classification of short hypervariable regions of 16S rRNA sequence. It displayed precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. 16S Classifier is available freely at http://metagenomics.iiserb.ac.in/16Sclassifier and http://metabiosys.iiserb.ac.in/16Sclassifier.

Introduction

In the last decade, metagenomics has emerged as one of the most incredible events in the study of microbial ecology which has made it possible to access, in-principle, almost 100% of the genetic material present in unculturable microbes [1]. More than 98% of the bacteria which cannot be cultured using traditional methodologies can be directly sequenced from their natural environments using the metagenomic approaches [2]. Furthermore, the rapid developments in sequencing technologies have made sequencing easier, faster and extremely economical which provide a unique opportunity to explore the microbial diversity of most complex environments. The two common strategies adopted in any metagenomic project are random shotgun approach and targeted approach [3]. The former approach involves the sequencing of all genomic fragments and is used to uncover the enormously large functional gene diversity inherent in microbial communities. The latter approach involves the sequencing of a marker gene, such as 16S rRNA, which helps in estimating the diversity, evolutionary distance and relative abundance of different microbes in their complex environments [4]. The 16S rRNA gene has been the most commonly used genetic marker for reconstructing prokaryotic phylogenies since it is conserved in all prokaryotes [5, 6]. The distinctive feature of 16S rRNA gene which makes it a suitable genetic marker is the presence of nine hypervariable regions (HVRs) V1-V9 flanked by conserved regions which can be used to amplify the variable regions. The sequences of the HVRs have been used for the taxonomic identification of microbial species in several metagenomic studies [711].

In the early metagenomic projects, the sequencing of complete 16S rRNA gene was commonly performed using the traditional Sanger sequencing methodology [7, 12]. This approach, though informative, was tedious, laborious, expensive, and provided a limited depth of sequencing which was insufficient to uncover the complete bacterial diversity present in a complex environment. The next-generation sequencing technologies provide short reads and enormous sequencing depth at a much lower cost [13]. Thus, it has shifted the focus towards sequencing short HVRs of the 16S rRNA gene at greater depths instead of sequencing the complete gene [14]. This approach works primarily because the lengths of different variable regions of the 16S rRNA gene lie in the range of 100–300 bp which can be easily covered using short paired-end reads produced by commonly used next-generation sequencing technologies [15, 16].

The taxonomic classification of environmental 16S rRNA gene sequences is carried out by using either a homology-based or prediction-based approach. The former approach requires the alignment of a query 16S rRNA sequence with all the 16S rRNA sequences present in the reference database [17], such as Ribosomal Database Project [18], Greengenes [19] and SILVA [20]. Several homology-based tools and pipelines are currently available for the analysis of the 16S rRNA environmental sequences, such as MEGAN [21], PyNAST [22], UCLUST [23], QIIME [24], EzTaxon [25] and MG-RAST [26]. The major limitations of the above approach are the large computational time needed for classification and dependence on the availability of a homologous sequence in the reference database [27]. The prediction-based approaches are useful in this scenario. One of the most commonly used tools for the taxonomic classification is the RDP-Classifier which uses a Naive Bayesian Classifier [28, 29]. It performs well on complete 16S rRNA sequences, however, it provides limited accuracy for any selected HVRs which are short in length [30].

Since the recent metagenomic projects routinely employ the sequencing of only a single HVR or a combination of two or more HVRs, specialized tools are needed for the accurate identification and classification of species using short variable sequences. Therefore, 16S Classifier has been developed using Random Forest (RF), a machine learning based approach, for the taxonomic classification of short 16S rRNA HVRs and complete 16S rRNA gene sequences obtained from metagenomic projects.

Methods

Construction of datasets

A total of 1,262,986 16S rRNA sequences along with their taxonomic information were retrieved from the Greengenes database (version 13_5) which provides a curated database of full length 16S rRNA sequences [19, 31]. A list of primer pairs specific for each HVR and combinations of HVRs was prepared based on the information known in the literature (Table A in S1 File). Since the 16S rRNA sequences display variability in length, the HVRs were extracted from the complete 16s rRNA gene sequences by aligning the primer pairs using the Fuzznuc program available in EMBOSS software suite [32]. The primer pairs which could extract the sequences for a HVR from more than 50% of the total sequences present in the database were selected. V1 and V9 regions were not included since for V1, using the known primers, only up to 25% sequences could be extracted from the total sequences, and for HVR V9 primer pairs could not be found. In addition, these HVRs (individually) are not commonly used in metagenomic studies. The sequences of each HVR were divided into separate groups based on their taxonomic ranks from phylum to genus as per the information available in the taxonomy data retrieved from the Greengenes database.

The sequences in each taxonomic rank group were clustered using CD-HIT (v 4.6) program [33]. For the complete 16s rRNA gene sequence, clustering was performed at a global sequence identity threshold of ‘0.999’ for sequences belonging to the taxonomic rank genus, and the threshold ‘1’ was used for the rest of the higher taxonomic ranks to remove the redundant sequences which may lead to over-training. For all HVRs, the clustering was performed at a global sequence identity threshold of ‘1’ for all taxonomic rank groups. For each taxonomic rank group, all representative sequences obtained after using CD-HIT were used as the training dataset for the respective HVR (Table 1).

thumbnail
Table 1. Summary of the number of HVR sequences which were used for the training and testing of RF*.

https://doi.org/10.1371/journal.pone.0116106.t001

Random Forest (RF)

RF which is available in the R package (randomForest package, http://cran.r-project.org/) was the method of choice for developing 16S Classifier because of the following reasons; i) fast and easy implementation, ii) ability to analyse large datasets due to its robust classification algorithm, iii) ability to accept large number of input variables exclusive of overfitting, and iv) it can provide very high accuracy along with the information about the importance of variables [34]. RF is an implementation of bagging approach where each tree is independently constructed and works as an independent model [35]. Further, RF uses ensemble learning method for the classification and regression by creating many classifier trees and then combining their results, since the result from an ensemble (combined) are more acceptable than an individual model [36].

Bootstrapping was used to grow classification trees in the forest using the training dataset. About two third of the data was randomly selected to grow a classification tree and rest one third of the data was used for the prediction which is considered as out-of-bag (OOB). At each split node a subset of variables (mtry) was randomly selected to calculate the variable importance. Permutation variable importance and gini index can be used to examine the importance of a particular variable for classification. Among these, the permutation importance value is most commonly used, and therefore was used in this study since it is directly related to the predictive ability [37]. The error of RF depends on the correlation between any two trees and the strength of each tree in the forest which is measured in terms of OOB error [38].

Optimization of parameters

Optimization of parameters was carried out to obtain the best RF model with the lowest OOB error. The sequences from HVR V3 were used for the optimization since it is commonly used in metagenomic studies [39]. It has an appropriate length (~150 bp) which can be easily covered using next-generation sequencing technologies. Furthermore, this region could be extracted from a large (~98% in this study) diversity of bacterial genomes using its specific primer pair. The nucleotide k-mers from size 2 to 6 were evaluated as input features for the training of RF. The frequency of each k-mer in any given sequence was calculated as shown below.

The performance of different k-mer models was tested using tuneRF function available in RF package. The tuneRF searches for optimal mtry value (the value with least OOB error) beginning from a given default value for constructing the RF model. The default mtry value for each k-mer model was calculated as half of the square root of total number of possible k-mers for that k-mer model, whereas, the ‘stepFactor’ and ‘improve’ values were used as 1.5 and 0.02, respectively. OOB error for 2-mer and 3-mer models was higher as compared to 4-mer, 5-mer and 6-mer models (Fig. 1a). Though, the 5-mer and 6-mer models showed marginal (up to ~1%) improvement in the accuracy (lower OOB error) of prediction as compared to 4-mer, the achieved improvement does not justify the several-fold increase in the time taken to prepare a model and a larger (up to ~4 times) training data size (Fig. 1b and 1c). Therefore, 4-mer was selected as the k-mer size at mtry = 8 (selected using tuneRF) to construct the RF models.

thumbnail
Figure 1. Optimization of parameters using hypervariable region V3.

(a) OOB error at different mtry values for 2-mer, 3-mer, 4-mer, 5-mer and 6-mer models, (b) Effect of k-mer size on time required for the calculation, (c) Size of the input file (used for training) for different k-mer size. From the figure (a), it is apparent that the OOB error for 2-mer and 3-mer models was higher as compared to 4-mer, 5-mer and 6-mer models. The figures (b) and (c) show that the time taken and the training data size were several fold higher for 5-mer and 6-mer models as compared to the 4-mer model.

https://doi.org/10.1371/journal.pone.0116106.g001

RF is able to handle large number of predictor variables, yet achieving better or similar accuracy using the minimum number of variables is highly desirable for optimal performance. A total of 256 variables are possible using the k-mer size of 4 and can be used as the input. Therefore, to select the minimum number of variables required for an optimal prediction, the importance of each variable at the selected mtry value (mtry = 8) was examined using the permutation variable importance value obtained from the RF model (Fig. A in S1 File). From the complete set of 256 variables, subsets were created by removing the 25 least important variables successively. Using this approach, three new subsets were formed consisting of 231, 206 and 181 variables which were further used as the input to RF at ntree = 100 and mtry = 8. The OOB error obtained using the above three subsets of variables were compared with the OOB error obtained using the complete set of variables (Fig. 2). It is apparent that the OOB error showed an increase with the removal of variables from the total set. Hence, all 256 variables were selected as input variables for constructing the RF model.

thumbnail
Figure 2. OOB error shows a slight increase on removing variables.

The optimizations were carried out using hypervariable region V3, 4-mer as input and mtry = 8 (The values of these parameters were selected from the Fig. 1).

https://doi.org/10.1371/journal.pone.0116106.g002

To examine the effect of increasing the number of trees (ntree) on OOB error, the value of ntree (at mtry = 8) was gradually increased to 1000. On increasing the number of trees, a gradual decrease in OOB error was observed which nearly saturated at n = 1000, therefore n = 1000 was selected as the number of trees for constructing the RF models (Fig. 3). The tuneRF function was used to optimise the value of mtry for constructing the RF models for each HVR separately. The final models were created using 4-mer as feature input, using all 256 variables and ntree = 1000 at optimum mtry value obtained from tuneRF function using 10 fold cross validation. A decrease in OOB error was observed for each model on increasing the number of trees (Fig. 4).

thumbnail
Figure 3. Decrease in OOB error for was observed on increasing the number of trees (ntree) at mtry = 8.

This optimization was carried out using hypervariable region V3, 4-mer as input variable, mtry = 8 and 256 variables

https://doi.org/10.1371/journal.pone.0116106.g003

thumbnail
Figure 4. OOB error decreases on increasing the number of trees (ntree) at optimum mtry for different HVRs.

For all individual hypervariable region regions mtry value was optimized separately (using 4-mer as input) and was used for constructing the model at ntree = 1000. V2_mtry8 represents hypervariable region V2 at optimum mtry 8, and similarly represented for other hypervariable regions.

https://doi.org/10.1371/journal.pone.0116106.g004

Test datasets

Two test datasets were prepared to evaluate the performance of 16S Classifier. The first test dataset was prepared by randomly extracting ~10% of the HVR sequences from each cluster belonging to different taxonomic rank groups (Table 1). To examine the effect of sequencing errors, 1% mutations were randomly introduced in the HVR sequences using in-house Perl script. The test datasets were prepared using this approach for all HVRs. The second test dataset was prepared using real sequence datasets available in public (SRA database of NCBI) database for the different HVRs (Table B in S1 File) [40]. The data for the complete 16S rRNA sequences was obtained from the oral cavity samples of 10 healthy individuals (GeneBank accession numbers FJ976202 to FJ976448) [12].

Publicly available programs

The BLAST package (version 2.2.26, NCBI) and RDP Classifier (version 2.2) were used for comparing the results of 16S Classifier [28, 41]. The same version of Greengenes database which was used for the training of 16S Classifier was used as the reference data for BLAST and as the training data for RDP Classifier.

Results and Discussions

Performance Analysis of HVR models

The performance of the models was assessed by using the following measures: Where, TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative

The above measures were calculated for all taxonomic rank groups for a given HVR model. The values for each measure were averaged from all groups to calculate the values for that HVR model. Since the number (from the confusion matrix) of ‘True Negatives’ was very large compared to the number of ‘False Positives’, the value of specificity and accuracy was almost one for all models. Among the models of individual HVRs, the models for V2, V4 and V8 HVRs displayed the highest precision values of 0.85, 0.87 and 0.85, respectively. These HVRs were also longer (>200 bp) in length as compared to the other individual HVRs. The models for V6 and V7 regions showed the lowest precision (0.63 and 0.65, respectively) values and also had the smallest length (86 and 107 bp, respectively) compared to other individual HVRs (Table 1 and Table 2).

thumbnail
Table 2. Performance of RF models on the different HVRs and complete 16S rRNA.

https://doi.org/10.1371/journal.pone.0116106.t002

Similarly, the RF models of the combined HVRs, the V34 and V35 regions, which had the longest (>400 bp) lengths displayed the highest precision (0.90 and 0.91, respectively) values. However, the V45 region which had a much smaller length of 331 bp also displayed similar precision value of 0.90. The smallest (236 bp) V67 region showed the lowest precision value of 0.77. These results indicates that the value of precision is directly proportional (R = 0.85, p≤0) to the length of the HVR. The RF model of the complete 16S rRNA also displayed the highest precision value of 0.91.

Performance on Test Datasets

The performance of 16S Classifier was evaluated on two test datasets. The first test dataset consists of HVR sequences where 1% mutation was introduced to simulate the effect of sequencing errors. This dataset is helpful to estimate the accuracy of 16S Classifier in case the HVR sequences contain errors due to sequencing. The performance of 16S Classifier was assessed on individual test datasets for all HVRs (Table 3). 16S Classifier displayed the highest sensitivity (0.98) and precision (0.98) in the case of V23 region. The highest precision values (0.98) were also observed for V34 and V45 HVRs. It is apparent that only for the short HVRs, such as V5 (106 bp), V6 (86 bp) and V7 (107 bp), the 16S classifier displayed lower sensitivity (0.78–0.82) and precision (0.83–0.87) values. For all other HVRs the sensitivity and precision values were in the range of 0.89–0.97 and 0.92–0.97, respectively.

The second dataset consisted of real sequence datasets for all HVRs. The primer regions were removed from the sequences before analysing them using 16S Classifier. The performance of 16S Classifier was compared with RDP Classifier (v 2.2) using BLAST (v 2.2.26), which are the two commonly used methods for the taxonomic assignment of 16S rRNA sequences. The results of taxonomic assignments of BLAST program were considered as the reference to determine the correct taxonomic lineage of the sequences in the real datasets (Text B in S1 File). The performance of 16S Classifier and RDP classifier were evaluated on the test dataset for each HVR.

For all HVRs and at all taxonomic ranks (except at genus rank for V7), the results of 16S Classifier were more accurate as compared to RDP classifier (Table 4 and Fig. B in S1 File). At phylum, class, order, family and genus levels, the 16S classifier displayed up to 42.9%, 40.7%, 41.0%, 57.9% and 73.8% higher accuracy as compared to RDP classifier. These results indicate that 16S classifier shows much higher accuracy at lower taxonomic ranks, such as genus, compared to the RDP classifier and attest to the accuracy of 16S classifier on different HVRs at all taxonomic ranks. In the case of complete 16S rRNA sequences, both 16S Classifier and RDP Classifier displayed comparable accuracy. The time taken for taxonomic analysis by 16S Classifier, RDP Classifier and BLAST was compared using a sample dataset of 5,000 HVR sequences of V3 region on a Linux Workstation with 64 GB RAM and an Intel Xeon 2.4 GHz CPU. The 16S Classifier took ~40 seconds, RDP Classifier took ~300 seconds and BLAST took 32,370 seconds on the same dataset. These results indicate that 16S Classifier is much faster in carrying out the taxonomic assignments as compared to the other available methods.

thumbnail
Table 4. Comparison of the performance of 16S Classifier with RDP Classifier on real datasets.

https://doi.org/10.1371/journal.pone.0116106.t004

Implementation with QIIME pipeline

QIIME pipeline has recently become the most commonly used and standard pipeline for the taxonomic analysis of 16S rRNA data obtained from metagenomic datasets [42]. It provides options to use the available methods such as RDP Classifier, BLAST, MOTHUR and RTAX for the taxonomic classification of the representative Operational Taxonomic Unit (OTU) sequences obtained after the clustering step in the pipeline. For the taxonomic assignment of OTU sequences, the 16S Classifier is compatible with the QIIME pipeline and can be easily used to carry out the taxonomic assignment using QIIME. It can accept the representative sequences of OTUs in QIIME format and produces the output in the format acceptable by the QIIME pipeline for downstream analysis. Therefore, to the best of our knowledge, the 16S Classifier is the only available machine learning based tool which can carry out the efficient, sensitive and accurate taxonomic assignment of any of the 16S rRNA HVRs which are commonly used in metagenomic projects. On complete 16S rRNA also, it displayed exceptional performance. Thus, the wide usage of this tool is anticipated in different metagenomic projects. The standalone software and the webserver of 16S Classifier are available at http://metagenomics.iiserb.ac.in/16Sclassifier and http://metabiosys.iiserb.ac.in/16Sclassifier. The instructions for installing and using the software have been provided in Text A in S1 File.

Supporting Information

S1 File. Supporting text, tables, and figures.

Text A. Instructions for running the stand-alone version of 16S Classifier on the Linux PC. Text B. Performance evaluation of BLAST. Table A. Information on the selected primer pairs used for extracting the different HVRs. Table B. Information on the publicly available datasets for different HVRs which were used as the real datasets for comparative analysis. Table C. Accuracy of BLAST and 16S Classifier on the randomly selected test sequences. Fig. A. List of top 30 variables which displayed significant mean decrease in accuracy. Fig. B. Comparison of 16S Classifier with RDP Classifier on real datasets. The results of BLAST were used as the reference for comparing the result of 16S Classifier and RDP Classifier.

https://doi.org/10.1371/journal.pone.0116106.s001

(DOCX)

Acknowledgments

We thank MHRD, Govt of India, funded Centre for Research on Environment and Sustainable Technologies (CREST) at IISER Bhopal for its support. However, the views expressed in this manuscript are that of the authors alone and no approval of the same, explicit or implicit, by MHRD should be assumed.

Author Contributions

Conceived and designed the experiments: VKS NC AKS. Performed the experiments: NC AKS. Analyzed the data: NC AKS AG. Contributed reagents/materials/analysis tools: NC AKS. Wrote the paper: VKS NC AKS. Contributed to the web server development: PA NC AKS.

References

  1. 1. Thomas T, Gilbert J, Meyer F (2012) Metagenomics—a guide from sampling to data analysis. Microb Inform Exp 2: 3. pmid:22587947
  2. 2. Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS computational biology 6: e1000667. pmid:20195499
  3. 3. Fuhrman JA (2012) Metagenomics and its connection to microbial community organization. F1000 Biol Rep 4: 15. pmid:22912649
  4. 4. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. science 304: 66–74. pmid:15001713
  5. 5. Janda JM, Abbott SL (2007) 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. Journal of Clinical Microbiology 45: 2761–2764. pmid:17626177
  6. 6. Case RJ, Boucher Y, Dahllöf I, Holmström C, Doolittle WF, et al. (2007) Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Applied and Environmental Microbiology 73: 278–288. pmid:17071787
  7. 7. Petrosino JF, Highlander S, Luna RA, Gibbs RA, Versalovic J (2009) Metagenomic pyrosequencing and microbial identification. Clinical Chemistry 55: 856–866. pmid:19264858
  8. 8. Hao X, Chen T (2012) OTU analysis using metagenomic shotgun sequencing data. PloS one 7: e49785. pmid:23189163
  9. 9. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. (2008) A core gut microbiome in obese and lean twins. nature 457: 480–484. pmid:19043404
  10. 10. Andersson AF, Lindberg M, Jakobsson H, Bäckhed F, Nyrén P, et al. (2008) Comparative analysis of human gut microbiota by barcoded pyrosequencing. PloS one 3: e2836. pmid:18665274
  11. 11. Dethlefsen L, Huse S, Sogin ML, Relman DA (2008) The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biology 6: e280. pmid:19018661
  12. 12. Bik EM, Long CD, Armitage GC, Loomer P, Emerson J, et al. (2010) Bacterial diversity in the oral cavity of 10 healthy individuals. The ISME journal 4: 962–974. pmid:20336157
  13. 13. Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, et al. (2013) Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data. PloS one 8: e60204. pmid:23593174
  14. 14. Mizrahi-Man O, Davenport ER, Gilad Y (2013) Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs. PloS one 8: e53608. pmid:23308262
  15. 15. Zhang J, Kobert K, Flouri T, Stamatakis A (2014) PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30: 614–620. pmid:24142950
  16. 16. Aravindraja C, Viszwapriya D, Pandian SK (2013) Ultradeep 16S rRNA Sequencing Analysis of Geographically Similar but Diverse Unexplored Marine Samples Reveal Varied Bacterial Community Composition. PloS one 8: e76724. pmid:24167548
  17. 17. Jonasson J, Olofsson M, Monstein HJ (2002) Classification, identification and subtyping of bacteria based on pyrosequencing and signature matching of 16S rDNA fragments. Apmis 110: 263–272. pmid:12076280
  18. 18. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic acids research 37: D141–D145. pmid:19004872
  19. 19. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and environmental microbiology 72: 5069–5072. pmid:16820507
  20. 20. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, et al. (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic acids research 35: 7188–7196. pmid:17947321
  21. 21. Mitra S, Stärk M, Huson DH (2011) Analysis of 16S rRNA environmental sequences using MEGAN. BMC genomics 12: S17. pmid:22369513
  22. 22. Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, et al. (2010) PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26: 266–267. pmid:19914921
  23. 23. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26: 2460–2461. pmid:20709691
  24. 24. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature methods 7: 335–336. pmid:20383131
  25. 25. Chun J, Lee J-H, Jung Y, Kim M, Kim S, et al. (2007) EzTaxon: a web-based tool for the identification of prokaryotes based on 16S ribosomal RNA gene sequences. International Journal of Systematic and Evolutionary Microbiology 57: 2259–2261. pmid:17911292
  26. 26. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC bioinformatics 9: 386. pmid:18803844
  27. 27. Gupta A, Kapil R, Dhakan DB, Sharma VK (2014) MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data. PloS one 9: e93907. pmid:24736651
  28. 28. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology 73: 5261–5267. pmid:17586664
  29. 29. Claesson MJ, Wang Q, O’Sullivan O, Greene-Diniz R, Cole JR, et al. (2010) Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Research 38: e200. pmid:20880993
  30. 30. Qunfeng D, Claudia V (2012) Evaluation of the RDP classifier accuracy using 16S rRNA gene variable regions. Metagenomics 2012.
  31. 31. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, et al. (2011) An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME journal 6: 610–618. pmid:22134646
  32. 32. Mullan LJ, Bleasby AJ (2002) Short EMBOSS user guide. Briefings in Bioinformatics 3: 92–94. pmid:12002228
  33. 33. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150–3152. pmid:23060610
  34. 34. Biau G (2012) Analysis of a random forests model. The Journal of Machine Learning Research 98888: 1063–1095.
  35. 35. Panov P, Džeroski S (2007) Combining bagging and random subspaces to create better ensembles: Springer.
  36. 36. Breiman L (2001) Random forests. Machine learning 45: 5–32.
  37. 37. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics 8: 25. pmid:17254353
  38. 38. Zhang J, Zulkernine M. A hybrid network intrusion detection technique using random forests; 2006. IEEE. pp. 8 pp.
  39. 39. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, et al. (2008) Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS genetics 4: e1000255. pmid:19023400
  40. 40. Leinonen R, Sugawara H, Shumway M (2010) The sequence read archive. Nucleic Acids Research: gkq1019.
  41. 41. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410. pmid:2231712
  42. 42. D’Argenio V, Casaburi G, Precone V, Salvatore F (2014) Comparative Metagenomic Analysis of Human Gut Microbiome Composition Using Two Different Bioinformatic Pipelines. BioMed research international 2014. . https://doi.org/10.1155/2014/325340 pmid:24719854