RESCRIPt: Reproducible sequence taxonomy reference database management

doi:10.1371/journal.pcbi.1009581

Fig 1.

Current RESCRIPt functionality for processing and curating reference sequence data.

Arrows indicate suggested workflows. Dotted arrows and edges indicate optional steps for customized workflows.

More »

Expand

Fig 2.

Comparison of sequence information from SILVA, Greengenes, GTDB, and NCBI-RefSeq 16S rRNA gene databases.

A, Sequence length distributions (after removing outliers, see materials and methods). B, Number of unique sequences in each database. C, Entropy of full-length sequences and different kmer lengths in each database.

More »

Expand

Fig 3.

Comparison of taxonomic information and simulated classification accuracy from SILVA, Greengenes, GTDB, and NCBI-RefSeq 16S rRNA gene databases.

A, Number of unique taxonomic labels; B, Taxonomic entropy; C, proportion of unclassified taxa at each rank; D, optimal classification accuracy (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database). Cross-validation was not used because two of the databases (GTDB and NCBI-RefSeq) lack replicate species. Rank labels on x-axis: D = domain, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 4.

Comparison of taxonomic coverage among SILVA, Greengenes, GTDB, and NCBI-RefSeq 16S rRNA gene databases.

Each panel displays the proportion of taxa represented in one reference database (as indicated in the panel title) that are shared with each other database at each taxonomic rank. The legends indicate which groups are being compared: the reference alone (always 1.0, shown for clarity), pairs consisting of the reference and one other database, trios consisting of the reference and two other databases, and the proportion of the reference’s labels that are shared by all four databases. Rank labels on x-axis: D = domain, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 5.

Average family-level taxonomic composition of EMP empo 3 types.

Family-level classification as predicted by SILVA, Greengenes, GTDB, or NCBI-RefSeqs classifiers. Samples were grouped by EMPO 3 type to look at average family-level taxonomic composition of each sample type. Only taxa detected at a minimum of 10% relative frequency in at least one group are shown.

More »

Expand

Fig 6.

Comparison of sequence information across each successive sequence quality filtering step as applied to the SILVA 16S rRNA gene database.

A, Sequence length distributions. B, Number of unique sequences. C, Entropy of full-length sequences and different kmer lengths. Note: The subsequent sequence length filtering did not have any effect on the data as the NR99 reference database is already pre-trimmed as specified above. Base: the complete NR99 SILVA database, Culled: after sequences with either 8 or more homopolymers and/or 5 ambiguous bases removed, LengFiltByTax: sequence length filtering of the data based on taxonomy, i.e. removal of archaeal and bacterial sequences less than 900 and 1200 bp in length, respectively. DereplicateUniq: Taxonomy and Sequence dereplication using “uniq” mode (i.e. any identical sequences with differing taxonomy will not be merged), NoAmbigLabels: any sequence data associated with ambiguous labels (typically at lower taxonomic ranks) are removed from the data set.

More »

Expand

Fig 7.

Comparison of taxonomic information and simulated classification accuracy across several successive steps of quality filtering of the NR99 16S rRNA gene databases.

A, Number of unique taxonomic labels; B, Taxonomic entropy; C, optimal classification accuracy from the evaluate-fit-classifier action (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database); D, optimal classification accuracy from the evaluate-cross-validate action (as F-Measure), which simulates pseudo-realistic classification task whereby a set of query sequences may not have an exact match in the reference database. See Fig 6 Legend for label descriptions. Rank labels on x-axis: D = domain, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 8.

Taxonomic information (A-C) and classification accuracy (D-E) of Greengenes 16S rRNA gene database clustered at different similarity thresholds. Subpanels show taxonomic/classification characteristics at each taxonomic level: A, Number of unique taxonomic labels; B, Taxonomic entropy; C, number of taxa that terminate at that level; D, optimal classification accuracy (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database); E, classification accuracy (F-Measure) with cross-validation (simulating realistic classification tasks when the correct label is unknown). Rank labels on x-axis: D = domain, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 9.

Taxonomic information (A–C) and classification accuracy (D–E) of UNITE ITS domain database with different filtering and clustering settings. Filtered versions include the "all Eukaryotes" 2020.04.02 release version containing all Eukaryotes ("All Euks"), filtered to contain only Fungi, and filtered to contain only Fungi with at least order-level taxonomic annotation ("Fungi Order"). Cluster levels indicate which UNITE release version was used: sequences clustered at 97% similarity, 99% similarity, or the UNITE "dynamic" species hypothesis threshold. Subpanels show taxonomic/classification characteristics at each taxonomic level: A, Number of unique taxonomic labels; B, Taxonomic entropy; C, proportion of taxa that terminate at that level; D, optimal classification accuracy (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database); E, classification accuracy (F-Measure) with cross-validation (simulating realistic classification tasks when the correct label is unknown). Rank labels on x-axis: K = kingdom, P = phylum, C = class, O = order, F = family, G = genus, S = species. See Materials and Methods for more details on how these databases were created and processed.

More »

Expand

Fig 10.

Comparison of sequence information from BOLD COI gene database for available arthropod and chordate sequences.

Differences in datasets reflect whether sequences were trimmed to a particular primer region (boldANML) or not (boldFull), and whether sequences were dereplicated (100) or clustered at a particular percent identity (97, 98, 99). A, Number of unique sequences. B, Entropy of sequences and different kmer lengths.

More »

Expand

Fig 11.

Comparison of taxonomic information and simulated classification accuracy from BOLD COI gene database for available arthropod and chordate sequences.

Differences in datasets reflect whether sequences were trimmed to a particular primer region (boldANML) or not (boldFull), and whether sequences were dereplicated (_100) or clustered at a particular percent identity (_97, _98, _99). A, Number of unique taxonomic labels; B, Taxonomic entropy; C, proportion of unclassified taxa at each rank; D, optimal classification accuracy (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database). E, Classification accuracy with cross-validation. Rank labels on x-axis: K = kingdom, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 12.

Comparison of sequence information from BOLD and NCBI GenBank COI gene databases for available arthropod and chordate sequences.

All sequences were dereplicated and trimmed to a common primer region. NCBI references either contained a cross-reference term to BOLD (“ncbiOB”) or not (“ncbiNB”) or were combined together (“ncbiAll”). A, Number of unique sequences (note difference in scales between Arthropoda and Chordata). B, Entropy of sequences and different kmer lengths.

More »

Expand

Fig 13.

Comparison of taxonomic information and simulated classification accuracy from BOLD and NCBI GenBank COI gene databases for available arthropod and chordate sequences.

All sequences were dereplicated and trimmed to a common primer region. NCBI references either contained a cross-reference term to BOLD (“ncbiOB”) or not (“ncbiNB”) or were combined together (“ncbiAll”). A, Number of unique taxonomic labels; B, Taxonomic entropy; C, proportion of unclassified taxa at each rank; D, optimal classification accuracy (as F-Measure) without cross-validation (simulating best possible classification accuracy when the true label is known but classification accuracy may be confounded by other similar hits in the database). E, Classification accuracy with cross-validation. Rank labels on x-axis: K = kingdom, P = phylum, C = class, O = order, F = family, G = genus, S = species.

More »

Expand

Fig 14.

An example of using RESCRIPt for reproducible genomics workflows.

HEV genomes were downloaded from NCBI-GenBank and used to make a reference genome classifier based on the following geographic locations: Bangladesh (BD), China (CN), France (FR), India (IN), and the United Kingdom (UK). The interoperability of RESCRIPt with other QIIME 2 plugins enables users to chain together a variety of functions into fully reproducible workflows that record processing decisions in data provenance. A, a simplified data provenance graph highlighting our workflow leveraging RESCRIPt, q2-sourmash, q2-diversity, q2-sample-classifier, and EMPeror. B, PCoA plot of individual HEV genomes based on MASH signature comparison results. C, k-nearest-neighbor classification accuracy based on MASH signature dissimilarities and geographic location.

More »

Expand