RESCRIPt: Reproducible sequence taxonomy reference database management

doi:10.1371/journal.pcbi.1009581

RESCRIPt: Reproducible sequence taxonomy reference database management

Fig 6

Comparison of sequence information across each successive sequence quality filtering step as applied to the SILVA 16S rRNA gene database.

A, Sequence length distributions. B, Number of unique sequences. C, Entropy of full-length sequences and different kmer lengths. Note: The subsequent sequence length filtering did not have any effect on the data as the NR99 reference database is already pre-trimmed as specified above. Base: the complete NR99 SILVA database, Culled: after sequences with either 8 or more homopolymers and/or 5 ambiguous bases removed, LengFiltByTax: sequence length filtering of the data based on taxonomy, i.e. removal of archaeal and bacterial sequences less than 900 and 1200 bp in length, respectively. DereplicateUniq: Taxonomy and Sequence dereplication using “uniq” mode (i.e. any identical sequences with differing taxonomy will not be merged), NoAmbigLabels: any sequence data associated with ambiguous labels (typically at lower taxonomic ranks) are removed from the data set.

doi: https://doi.org/10.1371/journal.pcbi.1009581.g006