Skip to main content
Advertisement

< Back to Article

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Fig 3

Workflow diagram of the analysis done for the Listeria marker gene simulated databases (16S rRNA and 40 marker genes).

First, 5,014 Listeria draft genomes were downloaded from RefSeq and the 16S rRNA and 40 markers genes were predicted with Barnap and FetchMG, respectively. Genes that were below half or above twice as long as the mean length for a specific marker gene were removed. To create the simulated databases for each marker gene, we randomly subsampled the sequences into subsets varying in size from 1,000 to 5,000 sequences in 1,000 gene increments. We repeated this process 100 times so we could estimate the variability of our results. Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones.

Fig 3

doi: https://doi.org/10.1371/journal.pcbi.1012343.g003