16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses
Fig 7
Maturation of the body site classification decision as sequence embeddings are introduced.
Embedding results were generated using 256 dimensional embeddings of 10-mers without denoising (since the number of reads varies throughout the figure). One American Gut tongue sample is shown, which was misclassified by lasso as “fecal.” Read activations are defined as the linear combination of a given sequence embedding and the regression coefficients obtained from lasso for a particular body site. A: The trajectory of the body site classification decision via multinomial lasso. The cumulative activation is the sum of all read activations across all nodes (dimensions of the embedding) up until the introduction of a specific read. A body site is favored when it has the largest cumulative activation of the three body sites. Reads were sorted and introduced based on their taxon (phylum designations are color coded). B: The (non-cumulative) activations (across all nodes) for body site as reads are introduced (with no accumulation from previous reads). Genera labels are shown for reads with large activations. C-F: The (non-cumulative) activations for individual nodes and specific body sites as reads are introduced.