16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

doi:10.1371/journal.pcbi.1006721

Fig 1.

t-SNE projection of sequence embeddings from KEGG 16S sequences.

Embedding results were generated using 256 dimensional embeddings of 10-mers that were denoised. A: 2-dimensional projection via t-SNE of the sequence embedding space from 14,520 KEGG 16S sequences. The position of each sequence (points) are colored based on their phylum designation. B: t-SNE projection of sequences that belong to different genera within the same family.

More »

Expand

Fig 2.

t-SNE projection of 6-mer frequency table from KEGG 16S sequences.

A: 2-dimensional projection via t-SNE of the sequence embedding space from 14,520 KEGG 16S sequences. The position of each sequence (points) are colored based on their phylum designation. B: t-SNE projection of sequences that belong to different genera within the same family.

More »

Expand

Fig 3.

Within-taxon distribution of pairwise sequence alignment similarity verses pairwise embedding similarity.

Embedding results were generated using 256 dimensional embeddings of 10-mers that were denoised. For a given taxonomic level, the red violin plots represent the distribution of pairwise cosine similarity between all sequence embeddings from the 14,520 KEGG 16S rRNA sequences, whereas the blue violin plots represent the distribution of pairwise nucleotide sequence identity using global alignment via VSEARCH. Both sets of scores were z-scored to make them visually comparable. Linear regression best fit lines are shown to ease interpretation.

More »

Expand

Fig 4.

Agreement between consensus sequence embeddings and their cluster embeddings.

For each cluster, all KEGG 16S sequences were embedded into a cluster embedding, and the cluster’s VSEARCH consensus sequence was embedded into a consensus embedding. The pairwise cosine similarities between all consensus and cluster embeddings are shown. They are sorted based on the (arbitrary) index for cluster membership. Darker shading indicates larger cosine similarity betweens cluster and consensus embeddings. Thus, the dark diagonal represents that the cluster embedding is similar to the consensus embedding for that cluster.

More »

Expand

Table 1.

Clustering analysis of KEGG sequence embeddings.

More »

Expand

Fig 5.

Lower dimensional projections of k-mer, sequence, and sample embeddings.

Embedding results were generated using 256 dimensional embeddings of 10-mers that were denoised. A: A 2-dimensional projection via independent component analysis of the 10-mer embedding space from the GreenGenes training sequences. 406,922 unique 10-mers are shown. The position of 10-mers that differ by one nucleotide from AAAAAAAAAA are labeled to demonstrate that it is not simply sequence similarity that is preserved, since these sequences span a wide range in the embedding space. The k-mers were sorted alphabetically and ranked; the alphabetical progression of the indexes are shaded from yellow to green. B: A 2-dimensional projection via independent component analysis. 705,598 total sequences embeddings from 21 randomly chosen American Gut samples (7 from each class) are shown. The position of each sequence (points) is colored based on its phylum designation (only the 7 most abundant phyla are shown). C: 2-dimensional t-SNE projection of the 11,341 American Gut sample embeddings. The position of each sample (points) is colored based on its body site label.

More »

Expand

Fig 6.

Lower dimensional projections of 6-mer sample embeddings using k-mer method.

2-dimensional t-SNE projection of the 11,341 American Gut sample embeddings based on k-mer method (where k = 6). The position of each sample (points) is colored based on its body site label.

More »

Expand

Table 2.

Sample embedding classification performance.

More »

Expand

Fig 7.

Maturation of the body site classification decision as sequence embeddings are introduced.

Embedding results were generated using 256 dimensional embeddings of 10-mers without denoising (since the number of reads varies throughout the figure). One American Gut tongue sample is shown, which was misclassified by lasso as “fecal.” Read activations are defined as the linear combination of a given sequence embedding and the regression coefficients obtained from lasso for a particular body site. A: The trajectory of the body site classification decision via multinomial lasso. The cumulative activation is the sum of all read activations across all nodes (dimensions of the embedding) up until the introduction of a specific read. A body site is favored when it has the largest cumulative activation of the three body sites. Reads were sorted and introduced based on their taxon (phylum designations are color coded). B: The (non-cumulative) activations (across all nodes) for body site as reads are introduced (with no accumulation from previous reads). Genera labels are shown for reads with large activations. C-F: The (non-cumulative) activations for individual nodes and specific body sites as reads are introduced.

More »

Expand

Fig 8.

Regions within Lachnospiraceae reads among in which skin-associated k-mers mapped.

The top-1000 k-mers with the largest activations (the linear combination of the k-mer embedding and regression coefficients obtained via lasso) for skin were identified. A random sample of 25,000 reads (from skin samples) containing these k-mers underwent multiple alignment. Shown is the relative position of k-mers found only in Lachnospiraceae reads from skin samples. The position of a given k-mer in the alignment spans its entire starting and ending position, including gaps. The frequency in which these k-mers mapped to positions (left column) in the multiple alignment were quantified (y-axis). The order of the k-mers in the heatmap (x-axis) was obtained via hierarchical clustering (Ward’s method) on Bray-Curtis distances. The Lachnospiraceae genus that most frequently occurred at a given alignment position for a given k-mer is colored (right column).

More »

Expand

Fig 9.

Distribution of pairwise cosine similarity for k-mer, sequence, and sample embeddings of the American Gut data.

Sequence and sample embeddings were calculated from 21 randomly selected (7 for each body site) samples. k-mer embeddings were those used during training with the GreenGenes sequences. Shown are distributions of pairwise cosine similarities as a function of k-mer size (rows), denoising (blue versus red), and embedding space (columns).

More »

Expand