Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

doi:10.1371/journal.pone.0258693

Fig 1.

Workflow for alignment-free genome comparisons used in this work.

(A) Genomes are processed from left to right with a sliding window of fixed length resulting in k-mer databases. As an example, k-mer extraction is shown for a short section of three DNA sequences with k = 4. (B) k-mer set comparisons, e.g. union and intersection, are then computed for pairs of genomes. Arrows indicate the k-mers shared between k-mer sets. (C) From the set comparisons, similarity scores are calculated resulting in a pairwise similarity matrix. (D) Hierarchical clustering of the similarity matrix yields a tree which can be compared to a reference phylogenetic tree.

More »

Expand

Fig 2.

Sequence space coverage (SSC) largely depends on genome length and GC-content.

(A) SSC for each genome is plotted for k ranging from 7 to 19 (odd). SSC exhibits a sigmoidal relationship with log-transformed genome length. (B) SSC of prokaryotic genomes, for k = 11 (gold color in panel A) is plotted with points colored by GC-content. GC-content has a moderately strong positive correlation with log-transformed genome length (change from blue to magenta color; Pearson correlation coefficient r = 0.606, p < 0.001). SSC decreases as GC-content differs from the random, 0.5 (bright green).

More »

Expand

Fig 3.

Normalized sequence space coverage (NSSC) of genomes attains a minimum within the range of k = 9–19.

(A) SSC from Fig 2A was normalized by the expected SSC estimated for random sequences of the same length (Eq 5). (B) For each genome, the k-mer length at which it attains a minimum NSSC (Eq 6) was plotted against log-transformed genome length. Dotted lines in (A) and (B) represent thresholds (set at 0.5) from logistic regression to predict the k-mer lengths at which minimum NSSC will occur based on log genome length. (C) A schematic of NSSC curve for three idealized genomes of different lengths (purple, green and blue dotted lines; see S1A Fig for examples). At low k-mer lengths, NSSC is high due to full sequence space coverage. On the other hand, at high k-mer lengths, NSSC is also high due to high k-mer specificity. In the range of k = 9–19, NSSC attains a minimum at k*, denoted by star symbols, representing minimal entropy and maximum intragenomic shared k-mers relative to estimates for random sequences of the same length.

More »

Expand

Fig 4.

21-mer Jaccard similarity clusters genomes across different levels of taxonomy.

A heatmap of pairwise 21-mer Jaccard similarity is shown for 1634 genera representatives arranged by hierarchical clustering with optimal leaf ordering to minimize the distance between successive leaves. Leaf order starts from the top left of the heatmap, and many of the clusters made up of organisms predominantly from a named taxon are numbered and labeled (brackets on the edge or arrows at a corner of clusters) with names listed in the legend (see S1 Table for ordered genera list and S2 Table for a more detailed account of named clusters). The three large clusters, corresponding to the superkingdom domain level (eukaryota, bacteria, and archaea), are colored and labeled in the hierarchical clustering tree shown above the heatmap. One group (21*; labeled in both the heatmap and tree) within the bacteria cluster is made up of a mix of an archaea, fungus, and several bacteria, all characterized by a low GC-content (<31%, bottom 5^th percentile). The first dichotomy of each superkingdom cluster is also labeled in the hierarchical clustering tree (E1, E2, B1, B2, A1, and A2).

More »

Expand

Fig 5.

Prokaryotic genome clustering differs significantly with varying k-mer lengths used to compute similarity.

(A-D) Heatmaps of pairwise 11-, 21-, 31-, and 41-mer Jaccard similarity are shown for 1266 prokaryotic genera representatives arranged by hierarchical clustering with optimal leaf ordering by 21-mer similarity (same as in Fig 4). (E-F) For comparison, heatmaps of pairwise 21- and 41-mer Jaccard similarity are shown ordered by optimal leaf ordering by 41-mer similarity. For increasing k-mer lengths, the signal of similarity between some groups is diminished, for example between Haloarchaea and a group of bacteria that likely share horizontally transferred genes (B and D white arrows). Conversely, the signal of similarity between some split taxons becomes more apparent off the diagonal with increasing k, for example Alphaproteobacteria (D and F, magenta arrows) and Gammaproteobacteria (D and F, red arrows). While more of the genomes in these proteobacteria classes are within a single group, the groups are separated further apart from each other with 41-mer ordering as this phylum-level signal is diminished (F). Color legend for Jaccard similarity is shown to the right of each plot.

More »

Expand

Fig 6.

Hierarchically clustered trees of large prokaryotic taxons visually demonstrate an optimal range of k-mer lengths for genome comparisons.

(A-D) Hierarchical clustering trees with optimal leaf ordering computed from pairwise 11-, 21-, 31-, and 41-mer Jaccard similarity of 1266 prokaryotic genera representative genomes. Leaves are colored by large taxonomic groups, including 11 bacterial phyla and 3 archaeal phyla, with proteobacterial classes separated (see legend at bottom). For short k-mer lengths, genomes do not cluster well by taxon groups due to k-mer homoplasy as seen by the 11-mer tree with mixed leaf colors (A). For large k-mer lengths, the similarity for distant taxons reduces until the signal is too low to cluster them together (few if any long k-mers shared), as is seen by the 41-mer tree which has more bacteria and archaea mixed together than the 21- and 31-mer trees (see left- and right-end groups in D). The 21- and 31-mer trees (B and C) separate bacteria and archaea well and cluster phyla together closely. The 31-mer tree clusters Alpha and Gammaproteobacteria (light blues) together better than the 21-mer tree, but Delta and Epsilonproteobacteria are further away (dark blues). These k-mer lengths fall in an optimal range which balances k-mer sharing and specificity.

More »

Expand

Fig 7.

Distributions of log-transformed Jaccard similarity for different lowest-common-ancestor (LCA) prokaryotic taxon levels.

Ridgeline plots show the distributions of log10 Jaccard similarity (A) and median log10 Jaccard similarity (B) at different LCA taxon levels. Distributions were computed by kernel density estimation (see Methods). The dotted lines represent an estimated 95% average nucleotide identity (ANI), at approximately -0.67 log10 Jaccard similarity (see Eq 2), which is commonly used as a species-level threshold. Asterisks adjacent to the organism and superkingdom distributions in A represent the 1177 out of ~1.3M pairs which shared zero k-mers; 252 of these pairs with LCA above the phylum level, i.e. bacteria-archaea pairs, and 925 archaea-archaea pairs.

More »

Expand

Fig 8.

Aberrant trajectories of median similarity across taxonomic levels indicate potential misclassifications in prokaryotic reference databases.

(A) For prokaryotic genera representatives that had at least one pairwise comparison at all lowest-common-ancestor (LCA) levels from genus to cellular organism (n = 291), we plotted trajectories of their median log10 Jaccard similarity for k = 21 (gray lines), along with overlaid boxplots to show the overall distribution at each level. Two trajectories are highlighted to show an example of median similarity always decreasing as LCA distance increases (orange; Bradyrhizobium diazoefficiens) and an example of an aberrant trajectory for which median similarity increases as LCA taxon goes from family to order (green; Roseomonas gilardii). (B) For the same group of prokaryotes, boxplots show the distribution of delta median log10 Jaccard similarity for increasing taxon level pairs (e.g. genus minus family, family minus order, etc). The horizontal red dotted line represents an equivalent median similarity from a genome to the two compared LCA taxons. Negative values, below this line, are unexpected and are potentially due to misclassifications in the database, e.g. a species having a higher median similarity to organisms which share the same order than to those in its family (green line in A). In total, about one third (106/291) of the genera analyzed had at least one value below zero.

More »

Expand