Fig 1.
kMermaid unambiguously maps nucleotide sequences to functionally homogeneous protein clusters.
To classify a metagenomic read, a nucleotide query sequence undergoes a six-frame translation (Step 1), and frames containing stop codons are removed. Each amino acid k-mer in a non-truncated coding frame (Step 2) is then mapped to the protein clusters that contain that k-mer in the database. An assignment score is calculated to evaluate a match between the query sequence and each protein cluster, by summing frequencies of k-mers in the query sequence in every protein cluster (Step 3). The query is classified to the protein cluster assigned with the highest assignment score which corresponds to the cluster in which the k-mers of the query are most frequently observed (Step 4). A description of and pseudo-code for kMermaid’s implementation is provided in the S1 Methods.
Fig 2.
Protein clusters underlying kMermaid mitigate multi-mapping and allow cluster-specific prevalences of k-mers.
(a) The reduction in the number of reads mapped to >1 protein using default configurations of BLASTX and DIAMOND compared to cluster assignments, when employed to the BLASTX and DIAMOND outputs, respectively, across 29 human fecal samples. (b) The percent of co-clustered proteins using our clustering process also co-clustered in the NCBI PCLA prokaryotic protein clusters for all overlapping proteins, plotted as bars. The x-axis contains individual proteins ordered by cluster, i.e., proteins in the same cluster are plotted next to each other, and the color of the bar corresponds to the percent similarity. (c) The distribution (Kernel density estimation, KDE) of keyword percentage, i.e., percent of cluster members with the most common word from all names of proteins in the cluster, across all clusters (yellow), and for clusters with specific common keywords of interest. (d) Visual representation of the kMermaid cluster frequency model sorted by the number of clusters a k-mer is present in. Colors represent the number of clusters in which a unique k-mer is found, and the y-axis corresponds to the k-mer frequency in the cluster. The panel shows a representative random 50K subset of all k-mers.
Fig 3.
kMermaid sensitivity and resource benchmarking on simulated microbial protein data.
(a) The percentage of reads classified correctly by kMermaid compared with leading read-to-protein mapping tools averaged across 10 simulated datasets per each combination of read length and mutation rate. (b) The number of reads classified by each tool normalized by the number of input reads averaged across 10 simulated datasets per each combination of read length and mutation rate. (c) kMermaid (green) provides up to a 25-fold decrease in runtimes (in seconds, log-transformed) compared to BLASTX and has comparable runtimes to DIAMOND (blue). The y-axis has been truncated and tools that exceeded a 24-hour run time for larger input sizes are denoted with an asterisk. (d) kMermaid (green) requires a fixed, low memory allocation in comparison to other read-to-protein mapping tools. BLASTX was excluded from comparisons with more than 1 million sequences due to the infeasible running times. Methods exceeding 16GB of RAM are denoted with an asterisk.
Fig 4.
Biological applications, function-specific performance, and evidence of remote homology detection of kMermaid.
(a) Receiver operating curves demonstrating the ability of kMermaid’s assignment score to correctly classify reverse-translated nucleotide segments of varying lengths from unseen protein sequences that were added to RefSeq in early 2025. (b) Agreement of BLASTX alignments and kMermaid protein assignments on 29 fecal samples from ulcerative colitis patients. (c) Boxplot showing kMermaid agreement with BLASTX for clusters with specific functional annotations. (d) Violin plots showing the k-mer frequency scores for six clusters of reads unclassified with BLASTX that were correctly functionally classified by kMermaid.