kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino acid k-mer frequencies
Fig 2
Protein clusters underlying kMermaid mitigate multi-mapping and allow cluster-specific prevalences of k-mers.
(a) The reduction in the number of reads mapped to >1 protein using default configurations of BLASTX and DIAMOND compared to cluster assignments, when employed to the BLASTX and DIAMOND outputs, respectively, across 29 human fecal samples. (b) The percent of co-clustered proteins using our clustering process also co-clustered in the NCBI PCLA prokaryotic protein clusters for all overlapping proteins, plotted as bars. The x-axis contains individual proteins ordered by cluster, i.e., proteins in the same cluster are plotted next to each other, and the color of the bar corresponds to the percent similarity. (c) The distribution (Kernel density estimation, KDE) of keyword percentage, i.e., percent of cluster members with the most common word from all names of proteins in the cluster, across all clusters (yellow), and for clusters with specific common keywords of interest. (d) Visual representation of the kMermaid cluster frequency model sorted by the number of clusters a k-mer is present in. Colors represent the number of clusters in which a unique k-mer is found, and the y-axis corresponds to the k-mer frequency in the cluster. The panel shows a representative random 50K subset of all k-mers.