Skip to main content
Advertisement
  • Loading metrics

kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino acid k-mer frequencies

  • Anastasia Lucas,

    Roles Software, Writing – original draft, Writing – review & editing

    Affiliations Genomics and Computational Biology Graduate Group, University of Pennsylvania - Perelman School of Medicine, Philadelphia, Pennsylvania, United States of America, Molecular and Cellular Oncogenesis Program, Ellen and Ronald Caplan Cancer Center, The Wistar Institute, Philadelphia, Pennsylvania, United States of America

  • Daniel E. Schäffer,

    Roles Software, Writing – review & editing

    Affiliations Molecular and Cellular Oncogenesis Program, Ellen and Ronald Caplan Cancer Center, The Wistar Institute, Philadelphia, Pennsylvania, United States of America, Computer Science and Artificial Intelligence Laboratory & Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America

  • Jayamanna Wickramasinghe,

    Roles Data curation, Writing – review & editing

    Affiliation Bioinformatics Facility, The Wistar Institute, Philadelphia, Pennsylvania, United States of America

  • Noam Auslander

    Roles Conceptualization, Software, Supervision, Writing – original draft, Writing – review & editing

    nauslander@wistar.org

    Affiliations Molecular and Cellular Oncogenesis Program, Ellen and Ronald Caplan Cancer Center, The Wistar Institute, Philadelphia, Pennsylvania, United States of America, Department of Cancer Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Abstract

Shotgun metagenomic sequencing can determine both the taxonomic and functional content of microbiomes. However, functional classification for metagenomic reads remains highly challenging as protein mapping tools require substantial computational resources and yield ambiguous classifications when short reads map to homologous proteins originating from different bacteria. Here we introduce kMermaid for the purpose of uniquely mapping bacterial short reads to taxa-agnostic clusters of homologous proteins, which can then be used for downstream analysis tasks such as read quantification and pathway or global functional analysis. Using a nested hash map containing amino acid k-mer profiles as a model for protein assignment, kMermaid achieves the sensitivity of popular existing protein mapping tools while remaining highly resource efficient. We evaluate kMermaid on simulated data and data from human fecal samples as well as demonstrate the utility of kMermaid for classifying reads originating from new, unseen proteins. kMermaid allows for highly accurate, unambiguous and ultrafast metagenomic read assignment into protein clusters, with a fixed memory usage, and can easily be employed on a typical computer.

Author summary

Whole-genome shotgun sequencing has allowed for the collection of a wealth of metagenomic data. Evidence that microbiomes play key roles in human health and disease is growing, but approaches for studying functional metagenomic content are still limited. Current protein mapping approaches do not allow for direct quantification of protein coding potential because short reads commonly map to similar proteins in different bacteria. Mapping metagenomic sequencing reads to proteins in such a way that a microbiome’s coding potential can be quantified is a key first step to pinpointing specific functional mechanisms or associations of disease. Here, we present a framework to first group similar proteins together, then uniquely map reads directly to these homologous protein groups. Our results show that by using k-mer frequencies stored in a two-layer hash map, we can sensitively classify metagenomic reads from high-depth sequencing data in only a few hours. We present our protein mapping method in an easy-to-use, resource efficient Python package, kMermaid. kMermaid results can be directly quantified which in turn will enable linkage of microbiome amino acid content to numerous health and disease phenotypes.

Introduction

The gut microbiome has recently emerged as a new frontier for non-invasive biomarker discovery and new therapeutic intervention. As the field of metagenomics has matured in terms of popularity and technical advancement [1,2], there has been increasing recognition of the importance of functional analysis of microbiomes [3]. Whole-genome shotgun sequencing has allowed for the collection of vast amounts of metagenomic data which can be used to gain insights about both the taxonomic and functional composition of microbiomes. Functional profiling and quantitative comparisons of microbial proteins have immense potential to reveal microbe-microbe and host-microbe interactions, establish new microbial biomarkers, and provide predictions based on microbiomes [4,5]. However, the quantitative analyses required for these tasks are contingent on the functional classification of shotgun metagenomic reads as a preprocessing step, which remains a computational challenge.

Functional read classification is a broad notion that can encompass mapping reads directly to proteins or mapping to higher level functional classes, such as ortholog protein groups or pathways. Several methods and pipelines, such as eggNOG-mapper [6], PANNZER2 [7], BlastKOALA [8], HUMAnN3 [9](p3), and fmh-funprofiler [10], which been recently developed to perform these higher-level classifications. Such tools use a variety of computational algorithms, including alignment and sketching techniques, to report mappings to functional reference databases, such as KEGG Ortholog or eggNOG [11]. In contrast, direct protein mapping is often performed using alignment-based methods, which rely on homology between a metagenomic sequence and microbial proteins in reference databases [5]. BLASTX [12] remains the gold standard for sensitivity despite it being infeasibly slow for typical metagenomic experiments producing tens of millions of reads [13]. DIAMOND [14,15] was developed in part to address computational challenges associated with BLASTX and allows for ultrafast read-to-protein alignments, making it one of the most widely used metagenomic protein mapping tools. Another popular method, MMseqs2 [16], was primarily developed to cluster metagenomic nucleotide and protein sequences, but also has translated protein search capabilities. While ultra-fast, neither DIAMOND nor MMseqs2 addresses BLASTX’s other challenge of multimapping, i.e., when a single read aligns to more than one protein. Multimapping often occurs due to homologous proteins or domains originating in different taxa, but can be problematic for downstream read counting and subsequent analysis [1719]. Mapping to higher-level functional classes can resolve multimapping, but not without the loss of the granular information protein annotation provides. Therefore, there is a need for resource-efficient methods that can provide unique read-to-protein level maps.

Methods for efficient and sensitive taxonomic classification of metagenomic sequences have addressed similar challenges in computational efficiency of taxonomic assignment by using k-mer based approaches, which are faster than alignments [20]. Most notably, Kraken [21] introduced accurate and highly efficient taxonomic classification by mapping k-mers to lowest common ancestors. This approach was later provided with improved resource usage through Kraken2 [22] and improved precision through KrakenUniq [23]. Other high speed taxonomic classification methods are CLARK [24], another k-mer based approach, as well as Centrifuge [25] and Kaiju [26], which are based on FM-indexing. Importantly, Kaiju demonstrated that the use of protein-level sequence comparisons substantially improves taxonomic classification. k-mer based approaches could offer similar advancements for protein mapping; however, such methods are currently lacking.

Here, we introduce kMermaid, a new method for ultrafast and resource-efficient protein mapping of metagenomic reads (Fig 1, S1 Methods). kMermaid uniquely maps query nucleotide sequences into taxa-agnostic clusters of highly homologous proteins using a precomputed k-mer frequency model (S1 Fig, Methods). The underlying rationale for kMermaid is that proteins with high sequence homology have similar biological functions and thus should be grouped together for downstream analysis. To this end, mapping the sequences to clusters representing homologous groups of proteins irrespective of taxa addresses issues along both computational and biological axes. The read-to-cluster approach resolves the problem of alignment ambiguity, referred to here as multi-mapping, i.e., when a single read similarly aligns to multiple proteins. The resulting aligned proteins are often functionally similar but originate in different species. By aggregating at the protein cluster level, kMermaid can capture novel biological effects that may be overlooked when performing analyses conditioned on taxa or when aggregating reads into broader functional categories, such as ortholog groups or pathways. kMermaid can classify tens of millions of sequences in just a few hours, providing the computational speed and resource efficiency needed for the large volumes of data generated through metagenomic sequencing experiments, while matching the sensitivity of BLASTX. Through comprehensive benchmarking against other widely used metagenomic protein mapping tools, we show that kMermaid achieves fast, resource-efficient, and sensitive metagenomic read classification into functional units expected to improve downstream quantitative analysis.

thumbnail
Fig 1. kMermaid unambiguously maps nucleotide sequences to functionally homogeneous protein clusters.

To classify a metagenomic read, a nucleotide query sequence undergoes a six-frame translation (Step 1), and frames containing stop codons are removed. Each amino acid k-mer in a non-truncated coding frame (Step 2) is then mapped to the protein clusters that contain that k-mer in the database. An assignment score is calculated to evaluate a match between the query sequence and each protein cluster, by summing frequencies of k-mers in the query sequence in every protein cluster (Step 3). The query is classified to the protein cluster assigned with the highest assignment score which corresponds to the cluster in which the k-mers of the query are most frequently observed (Step 4). A description of and pseudo-code for kMermaid’s implementation is provided in the S1 Methods.

https://doi.org/10.1371/journal.pcbi.1013470.g001

Results and discussion

Using k-mer frequencies to map reads to homologous protein clusters

The main motivation for kMermaid is that short metagenomic reads are rarely mapped to a single protein and instead often map to multiple functionally similar proteins. Obtaining a unique read-to-protein mapping requires grouping these homologous proteins into some broader functional unit. Such functional units become especially critical for downstream quantitative analyses to prevent issues such as double counting multi-mapped reads. We find that most reads map to at least five protein hits using BLASTX, which is the minimum recommended value for the number of hits reported, while only 7% of the reads can be uniquely mapped by alignment to a single protein by either BLASTX or DIAMOND in BLASTX mode (Fig 2a). In contrast, by aggregating multi-mapped hits from single proteins into homologous protein clusters (see Methods), we find that more than 93% of the reads can be uniquely mapped to a single cluster or functional unit. In other words, for more than 93% of the reads, all BLASTX hits for the read belong to a single cluster. DIAMOND follows a similar trend to BLASTX. Together, this demonstrates that our clusters resolve the majority of ambiguous alignments without loss of information from multimapping.

thumbnail
Fig 2. Protein clusters underlying kMermaid mitigate multi-mapping and allow cluster-specific prevalences of k-mers.

(a) The reduction in the number of reads mapped to >1 protein using default configurations of BLASTX and DIAMOND compared to cluster assignments, when employed to the BLASTX and DIAMOND outputs, respectively, across 29 human fecal samples. (b) The percent of co-clustered proteins using our clustering process also co-clustered in the NCBI PCLA prokaryotic protein clusters for all overlapping proteins, plotted as bars. The x-axis contains individual proteins ordered by cluster, i.e., proteins in the same cluster are plotted next to each other, and the color of the bar corresponds to the percent similarity. (c) The distribution (Kernel density estimation, KDE) of keyword percentage, i.e., percent of cluster members with the most common word from all names of proteins in the cluster, across all clusters (yellow), and for clusters with specific common keywords of interest. (d) Visual representation of the kMermaid cluster frequency model sorted by the number of clusters a k-mer is present in. Colors represent the number of clusters in which a unique k-mer is found, and the y-axis corresponds to the k-mer frequency in the cluster. The panel shows a representative random 50K subset of all k-mers.

https://doi.org/10.1371/journal.pcbi.1013470.g002

We developed kMermaid to uniquely and efficiently map microbial short read sequencing into homologous protein clusters using this underlying clustering framework. A user can provide a file with nucleotide sequences from whole-genome shotgun sequencing to kMermaid, for querying against our precomputed model of 1,793,361 proteins aggregated into 32,308 clusters. kMermaid will uniquely assign each nucleotide read to a cluster and provide a readable functional annotation, i.e., protein label, based on its cluster representative. Because approximately 25% of our cluster representatives had non-descriptive names, e.g., “hypothetical protein,” in RefSeq, we employed HH-suite3 [27] remote homology detection to produce descriptive cluster names. In total, we reannotated 8,617 proteins, of which 6,488 were with high confidence (S1 Table). These include 601 phage proteins, 436 membrane proteins, 230 transcriptional proteins, and 205 lipoproteins (S1 Table). The composition of kMermaid’s clusters is also highly consistent with preexisting, smaller-scale cluster annotations. We verified that 96% of proteins share 100% similarity with existing NCBI protein clusters [28], i.e., all proteins in the kMermaid cluster also co-occur in the broader NCBI-derived clusters (Fig 2b). In addition, using keywords of NCBI-assigned protein names, we show that the functional annotations of the proteins are highly homogenous within clusters (Fig 2c). Therefore, we concluded that kMermaid’s cluster model and corresponding annotations are sufficiently biologically accurate and highly reflective of the cluster content.

kMermaid’s internal pipeline assigns sequencing reads to clusters according to a frequency-based assignment score calculated from amino acid (AA) k-mer frequencies. A higher assignment score indicates that k-mers within a query sequence are more frequently observed in the assigned cluster than other clusters in the underlying model. For some query sequences, a k-mer may uniquely determine cluster assignment, while other query sequences may contain multiple k-mers that have a higher combined frequency in the assigned (maximal) cluster compared with any other cluster. With this in mind, we reasoned that a k-mer found in many clusters may be less informative and its contribution to the assignment score for that cluster is noisy. In the improbable case that all k-mers were present in all clusters at a similar frequency, our cluster assignment would be close to random chance. On the other hand, k-mers that are only found in one or two clusters allow for deterministic classification and enhance our confidence in use for read assignments. Out of approximately 2.5 million AA 5-mers in our model, presence in a single cluster was the most common scenario (22%) and 81% of all AA 5-mers were found in <10 clusters (Fig 2d). We refer to these 5-mers as “deterministic,” such that the presence of a deterministic k-mer in a query sequence is highly informative for cluster assignment. We also found that the AA 5-mers that are present in many clusters have relatively similar frequencies across the clusters when examining the top 10 clusters where they are most frequently found, implying that common 5-mers should not bias the assignment score. As anticipated, given the relative percentage of 5-mers that tend to be deterministic and that the assignment score considers multiple k-mers for each query sequence, kMermaid clusters fully resolve read alignment ambiguity or multi-mapping in more than 95% of the cases (Fig 2a).

Performance evaluation on simulated data with known labels

To provide validation for our model, we comprehensively benchmarked the accuracy and sensitivity of kMermaid against the most widely used metagenomic protein mapping methods: BLASTX, DIAMOND, and MMseqs2. We first benchmarked kMermaid using simulated data sets with varying rates of point mutations selected to mimic biological mutation rates and sequencing error rates. To compare against ground truth, the reads were simulated from the RefSeq data used in the second step of the clustering procedure (see Methods) so that their true cluster labels would be known. kMermaid, DIAMOND, and MMseqs2 were almost always able to map a read to the correct protein at the level of their reporting, i.e., either the single protein (BLASTX, DIAMOND, MMseqs2) or the protein cluster (kMermaid) (Fig 3a). As expected, when BLASTX and DIAMOND were restricted to reporting only one match per read, their performance dropped considerably as highly homologous proteins will have the same alignment scores. The performance of methods that align to a single protein increases substantially when viewing the singly-aligned results at the cluster level in concordance with the notion that clustering proteins by homology resolves ambiguous alignments in most cases (Fig 2a). Encouragingly, kMermaid was able to assign reads to the correct cluster in nearly all cases, albeit with a slight decrease in coverage (percentage of reads classified) when query sequences are > 500 nucleotides and the mutation rates are high (Fig 3b). Because kMermaid uses a cumulative scoring model, it is expected that the assignments for long, highly mutated sequences are noisier, especially at the default scoring threshold which is tuned to short reads. Similar trends were observed when we simulated reads guaranteed to have a certain number of mutations rather than using a probabilistic rate (S2a and S2b Fig). Thus, by resolving multi-mapped reads at the cluster level, kMermaid shows improved sensitivity compared to methods that assign reads only to individual proteins.

thumbnail
Fig 3. kMermaid sensitivity and resource benchmarking on simulated microbial protein data.

(a) The percentage of reads classified correctly by kMermaid compared with leading read-to-protein mapping tools averaged across 10 simulated datasets per each combination of read length and mutation rate. (b) The number of reads classified by each tool normalized by the number of input reads averaged across 10 simulated datasets per each combination of read length and mutation rate. (c) kMermaid (green) provides up to a 25-fold decrease in runtimes (in seconds, log-transformed) compared to BLASTX and has comparable runtimes to DIAMOND (blue). The y-axis has been truncated and tools that exceeded a 24-hour run time for larger input sizes are denoted with an asterisk. (d) kMermaid (green) requires a fixed, low memory allocation in comparison to other read-to-protein mapping tools. BLASTX was excluded from comparisons with more than 1 million sequences due to the infeasible running times. Methods exceeding 16GB of RAM are denoted with an asterisk.

https://doi.org/10.1371/journal.pcbi.1013470.g003

Computational efficiency of kMermaid

Shotgun metagenomics experiments commonly yield tens of millions of sequences per sample, and each must be queried against a large reference database for classification purposes. Computational efficiency remains a challenge. BLASTX is perhaps the most established nucleotide-to-protein aligner, but its computational time is infeasible for typical large files, necessitating the development of alternatives that can process these reads in a reasonable timeframe. We benchmarked kMermaid’s single-CPU runtime and RAM usage again against BLASTX, DIAMOND, and MMseqs2 (Fig 3c, d). DIAMOND is regarded as one the fastest accurate approaches for protein mapping of metagenomic reads and was in part developed to address the runtime limitations of BLASTX. kMermaid ran 1,600 times faster than BLASTX on files with 100,000 sequences while DIAMOND ran 1,000 times faster. Both methods also provided substantial reductions in running time over MMseqs2. Both kMermaid and DIAMOND were able to classify 500K sequences in a minimum of 2.7 minutes and 3.7 minutes, respectively. The same input when given to BLASTX took over six days to complete. Further, classifying 40M sequences took kMermaid 3.3 hours (compared to 1.9-2.2 hours for DIAMOND), which highlights its usability for experimental shotgun metagenomic sequencing files. Like some other methods, since kMermaid classifies reads independently of other reads in the same input file, it easily lends itself to parallelization by means of splitting input files into smaller chunks, allowing for further speed improvement when resources are available.

Along with speed, RAM usage is another potentially limiting factor when input files are large. We have developed kMermaid to be highly memory efficient. We therefore compared kMermaid to leading tools including DIAMOND, which has excellent running times, but achieves this performance in speed at the expense of higher memory and multiple CPU utilization. With these limitations in mind, kMermaid performs read assignments in a way that only requires the precomputed k-mer frequency model, and not the input data, to be loaded into memory. As such, kMermaid requires a fixed amount (2GB) of memory per run regardless of file size, whereas DIAMOND and MMseqs2 generally require memory to scale with the increasing input file size (Fig 3d). BLASTX was excluded from comparisons with more than 1 million sequences due to long running times.

Using kMermaid to map new or unseen proteins

A key challenge in metagenomics is the classification of unknown microbial sequences that are not present in existing reference databases. To understand how kMermaid performs on unknown microbial proteins, we classified segments of 22,435 new RefSeq protein sequences deposited between January and May 2025, after our frequency model was developed. We compared the resulting mappings to BLASTX alignments with the same RefSeq database used to construct kMermaid’s underlying database. Importantly, we observed that the kMermaid assignment score is correlated with BLASTX percent identity (Spearman r = 0.83, 0.82, 0.8 for reads of length 125, 150, and 200, respectively; S3a, b Fig), highlighting the importance of choosing a more stringent score threshold when classifying reads which are likely to originate from unseen microbes. We further used the area under the receiver operating curve (AUROC) to assess the kMermaid’s ability to correctly classify reads, where a correct classification was based on BLASTX alignment. The kMermaid was able to achieve AUROCs of 0.93 and ≥0.96 for reads matching BLASTX results and higher confidence BLASTX results filtered at a more stringent percent-identity threshold of 66.6%, respectively (Fig 4a). We observed that kMermaid scores ranging from 6.3-8, depending on the read length with longer reads requiring higher scores, were able to achieve false positive rates ≤0.05 while still maintaining true positive rates around 80% (S2 Table). BLASTX was able to classify a slightly higher percentage of reads (13.7-14.9%) compared to kMermaid (11.7%-13.6%), but kMermaid was found to be highly concordant with BLASTX results, with around 95% of assignments correct assuming BLASTX as the gold standard (S2 Table).

thumbnail
Fig 4. Biological applications, function-specific performance, and evidence of remote homology detection of kMermaid.

(a) Receiver operating curves demonstrating the ability of kMermaid’s assignment score to correctly classify reverse-translated nucleotide segments of varying lengths from unseen protein sequences that were added to RefSeq in early 2025. (b) Agreement of BLASTX alignments and kMermaid protein assignments on 29 fecal samples from ulcerative colitis patients. (c) Boxplot showing kMermaid agreement with BLASTX for clusters with specific functional annotations. (d) Violin plots showing the k-mer frequency scores for six clusters of reads unclassified with BLASTX that were correctly functionally classified by kMermaid.

https://doi.org/10.1371/journal.pcbi.1013470.g004

kMermaid is highly sensitive for protein cluster mapping of human fecal samples

Even though kMermaid performed well on simulated reads, it is difficult to account for the additional challenges and noise associated with real, experimental data by simulations alone. Therefore, we performed additional testing using real sequencing data from 29 publicly available human fecal samples of ulcerative colitis patients [29], comparing kMermaid assignments against BLASTX. On average, kMermaid results agreed with BLASTX alignments 83.3% of the time and the agreement rate was highly consistent across the 29 samples (Fig 4b). Assuming BLASTX hits to be the ground truth, kMermaid was able to maintain a balance between retaining a high percentage of the assignments that agree with BLASTX hits as well as a high ratio of assignments that agree with BLASTX to assignments that disagree with BLASTX (Figs 4b and S3c) at the default kMermaid assignment score ≥ 3. Given that the overwhelming majority of BLASTX hits belong to a single cluster (Fig 2a), the consistent agreement between BLASTX and kMermaid provides strong evidence for kMermaid’s ability to accurately and sensitively classify short reads from human fecal metagenomic sequencing.

Cluster specific results

A primary objective of kMermaid is to achieve high performance for classification at the read level. To comprehensively assess kMermaid’s performance, we also evaluated its cluster specific performance. We compared the kMermaid read assignments of experimental metagenomic sequencing input samples used for benchmarking to the assignment by BLASTX, within each cluster. Interestingly, we find that clusters related to restriction, toxins, transposons, and those in the GCN5-related N-acetyltransferases family (GNAT) had high agreement with BLASTX, whereas clusters related to ABC transporters tended to have relatively low agreement with BLASTX, and therefore likely lower accuracy (Fig 4c). We also confirmed that the proportion of reads concordant with BLASTX was correlated with the mean kMermaid assignment score for all reads mapping to the cluster, a trend that was not confounded by the number of reads mapped to the cluster (S3d Fig). Importantly, by investigating reads that were assigned with a high kMermaid assignment score but were not classified by BLASTX, we identified reads with remote homology to proteins within kMermaid clusters. We verified a correct functional classification of reads assigned to six such kMermaid clusters using both PSI-BLAST [30] and HHblits3 from HH-suite3 [27] and validated these kMermaid functional annotations (Fig 4d and S1 File).

Conclusions

Metagenomic sequencing allows for the functional profiling of diverse microbes facilitating numerous biomedical applications, such as biomarker discovery and disease prediction [3133]. To date, k-mer and binning based methods have been immensely useful in allowing efficient and sensitive classification of short read metagenomic sequencing into taxonomic units [2124,3437]. However, to the best of our knowledge, analogous methods that achieve sensitive and efficient classification at the protein level have not been developed. As a result, there remain critical limitations in our ability to classify short microbial reads for the ultimate task of downstream analysis and biological inference. Notable limitations of functional read assignment methods are ambiguous alignments and computational costs which are prohibitive for the large volumes of data typically generated by next generation sequencing experiments. The loss of granularity from higher-level functional mappings into pathways additionally prohibits analyses where amino acid sequences may be needed such as microbial peptide binding prediction [38,39]. As such, there is a need for methods that can capture and retain the underlying biology of microbial function in a granular, computationally feasible, and accessible manner.

To address these challenges, we have developed kMermaid, a novel ultrafast method for the unambiguous and sensitive classification of short reads into functional units consisting of protein clusters. We show that by using a well-known concept of clustering homologous proteins into a single functional unit, kMermaid rapidly resolves the majority of ambiguous BLASTX protein alignments while retaining granular amino acid level information. kMermaid uses a precomputed k-mer frequency model based on high-confidence protein clusters encompassing almost two million microbial proteins from RefSeq. Our designated clustering allows the assignment of diverse proteins into clusters with mostly homogenous k-mers or words, enhancing the potential of the model to correctly capture distinct functions. Using both simulated short reads and sequencing data from real human fecal samples, we demonstrate that kMermaid classifies reads with high accuracy and sensitivity compared to BLASTX but runs up to 2,500% faster on typical files with tens of millions of sequences. Additionally, we were able to verify kMermaid’s ability to correctly classify reads that were unclassified by BLASTX, for sequences sharing remote homology to proteins within kMermaid clusters. This striking performance is likely achieved through kMermaid’s composite assignment scoring which uses information on a set of proteins in a cluster to classify each read in contrast to BLASTX, which is based on pairwise comparisons.

kMermaid assigns a short read to a single protein cluster from a fixed set (database) based on a global maximum k-mer frequency assignment score, which implies certain limitations. First, in the case of ties it will randomly assign the sequence to a cluster based on the first time it reaches the maximum. Despite this, we have shown that ties should not be a widespread issue since most multi-mapped reads are mapped to a single cluster (Fig 2a) and many amino acid 5-mers are unique (Fig 2d). Second, it is possible that there exist additional clusters of proteins which are not homologous or functionally similar to any provided through our precomputed kMermaid database. Because of this, it is recommended that researchers use our method as a first pass means of dealing with the computational burden of BLASTX and perform alignment-based verification for select proteins of interest. Orthogonally, additional kMermaid databases/models can be (re)trained to reflect periodic increases in the quantity and diversity of available sequences. Third, although we did find evidence supporting some ability for remote homology detection, like any method that relies on sequence comparisons, kMermaid cannot classify truly novel or unseen proteins. Last, there are several biological limitations of inferring function from short reads that kMermaid does not address, including misclassification of multi-domain proteins and operons. These limitations will remain for any metagenomic protein mapping tool designed for short reads and researchers wanting a more thorough view of the functional content of metagenomes should consider methods that include some degree of contig assembly prior to classification [4042].

In summary, we present kMermaid, a novel, sensitive, and runtime and memory efficient approach for the task of assigning protein identities to short microbial sequences. Future studies can utilize kMermaid for the discovery of microbial functional biomarkers and as a precursor to downstream quantitative functional analyses.

Methods

Forming functionally similar microbial protein clusters using a two-step clustering procedure

kMermaid is foremost designed to address ambiguity in read alignment, where most shotgun metagenomic reads align against multiple microbial proteins with similar alignment scores and are therefore classified as multiple functionally related proteins by the alignment (Fig 2a). To circumvent this issue, kMermaid uses a well-defined concept and groups functionally related proteins into functional units prior to the assignment such that a read can be uniquely classified into a single functional unit. We therefore constructed a set of comprehensive and high confidence clusters of microbial proteins by employing a two-step clustering procedure using CD-HIT [43,44]. CD-HIT uses an incremental greedy algorithm to identify representative sequences and cluster remaining sequences by sequence similarity using short word filtering. We first clustered 43,176 proteins from the NCBI RefSeq non-redundant microbial protein database [45] using CD-HIT with a similarity threshold of 65% and a word size of 5. This first clustering step resulted in 32,308 clusters allowing non-redundant clusters. In the second step, to further expand and diversify protein members within these clusters, we applied CD-HIT-2D [44] with a 70% sequence similarity threshold to cluster all RefSeq microbial proteins against the previously selected cluster representatives (N = 1,797,426 proteins from RefSeq, dataset downloaded in May 2023). A set of expanded clusters was created from this process such that the final dataset clusters 1,793,361 proteins into 32,308 functional groups.

Assigning functions to clusters of hypothetical or uncharacterized microbial proteins

kMermaid annotates its underlying protein clusters based on the name of the representative sequence as determined in the initial CD-HIT phase. Even though these proteins are in the representative microbial protein database, 8,617 (approximately 25%) of the cluster representatives are best annotated by NCBI or RefSeq as hypothetical proteins, i.e., proteins of unknown or unverified function. To annotate these protein clusters and assign them with protein names, we used HHblits from HH-suite3 [27] for remote homology detection against two databases from The Protein Databank and UniProtKB (PD70 [46] and Uniclust30 [47] v2023, respectively) and selected the match with lowest e-value across the databases that did not map to hypothetical, unknown, or uncharacterized proteins. We were able to confidently assign a protein name or function to 6,488 of the 8,617 hypothetical clusters (e-value < 0.01) and the rest are assigned with lower confidence.

Creating kMermaid’s k-mer frequency cluster model using nested hashing

The goal of kMermaid is to functionally classify short reads using the previously defined clusters of functionally similar microbial proteins. To this end, we built a k-mer frequency model by obtaining all amino acid (AA) k-mers of all protein sequences in each of 32,308 clusters and computing the cluster-level frequency of each AA k-mer (S1 Fig). The value of k was chosen through hyperparameter search, by simulating truncations of each protein in the database into 50 overlapping AA segments. We evaluated k values of 3, 4, 5 and 6 for classifying truncated protein sequences, measuring accuracy as the fraction of truncated sequences which were correctly assigned to their cluster. Both k = 5 and k = 6 achieved similarly high accuracy (>0.99), but the number of k-mers increased sharply from 2,574,615 for k = 5 to 12,043,100 for k = 6. Therefore, k = 5 was selected, achieving high accuracy with substantially fewer parameters.

We then obtained overlapping 5-mers of each protein amino acid sequence. The k-mer frequencies for each cluster were defined by the count of the k-mer in the cluster divided by the total number of proteins in the cluster (note that frequency can be > 1 if k-mers appear multiple times on average in the proteins of a cluster). The underlying model is then stored in a two-level hash map where the top-level map stores for each k-mer , a map of clusters containing , and the second level maps the clusters containing , to the frequency of in , (S1 Methods, Algorithm 1). The resulting nested hash map can be written as .This precomputed model, consisting of 2,574,615 unique 5-mers (all the naturally occurring AA 5-mers in the underlying protein cluster database), is then used to determine the cluster to which a query sequence is assigned. The map is distributed along with the kMermaid package and is implemented as the default frequency model.

Assigning protein maps to reads using the pre-computed k-mer model

To assign a read into a protein cluster, a six-frame translation is applied to a query sequence and translations containing a stop codon (truncated frames) are discarded. Next, all overlapping AA 5-mers are extracted from the non-truncated frames. A score representing the strength of a match between and each cluster is then calculated by the summation of the precomputed model k-mer frequencies for each k-mer in the query sequence. The score for query sequence and each cluster C is computed as:

where is the model frequency of each k-mer w from in cluster C, i.e., the average occurrence of w in proteins of C. Finally, the sequence is then assigned to the cluster with the global maximum score across all clusters (Fig 1; S1 Methods, Algorithm 2). kMermaid annotates the query sequence by the cluster representative for the cluster corresponding to this maximum assignment score. This scoring approach effectively assigns higher confidence to scores when k-mers within a query sequence are more frequently observed in a cluster. kMermaid uses a k of 5 AA chosen via hyperparameter search, as described previously, for both the base model construction and the assignment procedure described previously and reports assignments with an assignment score >3.

Evaluating the clusters of functionally similar microbial proteins

As the protein clusters lie at the base of kMermaid’s approach, we validated their correctness using two orthogonal analyses aimed at verifying that the clusters produced contain homologous proteins with shared biological function.

  1. Compatibility with NCBI protein clusters. To demonstrate kMermaid’s ability to construct biologically relevant clusters, we compared the results of the two-step CD-HIT clustering to the datasets from the NCBI Protein Clusters [28], which groups together proteins by sequence similarity. A subset of 102,380 of the proteins contained in our expanded cluster model was also clustered through the prokaryotic PCLA protein clusters dataset within this database. Proteins that were in this overlapping subset and were also in a non-singleton kMermaid cluster were used for comparison (N = 102,367, mapped to 9,984 kMermaid clusters). For each of these proteins, we evaluated their tendency to be co-clustered with the same proteins in both PCLA and kMermaid clusters by computing the percent of co-clustered proteins by kMermaid clustering that were also co-clustered in PCLA. The number of kMermaid clusters was chosen as the denominator for this evaluation metric to verify the correctness of kMermaid clusters rather than to assess its ability to maximize clusters, which is not an objective of this approach.
  2. Within-cluster keyword similarity. High-throughput text analysis was performed on the protein name annotations to further investigate the similarity and homogeneity of the clusters. Trends and frequencies of word presence in clusters were used to evaluate cluster functional homogeny. After removing ubiquitous and generic words (e.g., “bacteria” or “protein”), we computed the frequency of the most common keyword found in each cluster, i.e., the fraction of proteins in the cluster containing the most common keyword in that cluster. Most clusters demonstrated a common keyword frequency of 1, indicating that our clusters are highly homogenous in key functions.

Benchmarking against established protein mapping tools

We benchmarked kMermaid against popular methods that can be used for protein mapping—BLASTX, DIAMOND, and MMseqs2. Each method was run against an underlying database containing the same 1,797,426 RefSeq protein sequences described previously. For consistency, we used the default or recommended configurations of each method. BLASTX and DIAMOND in BLASTX mode were run with e-values of 1e-4 and 1e-3 (default), respectively. MMseqs2 was run with a min-length set to 16 and e-value set to 1e-4. Each method except for BLASTX was set to report the default number of matches based on the e-value. BLASTX was set to report a maximum of 1 match for simulated data and 5 matches for experimental data due to its excessive running time when reporting all matches. DIAMOND was additionally run with a maximum of 1 match to demonstrate the consistency of mapping when using higher level groups rather than exact matches as well as to provide a fairer comparison resource benchmarking.

Performance assessment using reads simulated from RefSeq sequences.

To demonstrate that kMermaid correctly assigns proteins to clusters, we benchmarked kMermaid using data simulated from nucleotide sequences of 1,383 microbial coding frames downloaded from RefSeq for which the true cluster identity is known, i.e., proteins that already exist in kMermaid’s model and thus have a ground-truth cluster label. From these, simulated data were generated with varying mutation rates and read lengths. Mutation rates were chosen to be representative of bacterial mutation rates [48] and error rates in next-generation sequencing data [49]. For continuous rates, the number of mutations per sequence was determined probabilistically using a binomial distribution and the location of the mutation in the sequence was determined by random sampling. A mutation was defined as a random assignment of any nucleotide that did not match the original position. Query sequences were then created by segmenting the mutated sequence to the specified read length, , starting at some random position such that . Since low mutation rates could probabilistically result in no mutations, we additionally simulated reads guaranteed to have 1, 2, 3, or 4 mutations resulting in an amino acid change. To guarantee an amino acid change we employed the following procedure: 1) randomly select a substring of read length, , starting at some random position such that , 2) perform a translation on the protein nucleotide sequence for the correct open reading frame based on the original sequence, 3) randomly select an amino acid to change, 4) concatenate the original substring with a randomly-selected reverse translation of the amino acid. Protein maps were assigned using the same reference database of 1,797,426 proteins that were used to develop the kMermaid database using default configurations of BLASTX, DIAMOND, kMermaid, and MMseqs2 or as described above. In lieu of the defaults, BLASTX was run with the recommended minimum value of 1 maximum match to accommodate a reasonable running time. DIAMOND was additionally run with only (at most) 1 match as a point of comparison for unambiguous reporting. Results from the simulations were averaged across 10 replicate datasets generated with a different random seed for each combination of parameters.

Benchmarking computational resource utilization.

We also compared the speed and maximum memory usage of kMermaid to BLASTX, DIAMOND, and MMseqs2. DIAMOND, which is up to 20,000 times faster than BLASTX, is widely considered the fastest protein aligner that can maintain the sensitivity of BLASTX results and is included in our benchmarking as a standard for efficient resource consumption. To compare the runtime of each method, we created random subsets of a single fasta file containing nucleotide metagenomic sequences from a published immunotherapy trial of melanoma patients [50]. For running time comparisons, we tested input files with a varying number of sequences ranging from 5,000–40 million with 10 replicates each to account for machine or algorithmic variability. BLASTX and DIAMOND sequence queries were performed against the same database that was used to create the kMermaid model described above. All comparisons were run on a Linux kernel using 1 task and 1 CPU per task. Because most methods can run tens of millions of sequences in under a day, we set an upper time limit of 72 hours and denote jobs that were unable to be completed in that time frame. Since no jobs took between 24 and 72 hours, we truncated the upper limit of the y-axis for visualizations to 24 hours.

Individual metagenomic sequencing experiments can yield large volumes of data and file sizes are commonly on the order of tens of gigabytes. As such, efficient memory usage is another important factor to consider when choosing analysis tools. We compared the memory (RAM) usage between all methods for input files containing 100,000, 1 million, 10 million, 20 million, 40 million, 60 million, and 80 million reads, with the latter numbers corresponding to the total number of reads commonly generated from a standard paired-end sequencing experiment (10-40M each, combined paired end). Because our SLURM cluster was not set up to report maximum memory usage, maximum RAM usage was inferred by submitting jobs with increasing amounts of memory in 1–2GB intervals (500MB, 1GB, 2GB, 3GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB, 16GB) until the job did not report a memory related error and was able to complete successfully. For example, the max RAM for a job that was reported as requiring 12G of memory used more than 10 but less than 12GB of RAM. We denoted jobs that could not be completed with 16GB of memory, meant to reflect the feasibility of a laptop analysis. Runtime in seconds was computed manually using the date function in Linux. BLASTX was excluded from comparisons of over 1,000,000 sequences due to its infeasibly high running time, although we acknowledge that BLASTX memory usage is generally minimal.

Analysis of unseen RefSeq microbial protein data

Since kMermaid was first implemented in 2023, we were able to test the ability of our tool to classify unknown microbial proteins by using 22,435 annotated protein sequences deposited in RefSeq between January 2025 and May 2025 (obtained May 2025). Because our method uses nucleotide reads as input, we had to perform a reverse translation of the amino acid sequences. If an amino acid reverse mapped to multiple tri-nucleotide sequences without a stop codon, a tri-nucleotide sequence was selected at random, which created an even more challenging classification task. We also removed sequences with missing or non-standard amino acid sequences at this stage. We then created datasets containing randomly selected 125, 150, and 200 base pair substrings of the reverse translated sequences with 10 different seeds each. We then ran BLASTX with max_target_seqs = 5 and evalue = 0.0001 and kMermaid with default configurations. Spearman correlation was computed on all BLASTX and kMermaid results; datapoints were down sampled to 150,000 across all 30 datasets for visualization (S2a Fig). We additionally performed post-hoc filtering of BLASTX results using percent identity ≥66.6 to obtain higher confidence hits for comparison with kMermaid. Because the reference databases for each method did not contain the truth assignment, we considered a correct hit, or true positive, to be a sequence for which the BLASTX and kMermaid protein map matched. We computed AUROC using the kMermaid scores for each read assigned by both BLAST and kMermaid. We then the kMermaid scoring threshold at which the false positive rate fell below 0.05 for each read length. These thresholds were used for subsequent comparisons with BLASTX (S2 Table). We note that while kMermaid assignment scores generally correlate with higher confidence for data of consistent read length, the specific scores thresholds reported are likely specific to this analysis given the cluster-specific trends and biases in recently deposited protein sequences, i.e., over-representation of specific proteins.

Protein mapping of reads from fecal samples from ulcerative colitis patients

The goal of kMermaid is to efficiently map reads to proteins, while maintaining the accuracy and sensitivity of BLASTX. We benchmarked kMermaid’s functional read classification against BLASTX using paired end reads from ulcerative colitis patients enrolled in the LOTUS fecal matter transplant clinical trial [29] available on the NCBI’s Sequence Read Archive (SRA). Two samples were excluded from analysis based on data incompletion (low read depth). Due to the infeasibly long running times incurred by BLASTX, we randomly subsampled each fasta file to 100,000 reads. We used a small, representative subset (n = 3) of the available samples for in-depth follow-up analyses of all reads in the samples. BLASTX was set to max_target_seqs = 5, evalue = 1e-4 and to max_target_seqs = 3, evalue = 0.01 for these analyses, respectively. We additionally filtered BLASTX results at percent identity >66.6% where specified. We compared the overall coverage, defined as the percentage of reads that were able to be classified by both BLASTX and kMermaid, as well as the percentage of BLASTX hits that kMermaid was able to classify. We further investigated the correctness of kMermaid’s assignments using BLASTX results as a gold standard for correct read assignment, by examining reads which were assigned by both methods. Correct assignment by kMermaid was defined for reads as a non-empty overlap between proteins in the BLASTX hits and the assigned kMermaid clusters.

Cluster-specific results and remote homology detection

To evaluate kMermaid’s performance for specific biological functions, we calculated the accuracy across clusters with similar functional annotations. We used the LOTUS clinical trial data to evaluate function specific performance in a real metagenomic sequencing cohort. To this end, for every cluster we calculated the ratio of reads correctly assigned to that cluster (i.e., assigned to that cluster by BLASTX and kMermaid), out of all the reads assigned to that cluster by kMermaid. Then, we evaluated the distribution of cluster performances for distinct functions, i.e., clusters named with common keywords (S3 Table).

To evaluate kMermaid’s ability to identify sequences with remote homology to proteins in the database, we examined reads that were not classified by BLASTX from the LOTUS clinical trial data and were assigned with high kMermaid assignment scores (>20). We explored reads that were confidently assigned to six clusters that failed to be classified with BLASTX, and carefully verified that these reads have remote homology to their kMermaid assigned clusters using PSI-BLAST and HHblits3 from HH-suite3 [27] (S1 File).

Supporting information

S1 Fig. kMermaid’s internal k-mer frequency model with all steps outlined.

https://doi.org/10.1371/journal.pcbi.1013470.s001

(TIF)

S2 Fig. Performance on simulated reads with known labels, with 1–4 introduced mutations per read.

(a) Percent correct labels by each method for 3 typical read lengths evaluated. (b) Percent of input reads mapped by each method for 3 typical read lengths evaluated.

https://doi.org/10.1371/journal.pcbi.1013470.s002

(TIF)

S3 Fig. Biological applications.

(a) Spearman correlation between kMermaid’s assignment score and the maximum BLASTX percent identity per read for all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both methods without thresholding. To prevent overplotting, reads were down sampled to 100,000 (40% of total). (b) Histograms showing the kMermaid score of all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both kMermaid and BLASTX at > 66.6 percent identity. The dashed lines denote the read length-specific thresholds determined by maintaining a false positive rate < 0.05. (c) The percent of all input reads able to be classified by kMermaid compared to BLASTX for sequencing from 3 representative colitis samples, chosen randomly. kMermaid’s optimal scoring threshold was determined by maximizing the percentage of the assignments that agree with BLASTX hits (Sensitivity, dark blue) while retaining a high ratio of assignments that agree with BLASTX to assignments that disagree with BLASTX (Accuracy, light blue). (d) Correlation between the proportion of reads concordant with BLASTX and the mean assignment scores (log-transformed) for all proteins in the cluster. Distributions of these metrics broken down by number of reads mapped to the cluster where clusters in the bottom tertile have the lowest number of mapped reads and clusters in the top tertile contain the highest.

https://doi.org/10.1371/journal.pcbi.1013470.s003

(TIF)

S1 Table. Protein names and descriptions of cluster representatives of the protein clusters underlying the kMermaid model.

https://doi.org/10.1371/journal.pcbi.1013470.s004

(CSV)

S2 Table. Performance evaluation for sequences from unseen RefSeq microbial protein data.

https://doi.org/10.1371/journal.pcbi.1013470.s005

(XLSX)

S3 Table. The median kMermaid model performance when classifying protein cluster containing different, specific keywords.

https://doi.org/10.1371/journal.pcbi.1013470.s006

(CSV)

S1 File. Selected reads failed to be classified with BLASTX that were correctly classified with kMermaid share remote sequence homology with proteins in their associated clusters.

https://doi.org/10.1371/journal.pcbi.1013470.s007

(TXT)

S1 Methods. Pseudo-code for model training and read assignment.

https://doi.org/10.1371/journal.pcbi.1013470.s008

(PDF)

References

  1. 1. Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20(6):341–55.
  2. 2. Ko KKK, Chng KR, Nagarajan N. Metagenomics-enabled microbial surveillance. Nat Microbiol. 2022;7(4):486–96. pmid:35365786
  3. 3. Gao Y, Li D, Liu Y-X. Microbiome research outlook: past, present, and future. Protein Cell. 2023;14(10):709–12. pmid:37219087
  4. 4. Nayfach S, Pollard KS. Toward Accurate and Quantitative Comparative Metagenomics. Cell. 2016;166(5):1103–16. pmid:27565341
  5. 5. Prakash T, Taylor TD. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 2012;13(6):711–27. pmid:22772835
  6. 6. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol. 2021;38(12):5825–9. pmid:34597405
  7. 7. Törönen P, Medlar A, Holm L. PANNZER2: a rapid functional annotation web server. Nucleic Acids Res. 2018;46(W1):W84–8. pmid:29741643
  8. 8. Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J Mol Biol. 2016;428(4):726–31. pmid:26585406
  9. 9. Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, Maharjan S, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021;10:e65088. pmid:33944776
  10. 10. Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Metagenomic functional profiling: to sketch or not to sketch? Bioinformatics. 2024;40(Suppl 2):ii165–73. pmid:39230701
  11. 11. Hernández-Plaza A, Szklarczyk D, Botas J, et al. eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res. 2022;51(D1):D389–D394.
  12. 12. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–10.
  13. 13. Gweon HS, Shaw LP, Swann J, De Maio N, AbuOun M, Niehus R, et al. The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples. Environ Microbiome. 2019;14(1):7. pmid:33902704
  14. 14. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2014;12(1):59–60.
  15. 15. Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8.
  16. 16. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8. pmid:29035372
  17. 17. Golob JL, Minot SS. In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision. BMC Bioinformatics. 2020;21(1):459. pmid:33059593
  18. 18. Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. Pseudoalignment for metagenomic read assignment. Bioinformatics. 2017;33(14):2082–8. pmid:28334086
  19. 19. Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS, Munger SC, et al. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics. 2018;34(13):2177–84. pmid:29444201
  20. 20. Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. pmid:38840832
  21. 21. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3).
  22. 22. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. pmid:31779668
  23. 23. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19(1):198. pmid:30445993
  24. 24. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16(1):236. pmid:25879410
  25. 25. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26(12):1721–9. pmid:27852649
  26. 26. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:11257. pmid:27071849
  27. 27. Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding J. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics. 2019;20(1):473. pmid:31521110
  28. 28. Klimke W, Agarwala R, Badretdin A, Chetvernin S, Ciufo S, Fedorov B, et al. The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res. 2009;37(Database issue):D216-23. pmid:18940865
  29. 29. Haifer C, Paramsothy S, Kaakoush NO, Saikal A, Ghaly S, Yang T, et al. Lyophilised oral faecal microbiota transplantation for ulcerative colitis (LOTUS): a randomised, double-blind, placebo-controlled trial. The Lancet Gastroenterology & Hepatology. 2022;7(2):141–51.
  30. 30. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. pmid:9254694
  31. 31. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12(6).
  32. 32. Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol. 2016;12(7):e1004977.
  33. 33. LaPierre N, Ju CJ-T, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods. 2019;166:74–82. pmid:30885720
  34. 34. Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics. 2015;31(22):3584–92. pmid:26209798
  35. 35. Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience. 2019;8(2):giy165. pmid:30597002
  36. 36. Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics. 2022;39(1).
  37. 37. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–6. pmid:25218180
  38. 38. Medzhitov R. Recognition of microorganisms and activation of the immune response. Nature. 2007;449(7164):819–26. pmid:17943118
  39. 39. Cusick MF, Libbey JE, Fujinami RS. Molecular mimicry as a mechanism of autoimmune disease. Clin Rev Allergy Immunol. 2012;42(1):102–11. pmid:22095454
  40. 40. Elbasir A, Ye Y, Schäffer DE, Hao X, Wickramasinghe J, Tsingas K, et al. A deep learning approach reveals unexplored landscape of viral expression in cancer. Nat Commun. 2023;14(1):785. pmid:36774364
  41. 41. Shang J, Jiang J, Sun Y. Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics. 2021;37(Suppl_1):i25–33. pmid:34252923
  42. 42. Lugli GA, Milani C, Mancabelli L, van Sinderen D, Ventura M. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation. FEMS Microbiol Lett. 2016;363(7):fnw049. pmid:26936607
  43. 43. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699
  44. 44. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
  45. 45. Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42(Database issue):D553–9. pmid:24316578
  46. 46. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. pmid:10592235
  47. 47. Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45(D1):D170–6. pmid:27899574
  48. 48. Watford S, Warrington SJ. Bacterial DNA Mutations. In: StatPearls. StatPearls Publishing; 2023. [Accessed August 10, 2023. ]. http://www.ncbi.nlm.nih.gov/books/NBK459274/
  49. 49. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20(1):50. pmid:30867008
  50. 50. Peters BA, Wilson M, Moran U, Pavlick A, Izsak A, Wechter T, et al. Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients. Genome Med. 2019;11(1):61. pmid:31597568