HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive read mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/HAYSTAC.


Introduction
Metagenomics allows high-throughput sequencing of complex microbial communities that may not be possible to grow in a laboratory. Accurate identification of specific species within a complex community is critical for metagenomic applications in a wide variety of fields, including medicine, microbiome sciences, environmental sciences, biosurveillance, and biomolecular archaeology. Confidently assessing the presence of individual species is also crucial for taxonomic profiling [1], inferring genomic evolution and phylogenetics [2], tracing disease outbreaks [3], generating diagnostics from biofluids [4] and studying ancient pathogens [5].
Accurately identifying specific microbial species from ancient metagenomes, however, can be challenging for numerous reasons. Firstly, short read sequences originating from closely related species may be difficult to distinguish from each other. In addition, the species of interest may have low abundance in the sample, or the genomes in the reference database may not be representative of the strain present in a sample [15]. Contamination from laboratory environment or reagents may also obscure identification [16]. Identifying specific species within ancient metagenomes is further complicated by issues inherent to ancient DNA, including short molecules (typically between 30-60 bp [17]), post-mortem miscoding lesions (e.g., cytosine deamination), contamination from soil and sediment bacteria in the burial environment, sample handling, and storage [18].
When present in ancient metagenomes, pathogen DNA is usually found in low abundance [15], and researchers often initially screen many samples before deciding which libraries should be more deeply sequenced in order to obtain whole genome data, or be further processed using targeted capture techniques [19,20,5]. Screening for pathogens is typically performed by shallow shotgun sequencing followed by in silico analyses of short read data using taxonomic classifiers. Following screening, libraries with a positive identification for a species of interest are then more deeply sequenced and investigated, with or without targeted capture [5]. Confident species identification from metagenomic screening data is therefore critical for studies of ancient pathogens.
Tools for detecting microbial species from metagenomes include metagenomic classifiers that report taxonomic relative abundances [21], such as Kraken2 [22,23], KrakenUniq [24] and the Megan Alignment Tool (MALT) [25], as well as species identification pipelines that classify individual reads, such as Sigma [3], Strain Prediction and Analysis using Representative Sequences (SPARSE) [26], Bayesian Reestimation of Abundance with KrakEN (Bracken) [27], and Heuristic Operations for Pathogen Screening (HOPS, specifically designed for ancient DNA) [28]. The metagenomic classifiers Kraken2, KrakenUniq and MALT provide a good starting point for exploratory studies, especially when it is not known which pathogens may be present [29]. These tools, however, can be slow, require a large amount of RAM [24,25,28], and are prone to false positive identifications of low abundance species [30]. The species identification pipelines Sigma [3] and SPARSE [26] are based on more sophisticated statistical models but they require alignment to a large database, which can be slow [26], and do not provide statistical support for individual species identifications.
To address these issues, we developed High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC -https://github.com/antonisdim/HAYSTAC), a lightweight, fast, and user-friendly species identification tool. HAYSTAC evaluates the presence of a particular species of interest in a metagenomic sample and provides statistical support for the species assignment. Our method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set of sequencing reads and a database of reference genomes. We first derive the probability that the set of reads has originated from a single species (single source identification). This is useful, for example, for taxonomic identification of unknown source material (e.g. [31]). We then expand this method to identify multiple species in a metagenomic sample (multiple source identification).
To apply HAYSTAC, the user begins by constructing a database from publicly available genomes via a user-friendly automated process. Sample reads are then mapped to all reference genomes in the database using a Bowtie2 wrapper [32]. HAYSTAC then applies a novel Bayesian framework to infer abundance and statistical support for each species identification, and provides a per-read classification, allowing for both hypothesis-driven and exploratory analyses. HAYSTAC can build and make use of arbitrarily sized databases, with minimal effects on sensitivity and specificity. This means that HAYSTAC can provide reliable species identifications by aligning reads to a handful of reference genomes (instead of thousands), without misidentifying taxa because of the restricted database size. This feature dramatically improves its performance relative to other methods. Altogether, HAYSTAC provides a robust method to assess the presence of a specific species in a metagenomic sample without the need to run multiple programs to assess assignment accuracy, such as is required by other methods (i.e., Bracken for Kraken2 [27] and HOPS for MALT).
To evaluate HAYSTAC's performance, we applied it to both simulated microbiome datasets and to empirical datasets that were generated from DNA extracts derived from archaeological human remains. We further used HAYSTAC to assess the presence of putative respiratory pathogens in oral microbiome-derived metagenomes from ancient human dental calculus.
While dental plaque is known to harbour respiratory pathogens [33], few studies to date have reported these species in its mineralized analogue, dental calculus [34,35], a discrepancy that may reflect identification biases in studies of the latter. Improved identification of ancient respiratory pathobionts (host-associated microbial taxa that may cause disease under specific conditions) has the potential to add to our understanding of pathogen evolutionary dynamics and even aid vaccine development.

Method overview
HAYSTAC is a metagenomic species identifier that works with arbitrary user defined databases. The database can be built from any combination of an NCBI search query, a user specified list of NCBI accession codes, the RefSeq representative database of prokaryotic species, or a list of user provided reference assemblies. To construct the database, HAYSTAC uses a heuristic method to select a single representative genome per taxon from those available in the user provided input data (Fig 1). Fig 1. HAYSTAC's workflow. HAYSTAC consists of three main modules: (i) DATABASE, which builds a database of reference genomes from various user input sources; (ii) SAMPLE, which handles downloading sequencing files from the SRA and pre-processing of samples prior to analysis; and (iii) ANALYSE, which performs an analysis of a sample against a database by applying the mathematical model (see methods) for taxonomic abundance estimation.
A database-wide index is then built, followed by a competitive read mapping (very fast local alignment mode in Bowtie2, with default parameters) of each sample against the full database. This step ensures that reads with no matches in the database are excluded in all subsequent analyses ("Unknown Source" category).
Reads passing this initial inclusion filter are then aligned against every individual genome in the database (with Bowtie2 in end-to-end mode). For each alignment, we count the number of transitions (T s ) and transversions (T v ), as each type of mismatch contributes differently to our statistical model, in order to handle ancient DNA damage. We then compute a posterior probability for each read/taxon alignment pair, using both mismatch count and expected mismatch probability (σ, default of 0.05). Accuracy can be improved by adjusting the mismatch probability (σ) using prior information on intra-and interspecific variability in the specific set of species analysed.
Reads that possess a posterior probability exceeding a user-defined threshold (default of 0.75, S5 and S6 Figs) are then assigned to a single taxon. Reads with a lower posterior probability are assigned to the "ambiguous source" category. For each taxon, the count of reads passing the posterior probability threshold are used to compute the mean posterior abundance (and 95% confidence interval); i.e., the relative abundance of a taxon in a sample, using a Dirichlet distribution of the sample input data as a whole. A final validation step, for low depth sequencing data, using breadth of coverage is then used to assess if the assigned reads represent a random genome sample or cluster around specific genomic regions.

Computational performance
We assessed the computational performance of HAYSTAC relative to Sigma, Kraken2/ Bracken, KrakenUniq and MALT by measuring peak memory usage (maximum resident set size) and elapsed execution time (wall clock) for each software package on a machine with 48 CPU cores and 392 GB of RAM under controlled conditions. To control for variability in each measurement, we ran five independent replicates of each job, and performed a linear regression across the observed values.
We first assessed performance when building databases containing 10, 100 and 500 species. For all tested sizes, HAYSTAC uses less memory than MALT, Kracken2/Bracken or KrakenUniq, as well as being faster than KrakenUniq for all database sizes and faster than MALT and Kracken2/Bracken for the 10 and 100 species databases and only marginally slower than MALT and Kracken2/Bracken for the 500 species database. For all databases HAYSTAC was slower and used more memory than Sigma, which does not require large indices for its alignments (Fig 2; S9 Table).
We then assessed the performance of these methods when analysing a one million read dataset against the same three database sizes. HAYSTAC uses less memory than Kraken2/ Bracken, KrakenUniq and MALT for all database sizes, but it uses more memory and is slower than SIGMA for all database sizes (Fig 2; S10 Table).
Lastly, we assessed the performance of these methods when analysing datasets of 10 thousand, 100 thousand and 1 million reads, against a database of 500 species (S1 Fig; S10 Table). Memory performance was unchanged from the earlier benchmarks, showing that for all methods, peak memory usage scales with database size rather than sample size. With the exception of Kraken2/ Bracken and KrakenUniq, all other methods saw only a modest increase in runtime with larger input sizes, showing that database size also has a substantial impact on overall runtime.

Analyses of simulated data sets using RefSeq prokaryotic database
To characterise the sensitivity and specificity of HAYSTAC, we compared its performance to that of three other widely used metagenomic profiling tools: Sigma, Kraken2/Bracken, KrakenUniq and MALT. We simulated four datasets for these comparisons (Table 1), and then profiled each dataset with all five tools. Each simulated dataset was designed to test a different aspect of performance. For each test, we computed: (i) false positive count (total number Computational performance of HAYSTAC and other methods. Linear regression of the elapsed time (wall clock) and peak memory usage (maximum resident set size) for a sample of size 1 million reads and reference databases containing either 10, 100 or 500 genomes, each with 5 replicates. When constructing the database, HAYSTAC uses substantially less memory and runs faster than either Kraken2/Bracken, KrakenUniq or MALT for restricted database sizes. When performing analyses, HAYSTAC uses less memory than Kraken2/Bracken, KrakenUniq or MALT, while its runtime was only marginally slower.
https://doi.org/10.1371/journal.pcbi.1010493.g002 of species that were reported yet not present in the simulated sample); (ii) false negative count (total number of species present in the simulation yet were not identified); and (iii) true positive count (total number of species correctly identified). We first generated a dataset of samples comprising 10 species to quantify the ability of each program to accurately detect species in a simple metagenomic community (hereafter "Simple Microbiome" dataset). We generated two additional sets of profiles for this dataset: one that included modern human DNA ("contaminated") and one that did not ("noncontaminated"). This dataset was created using gargammel [36] by randomly sampling 10 species from a modified RefSeq prokaryote representative genome database (see Methods) and building simulated libraries of one million single-end reads with an equal number of reads per species (100K), both with and without contamination from the human reference genome (25% of the total simulated reads).
All tools, apart from KrakenUniq, correctly identified the 10 species in each simulated sample set (both with and without human contamination; Fig 3). KrakenUniq in both sample sets identified 9.5 species. Kraken2/Bracken (average of 2.5 species), KrakenUniq (average of 0.5 species) and MALT (average of 0.5 species) programs reported false positives in the non-contaminated sample set. False positive species counts, however, dramatically increased in the sample set with human contamination. HAYSTAC reported the lowest false positive count (5) relative to Sigma (6), Kraken2/Bracken (12.5), KrakenUniq (10.5) and MALT (12.5) (Fig 3). This suggests that HAYSTAC is less prone to false positive identifications caused by human contamination in bacterial reference genomes found in the RefSeq representative prokaryotic database. All false positive identification (across all programs) involved species whose RefSeq genomes are known to contain human DNA contamination [37] (S7 Table). This result strongly supports the pre-filtering and removal of all human reads prior to microbial metagenomic profiling.
We then tested HAYSTAC's performance when profiling samples with damage patterns characteristic of aDNA (i.e., sequence fragmentation and cytosine deamination). To do so, we generated eight simulated samples consisting of 10 million reads each, both with and without aDNA damage (hereafter "General Microbiome" dataset). Each sample contained simulated reads from either 100 species (four simulated samples; two random sets of species all with equal abundance) or 500 species (four samples; two random sets of species all with equal abundance) that were randomly sampled from the RefSeq representative prokaryotic database (Table 1).
Overall, the incidence of false negatives was low. HAYSTAC reported 2.0 false negative species on average in the 100 species sample set, and 14.0 in the 500 species sample set. Sigma  Table).
We next performed more realistic simulations based on species compositions that mimic an oral microbial community from a previously analysed subgingival plaque sample ( [30]; hereafter "Oral Microbiome" dataset). These simulations included 12 simulated samples of 5 million paired-end reads each, with and without aDNA damage, and with 176-180 species in varying or constant species abundance ( Table 1). Out of the 196 species that were simulated for the Oral Microbiome dataset, only 115 were included in the RefSeq representative prokaryotic database. From these 115 species, only 4 species were simulated from the same genomes that were present in the modified representative prokaryotic RefSeq database we used (S6 Table).
As expected, given the incompleteness of the database, false negatives were elevated overall ( Table 2) Table). Overall HAYSTAC outperformed the other methods in the ancient Oral Microbiome sample set with the lowest number of false positive identifications, and the highest number of true positive identifications in both the ancient and modern Oral Microbiome datasets.

Analyses of simulated data sets using genus-level database
We next assessed how accurately HAYSTAC identified specific candidate species using smaller reference databases. To do so, we re-analysed 20 simulated samples included in the General and Oral Microbiome datasets (General Microbiome 100 species, General Microbiome 500 species, Oral Microbiome aDNA damage, and Oral Microbiome modern datasets) ( Table 1) using nine different genus-specific databases (see Methods). For each randomly selected genus, we constructed a database that included all genomes found in the RefSeq representative prokaryotic database. The nine genera that were selected for this analysis were: Bacteroides, Burkholderia, Campylobacter, Clostridium, Corynebacterium, Desulfitobacterium, Mycobacterium, Solimonas, and Streptococcus. Each database consisted of all RefSeq genomes from all species belonging to the genus instead of the full RefSeq representative prokaryotic database. This resulted in databases with between 3-91 species each, as opposed to 5,652 species when  Table). False positives,  Fig 5). We also used HAYSTAC to compare the mean posterior abundance computed using the full RefSeq database with abundance computed using the genus specific databases (Fig 6). We found that while genus specific abundances were in cases slightly higher than those computed with the full RefSeq database, there was a very strong correlation between the two (R 2 = 0.999; slope = 1; p-value = 0; GLM: y = x+3.1�10 −6 ).

Pathogen identification from metagenomic communities
We then tested the ability of HAYSTAC and HOPS (an extension of MALT) to distinguish specific human pathogen species from close relatives when occurring at low abundance in a metagenomic sample. To do so we selected 100 (human) pathobionts, each one belonging to a different genus, as well as 1 non-pathogenic microbial taxon from each of the selected genera (total of 200 taxa, 2 per genus, S11 Table). We subsequently simulated a total of 200 libraries (1 library per species), each containing 100 reads with damage patterns characteristic to aDNA. This data was then added to one of our previously simulated Oral Microbiome dataset samples. This resulted in 200 replicates of an Oral Microbiome sample (anc200e2repgn), with each replicate containing 1 human pathobiont at low abundance. Due to the lack of taxonomic specificity present in a random sample of only 100 reads per pathogen genome, the overall true for either a genus specific database or the entire RefSeq representative database of prokaryotic species. Using a genus specific database has a small positive bias in mean posterior abundance for taxa within that genus (paired t-test p-value < 2.2�10 −16 , mean of the differences = 3.9�10 −6 ), nevertheless the overall abundance levels are highly correlated (R 2 = 0.999). Computational runtime for the genus specific analyses are faster and use less memory, making genus specific analyses suitable for rapid initial screening (e.g. 1 million reads against a Corynebacterium specific database runs approximately 6.15 faster than against a database containing 500 species and uses approximately 3.9 times less memory).
HAYSTAC was run using a reference database consisting of the longest complete genomes for each of the species in the genus Yersinia (including plasmid sequences). Our method was able to confidently assign between 6,720 and 856,467 reads to Y. pestis across all seven samples (Fig 8). HAYSTAC on average uniquely assigned an order of magnitude fewer reads compared to the number of aligned reads reported in the original study [9]. In addition, we found evidence for additional Yersinia species in all samples, including Y. pseudotuberculosis (606- Of these, Y. ruckeri and Y. similis are known to cause infections in animals (specifically fish), but have also been reported to infect humans [38]. These identifications, however, are likely spurious, since Y. ruckeri and Y. similis are phylogenetically basal to other Yersinia species [39,40] and may be attracting reads belonging diverged sequences.
The posterior abundances for Yersinia pestis in the RISE samples ranged from 0.0093-1.02%, corresponding to between 6,720-856,467 assigned reads. Despite the large number of assigned reads, posterior abundances only exceeded 0.01% for six of the seven samples (RISE00, RISE139, RISE386, RISE397, RISE505 and RISE509). This highlights the need to consider both relative and absolute abundance when making positive identifications.

Case study 2: Historic dental calculus
Dental plaque is a biofilm that develops naturally on teeth, and during life it periodically mineralizes to form dental calculus [41], a robust substrate that can preserve biomolecules such as DNA from the biofilm and the host [34]. Although this substrate can act as a reservoir for respiratory pathobionts, analysis of ancient dental calculus metagenomic data is often challenging because the low abundance of these species makes it difficult to confidently identify them [34]. To test the performance of HAYSTAC on pathobiont aDNA sequences within host-associated ancient microbiomes, we analysed data from 48 dental calculus samples from a 19 th century hospital in Oxford [42,43].
We screened these libraries for potential upper respiratory pathobionts from the Corynebacterium, Streptococcus, Klebsiella and Bordetella genera (Fig 9), in an effort to test if such pathobionts could be preserved and identified in dental calculus. Members of the genera Corynebacterium, Streptococcus and Bordetella had also been included in the Oral Microbiome dataset. Within the Corynebacterium genus, we identified Corynebacterium matruchotii (abundance 0.01-1.96%; 37 samples) and C. durum (abundance 0.01-0.59%; 15 samples) both of which are oral commensal taxa. We did not, however, identify Corynebacterium diphtheriae, the species that causes diphtheria. We identified multiple species of Streptococcus, with the commensal Streptococcus sanguinis being identified in 33 samples at relatively high abundance (0.01-2.62%). The pathobiont Streptococcus pneumoniae was also positively identified at low abundance (0.01-0.07%) in 10 samples. We did not find any evidence for species belonging to the genera Klebsiella and Bordetella.
Members of the Haemophilus genus were rarely identified, which is expected given that many species from that genus are generally low-abundance in dental plaque compared to other oral surfaces such as the tongue [44], and are similarly low-abundance in dental calculus [43]. However, Haemophilus parainfluenzae was found in 18 samples (abundance between 0.01-0.46%), H. haemolyticus was identified in two samples (abundance 0.01-0.03%), and H. parahaemolyticus was identified only in sample CS18 (abundance 0.02%). An uncharacterized oral species, Haemophilus oral taxon 036, was identified in three samples (abundance 0.01-0.04%) and the commensal H. sputorum was also identified in three samples (abundance 0.01-0.03%).
As expected, our analysis indicates that while species naturally occurring in the oral cavity (e.g., C. matruchotii and durum), were detected in most samples, upper respiratory system pathobionts (e.g. S. pneumoniae, H. parainfluenzae) were more rare. Although pathogens were found generally at very low abundance, in only a few individuals, our results indicate that HAYSTAC was able to positively identify some potentially pathogenic species such as Streptococcus pneumoniae and Haemophilus parainfluenzae, suggesting that dental calculus is a promising substrate to study ancient respiratory pathobionts.

Discussion
Our results demonstrate that HAYSTAC provides a robust and computationally efficient framework to statistically assess the presence of specific species in a metagenomic sample. HAYSTAC identified fewer false positives than Kraken2/Bracken, KrakenUniq or MALT in all simulations, and fewer than Sigma in simulations of ancient data. When analysing modern data, HAYSTAC's performance can be further optimised by reducing the base mismatch probability parameter (σ) at run-time leading to a more stringent alignment strategy, and increasing the minimum relative abundance threshold for a species call. Especially reducing the base mismatch probability (σ) will reduce the number of false positive identifications, because HAYSTAC calculates the maximum number of mismatches for a valid alignment based on both the base mismatch probability (σ = 0.05) and the average read length of reads in a library (e.g., 150 bp for the Oral Microbiome modern dataset, which allows for more mismatches to be allowed). It uses less memory than Kraken2/Bracken, KrakenUniq and MALT. Furthermore, our results show that the use of reduced reference databases (e.g., a few selected genera as opposed to the entire RefSeq representative prokaryotic database), can dramatically reduce computational time (approximately 6.15 times faster when using a database of 84 species than a database of 500) while still providing highly accurate species identification. HAYSTAC's underlying computational framework allows for an easy installation of the package, and the efficient maintenance of its source code and software dependencies. Altogether, this makes HAYSTAC a flexible tool to rapidly and accurately profile metagenomic samples and screen them for pathogens, a feature that is attractive for a large range of applications, including archaeogenetics and biosurveillance.

Species concept, horizontal gene transfer and species identification
Although we demonstrated that both HAYSTAC and Sigma provide more robust species assignments than MALT, Kraken2/Bracken or KrakenUniq, their internal computation is based on the assumption that intraspecific polymorphism is lower than interspecific polymorphism. While this is true in most cases, the concept of a species among microorganisms is often not straightforward. This is the case, for example, in the Mycobacterium tuberculosis complex in which interspecific variability is very low [45], and occasionally lower than the default value of expected mismatch (i.e., expected level of intraspecific polymorphism) set in HAYSTAC (σ = 0.05; see Eq 4 below). HAYSTAC deals with this issue by introducing an "ambiguous source" category, to which reads that are equally likely to match two or more genomes are assigned. This ensures that they do not contribute to posterior abundance computations, which could possibly force false identifications, while also not dismissing them into "unknown source" category with reads that cannot be classified. In cases where the levels of intra-and interspecific diversity are known (e.g., within a genus) it is possible to adjust the probability of the expected mismatches between a read and a genome (σ; see Eq 4), which can increase power while reducing false positive rates when analysing a specific dataset. This feature means HAYSTAC's performance can be finetuned to specific questions and datasets.
Identifying species within a metagenomic sample can also be complicated by horizontal gene transfer (HGT). HGT can lead to spurious, highly supported, species assignments-especially if the sequences originate from an outgroup species that is not represented in the database. HAYSTAC mitigates this issue by inspecting the evenness of coverage across the genome for shallow sequencing data. Specifically, reads that belong to an outgroup species and that map to a genome in a reference database because of an HGT or due to contamination are expected to map only to parts of the reference genome (e.g. the part that was horizontally transferred). HAYSTAC assesses whether reads are clustered more than expected by chance across a reference genome using an evenness of coverage metric (see Methods). This provides the user with a quantification of the evenness of spread, and thus a proxy signature of potential HGT or misassignment of reads from a taxon that is not represented in the reference database.

Issues with databases
Another issue facing metagenomics is contaminated sequences incorporated into reference genomes. For example, multiple bacterial RefSeq representative assemblies contain human sequences [37]. This can lead to spurious, highly supported, species assignments when profiling metagenome samples. Including genome sequences from potential contaminant sources (e.g., the human reference genome) in the database should alleviate these issues [46]. Alternatively, it is also possible to filter reads before starting a taxonomic identification/assignment analysis (i.e., by excluding those that map to the contaminant's reference genome). Potential contaminants, however, are not always known, and there are many gaps in reference sequence databases. In fact, inadequate species representation in databases is likely to be an important source of false positives for all species identification methods [15].
Although reads belonging to a species that is not represented in a database are typically assigned to a higher taxonomic node ("ambiguous source" category in HAYSTAC), incorrect species assignment can occur if the missing taxon is closely related to another species in the database, or if the read derives from a conserved element found within a taxonomic group with sparse representation in the database. Although this is not considered a major problem for general profilers, as closely related species are likely to play a similar function in the ecosystem, such misassignments can greatly complicate the identification of specific pathogen species, especially if they have close environmental relatives.

Conclusions
Metagenomic data are now routinely produced and analysed, providing new insights into microbial communities that were previously unknown. Development of metagenomic analytical tools, however, has mostly focused on frameworks that can profile an entire microbial community. We showed that these methods, however, can produce high levels of false positives when focusing on specific species identification. Here we present a tool that can robustly assess whether a specific species is present in a metagenomic sample in one step, without the need to combine different pipelines to validate the results. HAYSTAC provides a user-friendly and automated method for rapidly constructing databases, and produces robust identifications regardless of the database size, allowing for both rapid hypothesis-driven analyses and exhaustive profiling. Because of the mathematical framework it employs, HAYSTAC reliably produces the lowest number of false positive identifications, making it a valuable tool for both ancient and modern DNA microbial research.

Methods
Single source identification. We collect a series of observations (reads:r 1 �r n ) and we denote the full set of observations as R. We map all the reads to a database of reference genomes, and we define the event of sampling from the i-th reference genome in our database as G j . Assuming that all reads come from the same source (i.e., a single species represented in the database), the events of sampling from a given genome are exhaustive and mutually exclusive, i.e.
Given that each read is a different realization of sampling process (i.e., an independent observation), we can express the likelihood as: Therefore, the posterior probability of sampling from a given genome G j is:

Likelihood
We next derive a likelihood function to express the probability of a read given a genome P(r i | G j ). To do so, we use a previously published expression of this likelihood [3] based on the number of mismatches between a read and a reference to which it is aligned: where z is the number of mismatches between read r i and genome G j , U is the maximum number of mismatches allowed in the alignment and l is the length of read r i (in bp). The method assumes a uniform probability of mismatch within a given alignment (σ) to account for population level variability and sequencing errors. We expand this framework to leverage the fact that transitions and transversions occur at significantly different rates, particularly in ancient samples due to the effect of deaminations.
For each alignment we can then express the likelihood of generating a read r i given that it has been sampled from the reference genome G j based on both number of transitions (t) and transversions (v).
Let σ t be the probability of observing a transition and σ v the probability of observing a transversion, we can express the likelihood for a single read as: where l i represent the length of the read r i , while v i|j and t i|j denote the number of transversions and transitions between the read r i and the genome G j respectively.

Posterior
By substituting the likelihood term (Eq 4) into Eq 3, we can obtain an explicit form for the posterior probability of sampling a set of reads (R) from a given reference genome: By rescaling both numerator and denominator by a factor of ð1 À s v À s t Þ P i l i we get: Let V j = ∑ i v i|j denote the total number of observed transversions between all reads and genome G j while T j = ∑ i t i|j represents the total number of transitions.
s v À s t denote instead the transversion to match ratio and the transition to match ratio, respectively. By introducing this more compact notation, the previous equation can be re-written as:

Numerical representation
Our method assumes a fixed probability of mismatches (σ v +σ t ) which can be specified by the user and has a default value of 5%. We estimate the transition/transversion ratio based on the total number of mismatches per categories observed in the entire dataset and we obtain the value for σ v and σ t by solving the following linear system: : If we assume a uniform prior in which each genome has a probability of 1/n we can simplify Eq 7 even further obtaining: When analysing a large number of reads, evaluating Eq 9 might become problematic due to numerical representation issues. To accommodate for this we use an equivalent expression of the posterior probability: Each term in the summation at the denominator of Eq 10 represents the likelihood of genome G k relative to genome G J . There might be cases in which even the relative likelihood is so close to zero that cannot be represented numerically. We then take a conservative approach and assign to d the smallest number that the machine can represent (for a 64bit system it should be 2.2250738585072014� 10 −308 which we roughly approximate with 10 −300 ).

Multiple sources identification
We then expand this method to identify multiple species from metagenomic data. The multiple source identification approach involves the following steps: • Assigning each read to a reference genome Operatively we apply Eq 9 to each read separately and obtain a posterior probability for a read to originate from a given genome: Thus, for each aligned read we obtain a vector of posterior probabilities (one value for each genome in our dataset). If a posterior probability value is above a given threshold (default: 0.75) we consider the read r i to be informative and assign it to that reference genome. If instead all posterior values are below the threshold, we consider the read to be uninformative and we assign it to the n+1 category where n is the number of reference genomes. This means that any read r i that has been successfully aligned, can only be assigned to a single category: either to one reference genome or to the "Ambiguous Source" (AS, the n+1 category). All reads which belong to species that do not possess a representative reference genome (or a close relative) in the database and therefore haven't been successfully aligned to any of the reference genomes, will then be assigned to the n+2 category called "Unknown Source" (US).
Repeating this procedure 8r i 2R allows us to populate a matrix M N × (n+2) in which each row represents a read and each column represents a reference genome in our database or either the AS or US categories. Hence, the assignment matrix represents a series of N observations from a multinomial distribution of n+2 variables (where N is the total number of reads). This distribution is parametrized by a probability mass function q = [q 1 ,q 2 ,. . .,q n ,q AS ,q US ].
The likelihood of N observations has thus the following form: where x j is the sum of the column j of M, i.e. the number of occurrences of genome G j .

Dirichlet prior
The Dirichlet distribution (Dir) is a conjugate prior for the multinomial distribution, and it has desirable mathematical properties which make it appropriate to model the composition of metagenomes.
Let π 0 indicate the prior distribution and π � the posterior. From Bayes theorem, We employ a uniform prior: Because the Dirichlet is a conjugate prior for the multinomial likelihood (Eq 12), the posterior density has the same form, i.e where, again, x j is the number of observations (reads) of genome G j in the assignment matrix M. We can then use this framework to compute the minimum abundance of a species (represented by its closest reference genome) in a metagenome.

Posterior mean and confidence interval
Let γ j denote the posterior abundance of genome G j . From the property of the Dirichlet we can obtain the posterior mean as follows: meaning that the denominator of each posterior mean corresponds to the total number of reads plus the number of categories (the number of reference genome + 2). The marginal distribution of each abundance is: and we can use this to obtain the 95% CI numerically.

Computational architecture
HAYSTAC is written in Python and built upon the snakemake workflow engine [47]. It operates as a wrapper around custom python scripts and several common bioinformatics tools, including: AdapterRemoval [48], bowtie2 [32], DeDup [49], mapDamage [50], samtools [51], SeqKit [52], seqtk [53] and sra-tools [54]. It uses the conda package manager, with bioconda [55], to manage its dependencies. Employing snakemake's feature to use conda for installing software dependencies required by a rule in a containerised fashion, allows HAYSTAC to have a minimum list of dependencies as a package. Furthermore, it facilitates maintaining software dependencies up to date, if necessary, without the need to make any major code changes.

A tool for constructing databases
HAYSTAC includes a user-friendly tool to download genomes from NCBI and build customisable reference databases. After deciding a priori which taxa to include in their database (e.g., anywhere between a single genus to many thousands of taxa), the user constructs a database from any combination of: (i) an NCBI search query; (ii) a user specified list of NCBI accession codes; (iii) the RefSeq representative database of prokaryotic species; and/or (iv) a list of user provided reference assemblies (in FASTA format).
To construct a valid NCBI search query, visit the NCBI Nucleotide database website (https://www.ncbi.nlm.nih.gov/nucleotide/) and use the search feature to obtain a correctly formatted query string from the "Search details" box. This search query can then be used directly with HAYSTAC to automatically download and build a reference database based on the accession codes present in the resultset returned by the query.
We caution that the choice of reference database can dramatically affect species identification for all methods that use reference databases. For example, issues can arise if the reference genomes in the database are not complete, as reads may map uniquely to one reference genome because other incomplete genomes in the database do not possess that specific sequence. To mitigate this issue, HAYSTAC uses a simple heuristic to select the longest genome available per species in the input result set, which limits issues arising from unequal size of reference genomes in the database.

Processing and aligning reads
Pre-processed reads can be directly passed onto the alignment step (Fig 1). This is useful when dealing with modern DNA. The user can also decide whether to remove sequencing adapters in the case of raw reads, and for ancient/degraded DNA collapse overlapping mate pairs using a wrapper for AdapterRemoval [48]. Reads are then aligned to a database using Bowtie2 [32]. We first align to a single Bowtie2 index that contains all the reference genomes, to obtain a BAM file that contains one single best alignment per read. For this alignment step we use the local alignment mode in Bowtie2. This allows us to discard reads that did not map to a single reference genome. This BAM is used by HAYSTAC to compute the expected T s /T v ratio from the data, which is used as prior in Eq 4. Reads that pass this first filtering step are then realigned to each reference genome separately using Bowtie2 (one index per reference) in endto-end mode. We set a maximum number of mismatches to calculate the minimum alignment score during the Bowtie alignment. The maximum mismatch allowed during this second alignment step is calculated by the average read length of the input library. Allowing more mismatches results in slower run time as the number of reads included in the analytical step increases.

Computing abundance
Alignment files are then processed to calculate a likelihood score for each read/reference alignment and build a likelihood matrix that possesses n×r entries, where n is the number of reference genomes and r is the number of reads that pass the first filtering stage. This is done by computing T s , and T v for each alignment, and computing the likelihood using Eq 4. Here the user can change the probability of mismatch within a given alignment (σ; default 0.05) (see Eq 4). For each read we then compute a posterior probability using Eq 7. Reads are then filtered based on their posterior probability (default 0.75) and included in the Dirichlet assignment step. Reads that have not been aligned to any taxa in the database get assigned to the "Unknown Source" category, a category that absorbs reads whose reference genomes are not included in our database. Reads that have been aligned to multiple taxa in our database but obtain a posterior probability below the threshold that is set for the Dirichlet assignment are also assigned to another special category called the "Ambiguous Source", a category that includes reads that were not informative enough to be uniquely aligned to a specific taxon. Both of these categories are included in the mean posterior abundance calculations.

Assessing evenness of coverage
We use an evenness of coverage ratio to assess if the reads assigned to a specific taxon represent a random sample of its genome or whether they are clustering around specific regions. The evenness of coverage ratio is defined as the genome coverage of a taxon over the fraction of its genome that is covered by aligned reads. In the case of a true positive identification the reads should not be clustering around specific genome regions (evenness ration< 10 as estimated from empirical datasets), something that would indicate an HGT event. This test is particularly useful for shallow sequencing data that do not allow for more than 1× genome coverage of a given reference genome.

Representative species RefSeq database
For our analyses we build a database out of all the representative species of the prokaryotic RefSeq database. A list of all the species that were included can be found here: https://ftp.ncbi. nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt. In cases where the RefSeq representative database contains more than one strain for a given species, the first listed accession was picked. We also added complete genomes from the genera Klebsiella, Streptococcus, Corynebacterium, Bordetella, Haemophilus and Yersinia, for these species that were not present in the prokaryotic representative RefSeq. When possible (i.e., when genome assembly with plasmid data were available) we included plasmid sequences in the database.

Simulations
We generated 3 different simulated datasets. The first dataset was generated to determine the accuracy of different methods under simple conditions. We did not apply any chemical damage pattern and the read length was kept constant at 60 bp. This was done by randomly sampling 10 species from the prokaryotic representative RefSeq database (see above) twice, to get two different sets of 10 species. For each set we then simulated 2 different sequencing library samples of 1 million single-end reads using gargammel [36]. One set had no human genome included and all taxa were present at 10% abundance. In the second set 25% of reads were from the human genome, and all remaining taxa were present at 7.5% abundance. This dataset included 4 simulated sample sequencing files.
The second dataset was produced to compare performance of the different tools when analysing ancient and modern DNA, and did not include the human genome. To do so, we first generated two sets of 100 species randomly sampled from the prokaryotic representative RefSeq database, and two sets of 500 randomly sampled species. The species were different between each of the 2 random samplings for both the 100 species and 500 species sets. For each species profile in both the 100 species and 500 species sets, we simulated two different libraries (8 total libraries) using gargammel with 10 million single-end reads of length 125bp from an Illumina HiSeq 2500 run. The species abundances were kept consistent within each library, where libraries of 100 species have all taxa at 1% abundance and libraries of 500 species have all taxa at 0.2% abundance. The first library incorporated aDNA damage patterns based on the profile of reads that mapped to the Tannerella forsythia genome from a real ancient dental calculus sample (CS21 from [43]) and the second library did not include aDNA damage patterns.
To assess our method with more realistic data sets, we analysed simulated data previously published by Velsko et al. [30] designed to resemble a dental plaque community, which included species that were not in the prokaryotic representative RefSeq database. Twelve datasets of 200 species each were generated: with and without ancient DNA damage, each with even species abundance (0.5%) or with abundance based on values observed in an ancient oral microbiome bacterial community (see S8 Table). The type of simulation included members of the Corynebacterium genus without Corynebacterium diphtheriae; the second type included all members of the Corynebacterium genus (including Corynebacterium diphtheriae); the third type did not include any member of the Corynebacterium genus. Each of these twelve samples contained 5 million paired end reads of varying length, that were collapsed with AdapterRemoval.
Lastly, we simulated reads from human pathogens and closely related species to test HAYS-TAC's ability to identify low abundance pathogens in metagenomic data. To do so, we simulated 200 single-end Illumina sequencing libraries from 100 pathobionts and 100 nonpathogenic closely related species respectively, with aDNA damage patterns (estimated from simulated sample anc200e2repgn), which we then added to one of the simulated ancient Oral Microbiome dataset samples (anc200e2repgn). These libraries were then analysed using both HAYSTAC and HOPS (see below for more details).
For the pathogen detection analysis, HAYSTAC was run with a mismatch probability of 0.2 and a read assignment probability threshold of 0.5. For the HOPS analysis we initially ran cMALT (a version of MALT adapted to aDNA libraries) with the following flags activated −asm−tails5, while setting all the other parameters at default settings. The cMALT output was subsequently analysed with HOPS [28] with default values but providing a list of candidate species that included 100 pathogen species, and using the outputs characterised as ancient by HOPS. Before counting the true and false positive reads assigned to pathogen taxa by each method, we removed the all the microbial taxa each method would identify in the initial library of the ancient Oral Microbiome sample anc200e2repgn, so that the background species composition of the library would not affect the identification of the spiked pathobionts.

Case study 1: Ancient Yersinia pestis
We tested HAYSTAC using a published dataset of ancient human bone and tooth samples in which Yersinia pestis was identified [9], to compare performance against standard techniques used to identify specific pathogens in aDNA samples. We downloaded reads from all seven samples in which the authors detected Y. pestis (RISE00, RISE139, RISE386, RISE505, RISE509, RISE511) from the European Nucleotide Archive (NCBI BioProject accession: PRJEB10885). We then used HAYSTAC to build a database that contains the longest complete genome for each species of the genus Yersinia and aligned reads using default parameters. HAYSTAC was then used to compute posterior abundance of each species.

Case study 2: Historic dental calculus
We also tested HAYSTAC using an historic dental calculus dataset [43] (PRJEB30331). Raw reads were processed as described in Velsko et al. [43]. We attempted to identify a set of respiratory pathogens in these calculus samples, including Corynebacterium diphtheriae and Bordetella pertussis, the causative agents of diphtheria and whooping cough, respectively (Fig 8). We also tested for the presence of species belonging to the Haemophilus (ideally Heamophilus influenzae or parainfluenzae), Klebsiella (ideally Klebsiella pneumoniae) and Streptococcus (Streptococcus pneumoniae) genera. These species were selected to test if pathobionts of the upper respiratory system could be found in dental calculus.

Performance test
For the performance tests we created a reproducible conda environment, inside which all the different pipelines were installed and run. For each method we used the GNU time command to measure elapsed time (wall clock) and peak memory usage (maximum resident set size). Scripts to run the performance benchmarks are available from https://github.com/antonisdim/ haystac_paper We used a total of 500 accessions, one for each species, which were a subset of the accessions used for the analysis of all real and simulated datasets. We built databases with an increasing number of species (10, 100, 500) and the same database was given as input to all the methods. We only measured the database building time and maximum memory usage for each method, as HAYSTAC was used to efficiently fetch all 500 reference genomes from NCBI.
For sample inputs we used samples from the Human Microbiome Project, with accession SRS078671. We merged the forward-mate and the singleton reads in one file that we then treated as single end reads. That file was subsequently subsampled to generate fastq files of 10 K, 100 K, and 1 M SE reads respectively, that were used as inputs for all the tests.
For the sample analysis we performed two tests, one that measured max memory usage and runtime for samples of increasing size against the database of 500 species, and a second where the input file size was kept constant at 1 million reads, while the database size was being varied, again measuring the same variables for memory and elapsed execution time.