Karp: Accurate and fast taxonomic classification using pseudoalignment

Pooled DNA from multiple unknown organisms arises in a variety of contexts, for example microbial samples from ecological or human health research. Determining the composition of pooled samples can be difficult, especially at the scale of modern sequencing data and reference databases. Here we propose the novel pooled DNA classification method Karp. Karp combines the speed and low-memory requirements of k-mer based pseudoalignment with a likelihood framework that uses base quality information to better resolve multiply mapped reads. In this text we apply Karp to the problem of classifying 16S rRNA reads, commonly used in microbiome research. Using simulations, we show Karp is accurate across a variety of read lengths and when samples contain reads originating from organisms absent from the reference. We also assess performance in real 16S data, and show that relative to other widely used classification methods Karp can reveal stronger statistical association signals and should empower future discoveries.


Introduction
The study of microbial community composition has been revolutionized by 2 modern genetic sequencing. Experimenters can forgo the laborious work of culturing cells and detect a broader range of taxa than was previously possible. 4 This improved ability to describe the microbes present in a pooled sample has led to important findings in human health (Davenport et al., 2014;Wu et al., 6 2011;Turnbaugh et al., 2009) and ecology (Metcalf et al., 2016;Godon et al., 2016). These findings rely on quantification of the taxa present in experimental 8 samples, and towards that goal many methods have been developed. The everincreasing scale of both sequencing data and relevant reference databases require 10 that such methods be efficient in addition to accurate. Here we present a novel method, Karp, which combines the speed of k-mer-based pseudoaligning with a 12 likelihood framework that incorporates base quality information. In this work we use Karp to classify the taxonomy of pooled 16S microbiome data quickly 14 and with an accuracy superior to widely adopted alternative methods.
Microbiome samples are commonly generated using either shotgun sequenc- 16 ing or the sequencing of marker genes, most often the gene encoding 16S ribosomal RNA. Classifying the output of shotgun sequencing can be difficult, as 18 limited reference databases exist for entire bacterial genomes, so whole genome sequencing generally either requires computationally intensive de novo assembly 20 methods (Cleary et al., 2015;Howe et al., 2014;Boisvert et al., 2012) or limits the range of organisms available for study (Scholz et al., 2016). Alternatively, 22 several large reference databases exist for microbial 16S sequences (Cole et al., 2014;Quast et al., 2013;DeSantis et al., 2006). The 16S gene contains alternat-24 ing regions of highly conserved and highly variable sequences, making it easy to target and well powered for differentiating taxa. Many experiments target one 26 or several of the 16S hypervariable regions and sequence to a high depth (Howe et al., 2016;Ahn et al., 2011;Chakravorty et al., 2007).
tions of 16S sequencing experiments the improvement in accuracy that Karp provides relative to Kallisto, as well as modern similarity-based methods (us-100 ing Quantitative Insights Into Microbial Ecology (QIIME)), and the Wang et al. (2007) naive Bayesian classifier (using Mothur). We also use simulations to 102 demonstrate how Karp leads to better estimates of important summary statistics and remains robust when sequences from organisms absent from our reference 104 database are present at high frequencies in samples. Finally, we assess performance in a real 16S dataset with 368 samples drawn from two individuals over 106 two days. In this data Karp finds more taxa with stronger association signals that differ between the two individuals. Karp also maintains comparable clas-108 sification errors when a random forest is employed to classify which location or individual each sample originated from.  Figure 1 gives an outline of Karp's classification process.
The first step in using Karp is the construction of a k-mer index of the M ref-116 erence sequences. This index catalogs the subset of the M reference haplotypes that contain each unique k-mer of a given length. Next, the query reads are 118 pseudoaligned using the k-mer index. Query reads that pseudoalign to multiple references (multiply-mapped reads) are locally aligned to each potential refer-120 ence, and each reference's best alignment is kept. Queries that pseudoalign to a single reference are assigned without alignment. Next, for multiply-mapped 122 reads the likelihood that they originated from each potential reference is calculated using the best alignment and the base-quality scores that correspond to 124 the read. After the likelihoods for every query read have been calculated, an EM-algorithm is used to estimate the relative frequencies of each reference hap-126 lotype contributing to the pool. More details about the method are provided in 5 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  (2) The query reads are locally aligned to the possible references.
(3) Using the best alignment, the likelihood that a read originated from a specific reference is calculated. (4) Using the read likelihoods an EM-algorithm is employed to estimate the relative abundances of the reference haplotypes in the pool of query reads.
the following sections.

Pseudoaligning and alignment
Aligning millions of reads against hundreds of thousands of references is 130 impractical in both memory and time. However, calculating the probability that a read originated from a given reference sequence using base-quality information 132 requires an alignment. To overcome this challenge, Karp uses pseudoalignment as a filter before performing local alignment. Pseudoalignment is a fast and 134 memory efficient way to narrow the space of possible references from which a 6 query read may have originated. Our pseudoaligning algorithm is directly based on that of Kallisto (Bray et al., 2016). Briefly, first an indexed de Bruijn Graph of the reference database is constructed, with each observed k-mer mapped to 138 an equivalence class of reference sequences that it is contained in. Next, each query read is decomposed into its constituent k-mers, which are searched against 140 the index. An intelligent coding of the index allows for a minimal number of ing errors are independent, we can then formulate the probability of read r j arising from reference h k , which we label l j,k , as where, if we assume every base is equally likely when an error occurs A more complete definition of the probability would sum over all possible 174 alignments of r j to h k . However, in non-repetitive marker gene sequence the best local alignment typically contributes such a large proportion of the probability 176 weight, that excluding alternate local alignments has a negligible impact on results but substantially improves computation. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint

Estimating reference haplotype proportions
As previously noted, the aim of our method is to estimate a vector F = 180 (f 1 , ..., f M ), containing the frequencies of the M possible reference haplotypes in a pooled DNA sample. If we were to observe which reference haplotype gave 182 rise to each read in our sample, the maximum likelihood estimate of F,F, would follow directly from the multinomial likelihood. In reality, we observe 184 the reads r, but the reference haplotypes that they originate from, η, are unobserved. To estimate F we therefore employ an EM algorithm, with a form 186 common to mixture model problems. Details of our EM algorithm are provided in supplementary section 7.1.

188
Karp modifies the standard mixture EM algorithm in two ways to speed up performance. The first is an assumption that if a read r j uniquely pseudoaligns 190 to a reference h k then P (η j,k = 1|r, F) = 1. For haplotype h k , label the number of reads that uniquely map as t k and define N * = N − M k=1 t k . Then we can 192 write the likelihood of the data as and our update step as This assumption also provides a logical initial estimate of The second speed-up that Karp uses is an implementation of SQUAREM different thresholds on the simulated and real data presented in this study.

Read likelihood filter 214
Our EM method relies on the fact that all the reads in our sample originated from haplotypes present in our reference database. In real data this assump-216 tion can be problematic; the classification of microbial taxonomy is an ongoing project and many taxons have yet to be identified or referenced. To preserve 218 the accuracy of our frequency estimates in the presence of reads from haplotypes absent from our references we implemented a filter on the maximum read 220 likelihood value (Kessner et al., 2013).
Specifically, using the base-quality scores of the query reads we calculate The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint quality score distributions were encountered, cutoff values between -3.0 and -1.5 yielded similar results, a finding in line with Kessner et al. (2013), and which 232 supports a default value of -2.0. In the real 16S data from Lax et al. (2015) we explored thresholds between −0.5 and −7.0, and generally those > −1.5 yielded 234 the lowest classification error rates (Supplementary Figure S4 and Table 3). For more details about the filter see supplement S6.

Karp collapse mode
The default approach in Karp estimates the relative frequencies of the in-238 dividual haplotypes present in the reference database. In many microbiome databases there is not a one-to-one relationship between reference haplotypes 240 and taxonomic labels; multiple haplotypes share a single label. When little information exists to distinguish closely related haplotypes apart, estimating 242 the relative frequencies at the taxon level rather than haplotypes can improve accuracy. To accommodate this, Karp includes a collapse option, which adds 244 a step to the estimation procedure. When the collapse option is used, after pseudoalignment and local alignment Karp calculates the average likelihood for 246 each taxonomic label, and uses these likelihoods in the EM algorithm to estimate taxonomic frequencies. This can be interpreted in a Bayesian context as 248 the likelihood a read is from a taxon under a uniform prior of its true reference sequence within that taxa. Karp output in collapse mode provides counts 250 at each taxonomic level from species to phylum. Because it is estimating the frequencies of fewer categories, collapse mode is often faster than Karp's default. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint reads were simulated. For each read a reference haplotype was drawn at ran-260 dom according to its frequency in the original frequency vector. Then, along the chosen reference sequence a read start position was selected uniformly and 262 a number of bases corresponding to the desired read length were copied from the reference. In the case of paired-end reads, the distance between pairs was  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint with 75bp single-end reads, 130 samples with 151bp paired-end reads, and 170 290 samples with 301bp paired-end reads, each with a unique mix of 1,000 reference haplotypes from GreenGenes. With Kallisto and Karp the raw forward 292 and reverse reads were directly classified. For the QIIME algorithms the script join_paired_ends.py was run and the resulting contigs were classified. 294 We simulated an additional 100 samples with 75bp single-end reads to compare how each method's frequency estimates impacted the estimation of common 296 sample summary statistics. Many statistics, such as β Diversity, summarize the sharing of taxa between samples, so instead of 1,000 unique taxa in each sample, 298 we used a shared pool of 1,000 taxa for all 100 samples, and further increased the similarity between samples by introducing correlation between the reference 300 frequencies. The reference haplotype frequencies for each sample were a linear combination of a random Dirichlet variable generated in a manner identical to 302 the simulations above and the reference frequencies of the preceding sample. In this way the samples again covered the full range of Shannon Diversity values, 304 however the frequencies of shared taxa was potentially much higher, providing a broader range of summary statistic values in the simulations.

306
Next we compared how the methods performed when the simulated samples contained reads generated from taxa that were absent from the reference 308 database being used for classification. We selected one phylum (Acidobacteria), one order (Pseudomonadales), and one genus (Clostridiisalibacter) at random 310 from the taxa in GreenGenes with more than 30 reference sequences. Then, for each missing taxa, we simulated 10 samples where 50% of the reads originated 312 from 3 different members and at least 5% of the reads came from closely related taxa (kingdom Bactera for Acidobacteria, class Gammaproteobacteria for 314 Pseudomonadales, and family Clostridiaceae for Clostridiisalibacter). Next, we create 3 reduced GreeGenes reference databases, each with one of the missing 316 taxa (including all lower ranking members) expunged. Finally, we classified the simulated samples using both the appropriate reduced reference database and 318 the full GreenGenes database.
Finally, we examined how sensitive our results were to the assumption that 320 13 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It We include the scaling factor of 1/1000 in order to transform the value into 336 an estimate of the average per-reference error rate, as our pooled simulation samples include 1,000 individual reference sequences. We use the same scaling 338 factor when looking at errors in the estimation of higher order taxonomy for consistency, although the true number of references at any given taxonomic 340 level will be < 1, 000.

Real data 342
To test the performance of Karp with real data we reanalyzed samples originally published by Lax et al. (2015). In brief, these samples were collected 344 from the floor, shoes, and phones of two study participants every hour for two 12-hour time periods over the course of two successive days. From these sam-346 ples the V4 region of the 16S rRNA gene was amplified and sequenced using 14 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint the Illumina HiSeq2000 (llumina, San Diego, USA). Because this dataset contains many samples of known origin it is useful for assessing performance by measuring classification accuracy and the power to detect differences.

350
The data is publicly available at https://figshare.com/articles/\%20Forensic_ analysis_of_the_microbiome_of_phones_and_\%20shoes/1311743, and af-  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint Finally, we tested for differences in the mean abundance of taxa between person 1 and person 2, first in the phone samples, and then between the shoe 380 samples. After subsampling 25,000 reads for each sample, we tested each taxon with > 250 total reads across all samples using Welch's t-test in R, and recorded 382 the corresponding p-value and t-statistic. We used the p.adjust function in R to calculate false discovery rates (FDR) from the t-statistic p-values once all taxa 384 had been tested. The simreads program we used to simulate sequence data with an empiri-398 cal distribution of base-quality scores is available at https://bitbucket.org/ dkessner/harp.

Comparison of competing methods with simulations 402
To test the performance of Karp against alternatives we simulated 110 independent samples, each with 1 × 10 6 75bp single-end reads drawn from 1,000 404 reference haplotypes selected at random from the GreenGenes database. Each simulation used a unique set of 1,000 references, and the frequencies of each 406 16 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint reference was varied to create a range of Shannon Diversity in the 110 samples (Supplemental Figure S5). We classified sequences against the full GreenGenes 408 database using Karp, Kallisto, the Wang et al.. (2007) Naive Bayes classifier implemented in Mothur, and several algorithms from QIIME including UCLUST, 410 USEARCH, and SortMeRNA. We estimated errors as described in section 2.8.
At the level of individual reference haplotypes, which here we also refer 412 to as operating taxonomic units or OTUs, we calculated estimation error for references with > 1 read present or classified. On average Karp had the lowest 414 errors (34% smaller than Kallisto, 65−66% smaller than UCLUST, USEARCH, and SortMeRNA, Figure 2A). When we limited our comparison to references 416 with a frequency > 0.1%, Karp remained the most accurate (errors 31% smaller than Kallisto, and 68−70% smaller than the QIIME algorithms, Figure 2B). The 418 accuracy of all methods improved with increasing diversity, and Karp's average error was 48% smaller when diversity was > 6.2 than when it was < 0.7.

420
Many reference haplotypes share the same taxonomic label, and it is possible researchers would be interested in hypothesis at the level of genus or species 422 rather than individual references. We aggregated counts for references with identical labels and again compared with the truth in our simulated samples.  The difference in classification error observed here is relevant for downstream analysis. When we calculated summary statistics using OTUs with frequencies 432 > 0.1% in 100 independent simulated samples, Karp's estimates were on average closer to the truth than either Kallisto or UCLUST (  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. Karp had the lowest error when Shannon Diversity was high (> 3.4) ( Figure   3A). When we aggregated counts for OTUs with identical taxonomic labels and 448 compared the abundance estimates of species with frequency > 0.1%, Karp had the lowest average errors (50%, 84%, and 94% less than Kallisto, UCLUST, and 450 the Wang et al. Naive Bayes respectively, Figure 3B). Of the three read lengths examined, Karp's advantage was greatest for the 301bp paired-end reads. For 452 these reads Kallisto's strict pseudoalignment threshold struggled to make as-

19
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint For computational reasons we were unable to calculate Naive Bayes estimates for the 301bp samples.

20
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint signments, and on average classified only 3.8% of reads (versus 53% with Karp).
Note that Kallisto's performance could be improved by subsampling shorter regions from the longer reads, although this would be removing information that 456 Karp is currently using to assign reads. Also, the Naive Bayes classification could not be computed for the 301bp paired-end reads with the computational 458 resources available for this project and was therefore not compared. In references with frequency > 0.1% Karp was on average the most accurate method 460 across the entire range of Shannon Diversity (errors 90% smaller than Kallisto, 62% smaller than UCLUST, Figure 3C).

462
In microbiome classification problems it is not uncommon to have taxa present in sequenced samples that are absent from reference databases. We isalibacter Karp's classification using the reduced reference database was more accurate than UCLUST using the full reference database.

478
The model that underpins Karp relies on knowing the probability of a sequencing error at a given position in a read. Our work assumes that the base-480 quality scores are accurate estimates for the probability of sequencing error. In real data it has been recognized that base-quality scores are not always accurate, The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  Figure 4: Accuracy when the reference database used for classification is missing taxa found in the sample. For each of one phylum (Acidobacteria), one order (Pseudomonadales), and one genus (Clostridiisalibacter), 10 samples were simulated where 50% of the reads originated from the noted taxa. Each sample was classified with the full GreenGenes database and also a reduced version of the database lacking all members of the taxa which had been used to simulate the sample. The accuracy of estimates by Karp, Kallisto, and UCLUST for the 50% of the samples that did not originate from the absent taxa were compared with their true frequencies. Black bars give 95% confidence intervals.

22
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint ibration is complicated (possibly requiring spike-ins of known sequence during the experiment or alignment to conserved reference sequence), so we explored 486 how differences between the expected sequencing error rate as represented by the base-quality scores and the actual sequencing error rate effect Karp's accuracy.

488
Under a model where the actual probability of a sequencing error was 10 Q −5 for quality score Q, rather than the 10 The increased accuracy of Karp comes at some computational cost, especially 496 relative to Kallisto, however it is still quite feasible for modern data. Table 2 compares the performance of the methods while classifying samples with either 498 10 6 75bp single-end, 151bp paired-end, or 301bp paired-end reads using 12 cores, and in all cases even the full mode of Karp requires < 3 hours. Karp was run 500 with default settings, in both full and collapse mode. For Karp and Kallisto the 75bp reads require longer to classify than the 151bp reads due to the larger 502 number of multiply-mapped reads with the shorter length. 504 We classified 368 16S rRNA samples collected from the shoes, phones, and floors of two study participants using Karp, Kallisto, and UCLUST. For each 506 classification method we subsampled without replacement 25,000 reads from each sample, either from individual references or else after aggregating counts 508 within taxonomic labels, and then performed several analyses. For robustness, we performed the subsampling 10 times for each method and analysis. First, 510 we used the random forest classification method with 1,000 trees to classify subsets of the data. We classified the shoe samples as coming from person 1 512 or person 2. We did the same with the phone samples, and then within each individual we classified the phone samples as either from the front or back 514 23 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  of their phone, and their shoe samples as coming from the front left, front right, back left, or back right. In these analyses we measured the classification 516 error using the known identity of each sample (Table 3). When we aggregated counts by taxonomic label there were not enough reads at the species level to 518 subsample, so we performed the classification with genus-level labels. Error rates were lower when classification was done using individual references rather 520 than counts aggregated by genera. The error rates for classifying the shoe surfaces from person 2's samples were greater than the baseline error rate (if 522 every sample had been assigned the most common label), suggesting there was not power to perform this classification.

524
Next, we performed a PCA decomposition on matrices with 25,000 reads subsampled from each of the floor and shoe samples. From this we calculated the 526 average correlation between PCA1 for each floor sample and the shoe samples collected at the same location and time (   Finally, we tested for differences in the mean abundance of taxa between 530 person 1 and person 2 using Welchs' t-tests. We tested OTUs with at least 25 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint 250 observed reads across all samples, and tested the phone and shoe samples 532 separately. When we looked at taxa that varied significantly between individuals with a false discovery rate (FDR) < 0.05 in all 10 matrices of subsampled reads,

540
In both simulations and real 16S data we have shown that Karp is an accurate and computationally feasible method for estimating the relative frequencies of The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint tically assign reads to references, like Karp and Kallisto, and those that make a hard assignment, like UCLUST or USEARCH, can arise. With Karp, when 562 references are nearly identical they will receive nearly equal probability weights from each read that maps to them, and the result will be many closely related 564 references at low frequencies. With UCLUST or other similarity score methods, the references are sorted and the first of the closely related references to appear 566 in the sorting order will be assigned all or nearly all the references, regardless of if it is the actual contributing organism. The truth in this case, is that the 568 sequencing data does not contain enough information to accurately distinguish between the references, and both methods end up at sub-optimal, albeit different 570 solutions. Under such conditions researches need to have a realistic expectation of what they can resolve in their data, and it is likely that inferences of higher-572 level taxonomic abundances rather than individual references are more likely to be robust.

574
Current experimental protocols and downstream clustering algorithms make using a single hypervariable region in the 16S gene a standard approach. A 576 single hypervariable region is short enough that it is rare for reads from a single organism to form multiple OTUs during clustering. However, it is important 578 to understand that the sequencing of a smaller reference costs researchers information that could make it possible to improve quantification accuracy and 580 distinguish between closely related references. Our simulations suggest experimenters could benefit substantially from sequencing more of the 16S gene than 582 is often presently used.
In addition to k-mer length, Karp users can adjust the thresholds for mini- The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint experimental run time as well.

592
In both ecology and human health a greater understanding of the microbiome promises medical and scientific breakthroughs. Modern sequencing technology 594 gives us unprecedented access to these microbial communities, but only if we can correctly interpret the pooled DNA that sequencing generates can we hope to 596 make significant progress. Towards that end, Karp provides a novel combination of speed and accuracy that makes it uniquely suited for scientists seeking to 598 make the most out of their samples. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint

Acknowledgments
In actuality, we observe the reads but the reference haplotypes that they 782 originate from are unobserved. To estimate F we therefore employ an EM algorithm. Briefly, the E-step of our procedure can be written The M-step directly follows from the form of our likelihood, and the algorithm 786 updates the estimates of F until convergence according tô (10)

Likelihood filter 788
When we classify pooled microbiome data it is likely that some reads originate from taxa that are absent from our reference database. Filtering these reads The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint Given a set of query reads with their corresponding base-quality scores, we 794 can calculate the mean and variance for the distribution of likelihood values that would result if every query read were aligned to the actual reference that gave 796 rise to it, such that every mismatch was the result of sequencing error. This calculation requires only the query read base-quality scores, not the actual reads 798 or a reference database, and is carried out before Karp begins pseudoalignment.
Recalling the notation of section 2.3, a read of length L, has bases r [0] , r [1] , ..., r [L] 800 and corresponding base-quality scores q [0] , q [1] , ..., q [L] . If each read r originated from a reference h, our goal is to calculate E log P (r|q, h) and V ar log P (r|q, h) .

802
First, for each position i ∈ 1, .., L define the empirical distribution of basequality scores, Q [i] , in a sample of N reads by where I is an indicator function and q j, [i] is the base-quality score at position i on read j. This distribution is independent of h.

806
Assuming that each position along a read is independent we can write: For each position i the probability of sequencing error is a known function of 808 the base-quality score, (q [i] Note that this expression does not depend on h [i] or r [i] . By combining equa-812 tions 11, 12, and 13 we have an expression for E log P (r|q, h) . To calculate V ar log P (r|q, h) we again use the assumption that bases are independent 814 and write: The likelihood filter is applied after the query reads have been locally aligned 816 to the reference database and the corresponding likelihood values have been determined. Then, a z-score is computed for each query read using its largest 818 likelihood value and the mean and variance of the "null" likelihood distribution (Equations 13 and 14). If this z-score is too low it is evidence that the 820 true reference that the read originated from is absent from the database, and correspondingly the read is removed. formation exists, for example if a single hypervariable region has been sequenced, 834 lower thresholds can give more optimal solutions ( Figure S4 and Tables S1 and   S2). This is because a lower threshold avoids removing organisms truly in the

39
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  Figure S3: The impact of the EM frequency threshold is smaller when analyzing error in the estimates of more common references. Solid lines present the error calculated using all references classified, dashed lines give the error when only references with an actual or estimated frequency above > 0.1%, a cut-off used frequently in this study. In such cases the chosen frequency threshold is less important.

40
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.   figure   S4. Higher counts reflect better performance, the maximum count in row 1 would be 12, and in row 2 would be 8. Particularly for the full version of Karp, setting a lower minimum frequency resulted in lower classification error rates.

Parameters with Lowest Classification Error
Likelihood z-score -0.5 -1 -1.5 -2 -3 -4 -5 -6 -7  Analysis onlyphones_person1 onlyphones_person2 onlyphones_personID onlyshoes_personID Figure S4: Evaluating impact of Karp tuning parameters (EM frequency threshold and maximum likelihood z-score) on shoe and phone data from Lax et al..(2015). We classified each sample using a range of tuning parameters and then performed random forest classification with 10,000 subsampled reads per sample and 1000 trees. We used both the Full and Collapse mode of Karp. Each colored line represents a different analysis, and the bars give 95% confidence intervals based on 10 replicates. When a parameter setting failed to assign taxonomy to enough reads to classify at a subsampled depth of 10,000 reads where other parameter settings successfully quantified to that depth, it was counted as an error for the purpose of random forest classification.

42
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (each an operational taxonomic unit: OTU). The frequencies at which reads were generated from contributing references were varied to create datasets with a range of Shannon Diversity.
As diversity increases the frequency distribution begins to approach a uniform distribution.

43
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint  Figure S6: Impact of assumption that base-quality scores accurately represent probability of sequencing error. For two different models of sequencing error we simulated 50 samples and classified them with Karp, Kallisto, and UCLUST/USEARCH. Each method is represented by a different colored line, and bars represent 95% confidence intervals (A) In our first model the true rate of sequencing error varied with the base-quality score, but was smaller than Karp's model assumes. (B) In our second model, errors were distributed uniformly at 1% of bases in each read, independent of whatever base-quality score was assigned.

44
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint

45
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint were compared to their true counts. Likewise, for 301bp paired-end reads counts were aggregated for OTUs classified in the same (E) Genus, (F) Family, (G) Order, or (H) Class and taxa with a frequency > 0.1% were compared to their true counts.

46
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/097949 doi: bioRxiv preprint