dbOTU3: A new implementation of distribution-based OTU calling

Distribution-based operational taxonomic unit-calling (dbOTU) improves on other approaches by incorporating information about the input sequences’ distribution across samples. Previous implementations of dbOTU presented challenges for users. Here we introduce and evaluate a new implementation of dbOTU that is faster and more user-friendly. We show that this new implementation has theoretical and practical improvements over previous implementations of dbOTU, making the algorithm more accessible to microbial ecology and biomedical researchers.


Introduction
Preheim et al. [1] formulated the distribution-based OTU-calling (dbOTU) algorithm, an extremely accurate algorithm for grouping DNA sequences from microbial communities into operational taxonomic units (OTUs) for ecological or biomedical research. Unlike most OTUcalling approaches, which group sequences based only on the similarities of the sequences themselves, this algorithm also uses information about the distribution of sequences across samples. This allows dbOTU to distinguish ecologically-distinct but sequence-similar organisms or populations. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 • how they evaluate the distribution (i.e., ecological similarity) criterion, and • the details of the software itself.
These differences are summarized in Table 1. The first implementation. The original implementation (github.com/spacocha/ Distribution-based-clustering) which we term "dbOTU1", was coded in Perl and R (with accessory shell scripts) and took matrices of genetic dissimilarities and a table of sequence counts as input. The table of sequence counts, similar to an OTU table, countained information about how many times each unique sequence appeared in each sample.
The algorithm performed best when it was provided with two dissimilarity matrices, one computed from unaligned sequences and one computed from aligned sequences. To perform the alignment, the original publication recommended the align.seqs command in mothur [2]. To compute the dissimilarities matrices, the publication recommended the -makematrix option in FastTree [3]. The values in FastTree's dissimilarity matrix are Jukes-Cantor distances À 3 4 log 1 À 4d 3 À Á , where d is the proportion of mismatched sites, that is, the number of mismatches in the aligned sequences divided by the length of the sequences. (If the sequences are aligned and contain gaps, the gaps are not counted in the denominator.) The genetic criterion in dbOTU1 was articulated as the minimum of the aligned and unaligned Jukes-Cantor distances, which was a work-around for the fact that using the aligned sequences sometimes led to a greater distance between two sequences than would be computed by comparing the unaligned sequences.
The distribution criterion was evaluated using the p-value from the χ 2 test of independence as implemented in the chisq.test function in R [4], which was called in a separate process from a Perl script. The p-value was compared against a threshold for determining whether two sequences should be merged into one OTU. If the p-value fell below some treshold, the distributions of the two sequences across the samples were considered too distinct for the two sequences to be merged. (Because the p-values were used as an operational threshold for merging OTUs-rather than as explicit statements about the statistical significance of the similarity of two sequences' distributions-these p-values were never subjected to multiple hypothesis correction.) Many of the comparisons involved sequences with small numbers of counts, for which the analytical, asymptotic calculation of the p-value of a χ 2 test is not accurate. This implementation therefore used a simulated p-value, available through the R command's simulate.p. value option, with 10 4 simulated contingency tables. This empirical calculation was slow and was the principal bottleneck in running dbOTU1.
The second implementation. The second implementation (github.com/spacocha/ dbOTUcaller), which we call "dbOTU2", was coded in Python 2 and interfaced with R  It took unaligned sequences,  aligned sequences, and the sequence count table as input. Like dbOTU1, dbOTU2 used the minimum of the aligned and unaligned genetic dissimilarities for its genetic criterion. Unlike dbOTU1, which used the Jukes-Cantor distance, dbOTU2 used the simple proportion d of mismatched sites and computed these dissimilarities directly from the sequence files only when required by the algorithm. This selective computation avoided loading the matrix of N 2 pairwise distances. Hereafter we refer to this metric, the minimum of aligned and unaligned proportion of mismatched sites, as the dbOTU2 metric.
Like the first implementation, this one used the simulated p-value from R's chisq.test for its distribution criterion. Unlike dbOTU1, dbOTU2 called the test via rpy2 from the Python script. This removed the need for reading and writing temporary files, but it was still slow and required both R and Python.

This implementation
This implementation, dbOTU3, aims to improve speed and ease of use. It is written in pure Python and is compatible with Python 2 and 3. For the genetic criterion, this implementation uses the Levenshtein edit distance, the number of single-position insertions, deletions, or substitutions required to transform one sequence into another, as an approximation for the sequence dissimilarity. (The Levenshtein distance is implemented in the python-Levenshtein package; github.com/ztane/python-Levenshtein.) For the distribution criterion, this implementation uses a likelihood-ratio test. The merit of these choices for the genetic and distribution criteria are discussed in the Results.

New genetic and distribution criteria
This implementation evaluates the genetic criterion using the Levenshtein edit distance: a candidate sequence will not be merged into an OTU if 2E/(ℓ 1 + ℓ 2 ) is greater than some threshold, where E is the Levenshtein edit distance, ℓ 1 is the length of the candidate sequence, and ℓ 2 is the length of the OTU's sequence. As shown in the Results, this metric is a good approximation of the proportion of mismatched sites in an alignment.
This implementation evaluates the distribution criterion using the p-value from a likelihood-ratio test. Define x ðiÞ 1 as the number of counts that the OTU has in sample i and x ðiÞ 2 as the number of counts the candidate sequence has. Define also X 1 ¼ P i x ðiÞ 1 and similarly X 2 . The alternative hypothesis for this test is that the OTU and candidate sequence are distributed "differently", that is, that each of the x ðiÞ 1 and x ðiÞ 2 are drawn from different random variables, each of which we model as Poisson-distributed [5]. Thus, the alternative hypothesis is where there are no constraints on the Poisson parameters.
The null model asserts that the OTU and candidate sequence are distributed "the same", that is, that the candidate sequence's counts in each sample is drawn from a Poisson random variable whose parameter is proportional to the parameter of the OTU's Poisson variables, where the constant of proportionality is the same across samples. Specifically, the null model is We expect that 0 < ρ < 1 because the candidate sequence is overall less abundant than the OTU. Asserting maximum likelihood for each model shows that so that the test statistic is where L 1 is the likelihood of the alternative model, L 0 is the likelihood of the null model, Accuracy and speed of new implementation To evaluate the performance of the new implementation, we compared the results of calling OTUs with dbOTU3, dbOTU2, and with UPARSE [6]. We used the Turnbaugh mock community data set [7] analyzed in the original dbOTU publication.
We trimmed all sequences in Mock_nochimeras.fna to 187 nucleotides, the length of the shortest sequence in that file. To align the sequences in Mock_clean.fna with those in Mock_nochimeras.fna, we trimmed the first 14 nucleotides from each sequence in Mock_clean.fna and then trimmed the remaining sequences to 187 nucleotides.
To generate a table of sequence counts, we dereplicated the unique trimmed sequences from Mock_nochimeras.fna. For each sequence in Mock_clean.fna, we checked if that sequence appeared among the unique sequences. If so, we counted it as present in the sample corresponding to the metadata for that sequence. The dereplicated sequences and table of sequence counts are in the Supporting Information (S1 and S2 Files).
To generate the aligned sequences required for dbOTU2, we aligned the unique sequences using mothur's align.seqs command.
dbOTU2 and dbOTU3 were run with a genetic dissimilarity threshold of 0.1 and with p = 0.001 as the distribution criterion threshold. dbOTU2 was run using the aligned and unaligned sequences. UPARSE OTUs were called using usearch -cluster_otus and the same list of unique sequences (but with the summary "size" information required by usearch) at similarity thresholds of 95%, 97%, and 100%. Notably, UPARSE includes chimera checking, which identified chimeras among the sequences from Mock_nochimeras.
fna. The specific sequences that UPARSE identified as chimeras depended on the clustering similarity threshold.
We compared the results of the OTU callers with the true composition of each mock community sample (Table S3 in ref. [7]). To link the sequence data with the true mock community composition, which is expressed in terms of the abundances of input isolates, we identified, for each dereplicated sequence, the most genetically-similar reference sequence in MockIsolatesV2.fna, specifically, the one with the smallest proportion d of unaligned mismatched sites (S3 File). To compare compositions, we combined the abundance data from all the dereplicated sequences that mapped to the same species.
To benchmark the speed of the new implementation, the three implementations were run on the mock community data using a personal computer.

Accuracy and speed of the new genetic criterion
When dbOTU3 was run on the mock community data, the genetic criterion was evaluated for 3,688 pairs of sequences. We used those pairs of sequences to benchmark the accuracy and speed of the new genetic criterion.
To evaluate the performance of the Levenshtein metric, we compared the dissimilarities computed by it, by the dbOTU2 metric, and by a gold standard-pairwise alignment using Clustal Omega [8]-for each of the 3,688 pairs of sequences.
To benchmark the speed of the new genetic criterion relative to other in-Python options, we measured the time required to compute the dissimilarities between these pairs using three methods: the dbOTU3's Levenshtein metric, Biopython's pairwise2.align.globalxx [9], and an external call to Clustal Omega. These computations were performed on a personal computer.

Accuracy and speed of the new distribution criterion
When dbOTU3 was run on the mock community data, the distribution criterion was evaluated for 47 OTU/sequence pairs. (In terms of the outlined algorithm in the Introduction, dbOTU3 performed 47 comparisons in step 4.) We used those OTU/sequence pairs to benchmark the accuracy and speed of the new distribution criterion.
To benchmark the speed of the new likelihood-ratio test, we computed the distribution tests for the 47 OTU/sequence pairs using the method of dbOTU1 (i.e., an external call to R's chisq.test command with 10 4 simulations), the method of dbOTU2 (the same R command as for dbOTU1 but called using rpy2), and dbOTU3's likelihood-ratio test.
To evaluate the performance of dbOTU3's likelihood-ratio test, we compared the p-values computed by this new test against a gold standard: the simulated χ 2 test with 10 7 simulations.

Results
The new implementation is faster and performs similarly to the previous one When processing this relatively small mock community dataset, dbOTU3 was also about tenfold faster than dbOTU1 and dbOTU2 ( Table 2). dbOTU3 also yieled a nearly identical species-level composition when compared to dbOTU2 (Fig 1). Both dbOTU implementations performed similarly to 100% clustering with UPARSE but less similarly to other clustering thresholds with UPARSE.
The Levenshtein dissimilarity is fast and performs comparably to the previous metric The Levenshtein genetic dissimilarity was much faster than either in-Python alignments made using Biopython or out-of-Python alignments using Clustal Omega ( Table 3).
The relationship between the genetic dissimilarities measured by the gold standard (the pairwise alignments made with Clustal Omega), the dbOTU2 metric, and dbOTU3's Levenshtein metric are shown in Fig 2. Both metrics perform well for dissimilarities below 10%. Above 10% true dissimilarity, the Levenshtein metric underestimates dissimilarities, while the dbOTU2 metric mostly overestimates dissimilarities.
To quantify the performance of the new Levenshtein dissimilarity, we separately assessed the two roles the genetic dissimilarity plays in the dbOTU algorithm. First, as laid out in step 2 of the algorithm in Section Overview of the algorithm, the genetic dissimilarity is used to classify OTU/sequence pairs as above or below the genetic criterion threshold: sequences that are too dissimilar to the existing OTUs are immediately made into new OTUs. Second, as laid out in step 4 of the algorithm, sequences that are sufficiently similar to at least one OTU have their distributions compared to those genetically-similar OTUs. The candidate sequence is compared to the most genetically-similar OTU first, then the next-most similar, and so on.
To evaluate the first role of the genetic dissimilarity, we considered its ability to correctly classify sequence pairs as above or below the genetic dissimilarity threshold. We compared the classification performance of dbOTU3's Levenshtein metric and the dbOTU2 metric against a gold standard, pairwise-alignments by Clustal Omega (Table 4). The dbOTU2 metric is prone to erroneously asserting that two sequences are too dissimilar to be considered for merging, while the dbOTU3 metric is prone to erroneously asserting that two sequences are sufficiently similar to be subject to the distribution test and potentially merged.
To evaluate the second role of the genetic dissimilarity, we computed the correlation between the dbOTU metric and the gold standard for sequences pairs that passed the genetic criterion at the default 10% dissimilarity threshold. Among these sequence pairs, the Levenshtein metric is slightly better correlated with the true distance than the dbOTU2 metric metric ( Table 5).

The likelihood-ratio test is fast and mostly reproduces the previous distribution criterion
The likelihood-ratio test used in dbOTU3 was markedly faster than the distribution criteria used in dbOTU1 and dbOTU2 ( Table 6).
The likelihood-ratio test also performed similarly to the gold standard (χ 2 test with 10 7 simulations). Of the 47 OTU/sequence comparisons, the gold standard determined that the OTU "mothur alignment" refers to using mothur to align the input sequences, which was required before running dbOTU1 and dbOTU2. The time required to compute the FastTree distance matrix, which is required for dbOTU1, was small (≲0.1 seconds) and is not shown. Errors show the standard deviations over 10 runs. https://doi.org/10.1371/journal.pone.0176335.t002 Distribution-based OTU calling and candidate sequence were differently distributed in 16 cases. The likelihood-ratio test concurred in those 16 cases. In one case, shown in Table 7, the gold standard classified the candidate sequence and OTU as similarly-distributed but the likelihood-ratio test classified them as differently-distributed. This performance (16 true dissimilars and 1 false dissimilar) corresponds to an accuracy of F 1 = 97%.  The genetic similarity metrics used in dbOTU2 and dbOTU3 were used to classify pairs of sequences as sufficiently genetically similar to be subjected to the distribution criterion. To compute the accuracies, dissimilarities for the 3,688 pairs of sequences compared while running dbOTU3 on the mock community data (i.e., the same ones shown in Fig 2) were computed by the gold standard, dbOTU2, and dbOTU3 (Levenshtein) metrics. Here a "similar" result means that a metric concludes that a pair of sequences are sufficiently genetically similar to be considered for merging into an OTU; a "dissimilar" results means that the sequences are too dissimilar to be merged. The "level" is the genetic dissimilarity threshold used for the test. For example, a dbOTU2 true similar at the 5% level means that the dbOTU2 metric and the gold standard both concluded that the two sequences were at least 95% similar and thus candidates for merging. A false dissimilar, by contrast, means that the sequences were genetically similar but the metric concluded they were not, thus erroneously excluding them from a distribution criterion check.
Up to this point, the p-value threshold for the likelihood-ratio test was fixed at the same value as was used for the gold standard, the simulated χ 2 test. To check if the likelihood-ratio test would perform better when using a different threshold, we varied the likelihood-ratio test's threshold, keeping the χ 2 test's threshold fixed, and computed the new test's accuracy ( Table 8). The likelihood-ratio test performs best, relative to the χ 2 test, when its p-value threshold is about twenty-fold smaller.

Discussion
This implementation is overall more user-friendly than previous implementations. The software itself is faster, and the codebase is smaller. The main codebase is all in a single file and is supported by online documentation and unit tests. The input files for this implementation are easier for the user to prepare because they do not require sequence alignment.
Aside from software, this implementation has two main differences from the previous implementations. The first is the genetic dissimilarity criterion. Ideally, the genetic dissimilarity metric would be computed by a pairwise alignment of sequences. Unfortunately, there is no efficient implementation of this kind of alignment in pure Python. We found that calling Clustal Omega as a separate process was more than a hundred times slower than computed the Levenshtein distance. If this kind of computation is efficiently implemented in Python The correlations are between the values computed by the gold standard (pairwise alignment with Clustal Omega) and by the metrics used in dbOTU2 and dbOTU3. The correlations were computed for the 167 pairs of sequences that were compared while running dbOTU3 on the mock community data and for which the true genetic dissimilarity is at most 10% (i.e., those points in Fig 2 for which the true dissimilarity is at most 10%). Ranges are 95% confidence intervals.
In the meantime, the Levenshtein dissimilarity is efficient and well-supported. Compared to the dbOTU2 genetic dissimilarity metric, the Levenshtein metric is less likely to erroneously exclude a sequence from being merged into an OTU on the grounds of genetic dissimilarity, but it is conversely more likely to erroneously conclude that a candidate sequence and OTU are genetically similar. dbOTU3 users should treat the genetic threshold as an inclusion criterion rather than an exclusion criterion: sequence pairs more similar than the threshold will almost certainly be treated as similar, but pairs less similar than the threshold can also be expected to be (erroneously) treated as similar and therefore subjected to the distribution criterion.
The second major difference from previous implementations is the distribution criterion. In previous implementations, the simulated χ 2 test, which is computationally demanding, was used in place of the asymptotic χ 2 test, which is not computationally demanding, because the asymptotic χ 2 test is not accurate when some cells have low counts (i.e., the candidate sequence is overall rare or is absent or nearly absent from some samples). This implementation uses a likelihood-ratio test that performed similarly to the simulated χ 2 test when using the same p-value threshold, and adjusting the threshold for the likelihood-ratio test to twenty-fold smaller perfectly reproduced the criterion emerging from the χ 2 test. We therefore recommend that users migrating from a previous dbOTU implementation adjust the p-value threshold similarly.
In the analysis of the mock community data, these differences in implementation of the genetic and distribution criteria had a negligible effect on the resulting inferred community compositions. Therefore, the degree to which this implementation's results did not recapitulate the "true" compositions is addressed by the original publication, which rigorously showed that the dbOTU algorithm was more accurate than alternative methods. S3 File. Mapping from sequence to isolate. A tab-separated list of sequence IDs (the same that appear in S1 and S2 Files) and the corresponding isolate listed in Table S3 of ref. [7]. (TSV)