DUDE-Seq: Fast, flexible, and robust denoising of nucleotide sequences

Motivation: We consider the correction of errors from nucleotide sequences produced by next-generation sequencing. The error rate in reads has been increasing with the shift of focus of mainstream sequencers from accuracy to throughput. Denoising in high-throughput sequencing is thus becoming a crucial component for boosting the reliability of downstream analyses. Results: Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finitevalued source data corrupted by a discrete memoryless channel and provides an effective means for correcting substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput sequencing platforms. Our experimental studies with real and simulated data sets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capabilities and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables us to robustly apply it to different sequencing platforms and analysis pipelines by a simple update of the noise model. Availability: http://data.snu.ac.kr/pub/dude-seq Contact: tsmoon@dgist.ac.kr, sryoon@snu.ac.kr


Introduction
A new generation of high-throughput, low-cost sequencing technologies, referred to as nextgeneration sequencing (NGS) technologies [1], is reshaping biomedical research, including large-scale comparative and evolutionary studies [2][3][4]. Compared with automated Sanger sequencing, NGS platforms produce significantly shorter reads in large quantities, posing various new computational challenges [5].
There are several DNA sequencing methodologies that use NGS [6,7] including whole genome sequencing (WGS), chromatin immunoprecipitation (ChIP) sequencing, and targeted sequencing. WGS is used to analyze the genome of an organism to capture all variants and identify potential causative variants; it is also used for de novo genome assembly. ChIP sequencing identifies genome-wide DNA binding sites for transcription factors and other proteins. Targeted sequencing (e.g., exome sequencing and amplicon sequencing), the focus of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 this paper, is a cost-effective method that enables researchers to focus on investigating areas of interest that are likely to be involved in a particular phenotype. According to previous studies [8,9], targeted sequencing often results in the complete coverage of exons of disease-related genes, while alternative methods result in approximately 90-95% coverage. Hence, in clinical settings, researchers tend to rely on targeted sequencing for diagnostic evaluations.
To detect sequences based on fluorescent labels at the molecular level, NGS technologies normally rely on imaging systems requiring templates that are amplified by emulsion polymerase chain reaction (PCR) or solid-phase amplification [1]. These amplification and imaging processes can generate erroneous reads, the origin of which can be traced to the incorrect determination of homopolymer lengths, the erroneous insertion/deletion/substitution of nucleotide bases, and PCR chimeras [6]. Substitution errors dominate for many platforms, including Illumina, while homopolymer errors, manifested as insertions and deletions (indels), are also abundant for 454 pyrosequencing and Ion Torrent.
Erroneous reads must be properly handled because they complicate downstream analyses (e.g., variant calling and genome assembly), often lowering the quality of the whole analysis pipeline [7] Soft clipping, in which 3'-ends of a read are trimmed based on the quality scores of individual bases, may be the simplest approach, but it results in a loss of information [10]. More sophisticated methods focus on detecting and correcting errors in sequence data [11][12][13][14][15][16][17][18][19][20]. Given the widespread use of Illumina sequencing platforms, most error-correction algorithms have targeted substitution errors [10].
As summarized in recent reviews [10,21], current error-correction methods for NGS data can be categorized as follows: k-mer (i.e., oligonucleotides of length k) frequency/spectrumbased, multiple sequence alignment (MSA)-based, and statistical error model-based methods. The idea behind k-mer-based methods [13,20,[22][23][24][25] is to create a list of "trusted" k-mers from the input reads and correct untrusted k-mers based on a consensus represented by this spectrum. In addition to the length of the k-mer, coverage (k-mer occurrences) information is important to determine trusted k-mers. Under the assumption that errors are rare and random and that coverage is uniform, for sufficiently large k, it is reasonable to expect that most errors alter k-mers to inexistent ones in a genome. Thus, for high-coverage genome sequences obtained by NGS, we may identify suspicious k-mers and correct them based on a consensus. MSA-based methods [12,16,26] work by aligning related sequences according to their similarities and correcting aligned reads, usually based on a consensus in an alignment column, using various techniques. This alignment-based scheme is inherently well-suited for correcting indel errors. Early methods suffered from computational issues, but recent approaches utilize advanced indexing techniques to expedite the alignments. In statistical error model-based methods [27][28][29], a statistical model is developed to capture the sequencing process, including error generation. In this regard, an empirical confusion model is often created from datasets, exploiting the information obtained from, e.g., alignment results, Phred quality scores (a measure of the quality of nucleobases generated by automated DNA sequencing) [30], or other parameters.
While the above methods often exhibit good performance for various platforms, they also have several limitations. First, k-mer-based schemes tend to be ineligible when the coverage is expected to vary over the queried sequences, as in transcriptomics, metagenomics, heterogeneous cell samples, or pre-amplified libraries [21]. Second, MSA-based methods, which do not suffer from the above issue related to non-uniform coverage, often require the application of heuristic and sophisticated consensus decision rules for the aligned columns, and such rules may be sensitive to specific applications or sequencing platforms. Third, statistical error model-based methods typically use computationally expensive schemes (e.g., expectationmaximization) owing to additional stochastic modeling assumptions for the underlying DNA sequences. Moreover, little attention is given to the validity and accuracy of such modeling assumptions, let alone to theoretical analysis of whether near optimum or sound error-correction performance is attained. Finally, many existing schemes applying the three methods often return only representative (consensus) denoised sequences created by merging input sequences; hence, the number of sequences is often not preserved after denoising. In some applications, this may result in inconsistencies in downstream analyses. To address these limitations, many existing tools combine the three methods in a complementary manner to improve performance [10,21].
In this paper, as an alternative, we applied an algorithm called Discrete Universal DEnoiser (DUDE) [31] for accurate DNA sequence denoising. DUDE was developed for a general setting of reconstructing sequences with finite-valued components (source symbols) corrupted by a noise mechanism that corrupts each source symbol independently and statistically identically. In the DNA denoising literature, such a noise model is equivalent to the confusion matrix commonly used in statistical error-model-based methods. As demonstrated in the original paper [31], DUDE exhibits rigorous performance guarantee for the following setting; even when no stochastic modeling assumptions are made for the underlying clean source data, only with the assumption of known noise mechanism, DUDE is shown to universally attain the optimum denoising performance for any source data the data increase. We note that the above setting of DUDE naturally fits the setting for DNA sequence denoising, i.e., it is difficult to establish accurate stochastic models for clean DNA sequences, but it is simple and fairly realistic to assume noise models (i.e., confusion matrices) for sequencing devices based on reference sequences.
The DUDE algorithm, which will be explained in details in the next section, possesses flavors that are somewhat connected to all three representative methods mentioned above, in a single scheme. Specifically, DUDE works with double-sided contexts of a fixed size that are analogous to k-mers. Moreover, like MSA, DUDE applies a denoising decision rule to each noisy symbol based on aggregated information over certain positions in the reads. However, unlike MSA, which makes a decision based on the information collected from the symbols in the same aligned column, DUDE makes a decision using the information collected from positions with the same double-sided context. Finally, the denoising decision rule of DUDE utilizes information from the assumed noise model, like in most statistical error model-based methods, but does not assume any stochastic model on the underlying sequence, thus resulting in a computationally efficient method. The method of incorporating the noise model is also simple, making it easy to flexibly apply DUDE to different sequencing platforms by simply changing the confusion matrix model in the algorithm.
With the above unique nature of the DUDE algorithm, we show in our experiments that it outperforms other state-of-the-art schemes, particularly for applications to targeted amplicon sequencing. Specifically, among the applicable areas of targeted amplicon sequencing (e.g., cancer gene, 16S rRNA, plant, and animal sequencing [32]), we used 16S rRNA benchmark datasets obtained with different library preparation methods and DNA polymerases to confirm the robustness of our algorithm across various sequencing preparation methods. Targeted amplicon sequencing datasets often have deeper sequencing coverage than those of WGS or ChIP datasets, which frequently makes conventional k-mer-based techniques often suffer from the amplification bias problem [33]. By contrast, for DUDE-Seq, as the sequencing coverage becomes deeper, context-counting vectors can accumulate more probable contexts, and the robustness of denoising typically improves. We apply two versions of DUDE separately for substitution and homopolymer errors, the two major types of sequencing error. For substitution errors, our approach directly utilizes the original DUDE with appropriate adaptation to DNA sequences and is applicable to reads generated by any sequencing platform. For homopolymer errors, however, we do not apply the original DUDE, which was developed in a framework that does not cover errors of the homopolymer type. To correct homopolymer errors, we therefore adopt a variant of DUDE for general-output channels [34]. Our homopolymer-error correction is applicable to cases in which base-called sequences and the underlying flowgram intensities are available (e.g., pyrosequencing and Ion Torrent). For brevity, we refer to both of these DUDE-based approaches as DUDE-Seq, but the correction type will be easily distinguishable by the reader.

Discrete Universal DEnoiser (DUDE)
In this section, we formally introduce the DUDE algorithm along with its notation and its connection to DNA sequence denoising. Fig 1 shows the concrete setting of the discrete denoising problem. We denote the underlying source data as {x i } and assume each component takes values in some finite set X . The resulting noisy version of the source corrupted by a noise mechanism is denoted as {Z i }, and its components take values in, again, some finite set Z. As mentioned in the Introduction, DUDE assumes that the noise mechanism injects noises that are independent and statistically identical, and such a mechanism is often referred to as a Discrete Memoryless Channel (DMC) in information theory. The DMC is completely characterized by the channel transition matrix, also known as the confusion matrix, P 2 R jX jÂjZj , of which the (x, z)-th element, P(x, z), stands for Pr(Z i = z|x i = x), i.e., the conditional probability that the noisy symbol takes value z, given that the original source symbol is x. We denote random variables with uppercase letters and the individual samples of random variables or deterministic symbols with lowercase letters. Thus, the underlying source data, which are treated by DUDE as individual sequences (and not a stochastic process), are denoted by the lowercase {x i }, and the noise-corrupted sequences, i.e., sequences of random variables, are denoted by uppercase {Z i }. Furthermore, throughout this paper, we generally denote a sequence (n-tuple) as a n = (a 1 ,. . .,a n ), for example, where a j i refers to the subsequence (a i ,. . .,a j ). As shown in Fig 1, a discrete denoiser observes the entire noisy data Z n and reconstructs the original data withX n ¼ ðX 1 ðZ n Þ; . . . ;X n ðZ n ÞÞ. The goodness of the reconstruction by a discrete denoiserX n is measured by the average loss, where Lðx i ;x i Þ is a single-letter loss function that measures the loss incurred by estimating x i withx i at location i. The loss function can be also represented with a loss matrix L 2 R jX jÂjX j . DUDE in [31] is a two-pass algorithm that has linear complexity with respect to the data size n. During the first pass, given the realization of the noisy sequence z n , the algorithm collects the statistics vector for all a 2 Z, which is the count of the occurrence of the symbol a 2 Z along the noisy sequence z n that has the double-sided context ðl k ; r k Þ 2 Z 2k . Note that m is similar to the counts across the aligned columns for the simple majority voting in MSA-based denoising methods. However, in DUDE, the count is collected regardless of whether the positions in the reads are aligned or not, but considering whether the position has the same context. Additionally, the context length k is analogous to the k-mer length. Once the m vector is collected, for the second pass, DUDE then applies the rulê for each k + 1 i n − k, where π z i is the z i -th column of the channel matrix P, and lx is thê x-th column of the loss matrix Λ. Furthermore, stands for the element-wise product operator for two vectors. The intuitive explanation of Eq (2) is as follows: when we rearrange the right-hand side of Eq (2), we obtain and we can show that π a P −T m T (z n , l k , r k ) approximates the empirical count vector of the underlying clean symbol at the middle location that resulted in the noisy context l k ar k . Thus, the denoising rule Eq (2), re-expressed in Eq (3), finds a reconstruction symbolx that minimizes the expected loss with respect to the empirical estimate (obtained by utilizing the inverse of P) of the count vector of the underlying x i given the noisy context z iþk iÀ k . At a high level, DUDE is not a simple majority voting rule based on m; instead, it incorporates the DMC model P (the confusion matrix) and loss function Λ to obtain a more accurate estimation of the clean source symbol. For more detailed and rigorous arguments on the intuitive description of Eq (2), we refer readers to the original paper [31, Section IV-B].
Note that formula Eq (2) assumes X ¼ Z ¼X and P is invertible for simplicity, but Weissman et al. [31] deal with more general cases as well. The form of Eq (2) also shows that DUDE is a sliding window denoiser with window size 2k + 1; i.e., DUDE returns the same denoised symbol at all locations with the same value z iþk iÀ k . DUDE is guaranteed to attain the optimum performance by the sliding window denoisers with the same window size as the observation length n increases. For more details on the theoretical performance analyses, see Weissman et al. [31,Section V].
The original DUDE dealt exclusively with the case of jX j and jZj finite. Dembo and Weissman [34] DUDE to the case of discrete input and general output channels; the noisy outputs do not have to have their values in some finite set, but can have continuous values as well. As in [31], the memoryless noisy channel model, which is characterized in this case by the set of densities ff x g x2X , was assumed to be known. As shown in [34, Fig 1], the crux of the arguments is to apply a scalar quantizer Q(Á) to each continuous-valued noisy output {Y i } and to derive a virtual DMC, Γ 2 R jX jÂjZj , between the discrete input {X i } and the quantized (hence, discrete) output {Z i }. Such Γ can be readily obtained by the knowledge of ff x g x2X and evaluating the following integral for each (x, z): Γ(x, z) = R y:Q(y) = z f x (y)dy. Once the virtual DMC is obtained, the rest of the algorithm in [34] proceeds similarly as the original DUDE; specifically, it obtains the statistics vector m for the quantized noisy outputs {Z i } during the first pass, and then applies a sliding window denoising rule similar to Eq (2), which depends on the statistics vector m, the virtual DMC Γ, ff x g x2X , and the noisy sequence Y n , during the second pass. A concrete denoising rule can be found in [34,Eqs (16), (19) and (20)]. In [34], a formal analysis of the generalized DUDE shows that it attains the optimum denoising performance among sliding window denoisers with the same window size, that base their denoising decisions on the original continuous-valued outputs Y n . We refer readers to the paper for more details. In the next section, we show how we adopt this generalized DUDE in our DUDE-Seq to correct homopolymer errors in DNA sequencing.

DUDE-Seq: DUDE for DNA sequence denoising Substitution errors
As described in the previous section, the setting of the original DUDE algorithm naturally aligns with the setting of substitution-error correction in DNA sequence denoising. We can set X ¼ Z ¼ fA; C; G; Tg, and the loss function as the Hamming loss, namely, Lðx;xÞ ¼ 0, if x ¼x, and Lðx;xÞ ¼ 1, otherwise. Then, the two-pass sliding window procedure of DUDE for collecting the statistics vector m and the actual denoising can be directly applied as shown in the toy example in Fig 2. Before we formally describe our DUDE-Seq for substitution-error correction, however, we need to address some subtle points.
First, the original DUDE in Eq (2) assumes that the DMC matrix P is known beforehand, but in real DNA sequence denoising, we need to estimate P for each sequencing device. As described in the Experimental Results section in detail, we performed this estimation following the typical process for obtaining the empirical confusion matrix, i.e., we aligned the predefined reference sequence and its noise-corrupted sequence and then determined the ratio of substitution errors and obtain the estimated P. Second, the original DUDE assumes that the noise mechanism is memoryless, i.e., the error rate does not depend on the location of a base within the sequence. In contrast, for real sequencing devices, the actual error rate, namely, the conditional probability Pr(Z i = z|X i = x) may not always be the same for all location index values i. For example, for Illumina sequencers, the error rate tends to increase towards the ends of reads, as pointed out in [21]. In our DUDE-Seq, however, we still treat the substitution error mechanism as a DMC and therefore use the single estimated P obtained as above, which is essentially the same as that obtained using the average error rate matrix. Our experimental results show that such an approach still yields very competitive denoising results. Thirdly, the optimality of the original DUDE relies on the stationarity of the underlying clean sequence, thus requiring a very large observation sequence length n to obtain a reliable statistics vector m. In contrast, most sequencing devices generate multiple short reads of lengths 100 * 200.
Hence, in DUDE-Seq, we combined all statistics vectors collected from multiple short reads to generate a single statistics vector m to use in Eq (2).
Addressing the above three points, a formal summary of DUDE-Seq for the substitution errors is given in Algorithm 1. Note that the pseudocode in Algorithm 1 skips those bases whose Phred quality score s are higher than a user-specified threshold and invokes DUDE-Seq only for the bases with low quality scores (lines [10][11][12][13][14]. This is in accord with the common practice in sequence preprocessing and is not a specific property of the DUDE-Seq algorithm. Furthermore, for simplicity, we denoted z n as the entire noisy DNA sequence, and m T ðz n ; z iÀ 1 iÀ k ; z iþk iþ1 Þ represents the aggregated statistics vector obtained as described above.

Algorithm 1 The DUDE-Seq for substitution errors
Require: Observation z n , Estimated DMC matrix P 2 R 4Â4 , Hamming loss Λ 2 R 4Â4 , Context size k, Phred quality score Q n Ensure: The denoised sequenceX n 1: Define mðz n ; l k ; r k Þ 2 R 4 for all (l k , r k )2{A,C,G,T} 2k . 2: Initialize m(z n , l k , r k )[a] = 0 for all (l k , r k )2{A,C,G,T} 2k and for all a 2 {A,C,G,T} 3:

Remarks
1. Incorporating flanking sequences in DUDE-Seq is quite straightforward; we can simply use the one-sided contexts l 2k or r 2k once DUDE-Seq reaches the flanking regions. In our experiments, however, we did not perform such modification (lines 7-8 of Algorithm 1) since we normally used small k values (around k = 5). As demonstrated in our experimental results, the effect of such small flanking regions is not significant on the final denoising results, and we can achieve satisfactory results without considering flanking regions. However, in general, should longer values of k be needed, we can easily modify the algorithm to incorporate one-sided contexts in the flanking regions, and such modification will clearly improve the final denoising result.
2. DUDE-Seq does not need to consider reverse complements of the input sequences to collect m's, since forward and reverse reads are handled separately in our experiments. Reverse complements are typically considered when we need to handle double-stranded sequences without knowing whether each read corresponds to the forward or reverse strand.

Homopolymer errors
Homopolymer errors, particularly in pyrosequencing, occur while handling the observed flowgram, and a careful understanding of the error injection procedure is necessary to correct these errors. As described in [35], Exploiting the fact that the order of DNA bases is always fixed at {T, A, C, G}, we can apply the setting of the generalized DUDE in [34] to correct homopolymer errors as follows. Because we know the exact DNA base that corresponds with each intensity value, the goal is the correct estimatimation of homopolymer lengths from the observed intensity values. Hence, we can interpret the intensity distributions {P(f|N)} as the memoryless noisy channel models with a continuous-output, where the channel input is the homopolymer length N. We set the upper bound of N to 9 according to the convention commonly used for handling flowgram distributions in the targeted amplicon sequencing literature [35][36][37]. When the usual rounding function is used as a scalar quantizer, as mentioned above, and the virtual DMC G 2 R 10Â10 can be obtained by calculating the integral for each 0 i 9, 1 j 9 and Gði; 0Þ ¼ R 0:5 0 Pðf jiÞdf . With this virtual DMC model, we apply a scheme inspired by the generalized DUDE to correctly estimate the homopolymer lengths, which results in correcting the insertion and deletion errors. That is, we set X ¼ Z ¼ f0; 1; . . . ; 9g, and again use the Hamming loss L 2 R 10Â10 . With this setting, we apply Q R (f) to each f i to obtain the quantized discrete output z i , and obtain the count statistics vector m from z n during the first pass. Then, for the second pass, instead of applying the more involved denoising rule in [34, we employ the same rule as Eq (2) with Γ in place of P to obtain the denoised sequence of integersX n based on the quantized noisy sequence Z n . Although it is its implementation is easier and it has a faster running time than that of the generalized DUDE. Once we obtainX n , from the knowledge of the DNA base for each i, we can reconstruct the homopolymer error-corrected DNA sequenceD (the length of which may not necessarily be equal to n). Algorithm 2 summarizes the pseudo-code of DUDE-Seq for homopolymer-error correction.

Experimental results Setup
We used both real and simulated NGS datasets and compared the performance of DUDE-Seq with that of several state-of-the-art error correction methods. The list of alternative tools used for comparison and the rationale behind our choice s are described in the next subsection. When the flowgram intensities of base-calling were available, we corrected both homopolymer and substitution errors; otherwise, we only corrected substitution errors. The specifications of the machine we used for the analysis are as follows: Ubuntu 12.04.3 LTS, 2 × Intel Xeon X5650 CPUs, 64 GB main memory, and 2 TB HDD.

Algorithm 2 The DUDE-Seq for homopolymer errors
Require: Flowgram data f n , Flowgram densities fPðf jNÞg Compute Γ(i, j) following Eq (5) of the main text ⊳ Computing the virtual DMC Γ 9: end for 10: end for 11: for i 1,. . .,n do Obtain DUDE-Seq has a single hyperparameter k, the context size, that needs to be determined. Similar to the popular k-mer-based schemes, there is no analytical method for selecting the best k for finite data size n, except for the asymptotic order result of kjX j 2k ¼ oðn= log nÞ in [31], but a heuristic rule of thumb is to try values between 2 and 8. Furthermore, as shown in Eq (2), the two adjustable matrices, Λ and P, are required for DUDE-Seq. The loss Λ used for both types of errors is the Hamming loss. According to Marinier et al. [38], adjusting the sequence length by one can correct most homopolymer errors, which justifies our use of Hamming loss in DUDE-Seq. In our experiments, the use of other types of loss functions did not result in any noticeable performance differences. The DMC matrix P for substitution errors is empirically determined by aligning each sampled read to its reference sequence, as in [35]. Fig 4 shows the non-negligible variation in the empirically obtained P's across the sequencing platforms, where each row corresponds to the true signal x and each column corresponds to the observed noisy signal z. In this setting, each cell represents the conditional probability P(z|x). In our experiments, dataset P1-P8 used P for GS FLX, Q19-Q31 used P for Illumina, and S5, A5 used P for Simulation data. The details of each dataset are explained in the following sections.
In order to evaluate the results, we used Burrows-Wheeler Aligner (BWA) [39] and SAMtools [40]. We aligned all reads to their reference genome using BWA with the following parameters: [minimum seed length: 19, matching score: 1, mismatch penalty: 4, gap open penalty: 6, gap extension penalty: 1]. After the mapped regions were determined using BWA in SAM format, we chose uniquely mapped pairs using SAMtools. The Compact Idiosyncratic Gapped Alignment Report (CIGAR) string and MD tag (string for mismatching positions) for each of the resultant pairs in the SAM file were reconstructed to their pairwise alignments using sam2pairwise [41].

Evaluation metric
As a performance measure, we define the per-base error rate of a tool after denoising as in which '# aligned bases' represents the number of mapped bases (i.e., matches and mismatches) after mapping each read to its reference sequence, and '# mismatched bases' represents the number of the erroneous bases (i.e., insertions, deletions, and substitutions) among the aligned bases.
We also employ an alternative definition that adjusts the error rate by incorporating the degree of alignment. To this end, we define the relative gain of the number of aligned bases after denoising by a tool over raw data as gða tool Þ ¼ # aligned bases after denoising À # aligned bases in raw # aligned bases in raw : Based on this, the adjusted error rateê tool of a denoising tool is defined as follows: where e tool and e raw represent the (unadjusted) error rates of the denoised data and the raw data, respectively. In other words, Eq (8) is a weighted average of e tool and e raw , in which the weights are determined by the relative number of aligned bases of a tool compared to the raw sequence. We believeê tool is a fairer measure as it penalizes the error rate of a denoiser when there is a small number of aligned bases. The relative gain of the adjusted error rate over raw data is then defined as which we use to evaluate the denoiser performance. While evaluating a clustering result, we employ a measure of concordance (MoC) [42] which is a popular similarity measure for pairs of clusterings. For two pairs of clusterings P and Q with I and J clusters, respectively, the MoC is defined as where f ij is the number of the common objects between cluster P i and Q j when p i and q j are the numbers of the objects in cluster P i and Q j , respectively. A MoC of one or zero represents perfect or no concordance, respectively, between the two clusters.

Software chosen for comparison
It is impossible to compare the performance of DUDE-Seq with that of all other schemes. Hence, we selected representative baselines using the following reasoning.
1. We included tools that can represent different principles outlined in the Introduction, namely, k-mer-based (Trowel, Reptile, BLESS, and fermi), MSA-based (Coral), and statistical error model-based (AmpliconNoise) methods.

2.
We considered the recommendations of [21] to choose baseline tools that are competitive for different scenarios, i.e., for 454 pyrosequencing data (AmpliconNoise), non-uniform coverage data, such as metagenomics data (Trowel, fermi, Reptile), data dominated by substitution errors, such as Illumina data (Trowel, fermi, Reptile), and data with a high prevalence of indel errors (Coral).
3. For multiple k-mer-based tools, we chose those that use different main approaches/data structures: BLESS (k-mer spectrum-based/hash table and bloom filter), fermi (k-mer spectrum and frequency-based/hash table and suffix array), Trowel (k-mer spectrum-based/ hash table), and Reptile (k-mer frequency and Hamming graph-based/replicated sorted k-mer list). 5. We mainly chose tools that return read-by-read denoising results to make fair error-rate comparisons with DUDE-seq. We excluded tools that return a substantially reduced number of reads after error correction (caused by filtering or forming consensus clusters). Examples of excluded tools are Acacia, ALLPATHS-LG, and SOAPdenovo.
6. We also excluded some recently developed tools that require additional mandatory information (e.g., the size of the genome of the reference organism) beyond the common setting of DNA sequence denoising in order to make fair error-rate comparisons. Examples of excluded tools are Fiona, Blue, and Lighter. Incorporating those tools that require additional information into the DUDE-Seq framework and comparisons with the excluded tools would be another future directions.

Real data: 454 pyrosequencing
Pyrosequenced 16S rRNA genes are commonly used to characterize microbial communities because the method yields relatively longer reads than those of other NGS technologies [43]. Although 454 pyrosequencing is gradually being phased out, we test ed DUDE-Seq with 454 pyrosequencing data for the following reasons: (1) the DUDE-Seq methodology for correcting homopolymeric errors in 454 sequencing data is equally applicable to other sequencing technologies that produce homopolymeric errors, such as Ion Torrent; (2) using pyrosequencing data allows us to exploit existing (experimentally obtained) estimates of the channel transition matrix Γ (e.g., [35]), which is required for denoising noisy flowgrams by DUDE-Seq (see Algorithm 2); (3) in the metagenomics literature, widely used standard benchmarks consist of datasets generated by pyrosequencing. In metagenome analysis [44], grouping reads and assigning them to operational taxonomic units (OTUs) (i.e., binning) are essential processes, given that the majority of microbial species have not been taxonomically classified. By OTU binning, we can computationally identify closely related genetic groups of reads at a desired level of sequence differences. However, owing to erroneous reads, nonexistent OTUs may be obtained, resulting in the common problem of overestimating ground truth OTUs. Such overestimation is a bottleneck in the overall microbiome analysis; hence, removing errors in reads before they are assigned to OTUs is a critical issue [35]. With this motivation, in some of our experiments below, we used the difference between the number of assigned OTUs and the ground truth number of OTUs as a proxy for denoising performance; the number of OTUs was determined using UCLUST [45] at identity threshold of 0.97 which is for species assignment.
We tested the performance of DUDE-Seq with the eight datasets used in [35], which are mixtures of 94 environmental clones library from eutrophic lake (Priest Pot) using primers 787f and 1492r. Dataset P1 had 90 clones that are mixed in two orders of magnitude difference while P2 had 23 clones that were mixed in equal proportions. In P3, P4, and P5 and P6, P7, and P8, there are 87 mock communities mixed in even and uneven proportions, respectively. In all datasets, both homopolymer and substitution errors exist, and the flowgram intensity values as well as the distributions are available [35]. Therefore, DUDE-Seq tries to correct both types of errors using the empirically obtained P and the flowgram intensity distributions {P(f|N)}.
We first show the effect of k on the performance of DUDE-Seq in Fig 5. The vertical axis shows the ratio between the number of OTUs assigned after denoising with DUDE-Seq and the ground truth number of OTUs for the P1, P2, and P8 dataset. The horizontal axis shows the k values used for correcting the substitution errors (i.e., for Algorithm 1), and color-coded curves were generated for different k values used for homopolymer-error correction (i.e., for Algorithm 2). As shown in the figure, correcting homopolymer errors (i.e., with k = 2 for Algorithm 2) always enhanceed the results in terms of the number of OTUs in comparison to correcting substitution errors alone (i.e., Algorithm 1 alone). We observe that k = 5 for Algorithm 1 and k = 2 for Algorithm 2 produce the best results in terms of the number of OTUs. Larger k value work better for substitution errors owing to the smaller alphabet size of the data, i.e., 4, compared to that of homopolymer errors, i.e., 10. Motivated by this result, we fixed the context sizes of substitution error correction and homopolymer error correction to k = 5 and k = 2, respectively, for all subsequent experiments.
In Fig 6(a), we report a more direct analysis of error correction performance. We compared the performance of DUDE-Seq with that of Coral [16], which is an MSA-based state-of-the-art scheme. It aligns multiple reads by exploiting the k-mer neighborhood of each base read and produces read-by-read correction results for pyrosequencing datasets, similar to DUDE-Seq. Furthermore, as a baseline, we also present ed the error rates for the original, uncorrected sequences (labeled 'Raw'). We did not include the results of AmpliconNoise [35], a state-ofthe-art scheme for 454 pyrosequencing data, in the performance comparison because it does not provide read-by-read correction results, making a fair comparison of the per-base error correction performance with DUDE-Seq difficult. We observeed that DUDE-Seq(1+2), which corrects both substitution errors and homopolymer errors, always outperforms Coral, and the relative error reductions of DUDE-Seq(1+2) with respect to 'Raw,' without any denoising, was up to 23.8%. Furthermore, the homopolymer error correction further drives down the error rates obtained by substitution-error correction alone; hence, DUDE-Seq(1+2) always outperforms DUDE-Seq(1).
In Fig 6(b), we compare the error correction performance of three schemes, Amplicon-Noise, Coral, and DUDE-Seq, in terms of the MoC. AmpliconNoise assumes a certain statistical model on the DNA sequence and runs an expectation-maximization algorithm for denoising. Here, the two clusterings in the comparison are the golden OTU clusterings and the clusterings returned by denoisers. We observe that for all eight datasets, the number of OTUs generated by DUDE-Seq is consistently closer to the ground truth, providing higher MoC values than those of the other two schemes.
Furthermore, Fig 6(c) compares the running time of the three schemes for the eight datasets. We can clearly see that DUDE-Seq is substantially faster than the other two. Particularly, we stress that the running time of DUDE-Seq, even when implemented and executed with a single CPU, is two orders of magnitude faster than that of parallelized AmpliconNoise, run on four powerful GPUs. We believe that this substantial boost over state-of-the-art schemes with respect to running time is a compelling reason for the adoption of DUDE-Seq in microbial community analysis.

Real data: Illumina sequencing
Illumina platforms, such as GAIIx, MiSeq, and HiSeq, are currently ubiquitous platforms in genome analysis. These platforms intrinsically generate paired-end reads (forward and reverse reads), due to the relatively short reads compared to those obtained by automated Sanger sequencing [46]. Merging the forward and reverse reads from paired-end sequencing yeilds elongated reads (e.g., 2 × 300 bp for MiSeq) that improve the performance of downstream pipelines [47].
Illumina platforms primarily inject substitution errors. A realistic error model is not the DMC, though, as the error rates of the Illumina tend to increase from the beginning to the end of reads. Thus, the assumptions under which the DUDE was originally developed do not exactly apply to the error model of Illumina. In our experiments with DUDE-Seq, however, we still used the empirically obtained DMC model P in Fig 4, which was computed by averaging all error rates throughout different Illumina platforms.
In our experiments, we used 13 real Illumina datasets (named Q19-Q31) reported previously [32], including sequencing results from four organisms (Anaerocellum thermophilum Z-1320 DSM 6725, Bacteroides thetaiotaomicron VPI-5482, Bacteroides vulgatus ATCC 8482, and Caldicellulosiruptor saccharolyticus DSM 8903) targeting two hypervariable regions, V3 and V4, using different configurations (see the caption for Table 1 and Fig 7 for details). To examine how the number of reads in a dataset affects denoising performance, we derived 10 subsets from the original datasets by randomly subsampling 10,000 to 100,000 reads in increments of 10,000 reads. In addition to Coral, we compared the performance of DUDE-Seq with that of BLESS [48], fermi [49], and Trowel [25], which are representative k-mer-based state-of-the-art tools. BLESS corrects "weak" k-mers that exist between consecutive "solid" k-mers, assuming that a weak k-mer has only one error. Fermi corrects sequencing errors in underrepresented kmers using a heuristic cost function based on quality scores and does not rely on a k-mer occurrence threshold. Trowel does not use a coverage threshold for its k-mer spectrum and iteratively boosts the quality values of bases after making corrections with k-mers that have high quality values. Fig 7 shows the per-base error rates, defined in Eq (6), for the tools under comparison using the first eight datasets (Q19-Q26) and their subsets created as described above (thus, a total of 80 datasets per tool). BLESS did not run successfully on these datasets, and hence its results are not shown. First, we can confirm that DUDE-Seq is effective in reducing substitution errors for data obtained using the Illumina platform in all tested cases of targeted amplicon sequencing, with relative error rate reductions of 6.40-49.92%, compared to the 'Raw' sequences. Furthermore, among the tools included in the comparison, DUDE-Seq produced the best results for the largest number of datasets. For Q24 and Q25, fermi was most effective, but was outperformed by DUDE-Seq in many other cases. Coral was able to denoise to some extent but was inferior to DUDE-Seq and fermi. Trowel gave unsatisfactory results in this experiment.
Before presenting our next results, we note that while the error rate defined in Eq (6) is widely used for DNA sequence denoising research as a performance measure, it occasionally misleading and cannot be used to fairly evaluate the performance of denoisers. This is because only errors at aligned bases are counted in the error rate calculation; hence, a poor denoiser may significantly reduce the number of aligned bases, potentially further corrupting the noisy  sequence, but it can have a low error rate calculated as in Eq (6). In our experiments with the datasets Q27-Q31, we detected a large variance in the number of aligned bases across different denoising tools; thus, it was difficult to make a fair comparison among the performance of different tools with Eq (6). We note that in the experiments presented in Figs 6(a) and 7, such a large variance was not detected. To alleviate this issue, we employ the alternative definition of the per-base error rate of a tool in Eq (8). Fig 8 shows the results obtained for 100,000-read subsets of each of the Q19-Q31 datasets, i.e., all datasets, for DUDE-Seq and the four alternative denoisers. Because the datasets Q27-Q31 had two subsets of 100,000 reads, we used a total of 18 datasets to draw Fig 8, one each from Q19-Q26 and two each from Q27-Q31. As mentioned previously, BLESS could not run successfully on Q19-Q26; hence, there are only 10 points for BLESS in the plots. Fig 8(a), 8(b) and 8(c) presents the distribution of gðê tool Þ, g(a tool ), and running times for each tool, respectively. For each distribution, the average value is marked with a solid circle. As shown in Fig 8  (b), we clearly see that Coral and Trowel show a large variance in the number of aligned bases. For example, Coral only aligns 30% of bases compared to the raw sequence after denoising for some datasets. With the effect of this variance in aligned bases adjusted, Fig 8(a) shows that DUDE-Seq produces the highest average gðê tool Þ, i.e., 19.79%, among all the compared tools. Furthermore, the variability of g(a tool ) was the smallest for DUDE-Seq, as shown in Fig 8(b), suggesting its robustness. Finally, in Fig 8(c), we observe that the running times were significantly shorter for DUDE-Seq and Trowel than for Coral and fermi. Overall, we can conclude that DUDE-Seq is the most robust tool, with a fast running time and the highest average accuracy after denoising.
In summary, we observe from Figs 7 and 8 that DUDE-Seq robustly outperforms the competing schemes for most of the datasets tested. We specifically emphasize that DUDE-Seq shows a strong performance, even though the DMC assumption does not hold for the sequencer. We believe that the better performance of DUDE-Seq relative to other state-of-theart algorithms (based on MSA or k-mer spectrums) on real Illumina datasets strongly demonstrates the competitiveness of DUDE-Seq as a general DNA sequence denoiser for targeted amplicon sequencing.

Experiments on simulated data
We performed more detailed experiments using Illumina simulators in order to further highlight the strong denoising performance of DUDE-Seq, including the effects on downstream analyses. Fig 9(a) shows the results obtained using the Grinder simulator [50] and a comparison with Coral. Trowel and Reptile require quality scores as input, which are provided by the GemSIM simulator, but not by the Grinder simulator; hence, we could not include Trowel and Reptile in Fig 9(a). We generated nine synthetic datasets of forward reads that had error rates at the end of the sequence varying from 0.2% to 1.0%, as denoted on the horizontal axis. For all cases, the error rate at the beginning of the sequence was 0.1%. We again used the average DMC model for the entire sequence for DUDE-Seq. Note that the error rates for the 'Raw' data, i.e., the red bars, match the average of the error rates at the beginning and the end of the sequence. From the figure, consistent with the real datasets analyzed in Section, we clearly see that DUDE-Seq significantly outperforms Coral for all tested error rates.
To evaluate the performance of DUDE-Seq for paired-end reads, we generated datasets, shown in Table 2, with the GemSIM sequencing data simulator [51]. As shown in the table, we used 23 public reference sequences [35] to generate the dataset A5 and a single reference sequence for S5. We used the error model v5 that has error rate s of 0.28% for forward reads and 0.34% for reverse reads. In Fig 9(b), in addition to DUDE-Seq, Coral, fermi, and Trowel, we included the results obtained using Reptile [20], another k-mer spectrum-based method that outputs read-by-read denoising results. We again observe from the figure that DUDE-Seq outperforms the alternatives by significant margins.
In Table 3, we show that the error-corrected reads produced by DUDE-Seq can also improve the performance of downstream pipelines, such as paired-end merging. We applied four different paired-end merging schemes, CASPER [52], COPE [53], FLASH [47], and PAN-DAseq [54], for the two datasets A5 and S5 in Table 2. The metrics are defined as usual. A true positive (TP) is defined as a merge with correct mismatching resolution in the overlap region, and a false positive (FP) is defined as a merge with incorrect mismatching resolution in the overlap region. Furthermore, a false negative (FN) is a merge that escapes the detection, and a true negative (TN) is defined as a correct prediction for reads that do not truly overlap. The accuracy and F1 score are computed based on the above metrics [55]. For each dataset, we compared the results for four cases: performing paired-end merging without any denoising, after correcting errors with Coral, after correcting errors with fermi, and after correcting errors with DUDE-Seq. Reptile and Trowel were not included in this experiment because they were generally outperformed by Coral and fermi, as shown in Fig 9(b). The accuracy and F1 score results show that correcting errors with DUDE-Seq consistently yields better paired-end merging performance, not only compared to the case with no denoising, but also compared to the

Discussion
Our experimental results show that DUDE-Seq can robustly outperform k-mer-based, MSAbased, and statistical error model-based schemes for both real-world datasets, such as 454 pyrosequencing and Illumina data, and simulated datasets, particularly for targeted amplicon sequencing. This performance advantage in denoising further allowed us to obtain improved results in downstream analysis tasks, such as OTU binning and paired-end merging. Furthermore, the time demand of DUDE-Seq-based OTU binning is order(s) of magnitude lower than that of the current state-of-the-art schemes. We also demonstrated the robustness and flexibility of DUDE-Seq by showing that a simple change in P matrix is enough to apply the exact same DUDE-Seq to data obtained using different sequencing platforms. In particular, we experimentally showed that even when the memoryless channel assumption does not hold, as in Illumina data, DUDE-Seq still solidly outperforms state-of-the-art schemes.
The sliding window nature of DUDE-Seq resemble s the popular k-mer-based schemes in the literature. However, while all existing k-mer-based schemes rely on heuristic threshold selection for determining errors in the reads, regardless of the error model of the sequencing platform, DUDE-Seq applies an analytic denoising rule that explicitly takes the error model P into account. Therefore, even for identical noisy reads z n , DUDE-Seq may result in different denoised sequences, depending on the P's of different sequencing platforms, whereas the kmer-based scheme will always result in the exact same denoised sequence. The performance gains reported in this paper compared to state-of-the-art baselines, including those for k-merbased schemes, substantiate the competitiveness of our method for targeted amplicon sequencing.
Another advantage of DUDE-Seq is its read-by-read error-correction capability. This feature is important for a number of bioinformatics tasks, including de novo sequencing, metagenomics, resequencing, targeted resequencing, and transcriptome sequencing, which typically require the extraction of subtle information from small variants in each read. In addition to the types of tasks presented in this paper (i.e., per-based error correction, OTU binning, and paired-end merging), we plan to apply DUDE-Seq to additional tasks, as mentioned above.
Additional venues for further investigation include the procedure for estimating the noise mechanism represented by P, which is currently empirically determined by aligning each read to the reference sequence and is therefore sensitive to read mapping and alignment. For more robust estimation, we may employ an expectation-maximization-based algorithm, as was recently proposed for estimating substitution emissions for the data obtained using nanopore technology [56]. Considering uncertainties in P may also be helpful; hence, it may be useful to investigate the relevance of the framework in [57]. Additionally, it will likely be fruitful to utilize the information in Phred quality scores to make decisions about noisy bases and to finetune the objective loss function in our approach. Using a lossy compressed version of the quality scores is one possible direction for boosting the inferential performance of some downstream applications, as shown in [58]. Furthermore, particularly for the homopolymer error correction, there are several hyperparameters whose choices can be experimented with in the future to potentially achieve substantial performance boosts. Examples include the choice of alphabet size (in lieu of the current value of 10), the choice of the loss function that may be proportional to the difference between the true and estimated value of N (in lieu of the current Hamming loss), and the choice of quantization (in lieu of Eq (4)). Moreover, we may apply the full generalized DUDE in [34] for homopolymer error correction to determine if better performance can be achieved at the cost of increased complexity. Applying DUDE-Seq to other types of sequencing technology with homopolymer errors (e.g., Ion Torrent) would also be possible as long as we can acquire flow (e.g., ionogram) density distributions to estimate Γ. Currently, there exists no public data repository that includes such information for Ion Torrent, and thus existing Ion Torrent denoisers often ignore homopolymer errors or rely on simplistic noise modeling and iterative updates that unrealistically limit the maximum length of homopolymer errors that can be handled, let alone computational efficiency [36]. Finally, we plan to test DUDE-Seq on several other sequencing platforms, such as PacBio and Oxford Nanopore, which tend to result in longer and more noisy sequences, to further substantiate the robustness and effectiveness of our algorithm. Applying the recently developed deep neural networks -based Neural DUDE algorithm [59] to DNA sequence denoising beyond targeted amplicon sequencing could be another fruitful direction.
Supporting information S1 File. Fig A, DUDE-Seq web interface. This is a screenshot of the website accompanying the proposed DUDE-Seq method (http://data.snu.ac.kr/pub/dude-seq). For users who prefer a graphical user interface, this website provides a web-based execution environments for DUDE-Seq. Through this screen, a user can specify the parameters for each of the two error types (in the figure, DUDE-Seq (1) stands for for the substitution error correction described in Algorithm 1 and DUDE-Seq (2) stands for the homopolymer error correction shown in Algorithm 2), and upload the input file of her choice. The DUDE-Seq process starts automatically by clicking the "SUBMIT" button. For advanced users who prefer batch processing, the source code of DUDE-Seq is also available at http://github.com/datasnu/dude-seq. All the used datasets are also available on the Sequence Read Archive (SRA) under the accession number SRP000570 (SRS002051-SRS002053) at https://www.ncbi.nlm.nih.gov/sra/SRP000570 and the European Nucleotide Archive (ENA) under the accession number PRJEB6244 (ERS671332-ERS671344) at http://www.ebi.ac.uk/ena/data/view/ PRJEB6244. Fig B, Website output: sequence complexity. The DUDE-Seq website provides analysis results from applying the DUST algorithm [60] and block-entropy to the outputs from denoising by DUDE-Seq. The DUST algorithm masks low-complexity regions that have highly biased distribution of nucleotides based on counting 3-mer frequencies in 64-base windows. The DUST score is computed based on how often different trinucleotides occur as follows: score ¼ X k i¼1 n i ðn i À 1Þðw À 2Þs 2ðl À 1Þl where k = 4 3 is the trinucleotide size, w = 64 is the window size, n i is the number of the words i in a window, l is the number of the possible words in a window, and s is the scaling factor. The score is scaled from 0 to 100 and a high score implies a low complexity metagenome. The block-entropy is calculated using Shannon's diversity index [61]. The block-entropy evaluates the entropy of the trinucleotides in a sequence as follows: where k = 4 3 is the trinucleotide size, n i is the number of the words i in a window, and l is the number of the possible words in a window. The entropy is also scaled from 0 to 100 and a low entropy implies a low complexity metagenome. Fig C, Website output: tag sequence probability. Another output from the DUDE-Seq website is the tag sequence probability of reads [62]. This is to reveal the existence of artifacts at the ends, i.e., adapter or barcode sequences at the 5'-or 3'-end. Fig D, Website output: sequence duplication. The accompanying website also carries out sequence duplication analysis based on the denoised outputs from DUDE-Seq, in order to reveal artificial duplicates. As shown in the figure, five types of duplication statistics [63] are provided: exact duplicates, 5' duplicates, 3' duplicates, exact duplicates with the reverse complement of another sequence, and 5' or 3' duplicates with the reverse complement of another sequence. (PDF)