Minimum error correction-based haplotype assembly: Considerations for long read data

The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.


Introduction
Among the various types of genetic variations, single nucleotide polymorphisms (SNPs) are the most widely studied among others in genome wide association studies (GWAS). The genome of diploids like humans consists of two homologous pairs: the paternal and maternal chromosomes. A haplotype, the sequence of alleles at SNP sites on each homologous chromosome, can be measured through direct experiments or can be reconstructed by computational approaches [1,2]. Due to the high cost of experimental methods, the computational techniques have attracted more attention. These techniques can be categorized as phasing or assembly approaches. Phasing makes use of the genotypes of multiple individuals to infer the haplotype. In the haplotype assembly approach, sets of reads generated by DNA sequencing devices are exploited for haplotype reconstruction. While haplotype assembly can be performed for a single individual, phasing cannot. Moreover, phasing is difficult in the presence of low-frequency and de novo variants. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The history of DNA sequencing technologies consists of three generations. Firstly, the lowthroughput Sanger sequencing machines were built in the late 1980s, thanks to the invention of the chain termination procedure. Subsequently, multiplexing strategies were used for the development of the so-called second generation technologies of the early 2000s. Today, Illumina is the dominant platform of this second generation, providing massively high throughput, up to billions of reads, with a length of a few hundred bases and an error probability lower than 0.001 [3]. Utilizing such short reads incurs limitations, precluding assembly of repetitive regions and detection of structural variants larger than read length. The third-generation of sequencing technology, namely single-molecule sequencing as provided by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), produces exceptionally long reads of up to a million bases. The bottleneck of this third-generation technology is the low per-base accuracy in comparison to that of the second generation, such that the error probability may exceed 0.1 [4]. Both second and third generation sequencing technologies have been used for haplotype assembly.
Although the sequencing reads provided by all above-mentioned technologies do not keep track of the haplotypic origin of reads, a haplotype assembly algorithm tends to reconstruct the haplotypes using overlaps among reads. In the absence of sequencing errors, this is a trivial problem to solve. A simple bipartitioning scheme can be used to divide reads into two groups corresponding to two haplotypes, such that those reads in each group do not conflict. But in real cases, the presence of errors makes the problem computationally hard to solve. Several criteria have been proposed in the literature, including minimum fragment removal (MFR), minimum SNP removal (MSR) and minimum error correction (MEC) [5]. The idea behind MFR is to find the minimum number of reads containing errors, which should then be removed. The heuristic algorithms for solving this model are time-consuming and not suitable for low coverage input data. In MSR-based algorithms, several SNP positions are removed to make haplotyping possible. Thus, the haplotypes contain some gaps, leading to a high rate of missing SNPs, which is undesired.
The dominant objective function utilized for the haplotype assembly problem is the MEC, also known as the minimum letter flip [2]. This function is also used in evaluating the performance of different haplotype reconstruction algorithms [6,7]. Minimizing the MEC function can be rewritten as a MAXCUT problem, which is NP-hard, leading to a large number of heuristic algorithms [5]. Some examples include the HapCUT algorithm (which iteratively computes max-cuts of a read graph [8]), a branch-and-bound genetic algorithm approach [9], an integer linear programming approach [10] and a clustering approach [11], as well as multiple dynamic programming approaches [12][13][14][15][16].
Despite the existence of all these methods utilizing the MEC for haplotype reconstruction, it is crucial to note that this criterion may fail to identify the exact haplotype when there is a high error rate in the reads [9,17]. In addition, a negative correlation between the haplotype accuracy and the MEC has already been reported in [18], as discussed in the Results section. While this issue has been mentioned briefly in previous studies, it has never been systematically investigated in an effort to understand the implications across different sequencing platforms.
In this work, we provide insight into the MEC function to clarify the above ambiguities. The following section presents the fragment matrix model, defines MEC and introduces two theorems regarding MEC performance. The performance curve for MEC is introduced and discussed in the Results section. Furthermore, several DNA sequencing devices are evaluated based on their characteristics, including the error probability values. Finally, simulations of long and short reads are provided to explore practical consequences.

Methods
For diploids, haplotype assembly is the process of reconstructing two haplotypes from overlapped aligned reads. Throughout this paper, we only consider bi-allelic SNPs-that is, SNPs with only one alternative allele against the reference allele [8,12]. Below we describe the construction of the fragment matrix. Prior to this construction, we remove those reads that cover less than two SNP sites, because these are not informative for haplotype assembly. Non-SNP bases of each read are also omitted.

Fragment matrix model
We assume that there are N reads obtained from both chromosomes. For a haplotype with the length of l, an N × l fragment matrix R is constructed whose rows embed the reads and whose columns correspond to the heterozygous SNP sites [19,20]. The SNP sites not covered by the reads are coded with zero. Then, bases of reads are converted to −1 (alternative allele) or 1 (reference allele), assuming bi-allelic SNPs.
As an example of an error-free case, consider the first exon of HLA-A, a gene on chromosome 6 -with NCBI reference sequence number NG_029217.2. Its first 40 bases are presented in Fig 1a. It contains five bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). The procedure of constructing the fragment matrix is depicted in Fig 1d. In this example, the exact haplotypes that should be reconstructed by the haplotype assembly algorithms are {CGTAG} and {TCGCA}.
The fragment matrix R can be modeled using a matrix completion approach [19,20]. In the error-free case, R is a partially observed matrix modelled as where M is the completed version of matrix R (see section B of S1 Appendix for more details).

PLOS ONE
P O is the observation operator defined as in which O is the set of indices of known entries. In order to generalize the model to the more realistic case allowing erroneous entries, we use an additive measurement error model inspired by [11,19,20]: To define the error matrix E, we should first clarify what we mean by an error. A substitution error is the conversion of a DNA base to one of the other three possible bases during the sequencing procedure. As mentioned earlier, during fragment matrix construction, only two bases (reference and alternative alleles) for each SNP site are permitted and other possible bases are ignored; as a result, a substitution to the ignored bases does not affect the entries of the fragment matrix. Accordingly, we introduce the term bi-allelic substitution, or simply bisubstitution to make it distinguishable from generally defined substitution. A bi-substitution error occurs when a reference allele is converted to the alternative allele or vice versa. Consequently, an error in the entries of P O (M) is simplified as a change from −1 to 1 or vice versa. This can be formulated as an addition of 2 (or −2) to each erroneous entry of P O (M) which is represented in error matrix E. We assumed that each non-zero entry of R is erroneous with a probability of p e , the bi-substitution error probability, independent of the other entries. This value equals one third of the substitution error probability of the sequencing device p s .

MEC definition
If the reads contain no errors, the corresponding rows of fragment matrix are compatible with each other and haplotypes are extracted using a simple clustering technique. However, in practice, sequencing devices may produce erroneous reads due to which the compatibility of reads is lost. To cope with this problem, the MEC approach is employed by inverting the sign of some entries of the fragment matrix to make it compatible [9]: 1. Find the minimum number of entries of R that should be inverted to make the fragment matrix compatible.
2. Cluster the rows of the augmented fragment matrix and reconstruct the haplotype.
For fragment matrix R of dimension N × l and candidate haplotype vector h c of length l, the MEC function is calculated as in which r i is the i th row of R and the extended Hamming distance (EHD) is defined as Dða; bÞ ¼ P l k¼1 dðaðkÞ; bðkÞÞ [8,10]. Furthermore, d(�, �) is a mismatch indicator which penalizes its dissimilar arguments by one: Therefore, the EHD function represents the number of mismatches between two vectors. From this point of view, MEC(R, h c ) indicates the whole number of mismatches between each row of R and the vector h c . It is notable that the function D(�, �) is not a distance from the mathematical point of view [21], though it is named as such (See sections A and C of S1 Appendix).

Analysis of MEC performance
Consider h opt as an optimal solution resulting from a given method by minimizing the MEC function. The question arises: does minimizing this function guarantee reaching the exact haplotypes (i.e., the true haplotypes of the individual)? In Theorem 1, we demonstrate not only that this solution offers no guarantee of finding the exact haplotype, but also that the MEC function will not lead to the exact haplotype. Theorem 1. There exists a vector h d different from the exact haplotype h ex with a lower MEC, when the k th column of the fragment matrix, R, contains some erroneous entries whose number E (k) is greater than half of its coverage. In a mathematical expression: where c (k) is the coverage (or the read depth) of the k th SNP site. The coverage indicates the number of reads that covers the SNP and is equal to the number of known entries of the k th column of R. We conclude that the ratio E (k) /c (k) , called the bi-substitution rate, plays a key role in the evaluation of a sequencing device. From a practical perspective, E (k) , the number of nonzero values of the k th column of E, represents the number of bi-substitutions at the corresponding genomic position (see section Fragment matrix model). The proof of Theorem 1 is presented in section B of S1 Appendix. The core idea of the proof is to consider h d equal to h ex except in its k th entry, whose sign is inverted. This guarantees a lower MEC. Note that if the antecedent is not satisfied, the MEC approach works properly. In practice, fulfilling the antecedent of Theorem 1 is a major point to be investigated further. To explore this point, Theorem 2 presents the probability of the antecedent not occurring. Theorem 2. The probability of obtaining a minimum MEC value for the exactly correct haplotype (P{c-MEC}) is equal to in which p e is the bi-substitution error probability. Proof: According to Theorem 1, the MEC approach works properly when the number of erroneous entries of each column is lower than half of its corresponding coverage. Based on the above assumption, the number of erroneous entries of each column of R is independent of the other columns. Then, we have: An erroneous entry gets the opposite sign due to the bi-allelic assumption. This follows a Bernoulli distribution of ±1 with the probability of error p e . Thus, the number of errors in the j th column follows a Binomial distribution given by P E ðjÞ ¼ k f g ¼ c ðjÞ k � � p k e ð1 À p e Þ c ðjÞ À k .

PLOS ONE
Minimum error correction-based haplotype assembly: Considerations for long read data Therefore, we can write: Accordingly, using (8) and (9) the proof of Theorem 2 is complete.

Performance curves of MEC
The outcome of Theorem 2 is calculated for various scenarios with different probabilities of error and coverage levels. This is done by introducing performance curves for MEC. The yaxis indicates the probability of obtaining a correct MEC P{c-MEC} and the x-axis the bi-substitution error probability p e . In practice, the average coverage of input data provided for haplotype assembly varies from very low to very high levels. Based on the existing literature on coverage distribution among different genomic positions [19, 22, 23], we consider two different distributions, including Poisson and quasi-uniform (i.e., the analogue of the uniform distribution defined for a discrete random variable), as well as constant coverage levels. The error probability of various datasets may also differ dramatically due to the specifications of the DNA sequencer.
In Fig 2a, it is seen that P{c-MEC} is inversely proportional to the sequencing error probability p e . Additionally, depending on the coverage distribution, each P{c-MEC} begins to drop after a particular threshold. For example, for the Poisson distribution with mean λ = 10 and l = 1k, this threshold is p e = 2%. In this case, the MEC approach is unable to reconstruct the exact haplotype for p e > 2%. This problem arises when the number of errors in column is more than half of its coverage, as expressed in Theorem 1. The existence of such a column is more likely as the error probability increases. Fig 2b presents our investigation on the effect of the haplotype length on P{c-MEC}. It demonstrates that a higher haplotype length l leads to incorrect haplotypes at a lower bi-substitution error probability p e .

Evaluation of sequencing technologies: Theory
Here, we analyze the MEC for different DNA sequencing devices based on our reasoning. Table 1 presents the results of the evaluation of different devices launched by Illumina, PacBio and ONT. For each device, the evaluation employs the typical number of reads per run, the read length and error probability as reported in literature [4,[24][25][26]. In order to provide a fair comparison, we set the coverage value at 10. To calculate the number of runs needed (denoted by n) for such coverage, we used the averaged coverage formula, the Lander-Waterman equation, as following: where l r , N t and G show the read length, the total number of reads per run and the human genome length, respectively. The applicability of the MEC approach for data generated by each device is reported in the last column of Table 1, based on the value of P{c-MEC}. This shows that the MEC criterion

PLOS ONE
works well for short reads produced by Illumina devices, but not for long reads produced by PacBio or ONT. A larger value of n corresponds to a higher sequencing cost for each device. It should be noted that for each run, long-read devices are far more expensive than short-read devices.

Evaluation of sequencing technologies: Simulations
We run various simulations to provide a deeper understanding of MEC-based haplotype assembly. First, using DNA sequencing data, we estimate how often MEC failures can occur based on Theorem 1. The accuracy of the reconstructed haplotype is also investigated in terms of switch error rate and haplotype block length.
On the satisfaction of Theorem 1. Here, we inspect the effect of short and long sequencing reads along with their corresponding error profiles for the satisfaction of antecedent of Theorem 1. To do so, we use the bi-substitution rate defined in the Methods section.
We briefly present the details of our simulations. We consider the 21st chromosome of the human genome (GRCh38) [27] as the reference DNA sequence. Bi-allelic SNPs are introduced at a rate of one in a thousand bases [28] across the mentioned reference using haplo-generator, part of the haplosim package [29]. For generating PacBio long reads, we use the PBSIM package [26] in which the PacBio error profile is used. Then, we align the reads using minimap2 [30]. We run the ART package [31] for generating short paired-end reads and Burrows-Wheeler Aligner (BWA) [32] for aligning them. We sort the aligned reads using the samtools package [33]. Afterwards, using the mpileup subprogram of samtools [33], alleles for each position are extracted from the sorted aligned reads. Then, the required statistics for all introduced SNPs are calculated.
For both Illumina reads and PacBio long reads, the number of SNPs with a bi-substitution rate of greater than or equal to 0.5 are depicted in Fig 3. In S1 and S2 Figs, we depict the histogram of bi-substitution rates of SNP sites. For coverage values up to 25 for PacBio data, there are some positions in which the bi-substitution rate is greater than 0.5. This leads to the satisfaction of the antecedent of Theorem 1 and thus MEC failure. When we set the coverage greater than or equal to c = 30, no SNP site with high bi-substitution rate remains.
Haplotype reconstruction accuracy. We now examine the direct effect of coverage on the accuracy of the reconstructed haplotype. We utilize the well-known HapCUT algorithm as a MEC-based haplotype assembly method.
The output of HapCUT consists of haplotype blocks, whose continuity can be evaluated by calculating the average block length. Larger haplotype blocks, indicating that haplotypes are reconstructed more continuously, are of interest. To evaluate the accuracy of the reconstructed haplotype, we calculate the switch error rate by dividing the number of switch errors by the  Fig 4a and 4b, respectively. The

PLOS ONE
results are provided for 20 independently generated datasets. As seen in both figures, by increasing the coverage, the accuracy and continuity of the reconstructed haplotype increases. For a dataset with low coverage, specifically lower than 25 per haploid, not only are there many switches but the reconstructed haplotype is highly fragmented as well. This corroborates the findings in Fig 3.

Discussion
The issue addressed in this paper has been recognized previously by Duitama et al. [18], who note that a candidate haplotype with lower MEC is associated with lower reconstruction accuracy. This result can be predicted from the model we described.
It should be noted that, while we assume errors to be an independent and identically distributed (iid), in reality this may not hold true, although this assumption has been used before widely [6,34,35]. Though PacBio reads have no systematic error, errors in alignment and variant calling may exist due to high numbers of insertions and deletions. Acquiring comprehensive error models for all sequencing technologies is a difficult task and exploiting them in our model would make the derivation unfeasible. Therefore, we used an approach that is simplified yet close to reality.
However the focus of this paper is on diploid, we present the MEC formula for polyploids and we show that MEC failure may also happen in a specific polyploid case in section D of S1 Appendix.

Conclusion
We investigated the reliability of the MEC approach for haplotype assembly. We demonstrate that in some practical circumstances, an imprecise haplotype may be reconstructed with a lower MEC than that of the exact haplotype. The theoretical MEC performance curves were obtained for different coverage values and error rates. Based on our analyses, we evaluated some DNA sequencing devices by the MEC criterion. It was found that this approach can generate misleading results for low-coverage error-prone long reads generated by Pacific BioSciences and Oxford Nanopore Technologies platforms. In order to address this issue, one should exploit a high coverage for long reads. The results provided in this study suggest that using MEC-based haplotype assembly methods on available long reads, reconstruction of the true haplotypes is not feasible for coverage lower than 25 per haploid (i.e., 50 overall). An important future direction for this work is to do a thorough research on the extent of the issues with MEC for the polyploid genome.