NGS based haplotype assembly using matrix completion

We apply matrix completion methods for haplotype assembly from NGS reads to develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by applying a mathematical model to convert the reads to an incomplete matrix and estimating unknown components. This process is followed by quantizing and decoding the completed matrix in order to estimate haplotypes. These algorithms are compared to the state-of-the-art algorithms using simulated data as well as the real fosmid data. It is shown that the SNP missing rate and the haplotype block length of the proposed HapOPT are better than those of HapCUT2 with comparable accuracy in terms of reconstruction rate and switch error rate. A program implementing the proposed algorithms in MATLAB is freely available at https://github.com/smajidian/HapMC.


Introduction
The Single Nucleotide Polymorphism (SNP) is a kind of genetic variation with a frequency greater than 1% in population. In diploid organisms, genomes are organized into pairs of chromosomes, a paternal and a maternal copy. The sequence of SNPs on each copy of a pair of chromosomes is called a haplotype. A genotype is the conflation of two haplotypes on the homologous chromosomes. An SNP is called homozygous, if a pair of alleles at this locus is made up of two identical nucleotides, and is heterozygous, otherwise.
From the evolutionary point of view, the SNP happens as a consequence of mutation. However, since the mutation rate is low, several mutations of a locus rarely occur. Thus, it is usual to assume that the majority of SNPs are bi-allelic, meaning that each SNP can be chosen from just two of the four possible nucleotides, i.e., A, T, C, and G [1]. Accordingly, in this work we similarly use this assumption. The haplotype is widely used in the Genome Wide Association Studies (GWAS), clinical genetics, linkage analysis, drug-design, and personalized medicine [2].
To extract a haplotype, one may use the following three approaches where the last two approaches are mathematical: 1. Applying high-cost experimental and expensive methods for every single individual which is of course not desirable [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 2. Haplotype phasing wherein the haplotypes are inferred from the genotypes of multiple individuals. As such, a method based on the maximum parsimony assumption [3] and statistical methods like SHAPEIT, developed based on the Hidden Markov Model [1,4] may be mentioned. Note that using this approach, the haplotype of an individual can not be found separately and also is challenged by the low-frequency and also de novo variants [2].
3. Estimating haplotypes from Next Generation Sequencing (NGS) reads i.e. nucleotide sequence of fragments. Using this approach, known as the haplotype assembly, haplotyping of a single individual becomes feasible. In this regard, HapCUT2 [5], HapTree [6], and HapSAT [7] are three famous methods developed based on probabilistic models. These methods are sensitive to the selected model and thus fragile to the model error.
A recent method for haplotype assembly is AltHap [8] which has shown accurate results compared to H-PoP [9], SCGD [10], and HapTree [6]. The H-PoP is a heuristic algorithm originated from the Balanced Optimal Partition (BOP) optimization model which benefits from the Minimum Error Correction (MEC) as well as the maximum fragments cut approaches [11]. The SDhaP [12] is also another heuristic method based on correlation clustering and non-convex optimization which does not guarantee reaching the global optimum.
The innovation of this article is threefold. First, the haplotype assembly is mathematically formulated based on matrix completion methods. Secondly, three new algorithms called the Haplotype assembly based on Singular Value Thresholding (HapSVT), Haplotype assembly based on Nuclear norm minimization (HapNuc), and Haplotype assembly based on OPT-SPACE (HapOPT) are proposed. Next, in the section of Results, these algorithms are compared to some benchmark methods in terms of the reconstruction rate and the switch error rate.

Model of haplotypes
To exploit the NGS reads as the raw data, a computational modeling is needed. For this purpose, similar to [10], we first convert the sequence of nucleotides which can be either reads or haplotypes into a sequence of numbers. The SNP nucleotides are converted to 1 and −1 for the wild and rare alleles, respectively. As an example, Table 1 depicts the alleles of the β 2 AR gene [3] for which the maternal and paternal haplotypes of an individual are shown by h m and h p , respectively. The corresponding codewords based on the above modeling are presented in the last column.
Next, assuming that each read has been aligned to the reference genome, the non-SNP sites of each read are omitted. Then, the reads are coded using the procedure described in Table 1, and are completed by adding zeros for the length of l as shown for 10 aligned reads in Table 2. As seen in this example, for the 1st row, we get {-1 1 1 0 0 0 0 0 0 0} with 3 sites of ±1 and 7 sites of zeros.
Without loss of generality, by representing the codewords of Table 2 by the vectors r i , i = 1, . . ., N, we form the read matrix R, where N is the number of reads. In fact, R is an incomplete matrix with the rank of 2 which consists of the maternal and paternal haplotypes in its Table 1. Haplotypes of β 2 AR genes and their corresponding codewords. rows. At this stage, we may utilize matrix completion methods to complete this low rank matrix. To do so, by estimating the zero entries of R, we obtain the completed matrix H which has the same dimension as R, i.e., N × l where l is the haplotype length. According to Table 2, these matrices are given by (1) and (2). 1 1 À 1 1 1 1 1 À 1 1 1 1 1 À 1 1 1 1 1 À 1 1 1 À 1 1 1 À 1 À 1 À 1 À 1 1 À 1 1 À 1 1 1 À 1 À 1 À 1 À 1 1 À 1 1 ð2Þ Table 2. Example of aligned reads for β 2 AR genes and the considered codewords.

Reads Nucleotides Codewords
From H, one can observe that only two of its rows are different and thus the desired haplotypes are given by These vectors can then be decoded to the sequence of nucleotides using the first row of Table 1. To the best of our knowledge, no algorithm has been reported to distinguish between the maternal and paternal haplotypes and therefore h p and h m may be interchanged with each other.
It should be noted that the above example is an error-free case to clarify the procedure of data modeling which can be trivially solved. For the erroneous case; which is the subject of our work, R is an incomplete version of H + N where N shows the noise matrix [8].

Proposed methods
We present three new algorithms for haplotype assembly whose general block diagram is illustrated in Fig 1. The goal is to estimate h p and h m from the noisy reads. The first two blocks have been explained before. In the third block, we receive an incomplete matrix R with a few known entries where the set of indices of known entries is given by O [10]. Then, we intend to estimate the unknown entries based on rank assumption. Mathematically, this is modeled by the following optimization problem: It is worth mentioning that here we have not only considered the case of all-heterozygous variants, but also included the case of both heterozygous and homozygous variants. This can be realized as a point of this work in comparison to some other methods that are restricted to heterozygous variants. In the all-heterozygous case, the two haplotypes will be the negative of each other, i.e., h p = −h m and thus the rank of H will be one (See (5)).
To solve (5), the nuclear norm minimization, Singular Value Thresholding (SVT), and OPTSPACE methods have already been reported [13], based on which we introduce three new algorithms called the HapSVT, HapNuc, and HapOPT.

Haplotype assembly based on Singular Value Thresholding (HapSVT)
To explain the proposed HapSVT algorithm, we first introduce the SVT which is based on Singular Value Decomposition (SVD) [14] defined for the read matrix R as where H denotes the hermitian operator, and U and V have orthonormal columns with the dimension of N × r and l × r, respectively. By applying the singular value shrinkage operator where It is worth noting that D τ (R) is the optimal value of the optimization problem where k�k F is the Frobenius norm and k�k� shows the nuclear norm as the summation of singular values.
To perform the matrix completion part as shown in Fig 1, we recursively use the SVT in two steps. In the first step, starting with the initial matrix Y 0 = R, the singular value shrinkage operator is used as Then, in the second step, the difference between the projected matrix X k and the initial matrix is compensated for the known entries using for k = 1, 2, . . ., where P O ð�Þ is an operator which keeps the entries of the matrix corresponding to O unchanged, and sets the other entries to zero. The iterations continue until the condition k P O ðX k À RÞk F < �k R k F is satisfied and the last X k is reported as the completed matrix H.
To extract h p and h m , we compute the reduced row echelon form of H and by using the first two pivot positions, two independent rows of H are obtained. Then, in order to acquire the paternal and maternal haplotypes the entries are quantized to 1 and −1. The procedures of the HapSVT algorithm is depicted in Algorithm 1.
input: N aligned reads output: Haplotypes h m , h p / � Read Matrix Preparation � / 1 Convert the sequences of nucleotides (reads) to the sequences of numbers. 2 Add zeros to each read to construct r i s with the length of l. 3 Construct the read matrix R (N × l).

Haplotype assembly based on Nuclear norm minimization (HapNuc)
A popular method for matrix completion is based on relaxing the non-convex rank function to a convex function. Since the number of nonzero singular values determines the rank of a matrix, an approximation of the rank function is defined by the summation of singular values, known as the nuclear norm [15]. In this way, the optimization problem is cast as This problem can be solved easily using the CVX, a MATLAB based package [16]. It has been shown that the nuclear norm minimization has strong mathematical guarantees to achieve the optimal solution [15,17,18]. To develop the new HapNuc algorithm, we substitute the SVT part of Algorithm 1 by nuclear norm minimization.

Haplotype assembly based on OPTSPACE (HapOPT)
Another method for matrix completion is known as OPTSPACE [19] in which unlike the two previous methods, we assume that the rank of the desired matrix H is known. The OPTSPACE consists of the following three steps: a) trimming, b) projection, and c) cleaning, as explained below. a) In the trimming step, those columns of R with the degrees larger than 2|O|/l are set to zero where |�| shows the cardinality of a set and l is the haplotype length. The degree of a column (or a row) shows the number of its known entries. This step is also performed for the rows of R with the degrees larger than 2|O|/N where N is the number of reads.
b) The trimmed R obtained from Step (a) is projected to the space of rank r matrices using where P r (S) = diag(σ 1 , . . .σ r ) and U and V are given by (6).
c) The cleaning step is performed by solving the following optimization problem, which contains two minimization parts. The inner part results in a function in terms of X and Y. To solve the outer minimization part, we use a gradient based recursive method whose initial matrices are computed from Step (b), i.e., X 0 = U and Y 0 = V. Then, this recursive method leads to the optimal solution H ¼ X opt S opt Y H opt . To finalize the third new Hap-OPT algorithm, we should substitute the SVT part of Algorithm 1 by the above three steps.

Results
Using extensive simulations, we compare the performance of the proposed HapSVT, HapNuc, and HapOPT algorithms with that of the three recent benchmark algorithms AltHap [8], Hap-CUT2 [5], and SDhaP [12]. It has already been shown that these algorithms outperform some other algorithms like RefHap [20], SCGD [10], HapTree [6], and H-PoP [9]. For comparison purposes, a well-known criterion is the reconstruction rate defined as [21] rr whereĥ p andĥ m are the reconstructed haplotypes which are compared to the known maternal and paternal haplotypes, h m and h p . Moreover, HDð�; �Þ is the augmented hamming distance between two vectors which counts the number of non-identical sites using where Dð�; �Þ is defined as To consider another criterion for performance evaluation, we make use of the SWitch Error Rate (SWER), defined as the number of switches divided by the haplotype length [22]. A switch happens when the parental origin of an allele with respect to that of the previous allele differs from one parent to another. For example, by considering h p = [1, 1, 1, 1] and h m = [−1, −1, −1, −1] as the grand truth haplotypes and the estimated haplotypes asĥ p ¼ ½1; 1; À 1; À 1� andĥ m ¼ ½À 1; À 1; 1; 1�, one switch has been occurred.

Simulated data
First, we use the simulated data [21] generated based on real human haplotypes in the HapMap project. This dataset; which contains different read matrices with various error rates and coverage values originated from different haplotype lengths, has vastly been used in previous studies [10,23,24]. We choose the longest available haplotype from the dataset with the length of l = 700. The coverage value of the NGS paired-end reads varies from c = 3 to its greatest value c = 10. The average number of reads are N = 561, 936, and 1873 for coverage values of c = 3, 5, and 10, respectively. The number of SNPs covered in each read is a constant value equal to 7.4. Also, 10% (and 20%) of the entries of the read matrix are contaminated by noise with uniform distribution. The results are averaged over 100 independent trials of the experiment. Table 3 shows the reconstruction rates for different coverage values and error rates. The corresponding SWERs are also depicted in Table 4. In this case, HapCUT2 is not examined, since it needs the Variant Call Format (VCF) file which is not available for this simulated dataset [21]. As seen in both Tables 3 and 4, the proposed HapOPT algorithm outperforms the others in terms of the reconstruction rate as well as the SWER. It is worth reminding that the SDhaP solves a non-convex optimization problem using a heuristic technique with the gradient descent algorithm which does not guarantee reaching the global optimum. Furthermore, as a consequence of increasing the coverage value, a better performance is achieved by a lower SWER and a higher reconstruction rate.

Real fosmid data
We evaluate the proposed algorithms on the sequence data of the individual NA12878 fabricated based on a fosmid approach [20]. The coverage of this data set is c = 3 and the average read length is 40 kb, and hence, is a low-coverage and long-read dataset. For evaluation purposes, we consider the trio-phased haplotype from the GATK resource bundle, as the grand truth containing 1.3 million heterozygous variants in common with fosmid dataset [22,25]. This dataset has already been used in several studies [5,8,22].
In the simulated dataset used in the last section, each read overlaps at least one another read, while for the real data these overlaps do not necessarily occur. In this situation, our algorithm incorporates the overlaps for haplotype estimation, and as a result, the output of each algorithm is some disjoint parts of the whole haplotype, called haplotype blocks. To evaluate a common length for these blocks, we consider their mean and also the AN50 defined as the median of blocks lengths in base pairs weighted by a proportion of correctly estimated alleles [6]. Also, we define the SNP Missing Rate (SMR) for each chromosome as the ratio of the number of missing SNPs in the estimates and the haplotype length [26]. The results on the real fosmid data are shown in Table 5. One can see that both HapOPT and AltHap algorithms achieve lower SNP missing rates in comparison to HapCUT2 and SDhaP. Moreover, HapOPT and AltHap have a better span in terms of AN50.
To assess the accuracy of different algorithms, the corresponding reconstruction rates [5,22] are presented in Fig 2. Moreover, we have considered both short and long SWERs [5,22]. By a long switch, we mean that the parental origin does not change for at least two SNPs and if two switches occur one after each other, we consider it as a short switch. These two metrics are reported on real fosmid data in Figs 3 and 4.
From the above results, one can observe that HapOPT outperforms SDhaP and AltHap in terms of the reconstruction rate as well as long and short SWERs with a reasonable runtime as reported in Table 6. Note that although, HapCUT2 achieves the best accuracy, still its SNP missing rate is greater than that of HapOPT. These results on the whole show that HapOPT is a promising tool for haplotype assembly with the best SNP missing rate and a good accuracy in terms of reconstruction rate and SWER.

Conclusion
We have exploited matrix completion methods including SVT, nuclear norm minimization, and OPTSPACE for haplotype estimation. This was led to developing the new HapSVT, Hap-Nuc, and HapOPT algorithms. Our experimental comparison on simulated data revealed that HapOPT is more accurate than SDhaP and AltHap in terms of reconstruction rate and switch error rate. Also, the results on real noisy fosmid data showed that the accuracy of HapOPT is better than that of SDhaP and AltHap and also is comparable to that of HapCUT2 in terms of the reconstruction rate and the short and long SWERs. Moreover, it was shown that HapOPT    outperforms the recently addressed algorithms, HapCUT2 and SDhaP, in terms of the mean, SNP missing rate, and AN50 of the haplotype block length. Furthermore, the proposed algorithm is not restricted to the heterozygous assumption, as commonly considered in peer algorithms. On the whole, we can conclude that using the proposed HapOPT, the haplotype is reconstructed more completely and continuously with acceptable accuracy. Also, the proposed optimization problem is capable of estimating haplotypes for different ploidy levels. Our research direction for future is to work on polyploids.