A Class Representative Model for Pure Parsimony Haplotyping under Uncertain Data

The Pure Parsimony Haplotyping (PPH) problem is a NP-hard combinatorial optimization problem that consists of finding the minimum number of haplotypes necessary to explain a given set of genotypes. PPH has attracted more and more attention in recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from mapping complex disease genes to inferring population histories, passing through designing drugs, functional genomics and pharmacogenetics. In this article we investigate, for the first time, a recent version of PPH called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotypes are not accurate, i.e., when some single nucleotide polymorphisms are missing or affected by errors. We propose an exact approach to solution of PPH-UD based on an extended version of Catanzaro et al. [1] class representative model for PPH, currently the state-of-the-art integer programming model for PPH. The model is efficient, accurate, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.


Introduction
The human genome is divided in 23 pairs of chromosomes thereof, one copy is inherited from the father and the other from the mother. When a nucleotide site of a specific chromosome region shows a variability within a population of individuals then it is called Single Nucleotide Polymorphism (SNP). Specifically, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the least frequent allele) while for the rest of the population a different nucleotide is observed (the most frequent allele) [2]. The least frequent allele, or mutant type, is generally encoded as '1', as opposed to the most frequent allele, or wild type, generally encoded as '0' [3]. A haplotype is a set of alleles, or more formally, a string of length p over an alphabet S~f0,1g [4]. Haplotypes represent a fundamental source of information for disease association studies. In fact, over 90% of sequence variation among individuals is due to common variant sites, most of which arose from single historical mutation events on the ancestral chromosome [5]. Hence, in a group of people affected by a disease, the SNPs causing or associated with the disease will be enriched in frequency compared with the corresponding frequencies in a group of unaffected individuals. This observation was of considerable assistance, for example, in the identification of the genes responsible for type 1 diabetes [6][7][8], type 2 diabetes [9,10], Alzheimer's disease [11], deep vein thrombosis [12], inflammatory bowel disease [13][14][15], hypertriglyceridaemia [16], schizophrenia [17], asthma [18], stroke [19], myocardial infarction [20], cystic fibrosis and diastrophic dysplasia [21,22].
Extracting haplotypes from a population of individuals is not an easy task. In fact, the current molecular sequencing techniques only provide information about the conflation of the paternal and maternal haplotypes of an individual (also called genotype) rather than haplotypes themselves [23]. When the family-based genetic information of a population is available, haplotypes can be retrieved experimentally [24]. However, the experimental approach is generally laborious, cost-prohibitive, requires advanced molecular isolation strategies [25], and sometimes not even possible [26]. In absence of a family-based genetic information, a valid alternative to the experimental approach is provided by computational methods which estimate, by means of specific criteria, haplotypes from the set of genotypes extracted from a population of individuals.
A genotype can be formally defined as a string of length p over an alphabet S~f0,1,2g, where the symbols '0' or '2' denote homozygous sites (of wild and mutant type, respectively) and the symbol '1' denotes heterozygous sites. As an example, the sequence S0,2,1T encodes a genotype in which: the first SNP is homozygous of wild type; the second SNP is homozygous of mutant type; and finally the third SNP is heterozygous. A genotype is said to be degenerate if it does not contain '1's. A genotype g k is said to be resolved from a pair of haplotypes fh i ,h j g, in symbols g k~hi +h j , if the p-th entry of g k , denoted as g kp , is equal to the sum of the p-th entries of h i and h j , denoted as h ip and h jp , respectively. For example, the genotype g k~S 1,2,1,0T is resolved from h i~S 0,1,1,0T and h j~S 1,1,0,0T. Haplotyping a set of genotypes G means finding the set of haplotypes resolving G.
It is worth noting that, given a genotype and denoted n as the number of its heterozygous sites, there exist 2 n{1 possible haplotypes that may resolve it [26]. As an example, genotype S0,2,1,1T may be resolved from either the pair of haplotypes fS0,1,0,0T,S0,1,1,1Tg or the pair fS0,1,1,0T,S0,1,0,1Tg. This fact involves as a necessary consequence the use of a criterion to select pairs of haplotypes among plausible alternatives. Gusfield [27] and Wang and Xu [28] observed that the number of distinct haplotypes existing in a large population of individuals is generally much smaller than the overall number of distinct genotypes observed in that population. This insight has suggested that, for low-rate recombination genes at least, the criterion of minimizing the overall number of haplotypes necessary to resolve a set of genotypes may have good chances to recover the biological haplotype set. This criterion, formally introduced by Gusfield [27], is known as the pure parsimony criterion of haplotype estimation and was of considerable assistance, for example, in the identification of the genes responsible for psoriasis and severe alopecia areata [2]. Haplotyping a set of genotypes under the parsimony criterion involves solving an optimization problem, called the Pure Parsimony Haplotyping (PPH) problem, that can be stated as follows: Problem. The Pure Parsimony Haplotyping (PPH) problem Given a set G of m non-degenerating genotypes, having s SNPs each, find the minimum set H of haplotypes such that for each genotype g k [ G there exists a pair of haplotypes fh i ,h j g [ H resolving g k .
As an example, an instance of PPH and the corresponding solution is shown in Table 1.
PPH is known to be polynomially solvable when each genotype has at most two heterozygous sites [29], and N P-hard when each genotype has at least three heterozygous sites [26].
Recently, Brown and Harrower [30] introduced an interesting version of PPH called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotype set G is not accurate, i.e., when some SNPs are missing or affected by errors, a situation that often occurs in practice. In this case, the input of the problem may include also a binary matrix B, called the error mask matrix, whose generic entry b kp is equal to 1 if the p-th SNP of genotype g k is uncertain (i.e., missing or affected by an error), and 0 otherwise. When a given SNP is uncertain its actual value could significantly deviate from its true value. For example, the true value of a wild type homozygous SNP affected by uncertainty could be homozygous of mutant type or even heterozygous. Similarly, the true value of a heterozygous SNP affected by uncertainty could be homozygous of wild or mutant type. The presence of uncertain data modifies the standard definition of resolution for a genotype. Specifically, Brown and Harrower [30] stated that when uncertainty occurs in the input data a genotype g k is resolved by a pair of haplotypes fh i ,h j g if g kp~hip zh jp zb kp e kp , for all SNPs p, being e kp a integer variables assuming values in the set E~f{2,{1,0,1,2g. Brown and Harrower [30] described an integer programming model able to tackle instances of PPH affected by uncertain data. Unfortunately, the authors did not offer experimental evidence of the performances of their model due to its unpractical runtimes. In this article we address this critical issue by introducing a possible integer linear programming model to solve exactly instances of PPH-UD. The model is based on an extension of Catanzaro et al. [1] Class Representative Model (CRM), currently one of the best integer programming model for PPH described in the literature. The model that we propose is efficient, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.

Methods
As shown in Catanzaro et al. [1], any feasible solution of PPH induces a family of subsets of genotype such that: (i) each subset represents one unique haplotype with elements in the subset being genotypes carrying the haplotype, (ii) each genotype belongs to exactly two subsets, and (iii) every pair of subsets intersects in at most one genotype. This principle can be exploited also when dealing with PPH-UD. Specifically, let associate an index to each subset S of genotypes induced by a haplotype h. If i is the smallest index of a genotype belonging to S, then i is the index associated to S and the subset will be denoted as S i . Since each genotype k belongs to exactly two subsets (as it must be explained by exactly two haplotypes) it may happen that k is itself the genotype with smallest index in both subsets. In this case a dummy genotype k' is added, and the subset S k' is created. As an example, one can imagine that the haplotype h 1 induces the subset S i~f g i ,g j ,g k , . . .g, h 2 induces the subset S i'~f g i ,g l ,g r ,g s , . . .g, h 3 induces the subset S k~f g k ,g l ,g s ,g t , . . .g, and so on. We remark that the index k' can be considered only if k was previously used, i.e., if the subset S k already exists.
Since at most 2m haplotypes are necessary to resolve m genotypes [26], then the indices i of the subsets S i can vary inside the index set Q~K|K', where K~f1, . . . ,mg and K'~f1', . . . ,m'g. Assume that an order is defined on Q in such a way that 1v1'v2v2'v . . . vmvm'. Define x i , Vi [ Q, as a decision variable equal to 1 if, in the solution, there exists a haplotype inducing a subset S i of genotypes whose smallest index genotype is g i , and 0 otherwise. Denote y k ij , Vk [ K,Vi,j [ Q, as a decision variable equal to 1 if genotype k belongs to the subsets S i and S j , and 0 otherwise. Denote P as the set of the input SNPs and z ip , Vi [ Q, p [ P, as a decision variable equal to 1 if the haplotype inducing the subset S i of genotypes has such a value at p-th site, and 0 otherwise. Variables z ip describe explicitly the haplotypes of the solution.
For every non-null entry of the error mask matrix B denote e c kp as a decision variable accounting for the difference between the value of g kp and the true underlying value. Specifically, e c kp is equal to 1 if the p-th entry of genotype g k is corrected with a value c [ C~f{2,{1,1,2g, and 0 otherwise. Finally, let E LB and E UB be a lower and an upper bounds on the overall number of errors in G. Then, the following model is a valid formulation of PPH-UD:   7) account for the correction imposed on the p-th SNP of haplotypes h i and h i' when g i has its pth SNP equal to 0. In this case, two situations may occur at the pth SNP: either no correction is performed, or a correction is performed by setting to 1 one between e 1 kp and e 2 kp . If a correction is performed and e 1 kp is set to 1 at the p-th SNP then one haplotype will be homozygous of wild type and the other of mutant type. On the contrary, if e 2 kp is set to 1 then both haplotypes will be homozygous of mutant type. Constraints (1.8) account for the correction imposed on the p-th SNP of haplotypes h i and h i' when g i has its p-th SNP equal to 2. Similarly to constraints (1.7), also in this case two situations may occur: either no correction is performed, or a correction is performed by setting to 1 one between e {1 kp and e {2 kp . If a correction is performed and e {1 kp is set to 1 at the p-th SNP then one haplotype will homozygous of wild type and the other of mutant type. On the contrary, if e {2 kp is set to 1 then both haplotypes will be homozygous of wild type. Finally, constraints (1.9) account for the correction imposed on the p-th SNP of haplotypes h i and h i' when g i has its p-th SNP equal to 1. If a correction is performed and e 1 kp is set to 1 at the p-th SNP then both haplotypes will be homozygous of mutant type. On the contrary, if e {1 kp is set to 1 then both haplotypes are homozygous of wild type. Constraints (1.10) establish the relations between variables z is and y k ij in absence of uncertainty. Specifically, they force the p-th site of the haplotype h i to be equal to 0 when at least one genotype g k , whose p-th entry equal to 0, belongs to the induced subset S i . By analogy, constraints (1.11) force the p-th site of the haplotype h i to be equal to 1 when at least one genotype g k , whose p-th entry equal to 2, belongs to the induced subset S i . Constraints (1.12-1.13) force one of the two p-th sites of haplotypes h i and h j to be equal to 1 when the p-th entry of genotype g k is equal to 1. Constraints (1.14-1.

Reducing model size
The particular nature of the set of indices Q can be exploited to reduce the size of the CRM for PPH-UD. This operation proves do not need to be defined. Moreover, the sets of redundant variables can be further expanded by taking into account the entries of the error mask matrix and by observing that for each triplet of genotypes fg i ,g j ,g k g such that the respective p-th SNP is g ip~0 , g jp~0 , g kp~1 , and b ip~bjp~bkp~0 , variable y k ij is necessarily equal to 0 since the containment of genotype g k to the subsets S i and S j would violate the sum operator among haplotypes at least at p-th SNP. Extending this argument to all the possible combinations of triplets of genotypes that violate the haplotype sum operator, it is easy to see that the following sets of variables are redundant and can be removed from the model: Note that, removing the redundant variables y k ij belonging to the sets R 4 -R 6 can be performed in O(m 3 s). Finally, a similar process of reduction can be applied to variables z ip both by removing those whose value is fixed by constraints (1.6) (e.g., when g kp~0 or when g kp~2 ). In this way, only variables z ip involved in constraints (1.6) when g kp~1 and in constraints (1.7-1.9) need to be defined.

Results and Discussion
In this section we analyze the performances of the Class Representative Model (CRM) to solve instances of the pure  [31], Wang and Xu [28], and Marchini et al. [32], and we refer the interested reader to their respective articles. As in Catanzaro et al. [1], we used the standard Brown and Harrower's datasets [30] for testing the performances of our model. Specifically, through Hudson's MS program [33], Brown and Harrower created two families of datasets (called the uniform and nonuniform datasets) by randomly pairing the resulting haplotypes. The distinction in the two simulated methods comes in how the random pairing is performed. In the uniform datasets the haplotypes are randomly paired by sampling uniformly from the set of distinct haplotypes. In the nonuniform datasets the haplotypes are sampled uniformly from the collection of haplotypes generated by the coalescent process. In this collection, haplotypes may not be unique, so some haplotypes are sampled with higher frequency than others. Both the uniform and nonuniform datasets consist of collections of 30 or 50 genotypes having 10, 30, 50, 75 or 100 SNPs each. Each dataset contains a number of instances variable between 15 and 50. The authors also considered biological data from chromosomes 10 and 21, over all four Hap-Map [21] populations. For each input the authors selected sequences having 30, 50, and 75 SNPs, respectively, giving a total of 8 datasets consisting of 3 instances each. Brown and Harrower's datasets are not subjected to uncertainty, for this reason we considered four sets of random generated error mask matrices having an error ratio (i.e., the number of entries equal to 1 divided mp) equal to 1%, 5%, 10%, and 15% respectively. Brown and Harrower's datasets and the error mask matrices used in our experiments can be downloaded at the address: http://home pages.ulb.ac.be/,dacatanz/PPHerr.zip.
In Tables 2-5 we show the performances of the CRM for PPH-UD under different error ratios by showing, conservatively, the same information described in Brown and Harrower [30] and Catanzaro et al. [1]. Specifically, the columns of Tables 2-5 evidence the average, the maximum, and the minimum of: the solution time, the gap (i.e., the difference between the optimal value found and the value of linear relaxation at the root node of the search tree, divided by the optimal value), and the number of nodes expanded in each group of instances belonging to a given dataset. The results have been obtained by implementing the CRM for PPH-UD in Mosel 2.0 of Xpress-MP, Optimizer version 18, running on a Pentium 4, 3.2 GHz, equipped with 2 GByte RAM and operating system Gentoo release 7 (kernel linux 2.6.17).
In our experiments we activated the Xpress-MP Optimizer automatic cuts, the Xpress-MP pre-solving strategy, and used the Xpress-MP primal heuristic to generate the first upper bound.
In order to obtain a qualitative measure of the running time performances of the CRM for PPH-UD, we compared the numerical results of the model with the corresponding ones of the CRM for PPH (RM version, see Catanzaro et al. [1]) running on the same datasets in absence of uncertainty. The performances of the CRM for PPH are shown in Table 6. Moreover, in order to obtain a measure of the accuracy of the CRM for PPH-UD, we used the following procedure: fixed a generic instance of PPH-UD, we computed the optimal solution provided by CRM for PPH in absence of uncertainty and considered the corresponding set of haplotypes as the ''correct set''; subsequently, we computed the optimal solution provided by CRM for PPH-UD in presence of uncertainty (i.e., when taking into account the corresponding input error mask matrix) and assumed, as measure of the accuracy, the ratio between the number of equal haplotypes in both solutions divided the overall number of haplotypes in the solution provided by CRM for PPH. When such a ratio is equal to 1 it means that CRM for PPH-UD was able to recover the correct haplotype set, otherwise, when the ratio is smaller than 1 it means that CRM for PPH-UD was able to recover only a fraction of such a set. The accuracy (expressed in percentage) of the CRM for PPH-UD under increasing error ratios is shown in Table 7. For sake of notation, in the following subsections we shall denote CRM1 and CRM2 as the CRM for PPH and the CRM for PPH-UD, respectively.

Uniform Datasets
The experiments relative to the uniform datasets showed that, when considering an error ratio of 1% already, CRM2 takes significantly more time than CRM1 to solve Brown and Harrower's datasets, confirming the hardness of PPH-UD with respect to PPH. Specifically, Tables 2-5 show, as general trend, that the higher the error ratio the slower the runtime performances of the model. For example, while CRM1 took in average 8 seconds to solve the most difficult dataset having 10 SNPs, CRM2 took in average at least 10 seconds independently from the error ratio, and even longer on instances 03, 05, 06, 08 and 11 of dataset 50610r4 where 19.657, 62.160, 34.429, 40.416, and 15.974 seconds, respectively, were needed to find the optimum. This trend persists also in the instances having 30 SNPs, where CRM1 took in average 11.772 seconds while CRM2 needed an average solution time of 43.273 seconds when considering an error ratio of 5%, with the exception of instances 02, 08, 09, 11, 13 and 14 which needed 62.182, 51.462, 60.079, 60.020, 122.443, and 58.514 seconds, respectively. We observed that the overall performances of CRM2 with an error ratio of 1% generally tend to be similar to the ones of CRM2 with an error ratio of 5%; moreover, we experienced also a generalized decrement of the average solution time when considering an error ratio of 10% and, vice versa, an increment of the average solution time when considering an error ratio of 15%. When considering instances having a larger number of SNPs, we experienced a generalized increment of the average solution time taken by CRM2, proportional to the increment of the error ratio. Interestingly, the average gap and number of branches performed by CRM2, although not directly comparable with the corresponding one of CRM1, results relatively small, confirming the tightness of the class representative model also for uncertain data.
The accuracy of CRM2 in the uniform datasets result very good. Specifically, the average accuracy is over 90% in the majority of the analyzed datasets and independently from the error ratio. However, it is worth noting that in some instances the accuracy may decrease significantly (see, e.g., datasets 30650 and 30675) and proportionally to the increment of the error ratio, by suggesting, as general trend, the fact that the higher the error ratio the more difficult is to recover the correct haplotype set.

Nonuniform Datasets
The general trend observed in the uniform datasets persists also in the nonuniform datasets. Specifically, as shown in Table 4, CRM2 took in average 10 times more the average solution time taken by CRM1 to solve instances having 10 SNPs, reaching a maximum solution time of 135.199 seconds when tackling instance 06 affected by an error ratio of 5%. Similarly, when tackling instances having 30 SNPs, CRM2 took in average 4 times more the average solution time taken by CRM1, reaching a maximum solution time of 193.406 seconds when tackling instance 00 affected by an error ratio of 5%. When dealing with instances having more than 30 SNPs, CRM2 took significantly more than CRM1 reaching a solution time of 7210.800 seconds when tackling the instance 100-30.03 affected by an error ratio of 5%.
The accuracy of CRM2 in the nonuniform datasets result still good, but slightly poorer than in the uniform datasets. Specifically, the average accuracy is over 80% in the majority of the analyzed datasets and independently from the error ratio. Similarly to the uniform datasets, in some instances the accuracy may decrease significantly (see, e.g., datasets 30675). However, in the worst case, the decrement results much smaller than the corresponding one in the uniform datasets.

Biological Datasets
To complete the performance analysis on Brown and Harrower's datasets, we tested CRM2 on the biological datasets. Once again, the general trend observed in the uniform and nonuniform datasets persists also in the biological datasets: CRM2 results significantly slower than CRM1, a part from datasets CHR10-CEU and CHR21-CEU in which the trend is inverted due to the peculiar nature of both datasets. While the average gap of CRM1 never exceeded 2:6%, the average gap of CRM2 was 10% or more, confirming the hardness of the biological datasets. However, we stress once again the fact that PPH and PPH-UD are de facto two different problems, hence intrinsic values such as the gap cannot be directly compared. Our analysis just aims at offering experimental evidence of the tightness of the class representative model in tackling instances of the pure parsimony haplotyping problem under uncertain data.
The small number of instances constituting each biological dataset (three instances per dataset) prevents a clear statistical characterization of the performances of CRM2 in terms of accuracy. As general trend, we have observed that the accuracy approaches 100% in the majority of the biological datasets analyzed. Nevertheless, in a number of datasets this trend changes, leading the accuracy level to low values. Investigating the reason why this phenomenon arises and the possible corresponding remedies warrants additional analysis.

Conclusion
In this article we have investigated, for the first time, a recent version of PPH, called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotypes are not accurate, i.e., when some single nucleotide polymorphisms are missing or affected by errors. We proposed an exact approach to solution of PPH-UD based on an extended version of Catanzaro et al. [1] class representative model for PPH, possibly one of the best integer programming models described so far in the literature on PPH. The model is efficient, accurate, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.