Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Class Representative Model for Pure Parsimony Haplotyping under Uncertain Data

  • Daniele Catanzaro ,

    dacatanz@ulb.ac.be

    Affiliation Graphes et Optimisation Mathématique (G.O.M.), Computer Science Department, Université Libre de Bruxelles (U.L.B.), CP 210/01, Brussels, Belgium

  • Martine Labbé,

    Affiliation Graphes et Optimisation Mathématique (G.O.M.), Computer Science Department, Université Libre de Bruxelles (U.L.B.), CP 210/01, Brussels, Belgium

  • Luciano Porretta

    Affiliation Graphes et Optimisation Mathématique (G.O.M.), Computer Science Department, Université Libre de Bruxelles (U.L.B.), CP 210/01, Brussels, Belgium

A Class Representative Model for Pure Parsimony Haplotyping under Uncertain Data

  • Daniele Catanzaro, 
  • Martine Labbé, 
  • Luciano Porretta
PLOS
x

Abstract

The Pure Parsimony Haplotyping (PPH) problem is a NP-hard combinatorial optimization problem that consists of finding the minimum number of haplotypes necessary to explain a given set of genotypes. PPH has attracted more and more attention in recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from mapping complex disease genes to inferring population histories, passing through designing drugs, functional genomics and pharmacogenetics. In this article we investigate, for the first time, a recent version of PPH called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotypes are not accurate, i.e., when some single nucleotide polymorphisms are missing or affected by errors. We propose an exact approach to solution of PPH-UD based on an extended version of Catanzaro et al. [1] class representative model for PPH, currently the state-of-the-art integer programming model for PPH. The model is efficient, accurate, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.

Introduction

The human genome is divided in 23 pairs of chromosomes thereof, one copy is inherited from the father and the other from the mother. When a nucleotide site of a specific chromosome region shows a variability within a population of individuals then it is called Single Nucleotide Polymorphism (SNP). Specifically, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the least frequent allele) while for the rest of the population a different nucleotide is observed (the most frequent allele) [2]. The least frequent allele, or mutant type, is generally encoded as ‘1’, as opposed to the most frequent allele, or wild type, generally encoded as ‘0’ [3]. A haplotype is a set of alleles, or more formally, a string of length over an alphabet [4]. Haplotypes represent a fundamental source of information for disease association studies. In fact, over 90% of sequence variation among individuals is due to common variant sites, most of which arose from single historical mutation events on the ancestral chromosome [5]. Hence, in a group of people affected by a disease, the SNPs causing or associated with the disease will be enriched in frequency compared with the corresponding frequencies in a group of unaffected individuals. This observation was of considerable assistance, for example, in the identification of the genes responsible for type 1 diabetes [6][8], type 2 diabetes [9], [10], Alzheimer's disease [11], deep vein thrombosis [12], inflammatory bowel disease [13][15], hypertriglyceridaemia [16], schizophrenia [17], asthma [18], stroke [19], myocardial infarction [20], cystic fibrosis and diastrophic dysplasia [21], [22].

Extracting haplotypes from a population of individuals is not an easy task. In fact, the current molecular sequencing techniques only provide information about the conflation of the paternal and maternal haplotypes of an individual (also called genotype) rather than haplotypes themselves [23]. When the family-based genetic information of a population is available, haplotypes can be retrieved experimentally [24]. However, the experimental approach is generally laborious, cost-prohibitive, requires advanced molecular isolation strategies [25], and sometimes not even possible [26]. In absence of a family-based genetic information, a valid alternative to the experimental approach is provided by computational methods which estimate, by means of specific criteria, haplotypes from the set of genotypes extracted from a population of individuals.

A genotype can be formally defined as a string of length over an alphabet , where the symbols ‘0’ or ‘2’ denote homozygous sites (of wild and mutant type, respectively) and the symbol ‘1’ denotes heterozygous sites. As an example, the sequence encodes a genotype in which: the first SNP is homozygous of wild type; the second SNP is homozygous of mutant type; and finally the third SNP is heterozygous. A genotype is said to be degenerate if it does not contain ‘1’s. A genotype is said to be resolved from a pair of haplotypes , in symbols , if the -th entry of , denoted as , is equal to the sum of the -th entries of and , denoted as and , respectively. For example, the genotype is resolved from and . Haplotyping a set of genotypes means finding the set of haplotypes resolving .

It is worth noting that, given a genotype and denoted as the number of its heterozygous sites, there exist possible haplotypes that may resolve it [26]. As an example, genotype may be resolved from either the pair of haplotypes or the pair . This fact involves as a necessary consequence the use of a criterion to select pairs of haplotypes among plausible alternatives. Gusfield [27] and Wang and Xu [28] observed that the number of distinct haplotypes existing in a large population of individuals is generally much smaller than the overall number of distinct genotypes observed in that population. This insight has suggested that, for low-rate recombination genes at least, the criterion of minimizing the overall number of haplotypes necessary to resolve a set of genotypes may have good chances to recover the biological haplotype set. This criterion, formally introduced by Gusfield [27], is known as the pure parsimony criterion of haplotype estimation and was of considerable assistance, for example, in the identification of the genes responsible for psoriasis and severe alopecia areata [2]. Haplotyping a set of genotypes under the parsimony criterion involves solving an optimization problem, called the Pure Parsimony Haplotyping (PPH) problem, that can be stated as follows:

Problem. The Pure Parsimony Haplotyping (PPH) problem

Given a set of non-degenerating genotypes, having SNPs each, find the minimum set of haplotypes such that for each genotype there exists a pair of haplotypes resolving .

As an example, an instance of PPH and the corresponding solution is shown in Table 1.

thumbnail
Table 1. Graphical representation of an instance of PPH and the corresponding solution.

https://doi.org/10.1371/journal.pone.0017937.t001

PPH is known to be polynomially solvable when each genotype has at most two heterozygous sites [29], and -hard when each genotype has at least three heterozygous sites [26].

Recently, Brown and Harrower [30] introduced an interesting version of PPH called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotype set is not accurate, i.e., when some SNPs are missing or affected by errors, a situation that often occurs in practice. In this case, the input of the problem may include also a binary matrix , called the error mask matrix, whose generic entry is equal to 1 if the -th SNP of genotype is uncertain (i.e., missing or affected by an error), and 0 otherwise. When a given SNP is uncertain its actual value could significantly deviate from its true value. For example, the true value of a wild type homozygous SNP affected by uncertainty could be homozygous of mutant type or even heterozygous. Similarly, the true value of a heterozygous SNP affected by uncertainty could be homozygous of wild or mutant type. The presence of uncertain data modifies the standard definition of resolution for a genotype. Specifically, Brown and Harrower [30] stated that when uncertainty occurs in the input data a genotype is resolved by a pair of haplotypes if , for all SNPs , being a integer variables assuming values in the set . Brown and Harrower [30] described an integer programming model able to tackle instances of PPH affected by uncertain data. Unfortunately, the authors did not offer experimental evidence of the performances of their model due to its unpractical runtimes. In this article we address this critical issue by introducing a possible integer linear programming model to solve exactly instances of PPH-UD. The model is based on an extension of Catanzaro et al. [1] Class Representative Model (CRM), currently one of the best integer programming model for PPH described in the literature. The model that we propose is efficient, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.

Methods

As shown in Catanzaro et al. [1], any feasible solution of PPH induces a family of subsets of genotype such that: (i) each subset represents one unique haplotype with elements in the subset being genotypes carrying the haplotype, (ii) each genotype belongs to exactly two subsets, and (iii) every pair of subsets intersects in at most one genotype. This principle can be exploited also when dealing with PPH-UD. Specifically, let associate an index to each subset of genotypes induced by a haplotype . If is the smallest index of a genotype belonging to , then is the index associated to and the subset will be denoted as . Since each genotype belongs to exactly two subsets (as it must be explained by exactly two haplotypes) it may happen that is itself the genotype with smallest index in both subsets. In this case a dummy genotype is added, and the subset is created. As an example, one can imagine that the haplotype induces the subset , induces the subset , induces the subset , and so on. We remark that the index can be considered only if was previously used, i.e., if the subset already exists.

Since at most haplotypes are necessary to resolve genotypes [26], then the indices of the subsets can vary inside the index set , where and . Assume that an order is defined on in such a way that . Define , , as a decision variable equal to 1 if, in the solution, there exists a haplotype inducing a subset of genotypes whose smallest index genotype is , and 0 otherwise. Denote , , as a decision variable equal to 1 if genotype belongs to the subsets and , and 0 otherwise. Denote as the set of the input SNPs and , , as a decision variable equal to 1 if the haplotype inducing the subset of genotypes has such a value at -th site, and 0 otherwise. Variables describe explicitly the haplotypes of the solution.

For every non-null entry of the error mask matrix denote as a decision variable accounting for the difference between the value of and the true underlying value. Specifically, is equal to 1 if the -th entry of genotype is corrected with a value , and 0 otherwise. Finally, let and be a lower and an upper bounds on the overall number of errors in . Then, the following model is a valid formulation of PPH-UD:

Formulation. Class Representative Model (CRM) for PPH-UD(1.1)(1.2)(1.3)(1.4)(1.5)(1.6)(1.7)(1.8)(1.9)(1.10)(1.11)(1.12)(1.13)(1.14)(1.15)(1.16)(1.17)(1.18)(1.19)(1.20)(1.21)(1.22)(1.23)The objective function (1.1) represents the number of distinct haplotypes or equivalently the cardinality of . Since the index is considered only if is already used, constraints (1.2) implies that if the haplotype is not used, then should not be used. Constraints (1.3) impose that each genotype must belong to exactly two subsets , and constraints (1.4) force to be 1, i.e., to take haplotype into account, if some genotype is resolved by . Constraints (1.5) are a consequence of the definition of the dummy genotype . Actually, they constitute a special version of constraints (1.4) when genotype is resolved by haplotype . Constraints (1.6) impose the sum operation among haplotypes in absence of uncertainty. Constraints (1.7–1.9) translate the sum operation among haplotypes when uncertainty occurs in the input data. Specifically, constraints (1.7) account for the correction imposed on the -th SNP of haplotypes and when has its -th SNP equal to 0. In this case, two situations may occur at the -th SNP: either no correction is performed, or a correction is performed by setting to 1 one between and . If a correction is performed and is set to 1 at the -th SNP then one haplotype will be homozygous of wild type and the other of mutant type. On the contrary, if is set to 1 then both haplotypes will be homozygous of mutant type. Constraints (1.8) account for the correction imposed on the -th SNP of haplotypes and when has its -th SNP equal to 2. Similarly to constraints (1.7), also in this case two situations may occur: either no correction is performed, or a correction is performed by setting to 1 one between and . If a correction is performed and is set to 1 at the -th SNP then one haplotype will homozygous of wild type and the other of mutant type. On the contrary, if is set to 1 then both haplotypes will be homozygous of wild type. Finally, constraints (1.9) account for the correction imposed on the -th SNP of haplotypes and when has its -th SNP equal to 1. If a correction is performed and is set to 1 at the -th SNP then both haplotypes will be homozygous of mutant type. On the contrary, if is set to 1 then both haplotypes are homozygous of wild type. Constraints (1.10) establish the relations between variables and in absence of uncertainty. Specifically, they force the -th site of the haplotype to be equal to 0 when at least one genotype , whose -th entry equal to 0, belongs to the induced subset . By analogy, constraints (1.11) force the -th site of the haplotype to be equal to 1 when at least one genotype , whose -th entry equal to 2, belongs to the induced subset . Constraints (1.12–1.13) force one of the two -th sites of haplotypes and to be equal to 1 when the -th entry of genotype is equal to 1. Constraints (1.14–1.17) are the analogous version of constraints (1.10–1.13) in presence of uncertainty in the input data. Constraints (1.18–1.20) impose that at most one variable can be equal to one in presence of uncertainty in the input data. Finally, constraints (1.21–1.22) impose the upper and lower bounds on the error variables .

Reducing model size

The particular nature of the set of indices can be exploited to reduce the size of the CRM for PPH-UD. This operation proves fundamental to vastly improve the efficiency of whole model. Specifically, as shown in Catanzaro et al. [1], variables belonging to one of the following sets:(2)(3)(4)do not need to be defined. Moreover, the sets of redundant variables can be further expanded by taking into account the entries of the error mask matrix and by observing that for each triplet of genotypes such that the respective -th SNP is , , , and , variable is necessarily equal to 0 since the containment of genotype to the subsets and would violate the sum operator among haplotypes at least at -th SNP. Extending this argument to all the possible combinations of triplets of genotypes that violate the haplotype sum operator, it is easy to see that the following sets of variables are redundant and can be removed from the model:(5)(6)(7)Note that, removing the redundant variables belonging to the sets can be performed in . Finally, a similar process of reduction can be applied to variables both by removing those whose value is fixed by constraints (1.6) (e.g., when or when ). In this way, only variables involved in constraints (1.6) when and in constraints (1.7–1.9) need to be defined.

Results and Discussion

In this section we analyze the performances of the Class Representative Model (CRM) to solve instances of the pure parsimony haplotyping problem under uncertain data. Similar to Brown and Harrower [30] and Catanzaro et al. [1], we emphasize that our experiments aim simply to evaluate the runtime performance of our model for solving PPH. We neither attempt to study the efficiency of PPH for haplotype inference nor compare the accuracy of our algorithm to haplotype inference solvers that do not use the parsimony criterion. This analysis has been already performed by Gusfield [31], Wang and Xu [28], and Marchini et al. [32], and we refer the interested reader to their respective articles.

As in Catanzaro et al. [1], we used the standard Brown and Harrower's datasets [30] for testing the performances of our model. Specifically, through Hudson's MS program [33], Brown and Harrower created two families of datasets (called the uniform and nonuniform datasets) by randomly pairing the resulting haplotypes. The distinction in the two simulated methods comes in how the random pairing is performed. In the uniform datasets the haplotypes are randomly paired by sampling uniformly from the set of distinct haplotypes. In the nonuniform datasets the haplotypes are sampled uniformly from the collection of haplotypes generated by the coalescent process. In this collection, haplotypes may not be unique, so some haplotypes are sampled with higher frequency than others. Both the uniform and non-uniform datasets consist of collections of 30 or 50 genotypes having 10, 30, 50, 75 or 100 SNPs each. Each dataset contains a number of instances variable between 15 and 50. The authors also considered biological data from chromosomes 10 and 21, over all four Hap-Map [21] populations. For each input the authors selected sequences having 30, 50, and 75 SNPs, respectively, giving a total of 8 datasets consisting of 3 instances each. Brown and Harrower's datasets are not subjected to uncertainty, for this reason we considered four sets of random generated error mask matrices having an error ratio (i.e., the number of entries equal to 1 divided ) equal to , , , and respectively. Brown and Harrower's datasets and the error mask matrices used in our experiments can be downloaded at the address: http://homepages.ulb.ac.be/~dacatanz/PPHerr.zip.

In Tables 25 we show the performances of the CRM for PPH-UD under different error ratios by showing, conservatively, the same information described in Brown and Harrower [30] and Catanzaro et al. [1]. Specifically, the columns of Tables 25 evidence the average, the maximum, and the minimum of: the solution time, the gap (i.e., the difference between the optimal value found and the value of linear relaxation at the root node of the search tree, divided by the optimal value), and the number of nodes expanded in each group of instances belonging to a given dataset. The results have been obtained by implementing the CRM for PPH-UD in Mosel 2.0 of Xpress-MP, Optimizer version 18, running on a Pentium 4, 3.2 GHz, equipped with 2 GByte RAM and operating system Gentoo release 7 (kernel linux 2.6.17). In our experiments we activated the Xpress-MP Optimizer automatic cuts, the Xpress-MP pre-solving strategy, and used the Xpress-MP primal heuristic to generate the first upper bound.

thumbnail
Table 2. Performances of the CRM for PPH-UD when considering input data having an error ratio of 1%.

https://doi.org/10.1371/journal.pone.0017937.t002

thumbnail
Table 3. Performances of the CRM for PPH-UD when considering input data having an error ratio of 5%.

https://doi.org/10.1371/journal.pone.0017937.t003

thumbnail
Table 4. Performances of the CRM for PPH-UD when considering input data having an error ratio of 10%.

https://doi.org/10.1371/journal.pone.0017937.t004

thumbnail
Table 5. Performances of the CRM for PPH-UD when considering input data having an error ratio of 15%.

https://doi.org/10.1371/journal.pone.0017937.t005

In order to obtain a qualitative measure of the running time performances of the CRM for PPH-UD, we compared the numerical results of the model with the corresponding ones of the CRM for PPH (RM version, see Catanzaro et al. [1]) running on the same datasets in absence of uncertainty. The performances of the CRM for PPH are shown in Table 6. Moreover, in order to obtain a measure of the accuracy of the CRM for PPH-UD, we used the following procedure: fixed a generic instance of PPH-UD, we computed the optimal solution provided by CRM for PPH in absence of uncertainty and considered the corresponding set of haplotypes as the “correct set”; subsequently, we computed the optimal solution provided by CRM for PPH-UD in presence of uncertainty (i.e., when taking into account the corresponding input error mask matrix) and assumed, as measure of the accuracy, the ratio between the number of equal haplotypes in both solutions divided the overall number of haplotypes in the solution provided by CRM for PPH. When such a ratio is equal to 1 it means that CRM for PPH-UD was able to recover the correct haplotype set, otherwise, when the ratio is smaller than 1 it means that CRM for PPH-UD was able to recover only a fraction of such a set. The accuracy (expressed in percentage) of the CRM for PPH-UD under increasing error ratios is shown in Table 7. For sake of notation, in the following subsections we shall denote CRM1 and CRM2 as the CRM for PPH and the CRM for PPH-UD, respectively.

thumbnail
Table 6. Performances of the CRM for PPH (RM version) on Brown and Harrower's datasets [30].

https://doi.org/10.1371/journal.pone.0017937.t006

Uniform Datasets

The experiments relative to the uniform datasets showed that, when considering an error ratio of already, CRM2 takes significantly more time than CRM1 to solve Brown and Harrower's datasets, confirming the hardness of PPH-UD with respect to PPH. Specifically, Tables 25 show, as general trend, that the higher the error ratio the slower the runtime performances of the model. For example, while CRM1 took in average 8 seconds to solve the most difficult dataset having 10 SNPs, CRM2 took in average at least 10 seconds independently from the error ratio, and even longer on instances 03, 05, 06, 08 and 11 of dataset 50×10r4 where 19.657, 62.160, 34.429, 40.416, and 15.974 seconds, respectively, were needed to find the optimum. This trend persists also in the instances having 30 SNPs, where CRM1 took in average 11.772 seconds while CRM2 needed an average solution time of 43.273 seconds when considering an error ratio of , with the exception of instances 02, 08, 09, 11, 13 and 14 which needed 62.182, 51.462, 60.079, 60.020, 122.443, and 58.514 seconds, respectively. We observed that the overall performances of CRM2 with an error ratio of generally tend to be similar to the ones of CRM2 with an error ratio of ; moreover, we experienced also a generalized decrement of the average solution time when considering an error ratio of and, vice versa, an increment of the average solution time when considering an error ratio of . When considering instances having a larger number of SNPs, we experienced a generalized increment of the average solution time taken by CRM2, proportional to the increment of the error ratio. Interestingly, the average gap and number of branches performed by CRM2, although not directly comparable with the corresponding one of CRM1, results relatively small, confirming the tightness of the class representative model also for uncertain data.

The accuracy of CRM2 in the uniform datasets result very good. Specifically, the average accuracy is over in the majority of the analyzed datasets and independently from the error ratio. However, it is worth noting that in some instances the accuracy may decrease significantly (see, e.g., datasets 30×50 and 30×75) and proportionally to the increment of the error ratio, by suggesting, as general trend, the fact that the higher the error ratio the more difficult is to recover the correct haplotype set.

Nonuniform Datasets

The general trend observed in the uniform datasets persists also in the nonuniform datasets. Specifically, as shown in Table 4, CRM2 took in average 10 times more the average solution time taken by CRM1 to solve instances having 10 SNPs, reaching a maximum solution time of 135.199 seconds when tackling instance 06 affected by an error ratio of . Similarly, when tackling instances having 30 SNPs, CRM2 took in average 4 times more the average solution time taken by CRM1, reaching a maximum solution time of 193.406 seconds when tackling instance 00 affected by an error ratio of . When dealing with instances having more than 30 SNPs, CRM2 took significantly more than CRM1 reaching a solution time of 7210.800 seconds when tackling the instance 100-30.03 affected by an error ratio of .

The accuracy of CRM2 in the nonuniform datasets result still good, but slightly poorer than in the uniform datasets. Specifically, the average accuracy is over in the majority of the analyzed datasets and independently from the error ratio. Similarly to the uniform datasets, in some instances the accuracy may decrease significantly (see, e.g., datasets 30×75). However, in the worst case, the decrement results much smaller than the corresponding one in the uniform datasets.

Biological Datasets

To complete the performance analysis on Brown and Harrower's datasets, we tested CRM2 on the biological datasets. Once again, the general trend observed in the uniform and nonuniform datasets persists also in the biological datasets: CRM2 results significantly slower than CRM1, a part from datasets CHR10-CEU and CHR21-CEU in which the trend is inverted due to the peculiar nature of both datasets. While the average gap of CRM1 never exceeded , the average gap of CRM2 was or more, confirming the hardness of the biological datasets. However, we stress once again the fact that PPH and PPH-UD are de facto two different problems, hence intrinsic values such as the gap cannot be directly compared. Our analysis just aims at offering experimental evidence of the tightness of the class representative model in tackling instances of the pure parsimony haplotyping problem under uncertain data.

The small number of instances constituting each biological dataset (three instances per dataset) prevents a clear statistical characterization of the performances of CRM2 in terms of accuracy. As general trend, we have observed that the accuracy approaches in the majority of the biological datasets analyzed. Nevertheless, in a number of datasets this trend changes, leading the accuracy level to low values. Investigating the reason why this phenomenon arises and the possible corresponding remedies warrants additional analysis.

Conclusion

In this article we have investigated, for the first time, a recent version of PPH, called the Pure Parsimony Haplotype problem under Uncertain Data (PPH-UD). This version mainly arises when the input genotypes are not accurate, i.e., when some single nucleotide polymorphisms are missing or affected by errors. We proposed an exact approach to solution of PPH-UD based on an extended version of Catanzaro et al. [1] class representative model for PPH, possibly one of the best integer programming models described so far in the literature on PPH. The model is efficient, accurate, compact, polynomial-sized, easy to implement, solvable with any solver for mixed integer programming, and usable in all those cases for which the parsimony criterion is well suited for haplotype estimation.

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments on the previous version of the manuscript.

Author Contributions

Conceived and designed the experiments: DC ML. Performed the experiments: LP. Analyzed the data: DC ML LP. Wrote the paper: DC.

References

  1. 1. Catanzaro D, Godi A, Labbé M (2009) A class representative model for pure parsimony haplotyping. INFORMS Journal on Computing 22: 195–209.D. CatanzaroA. GodiM. Labbé2009A class representative model for pure parsimony haplotyping.INFORMS Journal on Computing22195209
  2. 2. Catanzaro D, Andrien M, Labbé M, Toungouz-Nevessignsky M (2010) An integer programming model for hla association studies: A case study for psoriasis and severe alopecia areata. Human Immunology. D. CatanzaroM. AndrienM. LabbéM. Toungouz-Nevessignsky2010An integer programming model for hla association studies: A case study for psoriasis and severe alopecia areata.Human Immunology In Press. In Press.
  3. 3. Zhang XS, Wang RS, Wu LY, Chen L (2006) Models and algorithms for haplotyping problem. Current Bioinformatics 1: 105–114.XS ZhangRS WangLY WuL. Chen2006Models and algorithms for haplotyping problem.Current Bioinformatics1105114
  4. 4. Catanzaro D, Labbé M (2009) The pure parsimony haplotyping problem: Overview and computational advances. International Transactions in Operational Research 16: 561–584.D. CatanzaroM. Labbé2009The pure parsimony haplotyping problem: Overview and computational advances.International Transactions in Operational Research16561584
  5. 5. Li WH, Sadler LA (1991) Low nucleotide diversity in man. Genetics 129: 513–523.WH LiLA Sadler1991Low nucleotide diversity in man.Genetics129513523
  6. 6. Bell GI, Horita S, Karam JH (1984) A polymorphic locus near the human insulin gene is associated with insulin-dependent diabetes mellitus. Diabetes 33: 176–183.GI BellS. HoritaJH Karam1984A polymorphic locus near the human insulin gene is associated with insulin-dependent diabetes mellitus.Diabetes33176183
  7. 7. Dorman SJJ, LaPorte RE, Stone RA, Trucco M (1990) Worldwide differences in the incidence of type I diabetes are associated with amino acid variation at position 57 of the HLA-DQ β chain. Proceedings of the National Academy of Sciences of the USA 87: 7370–7374.SJJ DormanRE LaPorteRA StoneM. Trucco1990Worldwide differences in the incidence of type I diabetes are associated with amino acid variation at position 57 of the HLA-DQ β chain.Proceedings of the National Academy of Sciences of the USA8773707374
  8. 8. Nisticó L, Buzzetti R, Pritchard LE, Van der Auwera B, Giovannini C, et al. (1996) The ctla-4 gene region of chromosome 2q33 is linked to, and associated with, type I diabetes. Human Molecular Genetics 5: 1075–1080.L. NisticóR. BuzzettiLE PritchardB. Van der AuweraC. Giovannini1996The ctla-4 gene region of chromosome 2q33 is linked to, and associated with, type I diabetes.Human Molecular Genetics510751080
  9. 9. Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM, Vohl MC, et al. (2000) The common ppar γ pro12ala polymorphism is associated with decreased risk of type 2 diabetes. Nature Genetics 26: 76–80.D. AltshulerJN HirschhornM. KlannemarkCM LindgrenMC Vohl2000The common ppar γ pro12ala polymorphism is associated with decreased risk of type 2 diabetes.Nature Genetics267680
  10. 10. Deeb SS, Fajas L, Nemoto M, Pihlajamäki J, Mykkänen L, et al. (1998) A Pro12Ala substitution in PPAR γ 2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity. Nature Genetics 20: 284–287.SS DeebL. FajasM. NemotoJ. PihlajamäkiL. Mykkänen1998A Pro12Ala substitution in PPAR γ 2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity.Nature Genetics20284287
  11. 11. Strittmatter WJ, Roses AD (1996) Apolipoprotein E and Alzheimer's disease. Annual Reviews - Neuroscience 19: 53–77.WJ StrittmatterAD Roses1996Apolipoprotein E and Alzheimer's disease.Annual Reviews - Neuroscience195377
  12. 12. Dahlbäck B (1997) Resistance to activated protein C caused by the factor V R506Q mutation is a common risk factor for venous thrombosis. Journal of Thrombosis and Haemostasis 78: 483–488.B. Dahlbäck1997Resistance to activated protein C caused by the factor V R506Q mutation is a common risk factor for venous thrombosis.Journal of Thrombosis and Haemostasis78483488
  13. 13. Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, et al. (2001) Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to crohn disease. Nature Genetics 29: 223–228.JD RiouxMJ DalyMS SilverbergK. LindbladH. Steinhart2001Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to crohn disease.Nature Genetics29223228
  14. 14. Hugot JP, Chamaillard M, Zouali H, Lesage S, Cézard JP, et al. (2001) Association of NOD2 leucine-rich repeat variants with susceptibility to crohn's disease. Nature 411: 599–603.JP HugotM. ChamaillardH. ZoualiS. LesageJP Cézard2001Association of NOD2 leucine-rich repeat variants with susceptibility to crohn's disease.Nature411599603
  15. 15. Ogura Y, Bonen DK, Inohara N, Nicolae DL, Chen FF, et al. (2001) A frameshift mutation in NOD2 associated with susceptibility to crohn's disease. Nature 411: 603–606.Y. OguraDK BonenN. InoharaDL NicolaeFF Chen2001A frameshift mutation in NOD2 associated with susceptibility to crohn's disease.Nature411603606
  16. 16. Pennacchio LA, Olivier M, Hubacek JA, Cohen JC, Cox DR, et al. (2001) An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294: 169–173.LA PennacchioM. OlivierJA HubacekJC CohenDR Cox2001An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing.Science294169173
  17. 17. Stefansson H, Petursson H, Sigurdsson E, Steinthorsdottir V, Bjornsdottir S, et al. (2002) Neuregulin 1 and susceptibility to schizophrenia. American Journal of Human Genetics 71: 877–892.H. StefanssonH. PeturssonE. SigurdssonV. SteinthorsdottirS. Bjornsdottir2002Neuregulin 1 and susceptibility to schizophrenia.American Journal of Human Genetics71877892
  18. 18. Van Eerdewegh P, Little RD, Dupuis J, Del Mastro RG, Falls K, et al. (2002) Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness. Nature 418: 426–430.P. Van EerdeweghRD LittleJ. DupuisRG Del MastroK. Falls2002Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness.Nature418426430
  19. 19. Gretarsdottir S, Thorleifsson G, Reynisdottir S, Manolescu A, Jonsdottir S, et al. (2003) The gene encoding phosphodiesterase 4d confers risk of ischemic stroke. Nature Genetics 35: 131–138.S. GretarsdottirG. ThorleifssonS. ReynisdottirA. ManolescuS. Jonsdottir2003The gene encoding phosphodiesterase 4d confers risk of ischemic stroke.Nature Genetics35131138
  20. 20. Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, et al. (2002) Functional snps in the lymphotoxin-a gene that are associated with susceptibility to myocardial infarction. Nature Genetics 32: 650–654.K. OzakiY. OhnishiA. IidaA. SekineR. Yamada2002Functional snps in the lymphotoxin-a gene that are associated with susceptibility to myocardial infarction.Nature Genetics32650654
  21. 21. Consortium TIH (2003) The international hapmap project. Nature 426: 789–796.TIH Consortium2003The international hapmap project.Nature426789796
  22. 22. Consortium TIH (2005) A haplotype map of the human genome. Nature 437: 1299–1314.TIH Consortium2005A haplotype map of the human genome.Nature43712991314
  23. 23. Halldórsson BV, Bafna V, Edwards N, Lippert R (2003) Combinatorial problems arising in SNP and haplotype analysis. In: Calude CS, editor. BV HalldórssonV. BafnaN. EdwardsR. Lippert2003Combinatorial problems arising in SNP and haplotype analysis.CS CaludeDiscrete Mathematics and Theoretical Computer Science, Springer-Verlag, volume 2731 of Lecture Note in Computer Science. Discrete Mathematics and Theoretical Computer Science, Springer-Verlag, volume 2731 of Lecture Note in Computer Science.
  24. 24. Lu X, Niu T, Liu JS (2003) Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms. Genome Research 13: 2112–2117.X. LuT. NiuJS Liu2003Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms.Genome Research1321122117
  25. 25. Clark VJ, Methey N, Dean M, Peterson RJ (2001) Statistical estimation and pedigree analysis of CCR2–CCR5 haplotypes. Human Genetics 108: 484–493.VJ ClarkN. MetheyM. DeanRJ Peterson2001Statistical estimation and pedigree analysis of CCR2–CCR5 haplotypes.Human Genetics108484493
  26. 26. Lancia G, Pinotti MC, Rizzi R (2004) Haplotyping populations by pure parsimony: Complexity of exact and approximate algorithms. INFORMS Journal on Computing 16: 348–359.G. LanciaMC PinottiR. Rizzi2004Haplotyping populations by pure parsimony: Complexity of exact and approximate algorithms.INFORMS Journal on Computing16348359
  27. 27. Gusfield D (2001) Inference of haplotypes from samples of diploid populations: Complexity and algorithms. Journal of Computational Biology 8: 305–324.D. Gusfield2001Inference of haplotypes from samples of diploid populations: Complexity and algorithms.Journal of Computational Biology8305324
  28. 28. Wang L, Xu Y (2003) Haplotype inference by maximum parsimony. Bioinformatics 19: 1773–1780.L. WangY. Xu2003Haplotype inference by maximum parsimony.Bioinformatics1917731780
  29. 29. Lancia G, Rizzi R (2006) A polynomial solution to a special case of the parsimony haplotyping problem. Operations Research Letters 34: 289–295.G. LanciaR. Rizzi2006A polynomial solution to a special case of the parsimony haplotyping problem.Operations Research Letters34289295
  30. 30. Brown D, Harrower IM (2006) Integer programming approaches to haplotype inference by pure parsimony. IEEE Transactions, Computational Biology and Bioinformatics 3: 141–154.D. BrownIM Harrower2006Integer programming approaches to haplotype inference by pure parsimony.IEEE Transactions, Computational Biology and Bioinformatics3141154
  31. 31. Gusfield D (2003) Haplotype inference by pure parsimony. In: in Computer Science LN, editor. Annual Symposium in Combinatorial Pattern Matching. pp. 144–155.D. Gusfield2003Haplotype inference by pure parsimony.LN in Computer ScienceAnnual Symposium in Combinatorial Pattern Matching144155Springer-Verlag, Berlin, Germany, volume 2676. Springer-Verlag, Berlin, Germany, volume 2676.
  32. 32. Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, et al. (2006) A comparison of phasing algorithms for trios and unrelated individuals. American Journal of Human Genetics 78: 437–450.J. MarchiniD. CutlerN. PattersonM. StephensE. Eskin2006A comparison of phasing algorithms for trios and unrelated individuals.American Journal of Human Genetics78437450
  33. 33. Hudson RR (1990) Gene genealogies and the coalescent process. Oxford Survey of Evolutionary Biology 7: 1–44.RR Hudson1990Gene genealogies and the coalescent process.Oxford Survey of Evolutionary Biology7144