Powerful Haplotype-Based Hardy-Weinberg Equilibrium Tests for Tightly Linked Loci

Recently, there have been many case-control studies proposed to test for association between haplotypes and disease, which require the Hardy-Weinberg equilibrium (HWE) assumption of haplotype frequencies. As such, haplotype inference of unphased genotypes and development of haplotype-based HWE tests are crucial prior to fine mapping. The goodness-of-fit test is a frequently-used method to test for HWE for multiple tightly-linked loci. However, its degrees of freedom dramatically increase with the increase of the number of loci, which may lack the test power. Therefore, in this paper, to improve the test power for haplotype-based HWE, we first write out two likelihood functions of the observed data based on the Niu's model (NM) and inbreeding model (IM), respectively, which can cause the departure from HWE. Then, we use two expectation-maximization algorithms and one expectation-conditional-maximization algorithm to estimate the model parameters under the HWE, IM and NM models, respectively. Finally, we propose the likelihood ratio tests LRT and LRT for haplotype-based HWE under the NM and IM models, respectively. We simulate the HWE, Niu's, inbreeding and population stratification models to assess the validity and compare the performance of these two LRT tests. The simulation results show that both of the tests control the type I error rates well in testing for haplotype-based HWE. If the NM model is true, then LRT is more powerful. While, if the true model is the IM model, then LRT has better performance in power. Under the population stratification model, LRT is still more powerful. To this end, LRT is generally recommended. Application of the proposed methods to a rheumatoid arthritis data set further illustrates their utility for real data analysis.


Introduction
In studies of genetic epidemiology, complex diseases are often associated with multiple (interacting) markers [1][2][3].As such, haplotype-based analysis has gained increasing attention as it can potentially be more efficient than a single-marker-based analysis [4][5][6][7][8][9].Therefore, haplotype inference of unphased genotypes may be expected to play an important role in disease fine mapping [10].Nowadays, there are many statistical and computational methods available for inferring haplotypes based on different types of data, such as unrelated individuals.One of the popular approaches is the likelihood method, and the maximum likelihood estimation via the expectation-maximization (EM) algorithm [11] is a frequently employed method for haplotype inference.For genotype data of unrelated individuals, an EM-based maximum likelihood method for the estimation of haplotype frequencies was first proposed by Excoffier and Slatkin [12].We call it EM algorithm in this paper for easy description later.However, the EM algorithm needs the assumption that the population under study is in Hardy-Weinberg equilibrium (HWE), otherwise the estimates of haplotype frequencies may be biased.
Recently, there have been many case-control studies proposed to test for association between haplotypes and disease.The likelihood ratio test (LRT) was constructed from the maximum likelihood functions for cases, controls and the pooled data of cases and controls, to test for haplotype-disease association, which requires the assumption of HWE in the pooled sample data [3].Prospective likelihood methods based on logistic regression or generalized linear models were investigated by Schaid et al. [13], Stram et al. [14], Zaykin et al. [15], and others.These methods treat unobserved haplotypes as covariates in a regression model and compute the conditional expectation of the covariates given genotype observations under the null hypothesis of no association with a HWE assumption in the pooled sample of cases and controls.Zhao et al. [16] proposed a prospective estimatingequation approach for the assessment of disease association with haplotypes when adjustment for covariates, which needs the HWE assumption of haplotype frequencies only in the control sample.The pooled sample of cases and controls is not necessarily in HWE.On the other hand, a retrospective likelihood method can be used in detecting haplotype-disease association in a case-control study and also requires HWE only in the control population [17].Therefore, the detection of haplotype-based HWE is crucial prior to fine mapping and positional cloning studies for case-control designs.
The goodness-of-fit test is a frequently-used method to test for HWE for multiple tightly-linked loci.However, when the number of loci under study increases, the degrees of freedom dramatically increase, which may lack the test power.As such, in this paper, to investigate more powerful haplotype-based HWE tests, we first recall three models which can cause Hardy-Weinberg disequilibrium (HWD).One was proposed originally by Niu et al. [6], which includes a parameter K and is called Niu's model (NM) in this paper for convenience; the second one is the inbreeding model (IM) with incorporating the inbreeding coefficient r [18]; the third one is a population stratification (PS) model, which can also lead to HWD.Then, we write out two likelihood functions of the observed data based on the NM and IM models, respectively.We develop an expectation-conditional-maximization (ECM) algorithm [19] for the NM model to estimate the parameter K and haplotype frequencies and suggest an EM algorithm for the IM model (denoted by IEM algorithm here) to estimate the inbreeding coefficient r and haplotype frequencies.Note that K~1 or r~0 means that HWE holds.So, we further propose two LRT tests LRT 1 and LRT 2 to test for haplotype-based HWE under the NM and IM models, respectively.We simulate the HWE, Niu's, inbreeding and population stratification models to assess the validity and compare the performance of these two LRT tests.The simulation results show that both of the tests control the size well in testing for haplotype-based HWE.If the Niu's model is true, then LRT 1 is more powerful.While, if the inbreeding model is true, then LRT 2 has better performance in power.Under the population stratification model, LRT 2 is still more powerful.Therefore, LRT 2 is generally recommended.In addition, we obtain the sum of absolute differences (SAD) between the true and estimated haplotype frequencies [20], and compare the performance of the EM, ECM and IEM algorithms in estimating the haplotype frequencies.If the true model is the Niu's model, then the ECM algorithm has more accurate estimates of haplotype frequencies than the EM and IEM estimates.However, for all the other simulation settings, the EM algorithm is not so much affected by the departure from HWE, and the EM and IEM algorithms almost have the same performance in controlling SAD, which is less than the ECM estimates.Application of the proposed methods to the Rheumatoid Arthritis (RA) data set from the North American Rheumatoid Arthritis Consortium (NARAC) further illustrates their utility for real data analysis.

Likelihood Function and EM Algorithm under HWE
Consider a sample of n unrelated individuals and q single nucleotide polymorphism (SNP) markers.Assume that the SNPs are tightly linked so that the recombination fraction between any SNP pair is zero.For each SNP, there are two alleles 1 and 2. Let H~fh 1 ,h 2 , . . .,h m g be the set of all possible haplotypes at these q loci, where m~2 q .We assume that h i is the frequency of haplotype h i (i~1,2, . . .,m), so the set of haplotype frequencies can be denoted by H~fh 1 ,h 2 , . . .,h m g.Let G~fG 1 ,G 2 , . . .,G n g be the set of the observed genotypes of all the n individuals, where G j is the genotype of the j th individual.For the j th individual, the number of haplotype combinations compatible with G j is s j .Therefore, the likelihood function of the sample can be expressed as where H jk denotes the k th haplotype combination compatible with genotype G j for the j th individual.
To make the haplotype frequency estimation easy and feasible, the EM algorithm was employed [11].Let Z~(Z 1 ,Z 2 , . . .,Z n ) be the true haplotype combinations of the sample which are actually unobserved, and Z j is the true haplotype combination of the j th individual.Then the log-likelihood function of the complete data is where I( : ) is an indicator function and I( : )~1 if Z j ~Hjk and 0 otherwise.Note that under HWE, the probability P(H jk ~hr h s jH) of unordered haplotype pair h r h s is h 2 r if r~s and 2h r h s otherwise.Further, Excoffier and Slatkin [12] proposed the following EM algorithm to obtain the maximum likelihood estimates of h i (i~1,2, . . .,m) at iteration (tz1), , t~0,1,2, . . .where t i jk is the number of times that haplotype h i occurs in the k th haplotype combination for the j th individual and takes values of 0, 1 or 2, and P(H jk j Ĥ H (t) ) is the value of the probability P(H jk jH) based on the estimated haplotype frequencies Ĥ H (t) ~fĥ h (t)  1 , ĥ h (t) 2 , . . ., ĥ h (t) n g at iteration t.

Two Forms of HWD
Note that the underlying assumption of HWE is strong and HWE does not hold usually.One may consider the following form of HWD, where r is the inbreeding coefficient which is generally positive [21].Note that Equation (3) is reduced to HWE when r~0.We denote this form of HWD as ''inbreeding model (IM)'' for convenient description in this paper.Another form of the departure from HWE was originally proposed by Niu et al. [6] as follows.Assume that the probability of unordered haplotype pair h r h s is proportional to ah 2 r if r~s and 2bh r h s otherwise, with two parameters a and b.Obviously, the HWE assumption holds if a~b.Note that the sum of all these terms for all the m haplotypes at the q loci may not be 1.Then, HWD can be defined as the following form: Let K~a=b.Then, we assume K §1 due to the positive inbreeding coefficient r.We denote this form of HWD as ''Niu's model (NM)'' for convenience.

Likelihood Function and Haplotype-Based HWE Test under Niu's Model
Using Equations ( 2) and ( 4), the log-likelihood function of the complete data under the Niu's model can be expressed as where Y~(K,H).In fact, there is only one additional parameter K included in Equation ( 5), compared to the likelihood function under HWE.So, we propose the following expectation-conditional-maximization (ECM) algorithm to estimate the haplotype frequencies and the parameter K.It consists of one expectation step (E-step) and m conditional-maximization steps (CM-steps) at each iteration.In E-step at iteration (tz1), we can get the following Q function after taking the conditional expectation of Equation ( 5), given the observed genotype data G and current estimate Ŷ where P(Z j ~hr h s jG j ,Y (t) ) is the conditional probability of the haplotype pair h r h s given G j and Ŷ Y (t) , which is 0 if there is no haplotype pair compatible with genotype G j .
In CM-steps, we maximize the Q function in Equation ( 6) to estimate Y. Let Ŷ Y (x=mzt) be the estimate of Y in the x th CM-step among m CM-steps at iteration (tz1).The detailed CM-steps are as follows: N Give the initial value Y (0) ~(K (0) ,H (0) ), where t) in the first CM-step, maximize the Q function by taking the first-order derivation with respect to K so as to get the estimate of K, and then where ) to estimate the haplotype frequencies H. Thus, from the second CM-step to the m th CMstep, h i 's (i~2,3, . . .,m) are estimated step by step and h 1 is then estimated by 1{ ĥ h {x be the set of the haplotype frequency estimates for all the haplotypes but h 1 and h x in the x th CM-step.Then, ~(ĥ h (t) 3 , ĥ h (t) 4 , . . ., ĥ h (t) m ) in the second CM-step for estimating h 2 .As such, in the x th CM-step (x~2,3, . . .,m), by maximizing ), it is shown in Text S1 that a cubic equation with respect to ĥ h (tz1) x is obtained, A ĥ h (tz1) where the coefficients A, B, C and D are, respectively, and the vector E and the matrix F are respectively Moreover, the cubic equation above is alway solvable, and its solution can be obtained by Shengjin's formulas [22].Note that the likelihood function converges no matter which initial values of Y are chosen.So, if there are two or three solutions between 0 and 1, then we can choose the solution which is closer to ĥ h (t)  x in the former step.After this step, tz1) , ĥ h (t) 1 , ĥ h (tz1) 2 , . . ., ĥ h (tz1) x , ĥ h (t) xz1 , . . ., ĥ h (t) m{1 , ĥ ). N Repeat the steps above until the observed log-likelihood function of Equation (1) converges.Equation (1) can be written to be L(Y)~P n j~1 P sj k~1 P(H jk jY) Â Ã under the Niu's Model.Note that HWE holds when K~1 and HWE is violated otherwise.Therefore, a likelihood ratio test (LRT) for HWE is naturally constructed based on the estimated haplotype frequencies as follows, where L( Ŷ Y 0 ) and L( Ŷ Y) are the values of the observed likelihood function under the null hypothesis of HWE and under the HWD alternative, respectively.Obviously, this LRT statistic asymptotically follows a Chi-square distribution with the degree of freedom being 1 when HWE holds.

Likelihood Function and Haplotype-Based HWE Test under Inbreeding Model
Borrowing the idea of Zeng and Lin on how to estimate the haplotype frequencies based on case-control data for testing for association [18], here we rewrite the likelihood function for unrelated individuals under study and then propose a haplotypebased HWE test under the inbreeding model.Let H j be a random variable, which takes values from s j possible haplotype combinations compatible with G j of the j th individual.Suppose that r §0, and W j is a Bernoulli variable with success probability r.Let P(V 1j ~hr =h r )~h r and P(V 2j ~hr =h s )~h r h s , where V 1j and V 2j are discrete random variables, and the haplotype before ''/'' is paternal and haplotype after ''/'' is maternal.So, W j V 1j z(1{W j )V 2j has the same distribution as H j , and we treat W j , V 1j and V 2j as missing.Then, the log-likelihood function of the complete data under the inbreeding model is where W~(r,H).
To estimate the parameters Y in Equation ( 9), the EM algorithm is considered.In E-step, the Q function is ) In M-step, the estimation of W at iteration (tz1) can be obtained by solving the following equation So, r can be estimated by where r r (t) and ĥ h (t) i are the estimates of r and h i at iteration t, respectively.The haplotype frequencies can be estimated by ) where c is a normalizing constant, and and E½(1{W j )I(V 2j ~hi =h r )jG j , Ŵ W (t) can be calculated as follows, E½(1{W j )I(V 2j ~hi =h r )jG j , Ŵ W (t) P m f ~1 P m g~1 (1{r r (t) ) ĥ h (t) f ĥ h (t) g I(V 2j ~hi =h r ) We call this process IEM algorithm for distinguishing it from the previous EM algorithm under HWE.Note that under the IM model, HWE holds when r~0, and HWE is not true when r=0.Therefore, we propose the following LRT to test for haplotype-based HWE, where L( Ŵ W 0 ) and L( Ŵ W) are the values of the observed likelihood function under the null hypothesis of HWE and under the HWD alternative, respectively.Obviously, this LRT statistic asymptotically follows a Chi-square distribution with the degree of freedom being 1 when HWE holds.

Software Implementation
Based on the above EM, ECM and IEM algorithms, we have written a software HAP-HWE to conduct the proposed haplotypebased HWE tests, which is implemented in R (http://www.rproject.org)and is freely available at http://www.echobelt.org/web/UploadFiles/HAP-HWE.html.For each of the EM, ECM and IEM algorithms, let N denote the number of haplotypes that occur in all the possible haplotype combinations compatible with the observed genotypes G in the sample.As such, the initial values of all these N haplotype frequencies are taken as 1=N at t~0.For the ECM and IEM algorithms, the initial values of K and r are taken as 1 and 0.01, respectively.The convergence criterion is that the absolute difference between the estimated values of the loglikelihood function at two consecutive iterations is smaller than The input data file is a standard linkage pedigree file containing pedigree relationship, genotype and phenotype information, with each row being for an individual.The HAP-HWE software will only use the founders in the sample and automatically exclude the nonfounders from the analysis.Further, a haplotype block file is needed with each row representing a haplotype block, which can be easily exported from other existing software, such as Haploview [23].Then, our HAP-HWE software will analyze the haplotype blocks one by one.The usage of the HAP-HWE software and other details refer to Text S2.
Our HAP-HWE software outputs: (i) the convergence processes of the log-likelihood function under the EM, ECM and IEM algorithms, (ii) the haplotypes with frequency estimates being larger than 10 {5 and the associated frequency estimates under the three algorithms, (iii) the estimated value of K, the value of LRT 1 and the corresponding P value under the Niu's model, and (iv) the estimated value of r, the value of LRT 2 and the corresponding P value under the inbreeding model.The output results will be saved in a text file (named ''results.txt'') in the working directory.In addition, like other haplotype frequency estimation methods, our methods also face running time and storage space problems because of the large number of possible haplotypes.In our software, to reduce storage space, each haplotype is represented by an integer, rather than a vector of alleles.

Simulation Settings
To assess the validity and compare the performance of two LRT tests in testing for haplotype-based HWE, we consider three models with three tightly-linked SNPs that can lead to HWD: Niu's model (NM), inbreeding model (IM) and population stratification (PS) model.For both the NM and IM models, the true marginal haplotype distribution is given in Table 1.For the NM model, the value of K is taken from 1.0 to 1.5 in increments of 0.05.Firstly, we calculate the probabilities of all the haplotype combinations from Equation ( 4).Then, one haplotype combination for each individual is randomly chosen.For the IM model, the inbreeding coefficient r is taken from 0 to 0.1 in increments of 0.01.Firstly, we calculate the probabilities of all the haplotype combinations from Equation ( 3), and then one haplotype combination is selected at random for each individual.Finally, we combine these two haplotypes to form the unphased genotype for the individual.To investigate how the population admixture affects the performance of two haplotype-based HWE tests, we consider the following PS model with two subpopulations I and II, where the corresponding haplotype distributions are given in Table 2, respectively.The proportion t of the subpopulation I is taken to be 0.6 and 0.8.
Note that when K~1 and r~0, HWE holds for the NM and IM models, respectively.So, we simulate the type I error rates of the proposed HWE tests when K~1 or r~0, and make power comparison when Kw1 and rw0.The PS model is also used to simulate the powers of both of the tests.For all the models, we generate samples of unrelated individuals at these three loci and the sample size is taken as 500, 1000 and 1500, respectively.The number of simulation replicates is fixed at 1000 and the significance level a is taken to be 5%.
As additional findings in this paper, we can also compare the efficiency of the EM, ECM and IEM algorithms in haplotype inference.The accuracy of haplotype frequency estimates is assessed by the sum of absolute differences (SAD) between the true and estimated frequencies, which was proposed by Fallin and Schork [20] and defined as SAD~X m i~1 jh i { ĥ h i j where h i and ĥ h i are the true and estimated haplotype frequencies of h i , respectively.It ranges from 0 (when the estimation is perfect) to 1.  is close to the nominal 5% level, while the size result of LRT 2 is less than 0.05, when K~1 (i.e.HWE holds).This means that in testing for haplotype-based HWE, LRT 1 controls the size well and LRT 2 is conservative under the NM model.The powers of both LRT 1 and LRT 2 are larger when K increases from 1.1 to 1.5 and the sample size n is fixed.However, LRT 1 is more powerful than LRT 2 .In addition, when K~1 and n is unchanged, the EM, ECM and IEM algorithms perform similarly in the estimation of haplotype frequencies.However, with the increase of the K value, the SAD measure of the ECM algorithm does not have much change and is much smaller than the EM and IEM algorithms.The SADs of the EM and IEM algorithms are very close to each other and become larger when K is larger.On the other hand, with the sample size increasing, the SAD measures of all the three algorithms become less and two proposed LRT tests have more powers.

Simulation Results
Table 4 shows the estimate of r, mean SAD of haplotype frequency estimates, simulated size and powers of two HWE tests for different values of inbreeding coefficient r and different sample sizes n under the inbreeding model.We can see from the table that the mean estimated value r r over 1000 replicates is close to its true value.As shown in Table 3, LRT 1 performs better in controlling the size than LRT 2 under the IM model.However, LRT 2 is more powerful than LRT 1 under this situation.On the other hand, both the EM and IEM algorithms have the same performance and the corresponding SADs are stable across different values taken for r (0 to 0.1) in the estimation of haplotype frequencies.However, the ECM estimate gets larger with the increase of r and performs worse than the EM and IEM estimates.When the sample size is larger, the corresponding SADs appear to be smaller and two proposed LRT tests are more powerful.
Table 5 displays the mean SAD of haplotype frequency estimates and simulated powers of two HWE tests based on 1000 simulation replicates, under the PS model, with the proportion t of subpopulation I being taken as 0.6 and 0.8, and the sample size being fixed at 500, 1000 and 1500.From the table, we find that LRT 2 is more powerful than LRT 1 , irrespective of the t value or the sample size n.In the estimation of haplotype frequencies, the EM and IEM algorithms perform similarly in SAD and have better SADs than the ECM estimate, which signifies that the EM and IEM algorithm are more robust to population stratification than the ECM algorithm.
Table 5. Mean and standard deviation (SD) of K and r estimates, mean of sum of absolute differences (SAD) of haplotype frequency estimates for EM, ECM and IEM algorithms, power comparison of two HWE tests under population stratification model, with the proportion t of subpopulation I being taken as 0.6 and 0.8, and the sample size being fixed at 500, 1000 and 1500.

Application to NARAC Data Set
We apply our HAP-HWE software to the Rheumatoid Arthritis (RA) data set from the North American Rheumatoid Arthritis Consortium (NARAC) [24], which was made available through the Genetic Analysis Workshop 15 [25].In the data set, there are 757 pedigrees comprised of 8017 individuals (2481 founders and 5536 nonfounders), which were genotyped at 5407 SNP markers over the 22 autosomes.In each pedigree, there is at least one affected nonfounder with RA.
Note that information on haplotype blocks is needed prior to the HAP-HWE analysis.In this application, we use the existing software Haploview (version 4.2) [23] to define haplotype blocks, with all the arguments being taken as the default values.Then, 181 haplotype blocks are identified, 150 blocks including 2 SNPs, 19 blocks including 3 SNPs, 7 blocks including 4 SNPs, 1 block including 5 SNPs, 2 blocks including 6 SNPs, 1 block including 9 SNPs and 1 block including 13 SNPs.
On the other hand, HAP-HWE only uses the founders and excludes the nonfounders from the analysis.Further, there is a large proportion of missing genotypes for individuals in the data set.Therefore, the reduced data set used for the HAP-HWE analysis contains only a few founders in the data set.On the average, there are about 295 pedigrees (about 367 unrelated individuals) used for each haplotype block, ranging from 288 to 296 (ranging from 358 to 369).
Table 6 lists the results of the application to the NARAC data set.The significance level is fixed at a~5%.There are 13 haplotype blocks (out of 181) with at least one of the P values of the LRT 1 and LRT 2 being less than 5%.However, after multiple testing based on Bonferroni correction (a 0 ~0:05=1812 :76|10 {4 ), only the seventh haplotype block including 6 SNPs (rs347117, rs383902, rs395601, rs387812, rs347115 and rs610877) on chromosome 15 is statistically significant with the P value of the LRT 1 being 2:36|10 {4 .Figure 1 gives the Haploview LD display for this haplotype block.On the other hand, Min et al. [26] reported that chromosome 15p34 at rs347117 showed a possible linkage peak to RA by using the nonparametric linkage Z score (Z~2:80), which may support our finding.

Discussion
In this paper, we first wrote out two likelihood functions of the observed data based on the NM model and IM model.Then, we developed the ECM algorithm for the NM model to estimate the parameter K and haplotype frequencies and suggested the IEM algorithm for the IM model to estimate the inbreeding coefficient r and haplotype frequencies.Note that K~1 or r~0 means that HWE holds.So, we further proposed two LRT tests to test for haplotype-based HWE.We simulated the HWE, Niu's, inbreeding and population stratification models to assess the validity and compare the performance of these two LRT tests.The simulation results showed that both of the two tests are valid in testing for the haplotype-based HWE.If the Niu's model is true, then LRT 1 is more powerful.While, if the inbreeding model is true, then LRT 2 has better performance in power.Under the population stratification model, LRT 2 is still more powerful.Therefore, if the population model is unknown in practice, LRT 2 is generally recommended due to its good performance.Furthermore, we compared the performance of the EM, ECM and IEM algorithms in estimating the haplotype frequencies.If the true model is the Niu's model, then the ECM algorithm has more accurate estimates of haplotype frequencies than the EM and IEM estimates.However, for all the other simulation settings, the EM algorithm is not so much affected by the departure from HWE, and the EM and IEM algorithms almost have the same performance in controlling SAD, which is less than the ECM estimates.We also demonstrate the practical utility of the proposed methods by the application to the Rheumatoid Arthritis (RA) data

Figure 1 .
Figure 1.Haplotype LD display for the seventh haplotype block on chromosome 15.The red box denotes that the LOD value between any two loci is larger than or equal to 2.0.The numbers in the red boxes are the corresponding values of D 0 and the empty box denotes that D 0 ~1.doi:10.1371/journal.pone.0077399.g001

Table 1 .
Haplotype distribution for Niu's model and inbreeding model.

Table 3 .
Mean and standard deviation (SD) of K and r estimates, mean of sum of absolute differences (SAD) of haplotype frequency estimates for EM, ECM and IEM algorithms, simulated size and powers of two HWE tests for different values of K and n, under Niu's model.

Table 4 .
Mean and standard deviation (SD) of K and r estimates, mean of sum of absolute differences (SAD) of haplotype frequency estimates for EM, ECM and IEM algorithms, simulated size and powers of two HWE tests for different values of r and n, under inbreeding model.

Table 3
lists the estimate of K, mean SAD of haplotype frequency estimates, simulated size and powers of two HWE tests for different values of K and different sample sizes n under the Niu's model.It is shown in the table that the mean estimated value K K over 1000 replicates is close to its true value.The type I error rate of LRT