Testing Hardy-Weinberg Proportions in a Frequency-Matched Case-Control Genetic Association Study

In case-control genetic association studies, cases are subjects with the disease and controls are subjects without the disease. At the time of case-control data collection, information about secondary phenotypes is also collected. In addition to studies of primary diseases, there has been some interest in studying genetic variants associated with secondary phenotypes. In genetic association studies, the deviation from Hardy-Weinberg proportion (HWP) of each genetic marker is assessed as an initial quality check to identify questionable genotypes. Generally, HWP tests are performed based on the controls for the primary disease or secondary phenotype. However, when the disease or phenotype of interest is common, the controls do not represent the general population. Therefore, using only controls for testing HWP can result in a highly inflated type I error rate for the disease- and/or phenotype-associated variants. Recently, two approaches, the likelihood ratio test (LRT) approach and the mixture HWP (mHWP) exact test were proposed for testing HWP in samples from case-control studies. Here, we show that these two approaches result in inflated type I error rates and could lead to the removal from further analysis of potential causal genetic variants associated with the primary disease and/or secondary phenotype when the study of primary disease is frequency-matched on the secondary phenotype. Therefore, we proposed alternative approaches, which extend the LRT and mHWP approaches, for assessing HWP that account for frequency matching. The goal was to maintain more (possible causative) single-nucleotide polymorphisms in the sample for further analysis. Our simulation results showed that both extended approaches could control type I error probabilities. We also applied the proposed approaches to test HWP for SNPs from a genome-wide association study of lung cancer that was frequency-matched on smoking status and found that the proposed approaches can keep more genetic variants for association studies.


Introduction
Case-control genetic association studies using unrelated individuals to find genetic variations associated with a particular disease are popular and useful. In a case-control study design, cases are those subjects with the primary disease (e.g., lung cancer, diabetes, breast cancer) and controls are those free of the primary disease. In addition to the cases and controls with respect to the primary disease, at the time of case-control collection, information about secondary phenotypes, which we define as traits associated with the primary disease of interest (i.e., predictors of the primary disease), such as smoking behavior and body mass index (BMI), are also collected. In addition to studies of primary diseases, there has been some interest in studying genetic variants associated with secondary phenotypes. Many case-control studies of primary diseases are frequency-matched on the secondary phenotypes. Frequency-matching on known risk confounders is an important and commonly used study design in case-control studies [1] to reduce the effects of confounding factors. For example, some lung cancer studies are frequency-matched on smoking behavior, as smoking is a known risk confounder for the association between lung cancer and other risk factors.
In genetic association studies, the deviation from Hardy-Weinberg proportion (HWP) of each genetic marker is typically assessed as an initial quality check procedure to identify singlenucleotide polymorphisms (SNPs) with questionable genotypes. The genetic markers that deviate from HWP are usually considered to be genotyping errors and are removed from further analysis. In general, the HWP test assumes the genotypes are sampled from the general population, and therefore, the expected genotype counts in the test should be evaluated from the general population. When the HWP test is conducted in only controls, the observed control counts are compared against the expected control counts. Recent papers [2][3][4] have shown, however, that when the disease in a case-control study is common in the general population, the controls (all of which do not have the disease) do not accurately represent the general population. Therefore, using only controls (of primary disease or secondary phenotype) for HWP testing can result in highly inflated type I error probabilities for the primary disease-and/or secondary phenotype-associated SNPs and might lead investigators to discard potential causal SNPs of the disease or secondary phenotype of interest.
Recently, new approaches have been proposed for assessing HWP in the general population for genetic case-control association studies [2][3][4]. The approaches proposed by Li and Li [2] and Yu et al [4] are based on a general likelihood ratio framework. The likelihood-based approach compares the likelihood that is maximized under the alternative hypothesis with the likelihood that is maximized under the null hypothesis (under HWP). Wang and Shete [3] proposed a mixture HWP (mHWP) exact test that uses a mixed sample of cases and controls that mimics the general population. Both the likelihood-based approach and the mHWP exact test can control the type I error rates for genetic variants associated or unassociated with the primary disease.
Both the likelihood-based and mHWP approaches will work if the study of primary disease is not frequency-matched (as shown in the Supporting Information Table S1, Table S2, Table S3 and  Table S4). In this situation, individuals with the secondary phenotype are randomly sampled from among the primary disease cases and controls. However, if the case-control study is frequency-matched on the secondary phenotype, individuals with and without the secondary phenotype are not sampled randomly but on the basis of the constraints of the primary disease cases and controls. In this situation, both the likelihood-based approach and the mHWP exact test would lead to the rejection of potential causal variants associated with the primary disease and/or secondary phenotype, which would decrease the likelihood of identifying the causal or associated genetic variants in the followup association studies. For example, for the mHWP exact test, although the proportion of the primary disease in the mixture sample would be similar to its prevalence in the general population, the proportion of the presence of the secondary phenotype may not be consistent with the prevalence of the secondary phenotype in the general population, owing to the frequency-matching design. Thus, using the recently proposed approaches to assess HWP in the general population could introduce artificial deviations from HWP and produce inflated type I error rates for primary disease-and/or secondary phenotype-associated genetic markers.
In this article, we show that when a case-control study of primary disease was frequency-matched on the secondary phenotype, all the existing approaches failed to conserve the type I error probabilities. Therefore, we proposed alternative approaches for assessing HWP that account for frequency matching. These approaches extend the likelihood ratio test (LRT) approach [2] and the mHWP exact test [3]. We considered multiple associated and unassociated genetic variants in frequency-matched case-control studies with respect to the secondary phenotype. Simulation studies performed to investigate the performance of the proposed approaches showed that the type I error probabilities were well controlled by both extended approaches. Furthermore, we observed that, between the two approaches, the extended mHWP exact test was more likely to keep potential secondary trait and/or primary disease causal SNPs for further analysis when the secondary phenotype was more common. We also applied the proposed approaches to a real lung cancer case-control genetic association study frequency-matched on smoking behavior.

Materials and Methods
We assumed a diallelic locus with two alleles, A and a. We denoted the three genotypes-AA, Aa, and aa-as a categorical random variable, X = (0, 1, 2). If the allele frequency of A is p and the allele frequency of a is (1-p), then the expected genotype frequencies of AA, Aa, and aa are P 0 = p 2 , P 1 = 2p(1-p), and P 2 = (1-p) 2 , respectively, assuming HWP in the population. We defined a binary random variable, D = (0, 1), to indicate the case-control status of the primary disease, with 0 representing controls and 1 representing cases. We also defined the status of the secondary phenotype as a binary random variable, T = (0, 1), with 0 representing absence of the secondary phenotype and 1 representing presence of the secondary phenotype. Let K ij denote the joint probability of secondary phenotype T = i and primary disease D = j, where i, j = 0, 1, in the general population. The prevalence of the primary disease in the general population is denoted as f D . It is easy to see that f D = K 01 + K 11 . In our studies, we assumed that the prevalence value and the joint probabilities K ij were known because usually this information can be obtained from the literature or previous studies. We assumed a case-control association study of N individuals, with m controls and n cases, with respect to the primary disease.

Extended likelihood ratio test (eLRT) approach
To extend the LRT approach, we followed a strategy similar to that described by Li and Li [2]. To account for both primary disease and secondary trait associations, we considered the conditional probabilities of the genotypes given different primary disease and secondary phenotype status for the likelihoodbased approach. We denoted this conditional probability as Pr(X~kjT~i,D~j) = p ijjk P k K ij , where i, j = 0, 1, k = 0, 1, 2, and p ijjk~P r(T~i,D~jjX~k) is the conditional probability that an individual is observed to have primary disease D = j and secondary phenotype T = i given the genotype X = k. Given m controls and n cases, we denoted m ki as the number of individuals in the control subjects with genotype X = k and secondary phenotype T = i, and we denoted n ki as the number of individuals in the case subjects with genotype X = k and secondary phenotype T = i. Given the sample data, the likelihood can be written as L~P 2 k~0 p 11jk P k K 11 n k1 p 01jk P k K 01 n k0 p 10jk P k K 10 m k1 p 00jk P k K 00 m k0 .
The data are sampled from four trinomial distributions for the genotypes, with each distribution corresponding to one of the blocks of individuals with different primary disease and secondary phenotype status; therefore, 8 parameters at most can be estimated. The above likelihood function involves 15 parameters; therefore, it is necessary to add multiple constraints to the parameters. Let p jjk~P r(D~jjX~k), j = 0, 1, and k = 0, 1, and 2 denote the conditional probabilities of an individual with primary disease status j given genotype k, which can be written as p jjk~P 1 i~0 p ijjk . Also, we know that p 1jk zp 0jk~1 for all k = 0, 1, and 2. We defined the genotype relative risk for genotype k compared with reference genotype 0 for different scenarios: r ijk = p ij|k /p ij|0 and r jk = p j|k /p j|0 for k = 1, 2 and both r ij0 and r j0 equal to 1. Because we assumed that the joint probability of primary disease and secondary phenotype K ij were known in advance, the conditional probability p ij|k can be expressed using the joint probability and genotype relative risk as p ij|k = K ij r ijk / (P 0 +r ij1 P 1 +r ij2 P 2 ), with i, j = 0, 1 and k = 0, 1, and 2. Similarly, because we fixed the prevalence of the primary disease, the conditional probability p j|k can be expressed as p j|k = f D r jk / (P 0 +r j1 P 1 +r j2 P 2 ), with j = 0, 1 and k = 0, 1, and 2. The final constraint was added for the genotype frequencies, where . P 0 +P 1 +P 2 = 1. Given these constraints for the parameters, the likelihood function can be re-written as L~P 2 k~0 P n k1 zn k0 zm k1 zm k0 k K n k1 where p 11jk~K 11 r 11k 1{P 1 {P 2 zr 111 P 1 zr 112 P 2 , p 10jk~K 10 r 10k 1{P 1 {P 2 zr 101 P 1 zr 102 P 2 , and p 1jk~f D r 1k 1{P 1 {P 2 zr 11 P 1 zr 12 P 2 for k = 0, 1, and 2. We denoted K K ij ,f f D , andf f T as the estimated joint probability of secondary phenotype and primary disease and the estimated prevalence values of the primary disease and secondary phenotype in the general population, respectively. This modified likelihood function for hypothesis testing now involves 8 parameters P 1 , P 2 , r 111 , r 112 , r 101 , r 102 , r 11 , r 12 f g . Under the null hypothesis that the genetic variant is in HWP, P 1 = 2p(1-p) and P 2 = (1-p) 2 , with p as the minor allele frequency (MAF). Thus, the number of parameters needing to be estimated in the likelihood function is 7 under the null hypothesis and 8 under the alternative hypothesis that the genetic variant is not in HWP. Therefore, the eLRT can be performed in a manner similar to the test proposed by Li and Li [2]. The eLRT statistic is defined as 2( ln (L L 1 ){ ln (L L 0 )), whereL L 1 is the maximized likelihood under the alternative hypothesis andL L 0 is the maximized likelihood under the null hypothesis. Asymptotically, the test statistic follows a one-degree-of-freedom chi-square distribution under the null hypothesis. To maximize the likelihood, we employed the 'fminsearchcon' function [5] in Matlab, which implements the simplex algorithm.

Extended mHWP (emHWP) exact test
The basic concept of the extended mHWP (emHWP) is that, given the data of a case-control association study of primary disease frequency-matched on secondary phenotype, we try to construct a mixture sample from the data to represent the general population. In this mixture sample, the proportions of primary disease and secondary phenotype can mimic the prevalence values of primary disease and secondary phenotype in the general population, respectively.
Consider a case-control study with N individuals, N = N 00 + N 10 + N 01 + N 11 , where N ij is the number of individuals in a block of sample data with secondary trait status i and primary disease status j, where i, j = 0, 1. Let N m be the sample size of the mixture sample. To mimic the general population, the proportion of individuals in the mixture sample with secondary trait status i and primary disease status j should be consistent with the corresponding joint probability in the general population. Therefore, in the mixture sample, the number of individuals with secondary phenotype i and primary disease j should be N m |K K ij ( Figure S1). For each block of individuals with secondary phenotype i and primary disease j, the number of individuals in the mixture sample must be less than the number in the original dataset. So, N m |K K ij # N ij , and N m # min(N 00 =K K 00 ,N 01 =K K 01 ,N 10 =K K 10 ,N 11 =K K 11 ). In our study, we chose N m = min(N 00 =K K 00 ,N 01 =K K 01 ,N 10 =K K 10 ,N 11 =K K 11 ) to achieve the largest possible mixture sample size and then randomly sampled N m |K K ij individuals from the blocks of individuals with secondary phenotype i and primary disease j. In the mixture sample, the HWP exact p value was evaluated [6]. We then employed the re-sampling procedure to obtain M mixture samples and assess M HWP exact p values, as was done in the original study by Wang and Shete [3]. The empirical distribution-based non-parametric density was constructed on the basis of M mixture sample p values (please see details of kernel density estimation in [3]). The maximum likelihood estimator (MLE) of this empirical distribution was obtained as the final estimate of p value for the emHWP exact test in the general population. Simulations were conducted to decide the number of mixture samples M, and we selected M = 500 in our study.

Simulation studies
We performed simulation studies to investigate the performance of the proposed eLRT and emHWP approaches, and compared the proposed approaches to the existing approaches for HWP testing: the LRT approach proposed by Li and Li [2] and the mHWP exact test proposed by Wang and Shete [3]. We considered four independent SNPs with different associations to the primary disease and/or secondary phenotype. In addition to the genetic risk factors, we also accounted for environmental factors, including sex, ethnicity, and age, in the simulation models. The case-control status was simulated on the basis of two logistic models as follows: Logit ( Pr (T~1))~a 0 za 1 X 1 za 2 X 2 za 3 X 3 za 4 X 4 z a 5 X sex za 6 X ethn za 7 X age , Logit( Pr (D~1))~b 0 zb 1 X 1 zb 2 X 2 zb 3 X 3 zb 4 X 4 z b 5 X sex zb 6 X ethn zb 7 X age zb 8 T.
In the two logistic models, X i , i = 1, …, 4, represent random variables of SNPs, and X sex , X ethn , and X age represent random variables corresponding to the environmental factors. The first logistic model was used to generate secondary phenotype status given the dataset of realizations of SNPs and environmental factors. Then, the primary disease status was generated using the second logistic model, which was conditional on the values of SNPs, environmental factors, and the secondary phenotype. We defined all the regression coefficients (a i , i = 1, …, 7 and b i , i = 1, …, 8) and prevalences of the genetic and environmental factors for the purpose of the simulation studies, as listed in Table 1. With these settings, we assumed different associations of generic variants: (1) SNP 1 is associated with both primary disease and secondary phenotype; (2) SNP 2 is associated with primary disease only; (3) SNP 3 is associated with secondary phenotype only; and (4) SNP 4 is not associated with either primary disease or secondary phenotype. The associations among all the generic variants, environmental factors, secondary phenotype, and primary disease can be represented by a network structure, as shown in Figure S2 in the Supporting Information.
The genotypes of the SNPs were generated with the use of the genotype frequencies assuming the SNPs were in HWP. In the simulation study, we assumed that the SNPs were common SNPs with an MAF of 40% or less common SNPs with an MAF of 10%. The values of the environmental factors were generated on the basis of their prevalence values. By using different values for the intercept coefficients a 0 and b 0 , we defined different prevalence values for the primary disease and secondary phenotype in the general population, ranging from 10% to 70%, which can represent different common diseases and common secondary traits. We did not study the scenario of a very rare disease or secondary phenotype (e.g., prevalence , 5%) because it has been shown in the previous studies [2,3] that the standard approach for testing HWP based on controls only (with respect to the primary disease or secondary phenotype) can work well in those situations.
Given the values of the genetic variants and environmental factors, for each scenario (i.e., one pair of specific f D and f T ), we , generated a large amount of data on the population of interest based on the above logistic models. Therefore, the joint probability of primary disease and secondary phenotype,K K ij , i, j = 0, 1, can be estimated from the simulated population, which would be used later in the analysis based on the proposed approaches. The casecontrol status was simulated assuming a dominant genetic model for all genetic variants; however, the proposed approaches are not restricted to a dominant genetic model. When a frequencymatched study based on the secondary trait is considered, the proportions of individuals with the secondary phenotype should be approximately equal in primary disease cases and controls [1]. That is, given that the secondary trait T is a binary random variable, the frequency-matching case-control design can be expressed using the following inequality based on conditional probabilities: | Pr(T = 1 | D = 0) -Pr(T = 1 | D = 1) | # c, where c is a small fraction. We assumed the constant c = 0.02 in our study. Therefore, we first randomly sampled the cases of primary disease and estimated the proportion of individuals with the secondary phenotype, Pr(T = 1 | D = 1). According to the estimated Pr(T = 1 | D = 1), the proportion of individuals with the secondary phenotype in primary disease controls, Pr(T = 1 | D = 0), was assessed as a random number from a uniform distribution (Pr(T = 1 | D = 1) -c, Pr(T = 1 | D = 1) + c). The controls were then sampled to satisfy this estimated proportion, Pr(T = 1 | D = 0). In this way, we simulated 1,000,000 replicates, each with 2,000 primary disease cases and 2,000 controls frequency-matched by the secondary phenotype.

Simulation study results
We report the observed type I error probabilities of the different approaches for testing HWP at two different significance levels for all the scenarios (i.e., the different combinations of prevalence values for the primary disease and secondary phenotype). In addition to the 0.05 nominal significance level used for candidate gene association studies, we considered the nominal significant level 0.0001, which is used as a threshold for HWP testing in genome-wide association studies [7]. All the results were evaluated based on 1,000,000 replicates. For the common SNPs (MAF = 40%), the results associated with SNP 1 , SNP 2 , SNP 3 , and SNP 4 are reported in Tables 2, 3, 4, 5, respectively. For the less common SNPs (MAF = 10%), the results are reported in Tables 6,7,8,9, respectively. Four existing approaches for testing HWP and the two proposed approaches were studied: LRT_t and mHWP_t are the LRT approach [2] and the mHWP exact test [3], respectively, which use the presence and absence of the secondary phenotype as cases and controls; LRT_d and mHWP_d use the presence and absence of the primary disease as cases and controls; the eLRT approach proposed in this article is an extension of the LRT approach proposed by Li and Li [2]; and the emHWP exact test proposed in this article is an extension of the mHWP exact test proposed by Wang and Shete [3]. Table 2 reports the type I error probabilities of different approaches for testing HWP for SNP 1 (MAF = 40%) at 0.05 and 0.0001 significance levels. SNP 1 was associated with both the primary disease and the secondary phenotype in the simulations. The LRT approach and the mHWP exact test using individuals with presence and absence of the secondary phenotype as cases and controls (LRT_t and mHWP_t) provided similar type I error rates, and neither could control the type I error rate in most of the scenarios. Both approaches also performed very similarly when using individuals with presence and absence of the primary disease as cases and controls (LRT_d and mHWP_d); both could control the type I error rate in more scenarios than LRT_t and mHWP_t but still resulted in an inflated type I error rate in many scenarios. Finally, the newly proposed eLRT approach and emHWP exact test both controlled the type I error rate well. For example, when prevalence values of both the primary disease and secondary phenotype were 0.3, given a 0.05 significance level, the type I error rates of the LRT_t, mHWP_t, LRT_d, and mHWP_d approaches were 0.207040, 0.215840, 0.118910, and 0.125500, respectively, which were all highly inflated; the type I error rates of the eLRT and emHWP approaches were 0.050629 and 0.045782, respectively, which agreed very well with the nominal significance value of 0.05. When the nominal significance level was 0.0001 and both f D and f T were set as 0.3, we observed a similar trend in type I errors: the type I error rates of the existing approaches were 0.003054, 0.002029, 0.000867 and 0.000449, respectively, which were highly inflated, whereas the type I error rates of the eLRT and emHWP approaches were 0.000162 and 0.000019, respectively, which agreed very well with the nominal significance value of 0.0001.
When the genetic variant was only associated with the primary disease (SNP 2 , Table 3), the LRT approach and the mHWP exact  test using individuals with presence and absence of primary disease as cases and controls (LRT_d and mHWP_d) could conserve the type I error rates for all the scenarios. This was expected because SNP 2 was associated with the primary disease only, and this assumption was the focus in the original studies of these two approaches [2,3]. However, LRT_t and mHWP_t led to inflated type I error rates. And again, we observed that the type I error rates were well controlled by both of the proposed approaches, eLRT and the emHWP exact test. When the genetic variant was only associated with the secondary phenotype (SNP 3 , Table 4), it was not surprising to see that the LRT approach and the mHWP exact test using individuals with presence and absence of the secondary phenotype as cases and controls (LRT_t and mHWP_t) could control the type I error rates in all scenarios. However, LRT_d and mHWP_d led to inflated type I error rates in many situations. As in the results for SNP 1 and SNP 2 , both the proposed approaches (eLRT and emHWP) still controlled type I error rates well for all scenarios.
Last, when the genetic variant was not associated with the primary disease or the secondary phenotype (SNP 4 , Table 5), all of the approaches controlled the type I error rates well for all scenarios.
Therefore, the results reported in Tables 2, 3, 4, 5 for the common SNPs (MAF = 40%) show that the proposed eLRT and emHWP approaches could control the type I error rates for all SNPs with different types of associations with primary or secondary phenotypes and all scenarios with different prevalence values. It also should be noted that when the primary disease was less common (e.g.,f f D = 0.1 , 0.5) and the secondary phenotype was very common (e.g.,f f T = 0.5 , 0.7), the eLRT approach tended to have a larger type I error rate than the emHWP exact test, which means that the emHWP exact test is more likely to keep the promising genetic variants than the eLRT approach in these situations. It is possible that actual studies of primary disease and secondary phenotype could fall within these ranges of prevalence values. For example, in a study of overweight based on data collected for studying type 2 diabetes, the prevalence of type 2 diabetes (primary disease) was about 10% in the U.S. [8] and the prevalence of overweight was about 66% in the U.S. [9]. In this situation, the emHWP test would be preferable to the eLRT approach. At a very low nominal significance level (0.0001), the eLRT, but not the emHWP, approach had a slightly inflated type I error rate. Thus, the emHWP exact test is also more likely to keep the promising genetic variants than the eLRT approach at a low nominal significance level. When the SNPs of interest were less common (MAF = 10%, Tables 6,7,8,9), we observed similar trends in the results for all SNPs with different associations. As expected, the inflation in type I error rates of the existing approaches was not as significant as that for common SNPs (MAF = 40%, Tables 2, 3, 4, 5). Especially, we noticed that the LRT_d and mHWP_d approaches could conserve the type I error better in many, but not all, situations. For example, for SNP 1 ( Table 6, associated with both the primary disease and secondary phenotype), when the prevalence values of both primary disease and secondary phenotype were 0.3, given a 0.05 significance level, the type I error rates of the different existing approaches were 0.072453, 0.066110, 0.058761, and 0.053680, respectively, whereas the type I error rates of the proposed approaches were 0.049599 and 0.035411, respectively, which were well-controlled at the 0.05 level. The performance of the emHWP exact test for the less common SNPs was very similar to that for the common SNPs at both the 0.05 and 0.0001 significance levels for all scenarios. However, the eLRT approach had inflated type I error rates at the 0.0001 level of significance in some situations for the less common SNPs. This observation further suggested that the emHWP exact test is more favorable than the eLRT approach in these situations. Although the previously proposed LRT approach and mHWP exact test would work for certain SNPs in some scenarios, in reality, the HWP tests are performed before the association tests. Therefore, one would not know the underlying real associations of SNPs with the primary disease and/or secondary phenotype when performing the HWP tests, and the existing approaches might lead to the removal of genetic variants potentially associated with the primary disease and/or secondary phenotype. In contrast, the proposed emHWP test performed uniformly well at controlling the type I error rates for all four SNPs with different associations to the primary disease and secondary phenotype in all scenarios.
We also conducted simulation studies to evaluate the performances of all the approaches to HWP testing for the unmatched case-control study of primary disease and reported the type I error results in Supporting Information Tables S1, S2, S3, S4. We considered common SNPs with MAF = 40% and defined different prevalence values for primary disease and secondary phenotype in the general population, ranging from 10% to 90%. The type I errors were evaluated at a nominal significance level of 0.05. All the results were based on 1,000 replicates, each with 1,000 primary disease cases and 1,000 randomly sampled controls. As in the frequencymatched case-control studies, the proposed eLRT approach and the emHWP exact test were both able to control the type I error rates for all SNPs and all scenarios in the unmatched case-control studies. Therefore, the proposed eLRT approach and emHWP exact test are robust for different study designs. In addition, the LRT approach and the mHWP exact test using individuals with presence and absence of primary disease as cases and controls also performed well for all SNPs and all scenarios, as expected.

Application to lung cancer data
To examine the performance of the proposed eLRT and emHWP approaches, we also applied them to the case-control study of lung cancer frequency-matched on smoking status. This analysis included 2,291 individuals, with 1,154 lung cancer patients and 1,137 controls frequency-matched to the cases by age, sex, and smoking status [10]. The data were collected for a case-control study of lung cancer. All the case and control subjects were ever smokers: 1,260 former smokers and 1,031 current smokers. All the individuals were Caucasian. Lung cancer cases were accrued at The University of Texas MD Anderson Cancer Center and were histologically confirmed. Controls were ascertained through a multi-specialty physician practice from the same area. Questionnaire data were obtained by personal interview in the original study. This study was approved by the institutional review board at MD Anderson Cancer Center, and all participants provided written informed consent (LAB10-0347). In the lung cancer genome-wide association study, 317,498 tagging SNPs were genotyped [11]. We only included the autosomal SNPs in this study. We further excluded the SNPs with MAF , 0.05, and therefore, 300,738 SNPs were left in the analysis. We were interested in determining how many SNPs would be rejected in the quality check procedure using the different approaches for testing HWP. From the simulation studies, we found that the LRT approach and mHWP exact test performed very similarly; therefore, we only reported the results obtained using the LRT approach with either (1) the presence and absence of lung cancer as cases and controls (LRT_d) or (2) current and former smokers as cases and controls (LRT_t). To evaluate eLRT and emHWP, we first obtained the prevalence values of lung cancerf f D and current smokersf f T in ever smokers from the literature (0.14 and 0.498, respectively) [12,13]. We then estimated the conditional probability of lung cancer cases given current smokers in the ever smokers as 0.2545 [12]. Therefore, we could calculate the estimated joint probabilities of lung cancer and smoking statusK K ij , where i, j = 0, 1, with 1 representing lung cancer patients and current smokers and 0 representing lung cancer-free controls and former smokers. For example,K K 11 = 0.2545 6 0.14 = 0.0356 and K K 10 = 0.14 -0.0356 = 0.1044.K K 01 andK K 00 can then be calculated accordingly. Table 10 reports the numbers of SNPs that would be rejected and removed in the quality check procedure using the different HWP testing approaches, including LRT_d, LRT_t, eLRT, and emHWP, at different commonly used nominal significance levels (from 0.005 to 0.000001) in genome-wide association studies. We observed that for all significance levels, the proposed eLRT and emHWP approaches always rejected fewer Table 6. Estimated type I error probability for test of deviation from HWP of SNP 1 , a causal SNP to both primary disease and secondary phenotype (MAF = 10%), at 0.05 and 0.0001 significance levels in simulation studies* using different approaches for HWP testing. SNPs than the LRT approach. The emHWP approach always rejected the fewest SNPs, whereas LRT_t always rejected the most SNPs, among all four approaches. For example, when the nominal significance level was 0.0001, 1,121 and 812 SNPs would be rejected and removed by using the LRT_t and LRT_d, respectively, whereas only 798 and 637 SNPs would be rejected and removed by using the proposed eLRT and emHWP approaches, respectively. Compared with the LRT_t approach, the emHWP approach would keep 484 more SNPs for further analysis of the association of lung cancer and/or smoking status.

Discussion
In this article, we propose two new approaches (eLRT and emHWP) for testing HWP in genetic association studies in which the primary disease cases and controls are frequency-matched on the secondary phenotype. These two approaches are extensions of two recently proposed approaches, the LRT approach [2] and the mHWP exact test [3], that further account for the frequencymatching design with respect to the secondary phenotype. When the case-control study of primary disease is frequency-matched based on the secondary phenotype, which is correlated with the primary disease, statistically speaking, it can be considered to analyze one phenotype with four possible categories, and the likelihood function in the eLRT approach was constructed under this scenario. Similar thinking could be applied to the development of the emHWP exact test. Moreover, the approaches proposed can be extended to obtain estimates and standard errors of the allele frequency. The performance of the two approaches was investigated via simulation studies, as well as an analysis of an association study of lung cancer frequency-matched on smoking status. We compared the proposed approaches to the existing LRT approach and mHWP exact test. On the basis of the results of our simulation studies, when the study of primary disease was frequency-matched on the secondary phenotype, the existing LRT and mHWP exact test provided inflated type I error rates for many scenarios for the different SNPs. In contrast, the newly proposed emHWP approach uniformly and effectively controlled the type I error probability in all scenarios examined for different SNPs associated with the secondary phenotype and/or the primary disease. For some scenarios (f D is small while f T is large), the emHWP approach is more likely than the eLRT approach to keep SNPs associated with the primary disease and/or secondary phenotype in the analysis. The performance of the emHWP for less common SNPs (MAF = 10%) is similar to that for common SNPs at different significance levels for all the SNPs. The eLRT approach, on the other hand, behaved slightly differently at a low significance level when the SNPs were less common. It tends to provide inflated type I errors at a low significance level, especially when the disease is less common but the secondary phenotype is very common (i.e., f D = 0.1 and f T = 0.7) and the SNPs are associated with the primary disease. Therefore, we recommend the emHWP exact test as a better HWP test that has a greater chance of keeping potentially associated SNPs for future association analysis.
In reality, the prevalence values of the primary disease and secondary phenotype, as well as their joint distribution, cannot be known for certain. Therefore, we also evaluated the robustness of the proposed approaches to the estimated prevalence values and joint distribution using simulations. We considered a range of primary disease prevalence and conditional probabilities centered on the true prevalence and true conditional probabilities: The error terms were defined as 20% of the true values, for example D D = 20% 6 f D . The miss-specified secondary phenotype prevalence value and joint probabilities can be evaluated by using the primary disease prevalence and conditional probabilities defined above. We found that all the results were very similar to those obtained using the real prevalence values and joint distribution. Therefore, the misspecification of prevalence values and joint distribution will not inflate the type I error rate of the proposed approaches (as in the previous work [3]). The interactive effects of secondary phenotype and genetic variants on primary disease might have some impact on the test for deviation from HWP for genetic variants, which could be an interesting topic for future study. In addition to the simulation studies, we also applied the eLRT and emHWP approaches to a real case-control genetic association study of lung cancer frequency-matched on smoking status and compared the numbers of rejected SNPs to those obtained using the LRT approach in the quality check procedure. In the original lung cancer study, the lung cancer controls were frequencymatched by smoking status to the cases. The proposed approaches always rejected, and thus removed, fewer SNPs than the LRT approach. We are not claiming that the SNPs kept using our approaches are causal or associated with the primary disease or secondary phenotype, as such claims would require validation by independent studies as dictated by the genome-wide association study guidelines. Our main goal is to increase the likelihood of not filtering potentially associated SNPs in the data cleanup stage. In other words, the eLRT and emHWP approaches have a higher likelihood of keeping SNPs for further association analysis, and the additional SNPs kept could potentially be associated with either the secondary phenotype or primary disease, according to our simulation results. To summarize, in this article, we extended the recently proposed HWP testing approaches, the LRT approach and mHWP exact test, to frequency-matched case-control study. We showed that when the study of the primary disease is unmatched, the proposed eLRT and emHWP approaches are robust and provide results similar to those obtained with the existing approaches; when the study of primary disease is frequencymatched with respect to the secondary phenotype, the proposed approaches are better HWP tests than the existing approaches. For frequency-matched studies based on the secondary phenotype, the eLRT and emHWP approaches will improve our ability to keep SNPs potentially associated with the secondary phenotype and/or the primary disease. Figure S1 Construction of a mixture sample from the dataset of a case-control study of primary disease. (DOC) Figure S2 Network structure representing associations among genetic variants, environmental factors, secondary phenotype, and primary disease. (DOC)

Supporting Information
Table S1 Estimated type I error probability for test of deviation from HWP of SNP 1 , a SNP causal to both primary disease and secondary phenotype (MAF = 40%), at a 0.05 significance level in simulation studies using different approaches for HWP testing. (DOC)