Are Molecular Haplotypes Worth the Time and Expense? A Cost-Effective Method for Applying Molecular Haplotypes

Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these misclassification errors on the false-positive rate and power for two association tests. These tests include the standard likelihood ratio test (LRTstd) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRTae). We aim to determine the cost–benefit relationship of increasing the proportion of individuals with molecular haplotype measurements in addition to genotypes to raise the power gain of the LRTae over the LRTstd. This analysis should provide a guideline for determining the minimum number of molecular haplotypes required for desired power. Our simulations under the null hypothesis of equal haplotype frequencies in cases and controls indicate that (1) for each statistic, permutation methods maintain the correct type I error; (2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and (3) our simulations under the alternative hypothesis showed a significant power gain for the LRTae over the LRTstd for a subset of the parameter settings. Permutation methods should be used exclusively to determine significance for each statistic. For fixed cost, the power gain of the LRTae over the LRTstd varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole-genome association study.


Introduction
With the advent of the HAPMAP project [1,2], the popularity of haplotype-based case-control genetic association studies has grown markedly. The alleles present at multiple genetic markers across a given chromosome form a haplotype [3]. It has been suggested that association studies utilizing haplotypes formed from single nucleotide polymorphisms (SNPs) may be more powerful than single locus association [4][5][6][7][8][9][10][11].
Several statistical methods are available to perform tests of haplotype-based case-control association. One method calculates the likelihood of the data in terms of the estimated haplotype frequencies. An alternative method relies on the use of a contingency table containing the case-control counts for each inferred haplotype. The counts in the contingency table can be determined either by inferring phased haplotypes for each individual or by multiplying each haplotype frequency estimate by the total number of haplotypes in the study. Many researchers find the latter method appealing since it applies the same format as the classic genotypic and allelic case-control studies, and explicitly accounts for each phased haplotype. As a result, many researchers employ this method in practice [18,35,[38][39][40]. In the event that all phased haplotypes have been called correctly, this method can provide additional power [41,42]. This situation is analogous to tests of association using allele estimates from individual genotypes as compared with allele frequency estimates from DNA-pooling data [43].
However, misclassifications can lower a study's power and/ or affect the false-positive rate. The act of calling haplotype pairs from multilocus genotypes in the phase-ambiguous situation is similar to the act of dichotomizing continuous measures. Royston et al. document a loss in power when dichotomizing continuous predictor variables in a regression analysis [44]. In the context of our study, a misclassification results when the haplotype pair called for an individual is not the true underlying haplotype pair. Non-differential misclassification occurs when the misclassification rates are the same in cases and controls. When non-differential misclassification exists, the test suffers a loss in power, but the falsepositive rate remains unchanged [45,46]. In contrast, differential misclassification inflates the test's false-positive rate and may diminish its power [47]. We conjecture that in the absence of differential genotype misclassification, all haplotype misclassification is non-differential when haplotype frequency distributions are the same in cases and controls, i.e., under the null hypothesis.
Although there have been several studies aimed at evaluating the accuracy of haplotype inference and haplotype frequency estimation procedures [26,29,30,32,35,37], to our knowledge, no systematic study of the effects of haplotype misclassification has been documented. Thus, the purpose of this work is to address the effects of haplotype misclassification on the false-positive rate and power of commonly used tests of haplotype-based association. Specifically, this research aims to (1) classify the nature of the misclassification present in calling phased haplotypes; (2) determine the appropriateness of using the asymptotic v 2 distribution and permutation methods to evaluate the significance of the test statistics we employ; and (3) compare the power of our test statistic which accounts for haplotype misclassification with the power of the standard likelihood ratio test statistic when the costs are fixed.

Test Statistics
In order to detect an association between a haplotype pair and disease status, we employed two statistical tests on 2 3 n contingency tables where n is the number of haplotype pair categories found by inference. These tests include the standard likelihood ratio test (LRT std ) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRT ae ). The LRT std is a likelihood ratio statistic that treats the called haplotype pairs as observations, and as a result, the likelihood is the multinomial distribution where the called haplotype pairs are the categories [48]. The LRT ae statistic is a likelihood ratio statistic that employs a doublesampling procedure to account for the misclassification present in haplotype inference. On all the individuals in the study, there is a fallible measure [49,50], the haplotype pairs inferred from the multilocus genotypes, and on a subset of these individuals, there is a second measure that is considered to be infallible [49,50], molecular haplotypes. By comparing the fallible data with infallible data, the LRT ae procedure estimates the misclassification rates present in the fallible data and incorporates this information into the likelihood calculation [51]. The details regarding the LRT std and LRT ae statistics including notation and computation are provided in Protocol S1.

Permuted and Asymptotic p-Values
We applied two methods for evaluating the p-value or statistical significance of each statistic. The first method relies on using the central v 2 distribution to find the p-value since, according to statistical theory under the null hypothesis of no association, twice the natural logarithm of the likelihood ratio follows the central v 2 distribution asymptotically for large sample sizes [3,48]. In addition, it has been shown that when Cochran's rule is followed (more than five observations in each cell of the contingency table), the presence of nondifferential misclassification does not affect the distribution of the likelihood ratio test statistics under the null hypothesis of no association [46,51]. The second method employs permutation testing to generate the distribution of the test statistic under the null hypothesis and to determine its statistical significance. In this article, p-values found with the former and latter approaches are referred to as asymptotic pvalues and permutation p-values, respectively.

Description of Data Generation and Analysis
To investigate the behavior of these test statistics for a variety of situations, we applied these statistical tests to many simulated datasets. Figure 1 illustrates the procedure we used to simulate the data and to evaluate the false-positive rate or type I error and power at fixed significance levels for each statistic. For the analysis of each replicate dataset simulated, the multilocus genotype data from cases and controls were pooled to infer haplotype pairs for each individual. These

Synopsis
Localizing genes for complex genetic diseases presents a major challenge. Recent technological advances such as genotyping arrays containing hundreds of thousands of genomic ''landmarks,'' and databases cataloging these ''landmarks'' and the levels of correlation between them, have aided in these endeavors. To utilize these resources most effectively, many researchers employ a genemapping technique called haplotype-based association in order to examine the variation present at multiple genomic sites jointly for a role in and/or an association with the disease state. Although methods that determine haplotype pairs directly by biological assays are currently available, they rarely are used due to their expense and incongruity to automation. Statistical methods provide an inexpensive, relatively accurate means to determine haplotype pairs. However, these statistical methods can provide erroneous results. In this article, the authors compare a standard statistical method for performing a haplotype-based association test with a method that accounts for the misclassification of haplotype pairs as part of the test. Under a number of feasible scenarios, the performance of the new test exceeded that of the standard test.
inferred haplotypes are sufficient for the computation of LRT std ; however, LRT ae requires additional information in the form of molecular haplotypes for a subset of the individuals in the study. Two alternative procedures for selecting individuals for the double sample (individuals with molecular haplotypes in addition to genotypes) were employed. In one selection scheme, individuals were selected randomly. In the other selection scheme, individuals possessing the most ambiguity in their statistically inferred haplotype pairs were prioritized in selecting the double sample. Specifically, we double-sampled those individuals with the smallest posterior probabilities associated with their inferred haplotype pair up to a posterior probability threshold, d, of 0.85 or until the number of individuals specified by the maximum double-sample proportion was reached. Therefore, under this second scheme, the number of individuals double-sampled varied between replicate datasets. In this article, the former and latter procedures for determined the double sample are referred to as random and threshold double-sample selection, respectively.

Two-SNP Scenario
Evaluation of false-positive rate for permutation and asymptotic p-values. For the simplest non-trivial case, the scenario in which the haplotype under evaluation includes two SNPs, we applied a fractional factorial design [52] to perform a comprehensive study of type I error. For the type I error, haplotype pairs were inferred using both SNPHAP v 1.3.1 (see Electronic Database Information) and PHASE v 2.1.1 [27] (see also Electronic Database Information). Table 1 contains the fractional factorial design settings for the study of type I error for the scenario involving two SNP markers. We consider a 1/2(2 k ) fractional factorial design, where k ¼ 6.
Because of redundancy, we were able to reduce the number of experimental runs from 32 to 18. For instance, under the null hypothesis of no association, a run with 1,000 cases and 250 controls is equivalent to a run with 250 cases and 1,000 controls (with all other factors having equal settings to those for the first run). During each run, 10,000 replicate datasets were simulated. We performed the 18 runs with both of the two alternative procedures for selecting the double sample: random and threshold double-sample selection. For the threshold double-sample selection method, d was 0.85, and the maximum double-sample proportion was set to the value of a in the fractional factorial design.
To evaluate each test statistic's ability to maintain the correct type I error, we examined the distribution of the pvalues computed for data simulated under the null hypothesis of no association. We performed two goodness-of-fit tests, the Kolmogorov-Smirnov (KS) and the Anderson-Darling (AD) tests [53] to determine whether the p-values deviate significantly from the standard uniform distribution, and examined the false-positive rate for significant thresholds of 0.05, 0.01, and 0.001.
Evaluation of power for fixed cost. We also evaluated the behavior of these statistics under the hypothesis that an unobserved disease locus exists in linkage disequilibrium (LD) with the haplotype under study. Table 2 contains the factorial design settings for the power study in the scenario involving two SNP markers. The factorial design includes three factors: disease model, genotype relative risk [54] for the homozygote genotype (R 2 ), and the disease allele frequency (DAF). Each factor contains two levels. For the disease model factor, the two levels are a dominant disease model and a multiplicative disease model. The dominant disease model requires that R 2 ¼ R 1 whereas the multiplicative disease model requires that where R 1 and R 2 are the genotype relative risks for the heterozygote and homozygote genotypes, respectively. Specifically, the genotype relative risks are defined as the following. If the penetrances, f i , are defined by f i ¼Pr(affectedji copies of disease allele), where i ¼ 0, 1, or 2, the genotype relative risks, R 1 and R 2 , are defined by R 1 ¼f 1 /f 0 and R 2 ¼f 2 /f 0 , respectively [54].
As with the study of type I error, we inferred the haplotypes for the power simulations with both SNPHAP v 1.3.1 and PHASE v 2.1.1. The proportion of individuals doublesampled, a, for the LRT ae method (random double-sample selection) was set at 0.75. For the threshold double-sample  selection, d was set to 0.85, and the maximum double-sample proportion was 0.75. In the power simulations, the conditional haplotype frequencies were found from the specified disease model parameters by the method described previously [7,55] (also see the PAWE Web site at http://linkage. rockefeller.edu/derek/pawe1.html). However, we selected a specific haplotype to be in LD with the disease locus. During each run, 1,000 replicate datasets consisting of 500 cases and 500 controls were simulated. For these simulations, the disease prevalence was 0.025; the LD between the disease locus and the linked haplotype was 0.9 (measured by D9 [56]); and the population haplotype frequencies were 0.05, 0.15, 0.25, and 0.55. The selection of the specific haplotype in LD with the disease locus depended on the DAF. The haplotype occurring with a frequency most similar to that of the disease allele was selected. Thus, the haplotypes with frequencies of 0.05 and 0.25 were selected as the variant in LD with the disease when the DAF was set at 0.07 and 0.27, respectively. As with the evaluation of the false-positive rate, we performed all eight runs from the factorial design using both random and threshold double-sample selection.
To compare the power of the two test statistics, we evaluated the power of the statistics under fixed cost conditions. Since the LRT ae requires the additional cost associated with obtaining molecular haplotypes on a subset of the samples, we reduced the number of samples when the LRT ae statistic was applied so that the same total cost would be incurred as for the runs with the LRT std . The reduced sample size for the LRT ae sample was computed using Equation 1, where N DS is the sample size for the LRT ae ; N is the sample size for the LRT std ; C p is the cost of phenotyping, C g is the cost of genotyping; r is the cost ratio of molecular haplotyping to genotyping (C mh /C g ); and a is the proportion of individuals in the LRT ae sample that have molecular haplotypes determined (double-sampling proportion). We consider the phenotyping costs, C p , to include costs associated with ascertainment and diagnosis. We illustrate fixed-costs sample sizes for the following example. With settings of C p /C g ¼ 25, r ¼ 5, a ¼ 0.75, and N ¼ 1,000 for the LRT std method, the corresponding total sample size for the LRT ae method, N DS , is 874. The reader should note that the reduced sample size results from the additional cost incurred by double-sampling 75% of the total sample for the LRT ae method. If C p /C g ¼ 1,000, note that this term will dominate the expression in Equation 1, and the fixed-cost sample size, N DS , will not differ greatly from the sample size for the LRT std , N. All power simulations were performed under fixed-cost conditions. Since the doublesample proportion, a, varies from replicate to replicate when the threshold double-sample selection method is employed, we first performed several test runs to determine the mean double-sample proportion, a. Using a, we computed N DS *, the total sample size for the LRT ae determined from the expectation of a. Here, the asterisk is added to indicate that total sample size is computed using the expected value of a, as compared with the random double-sample selection, in which a is a fixed quantity. For a specific disease model, we performed a comprehensive study of the power difference between the LRT ae and LRT std for the situation of a haplotype comprising two SNPs.

Multi-SNP Scenario
Evaluation of false-positive rate and power for fixed costs. Through additional simulations, we investigated the behavior of these statistics when applied to haplotypes comprising larger numbers of SNPs. Because these simulations required additional computational time, we only used SNPHAP v 1.3.1 (see Electronic Database Information) for inferring haplotypes. Our simulations were based on haplotype frequencies from two datasets: (1) a dataset of molecular haplotypes with very high levels of pair-wise LD between markers [14] and (2) a dataset of multilocus genotypes from the TAP2 gene within the major histocompatibility complex, a region with low pairwise LD between markers [1,2] (see also Electronic Database Information), hereafter referred to as the Horan and the HapMap TAP2 datasets, respectively. Figure 2 displays the inter-marker LD for each of these two datasets using GOLD plots [57]. For the Horan dataset, we determined the generating population haplotype frequencies for our simulations directly using the counting method [3]. For the HapMap TAP2 dataset, we found the generating population haplotype frequencies for our simulations indirectly using SNPHAP v 1.3.1 (see Electronic Database Information). In the latter case, haplotype frequencies were estimated from the parents of each trio in the Yoruba population group from the International HapMap Project. For the type I error simulation studies, 1,000 replicate datasets containing 250 cases and 250 controls were simulated. For the type I error runs based on the Horan data and the HapMap TAP2 data, we simulated haplotypes comprising 15 SNPs and ten SNPs, respectively, whereas for the power runs, we simulated haplotypes comprising five SNPs [1,2,14]. Figure 2 specifies the SNPs we used from each of the datasets in the type I error and power runs. For the Horan dataset, we provide the SNP markers' positions (relative to the transcription start site of This table presents the settings for all parameters considered in the power simulations assuming the haplotype under investigation contains two SNP markers. We consider a 2 k factorial design, where k ¼ 3. The dominant disease model requires that R 2 ¼ R 1 while the multiplicative disease model requires R 2 ¼ R 1 2 , where R 1 and R 2 are the genotype relative risks for the heterozygote and homozygote genotypes, respectively. For the random double-sample selection method, the proportion of individuals double-sampled (a) was 0.75 whereas a haplotype pair posterior probably threshold (d) of 0.85 and a maximum double-sample proportion of 0.75 were used for the threshold double-sample selection method. The cost ratio of molecular haplotyping to genotyping (r) was 5. For each combination of settings, 1,000 replicate datasets comprising 500 cases and 500 controls were simulated. The disease prevalence was 0.025; the LD between the disease locus and the linked haplotype was 0.9 (measured by D9); and the population haplotype frequencies were 0.05, 0.15, 0.25, and 0.5. The haplotype with frequency of 0.05 was linked to the disease locus when DAF ¼ 0.07, and the haplotype with frequency 0.25 was linked to the disease locus when DAF ¼ 0. 27 the GH1 gene) whereas for the HapMap TAP2 dataset, we provide the name of the SNP marker. As a result, we simulated haplotypes using 17 haplotype variants with frequencies greater than 0.01 for both the Horan and HapMap TAP2 type I error simulations. In addition, we simulated haplotypes using five and ten haplotype variants with frequencies greater than 1/(2s), where s is the total number of individuals, for the Horan and HapMap TAP2 power simulations, respectively. For each scenario, we normalized the frequencies so that they summed to unity. As with the power studies for the two-SNP scenario, the selection of the specific haplotype in LD with the disease locus depended on the DAF. The rationale for the selection procedure is provided in the Results section addressing multi-SNP power. For multi-marker type I error and power studies, we employed both the random and threshold double- sample selection methods in computing the LRT ae statistic. When the random double-sample selection method was used, the double-sample proportion, a, was 0.75. When the threshold double-sample method was used, the setting of d ¼ 0.85 was used, and the maximum proportion of individuals included in the double-sample was 0.75.
Identifying the nature of haplotype pair misclassification. For all the simulations performed, we recorded the details of the misclassifications that occurred. Specifically, for every replicate, we computed the misclassification rates, h j9j ¼ Prð observed haplotype pair classification is j j true haplotype pair classification is j9Þ; where j96 ¼j are integers ranging from 1 to the maximum number of haplotype pairs [51]. Previous research studying genotype misclassification rates in tests of genotypic association provides the motivation for ascertaining these values [58,59]. This notation is also used in Protocol S1.

Two-SNP Scenario
Our results for type I error and power were almost identical from the simulations utilizing SNPHAP v 1.3.1 and PHASE v 2.1.1 for the haplotype inference. Although we present graphs and tables that display the results provided by SNPHAP v 1.3.1 for the haplotype inference, the reader should note that similar results were found using PHASE v 2.1.1.
Evaluation of false-positive rates for permutation and asymptotic p-values. The type I error simulations demonstrated that the approach for determining statistical significance is critical for maintaining the correct false-positive rate. Although the KS and Anderson-Darling (AD) test results indicated that the distribution of permutation p-values was consistent with the standard uniform distribution, they also indicated that the distributions of asymptotic p-values did not resemble the standard uniform distribution. These results were reinforced by the false-positive rates we found. For all the simulation runs displayed in Table 1, Figure 3 shows the false-positive rate for a significance threshold of 0.05 for LRT std and LRT ae (using the random and threshold doublesample selection methods) association tests in which statistical significance was indicated by permutation and asymptotic p-values. The graph shows that asymptotic p-values for LRT ae are anti-conservative whereas those for LRT std fluctuate between conservative and anti-conservative values. In contrast, the permutation p-values for both statistics consistently maintain the nominal significance level of 0.05. We found that the asymptotic and permuted p-values demonstrated similar behavior for significance thresholds of 0.01 and 0.001 (unpublished data). SNPHAP v 1.3.1 was used for the haplotype inference for the simulation results displayed in the graph. These results are not surprising since several simulation parameter settings have expected cell counts of less than five counts, violating Cochran's rule [60].
Evaluation of power for fixed cost. Based on the results for the false-positive rates, we conclude that power can only be evaluated using the permutation p-values. We compare the power of LRT ae (using the random and threshold doublesample selection methods) to LRT std . Table 3 presents summary statistics for the power difference (LRT ae power À LRT std power) at various significance levels for the two cost ratios C p /C g ¼ 25 and C p /C g ¼ 1,000 using the eight parameter settings from the factorial design ( Table 2). Note that in all runs, we set the cost ratio of molecular haplotyping to  Table 1. For all 18 runs, LRT ae was computed with the random and threshold double-sample selection methods. When the threshold double-sample method was used to compute LRT ae , the setting of d ¼ 0.85 was used, and the maximum proportion of individuals included in the double sample was the value for a specified by the fractional factorial design. SNPHAP v 1.3.1 was used for the haplotype inference for the simulation results displayed in the graph. DOI: 10.1371/journal.pgen.0020127.g003 genotyping, r, to be 5, and the proportion of individuals to be double-sampled, a, to be 0.75 (for the random double-sample selection method). The values reported correspond to the simulations utilizing SNPHAP v 1.3.1.
For the random double-sample selection method, the minimum power difference observed occurred when C p /C g ¼ 25 for a dominant disease model with R 2 ¼ 2 and DAF ¼ 0.27 at a significance level of 0.01. For these settings, the LRT ae power was 0.544 and LRT std power was 0.606. The maximum power difference observed occurred when C p /C g ¼ 1,000 for a dominant disease model with R 2 ¼ 3.5 and DAF ¼ 0.07 at a significance level of 0.001. For these settings, the LRT ae power was 0.910 and LRT std power was 0.775.
For the threshold double-sample selection method, the minimum power difference observed occurred when C p /C g ¼ 25 for a dominant disease model with R 2 ¼ 2 and DAF ¼ 0.27 at a significance level of 0.05. For these settings, the LRT ae power was 0.821 and LRT std power was 0.831. The maximum power difference observed occurred when C p /C g ¼ 1,000 for a dominant disease model with R 2 ¼ 2 and DAF ¼ 0.07 at a significance level of 0.05. For these settings, the LRT ae power was 0.573 and LRT std power was 0.411.
Power difference as a function of double-sample proportion and cost ratio. In the spirit of response surface analysis for factorial design [52], we performed a more thorough analysis of the parameter settings that provided the maximum power difference with LRT ae computed with the random double-sample selection method. These parameter settings are a dominant disease model with R 2 ¼ 3.5 and DAF ¼ 0.07. These setting provided the additional benefit of power results greater than 75% for both the LRT ae and LRT std methods at the 0.05, 0.01, and 0.001 significance levels for both cost ratios of C p /C g ¼ 25 and C p /C g ¼ 1,000. The analysis involved computation of the LRT ae with the random doublesample selection method. Figure 4 displays the two-dimensional contour plot of the power difference between the LRT ae and the LRT std as a function of r, the cost ratio of molecular haplotyping to genotyping, and a, the proportion of individuals double-sampled. These power differences are computed for the fixed parameter settings of C p /C g ¼ 25 ( Figure 4A) and C p /C g ¼ 1,000 ( Figure 4B) at significance level ¼ 0.001 for the disease model described immediately above.
The values of r considered in the contour plots are 1, 5, 10, 25, and 50 whereas the values of a considered are 0.25, 0.50, 0.75, and 1.0. One should note that a ¼ 1.0 indicates that all individuals in the study are double-sampled regardless of phase ambiguity. Simulations were performed with 1,000 replicates and 10,000 permutations for each combination of parameters, and SNPHAP v 1.3.1 was used for the haplotype inference. The sample size for the LRT std , N, was 1,000 (equal numbers of cases and controls). Figure 4A shows that the LRT ae provides a power advantage over the LRT std when r is less than 10 and a is greater than 0.5. The maximum power gain is 0.16 and occurs when r and a are 1.0. Conversely, when the r is greater than 10, LRT ae is less powerful than LRT std for these parameter settings. The maximum power loss is 0.58 and occurs when r is 50 and a is 1.0. Note that for these values, the total sample available for the LRT ae method, N DS (Equation 1), is 342 whereas the total sample available for the LRT std method, N, is 1,000. Figure 4B illustrates that LRT ae is always at least as powerful as the LRT std when C p /C g ¼ 1,000. We observe the minimum power gain of 0.02 when r is 50 and a is 0.25 and the maximum power gain of 0.17 when r and a are 1.0. Furthermore, Figure 4B indicates that for any cost ratio, r, increasing the double-sampling proportion, a, always increases the power gain with the maximum power gain occurring when a ¼ 1.0.

Multi-SNP Scenario
Evaluation of false-positive rates for permutation and asymptotic p-values. Table 4  Evaluation of power for fixed cost. In our power study for haplotypes comprising five SNPs, we again used the disease model parameter settings that provided the maximum power difference (LRT ae power À LRT std power) for the two-SNP factorial design (Table 2) with LRT ae computed using random  This table presents summary statistics for the power difference between the LRT ae and LRT std methods (p-values evaluated using permutation) at the 0.05, 0.01, and 0.001 significance levels. Results are shown for LRT ae computed using both the random and threshold double-sample selection methods. The methods are compared for fixed costs where the power for LRT ae is computed under two conditions: (1) the cost ratio of phenotyping to genotyping (C p /C g ) is 25 and (2) the cost ratio of phenotyping to genotyping (C p /C g ) is 1,000. The sample size for LRT std , N, is 1,000 (500 cases and 500 controls). For the LRT ae statistic, settings of a ¼ 0.75 (random double-sample selection method) and r ¼ 5 were used. When the threshold doublesample selection method was used to compute LRT ae , the setting of d ¼ 0.85 was used, and the maximum proportion of individuals included in the double-sample was 0.75. Haplotype pairs were inferred using SNPHAP v 1.3.1. DOI: 10.1371/journal.pgen.0020127.t003 double-sample selection. These parameter settings are a dominant disease model with R 2 ¼ 3.5 and DAF ¼ 0.07. We based the population haplotype frequencies on the Horan and HapMap TAP2 datasets as described in the Methods section. For each dataset, we selected the haplotype with frequency closest to 0.05 as the haplotype in LD with the disease locus. By this choice of haplotype, we approximated the frequency of the linked haplotype for the two-SNP scenario (see Material and Methods section) when DAF ¼ 0.07. As with the two-SNP power study, the LD between the disease locus and the linked haplotype was 0.9 (measured by D9) [56]. The cost ratio of molecular haplotyping to genotyping (r) was 5. When the random double-sample selection method was used to compute LRT ae , the double-sample proportion (a) was 0.75. When the threshold double-sample method was used to compute LRT ae , the setting of d ¼ 0.85 was used, and the maximum proportion of individuals included in the doublesample was 0.75. For the Horan dataset, the power estimates for the LRT std and the LRT ae were almost identical at the 0.05, 0.01, and 0.001 significance levels for cost ratios (C p /C g ) of both 1,000 and 25 (unpublished data). The high pair-wise intermarker LD present in the Horan dataset causes the haplotype inference to occur with almost complete fidelity. In the absence of misclassification, the LRT ae statistic reduces to the LRT std . Therefore, the high degree of similarity in power for these statistics is not surprising.
For the HAPMAP TAP2 dataset, Table 5 displays the power estimates and the corresponding 95% confidence intervals (CIs) for the LRT std and LRT ae methods at the 0.05, 0.01, and 0.001 significance levels assuming fixed costs. When C p /C g ¼ 1,000, the LRT ae provides a substantial power benefit over the LRT std with the power difference ranging from 6% and 7% at a significance level of 0.05, to 14% and 21% at a significance level of 0.001 for random double-sample selection and threshold double-sample selection, respectively. When C p /C g ¼ 25, the advantage of the LRT ae over the LRT std is still substantial for threshold double-sample selection, but more modest for random double-sample selection. For the three significance levels under investigation, the power difference ranged from 7% to 22%, and 1% to 3.5% for threshold and random double-sample selection, respectively.
We found that the median power gain of the LRT ae over the LRT std for the threshold double-sample selection method was consistently greater than that for the random doublesample selection method for the runs associated with the factorial design settings displayed in Table 2 and the HAPMAP TAP2 power simulations (see Tables 3 and 5). Furthermore the power gain for the threshold double-sample selection method occurred for either setting of C p /C g . For the threshold double-sample selection method, a was small (less than 21%) in our simulations so that our computed N DS * values had a minimum of 963 individuals.

Discussion
In practice, few researchers employ molecular haplotyping techniques in genetic case-control studies. The absence of a high-throughput procedure relative to current SNP genotyping technologies is arguably the main reason that this methodology is not more widely used. Another related reason is the cost in terms of both the time and money associated with employing this methodology. Our research suggests that the additional costs involved in molecular haplotyping may be worth the effort, especially if the cost of phenotyping is high relative to the cost of genotyping for a study. Ji et al. found analogous results for the effects of genotype misclassification on genotypic test of association [61]. In practice, this situation arises for replication studies. A genome-wide scan involving thousands of SNP markers along with subsequent fine mapping in an initial set of case and control individuals may identify a number of promising regions for follow-up studies. These follow-up or replication studies involve recruiting an independent sample of cases and controls for which only SNPs in the promising regions will be genotyped [62]. In replication studies for complex traits, the cost ratio of phenotyping to genotyping may be on the order of thousands. For these situations, the LRT ae for testing haplotype association should provide the most utility. It is interesting to note, however, that applying the threshold double-sample selection method provided comparable powers for both high and low phenotyping to genotyping cost ratios. This finding suggests that this selection strategy may provide additional power for an initial genome-wide association study, as well as for a replication study.
One potential limitation of these test statistics that we selected is the increase in degrees of freedom associated with using haplotype pairs rather than individual haplotypes. In general, larger degrees of freedom may result in a loss of power. That is, methods that fully account for uncertainty in the phase-assignment process [11,63,64] may be more powerful than LRT ae because the LRT ae method examines haplotype pairs rather than single haplotypes and therefore has more degrees of freedom. We chose these statistics for the following reasons: (1) The most general misclassification model involves modeling errors in haplotype pairs rather than in individual haplotypes [51,65,66]. (2) When haplotype pair frequencies deviate from Hardy-Weinberg Equilibrium in either case or control sample populations, test statistics that use single haplotype frequencies may increase falsepositive rates and/or lose power [67,68]. (3) In contrast with  methods that use single haplotype frequencies, the Cochran-Armitage Linear Test of Trend maintains the nominal falsepositive rate and does not lose power [68][69][70]. To our knowledge, a version of this test that incorporates doublesampling procedures to correct for haplotype miscalls does not currently exist. A point for further research involves identifying the scenarios that produce differential and non-differential haplotype pair misclassification, respectively, as well as identifying the effects of each kind of misclassification on type I error and power. Under the null hypothesis that haplotype frequency distributions are equal in case and control populations, theoretical and simulation studies (including ours in this work) suggest that misclassification is non-differential. Under the alternative hypothesis, it is conceivable that haplotype pair misclassification rates may be different in case and control populations. Although recent research [47,71] indicates that differential misclassification increases the type I error, the effects of differential misclassification on the power of these statistics are unclear.
Although the current perception may be that molecular haplotyping costs are not cost-effective, recent publications suggest that for relatively small regions of the genome, accurate molecular haplotyping is no more expensive than performing fluorescent polymerase chain reactions [18]. In addition, current techniques are able to provide molecular haplotypes for an entire chromosome at a cost ratio (C mh /C g ) of approximately 5 (C. Ding, personal communication). Finally, as technology improves, the costs associated with molecular haplotyping will likely decrease, and the throughput will likely increase.

Conclusion
In this work, our simulations showed that the misclassification present in calling phased haplotypes from multilocus genotypes using statistical methods is complete. That is, each misclassified haplotype pair is consistently misclassified as the same incorrect haplotype pair throughout the entire dataset. In addition, our simulations under the null hypothesis of no association demonstrate that applying the theoretical v 2 distribution to evaluate the significance of test statistics produces conservative and anticonservative p-values whereas applying permutation methods consistently produces pvalues that maintain the nominal false-positive rate. Consequently, permutation methods should be exclusively used to determine statistical significance for the tests we perform. As expected, the LRT ae provides the greatest advantage in terms of power over the LRT std in situations in which more haplotype misclassification errors are present. These situations arise when the haplotype under investigation comprises many SNP markers with low pair-wise intermarker LD.
For fixed costs, the power gain of the LRT ae over the LRT std varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. In general, the LRT ae showed the greatest benefit over the LRT std when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a candidate gene replication study as opposed to a genome-wide association study. For intermediate phenotyping to genotyping cost ratios (e.g., C p /C g ¼ 25), the LRT ae may still provide a power advantage if the cost ratio of molecular haplotyping to genotyping is low (C mh /C g , 10 for a ! 0.5). Currently, inexpensive long-range PCR methods for molecular haplotyping are under development. As technology improves leading to less-expensive molecular haplotyping methods, the LRT ae will become applicable to a wider set of circumstances.

Electronic Database Information
The documentation for SNPHAP and PHASE can be found at http://www-gene.cimr.cam.ac.uk/clayton/software and http:// www.stat.washington.edu/stephens/software.html, respectively.
Data for the estimation of haplotype frequencies from SNP markers within the TAP2 gene were downloaded from http:// www.hapmap.org/downloads/index.html.en (HapMap public release #16c.1).
LRT ae software is available for free download from ftp:// linkage.rockefeller.edu/software/lrtae.