plosPLoS GenetplosgenPLoS Genetics1553739015537404Public Library of ScienceSan Francisco, USA09PLGERA0807R210.1371/journal.pgen.1000741Research ArticleGenetics and Genomics/Complex TraitsGenetics and Genomics/Genetics of DiseaseOn the Analysis of GenomeWide Association Studies in FamilyBased Designs: A Universal, Robust Analysis Approach and an Application to Four GenomeWide Association StudiesOn the Analysis of GenomeWide Association StudiesWonSungho^{1}^{2}WilkJemma B.^{3}MathiasRasika A.^{4}O'DonnellChristopher J.^{5}^{6}SilvermanEdwin K.^{7}^{8}^{9}BarnesKathleen^{10}O'ConnorGeorge T.^{11}WeissScott T.^{7}^{9}^{12}LangeChristoph^{9}^{12}^{13}^{*}Department of Statistics, ChungAng University, Seoul, KoreaResearch Center for Data Science, ChungAng University, Seoul, KoreaDepartment of Neurology, Boston University School of Medicine, Boston, Massachusetts, United States of AmericaGenometrics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, Maryland, United States of AmericaNational Heart, Lung, and Blood Institute and Framingham Heart Study, Bethesda, Maryland, United States of AmericaCardiology Division, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of AmericaChanning Laboratory, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States of AmericaDivision of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States of AmericaHarvard Medical School, Boston, Massachusetts, United States of AmericaDepartment of Medicine, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of AmericaPulmonary Center, Boston University School of Medicine, Boston, Massachusetts, United States of AmericaCenter for Genomic Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States of AmericaDepartment of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of AmericaSchorkNicholas J.EditorUniversity of California San Diego and The Scripps Research Institute, United States of America* Email: clange@hsph.harvard.edu
Conceived and designed the experiments: SW CL. Performed the experiments: SW. Analyzed the data: SW JBW RM CJO EKS KB GTO STW. Contributed reagents/materials/analysis tools: SW. Wrote the paper: SW CL.
The authors have declared that no competing interests exist.
11200926112009511e10007411852009261020092009This is an openaccess article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
For genomewide association studies in familybased designs, we propose a new, universally applicable approach. The new test statistic exploits all available information about the association, while, by virtue of its design, it maintains the same robustness against population admixture as traditional familybased approaches that are based exclusively on the withinfamily information. The approach is suitable for the analysis of almost any trait type, e.g. binary, continuous, timetoonset, multivariate, etc., and combinations of those. We use simulation studies to verify all theoretically derived properties of the approach, estimate its power, and compare it with other standard approaches. We illustrate the practical implications of the new analysis method by an application to a lungfunction phenotype, forced expiratory volume in one second (FEV1) in 4 genomewide association studies.
Author Summary
In genomewide association studies, the multiple testing problem and confounding due to population stratification have been intractable issues. Familybased designs have considered only the transmission of genotypes from founder to nonfounder to prevent sensitivity to the population stratification, which leads to the loss of information. Here we propose a novel analysis approach that combines mutually independent FBAT and screening statistics in a robust way. The proposed method is more powerful than any other, while it preserves the complete robustness of familybased association tests, which only achieves much smaller power level. Furthermore, the proposed method is virtually as powerful as populationbased approaches/designs, even in the absence of population stratification. By nature of the proposed method, it is always robust as long as FBAT is valid, and the proposed method achieves the optimal efficiency if our linear model for screening test reasonably explains the observed data in terms of covariance structure and population admixture. We illustrate the practical relevance of the approach by an application in 4 genomewide association studies.
The CAMP Genetics Ancillary Study is supported by U01 HL075419, U01 HL65899, P01 HL083069, R01 HL086601, and T32 HL07427 from the National Heart, Lung, and Blood Institute, National Institutes of Health. CL is supported by the National Institutes of Health grant R01MH081862. Framingham Heart Study genotype and phenotype data are publicly available through the NHLBI's SNP Health Association Resource (SHARe) initiative (http://public.nhlbi.nih.gov/GeneticsGenomics/home/share.aspx). The British 1958 Birth Cohort DNA collection is funded by the Medic Research Council grant G00000934 and the Wellcome Trust grant 068545/Z/02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Introduction
During the analysis phase of genomewide association studies, one is confronted with numerous statistical challenges. One of them is the decision about the “right” balance between maximization of the statistical power and, at the same time, robustness against confounding. In familybased designs, the possible range of analysis options spans from a traditional familybased association analysis [1]–[4], e.g. TDT, PDT, FBAT, to the application of populationbased analysis methods that have been adapted to familydata [1]–[3]. While, by definition, the first group of approaches is completely immune to population admixture and model misspecification of the phenotype, and can be applied to any phenotype that is permissible in the familybased association testing framework (FBAT [4]–[6]), the second category of approaches maximizes the statistical power by a populationbased analysis. The phenotypes are modeled as a function of the genotype, and populationbased methods such as genomic control [7],[8], STRUCTURE [9] and EIGENSTRAT [10], are applied to account for the effects of population admixture and stratification. Hybridapproaches that combine elements of both populationbased and familybased analysis methods, e.g. VanSteen algorithm [11] and Ionita weightingschemes [12],[13] have been suggested to bridge between the 2 types of analysis strategies. Contrary to the other methods that combine family data and unrelated samples [14]–[17], such hybrid testing strategies maintain the 2 key features of the familybased association tests: The robustness against confounding due to population admixture and heterogeneity, and the analysis flexibility of the approach with respect to the choice of the target phenotype. Such 2stage testing strategies utilize the information about the association at a populationlevel, the betweenfamily component, to prioritize SNPs for the second step of the approach in which they are tested formally for association with a familybased test. The hybrid approaches can achieve power levels that are similar to approaches in which standard populationbased methods are applied to familydata, but the optimal combination of the 2 sources of information (the betweenfamily component and the withinfamily component) is not straightforward in the hybrid approaches.
In this communication, we propose a new familybased association test for genomewide association studies that combines all sources of information about association, the between and the withinfamily information, into one single test statistic. The new test is robust against populationadmixture even though both components, the between and the withinfamily components, are used to assess the evidence for association. The approach is applicable to all phenotypes or combinations of phenotypes that can be handled in the FBATapproach, e.g. binary, continuous, timetoonset, multivariate, etc [4]–[6],[18]. While the correct model specification for the phenotypes will increase the power of the proposed test statistic, misspecification of the phenotypic model does not affect the validity of the approach. Using extensive simulation studies, we verify the theoretically derived properties of the test statistic, assess its power and compare it with other standard approaches. An application to the Framing heart study (FHS) illustrates the value of the approach in practice. A new genetic locus for the lungfunction phenotype, FEV1 (forced expiratory volume in the first second) is discovered and replicated in 3 independent, genomewide association studies.
Methods
We assume that in a familybased association study, n family members have been genotyped at m loci with a genomewide SNPchip. For each marker locus, a familybased association test is constructed based on the offspring phenotype and the withinfamily information. The withinfamily information is defined as the difference between the observed, genetic marker score and the expected, genetic marker score, which is computed conditional upon both the parental genotypes/sufficient statistic [19] under the assumption of Mendelian transmissions. We denote the familybased association test for the ith marker locus by FBAT_{i}. Such an FBAT statistic can be the standard TDT, an FBAT for quantitative/qualitative traits, FBATGEE for multivariate traits, etc [4],[6],[18],[20],[21]. Similarly, for the ith marker, the betweenfamily information can be used to assess the evidence for association at a populationbased level by computing a VanSteentype [11] “screening statistic” T_{i}. The screening statistic is computed based on the data for offspring phenotype and the parental genotypes/sufficient statistic. The statistic T_{i} can be a Wald test for the genetic effect size that is estimated based on the conditional mean model [22], or the estimated power of the familybased test FBAT_{i}[23], either of which is feasible. However, while the FBAT_{i} statistic is robust against population stratification, the screening statistic T_{i} is susceptible to confounding. For this reason, the VanSteentype testing strategies have restrictively used the betweenfamily information as weights for pvalues of the FBATstatistic, but never as a component in the test statistic itself.
Construction of an overall familybased association test including the populationbased and familybased components
In order to construct a familybased association test that incorporates both the within and the betweenfamily information, the Zstatistics that correspond to the pvalues of FBAT_{i} and T_{i} are computed. The statistic Z_{α}^{*} is the α quantile of standard normal distribution. pFBAT_{i} and pT_{i} are the pvalue of the FBATstatistics and one sided pvalue of the screening statistic where the direction of the one sided screening statistic is defined by the directionality of FBAT_{i}. Based on the statistical independence of FBAT_{i} and T_{i}[11] under the nullhypothesis, we can obtain an overall familybased association test statistic Z_{i} by combining both Zstatistics in a weighted sum, where the parameters w_{FBAT} and w_{T} are standardized weights so that the overall familybased association test Z_{i} has a normal distribution with mean 0 and variance 1, i.e. w_{FBAT}^{2}+w_{T}^{2} = 1. In the literature, this approach of combining two test statistics is known as the Liptakmethod [24]. However, the Liptakmethod varies here from its standard application in that the 2 test statistics have to be combined so that confounding in the screening statistic T_{i} cannot affect the validity of the overall familybased association test statistic Z_{i}. In the context of a genomewide association study (GWAS), we are able to achieve this goal by using rankbased pvalues for the screening statistic T_{i} instead of their asymptotic pvalues.
The “screening statistics” T_{i} are sorted based on their evidence for association so that T_{(m)} denotes the screening statistic with the least amount of evidence for association and T_{(1)} the screening statistic with the largest amount of evidence for association. The rankbased pvalue, (i – 0.5)/m, is then assigned to the ordered screening test statistics T_{(i)}. If there is a tie, then the average of the ranks will be used for the computation of the rankbased pvalue for the ith marker. Since the nullhypothesis will be true for the vast majority of the SNPs in a GWAS, the rankbased pvalues provide an alternative way to assess the significance of the populationbased screening statistic T_{i}. The overall association test Z_{i} is then computed based on the Zscore for the asymptotic pvalue of the FBATstatistic and the Zscore for the rankedbased pvalue of the screening statistic T_{i}. In Text S1 we show that the overall association test Z_{i} maintains the global significance level α, under any situations including population admixture and stratification. This can be understood intuitively as well. The smallest rankbased pvalue is 0.5/m. Using the Bonferronicorrection to adjust for multiple testing, the individual, adjusted significance level is α/m which will always be smaller than the smallest rankbased pvalue, 0.5/m, unless the prespecified global significance level α is great than 0.5. This implies that the overall familybased association test can never achieve genomewide significance just based on the rankbased pvalues alone. The FBATstatistic has to contribute evidence for the association as well in order for the overall familybased association test to reach genomewide significance. Finally, we have to address the specification of the weights w_{FBAT} and w_{T} in the overall familybased association test statistic Z_{i}. While any combination of weights w_{FBAT} and w_{T} will provide a valid test statistic Z_{i}, the most powerful overall statistic Z_{i} is approximately achieved when the ratio of the weights is equal to the ratio of the standardized effect sizes, the expected effect size of the regression coefficient divided by its (estimated) standard deviation. For quantitative traits in unascertained samples, one can show that optimal power levels are achieved for equal weights, i.e. w_{FBAT} = w_{T}. In general, the equal weighting scheme seems to provide good power levels for any disease mode of inheritance and for different trait types, e.g. binary traits, timetoonset, etc. The theoretical derivation of optimal weighting schemes for such scenarios is ongoing research and will be published subsequently.
Furthermore, it is important to note that, instead of the Liptakmethod, Fisher's method for combining pvalues could have been used as well to construct an overall familybased association test which would have the same robustness properties as the overalltest based on the Liptakmethod. However, simulation studies (data not shown) suggest that the highest power levels are consistently achieved with the Liptak method. We therefore omit the approach based on Fisher's method here.
ResultsType I error for 500K GWAS
In the first part of the simulation study, the type1 error of the proposed familybased association test denoted as LIP was assessed in the absence and in the presence of population admixture, and we use the Wald test based on the conditional mean model [22] with betweenfamily component for pT_{i} in our all simulations. For various scenarios, we verified that the proposed overall familybased association test maintains the αlevel.
For simplicity, we assume in the simulation studies that the random samples are given, i.e. no ascertainment, and that the parental genotypes are known. Assuming HardyWeinberg equilibrium, the parental genotypes are generated by drawing from Bernoulli distributions defined by the allele frequencies. The offspring genotypes are obtained by simulated Mendelian transmissions from the parents to the offspring. For the jth trio, the offspring phenotype Y_{j} is simulated from a Normal distribution with mean aX_{j} and variance 1, i.e. N(aX_{j}, 1), where the parameter a represents the genetic effect size and the variable X_{j} denotes the offspring genotype. Under the nullhypothesis of no association, the genetic effect size parameter a will be set to 0.
For scenarios in which population admixture is present, we assume that the admixture is created by the presence of 2 subpopulations whose phenotypic means differ by 0.2. The allele frequencies for each marker in the two subpopulations are generated by the BaldingNichols model [25]. That is, for each marker, the allele frequency in an ancestral population is generated from a uniform distribution between 0.1 and 0.9, U(0.1, 0.9). Then, the marker allele frequencies for the two subpopulations are independently sampled from the beta distributions (p(1−F_{ST})/F_{ST}, (1−p)(1−F_{ST})/F_{ST}) for the whole markers in each replicate of the simulated GWAS. A survey reported F_{ST} estimates with a median of 0.008 and 90th percentile of 0.028 among Europeans, and the corresponding values are 0.027 and 0.14 among Africans, and 0.043 and 0.12 among Asians [26]. The value for Wright's F_{ST} was assumed to be 0.05, 0.1, 0.2, or 0.3. Each trio was assigned to the one of the 2 subpopulations with 50% probability.
In the absence and presence of the population stratification (F_{ST} = 0.05, 0.1, 0.2, and 0.3), Table 1 shows the empirical type1 error rates of the overall association test statistic Z_{i} for a GWAS with 500,000 SNPs. The estimates for the empirical significance levels in Table 1 are based on 2,000 replicates. The empirical genomewide significance level is calculated as the proportion of replicates for which the minimum pvalues among the 500,000 markers is less than 0.05/500,000. We consider the proposed equal weights for w_{FBAT} and w_{T}, for genomewide significance level 0.05 and Table 1 shows that the type1 error rate is preserved well. For different significance levels, we calculate in Table 2 the empirical proportions of SNPs for which the overall familybased association test Z_{i} is significant at the αlevels of 0.05, 0.01, 10^{−3}, 10^{−4} and 10^{−5}. The simulation studies are conducted in the absence and in the presence of population admixture. Table 2 does not provide any evidence for a departure of the empirical significance levels from the theoretical levels, both in the absence and presence of population substructure. These results confirm our theoretical conclusions that Z_{i} is robust against population stratification and maintains correct type1 error.
10.1371/journal.pgen.1000741.t001Empirical type1 error for 500K GWAS at genomewide significance level 0.05.
F_{ST}
Empirical error rate
0.00
0.0505
0.05
0.0395
0.10
0.0425
0.20
0.0450
0.30
0.0445
10.1371/journal.pgen.1000741.t002Average of empirical proportion at 500K GWAS.
F_{ST}
c = 5×10^{−2}
c = 1×10^{−2}
c = 1×10^{−3}
c = 1×10^{−4}
c = 1×10^{−5}
0.00
5.00×10^{−2}
9.97×10^{−3}
9.91×10^{−4}
9.86×10^{−5}
9.66×10^{−6}
0.05
5.00×10^{−2}
9.97×10^{−3}
9.91×10^{−4}
9.85×10^{−5}
9.76×10^{−6}
0.10
5.00×10^{−2}
9.96×10^{−3}
9.88×10^{−4}
9.78×10^{−5}
9.79×10^{−6}
0.20
4.99×10^{−2}
9.95×10^{−3}
9.87×10^{−4}
9.76×10^{−5}
9.60×10^{−6}
0.30
4.98×10^{−2}
9.92×10^{−3}
9.82×10^{−4}
9.68×10^{−5}
9.40×10^{−6}
In the next set of simulation studies, we assess the effects of the local population stratification on the overall familybased association test. We generate local population stratification under the following assumptions: there are two subpopulations, G_{1} and G_{2} which distinguish themselves from each other in 2 marker regions. We assume that a subject can be from all possible 4 combinations at the 2 particular regions, e.g. (G_{1}, G_{1}), (G_{1}, G_{2}), (G_{2}, G_{1}) and (G_{2}, G_{2}). Both regions consist of 10K SNPs and 90K SNPs respectively and if subjects are from the same subpopulation in each genetic region, their assumed allele frequencies of the markers in the corresponding region are equal. For example, the allele frequencies of each marker in the marker region 1 are the same for samples in (G_{1}, G_{1}) and (G_{1}, G_{2}), but they are different for (G_{1}, G_{1}) and (G_{2}, G_{2}). In the simulation study, we generate the parental genotypes based on these allele frequency assumptions and obtain the offspring genotypes based on simulated Mendelian transmissions. Using the BaldingNichols model we considered F_{ST}'s of 0.001, 0.005, 0.01 and 0.05 in the simulation studies. The offspring's phenotype was generated under the null hypothesis, but we assumed that each subpopulation strata had a different phenotypic mean: 0 for (G_{1}, G_{1}), 0.2 for (G_{1}, G_{2}), 0.4 for (G_{2}, G_{1}) and 0.6 for (G_{2}, G_{2}). Each replicate consists of 2,000 trios with equal number of trios for all 4 possible combinations. The data was analyzed with the proposed overall familybased association test and with standard linear regression after adjusting population admixture with EIGENSTRAT [10]. For EIGENSTRAT, we applied the principal component analysis to the mean of the paternal and maternal genotypes at each locus because parents of each offspring are from the same subpopulation, and then the residuals obtained from regressing offspring genotypes and phenotypes with eigenvectors respectively are used to calculate the generalized Armitage trend test [27]. Table 3 provides the empirical type1 error for both analysis approaches based on 2,000 replicates. While EIGENSTRAT exhibits an inflated type1 error, the proposed overall family test maintains the theoretical significance level.
10.1371/journal.pgen.1000741.t003Average of empirical proportion at 100K GWAS.
Method
F_{ST}
c = 5×10^{−2}
c = 1×10^{−2}
c = 1×10^{−3}
c = 1×10^{−4}
c = 1×10^{−5}
EIGENSTRAT
0.001
5.07×10^{−2}
1.02×10^{−2}
1.04×10^{−3}
1.05×10^{−4}
1.02×10^{−5}
0.005
5.44×10^{−2}
1.17×10^{−2}
1.36×10^{−3}
1.72×10^{−4}
2.45×10^{−5}
0.01
5.86×10^{−2}
1.39×10^{−2}
2.09×10^{−3}
3.62×10^{−4}
7.57×10^{−5}
0.05
8.20×10^{−2}
3.24×10^{−2}
1.32×10^{−2}
6.58×10^{−3}
3.39×10^{−3}
LIP
0.001
5.00×10^{−2}
9.99×10^{−3}
9.93×10^{−4}
9.89×10^{−5}
9.70×10^{−6}
0.005
5.00×10^{−2}
9.99×10^{−3}
1.00×10^{−3}
1.01×10^{−4}
1.00×10^{−5}
0.01
5.00×10^{−2}
9.99×10^{−3}
9.97×10^{−4}
9.96×10^{−5}
9.99×10^{−6}
0.05
5.00×10^{−2}
9.98×10^{−3}
9.94×10^{−4}
9.89×10^{−5}
9.98×10^{−6}
Empirical power with simulation for 500K GWA for quantitative trait
For the analysis of quantitative traits, Table 4 provides the empirical power for 500K GWAS from 2000 replicates when there is no population stratification. Under the assumption of an additive disease model for a quantitative trait, the genetic effect, a, is given as a function of the heritability, h^{2}, the minor allele frequency p_{D}_{ı} and the phenotypic variance, σ^{2}, by: a = σh/[2p(1−p)(1−h^{2}) ]^{0.5}. In the simulation study, we assume heritabilities of h^{2} = 0.001, 0.005, 0.01 and 0.015 for 2,000, 2,500 and 3,000 trios. The allele frequency of the disease locus, p_{D}_{ı}, is 0.3 and the phenotypic variance is 1. We compare the achieved power levels of the proposed overall familybased association test, Z_{i}, with the weighting approach by IonitaLaza et al [12], the original VanSteen approach [11], the QTDT approach [28] and populationbased analysis, i.e. using linear regression of the phenotype Y on the genotype X. Bonferroni correction is used to adjust for multiple testing in the populationbased analysis, FBAT, QTDT and the proposed method. The results in Table 4 suggest that the proposed association test achieves power levels that represent a major improvement over the existing methods for familybased association tests (VanSteen [11] or IonitaLaza [12]). Our approach reaches the same power levels as the populationbased analysis. For the power comparisons that are shown in Figure 1, Figure 2, and Figure 3, the number of trios is assumed to be 1,000 in 500K GWAS and the empirical powers are calculated based on 10,000 replicates at an αlevel of 0.001 for the all genetic models. The results confirm that the Liptak's method combining T_{i} and FBAT_{i} has similar power to the populationbased method, and the choice of equal weights performs well. The simulation results in Table 4 also suggest that QTDT [28] approach achieves similar power levels as the standard FBAT approach, which is consistent with previously reported findings in the literature [29]. However, both standard FBAT and QTDT are still much less powerful than the proposed overall familybased association test. Table 5 shows the empirical power for a GWAS with 100,000 SNPs in the presence of population stratification. For the parameters of this simulation study, we assume F_{ST} = 0.001, 0.005, 0.01, and 0.05, and the additive mode of inheritance at the disease locus with values for the heritability of h^{2} = 0.005, 0.01 and 0.015. The disease allele frequency p_{D}_{ı} in the ancestral population is assumed to be 0.3. The phenotypic data is simulated so that their phenotypic means for two subpopulations are 0 and 0.2 respectively. Each individual/trio is assigned to either subpopulation with probability 0.5. The parental genotypes are used to estimate the ancestry for EIGENSTRAT as before. Various methods have been suggested to adjust the population stratification in a populationbased designs and we compare the proposed methods with the EIGENSTRAT approach [10]. In order to maximize the power of the proposed method, we apply the EIGENSTRAT approach to the populationbased component T_{i} of our approach, i.e. principal component analysis based on the parental genotypes and the offspring's phenotype is integrated into the generalized Armitage test for T_{i}[27]. To keep the power comparisons unbiased, the populationbased components of the approaches by VanSteen and IonitaLaza are also adjusted for population admixture, using the EIGENSTRAT approach. The results in Table 5 show that the proposed test statistic Z_{i} is considerably more powerful than populationbased analysis adjusted with EIGENSTRAT. QTDT is slightly more powerful than FBAT, but it is much less powerful than LIP as is in Table 4. This suggests that EIGENSTRAT should be applied only to betweenfamily component in familybased association studies. Our unpublished work showed that the proposed approach can be less powerful than the combination of populationbased analysis and EIGENSTRAT if pT_{i} is calculated from the conditional mean model [11],[22] without adjusting population stratification.
10.1371/journal.pgen.1000741.g001Empirical power at 0.001 significance level for additive disease.
POP is the empirical power of the standard populationbased method. T is the empirical power of the Wald test based on the conditional mean model [22] for betweenfaimly components. LIP is the empirical power of the combined pvalues with Liptak's method. In this figure, FBAT and T are completely overlapped.
10.1371/journal.pgen.1000741.g002Empirical power at 0.001 significance level for dominant disease.
POP is the empirical power of the standard populationbased method. T is the empirical power of the Wald test based on the conditional mean model [22] for betweenfaimly components. LIP is the empirical power of the combined pvalues with Liptak's method. In this figure, FBAT and T are completely overlapped.
10.1371/journal.pgen.1000741.g003Empirical power at 0.001 significance level for recessive disease.
POP is the empirical power of the standard populationbased method. T is the empirical power of the Wald test based on the conditional mean model [22] for betweenfamily components. LIP is the empirical power of the combined pvalues with Liptak's method. In this figure, FBAT and T are completely overlapped.
10.1371/journal.pgen.1000741.t004Empirical power for GWAS under no population stratification.
N_{trio}
h^{2}
POP
FBAT
QTDT
LIP
VAN
ION
2,000
0.001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.005
0.0200
0.0025
0.0010
0.0185
0.0080
0.0130
0.01
0.2085
0.0125
0.0180
0.1955
0.0990
0.1505
0.015
0.5725
0.0765
0.0150
0.5350
0.3045
0.4515
2,500
0.001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.005
0.0385
0.0030
0.0030
0.0370
0.0155
0.0210
0.01
0.3970
0.0430
0.0430
0.3760
0.2025
0.2960
0.015
0.8135
0.1420
0.1790
0.7995
0.5525
0.7380
3,000
0.001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.005
0.0740
0.0020
0.0070
0.0675
0.0325
0.0495
0.01
0.5720
0.0810
0.0855
0.5495
0.3175
0.4710
0.015
0.9175
0.2665
0.3265
0.8980
0.7055
0.8630
10.1371/journal.pgen.1000741.t005Empirical power for GWAS under population stratification.
F_{ST}
h^{2}
FBAT
QTDT
LIP
VAN
ION
EIG
0.001
0.005
0.0000
0.0010
0.0083
0.0000
0.0000
0.0000
0.010
0.0000
0.0030
0.1157
0.0826
0.1157
0.0579
0.015
0.0000
0.0085
0.3884
0.2975
0.3471
0.2562
0.005
0.005
0.0000
0.0000
0.0083
0.0083
0.0083
0.0083
0.010
0.0000
0.0020
0.0909
0.0579
0.0661
0.0661
0.015
0.0083
0.0080
0.3223
0.2810
0.3140
0.1901
0.01
0.005
0.0000
0.0015
0.0000
0.0000
0.0000
0.0000
0.010
0.0000
0.0010
0.0909
0.0826
0.0579
0.0331
0.015
0.0083
0.0135
0.3636
0.2975
0.3388
0.2645
0.05
0.005
0.0000
0.0000
0.01653
0.0330
0.0248
0.0000
0.010
0.0083
0.0035
0.0992
0.0744
0.0826
0.0165
0.015
0.0165
0.0080
0.3140
0.2645
0.2727
0.2066
Applications to a genomewide association in the Framingham Heart study
For the assessment of the severity of pulmonary diseases, the lung volume of air that a subject can blow out within one second after taking a deep breath is an important endophenotype. It is referred to as the forced expiratory volume in one second (FEV1). FEV1 is an important measure for lung function and we apply the proposed method to a familybased GWAS of FEV1. The proposed method is applied to 550K GWAS Framingham Heart Study (FHS) data set for FEV1, and then we confirm whether the selected SNPs are replicated in the British 1958 Birth Cohort (BBC), another population sample, as well as two samples of asthmatics in the the Childhood Asthma management program (CAMP) [30] and an AfroCaribbean group of families from Barbados (ACG) [31]. In FHS, 9,274 subjects were genotyped and 10,816 subjects of those had at least one FEV1 measurement. Of the 8637 participants with genotyping and FEV1 measures, only those with a call rate of 97% or higher were included. We adjusted the covariates, age, sex and the quadratic term of height that are known to be associated with FEV1. For withinfamily components, the FBAT statistic for quantitative trait was applied. Markers were excluded from the analysis if the number of informative families was less than 20, or the minor allele frequency was less than 0.05. In total, 306,264 SNPs were used for analysis and, based on the number of SNPs, rankbased empirical pvalues, pT_{i}, and the genomewide significance level was obtained with Bonferroni correction. When we let n and n_{inf} be the total number of individuals and the number of informative trios respectively, n_{inf}: (2n−n_{inf}) are used for the weights of Z_{i} because some of parental phenotypes are available.
Table 6 shows the pvalues for the top 10 SNPs from the proposed method. In our analysis, the genomewide significance level at 0.05 is 1.636×10^{−7} and our results show that only the first ranked SNP, rs805294, is significant at the genomewide level 0.2 with Bonferroni correction. For rs805294, we also checked the significance in other data sets, BBC, CAMP [30] and ACG [31]. In CAMP, 1215 subjects in 422 families were genotyped and there are 488 informative trios for rs809254 and in ACG, there were only 33 informative trios (Table 7). In the BBC, 1372 unrelated subjects were genotyped with the Affymetrix chip and 1323 unrelated subjects genotyped with the Illumina chip. In CAMP and ACG, age, sex and the quadratic terms of heights were adjusted and in the BBC, age, sex, height, recent chest infection and nurse were adjusted. Table 7 also shows that rs805294 is significant and their directions are same for the considered studies except for the ACG sample. In particular, in the ACG study, the MAF of the SNP is different from other studies, which indicates a different local LD structure; The ACG sample is from an AfroCaribbean population, contrary to the other studies which only include Caucasian study subjects. In addition, the ACG sample lacks statistical power for this particular SNP, i.e. there are only 33 informative trios in this sample. Thus, the inconsistent finding in the ACG study could be attributable to genetic heterogeneity, i.e. different local LD structure/flipflop phenomena [32], or insufficient statistical power. For meta analysis, the sample sizes are used as weights for Liptak's method and we use 13∶13∶5∶1 = FHS∶BBC∶CAMP∶ACG as weights because the betweenfamily information is used only for FHS. If the pvalue from Illumina gene chip in BBC and the pvalues from FHS, CAMP and ACG are combined, then the pvalues by Liptak's method using proposed weights and Fisher's method are 1.534×10^{−8} and 1.081×10^{−7} respectively, and they become 4.625×10^{−9} and 3.554×10^{−8} if the pvalues from onetailed tests are used for BBC, CAMP and ACG with the same direction of FHS. If the pvalue from the Affymetrix gene chip in BBC is combined with the other studies, then they are 3.787×10^{−8} (Liptak's method) and 1.890×10^{−7} (Fisher's method) for twotailed tests, and 1.098×10^{−8} (Liptak's method) and 6.236×10^{−8} (Fisher's method) for onetailed tests. As a result we can conclude that rs805294 is significantly associated with FEV1 at a genomewide scale and the gene, LY6G6C, associated with rs805293 will be investigated in further studies.
10.1371/journal.pgen.1000741.t006Applications to forced expiratory volume in one second in Framingham Heart study.
SNP
Chrom
Position
MAF
Num. Info. Fam.
FBAT_{i}
pT_{i}
Z_{i}
rs805294
6
31796196
0.340
918
4.300×10^{−3}
2.073×10^{−5}
5.929×10^{−7}
rs10863838
1
208750806
0.450
1016
7.408×10^{−5}
2.535×10^{−3}
2.553×10^{−6}
rs6794842
3
119308208
0.331
950
3.226×10^{−2}
2.400×10^{−5}
6.654×10^{−6}
rs804963
14
85918211
0.460
1031
9.786×10^{−2}
2.775×10^{−6}
7.060×10^{−6}
rs525914
11
119200660
0.187
711
9.204×10^{−4}
1.888×10^{−3}
2.081×10^{−5}
rs1886280
10
89347496
0.362
971
1.797×10^{−2}
2.297×10^{−4}
2.511×10^{−5}
rs710469
3
188467212
0.491
1058
3.202×10^{−3}
1.388×10^{−3}
2.639×10^{−5}
rs10799746
1
22497833
0.168
651
1.388×10^{−2}
3.538×10^{−4}
2.748×10^{−5}
rs1225888
20
15972225
0.449
1007
7.518×10^{−5}
1.736×10^{−2}
2.994×10^{−5}
rs4638547
15
71122046
0.377
999
3.454×10^{−5}
2.760×10^{−2}
3.549×10^{−5}
10.1371/journal.pgen.1000741.t007Descriptive statistics and results of rs805294 in different studies.
FHS
British Cohort
CAMP
BAR
Affy
Illumina
Num. Info. Fam.
918


488
33
Sample Size

1372
1323


MAF
0.34
0.36
0.36
0.33
0.22
Pvalues
−5.929×10^{−7}
−1.234×10^{−2}
−6.534×10^{−3}
−1.370×10^{−2}
7.84×10^{−1}
Discussion
Genomewide association studies have become one of the most important tools for the identification of new disease loci in the human genome. However, even though advances in genotyping technology have enabled a new generation of genetic association studies that provide robust and replicable findings, population stratification/genetic heterogeneity and the multiple testing problems continue to be the major issues in the statistical analysis that have to be resolved in each study. While familybased association tests provide analysis results that are completely robust against confounding due to populationsubstructures, the analysis approach is not optimal in terms of statistical power. Numerous approaches have been suggested to minimize this disadvantage of familybased association tests but the previous approaches had to compromise either in terms of robustness or in terms of efficiency.
In this communication, we develop an approach that efficiently utilizes all available data, while maintaining complete robustness against confounding due to population substructure. The proposed methods combines the pvalues of the familybased tests (the withincomponent) with the rankbased pvalues for populationbased analysis (the between component) to achieve optimal power levels. The use of rankbased pvalues for the populationbased component is similar in spirit to the genomic control approach. In principle, the genomic control functions as rescaling the variance inflated due to population stratification under the assumption of the constant F_{ST}. Rankbased pvalue directly rescales the statistics based on their ranks, which always generates the uniformly distributed pvalue and provides validity even for varying F_{ST} due to local population stratification etc.
Although our simulations are limited to independent unascertained samples and quantitative traits, the proposed work can be easily extended to ascertained samples, large pedigree, or different trait types, etc. By replacing the parental genotypes with the sufficient statistics by Rabinowitz&Laird [19], the FBATstatistic and the screeningstatistic can be adopted straightforwardly to designs with extended pedigrees [23]. Similarly, parental phenotypes can be incorporated into the conditional mean model [23] or its nonparametric extensions [33] as additional outcome variables. The optimal weights can vary between the different scenarios and further theoretical investigation is currently ongoing, but limited initial simulation studies suggest that equal weights, while not always the most powerful choice in such situation, will always result in more powerful analysis than currently used methods.
Supporting Information
The validity of the proposed method.
(0.04 MB DOC)
Framingham Heart Study genotype and phenotype data are publicly available through the NHLBI's SNP Health Association Resource (SHARe) initiative (http://public.nhlbi.nih.gov/GeneticsGenomics/home/share.aspx). We acknowledge the CAMP investigators and research team for collection of CAMP Genetic Ancillary Study data and use of genotype data from the British 1958 Birth Cohort DNA collection. We further acknowledge the families in Barbados for their generous participation in this study. We are grateful to Drs. Raana Naidu, Paul Levett, Malcolm Howitt and Pissamai Maul, Trevor Maul, and Bernadette Gray for their contributions in the field; Dr. Malcolm Howitt and the Polyclinic and A&E Department physicians in Barbados for their efforts and their continued support; as well as Drs. Henry Fraser and Anselm Hennis at the Chronic Disease Research Centre.
ReferencesAulchenkoYSde KoningDJHaleyC2007Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigreebased quantitative trait loci association analysis.ChenWMAbecasisGR2007Familybased association tests for genomewide association scans.ElstonRCGrayMcGuireC2004A review of the ‘Statistical Analysis for Genetic Epidemiology’ (S.A.G.E.) software package.LangeCBlackerDLairdNM2004Familybased association tests for survival and timestoonset analysis.LairdNMHorvathSXuX2000Implementing a unified approach to familybased tests of association.LangeCSilvermanEKXuXWeissSTLairdNM2003A multivariate familybased association test using generalized estimating equations: FBATGEE.DevlinBRoederK1999Genomic control for association studies.DevlinBRoederKWassermanL2001Genomic control, a new approach to geneticbased association studies.PritchardJKStephensMDonnellyP2000Inference of population structure using multilocus genotype data.PriceALPattersonNJPlengeRMWeinblattMEShadickNA2006Principal components analysis corrects for stratification in genomewide association studies.Van SteenKMcQueenMBHerbertARabyBLyonH2005Genomic screening and replication using the same data set in familybased association testing.IonitaLazaIMcQueenMBLairdNMLangeC2007Genomewide weighted hypothesis testing in familybased association studies, with an application to a 100K scan.MurphyAWeissSTLangeC2008Screening and replication using the same data set: testing strategies for familybased studies in which all probands are affected.NagelkerkeNJHoebeeBTeunisPKimmanTG2004Combining the transmission disequilibrium test and casecontrol methodology using generalized logistic regression.EpsteinMPVealCDTrembathRCBarkerJNLiC2005Genetic association analysis using data from triads and unrelated subjects.ChenYHLinHW2008Simple association analysis combining data from trios/sibships and unrelated controls.ZhuXLiSCooperRSElstonRC2008A unified association analysis approach for family and unrelated samples correcting for stratification.LangeCDeMeoDLLairdNM2002Power and design considerations for a general class of familybased association tests: quantitative traits.RabinowitzDLairdN2000A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information.SpielmanRSEwensWJ1998A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test.LangeCLairdNM2002On a general class of conditional tests for familybased association studies in genetics: the asymptotic distribution, the conditional power, and optimality considerations.LangeCLyonHDeMeoDRabyBSilvermanEK2003A new powerful nonparametric twostage approach for testing multiple phenotypes in familybased association studies.LangeCDeMeoDSilvermanEKWeissSTLairdNM2003Using the noninformative families in familybased association tests: a powerful new testing strategy.LiptakT1958On the combination of independent tests.BaldingDJNicholsRA1995A method for quantifying differentiation between populations at multiallelic loci and its implications for investigating identity and paternity.CavalliSforzaLLPiazzaA1993Human genomic diversity in Europe: a summary of recent research and prospects for the future.ArmitageP1955Tests for linear trends in proportions and frequencies.AbecasisGRCardonLRCooksonWO2000A general test of association for quantitative traits in nuclear families.DiaoGLinDY2006Improving the power of association tests for quantitative traits in family studies.1999 The Childhood Asthma Management Program (CAMP): design, rationale, and methods. Childhood Asthma Management Program Research Group.BarnesKCNeelyJDDuffyDLFreidhoffLRBreazealeDR1996Linkage of asthma and total serum IgE concentration to markers on chromosome 12q: evidence from AfroCaribbean and Caucasian populations.LinPIVanceJMPericakVanceMAMartinER2007No gene is an island: the flipflop phenomenon.JiangHHarringtonDRabyBABertramLBlackerD2006Familybased association test for timetoonset data with timedependent differences between the hazard functions.