The authors have declared that no competing interests exist.
Conceived and designed the experiments: DJL SML. Performed the experiments: DJL. Analyzed the data: DJL. Contributed reagents/materials/analysis tools: DJL SML. Wrote the paper: DJL SML. Designed the software used in analysis: DJL.
Next-generation sequencing has made possible the detection of rare variant (RV) associations with quantitative traits (QT). Due to high sequencing cost, many studies can only sequence a modest number of selected samples with extreme QT. Therefore association testing in individual studies can be underpowered. Besides the primary trait, many clinically important secondary traits are often measured. It is highly beneficial if multiple studies can be jointly analyzed for detecting associations with commonly measured traits. However, analyzing secondary traits in selected samples can be biased if sample ascertainment is not properly modeled. Some methods exist for analyzing secondary traits in selected samples, where some burden tests can be implemented. However p-values can only be evaluated analytically via asymptotic approximations, which may not be accurate. Additionally, potentially more powerful sequence kernel association tests, variable selection-based methods, and burden tests that require permutations cannot be incorporated. To overcome these limitations, we developed a unified method for analyzing secondary trait associations with RVs (STAR) in selected samples, incorporating all RV tests. Statistical significance can be evaluated either through permutations or analytically. STAR makes it possible to apply more powerful RV tests to analyze secondary trait associations. It also enables jointly analyzing multiple cohorts ascertained under different study designs, which greatly boosts power. The performance of STAR and commonly used RV association tests were comprehensively evaluated using simulation studies. STAR was also implemented to analyze a dataset from the SardiNIA project where samples with extreme low-density lipoprotein levels were sequenced. A significant association between
Next-generation sequencing has greatly expanded our ability to identify missing heritability due to rare variants. In order to increase the power to detect associations, one desirable study design is to combine samples from multiple cohorts for mapping commonly measured traits. However, many current studies sequence selected samples (e.g. samples with extreme QT), which can bias the analysis of secondary traits, unless the sampling ascertainment mechanisms are properly adjusted. We developed a unified method for detecting secondary trait associations with rare variants (STAR) in selected and random samples, which can flexibly incorporate all rare variant association tests and allow joint analysis of multiple cohorts ascertained under different study designs. We demonstrate via simulations that STAR greatly boosts the power for detecting secondary trait associations. As an application of STAR, a dataset from the SardiNIA project was analyzed, where DNA samples from well-phenotyped individuals with extreme low-density lipoprotein levels were sequenced.
Next-generation sequencing has already revolutionized the study of complex traits, and made possible the detection of rare variant associations. Many sequence based association studies are currently being performed, some of which have already lead to the discovery of associations with clinically important traits, such as lipids levels
Many methods have been developed for detecting rare variant associations
Several methods exist for detecting secondary trait associations in selected samples. For example, a retrospective likelihood method was developed for mapping
However none of these methods for detecting secondary associations incorporate sequence kernel association test (SKAT), a powerful variance component score test based method. This method can be more powerful when causal variants have bidirectional effects and/or a large proportion of the variants within gene region are non-causal.
Standard permutation algorithms cannot be applied to obtain empirical p-value. This is because when the primary and secondary traits are correlated and the genetic region is associated with the primary trait, neither the secondary trait residuals nor the locus genotypes are interchangeable under the null hypothesis. Therefore, the statistical significance can only be evaluated via asymptotic approximations, which has several notable limitations: 1.) Due to the low frequency of rare variants, asymptotic approximation for some tests may be violated, which can lead to either inflated type I error or loss of power. 2.) For some rare variant association methods, the analytical distribution for the test statistics is unknown and therefore the statistical significance has to be evaluated empirically. These rare variant tests that require evaluating p-values via permutation are often more powerful than the methods implemented in MTA, e.g. CMC or GRANVIL. It is therefore desirable that these tests can be applied to analyze secondary traits.
To overcome the limitations of existing methods, a unified model was developed to detect secondary trait associations using selected samples. In the samples with extreme primary quantitative traits, through re-parameterizing the likelihood functions, interchangeable residuals for the secondary traits can be obtained under the null hypothesis. The residuals are approximately independent, and normally distributed. We proved theoretically that the analysis of secondary trait associations can be equivalently implemented by analyzing the correlation between the secondary trait residuals and the gene/genetic region. Therefore, any rare variant association test that can analyze QT in random population based studies can be incorporated in STAR. In addition, multiple cohorts can be jointly analyzed through conventional mega-analysis methods that use individual participant data or meta-analysis methods that use summary level statistics.
A variety of popular rare variant tests have been implemented in the STAR framework and the power to detect secondary trait associations was evaluated. Specifically, we considered the weighted sum statistic (WSS)
The performances for these methods were compared using extensive simulation studies. Genetic data were simulated under a realistic population genetic model as described by Kryukov et al
The STAR method was also used to analyze a published sequence dataset from the SardiNIA project
An R-package, STARSEQ which implements the STAR method is available through the Comprehensive R Archive Network (CRAN) at
STAR model can be used with any rare variant test to detect associations with secondary traits in studies that use extreme sampling. The multi-site genotype for individual
Under the null hypothesis of no gene/secondary trait associations, following the MTA framework, a multivariate generalized linear model can be implemented to estimate nuisance parameters
The residual terms for the secondary traits, i.e.
In order to obtain unbiased results, sampling schemes have to be properly modeled. Ascertainment corrected likelihood can be used, which jointly models sample ascertainment status
The likelihood function in
In this model, the residual errors
Burden tests, such as CMC and WSS, aggregate multiple rare variants across a genetic region and analyze them jointly. The following model can be used to obtain score statistics for a burden test:
Score tests can be formally constructed from the joint likelihood for testing the null hypothesis of no gene/secondary trait associations, i.e.
It is clear from formula (9) that
Using similar ideas, we show in (
The KBAC test was previously developed for the analysis of binary trait associations
Type I error and power were evaluated for the following rare variant association tests that were extended in STAR, i.e. CMC, KBAC, WSS, SKAT and VT. Genetic data were generated according to a four parameter demographic model for Europeans
Data for selective sampling studies are simulated, where for each dataset, a total of 5,000 individuals with extreme primary trait values are selected from a cohort of 100,000 individuals. Two-sided alternative hypothesis was tested for each method. Although p-values for CMC and SKAT can be obtained analytically, they can either be conservative or anti-conservative
In order to illustrate the application of STAR for combining multiple cohorts, a meta-analysis of three studies was simulated. The primary trait for each study is different and a common secondary trait is measured for all studies. In the first study, the gene region is associated with the primary trait, and causal variants have an effect of −0.5. The correlation between the primary and secondary traits is 0.6. In the second study, the primary trait is associated with the gene region, and causal variants have an effect of 0.25. The primary and secondary traits are correlated with coefficient 0.4. In the third study, the gene region is not associated with the primary trait, and the correlation between the primary and secondary traits is −0.2. In each study, a different pool of 50,000 samples was simulated and 2,500 individuals with extreme primary trait were selected and analyzed for association. P-values for all rare variant tests in each study were obtained based upon 5,000 permutations. Meta-analysis is performed by combining Z-score statistics, which are transformed from p-values and weighted by the square root of the sample sizes in each study
In order to evaluate type I errors, data were simulated under the assumption that the secondary trait effects for all variants were 0. The empirical distribution of p-values was obtained using 10,000 replicates. For evaluating power, two scenarios were considered, i.e. (A) causal variants have an unidirectional effect of 0.5; (B) causal variants have bidirectional effects, where 80% of the causal variants have effect 0.5 and the other 20% of the causal variants have effect −0.5.The power for analyzing each individual study and meta-analysis was evaluated using 10,000 replicates under a significance level of α = 0.05.
Association analyses were performed for the nine genes that were sequenced from the SardiNIA project
Type I error for STAR was investigated when 1.) the gene region is neither associated with the primary trait nor the secondary trait. 2.) the gene region is associated only with the primary trait but not the secondary trait. The quantile-quantile plots of empirical p-values and their theoretical expectations are displayed for different rare variant tests. It can be seen that all tests incorporated in the STAR method have well controlled type I error. The p-values for the five tests are slightly conservative even when permutation is used to evaluate significance. This can occur when either the aggregate variant frequencies are low or the sample size is not sufficiently large. For example, when the primary trait effect is
Five tests were evaluated, i.e. CMC, WSS, KBAC, VT and SKAT. Empirical p-values for each test were plotted against their theoretical expectations. A variety of scenarios with different primary trait effects and trait residual correlations were examined, which include (A)
As a comparison, we also evaluated type I errors of linear regression analysis that ignores sample ascertainment mechanisms. When the gene region is not associated with the primary trait, type I errors for all rare variant tests are well controlled. However, if the gene region is also associated with the primary trait, the distribution of p-values under the null hypothesis is highly skewed and the type I errors for all tests are seriously inflated (
The power of detecting secondary trait associations was compared for a variety of rare variant tests (
Power is calculated for CMC, WSS, KBAC, VT, and SKAT implemented in STAR framework. Secondary trait effects are assumed to be fixed and unidirectional with
Power is calculated for CMC, WSS, KBAC, VT, and SKAT implemented in STAR framework. Secondary trait effects are assumed to be bidirectional with fixed magnitude (i.e.
VT can be more powerful than methods that use fixed variant frequency threshold, when the secondary trait effects are unidirectional. This is because using a fixed variant frequency threshold may result in the inclusion of higher frequency non-causal variants or the exclusion of more frequent causal variants from the analyses. For example, when the primary trait effect is
The variance component score test SKAT is less powerful than burden tests when causal variant effects are unidirectional. For example, when
When the gene region is associated with both the primary and secondary traits, the power to detect secondary trait associations can be greater than when the gene region is only associated with the secondary trait. This is because variants with pleiotropic effects can be more enriched through extreme sampling. For example, when secondary trait effects are
The power and type I errors for STAR were evaluated for a simulated meta-analysis of three studies. As shown in (
We also evaluated the power of the STAR method under the alternative hypothesis (
Sequence data from the SardiNIA project were analyzed to detect associations with multiple lipids and metabolic traits. First, association analyses were carried-out for the primary trait LDL levels (
CMC |
WSS |
KBAC |
VT |
SKAT |
|
|
0.014* | 0.029* | 0.026* | 0.045* | 0.317 |
|
0.820 | 0.946 | 0.942 | 0.964 | 0.971 |
|
0.050* | 0.025* | 0.035* | 0.009# | 0.234 |
|
0.272 | 0.299 | 0.381 | 0.491 | 0.491 |
For CMC, WSS, KBAC, and SKAT, only variants with MAF≤1% were analyzed.
For VT, variants with MAF≤5% were analyzed.
The statistical significance of all tests was obtained empirically via 5,000 permutations. Nominally significant p-values are labeled with an asterisk. P-values that are significant after Bonferroni corrections are labeled with a pound sign.
Next we analyzed secondary trait associations with the four genes, i.e.
Gene | Trait | CMC |
WSS |
KBAC |
VT |
SKAT |
|
TCL | 3.07E-01 | 6.04E-01 | 6.75E-01 | 2.76E-01 | 8.88E-01 |
|
HDL | 7.00E-01 | 9.71E-01 | 5.35E-01 | 9.21E-01 | 6.30E-01 |
|
BMI | 5.98E-01 | 2.57E-01 | 7.05E-01 | 5.22E-01 | 2.29E-01 |
|
DiasBP | 9.10E-01 | 1.71E-01 | 1.52E-01 | 2.22E-01 | 3.79E-01 |
|
SysBP | 7.54E-01 | 6.74E-01 | 5.37E-01 | 9.22E-01 | 7.69E-01 |
|
TG | 8.76E-01 | 3.60E-01 | 2.46E-01 | 4.16E-01 | 7.06E-01 |
|
INSULIN | 8.30E-01 | 4.85E-01 | 4.07E-01 | 5.96E-01 | 1.68E-01 |
|
TCL | 6.67E-01 | 8.18E-01 | 6.97E-01 | 3.93E-01 | 1.54E-01 |
|
HDL | 8.71E-01 | 2.78E-01 | 2.21E-01 | 4.14E-01 | 6.90E-02 |
|
BMI | 3.81E-01 | 7.72E-01 | 7.66E-01 | 9.50E-01 | 9.84E-01 |
|
DiasBP | 5.63E-01 | 8.10E-01 | 8.58E-01 | 4.41E-01 | 5.29E-01 |
|
SysBP | 5.39E-01 | 9.47E-01 | 9.22E-01 | 8.26E-01 | 8.62E-01 |
|
TG | 5.60E-01 | 9.22E-01 | 9.14E-01 | 6.12E-01 | 4.98E-01 |
|
INSULIN | 5.14E-01 | 9.74E-01 | 9.79E-01 | 9.73E-01 | 9.85E-01 |
|
TCL | 2.31E-02* | 3.60E-02* | 2.90E-02* | 4.90E-02* | 8.93E-01 |
|
HDL | 1.13E-01 | 1.59E-01 | 2.35E-01 | 4.19E-01 | 4.61E-01 |
|
BMI | 1.01E-01 | 2.62E-01 | 1.94E-01 | 3.74E-01 | 7.41E-01 |
|
DiasBP | 1.64E-02* | 2.70E-02* | 2.50E-02* | 4.70E-02* | 2.33E-01 |
|
SysBP | 9.14E-04# | 3.08E-04# | 1.20E-03# | 3.00E-03# | 6.00E-03# |
|
TG | 4.73E-01 | 9.21E-01 | 9.64E-01 | 9.88E-01 | 9.97E-01 |
|
INSULIN | 3.76E-01 | 7.91E-01 | 7.77E-01 | 4.88E-01 | 9.67E-01 |
|
TCL | 1.98E-02* | 4.80E-02* | 2.30E-02* | 1.52E-01 | 9.22E-01 |
|
HDL | 4.98E-02* | 6.70E-02 | 5.80E-02 | 1.44E-01 | 2.33E-01 |
|
BMI | 3.85E-01 | 6.81E-01 | 7.27E-01 | 4.11E-01 | 8.73E-01 |
|
DiasBP | 4.24E-01 | 8.03E-01 | 8.42E-01 | 1.18E-01 | 5.05E-01 |
|
SysBP | 2.76E-01 | 5.67E-01 | 5.79E-01 | 1.28E-01 | 7.43E-01 |
|
TG | 3.29E-01 | 5.25E-01 | 6.53E-01 | 6.56E-01 | 6.31E-01 |
|
INSULIN | 7.53E-01 | 6.15E-01 | 4.83E-01 | 1.12E-01 | 5.46E-01 |
Secondary traits, total cholesterol levels (TCL), high density lipoprotein (HDL), body mass index (BMI), diastolic blood pressure (DiasBP), systolic blood pressure (SysBP), triglyceride (TG) and insulin levels (INSULIN) were studied.
For CMC, WSS, KBAC, and SKAT, variants with MAF≤1% were analyzed.
For VT, variants with MAF≤5% were analyzed.
Statistical significance for all tests was obtained empirically via 5,000 permutations. Nominally significant p-values are labeled with an asterisk, while the associations that are significant after Bonferroni corrections are labeled with a pound sign.
It is interesting to note that
We also compared the analysis using STAR and standard linear regressions (
In this article, we present a likelihood model which can be used to analyze secondary trait associations in selected samples. The method corrects for the bias in the distribution of the secondary traits induced by selective sampling. All rare variant association analysis methods can be extended within the STAR framework. STAR makes it possible to apply more powerful rare variant association tests for the analysis of secondary trait and allows jointly analyzing cohorts that were ascertained for different primary traits. The power for detecting associations with secondary traits can be greatly enhanced. In addition to performing gene-based association analysis, the STAR method and STARSEQ software can also be applied to detect single variant associations (data not shown).
Currently, many sequence based genetic studies are being performed to detect associations with complex traits. Due to the high cost of sequencing, the sample sizes for many of these studies are small. It was previously shown that in order to have sufficient power (e.g. >80%) to detect association with rare variants in an exome-wide study, in some cases it is necessary to sequence at least 10,000 samples with extreme traits from a cohort of 100,000
Previously CMC and GRANVIL tests were extended for analyzing secondary traits with p-value being evaluated analytically
Permutation algorithm is often a necessary ingredient for rare variant association tests. Even if asymptotic approximations exist for some rare variant association tests such as SKAT and CMC, they may not be accurate and type I errors may be inflated or deflated
Under the STAR framework, we compared the power of several rare variant tests for analyzing secondary traits in selected samples. It is clear from our comparisons that when causal variants have unidirectional effects, burden tests perform better than SKAT. However, when variants with effects in opposite directions are present, SKAT can be more powerful than burden test based methods. Given that the goal of the article is to introduce a method for analyzing secondary traits in selected samples, rather than to compare different rare variant tests, our simulations are not as comprehensive as some existing reviews, such as Basu and Pan
In the analysis of the SardiNIA dataset, we adjusted the blood pressure for individuals undergoing antihypertensive therapy. The rank of sample blood pressure traits was only slightly changed after the adjustment. Given that we quantile-normalized the trait prior to the association analysis, the impact of the adjustment on the result is very minimal. In order to evaluate the robustness of the results, we also analyzed the associations with blood pressure when no adjustments were made, and the results are very similar (data not shown). A significant association was identified between rare variants in
With the large scale application of next generation sequencing to study complex traits, samples from many existing cohorts will be sequenced. There can be insufficient power for analyzing associations in each individual study. It would be highly beneficial if samples from multiple cohorts can be combined for analyzing commonly measured traits. STAR is thus important and will greatly accelerate the process of identifying genes involved in complex trait etiology.
Conditional distribution of the secondary trait for individuals with extreme primary trait values in the upper and lower 5% tails. The density function of the secondary traits is plotted when the primary trait effect
(TIF)
Biases of the secondary trait effects due to extreme sampling under the null hypothesis. The bias in the secondary trait effects i.e.
(TIF)
Quantile-Quantile plot of p-values for rare variant tests in linear regression models under the null hypothesis of no gene/secondary trait associations. Sample ascertainment mechanism was ignored in the linear regression analysis. Five tests were evaluated, i.e. CMC, WSS, KBAC, VT and SKAT. Empirical p-values for each test were plotted against their theoretical expectations. A variety of scenarios with different primary trait effects and trait residual correlations were examined, which include (A)
(TIF)
The power for detecting association with secondary traits in selected samples. The power is shown for CMC, WSS, KBAC, VT, and SKAT implemented in the STAR framework. It is assumed that the secondary trait effects for causal variants are unidirectional, and their magnitudes are inversely proportional to the minor allele frequencies with
(TIF)
The power for detecting association with secondary traits in selected samples. The power is shown for CMC, WSS, KBAC, VT, and SKAT implemented in STAR framework. It is assumed that 80% of the causal variants increase the mean secondary trait value, and the remaining variants decrease the mean secondary trait value. The magnitudes of the secondary trait effects are inversely proportional to the minor allele frequencies, with
(TIF)
The power for detecting associations with secondary traits in selected samples. Power is shown for CMC, WSS, KBAC, VT, and SKAT implemented in STAR framework. Secondary trait effects are assumed to be fixed and unidirectional with
(TIF)
The power for detecting association with secondary traits in selected samples. Power is shown for CMC, WSS, KBAC, VT, and SKAT implemented in STAR framework. It is assumed that secondary trait effects are bidirectional with fixed magnitude (i.e.
(TIF)
Quantile-Quantile plot for meta-analysis p-values under the null hypothesis. Meta-analysis for three studies was simulated. The primary trait in each study is assumed to be different and a common secondary trait is measured in all studies. The gene region is not associated with the secondary trait. In the first study, the gene region is associated with the primary trait, and causal variants have an effect of −0.5. The correlation between the primary and secondary traits is 0.6. In the second study, the primary trait is also associated with the gene region, and causal variants have an effect of 0.25. The primary and secondary traits are correlated with coefficient 0.4. In the third study, the gene region is not associated with the primary trait, and the correlation between the primary and secondary traits is −0.2. CMC, WSS, KBAC, VT and SKAT were used to detect associations. In each study, a pool of 50,000 samples was simulated and 2,500 individuals with extreme primary trait were selected and analyzed. P-values for all rare variant tests were obtained based upon 5,000 permutations. The empirical distribution of p-values was obtained using 10,000 replicates.
(TIF)
Power for meta-analysis using CMC, WSS, KBAC, VT and SKAT. Meta-analysis for three studies was simulated. The primary trait in each study is assumed to be different and a common secondary trait is measured in all studies. Power for the five tests was displayed when (A) causal variants have unidirectional effect of 0.5, and (B) causal variants have bidirectional effects, i.e. 80% of the causal variants have effect 0.5 and the other 20% have effect −0.5. In the first study, the gene region is associated with the primary trait, and causal variants have an effect of −0.5. The correlation between the primary and secondary traits is 0.6. In the second study, the primary trait is also associated with the gene region, and causal variants have an effect of 0.25. The primary and secondary traits are correlated with coefficient 0.4. In the third study, the gene region is not associated with the primary trait, and the correlation between the primary and secondary traits is −0.2. In each study, a different pool of 50,000 samples was simulated and 2,500 individuals with extreme primary trait were selected and analyzed. P-values for all rare variant tests were obtained based upon 5,000 permutations. The power for analyzing each individual study and meta-analysis was evaluated using 10,000 replicates.
(TIF)
Correlations of phenotypes from the SardiNIA cohort. Eight traits that were analyzed for associations are included, i.e. high density lipoprotein (HDL), low density lipoprotein (LDL), triglyceride (TG), total cholesterol levels (TCL), diastolic blood pressure (DiasBP), systolic blood pressure (SysBP), insulin levels (INSULIN), and body mass index (BMI). Correlations were estimated using 2044 unrelated individuals extracted from the SandiNIA pedigrees.
(DOC)
Analysis of secondary trait associations using standard linear regression. Sample ascertainment mechanisms were ignored in the analysis. Seven secondary traits were analyzed, including total cholesterol levels (TCL), high density lipoprotein (HDL), body mass index (BMI), diastolic blood pressure (DiasBP), systolic blood pressure (SysBP), triglyceride (TG) and insulin levels (INSULIN). Gene-based association analysis was performed using CMC, WSS, KBAC, VT and SKAT.
(DOC)
Extension of Kernel Based Adaptive Cluster to the Analysis of Quantitative Traits.
(PDF)
Biases of Naïve Inferences of Secondary Trait Associations in Selected Samples.
(PDF)
Details for the Null Likelihood Model.
(PDF)
Practical Issues for Inferences under the Ascertainment Corrected Likelihood.
(PDF)
Constructing Variance Component Score Tests from Ascertainment Corrected Likelihood.
(PDF)
We would like to thank Dr. Gonçalo Abecasis for sharing sequence data from the SardiNIA project. We would also like to thank Dr. Shamil Sunyaev for sharing the simulated genetic datasets. Additionally we thank Drs. Serena Sanna, Carlo Sidore, Giorgio Pistis, and Bingshan Li for their helpful discussions and comments.