^{1}

^{2}

^{*}

^{1}

^{2}

^{1}

^{1}

^{1}

^{1}

^{3}

^{4}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: PMV JY. Performed the experiments: JY PMV MEG. Analyzed the data: JY PMV. Contributed reagents/materials/analysis tools: NRW SHL AAEV GBC. Wrote the paper: PMV JY. Developed the online tool: GH JY.

We have recently developed analysis methods (GREML) to estimate the genetic variance of a complex trait/disease and the genetic correlation between two complex traits/diseases using genome-wide single nucleotide polymorphism (SNP) data in unrelated individuals. Here we use analytical derivations and simulations to quantify the sampling variance of the estimate of the proportion of phenotypic variance captured by all SNPs for quantitative traits and case-control studies. We also derive the approximate sampling variance of the estimate of a genetic correlation in a bivariate analysis, when two complex traits are either measured on the same or different individuals. We show that the sampling variance is inversely proportional to the number of pairwise contrasts in the analysis and to the variance in SNP-derived genetic relationships. For bivariate analysis, the sampling variance of the genetic correlation additionally depends on the harmonic mean of the proportion of variance explained by the SNPs for the two traits and the genetic correlation between the traits, and depends on the phenotypic correlation when the traits are measured on the same individuals. We provide an online tool for calculating the power of detecting genetic (co)variation using genome-wide SNP data. The new theory and online tool will be helpful to plan experimental designs to estimate the missing heritability that has not yet been fully revealed through genome-wide association studies, and to estimate the genetic overlap between complex traits (diseases) in particular when the traits (diseases) are not measured on the same samples.

Genome-wide association studies (GWAS) have identified thousands of genetic variants for hundreds of traits and diseases. However, the genetic variants discovered from GWAS only explained a small fraction of the heritability, resulting in the question of “missing heritability”. We have recently developed approaches (called GREML) to estimate the overall contribution of all SNPs to the phenotypic variance of a trait (disease) and the proportion of genetic overlap between traits (diseases). A frequently asked question is that how many samples are required to estimate the proportion of variance attributable to all SNPs and the proportion of genetic overlap with useful precision. In this study, we derive the standard errors of the estimated parameters from theory and find that they are highly consistent with those observed values from published results and those obtained from simulation. The theory together with an online application tool will be helpful to plan experimental design to quantify the missing heritability, and to estimate the genetic overlap between traits (diseases) especially when it is unfeasible to have the traits (diseases) measured on the same individuals.

Genome-wide association studies (GWAS) have been extremely successfully in identifying genetic variants associated with complex traits and diseases in humans ^{−8}, is used to report a significant finding. Therefore, if there are many genes each with a small effect affecting the trait, most of these genetic variants will fail to pass the stringent threshold and remain undetected. This is one of the explanations of the ‘missing heritability’ question, that genetic variants identified from GWAS so far explain a fraction of the heritability for complex traits

We previously named the SNP-based method mentioned GREML

The methods of using SNP data to estimate genetic variance in unrelated individuals have been detailed elsewhere

For power calculation, we need to know the sampling variance of the estimate of

For unrelated individuals, where the phenotypic correlation between individuals is small, mixed linear model analysis using the REML approach is asymptotically equivalent to simple regression analysis of pairwise phenotypic similarity/difference on pairwise genetic similarity, as measured by identity-by-descent (IBD) or identity-by-state (IBS) at genome-wide markers ^{−5} for genome-wide coverage of common SNPs in human populations

The first three columns are the averaged standard error observed from 100 simulations under three heritability levels. The last column is the predicted standard error from our approximation theory. The plotted data can be found in

For a bivariate analysis where the two traits are measured on the same individuals, the mixed linear model can be written as _{1} and _{2} are _{1} and _{2} are _{1} and _{2} are

Same sample | Different samples | ||||||||||

_{G} |
Est. | SE (Obs.) | s.e.m. | SE (Approx.) | _{1} |
_{2} |
Est. | SE (Obs.) | s.e.m. | SE (Approx.) | |

4000 | 0.00 | 0.128 | 0.0024 | 0.114 | 1000 | 3000 | 0.04 | 0.288 | 0.0126 | 0.264 | |

6000 | 0.00 | 0.085 | 0.0009 | 0.076 | 2000 | 4000 | 0.00 | 0.191 | 0.0062 | 0.161 | |

8000 | −0.01 | 0.065 | 0.0006 | 0.057 | 3000 | 5000 | −0.01 | 0.129 | 0.0021 | 0.118 | |

10000 | 0.00 | 0.053 | 0.0004 | 0.046 | 4000 | 6000 | −0.01 | 0.103 | 0.0014 | 0.093 | |

4000 | 0.42 | 0.112 | 0.0019 | 0.108 | 1000 | 3000 | 0.32 | 0.295 | 0.0124 | 0.309 | |

6000 | 0.39 | 0.076 | 0.0009 | 0.072 | 2000 | 4000 | 0.39 | 0.230 | 0.0141 | 0.182 | |

8000 | 0.38 | 0.057 | 0.0006 | 0.054 | 3000 | 5000 | 0.37 | 0.136 | 0.0036 | 0.131 | |

10000 | 0.39 | 0.046 | 0.0005 | 0.043 | 4000 | 6000 | 0.38 | 0.107 | 0.0015 | 0.103 | |

4000 | 0.80 | 0.081 | 0.0024 | 0.091 | 1000 | 3000 | 0.62 | 0.417 | 0.0274 | 0.418 | |

6000 | 0.80 | 0.056 | 0.0017 | 0.061 | 2000 | 4000 | 0.83 | 0.248 | 0.0127 | 0.232 | |

8000 | 0.80 | 0.042 | 0.0012 | 0.046 | 3000 | 5000 | 0.86 | 0.198 | 0.0069 | 0.164 | |

10000 | 0.79 | 0.036 | 0.0010 | 0.036 | 4000 | 6000 | 0.83 | 0.133 | 0.0034 | 0.127 |

For a bivariate analysis where the two traits are measured on different sets of individuals, e.g. height in males and blood pressure in females, the variance-covariance matrix is _{1} is an _{1}×1 vector of phenotypes in sample set #1 (e.g. males), and _{2} is an _{2}×1 vector of phenotypes for in sample set #2 (e.g. females), with _{1} and _{2} being the sample sizes of the two sets. _{1} is an _{1}×_{1} GRM for individuals in sample set #1, _{2} is an _{2}×_{2} GRM in sample set #2 and _{12} is an _{1}×_{2} GRM between the two sets of samples. _{1} and _{2}.

Analogous to the univariate analysis, estimation of genetic covariance by a bivariate mixed linear model analysis is asymptotically equivalent to the following linear regression model_{12}, i.e. the genetic relationship between the

We know from the derivations above that

For case-control studies, the proportion of variance in case-control status (0 or 1) that is explained by all SNPs on the observed scale (

_{cases} |
_{controls} |
SE(Obs.) | SE(Approx.) | |||

1604 | 1953 | 0.001 | 0.851 | 0.088 | 0.089 | |

3290 | 3849 | 0.020 | 0.364 | 0.049 | 0.044 | |

3154 | 6981 | 0.080 | 0.231 | 0.036 | 0.031 | |

9087 | 12171 | 0.010 | 0.410 | 0.015 | 0.015 | |

6704 | 9031 | 0.010 | 0.441 | 0.021 | 0.020 | |

9041 | 9381 | 0.150 | 0.177 | 0.017 | 0.017 | |

3303 | 3428 | 0.010 | 0.310 | 0.046 | 0.047 | |

4163 | 12040 | 0.050 | 0.253 | 0.020 | 0.020 |

To calculate power, however, we would need to specify _{case} is the number of cases, _{control} is the number of controls, and

The SE is predicted from the approximation theory given different levels of disease prevalence (

As shown in _{1} and _{2} are the total numbers of cases and controls of the two case-control studies, respectively. This also applies to a bivariate analysis of a quantitative trait and a cases-control disease study on different sets of samples, i.e.

Disease | _{cases} |
_{controls} |
Disease | _{cases} |
_{controls} |
SE (Obs.) | SE (Approx.) | |||||

0.01 | 9032 | 7980 | 0.40 | 0.01 | 6664 | 5258 | 0.39 | 0.68 | 0.044 | 0.049 | ||

0.01 | 9051 | 10385 | 0.38 | 0.15 | 8998 | 7823 | 0.16 | 0.43 | 0.055 | 0.057 | ||

0.01 | 9111 | 12146 | 0.41 | 0.01 | 3226 | 3308 | 0.29 | 0.16 | 0.059 | 0.057 | ||

0.01 | 9013 | 10115 | 0.42 | 0.05 | 4108 | 9936 | 0.22 | 0.08 | 0.046 | 0.045 | ||

0.01 | 6665 | 7408 | 0.42 | 0.15 | 8997 | 7680 | 0.17 | 0.47 | 0.061 | 0.063 | ||

0.01 | 6704 | 9030 | 0.43 | 0.01 | 3207 | 3294 | 0.31 | 0.04 | 0.065 | 0.061 | ||

0.01 | 6656 | 7041 | 0.38 | 0.05 | 4099 | 9873 | 0.25 | 0.05 | 0.053 | 0.052 | ||

0.15 | 9031 | 9370 | 0.17 | 0.01 | 3239 | 3331 | 0.31 | 0.05 | 0.089 | 0.090 | ||

0.15 | 8936 | 8668 | 0.16 | 0.05 | 4098 | 11233 | 0.24 | 0.32 | 0.071 | 0.073 | ||

0.01 | 3156 | 3254 | 0.27 | 0.05 | 4181 | 12022 | 0.23 | −0.13 | 0.087 | 0.090 |

Statistical power is calculated from the population value of the parameter and its sampling variance, which was derived above. If the parameter is ^{2} with 1 degree of freedom and non-centrality parameter (NCP) of ^{2} variable is larger than the central ^{2} threshold that is determined by

We have also developed an online calculator (GCTA Power Calculator,

We have derived the approximate sampling variance of the estimate of variance explained by all common SNPs (

The sampling variance of

Analytical expressions for the sampling variance of the estimates of genetic (co)variance from pedigree analyses have been around for over 50 years

Methods for calculating the power of detecting quantitative trait loci (QTL) in family-based linkage studies have been investigated extensively in the past two decades

For a given population, a set of common SNPs and the method of calculating the genetic relationship matrix that we have used here,

If there are unknown related samples in the data (cryptic relatedness),

Using the same experimental design of a sample of conventionally unrelated individuals, the experimenter can increase power by increasing sample size. Fortunately, power increases quadratically with sample size because every new sample is contrasted with all existing samples. The sampling variance of the estimate of the genetic correlation is generally much larger than that of the proportion of variance explained from a univariate analysis, consistent with the theory of the sampling variance of genetic correlations in pedigree designs

Likelihood ratio test (LRT) statistic vs. Chi-squared test-statistic in a univariate analysis.

(PDF)

Standard error of the estimate of

(PDF)

Simulations.

(PDF)

Sampling variance of genetic correlation.

(PDF)

Acknowledgments to dbGaP data.

(PDF)

This study uses data obtained from dbGaP through accession numbers [phs000090] and [phs000091] (a full list of acknowledgments to the dbGaP data can be found in