^{1}

^{*}

^{2}

^{3}

^{4}

^{4}

^{5}

^{2}

^{3}

^{4}

^{2}

^{3}

^{4}

^{2}

^{3}

^{4}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: NZ NP ALP. Performed the experiments: NZ SP. Analyzed the data: NZ PK NP BP GB ALP. Wrote the paper: NZ ALP.

Important knowledge about the determinants of complex human phenotypes can be obtained from the estimation of heritability, the fraction of phenotypic variation in a population that is determined by genetic factors. Here, we make use of extensive phenotype data in Iceland, long-range phased genotypes, and a population-wide genealogical database to examine the heritability of 11 quantitative and 12 dichotomous phenotypes in a sample of 38,167 individuals. Most previous estimates of heritability are derived from family-based approaches such as twin studies, which may be biased upwards by epistatic interactions or shared environment. Our estimates of heritability, based on both closely and distantly related pairs of individuals, are significantly lower than those from previous studies. We examine phenotypic correlations across a range of relationships, from siblings to first cousins, and find that the excess phenotypic correlation in these related individuals is predominantly due to shared environment as opposed to dominance or epistasis. We also develop a new method to jointly estimate narrow-sense heritability and the heritability explained by genotyped SNPs. Unlike existing methods, this approach permits the use of information from both closely and distantly related pairs of individuals, thereby reducing the variance of estimates of heritability explained by genotyped SNPs while preventing upward bias. Our results show that common SNPs explain a larger proportion of the heritability than previously thought, with SNPs present on Illumina 300K genotyping arrays explaining more than half of the heritability for the 23 phenotypes examined in this study. Much of the remaining heritability is likely to be due to rare alleles that are not captured by standard genotyping arrays.

Phenotype is a function of a genome and its environment. Heritability is the fraction of variation in a phenotype determined by genetic factors in a population. Current methods to estimate heritability rely on the phenotypic correlations of closely related individuals and are potentially upwardly biased, due to the impact of epistasis and shared environment. We develop new methods to estimate heritability over both closely and distantly related individuals. By examining the phenotypic correlation among different types of related individuals such as siblings, half-siblings, and first cousins, we show that shared environment is the primary determinant of inflated estimates of heritability. For a large number of phenotypes, it is not known how much of the heritability is explained by SNPs included on current genotyping platforms. Existing methods to estimate this component of heritability are biased in the presence of related individuals. We develop a method that permits the inclusion of both closely and distantly related individuals when estimating heritability explained by genotyped SNPs and use it to make estimates for 23 medically relevant phenotypes. These estimates can be used to increase our understanding of the distribution and frequency of functionally relevant variants and thereby inform the design of future studies.

Although genome-wide association studies (GWAS) have resulted in the discovery of thousands of novel associations of loci to hundreds of phenotypes

The narrow-sense heritability of a phenotype (

There are two major challenges in comparing

Here, we analyze the heritability of 23 complex phenotypes in an Icelandic cohort of 38,167 individuals, leveraging both a population-wide genealogical database and genotype data from over 300,000 SNPs that have been long-range phased across and between chromosomes (i.e. where not only the phase, but also the parental origin of alleles has been determined)

We further introduce a new variance components method that provides simultaneous estimates of

For most of the 23 phenotypes examined here, our results show that

Below, we provide an overview of the approaches we used to estimate various components of heritability. The details of these approaches are provided in the Methods section.

We used a linear mixed model approach to estimate components of heritability. In this approach, each phenotype is modeled using a multivariate normal distribution. Each of the components of heritability that we estimated corresponds to a different model of the phenotypic covariance.

Narrow-sense heritability (

The fine-scale estimates of IBD used here rely on long-range phasing data that are not available in most data sets. An estimate of

Previous approaches to estimating the heritability explained by genotyped SNPs (

Broad-sense heritability (H^{2}) is the sum of additive, dominant, and epistatic components of heritability. The additive, dominant, environmental (ADE) model can be used to obtain joint estimates of dominance and additive components of heritability, using two covariance matrices based on IBD2 (two copies shared IBD) and IBD

Below, we investigate all of these modeling approaches.

Ideally, estimates of narrow-sense heritability of a particular phenotype would be based on a genetic relationship matrix constructed from causal variants, representing the true genetic contribution to the phenotype _{IBD}). However, fine-scale estimation of _{IBD} in population samples is dependent on information about the chromosomal phase of alleles, which requires long-range phasing of the data _{IBD}

IBD-based estimates of ^{2}, age, sex, and geographic region were included as covariates to prevent confounding. These estimates range from 0.099 for recombination rate to 0.691 for height. The only quantitative trait yielding an estimate not significantly different from 0 was sex-ratio of offspring. For each of the eight quantitative phenotypes with published estimates of

Quantitative trait | N^{a} |
s.e. | s.e. | |||

Body Mass Index (kg/m^{2}) |
20000 | 0.422 | 0.018 | 0.433 | 0.018 | 0.4–0.6 |

Cholesterol High Density Lipoprotein | 19977 | 0.446 | 0.017 | 0.457 | 0.018 | 0.5 |

Cholesterol Low_Density Lipoprotein | 4547 | 0.196 | 0.062 | 0.198 | 0.063 | 0.376 |

Height (cm) | 20000 | 0.691 | 0.016 | 0.704 | 0.016 | 0.8 |

Menarche Age (years) | 15150 | 0.443 | 0.022 | 0.454 | 0.022 | 0.4–0.7 |

Menopause Age (years) | 5540 | 0.400 | 0.047 | 0.409 | 0.048 | 0.4–0.6 |

Monocyte White Blood Cell | 9651 | 0.343 | 0.032 | 0.351 | 0.032 | 0.378 |

Waist-Hip Ratio | 5538 | 0.181 | 0.037 | 0.187 | 0.038 | 0.3–0.6 |

Sex Ratio of offspring | 15000 | 0.026 | 0.017 | 0.021 | 0.018 | - |

Total Children | 15000 | 0.103 | 0.017 | 0.111 | 0.018 | - |

Recombination Rate | 10259 | 0.099 | 0.023 | 0.110 | 0.030 | - |

N is the number of individuals used in the analysis of each phenotype.

Previous studies, based on either genealogical or direct estimates of IBD sharing, have been limited to closely related individuals (first-cousins or closer), and may therefore be upwardly biased due to the impact of shared environment, dominance, or epistatic interactions

In most cases, researchers do not have access to long-range phased genotypes with which to estimate h^{2}. One suggested solution to this problem is the use of _{IBS}, the genome-wide proportion of alleles shared identical-by-state (IBS) at all genotyped loci, as a substitute for _{IBD} _{IBS} provides a poor estimate of _{IBD} for distantly related pairs of individuals). Taking advantage of long-range phase based estimates of _{IBD}, we sought to evaluate the use of _{IBS} for the estimation of h^{2}. For this purpose, we computed _{IBS} as defined in ^{2}, the _{IBS} matrix captures information from two distinct sources, depending on the degree of relationship between pairs of individuals. For large values of IBS it estimates genetic covariance over all SNPs in the genome. For low values of IBS it estimates genetic covariance over just those SNPs on the genotyping platform _{IBS} therefore tend to lie between the true value,

To avoid this bias, we implemented a different approach, retaining all individuals for the calculation of h^{2}, but setting values of _{IBS} less than or equal to a threshold t (_{IBS>t}) to 0, for t = 0.00, 0.025 and 0.05. This threshold defines the separation between closely and distantly related individuals. We evaluated this approach using both simulations and real data sets and observed a significant downward bias of narrow-sense heritability estimated from tresholded IBS (_{IBS}^{2}, by means of

Recently, Yang et al _{IBS}_{IBS}_{IBS}

To enable unbiased calculation of _{IBS}_{IBS}_{IBD}_{IBS}_{>t}, because fine-scale _{IBD}_{IBD}_{IBS}_{IBS}_{>t} and _{IBS}

^{2}>0. Our results are concordant with the previous estimates of ^{2} ratio of 0.53 (the average across all the traits) and found that only height, with a value of 0.58 was significantly different (p-value<0.0017, see

Phenotype | s.e. | s.e. | ||||

Body Mass Index | 0.424 | 0.018 | 0.229 | 0.017 | 0.540 | 0.16(0.03) |

Cholesterol High Density Lipoprotein | 0.450 | 0.017 | 0.239 | 0.017 | 0.531 | 0.12(0.05) |

Cholesterol Low_Density Lipoprotein | 0.199 | 0.063 | 0.103 | 0.065 | 0.518 | - |

Height | 0.687 | 0.016 | 0.399 | 0.017 | 0.581 | 0.42(0.03) |

Menarche Age | 0.451 | 0.022 | 0.225 | 0.022 | 0.499 | - |

Menopause Age | 0.409 | 0.048 | 0.136 | 0.053 | 0.333 | - |

Monocyte White Blood Cell | 0.343 | 0.032 | 0.198 | 0.032 | 0.577 | - |

Waist Hip Ratio | 0.188 | 0.037 | 0.140 | 0.055 | 0.745 | 0.13(0.05) |

Total Children | 0.102 | 0.028 | 0.043 | 0.023 | 0.422 | - |

For dichotomous phenotypes, ascertainment in samples with closely related pairs of individuals induces an upward bias in narrow-sense heritability jointly estimated from IBS above a threshold (_{IBS>t}_{IBS}

Phenotype | s.e. | Prevalence | ||

Alcohol Dependence | 0.235 | 0.030 | 0.07 | |

Asthma | 0.264 | 0.067 | 0.13 | |

Autoimmune Systemic RA SLE SSc AS | 0.200 | 0.048 | 0.02 | |

Autoimmune Tcell mediated | 0.192 | 0.033 | 0.05 | |

Breast Cancer | 0.117 | 0.051 | 0.12 | |

Coronary Artery Disease | 0.146 | 0.017 | 0.06 | 0.39(0.06) |

Hypertension in Pregnancy | 0.083 | 0.043 | 0.03 | |

Osteoarthritis | 0.126 | 0.026 | 0.1 | |

Prostate Cancer | 0.204 | 0.056 | 0.09 | |

Rheumatoid Arthritis |
0.261 | 0.061 | 0.01 | 0.63(0.06)0.32(0.07) |

Type 2 Diabetes | 0.254 | 0.041 | 0.08 | 0.44(0.06) |

RA estimate without the MHC region.

RA in our study contained a mixture CCP positive and negative cases, while the previously published worked is based on CCP positive cases only

Our results are lower than previous estimates of the heritability explained by genotyped SNPs (

Shared environment, dominance effects and epistasis (i.e. non-additive interaction between variants) can upwardly bias estimates of

First, we estimated additive and dominance-like effects simultaneously under an ADE (additive, dominant, and environmental) model with variance components _{IBD}_{IBD}_{2}, where the latter represents the fraction of the genome with both chromosomes shared IBD for each pair of individuals _{IBD}^{2}. The only class of relationship with significant IBD2 is siblings who share an expected ¼ of their genome IBD2. This analysis will therefore focus overwhelmingly on the difference between siblings and other classes of relationship. Siblings are also subject to epistatic interactions and shared environment and so this analysis will be influenced by all three factors (shared environment, dominance, and epistasis). We note that this analysis will not detect shared environment effects that decay exactly in proportion to genome-wide IBD.

We initially examined a subset of 11 quantitative and dichotomous traits, viewed as likely candidates for environmental effects, in a subset of 15,000 genotyped individuals using the ADE framework. The results for these phenotypes are shown in ^{2}, they do not distinguish between the possible sources of inflation. The fact that the narrow sense heritability estimate

Phenotype | s.e. | s.e. | N^{a} |
p-value | ||

Body Mass Index | 0.090 | 0.069 | 0.381 | 0.023 | 15000 | 0.18 |

Coronary Artery Disease | 0.387 | 0.078 | 0.164 | 0.023 | 6661 CA 11774 CO | 3.36E-04 |

Cholesterol High Density Lipoprotein | 0.141 | 0.066 | 0.423 | 0.023 | 15000 | 0.03 |

Cholesterol Low Density Lipoprotein | 0.257 | 0.071 | 0.202 | 0.023 | 13121 | 2.81E-04 |

Osteoarthritis | 0.279 | 0.075 | 0.181 | 0.022 | 2319 CA 11666 CO | 3.66E-05 |

Type 2 Diabetes | 0.363 | 0.072 | 0.301 | 0.022 | 2.86E-08 | |

Total number of children | 0.073 | 0.068 | 0.095 | 0.019 | 15000 | 0.27 |

Total number of children (Mothers) | 0.180 | 0.066 | 0.145 | 0.020 | 15000 | 4.19E-03 |

Breast Cancer | 0.154 | 0.081 | 0.128 | 0.022 | 2214 CA 11687 CO | 0.05 |

Prostate Cancer | 0.296 | 0.082 | 0.144 | 0.027 | 1792 CA 8328 CO | 9.01E-03 |

Hypertension in Pregnancy | 0.826 | 0.074 | 0.072 | 0.021 | 419 CA 10085 CO | 3.33E-16 |

For dichotomous phenotypes these estimates are inlineed on the observed scale. ^{a}N is the number of individuals used in the analysis of each phenotype (CA = cases; CO = control). The p-values are from the likelihood ratio test of the ADE model against the _{IBD}

In order to address this issue, we performed a pedigree-based analysis, making use of genealogical information

The differences in heritability estimate between classes of relationship are consistent with a shared-environment only effect on phenotypic correlation, but not with a dominance only or epistasis only effect on phenotypic correlation.

If the heritability estimate of two copy IBD when fit jointly with IBD (_{IBD2} due to dominance and would have larger estimates of _{IBD} ^{2}. Again, this is not what is observed. Finally, if

Our results show that

Two additional results from _{IBD}

We have made use of long-range phased genotype data and genealogical information from an Icelandic cohort to shed light on the problem of missing heritability, and the relative contributions of common and rare sequence variants and environmental factors to complex human phenotypes.

First, we examined IBD based estimation of narrow-sense heritability

Second, we developed a new method to estimate ^{2}. The estimated value of

Finally, we investigated the impact of shared environment, dominance, and epistasis on estimates of

A standard way to quantify the contribution of environmental effects is to fit an ACE model

Interestingly, our estimate of the heritability of height (0.69) is lower than previous estimates (0.8)

We conclude that, for quantitative traits, more than half of ^{2}, this fraction is larger than previous estimates

This research was approved by the Data Protection Commission of Iceland and the National Bioethics Committee of Iceland. The appropriate informed consent was obtained for all sample donors.

We analyzed 38,167 individuals from the deCODE data set genotyped on the Illumina 300K chip. Owing to the sensitive nature of genotype data, access to these data can only be granted at the headquarters of deCODE Genetics in Iceland. The details of the genotyping, quality control, IBD estimation and genealogy are described elsewhere

The deCODE Genetics genealogy database, containing all contemporary Icelanders and most of their ancestors going back to the year 1650, was used to determine the genealogical relationships between individuals

A description of the phenotypes is given in

We used a linear mixed model approach to estimate the components of heritability in our data sets. In this approach, each phenotype Y is modeled from a multivariate normal distribution

For a normalized phenotype with mean 0 and variance 1, to estimate _{IBS>t} and _{IBS} respectively. To jointly estimate _{IBD} and _{IBS}. The intuition for this approach is that the large elements of genetic covariance matrix _{IBS} are good estimates of the pair-wise IBD of individuals. Indeed, this is why _{IBS} only provide information about SNPs in LD with those on the genotyping platform. This is why the _{IBS} applied to unrelated individuals in the approach of Yang et al. estimates _{IBS} into two components, one provides estimates of the phenotypic variance explained by SNPs on the genotyping platform, and the other provides an estimate of the remaining phenotypic variance. The total narrow-sense heritability is the sum of the parameters for _{IBS>t} and _{IBS}, and the heritability explained by genotyped SNPs is the parameter for _{IBS}.

To estimate _{IBD2} is a combination of shared environment, dominance, and epistatic effects. The parameter for _{IBD} is the narrow-sense heritability. Under the assumption of no shared environment or epistatic effects, the sum of these estimates is the broad-sense heritability. In all cases we adjusted for age, gender, and region of Iceland as fixed effects. We use REML including a constant vector instead of ML to prevent bias in heritability estimation

The _{IBS} matrix is estimated as defined in _{IBS>t} is the _{IBS} matrix with all entries less than t set to zero, with the exception of the diagonal, which is not changed.

Entry (i,j) of the _{IBD} matrix is the fraction of the genome shared IBD between individuals i and j. Pair-wise IBD estimates were estimated as described in _{IBD2} matrix is defined similarly, but pair-wise IBD2 estimates (both chromosomes IBD) are used in place of pair-wise IBD estimates. That is, a pair of individuals is IBD2 at a particular SNP if they are IBD on both haplotypes. Entry, (j,j) is set to 1.0. We note that none of the kinship matrices defined here, and hence none of the resulting heritability estimates rely on the genealogy, but are based on direct estimates from the genetic data.

We performed a set of experiments over simulated genotype data in order verify that our estimates of _{1},p_{2},…,p_{N} where the number of SNPs N = 5,000 and pi drawn uniformly between 0.05 and 0.5. We then repeated this process to create unobserved genotypes for each individual. The observed genotypes represent those on a genotyping platform and the unobserved represent those not in LD with the genotyped SNPs.

To simulate a pair individuals with x% of their genome shared IBD, we copied x% of the SNPs of the first individual's haplotypes onto the corresponding SNPs of the second individual in the pair. We normalized the genotypes to have mean 0 and variance 1 and set the effect size for each SNP to be the square root of 0.5/10000. The phenotype for each individual is the sum over all SNPs of the product of the normalized genotype and the effect size plus noise drawn from a random distribution with mean 0 and variance 0.5. This gives a phenotype with mean 0, variance 1, a of heritability 50%, of which 50% is due to observed genotypes and 50% is due to unobserved genotypes.

We then constructed relationship matrices K using the observed genotypes. For _{IBD}, we assumed that we had access to the true value of IBD for each pair of individuals. We constructed data sets of 1,400 individuals with several different types of relationship structure, created 1,000 replicates of each data set, and estimated narrow-sense heritability _{IBD}, _{IBS>t}, and _{IBS}. We estimated the joint estimates

The results shown in _{IBD} and _{IBS>t} give good estimates of _{IBS} is not a good estimator of either _{IBD}+_{IBS} and _{IBS>t}+_{IBS} provide joint estimates of _{IBS}>0.025 is removed from the data set, the estimate of _{IBS>t}, but the variance is significantly higher.

When the related individuals are closely related (e.g. K = 0.5), _{IBS>t} is a good estimator of KIBD and the mean heritability estimate is the true _{IBS} is therefore biased towards the heritability explained by genotyped SNPs. This is why _{IBS} to estimate narrow-sense heritability without thresholding can lead to biased heritability estimates.

When the relatedness of individuals in the data set is moderate (e.g. K = 0.125), the joint model _{IBS>t}+_{IBS} does not provide a good estimate of since _{IBS>t} will be influenced by genotyped variants. However, _{IBS>t} as an estimate for the true heritability depends on the relatedness structure of the data set. In data sets with families, such as the cohort examined here, or the FHS data set

We performed a similar set of experiments to those described above, but this time used real genotype data with simulated phenotypes in order to verify that issues due to LD, IBD estimation, population structure, or other similar confounders did not affect our results. We selected 8000 individuals randomly from the complete data set and generated two sub-phenotypes for each individual. We generated two sets of causal variants C_{1} and C_{2} by selecting a causal variant every 500 SNPs along the even chromosomes for C_{1} and repeating the process along the odd chromosomes for C_{2}. We chose effect sizes _{1}_{2}_{1} and C_{2} respectively and set the sub-phenotypes for an individual by summing the product of their genotypes (0,1,2) times the corresponding effect sizes. We then added random noise _{1}_{2}_{1}_{2}_{1}_{2}

We recomputed the relationship matrices K using only the even chromosomes. For each simulated phenotype we estimated heritability using _{IBD}, _{IBS}, _{IBS>t}, and _{IBS>t}+_{IBS}. The results are shown in _{IBS}, which always lies between _{IBS} by examining the estimate of

For a given class of relatives, (e.g. siblings), for each phenotype we computed the correlation between the phenotype across all pairs of that class. The heritability estimate was then generated by dividing the correlation by the fraction of the genome expected to be shared IBD (e.g. 0.5 for siblings). It is not possible to place standard errors on the heritability estimate of the phenotypes due to the complex relatedness structure of the individuals in each class. One pair of siblings for example, might be the grandfather and granduncle of another pair of siblings. However, it is possible to compute an empirical mean and standard deviation across traits.

To compare the classes of relatives we computed the empirical mean and standard deviation of the differences of the heritability estimates across traits. The standard error of the difference is the standard deviation estimate divided by the square root of the number of phenotypes (17 in this case). We applied a Wald test to determine the p-value

To determine the significance of the combined effects of shared environment, dominance, and epistatic interaction we constructed a one degree of freedom likelihood ratio test. We computed the likelihood of the ADE model fit with covariance matrices _{IBD2}+_{IBD} and the likelihood of the narrow-sense heritability estimated from _{IBD}.

The heritability estimates in

Average heritability estimates of type 2 diabetes, coronary artery disease, and hypertension in pregnancy for six classes of relationship. The differences in heritability estimate between classes of relationship are consistent with a shared-environment only effect on phenotypic correlation, but not with a dominance only or epistasis only effect on phenotypic correlation.

(PDF)

Definition of subscripted parameters.

(DOCX)

Narrow-sense heritability (^{2}

(DOCX)

Heritability estimates over simulated data for a range of relatedness structures. Model refers to the relatedness structure of the data. 0.5 represents 700 pairs of individuals with 50% of their genome IBD. 0.5,0.25 represents 350 pairs with 50% IBD and 350 pairs with 2.5% IBD. 0.125,0.025 represents 350 pairs with 12.5% IBD and 350 pairs with 2.5% IBD. We used a threshold

(DOCX)

Heritability estimates from data simulated over even and odd chromosomes of 8,000 individuals from the decode cohort.

(DOCX)

Heritability estimates from data simulated over even and odd chromosomes of 8,000 individuals from the decode cohort.

(DOCX)

Narrow-sense heritability estimated from thresholding IBS (

(DOCX)

Narrow-sense heritability (^{2}

(DOCX)

Differences in heritability estimates between pairs of classes of relationships. If there is no effect of shared environment, dominance, or epistatic interaction then

(DOCX)

IBS based estimates (

(DOCX)

Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits.

(DOCX)

We are grateful to A. Helgason, D. Gudbjartsson, G. Thorleifsson, A. Kong, and K. Stefansson for valuable discussions, assistance with deCODE data, and comments on the manuscript.