Estimating variance components in population scale family trees

doi:10.1371/journal.pgen.1008124

Fig 1.

A demonstration of the Sci-LMM IBD matrix construction algorithm.

(a) An example pedigree with 26 individuals. (b) A heat-map representing the IBD matrix, where zero elements are white to emphasize sparsity. (c) A heat-map representing the lower Cholesky factorization of the IBD matrix (i.e. the matrix L in the factorization A = LHL^T, where A is the IBD matrix). The value of entry i,j is the expected fraction of the genome that is shared between individual i and her ancestor j.

More »

Expand

Fig 2.

Evaluating the estimation accuracy of Sci-LMM.

(a-c) Box plots comparing REML and HE estimation accuracy (RMSE) across simulated datasets (each box represents 10 experiments), under varying sample sizes, using (a) only IBD, (b) IBD and epistasis, or (c) IBD, epistasis and dominance variance components. HE is more accurate than REML for smaller sample sizes, but REML outperforms HE as the sample size increases. Results for analyses with three matrices and 500,000 individuals are omitted due to excessive required computational time. (d-e) Comparing REML and HE estimation accuracy when using IBD, epistasis and dominance matrices under various sparsity factors (the fraction of non-zero matrix entries) with either (d) 100,000 individuals, or (e) 250,000 individuals. The estimation accuracy of both REML and HE increases with the number of non-zero entries, for both REML and HE.

More »

Expand

Fig 3.

Analysis of Sci-LMM computation time.

(a) Computation time required to compute an IBD matrix from pedigree data under different sparsity factors as a function of sample size. (b) Computation time required to compute an IBD matrix from pedigree data as a function of the number of nonzero relationships, demonstrating a linear relationship. The maximal number of evaluated non-zero relationships increases with the sparsity cutoff, because we only generated matrices with up to a million individuals. (c) Variance component estimation time (using REML), as a function of sample size, when using different combinations of covariance matrices. Epis–Epistasis; Dom—dominance (d) same as (c), but for HE regression instead of REML estimation. Here we evaluated datasets with up to 2 million individuals that were not investigated in (c), owing to technical limitations of the sparse matrix factorization routines used in our REML implementation.

More »

Expand

Fig 4.

Results of analysis of a real pedigree with 441,000 individuals.

(a) A histogram of genetic similarity across 441,000 individuals, using only the closest relationship between every pair of individuals. The degree of relationship between a pair of individuals is given by −log₂(K_ij)−1, where K_ij is their IBD coefficient (Methods). The dataset includes approximately 9.7 million pairs of individuals whose least common ancestor lived at least 10 generations earlier. (b) The estimated fraction of longevity variance attributed to different variance components (and their 95% CI).

More »

Expand