^{1}

^{*}

^{1}

^{2}

^{3}

^{4}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: GdlC DS. Performed the experiments: GdlC AIV YCK. Analyzed the data: GdlC AIV YCK. Wrote the paper: GdlC DS AIV YCK RF.

Despite important advances from Genome Wide Association Studies (GWAS), for most complex human traits and diseases, a sizable proportion of genetic variance remains unexplained and prediction accuracy (PA) is usually low. Evidence suggests that PA can be improved using Whole-Genome Regression (WGR) models where phenotypes are regressed on hundreds of thousands of variants simultaneously. The Genomic Best Linear Unbiased Prediction (G-BLUP, a ridge-regression type method) is a commonly used WGR method and has shown good predictive performance when applied to plant and animal breeding populations. However, breeding and human populations differ greatly in a number of factors that can affect the predictive performance of G-BLUP. Using theory, simulations, and real data analysis, we study the performance of G-BLUP when applied to data from related and unrelated human subjects. Under perfect linkage disequilibrium (LD) between markers and QTL, the prediction R-squared (R^{2}) of G-BLUP reaches trait-heritability, asymptotically. However, under imperfect LD between markers and QTL, prediction R^{2} based on G-BLUP has a much lower upper bound. We show that the minimum decrease in prediction accuracy caused by imperfect LD between markers and QTL is given by (1−^{2}, where ^{2}. However, with distantly related individuals ^{2}. Our simulations suggest that for the analysis of data from unrelated individuals, the asymptotic upper bound on R^{2} may be of the order of 20% of the trait heritability. We show how PA can be enhanced with use of variable selection or differential shrinkage of estimates of marker effects.

Despite great advances in genotyping technologies, the ability to predict complex traits and diseases remains limited. Increasing evidence suggests that many of these traits may be affected by a large number of small-effect genes that are difficult to detect in single-variant association studies. Whole-Genome Regression (WGR) methods can be used to confront this challenge and have exhibited good predictive power when applied to animal and plant breeding populations. WGR is receiving increased attention in the field of human genetics. However, human and breeding populations differ greatly in factors that can affect the performance of WGRs. Using theory, simulation and real data analysis, we study the predictive performance of the Genomic Best Linear Unbiased Predictor (G-BLUP), one of the most commonly used WGR methods. We derive upper bounds for the prediction accuracy of G-BLUP under perfect and imperfect LD between markers and genotypes at causal loci and validate such upper bounds using simulation and real data analysis. Imperfect LD between markers and causal loci can impose a very low upper bound on the prediction accuracy of G-BLUP, especially when data involve unrelated individuals. In this context, we propose and evaluate avenues for improving the predictive performance of G-BLUP.

Many important human traits and diseases are moderately to highly heritable. This, together with advances in genotyping and sequencing technologies, brought the promise of genomic medicine

The ability of a model to predict yet-to-be observed phenotypes (hereinafter referred to as PA, for prediction accuracy) constitutes one of its most important properties from the perspective of its potential use for preventive and personalized medicine. The study by Makowsky et al. ^{2} of 0.25. However, the R^{2} ranged from 0.36 for individuals having 3 or more close relatives in the training data set to 0.11 for individuals with no close relatives in the training data set. The result confirms previous findings from the field of animal breeding

In this article, using theory, simulation and real data analysis we study the factors that affect the extent of missing heritability and the prediction accuracy of G-BLUP for the analysis of human data. The article is organized as follows. The ^{2} formula under perfect LD between markers and QTL follows from standard properties of the multivariate normal density and similar results have been presented before ^{2} for the case of imperfect LD. Predictions from the formulas derived in the methods section are validated in

Standard quantitative genetic models describe phenotypes _{i}^{th}^{th}^{th}

Genomic BLUP can be motivated in many different ways: as a Ridge Regression (RR,

In empirical analyses, genomic relationships are usually computed using crossproduct terms between genotypes. In such cases, estimates derived from G-BLUP methods are equivalent to those that can be derived by regressing phenotypes on marker genotypes using a linear model,

Estimation of variance parameters in G-BLUP is possible only when

G-BLUP methods can also be used to predict yet-to-be observed phenotypes. For instance, using a set of ^{th}^{th}

The weights in this linear score (4) are given by the

The predictive ability of a model is commonly assessed using the variance of prediction errors (or prediction error variance), ^{2} in this article) can be quantified using

Formulas for PEV and R^{2} for the case when the model holds have been derived elsewhere ^{2} in validation data sets is given by^{2} of prediction of genetic values,

To get further insight on the role played by TRN-TST relationships, assume for now that the off-diagonal elements of ^{2} formula reduces to

The above expression gives interesting insights into the impact of realized genomic relationships (at causal loci) between TRN and testing (TST) samples on prediction accuracy. When individuals in the TRN data set are genetically independent (i.e., when ^{th}^{th}^{2}

The set of causal loci is typically unknown and in practice, marker genotypes are used to compute genomic relationships. The patterns of realized genomic relationships at different sets of loci (e.g., markers and causal loci) vary across the genome ^{2}. To circumvent this problem, a closed-form expression for an upper bound to R^{2} is obtained instead. The details of the derivation are given in section 2.2 of the Supplementary Methods. Briefly, to arrive at a closed-form upper bound it is assumed that genomic relationships realized at causal loci among pairs of individuals in the TRN data set are known. Therefore, we consider only the impacts of imperfect LD between markers and QTL that occur through misspecification of genetic relationships at causal loci between individuals in the TST and those in the TRN data set.

Prediction equations in GBLUP are given by

A regression similar to (6) was used by

Using (2) and (6) it can be shown (see eq. (S.14) of the Supplementary Methods) that the R^{2} that can be attained using markers that are in imperfect LD with genotypes at causal loci (^{2} that could be obtained with the same TRN sample if markers were in perfect LD with the QTL. ^{2} that occurs due to imperfect LD between markers and QTL. ^{2} due to imperfect LD between markers and QTL.^{2} under imperfect LD.

The principles used to derive expression (7) can also be applied when predictions are computed from pedigree information as opposed to markers. In this case

To obtain further insight on the impacts of imperfect LD between markers and QTL on the proportion of missing heritability and on PA, a simulation study and real data analysis were performed using data sets from related and from unrelated individuals.

We used two publically available data sets: one involving family data (the Framingham Heart Study, hereinafter denoted as FHS,

The FHS data set required IRB approval and this was done at the IRB of the University of Alabama at Birmingham (Protocol # X100712003). GEN did not qualify as human subject data.

Individuals in the FHS were genotyped with a 500 K SNP platform (K = 1,000) and those in GEN were genotyped with a 1,000 K SNP platform. We identified a set of 400 K SNPs which were available both in GEN and FHS and passed quality control criteria which consisted of removing markers with more than 10% of uncalled genotypes and with minor allele frequency (MAF) smaller than 0.5%. These 400 K SNPs were used for simulations and for data analysis, as described next. In both data sets individuals with unknown height, age or sex were removed in order to allow comparison with analysis of the real data presented in the next section.

In GEN, only nominally unrelated individuals of Caucasian origin with less than 5% of missing genotypes were included. This left 5,854 individuals out of which 5,800 were randomly sampled for the analysis. In FHS individuals with no data from relatives and subjects with more than 5% of uncalled genotypes were removed. The 7,865 individuals that passed these criteria were ranked using an index consisting of the sum of squares of the additive relationships (computed from pedigree) of each individual with the rest of the data set

The simulation study incorporates the observed genotypes from the FHS and GEN data sets as follows. From the available 400 K SNPs, 300 K were randomly sampled and designated as markers (hereinafter genotypes at these loci are denoted as ^{−5}, to all markers) or (b) at random but sampling an excess of SNPs with low MAF (Low-MAF). Here we assigned sampling probability of 0 to markers with MAF greater than 0.15, 1.76×10^{−5} probability to markers with MAF between 0.05 and 0.15 and three times higher probability, 5.28×10^{−5}, to markers with MAF smaller or equal to 0.05. These sampling probabilities were defined based on MAF computed using the average MAF observed in FHS and GEN. In the RAND scenario, markers and genotypes at causal loci were sampled from the same distribution; on the other hand, in the Low-MAF scenario markers and genotypes at causal loci have different distribution of allele frequencies.

The effects of causal loci

The simulated data were analyzed using a G-BLUP model with genomic relationships computed either using genotypes at causal loci (

There are several ways of computing genomic relationships

In the analysis of the simulated data a Bayesian model with uninformative priors for variance parameters was used to estimate variance parameters and to predict phenotypes and genetic values. Details of the model used for analysis as well as specifics of the MCMC implementation are given in Supplementary Methods.

For each of the 30 MC replicates, we used data from 5,300 individuals to train the models (TRN) and data from 500 individuals for testing (TST). Phenotypes of individuals assigned to TST were regarded as missing and prediction accuracy was assessed by means of R^{2} between predicted and observed phenotypes in the TST data sets, estimated by the squared of Pearson's product-moment correlation. The assignment of individuals to TRN/TST data sets was completely at random in GEN. In FHS we designed a sampling strategy that guarantees that prediction used only ancestors and nominally unrelated individuals. Specifically, TST data sets were drawn from the most recent cohort, and the algorithm used to construct the TRN data set avoided contemporaneous relatives between individuals in the TRN and TST data sets.

In the real data analysis height is used as model trait and we report estimates of variance parameters and assess PA in FHS and GEN. The models for the analysis were similar to those used in the simulation (see below for details) but alternatives that incorporate results from previous GWAS into the whole-genome regression approach are also considered. In both data sets height was preadjusted with estimates of effects of age and sex (estimated within each data set). Estimates of variance components were derived by fitting models to each of the full data sets (N = 5,800, in both cases) and to the combined data sets (N = 11,600). PA was assessed for FHS and GEN using the same 30 TST data sets used in the simulation, each containing 500 individuals.

For assessing PA, the training was done within data set (N-TRN = 5,300 in each partition) or using a combined data set. For the combined data set analyses, when testing was carried out in GEN, the TRN data set included the 5,300 individuals from GEN (those not used for TST in each partition) plus 5,800 from FHS. Similarly, when testing in FHS, the TRN data set included 5,300 individuals from FHS and 5,800 from GEN.

The baseline model was a G-BLUP using all available markers (p = 400 K SNPs). As shown in the next section, results from the simulation study suggest that imperfect LD between genotypes at markers and those at causal loci can have dramatic impacts on PA, especially when data involve unrelated individuals. In practice, the set of causal loci is unknown; however, it is possible to use information from existing GWAS to either select or weight the markers included in the analysis. Therefore, we also evaluated G-BLUP models using a subset of markers selected on the basis of their association p-values for human height published by the GIANT consortium

As an alternative to this variable selection approach we evaluated the use of genomic matrices computed by weighting all markers differentially

Results from the simulation study are reported first; this is followed by the results of the real data analysis.

Type of loci | Scenario | Data set | Minor Allele Frequency | ||||

<3% | 3%–5% | 5%–10% | 10%–15% | >15% | |||

Tag | FHS | .061 | .049 | .119 | .116 | .654 | |

Tag | GEN | .065 | .049 | .119 | .115 | .652 | |

Causal | RAND | FHS | .063 | .047 | .117 | .123 | .651 |

Causal | RAND | GEN | .066 | .048 | .117 | .117 | .651 |

Causal | Low-MAF | FHS | .310 | .233 | .239 | .207 | .011 |

Causal | Low-MAF | GEN | .321 | .237 | .231 | .201 | .010 |

The eigenvalue decomposition of the marker-derived genomic relationship matrices revealed that the cumulative variance explained by the 1^{st} 5 eigenvalues were 0.35, 0.51, 0.64, 0.78 and 0.90% in FHM and 0.35, 0.51, 0.61, 0.69, and 0.77% in GEN, respectively. Ordinary least squares regression of adjusted height on the 1^{st} PC explained a proportion of the variance (in the training sample) equal to 4% in FHM and to 2% in GEN. Therefore, although both data sets exhibit some extent of population stratification, the proportion of variance of genotypes explained by high order principal components was low.

Estimates of ^{2}, averaged across 30 MC replicates are displayed in

Scenario | Genetic Information Used to Compute Relationships | ^{2} |
|||

FHS | GEN | FHS | GEN | ||

RAND | Causal Loci | 0.775 | 0.773 | 0.545 | 0.517 |

(0.009) | (0.010) | (0.040) | (0.031) | ||

Markers | 0.774 | 0.737 | 0.263 | 0.071 | |

(0.018) | (0.040) | (0.048) | (0.023) | ||

Pedigree | 0.764 | — | 0.223 | — | |

(0.020) | (0.047) | ||||

Low-MAF | Causal Loci | 0.777 | 0.775 | 0.551 | 0.536 |

(0.007) | (0.008) | (0.026) | (0.026) | ||

Markers | 0.748 | 0.573 | 0.240 | 0.049 | |

(0.018) | (0.058) | (0.029) | (0.019) | ||

Pedigree | 0.755 | — | 0.224 | — | |

(0.023) | (0.033) |

: average (over 30 MC replicates) estimated posterior mean of the ratio of genomic variance over the sum of genomic and residual variance;

: average prediction R^{2} (phenotypes) over 30 training (N = 5,300)-testing (N = 500) partitions.

The estimates (± estimated standard error)

For the FHS we also fitted a pedigree-based model to the simulated data and the estimates of proportion of variance explained with pedigrees (0.764 and 0.755 in the RAND and Low-MAF scenarios, respectively) were very similar, only slightly lower, but not significantly different based on the MC SEs, to those obtained when genotypes at markers were used.

When genotypes at marker and causal loci are in perfect LD, the R^{2} between predicted and observed phenotypes in TST data sets (averaged across 30 MC replicates) ranged from 0.517–0.551, with very minor differences across data sets and scenarios. The R^{2} for prediction of genetic values (not presented in the

The PA attained when marker genotypes were used to compute genomic relationships was much lower than that achieved using genotypes at causal loci. Reduction in R^{2} due to imperfect LD between markers and QTL ranged from 52% (for FHS in the RAND scenario, computed as 100×[1−0.263/0.545]) to 91% (for GEN in the Low-MAF scenario, computed as 100×[1−0.049/0.536]). In both data sets the reduction in R^{2} was higher in the Low-MAF scenario than in the RAND scenario; however, the reduction in R^{2} was orders of magnitude different in FHS and GEN, regardless of the simulation scenario.

Importantly, in many cases, the value of the estimated ^{2}. For instance, in FHS the use of markers, as opposed to causal loci, did not induce a great extent of missing heritability, but the PA attained with markers was less than 50% of that attained when genotypes at causal loci were used. Another example can be seen in GEN; here, in the RAND scenario when markers were used we observed only a small extent of missing heritability (the estimated value was ^{2} due to use of markers that were in imperfect LD with causal loci was dramatic (86% computed as 100×[1−0.071/0.517]).

Finally, in FHS the prediction accuracy of the pedigree model (R^{2} 0.224) was, as one would expect, lower than that of the marker-based model (R^{2} 0.263). Relative to the pedigree model, using markers leads to a gain in PA in the R^{2} (correlation) scale of 17.9% (8.6%), computed as 100×[0.263/0.223–1]

According to expression (7) of section 2, imperfect LD between markers and QTL results in a minimum reduction in prediction R^{2} of

Data set | Information used to compute relationships | Simulation Scenario | Regression Coefficient (_{n+1} |
Reduction Factor in R-squared | |

Minimum_{n}_{+1}) |
Observed |
||||

Framingham | Pedigree | Random | 0.295 | 50% | 59% |

Low-MAF | 0.285 | 51% | 60% | ||

Markers | Random | 0.371 | 40% | 52% | |

Low-MAF | 0.334 | 44% | 56% | ||

GENEVA | Markers | Random | 0.127 | 76% | 86% |

Low-MAF | 0.089 | 83% | 91% |

: For each individual in the testing (TST) data sets we computed the regression of marker or pedigree derived relationships on genomic relationships computed at causal loci,

: Upper bound calculated using expression (7).

: Reduction in prediction R^{2} observed when data was analyzed using markers relative to that obtained when data was analyzed using genotypes at causal loci (see

The validity of inequality (7) can be evaluated by comparing the minimum reduction factor in prediction R^{2}, ^{th} column of ^{2}, as expected from (7). The order of magnitude of ^{2} are similar, with a difference between the two of roughly 10 percentage units.

To get further insight, and to check the validity of the linear relationship postulated by

The plot displays realized relationships between one individual in TST and all the other individuals in TRN for GEN (right panel) and FHS (left panel). Genomic relationships computed using markers are given in the vertical axis and those computed using genotypes at causal loci are in the horizontal coordinate.

The effect of the degree of familial relationships on the regression between genomic relationships realized at markers and at causal loci is further illustrated in

Data set | Framingham | GENEVA | ||||

Relationships |
Related | Unrelated | Unrelated | |||

Scenario | RAND | Low-MAF | RAND | Low-MAF | RAND | Low-MAF |

Average | 0.992 | 0.998 | 0.119 | 0.078 | 0.127 | 0.089 |

q_{5%} |
0.898 | 0.827 | 0.085 | 0.048 | 0.087 | 0.051 |

q_{95%} |
1.083 | 1.174 | 0.184 | 0.130 | 0.329 | 0.269 |

: Relationship between the individual whose phenotype is predicted and those used for model training; coefficients, _{5%} and q_{95%} represent the 5% and 95% empirical percentiles of the estimated regression coefficients.

Data set | Pedigree | G-BLUP | wG-BLUP |

Framingham (N = 5,800) | 0.857 | 0.837 | 0.814 |

(0.016) | (0.016) | (0.013) | |

GENEVA (N = 5,800) | — | 0.374 | 0.268 |

(0.049) | (0.026) | ||

Framingham+GENEVA (N = 11,600) | — | 0.721 | 0.632 |

(0.016) | (0.015) |

On the other hand, the analysis of data from unrelated individuals (GEN) exhibited a great extent of missing heritability (roughly 53% for G-BLUP, computed as 100×[1−0.374/0.80]) both for G-BLUP and even greater for wG-BLUP. These results are also in agreement with previous reports for the trait (e.g.,

^{2} within FHS obtained with P-BLUP and G-BLUP were of the order of 0.28. These results are similar to what was observed for this data set in the simulation, with two main differences: (a) the values of R^{2} obtained in the simulation were 10–20% lower than those in the real data and (b) in the simulation, G-BLUP outperformed P-BLUP but in the real data analysis the opposite happened. This observation is consistent with the conjecture that P-BLUP may be capturing environmental and non-additive effects that are not captured by G-BLUP. The PA of G-BLUP in GEN was very poor (R^{2} of 3.1% when training with GEN only) and this is in agreement with the simulation results. The PA of wG-BLUP was higher than that of G-BLUP in FHS (11% gain in R^{2}, calculated as 100×[0.311/0.280–1]) and substantially higher in GEN (R^{2} = 0.086 of wG-BLUP was almost 3 times higher than that of G-BLUP). Combining FHS and GEN for training was beneficial for prediction of GEN but not for prediction of FHS. With use of a TRN data set that included FHS and GEN and with wG-BLUP we attained a prediction R^{2} of 11% with unrelated individuals. Interestingly, in the case of GEN wG-BLUP leads to higher proportion of missing heritability and to higher prediction accuracy, stressing again that there is no direct relationship between estimates of ^{2}.

Training data sets | Testing data sets | Pedigree-BLUP | G-BLUP | wG-BLUP | ||

N-FHS | N-GEN | N-FHS | N-GEN | |||

5,300 | — | 500 | — | 0.284 | 0.281 | 0.311 |

(0.048) | (0.051) | (0.037) | ||||

5,300 | 5,800 | 500 | — | 0.273 | 0.290 | |

(0.048) | (0.036) | |||||

— | 5,300 | — | 500 | 0.031 | 0.086 | |

(0.013) | (0.020) | |||||

5,800 | 5,300 | — | 500 | 0.036 | 0.110 | |

(0.015) | (0.027) |

The above results indicate that the use of differential weights in the computation of genomic relationships may be beneficial, especially with data from unrelated individuals. Another alternative is to use p-values from GWAS to select predictors. ^{2} obtained with FHS and GEN using subsets of markers selected on the basis of the associated p-values reported by the GIANT consortium ^{2} increased monotonically with marker density in the range 0–100 K SNPs, suggesting no benefit of pre-selecting markers within that range. In this data set, R^{2} with 400 K SNPs was only slightly lower than that obtained using the top 100 K SNPs; therefore, we conclude that there is little benefit of performing variable selection when family data are used. On the other hand, for GEN, benefits of pre-selection of markers are clearly observed: in this case R^{2} increases sharply up to 5 K SNPs, achieving a prediction R^{2} substantially higher than the G-BLUP with 400 K SNPs (7.5% relative to 3.1% or 9.9% relative to 3.6% in the analyses with training using GEN or GEN combined with FHS, respectively), and decreases thereafter with higher marker density. However, wG-BLUP gave higher prediction accuracy than the use of 5 K selected SNPs (R^{2} 8.6% and 11% in the case of the analysis using GEN or GEN and FHM for training) suggesting that perhaps the use of ‘smooth weights’ may be better than the use of 0/1 weights which are implicitly used when markers are pre-selected. We note again that the estimates of genomic heritability did not follow the same patterns than those of R^{2}, for instance the analysis with the top 5 K SNPs yielded smaller genomic heritability but higher prediction accuracy than the analysis with 400 K SNPs.

Dotted horizontal lines give the prediction R^{2} obtained when all markers (p = 400,000) were used.

In recent years GWAS have uncovered unprecedented numbers of variants associated with many important complex human traits and diseases. However in most cases the joint effects of variants detected so far explain only a small proportion of the genetic variance of those traits, a problem referred to as the missing heritability of complex traits

The purpose of this study was to shed light on some of the factors that affect the predictive performance of G-BLUP and to identify avenues by which this methodology can be improved. In particular we focused on studying how imperfect LD between markers and QTL affects the extent of shrinkage in prediction R^{2}, relative to the prediction R^{2} obtained with the same sample and data structure if genotypes at causal loci were known. Several authors have studied the factors affecting prediction accuracy of G-BLUP. For instance, Goddard ^{2} to features of the trait (e.g, ^{2}) of the sample (e.g., size of the training data set) and of the genome (e.g., span of LD and how this affects the number of independently segregating segments). The studies of Goddard

One limitation of Goddard's approach is that it does not account for the effects of recent familial relationships (the derivations are solely based on population LD). Our approach captures, via the regression of marker-derived genomic relationships on those realized at causal loci, both effects of LD between markers and QTL as well as cosegregation between markers and QTL that occurs because of recent family relationships. On the other hand, Daetwyler's approach

In Goddard's approach ^{2} under perfect LD; and we make almost no statements about this quantity, other than the ones that follow from the properties of the multivariate normal distribution. Instead, we focus on quantifying the effects of LD on R^{2} that occurs through misspecification of TRN-TST relationships. Importantly, our simulation results show that the proposed upper bound formulas account for 80–90% of the observed shrinkage in R^{2}.

In summary, our approach is complementary to that of Goddard ^{2} due to imperfect LD without making strong assumptions.

The ability of G-BLUP to separate true signal (

If the variance of the realized genomic relationships (across regions of the genome) is small relative to their expected value, the patterns of realized genomic relationships at markers will provide a good description of the patterns of realized genetic relationships at unobserved causal loci. Hill and Weir (2011)

In this study we have chosen to center and to standardize markers using estimates of allele frequency derived from the sample. As stated, centering does not have an effect on predictions or on estimates of variance parameters

The results of the simulation study indicate that when markers and QTL are in perfect LD, no missing heritability is observed, as expected. This holds regardless of whether the training sample comprises data from related or unrelated individuals. When markers and QTL are in imperfect LD two contrasting situations were encountered: (a) with family data no missing heritability was observed, and (b) with unrelated individuals, we either observed a small extent of missing heritability (when markers and QTL were sampled from the same distribution of loci, the RAND scenario) or a greater extent of missing heritability (this happened when the distribution of allele frequency at markers and causal loci was different, the Low-MAF scenario). The estimates of variance components and of genomic heritability for human height reported here are consistent with previous results for this trait. In other words, no missing heritability was observed in the analysis of family data

Predictions based on G-BLUP are weighted averages of phenotypes in the TRN data set (see, ^{2} (see Supplementary Methods). These expressions are valid ^{2} has an upper bound given by an index that is the product of the heritability of the trait times a weighted sum of squares of the realized genomic relationships between the individuals used for TRN and those in the TST data set (see ^{2} converges to the heritability of the trait. However, this does not occur ^{2} can have an upper bound that is much lower than the heritability of the trait. Assuming a linear relationship between the realized genomic relationships at markers and at causal loci, an upper bound to prediction R^{2} under imperfect LD between markers and QTL (^{2} that can be obtained (using the same TRN sample) if markers and QTL were in perfect LD

The regression coefficient ^{2}. When the TRN and TST data set are related due to close familial relationships, the regression of genomic relationships at markers on those at causal loci ^{2} due to imperfect LD, ^{2} is therefore predicted (of the order of 80% computed as 100×[1–2×0.1+0.1^{2}]). Importantly, the minimum shrinkage in R^{2} predicted by our formula matched very closely the observed shrinkage due to imperfect LD estimated in the simulation (roughly, the minimum shrinkage factor was 80–90% of the observed shrinkage in R^{2}, see

The maximum R^{2} that can be attained under perfect LD (assuming infinitely large samples and that the model holds) is ^{2}, the heritability of the trait. Imperfect LD between markers and QTL induces shrinkage in R^{2}; in case of data sets of nominally unrelated individuals similar to GEN a minimum shrinkage in R^{2} of 80% is anticipated; therefore, the expected asymptotic upper bound for R^{2} is 20% of ^{2}, or 16% in the case of height. This estimate applies to data sets of similar characteristics that the GEN data set. Prediction problems involving individuals that are less (more) distantly related than the average individual in GEN are expected to have a lower (higher) upper bound on R^{2}. Similarly, our estimates reflect the specifics of the SNP chip used and how genomic relationships were computed.

In finite samples, as pointed out in previous studies ^{2} to values smaller than ^{2}. Some proposed formulas for the expected value of R^{2} under perfect LD take the forms ^{2} under perfect LD. However, the derivation of these formulas assumes that genotypes at causal loci are fully orthogonal. We applied these formulas using ^{2} values of 0.47 and 0.37 using the formulas suggested in ^{2} under perfect LD ranged between 0.52–0.54.

GENEVA and the FHS contain samples drawn from relatively homogeneous populations. On the other hand, when allele frequencies vary across subpopulations, so does the relative contribution of each locus affecting the trait to genetic variance in each of the subpopulations. This raises the question of what estimates of allele frequencies should one use when analyzing data involving different subpopulations. In the present study this was not an issue because the correlation of estimates of allele frequencies derived from GEN and FHM was virtually 1 (0.99). However, when this is not the case, if genomic relationships are scaled with estimates of allele frequencies derived from the entire sample, then marker derived genomic relationships will provide a poorer description of the realized genetic relationships in each of the sub-populations. This may result in a lower estimate of ^{2}-shrinkage factor.

Both FHS and GEN, especially the former, show some degree of population stratification, as judged by the inspection of the loadings of the 1^{st} two eigenvectors derived from ^{st} k principal components of ^{2} that does not account for genetic similarity attributable to substructure.

The effectiveness of G-BLUP depends critically on the extent to which marker derived genomic relationships reflect the patterns of realized genetic relationships at causal loci. The size of the coefficient of variation of realized genomic relationships across regions of the genome depends on the number of independently segregating segments among the pair of individuals whose realized genomic relationship we wish to assess. For pairs of unrelated individuals this is largely controlled by the span of LD in the population. For pairs of related individuals this is largely controlled by within family disequilibrium. In animal and plant breeding populations G-BLUP has exhibited very good predictive performance because the two conditions needed for G-BLUP to perform well are generally met: LD span over long regions and data include highly related individuals. Under these conditions variable selection is difficult to perform and may not be needed because the patterns of genetic similarity realized at markers and at causal loci are similar.

However, the analysis of human data from unrelated individuals represents the exact opposite situation. Here LD spans over shorter regions

In conclusion, we have provided an analytical framework to quantify the maximum prediction R^{2} that can be attained using G-BLUP and have compared the properties of G-BLUP in samples of related and unrelated individuals. The analytical expressions derived are consistent with our simulation and empirical results and suggest that the analysis of nominally unrelated individuals presents a number of challenges that standard G-BLUP does not address. These can be partly met by incorporating prior knowledge of the relative importance of SNPs for a given trait. Further research will be required to optimize the modeling of such prior knowledge towards improved trait prediction.

Squared-correlation between genotypes at adjacent markers observed in the FHS (vertical axis) and GEN datasets. The average inter-marker distance in the platform was 7.2 kb The blue lines gives the median squared-correlation in both datasets.

(PDF)

Average squared correlation between genotypes at various lags (number of markers in between the two used to compute the squared correlation) observed in FHS (dots) and GEN (line). The average inter-marker distance in the platform was 7.2 kb.

(PDF)

Heritability estimates by dataset, simulation scenario and genetic information used.

(PDF)

R-squared (R^{2}) between realized and predicted

(PDF)

R-Squared (R^{2}) between realized and

(PDF)

R-squared (R^{2}) between realized and predicted

(PDF)

R-Squared (R^{2}) between realized and

(PDF)

Includes the analytical derivations leading to the R-squared formulas presented in the article, and details of the methods used to estimate variance components and for prediction.

(PDF)

The authors thank the organizers and participants of both the FHS and GEN studies, Dr. David B. Allison for his support in the process of acquiring the data and dbGaP for providing access to it. Dr. Michael Weedon supplied us with the p-values from the GIANT consortium, Professor W.G. Hill and Swetlana Miller made valuable comments to an earlier version of this manuscript. During the review process we benefited from valuable contributions made by the editor in charge of this article and by three anonymous reviewers.