Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship

Genomic prediction is emerging in a wide range of fields including animal and plant breeding, risk prediction in human precision medicine and forensic. It is desirable to establish a theoretical framework for genomic prediction accuracy when the reference data consists of information sources with varying degrees of relationship to the target individuals. A reference set can contain both close and distant relatives as well as ‘unrelated’ individuals from the wider population in the genomic prediction. The various sources of information were modeled as different populations with different effective population sizes (Ne). Both the effective number of chromosome segments (Me) and Ne are considered to be a function of the data used for prediction. We validate our theory with analyses of simulated as well as real data, and illustrate that the variation in genomic relationships with the target is a predictor of the information content of the reference set. With a similar amount of data available for each source, we show that close relatives can have a substantially larger effect on genomic prediction accuracy than lesser related individuals. We also illustrate that when prediction relies on closer relatives, there is less improvement in prediction accuracy with an increase in training data or marker panel density. We release software that can estimate the expected prediction accuracy and power when combining different reference sources with various degrees of relationship to the target, which is useful when planning genomic prediction (before or after collecting data) in animal, plant and human genetics.


Introduction
Genomic prediction of (additive) genetic effects and phenotypes is emerging in a wide range of fields including animal and plant breeding, risk prediction in human medicine and forensics [1][2][3][4]. Genomic prediction requires modeling of the association between genome-wide single nucleotide polymorphisms (SNPs) and phenotypes. The success of genomic prediction is measured by its accuracy, i.e. how reliable a future phenotype of target individuals can be predicted.
Genomic prediction requires a reference population of individuals having information on both genotype and phenotype. The accuracy of genomic prediction depends on various parameters, including sample size of the reference and its genetic structure. An important a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 parameter in relation to the latter is the effective size of the population. The effective population size is a predictor of the effective number of chromosome segments that are represented in the population [5][6][7]. Theoretical predictions have usually considered a homogeneous population of individuals that are essentially unrelated. However, in most practical applications, the reference population used for genomic predictions possibly consists of many sub-groups with individuals having a variety of relatedness to the target individual, e.g. direct relatives, more distant relatives, and individuals that are considered unrelated. It is relevant to assess the contribution of these various sources to prediction accuracy before actually conducting an experiment.
A number of studies have shown that genomic predictions are more accurate if the genomic relationship between the proband and the reference population is higher, both in humans [8][9][10][11] and in other species [12][13][14]. Habier et al (2013) [15] distinguished between three types of information in genomic prediction; linkage disequilibrium, additive-genetic relationships and co-segregation of QTL predicted from markers genotypes with a pedigree. They argued that it would be useful to understand how these sources contribute to the accuracy of genomic predictions, especially when designing reference populations for breeding programs. They show these contributions via simulated examples but did not provide methods that allow simple predictions for their contribution to accuracy. Pszczola et al. (2012) [16] showed that the relationship between the reference population and the proband should be maximized to achieve an optimal design using a simulation study. However, they also did not attempt to derive the expected prediction accuracy from an optimal design in advance. Hayes et al. (2009) [17] considered the influence of direct relatives on genomic prediction. They followed the same approach as the general theory, i.e. by considering the number of independently segregating chromosome segments within families. They showed the accuracy of genomic prediction from varying sizes of the first and second degree of relatives, but did not consider the information from combined sources [18]. It should also be noted that those studies that derived genomic prediction accuracy from theory using effective number of chromosome segments (M e ) [5,6,[19][20][21], did not consider the correlation between relatedness at different chromosomes, therefore overestimating M e and underestimating whole-genome prediction accuracy [7].
Wientjes et al (2016) [22] proposed a simple selection index approach to combine information from different populations. They considered a genetic correlation between genetic effects expressed in different populations. We propose to use the same approach to combine different sources of information from different subsets within a population, where the different subsets have a different degree of relationship with the target individual. To predict the accuracy, we derive the number of effective chromosome segments from a hypothetical N e associated with each subset, and we show that that combining such subsets using selection index theory gives the same result as using a prediction from an M e derived from the variation in genomic relationships between the overall reference data and the target. Prediction accuracy is derived from variation in genomic relationship rather than the variation in genomic relationship as a deviation for the expected relationship among members of the reference set, as was proposed by Goddard et al (2011) and also applied by Wientjes et al (2016). This approach leads to a theoretical concept useful for assessing the accuracy of genomic predictions in advance, and we illustrate this with examples based on real data.

Predicting genomic selection accuracy
The accuracy of genomic breeding values (GBV) or (genomic profile score in the context of human risk prediction [23]) based on genome-wide SNP genotypes can be predicted from theory [5][6][7]24], assuming that prediction is based on a reference population with phenotypes and genotypes for the same genome-wide SNPs that are linked to quantitative trait loci (QTL). The accuracy depends on i) the proportion of genetic variance at QTL captured by markers and ii) the accuracy of estimating marker effects. The proportion of genetic variance at QTL captured by markers (b) depends on linkage disequilibrium (LD) between markers and QTL, which in turn depends on the number of markers (M) and the number of 'effective chromosome segments' (M e ) [5], that is Various forms of prediction of M e have been presented [5,6,21] that were however inconsistent to each other, and without considering the correlation between chromosomes. Recently, we presented a prediction formula with the form [7] where N e = effective population size; L = average chromosome length; N chr = number of chromosomes. This formula accounts for mutation, and that without considering mutation should be referred to Eq (10) in Lee et al. (2017) [7]. If N chr = 30 and L = 1, Eq (1) The accuracy of the genomic prediction of a phenotype can be written as [5] where r y;ĝ is the correlation coefficient between the true phenotypes (y) and estimated GBV, h 2 is the heritability of the trait, and N is the number of phenotypic observations. Other measures for genomic prediction accuracy, particularly for human risk prediction, such as the area under the receiver operating characteristic curve (AUC) or odds ratio of case-control status contrasting the higher or lower risk group are described elsewhere [7].

M e and genomic relationship
After collecting genotypic information of the reference data and the target individual, it is possible to obtain an empirical M e from a genomic relationship matrix (GRM). The elements in x Tm x jm =M where x Tm and x jm are the standardised genotype coefficients (mean 0 and variance 1) for the target individual (T) and jth individual in the reference data at the mth locus. It is possible to construct a GRM for each locus, and the elements in the GRM at the mth locus are G Tj(m) = x Tm x jm . Then, the variance of the mean of G Tj(m) across all where N chr is the number of chromosomes. This expression is equivalent to (1) when using Sved (1971) [25] for deriving the expected squared correlation between genotypes, which is dependent on N e . An empirical M e can be derived from the GRM as the variance in relationships between a target individual T and N individuals in the reference population. Goddard et al. (2011) [5] suggested their theoretical derivation had to assume a homogeneous population of individuals that are essentially unrelated. However, Eq (4) show that the assumption about unrelated individuals is not necessary so that any random samples from the population can be used, irrespective of whether they are highly related or not (see Results).

Effective population size in a reference data set
One of critical parameters to determine the accuracy of genomic prediction is the effective population size (N e ). It is not very common to represent a reference population by a single value of N e when it consists of several cohorts of individuals with different relationships to the target individual. Wientjes et al. (2016) [22] used a single value for M e representing a reference set consisting of two populations. Here, we generalized that concept for any number of subsets of the reference population based on the relationship between N e , M e and var(G Tj ), leading to a certain value for M e and N e for a reference population consisting of several cohorts. For any subset of the reference data set, there are realized relationships with the target sample. From Eq (5), a value of M e , which is the inverse of the variance of the genomic relationships between the target and the reference sample, can be assigned to the reference data. Then, a single value of N e , which is a function of M e from Eqs (1) or (2), can be obtained for the reference data. The effective population size of the reference set is therefore a parameter specific to the data used and it describes genomic diversity of the reference individuals used for prediction of genomic breeding value relative to the target individual which breeding value is being predicted. This N e value can be smaller than the effective size of the population from which the reference individuals were sampled, but it can also be larger, depending on whether the reference individuals chosen were closer or more distantly related to the reference set. Based on this concept of N e , reflecting information content of the reference sample in relation to the target sample, we consider three information sources consisting of i) close relatives of the proband, e.g. N e = 10, ii) distant relatives or individuals from the local area of the proband, e.g. N e = 100 and iii) a wider population sample of individuals that are not related to the proband, e.g. N e = 1,000.
The GBV can be estimated based on each of these information sources, and the accuracy of the estimation can be calculated as above, e.g. r g;ĝ ðiÞ from Eq (3) where i represents the ith information source. It is also possible to estimate GBV based on combined data of all three information sources. Assuming a random sample from the same population for each source, the accuracy of the GBV based on the combined data set can then be calculated using standard selection index theory as where g is the vector with covariances between each of the GBV and the true breeding value, and P is the variance-covariance matrix of the set of GBV. The accuracy of the GBV based on the combined data set can also be estimated based on the weighted M e from the three information sources. Assuming a random sample from the same population for each source, the weighted M e can be obtained as where p k is the proportion of the sample size over the total sample for each information source. The accuracy of the GBV based on the weighted M e is identical with that using standard selection index theory above (Eq (6)).
Following Wientjes et al. (2016) [22] we can further generalize for a case where genetic correlations among multiple reference populations and those between reference populations and the target are not one. Eq (6) can be generalized as . . .
where r G k;T is the genetic correlation between the k th reference population and the target set, and similarly, r G i;j is the genetic correlation between the i th and j th reference population (i = j = 1~k).
Eq (8) can be used when multiple reference populations have both quantitative traits and diseases. If the k th reference population is measured for binary response (disease or disorder), the reliability term, r 2 g;ĝ , in Eq (8) can be replaced with where u is a genetic profile score on the 0,1 disease scale [26,27], K is the population prevalence for the disease, P is the proportion of cases in the total sample N of cases and controls, and z is the density at the threshold on the normal distribution in the liability threshold model. When the target data set is measured for a binary response (e.g. diseases), the AUC or odds ratio of case-control status contrasting the higher or lower risk group can be also estimated [7]. These measures for genomic prediction accuracy with multiple heterogeneous reference populations can be obtained using MTG2, publicly available software (https://sites.google.com/site/ honglee0707/mtg2).

Power of genomic prediction
The power depends on the sample size in the target data set and the overall reliability of the genomic prediction (Eq (8)).
where NCP is the non-centrality parameters, N T is the sample size in the target data and r 2 y;ĝ is the coefficient of determination of the predictor. Then, the power can be written as where F is the cumulative standard normal function and α is the significance level. When using the odds ratio of case-control status contrasting the higher or lower risk group, the noncentrality parameters can be derived as [28] NCP ¼ lnðORÞ where N top and N bottom is the number of individuals in the top and bottom percentile, and P top and P bottom is the proportion of cases in each group. The power can be estimated as above.

Simulation
In a simulation, a stochastic gene-dropping method [29,30] was used to simulate 4,000 SNPs for each of 30 chromosomes, each of length L = 1 Morgan with N e = 50, 500 and 1000 for 50, 500 and 1000 generations, respectively. Recombination and mutations were modelled according to the genetic distance between SNPs and the mutation rate of 1e-08 per site per generation [31]. In the final generation, we constructed a genomic relationship matrix for a random set of 3000 individuals. Among the 3000 individuals, we randomly selected 1000 individuals as target data and 2000 individual as reference data, and estimated variance of the genomic relationships between the target and reference data to validate Eqs (1), (2), (4) and (5).

Evaluation of the formulas
For each of the three information sources contributing to genomic prediction we varied values for N e , sample size in reference data and marker density. We compared the expected accuracy of GBV from the sample of N e = 1000 with predictions that additionally included information from the sample of N e = 100 and N e = 10. The total number in the reference population was kept equal between the comparisons.

Real data analysis
We used publicly available data from the Framingham heart study (phs000007.v26.p10.c1) [32]. There were 6950 individuals genotyped for 500,568 SNPs. Stringent quality control for genotype data and phenotype adjustment for confounders were applied to the data (the details can be found in Lee et al. (2016) [7]). The quality control included SNP call rate > 0.95, individual call rate > 0.95, HWE p-value > 0.0001, MAF > 0.01 and individual population outliers < 6 SD from the first and second principal components (PC). After QC, 6920 individuals and 389,265 SNPs remained. Among them, 4243 individuals were phenotyped for height and body mass index (BMI). We made three different information sources to form the reference data that were tested in 100 replicated analyses (Table 1). Initially, we randomly selected 800 individuals out of 4243 phenotyped individuals as a target data set. For reference data set #1, we selected 50% of individuals that were highly related (> relatedness of 0.3) to the 800 target individuals (N 1 = 617 ± 19). For reference data set #2, we selected 80% of moderately related individuals (> relatedness of 0.1) of the 800 target individuals (N 2 = 1254 ± 30). For reference data set #3, we took the rest of the individuals that were not selected for reference data set #1 and #2 (N 3 = 1572 ± 33). There was no overlap sample between target data set and reference data sets #1, #2 and #3.
Using the real genotype data, the genomic relationships between the reference and target sample were constructed. Empirical M e was estimated from Eq (5) for reference #1, 2 and 3, and that for combined data. We took a median rather than mean because the distribution of variance of the genomic relationship between target and reference sample was skewed. The correlation between the true phenotypes (that were not used in the analyses) and estimated GBV in the target data set was estimated for the combined data set, which was used as the genomic prediction accuracy (r y;ĝ ). Phenotypes were adjusted for birth year, sex, and the first 10 PCs were used to control non-genetic confounding effects, e.g. population stratification.

Results
In the simulation study, as shown in Fig 1A, 1B and 1C, the expected (from Eq (1)) and empirically observed M e from the simulated genotyped data (using Eq (5)) are in good agreement, however, they are considerably lower than the expectation from the previous formulas [5,6,21], which confirms the result from Lee et al. (2017) [7]. It is noted that Eq (5) is still valid in the subset with a smaller N e = 50 that has a significant proportion of high related individuals, indicating that the assumption about unrelated individuals (made in Godard et al. (2011) [5]) can be relaxed. It is shown that whether using a high or low effective population size, the mean and variance of genomic relationships is generally agreed with the expectation from the theory (S1 Fig and S2 Fig).
In the evaluation of the formulas, we first tested how the prediction accuracy was changed with varying marker density, using formula (1) and (3) and b = M / (M e + M) (Fig 2). For N e = 10,000, the accuracy gradually increased with marker density, but the slope became flat when using the number of SNPs exceeded 100,000 (Fig 2A). For N e = 1,000, the accuracy did not increase with marker density as long as the number of SNPs was higher than 50,000 (Fig 2B).
For N e = 100, there was no improvement of the accuracy if the number of SNPs was more than 10,000 ( Fig 2C). This would be expected because the proportion of genetic variance at QTL captured by markers (b = M / (M e + M) approached one when the number of SNPs (M) was more than 100,000, 50,000 and 10,000 for N e = 10,000, 1000 and 100, respectively (Fig 3), as M e was equal to 21,248, 2,313 and 254 for these three cases.
Next, we quantified the contribution of each information source when varying sample size in the reference data using formula (1), (3) and (5) (Fig 4). It was assumed that the number of SNPs was sufficient to capture most of causal variants (e.g. > 50,000). When adding 100 individuals of N e = 100 or N e = 10 to the reference sample with N e = 1000, the accuracy was slightly or substantially improved (Fig 4A). The improvement was larger when adding more individuals (500) (Fig 4B). Results showed that an information source of a smaller N e was more important when the samples sizes of each information source were the same. When the total number of reference data was increased, the importance of adding an information source of a smaller N e was relatively decreased (Fig 4). When heritability was higher, overall accuracy was increased, and the relative contribution from an information source of a smaller N e , i.e. the close relatives, was reduced ( Fig 5). Fig 6 confirms again that the smaller N e , the better the prediction accuracy when using each information source separately. However, the sample sizes can be also varied across the information sources, as there are generally a lot fewer close relatives than individuals from the wider population. In Fig 6A, the accuracy at a sample size of 100 for N e = 10 was 0.73, which was lower than that of a sample size of 1,000 for N e = 100 (0.81) or that of a sample size of 20,000 for N e = 1,000 (0.83). With a higher heritability, the result is similar in that the 20,000 records in the information source of N e = 1000 gave a better accuracy than the 100 records of close relatives (N e = 10).
In real situations, the most common and desirable design may combine all of the information sources to maximize the prediction accuracy. We plotted the accuracy using a composite design consisting of N e = 1000 + N e = 100 (N = 500) + N e = 10 (N = 50), compared to that using N e = 1000 (Fig 7). The accuracy for a composite design was substantially increased especially when the total number of reference sample is low (Fig 7). Fig 8 illustrates the real data analyses. The median of empirically estimated M e from the inverse of the variance of the genomic relationship (Eq 5) over 100 replicates was 2254 (SD = 50), 3989 (SD = 104) and 28848 (SD = 920) for reference #1, #2 and #3, respectively (Table 1). Empirically estimated M e based on the combined data was 4836 (SD = 106) while expected M e was 5309 (SD = 88), approximately confirming Eq (7). The (small) difference between empirical observation and expectation was probably due to skewed distribution of the variance of the genomic relationships.
Given M e , N and h 2 , the expected accuracy agreed well with the observed accuracy when using Framingham data (Fig 9). The reported heritabilities, h 2 = 0.8 [33][34][35] for height and h 2 = 0.46 [36,37] for BMI, were used. We also quantified the importance of marker density using the real data. In agreement with Fig 2, the prediction accuracy is not much decreased even with 50,000 SNPs that were randomly selected from 389,265 SNPs (Fig 10).

Discussion
This work shows a simple approach for modeling genomic prediction in a reference data set that contained several subpopulations that differ in relatedness to the target set, and by modeling these subpopulations as having different effective population size. The model allows assessing the prediction accuracy before actually conducting an experiment so that designing genomic prediction can be precise and effective in animal, plant and human genetics. For example, it can address a question how much the prediction accuracy can be increased by adding 10,000 (conventionally) unrelated individuals into the current experiment consisting of 100 relatives in the reference data. The value for N e in Eq (1) can be approximated based on prior knowledge of a population, and the relatedness of the sample with the target, possibly supported by some genotype information that maybe available on cohorts, or samples thereof. As a reference table, we added approximated value of N e for structured population of full and Estimation of genomic prediction accuracy half sibs (S1 Table and S2 Table). Prediction in advance indeed relies on arbitrary modelling a number of cohorts, but it would be a useful exercise, as illustrated in the results when considering marker density and various sizes of the subsets of the training data. The theory is also useful for an animal breeder to predict the value of genotyped animals in an own herd versus those in a wider references population consisting of a larger number of more distantly related individuals.
The genotypic and phenotypic information of close and distant relatives of the proband can be effectively used as a part of the unified reference panel that also include a large number of individuals that are not related to the predicted subject to improve the accuracy further as illustrated in Fig 7. For a random sample from the same homogenous population, e.g. within the same breed or ethnicity, an optimal design should consist of both close and distant relatives and unrelated individuals, e.g. a composite design, to maximise the prediction accuracy ( Fig  7). That is, the composite design takes advantage of effective information from smaller number of relatives while it also use information from a greater number of unrelated individual.
We showed that the prediction accuracy derived for a population with unrelated individuals turns out to be higher, compared to previous quantifications that overestimated M e for a larger number of chromosomes [5,6,21,38]. Using the same theory, we also showed that the information from close relatives could increase the accuracy even further, especially for smaller reference populations (Figs 3-6). It is important to note that the assumption about using unrelated individuals in estimating empirical M e from genomic relationship [5] is not strictly Estimation of genomic prediction accuracy necessary and can be relaxed (Eqs (4) and (5)). The theory and empirical observation from simulation study agreed well (Fig 1) even when using a population with a smaller effective population size (N e = 50) that consisted of a significant proportion of high relatedness.
Previous studies related to genomic prediction accuracy have suggested that M e can be derived from the variation in the differences between realized and expected relationships [6,22], i.e. D = G-A where G is a genomic relationship matrix and A is a numerator relationship matrix based on pedigree. Those studies validated their results also with simulation. If the individuals used in the training set have a low expected relationship to the target individuals, then there is not much difference between the variations in D versus G. However, when some closer relatives are used, var(G) is larger than var(D) and M e is therefore smaller. Note that non-random sampling of individuals used for the training set can cause a difference between the N e of the population that was simulated, and the N e of the data set that was used for prediction.
We have not tested the theory for multi-breed reference populations, i.e. those that are heterogeneous in the sense of consisting of populations from different genetic background, i.e. different breeds or ethnicities, each with different minor allele frequencies, different LD structure and different effects for causal variants. Wientjes et al. (2016) [22] explicitly addressed the problem of different effects for causal variants (i.e. genetic correlation less than one) when combining data from two populations. Individuals from different populations share genomic relationships that are lower than those among members within each population. Evidence in literature suggests low prediction accuracies when using information from different breeds or Estimation of genomic prediction accuracy populations, which could be viewed as predicting from populations with very large N e . Moreover, we have not considered historical population dynamics such as bottleneck and admixture, but assumed a constant N e over the historical generations, which leads to simplifications that make the formulae tractable and easy to derive. Further work is required to extend the theory accounting for admixture populations and historical population dynamics.
We have shown an improved theory for the prediction of the effective number of chromosome segments, which is a key parameter in genomic prediction accuracy [7]. The theory accounts for the correlation between relationships at different chromosomes and as a result the effective number of chromosome segments is smaller than predicted from previous theory [5,6,21]. As a result, the increase of the genomic prediction accuracy appears to be less reliant on higher marker density unless N e is very large (e.g. > 10,000) (Fig 2), compared to what have been quantified by previous theory [5,6,21]. The previous theory overestimates M e (mostly due to neglecting correlation between chromosomes), therefore underestimates the proportion of genetic variance at QTL captured by markers. Little improvement of prediction accuracy with increasing SNP marker density has been empirically observed in a number of studies [39][40][41]. This may also have important implication in genomic prediction as to designing marker density in animal, plant and human genetics.
The ability to quantify the accuracy in relation to various degrees of relationships (e.g. close relatives, distant relatives, local or extensive population sample) is important for predicting outcomes of genomic prediction for specific designs. This study has addressed this question, and the theory has been implemented in MTG2 software (https://sites.google.com/site/ honglee0707/mtg2). Therefore, a user can know the expected prediction accuracy and the power [26] before designing an experiment of genomic prediction. Our approach can be applied both before and after collecting the data.
Supporting information S1 Fig. The distribution of off-diagonal of the genomic relationships matrix among 1000 individuals when effective population size is 50. Although a substantial proportion includes close relationships with an effective population size of 50, the mean and variance of the genomic relationships is -0.001and 0.0072, respectively, which agrees with the expected value from the theory (0 and 0.0077). This indicates that Eq (5) is valid with any random sample even having close relationship. With an effective population size of 1000 (having hardly close relationships as expected), the mean and variance of the genomic relationships is -0.001 and 0.0004, respectively, which agrees with the expected value from the theory (0 and 0.0004). It is noted that whether using a high or low effective population size, the mean and variance of genomic relationships is generally matched with the expectation, supporting that any random samples from the population can used for Eq (5). (DOCX) S1 Table. Approximated value for effective population size (N e ) for structured population of full sib families (kinship coefficient of 0.5). Given the family structure, we estimated variance of relationships, which can be used to obtain M e from Eq (5). Assuming that the number of chromosomes is 30 chromosomes and the length of each chromosome is 1 Morgan (total 30 Morgan), the effective population size was estimated from Eq (2). (DOCX) S2 Table. Approximated value for effective population size (N e ) for structured population of half sib families (kinship coefficient of 0.25). Given the family structure, we estimated variance of relationships, which can be used to obtain M e from Eq (5). Assuming that the number of chromosomes is 30 chromosomes and the length of each chromosome is 1 Morgan (total 30 Morgan), the effective population size was estimated from Eq (2). (DOCX)