Identification of superior parental lines for biparental crossing via genomic prediction

A parental selection approach based on genomic prediction has been developed to help plant breeders identify a set of superior parental lines from a candidate population before conducting field trials. A classical parental selection approach based on genomic prediction usually involves truncation selection, i.e., selecting the top fraction of accessions on the basis of their genomic estimated breeding values (GEBVs). However, truncation selection inevitably results in the loss of genomic diversity during the breeding process. To preserve genomic diversity, the selection of closely related accessions should be avoided during parental selection. We thus propose a new index to quantify the genomic diversity for a set of candidate accessions, and analyze two real rice (Oryza sativa L.) genome datasets to compare several selection strategies. Our results showed that the pure truncation selection strategy produced the best starting breeding value but the least genomic diversity in the base population, leading to less genetic gain. On the other hand, strategies that considered only genomic diversity resulted in greater genomic diversity but less favorable starting breeding values, leading to more genetic gain but unsatisfactorily performing recombination inbred lines (RILs) in progeny populations. Among all strategies investigated in this study, compromised strategies, which considered both GEBVs and genomic diversity, produced the best or second-best performing RILs mainly because these strategies balance the starting breeding value with the maintenance of genomic diversity.


Introduction
Biparental crossing is a common scheme used for pure-line breeding in self-pollinated crops such as rice (Oryza sativa L.), wheat (Triticum aestivum L.), soybean (Glycine max [L.] Merr.), and oat (Avena sativa L.). Plant breeders cross two inbred parental lines to produce the F 1 population. Then, a subset of diverse individuals from the F 2 population is selected to produce potential recombination inbred lines (RILs) after several generations of selfing. Parental lines play a fundamental role in line development, and significantly affect the performance of the resulting RILs. However, the identification of superior parental lines from germplasm collections for maximizing selection response in subsequent cycles remains challenging for plant breeders [1,2]. Another practical concern is that the number of possible crosses in such a breeding program is often far greater than what can be handled in the field. Therefore, developing a method that can identify a limited number of superior parents before field trials would be of great help to plant breeders. Genomic selection, based on the statistical method of genomic prediction (GP), has been used to improve breeding efficiency in dairy cattle [3] and a variety of crops [4][5][6][7][8]. The main concept of GP is to capture all the effects of quantitative trait loci (QTLs) using high-density DNA markers over the whole genome [9]. The most commonly used DNA markers are single nucleotide polymorphisms (SNPs). A GP model is first built using the phenotypic and genotypic data of a training population. Then, genomic estimated breeding values (GEBVs) for the candidate individuals with known genotypic data are predicted through the resulting GP model. Two kinds of mixed linear model methods are widely employed to obtain GEBVs: (i) best linear unbiased prediction (BLUP) based on markers, and (ii) BLUP based on a genomic relationship matrix. To perform marker-based BLUP, the marker effects are treated as random effects, and GEBVs of individuals are calculated by multiplying their marker scores by these BLUP estimates; the ridge regression BLUP (rr-BLUP) method [9,10] follows this approach. To perform genomic relationship matrix-based BLUP, the genotypic values of individuals are treated as random effects and estimated through a genomic relationship matrix; this approach is used in the genomic BLUP (GBLUP) model [11,12].
Gaynor et al. [13] proposed a two-part strategy for implementing genomic selection for line development, which addresses two components: (i) a product development component, to identify inbred lines either for hybrid parent development or cultivar release, and (ii) a population improvement component, to increase the frequency of favorable alleles through rapid recurrent genomic selection. Gaynor et al. [13] conducted a stochastic simulation and showed that programs using the two-part strategy generated up to 2.5-and 1.5-fold more genetic gain than conventional programs and the best performing standard genomic selection strategy, respectively. Additionally, Yao et al. [14] combined GP with Monte Carlo simulation to select superior parents in wheat breeding programs before field trials. The authors used the criterion of usefulness function on a selection index, which incorporates yield and two quality traits, to evaluate a cross, and concluded that the use of the usefulness function for parental selection resulted in higher genetic gain than the use of mid-parent GEBV, implying that the strategy for parental selection cannot only consider GEBVs of the candidate accessions.
By selecting parental lines with the highest GEBVs, breeders hope to maximally pass the favorable traits of parental lines on to their progeny. However, the truncation selection approach risks the elimination of several favorable QTLs from the breeding population because of a lack of genomic diversity [15]. Therefore, in this study, we took both GEBV and genomic diversity into account for identifying superior parents in a biparental crossing program. We constructed a GBLUP model for a specific target trait to predict the GEBVs of the candidate accessions. We first proposed a new index to quantify the genomic diversity of a set of candidate accessions. Subsequently, we simulated the genotypic data for progeny populations derived from a cross over successive generations, and predicted the GEBVs of the simulated progeny populations through the trained GBLUP model. Then, we made generation advancement decisions according to the resulting GEBVs. Finally, we assessed a set of parental lines based on F 10 RILs. We compared the performance of several selection strategies via analysis of two real rice genome datasets.

Rice genome datasets
Dataset I. The rice genome dataset originally collected for genome-wide association study (GWAS) in Zhao et al. [16] was used to illustrate the proposed procedure. This dataset contains 44,100 SNP variants and 36 traits of 413 O. sativa accessions, which comprises five subpopulations and one admixed group. SNPs with missing rate > 0.05 and minor allele frequency < 0.05 were removed from the dataset. To reduce redundant collinearity in the genomic relationship matrix, one SNP was randomly selected from each bin of 20,000 bp over each chromosome. A scatter plot based on the first two principal components (PCs) using the retained 11,047 SNPs is displayed in Fig 1, which is almost the same as that in the corresponding plot generated using all 44,100 SNPs [16]. The SNP genotype at each locus was coded as -1, 0, or 1, where 1 indicates homozygous genotype of the major allele; -1 indicates homozygous genotype of the minor allele; and 0 indicates heterogenous genotype. After SNP coding, any missing locus was imputed as 1. Six traits were analyzed: brown rice seed width (BRSW), florets per panicle (FPP), flowering time at Arkansas (FTAA), flowering time at Faridpur (FTAF), plant height (PH), and panicle number per plant (PNPP).
Dataset II. The rice genome dataset, which was collected for genomic selection study [8], SNPs were used in this study. One SNP marker was selected per 0.1-cM interval on each chromosome because the chosen subset of the full marker set has been shown to be efficient enough for genomic selection in this collection of rice germplasm [8].

Monte Carlo simulation of the genotypic data of progeny populations
To simulate the genotypic data of progeny populations, the Gramene Annotated Nipponbare Sequence [17] was used to estimate recombination rates between two adjacent SNPs. The Gramene Annotated Nipponbare Sequence database contains both physical and linkage distances between SNPs, which can be downloaded from http://archive.gramene.org. The genetic positions of SNPs were estimated via linear interpolation between the two markers flanking each SNP. Once the genetic positions were obtained, the recombination rates between adjacent SNPs were estimated using Haldane's mapping function [18]: where r AB is the recombination rate between markers A and B; X AB is the linkage distance between markers A and B; and e is Euler's number, a mathematical constant approximately equal to 2.71828. Based on a series of Bernoulli distributions and the estimated recombination rates, the crossover of each chromosome was simulated to yield the sequence of a gamete. Then, two gametes were paired to produce the genotypic data for the progeny.

GBLUP model
The following GBLUP model was considered for GP: where y denotes the vector of phenotypic values of a training population with n individuals; μ is a constant term; 1 n is the vector of order n with all elements equal to 1; g represents the vector of genotypic values; and e is the vector of random errors. It is assumed that g follows a multivariate normal distribution MVNð0; s 2 g KÞ, where 0 is a zero vector; s 2 g is the genetic variance of additive effects; and K is a genomic relationship matrix among the individuals. Furthermore, e follows MVNð0; s 2 e I n Þ, where s 2 e is the random error variance, and I n denotes the identity matrix of order n. Here, g and e are assumed to be mutually independent. In this study, the genomic relationship matrix K = MM T /p was considered, where M is the marker score matrix, and p is the number of SNP markers.
The model parameters of the GBLUP model can be estimated through the Henderson's equation [19], as follows: where the regularization parameter λ is given by The mmer () function in the R package sommer [20] was used to obtain the restricted maximum likelihood estimates (REMLs) for the two variance components of s 2 g and s 2 e , and the resulting estimates were entered into Eq (2) to obtainm andĝ .
Ifĝ bp is considered as the vector of estimated genotypic values for a breeding population, and K bp is considered as the genomic relationship matrix between the breeding and training populations, the following equation is obtained: The GEBV for the breeding population isĝ bp plus the estimate of the constant termm.

Index for quantifying genomic diversity
Let g 0 be the vector of genotypic values, and K 0 be the genomic relationship matrix for a particular set of accessions with size n 0 . According to the GBLUP model in Eq (1), the covariance matrix for g 0 is given by: The determinant of the covariance matrix represents the overall variability of the genotypic values, which is calculated as: Clearly, the determinant of Eq (3) is proportional to the D-score defined below: For a fixed number of n 0 , a subset of accessions chosen from a breeding population with the maximal D-score will have greater genomic diversity than the competing choices with size n 0 . The concept of the D-score is adopted from optimum experimental designs [21].
A simple example is given below to illustrate the D-score. Suppose that there are three accessions (n = 3) in the candidate set with the genomic relationship matrix: For n 0 = 2, the D-score for g 1 and g 2 is calculated as jK 0 j ¼ 1 0:9 the D-scores for g 1 and g 3 and for g 2 and g 3 are given as 0.75 and 0.99, respectively. Clearly, the two accessions with g 2 and g 3 genotypic values have greater genomic variation (smaller genomic correlation) than the other competing choices. Closely related individuals could be excluded from the maximal D-score set. The genetic algorithm presented in Ou and Liao [22] was used to search a subset of accessions from a candidate population, such that it can attain the maximal D-score of Eq (4).

Procedure for selecting parental lines
To evaluate a variety of strategies for determining parental lines, the following steps were carried out: 1. For a specific target trait, all phenotypic values available from each rice genome dataset were used to build the corresponding GBLUP model of Eq (1). 3. For each subset of 10 accessions determined by the seven strategies, any two parental lines were crossed to produce 45 F 1 hybrids. Here, we started to simulate the genotypic data for successive generations of progeny populations through the Monte Carlo simulation. Each of the 45 F 1 hybrids produced 60 individuals of the F 2 population by self-pollination, resulting in 2,700 F 2 individuals. After obtaining the GEBVs for the 2,700 F 2 individuals via the trained GBLUP model, the top 45 F 2 individuals were retained. Again, these 45 F 2 individuals were used to produce 2,700 F 3 individuals (60 F 3 individuals per F 2 individual), and the top 45 F 3 individuals were retained. The same procedure was repeated to produce 2,700 F 10 individuals, which were assumed to be a fixed population.

The
4. For the resulting 2,700 F 10 individuals generated according to each strategy, we found the best F 10 RILs with the highest GEBVs.
A flowchart of this procedure is displayed in Fig 2. This procedure was repeated 30 times to obtain the best F 10 RILs from each repetition for each strategy. The average of the GEBVs for the best F 10 RILs was then calculated and used as the measure of efficiency for the strategy. Then, pairwise comparisons were performed among the GEBV averages, based on the least significant difference (LSD) test. Note that for BRSW, FPP, and PNPP in Dataset I and for YLD in Dataset II, larger GEBVs are preferable (i.e., for these traits, the larger the GEBV, the better). The remaining five traits (FTAA, FTAF, and PH in Dataset I, and FT and PH in Dataset II) follow the rule: that the smaller the GEBV, the better.

Calculation of genetic gain
To understand the genetic improvement in a target trait using different strategies, the genetic gain was estimated as: where GEBV F 10 denotes the average GEBV of the resulting 2,700 F 10 RILs; and GEBV P denotes the average GEBV of 10 parental lines selected using each strategy [23]. The larger the absolute value of the genetic gain, the greater the improvement in the target trait. The average of genetic gains from 30 repetitions was reported for each strategy, and multiple comparisons among the genetic gain averages were performed using the LSD test.

Comparison of strategies based on the best F 10 RILs
The GEBV averages of the best F 10 RILs and results of the LSD test are displayed in Tables 1  and 2 for Datasets I and II, respectively. The strategies that considered both GEBV and genomic diversity (GEBV-GD-30, -50, -100) generally showed satisfactory efficiency because they achieved the best or second-best performance for all traits. Therefore, these types of strategies could be used as a reliable means for selecting parental lines. On the other hand, strategies accounting for only genomic diversity (GD-O-30, -50, -100) did not show satisfactory efficiency for all traits, except GD-O-100, which was satisfactory for YLD in Dataset II. The GEBV-O strategy showed the best or second-best performance for FPP and PH in Dataset I and for PH and FT in Dataset II, but it also showed the worst or second-worst performance for the remaining four traits in Dataset I and for YLD in Dataset II. These data indicate that GEBV-O is a high-risk strategy. In general, the results of the LSD test showed significant differences in GEBV averages between the best/second-best and worst/second-worst performances for all traits in both datasets. We also displayed the average GEBV ± standard deviation (SD) of the best RILs selected by 30 repetitions over consecutive generations in Figs 3, 4 and 5. Four strategies including GEBV-O, GEBV-GD-30, -50, and -100 selected the same best individual from the 30 repetitions at the parental generation and also at the F 1 generation; therefore, no SD is shown with the corresponding GEBV averages. The GEBV averages of the best parental lines selected by the strategies can be ranked as GEBV-O = GEBV-GD-30 = GEBV-GD-50 = GEBV-GD- The desirability at the parental generation decreased as the degree of diversity increased for the three strategies, considering the genomic diversity only. Additionally, the desirability declined from the parental generation to F 1 generation for every strategy because of the heterozygous alleles in F 1 hybrids.
To explore the extent to which the top two accessions contributed to the subset of 10 parental lines determined by four strategies (GEBV-O, GEBV-GD-30, -50, and -100), we compared each subset with a reduced group consisting of F 1 hybrids, whose parental lines contained at least one of the top two accessions for each subset. Each reduced group consisted of 17 F 1

PLOS ONE
Identification of superior parental lines via genomic prediction

PLOS ONE
Identification of superior parental lines via genomic prediction

PLOS ONE
Identification of superior parental lines via genomic prediction hybrids. Similarly, we followed the analysis procedure to obtain the GEBV averages of the best F 10 RILs from 30 repetitions based on the reduced group. The results are displayed in Table 3, with the corresponding GEBV averages based on the group of the original 45 F 1 hybrids. The results showed no practical significance between these two groups for all the traits using the four strategies (Table 3). Therefore, the reduced group can be an alternative to the full group.

Genetic gains with different strategies
The average genetic gains and results of the LSD test are displayed in Tables 4 and 5 for Datasets I and II, respectively. It is also reasonable to evaluate the performance of the strategies according to the endpoint of GEBV F 10

Discussion
Dataset II was specifically collected for genomic selection. All the available accessions in this dataset are indica or indica-admixed. The results of performance based on the best F 10 RILs (Table 2) revealed that all seven strategies showed similar performance for the three target traits. The resulting GEBV averages of the best F 10 RILs ranged from 6472 to 6546 kg/ha for YLD, from 85.889 to 91.852 cm for PH, and from 77.725 to 78.410 days for FT. This could be because the candidate accessions in Dataset II are elite breeding lines, with limited genetic diversity and similar phenotypic values for the target traits. However, the results of the LSD test revealed that the two strategies (GD-O-100 and GEBV-GD-100) with greater genomic diversity for YLD led to significantly larger YLD than the other five strategies. Four strategies  including GEBV-O, GEBV-GD-30, -50, and -100 performed equally well for PH but performed significantly better than GD-O-30, -50, and -100. It is well known that Dataset I contains more genomic diversity than Dataset II since it consists of five subpopulations and one admixed group. The higher genomic diversity of Dataset I could result in a bigger difference between GEBV-GD-30/50/100 strategies and the GEBV-O strategy for some traits. For example, the difference in the GEBV averages among the best F 10 RILs between GEBV-GD-50 and GEBV-O was approximately -9.06 days for FTAA and -2.55 days for FTAF in Dataset I (Table 1), but the corresponding difference was only -0.09 days for FT in Dataset II (Table 2). However, the flowering time is very sensitive to environmental  conditions, implying that genomic diversity cannot solely amount to the differences in results between these two datasets. More interestingly, the higher genomic diversity of Dataset I could lead to a larger genetic gain for a specific trait. The average genetic gain using the seven strategies for PH in Dataset I was -42.15 cm (Table 4); however, the corresponding mean in Dataset II was only -13.79 cm ( Table 5). The average GEBV of the best F 10 RILs for YLD was the highest using the GD-O-100 strategy on Dataset II (Table 2). However, the corresponding GEBV averages for two yield components, FPP and PNPP, were the lowest in Dataset I (Table 1). This is possible because the analysis results were based on two different collections of rice lines. There is little diversity among the RILs in Dataset II; therefore, the difference in the average GEBV for YLD among the strategies seems to be negligible. Note that the LSD test revealed only two significance groups in YLD. Nonetheless, the results of FPP and PNPP analysis using the GD-O-100 strategy in Dataset I appear to be reasonable.
Apparently, the number of accessions fixed in the proposed strategies seemed to be arbitrary, similar to the selection of 10 parental lines, retaining the top 2 accessions, and searching 10 or another 8 accessions from the three candidate sets composed of the top 30, 50, and 100 accessions, respectively. A user certainly can adjust these numbers in the strategies according to their own study. Additionally, historical phenotypic data were required to build the GP model. If the historical phenotypic data are not available, then a pilot experiment is needed to phenotype a set of accessions, which can be determined using an optimization algorithm [22]. Two R functions used to perform the proposed procedure for selecting parental lines are provided in S1 File.
We addressed the issue that incorporating genomic diversity into the conventional truncation selection could improve the likelihood of identifying superior parental lines. More importantly, we showed that combining GP with Monte Carlo simulation could help breeders to discover superior parental lines before conducting field experiments. It is well known that phenotype is affected by the genotype (G), environment (E), and G × E interaction. In reality, environment can have a significant impact on the performance of progeny populations during the growth period of each generation until reaching the F 10 generation. Thus, parental lines selected from our simulation study may not perform as expected. Therefore, conducting field experiments to validate our study would be worthwhile. As mentioned earlier, the works of Gaynor et al. [13] and Yao et al. [14] support that the strategy for selecting better parental lines through GP with Monte Carlo simulation should prove useful in plant breeding. The study of Vanavermaete et al. [15] supports our theory that considering both GEBV and genomic diversity in parental selection is a promising strategy.
In this study, we focused on single-trait selection; therefore, the proposed approach could select different parental lines for different target traits. In practice, it is desirable to extend the approach to multiple-trait selection. A straightforward extension is to apply a selection index incorporating multiple traits, and then treat the selection index as a new target trait for the current single-trait approach. Another possible modification is to directly implement multipletrait GP models, and then use an appropriate selection index for evaluating candidate accessions. Jia and Jannink [24], Hayashi and Iwata [25], and Guo et al. [26] have shown that multiple-trait GP models provide better prediction accuracy than single-trait GP models for a lowheritability trait, which shows strong correlation with a high-heritability trait. We will present results of the multiple-trait approach in a future communication.

Conclusion
Combining GP with Monte Carlo simulation can serve as a practical means of detecting superior parents for crop pre-breeding programs. Different strategies can be implemented to identify a set of superior parental lines from a candidate population. The strategy that considers only GEBV will have a higher starting average GEBV among selected parental lines, but it may lead to a less genetic gain. On the other hand, strategies that consider genomic diversity only can retain greater genomic diversity but cannot simultaneously have a favorable starting GEBV average, and therefore may not produce RILs with satisfactory performance. Strategies that consider both GEBV and genomic diversity balance the starting GEBV average and genomic diversity among parental lines; these strategies show satisfactory genetic gain and produce high-performing RILs.