Seed Quality Traits Can Be Predicted with High Accuracy in Brassica napus Using Genomic Data

Improving seed oil yield and quality are central targets in rapeseed (Brassica napus) breeding. The primary goal of our study was to examine and compare the potential and the limits of marker-assisted selection and genome-wide prediction of six important seed quality traits of B. napus. Our study is based on a bi-parental population comprising 202 doubled haploid lines and a diverse validation set including 117 B. napus inbred lines derived from interspecific crosses between B. rapa and B. carinata. We used phenotypic data for seed oil, protein, erucic acid, linolenic acid, stearic acid, and glucosinolate content. All lines were genotyped with a 60k SNP array. We performed five-fold cross-validations in combination with linkage mapping and four genome-wide prediction approaches in the bi-parental population. Quantitative trait loci (QTL) with large effects were detected for erucic acid, stearic acid, and glucosinolate content, blazing the trail for marker-assisted selection. Despite substantial differences in the complexity of the genetic architecture of the six traits, genome-wide prediction models had only minor impacts on the prediction accuracies. We evaluated the effects of training population size, marker density and phenotyping intensity on the prediction accuracy. The prediction accuracy in the independent and genetically very distinct validation set still amounted to 0.14 for protein content and 0.17 for oil content reflecting the utility of the developed calibration models even in very diverse backgrounds.

Genomic-assisted crop improvement can either be based on marker-assisted selection [12] or genome-wide predictions [12,13].In marker-assisted selection, the performance of individuals is predicted using a few diagnostic markers associated with the traits under consideration [14].In contrast, genome-wide prediction exploits many markers without performing markerspecific significance tests [15].The accuracy of marker-assisted selection and genome-wide predictions depends on the genetic architecture underlying the traits under consideration.Marker-assisted selection is most effective if the trait is controlled by a few genes with large effects.If the genetic architecture is complex, quantitative trait loci (QTL) detection is not reliable and genome-wide prediction is more powerful [16].
The presence of QTL underlying quality traits in rapeseed has been investigated in linkage and linkage disequilibrium mapping studies [1,3,[9][10][11][17][18][19][20][21][22][23][24][25][26][27][28].Accumulated information of the QTL accounting for seed quality traits such as seed fatty acid has also been identified in other Brassica species, such as B. oleracea and B. juncea [29,30], which could provide reference for the comparison between species.However, linkage and linkage disequilibrium mapping, are often afflicted by upwards biased estimates in terms of the proportion of genotypic variance explained by QTL.Therefore, cross-or independent validations have been suggested to obtain unbiased estimates of QTL effects but have been applied only in a limited number of studies in rapeseed [31,32].
This study is based on a published dataset from the bi-parental TN DH population comprising 202 DH lines, which has been intensively used to study the genetic architecture of important agronomic traits [9-11, 22, 23] and were genotyped with an Infinium 60K-SNP array [52] being extensively used in Brassica [24,53,54].The two parents of the TN DH mapping population originated from the European and Chinese genepools and have been used widely for rapeseed breeding programs in both target regions.Our objectives were to (i) test for the presence of QTL exhibiting reliable and large effects using five-fold cross-validations, (ii) investigate the effect of the genetic architecture on the superiority of different genomewide prediction models, (iii) examine the potential to improve the prediction accuracy by modeling digenic epistatic effects, (iv) validate the prediction accuracy in a genetically independent population, and (v) discuss the consequences for implementing genome-wide predictions in applied rapeseed breeding programs.

Plant materials and field trials
A bi-parental DH population of B. napus denoted as TN DH has been developed, comprising 202 unique lines [22].The DH lines were derived from a microspore culture based on the F 1 cross between Tapidor and Ningyou7.The parent Tapidor is a European winter cultivar with low erucic acid and glucosinolate content in the seeds.The parent Ningyou7 is a Chinese semi-winter cultivar with high erucic acid and glucosinolate content in seeds.The TN DH mapping population along with its two parents was grown in 11 winter and semi-winter ecotype environments (S1 Table ).The phenotypic data was generated and used in a previous linkage mapping study, which was based on a limited set of markers [9-11, 22, 23].The experimental design was a randomized complete block design with 3 replications.Every plot comprised three rows with a total plot size of 3.0 to 4.0 m 2 .
Phenotypic data was collected for six important seed quality traits for each DH line and parent: seed oil content (%) and protein content (%), which were separately defined as the percentages of the oil and protein in the total seed dry weight, respectively; three important components of the fatty acid in the seed oil: the erucic acid content (%), the linolenic acid content (%), and the stearic acid content (%); and the content of glucosinolates in the total seed dry weight (µmol/g).The quality traits were determined based on near infrared reflectance spectroscopy measuring three technical and three biological replicates.The details of the phenotyping are outlined in detail in previous studies [10,11,22].
A total of 117 genetically independent B. napus inbred lines were used in this study for validating the prediction accuracy based on the TN DH population.The validation population was developed based on hundreds of crosses between B. rapa and B. carinata accessions [55,56].The validation population was grown in one semi-winter environment (Wuhan, China) in 2013-2014 in a trial with three replicates.Every plot comprised two rows with a total plot size of 2.0 to 3.0 m 2 .Seed oil content and protein content was measured using the same method as that used for the TN DH population.

Phenotypic data analyses
The best linear unbiased estimates (BLUEs) of phenotypic values and variance components were estimated by the following linear mixed model using ASREML-R software [57]: The genotype effects were treated as fixed effects and the other effects were treated as random.To estimate variance components, all effects were treated as random.Broad-sense heritability was calculated as the ratio of genotypic to phenotypic variance: where N E refers to the number of environments, N R is the average number of replications per location, s 2 G is the genotypic variance, s 2 GxE is the variance of genotype times environment interaction, and s 2 E refers to the error variance.

Genotypic data analyses
The 202 DH lines of the TN DH population and the two parents were previously fingerprinted using a 60k SNP array based on an Illumina Infinium assay [52].Quality control was performed and those markers have been removed which are either monomorphic, have missing values of >5%, a minor allele frequency <5%, or degree of heterozygosity >5% in the DH population.After applying the quality check outlined above, 180 DH lines with 13,678 high-quality SNP markers remained.By aligning the marker sequence of the 13,678 SNPs to the reference "Darmor-bzh" genome of B. napus version 4.1 [58] via BLAST analysis, 9,628 SNP markers could be assigned a unique physical position in the genome with the parameters of 100% alignment, E value <10 −20 and mismatch <2 (S2 Table ).After removing redundant SNPs in full linkage disequilibrium (LD), 1,527 markers representing recombination loci (referred to as representative markers) remained (S2 Table ).The 1,527 representative markers included 1,052 representative markers from 1,052 genetic bins and 475 single markers.From each of the genetic bins, one marker with the least missing rate and the best available physical alignment position was selected as representative marker.In this way, a total of 1,527 representative markers were obtained and used for the subsequent analysis.Pairwise LD between markers was calculated as the squared Pearson moment correlation coefficient using R package genetics [59].The 117 lines of the validation population were genotyped using the same SNP array and the 1,527 representative markers selected in the TN DH population were used for prediction.

QTL mapping and genome-wide prediction
For the QTL mapping, the SNP markers were coded according to the F 1 metric [60].The genome-wide QTL mapping method is based on the inclusion of cofactors [7] obtained by stepwise multiple linear regressions using the Bayesian information criterion [61].The genome-wide scan was conducted comparing the full model comprising the SNP and all cofactors versus a reduced model including only cofactors.We used a false-discovery rate (FDR) of P<0.1 to test for significance.The proportion of the phenotypic variance explained (PVE) by all QTLs, was estimated using the adjusted R 2 values fitting a multiple regression [62].
We performed a five-fold cross-validation of the QTL mapping in which the total population of 180 DH lines was randomly divided into two groups with 100 replications according to the ratio of 4:1 (one group with 144 lines and the other group with 36 lines).One hundred and forty-four lines were used as the training set and the remaining 36 were used as the test set.QTL mapping was performed in each training set and estimated QTL effects were used to predict the genetic values of the lines of the test set.The prediction accuracy was defined as the correlation between the predicted and observed phenotypic values standardized with the square root of the heritability.
For the genome-wide prediction, four different models were used in this study.We implemented three methods exploiting the additive marker effect: genomic best linear unbiased prediction (GBLUP), ridge regression best linear unbiased prediction (RR-BLUP) [63], and BayesCπ [64].To accelerate computation speed and eliminate the impact of LD on the prediction accuracy of BayesCπ, we removed SNPs with r 2 >0.95.For BayesCπ, the Gibbs sampling ran 20,000 times, and the first 6,000 cycles were used as burn in.We also implemented an extended GBLUP model denoted as EG-BLUP, which models digenic epistatic effects as well as additive effects [65].The accuracies of all these genome-wide prediction methods were determined based on the adjusted entry means for the 180 genotypes applying five-fold cross-validation.Details of the implementation of the models have been described elsewhere [41,42,65].We performed 100 cross-validation runs and estimated the accuracy as the Pearson correlation coefficient between predicted and observed values standardized with the square root of the heritability.
To evaluate the dependence of prediction accuracy on training set size, we applied crossvalidation with randomly selected subsets of n (n = 48, 80, 112, 144) lines from the full data to form the training set and used the remaining lines as the test set.To evaluate the dependence of prediction accuracy on marker density, we selected subsets of m (m = 100, 1,000, 5,000, 13,678) evenly distributed markers from the full dataset and applied five-fold cross-validations using all 180 lines.The sampling procedure was randomly repeated 100 times for each scheme, and the prediction accuracies were averaged across the 100 cross-validation runs.We focused in the above outlined analyses of sampling of marker subsets and training set sizes on the traits seed oil content and protein content.The traits were selected because oil content was evaluated in a large number of 11 environments and protein content exhibited a high heritability.
We also evaluated the prediction accuracy using an independent validation population.The marker effects were estimated based on RR-BLUP and the TN DH population.Marker effects were used to predict the performance of the 117 individuals of the validation population.The prediction accuracy was again estimated as the Pearson correlation coefficient between predicted and observed values standardized with the square root of the heritability.Heritability was estimated using the variance components estimated for the TN DH population.

Intensive field evaluation of the TN DH population resulted in high-quality phenotypic data
We combined the information on seed protein content with previously published data for other seed quality traits of the TN DH population.We observed a wide variation of BLUEs approximating a normal distribution for most traits, except for erucic acid content (Fig 1 , S3 Table ).The analyses across environments revealed significant (P<0.001)variances for genotypes, environments, and interactions between genotypes and environments (Table 1).Broadsense heritability estimates were high for the six traits, ranging from 0.81 for protein content to 0.98 for erucic acid content.Consequently, the intensive phenotyping resulted in high-quality data representing an excellent source for dissecting the genetic basis of the six traits.
In total, 80% of the pairwise trait comparisons were significantly (P<0.001)associated with Pearson moment correlation coefficients ranging from -0.84 between erucic acid content and stearic acid content to 0.66 between erucic acid content and glucosinolate content (Fig 1).Interestingly, protein content was only poorly associated with erucic acid, glucosinolate, and stearic acid content.This lack of associations points to independent biochemical pathways and genes controlling the two classes of traits.
Large differences in the complexity of the genetic architecture of the six seed quality traits Altogether, 151 SNP markers passed the FDR significance level of P<0.1 in the genome-wide QTL mapping scan (Figs 2 and S1).The QTL numbers for the six traits ranged from 8 to 59 and were distributed across 19 chromosomes of B. napus.Phenotypic variance explained by a single putative QTL exceeded 5% for 27 SNPs and reached 45% for a QTL located on chromosome C03 controlling erucic acid content (Table 2).A second major QTL was detected on chromosome A08 for erucic acid content, explaining 31% of the phenotypic variance.However, the majority of the QTLs, especially those influencing oil and protein content, exhibited only minor effects.Among the detected QTLs, seven were putative pleiotropic QTLs influencing two traits.For instance, the marker "Bn-scaff_15794_1-p347392", which was physically aligned to C03 and detected as a putative pleiotropic QTL, explained 26% and 45% of the phenotypic variance for stearic acid and erucic acid concentration, respectively.
We used five-fold cross-validation to reliably estimate the potential of marker-assisted selection (MAS).The average accuracy of MAS ranged from 0.47 for protein content to 0.81 for erucic acid content (Table 3).These values were substantially lower compared to the noncross-validated results (Table 2), underlining the need to validate findings of linkage mapping.

Accuracies of genome-wide prediction in the TN DH population
We used four different models to investigate the efficiency of genome-wide prediction for the six seed quality traits.Genomic selection significantly showed higher prediction accuracies than MAS for all traits, with the most pronounced differences observed for linolenic acid, oil, and protein content (Table 3).The average prediction accuracy of RR-BLUP was the highest, while BayesCπ performed best for erucic acid and glucosinolate content.The most complex model comprising main and epistatic effects, EG-BLUP, performed best for linolenic acid content.In general, traits with high heritability could be predicted with higher accuracy compared to traits with low heritability.
As expected for a bi-parental mapping population, a large number of markers were in tight LD and could thus be grouped into genetic bins because of the absence of recombination events.We reduced the co-linearity among markers and removed redundant markers in full linkage disequilibrium, resulting in a subset of 1,527 SNP markers (S2 Table, S2 Fig) .Prediction accuracy increased on average by 3% using the reduced 1,527 representative marker set compared to genomic selection based on all SNPs (Table 3).

Effects of marker density, training population size, and number of environments on prediction accuracy
Genome-wide prediction based on RR-BLUP performed best on average and, in addition, was computationally efficient.Therefore, we conducted comprehensive analyses on the factors driving the accuracy in genome-wide prediction exclusively based on RR-BLUP.We varied the training population size and marker density and examined the accuracy of genome-wide predictions in our study.The accuracy remained in the range of 0.44 to 0.67 for all traits using only 48 lines as the training set (Fig 3).Interestingly, prediction accuracy reached a peak with 1,000 randomly selected markers and decreased only marginally for a subset of 100 markers.The prediction accuracy increased by ~4% for all six traits when using a representative set of markers compared to the 1,527 random evenly distributed markers (Table 2).Thus, our results indicated that to improve the accuracy of genome-wide prediction in a bi-parental population, the population size is more important than the density of markers.
We further studied the effects of the number of environments and training population size on the accuracy of genomic selection by focusing on oil and protein content.The traits were selected because oil content was evaluated in a large number of 11 environments and protein content exhibited a high heritability.We randomly selected training sets comprising n = 48, 80, 112, and 144 lines evaluated for oil content evaluated in subsets of environments (k = 2, 3,. .., 11 for oil content; k = 2, 3, 4, 5 for protein content).The accuracy was estimated as the Pearson moment correlation coefficient between predicted genotypic values and the adjusted entry means of all remaining lines evaluated across all environments.This type of cross-validation allows for the study of the prediction accuracy assuming reduced phenotyping intensity.As the test set was not evaluated in any of the environments, their performance could not be estimated by phenotypic correlations between environments.The prediction accuracies based on phenotypic data from only two environments were 0.73 for oil content and 0.60 for protein content (Fig 4).Compared to the accuracy evaluated with the full dataset, the accuracy decreased only in the range of 3% to 6%.The accuracy remained at 0.55 for oil content when only 48 lines and 2 environments were used.

Discussion
Erucic acid, stearic acid, and glucosinolate content are promising targets for marker-assisted selection Understanding the genetic basis of seed oil yield and quality is important for efficient rapeseed breeding [10].Previous studies revealed differences in the complexity of the genetic architecture of the six quality traits examined in our study [1, 3, 9-11, 17, 20, 22, 72], which were further substantiated using five-fold cross-validations (Table 2; S3 Table ).Oil, protein, and linolenic content are characterized by the absence of a reliable large-effect QTL, while erucic acid, stearic acid, and glucosinolate content are to a large degree controlled by a few QTL exhibiting large effects.For instance, the major QTL located in A08 and C03 (Table 2) totally explained 76.66% of the phenotypic variance for erucic acid, which has been widely identified previously in TN DH population and other mapping populations of B. napus [21,22,24,73].The major QTL located on C03 and explaining 16.44% of the phenotypic variance for oil content was identified in both of TN DH population and KN DH population [74].The QTL with large genetic effects for total seed glucosinolates located in A08, C01 and C03, were also identified previously in this and other mapping populations [9,19,75].These QTLs are interesting targets for marker-assisted selection, which can be applied in rapeseed breeding in  [9]   4 Bn-A04-p13930713  [9]   6 Bn-scaff_15794_1-p437864 1.45E-25 18.07 774 C03 55837809 TN-q.mcG-C3c(Feng et al.2012) [9]   7 Bn-C14160250-p3687  [9]   Total/average 151 1.94E-05 3.221133333 1 More detailed information of each genetic bin is listed in S2 Table. 2 The physical position is presented by the start position of each SNP with unique position to the reference genome of B. napus, Darmor-bzh 4.1, and more information is also available in S2 Table. 3Not available because of absent of alignment or multiple alignment positions.
The FDR (false-discovery rate) significance level is P<0.1for the detection of associated markers. doi:10.1371/journal.pone.0166624.t002 combination with the enrichment of target alleles for F 2 populations prior to producing DH populations.Besides of the consistent identified QTL, we also detected several new QTLs accounting for these seed quality traits with minor effects in TN DH population compared to the previous QTL identification in this population [10,11,22], which possibly because of the improved detection power using the high density SNP markers compared to the previous QTL identification using the relatively low-density markers.For example, the QTL "Bn-Scaf-fold000217-p20168" in C05, "Bn-scaff_16130_1-p1013445" and "Bn-scaff_16130_1-p1039452" located in C07 was newly identified for seed oil content of this population compared to that detected in Jiang et al., (2014).It is important to note that due to the absence of a physical position of 2,828 SNPs without alignment to the reference genome, we could not compare those QTL without unique physical position with previous studies.

Genetic architecture marginally impacts the choice of the genome-wide prediction model
Previous simulation studies revealed that equal shrinkage of marker effects as applied in RR-BLUP can be inappropriate for traits influenced by QTLs exhibiting large effects [13,76].In these cases, Bayesian models such as BayesB or BayesCπ, which allow specific shrinkage of every marker [77], are expected to outperform RR-BLUP.The superiority of BayesB over RR-BLUP has been reported for glucosinolate content in a previous genome-wide prediction study based on a diverse panel of 391 rapeseed lines derived from nine families [31].Superiority of Bayes models versus RR-BLUP has also been observed for flowering time in the TN DH population [50].In accordance with this observation, prediction accuracies for erucic acid and glucosinolate content were maximized when applying BayesCπ, with improvements of 8-10% compared to RR-BLUP (Table 2).In contrast, for stearic acid, RR-BLUP outperformed BayesCπ despite the presence of large-effect QTL.This is most likely due to two reasons.First, the ratio between the phenotypic variance explained by the two large-effect QTL versus that explained by the remaining small-effect QTL is approximately 1 to 1 for stearic acid content, while this ratio is 5 to 1 for erucic acid and glucosinolate content.Second, one large-effect  QTL controlling stearic acid is reflected in several marker-trait associations with SNPs being in tight linkage disequilibrium (r 2 >0.8), while the QTLs are reflected by only a limited number of SNPs for erucic acid and glucosinolate content.Epistasis, the interaction between genes [78], is an additional potential force influencing the choice of the biometrical model for genome-wide prediction [65].Previous linkage and linkage disequilibrium mapping studies in rapeseed indicated that epistatic effects are involved in fatty acid metabolism [11,47].Consequently, we implemented EG-BLUP for genome-wide prediction, which explicitly considers digenic additive by additive epistatic effects [65].We observed, however, higher prediction accuracies of EG-BLUP compared to the other genome-wide prediction models only for linolenic acid content (Table 2).Moreover, the gains in prediction accuracy were only marginal.These negligible benefits are in contrast to the non-cross-validated results of previous linkage and linkage disequilibrium studies [11,47] and point to the strong need to validate the role of epistatic effects.In summary, the accuracy of genomic selection does not crucially depend on the choice of a suitable genome-wide prediction model and is an attractive alternative to marker-assisted selection.

Implementation of genome-wide prediction in rapeseed breeding
The successful implementation of genome-wide prediction in rapeseed breeding requires that a certain threshold of prediction accuracy is realized [40,79].Previous model studies in wheat and maize suggested a threshold for the prediction accuracy of 0.5 [80,81].We chose two important traits, oil content and protein content, to illustrate the size of the training population, the number of environments, and the marker density required to reach a prediction accuracy of 0.5 for the bi-parental population.
In accordance with previous studies based on bi-parental populations [82][83][84][85], approximately one thousand markers were required before the prediction accuracy plateaued (Fig 3).Increasing the number of markers introduced problems due to collinearities.Prediction accuracies were higher for a reduced a set of 1,527 SNPs, which represented recombination loci in the population, in contrast to the full 13,678-marker set (Table 3).Thus, to improve the accuracy of genome-wide prediction in a bi-parental population, the population size indicating recombination events obtained is more important than the density of markers.
The number of lines has a greater impact on the prediction accuracy than the number of environments (Figs 3 and 4).The prediction accuracy is already stagnating at three environments, and thus it is more efficient to invest in training population size.For protein content, approximately 144 lines evaluated in two environments were needed to reach an accuracy of 0.6.For oil content, prediction accuracy amounted to 0.6 when the training population was decreased to 80 lines and the number of environments reduced to two.These results suggest that genome-wide prediction can be successfully implemented in bi-parental populations even with small training population sizes and is an attractive complement to phenotypic selection to improve seed quality traits.
The prediction accuracy within bi-parental populations is of central importance examining the potential to implement genome-wide prediction in breeding programs exploiting the double-haploid technology.Moreover, it is of interest to study the potential to use the prediction model also in unrelated populations.We examined an extreme validation scenario for the prediction of seed oil and protein content using a genetically diverse sample of 117 lines which were based on crosses between B. rapa and B. carinata accessions [55,56].The prediction accuracy in this independent and genetically very distinct validation population still amounted to 0.14 for protein content and 0.17 for oil content.While interpreting the prediction accuracies it has to be considered that the validation population exhibits genome segments from B. rapa/B.carinata.However, the used Brassica 60K-SNP array was developed based on the AC genome sequence of B. rapa, B. olearaca and B. napus.Thus, the lack of unique polymorphisms of B. carinata is expected to impair the prediction accuracies.Taking this into consideration, our independent validation reflects the high quality of the developed calibration models even in very diverse backgrounds highlighting the prospects of genome-wide prediction for routine rapeseed breeding programs.

Fig 1 .
Fig 1. Distributions and pairwise correlations for Best Linear Unbiased Estimates of six seed traits evaluated for 202 lines of the TN DH population in multi-environmental field trials.All correlations passed significance tests with P-values less than 0.001 except for the correlation between protein content and erucic acid, glucosinolates, and stearic acid content.doi:10.1371/journal.pone.0166624.g001

Fig 2 .
Fig 2. Manhattan plots based on composite interval QTL mapping for the six seed quality traits.The x-axis represents the corresponding physical position of each SNP of the 13,678 SNPs across the genome from chromosome A01 to A10 and C01 to C09.Those markers without unique alignment to the reference genome were arranged in the axis noted as "not assigned".The Y-axis represents the corresponding false-discovery rate (FDR) of each QTL indicating the significance for QTL calling.The PVE, i.e. proportion of the phenotypic variance explained by each QTL, is listed in Table 2.doi:10.1371/journal.pone.0166624.g002

Table 3 . Average prediction accuracy of four genomic selection methods and marker assisted selection (MAS) for six seed quality traits of the TN DH population. Marker Type Method Oil content Protein content Erucic acid Linolenic acid Stearic acid Gluco-sinolates Average
The 1,527 representative SNPs are specifically selected from the 1,527 individual genetic bins of the TN DH population, while the 1,527 random SNPs are randomly selected from the total 13,678 polymorphic SNPs of the TN DH population.Marker assisted selection (MAS) is based on markers significantly associated with the respective traits outlined in detail in the Material and Methods. *doi:10.1371/journal.pone.0166624.t003