Leveraging genomic prediction to scan germplasm collection for crop improvement

The objective of this study was to explore the potential of genomic prediction (GP) for soybean resistance against Sclerotinia sclerotiorum (Lib.) de Bary, the causal agent of white mold (WM). A diverse panel of 465 soybean plant introduction accessions was phenotyped for WM resistance in replicated field and greenhouse tests. All plant accessions were previously genotyped using the SoySNP50K BeadChip. The predictive ability of six GP models were compared, and the impact of marker density and training population size on the predictive ability was investigated. Cross-prediction among environments was tested to determine the effectiveness of the prediction models. GP models had similar prediction accuracies for all experiments. Predictive ability did not improve significantly by using more than 5k SNPs, or by increasing the training population size (from 50% to 90% of the total of individuals). The GP model effectively predicted WM resistance across field and greenhouse experiments when each was used as either the training or validation population. The GP model was able to identify WM-resistant accessions in the USDA soybean germplasm collection that had previously been reported and were not included in the study panel. This study demonstrated the applicability of GP to identify useful genetic sources of WM resistance for soybean breeding. Further research will confirm the applicability of the proposed approach to other complex disease resistance traits and in other crops.


Introduction
Sclerotinia stem rot or white mold (WM), caused by Sclerotinia sclerotiorum (Lib.) de Bary, is a devastating disease of many economically important crops including soybean [1]. Since its first report from Ontario, Canada [2], WM has become one of the major diseases impacting soybean production in the United States [3]. Current strategies, such as increased row spacing, reduced irrigation before and during crop flowering, and biocontrol, have not effectively controlled this disease [4]. Despite that fungicides reduce the WM disease severity and yield losses are reduced, the profit does not increase using fungicides [5]. Considering the low effectiveness for all of these strategies, the development of resistant cultivars remains an effective and economic approach to cope with WM [6]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Plant breeders have traditionally used specialized disease nurseries or indoor growth environments to determine the expression of resistance in cultivars. However, due to the low to moderate heritability of trait [7] and its vulnerability to environmental influences [8], phenotyping for WM resistance is time-consuming and costly. Several quantitative trait loci (QTL) associated with the disease's resistance have been reported [9], demonstrating the quantitative inheritance of WM resistance in soybean. Although marker-assisted selection (MAS) has played an important role in soybean breeding for disease and pest resistance (e.g. soybean cyst nematode resistance [10], its application in the improvement of WM resistance is challenging because WM resistance is controlled by many loci with small effects. Genomic prediction entails building a prediction model by associating marker information with phenotypic information in a model training step [11]. Individual genetic material(s) that have been genotyped and phenotyped comprise the training population. The prediction model is then applied to a set of selection candidates that have been genotyped but not evaluated phenotypically (validation population). The primary difference between GP and traditional forms of MAS is that GP foregoes QTL identification through statistical significance tests and testing of significant markers by modeling all scored markers simultaneously. By utilizing genomewide molecular markers, GP is becoming a promising method for the selection of complex traits in plant breeding programs [12] and has been applied to multiple crops including wheat, maize, and barley [13][14][15]. The prediction accuracies of GP have been reported to be 28% greater than some MAS and 95% as accurate as phenotypic selection for a single trait in wheat [16].
Relatively few GP studies have been reported in soybean [17][18][19][20] for quantitative traits including seed yield [17], seed weight [20], resistance to sudden death syndrome [21] and soybean cyst nematode [18]. However, the efficiency of GP on WM resistance, a complex trait in soybean, is currently unclear, making it an ideal trait to examine the efficiency of GP on soybean complex traits.
The effectiveness of GP depends on the correlation between the predicted genotypic value and the underlying true genotypic value [22]. This correlation, also called prediction accuracy, of GP has been expressed as a function of the training population size (TPS), trait heritability on an entry-mean basis (h 2 ), and marker density [23,24]. Simulation and cross-validation studies have indicated that prediction accuracy generally increases as more individuals are included in the training population [16,[25][26][27][28][29][30] and more markers were used [31]. However, when genome wide selection (GWS) is applied in a structured population, increasing the number of markers or the number of individuals of training population did not necessary lead to an increase of prediction accuracy [32]. Therefore, the first steps are to determine the training population size and number of markers to use for GP of genomic estimated breeding value (GEBV), and the appropriate prediction model to use to obtain high predictive ability, the correlation between the predicted breeding values and the observed phenotypic values as true breeding values are unknown in real datasets. The next steps are to validate the models and testing them to predict diverse accessions previously unseen in the training models. These efforts will not only provide information on the applicability of these prediction models but also complement the efforts to diversify the genetic basis of commercial soybean breeding programs by identifying and utilization of diverse accessions to broaden the genetic base of disease resistance.
Here we report the results of performing GP cross-validation for WM resistance from a collection of 465 plant introduction (PI) soybean lines from the USDA soybean germplasm collection. For the GP models, the training population composition, marker number, and the statistical method for the calculation of GEBV were varied to determine their effect on WM GP accuracy. To assess GP accuracy, cross validation was done to predict white mold (WM) in the field and greenhouse in 2014 and 2015. Finally, using the GP models, we validated previously reported sources of WM resistance as well as identified new sources of WM resistance in the entire USDA soybean germplasm collection, which houses 19,652 accessions from several geographical origins.

Phenotypic evaluation and GP method comparison
Continuous distribution was observed in the field experiments, asymmetric distribution was observed in GH2014, and a kurtotic distribution was observed in GH2015 based on the predicted value of the accession random effects in the logistic model (Fig 1). Asymmetric distribution was observed to Field2014, Field2015 and GH2014 and normal distribution to GH2015 for original  The PI lines were considered resistant when their predicted value was low or equal compared with the resistant checks. High genotypic correlation was observed between all experiments (S2 Fig). The genetic correlation between Field2014 and Field2015 was the highest (cor = 0.72) while the lowest correlation was between GH2014 and GH2015 (cor = 0.46). Correlations between field and greenhouse were almost the same ranging from 0.53 (Field2014 and GH2015) to 0.54 (Field2014 and GH2014, Field2015 and GH2014, Field2015 and GH2015).
The broad-sense heritability estimate for disease severity index (DSI) was 0.64 across field environments, suggesting that selections done for white mold resistance in a field setting would be effective. Genetic diversity estimated by principal component analysis (PCA) showed the diversity in the mini-core panel used in this study, and no pattern among PI accessions was detected (S3 Fig). When applying PCA to the genomic relationship matrix, 141 components of a total of 465 were needed to explain 80% of the total variance of genotypes. The first component captured 15.41% of the total variation of marker genotypes, whereas the second component explained 6.66% of the total variation (S3 Fig).
All GP methods tested in this study had similar prediction abilities of WM resistance in soybean, and these ranged from 0.43 to 0.47 in the field and from 0.18 to 0.25 in greenhouse testing (S4 Fig). The predictive ability from the field experiments was approximately twice that of the greenhouse experiments using the same model.

Effect of marker density and training population size in the genomic prediction
For all experiments and training population sizes, minimal increase was observed in the predictive ability for a given training population size beyond 5k SNPs (Fig 2). The increase for prediction ability comparing 100 markers and 5k markers was 14%, 7%, 8% and 35% in Field2014, Field2015, GH2014 and GH2015, respectively. Little or no difference in predictive ability was observed with sets of 10k, 15k, 20k, 25k, 30k and 36,105 SNPs. Variation in population size did not affect prediction accuracies, so only training population with 352 genotypes (80%) is shown (Fig 2). The effect of marker number and training population size is reported using RR-BLUP because this model maintained prediction accuracies similar to other models and had a lower computational requirement compared with Bayesian models.
A slight increase in predictive ability was observed with increased TPS from 220 genotypes (50% of the panel) to 396 genotypes (90% of the panel) (Fig 3). Similar results were obtained for varying sizes of marker sets for all TPSs; therefore, only data for the 5k-marker set are shown (Fig 3).

Genomic prediction capacity in different experiments
When the same experiment was used to train and validate the model, 10-fold cross-validation was used, and those models were used as control to compare with models performed using different experiments in training and validation population. When one field experiment (Field2014 or Field2015) was used as a TPS and another as the validation population set (VPS), the prediction accuracies were very similar ( Table 1). The predictive ability dropped 62% when one greenhouse experiment (GH2014 or GH2015) was used as the TPS and the other as the VPS. When field data were used as the TPS, the predictive ability in the greenhouse decreased. For example, when Field2014 is used as the TPS, the predicted accuracy of GH2014 was 20% less than the predictive ability of GH2014 when GH2014 was the TPS. When greenhouse experiments were used as the TPS, the predictive ability in the field decreased significantly. For example, when GH2014 was used as the TPS, the prediction accuracies for Field2014 and Field2015 were 0.31 and 0.30, respectively, while the best prediction accuracies for Field2014 and Field2015 were 0.44 and 0.47, respectively. The GP model using Field2014 as the TPS accurately predicted the Field2015 phenotype, and vice versa.

Testing GP model
The GEBV estimated by the GP model using RR-BLUP with 5k markers and 352 genotypes for twenty-nine previously and independently reported resistant accessions was similar to the GEBV of the 10% most resistant genotypes (Fig 4).
The comparison of the 5% most resistant accession in the USDA soybean germplasm collection with the 5% (23 accessions) and 10% (46 accessions) most resistant accessions identified in this study revealed that thirty-five of the fourth-six (76%) accessions were common with the 5% most resistant accessions from the USDA soybean germplasm collection (Fig 4 and S1 Table). The country of origin of the 5% most resistant accessions in the USDA soybean   Table 1. Predictive ability and standard deviation reported using RR-BLUP from white mold evaluation experiments in field and greenhouse (2014 and 2015), when one environment was used to train the model and validated on another environment. The predictive ability (estimated by 10 fold cross-validation) using the same experiment for training and validation population was used as control.  The identification and mobilization of useful genetic variation from germplasm bank for use in breeding programs is clearly a necessity not only for sustaining current rates, but also for increasing future rates of crop genetic improvement [33].
And it was also observed that the variability between field and greenhouse experiments were different (Fig 1 and S1 Fig).
Based on the differences between field and greenhouse experiments, they were evaluated for white mold separately and the correlation between them were estimated (S2 Fig). This allowed the comparison of different environments in an attempt to understand the modes of resistance underlying each of them and possibly explain the lack of correlation between field and greenhouse evaluations that has been previously reported [34][35][36][37][38][39]. These low correlations observed in this study support the fact that greenhouse experiments, although informative, do not correlate as well with field responses. At a correlation coefficient of 0.72, Field2014 and Field2015 observed the highest correlation coefficient between experiments. Therefore, in order to identify new sources of WM resistance, field screenings should be utilized as they are more representative of farmers' fields.
Moreover, the heritability estimate in this study was similar to previously reported estimates for DSI among recombinant inbred line (RIL) populations which ranged from 0.30 to 0.71 in individual field environments and 0.59 across environments [8]. This heritability means that it is possible to obtain gain with selection in our study.
Since the genetic material used in this study consisted of diverse accessions from the USDA soybean germplasm collection (S3 Fig), it does not represent the pedigreed structure that a soybean breeding program, developing commercial cultivars, will encounter. However, the results from the genetic structure and composition of entries in this study would be applicable to germplasm enhancement programs using diverse collections to obtain parental materials. Therefore, the genetic variability observed in the study panel make it suitable to develop genome wide predictions for the identification of soybean accession resistant to white mold.
There was no difference between GP methods compared in this study (S4 Fig). Several studies have shown that, in comparison with ridge regression methods, more complex statistical methods give little increase in the accuracy of GP for polygenic traits [40][41][42][43]. Despite there were no statistics differences between methods in this study, RR-BLUP performed a little better than some Bayesian methods in certain experiments. Moreover the RR-BLUP model has other advantages compared with Bayesian methods such as relative simplicity, reduced computing time, and the well-known optimality properties of mixed models for selection [44].

Effect of marker density and training population size in the genomic prediction
Models fitted using at least 5K markers were capable enough to predict WM (Fig 2). SNP set and verified that there was no significant difference in the best performing GP method for grain yield, days for 50% flowering and plant height [31]. Jarquín et al. (2014) evaluated the effect of SNP filtering on accuracy assessed by building a series of G-BLUP models using SNP datasets created by applying combinations of MAF and PMV filtering criteria [17]. Overall, marker filtering criteria did not have a large effect on accuracy for grain yield in soybean, but it had effects on the accuracy for plant height and days to maturity, when was observed that the accuracy improved when more markers were used to train the GP model.
The comparable performance of a limited number of markers (5k) relative to the complete marker data set (36,105) could be due to marker saturation because random markers was selected per chromosome, but the same number of marker were picked in each one, i.e., in the 5k samples, 250 markers were selected randomly per chromosome. With larger linkage disequilibrium (LD), the addition of more markers will not increase the accuracy of predictive models [31]. Therefore, it might be desirable to reduce the SNP numbers to ease computational requirements when predicting individual SNP effects and summing effects for GP. However, more saturated SNP datasets may be more desirable for computing GP of multifamily selection schemes of more diverse germplasm. The RR-BLUP approach is more computationally efficient compared with Bayesian models with computational demands scaling with individual number rather than marker number [17]. Because correlation between the number of markers and predictive ability was not observed in this research, a good GP model to predict WM in soybean can be fitted using about 5k markers in a diverse genotype collection. The use of a small SNP set can lead to cost savings. Use of a uniform or common SNP set will allow consistent utilization of genome wide prediction in the research and breeding programs.
When the training population size increased, a small increase in the predictive ability was observed (Fig 3). Several studies have demonstrated that the accuracy of GP is highly influenced by the population used to calibrate the model [29,32,43,46]. However, Charmet et al.
[2014] observed that predicted accuracies did not improve significantly with the increase of the training population size when unrelated populations from different breeding programs were merged to create a new population. Similarly, the lack of significant increase in predictive ability with an increased number of individuals in the training population could therefore be explained by the high diversity of soybean lines in this study (S4 Fig). On the other hand, Boligon et al. (2012) concluded that strategies that maximize the phenotypic variance, by selecting individuals from the two-tail distribution, are preferred to those using genotypes with the largest or lowest phenotypic deviation [47]. Thus, capturing most of the phenotypic variance in the training population seems to be the key to increase the prediction ability [32]. Indeed the TPS did not influence the predictive ability because the highest and lowest classes were represented by few individuals (Fig 1). Therefore, it may be useful to choose genotypes that provide equal distribution across the rating scale (resistant, score = 0 to susceptible, score = 3) to compose the training population in future analyses. The variability would then be maximized, the predictive ability improved, and fewer plants should be phenotyped for use in the training population, saving money and time.

Prediction in different environments
The main objective in GP analysis is to use one generation of data to train the model which would then be used to predict the data of the next generation [40]. Models built using TPS from a field experiment had similar accuracies in predicting results of a different field experiment. The same happened when models were fitted using one greenhouse experiment as the TPS and another as the VPS (Table 1). However, when models fitted using field data as the TPS were used to predict greenhouse performance, the predictive ability was low, and vice versa.
Field and greenhouse experiments differed for inoculation and phenotypic evaluation. Field2014 and Field2015 experiments were inoculated when plots that had reached crop growth stage R1, while GH2014 and GH2015 were inoculated when the third trifoliate was fully expanded, approximately four and six weeks after transplanting respectively. Moreover Field2014 and Field2015 was rated when plants reached the growth stage R5, about 30 days after inoculation (DAI), while GH2014 and GH2015 were rated 14 DAI. Additionally, genetic mechanism of WM resistance is likely not common between the field and greenhouse screening tests. These differences may explain why prediction ability in field was twice than greenhouse.
The models derived from field data had prediction accuracies twice that of those derived from greenhouse data (Table 1 and S4 Fig). This may be explained by the different number of plants per genotype used in each experiment. In the greenhouse experiment, one or two plants per replication were phenotyped in 2015 and 2014, respectively, while the number of plants phenotyped in the field was higher. The lower number of plants in the greenhouse experiments perhaps increase the experiment error, which resulted in the lower predictive ability of predictive models using greenhouse data.
Good predictive ability was observed when the same type of environment was used. The phenotypic correlation between environments ranged from 0.15 (GH2014 and GH2015) to 0.46 (Field2014 and Field2015) ( Table 1). Lopez-Cruz et al. (2015) compared models with and without the G x E interaction to estimate prediction accuracy for yield in wheat [48]. They concluded that the across-environment model (the same method used in this study) was the worst method to predict between different environments when the correlation between two environments was less than 0.4. Their findings supports our observation and highlights the importance of considering G x E when analyzing multi-environmental data and involving G x E interactions in the model for increasing the predictive ability [48].

Testing GP model
Twenty-nine previously reported resistant PI lines were compared with our results. Those resistant PI lines were: PI391589B, PI507352, PI561345, PI196157, PI398637, PI358318A, PI189919, PI189861, PI437527, PI549066, PI567157A, PI416776, PI561331, PI437764, PI507353, PI548312, PI504502, PI437072, PI194634, PI281850, PI423941, PI423949, PI424242, PI458520, PI503336, PI593972, PI593973, PI603148 and PI243547 [49]. However, we observed that just four PI lines (PI424242, PI398637, PI281850 and PI549066) previously reported were in the 5% most resistant accessions from the entire USDA soybean germplasm collection, while 23 PI lines were in the 10% most resistant accessions. Several factors can influence these results since PI lines previously identified as resistant were reported on phenotypic values, and may be influenced by environment conditions, pathogen infection, inoculation procedure, or evaluation methods. Therefore, based on GWS outcomes, there are several genotypes in the USDA soybean germplasm collection that are predicted to be more resistant than the previously identified and reported WM resistant PI accessions, and these PI lines need to tested for obtain validation. GP model based on the training population and marker density analyses (RR-BLUP using 5k SNPs-250 SNPs per chromosome and 352 genotypes in the TPS) in the study accurately predicted WM resistance in PI lines in the USDA soybean germplasm collection that had already been previously reported to be resistant (Fig 4). This suggests that GP analysis using diverse accessions from the germplasm-genebank collection for different crop species and traits needs to be explored to identify useful accessions as parental sources in breeding programs.
Most resistant accessions from the USDA soybean germplasm collection have different countries of origin (S5 Fig) and belonged to different maturities (S6 Fig). Although, our GP model was developed using Maturity groups I, II and III [50], soybean accessions were accurately predicted and validated to be resistant in different maturity groups. These results suggest that the prediction ability is not dependent on maturity group, and this approach needs to be tested in other crop species and biotic stresses, and also would be a useful strategy for gene bank collections to identify selective accessions for further phenotypic validation for use as parents in genetic enhancement programs.
Therefore, we concluded that for WM GP, results from the different GP models did not vary significantly, and minimal response was observed for changes in TPS and marker numbers. GP approaches were shown to be useful to the breeder to assist in selecting among the USDA soybean germplasm collection. The results from this research will assist the establishment of GP approaches for germplasm enhancement using plant introductions and gene back accessions to help maximize the useful genetic diversity for disease resistance breeding.

Genetic materials
This study included 465 PIs from the USDA soybean germplasm collection and consisted of 144, 168, and 153 accessions of maturity groups (MG) I, II, and III, respectively. This panel was a subgroup of a large Glycine max core collection of 1,685 soybean genotypes that represented the genetic diversity of the entire USDA soybean germplasm collection as determined through pedigree and marker allele analysis [51]. Accessions originated from twenty-seven different countries/regions: China, Japan, Russia, South Korea, North Korea, France, Taiwan, Georgia, Belgium, Algeria, Austria, Eastern Europe, Germany, Hungary, Indonesia, Iran, Moldova, Morocco, Poland, Portugal, Romania, Turkey, United States, Ukraine, Uzbekistan, Vietnam, and Yugoslavia. Twenty-eight accessions were of unknown origin. . WM inoculum preparation, inoculation, and rating were modified from the cotton pad method described by Bastien et al. (2014). S. sclerotiorum cultures were started from surface-sterilized sclerotia obtained from a field in Northeast Iowa in 2013 [52]. Inoculation was done in each row at the R1 growth stage [53] by placing a mycelium-soaked cotton ball on the lowest flower-bearing petiole of each plant. Fields were irrigated by overhead sprinklers until the ratings were completed in order to create an epiphytotic disease nursery. When plants had reached the R5 growth stage [29-33 days after inoculation (DAI)], WM disease ratings were taken according to the methodology described in a previous study [54]. The disease severity scale for each plant was: 0 = no symptoms, 1 = lateral branches showing lesions, 2 = lesions on the main stem, and 3 = lesions on the main stem resulting in poor podfill or plant death. The disease severity was used to calculate the disease severity index (DSI) of each plot. The DSI ranged from 0, no disease, to 100, all plants rated were dead or had poor pod-fill due to disease.

Phenotyping for white mold resistance
No specific permissions were required for field locations/activities, as experiment sites are managed by the Iowa State University and university researchers are permitted to perform experiments as described in this study. Field studies did not involve endangered or protected species.
Greenhouse. In 2014, the greenhouse experiment was planted using a randomized complete block design with two replications (GH2014). An experimental unit consisted of one plant. Plants were inoculated at the V3 growth stage using a cut-petiole method, which is preferred for indoor screening [55]. In 2015 (GH2015), inoculation was done at the V5 growth stage, and the experiment consisted of one replication with two plants per experimental unit. WM infested plants in both years were rated for 3 days after inoculation (DAI). Disease severity was based on the scale previously described. In 2015, the second replication was lost due to greenhouse malfunction. Since greenhouse is a controlled environment, fewer replications are sufficient and therefore we retained 2015 data. Greenhouse experiments were conducted to complement field studies, as the greenhouse protocol allows for earlier assessment of disease and is less time and resource intensive.

Genotypic data
All 465 plant accessions had been previously genotyped on an Illumina Infinium SoySNP50K BeadChip [56], and the data is publically accessible at SoyBase (http://soybase.org). SNPs with a missing rate greater than 10% were excluded from further analyses and the remaining missing data were imputed using BEAGLE version 3.3.1 with default parameter settings [57,58]. The remaining missing data were imputed, and SNPs with a minor allele frequency (MAF) lower than 0.05 were ruled out, and these quality control steps left 36,105 SNPs.

Statistics analysis
A completely random genotype x environment model were used to calculate variance components of individual factors (s 2 G , s 2 GE , and s 2 e ) is described below: In which y ij is the phenotypic value for the i th genotype in the j th environment; μ is the overall mean; G i is the random effect of the i th genotype; E j is the random effect of the j th environment; GE ij is the GenotypexEnvironment effect assigned as random; and ijk is the residual error. Variance components were calculated using the lmer function in R package lme4 [59]. Broad-sense heritability (H) was calculated across environments for DSI in field environments. The equation used was: Where G is genotype and E is environment, e is the number of environments and r is the number of replicates. All experiments and years were analyzed separately. When an experiment was replicated and the disease response was ordinal, a logistic mixed model analysis was used to obtain BLUP for each accession. A mixed model approach was chosen to accommodate the unbalanced sample size among cultivars due to emergence rates. The analysis was performed using the clmm function (part of the ordinal package; [59]) executed in the R statistical analysis software [60]. The model for both field experiments (Field2014 and Field2015) was where θ i is the intercept for the ith response category (i = 0,1,2), R j is the effect of the jth replication, A k is the effect of the kth accession, RA jk is replication × accession interaction, D l is the effect of the lth inoculation date (inoculation was done per plot at the R1 growth stage therefore inoculations were made on multiple dates in the entire nursery), S m is the effect of the mth row, and DS lm is inoculation date × row interaction. The term Y jklm represents a vector of ratings and π ijklm is the probability that Y jklm will be rated at or below the ith response category. Replication was assumed to be a fixed effect, and all other terms were assumed to be random effects. Similarly, the model for GH2014 and GH2015 was Due to lack of replication in GH2015 experiments, phenotypic data for each accession consisted of the average response of the two plants.
Principal components analysis (PCA) was performed to estimate the diversity among soybean accessions. PCA was calculated based on all SNPs. MANOVA was used to estimate correlations between experiments and years based on GEBV estimated by GP models. PCA and MANOVA were performed using the stats package in R software.

Genomic prediction models
Six GP methods were used to analyze greenhouse and field experiments: ridge regression best linear unbiased prediction (RR-BLUP), Bayes A, Bayes B, Bayes Cπ, Bayesian LASSO (BLASSO), and Reproducing Kernel Hilbert Spaces (RKHS) regression. The genetic value estimated by the logistic regression was used as input in all GP models.
RR-BLUP, Bayes A, and Bayes B were described by in a previous study [11]. RR-BLUP assumes that each marker has an equal variance, V G /M, where V G is the genetic variance, and M is the number of markers. In Bayes A, each marker effect is drawn from a normal distribution with its own variance: N (0, s 2 gi ). The variance parameters are in turn sampled from a scaled inverse chi-squared distribution. In the Bayes B approach, the prior for the proportion of markers associated with zero phenotypic variance, π, is assumed unknown. Other prior hyperparameters for marker variance components in Bayes A and Bayes B were as given in Meuwissen et al. (2001) [11].
Bayes Cπ assumes common marker variances and allows for some markers to have no effect [61]. Additionally, Bayes Cπ jointly estimates π from the training data to avoid an incorrect π that can negatively affect prediction accuracy [62].
In BLASSO [63], marker effects are assigned independent Gaussian priors with marker-specific variances (s 2 e t 2 j ). At the next level of the hierarchical model, the t 2 j s are assigned exponential priors EXP½t 2 j jl 2 : At a deeper level of the hierarchy, λ 2 is assigned a gamma prior with rate (δ) and shape (r) which, in this study, was the default in the Bayesian Generalized Linear Regression (BGLR) package in R software. Finally, inverse chi-square priors were assigned to the variance parameters, and the scale and degree of freedom parameters were set to S u = S e = 1 and d:f :e = d:f :u = 4, respectively.
In RKHS regression, genetic values are viewed as a Gaussian process. When markers and a pedigree are available, genetic values are modeled as the sum of two components: g i = u i + f i, where u i is the mean, and f i is a Gaussian process with a (co)variance function proportional to the evaluations of a reproducing kernel, K(x i , x j ), evaluated in marker genotypes. Here, x i and x j are vectors of marker genotype codes for the i th and j th individuals, respectively. All hyperparameters were assumed following [64]. The genomic relationship matrix was estimated using the function A.mat (rrBLUP package) in R software.

Cross validation
The ten-fold cross-validation (CV) was performed to avoid an inflated estimate of the predictive ability of GP. In ten-fold CV, the genotypes set was divided into ten equally-sized subgroups. Of these, nine subgroups were used as the training population to fit each prediction model while the remaining subgroup was used as the validation population to assess the correlation between the observed and predicted trait values. This process was repeated ten times, with each subgroup being the validation population exactly once. This process was repeated fifty times by randomizing the genotypes and re-forming the folds. The mean of the 50 correlation coefficients was reported.
Predictive ability was assessed using the Pearson correlation of the predicted GEBV and the genetic value estimated by the logistic regression in the validation population [65].

Marker density and training population size
The effect of marker numbers on GP predictive ability was determined through ten-fold crossvalidation by including random samples of 100, 200, 500, 1k, 5k, 10k, 15k, 20k, 25k, and 30k SNPs from the full marker set. These samples were compared with the model which used 36,105 markers. For this analysis, the marker file was divided in 20 sets, where each set was composed with all SNPs in each chromosome. Then the same number of SNPs was picked in each set, i.e., each chromosome. The SNPs were picked randomly within each set. All codes were run in R software.
The impact of TPS on predictive ability was evaluated using training sets of variable sizes (220 genotypes [50% of the population], 264[60% of the population], 308[70% of the population], 352[80% of the population], and 396[90% of the population] genotypes). All marker subsets described above were used to evaluate these training population sizes. The predictive ability was estimated for each combination of marker number and TPS.

Genomic prediction capacity in different experiment
To verify the prediction capacity of the GP model to predict WM in unrelated experiments, each experiment (two field and GH experiments each) was used as the training population to validate the other three experiments. The RR-BLUP method was fitted using all markers, and 352 (80% of total population) genotypes were used in the training population.
In order to test the ability of the GP model to identify WM resistant genotypes from the USDA soybean germplasm collection, twenty-nine plant-introduction (PI) soybean accessions, previously reported to have WM resistance [49], were chosen. Marker data were available on these twenty-nine soybean PI accessions from the same SNPs run on the 465 PI accessions used in the model. Using the GP model and available marker information, the GEBV was obtained on these twenty-nine accessions. Boxplot was used to compare twentynine plant-introduction (PI) soybean accessions with the 5% (23 accessions) and 10% (46 accessions) most resistant PI lines and the 5% (23 accessions) and 10% (46 accessions) most susceptible PI lines found in our results. The 5% and 10% more resistant PI lines was assigned being the accessions with the lowest GEBV, meanwhile the 5% and 10% more susceptible PI lines was assigned being the accessions with the highest GEBV based on the GP model.
SNPs effect estimated in the training population by RR-BLUP using the average between Field2014 and Field2015 was used to predict the entire USDA soybean germplasm collection (19,652 accessions). The average between Field2014 and Field2015 was chosen to perform this analysis because the field experiment provided the greatest precision based on the estimated broad-sense heritability and GWS accuracy. The 5% (982 accessions) most resistant accessions were selected and compared with the 5% and 10% most resistant accessions selected in our study.
A total of 20,000 burn-ins (number of iterations before the Bayesian analysis convergence) and 40,000 saved iterations, obtained from the Markov chain Monte Carlo (MCMC) method, was used in all Bayesian methods. The convergence of Bayesian models was checked by inspecting trace plots of variance parameters.