Enhancing genomic selection by fitting large-effect SNPs as fixed effects and a genotype-by-environment effect using a maize BC1F3:4 population

The popularity of genomic selection (GS) has increased owing to its prospects in commercial breeding. It is necessary to enhance GS to increase its efficiency. In this study, a maize BC1F3:4 population, consisting of 481 families, was evaluated for days to anthesis in four environments, and genotyped with DNA chips including 55,000 single nucleotide polymorphisms (SNPs). This population was used to investigate whether GS could be enhanced by borrowing information from the genetic basis and genotype-by-environment (G × E) interaction. The results showed that: 1) fitting the top four large-effect SNPs as fixed effects could increase prediction accuracy, including three minor-effect SNPs explaining less than 10% phenotypic variance; 2) the increase of prediction accuracy when fitting large-effect SNPs as fixed effects was related to the decrease of genetic variance; 3) generally, the GS model fitting large-effect SNPs as fixed effects and G × E component enhanced GS. Therefore, we propose fitting large-effect markers as fixed effects and G × E effect for crop breeding projects in order to obtain accurately predicted phenotypic data and conduct efficient selection of desired plants.


Introduction
Plant quantitative genetics is a burgeoning field, enabling the identification of a great number of quantitative trait locus/loci (QTL) and genes in crops. The QTL or gene information (including its position and effect) should be summarized and transferred to molecular markers to better serve crop breeding [1,2]. The conventional use of QTL or gene information in plant breeding typically involves of marker-assisted selection (MAS), which requires the identification of significant QTL and selection of desired plants in advanced populations [3]. In the PLOS  MAS method, the target QTL are usually major QTL, and minor QTL are often not detected due to the probability of false negative, thus the QTL information is not fully exploited [3]. Genomic selection (GS), which was introduced in animals and later applied to crop genetic research, could make use of minor-effect QTL for the improvement of target traits [4,5]. GS entails the prediction of genomic estimated breeding values (GEBVs) of a validation population based on a training population, for which both phenotypic and genotypic data are available [4]. Factors influencing the prediction accuracy (PA) of GS should be considered and optimally controlled to accurately estimate GEBVs. These factors include, but are not limited to, population size, marker density, heritability, linkage disequilibrium, the genetic architecture of the target traits (QTL number, QTL effects, and QTL interactions), genetic relatedness between the training and validation populations, and GS models [6][7][8]. Some factors are difficult to modify when both genotypic and phenotypic data are available, whereas others can be optimized using statistical approaches. There might be differences in PAs among GS models, including ridge regression best linear unbiased prediction (rrBLUP), genomic best linear unbiased prediction (GBLUP), or the Bayesian alphabet [5,9,10]. However, the selection of an optimal GS model has proven difficult, as there is no clear indication of which model improves PA in all cases [9,11,12].
Examining of the components of GS models may provide insight into improving the PAs of GS models. In an rrBLUP model, the linear model is y = Xβ+Zu+ε, which is composed of fixed effect β and random effect u. Modifying the fixed and random effects might improve GS models. For example, a simulation study found that fitting major-effect SNPs as fixed effects could enhance genomic prediction [13,14]. A study on wheat stem rust resistance found that the PA of a GBLUP model using markers linked to Sr2 (involved in stem rust resistance) as fixed effects was larger than that of an ordinary GBLUP [14]. Generally, it is beneficial to use known genetic or gene information to improve GS models. In maize, it is common to select breeding materials from the offsprings of F 1 plants [15]. Therefore, it is necessary to confirm the effect of fitting large-effect markers as fixed effects in GS models using a maize biparental population.
The use of additional random variables might be helpful in enhancing the prediction of GS models. As suggested by a study on wheat, the PA of a GS model including a genotype-byenvironment (G × E) effect was higher than that of other GS models [16]. In another study on dairy cattle, the PA of a G × E GS model was the higher than other models for predicting metabolic body weight [17]. It is important to study the effect of modeling G × E in maize, which shows adaptations to various environments ranging from tropical to temperate regions. Flowering time is crucial for the adaptation of maize to diverse ecological regions, and is easily influenced by the environment, especially temperature and photoperiod [18,19]. It was demonstrated that QTL controlling flowering time were closely related to environmental conditions, some QTL showed significant QTL × environmental interaction effects [18,20]. Therefore, flowering time is suitable for analyzing the contribution of G × E to the PA of GS model.
In this study, 481 maize BC 1 F 3:4 families were constructed using elite inbred lines. Meanwhile, genotypic and phenotypic data of the 481 families were obtained. The objectives of this study were to investigate whether fitting large-effect SNPs as fixed effects could increase the PAs of GS models, and whether modeling G × E interaction could increase the PAs of GS models. Afterward, we assessed the performance of the GS models fitting large-effect SNPs as fixed effects and G × E interaction. This analysis would provide insight into improving the PAs of GS models using available phenotypic and genotypic data.

Plant materials and phenotyping
A biparental population was constructed using elite inbred lines Zheng58 and PH4CV. Zheng58 is the female parent of Zhengdan958 and PH4CV is the male parent of Xianyu335. Zhengdan958 and Xianyu335 are popular hybrids in China [21,22]. The F 1 plants were backcrossed to PH4CV to produce the BC 1 F 1 seeds. Each BC 1 F 1 plant was pollinated with bulked pollens collected from at least ten other BC 1 F 1 plants in the summer of 2014 in Shunyi, Beijing. The offsprings of these BC 1 F 1 plants were defined as bulk-BC 1 F 2 . In the summer of 2015 in Shunyi, Beijing, forty-three bulk-BC 1 F 2 families were sown, with each family in a one-row plot. Three plants in each row were self-pollinated to produce the BC 1 F 3 seeds. In the winter of 2015 in Sanya, Hainan, 481 BC 1 F 3 plants from three BC 1 F 2 ears sown and self-pollinated to produce 481 BC 1 F 3:4 families. The flowchart for the construction of materials used in this study was demonstrated in S1 Fig. The BC 1 F 3:4 families were sown in Shunyi, Beijing, and Changji, Xinjiang, in the summer of 2016 and 2017, the four environments were identified as 16BJ, 17BJ, 16XJ and 17XJ, respectively. In each environment, the BC 1 F 3:4 families were planted in a randomized complete design with two replicates. Within each replicate, each family was sown in a one-row plot. The row space was 50cm and the distance between two neighboring plants was 25cm. DA was recorded when 50% of the plants in each plot reached anthesis. The phenotypic data of DA are included in S1 File.

Phenotype data analysis
The best linear unbiased estimates (BLUEs) of the 481 BC 1 F 3:4 families were estimated following the model: where y ijm is the phenotype of the i th (i = 1,2 . . .,481) genotype in the j th (j = 1,2,3,4) environment, the m th (m = 1,2) replicate effect was nested in each environment. μ is the overall mean, g i is the genotype effect, e j is the environmental effect, ge ij is the G × E effect, δ (j)m is the replicate effect, and ε ijm~N (0, s 2 ε ) is the error term. N stands for normal distribution. To compute BLUEs, g i was treated as a fixed effect, and the other effects were treated as random effects with each random effect following a specific normal distribution. The model was fitted using the R package lme4 [23].
To calculate broad-sense heritability on an entry-mean basis (H 2 ), all variables were treated as random effects to estimate their variances using the above model, which was fitted using R package lme4 [23]. The variances of genotype, G × E and error term were identified as s 2 g ; s 2 ge , and s 2 ε , respectively. The formula for calculating H 2 is [24]: where N e is the number of environments, and r is the number of replicates.

Genotyping and data preprocessing
Fresh leaf tissues of the 481 BC 1 F 3 plants were collected and DNA of each plant was extracted using a cetyltrimethyl ammonium bromide method [25]. DNA samples were sent to Capital-Bio Corporation for DNA chip assay, which included 55,000 SNP loci covering the whole genome [26]. The physical position of the SNP markers was based on the B73 RefGen_V3 sequence. SNPs with a calling rate larger than 97% were used. The genotyping data were filtered by removing SNPs with missing data in any parent, SNPs that were non-polymorphic between parents, and SNPs with a missing rate larger than 0.05. Missing markers were imputed with the expected values calculated from estimates of allele frequencies [10], the processed genotypic data were included in S2 File.

GWAS and selection of large-effect SNPs
GWAS was performed using the R package sommer [27] following the model: where y � is an N×1 matrix of the BLUEs, β is a vector of fixed effects, g is the genetic effect, and is treated as a random effect with normal distribution g � Nð0; Ks 2 u Þ; τ is the additive marker effects, ε is the residual and follows the normal distribution ε � Nð0; Is 2 ε Þ. X, Z, and W are the corresponding design matrixes. K was estimated using the A.mat function in the R package rrBLUP with the following formula [5]: for the j th marker, p j and q j are the allele frequencies of "A" and "a", respectively. SNP markers were coded as -1, 0, 1 for the genotypes "aa", "Aa", and "AA", where "aa", "Aa", and "AA" were homozygous Zheng58, heterozygous and homozygous PH4CV alleles, respectively. W was computed by subtracting P from M as suggested by VanRaden, where the i th column of P is 2(p i − 0.5), M is the genotype matrix, and p i is the minor allele frequency of locus i [28]. Considering that flowering time was controlled by a small number of QTL in most biparental populations [2], we selected the top 50 SNPs with the largest -log 10 (P) value to find the SNPs with the largest effects. The 50 SNPS were fitted in a multiple linear model, from which SS reg and SS tol for each SNP were computed. Here, SS reg is the sum of square of each selected SNP, SS tol is the sum of square of the linear model. Phenotypic variance explained (PVE) of each SNP was calculated by dividing SS reg into SS tol [29].

The effect of fitting large-effect SNPs as fixed effects on the PAs of GS models
The BLUEs were used to test how many large-effect SNPs should be used as fixed effects. PA, calculated as the correlation coefficient between predicted and observed phenotypic data, was obtained by running 100 five-fold cross validations (CVs). The linear mixed model was as follows [30]: where y is the BLUEs, β is a matrix containing the fixed effects, u is the genetic effect treated as a random effect with u � Nð0; Ks 2 u Þ, ε is the error term with the distribution ε � Nð0; Is 2 ε Þ. s 2 u and s 2 ε are the genetic and error variances, respectively. The additive relationship matrix K was calculated according to a previous report [31]. X and Z are the corresponding design matrixes. The above model was fitted using R package BGLR, Gaussian processes (RKHS) model was used for estimating the variances of random effects. The number of iterations and burn-in were set to 20,000 and 5,000, respectively [10].
When the top large-effect SNPs were fitted as fixed effects, β included the intercept and the effects of the large-effect SNPs, the genotypic data of the large-effect SNPs were added as columns of the X matrix. Meanwhile, the top SNPs were removed from overall markers when calculating K matrix [32]. Two-tailed student's t-test analysis was used to test whether fitting one more SNP as fixed effect could increase PA by comparing the 100 PAs calculated by fitting top n SNPs with the 100 PAs calculated by fitting top n-1 SNPs (n≧1).
The above t-test analysis revealed that fitting the top four large-effect SNPs as fixed effects was optimal. To test the effect of adding the four large-effect SNPs on PA, four randomlyselected markers were chosen as fixed effects and PA was calculated correspondingly. This process was repeated for 200 times, then the 200 PAs were compared with the PA calculated using the four large-effect SNPs as fixed effects.
To calculate the PA of MAS using the top four large-effect SNPs, the four SNPs was fitted in a multiple regression model using the lm function in R. The phenotype was estimated using the predict function [33]. The PAs were calculated using 100 CVs. In order to prove the effect of MAS using the top four SNPs, we also calculate the PAs of MAS using four randomlyselected SNP.

GS using three environment models, with and without large-effect SNPs fitted as fixed effects
(1) Single environment (SE) model. The SE model can be expressed as: where y i is a vector of phenotypic data in the i th environment, μ i is the overall mean, β i is a vector of the marker effect, X is the genotype matrix, and ε i is the residual.
(2) A-E model. In this model, the marker effect of each SNP in all environments is assumed to be constant, and supposing that we have n environments [16,34,35], the model is:   where y i is the phenotype in the i th (1, 2, . . ., n) environment, μ i is the overall mean in the i th environment, X i is the genotype matrix, and ε i is the residual error.
(3) G × E model. In the G × E model, y i and μ i were the same as those in the A-E model, marker effect β was decomposed into two parts, a constant main effect β 0 and the environment-specific effect β i . The mixed linear model is:     ; The three environment models were analyzed in the R package BGLR [10]. The code for implementing A-E and G × E was revised from a previous report [36].

Cross-validation strategies
The variance components were estimated by fitting the full data set to each of the three models (the SE, A-E, and G × E models). The full data were scaled to standard normal distribution with mean and variance set to zero and one, respectively. In all cases, the number of iterations and burn-in were set to 20,000 and 5,000, respectively.
In the SE analysis, prediction accuracy was calculated using 100 five-fold CVs.
In the multiple environments GS models (the A-E and G × E models), two different CV schemes (CV1 and CV2, S1 Table) were used according to different breeding practices [16,35,37]. Briefly, CV1 was designed to predict the performance of newly-developed or untested lines that were not evaluated in any environment. CV2 was designed to predict the phenotype of some materials that was missing or not evaluated in some environments. Because one pair of environments was used to perform multi-environments GS each time, and the number of families in the two environments was different, the CV was performed based on the minimum number of families evaluated in the pair of environments.
PA was calculated as the correlation coefficient between the predicted and observed phenotype for either of the three models.

Phenotypic data analysis
DA of the 481 BC 1 F 3:4 families were evaluated in four environments over two years (16BJ, 17BJ, 16XJ, and 17XJ), the families flowered earlier in 17BJ (Table 1, Fig 1A and 1B). The correlation coefficients between each pair of environments varied from 0.48 to 0.63, suggesting that DA shared a common genetic basis across all environments (Fig 1A). The heritability estimated across multiple environments and the coefficients of variance proved the stability of DA (Table 1).

Genotypic data analysis
In total, 11,781 polymorphic SNP markers were obtained after filtering, these markers distributed across the whole genome with a sufficiently high density for GS analysis (Fig 2A). Genotypic analysis of the 481 BC 1 F 3 plants revealed that the backgrounds of most plants were the homozygous PH4CV genotype, which covered 65.4% of the genome on average. The average coverages of homozygous Zheng58 and heterozygous genotypes were 16.0% and 18.6%, respectively ( Fig 2B; S2 Fig; S2 Table). Zheng58 alleles were present across the whole genome, although it was the donor parent (Fig 2B), suggesting that the BC 1 F 3 population was segregating across the whole genome.

GWAS and mutiple linear regression analysis identified large-effect SNPs
GWAS was used to identify the genetic basis of DA, QQ plot revealed that the GWAS model was well-fitted in the population under study. Manhattan plot revealed that the highest peak was on chromosome 2, followed by chromosome 9 (Fig 3A and 3B). To identify the loci with large effects, the top 50 SNPs with the largest -log 10 (P) values were selected and fitted using a multiple linear regression model, then PVE of each SNP was calculated. Chr3_159867173, an SNP on chromosome 3, had the largest PVE of 11.88%, followed by Chr2_56238969, Chr9_154782803 and Chr3_23119818, explaining 7.52%, 4.81% and 4.59% of total phenotypic variance, respectively (Fig 3C).

The PA of GS model fitting the top four large-effect SNPs as fixed effects outperform the other models
The BLUEs were used to determine how many large-effect SNPs should be fitted as fixed effects. The PAs of GS models increased with the increase of the number of top SNPs fitted as fixed effects. Student's t-test analysis revealed that the increase of PA was not significant when the number large-effect SNPs increased from four to five ( Fig 3D). Therefore, fitting the top Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction four SNPs with the largest effects was optimal. To further demonstrate that the PA increase didn't happen by chance, four randomly-selected SNPs were fitted as fixed effects using the GBLUP model. The PA calculated by fitting the four large-effect SNPs as fixed effects was larger than each of the 200 PAs of GS models fitting four randomly-selected SNPs as fixed effects (Fig 3E). By comparing the PA of GS models with the PA of MAS, we found that the PA of MAS using the top four SNPs was lower than the PA of GS models. The PA of MAS using the top four SNPs was higher than that of MAS using four randomly-selected SNPs (Fig 3F).
The above analysis revealed that the four large-effect SNPs should represent real QTL and could be fitted as fixed effects in the following analysis.

Fitting the four large-effect SNPs as fixed effects generally decreased genetic variances and increased PA
To investigate the effects of fitting the four large-effect SNPs as fixed effects, the variance components were dissected using the full data. For the SE model, the genetic variances of the GS models decreased when the four large-effect SNPs were fitted as fixed effects (Table 2; S3 Fig).
For the A-E and G × E models, the most evident differences were the decreases of the genetic variances when the four large-effect SNPs were fitted as fixed effects. For each of the two G × E interaction variances (s 2 u 1 and s 2 u 2 in Table 2), no constant differences were found when the four SNPs were fitted as fixed effects ( Table 2). The analysis demonstrated that fitting the four large-effect SNPs as fixed effects would generally decrease the genetic variances, and that the four loci had constant effects across the four environments.
The PAs of the three models (the SE, A-E, and G × E models) was calculated to demonstrate the effect of fitting the four large-effect SNPs as fixed effects. For the SE model, it was demonstrated that the PA increased when the four large-effect SNPs were fitted as fixed effects for each environment (S3 Fig). We also found that fitting the top four SNP identified in each environment as fixed effects could also increase PA (S4 Fig). For the multi-environment GS model including A-E model and G × E model, two cross-validation (CV) schemes, named as CV1 and CV2, were used (S1 Table). For the A-E model, fixing the four SNPs generally resulted in higher PAs for the CV1 and CV2 schemes (excluding one case in CV1 and three cases in CV2, Table 3). The results were similar for the G × E model. Generally speaking, the results suggested that fitting the four large-effect SNPs as fixed effects was advisable for each of the three models.

The G × E models with the four large-effect SNPs fitted as fixed effects generally had better performance
When comparing the two multi-environment models (the A-E and G × E models) without fitting any SNP as a fixed effect, the G × E models had better performance than the A-E models in ten of the twelve cases for the CV1 scheme, and in eight of the twelve cases for the CV2 scheme (Table 3). When comparing the two multi-environment models with the four largeeffect SNPs fitted as fixed effects, the G × E models outperformed the A-E models in nine of the twelve cases for the CV1 scheme and in eight of the twelve cases for the CV2 scheme Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction (Table 3). The results supported that the PAs of the G × E models were generally larger than those of the A-E models. Because both fitting large-effect SNPs as fixed effects and modeling G × E interaction could increase PA, it was assumed that the best prediction could be achieved using the models including both of the two factors. By looking through each row in Table 3, it could be found that the G × E models fitting the four large-effect SNPs as fixed effects had the highest PAs in ten of the twelve cases for the CV1 scheme and in eight of the twelve cases for the CV2 scheme. Therefore, including the two factors into the GS models should be a powerful strategy for enhancing GS efficiency.

Discussion
With the fast development of genome sequencing technology and the continual decreasing of the genotyping cost, efficient selection is becoming increasingly important for any commercial breeding programme. GS can increase breeding efficiency by making prediction at the seedling stage as soon as DNA of the prediction population was available, thus help breeders to exclude undesired genotypes. GS could also increase breeding efficiency by making the best prediction and selecting the desired plants at the decision-making stage of a GS breeding programme. This study was designed to examine how to make the best use of available genotypic and phenotypic data to make the best prediction by including additional components to the GS models. The analysis in this study suggested that, compared with the use of crude GS models, manipulating existing data using statistical approaches enhanced genomic prediction without increasing any cost.
The finding that using known genetic loci as fixed effects could increase PA highlighted the importance of obtaining and assessing these data [14,38]. There are two general approaches to Table 3. Fitting the four large-effect SNPs as fixed effects and a G × E component generally enhanced genomic prediction. Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction obtaining these data: summarizing the chromosome position of QTL and genes by retrieving published articles; performing QTL analysis using established training population with both genotypic and phenotypic data. It should be noted that the collected historical QTL and gene information might not be useful if not validated in the breeding population. However, even when QTL and gene information are validated in the training population, this information should be carefully examined. QTL of a specific trait can be influenced by heritability and genetic architecture, the target trait might be controlled by one or two major QTL, or by many minor-effect QTL. It was demonstrated using simulation data that the selection efficiency increased with the increase of heritability for a given genetic architecture where only one locus with a major effect was fitted as a fixed effect. The increase in prediction accuracy was negligible when the effect of a locus fitted as fixed effect was 5% [13]. However, in this study, we proved using real data that the increase in prediction accuracy was significant even the effects of three loci were less than 10% (Fig 3D), which might be related to the relatively high heritability of DA in this study. Our analysis showed that fitting large-effect SNPs as fixed effects enhance GS. The prerequisite is that each of the large-effect SNPs should be in linkage disequilibrium with a real QTL. The four SNPs were in the chromosome regions of maize bin 3.05, 2.04, 9.07 and 3.04 according to ISU Integrated IBM 2009 (https://www.maizegdb.org/data_center/map). These regions contained consensus QTL according to QTL meta-analysis [2,39], suggesting that the four SNPs detected in this study should represent real QTL.

Pairs of environments
Marker effects estimated using mixed models might not reflect the real genetic effects, because the genetic variance of each SNP was assumed to follow some prior distribution, and modeling of this prior distribution might affect the estimation of marker effects, especially for large-effect SNPs [10,13,40]. When markers with large effects were fitted as fixed effects, only the effects of the remaining SNP markers should be estimated, strong shrinkage could be avoided in estimating the effects of large-effect SNPs when solving GS models [5,10]. Thus, the GEBVs can be estimated accurately when major genes are fitted as fixed effects.
In maize breeding programmes, a frequently-used strategy is to select lines from the advanced generation formed by crossing two elite inbred lines. However, GS studies modeling large-effect SNPs as fixed effects and G × E interaction effects using this kind of breeding population are relatively few. Therefore, it is necessary to examine how the PAs of GS models would change when the two factors are included in the GS models using a maize biparental population. Our study was conducted to address this concern, and we ultimately proved that fitting large-effect SNPs as fixed effects in the GS models would increase PA in a maize biparental population, even the effects of some SNPs were less than 5%. Furthermore, GS models fitting large-effect SNPs as fixed effects and G × E effects generally had the best performance. Our results should be useful for molecular crop breeding.

Conclusion
GWAS and multiple linear regression analysis was successfully applied to identify large-effect SNPs. Using the BLUEs, it was demonstrated that fitting the four large-effect SNPs as fixed effects increased PA and decrease genetic variance. We further demonstrated that combining G × E interaction and fitting large-effect SNPs as fixed effects could generally increase PA.
Supporting information S1 File. Phenotypic data. DA of the BC 1 F 3:4 population composed of 481 families was evaluated in four environments, including 16BJ, 16XJ, 17BJ, and 17XJ.

(ZIP)
Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction S2 File. Genotypic data. In the data file, 1, 0, and -1 represented homozygous PH4CV, heterozygous, and homozygous Zheng58 genotypes, respectively. (ZIP) S1 Fig. Flowchart for  The genetic variance (a) and residual variance (b) for each environment were dissected with or without the four large-effect SNPs as fixed effects. PA was also calculated for each environment with or without the four large-effect SNPs as fixed effects (c). Fixed and Random indicated fitting the four large-effect SNPs as fixed effects and four randomly-selected SNP as fixed effects, respectively. ��� on top of gray column indicated significantly different from its left column at P < 0.001 level. P value were determined by two-tailed Student's t-test. (TIF) S4 Fig. Fitting the four SNPs identified in each environment as fixed effects could increase PA. GWAS was performed using the phenotypic data in each environment and the top four SNPs were identified accordingly. For each environment, PA was calculated with or without the top four SNPs as fixed effects. Fixed and Random indicated fitting the top four SNPs as fixed effects and four randomly-selected SNPs as fixed effects, respectively. ��� indicated significantly different at P < 0.001 level. P value were determined by two-tailed Student's t-test. (TIF) S1 Table. The two cross validation schemes adopted to test the PA of AE and G � E GS models. CV1 and CV2 mean two cross validation schemes, Env1, Env2, . . .. . ., Envn means there are n environments, NA means phenotype was not evaluated in the specific environment, N means there are N lines. Fitting large-effect SNPs as fixed effects and G × E enhances genomic prediction