Epistatic Association Mapping in Homozygous Crop Cultivars

The genetic dissection of complex traits plays a crucial role in crop breeding. However, genetic analysis and crop breeding have heretofore been performed separately. In this study, we designed a new approach that integrates epistatic association analysis in crop cultivars with breeding by design. First, we proposed an epistatic association mapping (EAM) approach in homozygous crop cultivars. The phenotypic values of complex traits, along with molecular marker information, were used to perform EAM. In our EAM, all the main-effect quantitative trait loci (QTLs), environmental effects, QTL-by-environment interactions and QTL-by-QTL interactions were included in a full model and estimated by empirical Bayes approach. A series of Monte Carlo simulations was performed to confirm the reliability of the new method. Next, the information from all detected QTLs was used to mine novel alleles for each locus and to design elite cross combination. Finally, the new approach was adopted to dissect the genetic basis of seed length in 215 soybean cultivars obtained, by stratified random sampling, from 6 geographic ecotypes in China. As a result, 19 main-effect QTLs and 3 epistatic QTLs were identified, more than 10 novel alleles were mined and 3 elite parental combinations, such as Daqingdou and Zhengzhou790034, were predicted.


Introduction
Germplasm resources play crucial roles in genetics, evolution and breeding, by forming the physical foundation of the study of genetic diversity [1]- [3], fueling much evolutionary research [4]- [6] and providing the raw material for breeders to produce new cultivars or to further improve the existing ones, due to the existence of many valuable genes in genetic resources [7]- [9]. The identification of valuable genes and markers associated with traits of interest will greatly increase the efficiency of plant breeding programs. However, these beneficial genes are largely unexplored due to the lack of appropriate statistical techniques. Meanwhile, as the complexity of the trait increase, breeding problems increase, for example, favorable alleles in exotic genetic resources are in unadapated genetic backgrounds and linked to other unfavorable alleles. This means that methods to utilize these favorable alleles in crop breeding also need to be further addressed. Accordingly, there is a critical need for in-depth study of methodologies for mining elite alleles in germplasm resources and for the utilization of these elite alleles in crop breeding.
During the past several decades, many attempts have been made to mine elite alleles for objective traits of interest. In early studies, many genes for qualitative traits in crop breeding were studied with morphological and biochemical approaches [10]- [13], and those for complex diseases in human genetics were identified by both sibling pair analysis [14]- [18] and pedigree analysis [19]- [21]. The introduction of molecular markers has facilitated the genetic association analysis of complex diseases in humans, animals and plants. Single-marker association analysis [22] and, later, genomewide association study (GWAS) have been widely used in human genetics [23]. There has been substantial research of two aspects of GWAS: population structure [24]- [29] and mixed genetic models [30]- [32]. However, only one QTL was analyzed at a time in the above models. Likewise, although epistasis association analysis has been utilized in human genetics [33]- [37], all of the main genetic effects and gene interaction effects have not been simultaneously included in one genetic model. A full genetic model, including all the main and epistatic effects, could improve the power of QTL detection [38]- [41]. Several parameter estimation approaches such as LASSO [41], [42], empirical Bayes [43], and penalized maximum likelihood [38], [40] make this full genetic model possible. Therefore, epistasis association analysis with a full genetic model is feasible in crop germplasm resources.
In the past, most crop breeding methods were based on selection for observable phenotypes and breeding efficiency without markers is simply a function of heritability and choice of parental material. To date molecular markers have improved efficiency of selection largely for traits under simple genetic control and in specific conditions where marker selection is easier/cheaper than phenotypic selection [44]- [50]. However, this approach is only feasible for the improvement of one or several independent genes. If there are interactions among the objective genes, breeding strategy must be addressed by the incorporation of the epistasis [51], [52]. Carlborg and Haley [53] showed that epistasis is a common response to selection in breeding programs. Therefore, genetic interaction should be considered in crop breeding strategies.
One purpose of the genetic analysis of quantitative traits is to design a suitable breeding strategy, called breeding by design [54]. However, genetic analysis and crop breeding have traditionally been performed separately; for example, most genetic analyses exclusively use biparental crosses, but these are rarely used alone in commercial breeding. Therefore, the results of these biparental cross experiments have limited roles in breeding practice [55]- [57]. However, direct mapping of QTLs in natural populations, such as crop cultivars, is both economical and practical because the population being mapped is readily available, and the identified QTLs are directly applicable [31].
The purpose of this study was to develop an epistatic association mapping (EAM) approach in homozygous crop cultivars. We described detailed genetic and statistical models of epistasis association analysis in crop cultivars. All the parameters were estimated using the empirical Bayes approach. Our methods were confirmed by real data analysis in soybean and by a series of Monte Carlo simulation experiments.

Phenotypic variation
We measured seed length in 215 soybean cultivars. The minimum, maximum, average, median, standard deviation, coefficient of variation, skewness and kurtosis values were 5. 30, 11.85, 7.94, 7.86, 0.99, 12.43, 0.61 and 0.91, respectively. Results from ANOVA showed that there is significant difference among cultivars (P,10 24 ) and there are no significant differences between years (P = 0.192) and among cultivar 6 year interactions (P = 0.328). This means that in the cultivar population, there is a large amount of genetic variation, which exhibits a continuous normal distribution (Fig. 1).

Epistasis association mapping
Two years of phenotypic observations, along with information on 134 SSR molecular markers, were used to dissect the genetic basis of seed length in soybean. In the full model, 9,180 effects needed to be estimated, 40 times larger than the sample size. We adopted a two-stage method [58]. Nineteen main-effect QTLs and 3 epistatic QTLs for seed length in soybean were detected by EAM ( Table 1). All of these QTLs were nearly evenly distributed along the soybean genome, except for chromosomes H, J and L. Among these QTLs, the proportion of the total phenotypic variance was from 0.25% to 10.44% for main-effect QTLs and from 5.08% to 7.38% for epistatic QTLs, and each of 12 QTLs contributed greater than 5.0% of the variance. In addition, five loci were involved in epistatic interactions, and only one of these five (sat_342) had a significant main effect. This lack of main effects may create difficulties in detecting epistasis with other methods.
To compare the proposed approach with regular genome-wide association study (GWAS), the GWAS was used to analyze the above dataset. Results showed that three main-effect QTL, linked with markers satt382, sat_254 and satt441, respectively, were detected (Fig. 2a) and no significant environmental and epistatic interactions were identified (Fig. 2). These results are similar to those by the proposed approach in two aspects. First, the three main-effect QTLs detected by the GWAS are also identified by the proposed method. Second, no significant environmental interaction is detected by the above two approaches. However, there are some differences as well. The main difference is that the new approach can detect more main-effect and epistatic QTLs than the GWAS.

Mining elite alleles
The allelic effects of the cultivars were evaluated for all the identified loci for soybean seed length. The reduced model that includes the total mean, the population structure, all the identified loci and the residual error was a mixed model equation. In the reduced model, the allelic effects at each locus were estimated by a maximum likelihood approach. If we want to increase the trait value, we should take the allele with the largest positive effect per main-effect QTL as novel allele. If decreasing the trait value is our selection objective, we should take the allele with the largest negative effect per main-effect QTL as novel allele. The same is true for allele combination of epistatic QTL. The summary statistics for novel allele or allele combination are given in Table 2. These results show that there is one novel allele for each maineffect locus or one novel allele combination for each epistatic QTL. For example, for the locus linked to marker satt656, all the allelic effects are showed in Fig. 3, and novel allele is the allele with an effect of 2.63. Similarly, for the interaction between markers sat_342 and AW277661, novel allele combination is the allele combination with an effect of 1.29. The novel allele and allele combination were found in the Zhengzhou 790034 and Guangxibayuehuang cultivars, respectively.

Predictions for elite cross combination
The elite cross combinations could be predicted from all the detected loci and their effects by using the method described below. In a hypothetical cross between two cultivars, all types of RILs would be produced. In these RILs, seed length could be predicted by the combined effects of all the detected loci. The best RIL with maximum seed length in one cross would represent the cross. The best cross with maximum seed length in all the crosses could be selected by comparing all the crosses. In this study, the best three crosses were Daqingdou 6 Zhengzhou790034, Zhenghe-zhibanzi 6 Zhengzhou790034, and Liyangdawuhuangdou 6 Zhengzhou 790034. The presence of Zhengzhou790034 in the three best crosses indicated that it contained the best allele or allele combination.

Monte Carlo simulation studies
Evaluation of the performance of the proposed approach. The first simulation experiment was designed to investigate the effect of QTL heritability on QTL mapping in crop cultivars. The results show that the precision and power of the detection of QTLs increase with increasing QTL heritability, and that the false positive rate (FPR) is only 0.0244% (Table S2).
In the second simulation experiment, we investigated the effect of sample size by randomly sampling 100, 200, or 300 nonfounder lines. The other parameters were the same as those in the first simulation experiment. As expected, the precision and power increased with increasing sample size (Table S3). Sample sizes under 300 yield much better results than those under 200; we recommend a sample size of 300 for future studies.
The third simulation experiment compared the effect of the number of alleles on QTL mapping in crop cultivars. We set the numbers of alleles at 2, 3 and 4; other parameters were the same as those in the first simulation experiment. The results showed that precision and power decrease as the number of alleles increases (Table S4). The results also imply that the SNP or indel markers are better than the other markers.
In the fourth simulation experiment, the effect of allelic frequency on QTL mapping was assessed by setting the frequency ratio of the two alleles as 1:1 (uniform distribution), 1:2 (skewed distribution) or 1:3 (skewed distribution). The other parameters were the same as those in the first simulation experiment. The results showed that skewed distribution decreased the statistical power (Table S5), indicating that rare alleles should be preferentially studied in association analyses.
The detection of QTL-by-environment interaction. To investigate whether environmental effects could be detected, all the cultivars were evaluated in multiple environments. In the fifth  simulation experiment, two environments, ten main-effect QTL and five QTL-by-environment interactions were simulated. The new method holds greater power for detecting QTL-byenvironment interactions than for the main-effect QTL, and the FPR is lower than 0.06% (Table 3). To further demonstrate the performance of the new method, in the sixth simulation experiment, we designed a large genome with high density markers. In total, 510 markers were simulated on ten chromosome segments 1,000 cM long, with an average marker interval of 2 cM. The other parameters were the same as those in the fifth simulation experiment. The same trend in the fifth experiment was obtained ( Table 4), indicating that our method works in large genomes with a high marker density.
The identification of QTL-by-QTL interaction. To demonstrate whether QTL-by-QTL interactions could be detected, all epistatic effects between two main-effect QTLs were included in the full model. In the final simulation experiment, 50 markers were evenly distributed in five linkage groups 450 cM in length. Five main-effect QTLs, 3 QTL-by-environment interactions and 5 QTL-by-QTL interactions were simulated. The results (Table 5) show that the estimates for the positions and variances of simulated QTLs are close to their true values, and the power in the detection of QTL is high (e.g., over 80% for the QTLs with a heritability over 2%), especially for QTL-by-QTL interactions.

Discussion
The approach proposed in this work has several advantages over the approaches of previous association analysis studies. First, main, environmental, QTL-by-environment and QTL-by-QTL interactions were simultaneously considered in our full genetic model, improving the statistical power [38]- [41]. Although multi-locus genetic models have been proposed in plant genetics [59]- [62], they have difficulty combining both QTL-by-environment and QTL-by-QTL interactions. Epistasis association mapping has been developed in human genetics [33]- [37], but here the epistasis was identified by two-dimensional scan, and significant effects in the two-dimensional scan were further tested in one genetic model. Second, epistasis association analysis was first integrated with crop breeding by design. In the past, the results from QTL mapping have had limited utility in breeding practice, due to the use of a simple cross population or the neglect of epistasis in the detection of QTLs. We designed an elite cross combination to take these two issues into account. Third, it is easy to extend the proposed approach to nested association analysis. The commonality is that all the individuals in the mapping populations are inbred lines. The difference is that the pedigree is general for the present study and relatively simple for nested association analysis. Therefore, the new method is suitable for nested association analysis and human genetics. Fourth, the FPR is minimized in the new method. A shrinkage estimation method, empirical Bayes (eBayes), was adopted to estimate all types of effects in the full model so that the FPR was less than 0.06%.
At present the most widely used genome-wide association study (GWAS) is analysis of variance or mixed model approaches with the control of false discovery rate. In theory, it is similar to singlemarker analysis for main-effect QTL and two-marker analysis for epistatic QTL, and the difference is that the GWAS requires the setting of a significance threshold at the genome-wide level. However, it does not overcome the shortcomings of marker   QTLs can not be mapped in the soybean genome-wide association study. Prediction of elite cross combination is based on the assumption that dominance and dominance-type epistasis effects are absent. If the breeding objective is the development of inbred lines or cultivars as often the case in self-pollinated crops, the prediction may be useful. If these non-additive effects are important, then the prediction would not reliable. This issue needs to be addressed in the future. Xu [41] described a linear model in which the dimensions of the genotypic value vector and its incidence matrix depend on the number of genotypes for the locus. In theory, this model matches the situation under study. However, the model dimensions will increase rapidly. Therefore, it is preferable to gather more samples or reduce the number of effects considered [38], [63] to reduce the dimensions of the model. In this study, we designed a special incidence matrix such that there is one variable for each main-effect QTL. Simulation studies show that this approach works well. If the number of markers is large, the number of effects in the model is enormous. In this case, the two-stage method of He and Zhang [58] is recommended. We adopted this approach in our analysis of real data, and the results were consistent with those of He and Zhang [58] and He et al. [64]. The new approach works well if the marker interval length is approximately 5 cM. However, one must delete some closely linked markers if the interval length is less than 5 cM [64].
We compared the QTLs of seed length in soybeans with the QTLs in previous studies. Although few common markers existed between their data and ours, some loci that we detected were also detected in previous studies. Seven QTLs linked to markers sat_342, satt534, satt514, sat_365, sat_254, sat_419 and sat_274 in this study were detected by Xu et al. [65]; four QTLs associated with markers satt411, satt329, satt022 and AW277661 in this paper were identified by Salas et al. [66]; one QTL close to marker sat_256 was confirmed by Li et al. [67]; and one QTL next to marker satt514 was mapped by Liang et al. [68]. The above results further confirmed the feasibility of the approach proposed in this study.

Soybean samples
We recently assembled a soybean association panel with 215 cultivars provided by the National Center for Soybean Improvement, China. All the cultivars were obtained by stratified random sampling from six geographic ecotypes in China [69], planted in three-row plots in a completely randomized design and evaluated at the Jiangpu experimental station at Nanjing Agricultural University in 2008 and 2009. The plots were 1.5 m wide and 2 m long. Five individuals and 20 seeds in the middle row of each plot were randomly picked to measure seed length by digital vernier caliper. The measurements were averaged over 20 seeds, and the mean was used in this study.
Approximately 0.3 g of fresh leaves obtained in 2008 from each cultivar was used to extract genomic DNA using the cetyltrimethylammonium bromide method as described by Lipp et al. [70]. To screen for polymorphisms among all the cultivars, PCR was performed with 134 simple sequence repeat (SSR) primer pairs. The primer sequences were obtained from the soybean database Soybase (http://www.ncbi.nlm.nih.gov). PCR was performed as described by Xu et al. [65].

Population structure
For the soybean data, the STRUCTURE program was used to investigate the population structures of all selected cultivars [26]. The number of subpopulations (K) was set from 2 to 10. In the Markov chain Monte Carlo (MCMC) Bayesian analysis for each K, the length of a Markov chain consisted of 110,000 sweeps. The first 10,000 sweeps (the burn-in period) were deleted, and thereafter, the chain was used to calculate the mean of loglikelihood. This process was repeated 20 times, and the total average for mean log-likelihood at fixed K was used. STRUC-TURE analysis with 134 SSR molecular markers showed that the  log-likelihood increased with the increase of the model parameter K, so a suitable number of K could not be determined. In this situation, using the ad hoc statistic DK, based on the rate of change in the log-probability of data between successive K values, STRUCTURE accurately detected the uppermost hierarchical level of structure [71]. Here, the DK value was much higher for the model parameter K~4 than for other values of K. By combining this high DK value with knowledge of the breeding history of these cultivars, we chose a value of 4 for K. The Q matrix was calculated based on SSR markers and incorporated into the mixed model of epistasis association analysis.

Genetic model
The phenotypic value of a quantitative trait for the ith cultivar in the jth environment (i~1, Á Á Á ,n; j~1, Á Á Á ,R), y ij , may be described by the following mixed model: where c QQ,(m{1)m Þ ' are the corresponding effects; and m is the total average. The first three terms were viewed as fixed effects and the following three terms were considered random effects; therefore, model (1) was rewritten as Y~XbzZcze ð2Þ where X~1 X P X E ð Þ , Z~Z Q Z QE Z QQ ð Þ , b~m,b' P ,b' E À Á ' and c~c' Q ,c' QE ,c' QQ À Á ' .

Parameter estimation
Several methods exist to simultaneously estimate the parameters in model (2); for example, eBayes [41], [43]. Here, we adopted eBayes. Briefly, the parameter vector in model (2) is h~bcs 2 À Á . The priors and the likelihood are not described in detail here. The iteration process is given below.
The fixed effects were calculated by: where V~P m j~1 Z j Z T j s 2 j zIs 2 . Note that there is not an explicit solution for the estimation of s 2 j , and it is updated by maximizing where t~{1:0 and v~0:0005.
The random effects, c j , were predicted by best linear unbiased prediction (BLUP): The posterior variance of c j is The proportion of phenotype variance explained by one random effect may be calculated by Likelihood ratio test The traditional likelihood ratio test (LRT), as described by Zhang and Xu [38], could not be performed in this study, due to an oversaturated epistatic genetic model. We proposed the following two-stage selection process to screen all the effects. In the first stage, all the effects with Dc j . sDw10 {6 are picked up. In the second stage, the full model is modified so that only the effects that passed the first round of selection are included. Due to the smaller dimensionality of the reduced model, we can use the maximum likelihood method to reanalyze the data and perform the LRT. The procedure for the LRT is below.
The overall null hypothesis is no effect of the QTL at the locus of interest, denoted by H 0 : a 1~Á Á Á~a T~0 , where a t is the effect of the tth allele. If we solve the maximum likelihood estimation of the parameters under the restriction of H 0 : a 1~Á Á Á~a T~0 and calculate the log-likelihood value using the solutions with this restriction, we obtain L(ĥ hDH 0 ). We can also evaluate the log-likelihood value of the solutions without restrictions and obtain L(ĥ h). Therefore, the LR test statistic is Other test statistics can be used in similar ways. The significance threshold of the LOD score was set at 2.5 for our real data analysis, whereLOD~LR=4:605.

Genome-wide association study
First, phenotypic values for seed length in 215 soybean cultivars were corrected using population structure obtained by STRUC-TURE software. Then, the corrected phenotypes along with SSR marker information were used to carry out genome-wide association studies for main-effect QTLs, environmental interactions and QTL-by-QTL interactions by ANOVA. Finally, critical values at the 0.05 level of significance were determined by 1000 permutation experiments and thus significant QTL could be identified.

Simulation design
We performed seven simulation experiments in this study. In the first, the simulated pedigree was the maize pedigree described by Zhang et al. [31], [61]. The number of inbred lines within the maize pedigree was 404(n). Of these, n 0 (~103) were base (founder) lines, which were in linkage equilibrium so that the genotypes for markers and QTLs with two alleles could be simulated. Non-founders (n 1 = 301) were bred via repeated selfpollination of a hybrid between two inbred lines. Thus, each nonfounder line represents a recombinant inbred line (RIL) with respect to a known pair of parents. The genotypes of all the nonfounders could be generated from the genotypes of their parents, analogous to simulating the genotypes of RILs from their parents. All of the non-founder lines could be used to detect QTLs. To mimic the actual linkage maps that did not have equally spaced markers, 153 markers were simulated on ten chromosome segments of length ,2258.70 cM, with an average marker interval of 14.86 cM. A total of 20 QTLs, all of which overlapped with the markers, were simulated; the sizes and locations of the QTLs are listed in Table 3. The allelic effects were calculated by relating the genetic variance of the QTL to both the allelic frequencies and the allelic number. The phenotypic value of each line was the sum of the corresponding QTL genotypic values and the residual error, with an assumed normal distribution. Each simulation run consisted of 200 replicates. For each simulated QTL, we counted the samples in which the LOD statistic surpassed 3.0. The ratio of the number of such samples (m) to the total number of replicates (200) represented the empirical power of this QTL. The falsepositive rate was calculated as the ratio of the number of falsepositive effects to the total number of zero effects considered in the full model. The other simulation experiments were performed similarly. All simulated parameters are given in Table S1.