Interacted QTL Mapping in Partial NCII Design Provides Evidences for Breeding by Design

The utilization of heterosis in rice, maize and rapeseed has revolutionized crop production. Although elite hybrid cultivars are mainly derived from the F1 crosses between two groups of parents, named NCII mating design, little has been known about the methodology of how interacted effects influence quantitative trait performance in the population. To bridge genetic analysis with hybrid breeding, here we integrated an interacted QTL mapping approach with breeding by design in partial NCII mating design. All the potential main and interacted effects were included in one full model. If the number of the effects is huge, bulked segregant analysis were used to test which effects were associated with the trait. All the selected effects were further shrunk by empirical Bayesian, so significant effects could be identified. A series of Monte Carlo simulations was performed to validate the new method. Furthermore, all the significant effects were used to calculate genotypic values of all the missing F1 hybrids, and all these F1 phenotypic or genotypic values were used to predict elite parents and parental combinations. Finally, the new method was adopted to dissect the genetic foundation of oil content in 441 rapeseed parents and 284 F1 hybrids. As a result, 8 main-effect QTL and 37 interacted QTL were found and used to predict 10 elite restorer lines, 10 elite sterile lines and 10 elite parental crosses. Similar results across various methods and in previous studies and a high correlation coefficient (0.76) between the predicted and observed phenotypes validated the proposed method in this study.


Introduction
Genetic mating design plays an important role in crop genetics and breeding. In hybrid breeding, this design was widely used to evaluate general combining ability of parents and specific combining ability of the F 1 between two parents. Therefore, many elite hybrid cultivars were bred and utilized in crop production. In classical quantitative genetics, mating design is one of main components. It provided a lot of information about two-order genetic parameters for quantitative traits. However, only the collective effects of all the polygenes were estimated. The introduction of molecular markers has facilitated the mapping of quantitative trait loci (QTL) in numerous species, and substantial progress has been achieved in bi-parental segregation population but not in the mating design. As we know, North Carolina (NC) design II has been considered to be one of the most powerful mating designs for combining ability and heterosis analyses, and its application has expanded into many crops. Therefore, there is a critical need for in-depth study of the methodology for mapping QTL in this design.
During the past several decades, many attempts have been made to detect QTL for quantitative traits in bi-parental segregation populations. For example, single marker analysis [1], twomarker analysis [2], interval mapping [3], composite interval mapping [4,5], multiple interval mapping [6], Bayesian method [7,8], and Bayesian-based likelihood approach [9][10][11]. However, this bi-parental segregation population is rarely used alone in commercial breeding, and therefore the results from these single-cross experiments have limited roles in breeding practice [12]. To overcome this shortcoming, crop cultivar population was used to conduct genomewide association study (GWAS) [13,14], and the main and interacted effects of detected QTL were used to carry out breeding by design [14][15][16]. However, this approach is useful only for inbred line breeding but not for hybrid breeding, because only additive and additive-by-additive (aa) effects [14] but not dominant-related effects [17][18][19][20][21] were estimated in the above GWAS.
Up to now many mating designs have been proposed, such as four-way cross [22], triple testcross (TTC) [23], diallel design [24,25], and NC mating design [26]. Current studies on the topic focus mainly on three aspects. The first is the classical genetic analysis [27][28][29], for example, combining ability analysis [30]. These analyses dissected the genetic foundation of quantitative traits, which provided useful information for crop breeding. However, the position and effect of individual QTL are unclear. To answer this question, the QTL mapping approach is available, which is the second main area of the work on the mating design. In four-way cross, He et al. [12] extended the main-effect QTL mapping of Xu [31] into the interacted QTL mapping. In TTC, Z 1 , Z 2 and Z 3 were used to detect augmented additive, augmented dominant and dominance-by-additive (da) interaction effects, respectively, in Kusterer et al. [32] and Melchinger et al. [33], and to unbiasedly estimate all the main and epistatic effects in He et al. [34]. In the NCIII, pair mean Z 1 and pair difference Z 2 were used to detect augmented additive effect and augmented dominant effect in Melchinger et al. [35], and epistasis in Garcia et al. [36] and He et al. [37]. In addition, Rebaϊ & Goffinet [38] and Lenarcic et al. [39] developed a general regression-based method and Bayesian approach, respectively, for QTL detection in diallel design; Li et al. [40] and Wang et al. [41] proposed analysis of variance approach for the detection of main and interacted QTL of quantitative and endosperm traits in NCIII and TTC, respectively; and Reif et al. [42] used TTC with near-isogenic lines as base population to detect epistasis. However, relatively little has been known about NCII mating design. Recently, different base populations along with several testers were used to dissect the genetic foundation of heterosis using QTL mapping of GCA and SCA, such as BC 1 F 8 [43], RIL [44] and introgression lines [45,46]. More recently, the comparison across different base populations was conducted as well [47]. However, these studies are based on main-effect QTL model. Finally, the mating design has been adopted in crop breeding, for example, four-way cross in maize breeding, and NCII mating design in rice and maize hybrid breeding. However, genetic analysis and crop breeding have been often performed separately.
In hybrid breeding, one group of parents is crossed with another group of parents, namely NCII mating design, in order to select elite hybrid cross. However, partial (or unbalanced) crosses are often conducted in breeding practice. In this study, partial F 1 hybrids along with their parents, a partial NCII mating design, were used to conduct genetic analysis. Here all the main and interacted effects were included in one full model. Two-dimension interacted effects between each pair of QTL were considered as interacted term. If the number of effects in the model was 10 times more than sample size, two groups of extreme individuals were used to test which effects were related to the trait. All the selected effects were further shrunk by empirical Bayesian, so significant effects could be identified. A series of Monte Carlo simulations was performed to validate the new method. The validated approach was used to dissect the genetic basis of oil content in rapeseed in partial NCII mating design. Based on the above information, novel parents and cross combinations would be predicted.

Effect of QTL heritability on mapping QTL
In the first simulation experiment, the effect of QTL heritability on QTL mapping in the NCII population was evaluated by letting QTL heritability be set as 0.02, 0.05 and 0.08. Note that the number of effects in the full model is 84,050, which is 116 times more than sample size. At this situation, high throughput QTL-effect screening approach described in this study was adopted. The results are shown in Fig. 1 and Supporting Information S1 Table. A general trend was found. In other words, the power of QTL detection increases as QTL heritability increases. Relatively small estimates for the average and standard deviation of absolute bias between estimated and true effects as well as false positive rate (FPR) were observed in the above three situations. In the same QTL heritability, the power is higher for main-effect QTL than for interacted QTL. In addition, the lowest power in the detection of the dominant-by-dominant (dd) interaction is observed in all the situations. The reason may be due to the low proportion of heterozygous genotypes in the mapping population.

Effect of sample size on mapping QTL
In the second simulation experiment, we explored the effect of sample size on QTL mapping by letting sample size be set as 400, 500 and 600. Note that the proportion of the paternal lines, maternal lines and their hybrids was set at 1:1:2. The others were the same as those in the first simulation experiment. The results are shown in Fig. 2 and Supporting Information S2 Table. A general trend was also found, for example, the power of QTL detection increases as sample size increases. Relatively small estimates for the average and standard deviation of absolute bias between estimated and true effects as well as the FPR were found in the above three situations.

Effect of population structure on mapping QTL
In the third simulation experiment, we investigated the effect of population structure on QTL mapping. The results are shown in Fig. 3 and Supporting Information S3 Table. If all the parents were viewed as mapping population, only additive and aa effects could be detected. It is reasonable. This is because only homozygous genotypes were included in the mapping population. If all the F 1 hybrids were viewed as mapping population, additive and dominant effects could be identified with satisfactory powers. Although all kinds of interacted effects could be detected, their powers were not high, with the highest power for aa effect and with the lowest power for dd effect. Meanwhile, the average and standard deviation of absolute bias between estimated and true effects were larger for dominant-related effects than for additive and aa effects. If parents and F 1 hybrid were mixed with the same proportion, their powers in the detection of aa, additive-by-dominant (ad) and da effects were significantly higher than those in the only F 1 hybrids.

Mapping QTL for oil content in Brassica napus
One Brassica napus breeding dataset from Professor Jinxing Tu at Huazhong Agricultural University was used for the further demonstration. The dataset was collected from a partial NCII breeding design that contained 298 sterile lines, 143 restorer lines and their 284 F 1 hybrids. The phenotype analyzed was oil content. A total of 205 markers were used in the analysis. The total number of effects included in the genetic model is 84,050. Therefore, high throughput QTL-effect screening approach described in this study was adopted.
In the first step, 72 parental lines or F 1 hybrids with the highest oil content and 72 parental lines or F 1 hybrids with the lowest oil content were selected from the mapping population. Using the two groups of individuals, χ 2 test of independence was carried out to identify whether one marker under consideration was associated with the trait. In the second step, all the markers associated with the trait were included in the genetic model of the full mapping population. Therefore, association mapping could be conducted. All the results were listed in Table 1. 8 main-effect QTL and 37 interactedQTL were found to be associated with oil content, and explained 17.74% and 42.27% of phenotypic variance, respectively. Of these QTL, the proportion of phenotypic variance explained by each QTL varied from 0.52% to 4.12%, the LOD score varied from 2.13 to 14.04, and most QTL were additive (7 QTL) or aa (25 QTL). A few dominant-related QTL might be due to the low proportion (only 17.99%) of heterozygous genotypes in the mapping population. Correlation coefficient of 0.76 between the estimated genotypic value and the observation supported the proposed approach in this study.
To further confirm the above results, three other approaches, including EBLASSO [11], genome-wide efficient mixed-model association study [48] and regression-based association study [49], were used to re-analyze this dataset. All the results were showed in Table 1. Among all the main-effect QTL, three were identified simultaneously by all the four methods and seven were detected by at least three approaches. Among all the interacted QTL, one was detected simultaneously by all the four methods, eight were identified by at least two approaches; and 14 were partially confirmed because one same locus and two linked loci associated with these interacted QTL were found.  More importantly, some similar results were observed in previous studies (Table 1). One marker linked to main-effect QTL and 16 markers linked to the interacted QTL in this study were same as those in previous studies, and three markers linked to main-effect QTL and 32 markers linked to the interacted QTL were close to those in previous studies [50][51][52][53][54][55][56][57]. For example, additive QTL around marker BRAS078A was consistent with that in Zhao et al. [50], Wang et al. [51] and Delourme et al. [52].
Using the above mapping results, the genotypic values for all the missing F 1 hybrids could be predicted. These predicted values were further used to calculate general combining ability (GCA) and specific combining ability (SCA). Based on these estimates, novel parents and elite F 1 hybrids could be predicted ( Table 2). Note that novel restorer line R092 could produce elite hybrid crosses: B1341 × R092, B0984 × R092, B0641 × R092, B0857 × R092 and B0066 × R092.

Discussion
There have been several advantages for the current study. First, one interacted QTL mapping approach for quantitative traits in partial NCII mating design had been proposed in this study. Most genetic analyses of previous studies in the designs are combining ability analysis in the polygenic system and little has been known about detecting individual QTL. Then, interaction genetic analysis in this study had been integrated with crop breeding. This overcomes the shortcoming that genetic analysis in bi-parental segregation population has limited roles in breeding practice [12]. Although some similar studies have been reported [13][14][15]58], most previous studies are for inbred line breeding. As for the prediction of hybrid performance, to use multiple parents and their progenies as genetic population is a good strategy. However, a single marker analysis in Schrag et al. [30,59] and progenies of F 2 -derived lines in Windhausen et al. [60] were adopted. Clearly, new development appeared in this study. Third, how to estimate a large of parameters in oversaturated genetic model had been considered. In this study, the number of effects in the genetic model is 116 times than sample size. To solve this issue, bulked segregant analysis (BSA) along with empirical Bayesian method were used to estimate all the parameters. This approach was confirmed to be feasible in both Monte Carlo simulation studies and real data analysis. Although high proportion of phenotypic variation was contributed by interaction terms in real data analysis, main and interacted QTL could be clearly identified across various approaches and various groups, and the QTL detection power was larger for main QTL than for interacted QTL in a series of simulation experiments. Actually, this high proportion phenomenon was also observed in Mackay [61]. Meanwhile, 14 of 37 interacted QTL were frequently identified in a series of cross validation experiments, and these interacted QTL were similar to the commonly interacted QTL across four approaches in this study (Table 1), although prediction accuracy in cross validation needs to be addressed. To improve the accuracy, genome-wide prediction may be available in the future project. Finally, the new approach might be extended into other mating designs, i.e., unbalanced or balanced factorial crosses, NCI design, diallel crosses and recurrent breeding population, although the new method was designed for partial NCII mating design. In recurrent breeding population, there were heterozygous genotypes in the parents so that the F 1 hybrid might be a mixture of multiple genotypes. At this situation, family average idea in the widely-used F 2:3 design [62] was available.
Combining ability analysis has been found to be an effective method in crop breeding. When NCII mating design was completely carried out, it is easy to calculate general combining ability (GCA) and specific combining ability (SCA). At this situation, novel parents and elite hybrid crosses could be easily predicted. These results could be used to direct crop breeding practice. In crop breeding practice, however, partial NCII mating design (unbalanced factorial crosses) is frequently conducted. At this case, how to estimate combining ability is pending, although a mixed linear model approach for phenotypic values was adopted in Schrag et al. [30,59]. In this study, we proposed one method to deal with this issue. That is, the information from the interacted QTL mapping was used to estimate the genotypic values for all the missing F 1 hybrids, so elite parental combination could be found. Furthermore, GCA could be estimated and favorable parents could be predicted [63,64]. If all the effects with non-zero estimates were used to predict the genotypic values for all the missing F 1 hybrids, this is similar to genome-wide selection [65][66][67][68]. The above genetic analysis was for a single trait. In crop breeding, multiple traits would be improved simultaneously. At this case, we needed to pyramid favorable alleles of all the detected QTL for multiple traits. For example, if oil content, thioglycoside, erucidic acid, yield per plant and thousand kernel weight were considered simultaneously in the real data analysis. Of 268 detected QTL for the five traits (data not shown), B1348 and R484 were found to have 199 and 204 favorable alleles, respectively. These two sterile restorer lines might be considered in crop breeding practice.
If the number of effects in a genetic model is much larger than sample size, it is difficult to estimate these effects. Although at present there have been some methods available, for example, Bayesian shrinkage estimation [7], LASSO [69,70], penalized maximum likelihood [9], empirical Bayesian [10], EBLASSO [11] and empirical Bayesian elastic net [71], these methods are widely used in bi-parental segregation populations but not in genetic mating population. In this study, one biological approach, named BSA or DNA pooling, was used to choose effects that are associated with the trait of interest. Although BSA was proposed in bi-parental segregation populations [72], this analysis was also useful in large scale association studies [73,74]. When considering interacted QTL in genetic model, main and interacted QTL could be clearly identified. In this study, therefore, this analysis along with the test of independence makes most effects be excluded from the full genetic model, and the reduced model is estimable. This approach in this study is an alternative way in the parameter estimation of oversaturated genetic model.
In quantitative genetics, epistasis refers to any statistical interaction between genotypes at two (or more) loci [61]. In our study, aa, ad, da and dd interactions are actually statistical terms rather than epistasis, such as defined by Cockerham's model [75], although some mathematical relationships exist [76]. In F 2 population, Kao & Zeng [77] gave a very classic example for mapping epistasis, in which orthogonality between the main effect and epistasis was emphasized, because orthogonality between main effect and epistasis may be important for statistical clarification and interpretation. Note that this orthogonality depends on allelic frequency in the studied population [61]. However, only statistical interaction was considered in this study.

Mapping population and trait evaluation
In NCII mating design, all sterile lines (298) in Brassica napus need to be crossed to all restorer lines (143). However, it is infeasible in breeding practice. Here only 284 F 1 hybrids were conducted at Huazhong Agricultural University (Wuhan, China). In other words, each of 143 restorer lines was crossed to a pair of sterile lines to produce 284 hybrids. Therefore, the mapping population was a partial NCII mating design (unbalanced factorial crosses), including 298 sterile lines, 143 restorer lines and their 284 F 1 hybrids. Seed oil contents for each parent and F 1 hybrid were measured by near infrared reflectance spectroscopy, for technical detail the reader was referred to the original study of Tillmann [78].

Genetic model
Let y i be phenotypic observation of the ith accessions (parent or F 1 ) in the above partial NCII design. The genetic model for y i is expressed as ðx ij x ik ðaaÞ jk þx ij z ik ðadÞ jk þz ij x ik ðdaÞ jk þ z ij z ik ðddÞ jk Þ þ " i ð1Þ where μ is the population mean; m is the number of putative QTL; a j and d j are the additive and dominant effects of the jth QTL (j = 1,ÁÁÁ,m), respectively; x ij and z ij are the dummy variables of the ith individual for a j and d j , respectively; (aa) jk , (ad) jk , (da) jk and (dd) jk are aa, ad, da, and dd interaction effects between the jth and kth QTL (j = 1,ÁÁÁ,m−1;k > j), respectively; and is residual error with a N(0,σ 2 ) distribution.
For the sake of clarity of notation, we redefine the design matrix and the regression coefficients as follows. Let Y = (y 1 , y 2 , ÁÁÁ, y n ) T , β = μ and X = (1, 1, ÁÁÁ, 1) T ; γ is the main and interacted effects, and Z is the dummy variable for γ. The above model is now rewritten as Parameter estimation by empirical Bayesian There are several methods available in the estimation of parameters in model (2), e.g., penalized maximum likelihood [9,82], empirical Bayesian [10], hierarchical generalized linear model [83,84]. Here we adopt empirical Bayesian, for technical detail the reader is referred to the original study of Xu [10]. The method is briefly described here. The parameters β and σ 2 are always included in the model, the uniform prior is assigned to the two parameters: P(β) / 1 and P(σ 2 ) / 1. We adopt the normal prior for each of the genetic effects (γ k ) in model (2): Pðg k Þ / Nð0; s 2 k Þ. The scaled inverse χ 2 prior distribution is further assigned to s 2 k : Pðs 2 k Þ ¼ InvÀw 2 ðs 2 k jt; oÞ / ðs 2 k Þ À tþ2 2 expðÀo=2s 2 k Þ. Clearly, Y in model (2) follows a multivariate normal distribution with mean μ = Xβ and variance-covariance . Therefore, the main steps for parameter estimation are described as below.

Likelihood ratio test
In this study, a two-stage approach was adopted to conduct significance test of each effect. First, empirical Bayesian was used to select the significant effects in model (2). Then, all the selected effects were tested by using likelihood ratio test. For technical detail the reader is referred to the original study of Zhang & Xu [9] and Lü et al. [14]. For simplicity, the critical LOD score for declaring a significant effect at the 0.05 level was set at 2.0.

High throughput QTL-effect screening
A full genetic model should include potential pair-wise interaction effects of all loci. If the number of main and interacted effects is 10 times less than sample size, empirical Bayesian works well. Note that the model is saturated quickly as the number of loci increases. For example, in this study the number of SSR marker loci is 205, the numbers of potential main and interacted effects are 410 and 83,640, respectively. Therefore, a variable selection technique is usually considered to exclude those interactions with negligible effects. The procedures and steps were described as below.
1. Test of independence between effect and trait. First, extreme phenotypic individuals, 10% highest and 10% lowest, were selected from the mapping population. Then, contingency table was constructed based on the extreme individual and marker genotype. Third, χ 2 test of independence was conducted to test whether the targeted marker was associated with the trait. Finally, 100 main effects or 100 epistatic effects, with the minimum P-values, were selected to enter the next step. This method is BSA; 2. Empirical Bayesian estimation and likelihood ratio test. All the main or interacted effects selected in the first step were included in one genetic model and estimated by empirical Bayesian. Among these effects, the effects with non-zero estimates were remained. Likelihood ratio test was used to determine whether these effects were significantly associated with the trait. The critical LOD value was set at 2.0.
3. Correction of trait phenotype using the significantly associated effects. The corrected phenotypes for all the individual were y 0 i ¼ y i À Wb, where b was vector of effects for the significantly associated markers, and W was the designed matrix for b; 4. Repeat the step (1) to (3) until no more additional significantly associated effects were detected; 5. Empirical Bayesian analysis for selected main and epistatic effects. All the significantly main and interacted effects were included one genetic model and estimates by empirical Bayesian.
The software for parameter estimation is available as S1 Software.

Hybrid prediction (HP)
In hybrid breeding, elite parents were predicted from general combining ability (GCA) and elite parental combinations were predicted from specific combining ability (SCA). In NCII mating design, both GCA for each parent and SCA for each parental combination could be calculated. In a partial NCII mating design, however, some F 1 hybrids were missing so that GCA and SCA were not calculated. Using the information of QTL detected above, these missing values could be predicted. Once all the information was obtained, SCA and GCA could be calculated. Therefore, elite parents and parental combinations could be predicted.

Monte Carlo simulation design
We performed three simulation experiments in this study. In the first simulation experiment, the effect of QTL heritability on the new method was assessed. The QTL size (h 2 i ), being the proportion of total phenotypic variance explained by the QTL, was 0.02, 0.05 and 0.08, respectively. In each case, two additive, two dominant, one aa, one ad, one da, and one dd QTL were simulated. The genetic variance of the ith QTL, s 2 iðGÞ , was calculated from where residual variance σ 2 = 1. The allelic effects were calculated by relating s 2 iðGÞ to the allelic frequencies and effects. All the QTL were overlapped with the markers and listed in Table 3. All the genotypes of 441 parents and 284 F 1 hybrids in the partial NCII were exactly same as those in real data analysis in this study. The simulated phenotypic value of each parent or F 1 hybrid was the sum of the corresponding QTL genotypic values and residual error, with an assumed normal distribution. Each simulation run consisted of 100 replicates. For each simulated QTL, we counted the samples in which the LOD statistic surpassed 2.0. The ratio of the number of such samples to the total number of replicates represented the empirical power of this QTL. The FPR was calculated as the ratio of the number of false positive effects to the total number of zero effects. We also calculated absolute bias between estimated and true effects of each QTL in each sample. Therefore, the average and standard deviation of absolute biases across 100 replicates could be obtained. In the second simulation experiment, we evaluated the effect of sample size on the new method by letting the sample size be set as 400 (80 restorer lines + 160 sterile lines + 160 F 1 hybrids), 500 (100 + 200 + 200) and 600 (120 + 240 + 240). All the QTL sizes were 0.05. Other parameters were the same as those in the first simulation experiment. In the third simulation experiment, we investigated the effect of population structure on the new method by letting the population structure be set as all the parents (300 restorer lines + 300 sterile lines), parents + F 1 (150 restorer lines + 150 sterile lines + 300 F 1 hybrids) and all the F 1 (600 F 1 hybrids). Other parameters were the same as those in the second simulation experiment.
Supporting Information S1