An Efficient Hierarchical Generalized Linear Mixed Model for Mapping QTL of Ordinal Traits in Crop Cultivars

Many important phenotypic traits in plants are ordinal. However, relatively little is known about the methodologies for ordinal trait association studies. In this study, we proposed a hierarchical generalized linear mixed model for mapping quantitative trait locus (QTL) of ordinal traits in crop cultivars. In this model, all the main-effect QTL and QTL-by-environment interaction were treated as random, while population mean, environmental effect and population structure were fixed. In the estimation of parameters, the pseudo data normal approximation of likelihood function and empirical Bayes approach were adopted. A series of Monte Carlo simulation experiments were performed to confirm the reliability of new method. The result showed that new method works well with satisfactory statistical power and precision. The new method was also adopted to dissect the genetic basis of soybean alkaline-salt tolerance in 257 soybean cultivars obtained, by stratified random sampling, from 6 geographic ecotypes in China. As a result, 6 main-effect QTL and 3 QTL-by-environment interactions were identified.


Introduction
Many characters of biological interest and economic importance vary in an ordinal form, i.e. disease and tolerance, but are not inherited in a simple Mendelian fashion. More importantly, they cause substantial yield loss. To decrease the loss, developing resistance cultivar is the most economic and effective way. Therefore, there is a critical need for in-depth study of methodology for mining elite alleles for ordinal traits.
During the past several decades, many attempts have been made to mine elite alleles for binary and ordinal traits. The methodologies of mapping quantitative trait loci (QTL) for discrete traits have been well established within the framework of threshold model. On the early stage, almost all the approaches are based on single QTL genetic model [1][2][3][4][5][6][7]. Later on, several methods have been proposed to simultaneously identify multiple QTL for ordinal traits [8,9]. Recently, Bayesian methodology has been used to map multi-QTL and epistatic QTL for binary and ordinal traits [10][11][12][13][14]. However, all the above approaches are based on bi-parental segregating populations.
Many commercial inbred lines are available in crops. A large amount of elite alleles have preserved among these lines. Mining these elite alleles is the prerequisite in the integration of genetic analysis with crop breeding. Up to now, some approaches for mining elite alleles in crop cultivars have been developed [15][16][17][18][19][20]. All kinds of QTL can be effectively identified, elite alleles can be easily mined and novel parental combination can be effectively predicted [18]. However, these approaches in crop cultivars are for quantitative traits but not for discrete traits. As for discrete traits, too much complication comes from seemingly simple descriptions and unknown population structure meanwhile in fact the underlying biological model may be complicated. Accordingly, genetic analyses may be more challenging for discrete traits than for continuous traits. If pedigree information among these lines is known, Bayesian linkage analysis [21] and variance-components approach [22] have been presented. If the pedigree information is not known, relatively little has been known, except for Iwata et al. [23] and Hoggart et al. [24]. Although Iwata et al. [23] have developed Bayesian multilocus association analysis, the method is implemented via Markov chain Monte Carlo, and computing time becomes a major concern. Although Hoggart et al. [24] proposed simultaneous analysis of all SNPs in genome-wide association study, the method is for case-control dataset.
Multi-QTL mapping for discrete and quantitative traits is now the state-of-the-art method [18][19][20]24,25]. However, it is difficult to implement under the maximum-likelihood framework. At present the Bayesian method implemented via expectationmaximization algorithm [26] is specialized to handle complicated models and thus it is the ideal tool for mapping multiple QTL for ordinal trait in crop cultivars. Accordingly, in this study empirical Bayes approach of Xu [26] and the computational algorithm of Yi et al [27] were incorporated into the hierarchical generalized linear model of Yi et al [12] to map main-effect QTL (M-QTL) and QTL-by-environment (QE) interaction for ordinal traits in crop cultivars. The new method was validated by a series of Monte Carlo simulation experiments and real data analysis in soybean.

Phenotypic variation for soybean alkaline-salt tolerance
We measured lengths of main root (LR) of 257 soybean cultivars under the cases of control (CK), 100 mM NaCl and 10 mM Na2CO3. These original trait observations might be transferred into alkaline tolerance index (ATI) and salt tolerance index (STI). To measure the degree of salt-alkaline tolerance, these indexes were partitioned into five grades: high tolerance, tolerance, middle tolerance, sensitivity, and high sensitivity. In other words, this data is ordinal. The phenotypic distributions were shown in Fig. 1 and Table S1. All the two discrete indexes almost exhibited skewed distribution, indicating the existence of genetic variation. Results from x 2 test showed that there is significant relationship between the tolerance and environment (x 2 = 44.83 and P,1e-4 for ATI, and x 2 = 13.29 and P = 0.004 for STI), indicating the existence of environmental interaction.

Mapping M-QTL and QE interaction for ATI and STI
A total of 6 M-QTL (3 for ATI, and 3 for STI) and 3 QE interactions (one for ATI, and 2 for STI) for soybean alkaline-salt tolerance are detected by new method, and mapped to chromosomes A1, B2, I, L, N and O. Among them, one QTL, associated with marker sat_274, is responsible simultaneously for the above two traits; seven QTL are consistent with those of continuous ATI and STI using enriched compression mixed linear model (ECMLM) [28] and epistatic association mapping (EAM) [18] methods, and the other two were also confirmed by test of independence (x 2 test); and one M-QTL and one QE interaction are associated simultaneously with marker satt270. A summary of all detected QTL is shown in Table 1.
4 ATI QTL, with proportion of phenotypic variance explained by single QTL (PVE) of 3.29-11.04%, are detected and mapped to chromosomes A1, B2 and O. Of these QTL, there are three M-QTL (18.96%) and one QE interaction (11.04%); and three QTL are further identified by ECMLM (or EAM) and x 2 test. It should be noted that the PVE by qAT10-2 and qATI5e, associated respectively with sat_274 and sat_344, are greater than 10%. 5 STI QTL, with PVE of 4.21-9.17%, are detected and mapped to chromosomes I, L, N and O. Of these QTL, there are three M-QTL (21.06%) and two QE interactions (13.48%); and all the QTL, except for qSTI10, are further identified by ECMLM (or EAM) and x 2 test. It should be noted that the PVE of all the QTL are less than 10%.

Mining elite alleles
The summaries of elite allele and its representative carrier are shown in Table 1. As for the qATI14 associated with Sat_342, there are 12 alleles and one unknown allele. The effects for all these alleles can be estimated by maximum likelihood method. Of these effects, the 260 bp allele has the smallest effect 20.73, being an elite allele, which can be found in soybean cultivar Zunyizongzidou. Similarly, as for the qSTI3e associated with satt270, the 223 bp allele shows the smallest effect in 2010, elite allele combination is the 223 bp allele62010 with an effect of 20.90.

Predicting novel parental combination
In a hypothetical cross of two cultivars, all the recombinant inbred lines (RILs) from the cross may be produced. In these RILs, the trait values can be predicted by the effects of all the detected loci. The best RIL with minimum value would represent the cross. Therefore, the best cross could be selected from all the crosses. It was found that any cultivar-pair does not pyramid all the elite alleles of the detected QTL. However, some four-cultivar combinations might pyramid all the elite alleles of salt-alkaline tolerances in this study, for example, the best two combinations were Zunyizongzidou 6 Hunanqiudou 16Ludou 16Qi 588-8, and Zunyizongzidou6Hunanqiudou 16Ludou 26Qi 588-8, which are used to simultaneously improve the two traits.

Prediction for potential candidate genes
The summary of potential candidate genes for alkaline-salt tolerance in soybean is shown in Table 2. A total of 7 soybean genes homologous to Arabidopsis are linked to 7 markers detected in this study, with physical distances of 206.21-129132.42 kb; and one gene (Glyma03g38040) is closely linked to the associated markers (satt022) in this study, within 210 kb in physical distance.

Monte Carlo simulation studies
Comparison of new method with both single-QTL method and test of independence. In the first simulation experiment, each simulated sample was analyzed by three methods. One is multi-QTL-based method in this study (new method), one is to use the new method under the condition of single-QTL model and one is test of independence. All the results are shown in Fig. 2. Among the three methods, the statistical power of the new method is the maximum, and the false positive rate (FPR) of the new method is the minimum. The estimates of QTL effects and threshold values from the new method are closer to the corresponding true values than those from single-QTL method, although all the estimates were slightly biased. Relatively small variations were observed in the new method for the estimates of position and effects of QTL as well as the threshold values. Therefore, the new method works relatively well.
Effect of phenotypic distribution on QTL mapping. In the second simulation experiment, the effect of the shape of phenotypic distribution on the new method was assessed by letting the phenotypic distribution of five ordinal categories be set as 1:1:1:1:1 (uniform distribution), 1:2:4:2:1 (symmetrical distribution) and 8:5:3:1:1 (skewed distribution). Other parameters were the same as those in the first simulation experiment. The results are given in Fig. 3. We found that skewed distribution has decreased the statistical power. The optimal power occurred in the situation where the phenotypic distribution is bell-shaped. Relatively small variations were also observed in the three situations for the estimates of position and effects of QTL as well as the threshold values.
Effect of the number of categories on QTL mapping. In the third simulation experiment, we evaluated the effect of the number of categories on the new method. The design of the simulation was similar to that described in the first simulation experiment, except for the number of phenotypic categories. We simulated three levels for the number of categories: 2, 6 and 9. The corresponding phenotypic distributions were 1:1, 1:3:6:6:3:1 and 1:2:4:6:9:6:4:2:1, respectively. The results are given in Fig. 4, which shows that the estimate of QTL position is very close to its true value in the three cases, and the power for QTL detection increases as the number of categories increases. The reason is that increasing the number of categories has increased the information of predicting the liability from the observed categorical phenotype. In addition, relatively small variations were also observed in the three situations for the estimates of QTL effects and the threshold values.
Effect of sample size on QTL mapping. In the fourth simulation experiment, we assumed the pedigree to have the numbers of non-founders of 100, 200, 300 and 500, and the number of founders of 50. One hundred and one equally spaced markers, each with three alleles, were placed on each of three 1000 cM chromosome segments; and eighteen QTL, each with three alleles, were simulated with heritabilities of 0.01-0.15. Other parameters were given in Table 3. The results of five QTL are shown in Fig. 5. As expected, the QTL detection power increases and the variations for the estimates of QTL parameters and the threshold values decreases as sample size or QTL heritability increases.
Effect of the number of founders on QTL mapping. In the last simulation experiment, we assumed the pedigree to have the number of non-founders of 200, and the numbers of founders of 25, 50 and 75. Other parameters were the same as those in the fourth simulation experiment. The results of five QTL are shown in Fig. 6. As expected, the QTL detection power increases as the founder number increases and relatively small variations and Table 1. Association mapping for ordinal alkaline-salt tolerance in 257 soybean cultivars. biasedness for the estimates of QTL parameters and the threshold values were observed.

Discussion
In this study the probability of y j , Pr y j j. À Á , is viewed as an approximate normal distribution so that empirical Bayes approach could be adopted to estimate genetic effects in the hierarchical generalized linear model for ordinal trait association studies. As a result, M-QTL and QE interaction for ordinal traits in crop cultivars can be identified, elite alleles can be mined and novel parental combinations can be predicted. Clearly, it integrates genetic analyses with crop breeding design. More importantly, the mapping results in this study are reliable because they have been validated in four aspects. First, seven QTL detected by new method are consistent with those by at least one of three approaches: ECMLM, EAM and single marker analysis ( Table 1). Second, a total of 7 potential candidate genes homologous to Arabidopsis are found to be around 7 associated markers ( Table 2). Third, some QTL were simultaneously identified among alkaline-salt tolerance index, original and ordinal traits, for example, Sat_342 and Satt348 were associated with alkaline tolerance, and Satt270 was associated with salt tolerance. Finally, the results from Monte Carlo simulation studies show that new method improves statistical power and precision, and reduces FPR.
The major contribution of this study is the pseudo data normal approximation of the likelihood function for ordinal trait association studies. The normal likelihood approximation was first developed by Wolfinger and O'Connell [29] and continued by Gelman et al [30]. McGilchrist [31] used a different approach for the same problem, but much easier to understand. Although the method has been explored for binary and binomial traits in linkage studies [32], this study is the first report of the pseudo data approximation for ordinal trait association studies.
We compared the new method with that of Lü et al. [18]. There are some commons between the two approaches. For example, the similar effects of phenotypic distribution (the number of categories, sample size and heritability) on QTL mapping in homozygous cultivars are observed. However, the differences exist as well. For example, the trait is quantitative in Lü et al. [18] and ordinal in this study; and the power for the detection of QTL is lower for this study than for Lü et al. [18], because limited information is observed for ordinal traits. As the number of categories increases, it is better to use the normal trait hierarchical linear mixed model. Note that the main benefit of this study comes from small number of categories. Although Iwata et al. [23] and Hoggart et al. [24] are for ordinal traits, in this study main-QTL, environmental effect and QTL-by-environment interactions were simultaneously considered in our full genetic model, improving the statistical power and estimation precision.
As compared with genome-wide association studies in Yu et al. [16] and Zhang et al. [17], kinship matrix was not considered in this study. In fact, this term is related to background control, which is similar to co-variable markers in composite interval mapping. Note that all the main-effect QTL and QTL-byenvironment interactions are included in the full genetic model of this study. Thus, it is unnecessary to consider this term in the current study. In addition, in real data analysis we also consider the effect of population structure on association studies. As a result, a slightly different result is observed while Q matrix is deleted from the above full model. Epistasis, the interaction between QTL, plays an important role in the dissection of genetic architecture for complex traits. To date, several approaches have been developed, including multiple interval mapping, Bayesian approach, and penalized maximum likelihood method. Most of these methods are for quantitative traits in bi-parental segregating populations. In homozygous cultivars, it is relatively difficult. Because of its complexity, it will be investigated separately in a future project.

Soybean samples
257 soybean cultivars used in this study were mainly provided by the National Center for Soybean Improvement, China. All the cultivars were obtained by stratified random sampling from six geographic ecotypes in China, planted in three-row plots in a completely randomized design and evaluated at the Jiangpu experimental station at Nanjing Agricultural University in 2009 and 2010. The plots were 1.5 m wide and 2 m long. Twelve seeds for each cultivar were sown in a 30620615 cm plastic container with the 3.5 cm height sand and then treated with control (CK), 100 mM NaCl and 10 mM Na 2 CO 3 , and each with two replications. They were grown in a growth chamber under white fluorescent light (600 mmol m 22 s 21 ; 14 h light/10 h dark) at 2561uC. Length of main root (LR, centimetre) for healthy seedlings were measured from 5 plants 7 days after sowing. To measure the degree of salt-alkaline tolerance, original trait observations might be transferred into salt-alkaline tolerance index for each trait using the below equations where x CK ,x NaCl and x Na 2 CO 3 stand for phenotypic values in control, saline and alkaline treatments, respectively [33]. The tolerance grades 1 to 5, used in this study, were indicated by 0-20%, 20-40%, 40-60%, 60-80% and 80-100%, respectively. Approximately 0.3 g of fresh leaves obtained in 2008 from each cultivar was used to extract genomic DNA using the cetyltrimethylammonium bromide method as described by Lipp et al. [34]. To screen for polymorphisms among all the cultivars, polymerase chain reaction (PCR) was performed with 135 simple sequence repeat (SSR) primer pairs. The primer sequences were obtained from the soybean database Soybase (http://www.ncbi. nlm.nih.gov). PCR was performed as described by Xu et al. [35] and Wei et al. [36].

Population structure
For the soybean sample, the STRUCTURE software [37] was used to investigate the population structures of all selected cultivars. The number of subpopulations (K) was set from 2 to 10. In the Markov chain Monte Carlo (MCMC) Bayesian analysis for each K, the length of a Markov chain consisted of 110,000 sweeps. The first 10,000 sweeps (the burn-in period) were deleted, and thereafter, the chain was used to calculate the mean of loglikelihood. This process was repeated 20 times, and the total average for mean log-likelihood at fixed K was used. STRUC-TURE analysis with 135 SSR molecular markers showed that the log-likelihood increased with the increase of the model parameter K, so a suitable number of K could not be determined. In this situation, using the ad hoc statistic DK, based on the rate of change in the log-probability of data between successive K values [38], STRUCTURE accurately detected the uppermost hierarchical level of structure. Here, the DK value was much higher for the model parameter K = 4 than for other values of K [33]. By combining this high DK value with knowledge of the breeding history of these cultivars, we chose a value of 4 for K. The Q matrix was calculated based on SSR markers and incorporated into the hierarchical generalized linear mixed model in this study.

Generalized linear mixed Model
Let l j (j~1,2, Á Á Á ,n) be the vector of underlying latent variable or liability of cultivar j. For the jth cultivar, it is postulated that where b~(b 1 , Á Á Á ,b q ) 0 is non-genetic effects, i.e., population mean (b 1 ) and environmental effect (b 2 , Á Á Á ,b q ) 0 ; c k is allelic effect for k~1, Á Á Á ,v and allele-by-environment interaction effect for k~vz1, Á Á Á ,vq, n~P p i~1 g i {1 ð Þ, and g i is the number of alleles for locus i i~1, Á Á Á ,p ð Þ ; x jm and u jk are dummy variables of b m and c k for cultivar j, respectively; and e j is the random residual error with anN 0,s 2 À Á distribution. s 2~1 will be adopted here because the liabilities are unobservable.
Methods of estimating allelic effects and allele-by-environment interaction effects are the same. For the sake of clarity of notation, we redefine the design matrix and the regression coefficients as follows.
Let Y~y j È É n j~1 denote the vector of observed ordinal data. Here each y j represents an assignment into C ordinal categories. These classes result from the hypothetical existence of Cz1 thresholds (t 0~{ ?vt 1 v Á Á Á vt C~z ?) in the latent scale. The relationship between y j and l j is indicated by The conditional probability that y j falls in category c, given b, c and t~t 0 ,t 1 , Á Á Á ,t C ð Þ , is given by where W(.) is the cumulative distribution function of standard normal distribution. The data are conditionally independent, given b, c and t. Therefore, log-likelihood function can be written as where h~t 1 , Á Á Á ,t C ,b 1 , Á Á Á ,b q ,c 1 , Á Á Á ,c vq È É ; and I y j~c is an indicator function taking value of 1 if y j~c and 0 otherwise.

Prior distribution and joint posterior density
The parameters b and c are treated as fixed and random effects, respectively. The number of random effects in the above genetic model is very large so that the model is oversaturated. Therefore, the hierarchical generalized linear mixed model is adopted in this study. It is assumed that each genetic effect c k has a different variance s 2 k . The following prior distributions are chosen for building the hierarchical model where t and a are the constants given in advance. When I y j~c Pr y j~c DX j ,Z j ,t,b,c where w~h,s 2

Parameter estimation
Genetic effect. As shown in Wolfinger and O'Connell [29], Pr y j jX j ,Z j ,h À Á is an approximate normal distribution where w(.) is the probability density function of standard normal distribution. The conditional log-posterior distribution, related to c, is indicated by Using expectation-maximization empirical Bayes approach of Xu [26], the expectation of the quadratic term required in the maximization step is expressed as hyperparameters. According to joint posterior density in equation (6), conditional posterior distribution is Here the mode is used to estimate the corresponding parameter, such as, Non-genetic effect b. Formula for the fixed effect follows the standard procedure of mixed model methodology, we have Thresholds. Using the Newton-Raphson method, the thresholdt are estimated by where t s ð Þ c is the estimate of parametert c at the sth iteration,

Statistical test
A two-stage selection process in Lü et al. [18] was used to conduct likelihood ratio test (LRT) for all the QTL. In the first stage, all the markers were included in the model. If the estimate of an absolutely allelic effect (environmental interaction effect) at the kth locus k~1, Á Á Á ,p ð Þis greater than 10 {4 , the kth locus is picked up. In the second stage, we modified the full model only to contain the effects passing the first round of selection. If doing so, we can use the maximum likelihood method to perform the LRT.
The overall null hypothesis is no effect of the oth QTL (or interacted QTL), denoted by H 0 : c 1~c2~Á Á Á~c T~0 , where c a is the ath effect for the QTL. If we solve the maximum likelihood estimation of the parameters under the restriction of the H 0 and calculate the log-likelihood value using the solutions with this restriction, we obtain L(ĥ h 0 jH 0 ). We can also evaluate the loglikelihood value of the solutions without restrictions and obtain L(ĥ h 0 ).

Simulation design
We performed six simulation experiments in this study. In the first, the simulated pedigree was similar to the maize pedigree described by Zhang et al. [15]. In current pedigree, the numbers of founders and non-founders were 100 and 200, respectively. Of these, founder lines were in linkage equilibrium so that the genotypes for markers and QTL with three alleles could be simulated. In other words, three alleles for each locus were assigned in equal proportions to each founder. Non-founders were bred via repeated self-pollination of a hybrid between two inbred lines. Thus, each non-founder line represents a RIL with respect to a known pair of parents. The genotypes of all the non-founders could be generated from the genotypes of their parents, analogous to simulating the genotypes of RIL from their parents. All of the non-founder lines could be used to detect QTL. Thirty-three equally spaced markers were simulated on three-chromosome segments 300 cM long. A total of 3 QTL, all of which overlapped with the markers, were placed at 50 cM of each chromosome; the QTL size, being the proportion of total phenotypic variance explained by the QTL, is 0.05, 0.10 and 0.15, respectively. The allelic effects were calculated by relating the genetic variance of the QTL to the allelic frequencies and effects. The phenotypic value of each line was the sum of the corresponding QTL genotypic values and the residual error, with an assumed normal distribution. These phenotypic values could be transferred into five ordinal categories with four threshold values: 21.2816, 20.5244, 0.5244 and 1.2816. Therefore, the frequencies of the five ordinal categories occurring in all the inbred lines have a ratio of 1:2:4:2:1. Each simulation run consisted of 100 replicates. For each simulated QTL, we counted the samples in which the LOD statistic surpassed 2.0. The ratio of the number of such samples (m) to the total number of replicates (100) represented the empirical power of this QTL. The FPR was calculated as the ratio of the number of false positive effects to the total number of zero effects considered in the full model. The other simulation experiments were performed similarly. All simulated parameters are given in Table 3.
A SAS program is available from the authors on request.