Generalized Linear Model for Mapping Discrete Trait Loci Implemented with LASSO Algorithm

Generalized estimating equation (GEE) algorithm under a heterogeneous residual variance model is an extension of the iteratively reweighted least squares (IRLS) method for continuous traits to discrete traits. In contrast to mixture model-based expectation–maximization (EM) algorithm, the GEE algorithm can well detect quantitative trait locus (QTL), especially large effect QTLs located in large marker intervals in the manner of high computing speed. Based on a single QTL model, however, the GEE algorithm has very limited statistical power to detect multiple QTLs because of ignoring other linked QTLs. In this study, the fast least absolute shrinkage and selection operator (LASSO) is derived for generalized linear model (GLM) with all possible link functions. Under a heterogeneous residual variance model, the LASSO for GLM is used to iteratively estimate the non-zero genetic effects of those loci over entire genome. The iteratively reweighted LASSO is therefore extended to mapping QTL for discrete traits, such as ordinal, binary, and Poisson traits. The simulated and real data analyses are conducted to demonstrate the efficiency of the proposed method to simultaneously identify multiple QTLs for binary and Poisson traits as examples.


Introduction
Corresponding to continuous and discrete random variables in statistics, quantitative traits are classified into continuous and discrete traits in quantitative genetics. In contrast to discrete traits, continuous traits especially normally distributed ones are analyzed by taking advantage of the extensively developed inference methods available for linear models. Actually, mapping methods for continuous quantitative traits are developed prior to discrete traits. The earliest QTL mapping for continuous traits can be traced back to the interval mapping developed by Lander and Botstein [1], while the first group of people to map ordinal traits using the EM algorithm is credited to Hackett and Weller [2] and Xu and Atchley [3].
Binary and categorical discrete traits are commonly observed that typically follow binomial and multinomial distributions. A binary trait including only two categories is a special case of categorical or ordinal trait. Also, binomial trait and multinomial trait can be regarded as the derivatives of binary and categorical or ordinal traits, defined by the proportions of the number of events happened among the total number of trials. Traits measured as counts are often called Poisson traits because they are usually modeled by a Poisson distribution. The generalized linear model (GLM) therefore becomes a natural choice for analyzing the discrete traits with the above mentioned distributions [4,5]. Some applications of GLM to mapping QTLs have been conducted for binary traits [3,6,7], ordinal traits [2,8] and Poisson traits [9,10].
The IRLS for the normally distributed traits was extended to analyze binary traits [11][12][13], which greatly improves the computational efficiency with little loss in power. Especially, the EM algorithm within the framework of GLM have been developed to simplify QTL mapping for binary traits and ordinal traits [14,15]. In addition, the GEE approach for GLM has been adopted to comprehensively analyze multiple mixed traits of continuous and discrete trait components [16].
Based on the interval mapping with GLM models, a set of mapping methods [7,[17][18][19] have been developed to simultaneously map multiple QTLs for discrete traits. As an alternative, Bayesian mapping method with reversible-jump MCMC sampling [20] was proposed to infer QTLs for binary traits [7]. Subsequently, Yi et al. [18] applied a stochastic search variable selection method for Bayesian mapping of ordinal traits which remarkably improved the sampling efficiency for model parameters. By fitting a continuous prior distribution on genetic effects, most recently, hierarchical generalized linear models and computationally efficient algorithms have been further developed for genome-wide analysis of QTL for various types of phenotypes in experimental crosses [19]. In Bayesian mapping, only the method with reversible-jump MCMC sampling belongs to fully Bayesian. However, the reversible-jump MCMC sampling is usually subject to poor mixing. Some Bayesian methods use imputed QTL genotypes based on the conditional expectations of the genotype given on the flanking marker information, but ignores the uncertainties of the imputation process. Unfortunately, no efficient and convincing method for statistical inference of QTLs is provided.
As an extension of the IRLS method for continuous traits to discrete traits, GEE algorithm under a heterogeneous residual variance model can well characterize QTLs, especially large QTLs located in large marker intervals, in the manner of high computing speed [21]. Such a method has been developed for the interval mapping of discrete traits. Thus, it has a limited statistical power to identify multiple QTLs, without considering other linked QTLs. The objectives of this study is to derive a penalized generalized linear model with LASSO penalty, and then, to iteratively estimate the non-zero genetic effects of those loci over the entire genome, under a heterogeneous residual variance model. The iteratively reweighted LASSO algorithm [22] is extended to QTL mapping for discrete traits such as ordinal, binary, and Poisson traits.

Generalized linear genetic model
In a QTL mapping analysis, a genetically designed population is required to construct linkage relationship between putative QTL and markers. In such a population, n individuals are observed for phenotypic values and are genotyped for reasonably dense codominant markers. A genetic linkage map can be constructed based on the observed markers or can be obtained from prior studies. Based on the linkage information, QTLs are identified by inferring the significance of genetic effects for loci on or between markers.
If the trait of interest is normally distributed, the effects of these loci on phenotype are generally represented by the following linear model: where y i for i~1,2, Á Á Á ,n is the phenotypic value of the normal trait; b 0 is the population mean; k is the number of putative loci; b j is the genetic effect at the jth locus; x ij is the indicator variable determined by the genotypes of the jth locus. Note that only of the additive genetic effect is considered here for the simplification of description, which is suitable for backcross, double haploid and recombinant inbred lines. In model (1), the residual error e i is assumed to follow a normal distribution with N(0,s 2 ). Hence, the expectation of a normal phenotype has a linear predictor: In many situations, however, a trait in question are measured in discrete form, including binary, binomial, categorical, multinomial and Poisson traits. Their distributions, summarized in Table 1, belong to an exponential distribution family with different link functions. The relationship between the mean of a discrete phenotype (m) and the linear predictor of genetic effects of k loci (g) is formulated by means of a link function, denoted by for i~1,2, Á Á Á ,n. This is the GLM for multiple QTL mapping with discrete traits. In the model, g is the link function. Moreover, the variance of discrete phenotype V (y i ) can be derived from the distribution of each discrete trait (Also shown in Table 1), which is useful for estimating model parameters.

Genetic effect estimation
Theoretically speaking, the reweighted least square method by Wedderburn [5] can be used to estimate the parameters in model (3). But implementation of this method is not straightforward, due to the fact that the number of parameters estimated may be far greater than sample size and the values of indicator variables are missing at loci between markers. According to the least squares method of Haley and Knott [23], the missing values of indicator variables can be simply replaced by its expectation of conditional probability given on flanking marker genotypes. This replacement, however, could result in over dispersion (Xu 1998). From the linear predictor in model (3), the over-dispersion is calculated as where Var(x ij ) and Cov(x ij ,x ij 0 ) depend on genetically designed population, as derived in Liu et al. [22]; w e g is the parameter of dispersion in an exponential family. To adjust for the heterogeneity of over-dispersion [21], the linear predictor is standardized by the over-dispersion parameter, which lead to Instead of g i , the standardized linear predictor g i ' is substituted into model (3). As in the reweighted least square method for a generalized linear model, it is defined that By Taylor expansion, the log-likelihood about model parameters is quadratically approximated by with v i~1 s i and x' ij~1 s i x ij .
With more model parameters than sample size, the unique estimates of genetic effects can not be obtained by minimizing the log-likelihood above. Actually, there are few non-zero and significant genetic effects in model (3), because the number of QTLs for a trait is generally not large. In this case, the LASSO penalized method with a coordinate descent step can efficiently shrink most of genetic effects to zeros by minimizing the following function [24,25]: where l is a tuning parameter which will be chosen by cross validation.
In solving model parameters with LASSO, iterations are required, as response variable j i , independent variables v i and x' ij as well as weighted value w i are all a function of the estimated parameters. Without heterogeneous over-dispersion, the software R/glmnet can be applied to efficiently search for sparse solutions in the oversaturated GLM model [25]. Taking the glmnet as the inner loop, the iterative procedure is implemented in the following steps:

Statistical inference for QTLs
The LASSO for oversaturated GLM can provide estimates of non-zero genetic effects, but cannot do significance test for the estimates. After shrinkage estimation, the number of non-zero effects is generally less than the sample size. Substituting for the glmnet in iterative procedure, therefore, the reweighted least squares method for common GLM can be employed to estimate non-zero genetic effects: The t test statistic is used to infer the significance of the non-zero effects, which is calculated as It needs to be specially noticed that genetic effects re-estimated by reweighted least square method may be biased upward due to high variable selection of LASSO above. Meanwhile, population structure and marker density influence distribution of the t test statistic still. Permutation tests [26] is therefore introduced to adjust the critical value of t test statistic. The loci corresponding to significant genetic Meanwhile, population structure and mark effects are determined as the QTLs for trait of interest.

Simulations
The purpose of simulation is to demonstrate the efficiency of the method proposed here (IRglmnet for short), by comparing it to the GEE (IRGEE for short) algorithm under a heterogeneous residual variance model and unweighted glmnet (UWglmnet for short). Six chromosomes, each of length 100 cM with 11 evenly placed codominant markers are simulated in a backcross population with a sample size of 200 and 400. Total 10 QTLs are simulated on the 6 chromosomes, whose positions and genetic effects are listed in Table 2 and Table 3. Assuming population mean to be zero, the expectation g i is calculated based on the simulated genotypes of the QTLs and the expectation m i of discrete traits based on the link function. Taking binary and Poisson traits as examples, phenotypic values are randomly generated from binomial and Poisson distributions with their known expectations.

Alopecia areata in mouse
To locate QTLs linked with alopecia areata, an F 2 population was generated from crossing the strain of C3H/HeJ and C57BL/ 6J mice [27]. The 138 alopecia areata and 214 clinically normal mice were genotyped at 12 months of age using 211 microsatellite markers. Linkage maps and marker positions were reported on the website www.informatics.jax.org.

Tiller numbers in rice
This is a Poisson dataset for mapping QTL of tiller numbers in rice [28]. A doubled-haploid (DH) population of 123 lines was derived from the cross between two inbred lines, semidwarf IR64 and tall Azucena [29]. Based on this population, a genetic linkage map of 2005 cM long was constructed using 175 genetic markers.
For the 123 DH lines, each containing five plants, tiller numbers were observed every 10 days until all lines had headed. Phenotypic value of each line was obtained by averaging over five plants.

Simulation study
The simulated datasets are analyzed by using IRglmnet, UWglmnet and IRGEE. For convenience to compare the three mapping methods, all test statistics are transformed to -log(p), where p is the probability of greater than the realized statistic values. The critical values of the test statistic are determined through simulating 1000 samples under the null model with zero genetic effects. They are slightly distinguishable among the three mapping methods and two sample sizes (Results not shown). The simulations are replicated 500 times for estimating QTL parameters and assessing the statistical power of QTL detection. Statistical power of QTL detection is counted by each locus as the percentage of the number of those simulations that statistic value exceeds critical value at the locus.
The statistical performances of different scenarios are presented in Table 2 for position estimate comparison, in Table 3 for QTL parameter estimate comparison and in Table 4 for power comparison, As can be seen, the IRglmnet is mostly identical to the UWglmnet and the two glmnet methods are advantageous to IRGEE in terms of power to detect QTL. The IRglmnet methods can accurately estimate QTL genetic effects, while UWglmnet somewhat underestimates and IRGEE can not estimate well QTL effects. Meanwhile, both glmnets are able to detect QTL positions with higher precision than IRGEE. Under the same genetic design and sample size, the two traits analyzed are evidently distinct in QTL parameter estimation and power of QTL detection at the QTLs of small genetic effects. As compared to normally distributed traits in Liu et al. [22], the statistical powers to detect QTL for the two analyzed traits are higher at all the simulated QTLs except for the QTLs of small genetic effects, even with large sample size. This implies that in principle, it is generally difficult to detect QTLs for discrete traits. In addition, the statistical performance of each mapping method increases as sample size and QTL genetic effect increase, as observed in usual QTL mapping.
The advantage of our proposed method lies in the computing efficiency as well. Although the IRglmnet method with cross validation takes more computing time than the UWglmnet, the two glmnet methods run considerably fast, as compared to the IRGEE. On an Intel core 4 PC with a 3.8 GHz processor, the UWglmnet and IRglmnet for binary data consume 10.8 seconds and 24.5 seconds on average, respectively, whereas the IRGEE takes 1.2 minutes under a sample size of 200. For Poisson data, the UWglmnet, IRglmnet and IRGEE run 14.3 seconds, 33.6 seconds and 1.8 minutes, respectively. The difference in computing timeconsuming gets larger as the sample size increases.

Mapping QTL for alopecia areata
In an F 2 population, there are three genotypes at each locus, denoted by QQ, Qq and qq, so that QTL genetic effect can be partitioned into additive and dominance effects. The linear predictor for alopecia areata traits is described as where, m is the population mean, a j is the additive effect at the jth locus, z ij is the indicator variable corresponding to the additive effect, defined as +1 for QQ, 0 for Qq and 21 for qq. d j is the dominance effect at the jth locus, w ij is the indicator variable corresponding to the dominance effect, defined as 0 for homozygote and 1 for heterozygote. With probit link function, the dataset is analyzed by using IRglmnet, UWglmnet and IRGEE methods. The genome-wide critical threshold values for declaring QTL significance are obtained by using 1000 permutation tests. The critical values are distinguishable between the two glmnets methods and IRGEE method, which is marked by horizontal reference line in Figure 1 and Figure 2. The comparative plots in the profiles of -log(p) statistics between IRglmnet and IRGEE methods are depicted in Figure 1 and Figure 2 by the mode of inheritance. The overdispersion parameter for each individual is much closed to 1 in running IRglmnet method, so that the results obtained with the UWglmnet are exactly the same as those obtained with the IRglmnet. The QTLs are generally determined according to the peaks that exceed corresponding critical values. As can be seen from the two Figures, the IRglmnet finds not only those QTLs detected with IRGEE but also more QTLs than those detected with IRGEE. Surprisingly, most of QTLs are located on markers in the genetic map with moderate density. Table 5 tabulates parameter estimates of QTLs detected with three mapping methods. A total of 10 QTLs are identified for alopecia areata, of which, 6 are inherited in the additive mode and 4 in the dominance mode. Interestingly, the QTLs on chromosome 1 and chromosome 8 are completely different in the mode of inheritance between the two glmnet methods and IRGEE method. The proportions of phenotypic variation explained by the detectable QTLs varied from 2% to 49%. The largest heritability Table 5. Estimated QTL parameters obtained with the three mapping methods for alopecia areata in an F 2 mouse population.  (49%) is of the QTL on chromosome 17, which is more than five times of the second highest heritability (9%). The estimates for genetic effects from the IRglmnet are twice more than those from the UWglmnet, but their estimated heritabilies are roughly the same, except for the largest heritability. Meanwhile, the three mapping methods are able to consistently detect major locus on mouse chromosome 17 and minor locus on chromosome 9, as reported in Sundberg et al. [27].

Mapping QTL for tiller numbers
Taking the tiller numbers at the third stage (30 days after transplanting) as an example, the QTLs for the traits are located by using the three competing mapping methods. Figure 3 (upper panel) illustrates that there are two peaks that pass through the horizontal line of critical value 2.043 at 5% genome-wide significant level. This suggests that the two QTLs are indentified by the IRglmnet and the UWglmnet. One of the detected QTLs is located between markers MK23 and MK24 on chromosome 2, another between markers MK48 and MK49 on chromosome 3. They explain 1.9% and 2.1% of phenotypic variance, respectively. There is almost no difference in the estimated heritability between the UWglmnet and the IRglmnet, but the estimates for genetic effect obtained with UWglmnet are less than that with IRglmnet. In contrast, the IRGEE finds only one QTL of those identified by both glmnet methods, as shown in Figure 3 (lower panel).

Discussion
Two extensions are realized to map QTL for discrete traits: one is that of the GEE algorithm under a heterogeneous residual variance model by Xu and Hu [21] for a single QTL model to multiple QTL model, and another is that of IRLASSO for the continuous normal traits [22] to discrete ones. The glmnet with coordinate descent step is used for fast estimation of non-zero effects, followed by few non-zero genetic effects estimated and statistically inferred with a regular GLM. Like regular interval mapping, the method proposed here can, not only estimate QTL effects, but also assess the significance of QTLs. Although our mapping method is developed for improving linkage analysis with low marker density, it is also appropriate for missing genotypes that always happen in QTL mapping. R/glmnet can efficiently fit binary, categorical and Poisson data with logistic and Poisson regression models [22]. However, it can not be used to directly analyze binomial and multinomial data and only logit and log link functions are used in solving for the oversaturated GLM. In this study, a general LASSO procedure for the GLM is derived based on all possible link function for discrete traits, which can be incorporated into the R/glmnet with little modification. From a statistical viewpoint, the choice for link function can be somewhat arbitrary, but it is necessary to understand the biological meaning of discrete traits. For instance, the biological mechanism of binary and categorical traits can be well interpreted by the threshold model with the probit link function.