## Figures

## Abstract

Generalized estimating equation (GEE) algorithm under a heterogeneous residual variance model is an extension of the iteratively reweighted least squares (IRLS) method for continuous traits to discrete traits. In contrast to mixture model-based expectation–maximization (EM) algorithm, the GEE algorithm can well detect quantitative trait locus (QTL), especially large effect QTLs located in large marker intervals in the manner of high computing speed. Based on a single QTL model, however, the GEE algorithm has very limited statistical power to detect multiple QTLs because of ignoring other linked QTLs. In this study, the fast least absolute shrinkage and selection operator (LASSO) is derived for generalized linear model (GLM) with all possible link functions. Under a heterogeneous residual variance model, the LASSO for GLM is used to iteratively estimate the non-zero genetic effects of those loci over entire genome. The iteratively reweighted LASSO is therefore extended to mapping QTL for discrete traits, such as ordinal, binary, and Poisson traits. The simulated and real data analyses are conducted to demonstrate the efficiency of the proposed method to simultaneously identify multiple QTLs for binary and Poisson traits as examples.

**Citation: **Xing J, Gao H, Wu Y, Wu Y, Li H, Yang R (2014) Generalized Linear Model for Mapping Discrete Trait Loci Implemented with LASSO Algorithm. PLoS ONE 9(9):
e106985.
https://doi.org/10.1371/journal.pone.0106985

**Editor: **Rongling Wu, Pennsylvania State University, United States of America

**Received: **June 2, 2013; **Accepted: **August 12, 2014; **Published: ** September 11, 2014

**Copyright: ** © 2014 Xing et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **This work was financially supported by the National Natural Science Foundations of China (30972077 and 31172190) and the Applied National Science and Technology Support Program in China (2012BAD26B01). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Corresponding to continuous and discrete random variables in statistics, quantitative traits are classified into continuous and discrete traits in quantitative genetics. In contrast to discrete traits, continuous traits especially normally distributed ones are analyzed by taking advantage of the extensively developed inference methods available for linear models. Actually, mapping methods for continuous quantitative traits are developed prior to discrete traits. The earliest QTL mapping for continuous traits can be traced back to the interval mapping developed by Lander and Botstein [1], while the first group of people to map ordinal traits using the EM algorithm is credited to Hackett and Weller [2] and Xu and Atchley [3].

Binary and categorical discrete traits are commonly observed that typically follow binomial and multinomial distributions. A binary trait including only two categories is a special case of categorical or ordinal trait. Also, binomial trait and multinomial trait can be regarded as the derivatives of binary and categorical or ordinal traits, defined by the proportions of the number of events happened among the total number of trials. Traits measured as counts are often called Poisson traits because they are usually modeled by a Poisson distribution. The generalized linear model (GLM) therefore becomes a natural choice for analyzing the discrete traits with the above mentioned distributions [4], [5]. Some applications of GLM to mapping QTLs have been conducted for binary traits [3], [6], [7], ordinal traits [2], [8] and Poisson traits [9], [10]. The IRLS for the normally distributed traits was extended to analyze binary traits [11]–[13], which greatly improves the computational efficiency with little loss in power. Especially, the EM algorithm within the framework of GLM have been developed to simplify QTL mapping for binary traits and ordinal traits [14], [15]. In addition, the GEE approach for GLM has been adopted to comprehensively analyze multiple mixed traits of continuous and discrete trait components [16].

Based on the interval mapping with GLM models, a set of mapping methods [7], [17]–[19] have been developed to simultaneously map multiple QTLs for discrete traits. As an alternative, Bayesian mapping method with reversible-jump MCMC sampling [20] was proposed to infer QTLs for binary traits [7]. Subsequently, Yi et al. [18] applied a stochastic search variable selection method for Bayesian mapping of ordinal traits which remarkably improved the sampling efficiency for model parameters. By fitting a continuous prior distribution on genetic effects, most recently, hierarchical generalized linear models and computationally efficient algorithms have been further developed for genome-wide analysis of QTL for various types of phenotypes in experimental crosses [19]. In Bayesian mapping, only the method with reversible-jump MCMC sampling belongs to fully Bayesian. However, the reversible-jump MCMC sampling is usually subject to poor mixing. Some Bayesian methods use imputed QTL genotypes based on the conditional expectations of the genotype given on the flanking marker information, but ignores the uncertainties of the imputation process. Unfortunately, no efficient and convincing method for statistical inference of QTLs is provided.

As an extension of the IRLS method for continuous traits to discrete traits, GEE algorithm under a heterogeneous residual variance model can well characterize QTLs, especially large QTLs located in large marker intervals, in the manner of high computing speed [21]. Such a method has been developed for the interval mapping of discrete traits. Thus, it has a limited statistical power to identify multiple QTLs, without considering other linked QTLs. The objectives of this study is to derive a penalized generalized linear model with LASSO penalty, and then, to iteratively estimate the non-zero genetic effects of those loci over the entire genome, under a heterogeneous residual variance model. The iteratively reweighted LASSO algorithm [22] is extended to QTL mapping for discrete traits such as ordinal, binary, and Poisson traits.

## Methods

### Generalized linear genetic model

In a QTL mapping analysis, a genetically designed population is required to construct linkage relationship between putative QTL and markers. In such a population, *n* individuals are observed for phenotypic values and are genotyped for reasonably dense co-dominant markers. A genetic linkage map can be constructed based on the observed markers or can be obtained from prior studies. Based on the linkage information, QTLs are identified by inferring the significance of genetic effects for loci on or between markers.

If the trait of interest is normally distributed, the effects of these loci on phenotype are generally represented by the following linear model:(1)where for is the phenotypic value of the normal trait; is the population mean; *k* is the number of putative loci; is the genetic effect at the *j*th locus; is the indicator variable determined by the genotypes of the *j*th locus. Note that only of the additive genetic effect is considered here for the simplification of description, which is suitable for backcross, double haploid and recombinant inbred lines.

In model (1), the residual error is assumed to follow a normal distribution with . Hence, the expectation of a normal phenotype has a linear predictor:(2)

In many situations, however, a trait in question are measured in discrete form, including binary, binomial, categorical, multinomial and Poisson traits. Their distributions, summarized in Table 1, belong to an exponential distribution family with different link functions. The relationship between the mean of a discrete phenotype () and the linear predictor of genetic effects of *k* loci () is formulated by means of a link function, denoted by(3)for .

This is the GLM for multiple QTL mapping with discrete traits. In the model, *g* is the link function. Moreover, the variance of discrete phenotype can be derived from the distribution of each discrete trait (Also shown in Table 1), which is useful for estimating model parameters.

### Genetic effect estimation

Theoretically speaking, the reweighted least square method by Wedderburn [5] can be used to estimate the parameters in model (3). But implementation of this method is not straightforward, due to the fact that the number of parameters estimated may be far greater than sample size and the values of indicator variables are missing at loci between markers. According to the least squares method of Haley and Knott [23], the missing values of indicator variables can be simply replaced by its expectation of conditional probability given on flanking marker genotypes. This replacement, however, could result in over dispersion (Xu 1998). From the linear predictor in model (3), the over-dispersion is calculated as(4)where and depend on genetically designed population, as derived in Liu et al. [22]; is the parameter of dispersion in an exponential family. To adjust for the heterogeneity of over-dispersion [21], the linear predictor is standardized by the over-dispersion parameter, which lead to

(5)Instead of , the standardized linear predictor is substituted into model (3). As in the reweighted least square method for a generalized linear model, it is defined that(6)(7)(8)

By Taylor expansion, the log-likelihood about model parameters is quadratically approximated by(9)with and .

With more model parameters than sample size, the unique estimates of genetic effects can not be obtained by minimizing the log-likelihood above. Actually, there are few non-zero and significant genetic effects in model (3), because the number of QTLs for a trait is generally not large. In this case, the LASSO penalized method with a coordinate descent step can efficiently shrink most of genetic effects to zeros by minimizing the following function [24], [25]:(10)where is a tuning parameter which will be chosen by cross validation.

In solving model parameters with LASSO, iterations are required, as response variable , independent variables and as well as weighted value are all a function of the estimated parameters. Without heterogeneous over-dispersion, the software R/glmnet can be applied to efficiently search for sparse solutions in the oversaturated GLM model [25]. Taking the glmnet as the inner loop, the iterative procedure is implemented in the following steps:

- Initialize and all genetic effects as zeros
- Shrink genetic effects with the unweighted glmnet
- Update using non-zero genetic effects
- Shrink genetic effects with the iteratively weighted glmnet
- Repeat step 2 and 4 until certain convergence criteria are satisfied.

### Statistical inference for QTLs

The LASSO for oversaturated GLM can provide estimates of non-zero genetic effects, but cannot do significance test for the estimates. After shrinkage estimation, the number of non-zero effects is generally less than the sample size. Substituting for the glmnet in iterative procedure, therefore, the reweighted least squares method for common GLM can be employed to estimate non-zero genetic effects:(11)

The variance-covariance matrix of the model parameters is estimated by(12)

The *t* test statistic is used to infer the significance of the non-zero effects, which is calculated as (13)

It needs to be specially noticed that genetic effects re-estimated by reweighted least square method may be biased upward due to high variable selection of LASSO above. Meanwhile, population structure and marker density influence distribution of the *t* test statistic still. Permutation tests [26] is therefore introduced to adjust the critical value of *t* test statistic. The loci corresponding to significant genetic Meanwhile, population structure and mark effects are determined as the QTLs for trait of interest.

### Simulations

The purpose of simulation is to demonstrate the efficiency of the method proposed here (IRglmnet for short), by comparing it to the GEE (IRGEE for short) algorithm under a heterogeneous residual variance model and unweighted glmnet (UWglmnet for short). Six chromosomes, each of length 100 cM with 11 evenly placed co-dominant markers are simulated in a backcross population with a sample size of 200 and 400. Total 10 QTLs are simulated on the 6 chromosomes, whose positions and genetic effects are listed in Table 2 and Table 3. Assuming population mean to be zero, the expectation is calculated based on the simulated genotypes of the QTLs and the expectation of discrete traits based on the link function. Taking binary and Poisson traits as examples, phenotypic values are randomly generated from binomial and Poisson distributions with their known expectations.

### Alopecia areata in mouse

To locate QTLs linked with alopecia areata, an F_{2} population was generated from crossing the strain of C3H/HeJ and C57BL/6J mice [27]. The 138 alopecia areata and 214 clinically normal mice were genotyped at 12 months of age using 211 microsatellite markers. Linkage maps and marker positions were reported on the website www.informatics.jax.org.

### Tiller numbers in rice

This is a Poisson dataset for mapping QTL of tiller numbers in rice [28]. A doubled-haploid (DH) population of 123 lines was derived from the cross between two inbred lines, semidwarf IR64 and tall Azucena [29]. Based on this population, a genetic linkage map of 2005 cM long was constructed using 175 genetic markers. For the 123 DH lines, each containing five plants, tiller numbers were observed every 10 days until all lines had headed. Phenotypic value of each line was obtained by averaging over five plants.

## Results

### Simulation study

The simulated datasets are analyzed by using IRglmnet, UWglmnet and IRGEE. For convenience to compare the three mapping methods, all test statistics are transformed to -log(*p*), where *p* is the probability of greater than the realized statistic values. The critical values of the test statistic are determined through simulating 1000 samples under the null model with zero genetic effects. They are slightly distinguishable among the three mapping methods and two sample sizes (Results not shown). The simulations are replicated 500 times for estimating QTL parameters and assessing the statistical power of QTL detection. Statistical power of QTL detection is counted by each locus as the percentage of the number of those simulations that statistic value exceeds critical value at the locus.

The statistical performances of different scenarios are presented in Table 2 for position estimate comparison, in Table 3 for QTL parameter estimate comparison and in Table 4 for power comparison, As can be seen, the IRglmnet is mostly identical to the UWglmnet and the two glmnet methods are advantageous to IRGEE in terms of power to detect QTL. The IRglmnet methods can accurately estimate QTL genetic effects, while UWglmnet somewhat underestimates and IRGEE can not estimate well QTL effects. Meanwhile, both glmnets are able to detect QTL positions with higher precision than IRGEE. Under the same genetic design and sample size, the two traits analyzed are evidently distinct in QTL parameter estimation and power of QTL detection at the QTLs of small genetic effects. As compared to normally distributed traits in Liu et al. [22], the statistical powers to detect QTL for the two analyzed traits are higher at all the simulated QTLs except for the QTLs of small genetic effects, even with large sample size. This implies that in principle, it is generally difficult to detect QTLs for discrete traits. In addition, the statistical performance of each mapping method increases as sample size and QTL genetic effect increase, as observed in usual QTL mapping.

The advantage of our proposed method lies in the computing efficiency as well. Although the IRglmnet method with cross validation takes more computing time than the UWglmnet, the two glmnet methods run considerably fast, as compared to the IRGEE. On an Intel core 4 PC with a 3.8 GHz processor, the UWglmnet and IRglmnet for binary data consume 10.8 seconds and 24.5 seconds on average, respectively, whereas the IRGEE takes 1.2 minutes under a sample size of 200. For Poisson data, the UWglmnet, IRglmnet and IRGEE run 14.3 seconds, 33.6 seconds and 1.8 minutes, respectively. The difference in computing time-consuming gets larger as the sample size increases.

### Mapping QTL for alopecia areata

In an F_{2} population, there are three genotypes at each locus, denoted by QQ, Qq and qq, so that QTL genetic effect can be partitioned into additive and dominance effects. The linear predictor for alopecia areata traits is described as where, is the population mean, is the additive effect at the *j*th locus, is the indicator variable corresponding to the additive effect, defined as +1 for QQ, 0 for Qq and −1 for qq. is the dominance effect at the *j*th locus, is the indicator variable corresponding to the dominance effect, defined as 0 for homozygote and 1 for heterozygote.

With probit link function, the dataset is analyzed by using IRglmnet, UWglmnet and IRGEE methods. The genome-wide critical threshold values for declaring QTL significance are obtained by using 1000 permutation tests. The critical values are distinguishable between the two glmnets methods and IRGEE method, which is marked by horizontal reference line in Figure 1 and Figure 2. The comparative plots in the profiles of –log(p) statistics between IRglmnet and IRGEE methods are depicted in Figure 1 and Figure 2 by the mode of inheritance. The overdispersion parameter for each individual is much closed to 1 in running IRglmnet method, so that the results obtained with the UWglmnet are exactly the same as those obtained with the IRglmnet. The QTLs are generally determined according to the peaks that exceed corresponding critical values. As can be seen from the two Figures, the IRglmnet finds not only those QTLs detected with IRGEE but also more QTLs than those detected with IRGEE. Surprisingly, most of QTLs are located on markers in the genetic map with moderate density.

In each plot, the genome-wide critical value is marked by a horizontal reference line. Chromosomes are separated by the vertical dotted lines and marker positions are indicated by the ticks on the horizontal axis.

In each plot, the genome-wide critical value is marked by a horizontal reference line. Chromosomes are separated by the vertical dotted lines and marker positions are indicated by the ticks on the horizontal axis.

Table 5 tabulates parameter estimates of QTLs detected with three mapping methods. A total of 10 QTLs are identified for alopecia areata, of which, 6 are inherited in the additive mode and 4 in the dominance mode. Interestingly, the QTLs on chromosome 1 and chromosome 8 are completely different in the mode of inheritance between the two glmnet methods and IRGEE method. The proportions of phenotypic variation explained by the detectable QTLs varied from 2% to 49%. The largest heritability (49%) is of the QTL on chromosome 17, which is more than five times of the second highest heritability (9%). The estimates for genetic effects from the IRglmnet are twice more than those from the UWglmnet, but their estimated heritabilies are roughly the same, except for the largest heritability. Meanwhile, the three mapping methods are able to consistently detect major locus on mouse chromosome 17 and minor locus on chromosome 9, as reported in Sundberg et al. [27].

### Mapping QTL for tiller numbers

Taking the tiller numbers at the third stage (30 days after transplanting) as an example, the QTLs for the traits are located by using the three competing mapping methods. Figure 3 (upper panel) illustrates that there are two peaks that pass through the horizontal line of critical value 2.043 at 5% genome-wide significant level. This suggests that the two QTLs are indentified by the IRglmnet and the UWglmnet. One of the detected QTLs is located between markers MK23 and MK24 on chromosome 2, another between markers MK48 and MK49 on chromosome 3. They explain 1.9% and 2.1% of phenotypic variance, respectively. There is almost no difference in the estimated heritability between the UWglmnet and the IRglmnet, but the estimates for genetic effect obtained with UWglmnet are less than that with IRglmnet. In contrast, the IRGEE finds only one QTL of those identified by both glmnet methods, as shown in Figure 3 (lower panel).

In each plot, the genome-wide critical value is marked by a horizontal reference line. Chromosomes are separated by the vertical dotted lines and marker positions are indicated by the ticks on the horizontal axis.

## Discussion

Two extensions are realized to map QTL for discrete traits: one is that of the GEE algorithm under a heterogeneous residual variance model by Xu and Hu [21] for a single QTL model to multiple QTL model, and another is that of IRLASSO for the continuous normal traits [22] to discrete ones. The glmnet with coordinate descent step is used for fast estimation of non-zero effects, followed by few non-zero genetic effects estimated and statistically inferred with a regular GLM. Like regular interval mapping, the method proposed here can, not only estimate QTL effects, but also assess the significance of QTLs. Although our mapping method is developed for improving linkage analysis with low marker density, it is also appropriate for missing genotypes that always happen in QTL mapping.

R/glmnet can efficiently fit binary, categorical and Poisson data with logistic and Poisson regression models [22]. However, it can not be used to directly analyze binomial and multinomial data and only logit and log link functions are used in solving for the oversaturated GLM. In this study, a general LASSO procedure for the GLM is derived based on all possible link function for discrete traits, which can be incorporated into the R/glmnet with little modification. From a statistical viewpoint, the choice for link function can be somewhat arbitrary, but it is necessary to understand the biological meaning of discrete traits. For instance, the biological mechanism of binary and categorical traits can be well interpreted by the threshold model with the probit link function.

Our proposed method can be simplified to a notable extent in the practice of QTL mapping. As demonstrated in simulations, the UWglmnet is mostly consistent with IRglmnet method in terms of power to detect QTL, although it underestimates the genetic effects of QTLs. By removing the iteratively weighted step for IRglmnet, non-zero effects can be solved by using UWglmnet for simple computation. The iteratively weighted step is only used in GLM analysis for reestimating few non-zero genetic effects. In R/glmnet, additionally, cross validation (CV) is always introduced to improve shrinking efficiency. The fewer non-zero genetic effects are retained after the CV.glmnet, but more computing time is needed. Actually, the CV step can be ignored in the QTL mapping, as the CV.glmnet just drops the redundant non-zero genetic effects of those obtained with the glmnet without CV. Prior to the GLM analysis, the redundant non-zero genetic effects solved by using the glmnet without CV can be simply removed by restricting that there is only one QTL within a marker interval. Without the iteratively weighted step and cross-validation for glmnet, our proposed mapping method can save more computing time than the GEE algorithm under a heterogeneous residual variance model by Xu and Hu [21]. Our proposed method has been coded in the program on the IRLASSO [22], which can handle normally distributed, binary, binomial, ordinal, multinomial and Poisson traits. The program is freely available upon request from the authors.

## Author Contributions

Conceived and designed the experiments: RY. Performed the experiments: JX HG. Analyzed the data: HL. Contributed reagents/materials/analysis tools: YW YNW. Wrote the paper: RY YW.

## References

- 1. Lander ES, Botstein D (1989) Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199.
- 2. Hackett CA, Weller JI (1995) Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics 51: 1252–1263.
- 3. Xu S, Atchley WR (1996) Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics 143: 1417–1424.
- 4.
McCullagh JAN (1989) Generalized linear models; Hall C, editor. New York.
- 5. Wedderburn R (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61: 439–447.
- 6. Deng W, Chen H, Li Z (2006) A logistic regression mixture model for interval mapping of genetic trait loci affecting binary phenotypes. Genetics 172: 1349–1358.
- 7. Yi N, Xu S (2000) Bayesian mapping of quantitative trait loci for complex binary traits. Genetics 155: 1391–1403.
- 8. Rao S, Xu S (1998) Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity (Edinb) 81 (Pt 2): 214–224.
- 9. Cui Y, Kim DY, Zhu J (2006) On the generalized poisson regression mixture model for mapping quantitative trait loci with count data. Genetics 174: 2159–2172.
- 10. Cui Y, Yang W (2009) Zero-inflated generalized Poisson regression mixture model for mapping quantitative trait loci underlying count trait with many zeros. J Theor Biol 256: 276–285.
- 11. Xu S (1998) Iteratively reweighted least squares mapping of quantitative trait loci. Behav Genet 28: 341–355.
- 12. Yi N, Xu S (1999) Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity (Edinb) 82 (Pt 6): 668–676.
- 13. Yi N, Xu S (1999) A random model approach to mapping quantitative trait loci for complex binary traits in outbred populations. Genetics 153: 1029–1040.
- 14. Xu S, Yi N, Burke D, Galecki A, Miller RA (2003) An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family. Genet Res 82: 127–138.
- 15. Xu C, Zhang YM, Xu S (2005) An EM algorithm for mapping quantitative resistance loci. Heredity (Edinb) 94: 119–128.
- 16. Lange C, Whittaker JC (2001) Mapping quantitative trait Loci using generalized estimating equations. Genetics 159: 1325–1337.
- 17. Li J, Wang S, Zeng ZB (2006) Multiple-interval mapping for ordinal traits. Genetics 173: 1649–1663.
- 18. Yi N, Banerjee S, Pomp D, Yandell BS (2007) Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits. Genetics 176: 1855–1864.
- 19. Yi N, Banerjee S (2009) Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics 181: 1101–1113.
- 20. Sillanpaa MJ, Arjas E (1998) Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148: 1373–1388.
- 21. Xu S, Hu Z (2010) Methods of plant breeding in the genome era. Genet Res (Camb) 92: 423–441.
- 22.
Liu Y, Yang T, Li H, Yang R (2012) Iteratively reweighted LASSO for mapping multiple quantitative trait loci. Brief Bioinform.
- 23. Haley CS, Knott SA (1992) A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity (Edinb) 69: 315–324.
- 24. Yuan M, Lin Y (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B 68: 49–67.
- 25. Friedman J, Hastie T, Tibshirani R (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33(1): 1–22.
- 26. Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait mapping. Genetics 138: 963–971.
- 27.
Sundberg JP, Boggess D, Silva KA, McElwee KJ, King LE, et al.. (2003) Major locus on mouse chromosome 17 and minor locus on chromosome 9 are linked with alopecia areata in C3H/HeJ mice. 771–775 p.
- 28. Yan J, Zhu J, He C, Benmoussa M, Wu P (1998) Molecular Dissection of Developmental Behavior of Plant Height in Rice (Oryza sativa L.). Genetics 150: 1257–1265.
- 29. Huang N, Parco A, Mew T, Magpantay G, McCouch S, et al. (1997) RFLP mapping of isozymes, RAPD and QTLs for grain shape, brown planthopper resistance in a doubled haploid rice population. Molecular Breeding 3: 105–113.