Functional Mapping of Dynamic Traits with Robust t-Distribution

Functional mapping has been a powerful tool in mapping quantitative trait loci (QTL) underlying dynamic traits of agricultural or biomedical interest. In functional mapping, multivariate normality is often assumed for the underlying data distribution, partially due to the ease of parameter estimation. The normality assumption however could be easily violated in real applications due to various reasons such as heavy tails or extreme observations. Departure from normality has negative effect on testing power and inference for QTL identification. In this work, we relax the normality assumption and propose a robust multivariate -distribution mapping framework for QTL identification in functional mapping. Simulation studies show increased mapping power and precision with the distribution than that of a normal distribution. The utility of the method is demonstrated through a real data analysis.


Introduction
Since the seminal work of interval mapping [1], quantitative trait loci (QTL) mapping with molecular markers has been a standard means in targeting genetic regions harboring potential genes of interest underlying various traits of interest in biomedical and agricultural research. TL mapping originated for single trait analysis, then later was considered for multiple traits for the improvement of mapping precision and power (e.g., [2]). When a trait is measured through many developmental stages, e.g., body height measured over many time points, the trait reveals the dynamic expression of the underlying genes that are associated with the trait. These traits, which can be expressed as a function of time, were termed ''function-valued traits'' by Pletcher and Geyer [3] or ''infinite-dimensional characters'' by Kirkpatrick and Heckman [4]. Mapping QTLs or genes underlying the dynamics of a developmental characteristic has been a longstanding challenging topic in genetic mapping. Recently, Wu and his colleagues (e.g., [4][5][6]) have developed a series of mapping approaches for dynamic traits by integrating mathematical functions into a QTL mapping framework, opening a new era for genetic mapping. The so-called functional mapping approach enables one to propose either parametric or non-parametric functions to model the developmental mean function of a dynamic trait. By testing mean differences for different QTL genotype categories in a genome-wide linkage scan, one can identify potential genes that govern the dynamics of a trait.
In general, functional mapping assumes a joint multivariate normal distribution of a developmental trait. The mean of the multivariate normal is modeled through functions of time, and trait correlations among different developmental stages are fully considered. These treatments make functional mapping more powerful than single trait analysis for a developmental trait [4]. The multivariate normality assumption is commonly assumed for all the methods developed for functional mapping in the literature. In real data analysis, this assumption could be easily violated as in the case for single trait analysis [8]. In a single trait analysis, von Rohr and Hoeschele [8] showed that deviations from normality may lead to false positive QTL detection. The authors proposed to replace the normality assumption with the t-distribution to allow for heavy tails and skewness of a trait distribution. In human linkage analysis with the variance components model, Peng and Siegmund [9] also showed that departure from multivariate normality for the trait vector could dramatically reduce the mapping power when multivariate normality is assumed. As an alternative, the authors proposed to substitute the multivariate normal with a multivariate t-distribution and showed great power improvement.
For a developmental trait, the multivariate normality assumption is often a concern, especially for a small sample size. For many applied problems, the tails of the data distribution are often longer than a normal distribution assumes. In the presence of extreme observations, statistical inference based on the normal distribution is less robust. This could lead to low power or false positives under a functional mapping framework. The lack of robustness with respect to outliers and heavy tails that results from using a Gaussian model makes the multivariate t-distribution a powerful alternative.
In this work, we relax the multivariate normality assumption in functional mapping and propose a robust multivariate t-distribution for the error terms. The proposed method is implemented in a mapping framework that is different from Peng and Siegmund's treatment [9]. A mixture multivariate t-distribution is proposed and an expectation-maximization (EM) algorithm is derived to estimate various parameters of interests. To make the method more flexible for any developmental traits, a non-parametric Bspline technique is incorporated to model the developmental mean function. An antedependence covariance model is applied to model the non-stationary covariance structure [10]. Extensive simulations are conducted to evaluate the model performance. The utility of the method is demonstrated by reanalyzing a real data set for the purpose of identify genes underlying the variation of rice tiller numbers.

Methods
The mixture model and the multivariate t likelihood function Consider a backcross design initiated with two inbreed lines with contrasting phenotypic difference. A genetic linkage map can be constructed with molecular markers. Suppose there is a putative segregating QTL, with alleles Q and q, that affects the trait of interest, but by different degrees. For a backcross population with n observations, each one is measured over p time points. The phenotypic vector y~½y(t 1 ), Á Á Á ,y(t p ) T follows a multivariate distribution with a density function f (y; y,g), where y and g denote location and scale parameters.
In a QTL mapping study, the location and QTL genotype are generally unobservable. Suppose the QTL genotypes contributing to the variation of a dynamic quantitative trait are QQ and Qq. This missing data problem can be overcome by modeling the observed phenotypic data with a finite mixture model where f j (y i ; y j ,g) is the probability density function with the location parameters y j corresponding to QTL genotype j (~1 for QQ and 0 for Qq); g contains the scale parameters common to all components; and p ijj is the mixture proportion of individual i given the QTL genotype j. For a backcross design, the mixture proportions can be obtained via the conditional probabilities of QTL genotypes given the flanking marker in a standard backcross design [11].
As we mentioned in section Introduction, multivariate normality is a general concern in functional mapping when extreme observations or heavy tails are observed. To make the functional mapping more flexible, we assume the multivariate t distribution for y. The multivariate t density function for individual i given genotype j is given by where for genotype j ( = 0, 1), m j~½ m j (t 1 ), . . . ,m j (t p ) denotes the mean vector, S j is a positive definite covariance matrix, n j is the degree of freedom, and V j~( m j ,S j ,n j ) contains all the parameter of interest corresponding to genotype j. The Mahalanobis distance between y i and m j with respect to S j is denoted as At a specific time point t, the relationship between the observation and the mean can be expressed by a linear model where c i~0 or 1 if the QTL genotype is Qq or QQ, respectively; and e i (t) is the error term following a t distribution with mean zero and variance s 2 (t). The errors at two different time points t 1 and t 2 , are correlated with correlation coefficient r(t 1 ,t 2 ). Assuming independence among individuals, the joint likelihood function can be expressed as where p ijj~P (c ijj~1 ), and p ij0 +p ij1~1 . The unknown parameter vector V consists of two sets of parameters. One set, denoted as V l , determines the locations of the QTL with respect to markers; and the other set, denoted as V g~( V m ,V c ,V n ), determines the multivariate t distribution of the trait corresponding to each QTL genotype, where V m , V c and V u define the mean vectors, the covariance matrices and the degree of freedom.

Modeling the dynamic mean function
One of the challenges in functional mapping lies in the complexity of the developmental pattern as well as the intraindividual variation of a longitudinal trait. Rather than estimating the discrete means at p time points, functional mapping treats a developmental trait as a dynamic process which is fitted by a continuous function [7]. For a typical growth trait, a parametric logistic function would fit most data well [12] and it has been broadly applied in many applications (e.g., [5,13]). For other developmental characteristics such as a process that experiences programmed cell death, it is infeasible to find a mathematic function to describe the process, thus a joint modeling approach may be an option (e.g., [14]). Legendre polynomials have been shown to be useful in modeling irregular developmental processes (e.g., [15,16]). With recent statistical advances in nonparametric regression, a natural and flexible way to model an irregular developmental process is in a nonparametric fashion in which the data specify the best fit [17].
Here we adopt a nonparametric B-spline technique to model the time-dependent mean function. As aforementioned, the phenotype values are recorded at p time points, denoted as t 1 ƒt 2 ƒ . . . ƒt p . At a particular time point t Ã , we can fit the dynamic genotypic means corresponding to the QTL genotypes QQ and Qq by using B-spline functions with different orders. Denote the B-spline basis function in a matrix as B which can be defined by the degree and the order of a piecewise polynomial. For the uniform quadratic B-spline with mth order, the basis matrix is expressed as A column vector of the basis matrix B(t Ã ) is called a base function. For the two QTL genotypes QQ and Qq (corresponding to j~1 and 0 respectively), the base genotypic vector is expressed as j j~½ j j0 ,j j1 ,:::j j(m{1) '. The vector contains the coefficients to be estimated for genotype j. The B-spline function depends on the observed time points, the number and the relative positions of the knots. The criteria to determine the knots are open to discussion [17]. For the real data analyzed in this study, equidistantly distributed inner knots are selected since the rice tiller numbers are observed with the same duration (10 days). Around p 2 Â Ã inner knots should be selected, as suggested in Yang et al [17]. We choose 3 evenly distributed knots and with this representation, the dynamic genotypic mean at time t, m j (t), can be estimated by m j (t)~j j 0 B(t). It is shown later on simulation study that the estimation on the mean curves is satisfactory. This serves as a credential for our choice. Further investigation also indicates that the estimation are not sensitive to various spline bases.

Modeling the covariance function
Though nonparametric modeling of the time-dependent mean functions has been extensively studied, research on the modeling of the covariance structures via non-parametric approaches is rarely reported due to various difficulties [18]. In the original functional mapping [5], a stationary covariance function such as the first-order autoregressive (AR(1)) model was applied. Structured antedependence (SAD) model was later on adopted in functional mapping [19] for the purpose of relaxing the stationarity assumption. The SAD model is a non-stationary model which has been applied in many studies [20]. The SAD model with order r for modeling the error term in Eq. (2) is denoted by where i (t) is the ''innovation'' term assumed to be independent and distributed as N (0,s 2 t ); and w k (k~1, Á Á Á ,r) are the antedependence coefficients. Therefore, the variance-covariance matrix of the a developmental process can be expressed as where S e~d iagfs 2 1 ,s 2 2 , . . . ,s 2 p g is a diagonal matrix. For the firstorder SAD or SAD(1) model, the matrix Q can be expressed as In general, the SAD order r can be selected through an information criterion (see [19]). Since the purpose of this study is not to compare the performance of various modeling approaches for the covariance structure, we simply adopt the SAD(1) function due to its non-stationarity property and simplicity.

Parameter estimation
The Expectation-Maximization (EM) algorithm, originally proposed by Dempster et al. [21], was applied to obtain the maximum likelihood estimates (MLEs) of the unknown parameters contained in V g~( V m ,V c ,V n ). The detailed algorithm is given in the Appendix S1. Note that the QTL position is generally considered as an unknown parameter which can be estimated   together with other mean and variance parameters. This, however, could dramatically increase the complexity of an estimation algorithm. As commonly treated in QTL mapping studies, we do not directly estimate the QTL-segregating parameters. Instead, we use a grid search approach to estimate the QTL location by searching for a putative QTL at every 1 or 2cM on an interval bracketed by two flanking markers. This linkage scan is done for the entire linkage map. The log-likelihood ratio test statistic for a QTL at a testing position is displayed graphically, to generate a log-likelihood ratio plot called the LR profile plot. The genomic position corresponding to a peak of the profile is the MLE of the QTL location.

Hypothesis testing
Once the MLEs of parameters are obtained at each testing position, we are interested in testing whether there exists a QTL at a marker interval that governs the developmental process. The hypotheses for such a test can be formulated by The null hypothesis H 0 states that the data can be fitted by only one curve in the reduced model, while the alternative hypothesis H 1 states that there exist two different curves to fit the data in the full model. The likelihood ratio test (LRT) has been the standard test in testing the QTL effect. DenoteV V 0 andV V 1 as the MLEs of the unknown poarameters under H 0 and H 1 , respectively. The LRT test statistic can be computed as the log-likelihood ratio of the reduced model to the full model, i.e., LR~{2½logL(V V 0 ){ logL(V V 1 ). The genome-wide significance threshold can be determined through an empirical approach based on permutation tests proposed by Churchill and Doerge [22]. Following the overall genetic test described above, we can further test if a QTL triggers an effect on a certain time interval ½t 1 ,t 2 using a regional test approach based on the areas under the curve (AUC). The hypothesis for such a test can be formulated as where AUC j for genotype j is calculated as AUC j~Ð The significance of the test can be assessed through permutation tests [22].

Simulation
We simulated a backcross population with a 100cM long linkage group, composed of 6 equidistant markers, under the assumption that QTL governs the whole developmental process. A putative QTL that affects a developmental process was assumed to be located 48cM away from the first marker on the linkage group, in between the 3rd and 4th markers. The Haldane map function was used to convert the map distance into the recombination fraction. A developmental trait with 9 equally spaced time points was generated under various combinations of heritability levels (H 2 = 0.1 vs 0.4) and sample sizes (n = 100 vs 400). The covariance was simulated assuming a first-order SAD structure.
In the simulation, we evaluated how well the parameters (including the QTL position as well as the mean and covariance parameters) can be estimated, how robust the multivariate t statistic is when data generate from a multivariate normal, and how poor the performance of multivariate mixture normal will be if the model is misspecified. Several simulation scenarios were considered. Tables 1 and 2 list the results assuming that the data generating and data analyzing models were the same. Tables 3 and 4 list the results assuming the data generating and analyzing models were not the same. In all simulation scenarios, we observed that increases in sample size and heritability always lead to more accurate parameter estimations. For example, in Table 1, the standard error for the mean parameter j 00 of genotype Qq reduces from 0.14 to 0.06 while the sample size increases from 100 to 400 under a heritability level of 0.1. Meanwhile, given a sample size 400, the standard error decreases from 0.06 to 0.03 as H 2 increases from 0.1 to 0.4, a two-fold decrease.
For a multivariate t distribution, the degree of freedom (n) controls the shape of the distribution. A small value for n indicates that the normal assumption might be inappropriate for the data. Assuming n~3, we simulated data assuming a multivariate t distribution. Table 1 (denoted as MVTT) shows that the parameters can be reasonably estimated with good precision. When both sample size and heritability level increase, the precision for the QTL position estimation is improved with reduced standard error. The same simulated data were further analyzed assuming a multivariate normal distribution for the error term. The results are tabulated in Table 3 (denoted as MVTN). It is clear that when the error distribution is misspecified, large standard errors were observed for all the parameters. In particular, the QTL position is poorly estimated with large standard errors under a small sample size and low heritability level. For example, the standard error increases from 2.94 to 5.17 under n~100 and H 2~0 :4, when data were analyzed with the proposed and the multivariate normal model. Under a small sample size, the multivariate t distribution is more robust than a multivariate normal.
Next we simulated data under the multivariate normal assumption and analyzed the data with the corresponding data generating model (denoted as MVNN) and the proposed t distribution model (denoted as MVNT). We used the results in Table 2 as a reference to compare the performance of the multivariate t model in Table 4, since the results in Table 2 was obtained with the true model. Under a small sample size (n~100) and low heritability level (H 2~0 :1), not surprisingly the results with the multivariate t model are better for the multivariate normal model. For example, the standard error for the QTL position estimate is 4.35 in MVNT, while it is 5.41 in MVNN. Moreover, the bias in MVNN is also larger (1.72 vs 0.3). This result demonstrates the robustness of the t modeling under small samples. As sample size and heritability level increase, the results are very comparable. In real applications, due to various source of noise and for better estimation of the QTL position, a safe strategy is to apply the mixture multivariate t model in functional mapping. In functional mapping, the likelihood ratio (LR) statistic is used as the indicator of a QTL signal. The larger the LR value at a genomic position, the stronger the evidence of a QTL at that position. The LR test statistics for the above four scenarios are also compared across the simulated genetic linkage group, averaged over 100 simulation replicates. Figure 1 explicitly displays the difference in LR values under the different combinations of sample size and heritability level. When data were generated assuming a multivariate normal distribution, the results obtained with the t model (dashed curve) are very similar to those obtained with the normal model (solid curve). However, when data were generated assuming the multivariate t distribution, the t model (dotted curve) clearly outperforms the normal model (dash-dotted curve). This evidence indicates the superiority and robustness of the multivariate t mixture model in functional mapping.

A case study
We applied the method to a real data set to identify QTLs governing the variation of rice tiller number development to show the utility of the approach. A detailed description of the data can be found in Huang et al. [23] and Yan et al. [24]. In brief, semidwarf IR64 and tall Azucena, two inbred lines, were crossed to generate an F 1 progeny population. A doubled haploid (DH) population of 123 lines was constructed through doubling haploid chromosomes of the F 1 gametes. For this population, 40 isozyme and RAPD markers, and 135 RFLP markers were genotyped to construct a genetic linkage map of length 2005cM covering 12 rice chromosomes. Tiller numbers were measured every 10 days from 10 days after transplanting until all lines had headed. Nine developmental measurements were recorded for each rice. A plot of the original data can be found in Fig. 2 of Cui et al. [14].
We performed a genome-wide linkage scan at every 2cM interval to locate potential QTLs that trigger effects for the programmed cell death of rice tillers. Figure 2 shows the genom-wide log-likelihood ratio profile plots, where the results obtained with the multivariate t and the multivariate normal models are indicated by the solid and dashed curves, respectively, with the respective 5% genom-wide permutation threshold indicated by the horizontal solid and dashed lines (obtained with 1,000 permutations). The plot indicates one QTL located in chromosome 3 between marker RZ519 and Pgi{1. The QTL was also reported in our previous analysis [14,15]. The other peaks did not pass the genome-wide significance threshold. A test of multivariate normality for the phenotype data without considering the marker data shows evidence of departure from normality, indicating that a multivariate t model may be more appropriate for the data. The LR values for the two models across the 12 chromosomes are very comparable, with the multivariate t model generating slightly higher LR values in many positions.
The estimated QTL position on chromosome 3 and the corresponding marker interval as well as the MLEs of the model parameters are tabulated in Table 5. The tiller number developmental trajectories of the detected QTL are shown in Figure 3, with tiller number trajectories for all individuals indicated in the background. The gap between the two trajectories over the developmental stages is quite clear, indicating a developmental mean difference in tiller number between individuals carrying the Figure 2. The LR profile plot across the 12 rice chromosomes, fitted with the proposed multivariate t mixture model (solid curve) and a multivariate normal mixture model (dash-dotted curve). The genomic position corresponding to the peak of the curve is the MLE of the QTL location (indicated by the arrows). The 5% genome-wide threshold value for claiming the existence of a QTL is given as the horizonal dotted and dash-dotted lines for the two models. The marker positions on the linkage groups are indicated as ticks [23]. doi:10.1371/journal.pone.0024902.g002 two different genotypes. Individuals carrying genotype QQ have high mean tiller numbers during the observed developmental stage, hence are preferable for selection in breeding.

Discussion
Functional mapping has been shown to be a powerful approach and also a standard means in mapping QTLs underlying the dynamics of quantitative traits [7]. However, most current methods in functional mapping assume a multivariate normal distribution for the time-course error term, which could be easily violated in reality. In this work, we extended the current functional mapping approach assuming a robust multivariate t distribution for the error term, built upon the maximum likelihood framework while implemented with a full EM algorithm to estimate the model parameters. Extensive simulations show that the proposed model outperforms the mixture multivariate normal model when the underlying distribution is from a multivariate t distribution. Even if the underlying distribution is normal, the proposed t modeling approach performs as well or even better than the normal model (especially under a small sample size). Given its robustness, the proposed t model should be adopted in a regular functional mapping study, especially when the sample size is small.
In the original functional mapping study, a developmental mean process is generally modeled with a mathematical function such as the logistic function for a growth trait [5]. In this study, we modeled the developmental mean process using a nonparametric spline technique, given its flexibility in modeling patterns of data distribution which does not follow any particular mathematical form (e.g., [17,25]). The correlation structure was modeled by the non-stationary SAD model, which was studied in Zhao et al. [19] for functional mapping. Since the focus of this work is not on the modeling of the mean and the correlation structure, we simply adopted these approaches and did not compare the impact of different modeling approaches on the power of QTL identification. This investigation will be considered in our future work.
In real data analysis, there is not much significant deviation between the LR profile plot of the mixture t and the normal model. This is due to the fact that the data distribution is quite close to the multivariate normal. The same data were analyzed before with different models to approximate the developmental mean process [14,15]. The QTL showing genome-wide significance in this study is consistent with the one found in our previous work, while some other QTLs in chromosome 1 reported in Cui et al. [15] did not pass genome-wide significance in this analysis. This is largely due to differences in the modeling of the mean process. As previous investigation shown, the power and precision in QTL identification are quite sensitive to the way the mean and covariance structures are modeled [14,15,19]. In reality, the true mean and covariance function are generally unknown. This raises a very practical issue in functional mapping. What we can do to improve mapping power and precision is by modeling the error distribution with more robust approaches such as the one proposed in this work. We expect that the method developed can enhance the full power of functional mapping in understanding the genetic architecture of dynamic traits.

Supporting Information
Appendix S1 Derivation of the EM algorithm. (PDF)  Functional Mapping with Robust t-distribution PLoS ONE | www.plosone.org