Figures
Abstract
Interval mapping approaches have been playing significant role for quantitative trait locus (QTL) mapping to discover genetic architecture of diseases or traits with molecular markers. Composite interval mapping (CIM) is one of the superior approaches of the interval mapping for discovering both linked and unlinked putative QTL positions. However, estimators of this approach are not robust against phenotypic outliers. As a result, it fails to detect true QTL positions in presence of outliers. In this study, we investigated the performance of β-Composite Interval Mapping (BetaCIM) for detecting both linked and unlinked important QTLs positions from the robustness points of views. Performance of this approach depends on the value of tuning parameter β. It reduces to the classical CIM approach for β →0. We described and formulated the cross-validation procedure for selecting trait specific optimum value of β. It was observed that the optimum value of β depends on both amount of contaminated observations and their scatteredness. BetaCIM approach discover similar QTL positions as classical IM/CIM in absence of phenotypic outliers, but gives better results in presence of phenotypic outliers in terms of detecting true QTLs and effects estimation. We formulated the generalized forms of robust QTL analysis and developed an R-package named “BetaCIM” by implementing this robust approach. Left and right kidney weight data sets of mouse intercross population (129 S1/SvlmJ × A/J) were analyzed by using BetaCIM, CIM, and IM approaches. For right kidney weight (RKW) CIM and BetaCIM provided similar LOD score profile, and both approaches identified 3 QTL positions. IM approach also identified 3 QTL positions. For left kidney weight (LKW), there was evidence of one outlying observation; and in this case the BetaCIM approach identified 2 QTL positions. However, none of the QTLs were significant by CIM and IM approaches at 5% level of significance. Gene expression ontology (GEO) search showed that the candidate genes (Otof and A330033J07Rik) of the identified QTLs for LKW were expressed in kidney. Both simulation and real data analysis results showed that BetaCIM approach improves the performance over the existing methods in presence of phenotypic outliers. Otherwise, it keeps almost equal performance.
Citation: Monir MM, Khatun M, Mollah MNH (2018) β-composite Interval Mapping for robust QTL analysis. PLoS ONE 13(12): e0208234. https://doi.org/10.1371/journal.pone.0208234
Editor: David A. Lightfoot, College of Agricultural Sciences, UNITED STATES
Received: December 26, 2017; Accepted: November 14, 2018; Published: December 3, 2018
Copyright: © 2018 Monir et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The R-package for implementing the proposed methodology can be downloaded at http://www.bbcba.org/softwares/BetaCIM_1.01.zip.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
QTL mapping approaches have been used successfully to discover the genetic variants by using experimental cross populations [1, 2]. Arranging a cross between two inbred lines, which are different in quantitative trait, then scoring the segregating progeny for the trait and for a number of genetic marker are the basic steps of QTL mapping [3]. Due to advances in molecular biology, the availability of genetic markers has rapidly increased, leading to extensive use of QTL mapping in genetic studies of quantitative traits.
Genetic effects associated with marker genotypes are generally confounded by the position of functional QTL and its actual effect. By exploiting this property, interval mapping [3] has become a superior way to find QTLs [4, 5]. However, in the case of multiple linked QTLs, genetic effects need to be sufficiently separated [6]. Interval-mapping approach cannot separate the effects of linked putative positions completely [7]. Composite Interval Mapping (CIM) is a powerful analytical technique that improves the reliability and accuracy of QTL mapping by separating the effect of QTL from its locations [6, 8]. The CIM approach has advantages over other mapping methods for detecting linked QTLs. By utilizing the properties of multiple regression, test statistic of this approach was constructed to be unaffected by the QTLs outside of testing interval [6]. CIM can eliminate influence of genetic background by using a set of representative markers of the background QTL as covariates. Likelihood ratio test statistics for this approach was constructed by joining interval mapping for QTL position and multiple regression analysis for background markers that permits to identify linked QTL located near at the same chromosome.
Real phenotypic data might be contaminated by some abnormal observations, known as outliers. In general, outliers have a large impact on any classic statistical estimator. For QTL mapping, the presence of phenotypic outlying observations might be the reason for misrepresenting multiple linked QTLs and could hinder the efficient and accurate resolvability of QTLs [9]. For experimental population, phenotypic datasets often contain outlying observations that may seriously affect the estimates of model parameters and can lead to the wrong detection of QTL positions and their estimated effects [10, 11]. Mollah and Eguchi [12, 13] proposed β-Composite Interval Mapping (BetaCIM) for robustly identifying putative QTL positions, where parameters were estimated by maximizing β-likelihood functions. Maximization of β-likelihood function is equivalent to minimization of Beta-Divergence [14–17]. However, in their study, they did not provide any theoretical discussion about the robustness. Moreover, they did not investigate the performance for detection of linked QTLs. In this work, we conducted simulation study to observe the performance of BetaCIM approach for detecting multiple linked and unlinked QTLs and compared the analysis results with classical interval mapping (IM) and composite interval mapping (CIM) approaches in presence and absence of phenotypic outliers. We provided the theoretical discussion about the robustness property, cross validation procedure of selecting the tuning parameter β, and generalized the formulas of BetaCIM approach. Two real datasets, left and right kidney weight data sets of mouse intercross population (129 S1/SvlmJ × A/J), were analyzed to demonstrate the usefulness of using BetaCIM approach for QTL mapping. We developed an R package, named BetaCIM, for implementing this approach.
Materials and methods
Genetic model
Genetic cross between two parental inbred lines P1 and P2 is performed to produce an F1 population, consist of all heterozygotes genotypes, which are used to produce the segregating progeny B1 = F1× parent (backcross) or an F2 = F1×F1 (intercross). The genetic model for intercross population is as follows
(1)
Here G2, G1 and G0 are the genotypic values of genotypes QQ, Qq and qq. D is known as the genetic design matrix and E = [a, d]T is the vector of genetic parameters. The first and second columns of D, denoted by D1 and D2, represent the status of the additive and dominance effects.
The genetic model of backcross population can be written as
(2)
All notations are similar as described in the upper of this section.
CIM statistical model for QTL mapping
Phenotypic traits might be controlled by several QTLs. Estimators of simple interval mapping approach might be biased for confounding effects of multiple background QTLs [18]. Usually QTLs are highly linkage disequilibrium with corresponding flanking markers, therefore highly significant markers choosing through stepwise procedure could be a good representative set of background QTLs. Composite interval mapping approach modified the simple interval-mapping approach by including several significant markers as cofactors. Some others confounding factors (e.g. sex, age, diet etc.) also might have influence on traits, can be used as cofactor in the model to adjust their effects.
Suppose, we want to test for a QTL on a marker interval for F2 population, then the statistical model for composite interval mapping can be written as
(3)
where yj is the phenotypic value of the jth individual; a is the additive effect of the testing QTL position; d is the dominance effect of testing QTL position; Xj is the matrix, may contain some chosen markers and other explanatory variables; γ is the vector of partial regression coefficients including the general mean effect μ; εj is a random error. The value of (
,
) is (1, -0.5) if QTL genotype is QQ, (0, 0.5) if genotype is Qq, and (-1, -0.5) if genotype is qq.
Statistical genetic model for backcross population can be written as
(4)
Notations of this equation are similar as described in upper of this section. The value of is 1 if QTL genotype is QQ, and 0 if genotype is Qq.
Robustification of CIM approach using beta-likelihood estimators
Composite interval mapping approach use the estimators derived from classical likelihood function to estimate genetic parameters, which might produce false positive or reduce detecting power of true loci in presence of phenotypic outliers. Instead of classical estimators of genetic parameters, β-Composite Interval Mapping uses the robust estimators derived from beta-likelihood function for robustly estimating the QTL positions in presence and absence of phenotypic outliers.
In the model (3), observation yj’s are influenced by three QTL genotypes QQ, Qq, and qq; therefore, each phenotypic observation (yj) is assumed to follow a mixture of three possible gaussian densities with different means and mixing proportion. The distribution function of each phenotypic observation (yj’s) can be defined as
where θ = (p,a,d,γ,σ2), ϕ() is a standard normal probability density function,
,
, and
. The mixing proportions pji’s, which are functions of the QTL position parameter p, are the conditional probabilities of QTL genotypes given marker genotypes. For n individuals, the objective function for estimating θ is defined as
(5)
where lβ(θ|X) = [∫{f(y|θ,X}β+1dy]β/(β+1)
It was induced from the beta-divergence [19] for estimation of the parameters. It reduces to the log likelihood function for β →0. That is
Therefore, the objective function (5) called as beta-likelihood function. Maximization of this type of beta-likelihood function is equivalent to the minimization of beta-divergence [19] for estimating model parameter θ. The β-LOD score for the evidence of a QTL in a marker interval from the robustness point of view is defined by
(6)
where θ0 and θ are the restricted and unrestricted parameter spaces. For β →0, the LODβ reduces to the classical LOD criterion. The threshold value to reject the null hypothesis can be computed by permutation test [19].
Generalized form of the formulas for QTL mapping
The formulas for robust estimators of genetic parameters were elaborately described in related publications [12, 13]. In this section, we described the generalized forms of the robust estimators for QTL mapping. If k marker intervals (k putative QTLs) are considered jointly in mapping, the dimension of the genetic design matrix D augment to 2k × k for a backcross population and to 3k × 2k for an F2 population when epistasis is ignored. Let us consider a backcross population as an example to see how to use these general formulas for other genetic models and populations. If we want to consider three marker intervals (three putative QTLs) simultaneously and use an additive model, the genetic model can be defined as
(7)
where G111, G110, G101, G100, G011, G010, G001 and G000 represent genotypic values of QTL genotypes AABBCC, AABBCc, AABbCC, AABbCC, AABbCc, AaBBCC, AaBBCc, AaBbCC and AaBbCc respectively. The notations a1, a2 and a3 represent the effects of QTLs A, B, and C respectively. The genetic design matrix D with dimension 8 × 3 specifies that the corresponding likelihood is a mixture of 8 normal densities and has 3 genetic parameters (excluding μ) to be estimate. Accordingly, the matrix of β weighted posterior probabilities Πβ, defined in the related publication [12], is an n × 8 matrix. In deriving the β-estimators, in the E-step Πβ of the eight QTL genotypes are updated, and in the M-step the following equations
(8)
(9)
(10)
are applied to maximize Q(θ | θ(t)). Here,
δ is an indicator variable. For details about Q(θ | θ(t)) and Πβ, see the related publication [12]. To infer the joint conditional probability matrix Q for the three putative QTLs, we use the property that if there is no interference in crossing over, the conditional distributions of the individual putative QTL genotypes given the flanking marker genotypes are independent, irrespective of whether the QTLs are linked or not. This independence property simplifies the inference of Q matrix. If pair-wise epistasis of QTLs A × B, A × C, and B × C are also analyzed, the dimensions of genetic design matrix D in Eq (7) augment to 8 × 6. Columns 4, 5, and 6, which are the products of columns 1 and 2, 1 and 3, and 2 and 3, of the genetic design matrix represents the status of the epistatic parameters of different genotypes. The Q matrix is the same as that for the additive model. If higher order of QTL epistasis is considered, the corresponding column vector of D can be extended by the same procedure. For more details discussion about general formulas for CIM algorithm, please see the paper of Kao and Zeng [20].
Robustness
Let G be the distribution function of g, then we can view the β-estimator, which is a function of G defined by
(11)
where
(12)
is the objective function. Therefore, the robustness of the β-estimator can be investigated by the influence function. The influence function (IF) for the β-estimator at y under the distribution function G is defined as
(13)
where Δy is the probability measure that puts mass 1 at the point y. An estimator is said to be B-robust if its influence function is a bounded function of y [21]. Since the β-estimator satisfies the properties of M-estimator, the influence function for the robust estimator also can be written as
(14)
where ψβ(y;θ) = ∂Ωβ(y;θ)/∂θ is the estimating function for the β-estimator and
(15)
is a matrix which does not depend on y; thus, the B-robustness is equivalent to the boundness of the estimating function for the M-estimator as well as the β-estimator [14, 22]. To prove the boundedness of the estimating function ψβ(y;θ) for the β-estimator, let us consider the general form of estimating function as defined by
(16)
Obviously, the boundedness of the estimating function depends only on the second term of the right-hand side of (16), since Cβ(θ) is independent on observations. In the second term of the right-hand side of (16), we have
(17)
(18)
(19)
(20)
Thus, we can conclude that if β > 0, then all components of estimating function are bounded with respect to y and X. This is because all the terms corresponding to the Eq (17–20) are of the form with f(z) being polynomial in z, which is bounded in
. In case of β = 0 or equivalently the maximum likelihood estimator, all the terms corresponding to the Eq (17–20) become unbounded. Typically, for example,
. Thus, we may conclude that the estimating function ψβ(z;θ) for the β-estimator is bounded for β > 0. Therefore, the β-estimators are B-robust against outliers.
Selection of the tuning parameter β
The value of the tuning parameter β plays a key role in the performance of the BetaCIM method. It controls the trade-off between robustness and efficiency of estimators. This method shows good performance for a wide range of β. A large β decreases the efficiency and increases the robustness of an estimator, and vice-versa for the smaller β. However, an optimum value for β depends on the initialization of model parameters, data contamination rates, type of data contamination, type of datasets and so on. So heuristic selection of the tuning parameter β, may produces misleading results in some satiations. To find an optimum β for minimum β-divergence method, Mollah et al. [16, 17] used β-divergence with a fixed value β0 of β as a measure for evaluation of the minimum β-divergence estimators. In this paper, we also use the same measure for β selection using cross validation. To define the measure for β selection using K-fold cross validation, the entire dataset into K subsets
,
, …,
and let
. Then the measure for β selection by K-fold cross validation can be defined by
(21)
where
are estimated using dataset
and
(22)
with
(23)
where nk is the number of observation in the subset
. Under null hypothesis the distribution function is
, where
. We select an appropriate β by the minimizer of
for β. To compute β-LOD score for testing the evidence of a QTL in a specific position in a chromosome, we need to select β by cross validation under null hypothesis and alternative hypothesis separately. So, it takes a lot of time to compute the genome-wide β-LOD scores. However, we can use the same β by cross validation under null hypothesis, for each of alternative cases also to save the computational time, since the estimators show good performance for a wide range of β > 0 also. If cross validation results find the value of β significantly larger than 0, it indicates that outliers contaminate the dataset. If cross validation results find the value of β very close to 0, it indicates that the dataset is not contaminated by outliers and the robustified CIM algorithm reduce to the traditional CIM.
Results
Simulation study
Simulations studies were conducted for Backcross (BC) and Intercross (F2) populations based on the assumptions of multiple linked and unlinked QTLs, and in presence and absence of phenotypic outliers. LOD scores were calculated for classical interval mapping and composite interval mapping approaches and β-LOD scores were calculated for BetaCIM approach that is equivalent to classical LOD scores for β→0. Trait specific optimum β was selected by using k-fold cross-validation procedure, implemented in our BetaCIM R-package. We simulated genotype data by using the implemented functions (sim.map and sim.cross) of popular R/qtl package. Then by setting effects corresponding to some specific markers the phenotypic data sets were generated. Mean sum of squares of the parameters were calculated by using the following formula-
Detection power of the QTL was calculated as the numbers of times calculated LOD scores of a QTL position exceed the significant threshold level divided by the number of simulations. Threshold value of the test statistic was calculated by using permutation test.
Multiple unlinked QTLs
We simulated data for backcross population by assuming the QTLs located far enough to be linkage equilibrium to each other’s. Four QTLs situated in different chromosomes were considered for simulation study. Four chromosomes each with fifteen markers separated in 10cM intervals were simulated. Phenotypic data sets were generated by assuming some of the specific marker positions (Chromosome-Marker: C1M3, C2M6, C3M4 and C4M4) as QTLs. Variations of phenotypic data was contributed 50% by QTLs and 50% by random error. We generated 100 simulated data and analyzed by using IM, CIM, and BetaCIM approaches, and then plotted the average LOD scores (Fig 1). Sample size for each simulation was 300. Moreover, simulation was conducted with 5% contaminated phenotype data to investigate the robustness property of classical approaches and BetaCIM approaches. Same approach was used to generate contaminated data described in previous publications [12, 15], randomly selected 5% of the observations and then added random numbers.
(A) Analysis results in absence of outliers; (B) analysis results in presence of 5% outliers. Threshold for each method were calculated using permutation test with 1000 replicates.
Simulation results showed that all approaches could identify the true loci in absence of phenotypic outliers, and provided good estimates of genetic parameters with smaller standard error (Table 1).
Although, detection powers of 3 QTL positions were similar for IM, CIM and BetaCIM approaches, for QTL C4M4 detection power of IM approach was low (power = 64%) as compared to others two methods. Therefore, in the case of multiple unlinked QTL, detection power of the QTLs is better for CIM and BetaCIM approaches as compared to IM. Moreover, in the case of 5% phenotypic outlying observations, IM and CIM approaches failed to detect true QTL positions (Fig 1B). Detection powers of these approaches were very low (8~27% for IM, and 6~31% for CIM), as well as standard errors of the parameters were very high, implying the estimated genetic effects highly varied across different simulated data sets (Table 1). Interestingly, the BetaCIM method provides consistent results, just as there is no outlying observations, identified all QTL positions with high power and provided good estimates of genetic parameters with smaller standard errors (Table 1). Therefore, BetaCIM approach significantly improved the performance over classical CIM approach in the case of multiple unlinked QTLs and presence of phenotypic outliers.
Multiple linked QTLs
One of the crucial properties of CIM is the ability of identifying multiple linked QTLs. In this scenario, we conducted simulation with similar parameter setting as previous publication for CIM [6]. We simulated data from four chromosomes, each with 16 markers and separated in 15 10cM intervals. The traits were controlled by 10 QTLs with positions and effects given in Table 2. Among the 10 QTLs, 9 were in 1st three chromosomes and 1 was in chromosome 4.
Together, the QTLs account for 50% and 70% of the phenotypic variance for two different sets of simulated data. Sample size was 300. In Fig 2, we plotted the average LOD scores for 300 simulations under two different scenarios, with and without phenotypic outliers, and for two different heritability settings. Results showed that BetaCIM can provide similar LOD score profile as CIM approach in absence of phenotypic outliers (Fig 2A and 2C).
Total genetic heritability was set as 50% and 70% for two different sets of simulated data. 5%-contaminated data was added to phenotypes to investigate the robustness property of the approaches. (A) Analysis results in the case of 50% genetic heritability and no outlying observation; (B) analysis results in the case of 50% genetic heritability and 5% outlying observations; (C) Analysis results in the case of 70% genetic heritability and no outlying observation; (D) analysis results in the case of 70% genetic heritability and 5% outlying observations. Threshold for each method were calculated using permutation test with 1000 replicates.
With 50% genetic heritability both approaches identified 6 QTLs out of 10 QTLs at 5% level of significance (Fig 2A). These approaches provided lower picks at others five true QTL positions. Although, with 70% genetic heritability both approaches detected one additional QTL positions, referring that increasing of genetic heritability of the phenotypic traits can increase the detection power of the approaches. IM approach detected 4 true QTL positions with 50% and 70% genetic heritability respectively. This approach provided wider picks, failed to separate multiple linked QTLs and detected several wrong QTL positions (Fig 2A and 2C). Therefore, in the case of multiple linked QTLs, CIM and BetaCIM provided better result than IM approach.
Simulation with single replicate showed that IM approach could provide larger mean sum square error (MSE = 0.124) compared to CIM and BetaCIM approaches, referring that parameter estimation of IM approach could be biased in the case of multiple linked QTL. CIM and BetaCIM approaches provided similar estimates of the genetic parameters and equal mean sum square error (MSE = 0.007), indicating the CIM and BetaCIM approaches could provide better results compared to the IM approach in the case of multiple linked QTL.
With 5% contaminated observations, CIM and IM failed to detect true QTL positions (Fig 2B and 2D). However, BetaCIM provided similar results as without outliers, detected the true QTL positions (Fig 2B and 2D). Therefore, BetaCIM is robust against phenotypic contamination, can detect true QTL positions in presence and absence of outliers. Outlying observations had large impact on parameter estimation of IM and CIM approaches. Estimation was biased upward or downward for these approaches (Table 2), and provided larger MSE. BetaCIM approach significantly improved the performance in effects estimation and identifying true QTLs in the case of outlying observations, provided smaller MSE. Simulation with F2 population provided similar results for the approaches (Text A, Figs A and B, and Table A in S1 File).
Beta selection for QTL analysis
In simulation study, we observed that BetaCIM approach significantly improved the analysis results in presence and absence of phenotypic outliers. Tuning parameter β plays key role for controlling effects of contaminated data. Value of β depends on proportion of contamination in phenotypic data; its value can increase with respect to increasing outlying observations. We calculated the average value of β with different proportion of phenotypic outliers (Fig 3).
Triangles are locating at the median optimum value of beta. (A-E) plots for β selection by cross-validation in presence of 0%, 2%, 5%, 10% and 20 outliers, respectively.
In 100 simulations without phenotypic outliers, the median of optimum values of the tuning parameter β was 0.001, referring that without phenotypic outlier optimum value of β is very small. We observed that optimum value of the tuning parameter β increase with respect to increasing outlying observations. For example, in the case of 2~20% of outlying observations median of optimum values of β were varies to 0.041~0.291. Therefore, selecting trait specific tuning parameter β by using cross validation is crucial for the BetaCIM approach. Selecting the tuning parameter could be useful for real data analysis that may help in efficient estimation of QTL positions and effects. Implemented function of “BetaCIM” R-package can select trait specific optimum beta for analysis.
Real data analysis
Data from an experiment on multiple traits in the mouse was downloaded from mouse phenome database (https://phenome.jax.org/projects/Feng1). The data was for an intercross between 129S1/SvlmJ and A/J inbred mouse strains. There were several phenotypic traits scored in the cross. We analyzed left and right kidney weight of mouse to identify the QTLs underlying these traits. There were total 336 intercross individuals, aged 8 weeks, and typed at 91 markers. For more details about the data see the related publication [23]. IM, CIM and BetaCIM approaches were used for analyses (Fig 4).
(A, D) Plots of phenotypic distributions for right and left kidney weights, respectively; (B, E) Plots for β selection by cross-validation from left and right kidney weights, respectively. (C, F) LOD score profiles for left and right kidney weights, respectively. Threshold for each method were calculated using permutation test with 1000 replicates.
From Fig 4A, we observed that the right kidney weight (RKW) data symmetrically distributed and there had no extreme observations. None of the phenotypic observation was larger than Q3+3*IQR or smaller than Q1-3*IQR, where Q1, Q3 and IQR are first quartile, 3rd quartile, and inter quartile range of RKW data. In this case, optimum value of beta was 0.001 that also indicated there had no contaminated observations in the data set. Analysis with CIM and BetaCIM approaches provided similar LOD score profile and identified 3 QTL positions at 5% level of significance (Fig 4C). Highest picks of the identified QTL positions by using CIM and BetaCIM approaches were in 42.14 cM of chromosome 9 (LOD = 5.18, LODβ = 5.17) with nearby marker rs3676158 (position = 40.876 cM, LOD = 5.16, LODβ = 5.15); in 67.941 cM of chromosome 10 at the marker rs3674646 (LOD = 3.90, LODβ = 3.89); and in 45.055 cM of chromosome 13 at the marker rs3716022 (LOD = 3.53, LODβ = 3.42). IM approach also identified three QTL positions; the highest significant picks were 45.132 cM of chromosome 9 (LOD = 3.79) with nearby marker rs3676158 (LOD = 3.67), in 67.941 cM of chromosome 10 at the marker rs3674646 (LOD = 3.46); and 35.299 cM of chromosome 13 within the marker interval rs3676930 (position 24.6737 cM) and rs3716022 (position 45.055 cM). The candidate markers rs3676158 and rs3674646 are the intron variants of genes Unc13c, and Grip1 respectively, however gene information of another variant rs3716022 is unknown. Unc13c is responsible for an additional step of molecular and/or positional "superpriming" that substantially increases the efficacy of Ca(2+)-triggered release [24]. Its play a role in vesicle maturation during exocytosis as a target of the diacylglycerol second messenger pathway, and may be involved in the regulation of synaptic transmission at parallel fiber (http://www.uniprot.org/uniprot/Q8K0T7). Grip1 play a role as a localized scaffold for the assembly of a multiprotein signaling complex and as mediator of the trafficking of its binding partners at specific subcellular location in neurons (http://www.uniprot.org/uniprot/Q925T6).
We analyzed the left kidney weight (LKW) data of the same population. In this case, there was evidence of one outlying phenotypic observation (phenotypic value of one individual observation was larger than Q3+3*IQR, where Q3 is the third quartile and IQR is interquartile range of LKW data). Optimum value of the tuning parameter β was larger than before (opt β = 0.041), indicated there had some contaminated observations in the data set. For LKW, IM and CIM approaches identified none of the QTL positions at 5% level of significance (Fig 4F). However, BetaCIM approach identified 2 QTL positions, which are 12.43~21.43 cM of chromosome 5 and 20.30~29.30 cM of chromosome 13. LOD scores for CIM and IM approaches at the QTL positions were high but not significant at 5% level of significance (Fig 4F). These two QTLs hold the markers rs3023765 (position = 16.550 cM, LODβ = 3.78) and rs3676930 (position = 24.6737 cM, LODβ = 4.54), which are the variant of the genes Otof and A330033J07Rik. Mutations in human orthologous of mice Otof gene are a cause of neurosensory nonsyndromic recessive deafness, hearing loss [25]. Individual with moderate chronic kidney disease (CKD) have a higher prevalence of hearing loss than those of the same age without CKD (https://www.kidney.org/news/ekidney/november10/HearingLoss_November10). Mice lacking Otof display hearing loss. It expressed in the cochlear IHC, vestibular type I sensory hair cells, eye, heart, skeletal muscle, liver, kidney, lung and testis (http://www.uniprot.org/uniprot/Q9ESF1). Another novel gene A330033J07Rik also expressed in kidney (https://www.ncbi.nlm.nih.gov/geoprofiles/7902881). Therefore, the candidate genes may have relevant function for kidney weight.
Discussions
This paper discusses the robustification of CIM algorithm for identification of both linked and unlinked QTLs by maximizing β-likelihood function using the EM like algorithm. The value of the tuning parameter β plays a key role on the performance of the BetaCIM method. An optimum value for the tuning parameter β can be selected by using k-fold cross validation. Simulation studies showed that the value of tuning parameter β increase with respect of increasing proportion of outlying observations (Fig 3). Therefore, optimum value of the tuning parameter depends on data contamination rate, could vary across real data sets. We implemented the cross-validation procedure of selecting trait specific optimum β in BetaCIM R-package that could help to select trait specific value of the tuning parameter. In simulations, we observed that the BetaCIM approach significantly improved the performance over IM and CIM approaches in presence of outliers; otherwise, it keeps equal performance with CIM. This approach can identify all the QTL positions that can be detected by using CIM in absence of phenotypic outliers. And in presence of phenotypic outliers only BetaCIM approach can provide consistent results (Figs 1 and 2). CIM and IM approaches could fail to detect true QTL positions in presence of phenotypic outliers.
Again, simulation with 100 replicates, we observe that BetaCIM approach can provide reliable estimates of genetic parameters in presence and absence of phenotypic outliers (Table 1). The CIM and BetaCIM can provide smaller mean sum square error (MSE) compared to the IM approach in absence of phenotypic contamination (Table 2). Interestingly, in presence of phenotypic contamination MSE of CIM approach can be larger than IM approach, might due to biased estimations of the effects of background genetic makers that use as cofactor. IM approach does not use the background markers as cofactor. Therefore, genetic parameter estimation by using CIM approach is more sensitive to outliers compared to IM approach. However, in presence of phenotypic outliers BetaCIM provided smaller MSE compared to IM and CIM approaches, thus this approach overcomes the deficiency of CIM approach. Again, QTL detection powers of the BetaCIM approach are similar in absence and presence of phenotypic outliers, whereas detection power was very small for CIM and IM approaches in presence of phenotypic outliers (Table 1). In real situations, phenotypic data might be contaminated by different environmental exposures [11], as well as may contain some measurement errors. A robust approach can provide reliable results in real situations.
We analyzed two real data sets: right kidney weight (RKW) and left kidney weight (LKW) of mouse intercross population [23]. In real data analysis, trait specific optimum β were equal to 0.001 and 0.041 for two different datasets, indicating the presence of some unusual observations in the LKW data set (Fig 4B and 4E). After checking the phenotypic observations, we observed that there was evidence of outlying observation in LKW, but not in RKW. CIM and BetaCIM approaches provided similar LOD score profile for RKW, but different for LKW. The CIM and IM approach failed to detect the QTL positions for LKW, might due to presence of phenotypic contamination, whereas BetaCIM approach identified two QTL positions at 5% level of significance. Identified candidate genes of these QTL regions were expressed in Kidney. Simulation and real data analysis showed that BetaCIM approach could be useful to robustly identify the QTL positions and unbiased estimate of genetic parameters of experimental populations.
Supporting information
S1 File. Includes Text A, Table A, Fig A, and Fig 2.
- Text A in S1 File. Simulations with Intercross Population.
- Table A in S1 File. Detection power of different approaches for multiple linked QTLs in the case of F2 population. Results were calculated from 500 simulations. IM: Interval Mapping approach; CIM: Composite Interval Mapping approach; BetaCIM: Beta likelihood based composite interval mapping approach.
- Fig A in S1 File. Results from 100 simulations for F2 population in the case of unlinked QTLs. (A) without phenotypic outliers, and (B) with 5% phenotypic outliers. Threshold for each method were calculated using permutation test with 1000 replicates.
- Fig B in S1 File. Results from 100 simulations for F2 population in the case of multiple linked QTLs. (A) without phenotypic outliers, and (B) with 5% phenotypic outliers. Threshold for each method were calculated using permutation test with 1000 replicates.
https://doi.org/10.1371/journal.pone.0208234.s001
(DOC)
Acknowledgments
The authors thank the investigators of the mouse data sets used in this study for demonstrating the useful of the BetaCIM approach.
References
- 1. Calderon CI, Yandell BS, Doebley JF. Fine Mapping of a QTL Associated with Kernel Row Number on Chromosome 1 of Maize. PloS one. 2016;11(3):e0150276. Epub 2016/03/02. pmid:26930509; PubMed Central PMCID: PMC4773258.
- 2. Foiada F, Westermeier P, Kessel B, Ouzunova M, Wimmer V, Mayerhofer W, et al. Improving resistance to the European corn borer: a comprehensive study in elite maize using QTL mapping and genome-wide prediction. TAG Theoretical and applied genetics Theoretische und angewandte Genetik. 2015;128(5):875–91. Epub 2015/03/12. pmid:25758357.
- 3. Lander ES, Botstein D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989;121(1):185–99. Epub 1989/01/01. pmid:2563713; PubMed Central PMCID: PMC1203601.
- 4. Thoday JM. Location of Polygenes. Nature. 1961;191(478):368–&. WOS:A19618738B00567.
- 5. Sax K. The Association of Size Differences with Seed-Coat Pattern and Pigmentation in PHASEOLUS VULGARIS. Genetics. 1923;8(6):552–60. Epub 1923/11/01. pmid:17246026; PubMed Central PMCID: PMC1200765.
- 6. Zeng ZB. Precision Mapping of Quantitative Trait Loci. Genetics. 1994;136(4):1457–68. WOS:A1994NB82100022. pmid:8013918
- 7. Li H, Ye G, Wang J. A modified algorithm for the improvement of composite interval mapping. Genetics. 2007;175(1):361–74. Epub 2006/11/18. pmid:17110476; PubMed Central PMCID: PMC1775001.
- 8. Zeng ZB. Theoretical Basis for Separation of Multiple Linked Gene Effects in Mapping Quantitative Trait Loci. P Natl Acad Sci USA. 1993;90(23):10972–6. WOS:A1993MK09400019.
- 9. Jansen RC, Stam P. High-Resolution of Quantitative Traits into Multiple Loci Via Interval Mapping. Genetics. 1994;136(4):1447–55. WOS:A1994NB82100021. pmid:8013917
- 10. Fernandes E, Pacheco A, Penha-Goncalves C. Mapping of quantitative trait loci using the skew-normal distribution. Journal of Zhejiang University Science B. 2007;8(11):792–801. Epub 2007/11/02. pmid:17973340; PubMed Central PMCID: PMC2064950.
- 11. Hayat Y, Yang J, Xu HM, Zhu J. Influence of outliers on QTL mapping for complex traits. Journal of Zhejiang University Science B. 2008;9(12):931–7. Epub 2008/12/11. pmid:19067460; PubMed Central PMCID: PMC2596284.
- 12. Mollah MNH, Eguchi S. Robust Composite Interval Mapping for QTL Analysis by Minimum beta-Divergence Method. IEEE International Conference on Bioinformatics and Biomedicine. 2008:115–20.
- 13. Mollah MNH, Eguchi S. Robust QTL analysis by minimum β-divergence method. International Journal of Data Mining and Bioinformatics. 2010;4(4):471. pmid:20815143
- 14. Mihoko M, Eguchi S. Robust blind source separation by beta divergence. Neural computation. 2002;14(8):1859–86. pmid:12180405
- 15. Mollah MNH. Detection of Quantitative Trait Loci in Presence of Phenotypic Contamination. International Journal of Biometrics and Bioinformatics. 2010;4(2):13–21.
- 16. Mollah MNH, Eguchi S, Minami M. Robust prewhitening for ICA by minimizing β-divergence and its application to FastICA. Neural Processing Letters. 2007;25(2):91–110.
- 17. Mollah MNH, Sultana N, Minami M, Eguchi S. Robust extraction of local structures by the minimum β-divergence method. Neural Networks. 2010;23(2):226–38. pmid:19963342
- 18. Doerge RW. Mapping and analysis of quantitative trait loci in experimental populations. Nature reviews Genetics. 2002;3(1):43–52. Epub 2002/02/02. pmid:11823790.
- 19. Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138(3):963–71. pmid:7851788
- 20. Kao CH, Zeng ZB. General formulas for obtaining the MLEs and the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics. 1997;53(2):653–65. Epub 1997/06/01. pmid:9192455.
- 21.
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach Based on Influence Functions. Robust Statistics: John Wiley & Sons, Inc.; 2005.
- 22. Fujisawa H, Eguchi S. Robust estimation in the normal mixture model. Journal of Statistical Planning and Inference. 2006;136(11):3989–4011.
- 23. Feng M, Deerhake ME, Keating R, Thaisz J, Xu L, Tsaih SW, et al. Genetic analysis of blood pressure in 8 mouse intercross populations. Hypertension. 2009;54(4):802–9. Epub 2009/08/05. pmid:19652078; PubMed Central PMCID: PMC2854560.
- 24. Ishiyama S, Schmidt H, Cooper BH, Brose N, Eilers J. Munc13-3 superprimes synaptic vesicles at granule cell-to-basket cell synapses in the mouse cerebellum. The Journal of neuroscience: the official journal of the Society for Neuroscience. 2014;34(44):14687–96. Epub 2014/10/31. pmid:25355221.
- 25.
Shearer AE, Smith RJH. OTOF-Related Deafness. In: Pagon RA, Adam MP, Ardinger HH, Wallace SE, Amemiya A, Bean LJH, et al., editors. GeneReviews(R). Seattle (WA)1993.