^{1}

^{*}

^{2}

^{3}

^{4}

^{*}

JDS, JMA, and LK conceived and designed the experiments, analyzed the data, and wrote the paper.

The authors have declared that no competing interests exist.

With the ability to measure thousands of related phenotypes from a single biological sample, it is now feasible to genetically dissect systems-level biological phenomena. The genetics of transcriptional regulation and protein abundance are likely to be complex, meaning that genetic variation at multiple loci will influence these phenotypes. Several recent studies have investigated the role of genetic variation in transcription by applying traditional linkage analysis methods to genomewide expression data, where each gene expression level was treated as a quantitative trait and analyzed separately from one another. Here, we develop a new, computationally efficient method for simultaneously mapping multiple gene expression quantitative trait loci that directly uses all of the available data. Information shared across gene expression traits is captured in a way that makes minimal assumptions about the statistical properties of the data. The method produces easy-to-interpret measures of statistical significance for both individual loci and the overall joint significance of multiple loci selected for a given expression trait. We apply the new method to a cross between two strains of the budding yeast

Complex traits are frequently under control of multiple loci. This new method simultaneously maps multiple gene expression quantitative trait loci and assesses their significance within

Genetic linkage analysis has traditionally been applied to one or very few traits at a time. It is now possible to simultaneously measure thousands of related “traits” from high-throughput technologies such as DNA [

Existing linkage analysis techniques have already been applied to genomewide expression in yeast, mice, maize, and humans [

Although several other approaches exist for mapping multiple loci that are linked to a quantitative trait, none of these methods allows the individual and

To address these issues, we have developed a new method for mapping multiple loci and identifying epistatic interactions when analyzing thousands of phenotypes, such as gene expression levels. Information shared across expression traits is employed in a way that allows us to make minimal assumptions about the statistical properties of the data. Our method permits easy-to-interpret statistical significance analysis of individual loci, as well as the overall joint significance of multiple loci identified for any given expression trait. Strengths of both the model selection and composite interval mapping methods have been incorporated, which turns out to be more straightforward when analyzing many traits simultaneously. Rather than trying to estimate the true model underlying the expression trait by seeking the “best model,” or by assuming a certain model of genetic background and testing for the inclusion of additional loci, we propose to measure

We applied the method to the

The data used in this study were derived from a cross between two haploid strains of the budding yeast

Initially, we applied an exhaustive 2D linkage scan in order to identify expression traits that are significantly linked to pairs of loci or that are significant for epistasis. In performing these significance tests, we considered a linear model that fully parameterizes the quantitative trait in terms of all four possible genotypes. This model can be written as

Traditionally, genetic linkage is said to exist between the trait and a pair of loci if any of the locus effects are significantly different than zero, but not necessarily all of them [

The test for pair-wise linkage was performed as follows. For each expression trait, a linear model was fit by least squares to each pair of loci. The locus pair with the largest F-statistic comparing the full model to the baseline model was selected for that trait. For the test of epistasis, a similar procedure was performed, except an F-statistic was computed that compared the full model to the purely additive model, which directly assesses the contribution of the interaction term. The significance of each locus pair selected for linkage was computed using a standard permutation technique against the null hypothesis of no linkage to either locus [

The exhaustive 2D search proved to be unsatisfactory for a number of reasons. Most obviously, the number of multiple-locus models that have to be considered is computationally and statistically challenging for pairs of loci, and prohibitive for three or more loci. With 3,312 markers and 6,216 expression traits, one has to consider more than 18 million single-locus models to simply test for linkage between every expression trait and locus. More than 27 billion tests have to be performed to consider all two-locus models for every expression trait, and more than 27 trillion tests to consider all three-locus models for every expression trait. In addition, it is likely that by searching over so many models, the statistical power to detect linkage is severely attenuated because of the multiple comparison problem. Secondly, when employing an exhaustive 2D scan, there is no statistically rigorous method to test for joint linkage, which exists only if both loci have nonzero terms in the full model. In other words, the significance of an individual locus selected for an expression trait is confounded with the overall significance of the pair of loci. Since

One potential way to improve the exhaustive 2D scan is to use another method for selecting pairs of loci. In particular, one can select loci in a sequential manner, cutting down the number of models considered to 2 × 3,312, instead of more than 5 million. One readily available method for selecting loci in a sequential manner is forward stepwise regression. Here, one selects a primary locus that shows the most significant one-dimensional (1D) linkage, i.e., the one that has the largest LOD score. This is equivalent to identifying the locus that yields the smallest residual sum of squares when regressing the expression trait on the inheritance pattern at that locus [

One can use this forward stepwise regression technique simply as a way to select pairs of loci for each trait, and then the significance analysis can be repeated as before. It has been hypothesized that failing to consider all possible two-locus models through an exhaustive 2D search (e.g., selecting loci in a sequential fashion) may lead to a loss in power or to missing important interactions between loci [

The number of significant traits over a range of

(A) Plot of the number of traits significant for linkage versus the

(B) Plot of the number of traits significant for epistasis versus the

The sequential search was also more powerful for identifying epistasis relative to the exhaustive 2D scan.

Therefore, for this particular experiment, the sequential search is more powerful than the exhaustive 2D search in identifying pair-wise linkage and detecting epistasis. The sequential search also appears to extract a biological signal that is similar to that from the 2D search. However, it is not possible to conclude whether these properties would hold in other experiments or for different sample sizes. Also, the comparison was made based on significance assessed against the null hypothesis of no linkage, which is not a solution to the problem of detecting joint linkage. The sequential approach as implemented above still suffers from the problem that significance can be driven by a single locus while the other locus is a false positive. However, sequentially selecting loci allows their individual significance to be assessed, which we show is crucial in detecting true joint linkage. We discuss how to assess individual and joint significance for the sequential approach below; we note that the same methods would not work without a number of potentially unjustifiable assumptions for the exhaustive 2D search.

We developed a method to overcome the following problems associated with existing approaches: a prohibitively large number of multi-locus models are considered, a clear measure of significance among individual loci is not available, and the desired alternative hypothesis that

The overall probabilities of linkage from Step 3 provide a ranking of the traits from most significant to least significant. It is then necessary to select a set of traits, each of which has a high probability of being linked to all loci simultaneously. In order to guide this choice, we propose a method to assess the statistical significance of a given set of traits.

The starting point for the method is to define a multi-locus model that may include varying numbers of loci, where it is clear how one modifies the model to include an additional locus. Here, we continue to use the fully parameterized model. For zero, one, and two loci, the model may be written, respectively, as

where, for example, “locus1” is the main effect for the primary locus, and “locus1

The Bayesian posterior probability that the primary locus for each trait shows linkage can be written as

The above probabilities give a measure of significance to each locus. However, one would also like to know the

These joint-linkage probabilities can be used to select traits that are significant for having all loci jointly linked by calling all traits significant that have a joint-linkage probability exceeding some threshold. For example, all traits with

A trait is defined to be a “false discovery” for joint linkage if any of its selected loci is a false positive. In standard multiple hypothesis testing situations, the false discovery has been estimated as the ratio of the estimated number of false positive divided by the observed number of tests called significant. For a given threshold we place on the traits, it is straightforward to count how many are called significant, but it is not as easy to estimate the expected number of false positives because the null distribution of the joint-linkage probabilities is not available. However, when identifying pairs of loci for each trait, the probability a trait is a false discovery is 1 −

where again the summation is taken over all traits called significant for a two-locus linkage. This estimate can be justified in the context of Bayesian representations of the FDR, but it also has connections to

In practice, the locus-specific and joint-linkage probabilities must be estimated. Due to the massive amount of available data, we form nonparametric estimates of the probabilities rather than making assumptions about their distributions. At each stage of the locus selection, the strength of linkage is quantified by a standard F-statistic used to compare two models (M1 versus M0 or M2 versus M1). The statistics associated with the primary and secondary loci for each trait are the maximal F-statistics among all loci. Since these maximal statistics do not have a known null distribution, the null distributions are simulated. The quantitative trait values are permuted and the maximal statistics are recomputed [

A key aspect of our proposed approach is that loci are selected one at a time for each given expression trait. In the traditional approach, pairs of loci are selected together so that among these locus pairs, zero, one, or two loci may be truly linked. Therefore, it is not possible to model all three cases without making a number of assumptions. However, since we select only one locus at a time, there are only two possible outcomes at each selection step: the locus is either linked or not. The statistics calculated at each locus selection stage are a mixture of the two distributions corresponding to these linked and unlinked loci. The permutation null statistics represent one component of this mixture and can be used in conjunction with the observed statistics to conservatively estimate the locus-specific linkage probabilities.

The estimated density of the observed statistics is plotted (solid black). This density is modeled as a weighted mixture of probability densities corresponding to the “null” unlinked secondary loci (solid grey) and the “alternative” linked secondary loci (dashed grey). The estimated posterior probability of linkage is also shown (dashed black).

We applied the proposed method for two-locus linkage analysis to the

Recall that when a more liberal definition of two-locus linkage was used, where only one locus was required to be linked, about 4,000 linkages were called significant at a FDR of 5%. However, in that situation it was not clear whether both loci or just a single locus were truly linked. Because we identify only 170 significant joint linkages at a FDR of 10%, it appears that many of the 4,000 significant linkages from the other approach were due to only a single locus being truly linked. When comparing our method to a traditional 1D linkage scan where the top two linkage peaks are taken as significant, we find 3.3 to 8.7 times more linkages at FDR cut-offs ranging from 1% to 10% (

To better understand the molecular mechanisms underlying the observed linkages for these traits, we searched for

Several previous linkage analyses of gene expression levels in yeast and other organisms have shown that linkages are nonrandomly distributed throughout the genome and tend to cluster into specific locations [

(A) A plot of the significant locus pair positions when each chromosome has been partitioned into equally sized bins less than or equal to 550 kb. The number of significant traits showing linkage to locus pairs in each pair-wise bin is denoted. The number on each axis indicates the chromosome number; a dash denotes a bin division.

(B) A plot constructed analogously to (A), except bins less than or equal to 50 kb are used, and only bins with three or more traits significant for joint linkage are numbered.

Another interesting observation that emerges from the spatial distribution of joint linkages is that distinct groups are connected by a common linkage peak. For example, of the 10 pair-wise bins with three or more linked traits, there are three that share a common linkage to the exact same position on Chromosome 15 (

We developed a new, computationally efficient statistical method for simultaneously mapping multiple QTL. Whereas conventional linkage analysis has been widely and successfully applied to study one or very few traits at a time, our method is appropriate for analyzing thousands of phenotypes. Pairs of loci were identified sequentially rather than considering all possible combinations, which was shown to be empirically more powerful. The model used to select pairs of loci included an interaction term allowing for possible epistasis. This sequential approach will of course miss locus pairs with primarily epistatic effects (i.e., little or no main effect for either locus), and these may be biologically interesting or important. Also, we have not included any special modifications to handle the case where two QTL are closely linked, although such modifications are likely possible. Even though it is not likely that two locus models give a complete picture of gene regulation [

A major challenge that our method overcomes is to assign joint significance to the pairs of loci. When identifying linked loci in a sequential manner, it is tempting to apply a readily available significance threshold at each stage. For example, existing

As technological advances in gene, protein, and metabolite profiling continue to be made, we anticipate that statistical methods such as the one proposed here will provide important insights into the genetic architecture of complex and quantitative traits.

These expression data have recently been reported elsewhere [_{2}(sample/BY reference), averaged over two dye-swapped arrays. Each array [

As previously reported [

All pairs of loci were tested for linkage to a given trait based on an F-statistic comparing the least fitted two-locus full model to the null model of no linkage. In order to ease the computational burden, we considered only 613 equally spaced loci and we did not consider any pairs of loci located on the same chromosome. For each trait, the pair of loci with the largest F-statistic was selected. A

In order to compare the power of the 2D and sequential selection procedures, the sequential locus selection procedure (described below) was also performed exactly as above, on the same loci, the same null permutations, etc. Therefore, the only aspect compared is the exact procedure used to choose a pair of loci. The

For each fixed trait

Let _{i}_{1} be the maximal F-statistic corresponding to the primary locus for each trait _{i}_{2} be the maximal F-statistic corresponding to the secondary locus for each trait _{ij}

Statistics from the null distributions were simulated by randomly permuting the ordering of the arrays and calculating a new maximal F-statistic for each trait [_{i}_{1}^{0b} and _{i}_{2}^{0b} for

The observed _{ij}_{i1}^{0b} are directly used to estimate the locus-specific linkage probabilities. Define ℓ_{ij}_{ij}_{i}_{1}, ℓ_{i}_{2}, … , ℓ_{i}_{L}. Since there is an enormous amount of data available, we can avoid making some of these assumptions. Let _{ij}_{ij}_{i}_{1} = 1, … , ℓ_{i,j}_{−1} = 1, Data) with _{ij}_{i}_{1} = 1, … , ℓ_{i,j}_{−1} = 1, _{ij}_{ij}_{ij}^{0b} are calculated under the assumption that ℓ_{i}_{1} = 1, … , ℓ_{i,j}_{−1} = 1 so this is a coherent formulation.

All of the information shown in _{i}_{1} have probability density functions _{0} and _{1}, respectively. Then if π_{0} of the primary loci are not linked and π_{1} are linked, a randomly selected _{i}_{1} follows the mixture density g = π_{0}_{0} + π_{1}_{1}. (The density functions _{0} and _{1}and prior probabilities π_{0} and π_{1} are not assumed to be the same at each locus selection stage. Also, if _{0} and _{1} as the average of these.) According to Bayes theorem, the posterior probability of linkage for the primary locus is

Since _{i}_{1} are observations from _{0}_{0} + π_{1}_{1} function, and the simulated null _{0}, these two sets of statistics can be used to estimate the likelihood ratio _{0}/_{0}(_{i}_{1} are called “successes” and the

The quantity π_{0} is estimated by

This estimate was originally formulated for use in estimating _{0}, thus providing a conservative estimate. Adjusting the tuning parameter _{0} has been suggested as

for _{i}_{2} and simulated null statistics

We ranked the traits for significance by the magnitude of the _{λ} be the set of traits called significant with this threshold and

Setting λ = 0.84, we estimate the FDR to be 10%. The estimate can be generalized to _{λ} by those traits where

Rather than motivating this estimate of the FDR from a model-based Bayesian framework [

In situations where the FDR can be written in this way, it has been shown in a variety of scenarios that estimates of the form

control the FDR as long as the estimated number of expected false discoveries is conservative as the number of traits gets large [_{0}(

In a traditional 1D linkage scan, a statistic is calculated at each marker and a significance threshold is applied to these in order to find markers showing significant linkage. It is possible that more than one linkage statistic will exceed the threshold. Therefore, one could view this procedure as a multiple locus linkage analysis. We compared this approach to our proposed method by thresholding the two top linkage peaks for each trait. (Peaks were considered to be distinct if they lay on different chromosomes.) A

The “

(44 KB PDF).

The genome was split into 50-kb pair-wise bins, and the number of significant linkages at FDR less than or equal to 10% falling into each bin was recorded. For any bin containing more than three linkages, the exact marker positions and expression traits are listed below.

(49 KB PDF).

The expression data reported in this paper have been deposited in the Gene Expression Omnibus (GEO) (

This work was supported in part by NIH grants R01 HG002913–01 (JDS) and R37 MH59520–06 (LK), National Science Foundation Postdoctoral Fellowship 0305916 (JMA), and by the Howard Hughes Medical Institute (LK). LK is a James S. McDonnell Centennial Fellow. We thank J. Whittle for generating microarray data, R. Brem for several useful discussions, and three anonymous referees for helpful comments on the manuscript.

one-dimensional

two-dimensional

false discovery rate

kilobase

quantitative trait loci