## Figures

## Abstract

### Background

The ultimate goal of genetic mapping of quantitative trait loci (QTL) is the positional cloning of genes involved in any agriculturally or medically important phenotype. However, only a small portion (≤ 1%) of the QTL detected have been characterized at the molecular level, despite the report of hundreds of thousands of QTL for different traits and populations.

### Methods/Results

We develop a statistical model for detecting and characterizing the nucleotide structure and organization of haplotypes that underlie QTL responsible for a quantitative trait in an F_{2} pedigree. The discovery of such haplotypes by the new model will facilitate the molecular cloning of a QTL. Our model is founded on population genetic properties of genes that are segregating in a pedigree, constructed with the mixture-based maximum likelihood context and implemented with the EM algorithm. The closed forms have been derived to estimate the linkage and linkage disequilibria among different molecular markers, such as single nucleotide polymorphisms, and quantitative genetic effects of haplotypes constructed by non-alleles of these markers. Results from the analysis of a real example in mouse have validated the usefulness and utilization of the model proposed.

**Citation: **Hou W, Yap JSF, Wu S, Liu T, Cheverud JM, Wu R (2007) Haplotyping a Quantitative Trait with a High-Density Map in Experimental Crosses. PLoS ONE2(8):
e732.
https://doi.org/10.1371/journal.pone.0000732

**Academic Editor: **Peter Heutink, Vrije Universiteit Medical Centre, Netherlands

**Received: **April 10, 2007; **Accepted: **July 13, 2007; **Published: ** August 15, 2007

**Copyright: ** © 2007 Hou et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **The preparation of this manuscript is supported by NSF grant (0540745) to RW.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

The basic principle for quantitative trait locus (QTL) mapping is the cosegregation of the alleles at a QTL with those at one or a set of known polymorphic markers genotyped on a genome in an experimental cross [1], [2]. If a QTL is cosegregating with molecular markers, the genetic effects of the QTL on a quantitative trait and its genomic position can be estimated from the marker genotypes and phenotypic values of the trait. This estimation process particularly assumes the QTL to be located within an interval constructed by a pair of flanking markers in which a test statistics calculated under the reduced (there is no QTL) and full model (there is a QTL) is used to test the existence of the QTL and estimate its position. This so-called interval mapping approach and its extensions [3]–[5] is robust and powerful for the detection of major QTL and presents the most efficient way to utilize marker information when marker maps are sparse [6]. However, interval mapping is limited by its incapacity to infer any information about the sequence structure and organization of the QTL. Partly for this reason, only a few QTL mapped from markers have been successfully cloned [7]–[9], despite a considerable number of QTL reported in the literature.

Interval QTL mapping also has an unsolved statistical difficulty when it is used with a high-density linkage map. With more markers genotyped, a genetic map for QTL identification has tended to be infinitely dense. For such an infinitely dense map in which markers are located everywhere over the genome, test statistics at nearby intervals are not independent any more. Thus, the critical threshold used to acclaim the existence of a QTL by interval mapping will be difficult to analytically determine. Although an empirical alternative based on permutation tests has been proposed for threshold determination [10], extensive computing may affect the use efficiency of interval mapping.

Despite its unsuitability for interval mapping of QTL, an infinitely dense map provides an important tool for characterizing genetic variants that contribute to quantitative variation via the analysis of haplotypes composed of non-alleles at a set of highly linked markers. Recent genetic studies suggest that a gene may determine a complex trait, such as body weight or drug response, through its haplotype rather than genotype [11], [12]. The completion of the genome projects for several important organisms, Arabdopsis, chicken, human, mouse and poplar, has made massive amounts of DNA sequence data available. In particular, single nucleotide polymorphisms (SNPs), being the most common type of variant in the DNA sequence, provide a powerful means for genotyping the whole genome or any part of it. This facilitates the identification of specific SNP-constructed haplotypes which are responsible for quantitative traits. A set of SNPs that cause quantitative differences among individuals are called quantitative trait nucleotides (QTNs). Liu et al. [13] proposed a statistical model for estimating and testing haplotype effects at a QTN in a random sample drawn from a natural population. This model is based on the population genetic properties of gene segregation. Through the implementation of the EM algorithm, population genetic parameters of SNPs, such as haplotype frequencies, allele frequencies and linkage disequilibria, and quantitative genetic parameters, such as haplotype effects of a QTN, are estimated with closed forms.

The motivation of this work is to derive a statistical model for haplotype discovery responsible for quantitative variation in a mapping population derived an experimental cross. Unlike a natural population in which gene co-segregation analysis is based on linkage disequilibria [14], experimental crosses, such as the backcross or F_{2}, have usually been analyzed in terms of the linkage between different markers and QTL. In this article, we will frame a general statistical model for estimating the linkage between different SNPs and testing haplotype effects within the context of linkage disequilibrium analysis in an F_{2} pedigree. We show that the new model can test for the dependence of SNPs when a multi-point analysis is performed. We have derived closed forms for the EM algorithm to estimate a variety of genetic parameters. A worked example is used to validate the usefulness and utilization of the model.

## Methods

### Haplotype and diplotype

A haplotype represents a linear arrangement of nucleotides (alleles) at different SNPs on a single chromosome, or part of a chromosome. The pair of haplotypes is called a diplotype. The observed phenotype of a diplotype is called a genotype. A diplotype is always constructed by two haplotypes, one from the maternal parent and the other from the paternal parent. Suppose there are two different SNPs on the same genomic region, one with two alleles *A* and *a* and the other with two alleles *B* and *b*, respectively. Allele *A* from SNP 1 and allele *B* from SNP 2 are located on the first homologous chromosome, whereas allele *a* from SNP 1 and allele *b* from SNP 2 located on the second homologous chromosome. Thus, [*AB*] is one haplotype and [*ab*] is a second haplotype, and both constitute a diplotype [*AB*][*ab*] (Fig. 1).

In a practical genetic analysis, we can only observe the genotype expressed as *Aa/Bb*. However, the double heterozygote may be one (and only one) of two possible diplotypes [*AB*][*ab*] and [*Ab*][*aB*]. But these two diplotypes cannot be directly observed and should be inferred from SNP genotype data (Fig. 2). In practice, it is important to estimate haplotype effects on a quantitative trait based on the diplotypes and therefore genotypes. For example, if an animal carries haplotype [*AB*], it will grow better than other animals that carries any other haplotypes, [*Ab*], [*aB*] and [*ab*]. For this reason, the same genotype *Aa/Bb* may perform differently, depending on what diplotype it carries. If this genotype is diplotype [*AB*][*ab*], then it will have a better growth. If the animal is diplotype [*Ab*][*aB*], its growth will be poorer. The statistical model being developed will be used to determine which diplotype is associated with better growth in experimental crosses.

### Linkage disequilibrium in the F_{2} intercross

**A general model:** Haplotype analysis in the backcross is straightforward because the diplotype is determined for all the backcross genotype. Simple analysis of variance can be used to detect haplotype effects on a quantitative trait. In the F_{2}, this is not a case in which the double heterozygote is a mixture of two possible diplotypes.

Suppose many SNPs are genotyped each of which is segregating in a 1:2:1 Mendelian ratio in the F_{2} population. As seen in the human genome [15], these SNPs are divided into different haplotype blocks. For a given block, there are a particular number of representative SNPs or htSNPs that uniquely identify the common haplotypes in this block or QTN. Several algorithms have been developed to identify a minimal subset of htSNPs that can characterize the most common haplotypes [16]–[18]. Consider a QTN that contains *L* htSNPs among which there exist linkage disequilibria of different orders. The two alleles, 1 and 0, at each of these SNPs are symbolized by *r*_{1},…,*r _{L}*, respectively. For a cross initiated with two inbred parents, the allele frequencies for each of these htSNPs should be 1/2. A haplotype frequency, denoted as , is decomposed into the following components:(1)where

*D*'s are the linkage disequilibria of different orders among particular SNPs.

Totally, *L* SNPs form 2* ^{L}* haplotypes expressed as [

*r*

_{1}…

*r*], 2

_{L}

^{L}^{−1}(2

*+1) diplotypes, i.e., a pair of maternally- (m) and paternally-derived haplotypes (p), expressed as [*

^{L}*r*

_{1}

^{m}…

*r*

_{L}^{m}][

*r*

_{1}

^{p}…

*r*

_{L}^{p}] (

*r*

_{1}

^{m},

*r*

_{1}

^{p},…;

*r*

_{L}^{m},

*r*

_{L}^{p}= 1,0) and 3

*genotypes expressed as*

^{L}*r*

_{1}

*r*′

_{1}/…/

*r*′

_{L}r*(*

_{L}*r*

_{1}≥

*r*′

_{1,}…,

*r*≥

_{L}*r*′

*1,0). Only genotypes can be observed. The number of diplotypes is smaller than the number of genotypes because the genotypes that are heterozygous at two or more SNPs contain multiple different diplotypes. Diplotype (and therefore genotype) frequencies can be expressed in terms of haplotype frequencies. We use and to denote the diplotype and genotype frequencies, respectively, and to denote genotype observation.*

_{L}=**A special case: Two-point linkage disequilibrium:** For two given SNPs (**S**_{1} and **S**_{2}), there are four different haplotypes in a cross population. According to the definition given above, these four haplotypes are denoted as [11], [10], [01] and [00], whose frequencies in a cross population are, respectively, expressed as (2)Assume that the two SNPs are linked with a recombination fraction *r*. The haplotype frequencies can be expressed in terms of *r*, i.e., , , and . Combining equation (2), this establishes the relation between the linkage disequilibrium and recombination fraction as (3)or(4)

**A special case: Three-point linkage disequilibrium:** For three given SNPs (**S**_{1,} **S**_{2, }and **S**_{3}), there are eight different haplotypes, i.e., [111], [110], [101], [100], [011], [010], [001], and [000]. The haplotype frequencies in a cross population are, respectively, expressed as(5)where *D*_{12}, *D*_{23} and *D*_{13} are the linkage disequilibria between SNP **S**_{1} and **S**_{2}, between **S**_{2} and **S**_{3} and between **S**_{1} and **S**_{2}, respectively, and *D*_{123} is the linkage disequilibrium among the three SNPs. The four disequilibrium coefficients can be estimated, by solving equation (5), as(6)The first three first-order linkage disequilibria can be used to describe the linkage between different SNPs and crossover interference, whereas the last second-order linkage disequilibrium is thought to be associated with chromatid interference.

### Haplotyping a trait with two SNPs

Our interest is to search for the haplotype diversity that can explain phenotypic variation in a complex trait. The association between haplotype diversity and phenotypic variation has been detected in several studies of drug responses [11], [12]. This allows us to assume that a particular haplotype is different from other haplotypes for a given trait. Here, our focus will be on modelling haplotype effects in experimental crosses. Although haplotypes (comprising diplotypes) can be directly observed in the backcross, this is not possible for the F_{2} because their heterozygous genotypes are not concordant with diplotypes or haplotypes. For the F_{2} population, the effects of different haplotypes on the phenotype need be postulated from observed zygotic genotypes. The inference of diplotypes for a particular genotype is statistically a missing data problem that can be formulated by a finite mixture model.

**Mixture model:** The statistical method for the genomewide scan of QTN is formulated on the basis of a finite mixture model. The mixture model assumes that each observation comes from one of an assumed set of distributions. The mixture model derived to detect haplotype effects on a quantitative trait based on SNP genotype data contains three major parts: (1) the mixture proportions of each distribution, denoted as the relative frequencies of different diplotypes for the same SNP genotype, (2) the mean for each diplotype in the density function, and (3) the residual variance common to all diplotypes.

For simplicity, we consider a QTN that is composed of only two SNPs each with two alleles designated as 1 and 0. These two SNPs segregating in the F_{2} population form four haplotypes whose frequencies are arrayed in vector Θp = (*p*_{11}, *p*_{10}, *p*_{01}, *p*_{00}). All the genotypes are consistent with diplotypes, except for the double heterozygote, 10/10, that contains two different diplotypes [11][00] with a frequency of 2 *p*_{11}*p*_{00} and [10][01] with a frequency of 2 *p*_{10}*p*_{01} (Table 1). The relative frequencies of different diplotypes for the double heterozygote are a function of haplotype frequencies.

A total of *n* individuals in the F_{2} are classified into 9 genotypes for the two SNPs, each genotype with observation generally expressed as (*r*_{1}≥*r*′_{1},*r*_{2}≥*r*′_{2},*r*_{3}≥*r*′_{3} = 1,0). The frequency of each genotype can be expressed in terms of haplotype frequencies (Table 1). Considering a quantitative trait controlled by diplotype (rather than genotype) diversity, the phenotypic value of the trait (*y _{i}*) for individual

*i*is expressed by a linear model, i.e.,(7)where

*ξ*is the indicator variable defined as 1 if a diplotype considered is compatible with subject

_{i}*i*and as 0 otherwise, is the genotypic value for diplotype , and

*e*is the residual error distributed as

_{i}*N*(0,

*σ*

^{2}).

Assume that this QTN triggers an effect on the trait because at least one haplotype is different from the remaining seven. Without loss of generality, let [11] be such a distinct haplotype, called *risk haplotype*, designated as *A*. All the other non-risk haplotypes, [10], [01] and [00], are collectively expressed as *A̅*. The risk and non-risk haplotypes form three *composite diplotypes AA* (**2**), *AA̅* (**1**) and *A̅**A̅* (**0**). Let *μ*_{2}, *μ*_{1 }and *μ*_{0} be the genotypic value of the three composite diplotypes, respectively (Table 1). The means for different composite diplotypes and residual variance are arrayed by a quantitative genetic parameter vector Θ* _{q}* = (

*μ*

_{2, }

*μ*

_{1, }

*μ*

_{0, }

*σ*.

^{2})**Likelihoods:** With the above notation, we construct two likelihoods, one for haplotype frequencies (Θ* _{p}*) based on SNP data (

**S**) and the other for quantitative genetic parameters (Θ

*) based on haplotype frequencies (Θ*

_{q}*), phenotypic (*

_{p}*y*) and SNP data (

**S**). They are, respectively, expressed as(8)where

*f*(

_{j}*y*) is a normal distribution density function of composite diplotype

_{i}*j*(

*j*= 2,1,0), i.e.,It can be seen from the above likelihood functions that, although most zygote genotypes contain a single component (diplotype), the double heterozygote is the mixture of two possible diplotypes weighted by

*φ*and 1-

*φ*, expressed as(9)which represents the relative frequency of diplotype [11][00] for the double heterozygote.

It should be noted that *L*(Θ* _{p}*, Θ

_{q}_{ }|

*y*,

**S**) relies on the haplotype frequencies defined in

*L*(Θ

*|*

_{p}**S**) and, thus, the latter is thought to be nested within the former. The estimates of parameters that maximize

*L*(Θ

*|*

_{p}**S**) can also maximize the

*L*(Θ

*, Θ*

_{p}

_{q}_{ }|

*y*,

**S**).

**The EM algorithm:** A closed-form solution for the EM algorithm has been derived to estimate the unknown parameters that maximize the two likelihoods of (26) [13]. The estimates of haplotype frequencies are based on the log-likelihood function *L*(Θ* _{p}*|

**M**), whereas the estimates of diplotype genotypic means and residual variance are based on the log-likelihood function

*L*(Θ

*, Θ*

_{p}

_{q}_{ }|

*y*,

**M**). These two different types of parameters can be estimated using a two-stage hierarchical EM algorithm.

At a higher hierarchy of the EM algorithm, the E step is aimed to calculate the relative frequency (*φ*) of diplotype [11][00] in the double heterozygote is calculated by equation (9). The M step is aimed to estimate the haplotype frequencies based on the probabilities calculated in the previous iteration using(10)

At a lower hierarchy of the EM algorithm, the E step is derived to calculate the posterior probability (Ω[_{11}]_{[00]i}) of individual *i* with the double heterozygous genotype to be diplotype [11][00] byNote that for all the other genotypes, such posterior probabilities do not exist.

By assuming that [11] is a risk haplotype, the M step is derived to estimate the genotypic values (*μ _{j}*) for each composite diplotype and the residual variance based on the calculated posterior probabilities by(12)(13)whereIterations including the E and M steps are repeated at the higher hierarchy between equations (9) and (10) and at the lower hierarchy among equations (12) and (13) until the estimates of the parameters converge to stable values. The sampling errors of these parameters can be estimated by calculating Louis' [19] observed information matrix.

Haplotype frequencies can be expressed as a function of allelic frequencies and linkage disequilibrium. Based on equation (2), we solve the linkage disequilibrium between two SNPs by(14)With the genotypic means of composite diplotypes, we can estimate the overall mean (*μ*) and additive (*a*) and dominant genetic effects (*d*) due to the QTN detected, respectively, by

**Model selection:** The likelihood *L*(Θ* _{p}*, Θ

_{q}_{ }|

*y*,

**S**) is formulated by assuming that haplotype [11],[11] is a risk haplotype. However, a real risk haplotype is unknown from raw data (

*y*,

**S**). An additional step for the choice of the most likely risk haplotype should be implemented. The simplest way to do so is to calculate the likelihood values by assuming that any one of the four haplotypes can be a risk haplotype (Table 1). Thus, we obtain four possible likelihood values under different risk haplotypes; that is, (1) for [11], (2) for [10], (3) for [01], and (4) for [00]. Under each possible risk haplotype, we estimate the quantitative genetic parameters (

*k*= 1,…,4). The largest likelihood value calculated is thought to correspond to the most likely risk haplotype.

In practice, it is also possible that there exist more than one risk haplotypes for a QTN. Relative to the bi-“allelic” QTN with one risk haplotype, such a QTN is called a multi-“allelic” QTN. If there are two risk haplotypes, we will have six composite diplotypes. Assuming that [11] (denoted by *A*_{1}) and [10] (denoted by *A*_{2}) are risk haplotypes and the remaining haplotypes [10] and [01] are non-risk haplotypes (denoted by *A*_{3}), then six composite diplotypes, expressed as *A*_{1}*A*_{1}, *A*_{1}*A*_{2}, *A*_{1}*A*_{3}, *A*_{2}*A*_{2}, *A*_{2}*A*_{3} and *A*_{3}*A*_{3}, can be specified according to the diplotype distribution as shown in Table 1. Totally, there are six such haplotype combinations for a two-SNP QTL, each combination corresponding to a likelihood value. Based on the calculated likelihoods, we can determine a most likely risk and non-risk haplotype combination. If there are three risk haplotypes, we will have 10 different composite diplotypes. The optimal risk and non-risk haplotype combination will be selected from three combinations based on the likelihoods.

The likelihood can be used as a criterion to select the optimal risk and non-risk haplotype combination when the number of risk haplotype is the same. However, when the number of risk haplotype is different, an AIC- or BIC-based model selection strategy [20] should be used because of different numbers of parameters being estimated in this case.

**Hypothesis tests:** We can test two major hypotheses in the following sequence: (1) the association between two SNPs by testing their linkage disequilibrium, and (2) the difference of a given haplotype from the remaining haplotypes by testing the significance of haplotype additive and dominant effects on the trait. The linkage disequilibrium between two given SNPs can be tested using two alternative hypotheses:(15)The log-likelihood ratio test statistic for the significance of LD is calculated by comparing the likelihood values under the *H _{1}* (full model) and

*H*(reduced model) using(16)The

_{0}*LR*is considered to asymptotically follow a χ

_{1}^{2}distribution with one degree of freedom.

Diplotype or haplotype effects on the trait, i.e., the existence of a QTN, can be tested using the following hypotheses expressed as(17)The log-likelihood ratio test statistic (*LR*_{2}) under these two hypotheses can be similarly calculated,(18)where the tildes and hats denote the MLEs of parameters under the null and alternative hypotheses of (17), respectively. Although the critical threshold for determining the existence of a QTN can be based on empirical permutation tests, the *LR*_{2} may asymptotically follow a χ^{2} distribution with two degrees of freedom, so that the threshold can be obtained from the χ^{2} distribution table.

### Haplotyping a trait with multiple SNPs

**Haplotype structure:** The statistical method for QTN mapping is exemplified by a set of three SNPs, **S**_{1}–**S**_{3}, for a QTN. Two alleles 1 and 0 at each SNP are symbolized by *r*_{1}, *r*_{2} and *r*_{3}, respectively. Eight haplotypes, [111], [110], [101], [100], [011], [010], [001] and [000], formed by these three SNPs, have the frequencies arrayed in Θ* _{p}* = (

*p*

_{111, }

*p*

_{110, }

*p*

_{101, }

*p*

_{100,}

*p*

_{011, }

*p*

_{010, }

*p*

_{001, }

*p*

_{000}). Some genotypes are consistent with diplotypes, whereas the others that are heterozygous at two or more SNPs are not. Each double heterozygote contains two different diplotypes. One triple heterozygote, i.e., 10/10/10, contains four different diplotypes, [111][000] (in a probability of 2

*p*

_{111}

*p*

_{000}), [110][001] (in a probability of 2

*p*

_{110}

*p*

_{001}), [101][010] (in a probability of 2

*p*

_{101}

*p*

_{010}) and [100][011] (in a probability of 2

*p*

_{100}

*p*

_{011}). The relative frequencies of different diplotypes for this double or triple heterozygote are a function of haplotype frequencies (Table 2).

In the F_{2} population, there are 27 genotypes for the three SNPs. Let (*r*_{1}≥*r*′_{1},*r*_{2}≥*r*′_{2},*r*_{3}≥*r*′_{3} = 1,0) be the number of offspring for a genotype. As seen in Table 2, the frequency of each genotype is expressed in terms of haplotype frequencies. Similar to equation (25), the phenotypic value of the trait for individual *i* is expressed, at the diplotype level, as(19)where *ξ _{i}* is the indicator variable defined as 1 if a diplotype considered is compatible with subject

*i*and as 0 otherwise, is the genotypic value for diplotype [

*r*

_{1}

^{m}

*r*

_{2}

^{m}

*r*

_{3}

^{m}][

*r*

_{1}

^{p}

*r*

_{2}

^{p}

*r*

_{3}

^{p}], and

*e*is the residual error distributed as

_{i}*N*(0,

*σ*

^{2}). Note that

**m**and

**p**stand for the maternally and paternally derived alleles, respectively.

By assuming [111] as a risk haplotype (labelled by *A*) and all the others as non-risk haplotypes (labelled by *A̅*), Table 2 provides the formulation of genotypic values for three composite diplotypes, *μ*_{2} for *AA*, *μ*_{1} for *AA̅* and *μ*_{0} for *A̅**A̅*. The haplotype effect parameters and residual covariance matrix are arrayed by a quantitative genetic parameter vector Θ* _{q}* = (

*μ*

_{2},

*μ*

_{1},

*μ*

_{0},

*σ*

^{2}).

**Likelihoods and algorithms:**With the above notation, we construct two likelihoods, one for haplotype frequencies (Θ* _{p}*) based on SNP data (

**S**) and the other for quantitative genetic parameters (Θ

*) based on haplotype frequencies (Θ*

_{q}*), phenotypic (*

_{p}*y*) and SNP data (

**S**). They are, respectively, expressed as(20)where

*φ*

_{.}'s (

*φ̅*. = 1−

*φ*) are defined below, and

*f*(

_{j }( y_{j})*j*= 2, 1, 0) is a normal distribution density function of composite diplotype

*j*.

A two-stage hierarchical EM algorithm is derived to estimate haplotype frequencies and quantitative genetic parameters. At the higher hierarchy of the EM framework, we calculate the proportions of a particular diplotype within double or triple heterozygous genotypes (E step) by(21)The calculated relative proportions by equation (21) were used to estimate the haplotype frequencies with(22)

At the lower hierarchy of the EM framework, we calculate the posterior probabilities of a double or triple heterozygous individual *i* to be a particular diplotype (*A̅*) (E step), for which where [111] is assumed as the risk haplotype, expressed as(23)With the calculated posterior probabilities by the above equation (23), we then estimate the quantitative genetic parameters, Θ* _{q}*, based on the log-likelihood equations. These equations have similar, but more complicated, forms like equations (12) and (13).

Hypothesis tests can be made for linkage disequilibria among three SNPs and haplotype effects. Four different linkage disequilibria, *D*_{12}, *D*_{13}, *D*_{23} and *D*_{123}, that describe the linkage among three SNPs can each be tested using the null hypotheses described by equation (21). The log-likelihood ratios for each hypothesis are thought to follow a χ^{2} distribution.

*R***-SNP model:** The idea for haplotyping a quantitative trait is described for two- and three-SNP models. It is possible that these models are too simple to characterize genetic variants for quantitative variation. With the analytical line for the two- and three-SNP sequencing model, a model can be developed to include an arbitrary number of SNPs whose sequences are associated with the phenotypic variation. A key issue for the multi-SNP sequencing model is how to distinguish among 2^{r}^{−1} different diplotypes for the same genotype heterozygous at *r* loci. The relative frequencies of these diplotypes can be expressed in terms of haplotype frequencies. The integrative EM algorithm can be employed to estimate the MLEs of haplotype frequencies. A general formula for estimating haplotype frequencies can be derived.

## Results

The statistical model described above can be used to map and identify QTNs for a quantitative trait in an F_{2} population. Because the marker data we have for mouse are microsatellites rather than SNPs, we use these microsatellite markers as a surrogate of SNPs for the purpose to demonstrate the utility of the model. Our marker data were from Vaughn et al.'s [21] study in which a linkage map composed of 19 chromosomes was constructed with 96 microsatellite markers for 502 F_{2} mice (259 males and 243 females) derived from two strains, the Large (LG/J) and Small (SM/J). This map has a total map distance of ∼1780 cM (in Haldane's units) and an average interval length of ∼23 cM. The F_{2} progeny was measured for their body mass at 10 weekly intervals starting at age 7 days. The raw weights were corrected for the effects of each covariate due to dam, litter size at birth, parity and sex [21]. Here, only adult body weights at week 10 are used for “QTN” analysis.

For each F_{2} mouse, the parental origin of alleles at each marker can be discerned in molecular studies. Let *L* and *S* be the alleles inherited from the Large (LG/J) and Small (SM/J) strains, respectively. For any pair of markers, there are four different haplotypes, *LL*, *LS*, *SL* and *SS*, whose frequencies are accordingly denoted asand By assuming all the four haplotypes as a risk haplotype, respectively, the above model allows for the estimates of haplotype frequencies by the EM iteration at the higher hierarchy and of composite genotypic values by the EM iteration at the lower hierarchy. The estimated haplotype frequencies are used to estimate linkage disequilibrium based on equation (14) and the recombination fraction (*r*) based on equation (4). This estimation process is moved from the first (**M**_{1}–**M**_{2}) to last pair of markers (**M**_{6}–**M**_{7}) on chromosome 1 and then from chromosome 1 to 19.

Table 3 tabulates the results of the MLEs of haplotype frequencies and log-likelihoods under the assumptions of different risk haplotypes. A total of 96 markers are sparsely located on 19 mouse chromosomes, with the estimated recombination fractions from the linkage disequilibrium model [8] consistent with those obtained from the linkage model [21]. Significant likelihood ratios for testing haplotype effects were determined by critical values obtained from the *χ*^{2}-square distribution with two degrees of freedom with a Bonferroni adjustment to the type I error. The adjusted critical values for the two- and three-marker QTN models are 18.20 and 18.76, respectively, at the 5% significance level. Significant haplotype effects are detected for a total of eight marker pairs (Table 3), which include one pair on chromosome 4, two consecutive pairs on chromosome 6, four consecutive pairs on chromosome 7 and one pair on chromosome 14. For some pairs, multiple significant risk haplotypes were detected. Risk haplotypes purely composed of alleles inherited from the LG/J or SM/J parent exert a positive or negative additive effect on body weight, respectively. Based on the relative values of estimated additive and dominant effects, the significant marker pairs detected display partial dominant effects (Table 3).

The results from the three-marker model are basically consistent with those from the two-marker model (Table 4). The advantage of the three-marker model is that it incorporates the interferences between adjacent marker intervals into the estimation process and, thus, can potentially increase the estimation precision of haplotype effects.

## Discussion

Quantitative trait locus (QTL) mapping aims to identify narrow chromosomal segments for a quantitative trait by using a statistical method, and has proven its value to study the genetic architecture of the trait in a variety of species [6]–[8]. The limitations of this technique lie in its inability to characterize the structure and organization of DNA sequences and statistical difficulty in deriving the distribution of test statistics under the null hypothesis of no QTL [22]. At least partly for these reasons, despite thousands of QTL reported for different traits and populations, a very small portion of them have been cloned [9]. With the completion of the genome projects for several important organisms, a new line of thought in the post genomic era has begun to emerge for the identification of specific combinations of nucleotides or haplotypes that contribute to a complex quantitative trait [13], [23].

Theory and methods for haplotype discovery have well been established for natural populations [13] in which the non-random association among different single nucleotide polymorphims (SNP), specified by the coefficients of linkage disequilibria, lays a foundation for the mixture model of haplotyping a quantitative trait. In this article, we derived a statistical model for detecting haplotypes and estimating their effects on quantitative variation of a trait in experimental crosses. We used the principle of linkage disequilibrium analysis to characterize the linkage among different markers that is usually described by the recombination fractions in a commonly used F_{2} population, initiated with two inbred lines. We established an interchangeable relationship between the linkage and linkage disequilibrium. The merit of this relationship in trait haplotyping includes the incorporation of interferences between adjacent marker intervals into the estimation and test of haplotype effects when multiple markers are modelled simultaneously.

The haplotyping model developed in this article was used to analyze a published F_{2} population of mouse [21], but we used microsatellite markers as the surrogate of SNPs so that we can detect the effects of haplotypes constructed by microsatellite alleles. The whole-genome of mouse was scanned for haplotype effects on body weight by a two- and multi-marker model, respectively. Consistent results were observed from the two models, which suggests that four regions in mouse chromosomes 4, 6, 7, and 14 contribute to variation in body weight. These findings are in a good agreement with those from traditional interval QTL mapping [21]. But our haplotype discovery is more informative in terms of the characterization of specific haplotype structure and organization responsible for trait variation.

We have proposed a new model for haplotyping a quantitative trait in the F_{2} progeny population. The tenet of the model can be extended to haplotype a complicated trans-generational pedigree, founded with multiple original parents and involving individuals with different relatedness. The model can also be modified to dissect the epistatic effects of different genes [23] and the interaction of genes with environment. For these extensions, haplotype selection aimed to detect the risk haplotypes that are expressed differently from the others present many challenges, but is crucial for the facilitation of the process of detecting the association between haplotype diversity and phenotypic variation.

Our haplotyping model offers a powerful tool for positional cloning of QTL that are important for a complex trait. Flint et al. [9] reviewed the potential of currently available cloning strategies, such as probabilistic ancestral haplotype reconstruction, Yin-Yang crosses and in silico analysis of sequence variants, to identify genes that underlie QTL in rodents. Our model, in conjunction with these strategies, may open a new gateway for the illustration of a detailed picture of the genetic architecture for a complex trait.

## Author Contributions

Conceived and designed the experiments: RW. Performed the experiments: JC. Analyzed the data: WH SW TL. Wrote the paper: RW.

## References

- 1. Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199.ES LanderD. Botstein1989Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps.Genetics121185199
- 2.
Lynch M, Walsh B (1998) Genetics and Analysis of Quantitative Traits. Sunderland, MA: Sinauer Associates. M. LynchB. Walsh1998Genetics and Analysis of Quantitative Traits.Sunderland, MASinauer Associates
- 3. Jansen RC, Stam P (1994) High resolution mapping of quantitative traits into multiple loci via interval mapping. Genetics 136: 1447–1455.RC JansenP. Stam1994High resolution mapping of quantitative traits into multiple loci via interval mapping.Genetics13614471455
- 4. Zeng Z-B (1994) Precision mapping of quantitative trait loci. Genetics 136: 1457–1468.Z-B Zeng1994Precision mapping of quantitative trait loci.Genetics13614571468
- 5. Kao C-H, Zeng Z-B, Teasdale RD (1999) Multiple interval mapping for quantitative trait loci. Genetics 152: 1203–1216.C-H KaoZ-B ZengRD Teasdale1999Multiple interval mapping for quantitative trait loci.Genetics15212031216
- 6.
Mackay TFC (2001) Quantitative trait loci in
*Drosophila*. Nat Rev Genet 2: 11–20.TFC Mackay2001Quantitative trait loci in*Drosophila*.Nat Rev Genet21120 - 7.
Frary A, Nesbitt TC, Frary A, Grandillo S, van der Knaap E, et al. (2000)
*fw2.2*: A quantitative trait locus key to the evolution of tomato fruit size. Science 289: 85–88.A. FraryTC NesbittA. FraryS. GrandilloE. van der Knaap2000*fw2.2*: A quantitative trait locus key to the evolution of tomato fruit size.Science2898588 - 8. Li CB, Zhou AL, Sang T (2006) Rice domestication by reducing shattering. Science 311: 1936–1939.CB LiAL ZhouT. Sang2006Rice domestication by reducing shattering.Science31119361939
- 9. Flint J, Valdar W, Shifman S, Mott R (2005) Strategies for mapping and cloning quantitative trait genes in rodents. Nat Rev Genet 6: 271–286.J. FlintW. ValdarS. ShifmanR. Mott2005Strategies for mapping and cloning quantitative trait genes in rodents.Nat Rev Genet6271286
- 10. Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait mapping. Genetics 138: 963–971.GA ChurchillRW Doerge1994Empirical threshold values for quantitative trait mapping.Genetics138963971
- 11. Judson R, Stephens JC, Windemuth A (2000) The predictive power of haplotypes in clinical response. Pharmacogenomics 1: 15–26.R. JudsonJC StephensA. Windemuth2000The predictive power of haplotypes in clinical response.Pharmacogenomics11526
- 12. Bader JS (2001) The relative power of SNPs and haplotype as genetic markers for association tests. Pharmacogenomics 2: 11–24.JS Bader2001The relative power of SNPs and haplotype as genetic markers for association tests.Pharmacogenomics21124
- 13. Liu T, Johnson JA, Casella G, Wu RL (2004) Sequencing complex diseases with HapMap. Genetics 168: 503–511.T. LiuJA JohnsonG. CasellaRL Wu2004Sequencing complex diseases with HapMap.Genetics168503511
- 14. Lou X-Y, Casella G, Littell RC, Yang MKC, Wu RL (2003) A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis in natural populations. Genetics 163: 1533–1548.X-Y LouG. CasellaRC LittellMKC YangRL Wu2003A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis in natural populations.Genetics16315331548
- 15. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719–1723.N. PatilAJ BernoDA HindsWA BarrettJM Doshi2001Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21.Science29417191723
- 16. Zhang K, Deng M, Chen T, Waterman MS, Sun F (2002) A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci 99: 7335–7339.K. ZhangM. DengT. ChenMS WatermanF. Sun2002A dynamic programming algorithm for haplotype block partitioning.Proc Natl Acad Sci9973357339
- 17. Sebastiani P, Lazarus SW, Kunkel LM, Kohane IS, Ramoni M (2003) Minimal haplotype tagging. Proc Natl Acad Sci 100: 9900–9905.P. SebastianiSW LazarusLM KunkelIS KohaneM. Ramoni2003Minimal haplotype tagging.Proc Natl Acad Sci10099009905
- 18. Eyheramendy S, Marchini J, McVean G, Myers S, Donnelly P (2007) A model-based approach to capture genetic variation for future association studies. Genome Res 17: 88–95.S. EyheramendyJ. MarchiniG. McVeanS. MyersP. Donnelly2007A model-based approach to capture genetic variation for future association studies.Genome Res178895
- 19. Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J Roy Stat Soc Ser B 44: 226–233.TA Louis1982Finding the observed information matrix when using the EM algorithm.J Roy Stat Soc Ser B44226233
- 20.
Burnham KP, Andersson DR (1998) Model Selection and Inference. A Practical Information-Theoretic Approach. New York: Springer. KP BurnhamDR Andersson1998Model Selection and Inference. A Practical Information-Theoretic Approach.New YorkSpringer
- 21. Vaughn TT, Pletscher LS, Peripato A, King-Ellison K, Adams E, et al. (1999) Mapping quantitative trait loci for murine growth - A closer look at genetic architecture. Genet Res 74: 313–322.TT VaughnLS PletscherA. PeripatoK. King-EllisonE. Adams1999Mapping quantitative trait loci for murine growth - A closer look at genetic architecture.Genet Res74313322
- 22. Lander ES, Schork NJ (1994) Genetic dissection of complex traits. Science 265: 2037–2048.ES LanderNJ Schork1994Genetic dissection of complex traits.Science26520372048
- 23. Lin M, Wu RL (2006) Detecting sequence-sequence interactions for complex diseases. Current Genomics 7: 59–72.M. LinRL Wu2006Detecting sequence-sequence interactions for complex diseases.Current Genomics75972