^{1}

^{2}

^{*}

^{1}

^{1}

^{1}

^{1}

^{1}

^{2}

^{3}

^{4}

^{1}

^{1}

^{1}

^{5}

^{1}

^{2}

^{3}

DTR, JD, TMB, MBM, and DBA conceived and designed the experiments. DTR, JD, and DBA performed the experiments. DTR, JD, LKV, and DBA analyzed the data. DTR, JD, HKT, JRF, MAP, and DBA contributed reagents/materials/analysis tools. DTR, JD, LKV, HKT, TMB, RPK, RF, MAP, NL, MBM, and DBA wrote the paper.

The authors have declared that no competing interests exist.

Individual genetic admixture estimates, determined both across the genome and at specific genomic regions, have been proposed for use in identifying specific genomic regions harboring loci influencing phenotypes in regional admixture mapping (RAM). Estimates of individual ancestry can be used in structured association tests (SAT) to reduce confounding induced by various forms of population substructure. Although presented as two distinct approaches, we provide a conceptual framework in which both RAM and SAT are special cases of a more general linear model. We clarify which variables are sufficient to condition upon in order to prevent spurious associations and also provide a simple closed form “semiparametric” method of evaluating the reliability of individual admixture estimates. An estimate of the reliability of individual admixture estimates is required to make an inherent errors-in-variables problem tractable. Casting RAM and SAT methods as a general linear model offers enormous flexibility enabling application to a rich set of phenotypes, populations, covariates, and situations, including interaction terms and multilocus models. This approach should allow far wider use of RAM and SAT, often using standard software, in addressing admixture as either a confounder of association studies or a tool for finding loci influencing complex phenotypes in species as diverse as plants, humans, and nonhuman animals.

In recent years, scientific efforts to find genes influencing disease and health-related traits have sought to capitalize on the unique genetic characteristics of admixed populations. Admixture can refer to the event of two or more genetically diverse populations intermating and producing an admixed population. Admixture creates the potential for efficient identification of trait-influencing genes. However, genetic association studies using admixed populations are also prone to incorrectly concluding that a gene is linked and associated with a trait even when it is not. Several researchers have produced promising statistical methodologies for genetic association studies within admixed populations. In this paper, the authors show how these statistical methods can be unified in a broadly applicable regression framework and discuss which variables should be included in the regression models for valid testing. Because the variables required in this regression framework can only be measured with error, the authors show the consequences of these measurement errors and present measurement error correction methods applicable to this problem. By recasting the statistical methods for genetic association studies within admixed populations as regression models, a broader range of modeling and hypothesis testing becomes available.

When two or more populations have been separated by geographic or cultural boundaries for many generations, differential selection pressures, drift, and spontaneous mutations may lead to different allele frequencies in each population. If individuals from these founding populations subsequently mate, disequilibrium among linked markers in their offspring may span a greater genetic distance than typically found in panmictic populations. This extended disequilibrium can greatly facilitate the ability to detect regions of the genome harboring phenotype-influencing loci by reducing both the number of marker loci required and the cost when compared to disequilibrium mapping in panmictic populations [

Recently, with the availability of genome-wide markers, the wider use and application of Bayesian statistical methods, the use of Markov chain Monte Carlo and hidden Markov methods, and the insight of several investigative groups [

The overall aim of this paper is to provide a general model that conceptually unites RAM and SAT methodologies into an extensible form. To accomplish this, we provide an overview of the problem and existing methods, followed by methodologic clarification. We then present our model and illustrate its properties via simulation. These simulations are not meant to provide comprehensive description of the operating characteristics of the methods across many situations, but rather offer illustrations of key methodological points.

Before presenting a unifying approach, we review the justification and underlying principles of both methods.

Hoggart et al. (p. 1492 in [

We will show that how one attempts to control for parental ancestry is critical to determining whether one eliminates potential confounding due to variations in parental ancestry. To our knowledge, there are four published approaches to SAT [

The overall issue of confounding due to admixture disequilibrium, generalized to any population, is portrayed in the path diagram of

This figure was created based on the rules of path diagrams outlined in [_{i}, ^{th} QTL (the putative QTL, QTL 1, is observed, whereas QTL 2 is unobserved). Note that for a specific QTL, only two possible values of _{i}, _{i},

The figure indicates that association testing is not a simple issue. The relationship between the putative quantitative trait locus (QTL) and phenotype is the one of interest, but it can be confounded by other variables. First, note that QTLs and individual admixture can be directly influenced by random variation due to meiosis. In addition, both the phenotype and measured admixture are potentially subject to measurement error. Furthermore, measured admixture is directly affected by individual admixture, which in turn is affected by individual ancestry. Naturally, the ancestry of the parents, represented by P_{1} and P_{2}, affects individual ancestry. Individual ancestry can directly affect the putative QTL, which in turn can affect the phenotype, so individual ancestry has an indirect affect on the phenotype via the putative QTL. The right–hand side of the path diagram is a mirror image of the left–hand side, with unobserved QTL replacing the putative QTL and represents the potential path of spurious associations. The diagram also indicates that the product of parental ancestries also affects both QTLs. Justification for these paths is provided below.

The consequences of failing to control for variation is ancestry is illustrated in

A dataset was simulated from idealized circumstances for the purposes of illustration. The dataset contained 1,000 individuals that were admixed from parental populations _{10} scale) for the test for the effect of G2 for each dataset. The bar plot of the right section of each panel represents the observed ratio of the empirical to the nominal type I error for each simulation.

(A) Not controlling for ancestry leads to inflated type I error. The degree of type 1 error rate inflation increases with smaller α levels.

(B) Controlling for only the linear term of individual ancestry is sufficient only when the confounding QTL affects the phenotype only in an additive fashion. In this case, there was no excess of type 1 errors.

(C) When the QTL affects the phenotype in a nonadditive fashion (in this case, through overdominance), controlling for the linear term of ancestry is insufficient to remove the confounding effect. The type I error rates remain quite inflated even after including true individual ancestry in the model.

(D) When the QTL affects the phenotype in an overdominant fashion, controlling for true individual ancestry and the product of parental ancestries effectively eliminates the confounding. In this case, the ratios of empirical to nominal α levels are within sampling error of 1.0.

We define region-specific admixture as a characteristic of segments of the genomes of individuals. For any given region of the genome, one's region-specific admixture from population

Several approaches to RAM [

There are a number of methodologic points that have been alluded to but have not been completely elucidated in the literature pertaining to how one should condition upon (control for) ancestry within RAM and SAT. Within the next few sections, we seek to clarify these points.

It is unclear from past writing whether it is sufficient to control for individual admixture, individual ancestry, or both to eliminate confounding due to the admixture process. We first clarify that, although sometimes used interchangeably, an individual's admixture and an individual's ancestry are not equivalent variables. To illustrate, consider a set of full siblings that does not include any monozygotic twins. Because they are full siblings, all individuals in the set have equal individual ancestry from specific populations or regions. In fact, all individuals in the set have ancestry equal to the mean or midpoint of their parent's ancestries, represented as P_{1} and P_{2}. However, due to recombination, all individuals will have slightly different admixture values.

Here we show by counterexamples that it is not sufficient to control for individual admixture and it is also not sufficient to control for individual ancestry. We then show that it is sufficient to control for both individual ancestry and the product of parental ancestry. Throughout the paper and our examples, ^{th} individual, ^{th} locus, ^{th} locus, and

Given variations in parental ancestry, controlling for individual admixture is not sufficient. Imagine an organism with _{j}^{th} segment) and the ancestry values are all scaled to have variance 1.0. Given the assumptions above, all segment-specific admixture values will have equal covariance with ancestry. Denote this covariance as β_{j}_{1} and _{j}_{2} is ^{2}. The correlation coefficient between _{j}_{1} and

The correlation coefficient can be written in terms of simple correlation coefficients
^{7} base pairs in total length) with short genomes and less in organisms such as crayfish (diploid chromosome number = 200, 8.22 × 10^{9} base pairs in total length) with long genomes (see

Let _{1} and _{2} denote Bernoulli-distributed random variables indicating whether or not one has inherited two alleles from population _{ij}^{th} locus for the i^{th} individual. Assume that the two loci are unlinked and that we begin with two inbred populations, _{1} and _{2} individuals from populations

Expected Population Resulting from Two Generations of Random Mating between Two Inbred Populations

As can be seen in _{ij}_{1} will be correlated with _{2}_{1} = 1 or _{2} = 1.

Some models (e.g., [

However, the locus-specific effects on complex and quantitative traits cannot a priori be assumed to be additive and can even be overdominant [

The premise of conditioning on parental ancestry was first introduced by McKeigue [

Let _{1i} and _{2i} denote the individual ancestries from population

Furthermore, conditional on _{1i} and _{2i}_{i}_{1i}, _{2i}), _{i}_{1i}, _{2i}), and _{i}_{1i}, _{2i}) is sufficient to eliminate confounding by unlinked loci. Given that _{i}_{1i}, _{2i}) + _{i}_{1i}, _{2i}) + _{i}_{1i}, _{2i}) = 1, it is only necessary to control for any two in a model. We choose to control for _{i}_{1i}, _{2i}) and _{i}_{1i}, _{2i}). If we let _{i}_{0} ≡ _{0} + _{1}, β_{1} ≡ −2_{1}, and β_{2} ≡ _{1} + _{2} and substituting terms yields:
_{1i} + _{2i})/2 is individual ancestry (_{i}

One may choose to condition on parental ancestry only if parental ancestry is found to be statistically significant when included in the model or if significant structure is detected in the sample as was described by Pritchard et al. [_{0}, the type 1 error rate remains ≤ α, which generally defines a valid test in the frequentist context, then conditional conditioning is not a valid testing strategy. That is, even though covariates may not meet criteria for statistical significance in a finite sample, this does not mean they are not confounders, and failing to include them in the model can lead to inflated type 1 error rates [

We simulated datasets containing a phenotype Y that is associated with a marker G_{1} and true ancestry. We also simulated another marker G_{2} that is not associated with Y, but like Y, is correlated with true ancestry. Therefore, any significant association between Y and G_{2} is considered a false positive. We consider the full model _{0} _{1}_{i} +_{2}_{1i}_{2i} _{3}_{2} + ɛ_{i}_{0}: β_{1} = 0 and β_{2} = 0. If this test is significant, the p-value represented by the blue dots is obtained from the full model, otherwise we obtained the _{0} _{3}_{2} _{i}_{2} in a sequential fashion; that is by first testing H_{0}: β_{1} = 0 and β_{2} = 0, and relying on the outcome of this test to decide whether to control for ancestry. The correct α levels are obtained by always including ancestry terms in the model regardless of their levels of significance.

Here we introduce general models for RAM and SAT that are highly extensible. We define the following notation: _{i},^{th} individual, the proportion of the i^{th} individual's ancestors that came from parental population _{ijk},^{th} individual has inherited ^{th} locus from an ancestor that was from parental population _{ijk},^{th} individual has ^{th} locus of a specified type. We use _{i}

RAM model:

SAT model:

These general linear models are very flexible. First, dichotomous (e.g., case vs. control), ordinal, time-to-event, or continuous phenotypes can be accommodated by letting the regression model be logistic, Poisson, Cox, or ordinary least squares, respectively. This flexibility is important. Investigators frequently want to not only assess genetic association for dichotomous and static phenotypes such as lupus (yes vs. no) in a case-control study, but also wish to assess genetic association with longitudinal outcomes (e.g., clinical course in medical research or growth rate in agricultural research), adjusting for covariates including demographic and ancestry. Such longitudinal phenotypes can also be accommodated by this general model via the use of mixed models and related techniques for longitudinal data [

Another advantage of the models in _{ijk}_{ijk}_{ijk}

As already discussed, the models in

Multivariate RAM model:

Multivariate SAT model:

The ξ_{m}

To our knowledge, no current RAM or SAT test allows related individuals to be included as subjects. (We distinguish the inclusion of related individuals as subjects from the requirement that parents or other relatives be included in some testing procedures as a means of controlling for ancestry [e.g., [

The general linear model offered can be extended to allow one to test for linkage conditional upon association with a polymorphism in a region and, thereby, test whether that polymorphism appears to account for an observed linkage signal that was detected with RAM. The right side of _{ijk}_{ijk}_{ijk}_{ijk}

Until now, we have assumed that all variables are known without error. In reality, this will not be the case and is an important point to recognize. Any of the variables involved can be measured with error and we now address the consequences of error in each and propose responses to ensure validity of the tests in terms of type 1 error rate control. Throughout, we assume that the measurement errors are independent of each other and of all of the variables under study. We also do not dwell on how one should calculate estimates of individual and parental admixture or estimates of the reliability thereof when used as estimates of individual and parental ancestry. For now, we simply assume that it is possible to do so and briefly address ways in which this might best be accomplished in the

It is well known that genotyping errors occur and, when they occur, result in reduced power [_{ijk},

Phenotypes are also often measured with error but, again, this will only serve to lower power of the tests we offer and not inflate type 1 error rates [

Unless a perfectly informative marker (i.e., a marker with allele frequencies of zero and one in one parental population and complementary frequencies in the other, respectively) is available at exactly the locus under study, the degree of regional admixture for any individual will only be known probabilistically. Let us denote the (Bayesian posterior) probabilities of individual region-specific admixture as:
_{ij}_{1} and _{ij}_{2} with _{ij}_{1} and _{ij}_{2}, respectively, in the various regression models in an analogous manner to what would be done in some multipoint mapping approaches in experimental crosses (see p. 433 in [

Error in the estimates of parental ancestry poses the greatest challenge. As several authors [

We simulated a randomly mating population or organisms based upon the “island model” or intermixture admixture process [

Montana and Pritchard [

While many methods are available (e.g., [

Another alternative is the simulation extrapolation (SIMEX) approach [

Several other methods exist [

The dataset used to create this graph was generated under the same conditions as used to generate the data for

(A) Type I error inflation caused by measurement error in the individual ancestry estimate. Ignoring possible measurement error in the ancestry estimate may also lead to a high type I error rate.

(B) Observed false positive rate after correction for measurement error; in this example we used the SIMEX algorithm as described in Cook and Stefanski [

Our purpose here has not been to become bogged down in the logistics of setting up RAM and SAT studies or to provide detailed evaluations of the performance characteristics of specific designs and analytic implementations. Rather, our goal was to articulate a unified and generalizable approach to RAM and SAT. We have shown through proofs, counterexamples, and small simulations that it is necessary and sufficient to condition on both individual ancestry and the product of parental ancestries, and it is not sufficient to “conditionally condition” on parental ancestries, in order to control for confounding in admixture studies. We provide a general linear model that is extensible to a multitude of study designs, conditions, and populations of interest that are briefly presented, but left to future work for detailed descriptions. Within

Simulation studies were performed using the software SAS (Cary, North Carolina, United States) under the “general island” and intermixture models presented by Zhu et al. [_{i}_{0} + β_{1}_{i}_{2}_{1i}_{2i} + β_{3}_{ij}_{1} + β_{4}_{ij}_{2} + ɛ_{i}

(118 KB DOC)

The authors would like to thank Dr. Chenxi Wang for providing some initial code for the SIMEX implementation, Dr. Raymond Carroll for helpful advice; Drs. David Siegmund, Jonathan Pritchard, Robert Elston, and Hongyu Zhao for helpful comments on earlier drafts; and Dr. Barbara Gower for graciously providing some of the data used to calculate reasonable parameters for our simulations.

quantitative trait loci, (or locus)

regional admixture mapping

structured association testing

simulation extrapolation