^{1}

^{2}

^{3}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{3}

^{7}

^{8}

^{9}

^{10}

^{11}

^{12}

^{13}

^{14}

^{15}

^{16}

^{17}

^{17}

^{18}

^{19}

^{20}

^{21}

^{22}

^{23}

^{24}

^{25}

^{26}

^{27}

^{28}

^{29}

^{30}

^{31}

^{32}

^{33}

^{25}

^{34}

^{6}

^{6}

^{35}

^{36}

^{32}

^{3}

^{37}

^{18}

^{18}

^{3}

^{7}

^{37}

^{6}

^{38}

^{6}

^{3}

^{7}

^{3}

^{19}

^{39}

^{1}

^{2}

^{3}

^{*}

Conceived and designed the experiments: BP NZ CAH DR NP JGW ALP. Performed the experiments: BP. Analyzed the data: BP AT ALP. Contributed reagents/materials/analysis tools: BP GL GKC WHLK IR MF DSS XZ EL LAL LAC QY ELA SKM JD JM ML GJP RCM CBA EMJ LB WZ JJH RGZ SJN EVB SAI MFP SJC SLD JLRG CDP SB LE JNH BEH SM CAH DR NP JGW ALP. Wrote the paper: BP NZ DR NP JGW ALP.

The authors have declared that no competing interests exist.

While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.

This paper presents improved methodologies for the analysis of genome-wide association studies in admixed populations, which are populations that came about by the mixing of two or more distant continental populations over a few hundred years (e.g., African Americans or Latinos). Studies of admixed populations offer the promise of capturing additional genetic diversity compared to studies over homogeneous populations such as Europeans. In admixed populations, correlation between genetic variants exists both at a fine scale in the ancestral populations and at a coarse scale due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered either one or the other type of correlation, but not both. In this work we develop novel statistical methods that account for both types of genetic correlation, and we show that the combined approach attains greater statistical power than that achieved by applying either approach separately. We provide analysis of simulated and real data from major studies performed in African-American men and women to show the improvement obtained by our methods over the standard methods for analyzing association studies in admixed populations.

Genome-wide association studies (GWAS) are the currently prevailing approach for identifying genetic variants with a modest effect on the risk of common disease, and have identified hundreds of common risk variants for a wide range of diseases and phenotypes

GWAS disease mapping in homogeneous populations relies on linkage disequilibrium (LD) between nearby markers to identify SNP association

It is important to complement theoretical methods development with empirical evaluation on large real data sets. To this end, we have evaluated our methods using 6,209 unrelated African Americans from the CARe cardiovascular consortium as well as 5761 unrelated African-American women from a GWAS for breast cancer. We ran comprehensive simulations based on real genotypes and phenotypes simulated under a variety of assumptions. Our main focus was on case-control phenotypes, in which case-only admixture association is particularly valuable. Our analysis of simulated and real (coronary heart disease, type 2 diabetes and breast cancer) case-control phenotypes shows that our combined SNP and admixture association approach attains significantly greater statistical power than can be achieved by applying either approach separately. Although our main focus is on case-control phenotypes, we also provide a detailed evaluation of association statistics for quantitative phenotypes, using simulated and real (LDL and HDL cholesterol) phenotypes.

Since the general assumption in GWAS is that the causal SNP is not directly typed in the study, it is important to assess how the newly introduced scores perform in the context of genotype imputation. First, we show that imputation accuracy is marginally improved when local ancestry is taken into account in the imputation procedure. Second, our analysis in African Americans shows that for case-control studies our methods for combining SNP and admixture association outperform other approaches even in the presence of imputation. Finally, we show that when the causal SNP is not typed and cannot be reliably imputed our methods yield higher statistical power at finding the region harboring the causal variant when compared to previous approaches. Based on these findings we provide recommendations for the use of our combined approach in GWAS of admixed populations.

We analyzed data from 6,209 unrelated African Americans from the CARe consortium who were genotyped on the Affymetrix 6.0 chip, and merged in genotype data from the HapMap3 project (see

We used the Armitage trend test with correction for genome-wide ancestry as a baseline for the evaluation of other approaches, as this approach was used in previous association analyses using CARe data ^{2}(2dof) score, but as we show below, the higher degrees of freedom leads to a reduction in statistical power. We instead propose a mixed χ^{2}(1dof) score that jointly evaluates both SNP and admixture association using a single SNP odds ratio, by using the implied ancestry odds ratio (see ^{2}(1dof) SNP association score conditioned on local ancestry to a χ^{2}(2dof) SNP association score which allows different odds ratios for African versus European local ancestry (see ^{2}(1dof) MIX score that accounts for both admixture and case-control signal using a single SNP odds ratio and the χ^{2}(2dof) SUM score that allows for independent SNP and ancestry odds ratios.

We also explored whether it is necessary to assign African or European ancestry to each allele for a sample and SNP in which both local ancestry and genotype are heterozygous. Although the HAPMIX algorithm supports this functionality, it represents a significant complexity, particularly if representing local ancestry inference in terms of real-valued probabilities. We focus below on scores based on diploid local ancestry (AA, AE or EE) that do not require this extra information, and show that these scores perform nearly as well as scores that are based on haploid local ancestry (A or E) for each of two chromosomes with local ancestry inference and phasing performed jointly.

We randomly selected 100,000 autosomal SNPs and, for each SNP, assigned simulated phenotypes based on either a null model or causal model for that SNP. Under the null model, we chose 1,000 cases and 1,000 controls at random. Under the causal model, we chose 1,000 cases and 1,000 controls corresponding to odds ratios

We compared 5 scores: Armitage trend test with correction for genome-wide ancestry (ATT), SNP association conditioned on local ancestry (SNP1), admixture association using cases only (ADM), sum of SNP1 and ADM (SUM), and our new mixed score (MIX). All of these are χ^{2}(1dof) scores, except for SUM which is χ^{2}(2dof). We note that the strength of the induced admixture signal at highly differentiated SNPs (as measured by the ancestry odds ratio) in the simulated data fits the model assumed in the MIX score.

In

We plot the average power of each score as a function of allele frequency difference between CEU and YRI, for the R = 1.5 simulation only.

Typed Genotypes | ||||||

R = 1.2 random | R = 1.2 Δ>0.4 | R = 1.5 random | R = 1.5 Δ>0.4 | R = 2.0 random | R = 2.0 Δ>0.4 | |

ATT χ2(1dof) | 0.0017 | 0.0026 | 0.3803 | 0.5533 | 0.8351 | 0.9769 |

SNP1 χ2(1dof) | 0.0014 | 0.0012 | 0.3628 | 0.4181 | 0.8279 | 0.9362 |

ADM χ2(1dof) | 0.0001 | 0.0013 | 0.0081 | 0.0903 | 0.0737 | 0.6306 |

SUM χ2(2dof) | 0.0012 | 0.0028 | 0.3555 | 0.624 | 0.8287 | 0.9874 |

MIX χ2(1dof) | 0.0021 | 0.0046 | 0.4131 | 0.6899 | 0.8486 | 0.9907 |

Imputed Genotypes | ||||||

R = 1.2 random | R = 1.2 Δ>0.4 | R = 1.5 random | R = 1.5 Δ>0.4 | R = 2.0 random | R = 2.0 Δ>0.4 | |

ATT χ2(1dof) | 0.0010 | 0.0008 | 0.2871 | 0.2988 | 0.7620 | 0.7762 |

ATT-dose χ2(1dof) | 0.0010 | 0.0008 | 0.3009 | 0.3134 | 0.7775 | 0.7938 |

SNP1 χ2(1dof) | 0.0009 | 0.0007 | 0.2673 | 0.3013 | 0.7483 | 0.8748 |

ADM χ2(1dof) | 0.0001 | 0.0013 | 0.0081 | 0.0903 | 0.0737 | 0.6306 |

SUM χ2(2dof) | 0.0007 | 0.002 | 0.2668 | 0.5086 | 0.7567 | 0.9729 |

MIX χ2(1dof) | 0.0013 | 0.0034 | 0.3184 | 0.5915 | 0.778 | 0.9786 |

We also assessed all scores at null simulated data (R = 1) using the standard genomic control _{GC} which attained a value of 1.001 for MIX, 0.986 for SNP1 and 0.999 for the ATT score, respectively. We observed a λ_{GC} of 1.101 for the ADM score, which is suggestive of inflation, although we note that, for 1000 cases and a thousand independent genomic regions (as expected in the ADM score), a λ_{GC} of 1.101 can arise by chance. However, since multiple factors (e.g. deviations from random mating, correlation in errors of local ancestry estimates) could potentially lead to inflation of the ADM statistic, we have also devised an admixture statistic, ADMGC that incorporates the empirical variance of the average local ancestry (see _{GC}. Furthermore, we show how to incorporate ADMGC within the MIX framework to obtain a new version of our score (MIXGC) that incorporates the new admixture component. As expected, both ADMGC and MIXGC attain λ_{GC} of 1.000 (data not shown) in simulated null data. We note that MIXGC should be used when there is significant indication of inflation. As this was not the case here, we chose to use MIX for all results below.

We also assessed the performance of our scores when the disease model assumptions are not met. We simulated causal SNPs under various disease models such as dominant and recessive or when two causal independent SNPs are present within an admixture block. To simulate two causal independent SNPs within same admixture block, we restricted to SNPs less than 5Mb apart and with LD less than .1 (as measured by r^{2}). Results in

We also looked at heterogeneous effects across Europeans and Africans by simulating 100,000 causal SNPs with R = 1.5 (under no heterogeneity) and assessing the scores at SNPs with different levels of LD with the simulated causal in the two populations. Different LD across populations will induce heterogeneous effects as a function of the allele frequencies and the population specific LD pattern. Results in

Due to the limited number of markers present on the genotyping platforms, it is often the case that the causal SNP is not directly typed within the GWAS. However, genotypes typed in a study can be used as predictors, in conjunction with haplotypes over denser sets of SNPs from external repositories of human variation such as the HapMap

Following a standard masking approach, we masked 100,000 SNPs at random from the CARe data, imputed them and assessed imputation accuracy using a standard accuracy measure, the squared correlation between imputed and true ‘masked’ genotypes. We observe an average imputation r^{2} of 0.858 when our local ancestry aware framework is used, as opposed to 0.855 under the standard cosmopolitan approach, confirming that there is a small gain in accuracy by conditioning imputation on local ancestry. We observe a smaller improvement in imputation performance than the one reported in

We plot the average imputation accuracy as a function of allele frequency difference between CEU and YRI both when CEU+YRI was used as reference and when using the local ancestry aware framework.

A straightforward approach for extending association statistics at imputed SNPs is to use the maximum likelihood estimates for unobserved genotypes. Although this procedure does not fully account for the uncertainty in the imputed genotypes, it has been previously shown to perform well when there is considerable confidence in the imputed genotype calls. Throughout this paper we compute statistics over the maximum likelihood genotype calls. Although our novel scores could potentially be improved by fully incorporating the imputation uncertainty in the likelihood framework we note that the MIX score outperforms the standard ATT score, even when the ATT score accounts for the imputation uncertainty through the use of dosages instead of maximum likelihood genotype calls (see

We masked the 100,000 SNPs that were used for simulation of phenotypes and imputed genotypes at these SNPs using our local ancestry aware imputation framework (see

An important aspect in disease scoring statistics is to assess their performance when the causal SNP is untyped and, due to various reasons (e.g. not present in the reference panel), cannot be imputed. To address this scenario we randomly picked 100,000 autosomal SNPs and simulated case-control phenotypes for R = 1.5 using the methodology described above. For all the SNPs we evaluated the statistics at 40 SNPs in the neighborhood of the simulated SNP and, for each score, computed the maximum statistic in this region by either masking the simulated causal SNP or by including it in the computation of the maximum. Results in

Score | Average maximum χ^{2} value |
Proportion of regions that are genome wide significant | ||

ATT χ2(1dof) | 26.17 | 0.3834 | ||

SNP1 χ2(1dof) | 25.47 | 0.3622 | ||

ADM χ2(1dof) | 4.23 | 0.0135 | ||

SUM χ2(2dof) | 28.62 | 0.3571 | ||

MIX χ2(1dof) | 27.46 | 0.4158 |

As a sanity check we evaluated these scores using data from the CARe study for two case-control phenotypes: coronary heart disease (CHD) and type 2 diabetes (T2D), for which associations at several loci have been reported previously

CHD | ||||||||||

SNP | chrom | position (build35) | CEUfreq | YRIfreq | ATT | SNP1 | ADM | SUM | HET | MIX |

rs17577085 | 5 | 141,843,788 | 0.11 | 0.00 | 2.66 | 1.54 | 1.46 | 2.00 | 0.00 | 2.06 |

rs4244029* | 5 | 141,893,025 | 0.08 | 0.27 | 2.66 | 2.84 | 1.31 | 3.06 | 0.56 | 2.51 |

Best score | 5 | - | - | - | 2.66 | 2.84 | 1.93 | - | 2.51 | |

rs325105 | 6 | 147,805,960 | 0.47 | 0.012 | 2.62 | 1.65 | 0.81 | 1.57 | 0.65 | 2.15 |

rs325129* | 6 | 147,848,836 | 0.25 | 0.74 | 3.22 | 2.55 | 1.05 | 2.57 | 0.26 | 3.12 |

Best score | 6 | - | - | - | 2.86 | 1.18 | 2.79 | - | 3.13 | |

rs6475606 | 9 | 22,071,850 | 0.5 | 0.01 | 1.87 | 2.72 | 0.11 | 2.11 | 2.04 | 2.38 |

rs1333047* | 9 | 22,114,504 | 0.49 | 0.99 | 2.32 | 3.64 | 0.00 | 2.95 | 2.05 | 2.96 |

Best score | 9 | - | - | - | 2.50 | 0.32 | 2.95 | - | 2.99 |

T2D | ||||||||||

SNP | chrom | position (build35) | CEUfreq | YRIfreq | ATT | SNP1 | ADM | SUM | HET | MIX |

rs13424957 | 2 | 165,575,897 | 0 | 0.28 | 4.58 | 4.19 | 0.61 | 3.76 | 0.00 | 4.46 |

rs13396952* | 2 | 165,562,786 | 0.02 | 0.3 | 4.41 | 4.04 | 0.57 | 3.61 | 0.13 | 4.29 |

Best score | 2 | - | - | - | 4.19 | 1.01 | 3.76 | - | 4.46 | |

rs7901695 | 10 | 114,744,078 | 0.28 | 0.53 | 4.11 | 4.36 | 0.75 | 4.01 | 0.16 | 3.97 |

rs7903146* | 10 | 114,748,339 | 0.25 | 0.29 | 5.37 | 5.03 | 0.80 | 4.69 | 0.19 | 5.05 |

Best score | 10 | - | - | - | 5.03 | 1.25 | 4.69 | - | 5.05 |

Finally, we note that due to the fundamental difference between the asymptotically equivalent goodness-of-fit (ATT) and likelihood-ratio χ^{2}(1dof) tests (MIX), the scores may differ in either direction, but the likelihood-ratio approach used in the MIX score is theoretically appropriate (see

For a test analysis with a larger number of cases and potentially greater case-only admixture information, we also evaluated the above scores at the known FGFR2 breast cancer locus ^{2}(2 dof) = 22.74) and the MIX score (χ^{2}(1 dof) = 17.04) provides some evidence (χ^{2}(1 dof) = 5.7, P-value = 0.016) that rs2981578 may not be the unique causal variant at the FGFR2 locus. We also note that the HET score (χ^{2}(1 dof) = 1.80) provides little to no evidence in support of the hypothesis of heterogeneity at this SNP. Complete results of the breast cancer GWAS will be presented elsewhere (C. Haiman and colleagues, unpublished data).

ATT | ADM | MIX | SNP1 | HET | SUM | |

χ^{2} value |
13.99 | 6.16 | 17.04 | 16.57 | 1.80 | 22.74 |

-log10(p-value) | 3.74 | 1.88 | 4.44 | 4.33 | 0.75 | 4.94 |

We again used the Armitage trend test with correction for genome-wide ancestry as the baseline for our analyses. We also considered a SNP association score conditioned on local ancestry, as well as an admixture score that associates the local ancestry to the continuous phenotype with genome-wide ancestry as a covariate. (There is no analogue to a case-only admixture score for quantitative traits). As in the dichotomous case, we summed the SNP association score conditioned on local ancestry with the admixture score to produce a χ^{2}(2dof) score, but show below that the higher degrees of freedom lead to a reduction in statistical power. Finally, we considered a χ^{2}(1dof) heterogeneity score that tests for a difference in effect size conditional on African or European ancestry, by comparing a model that allows different effect sizes to a model with a uniform effect size (see

Analogous to simulations of dichotomous phenotypes, for 100,000 randomly chosen SNPs we used CARe genotypes and simulated phenotypes for 2,000 samples based on a null model or a causal model with effect sizes

We compared 4 scores: Armitage trend test with correction for genome-wide ancestry (QATT), SNP association conditioned on local ancestry (QSNP1), local ancestry admixture association (QADM), and sum of QSNP1 and QADM (QSUM). All of these are χ^{2}(1dof) scores, except for QSUM which is χ^{2}(2dof). Results are displayed in

Typed Genotypes | ||||||

QATT χ^{2}(1dof) |
0.0013 | 0.0009 | 0.2165 | 0.3223 | 0.8566 | 0.9883 |

QSNP1 χ^{2}(1dof) |
0.0012 | 0.0005 | 0.1951 | 0.2087 | 0.8437 | 0.9422 |

QADM χ^{2}(1dof) |
0 | 0.0001 | 0.0004 | 0.0048 | 0.0229 | 0.2594 |

QSUM χ^{2}(2dof) |
0.0006 | 0.0003 | 0.1636 | 0.2473 | 0.8353 | 0.9839 |

Imputed Genotypes | ||||||

QATT χ^{2}(1dof) |
0.0008 | 0.0009 | 0.1526 | 0.1677 | 0.7853 | 0.7993 |

QSNP1 χ^{2}(1dof) |
0.0007 | 0.0008 | 0.1346 | 0.1398 | 0.7663 | 0.7772 |

QADM χ^{2}(1dof) |
0 | 0.0001 | 0.0004 | 0.0048 | 0.0229 | 0.2594 |

QSUM χ^{2}(2dof) |
0.0004 | 0.0004 | 0.1115 | 0.1245 | 0.7617 | 0.7762 |

We evaluated the above scores using data from two quantitative phenotypes from CARe, LDL and HDL cholesterol, for which associations at several loci have previously been reported. Results for genotyped and imputed SNPs in the region are displayed in ^{2}(2 dof) QSUM score outperforms the χ^{2}(1 dof) ATT score. We point out that the presence of multiple causal variants, or alternatively an untyped/unimputed variant with large allele frequency differentiation, may invalidate the assumptions made by the QATT score and lead to poor performance. This suggests that the QSUM score can be of value in a minority of instances where strong admixture associations exist. We caution that in such cases an additional multiple hypothesis testing correction may be needed and that the QSNP1 score conditioned on local ancestry will be needed for localization

Incorporating admixture association signals into GWAS of admixed populations is likely to be particularly informative for diseases for which risk differs depending on ancestry. Cardiovascular disease (CVD) is a prime example, as African ancestry is associated to higher CVD mortality and to CVD risk factors such as hypertension, serum lipid levels and left ventricular hypertrophy

By analyzing real and simulated case-control phenotypes, we have shown that the MIX score, which incorporates both SNP and admixture association signals, yields a significant increase in statistical power over commonly used scores such as the Armitage trend test with correction for global ancestry. For randomly ascertained quantitative traits, in contrast to case-control phenotypes there is no case-only admixture score and thus no benefit from joint modeling of admixture and SNP association. Thus, for quantitative phenotypes, in general, the QATT score yields higher statistical power than other compared scores. Therefore, we recommend the use of MIX and QATT scores for dichotomous and quantitative traits, respectively, in future GWAS in admixed populations. However, we note that in various scenarios (e.g., multiple causal variants, heterogeneous effects, absence of the causal variant from the typed or imputed markers) assumptions made by the MIX and QATT may be invalid and using them can lead to poor performance. To this extent, we recommend that special consideration be given to regions with high signals of admixture association, in which the SUM and QSUM scores may produce higher association signals than MIX and QATT. As a future direction, we note that an improved score for non-randomly ascertained quantitative traits could potentially be developed, which would generalize both the MIX score for dichotomous traits and the QATT score for randomly ascertained quantitative traits.

As GWAS in European populations have demonstrated, association statistics need not be limited to SNPs that have been genotyped, because imputation algorithms that we and others have developed can be used to infer the genotypes of untyped SNPs by making use of haplotype information from HapMap. Our methods also perform well in the setting of imputation, when the causal SNP is not genotyped. As future work we consider the extension of our likelihood based scores to fully account for imputation uncertainty, where a promising direction is to define the likelihood as a full integration over the missing data given the observed data and the parameters of the model

Our results using simulated phenotypes show that, although benefiting from a reduced multiple-hypothesis testing burden, the admixture scoring yields lower power for finding associations when compared to SNP association scoring. An explanation is the limited number of SNPs that show high allelic differentiation among the ancestral populations (e.g., in our simulations only 7.6% of the SNPs have an allelic differentiation greater than 0.4 between Europeans and Africans). However, we note that the question of whether there exists a combined SNP and admixture score that benefits from reduced multiple hypothesis testing for the admixture component of the score is an important open question that requires further consideration.

While this paper focuses on frequentist approaches for disease scoring in admixed populations, we mention that joint modeling of admixture and SNP association signals could be developed in a Bayesian framework

Although in this work we have focused on African Americans, in theory our approaches can be extended to other admixed populations such as Latino populations, which inherit ancestry from up to three continental ancestral (European, Native American and African) populations. The approaches presented in this work can be extended to three-way admixed populations either by considering one ancestry versus the rest strategy or by jointly modeling the three ancestry odds ratios so that a single SNP odds ratio would lead to implied ancestry odds ratios for each ancestry. However, we caution that in the context of Latino populations, more work is needed to assess the performance and possible biases of the local ancestry estimates and its potential effects on methods that incorporate admixture and case-control signals into disease scoring statistics.

A final consideration is in fine-mapping causal loci. Here the availability of samples—or chromosomal segments—of distinct ancestry is valuable

The CARe project has been approved by the Committee on the Use of Humans as Experimental Subjects (COUHES) of the Massachusetts Institute of Technology, and by the Institutional Review Boards of each of the nine parent cohorts.

Affymetrix 6.0 genotyping and QC filtering of African-American samples from the CARe cardiovascular consortium was performed as described previously

When run in default mode, HAPMIX outputs local ancestry estimates as the expected probability of 0, 1 or 2 copies of European ancestry at each SNP (see ref.

We selected a random subset of 100,000 autosomal SNPs. For each SNP, we simulated phenotypes for ^{2} of being chosen.

A χ^{2}(1dof) statistic via the Armitage trend test with adjustment for genome-wide ancestry, as described previously

A χ^{2}(1dof) likelihood ratio test that compares the null hypothesis of case-control odds ratio R = 1 with the alternate hypothesis of R ≠1, where R is assumed to be the same across populations, while the allele frequencies are treated as nuisance parameters.

For every local ancestry _{1}_{2} (AA, AE, or EE) and phenotype

Then, the χ^{2} statistic with 1 degree of freedom is:

A χ^{2}(1dof) likelihood ratio test that compares the local ancestry in the disease cases to the average local ancestry across the genome in the same disease cases. This is more powerful than comparing cases to controls, since no statistical noise is introduced from controls

Let _{i}_{i}

Then the likelihood is ^{2}(1dof) likelihood ratio test defined as:

A χ^{2}(2dof) that sums the SNP1 and the ADM statistics

A χ2(1dof) test that combines the SNP1 and ADM likelihood functions by using the implied ancestry odds ratio

The MIX likelihood is defined as the product of the likelihoods for SNP1 and ADM as ^{2} statistic with 1 degree of freedom as:

A χ2(1dof) test that compares the alternate hypothesis of different odds ratios in different ancestries with the null model that assumes the same odds ratio. The likelihood _{A} and R_{E}) which leads to ^{2}(1dof) statistic as:

We incorporate the observed variance of the average local ancestry across the genome assuming that the average local ancestry ^{2}(1dof) statistic, ADMGC, that incorporates the empirical variance and in the ADM score as:^{2}(1dof) statistic MIXGC, that incorporates the empirical variance of the average local ancestry:

Many of the likelihoods defined above require a multidimensional optimization. The number of parameters optimized in the likelihoods is 3 for the SNP1 score, 1 for the ADM score, 3 for the MIX score and 4 for the HET score. (The HET score can be reduced to two independent 2-parameter optimizations by considering cases and controls separately.) For the ADM score, Newton’s method was used. For the SNP1, MIX and HET scores, Brent’s algorithm was used (GSL software library implementation; see Web Resources). The maximization is performed in one dimension over each parameter in turn, repeating for each parameter until the algorithm converges. In rare instances, extreme variation in the slope of the log likelihood as a function of odds ratio can cause the algorithm to not converge; in this situations a simple binary search is used.

We employed the widely used MACH

Even when the true odds ratio is the same across populations, different imputation quality across the segments with different ancestries can lead to different estimates for the allelic odds ratios in European versus African segments. We account for this by adjusting the observed allelic odds ratios in the SNP1 and the MIX scores as follows. Following a derivation similar to

We randomly selected 100,000 autosomal SNPs and simulated phenotypes as described above using R = 1.5. For all the compared scores, we computed the maximum statistic over all SNP across a region centered on the SNP of interest (taking the 20 SNPs upstream and 20 SNPs downstream). We computed the maximum of the statistics either over 41 SNPs by including the simulated causal SNP or over 40 SNPs by ignoring the statistics at the simulated causal SNP.

Case-control phenotypes for coronary heart disease (CHD) and type 2 diabetes (T2D) were ascertained as described previously

The FGFR2 locus has been associated with breast cancer in women of European and Asian descent

For each of 100,000 autosomal SNPs, we simulated phenotypes for

Let ^{2}(1dof) statistic as

Let ^{2}(1dof) statistic as

The model is ^{2}(1dof) statistic as

We sum the two χ^{2}(1dof) statistics to produce a χ^{2}(2dof) statistic.

Let ^{2}(2dof) statistic as ^{2}(2dof) statistic minus the

LDL and HDL cholesterol phenotypes in CARe samples were ascertained as described previously. We analyzed 5,801 samples for LDL and 5,946 samples for HDL for which phenotypic data were available, restricting to 6,209 unrelated individuals as defined above. For every analyzed SNP we performed imputation within a region of 10Mb centered on the SNP of interest using the MACH imputation method under the local ancestry aware framework. We assessed the scoring statistics at all SNPs within 100Kb of the SNPs of interest.

Principal components analysis of CARe and HapMap3 samples. Only the HapMap3 populations CEU, YRI and CHB were used to compute principal components.

(0.11 MB PDF)

Average local ancestry of 6,209 CARe samples.

(0.07 MB PDF)

Proportion of SNPs with imputation accuracy difference in European versus African segments under a specified threshold. The imputation accuracy in European (African) segments was estimated for each SNP as the squared correlation between true masked genotypes and imputed genotypes restricted to samples containing 2(0) European (African) alleles at that locus.

(0.04 MB PDF)

Proportion of SNPs achieving genome-wide significance as function of the expected difference in odds ratios between Africans and Europeans. Scores were computed at SNPs neighboring 100,000 simulated causal SNPs (R = 1.5), tagging with different LD in European versus Africans the simulated causal.

(0.03 MB PDF)

Average value and statistical power of simulated case-control MIX score in African Americans imputed genotypes under various imputation settings (MIX*-denotes no adjustment for differences in imputation error rates). For each setting we list the average χ^{2} value and proportion of SNPs for which the score attains genome-wide significance (defined as P<5e-08), for random SNPs as well as SNPs in the top decile of population differences (Δ>0.4), for

(0.03 MB DOC)

Average statistic and statistical power of case-control scores in African Americans computed for different number of cases and R = 1.5. The number of controls is set to 1000. For each score we list the average χ^{2} value and proportion of SNPs for which the score attains genome-wide significance (defined as P<5e-08 for all scores except ADM, P<1e-05 for ADM). In general all the scores show decrease in performance with the decrease in number of cases. The increase in performance of MIX over ATT score diminishes with the number of cases: for 100 cases, the increase of average χ^{2} in MIX over ATT is less than 1%, while for 1000 cases, the same increase is greater than 5%.

(0.03 MB DOC)

Average statistic and statistical power of case-control scores in African Americans computed under various disease models. 1000 cases and 1000 controls were simulated at 100,000 SNPs with odds ratio R. For each score we list the average χ^{2} value and proportion of SNPs for which the score attains genome-wide significance (defined as P<5e-08 for all scores except ADM, P<1e-05 for ADM). In the multiple causal scenarios, for each of the 100,000 SNPs, a nearby SNP (distance less than 5Mb and with r^{2}<0.1) was selected and a disease model with two causal SNPs was simulated in which both SNPs had an odds ratio of 1.5. With the exception of the ‘Dominant’ scenario in which ATT and MIX obtain similar results, in all remaining cases MIX outperforms the other scores in terms of power.

(0.04 MB DOC)

Results for LDL and HDL quantitative phenotypes. (a) We list results for each score (-log in base 10 of the p-value) for genotyped SNPs that have previously been associated to LDL in CARe samples, the imputed (* denotes imputed SNPs) or genotyped SNPs producing the most significant P-values, and the best score for each of the five scores. (b) Analogous to (a), for SNPs associated to HDL. The value achieving the smallest p-value is denoted in bold.

(0.08 MB DOC)

Supplementary Note.

(0.09 MB DOC)