Conceived and designed the experiments: ALP NP DR SM. Performed the experiments: ALP AT NP KCB NR IR THB RM DR SM. Analyzed the data: ALP AT NP DR SM. Contributed reagents/materials/analysis tools: ALP NP KCB NR IR THB RM DR SM. Wrote the paper: ALP AT NP KCB NR IR THB RM DR SM.
The authors have declared that no competing interests exist.
Identifying the ancestry of chromosomal segments of distinct ancestry has a wide range of applications from disease mapping to learning about history. Most methods require the use of unlinked markers; but, using all markers from genome-wide scanning arrays, it should in principle be possible to infer the ancestry of even very small segments with exquisite accuracy. We describe a method, HAPMIX, which employs an explicit population genetic model to perform such local ancestry inference based on fine-scale variation data. We show that HAPMIX outperforms other methods, and we explore its utility for inferring ancestry, learning about ancestral populations, and inferring dates of admixture. We validate the method empirically by applying it to populations that have experienced recent and ancient admixture: 935 African Americans from the United States and 29 Mozabites from North Africa. HAPMIX will be of particular utility for mapping disease genes in recently admixed populations, as its accurate estimates of local ancestry permit admixture and case-control association signals to be combined, enabling more powerful tests of association than with either signal alone.
The genomes of individuals from admixed populations consist of chromosomal segments of distinct ancestry. For example, the genomes of African American individuals contain segments of both African and European ancestry, so that a specific location in the genome may inherit 0, 1, or 2 copies of European ancestry. Inferring an individual's local ancestry, their number of copies of each ancestry at each location in the genome, has important applications in disease mapping and in understanding human history. Here we describe HAPMIX, a method that analyzes data from dense genotyping chips to infer local ancestry with very high precision. An important feature of HAPMIX is that it makes use of data from haplotypes (blocks of nearby markers), which are more informative for ancestry than individual markers. Our simulations demonstrate the utility of HAPMIX for local ancestry inference, and empirical applications to African American and Mozabite data sets uncover important aspects of the history of these populations.
The identification of chromosomal segments of distinct continental ancestry in admixed populations is an important problem, with a wide range of applications from disease mapping to understanding human history. Early efforts to solve this problem used coarse sets of unlinked markers
Here, we describe a haplotype-based method, HAPMIX, which applies an extension of the population genetic model of Li and Stephens
We apply HAPMIX to 935 African American individuals genotyped at ∼650,000 markers. By studying a large set of individuals from an admixed population of high relevance to disease mapping, we validate the effectiveness of this method in a practical setting and specifically show that the ancestry estimates are not systematically biased within the limits of our resolution. To illustrate how the method can provide insights into the history of an anciently admixed population, we also apply HAPMIX to a data set of 29 individuals from the Mozabite population of northern Africa that were genotyped at ∼650,000 markers as part of the Human Genome Diversity Panel (HGDP)
For the African American data, informed consent was obtained from each study participant, and the study protocol was approved by the institutional review board at either the Johns Hopkins University or Howard University.
HAPMIX assumes that the admixed population being analyzed has arisen from the admixture of two ancestral populations, and that phased data are available from unadmixed reference populations that are closely related to the true ancestral populations (e.g. phased data from HapMap
The central idea of the method is to view haplotypes of each admixed individual as being sampled from the reference populations: for example, haplotypes of an African American individual could be sampled from phased African and European chromosomes from HapMap. At each position in the genome, HAPMIX estimates the likelihood that a haplotype from an admixed individual is a better statistical match to one reference population or the other. A Hidden Markov Model (HMM) is used to combine these likelihoods with information from neighboring loci, to provide a probabilistic estimate of ancestry at each locus. The method allows transition at two scales. The small-scale transitions are between haplotypes from within a reference population, typically at a scale of every few tens of thousands of bases
The black lower line represents a chromosomal segment from an admixed individual, carrying a number of typed mutations (black circles). The underlying ancestry is shown in the bottom color bar, and reveals an ancestry change from the first population (red) to the second population (blue). The admixed chromosome is modeled as a mosaic of segments of DNA from two sets of individuals drawn from different reference populations (red and blue horizontal lines respectively) closely related to the progenitor populations for the admixture event. The yellow line shows how the admixed chromosome is constructed in terms of this mosaic. The dotted line above the bottom color bar shows the reference population being copied from along the chromosome – note that at most positions, this is identical to the true underlying ancestry, but with occasional “miscopying” from the other population (blue dotted segment occurring within red ancestry segment). Note also that switches between chromosomes being copied from, representing historical recombinations, are rapid (6 switches), while ancestry changes, representing recombination since admixture, are much rarer (1 switch). Finally, note that at most positions the type of the admixed chromosome is identical to that of the chromosome being copied from, but an exception to this occurs at one site, shown as a grey circle, and representing mutation or genotyping error. In our inference framework, we observe only the variation data for the admixed and reference individuals: the yellow line, and the underlying ancestry, must be inferred as the hidden states in a HMM.
An important strength of HAPMIX is the way it analyzes diploid data from admixed individuals. A naïve way to use population genetic methods to infer ancestry would be to pre-process such a data set using phasing software, and then to assume that this guess about the underlying phased haplotype is correct. However, phase switch errors that arise from this procedure (which are common even with the best phasing algorithms
HAPMIX is also notable in inferring probabilities for whether an individual has 0, 1, or 2 alleles of a particular ancestry at each locus. As our simulations show, these estimates are well-calibrated. Thus, when the method generates a probability
HAPMIX is fundamentally different from existing methods such as ANCESTRYMAP and LAMP
Our approach to inferring ancestry segments, implemented in HAPMIX, is based on extending a Hidden Markov Model (HMM) previously developed by Li and Stephens to model linkage disequilibrium in population genetic data
We extend the Li and Stephens model to allow inference on ancestry segments for individuals drawn from an admixed population. We begin by supposing that we have two previously sampled collections of phased haplotypes,
We begin by modeling the ancestry segments. Assume the admixture event occurred at a single time
To fully specify our model, we must consider the structure of variation conditional on these admixture segments. Our model remains computationally tractable while accommodating important features typical of real data such as mutation, recombination, genotyping error, reference populations that are drifted from the true ancestral populations, and incomplete sampling of diversity in the reference populations reflected in the samples drawn from these populations. We assume that all mutant sites take the form of single nucleotide polymorphisms (SNPs) with two alleles that can be represented as 0 and 1 (however, our approach could be extended to more complex mutation models).
We suppose that sections of the genome with true ancestry from population 1 are formed as mosaics of the haplotypes in the two parental groups. Specifically, at any given position with this ancestry, an individual from
For sections of the genome with ancestry from population 2, we formulate our model in an analogous way, with corresponding parameters
Some additional remarks about the interpretation of these parameters may be useful. As in the original Li and Stephens implementation,
For a typical application of HAPMIX, we expect to have data from a collection of discrete typed sites. Suppose we have
This probability allows us to calculate the likelihood of the observed data in the offspring for each possible underlying state. At sites with missing data in the offspring chromosome, the appropriate likelihood contribution is simply 1.0.
Choices of
It is easy to see that equations (0.1) and (0.2) describe a HMM for the underlying state (which includes information on ancestry) as we move along the genome, and that the underlying Markov process is reversible. Given a set of parameters we can exploit these properties and HAPMIX implements standard HMM techniques to efficiently infer posterior probabilities of underlying states, via the forward-backward algorithm, or sample random state paths from the correct joint posterior distribution, using a standard modification of this algorithm. In addition to parameter values, the software takes as input a recombination map for the regions to be analyzed, phased “parental” chromosomes from the two reference populations, and “offspring” data from the admixed population being analyzed.
A naïve implementation of the forward/backward algorithm would require computation time proportional to 4
It is straightforward to extend our approach to allow imputation of missing data, while simultaneously labeling underlying ancestry, in an analogous manner to methods employed in several existing approaches to imputation for samples drawn from panmictic populations
Typically, we actually have multiple “offspring” samples (either haploid chromosomes or diploid genotypes, see below) from the admixed population of interest. For the analyses in this paper, we used HAPMIX to analyze data from each sample independently, using the same parental chromosomes in each case. Although in principle improvements to ancestry inference could result from considering the problem in multiple samples jointly, there are formidable computational challenges in adapting our approach to allow this (one possibility might be to employ MCMC, as used for unlinked sites
Typically, real data consists of unphased genotypes for individuals drawn from a population, with haplotypic phase unknown. Many approaches already exist to infer phase from such data
The phasing is implemented using a HMM adapted from that described above (0.1) and employing a composite hidden state at each location, of the form (
Emission probabilities are also adapted from the haploid case. For genotype data, there are 3 possible emissions at typed sites, which we denote as genotypes
Having defined the HMM for this setting, we again use standard techniques to obtain posterior probabilities on (joint) ancestry for the two chromosomes, and then sample states from this posterior distribution. We note that as a by-product of sampling complete states jointly for the two chromosomes together, we are phasing the original data with respect to the underlying ancestry. This may help reduce phasing error rates in admixed populations compared to methods that ignore local ancestry, although we do not pursue this issue here.
We can adapt the computational speedups described above to the diploid setting, so that while a naïve implementation of the forward algorithm would take time proportional to 16
Irrespective of whether the true ancestry is known (as in simulations) or unknown (as in real data), an estimate of the
In simulated data sets where the true ancestry is known, the estimated
We simulated individuals of admixed African and European ancestry by constructing their genomes from a mosaic of real Yoruba and French individuals genotyped on the Illumina 650Y chip as part of the Human Genome Diversity Panel (HGDP)
We constructed 40 haploid admixed genomes (
It is important to distinguish between the
The reference populations used as input to HAPMIX consisted of 60 YRI individuals (120 haploid chromosomes) and 60 CEU individuals (120 haploid chromosomes) from the International HapMap Project
We repeated our simulations at
To investigate the scenario of an even more inaccurate reference population, as well as the asymmetric scenario in which only one reference population is inaccurate, we also repeated our simulations at
We modified our original simulations at
By running HAPMIX in the mode that samples random paths, which produces integer-valued guesses of local ancestry for each individual and each marker, it is possible to reconstruct chromosomal segments from the ancestral populations. We investigated whether these reconstructed segments provide an accurate proxy for the true ancestral populations by using allele counts to compute values of
By comparing the overall likelihoods produced by HAPMIX at various parameter settings, it is possible to evaluate which parameters provide the best fit to the data, irrespective of whether or not the choice of parameter settings significantly impacts the accuracy of local ancestry inference. We investigated how effectively the number of generations
We used HAPMIX to analyze 935 African American samples collected from volunteers living in the Baltimore–Washington, D.C. metropolitan region and genotyped on the Illumina 650Y chip as part of an asthma study. All subjects gave verbal and written consent. The Johns Hopkins and Howard University Institutional Review Boards (IRBs) determined that the samples were consented for genetic research, but not for public release of genotype data. Roughly half of these samples were asthma cases and half were non-asthmatic controls, but all phenotypic information was ignored in the current study (disease mapping analyses of these data will be described elsewhere; K. Barnes et al., unpublished data). We note that irrespective of whether asthmatic cases considered separately exhibit an admixture association signal, one would not expect to observe such a signal in a combined analysis of all 935 samples ignoring phenotypic information, due to dilution of the signal. The analyses were restricted to 510,324 autosomal markers which passed quality controls in the 935 African Americans and were polymorphic in phased YRI and phased CEU data from HapMap. We ran HAPMIX using YRI and CEU as input reference populations, setting
To draw inferences about the ancestral populations of African Americans, we ran HAPMIX in the mode that samples random paths to reconstruct chromosomal segments from the ancestral populations (see above), and used the resulting allele counts to compute
To produce an estimator of the number of generations since admixture for each individual with >20 ancestry segments, we note that the genetic map used as input to the software has total length 35.5 Morgans. For an individual with admixture proportion α, we expect to observe a fraction 2α(1-α) of all recombination events occurring since admixture (i.e. those that result in a change in ancestry). Given λ generations since admixture, we therefore expect to see a total of 142 λ α(1-α) events in a diploid individual. Estimating α using the observed genome-wide ancestry proportion μ for that individual, if
We excluded 3 clear outlier individuals who had more than 20 inferred generations of admixture, because we believe this is likely to indicate partial ancestry from a third source population in these individuals.
We analyzed 29 Mozabite samples from the HGDP data set. A total of 30 Mozabite individuals were originally genotyped as part of the HGDP, but one individual (HGDP01281) was excluded due to cryptic relatedness. We ran HAPMIX on the 29 Mozabite individuals using YRI and CEU as the input reference populations. We inferred the number of generations since admixture that provided the best fit to the data, and computed
We ran HAPMIX on a total of 13 populations from the HGDP data that were of African, European, or Middle Eastern ancestry. For each population, we used YRI and CEU as the input reference populations, and estimated the European-related mixture proportion. For populations with European-related ancestry that was estimated to be more than 0% and less than 100%, we also estimated the number of generations since mixture.
The HAPMIX software is available for downloading at the following URL:
We began by examining the performance of HAPMIX in a set of 20 simulated admixed individuals, with an average of 80% African ancestry and 20% European ancestry, and generated with admixture occurring 6 generations ago (
(A) Results comparison for a simulated recently admixed sample on chromosome 1. On each plot, the y-axis denotes the number of European chromosomal copies predicted by each method. The centromere of the chromosome is blanked out in white. The top plot shows the true number of European chromosomes, while the subsequent labeled plots show the results of applying each respective method. (B) Results comparison for a real African American individual across chromosome 1. Plots are constructed as in (A). We note the visible similarity to the simulation results.
A more challenging setting for ancestry inference is when admixture occurs further back in time, resulting in smaller ancestry segments. We therefore repeated the above comparisons with increasing lambda (
For each admixture time, results are based on analyzing 20 admixed individuals, simulated using an average genome-wide proportion of 80% African and 20% European ancestry. For each method, we plot the squared correlation between predicted and true number of European copies as a function of
To investigate whether the probabilities of 0, 1, or 2 copies of European ancestry reported by HAPMIX are well-calibrated, we binned the predicted probabilities into bins of size 0.05 and compared, for each
(A) For simulated admixed data sets, constructed as described in
We also used the HAPMIX predictions to compute an estimate of the squared correlation between predicted and true #European copies (see
Although most of our simulations focused on individuals of mixed African and European ancestry, we also considered a more general set of two-way mixtures of African, European, Chinese and/or Japanese populations. We again observed that HAPMIX outperformed other methods (see
In many real-world settings, the true reference populations for a particular admixture event may not have had suitable genetic data gathered, or may no longer exist. To test for the effect of this situation on HAPMIX, we repeated our simulations at
We also repeated our simulations at
We investigated how the accuracy of HAPMIX varies with data size, by varying either the number of markers or the number of reference chromosomes, in our
We also investigated how the accuracy of HAPMIX is affected when the parameters used as input are inaccurately specified (see
0.05 | 0.98 (0.20) | 0.82 (0.18) |
0.10 | 0.98 (0.20) | 0.83 (0.19) |
0.40 | 0.98 (0.20) | 0.83 (0.21) |
0.80 | 0.98 (0.20) | 0.83 (0.22) |
2 | 0.98 | n/a |
4 | 0.98 | n/a |
6 | 0.68 | |
8 | 0.98 | 0.72 |
10 | 0.98 | 0.77 |
20 | 0.98 | 0.81 |
40 | 0.97 | 0.83 |
100 | 0.94 | |
200 | n/a | 0.83 |
400 | n/a | 0.80 |
We are interested in applying HAPMIX to improve our understanding of ancestral populations contributing to admixture events. To explore the usefulness of the software for this purpose, we analyzed segments of inferred African or inferred European ancestry from our
trueAFR | trueEUR | |||
Yoruba | French | 0.001 | 0.001 | |
Yoruba | French | 0.000 | 0.003 | |
Mandenka | Basque | 0.000 | 0.003 | |
Mandenka | Basque | 0.001 | 0.003 | |
Yoruba | Druze | 0.000 | 0.006 | |
Yoruba | Druze | 0.001 | 0.007 |
Although the correspondence between inferred ancestral segments and true ancestral populations is reasonably tight, it is not perfect, with
Our results show that supplying the correct value of the number of generations since admixture to HAPMIX has virtually no impact on the accuracy of inference of local ancestry (
We also tried running HAPMIX to infer the date of admixture using data simulated under a double-admixture scenario (
We ran HAPMIX on 935 African American samples to obtain local ancestry estimates at each location in the genome (see
In addition to verifying that predictions are accurate on average, it is also important to check that there are no regions of the genome showing systematically inaccurate ancestry predictions. Such regions could produce spurious signals of selection after admixture in scans of control individuals, or spurious admixture association signals in scans of disease cases
We used HAPMIX to estimate the value of
We sought to investigate whether our precise ancestry inference revealed a correlation between time since admixture and ancestry proportion across individuals. For each individual separately, we estimated a time since admixture (
Each grey point shows an estimate of the time λ since admixture corresponding to one of 935 analysed African American individuals (
We analyzed 29 HGDP samples from the Mozabite population of North Africa, which has previously been reported to inherit a mixture of both European-related ancestry and ancestry related to sub-Saharan Africans
We further investigated whether local ancestry inference in Mozabite samples matches our expectations from simulated data by simulating an anciently admixed sample with admixture parameters chosen to be similar to Mozabite. Specifically, we assumed 80% European and 20% African ancestry (French and Yoruba from HGDP) and 100 generations since admixture. HAPMIX results on chromosome 1, along with true ancestry, are displayed in
As in
The plots are constructed as for
Different Mozabite individuals within our sample had different estimates of sub-Saharan African ancestry proportions, with a majority at close to 20%, but several individuals having a somewhat higher fraction. Exploration of the causes of this variation (
Which modern-day populations are most closely related to the founder populations for the Mozabite? Following the promising results of our simulation study, we used inferred segments of African-related or European-related ancestry to estimate
To understand the performance of HAPMIX on real populations with a wider range of histories, we applied the method to 13 different HGDP populations that were of African, Middle Eastern, or European origin. Using YRI and CEU as ancestral populations, HAPMIX inferred that 5 of these populations had greater than 0% and less than 100% European-related ancestry (
Population | No. of samples | Estimated percent European ancestry from HAPMIX | Estimated generations since mixture from HAPMIX |
Yoruba | 21 | 0% | N/A |
Mandenka | 21 | 2% | 120 |
Mozabite | 26 | 77% | 100 |
Bedouin | 45 | 91% | 90 |
Palestinian | 41 | 93% | 75 |
Druze | 39 | 97% | 60 |
Adygei | 16 | 100% | N/A |
Basque | 24 | 100% | N/A |
French | 28 | 100% | N/A |
Italian | 12 | 100% | N/A |
Orcadian | 14 | 100% | N/A |
Russian | 25 | 100% | N/A |
Tuscan | 8 | 100% | N/A |
We have described a method that takes advantage of haplotype information to accurately infer segments of chromosomal ancestry in admixed samples, even in the case of ancient admixture. The method is likely to be useful both for disease mapping in admixed populations and for drawing inferences about human history, as our empirical analyses of samples from African American and HGDP populations have demonstrated. The ability to reconstruct chromosomal segments from ancestral populations that contributed to recent or ancient admixture is a particular advance, as it implies that genetic analyses need not be restricted to extant populations but can also be applied to populations that have only left admixed descendents today
HAPMIX has particularly important applications for disease gene mapping, especially in African Americans where the ancestry estimates are exceedingly accurate and where we have shown that they are not systematically biased. With the accurate estimates of ancestry that emerge from HAPMIX it should be possible to carry out dense case-control association studies with hundreds of thousands of markers, which simultaneously test for admixture association
While our analyses show that HAPMIX—because of its explicit use of a population genetic model—has better power to infer locus-specific ancestry than many recent methods, the method also has some limitations in the range of scenarios in which it can be used. For example, it is not currently designed for the analysis of mixtures of more than two ancestral populations, and it requires the use of reference populations. Future directions for extending the HAPMIX method include allowing more than two ancestral populations, using the admixed samples as a pool of reference haplotypes instead of relying on input haplotypes from reference populations, and automating the fitting of model parameters. In addition, although determining the number of generations since admixture with high accuracy is not necessary for effective inference of local ancestry, our results motivate additional work to enable detection of multiple admixture events at different points in time in order to refine the inferences that can be made about human history.
Supplementary note.
(0.05 MB DOC)
Appendix.
(0.10 MB DOC)
We thank D. Falush and J. Marchini for helpful comments.