^{1}

^{*}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JK Pickrell, JK Pritchard. Analyzed the data: JK Pickrell. Contributed reagents/materials/analysis tools: JK Pickrell. Wrote the paper: JK Pickrell, JK Pritchard.

Current address: Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called

With modern genotyping technology, it is now possible to obtain large amounts of genetic data from many populations in a species. An important question that can be addressed with these data is: what is the history of these populations? There is a long history in population genetics of inferring the relationships among populations as a bifurcating tree, analogous to phylogenetic trees for representing the evolution of species. However, it has long been recognized that, since populations from the same species exchange genes, simple bifurcating trees may be an incorrect representation of population histories. We have developed a method to address this issue, using a model which allows for both population splits and gene flow. In application to humans, we show that we are able to identify a number of both previously known and unknown episodes of gene flow in history, including gene flow into Cambodia of a population only distantly related to modern East Asia. In application to dogs, we show that the boxer and basenji breeds have a considerable component of ancestry from grey wolves subsequent to domestication.

The extant populations in a species result from an often-complex demographic history, involving population splits, gene flow, and changes in population size. It has long been recognized that genetic data can be used to learn about this history

There are many statistical approaches to demographic inference from genetic data. One approach is to develop an explicit population genetic model for the history of a set of populations, framed in terms of the effective population sizes of the populations, the times of population splits, the times of demographic events (such as population bottlenecks), and other relevant parameters. The values of these parameters can then be learned from the data using a variety of techniques, often involving simulation

Another type of approach to learning about population history uses methods that summarize the major components of genetic variation in a sample by clustering or principal components analysis

A different class of approaches focuses on the relationships between populations, by representing a set of populations as a bifurcating tree

In this paper, we present a unified statistical framework for building population trees and testing for the presence of gene flow between diverged populations. In this framework, the relationship between populations is represented as a graph, allowing us to model both population splits and gene flow. Graph-based models are of growing interest in phylogenetics

The starting point for our model was first proposed by Cavalli-Sforza and Edwards

Our approach to this problem is to first build a maximum likelihood tree of populations. We then identify populations that are poor fits to the tree model, and model migration events involving these populations. Below, we first describe this approach in an idealized setting, and then describe the modifications necessary for implementation in practice.

In the most simple case, consider a single SNP, and let the allele frequency of one of the alleles at this SNP in an ancestral population be

Now consider a descendant population of

We can write down the expectation and variance of

Now consider a set of four populations, all related to an ancestral population by a tree, as depicted in

A. An example tree. B. The covariance matrix implied by the tree structure in A. Note that the covariance here is with respect to the allele frequency at the root, and that each entry has been divided by

Let us use

To extend this framework to include migration, we allow populations to have ancestry from multiple parental populations

Additionally, there is a choice of whether the edge from

With these simplifications, the variance of

As described above,

In practice, the multivariate normal model in

Now assume that we have genotyped

We now want to write down a likelihood for

Finally, we wanted to define measures for how well the model fits the data. First, we define the matrix of residuals in this model,

We implemented an algorithm, called

After building the tree, we fix the position of the root. (In the tree model the position of the root is not identifiable, as the evolution of allele frequencies along the tree is reversible under the Gaussian model when drift is assumed to be small. In a graph model, though the position of the root is partially identifiable, in all applications we assume that the position of the root is fixed using prior information about known outgroups). We then calculate the residual covariance matrix,

After finding the single migration edge that most increases the likelihood, we attempt a series of local changes to the graph structure (Methods). We then iterate over this procedure to add additional migration edges. In principle, migration edges could be added until they are no longer statistically significant (see the following paragraph). In our experience, however, we prefer to stop adding migration events well before this point so that the resulting graph remains interpretable.

After building the maximum likelihood graph, we would like to quantify our uncertainty in the resulting graph structure. In particular, we would like to quantify our confidence in individual migration events. However, because the likelihood in

Consider a given migration edge, with corresponding weight

We tested the performance of the

A. The basic outline of the demographic model used. B. Trees inferred by

First, we tested the performance of the algorithm on truly tree-like data. We generated 100 independent simulations of 20 chromosomes from each population using the above demographic model without migration, and inferred population trees. The inferred trees perfectly matched the simulated model in all cases (

We used these simulations without migration to test the calibration of our p-values for migration events. For each simulation, after building the maximum likelihood tree, we introduced a migration event between two random populations and tested it for significance. As expected if the p-values are properly calibrated, their distribution is approximately uniform (

Finally, we performed tree simulations in a situation where fixed differences and new mutations (rather than shared polymorphisms inherited from a common ancestor) were common between the populations; in this context the population genetic interpretation of the model breaks down. We performed simulations where all the true branch lengths were 50 times longer than in the original model, corresponding to a history where the 20 populations share a common ancestor approximately 100,000 generations in the past. Again, we see no errors in the topology of the inferred trees (Figures S5, S6). In this situation, the covariances between closely-related populations tend to be slightly underestimated; in more extreme situations this could lead to spurious inferences of migration (Figures S5, S6). However, overall, these simulations suggest that the model will still be useful even in situations where the population genetic interpretation is not strictly correct.

We then introduced migration events into our simulations. We generated simulations under the same model described above; however, we now simulated an admixture event approximately 100 generations before the present where one population receives a fraction of its ancestry (either 10% or 30%) from one of the other populations. We tried ten different pairwise combinations of populations, and generated 100 simulations for each pair. For each simulation, we ran

We next asked whether the mixture “weights” inferred in the model can be interpreted as admixture proportions. To do this, we simulated admixture events of varying proportion between the first and tenth population in the serial bottleneck model described above, set the graph to the true topology, and estimated the mixture weight. The weights are indeed correlated with the true ancestry fraction, but underestimate relatively high admixture proportions in these simulations (

To test the performance of the

A. Maximum likelihood tree. Plotted is the maximum-likelihood tree. Populations are colored according to geographic location (black: archaic humans, red: Africa, brown: Middle East, green: Europe, blue: Central Asia, purple: America, orange: East Asia). The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (

We then sequentially added migration events to the tree. In

Plotted is the structure of the graph inferred by

Several additional migration events in the human data have not been previously examined in detail, but are consistent with previous clustering analysis of these populations

Two inferred edges were unexpected. First, perhaps the most surprising inference is that Cambodians trace about 16% of their ancestry to a population equally related to both Europeans and other East Asians (while the remaining 84% of their ancestry is related to other southeast Asians). This is partially consistent with clustering analyses, which indicate shared ancestry between Cambodians and central Asian populations

Finally, we infer an admixture edge from the Middle East (a population related to the Mozabite, a Berber population from northern Africa) to southern European populations (

To test the robustness of our results to SNP ascertainment, we additionally ran

While human populations have been extensively studied, we next applied the model to dogs, a species where considerably less is known about population history. In particular, we applied the model to a dataset consisting of about 60,000 SNPs genotyped in 82 dog breeds or wild canids

A. Maximum likelihood tree. Populations are colored according to breed type. Dark blue: wild canids, grey: ancient breeds, brown: spitz breeds, black: toy dogs, red: spaniels, maroon: scent hounds, dark red: working dogs, light green: herding dogs, light blue: mastiff-like dogs, purple: small terriers, orange: retrievers, dark green: sight hounds. The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (

We sequentially added migration events to the tree in

Plotted is the structure of the graph inferred by

We infer that the bull mastiff is the result of an admixture event between bulldogs and mastiffs. This is a known event

The most visually apparent residuals in

Another breed that stands out in this analysis is the boxer (Note that many of the SNPs used in this study were ascertained using a boxer individual, so we may have increased power to identify migration events involving this breed). We infer a significant genetic contribution from wolves to the boxer (

Previous analyses of these data have noted that the “toy breeds” of dog cluster together Vonholdt:2010uq. We find that the Chinese toy breeds (the Pekingese and the Shi Tzu) result from admixture between a population related to ancient East Asian dog breeds and a modern population related to the Brussels griffon and the pug (

Finally, we noticed that two of the sighthounds (the Borzoi and the Italian greyhound) do not cluster with the other sight hounds in the tree, namely greyhound, whippet and Irish wolfhound (

Overall, we conclude that there has been considerable gene flow between dog breeds over the course of domestication; there are many additional migration events that merit further examination (

In this paper, we have developed a unified model for inferring patterns of population splits and mixtures from genome-wide allele frequency data. We have shown that this model is accurate in simulations, largely recapitulates the known relationships between well-studied human populations, and is able to identify new relationships between populations in both humans and dogs.

The

There are a number of assumptions, both implicit and explicit, in the interpretation of the

We have also modeled migration between populations as occurring at single, instantaneous time points. This is, of course, a dramatic simplification of the migration process. This model will work best when gene flow between populations is restricted to a relatively short time period. Situations of continuous migration violate this assumption and lead to unclear results (

We also rely on the implicit assumption that the history of the species being analyzed is largely tree-like. We have made this assumption to simplify the search for the maximum likelihood graph; additionally, we speculate that in graphs with complex structure, there will be many graphs that lead to identical covariance matrices, and thus several different histories will be compatible with the data. That said, improvements to the search algorithm could allow the assumption of approximate treeness to be somewhat relaxed. Currently, if the number of admixed populations is large relative to the number of unadmixed populations, this assumption breaks down. For example, in the human data, note that we see no evidence of the documented gene flow from Neandertals to all non-African populations

A number of extensions to the sort of model described here are of potential interest. First, the historical relationships between populations could be useful as null demographic models for the detection of natural selection

As described in the Results, we developed an algorithm called

To search the space of possible graphs, we take a hill-climbing approach. We start by finding a local optimum tree, taking an algorithmic approach similar to Felsenstein

After adding all populations, we calculate the residual covariance matrix,

The

We implemented three- and four-population tests as described in Reich et al.

For the four-population test for treeness, we calculate the

The human data we used were downloaded from

Since we have only a single allele from the Neandertal and Denisova populations, we cannot calculate heterozygosity in these populations for unbiased estimation of the covariance matrix (see ). To account for this, we simply chose a relatively low level of heterozygosity and assigned it to both populations. In the Yoruba ascertained SNPs, we used a heterozygosity of 0.13, and for the French ascertained SNPs, we used a heterozygosity of 0.2. In practice, this only affected the lengths of the terminal branches to Neandertal and Denisova; running

Allele counts for the dog breeds and wild canids reported in Boyko et al. Boyko:2010fk were downloaded from

The ascertainment scheme used for SNP discovery in dogs was complicated

All simulations were performed using

A graph with a mixture event. Capital letters represent nodes, branch length parameters are in blue, and weight parameters are in red.

(PDF)

Replicates of inferred trees from simulated data. We generated tree-like data using the topology in

(PDF)

Inferred trees on ascertained data. We generated tree-like data using the topology in

(PDF)

Histogram of p-values for migration in simulated data. We generated 100 tree-like datasets using the topology in

(PDF)

Consensus tree in simulations with long branches. We generated 100 tree-like datasets using the topology in

(PDF)

Example trees from simulations with long branches. We generated 100 tree-like datasets using the topology in

(PDF)

Representative errors in simulations. We examined the simulations in which

(PDF)

Replicate graphs inferred in the human data. These graphs were generated in an identical manner as

(PDF)

Residual fit from graph of human data presented in the main text. Plotted are the residuals from the fit of the graph presented in

(PDF)

Graph inferred from SNPs ascertained in a single French individual. The graph was generated in an identical manner as

(PDF)

Trees inferred using the human data including the Oceanians. We show the maximum likelihood trees and residuals for the human data including the Oceanian populations, plotted in the same manner as in

(PDF)

Graphs inferred using the human data including the Oceanians. We show the maximum likelihood graphs for the human data including the Oceanian populations, plotted in the same manner as in

(PDF)

Residual fit from graph of dog data presented in the main text. Plotted are the residuals from the fit of the graph presented in

(PDF)

(PDF)

(PDF)

Supplementary Information.

(PDF)

We thank three anonymous reviewers, David Reich, Nick Patterson, Graham Coop, Peter Ralph, Daniel Falush, and Daniel Lawson for helpful comments and suggestions.

^{nd}ed.): the art of scientific computing. New York, NY, USA: Cambridge University Press.