^{1}

JH conceived and designed the model and analyses, selected the datasets, wrote the computer programs, performed the analyses, and wrote the paper.

The author has declared that no competing interests exist.

The founding of New World populations by Asian peoples is the focus of considerable archaeological and genetic research, and there persist important questions on when and how these events occurred. Genetic data offer great potential for the study of human population history, but there are significant challenges in discerning distinct demographic processes. A new method for the study of diverging populations was applied to questions on the founding and history of Amerind-speaking Native American populations. The model permits estimation of founding population sizes, changes in population size, time of population formation, and gene flow. Analyses of data from nine loci are consistent with the general portrait that has emerged from archaeological and other kinds of evidence. The estimated effective size of the founding population for the New World is fewer than 80 individuals, approximately 1% of the effective size of the estimated ancestral Asian population. By adding a splitting parameter to population divergence models it becomes possible to develop detailed portraits of human demographic history. Analyses of Asian and New World data support a model of a recent founding of the New World by a population of quite small effective size.

A new population-genetic method for assessing human demographic history reveals that the effective size of the founding population of the New World comprised less than 80 individuals.

Archeological evidence, as well as anatomical, linguistic, and genetic evidence, have shown that the original human inhabitants of the Western Hemisphere arrived from Asia during the Late Pleistocene [

For complex historical subjects such as the colonization of the Americas, there are many ways that models can be constructed, examined, and compared. One approach is to develop a portrait based on a particular kind of data, such as linguistic [

To accommodate the stochastic variance among loci, population geneticists have turned in recent years to Bayesian and likelihood methods that explicitly take into account the range of possible gene tree histories that are consistent with a given dataset [

_{A} from which it moves to size N_{1} at the time of sampling. Similarly, population 2 begins with size (1 − _{A} from which it moves to size N_{2} at the time of sampling.

(A) The basic IM model. The demographic terms are effective population sizes (N_{1}, N_{2}, and N_{A}), gene flow rates (m_{1} and m_{2}), and population splitting time (t). Also shown are parameters scaled by the neutral mutation rate (u)_{1}) actually corresponds to the movement of genes from population 2 to population 1 as time moves forward.

(B) The IM model with changing population size. An additional parameter, _{A} that forms N_{1} (i.e., the fraction 1 − _{2})

These models were applied to questions on the founding of New World populations from Asia. A total of nine DNA sequence datasets that included Asian and Native American (Amerind-speaking) samples were drawn from the literature (_{ST} values (between Asian and New World samples) observed among the loci. Of the nine loci included in the present study, three have fairly high _{ST} values, while the remainder are either negative (undefined) or near zero (

In some cases locations are based on actual geographic locations, in other cases the locations are the approximate center of the geographic region occupied by ethnic groups identified in the original references (

Asian samples were arbitrarily designated as being from population 1 and the New World samples from population 2. In this case, 1 −

The overall picture that emerges is one in which the New World was very recently founded by a small number of individuals (effective size of about 70), and then grew by a factor of about 10. The data do suggest that there has been gene exchange between Asia and the New World since that time; however, the likelihood surfaces are quite flat, so confidence in gene flow estimates is low.

The method assumes that the loci have not been subject to recombination or to directional or balancing selection. For recombination, we used only those loci that showed no evidence of recombination by the four-gamete test [

See Dataset S1 and Protocol S1 for more detail.

^{a} The inheritance scalar was set to reflect the expected effective population size experienced by a locus relative to an autosome, assuming equal sex ratios and variance in reproductive success: autosomal loci, 1.0; X-linked loci, 0.75; maternally or paternally inherited loci, 0.25.

^{b} The percentage of basepairs that differ between a human and a chimpanzee sequence.

^{c} _{ST} is the proportion of variation that lies between samples pooled for Asia and the New World for each locus [

^{d} Data from [

^{e} Full-length mtDNA sequences were used [

^{f} Concatenated data from several noncoding regions of the nonrecombining portion of the Y chromosomes (NRY) [

^{g} Data from [

^{h} Data from [

^{i} Haplotypes were determined over multiple points across this locus [

^{j} Data from [

The estimated posterior distributions are shown in _{upper}, _{upper}” in

Probability densities for each of the parameters described in _{1}; (B) θ_{2}; (C) θ_{A}; (D) _{1}; and (H) _{2}. The analysis in which a high upper limit on the prior distribution for _{upper}_{upper}

The archaeological portrait of early New World populations has largely centered around widespread Clovis sites that have an earliest estimated age of about 13,000 y before the present [

With regard to migration, each of the three analyses show nonzero peaks for both directions of gene flow. These may well reflect the occurrence of more than one episode of migration to the New World. For example, it has been suggested on the basis of mitochondrial DNA haplotypes and glaciation history that an initial migration along a coastal route may have been followed later by another migration, possibly through an ice-free noncoastal corridor [_{1} and _{2} are broad, and all have high probability at the lower limit of resolution, indicating that zero gene flow is nearly as well supported by the data as are nonzero gene flow levels.

The ancestral population parameter, θ_{A}, shows a relatively narrow distribution with a very consistent peak location across the three analyses. These attributes are partly to be expected, given that the very large majority of the variation in the samples is older than _{A} than for the other parameters. The estimated effective size of the ancestral population is about 9,000 (_{1}) revealed broad distributions and estimates that are near those for the ancestral population. Although the estimates of current effective size in Asia vary among the analyses (

Parameter estimates are shown for three models described in the text. For those parameters in which the complete posterior distribution appeared to be estimated, the 90% highest posterior density interval was also determined and given as a range (in parentheses). This range is the shortest interval that contains 90% of the probability.

^{a} The location of the highest value of

NA, not applicable

In contrast to the Asian population, the New World population parameter (θ_{2}) is much smaller, and suggests a recent New World effective population size of less than 1,000 (

The conversion of model parameters to demographic terms is described in “Analyses” in

^{a} The estimated time associated with the highest value of

In order to gain a sense of how consistent the data actually are with the model and the parameter estimates, 500 simulated datasets were generated under the model in

Shown, both within and between populations, are the values of the average number of differences between pairs of sequences.

Exp, expected; Obs, observed

The method described is one of several new approaches that can glean information about ancestral population sizes [

Taken together, the analyses in this study suggest a recent founding of the New World Amerind-speaking peoples by a small population of effective size near 70, followed by population growth in the New World. It is interesting that the analyses do not suggest much population size change in Asia since the time of the founding of the New World population. Given the very broad distributions for θ_{1}, it is possible that the true value of this parameter is much higher than suggested by the peak location, and that there has been considerable population growth in Asia. The analyses reveal very broad distributions for migration parameters, and although the peak locations suggest that gene flow has been fairly high (2Nm values greater than 1; see

As parameter-rich as the method is, neither this nor any mathematical model can be expected to fully represent the complex history of two related populations. However, the same is essentially true of narrative models, as investigators are always constrained by limited data and the need to keep explanations as simple as possible given their data. In this light, the IM model provides a fairly complete framework for some oft-debated questions on human history. With the addition of a new parameter, the IM framework can now also be used to address questions about the founding size of populations and of population size change.

In the context of human demographic history, the most problematic assumption under the IM model is that each population is panmictic. Certainly this is not the case today, and it is likely to have even been less true in times past. This raises the general and important question of how local patterns of population structure affect regional or global estimates of diversity [_{A} and N_{1}, respectively. However, the generally low estimates of effective population size argue against this particular kind of population structure.

The analyses presented here share with some other genetic studies estimated dates for the peopling of the Americas that are more recent than archeologically based estimates [

The available data do not yet allow precise estimates of founding time nor of whether there has been gene flow between the New World and Asia following the initial founding event. However, the new method implements a parameter-rich model of divergence and has the potential to recover the history of complex divergence processes. The method can also be applied to a large number of loci, with large sample sizes, and in the future can be expected to provide increasingly detailed portraits of human population divergence.

Given the prevailing model of the founding of New World populations via a Bering land bridge, the descendant populations were defined as the Amerind speakers of the New World and the peoples of northeastern Asia. Greenberg et al. [

The model fitting requires data from loci that do not show evidence of recombination and that do not show clear evidence of directional or balancing selection. All available datasets from the literature that met these criteria and that had multiple DNA sequences from both of the designated sample regions were selected. The selected loci are listed in

At the center of the method for estimating the parameters is an expression for the posterior probability distribution of model parameters Θ, given the data. For the case of multiple loci

where Θ refers to the vector of parameters of the model, _{i}_{i} is the genealogy for locus _{i}|_{i}) is calculated using the mutation model for that locus and the branch lengths in the genealogy. The probability _{i}|Θ) is calculated using expressions from basic coalescent theory [

The integration in

Over the course of a simulation the genealogy for a given locus varies for topology, branch lengths, and migration times. However, the probability of the data for a locus given a particular genealogy, _{i}|_{i}), depends only upon the branch lengths and the mutation model for that locus [_{i}|_{i}) is not a function of _{i}|Θ), depends strongly on

The calculation of _{i}|Θ) is most directly done by taking the product of the probabilities of each of the coalescent and migration events that occur in the genealogy. Griffiths and Tavare [_{τ}/N_{0} of the population size at time τ, relative to that at time 0, they provide a general expression for the distribution of coalescent times. For population 1, the effective size goes from N_{1} at time zero, to _{A} at time t. If it is assumed that the size change is exponential over this period, then for population 1,

and for population 2,

One additional complication that arises is that when the population is growing exponentially back into the past (decreasing in size as time moves forward), there is a finite probability that the time to coalescence will be infinity [_{A} is less than N_{1}, it is necessary to calculate the probability of coalescence time conditioned on there being a coalescent event.

Migration under an exponentially changing population size can also be incorporated under this same framework with two changes. First, unlike coalescence, where the rate is inversely proportional to population size, the rate of migration is directly proportional to population size. Second, as time goes backward in the coalescent, the migration rate from population 1 to population 2 (i.e., _{1}) actually corresponds to the movement of genes from population 2 to population 1 as time moves forward. This means that in the coalescent under changing population size, we expect that the migration rate from population 1 to 2 will vary with the size of population 2. Thus the corresponding relative rate function for migration from population 2 to population 1 is

and for migration in the reverse direction it is

These intensity functions for coalescence and migration were used to develop an expression for _{i}|Θ) that includes

where _{i} is the current genealogy for locus

For genealogy updates the same proposal distribution of genealogies that was used in the case without _{i}|_{i}) is the Hastings term for the proposal probability of the genealogy for locus

The IM computer program [_{1} = 10; θ_{2}, = 10; θ_{A} = 10; _{1} = 0.04; and _{2} = 0.1. For each parameter, the mean of the 20 estimates is shown, and in general these are fairly close to the true value, though there is considerable variance for the peak locations in individual runs. To test whether the locations of these distributions are consistent with the true values of the parameters (i.e., the values used in the simulations), probabilities were combined by treating each simulation as an independent test of the same hypothesis [_{i}, i_{i}_{i}_{i}

The input parameters for the simulations were as follows: (A) θ_{1} = 10; (B) θ_{2} = 10; (C) θ_{A} = 10; (D) _{1}= 0.04; (G) _{2}= 0.2 ; and _{A} = 0.5). For each simulated dataset, coalescent simulations were done for each of 20 loci with identical mutation rates under an infinite sites mutation model, each with sample sizes of 10 for each of the two populations. Each simulated dataset was analyzed using wide uniform prior distributions for each parameter. Each analysis began with a burn-in period of 300,000 steps followed by a primary chain of 3 million to 10 million steps. The curves for parameters θ_{1} through _{2}

is χ^{2} distributed with 40 degrees of freedom (i.e., two times the number of densities). The _{1}, 35.5; θ_{2}, 26.4; θ_{A}, 41.7; _{1}, 29.9; _{2}, 44.1; and the mean of the seven values was 35.0. In the corresponding χ^{2} distribution, 90% of the probability mass falls above 29.05; 50% falls above 39.3; and 10% falls above 51.8 [^{2} distribution with a mean (35.0) close to the 50% point of the χ^{2} distribution (39.3).

From these simulations, and many others (additional results provided in

Each of the three analyses were done using at least three independent runs, with ten or more independent chains under Metropolis coupling [

To convert estimates of parameters that include the mutation rate to more easily interpreted units, a value of 6 million y since the splitting of human and chimpanzee lineages was used [^{−6} mutations per year. The geometric mean is used rather than an arithmetic mean, because under the multilocus model, the mutation rate by which demographic parameters are scaled is the geometric mean of the individual locus-specific mutation rates [

To convert the estimates of the population mutation rate parameters (θ_{1}, θ_{2}, and θ_{A}) to estimates of effective population size (N_{1}, N_{2}, and N_{A}, respectively) a measure of mutation rate on a scale of generations is needed. Thus, an assumption was made of 20 y per generation, and the geometric mean divergence between humans and chimpanzees for each species contrast was divided by 12 million y then multiplied by 20 y per generation. These calculations yielded a geometric mean value of 9.32 × 10^{−5} mutations per generation. These mutation rate values were then used to convert individual θ estimates to effective population size estimates (i.e., θ = 4Nu, and N = θ/4u).

Migration parameters in the model can be used to obtain population migration rate estimates (i.e.,

This is the input file that contains all of the data and that was analyzed using the IM computer program.

(582 KB TXT).

(92 KB DOC).

John Wakeley, Tad Schurr, and David Meltzer provided input on an early draft of the paper. Rasmus Nielsen provided some helpful suggestions on parameter updating. Thanks also to three reviewers for very helpful suggestions and critique.

isolation with migration