Figure 1.
Populations Included in This Study
The world map shows the 78 populations investigated in the combined dataset, with the locations of the 29 populations studied in the Americas shown in detail in the larger map. The 25 newly examined populations, including the Siberian Tundra Nentsi, are marked in red, and the previously genotyped HGDP-CEPH populations are marked in yellow.
Table 1.
Heterozygosity and FST (×100) for Various Geographic Regions
Figure 2.
The Mean and Standard Error Across 678 Loci of the Number of Distinct Alleles as a Function of the Number of Sampled Chromosomes
(A) Geographic regions worldwide. (B) Subregions within the Americas. For a given locus, region, and sample size g, the number of distinct alleles averaged over all possible subsamples of g chromosomes from the given region is computed according to the rarefaction method [24,25]. For each sample size g, loci were considered only if their sample sizes were at least g in each geographic region. Error bars denote the standard error of the mean across loci.
Table 2.
Heterozygosity for Newly Sampled Populations and for the Five Previously Sampled Native American Populations (Pima, Maya, Piapoco, Karitiana, and Surui)
Figure 3.
Heterozygosity in Relation to Geography
(A) Relationship between heterozygosity and geographic distance from East Africa. Populations in Sub-Saharan Africa and Oceania are marked with gray triangles and squares, respectively, and the remaining non-American populations from Europe, Asia, and northern Africa are marked with gray pentagons. Within the Americas, populations are color-coded and symbol-coded by language stock (see Figure 8). Denoting heterozygosity by H and geographic distance in thousands of kilometers by D, the regression line for the graph is H = 0.7679 − 0.00658D, with correlation coefficient −0.862. (B) The fit of a linear decline of heterozygosity with increasing distance from a putative source, considering Native American populations only. The color of a point indicates a correlation coefficient r between expected heterozygosity and geographic distance from the point, with darker colors denoting more strongly negative correlations. Across the Americas, the correlation ranges from −0.436 to 0.575, and color bins are set to equalize the number of points drawn in the four colors. From darkest to lightest, the four colors represent points with correlations in (−0.436, −0.424), (−0.424, −0.316), (−0.316, 0.494), and (0.494, 0.575), respectively.
Figure 4.
Heterozygosity and Least-Cost Paths in a Coastal Migration Scenario
(A) R2 (square of the correlation) between heterozygosity H and effective geographic distance (least-cost distance), assuming differential permeability of coastal regions compared to inland regions. Correlations significant at the 0.05 level are indicated by closed symbols, and those that are not significant are indicated by open symbols. (B) Least-cost routes for the scenario with 1:10 coastal/inland cost ratio.
Figure 5.
Unsupervised Analysis of Worldwide Population Structure
The number of clusters in a given plot is indicated by the value of K. Individuals are represented as thin vertical lines partitioned into segments corresponding to their membership in genetic clusters indicated by the colors.
Figure 6.
Supervised Population Structure Analysis, Using Five Clusters, Four of Which Were Forced to Correspond to Africans, Europeans, East Asians Excluding Siberians, and Siberians
Figure 7.
Unsupervised Analysis of Native American Population Structure
The colored plots at the left show the estimated population structure of Native Americans, obtained using STRUCTURE. The number of clusters in a given plot is indicated by the value of K on the right side of the figure. Next to the K = 7 plot, the population names and the major language stocks of the populations are also displayed. The left-to-right order of the individuals is the same in all plots. The diagram on the right summarizes the outcomes of 100 replicate STRUCTURE runs for each of several values of K. Each row represents a value of K, and within each row, each box represents a clustering solution that appeared at least 12 times in 100 replicates (see Methods). The number of appearances of a solution is listed above the box, and the boxes are arrayed from left to right in decreasing order of the frequencies of the solutions to which they correspond. The DISTRUCT plot shown on the left corresponds to the leftmost box on the right side of the figure. An approximate description of the clusters is located inside the box, with each row in the box representing a different cluster. The numbers 1, 2, and 3 are used to refer to the green cluster in the K = 2 DISTRUCT plot, the blue cluster in the K = 2 DISTRUCT plot, and the yellow cluster in the K = 9 DISTRUCT plot, respectively. The following population abbreviations are also used: A, Ache; Arh, Arhuaco; Cab, Cabecar; Chip, Chipewyan; E, Embera; G, Guaymi; K, Karitiana; Kog, Kogi; P, Pima; S, Surui; T, Ticuna (both Ticuna groups combined); W, Waunana. Clusters are indicated using set notation; for example {A} represents a cluster containing Ache only, and 2\{A,S} represents a cluster that corresponds to cluster 2 (the blue cluster for K = 2), excluding Ache and Surui. An asterisk indicates approximately 50% membership of a population in a cluster. A line is drawn from a box representing a solution with K clusters to a box representing a solution with K+1 clusters if the solution with K+1 clusters refines the solution with K clusters—that is, if all of the clusters in the solution with K+1 clusters subdivide the clusters in the solution with K clusters. In case of ties for the highest-frequency solution (K = 4 and K = 5), boxes are oriented in order to avoid the crossing of lines between them.
Figure 8.
Neighbor-Joining Tree of Native American Populations
Each language stock is given a color, and if all populations subtended by an edge belong to the same language stock, the clade is given the color that corresponds to that stock. Branch lengths are scaled according to genetic distance, but for ease of visualization, a different scale is used on the left and right sides of the middle tick mark at the bottom of the figure. The tree was rooted along the branch connecting the Siberian populations and the Native American populations, and for convenience, the forced bootstrap score of 100% for this rooting is indicated twice.
Table 3.
Correlation of Genetic and Linguistic Distances
Figure 9.
The Mean and Standard Error Across 678 Loci of the Number of Private Alleles as a Function of the Number of Sampled Chromosomes
For a given locus, region, and sample size g, the number of private alleles in the region—averaging over all possible subsamples that contain g chromosomes each from the five regions—is computed according to an extension of the rarefaction method [25]. For each sample size g, loci were considered only if their sample sizes were at least g in each geographic region. Error bars denote the standard error of the mean across loci.
Figure 10.
Allele Frequency Distribution at Tetranucleotide Locus D9S1120
For each population the sizes of the colored bars are proportional to allele frequencies in the population, with alleles color-coded as in the legend. Alleles are ordered from bottom to top by increase in size, with the smallest allele, a Native American private allele of size 275, shown in red, and the largest allele, 315, shown in dark blue.