Admixture in Latin America: Geographic Structure, Phenotypic Diversity and Self-Perception of Ancestry Based on 7,342 Individuals

The current genetic makeup of Latin America has been shaped by a history of extensive admixture between Africans, Europeans and Native Americans, a process taking place within the context of extensive geographic and social stratification. We estimated individual ancestry proportions in a sample of 7,342 subjects ascertained in five countries (Brazil, Chile, Colombia, México and Perú). These individuals were also characterized for a range of physical appearance traits and for self-perception of ancestry. The geographic distribution of admixture proportions in this sample reveals extensive population structure, illustrating the continuing impact of demographic history on the genetic diversity of Latin America. Significant ancestry effects were detected for most phenotypes studied. However, ancestry generally explains only a modest proportion of total phenotypic variation. Genetically estimated and self-perceived ancestry correlate significantly, but certain physical attributes have a strong impact on self-perception and bias self-perception of ancestry relative to genetically estimated ancestry.


Introduction
Understanding the basis of a variation in human physical appearance has been a topic of long-standing research interest. However, little is known about the genetic basis of most of this variation. An exception is pigmentation, which has been the focus of considerable research, particularly in Europeans [1][2][3][4]. Refining our knowledge on the genetics of physical appearance in human populations is of considerable evolutionary, biomedical and forensic importance. This research is also of broad social interest due to its bearing on debates around notions of self-identity, ethnicity and race.
Latin America provides an advantageous setting in which to examine the impact of genetic variation on physical appearance. The region has a history of extensive admixture between three continental populations: Africans, Europeans and Native Americans [5,6]. Latin America also provides an informative context in which to explore the perception of variation in physical appearance. The region has a unique history relating to the social and cultural politics of ethnicity, race and nation [7][8][9]. A considerable number of genetic studies have examined admixture in Latin America [10][11][12][13][14]. However, these analyses have mostly been based on relatively small samples and focused mainly on describing patterns of variation in admixture proportions between individuals and countries/regions. Few studies have examined the impact of genetic ancestry on physical appearance or the relationship of these to individual notions of ethnicity and ancestry [15,16].
In this paper we present the first phase of a research program focused on the genetics of physical appearance in Latin Americans. We base this program on a sample of over 7,000 individuals ascertained in five countries: Brazil, Chile, Colombia, México and Perú. Information was obtained for a range of sociodemographic variables, physical attributes and self-perception of ancestry. Here we report analyses based on individual mean genome admixture proportions. Coordinate-based spatial analyses illustrate the significant variation in ancestry existing across Latin America, in agreement with demographic history and census information. Significant effects of ancestry were detected for most of the phenotypes examined, and the direction of these effects agrees with the phenotypic differentiation of Africans, Europeans and Native Americans. Finally, we observe that certain phenotypes have a strong impact on self-perception and that these phenotypes bias self-perceived relative to genetically estimated ancestry.

Results
Summary descriptive statistics for the study sample collected are presented in Table 1.

Ancestry estimation
We estimated individual African/European/Native American admixture proportions with data for 30 highly informative SNPs using the ADMIXTURE program [17]. These markers were chosen from the 5,000 proposed by Paschou et al (2010) [18] as highly informative for continental ancestry estimation (see Methods). The selected set of markers produced individual ancestry estimates in 372 Colombians, included in a recent genome-wide association study [19], with correlations of ,70% (for the three continental ancestries) compared to estimates obtained with 50,000 markers (LD-pruned), and identical sample means. Although we estimated individual ancestry with a relatively small number of markers, we verified that the inferences drawn are robust to the level of uncertainty of the estimates obtained (see below).

Geographic variation of ancestry
Consistent with previous studies, we observe extensive variation in ancestry between countries (Table 1) as well as between individuals within countries (Text S1) and between socioeconomic strata (Text S2) [12,13,[20][21][22]. In order to obtain a spatial representation of variation in ancestry we obtained interpolated maps based on the geographic coordinates for the birthplaces of research volunteers. The geographic distribution of these birthplaces ( Figure 1 and Figure S4) overlaps with regional population density from national census data ( Figure S5). Consistent with this pattern, the number of volunteers for each birthplace correlates with census size for these localities: Brazil (r = 0.32, p-value , 10 25 ), Chile (r = 0.51, p-value ,10 24 ), Colombia (r = 0.54, pvalue ,10 213 ), Mexico (r = 0.44, p-value ,10 28 ), Perú (r = 0.41, p-value , 10 24 ). Few volunteer birthplaces were thus located in sparsely populated regions (e.g. Amazonia) and geographic interpolation of ancestry in those regions should be regarded with special caution.
The Brazilian sample ( Figure 1A) shows widespread European ancestry with the highest levels being observed in the south. African ancestry is also widespread (except for the south) and reaches its highest values in the East of the country. Native American ancestry is highest in the north-west (Amazonia). The Chilean sample ( Figure 1B) shows the least regional variation, with low levels of African ancestry throughout the country. European and Native American ancestry are relatively uniform, although somewhat higher European ancestry is seen around the main urban areas of the north and centre, Native ancestry predominating elsewhere, particularly in the south. The Colombian sample ( Figure 1C) shows highest African ancestry in the coastal regions (particularly on the Pacific) and highest European ancestry in central areas. Native ancestry appears highest in the south-west and in the east of the country (Amazonia) but interpolations in these areas are based on few data points. In the Mexican sample ( Figure 1D) Native American ancestry is highest in the centre/south of the country with the north showing the highest proportion of European Ancestry. African ancestry is generally low across Mexico except for a few coastal regions. The Peruvian sample ( Figure 1E) shows substantial Native American ancestry throughout the country, particularly in the south, European ancestry appears highest around northern/central areas. African ancestry in Peru is generally low, except for parts of the northern coast.
To evaluate the statistical significance of the observed spatial variation in ancestry we calculated Moran's Index (I) of association between each individual ancestry component and spatial location. These were significant for the three ancestries in all countries (pvalues ,0.02). Since the three ancestry components are not independent, we also calculated canonical correlation coefficients between ancestry and geographic location. These were also significant for all countries (p-values ,0.001). The variation in ancestry seen in the admixture maps of Figure 1 also result in highly significant correlations of the three ancestries with altitude of birthplace (p-values ,2610 216 for the three ancestries): African and European ancestry decreases with altitude (r of 20.24 and 20.39, respectively), while Native American ancestry increases (r = 0.48).
The Kriging interpolation scheme used in building the maps of Figure 1 uses the mean ancestry at each birthplace and does not provide information on the extent of individual variation in ancestry at each map location. In the 102 birthplaces with 10 or

Author Summary
Latin America has a history of extensive mixing between Native Americans and people arriving from Europe and Africa. As a result, individuals in the region have a highly heterogeneous genetic background and show great variation in physical appearance. Latin America offers an excellent opportunity to examine the genetic basis of the differentiation in physical appearance between Africans, Europeans and Native Americans. The region is also an advantageous setting in which to examine the interplay of genetic, physical and social factors in relation to ethnic/ racial self-perception. Here we present the most extensive analysis of genetic ancestry, physical diversity and selfperception of ancestry yet conducted in Latin America. We find significant geographic variation in ancestry across the region, this variation being consistent with demographic history and census information. We show that genetic ancestry impacts many aspects of physical appearance. We observe that self-perception is highly influenced by physical appearance, and that variation in physical appearance biases self-perception of ancestry relative to genetically estimated ancestry.
more individuals sampled we observe that the standard deviation in the three individual ancestry estimates extends over a wide range: African (0.012-0.022), European (0.046-0.273) and Native American (0.039-0.274). We evaluated the correlation of this variation in individual ancestry with the census size of these localities and found a significant positive correlation for all ancestries (r.0.3, p-values ,0.01).

Phenotypic diversity and genetic ancestry
Regression of phenotypic variation on genetic ancestry (taking Native American as reference) demonstrates a significant effect for most of the traits examined (p-value ,10 23 using a conservative Bonferroni multiple testing correction, Table 2). Among the nonfacial phenotypes (accounting for sex, country, age, educational attainment and wealth) higher European ancestry is associated with: increased height, lighter pigmentation (of hair, skin and eyes) ( Figure S6), greater hair curliness and male pattern baldness. Hair graying approaches statistical significance (p-value 10 22 ). Higher African ancestry is associated with: increased height, higher skin pigmentation and greater hair curliness. The proportion of phenotypic variance explained by ancestry is highest for skin pigmentation (19%) followed by hair shape (8%) and color of eyes and hair (4% and 5%, respectively) but at most 1% for the other phenotypes.
We also observed highly significant effects of educational attainment (p-value 3.87610 213 ) and age (p-value ,2610 216 ) on height, with height increasing for individuals born more recently at a rate ,1 cm every 10 years (Text S3). Genetic ancestry also has a range of effects on facial features, both in terms of size and shape, after accounting for height and BMI (in addition to the other covariates). Higher European ancestry is associated with reduced eye fold and an overall smaller face (centroid size).
Face size and shape effects were also evaluated through the analysis of all pair-wise inter-landmark distances (Table S3). Amongst these distances, 133 and 2 show significant effects of European and African ancestry, respectively (p-values 10 26 assuming a conservative Bonferroni multiple testing correction; Table S3). The most significant effects of European ancestry (P, 10 210 ) involve mainly distances between landmarks placed on the lips and nose. Face shape variation, independent of size, was assessed via Principal Components (PCs) of procrustes 3D coordinates. Significant effects of European ancestry were detected for PCs 1 and 3-5, while African ancestry impacts on PCs 1, 2 and 4 ( Table 2, Text S5 and Figure S3). These 5 PCs account for ,55% of the variation in face shape captured by the 36 landmarks placed on the facial photographs, with ancestry explaining up to 5% of the variance in PC scores (for PC4). Examination of the correlation between inter-landmark distances and facial PCs, indicates that the highest correlation of distances between landmarks of the lips and nose is with PC4 (results not shown), consistent with this PC showing the largest proportion of variance explained by ancestry ( Table 2).

Genetic ancestry, phenotypic diversity and selfperception
Four ethno/racial categories (''Black'', ''White'', ''Native'' and ''Mixed'') are commonly used across Latin America in national censuses and other population surveys. We contrasted genetic ancestry and skin pigmentation (as measured by the melanin index) across these four self-estimated categories for the countries sampled ( Figure 2 and Table S4). Within each country there is a gradient of decreasing European ancestry (and increasing pigmentation) for the ''White'', ''Mixed'' and ''Native/Black'' categories. Across countries, skin pigmentation is relatively uniform within ethnicity categories, except for ''Black''. For ''White'', ''Native'' and ''Mixed'' the mean melanin index across countries varies within ,2 units, while the range for ''Black'' is ,25 units. By contrast, genetic ancestry varies greatly between countries for all ethnicity categories. For example, European ancestry varies across countries by about 40% for ''White'', ''Mixed'' and ''Native'' and about 20% for ''Black'' ( Figure 2; estimates for African and Native American ancestry are shown in Table S4).  Contrasting self-perceived (ranked into five bands at 20% increments) and genetically estimated continental ancestry we observe a moderate, but highly significant, correlation: America: r = 0.48, P,2.2610 26 , Europe: r = 0.48, P,2.2610 26 , Africa: r = 0.32, P,2.2610 26 . However, there is a trend for higher selfperceived Native American and African ancestry to exceed the genetic estimates ( Figure 3). Similarly, there is a trend for lower self-perceived Native American and European ancestry to underestimate the genetic ancestry ( Figure 3). To explore these trends further we performed a multiple linear regression of the difference between self-perceived and genetically estimated ancestry (i.e. the bias, see Methods), using genetic ancestry and covariates as predictors ( Table 3). As expected, we observe that genetic ancestry has a highly significant effect (,2610 216 for all ancestries) and the negative sign of the regression coefficients reflects the orientation of bias seen in Figure 3. At increasing European genetic ancestry, there is greater underestimation in self-perception (a more negative bias). By contrast, with increasing African genetic ancestry there is less overestimation (less positive bias). For Native American ancestry, there is an overestimation (positive bias) at low levels, and an underestimation at high levels of ancestry (negative bias).
Most of the phenotypic traits that show ancestry effects ( Table 2) also have a significant effect on self-perception bias ( Table 3). There is a particularly strong effect of pigmentation: individuals with lower skin pigmentation tend to overestimate their European ancestry while individuals with higher pigmentation overestimate their Native American and African ancestries. Similarly, lighter eye and hair color lead to an overestimation of European ancestry and an underestimation of Native American ancestry (but not African ancestry). Hair type is strongly associated with an overestimation of African ancestry. Marginally significant associations are seen with other phenotypes, including facial features such as eye fold (leading to an underestimation of European ancestry) and landmark coordinate PCs (Table 3). An effect of social factors on perception bias is evidenced by the observation that greater wealth is significantly associated with an overestimation of European ancestry and that there is significant variation in bias between countries (Table 3). We examined the impact on these results of the uncertainty associated with the ancestry estimates by repeating the regression analyses using ancestry estimates obtained with a subset of 15 markers (Methods). We found that the same covariates had significant effects and that the regression coefficients were not significantly different in the two sets of regression analyses.

Discussion
Since the late 15 th century, the population of what is now called ''Latin America'' has undergone major demographic changes within the context of a highly diversified physical and social environment [6,23]. These changes include the occurrence of waves of immigration from various parts of Africa and Europe, the resulting decline of the Native populations most exposed to the immigrants and a variable admixture between these groups. There have also been a number of noticeable population movements in the region. For example, in recent generations there has been an extensive migration to the cities, Latin America now being the most urbanized region of the world (about 80% of its population is currently considered urban) [24]. Three of the countries we sampled (Brazil, Mexico and Colombia) are the most populous in the region and the combined population of the five countries examined here account for ,70% of Latin Americans. Although ours is a convenience sample, the individuals studied show considerable variation in birthplace and for a range of biological and social variables, illustrating the extensive heterogeneity of Latin Americans.
The interpolated ancestry maps obtained ( Figure 1) are consistent with other genetic studies [20,21,25,26] and with census information on the distribution of the main ethnicity groups within each country (available at www.ine.cl; geoftp.ibge.gov.br; www. igac.gov.co; www.censo2010.org.mx, www.indepa.gob.pe). Altogether, these data underline the extensive genetic structure existing within and between Latin American countries. It is possible to relate this genetic heterogeneity to well documented historical factors [6,23,27]. Broadly, Native American ancestry is highest in areas that were densely populated in pre-Columbian times (particularly Meso-America and the Andean highlands) as well as in regions that received relatively little non-native immigration and which currently have relatively low population densities (e.g. Amazonia). During the colonial period Africans were brought to Latin America as forced labour mainly to coastal tropical areas, particularly in the Caribbean and Brazil [28]. That country was the main recipient of African slaves in the region (representing about 40% of all African slaves brought to the Americas [29]). Early (mostly male) Iberian immigrants settled across the continent, admixing extensively with Native Americans and Africans [5]. These were followed by further currents of European immigration, including individuals from various parts of Europe (often arriving as a result of governmental initiatives) and resulting in the dense settlement of specific geographic regions (such as the south of Brazil). The larger variance in individual ancestry observed for larger urban centres is consistent with the increasing urbanization of Latin America seen recent generations, the cities absorbing immigrants with diverse genetic backgrounds. Other than demographic history, it is possible that assortative mating has also contributed to shaping population structure across Latin America. The Iberian ''Conquest'' (i.e. the first century of settlement) was characterized by extensive admixture between Natives and immigrants (driven by the highly predominant immigration of males) [5]. However, during the subsequent colonial period society became increasingly stratified, including the instauration during the 18 th century of a caste system regulating marriages [6,27]. These restrictions were mostly abolished with the establishment of republican governments in the 19 th century [6]. However, a number of studies have documented continuing assortative mating in Latin America, in relation to genetic ancestry, physical appearance and a range of social factors [30][31][32][33][34].
The pattern of variation we observe between physical appearance and genetic ancestry is consistent with information on the variation in frequency of the traits examined in Native Americans, Europeans and Africans. Constitutive skin pigmentation (i.e. in areas not exposed to light), hair and eye color and hair type are traits with little environmental sensitivity and show large differences between continental populations [35]. As expected, increased European ancestry shows a highly significant association with lighter skin, hair and eye pigmentation. A number of allelic variants impacting on these traits have been identified in Europeans and certain of these show large allele frequency differences between Europeans and non-Europeans [1,2,36]. We also found a highly significant effect of ancestry on hair type, individuals with higher Native American ancestry showing greater frequency of straight hair, a phenotype that is essentially fixed in Native Americans. Recent studies in East Asians implicate a p.Val370Ala substitution in the EDAR gene in hair morphology [37][38][39]. One of the ancestry informative markers typed here (rs260690) is located in the first intron of EDAR, is in high linkage disequilibrium with the p.Val370Ala variant in the HapMap , the medium-shade indicates the 5%-95% range, and the lightest shade refers to samples outside this range. For European ancestry, self-perception tends to underestimate genetic ancestry (the distributions are mostly above the diagonal). By contrast, self-perception tends to overestimate African ancestry (the distributions are mostly below the diagonal). At increasing levels of Native American genetic ancestry self-perception first underestimates then overestimates genetic ancestry (the distributions are on both sides of the diagonal). Simulations and plots were carried out using MATLAB [61]. doi:10.1371/journal.pgen.1004572.g003 dataset and is strongly associated with hair type in our sample, after accounting for ancestry (Text S4), suggesting that variants at EDAR could be impacting on hair morphology in Latin Americans. Greater European ancestry also correlates significantly with higher rates of male balding and (marginally) with hair greying (our sample is perhaps underpowered to detect these effects due to its relatively young age; Table 1). Although no thorough comparative data is available, classical population studies indicate that hair greying and androgenetic alopecia are rarer, less severe and of later onset in Native Americans than in other continental populations [40] and our data points to the existence of loci influencing the continental distribution of these traits. Studies in Europeans have recently identified loci associated with androgenetic alopecia [41,42], but no similar analyses have been performed for hair greying.
Recent genome-wide association analyses in Europeans have implicated loci for variation in height and related anthropometric traits [43,44]. However, these traits are also strongly influenced by environmental factors, including nutrition [45]. In the sample studied here we find that Native American ancestry correlates significantly with lower height and we also detect a significant effect of socioeconomic position (Text S3), lower socioeconomic position correlating with decreased height. The significant effect of age on height, with younger individuals tending to be taller than older ones suggests that the two socioeconomic indicators examined here (education and wealth) capture only part of the environmental variation impacting on height. The rate of increase in height for individuals born more recently (,0.1 cm/year) estimated here is similar to that obtained from extensive longitudinal surveys in Latin America (,1 cm per decade in the last century), an observation that has been interpreted as resulting from the historical improvement in living standards across the region [45,46]. It is thus possible that the ancestry effect on height that we detect could be influenced by environmental factors that correlate with ancestry that are not captured by the socioeconomic variables examined here.
The ancestry effects that we detect for facial features (eye fold, face shape and size), but not for head circumference, agree with the notion of a greater developmental and evolutionary constraint on neuro-cranium than on facial variation. This is also in line with proposals that human facial features include a range of environmental adaptations [47][48][49]. Aspects of face shape variation captured by principal components analysis that are influenced by genetic ancestry include mainly, width and height of the face, facial flatness, position of the glabella and fronto-temporal points, extent of eye fold and the relative size and position of lips and nose (a fuller description of face shape variation associated with each PC is presented in Text S5 and Figure S3). Two genome-wide Table 3. Multiple linear regression of the difference (D) between self-perceived and genetically estimated ancestry for the three continental components.  association scans in Europeans have identified a few loci associated with aspects of face shape [50,51] but these results are pending confirmation by further studies. No genetic variants have yet been implicated in intercontinental differentiation for facial features. Our joint analysis of genetic, phenotypic and self-perception variation emphasizes the strong impact of physical appearance on self-perception. Comparison of skin pigmentation across selfperceived ethno/racial categories shows remarkable consistency between countries, underlining the weight given to this trait in selfperception [52]. The large variation in genetic ancestry between countries for each ethnicity category illustrates the relatively low predictive power of physical appearance for genetic ancestry. Although we detected highly significant effects of ancestry on many of the phenotypes examined, the observed correlations are relatively low ( Table 2). The poor reliability of physical appearance as an indicator of genetic ancestry likely relates to the impact of environmental variation on some of these traits, and to their specific genetic architecture. Particularly, a few genetic variants could have relatively large phenotypic effects (as documented for pigmentation [2,36]). The impact of physical appearance on selfperception of ancestry likely relates to admixture in Latin America largely occurring many generations ago and the frequent unavailability of reliable genealogical information. The contrast between self-perceived and genetically estimated admixture proportions confirms the impact of physical appearance on selfperception and shows how certain traits, particularly but not exclusively related to pigmentation, can bias self-perception of ancestry. This biased perception of physical attributes is likely to be influenced by social and individual factors shaping the interpretation of phenotypic variation. The effect of such factors is illustrated by the observation of differences in bias across countries and the positive correlation between wealth and European ancestry ( Table 2). An effect of wealth on selfperception of ancestry has also been the subject of study in the sociological literature on Latin America [52].
In conclusion, our study sample illustrates the extensive geographic variation in genetic ancestry seen across Latin America, reflecting the heterogeneous demographic history of the region. The highly significant impact of genetic ancestry on physical appearance is consistent with some of the phenotypic variation seen in Latin Americans stemming from genetic loci with differentiated allele frequencies between Africans, Europeans and Native Americans [53]. Further analysis of the study sample collected here should enable the identification of such loci. The significant correlation between self-perceived and genetically estimated ancestry is consistent with the observed effects of genetic ancestry on physical appearance. However, self-perception is biased, possibly due to non-biological factors affecting the perception of phenotypic variation and to the genetic architecture of physical appearance traits. Our findings exemplify the informativity of Latin America for studies encompassing genetic, phenotypic and sociodemographic information and the interest of a multidisciplinary approach to human diversity studies.

Study subjects
Recruitment took place mainly in five locations: México City (México), Medellín (Colombia), Lima (Perú), Arica (Chile) and Porto Alegre (Brazil). With the exception of Chile, most subjects recruited in these cities were students and staff from the universities participating in this research. In Chile about 2/3 of the subjects recruited were professional soldiers. In Brazil ,10% of samples were collected in smaller towns of the states of Rio Grande do Sul, Bahia and Rondonia. Adult subjects of both sexes were invited to participate mainly through public lectures and media presentations. Maps showing the number of volunteers in each unique birthplace are presented in Figure S4. Being a convenience sample, the main collection sites are overrepresented on these maps for each country. We obtained ethics approval from: Escuela Nacional de Antropología e Historia (México), Universidad de Antioquia (Colombia), Universidad Perúana Cayetano Heredia (Perú), Universidad de Tarapacá (Chile), Universidad Federal do Rio Grande do Sul (Brazil) and University College London (UK). All participants provided written informed consent. Blood samples were collected by a certified phlebotomist and DNA extracted following standard laboratory procedures.

Phenotypic data
A physical examination of each volunteer was carried out by the local research team using the same protocol and instruments at all recruitment sites. We obtained: height, weight, head, hip and waist circumference, cheilion-cheilion width and sellion-gnation height. All measurements were obtained in duplicate and the mean of the two measurements retained for further analyses. We recorded eye colour into five categories (1-blue/grey, 2-honey, 3-green, 4-light brown, 5-dark brown/black), and natural hair colour into four categories (1-red/reddish, 2-blond, 3-dark blond/light brown or 4brown/black). Balding in males was recorded using a modified Hamilton scale as: 0) no hair loss, 1) frontal baldness only, 2) frontal hair loss with mild vertex baldness, 3) frontal hair loss with moderate vertex baldness, and 4) frontal hair loss with severe vertex baldness. Similarly, graying was recorded along a five point scale: 0) for no greying, 1) for predominant non-graying, 2) for ,50% graying, 3) for predominant greying and 4) for totally white hair. Due to the small number of individuals in categories 1-4 for male pattern balding and greying, we pooled these categories so as to contrast only two categories (presence or absence of the trait). Macroscopic hair type was categorized by visual inspection as 1straight, 2-wavy, 3-curly or 4-frizzy. A quantitative measure of constitutive skin pigmentation (the Melanin Index) was obtained using the DermaSpectrometer DSMEII reflectometer (Cortex Technology, Hadsund, Denmark). Measurements were obtained from both inner arms and the mean of the two readings used in the analyses.
Five digital photographs of the face: left side (290u), left angle (245u), frontal (0u), right angle (45u), right side (90u) were taken from ,1.5 meters at eye level using a Nikon D90 camera fitted with a Nikkor 50 mm fixed focal length lens. The frontal facial photographs were used to score (by visual inspection) the presence of an eye fold along the upper eye lids using a three point scale: 0) absence 1) partial (interior, middle or outer fold) and 2) full (along the entire eye lid).All photographs were annotated manually with 36 anatomical landmarks and 3D landmark coordinates extracted using the software Photomodeler (http://www.photomodeler. com/ Eos Systems Inc, Vancouver, Canada) ( Figure S1). Landmark configurations were superimposed by Generalized Procrustes Analysis [54] and Principal Components (PCs) of the 3D landmark coordinates extracted using the software MOR-PHOJ [54]. To ease visualization of the 3D shape changes associated with each PC we obtained deformation surfaces via a thin plate spline algorithm.

Socioeconomic information and self-perception of ancestry
A structured questionnaire was applied to each volunteer. We obtained information on two indicators of socioeconomic position (Text S2). The first indicator is highest education level attained, categorized as: (1) none/primary/technical, (2) secondary and (3) university and post-graduate. The second indicator is a wealth index obtained from a list of items used to assess living standards. These items were: home ownership, number of bathrooms at the place of residence, ownership of household items (cars, bicycles, fridge/freezer/dishwasher, TVs, radios, CD/DVD players, vacuum cleaner, washing machine) and availability of domestic service. We used polychoric principal component analysis to examine the variability of each country sample and retained the first principal component as an indicator of wealth. To allow comparisons across countries we converted an individual's wealth score to decile within each country. The questionnaire included items exploring self-perception of ethnicity in the categories: ''Black'', ''Native'', ''White'' and ''Mixed'', and self-perception of African, European, and Native American ancestry proportions. This was explained as a personal estimation of the proportion of ancestors that had a particular continental origin. We proposed a five point scale, expressed in 20% per cent brackets (and in words): 1) 0-20% (none or very low), 2) 20-40% (low), 3) 40-60% (moderate), 4) 60-80% (high) and 5) 80-100% (very high or total). The questionnaire also recorded information on the place of birth of the volunteer.

Genetic admixture estimation
In order to select 30 markers highly informative for estimating African/European/Native American ancestry, we started from the list of 5,000 markers, highly informative for world-wide continental ancestry estimation, identified by Paschou et al (2010) [18] using the approach of Rosenberg et al. (2003) [55] based on the worldwide CEPH-HGDP cell panel genotyped with Illumina's Human610-Quad beadchip (including data for about 600,000 SNPs [56]). The full list of these 5,000 markers is at: http://www. cs.rpi.edu/,drinep/HGDPAIMS/WORLD_5000_INFAIMs.txt. Of these, allele genotype data is available in Native Americans for 3,848 markers [57], of which 2,392 have been placed on subsequent Illumina bead-chip products. This subset of markers was retained for selection of those to be typed here so as to facilitate subsequent data comparison and integration. We ranked these 2,392 markers based on allele frequency differences in European-Native American or European-African samples. Amongst markers with the highest inter-continental allelefrequency differences we selected those with lowest heterozygosity in Native Americans (so as to reduce the effect of variable allele frequencies between Native Americans on ancestry estimation). Of the final set of 30 markers retained, 13 are monomorphic in 408 Native Americans (from 47 populations from México Southwards [57]), the rest have minor allele frequencies ranging from 0.01 to 0.15 (median 0.06) in that group of populations. The list of markers typed is provided in Table S1. Genotyping was carried out by LGC genomics (www.lgcgenomics.com/). In a sample of Colombians recently included in a genome-wide association study that used Illumina's 610 chip [19], this set of 30 markers produced individual ancestry estimates with correlations of ,0.7 (for all the three ancestries) compared with ancestry estimates obtained using an LD-pruned set of 50,000 markers from the chip data, and identical mean estimates. We compared the accuracy of these estimates with estimates obtained using markers from the list of 446 proposed by Galanter et al. (2012) [58], specifically for studying admixture in Latin Americans. From this list, 152 markers are present on Illumina's 610 chip (i.e. ,5 times the number of markers that we used) and produced estimates with correlations of ,0.85 with the ancestry estimates from the 50,000 marker set. By contrast, when the set of markers we selected was reduced to 15, the resulting ancestry estimates had a correlation of ,0.6 with the 50,000 marker set estimates, again showing that there is a diminishing return in accuracy when one increases the numbers of SNPs used in ancestry estimation.
Individual African, European and Native American ancestry proportions were estimated using the ADMIXTURE program [17] using supervised runs where African, European and Native American reference groups (K = 3) were provided (see below). Unsupervised runs at K = 3 produced very similar estimates ( Figure S2), confirming our choice of ancestry-informative markers and parental populations. Standard errors of the individual ancestry estimates were obtained by bootstrap using the program's default parameters (200 replication runs). Data from a total of 876 individuals sampled in putative parental populations were used in ancestry estimation and specified in the supervised ADMIXTURE runs. These were selected from HAPMAP, the CEPH-HDGP cell panel [56] and from published Native American data [57] as follows: 169 Africans (from 5 populations from Sub-Saharan West Africa), 299 Europeans (from 7 West and South European populations) and 408 Native Americans (from 47 populations from México Southwards). The full list of the putative parental population samples (and their sizes) is provided in Table S2.

Geographic analyses
The birthplace names of all individuals was consolidated into a list of unique locations organized into three fields: city/municipality, region/state and country. Geographic coordinates (and altitude) for each placename were obtained via the Google Maps Geocoding API (https://developers.google.com/maps/documentation/geocoding/). The GeodesiX software (http://www.calvert.ch/geodesix/) was used for the geocoding query. We used the Global Rural-Urban Mapping Project version 1 data set (GRUMPv1; http://sedac.ciesin.columbia. edu/data/set/grump-v1-settlement-points) to attribute census size to these localities (see Supp. Text S6) [59]. We use census data for 1990, as the median age in our country samples ranges between 20 and 25.
Geographic maps displaying spatial variation in individual admixture were obtained with Kriging interpolation using the software ArcGis 9.3 (http://www.esri.com/software/arcgis). The cartographic database was geo-referenced to the SIRGAS geodesic system (Geocentric Reference System for the Americas, www.ibge. gov.br/home/geociencias/geodesia/sirgasing/index.html) using a Universal Transverse Mercator projection. Corel-DRAW X3 (Corel Corporation, Ottawa, Canada) was used to edit the map images. When a geographic location had multiple data entries (i.e. volunteers), the Kriging interpolation scheme uses the mean ancestry at that location. The correlation between the standard deviation of individual ancestry variation (at locations with more than 10 samples) and census size was tested using Spearman's rank correlation (as population sizes are generally non-normal but rather distributed exponentially). Statistical significance was obtained via permutation of individual birthplaces.
We tested the null hypothesis of ancestry being spatially uniformly distributed using two approaches. Firstly, we obtained Moran's 'I' index for each ancestry component (African, European, Native American) separately. This index tests for spatial uniformity of a variable using standard autocorrelation models and we evaluated significance by permuting birthplace locations for every individual maintaining constant the number of individuals sampled per location. To assign a single value to each location we used the average ancestry, recalculating this average after every permutation. Secondly, we used canonical correlation analysis. A disadvantage of Moran's method is that the three ancestry variables are not independent, complicating the interpretation of p-values. Canonical correlation allows one to combine the three ancestries into a single variable: it is the maximal correlation between two sets of linear combinations of multiple variables. In our case, the three ancestries constitute one set and the geographical coordinates (latitude & longitude) constitute the second set. Adding quadratic and cubic powers of the geographic coordinates improved the fit, consistent with the curved shape of the ancestry gradients and the existence of regions with markedly different ancestry. Adding a fourth power did not improve the fit any further. P-values were obtained by permutation as above.

Statistical analyses
To evaluate the effect of ancestry on phenotype we used multivariate regression models including basic covariates (age, sex, country, education, wealth, and optionally BMI and height). Depending on the trait we used multiple linear (for continuous and ordinal traits) or logistic (for binary traits) regression. The categorical traits in Table 2 were considered ordinal variables (converted into four or five integer levels as specified in Table 1). The justification for doing so is the convention that for an ordinal variable with several categories there is little difference in fitting a linear regression model or an ordered probit model [48]. This is true because for these traits we can assume an underlying continuous variable (for eye or hair colour it could be the amount of pigment, for hair shape it could be the curvature of hair). Since an underlying continuous variable converted into ordered categories is the main assumption for the development of a probit model, this similarity in the two analysis holds. We verified this by examining both models and verifying that the results are similar.
Regression results corresponding to the ancestry variables are presented in Table 2 along with R 2 from this full model. A baseline regression model with only the covariates was also performed, leaving out ancestry, and the difference in R 2 in the two models was taken to be the proportion of variance in the phenotype explained by ancestry. Standard errors of the individual ancestry estimates (provided by the ADMIXTURE software) were incorporated in the multivariate regressions via the errors-invariables model [49]. This adjusts the estimated regression coefficients and p-values for all covariates. The error in estimating a variable generally leads to an underestimation of the regression coefficients. However, the p-value still approaches zero under the alternative hypothesis, provided samples sizes are sufficiently large. For the ancestry estimates, the error in estimation was relatively low (,1-5%), consequently for our large sample sizes the reduction in effect size for each variable was modest (,5-10%).
To evaluate the relationship between self-perceived and genetically estimated ancestry we performed a bias analysis. This bias was defined as self-perception minus the estimated genetic ancestry. Overestimation therefore means that self-perception exceeds the genetic estimate, while underestimation indicates that self-perception is lower than the genetic estimate. Each genetic ancestry estimate was obtained as a percentage (proportion), while self-perception was recorded into five bands at intervals of 20%. The bias in self-perception was therefore considered zero if the percentage of genetic ancestry fell within the chosen selfperception interval. Otherwise, bias was measured to be the distance of the closest boundary of the self-perception interval to the genetic ancestry percentage. We then performed multivariate linear regression of the bias on the genetic ancestry estimates and other variables ( Table 3). The advantage of analysing the bias is that the regression model is easily interpretable. If self-perception was accurate (bias of zero) all the regression coefficients would be non-significant. If the bias is non-zero and some variables show significant effects, the signs of the coefficients are interpretable as leading to overestimation (positive coefficients) or underestimation (negative coefficients) of ancestry, as indicated above.