Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

High-Throughput Sequencing of a South American Amerindian

  • André M. Ribeiro-dos-Santos ,

    Contributed equally to this work with: André M. Ribeiro-dos-Santos, Jorge Estefano Santana de Souza

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Jorge Estefano Santana de Souza ,

    Contributed equally to this work with: André M. Ribeiro-dos-Santos, Jorge Estefano Santana de Souza

    Affiliations Centro Regional de Hemoterapia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, São Paulo, Brazil, Institute of Bioinformatics and Biotechnology, São Paulo, São Paulo, Brazil

  • Renan Almeida,

    Affiliation Institute of Bioinformatics and Biotechnology, São Paulo, São Paulo, Brazil

  • Dayse O. Alencar,

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Maria Silvanira Barbosa,

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Leonor Gusmão,

    Affiliations Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil, Institute of Molecular Pathology and Immunology, University of Porto, Porto, Portugal

  • Wilson A. Silva Jr,

    Affiliations Centro Regional de Hemoterapia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, São Paulo, Brazil, Departamento de Genética, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, São Paulo, Brazil

  • Sandro J. de Souza,

    Affiliations Institute of Bioinformatics and Biotechnology, São Paulo, São Paulo, Brazil, Brain Institute, Universidade Federal do Rio Grande do Norte, Natal, Rio Grande do Norte, Brazil

  • Artur Silva,

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Ândrea Ribeiro-dos-Santos,

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Sylvain Darnet,

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

  • Sidney Santos

    sidneysantos@ufpa.br/sidney.santos@pq.cnpq.br

    Affiliation Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Pará, Brazil

High-Throughput Sequencing of a South American Amerindian

  • André M. Ribeiro-dos-Santos, 
  • Jorge Estefano Santana de Souza, 
  • Renan Almeida, 
  • Dayse O. Alencar, 
  • Maria Silvanira Barbosa, 
  • Leonor Gusmão, 
  • Wilson A. Silva Jr, 
  • Sandro J. de Souza, 
  • Artur Silva, 
  • Ândrea Ribeiro-dos-Santos
PLOS
x

Abstract

The emergence of next-generation sequencing technologies allowed access to the vast amounts of information that are contained in the human genome. This information has contributed to the understanding of individual and population-based variability and improved the understanding of the evolutionary history of different human groups. However, the genome of a representative of the Amerindian populations had not been previously sequenced. Thus, the genome of an individual from a South American tribe was completely sequenced to further the understanding of the genetic variability of Amerindians. A total of 36.8 giga base pairs (Gbp) were sequenced and aligned with the human genome. These Gbp corresponded to 95.92% of the human genome with an estimated miscall rate of 0.0035 per sequenced bp. The data obtained from the alignment were used for SNP (single-nucleotide) and INDEL (insertion-deletion) calling, which resulted in the identification of 502,017 polymorphisms, of which 32,275 were potentially new high-confidence SNPs and 33,795 new INDELs, specific of South Native American populations. The authenticity of the sample as a member of the South Native American populations was confirmed through the analysis of the uniparental (maternal and paternal) lineages. The autosomal comparison distinguished the investigated sample from others continental populations and revealed a close relation to the Eastern Asian populations and Aboriginal Australian. Although, the findings did not discard the classical model of America settlement; it brought new insides to the understanding of the human population history. The present study indicates a remarkable genetic variability in human populations that must still be identified and contributes to the understanding of the genetic variability of South Native American populations and of the human populations history.

Introduction

The emergence of next-generation sequencing (NGS) technologies, such as Solexa (Illumina) [1], 454 (Roche) [2] and SOLiD (Life Technologies) [3], allowed access to the vast amounts of information that are contained in the genomes of various organisms. In recent years, two large projects, HapMap [4] and the 1,000 Genomes Project [5], and other more specific initiatives [6][14] sought to document the genetic variability in the human genomes of different ethnic and geographic groups. This type of investigation allowed the identification of rare genetic variants [11], [15], the ability to make more precise inferences from association studies with complex diseases [16], and the formulation of new insights in the history of the formation and migration of human populations [6][10],[12][14],[17][19].

The Americas were the last continents to be occupied by human populations. According to the most widely accepted model, which is based on anthropological, archaeological, and genetic evidence [18], [20][22], the American natives originated in Eastern Asia 20 to 30 thousand years ago (kya) and expanded across the Americas along the North-South direction [18]. Despite the large number of publications on the occupation of the Americas [18], [20], [23], [24], several questions have not yet been answered. Information on individual and/or population-based genetic variability, such as the information provided by whole genome sequencing (WGS), might contribute to the solution of many of these questions.

The literature records at least 15 fully annotated WGSs of individuals from European [8], Asian [6], [7], [11], [13][15], Oceanic [12], and African [9] populations, but have no Native American representatives. To obtain a better understanding of the genetic variability of Native Americans and to infer genetic associations with complex diseases, such as diabetes mellitus type 2 and coronary artery diseases frequently detected in South American populations, the full genome of an individual from a South American indigenous population was sequenced. The data obtained served as a basis for the comparison with representatives from different geographic areas, the discovery of new polymorphisms (SNPs – single nucleotide polymorphisms; and INDEL - insertion/deletion polymorphisms) and represents the first WGS of a South Native American individual.

Results

Whole genome sequencing and mapping

The two mate-paired full-slide runs that were performed produced 36.8 Gbp of raw data (Table 1). The total number of Gbp that were mapped on the human genome was 31.4 (85.5%). Of these mapped Gbp, 25.9 Gbp (70.34%) exhibited single alignment, and 35.62% of the reads were paired. A 95.92% genome mapping was achieved with an average sequencing coverage of 8.23× and a physical coverage of 99.83×. The sequencing error rate was estimated to be 0.0035 per base pair (bp), as suggested by Fujimoto et al. [11].

Identification and annotation of SNPs and INDELs

Approximately 2.1 million SNP and INDEL polymorphisms were identified. Of these polymorphisms, 502,017 were considered to have high confidence (coverage equal to or higher than 10× and a minimum support of 4×), of which 435,947 (86.84%) were already described in the Single Nucleotide Polymorphism database (dbSNP) and 66,070 (32,280 SNPs and 33,785) were considered potentially new polymorphisms (Table 2).

thumbnail
Table 2. High-quality polymorphisms identified and not identified in the dbSNP build 135 database.

https://doi.org/10.1371/journal.pone.0083340.t002

The total number of high-confidence SNPs with coverage equal to or higher than 10× was 433,310 (available at Table S1). These polymorphisms were classified according to their mutation type (transition or transversion) and their relative position to the closest gene, 5′-UTR, introns, and other gene characteristics (e.g., synonymous, non-synonymous, stop codon loss, stop codon gain, intronic, 5′-UTR, 3′-UTR, near 3′ end of gene, 5′ splice site, and 3′ splice site; Table 3). The mutations were more frequently found at the intronic regions; the next most common locations were the 3′ end of the gene (500 bp after the 3′ end of the gene) and the 3′-UTR. Of the 32,275 potentially new SNPs, 1,899 (5.88%) and 123 (0.38%) exhibited a coverage that was equal to or higher than 20× and 50×, respectively. The following analyses were based on the complete set of high-quality SNPs, as described above.

thumbnail
Table 3. Nature (genomic position and type of mutation) of the SNPs identified in the dbSNP database.

https://doi.org/10.1371/journal.pone.0083340.t003

Maternal and paternal lineages

The maternal and paternal lineages of the sample were assessed through a comparison of the SNPs found in the mitochondrial DNA (mtDNA) and the SNPs found in chromosome Y (Y-DNA) [18], [25], [26]. The mtDNA exhibited an average coverage of 729× and 41 variable points were detected along the mitochondrial genome; among those 16 mutations (064 C>T; 146 T>C; 153 A>G; 235 A>G; 663 A>G; 1,736 A>G; 4,248 T>C; 4,824 A>G; 8,027 G>A; 8,794 C>T; 12,007 G>A; 16,111 C>T; 16,223 C>T; 16,290 C>T; 16,319 G>A; and 16,362 T>C) are commonly found among the Amerindian populations, and classified the sample within the A2 haplogroup [21], [25][30]. The analysis of the Y-DNA revealed the mutations M242, M346, and M3, which allowed the classification of the sample within haplogroup Q1a3a* [31].

HapMap's population-based comparisons

We compared the set of high-confidence SNPs that were found in this study with those obtained in the HapMap project phases 1, 2, and 3 [4]. Of the reliable SNPs identified in the sample, 205,634 had not been previously genotyped in any of the populations included in the database, whereas 227,676 had already been found in at least one individual. The Asian populations (CHB and JPT) shared the most, 222,644 SNPs (Figure S1). Moreover, these comparisons revealed that 2,955 SNPs genotyped in all of the populations included at HapMap. The genotype data regarding these 2,955 SNPs (dataset A, see Material and Methods) was applied to following analysis.

The results of STRUCTURE analyses are represented at Figure 1A. With a K value of four, the Amerindian sample clustered with the Asian populations (JPT, CHD, and CHB), with only 1% of external contribution. The results also indicated a group of the European populations (CEU and TSI), a group of the African populations (YRI, ASW, LWK, and MKK), and the admixed group of Mexican (MEX) and Indian (GIH) samples.

thumbnail
Figure 1. Comparative population-based genetic analysis of the Amerindian with the populations in the HapMap database.

The Amerindian sample (IND) was compared to 20 randomly selected samples from the following populations in the HapMap database: JPT (Japanese in Tokyo, Japan), CHD (Chinese in metropolitan Denver, Colorado, USA), CHB (Han Chinese in Beijing, China), CEU (residents of Utah, Nevada, USA with Northern and Western European ancestry from the CEPH collection), TSI (Tuscans in Italy), GIH (Gujarati Indians in Houston, Texas, USA), YRI (Yoruba in Ibadan, Nigeria), ASW (individuals with African ancestry in Southwest USA), LWK (Luhya in Webuye, Kenya), MKK (Maasai in Kinyawa, Kenya), and MEX (individuals with Mexican ancestry in Los Angeles, California, USA). The populations were clustered according to their geographic origin as follows: East-Asia (JPT, CHB, and CHD), Europe (CEU and TSI), South-West Asia (SWA, formed by the GIH population), Africa (YRI, ASW, LWK, and MKK) and North America (NA, formed by the MEX population). A) Diagram of the genetic contribution of the models for a value of K in the range of four to seven. The x-axis represents the different samples that were clustered according to the population and the geographic area of origin. B) Principal Component Analysis (PCA) of the Amerindian sample (indicated with the arrow) and the samples extracted from the HapMap database. The abscissa represents the 1st component, and the ordinate represents the 2nd component. C) Heat map of the FST index between the investigated populations and the Amerindian sample.

https://doi.org/10.1371/journal.pone.0083340.g001

With a K value in the range of five to seven, no gene flow with the Native American was found. At this level of analysis, the European (CEU and TSI), Asian (JPT, CHD and CHB), and African (YRI, ASW, LWK and MKK) populations were found to form homogeneous groups, although the LWK and MKK populations exhibited a small degree of mixture, mainly with Europeans. As expected, the Mexican (MEX) and Indian (GIH) populations exhibited a high proportion of admixture with the various populations.

Two other approaches were used to estimate the genetic differences: discriminant analysis of the principal components (DAPC), which illustrated the differences among the individuals and populations (Figure 1B); and the fixation index (FST) among the 12 populations. To facilitate the visualization of the results, the measures were plotted using a heat map (Figure 1C). Both results showed that the sample was isolated from the others populations. With the exception of the Mexican population, which is known to result from a mixture of Europeans and Amerindians, the Eastern Asian populations (JPT, CHB, and CHD) were found to be closest to the South Amerindian sample. The African and European populations were the most distant populations from the sample.

1,000 genomes population-based comparisons

The high-confidence SNPs were also compared with those found by 1,000 genomes project phase 1 [5] and the Aboriginal Australian [12]. It was found a set of 13,177 SNPs (dataset B, see Material and Methods) used for STRUCTURE and Threepop analyses. The analysis of f3 statistics did not reveal any gene flow involving the Native American sample with any other population (Table 4).

The STRUCTURE analyses (Figure 2, Figure S2) reveal an important contribution of Eastern Asian populations to the Native American sample (K less than 4). For K between 4 and 5, the sample distinguished from others groups, with more than 88% specific contribution. For K values higher or equal to 6, a new group emerges, isolating the Australian Aboriginal sample with more than 90% of specific contribution. Interesting to notice that for these analyses our sample presented 30% of contribution from the Aboriginal population.

thumbnail
Figure 2. Population genetic structure analysis of 1,000 genomes project's, Aboriginal Australian and Native South American genotype dataset.

The diagram of genetic contribution was obtained using Structure software for models of 4 to 6 subpopulations. The populations were grouped labeled as follow, according to their major continental ancestry: IND (Native South American individual, this work); ABO (Aboriginal Australian individual, Rasmussen et al. 2011); Eastern Asian (CHS, CHB and JPT); Europeans (CEU, IBS, TSI, FIN and GBR); African (ASW, LWK, MKK and YRI); and New Americans (CLM, PUR and MXL). The plot represents the rate of contribution from each subpopulation (colors) to the samples (x axis).

https://doi.org/10.1371/journal.pone.0083340.g002

The American populations (CLM, PUR and MXL) presented a similar scheme of contribution for K values higher or equals to 5. They had the same population-structure formed by three major groups: European, African and Native American.

Discussion

The aim of the present study was to describe potential new genetic variations to be further analyzed among South Native American populations through the use of NGS technologies. The specialized literature describes the fully sequenced genomes of samples of other populations, including samples of Japanese [11], Chinese [15], Korean [6], [7], Indian [13], [14], Aboriginal Australian [12], and African [9] populations; these samples are available for comparison analyses. The accumulation of additional data might contribute to a better understanding of the genetic diversity that is present in those populations and to an improved historical description of the expansion of the human population across the various continents.

The results of the present study contribute to the catalogs of genetic variation that have been organized by various projects, such as the 1,000 Genome [5] and the HapMap [4] projects, which did not include any South Native American samples until the moment.

The comparison of the findings with the available dbSNP, HapMap and 1,000 genomes datasets showed a high number of potentially new polymorphisms. This result indicated the importance of acquiring information on the genetic variability of human populations through the sequencing of individuals from different ethnic groups, particularly those that have not been previously investigated, such as the Native Americans. This approach is specially important to the study of isolated populations regarding their healthy, such as, the finding of markers associated to complex diseases like diabetes type 2, coronary diseases, hypertension and others.

The analyses performed limit the possibility of sequencing errors and sample contamination. This study mapping, genome coverage and SNP detection was similar to previous studies [12], [13], [15]. The quality of these results is attested by the low error rate (estimated as 0.0035 per bp) as the one obtained by Fujimoto et al. [11].

A STRUCTURE analyses was performed to establish the relationship between the investigated sample and samples from several different populations. The high number of SNPs shared with Eastern Asian populations corroborate to the classical model of the America occupation.

The sample ancestry origin was through the analysis of the maternal (mitochondrial DNA) and paternal (Y-DNA) lineages. The A2 mitochondrial haplogroup is typical of the indigenous populations of the Americas, including the South American indigenous groups, and it is found in approximately 15% of the Amazonian tribes [32], 3.7% of the Andean tribes [33], and 5.9% of the southern South American tribes [33]. The Q1a3a* Y-DNA haplogroup [31] is specific to Native American tribes [34]. It is the most frequent among the South Native American tribes (more than 90%; [35]). With rare exceptions, other lineages belonging to clade Q were detected, although at a much lower frequency; (e.g. [36][41]). Therefore, the analysis of the uniparental markers indicated that both lineages were derived from South Native American populations.

The introgression of European genes into Native American populations, particularly through the crossings between European males and Amerindian females [42] is relatively well known in the specialized literature. The dataset A analyses presented no admixture of the sample with any continental population that comprised more than four subpopulations. The 1,000 genomes dataset analyses with K larger than 5 no contribution of African or European populations was found. To further exploit our data, for possible recent admixtures, it was applied the f3 statistic to the 1,000 genomes dataset and no admixture scheme was statistical significant. The data support the conclusion that the investigated sample does not exhibit traits of a recent interethnic mixture and, thus, can be considered an authentic South Native American individual.

Although, no evidence of recent admixture was found, the 1,000 genomes STRUCTURE analyses indicated an important steady contribution of the Aboriginal Australian to the South Native American sample was found for K larger than 5. Three scenarios can explain similarities between the Aboriginal and our sample: i) an ancient shared history between the groups [43]; ii) ancient migrations, others than the classical Bering model [35], [44]; and iii) due to derive considering each group consist of only one sample. This matter can only be clarified by the expansion of the number of samples and dataset available to analysis.

The present study identified 32,275 potentially new SNPs with a low error rate. These new SNPs may contribute to the understanding of the composition and genetic history of South Native American populations. The authenticity of the sample was demonstrated through the analysis of the maternal, paternal, and autosomal lineages. In addition, an autosomal analysis was able to distinguish the sample from the other continental populations and showed that it is close to the Eastern Asian populations. Other important finding was the indication of a shared history between our sample and the Australian Aboriginal, although no conclusion may be taken. Although, the findings did not discard the classical model of America settlement; it brought new insides to the understanding of the human population history.

Materials and Methods

Ethics Statement

Ethical consent was obtained according to the Helsinki Declaration. Ethical approval was obtained from the Brazilian National Committee on Research Ethics (CONEP- Parecer No 1062/2006). A signed informed consent was obtained from the village leaders since most subjects were not literate in Portuguese, but all subjects provided verbal assent to participate. A FUNAI/FUNASA health agent, who helped explain the aims and scope of the study to individuals, accompanied all activities.

Preparation of the Sample and Sequencing Library

The genomic DNA was obtained from an Amazon tribe male healthy individual. Along the years, our group gathered a pool of samples from several Amazon tribes. To guarantee the best ethical behavior regarding our samples and their tribes, the studied sample was selected randomly from our pool of male samples together with Roewer's Amazon samples (recently investigated in Roewer et al. [35]). From the genomic DNA sample, a 1,000-bp insert mate-pair library was prepared. The library was sequenced using the SOLiD v.4 Plus platform (Life Technologies, CA, USA) according to the manufacturer's protocol. Based on the purpose of the present study, two runs were performed each one in a full slide.

Ultra-Deep Sequencing of SOLiD v.4 Plus and Mapping

The SOLiD platform generated thousands of reads with a length of 50 bp. The results were transferred to the processing server, where both runs were aligned to the reference human genome (NCBI Genome Reference build 37 – HG19/GRCh37 Feb. 2009) using the Bioscope™ v.1.3 platform (Life Technologies, CA, USA) following the manufacturer's recommended protocol. After the alignment, the output of the Bioscope software was converted to the BAM format, and the sequencings were paired.

Error Rate

The sequencing error rate was estimated using the data from chromosomes X and Y, as suggested by Fujimoto et al. [11]. The regions that were classified by Repeat Marker [45] to contain pseudo-autosomal contigs, short repeats, and repeat sequences were not considered. The predominant base call at each position was considered correct, and the other calls were considered erroneous.

Polymorphism call

The polymorphisms were called using the mpileup software of the SAMtools v.0.1.17 package [46] and the Bioscope v.1.3 platform. The following criteria were applied for the detection of the high-quality SNPs: (1) the mapping quality of the reads (MAPQ) should be greater than 25 (Phred scale), (2) the SNP coverage should be equal to or greater than 10×, and (3) the 10-bp window can contain up to three polymorphisms. The polymorphism zygousis was determined according to the filtered total and variant coverage. A variant was called as heterozygous when the ratio of variant coverage (variant/total) was less than 0.66 and mutant homozygous when the ratio is above 0.66. The large INDEL calling analyzes the read pairs and identifies those that exhibit significant changes in the average distance between pairs; the results were sorted according to the following criteria: (1) minimum coverage of 3×, (2) pairing quality equal to or greater than 25 (Phred scale), (3) mapping with a minimum size of 30 bp, and (4) maximum coverage of 10,000×. The small INDEL calling analyzed the small gaps in the alignment of the reads, and the results were sorted according to the following criteria: (1) the polymorphism was supported by at least two pieces of evidence and (2) the quality of the best mapping among the reads should be equal to or greater than five. All of the detected polymorphisms were compared to the polymorphisms described in the dbSNP build 135 database [47] to identify the rsID (dbSNP identification code) and genomic loci.

Maternal and Paternal linage analysis

The sample ancestry was inferred studying the maternal and paternal linage. Both linages were based on motifs (of mitochondrial and Y DNA, respectively) that characterize ethnic groups. The mitochondrial mutations were identified aligning the reads from the mitochondrial to the Andrews Reference Genome (rCRS, Revised Cambridge Reference Genome) [48], [49]. The haplogroup was determined according to the PhyloTree build 15 [26] classification and others papers [21], [25], [27][30]. The Y-DNA haplogroup was determined according to Karafet et al. [31] binary classification tree.

Population-based Comparisons

To further investigate our sample, it was collect two dataset of bi-allelic SNPs genotype data. The first dataset (dataset A) consists in 220 HapMap samples' and our sample's genotype data for 2,955 SNPs genotyped for all HapMap's samples and identified as high quality in our sample. The 220 samples were selected randomly, including 20 representatives for each population: Japanese in Tokyo, Japan (JPT); Han Chinese in Beijing, China (CHB); Chinese in Metropolitan Denver, Colorado (CHD); Gujarati Indians in Houston, Texas (GIH); Yoruban in Ibadan, Nigeria (YRI); African ancestry in Southwest USA (ASW); Maasai in Kinyawa, Kenya (MKK); Luhya in Webuye (LWK); Utah residents with Northern and Western European ancestry from the CEPH collection (CEU); Tuscan in Italy (TSI); and Mexican ancestry in Los Angeles, California (MEX).

The second dataset (Dataset B) consists in 1092 samples of project 1,000 genomes phase 1 [5], Aboriginal Australian's [12], and our sample's genotype data for 13,177 SNPs. Those SNPs were identified as high quality in the studied sample, genotyped for 90% of the 1000 genomes project samples and the Aboriginal Australian. This dataset included the following individual per population: 89 GBR (British individuals from England and Scotland); 93 FIN (HapMap Finnish individuals from Finland); 100 CHS (Han Chinese South); 55 PUR (Puerto Rican in Puerto Rico); 60 CLM (Colombian in Medellin, Colombia); 14 IBS (Iberian populations in Spain); 85 CEU; 88 YRI; 88 CHB; 89 JPT; 97 LWK; 61 ASW; 66 MXL; and 98 TSI.

To infer the genetic distance, the FST was calculated by Arlequin v.3.5 software [50], and visualized with a discriminant analysis of the principal components (DAPC), included in the adegenet v.1.3-1 library [51] of the R statistical package [52] using dataset A. DAPC was chosen over the conventional principal component analysis (PCA) because it maximizes the differences among groups, which results in a more accurate representation of the differences between the populations. It was also perform the test of “treeness” for 3 populations (f3 statistic [53]) by the Threepop v0.1 software (included in the treemix packge [54]) using dataset B to evaluate the sample admixture. The populations included in this dataset were grouped according to their continent as: Asian (CHB, CHS, and JPT); European (GBR, TSI, IBS, CEU, and FIN); African (ASW, MKK, LWK, and YRI); New Americans (MXL, PUR, and CLM); Australian Aboriginal; and South Native American. The f3 statistic test if a population 1 is admixed regarding the others two populations (this test is formally written as: f3(X1; X2, X3)), where f3 values significantly negatives indicates that population 1 is admixed.

The admixture analyses were performed using Structure v. 2.3.4 software [55] with 150,000 learning repetitions. The dataset A was analyzed with four to seven subpopulations and dataset B with two to nine subpopulations.

Supporting Information

Figure S1.

Venn diagram of the shared SNPs among Amerindian sample, CEU, YRI, JPT, and CHB populations. The Venn diagram represents the total set of SNPs that were found in the Amerindian sample and those SNPs that are shared with the genotyped samples of the CEU (residents of Utah, Nevada, USA with Northern and Western European ancestry from the CEPH collection), YRI (Yoruba in Ibadan, Nigeria), CHB (Han Chinese in Beijing, China), and JPT (Japanese in Tokyo, Japan) populations. Due to the high similarity between the genotypes of the CHB and JPT populations, their representatives were grouped into a single set (CHB+JPT) that represents the mutations that are shared between the Amerindian sample and the CHB and JPT populations.

https://doi.org/10.1371/journal.pone.0083340.s001

(EPS)

Figure S2.

Complete population genetic structure analysis of 1,000 genomes project's, Aboriginal Australian and Native South American genotype dataset. The diagram of genetic contribution was obtained using Structure software for models of 2 to 9 subpopulations. The populations were grouped labeled as follow, according to their major continental ancestry: IND (Native South American individual, this work); ABO (Aboriginal Australian individual, Rasmussen et al. 2011); Eastern Asian (CHS, CHB and JPT); Europeans (CEU, IBS, TSI, FIN and GBR); African (ASW, LWK, MKK and YRI); and New Americans (CLM, PUR and MXL). The plot represents the rate of contribution from each subpopulation (colors) to the samples (x axis).

https://doi.org/10.1371/journal.pone.0083340.s002

(TIF)

Table S1.

Total high-confidence SNPs detected. A complete list of all high-confidence SNPs detected in the sample, including its' chromossomic position relative to hg19 reference genome, the reference and mutant base observed with their relative base coverage and dbSNP identification.

https://doi.org/10.1371/journal.pone.0083340.s003

(ZIP)

Acknowledgments

The authors thank Lutz Roewer for collaborating to our sample pool.

Author Contributions

Conceived and designed the experiments: ARdS AS SJS SS WASJ. Performed the experiments: ARdS AS DOA MSB. Analyzed the data: AMRdS JESS RA SJS SD WASJ. Contributed reagents/materials/analysis tools: ARdS AS SS. Wrote the paper: AMRdS ARdS LG SS SD.

References

  1. 1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59
  2. 2. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380
  3. 3. McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, et al. (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 19: 1527–1541
  4. 4. Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F (2003) The International HapMap Project. Nature 426: 789–796
  5. 5. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073
  6. 6. Ahn SM, Kim TH, Lee S, Kim DS, Ghang H, et al. (2009) The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome 19: 1622–1629
  7. 7. Kim J-I, Ju YS, Park H, Kim S, Lee S, et al. (2009) A highly annotated whole-genome sequence of a Korean individual. Nature 460: 1011–1015
  8. 8. Tong P, Prendergast JGD, Lohan AJ, Farrington SM, Cronin S, et al. (2010) Sequencing and analysis of an Irish human genome. Genome Biol 11: R91
  9. 9. Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa. Nature 463: 943–947
  10. 10. Shapiro B, Hofreiter M (2010) Analysis of ancient human genomes. BioEssays 32: 388–391
  11. 11. Fujimoto A, Nakagawa H, Hosono N, Nakano K, Abe T, et al. (2010) Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat Genet 42: 931–936
  12. 12. Rasmussen M, Guo X, Wang Y, Lohmueller KE, Rasmussen S, et al. (2011) An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334: 94–98
  13. 13. Patowary A, Purkanti R, Singh M, Chauhan RK, Bhartiya D, et al. (2012) Systematic analysis and functional annotation of variations in the genome of an Indian individual. Hum Mutat 33: 1133–1140
  14. 14. Gupta R, Ratan A, Rajesh C, Chen R, Kim HL, et al. (2012) Sequencing and analysis of a South Asian-Indian personal genome. BMC Genomics 13: 440
  15. 15. Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome sequence of an Asian individual. Nature 456: 60–65
  16. 16. Gonzaga-Jauregui C, Lupski JR, Gibbs R a (2012) Human Genome Sequencing in Health and Disease. Annu Rev Med 63: 35–61
  17. 17. Burbano HA, Hodges E, Green RE, Briggs AW, Krause J, et al. (2010) Targeted investigation of the Neandertal genome by array-based sequence capture. Science 328: 723–725
  18. 18. Fagundes NJR, Kanitz R, Bonatto SL (2008) A reevaluation of the Native American mtDNA genome diversity and its bearing on the models of early colonization of Beringia. PLoS One 3: e3157
  19. 19. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, et al. (2010) A draft sequence of the Neandertal genome. Science 328: 710–722
  20. 20. O'Rourke DH, Raff JA (2010) The human genetic history of the Americas: the final frontier. Curr Biol 20: R202–7
  21. 21. Forster P, Harding R, Torroni A, Bandelt HJ (1996) Origin and evolution of Native American mtDNA variation: a reappraisal. Am J Hum Genet 59: 935–945.
  22. 22. Rothhammer F, Dillehay TD (2009) The late Pleistocene colonization of South America: An interdisciplinary perspective. Ann Hum Genet 73: 540–549.
  23. 23. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, et al. (2012) Reconstructing Native American population history. Nature 488: 370–374
  24. 24. Bisso-Machado R, Bortolini MC, Salzano FM (2012) Uniparental genetic markers in South Amerindians. Genet Mol Biol 35: 365–387
  25. 25. Silva Jr WA, Bonatto SL, Holanda AJ, Ribeiro-dos-Santos AK, Paixão BM, et al. (2002) Mitochondrial Genome Diversity of Native Americans Supports a Single Early Entry of Founder Populations into America. Am J Hum Genet 71: 187–192.
  26. 26. Van Oven M, Kayser M (2009) Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat 30: E386–94
  27. 27. Smith DG, Malhi RS, Eshleman J, Lorenz JG, Kaestle FA (1999) Distribution of mtDNA haplogroup X among Native North Americans. Am J Phys Anthropol 110: 271–284
  28. 28. Kong Q-P, Bandelt H-J, Sun C, Yao Y-G, Salas A, et al. (2006) Updating the East Asian mtDNA phylogeny: a prerequisite for the identification of pathogenic mutations. Hum Mol Genet 15: 2076–2086
  29. 29. Starikovskaya EB, Sukernik RI, Derbeneva OA, Volodko N V, Ruiz-Pesini E, et al. (2005) Mitochondrial DNA diversity in indigenous populations of the southern extent of Siberia, and the origins of Native American haplogroups. Ann Hum Genet 69: 67–89
  30. 30. Ribeiro-Dos-Santos AKC, Santos SEB, L MA, Guapindaia V, Zago MA (1996) Heterogeneity of mitochondrial DNA haplotypes in Pre-Columbian natives of the Amazon region. Am J Phys Anthropol 101: 29–37
  31. 31. Karafet TM, Mendez FL, Meilerman MB, Underhill PA, Zegura SL, et al. (2008) New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree. Genome Res 18: 830–838
  32. 32. Santos SE, Ribeiro-dos-Santos AK, Meyer D, Zago MA (1996) Multiple founder haplotypes of mitochondrial DNA in Amerindians revealed by RFLP and sequencing. Ann Hum Genet 60: 305–319.
  33. 33. Rodriguez-Delfin LA, Rubin-de-Celis VE, Zago MA (2001) Genetic diversity in an Andean population from Peru and regional migration patterns of Amerindians in South America: data from Y chromosome and mitochondrial DNA. Hum Hered 51: 97–106
  34. 34. Zegura SL, Karafet TM, Zhivotovsky LA, Hammer MF (2004) High-resolution SNPs and microsatellite haplotypes point to a single, recent entry of Native American Y chromosomes into the Americas. Mol Biol Evol 21: 164–175
  35. 35. Roewer L, Nothnagel M, Gusmão L, Gomes V, González M, et al. (2013) Continent-wide decoupling of Y-chromosomal genetic variation from language and geography in native South Americans. PLoS Genet 9: e1003460
  36. 36. Bailliet G, Ramallo V, Muzzio M, García A, Santos MR, et al. (2009) Brief communication: Restricted geographic distribution for Y-Q* paragroup in South America. Am J Phys Anthropol 140: 578–582
  37. 37. Toscanini U, Gusmão L, Berardi G, Gomes V, Amorim A, et al. (2011) Male lineages in South American native groups: evidence of M19 traveling south. Am J Phys Anthropol 146: 188–196
  38. 38. Bortolini MC, Salzano FM, Thomas MG, Stuart S, Nasanen SPK, et al. (2003) Y-chromosome evidence for differing ancient demographic histories in the Americas. Am J Hum Genet 73: 524–539
  39. 39. Demarchi DA, Mitchell RJ (2004) Genetic structure and gene flow in Gran Chaco populations of Argentina: evidence from Y-chromosome markers. Hum Biol 76: 413–429.
  40. 40. Sala A, Argüelles CF, Marino ME, Bobillo C, Fenocchio A, et al. (2010) Genetic analysis of six communities of Mbyá-Guaraní inhabiting northeastern Argentina by means of nuclear and mitochondrial polymorphic markers. Hum Biol 82: 433–456
  41. 41. Blanco-Verea A, Jaime JC, Brión M, Carracedo A (2010) Y-chromosome lineages in native South American population. Forensic Sci Int Genet 4: 187–193
  42. 42. Dos Santos SEB, Rodrigues JD, Ribeiro-dos-Santos AK, Zago MA (1999) Differential contribution of indigenous men and women to the formation of an urban population in the Amazon region as revealed by mtDNA and Y-DNA. Am J Phys Anthropol 109: 175–180
  43. 43. Gonçalves VF, Stenderup J, Rodrigues-Carvalho C, Silva HP, Gonçalves-Dornelas H, et al. (2013) Identification of Polynesian mtDNA haplogroups in remains of Botocudo Amerindians from Brazil. Proc Natl Acad Sci U S A 110: 6465–6469
  44. 44. Balter M (2008) Archaeology. Ancient algae suggest sea route for first Americans. Science 320: 729
  45. 45. Smit AFA, Hubley R, Green P. (n.d.) RepeatMasker Open-3.0.
  46. 46. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079
  47. 47. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308–311.
  48. 48. Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, et al. (1981) Sequence and organization of the human mitochondrial genome. Nature 290: 457–465.
  49. 49. Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, et al. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23: 147
  50. 50. Excoffier L, Lischer HEL (2010) Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10: 564–567
  51. 51. Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 27: 3070–3071
  52. 52. R Core Team (2013) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
  53. 53. Reich D, Thangaraj K, Patterson N, Price AL, Singh L (2009) Reconstructing Indian population history. Nature 461: 489–494
  54. 54. Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8: e1002967
  55. 55. Falush D, Stephens M, Pritchard JK (2007) Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol Ecol Notes 7: 574–578