A Novel Candidate Region for Genetic Adaptation to High Altitude in Andean Populations

Humans living at high altitude (≥2,500 meters above sea level) have acquired unique abilities to survive the associated extreme environmental conditions, including hypoxia, cold temperature, limited food availability and high levels of free radicals and oxidants. Long-term inhabitants of the most elevated regions of the world have undergone extensive physiological and/or genetic changes, particularly in the regulation of respiration and circulation, when compared to lowland populations. Genome scans have identified candidate genes involved in altitude adaption in the Tibetan Plateau and the Ethiopian highlands, in contrast to populations from the Andes, which have not been as intensively investigated. In the present study, we focused on three indigenous populations from Bolivia: two groups of Andean natives, Aymara and Quechua, and the low-altitude control group of Guarani from the Gran Chaco lowlands. Using pooled samples, we identified a number of SNPs exhibiting large allele frequency differences over 900,000 genotyped SNPs. A region in chromosome 10 (within the cytogenetic bands q22.3 and q23.1) was significantly differentiated between highland and lowland groups. We resequenced ~1.5 Mb surrounding the candidate region and identified strong signals of positive selection in the highland populations. A composite of multiple signals like test localized the signal to FAM213A and a related enhancer; the product of this gene acts as an antioxidant to lower oxidative stress and may help to maintain bone mass. The results suggest that positive selection on the enhancer might increase the expression of this antioxidant, and thereby prevent oxidative damage. In addition, the most significant signal in a relative extended haplotype homozygosity analysis was localized to the SFTPD gene, which encodes a surfactant pulmonary-associated protein involved in normal respiration and innate host defense. Our study thus identifies two novel candidate genes and associated pathways that may be involved in high-altitude adaptation in Andean populations.


Introduction
It is generally accepted that anatomically modern humans emerged in Africa and radiated from there to colonize most of the world's land masses [1]. During this "out of Africa" diaspora, modern humans encountered new habitats with a very diverse set of ecological conditions in contrast to the African homeland, e.g., in the form of new geographic environments, climates, diets and/or pathogens. Humans adapted successfully to these new conditions both culturally and biologically, the latter involving physiological acclimatization and/or genetic adaptation. One of the extreme habitats successfully colonized by humans is high altitude.
The main environmental stresses of living in elevated plateaus and mountainous regions are the decrease of temperature and humidity, the increase in solar electromagnetic radiation, and hypobaric hypoxia (defined as the decrease in oxygen intake for metabolic processes due to reduced barometric pressure) [2,3]. Although the term "high altitude" has no precise definition, it is generally taken to refer to regions that are 2,500-3,000 meters (m) or more above sea level, as the majority of newcomers arriving in such regions present certain clinical, physiological, anatomical and biochemical changes (reviewed in [3]). Moreover, some populations of humans have developed a physique that enables permanent habitation of high-altitude regions despite the severe conditions of hypoxia and other environmental stressors.
Three main high-altitude regions of the world have supported relatively large human populations for millennia: the Ethiopian highlands of the Semien Mountains [4,5], the Tibetan Plateau and Himalayan valleys [6,7], and the Andes of South America [2,8,9]. In order to overcome hypobaric hypoxia, the human body needs to adjust the cascade of metabolic processes for oxygen uptake and utilization. However, there is no universal pattern of response to hypoxia. People living in each of the above mentioned regions exhibit diverse respiratory, circulatory, hematological, and even pathological patterns of acclimatization and/or adaptation. For example, there is a relatively low hemoglobin concentration in Ethiopian and Tibetan highlanders as opposed to Andean or European populations living at high altitude [4,10,11]. Tibetans, similar to sojourners, display higher hypoxic ventilatory response (HVR) which results in increased ventilation compared with Andeans [12,13]. Furthermore, chronic mountain sickness (CMS), a disease defined as loss of adaptation to altitude [14], is common in the Andes, occasionally found in the Himalayas, and absent from the Ethiopian highlands [5,15]. CMS has a strong familial component, and it has been also noted that in Bolivian Andeans CMS is predominant in males of mixed or entirely European genetic background [16,17]. Moreover, having been born and raised within multigenerational high-altitude residence families appears to confer a substantial advantage in survival and performance at high-altitude environments (reviewed in [18]). This is in accordance with expectations that distinctive traits between high and low-altitude natives (or between different high-altitude native groups) may reflect genetic adaptations resulting from natural selection.
The characteristic morphology and physiology of high-altitude natives, in particular Tibetans and Andeans, has been studied in detail [11], enabling the identification of underlying candidate genes or groups of genes (e.g., [19]), such as the hypoxia-inducible transcription (HIF) pathway, renin-angiotensin system (RAS), and nitric oxide synthases (NOSs) [20][21][22][23][24][25][26]. However, because of the limited genomic scope of many candidate-gene studies, functionally relevant variation may have been overlooked. Moreover, the use of rather dissimilar lowland population outgroups, for example comparing Andeans with controls of European or indigenous North American genetic background (e.g., [20,21]), lessens statistical power. More recently, studies applying genome-wide scans have independently identified genes whose products participate in the HIF pathway (e.g. EGLN1 and EPAS1), and represent strong targets of selection at high altitude, especially for the Tibetan population [27][28][29][30][31][32]. Besides the HIF pathway, two genes involved in heart performance (VEGFB and ELTD1) have also been implicated in the elevated hematocrit characteristic of high-altitude populations in the Andes [33], while new candidategene sets have been proposed in Ethiopians as well [34][35][36].
In the present study, we focused on three indigenous populations from Bolivia, namely two groups of native inhabitants living at high altitude (!3600 meters above sea level) in the Andes, Aymara and Quechua, and as a control group the Guarani from the Gran Chaco lowlands. Using a genome-scan approach and pooled population samples, we identified a candidate region in chromosome 10 that exhibited significant allele frequency differences between the high-and lowland populations. Targeted resequencing of approximately 1.5 Mb of the region of interest revealed strong signals of positive selection in both Andean groups, suggesting that this genomic region harbors genes for high-altitude adaptation in these populations.

Results
We collected saliva samples for the extraction of genomic DNA from 55 (24 males / 31 females) Aymara (AYM) and 21 (18 males / 3 females) Quechua (QUE) from locations above 3,600 meters, and 23 (14 males / 9 females) Guarani (GUA) from the lowlands (see Materials and Methods) in Bolivia (Fig 1). Samples were obtained from healthy individuals with no known medical record or indication of CMS or any other altitude disease. All sampled individuals were unrelated and of self-identified ancestry as Aymara, Quechua or Guarani. Given the high historical rates of post-Columbian colonization male-mediated admixture into Native American communities [37], we performed determination of Y chromosome haplogroups for confirmation of the donor's ethnicity in all of the male samples. Indeed, the most common haplogroup found in our collection of Bolivian males was Q (predominant branch of the Y phylogeny observed in modern-day Amerindians of Central and South America), at frequencies of roughly 80% in all three groups (S1 Fig).

Pooled DNA Microarray genotyping and estimation of allele frequencies
To search for highly-differentiated genic regions among the high-and lowland groups, and therefore putatively involved in altitude adaptation, we pooled DNA samples from each population independently in triplicate (see Materials and Methods), genotyped each pool on the Affymetrix Genome-Wide Human SNP Array 6.0, and estimated allele frequency differences between each pair of populations based on the intensity of the hybridization signals. This approach has been used in several studies [38][39][40][41]. On average, the SNP call rate was 87.5% for the Aymara pools, 85.1% for the Quechua pools, and 88.2% for the Guarani pools. These rates are comparable to previous results with the Affymetrix platform for pooled DNA (e.g., [39]) or even with DNA from single samples (e.g., [42]).
We used frequency estimates for every called SNP obtained from the microarray experiments to perform pairwise comparisons among the populations, and we identified candidate regions that contained highly differentiated SNPs (see Materials and Methods). The tests were done between all pairs of the three groups as well as between the highlander group (combining Aymara and Quechua, from here on referred to as HL) and the lowlander group (Guarani, from here on LL). Aymara and Quechua share similar environments and lifestyles, and previous studies have found that they are genetically similar as well [33,43,44]. Since the main goal of this study was to investigate loci related to high-altitude adaptation, the following results and discussions will be mainly focused on the HL-LL comparison, as in previous studies [24,32], unless otherwise specified.
Based on the HL-LL comparison, we detected 9 candidate regions containing SNPs exhibiting large allele frequency differences (! 0.3); these regions were also supported by a t-test (Fig 2). In particular, a region on chromosome 10 (81.7~82.2 Mb) was enriched for several differentiated SNPs. In total, we found 56 SNPs having estimated allele frequency differences above 0.25, with four of them above 0.4 (Fig 2).

Validation of SNPs with large allele frequency differences
We selected the 14 SNPs with the highest estimated allele frequency differences in the 9 detected candidate regions for individual genotyping validation, with an emphasis on the chr10:81.7~82.2 Mb region (with 6 SNPs); the individually genotyped SNPs are listed in  Table 1 and also indicated in Fig 2. Each of the 14 SNPs was typed in the full set of individuals, not just those used in constructing the pools (see Materials and Methods). Observed allele frequencies were calculated by genotype counting and we performed a Fisher's exact test for significant differences in allele frequencies, correcting for multiple testing. We used a very conservative cutoff of 0.5 x 10 -8 , assuming 1 million random markers and a single test level of

Signals of population differentiation and positive selection
As the region on chromosome 10 (spanning approximately 500 kb from 81.7 to 82.2 Mb) contained several SNPs exhibiting significant allele-frequency differences between high-and lowland populations (Table 1), we investigated this signal in more detail. We performed targeted resequencing of a 1. This suggests that Aymara and Quechua may have shared very similar demographic/selection events, and that the extreme differentiation between the highlanders and Guarani in this region is unlikely to be accounted for solely by neutral genetic drift.
To formally test the latter hypothesis, we computed neutral simulations assuming several different demographic scenarios. Previous studies showed that the peopling of the Americas occurred more than 15,000 years ago through Beringia [46][47][48], with the initial colonization of the Andes around 11,000 years ago [6]. An admixture graph of Native American populations suggested that Guarani are descendants of Amazonia ancestry, which separated from high-altitude populations before the latter occupied the Andes [49]. To simplify the model, we set the divergence time between high and low-altitude populations at 10,000 years ago, and the divergence of the two high-altitude populations at 5,000 years ago. Population size estimates were obtained by using the demographic trajectory of Mexicans based on 1000 Genomes data [50] (red line in S3 Fig) as a surrogate for the common ancestral history of Quechua, Aymara and Guarani (see Materials and Methods). Given the much smaller recent population sizes of these three groups compared to Mexicans, we set constant population sizes of 9,000 after divergence for Quechua, Aymara and Guarani, instead of the sharp growth in the Mexicans. This scenario is referred to as the standard demographic model (green line in S3 Fig None of these neutral models could account for the strong divergence observed between Guarani and the two highlander groups in the chromosome 10 candidate region. The significant divergence signals strongly support the occurrence of positive selection, either in HL, LL, or in both groups. Table 2  We also calculated Tajima's D [51] and Fay and Wu's H [52] for the resequenced data (see Materials and Methods). Interestingly, in both Aymara and Quechua, when compared to the standard demographic model, Tajima Since the genic region chr10:82.0~82.3 Mb in Aymara and Quechua hosts the strongest and most consistent signals of selection, and the genetic profiles in this region are highly similar between Aymara and Quechua as shown in previous studies [33,43,44], we carried out in depth analyses of positive selection on the merged HL data. First, a composite of multiple signals like (CMSL) test was constructed based on six different selection tests: XP-CLR and iHS both exhibit the highest signals within the 82.0~82.3 Mb interval although the iHS peak locates upstream from that of XP-CLR (Fig 3E and 3F). The maximum XP-CLR value is 46.99 (P value = 9.01 x 10 -4 ) and maximum |iHS| is 3.144 (P value = 0.00732, Fig 3E  and 3F). When all these signals are combined together to derive a summary CMSL score (see Materials and Methods), the empirical CMSL scores are highly significant compared to the neutral CMSL distribution obtained from the simulations under the standard demographic The CMSL scores narrow the signal to a~57 kb region that contains one protein coding gene, FAM213A, and an enhancer that significantly influence the expression of this gene [56]. The peak region of CMSL is similar under all models (Fig 3G and S5 Fig), and the enhancer (chr10:82176099-82176325) is close to the highest CMSL signal (chr10:82174949).
Another test commonly used to identify candidate regions of positive selection is the relative extended haplotype homozygosity (REHH) test, which is based on the principle of long extended haplotypes [57]. We scanned the entire candidate region and found widespread REHH signals ( Fig 4A). The strongest signal occurred between position chr10:81699238 and chr10:81701722, within the SFTPD gene, where a major core haplotype with a frequency of 52.6% (haplotype 1 in Fig 4B) decays much slower than the other two haplotypes (haplotype 2 and 3 in Fig 4B). The P value for the observed excessive EHH of haplotype 1 is 8.4 x 10 -10 , indicating a strong signal of positive selection.

Discussion
Nowadays, it is estimated that worldwide some 140 million [58] people reside permanently at an altitude of 2,500 meters or more above sea level, and that countless others sojourn to high plateaus and mountainous regions for leisure or professional activities. The physiology of humans living at high altitude has been the subject of over a century of research, especially in Tibetan, Ethiopian and Andean populations which have acquired long-term physiological, anatomical, and biochemical responses to high-altitude environmental stress when compared to lowland inhabitants. Recent advances in genomic technologies are providing opportunities to explore the genetic basis of their adaptive traits, particularly in the regulatory systems of respiration and circulation.
In the present study, we focused on three populations from Bolivia, namely two groups of native inhabitants of the Andes: Aymara and Quechua, and Guarani from the Gran Chaco lowlands as a neighboring control group. Special care was taken to obtain samples from members of long-term high-altitude residence families, avoiding the collection of recent immigrants. Moreover, we collected samples only from healthy individuals, especially with no known medical record or indication of CMS We performed a genome-wide scan of over 900,000 SNPs using microarray technology on pooled DNA samples (see Results), thus examining the genetic profile of each group in the search for large differences in allele frequencies among them. After validation of the individual genotypes for the variants exhibiting the largest allele frequency differences (on average more than 38%), we applied multiple-test corrections and detected a region in chromosome 10 harboring several SNPs that achieved statistical significance. Genotyping of pooled samples decreases the power to detect weaker signals of population differences; however the fact that we do detect a strong signal of population differentiation that is likely to be due to selection further substantiate the utility of pooled data in genome scan studies [38][39][40][41].
The region on chromosome 10 is a novel candidate region for high-altitude adaptation, which has not been detected in previous studies. We further verified that the signal of high differentiation between HL and LL groups for the chromosome 10 region was unlikely to arise by demographic events alone after carrying out simulations under various demographic models. In order to investigate in more detail the potential signal of selection in chromosome 10, we performed targeted resequencing of~1.5 Mb surrounding the region of interest. Table 3 lists all protein coding genes in the candidate region and all non-synonymous SNPs observed in the resequencing data; regulatory elements could also be the target of positive selection, and several enhancers are included in the candidate region (Fig 3G and S1 Table).
The strongest evidence of positive selection was in the region of 82.0~82.3 Mb in chromosome 10, where several genes (ANXA11, MAT1A, DYDC1, DYDC2, FAM213A, TSPAN14 and SH2D4B) are included in or near the borders of this region (Fig 3G). One non-synonymous SNP (rs1049550) was detected in the ANXA11 gene, is predicted to be 'damaging' by Polyphen [59], and showed a significant differentiation between highlanders and lowlanders. Strong associations have been repeatedly found between genetic polymorphisms of ANXA11 and sarcoidosis, a systemic immune disorder characterized by destructive, noncaseating epithelioid granulomatous lesions (i.e., nodules caused by inflammation that do not lead to cell death) [60][61][62]. It is most often located in the lung or associated lymph nodes. The sarcoidosis-associated SNPs are listed in Table 4. In addition, a genome wide association study of chronic obstructive pulmonary disease identified one SNP in an intron of ANXA11 [63]. The risk allele is rs6585424-G (Table 4, P value = 1 x 10 -10 ).
The FAM213A gene was localized by the CMSL test under all models of population history (Fig 3G and S5 Fig). Also known as PAMM: peroxiredoxin (PPX)-like 2 activated in M-CSF- stimulated monocytes [65], it has been shown that the expression of FAM213A can protect cells from oxidative stress and modulate osteoclast differentiation through inhibition of NF-κB and c-Jun activation, which may affect bone resorption and help to maintain bone mass [65]. Oxidative stress is one of the most detrimental effects of hypobaric hypoxia, which is caused by increased reactive oxygen species (ROS), reactive oxygen and nitrogen species (RONS), decreased antioxidants and reduction in pulmonary nitric oxide (NO) bioavailability (reviewed in [66]). Antioxidant supplementation has been shown to have beneficial effects and reduced the oxidative stress of some individuals [67]. The expression levels of antioxidants were upregulated in hypoxia tolerant rats [68], and also in sojourners after a high-altitude stay, even if not sufficient to ameliorate oxidative stress completely [69]. These studies suggest that antioxidants are quite important in protecting against oxidative stress, and adaptive effects on the antioxidant system could be influenced by genetic factors, which differ between highlanders and sojourners. Moreover, as a consequence of preventing oxidative damage, the expression of FAM213A could abolish osteoclast formation, resulting in the maintenance of bone mass. It is unclear if this function of FAM213A would be beneficial for high-altitude adaptation; however, studies have shown accelerated growth in lung volume and chest dimensions in highlanders vs. lowlanders [70][71][72], which might be a developmental compensatory response to high-altitude hypoxia [73].
In addition to the FAM213A gene itself, the target region of positive selection includes an enhancer of FAM213A. Two SNPs are located in this enhancer; one (rs77999529) exhibits a low minor allele frequency in various human populations, while the second (rs150230265) exhibits significant allele frequency differences between HL and LL (FST = 0.229, P value = 0.014, S1 Table). The global distribution of the allele frequencies of rs150230265 is shown in S6 Fig,  which suggests that the derived G allele is restricted to Native American populations. Moreover, the derived allele is at highest frequency (0.382) in HL, and hence could be considered a candidate mutation. The fact that the rs150230265 SNP does exist at low frequency in low-altitude Native American populations (but nowhere else) makes selection on standing variation a possibility, which would further reduce the signal of selection in tests for selective sweeps. Our results suggest that elevated expression of FAM213A by positive selection on the enhancer could help protect against oxidative damage in a hypoxia environment. The mutation and the enhancer could thus be novel candidates for further experimental studies and therapeutic targets.
Although FAM213A was detected as a candidate gene in the CMSL analysis, it was not identified by the REHH analysis. Instead, a different candidate gene in the resequenced region, the SFTPD gene, was identified by this analysis. These different results are not surprising, given that different methods have different power to detect selection, especially in the case of partial selective sweeps and/or selection on standing variation [74]. SFTPD encodes lung surfactant protein D (SP-D), which contributes to the lung's defense against inhaled microorganisms and may participate in the extracellular reorganization or turnover of pulmonary surfactant. Pulmonary surfactant in turn lowers the surface tension at the air-liquid interface in the alveoli of the mammalian lung and is essential for normal respiration. Given the low oxygen levels at high altitude, altering the surfactant surface tension could be beneficial. A genome-wide association study of chronic obstructive pulmonary disease identified two risk alleles in an intron of SFTPD: the G allele of rs3923564 (P value = 2 x 10 -27 ) and the T allele of rs7078012 (P value = 5 x 10 -9 ) [63]. Several non-synonymous SNPs in SFTPD were detected in our resequencing data ( Table 3). The rs3088308 SNP involves a serine to threonine substitution, was predicted to be 'damaging' by Polyphen, and exhibits significant differentiation between HL and LL (FST = 0.537, P value = 5.46 x 10 -5 ). However, the derived allele frequency is higher in LL than in HL. Another SNP (rs721917) involves a methionine to threonine substitution and exhibits significant differentiation between HL and LL (FST = 0.271, P value = 7.6 x 10 -3 ) with a higher frequency of the methionine-encoding allele in HL. This mutation has been investigated intensively and influences oligomerization, function, and the concentration of SP-D in serum [75]. The Thr/Thr genotype had significantly lower SP-D serum levels, and is associated with increased disease-susceptibility [76][77][78]. The Met allele was associated with defense to respiratory syncytial virus [76]. The third non-synonymous SNP is rs2243639 (Thr/Ala), which also showed significant differentiation between HL and LL (FST = 0.181, P value = 0.028).
In addition to SFTPD, there are two other genes coding for surfactant pulmonary-associated proteins (SFTPA1 and SFTPA2) which are within the genomic region resequenced, but outside the region showing the highest signals in the CMSL and REHH tests. Mutations in SFTPA1 and SFTPA2 are associated with idiopathic pulmonary fibrosis [79], and (along with SFTPD) play an essential role in surfactant homeostasis and in the defense against respiratory pathogens [80,81]. Given that these surfactant proteins play a role in both lung function and disease resistance, it is unclear which of these (or perhaps both) might be the driving force behind the signals of selection that we detect in the HL populations.
The novel candidate genes for high-altitude adaption identified here are in accordance with previous evidence that the functional adaptations of Andean, Tibetan, and Ethiopian natives to high altitude differ [11]. Andeans exhibit lower levels of resting ventilation, a more 'blunted' HVR, higher levels of pulmonary hypertension and an increased frequency of CMS. In Tibetans, the exhaled NO is elevated compare to Andean and lowlanders [82], which was associated with higher blood flow through the lung [83]. Similar hemoglobin phenotypes among Tibetan and Ethiopian highlanders associate with different genetic loci, and the variants at those loci are present in most populations regardless of altitude [84]. Overall, populations in different continents have adapted to high altitude through different adaptation processes as a result of convergent evolution [85,86].
A recent study showed that altitude adaptation in Tibetans may have arisen via introgression of Denisovan-like DNA [87]. Thus, modern humans could obtain genetic adaptations to local environments through admixture with other hominin species [88][89][90]. Native American populations migrated from Siberia, where admixture might have happened between ancestors of modern Asians and archaic humans (including Neanderthals and Denisovans) [91][92][93]. We therefore checked our sequence data and found no haplotype specifically shared with Denisovans in the region surrounding both FAM213A and SFTPD genes (S7 Fig). These results further support different routes to functional adaptation in Tibetan and Andean high-altitude natives [11].
In summary, we identified a novel candidate region for high-altitude adaptation in Andeans, with several genes and/or enhancers potentially under positive selection. In particular, multiple tests localized the signal to FAM213A and a related enhancer encoding an antioxidant to reduce oxidative stress, which might be beneficial for adaptation to high altitude in the Andes. However, further functional studies are needed to elucidate the role of this gene (as well as the other candidates) in high-altitude adaptation.

Sample collection and DNA extraction
We collected in total 99 saliva samples from South American indigenous individuals from Bolivia. The participants were informed about our study objectives and provided written consent for the anonymous use of the biological material for academic research. This research was approved by the Ethics Committee of the University of Leipzig Medical Faculty. All sampled individuals were unrelated and of self-identified ancestry as either Aymara, Quechua or Guarani.
They were members of long-term residence families from the places where samples were gathered; sample collection from recent immigrants was avoided. Furthermore, special care was taken to obtain samples only from healthy individuals, with no known medical record or indication of CMS or other altitude-related illness. The Aymara individuals were sampled in El Alto (N = 24, situated at 4,100 m altitude above sea level), Tiwanaku (N = 24, 3,885 m), and La Paz (N = 7, 3,600 m). The Quechua individuals were sampled in Oruro-Soracachi (N = 21, 3,750 m), and the Guarani individuals were sampled in Santa Cruz-Gran Chaco (N = 23, 416 m). Genomic DNA was extracted from the saliva samples following the protocol published elsewhere [94], and the fraction of endogenous (i.e., human) DNA present in the extracts was quantitated as described previously [94].  [95]. The 24 loci were typed and used for haplogroup assignment as described in [96].

DNA pooling and microarray genotyping
To search for candidate genomic regions of high differentiation between high and low-altitude Bolivian populations, we genotyped pooled samples on microarrays; this approach has been used successfully in other studies [38][39][40][41]. A total of nine equimolar DNA mixtures were constructed, consisting of one pool of 18 Aymara, one pool of 17 Quechua, and one pool of 18 Guarani samples; each pool was prepared independently in triplicate with the same individuals, thus resulting in three technical replicates for each pool. We selected the individual genomic extracts containing the highest fractions of endogenous DNA, with all selected extracts containing ! 30% human DNA [97]. Each individual sample contributed 100 ng human DNA to the mixture. Pooled DNA solutions were diluted to a working concentration of 50 ng/μl with ddH2O. Affymetrix Reference Genomic DNA 103 was used as a positive control for the microarray experiments. Genotyping was performed using the Affymetrix Genome-Wide Human SNP Array 6.0 according to the manufacturer's protocol. Each of the nine DNA pools and the positive control sample were assayed on a separate microarray. Each array was scanned using the Affymetrix GeneChip Scanner 3000 with the High-Resolution Scanning Upgrade. The cell intensity files were analyzed using the Affymetrix Genotyping Console (GTC v2.1), and the concordance of called genotypes (excluding missing data) between replicates and between the positive control and its consensus genotypes provided by Affymetrix was analyzed using GTC v2.1. The concordance for the pooled Aymara genotypes was 97.5% (on average for the pairwise comparison among the three replicates), for the Quechua was 96.7%, and for the Guarani was 97.9%. The concordance of the positive control compared to the consensus genotypes provided by Affymetrix was 99.7%.

Allele frequencies from DNA pools and highly differentiated genic regions
The allele frequency per called SNP and population was estimated from the raw probe intensity data of each microarray as previously described [38][39][40]; the allele frequency data are available from the authors upon request. Briefly, we computed the Relative Allele Signal (RAS) score as an estimate of the allele frequency. In order to have consistent calculations, we only considered the first three probe sets for each SNP locus and removed SNPs whose standard deviation of RAS across technical replicates and/or probe sets in any group of individuals was greater than 0.1. Then, the allele frequency for each group of individuals was estimated by averaging across the technical replicates and the probe sets: p ¼ X j X k RAS j;k n j n k , where j is the technical replicate and k is the probe set. The allele frequency difference was taken as j p 1 À p 2 j, where p 1 and p 2 are the allele frequencies in two different groups. We calculated the allele frequency differences between groups in a pairwise fashion, and we also compared the Guarani against Aymara and Quechua individuals combined together into a single highland group.
We applied a t-test to formally evaluate the statistical significance of the calculated allele frequency differences. Therefore, T ¼ ðp 1 Àp 2 Þ 2 varðp 1 Àp 2 Þ , where T should follow a w 2 1 distribution. The overall variance: varðp 1 Àp 2 Þ consists of two parts: V s +V p , where V s represents the sampling variance, and V p represents the component from the pooling process. V s is given byp 1 ð1Àp 1 Þ and V p is given by varðp 1 Þ 2n 1 þ varðp 2 Þ 2n 2 , where n is the sample size. We applied a multi-locus approach to search for highly differentiated genic regions. SNPs were ranked according to either the allele frequency difference or P value significance. The top 0.1% SNPs were connected if they were within 100 kb distance, and a differentiated region was called if there were more than 10 top SNPs connected.

Validation of estimated allele frequencies
Confirmation of the allele frequencies estimated from the RAS scores for the 14 SNPs with the largest HL-LL allele frequency differences was performed using the ABI PRISM SNaPshot Multiplex System (Applied Biosystems by Life Technologies) according to the manufacturer's protocol. Primers for the single PCRs and for the subsequent extension reactions were designed using the UCSC In-Silico PCR tool (http://genome.csdb.cn/cgi-bin/hgPcr/). Primer interactions within the multiplex were evaluated and minimized using the NetPrimer online software (http://www.premierbiosoft.com/). Briefly, single PCRs amplified the target region surrounding the SNP of interest for each individual contained in the full collection set. The amplicons were then assembled into four separate multiplexes and analyzed on an ABI 3130xl Genetic Analyzer. The SNP calling was performed using the ABI GeneMapper ID v3.2 software. As a positive control for the SNaPshot experiments, the sample HapMap #NA06985 CEPH/UTAH Pedigree 1341 was assayed along with the Bolivian samples. The called genotypes for the control were compared with the consensus genotypes for the same 14 SNP loci obtained from the HapMap website; the concordance was 100%. Additionally, one Bolivian sample was assayed in single primer extension reactions for each of the 14 SNPs, and the called genotypes were compared to the genotypes obtained from the multiplex approach; the concordance was 100%. For the 14 SNPs re-genotyped individually, we performed a Fisher's exact test to validate the results obtained from the DNA pooling approach. The Fisher's exact test was performed using R (http://www.r-project.org/).

Capture array and resequencing
We used Agilent custom 1M capture arrays in order to resequence the target region of interest. We designed overlapping microarray probes of 60 bases targeting over 1.5 Mb of the region of interest in chromosome 10 (chr10:81113000-82664000). Probes were tiled every 3 bases across the target region. Probes containing repetitive elements were discarded [98]. We used the human reference sequence NCBI Build 37 (hg19) to design the probes.
Illumina GAIIx libraries were prepared following Meyer and Kircher [99], with some differences noted below. All samples were sheared with the Bioruptor UCD-200 (Diagenode) down to a range of approximately 200-800 bps. The adapter fill-in step was performed using Dynabeads MyOne Streptavidin C1 (Invitrogen). The beads were prepared and libraries immobilized by aliquoting 25μl bead suspensions for each sample, washing twice with 2X-BWT buffer and eluting in 25μl 2X-BWT buffer. A magnetic plate was used for all washing steps. The adapted sample libraries were added to the bead suspension, pipette-mixed, and incubated for 15 minutes at room temperature. The supernatant was then discarded while the plate was on a magnet and the beads were washed twice with 100μl 1X-BWT buffer. The fill-in step was performed by adding the master mix used in Meyer and Kircher [99] after removing the buffer, and no subsequent SPRI purification was necessary.
Individual-specific indexes were used to multiplex the libraries prior to hybridization enrichment. These were attached by performing a PCR amplification using the Phusion Mastermix (New England Biolabs, NEB). After indexing, samples were pooled in equimolar ratios and hybridized. After hybridization, quantitative PCRs were performed on the sample pools with the DyNAmo qPCR kit (NEB). Based on the resulting qPCR amplification plots, the sample pools were amplified using the Phusion Mastermix so that they did not reach plateau. Each sample pool was sequenced on a single lane of an Illumina GAIIx run by single-end sequencing using 36 cycles.

Resequence data processing
The raw sequencing reads were aligned to the human reference genome sequence GRCh37 by BWA v0.70 [100] with default parameters. The alignments were transferred to indexed binary alignment map (BAM) files by SAMtools [101] and duplicates removed with the Picard tool v.1.66.
Genotypes were called by the GATK UnifiedGenotyper v1.4 [102] with the following parameters: the minimal base quality score setting was 20, the minimal mapping quality score setting was 30, and the confidential Phred-scale threshold for genotyping calling setting was 50; the default settings were used for all other parameters. Furthermore, the GATK VariantRecalibrator tool was used to score variant calls by a machine-learning algorithm and to identify a set of high-quality SNPs using the Variant Quality Score Recalibration (VQSR) procedure. The insertions and deletions (indels) were filtered by GATK, resulting in 1,983 SNPs (with average coverage 9X) for the following analyses.

Population Genetic Analyses and Selection Tests
To analyze the population differences and detect signals of natural selection in either high or low-altitude populations, we employed several methods with both empirical polymorphism data and simulated data. These methods were based on population differentiation, the allele frequency spectrum, properties of haplotypes, and composite signals: FST test. F ST is a measurement of population differentiation. We calculated it in pairwise manner by using the unbiased estimator of Weir and Cockerham [45].
ΔDAF test. We calculated ΔDAF [53] between a putative selected population and a nonselected population. ΔDAF scores range between -1 and 1. SNPs with positive scores indicate a higher derived allele frequency in the selected population. The ancestral allele states were as determined by the 1000 Genomes Project [50].
XP-CLR test. XP-CLR test is a likelihood method for detecting selective sweeps based on multilocus allele frequency differentiation between a putative selected population and a non-selected population [54]. We set 0.05 cM sliding window sizes and uniform grid points with a spacing of 2 kb. The maximum number of SNPs was set to 200 for each window.
Tajima's D test. Tajima's D [51] was performed with a sliding window of 20 kb and no overlap between adjacent windows. The calculation was performed by an in-house Perl script.
Fay&Wu's H test. Fay & Wu's H [52] was also calculated by an in-house Perl script with the same sliding window approach as in Tajima's D test. iHS test. This method is based on the length of the haplotype associated with ancestral vs. derived alleles; derived alleles subject to positive selection tend to have unusually long haplotypes, as such alleles have risen to high frequency too quickly for recombination and/or new mutations to break down the length of the associated haplotype. The iHS test partitions haplotypes into an ancestral group and a derived group according to the allele states of core SNPs; iHS is defined as the log ratio of the integrated EHH (extended haplotype homozygosity) for these two groups [55].
REHH test. The REHH is another test based on haplotype length and structure, and was calculated with the Sweep software [57]. We set the option 'matching distance' to be 'marker H of about 0.04'.
CMSL test. Numerous methods have been developed to detect positive selection based on various patterns of genetic variation, and hundreds of candidate regions have been identified. But usually these regions are typically large and the causal variants remain unknown. A composite of multiple signals method narrows down the candidate regions and aids in identifying the causal variant [53]. Recently, another framework combing P values in large scale genomic data was used to detect selection [103]. This test is based on Fisher's combination test [104].
The statistic is computed as Z F ¼ À2 logP i , where k is the number of SNPs in one region and P i is the empirical P value of one test for the SNP i. In Luisi's study, F ST , ΔDAF and iHS statistics are calculated, and regions with high Z F scores indicate positive selection. Following this approach, we used a CMSL method by combining F ST , ΔDAF, Tajima's D, Fay & Wu's H, XP-CLR and the iHS test. The Z F statistic is computed as above, where k is the number of tests and P i is the P value of the SNP in test i. We obtained the P value from empirical distributions by simulations.

Simulations
Simulations were used to calculate the P value of the scores in the empirical data. To account for the impact of demography on the detection of selection, we did simulations under a wide range of demographic scenarios inferred by pairwise sequentially Markovian coalescent (PSMC) model, which is a method to infer the history of population size change based on a single genome sequence [105]. In this study, we sequenced nearly 1.5 Mb, which is not enough for inferring a high resolution N e trajectory. We therefore used the N e estimated from the Mexican (MXL) population in the 1000 Genomes Project [50](red line in S3 Fig), with the modification of a constant N e in recent history instead of a sharp expansion, as our standard demographic model. We set the divergence time between high and low-altitude populations at 10,000 years ago, and the divergence of the two high-altitude populations at 5,000 years ago. We also used two other models, one with a more intense bottleneck (N e reduced by 50% during the most recent 10,000 years) and one with a constant N e of 7,000 for the entire history.
We used a different formula for the time interval boundaries in PSMC: t i ¼ 0:1e ð i n Þ 2 logð1þ10T max Þ À 0:1; i ¼ 0; . . .; n We set n to be 30, to reduce the complexity of the search space. The squared exponential growth of time intervals results in more intervals in the recent past and much fewer intervals in the ancient past, as recent N e needs more information for accurate inference and is more important for our purposes. We simulated 2 Mb neutral segments with MSMS [106] with 100 replicates for each of the three demographic scenarios.
Supporting Information Haplotype frequencies in modern humans (from this study and 1000 Genomes data) and Denisovan genome sequence in FAM213A and SFTPD (low quality Denisovan sites were filtered following [87]). Green is haplotype shared with Denisovan; blue is the first dominant haplotype in HL; red is the second dominant haplotype which contain the derived allele of rs150230265 in HL; purple is the first dominant haplotype in LL; gray are haplotypes with frequencies <10% in Bolivian populations. The radii are scaled by sample sizes. (A) Haplotypes are 10kb extended on both sides from rs150230265. (B) Haplotypes are 10kb extended on both sites from the three non-synonymous SNPs in SFTPD.