MJR, CSC, and DAN conceived and designed the experiments. JMA, MAE, MDS, and LK analyzed the data. JMA, DAN, and LK wrote the paper.
The authors have declared that no conflicts of interest exist.
Identifying regions of the human genome that have been targets of natural selection will provide important insights into human evolutionary history and may facilitate the identification of complex disease genes. Although the signature that natural selection imparts on DNA sequence variation is difficult to disentangle from the effects of neutral processes such as population demographic history, selective and demographic forces can be distinguished by analyzing multiple loci dispersed throughout the genome. We studied the molecular evolution of 132 genes by comprehensively resequencing them in 24 African-Americans and 23 European-Americans. We developed a rigorous computational approach for taking into account multiple hypothesis tests and demographic history and found that while many apparent selective events can instead be explained by demography, there is also strong evidence for positive or balancing selection at eight genes in the European-American population, but none in the African-American population. Our results suggest that the migration of modern humans out of Africa into new environments was accompanied by genetic adaptations to emergent selective forces. In addition, a region containing four contiguous genes on Chromosome 7 showed striking evidence of a recent selective sweep in European-Americans. More generally, our results have important implications for mapping genes underlying complex human diseases.
An analysis of 132 human genes suggests that the migration of modern humans out of Africa into new environments was accompanied by genetic adaptations to emergent selective forces.
Despite intense study and interest, a detailed understanding of the evolutionary and demographic forces that have shaped extant patterns of human genomic variation remains elusive. An important goal in studies of DNA sequence variation is to identify loci that have been targets of natural selection and thus contribute to differences in fitness between individuals in a population. Identifying regions of the human genome that have been subject to natural selection will provide important insights into recent human history (
The neutral theory of molecular evolution (
One way out of this conundrum is to recognize that population demographic history affects patterns of variation at all loci in a genome in a similar manner, whereas natural selection acts upon specific loci (
Here, we describe an extensive analysis of the molecular evolution of 132 genes that were comprehensively resequenced in 24 African-Americans and 23 European-Americans. In total, over 2.5 Mb of baseline reference DNA was sequenced, spanning 20 autosomal chromosomes and the X chromosome. The sampling of a large number of loci dispersed throughout the genome has allowed us to clarify the relative contributions of demography and selection to patterns of genetic variation at individual genes. Specifically, we developed a rigorous computational approach for taking into account multiple hypothesis tests and demographic history, and we found that while many apparent selective events can instead be explained by demography, there is also strong evidence for positive or balancing selection at eight genes in the European-derived population. In addition, we describe a striking example of a previously unreported recent selective sweep in European-Americans that spans four contiguous genes on Chromosome 7. More generally, our data provide insight into the demographic histories of African-American and European-American populations and have important implications for genetic association studies of complex diseases, as several of the genes showing evidence of selection have been implicated in susceptibility to complex human diseases.
We resequenced 132 genes primarily involved in inflammation, blood clotting, and blood pressure regulation and discovered a total of 12,890 SNPs (
Genes that are nominally significant (
The direction of Tajima's D, Fu and Li's D*, and Fu and Li's F* is potentially informative about the evolutionary and demographic forces that a population has experienced. For example, negative values reflect an excess of rare polymorphisms in a population, which is consistent with either positive selection or an increase in population size. Positive values indicate an excess of intermediate-frequency alleles in a population and can result from either balancing selection or population bottlenecks. In the European-American sample, we observed eleven significantly positive and five significantly negative values for one or more of these three test statistics (
The observations of both significantly positive and significantly negative values of Tajima's D, Fu and Li's D*, and Fu and Li's F*, combined with the largely nonoverlapping set of significant genes, could reflect selective pressures unique to one population (i.e., local adaptation), different demographic histories, spurious results, or most likely some complex combination of all of these factors. Although these results are intriguing, their interpretation is confounded by two issues: (1) We have not corrected for multiple hypothesis tests, and (2) rejection of the standard neutral model can result from either selective or demographic forces. In the subsequent sections, we develop approaches to address these issues with the dual goals of identifying genes that possess strong evidence of natural selection and of inferring population demographic history.
In order to robustly correct for multiple hypothesis tests, the conventional practice of assuming no recombination when determining significance is not appropriate, because it results in conservative
In the European-American sample, we observed 22 genes that were significant at a FDR of 5% (i.e., we expect approximately one false positive in this set of genes) for one or more tests of the allele frequency distribution (
D, D*, F*, and H denote Tajima's D, Fu and Li's D*, Fu and Li's F*, and Fay and Wu's H, respectively. Nominal
Although neutrality tests of the allele frequency distribution reveal many significant deviations, it is impossible to unambiguously interpret these data as evidence for natural selection, because the null model used to assess significance makes unrealistic assumptions about population demographic history. In principle, it is possible to distinguish between demography and selection, because demography affects all loci in the genome, whereas selection acts upon specific loci. Thus, by sampling a large number of loci dispersed throughout the genome, we can begin to construct a more realistic null hypothesis by which to evaluate the evidence for or against selection (
To this end, we used the empirical data to explore four different demographic models (
(A) Schematic diagram of each demographic model and its associated parameters (see
(B) Average and 95% confidence intervals of Tajima's D (blue bars), Fu and Li's D* (red bars), and Fu and Li's F* (pale yellow bars) for the observed data and each demographic model (using the parameters that most closely match the empirical data). Results from the standard neutral model (Constant) are also shown.
We reestimated the significance of Tajima's D, Fu and Li's D*, Fu and Li's F*, and Fay and Wu's H in each population for each of the four demographic models using the best-fit parameter values. All simulations included recombination and correction for multiple tests using the FDR method (with a FDR of 5%) as described above. Population history can clearly have a profound effect on tests of natural selection (
(A and B) The significance of observed values of Tajima's D (red), Fu and Li's D* (pale yellow), Fu and Li's F* (pale blue), and Fay and Wu's H (dark blue) were reassessed for each best-fit demographic model in European-Americans (A) and African-Americans (B). Results from the standard neutral model (Constant) are shown for comparison. The number of significant genes for each demographic model is noted above each category in (A) and (B). For example, there were a total of 19 significant test statistics across all four tests of neutrality assuming a bottleneck model for Europeans, which define ten unique genes. Therefore, each gene is supported by approximately two (19/10) tests of neutrality.
(C) The distribution of the number of significant genes across the five demographic models in European-Americans and African-Americans. For example, in European-Americans, 40 genes were significant in at least one of the demographic models, and 27 genes were significant in at least two of the demographic models.
Biological Process terms were assigned using the Panther classification scheme (
One particularly interesting region of the genome is located at 7q and contains four contiguous demographically robust selection genes (
(A–D) Exons for
(E) The distribution of FST across the 115-kb region. The average FST for all SNPs across the 132 genes is shown as a dashed red line. The dashed green line indicates the threshold for significantly (
In summary, we have found that both population demographic history and natural selection shaped patterns of DNA sequence variation in the 132 genes studied here. By studying multiple unlinked loci dispersed throughout the genome, we were able to develop a rigorous computational approach to distinguish between the confounding effects of natural selection and demographic history on patterns of genetic variation. Using this strategy, we found that approximately two-thirds of the genes that were initially significant could be accounted for by population demographic history. Thus, our analyses clearly demonstrate the importance of considering both neutral and nonneutral forces when interpreting DNA sequence variation.
An interesting feature of our data is that the majority of deviations from neutrality, and all of the demographically robust selection genes, are not shared between the two population samples, suggesting that local adaptation has played an important role in recent human evolutionary history. Consistent with this observation, several possible examples of local adaptation in humans have previously been reported (
An alternative explanation for why we observed fewer significant results in African-Americans than in European-Americans is that African-Americans are an admixed population (
It is important to point out that some genes that do not meet our rigorous definition of a high-confidence selection gene may have nonetheless been targets of selection, such as
Recently,
The strongest signature of selection that we observed occurs on Chromosome 7q in European-Americans. The signature of selection extends for at least 115 kb and spans the genes
More generally, our results have several implications for mapping genes underlying complex human diseases. Specifically, four of the high-confidence selection genes have been implicated in various complex diseases (
Human DNAs were obtained from the Coriell Institute (Camden, New Jersey, United States). We analyzed DNA from 24 African-Americans from the Human Variation Panel, African-American Panel of 50 (HD50AA) and DNA from 23 European-Americans derived from various CEPH pedigrees. We also sequenced each gene in a common chimpanzee (
We calculated the following summary statistics of nucleotide variation for each gene: θ̂=
We initially assessed the significance of these statistics by comparing the observed values to 104 coalescent simulations (
We quantified the allele frequency differences between the European- and African-American samples by the statistic FST as described in
We estimated the time since the selective sweep for the Chromosome 7q region in European-Americans by analyzing the amount of nucleotide diversity that has accumulated on the selected haplotype as described in
We assessed the impact of demographic history on the robustness of the statistical tests of neutrality by using coalescent theory to simulate data under four different population histories, including a bottleneck, exponential expansion, population structure according to an island model that allows symmetric migration between demes, and population structure assuming population splitting with no subsequent migration. For each model we simulated data under a wide variety of parameters by conditioning on the observed sample size and θ̂W for each population. The bottleneck model is specified by the parameters
For each demographic model, we calculated the average value of Tajima's D, Fu and Li's D*, and Fu and Li's F* and compared the results to the observed values of these statistics. For the bottleneck and exponential expansion models, we identified the parameter values that most closely matched the observed data by identifying the parameter combination that minimized the function
(266 KB DOC).
(534 KB DOC).
(87 KB DOC).
LocusLink ID numbers (
Coriell (
We would like to thank members of the SeattleSNPs team (M. Ahearn, T. Armel, E. Calhoun, M. Chung, C. Hastings, P. Keyes, P. Lee, S. Kuldanek, M. Montoya, C. Poel, E. Toth, and N. Rajkumar) for cataloging the variation data. We would also like to thank D. Akey and D. Crawford for critical reading of this manuscript and J. Fay for helpful discussions. This work was supported by a National Science Foundation Postdoctoral Research Fellowship in Interdisciplinary Informatics (JMA) and grants from the National Heart Lung and Blood Institute Program for Genomic Applications (HL66682 to DAN and MJR; HL66642 to LK), the National Institute of Mental Health (MH59520 to LK), and the National Institutes of Health Pharmacogenetics Research Network (U01 HL69757 to DAN). LK is a James S. McDonnell Centennial Fellow.
false discovery rate
single nucleotide polymorphism