A Genomic Portrait of Haplotype Diversity and Signatures of Selection in Indigenous Southern African Populations

We report a study of genome-wide, dense SNP (∼900K) and copy number polymorphism data of indigenous southern Africans. We demonstrate the genetic contribution to southern and eastern African populations, which involved admixture between indigenous San, Niger-Congo-speaking and populations of Eurasian ancestry. This finding illustrates the need to account for stratification in genome-wide association studies, and that admixture mapping would likely be a successful approach in these populations. We developed a strategy to detect the signature of selection prior to and following putative admixture events. Several genomic regions show an unusual excess of Niger-Kordofanian, and unusual deficiency of both San and Eurasian ancestry, which were considered the footprints of selection after population admixture. Several SNPs with strong allele frequency differences were observed predominantly between the admixed indigenous southern African populations, and their ancestral Eurasian populations. Interestingly, many candidate genes, which were identified within the genomic regions showing signals for selection, were associated with southern African-specific high-risk, mostly communicable diseases, such as malaria, influenza, tuberculosis, and human immunodeficiency virus/AIDs. This observation suggests a potentially important role that these genes might have played in adapting to the environment. Additionally, our analyses of haplotype structure, linkage disequilibrium, recombination, copy number variation and genome-wide admixture highlight, and support the unique position of San relative to both African and non-African populations. This study contributes to a better understanding of population ancestry and selection in south-eastern African populations; and the data and results obtained will support research into the genetic contributions to infectious as well as non-communicable diseases in the region.


Introduction
The analysis of high-throughput genotype data has revealed global patterns of human haplotype variation, casting light on the pre-history of human populations [1,2,3,4,5]. The International HapMap consortium [1,5]) and Human Genome Diversity Project (HGDP) [6], among others, have facilitated the analysis of human genome-wide variation, and linkage disequilibrium in disease association studies [1,4,5] and also helped refine estimates of recombination rates [7]. Comparative genome-wide genotype data among humans, Neanderthals and Chimpanzees have also shown that selection has played a significant role in human adaptation to the environment [8,9,10,11]. These data have provided additional support for the African origin of modern humans [12,13] and highlight the effects of migration both within Africa and out of Africa. In general, African populations exhibit less linkage disequilibrium between adjacent markers than their non-African counterparts, consistent with a migratory bottleneck in the latter [1,2,5]. Such differences in the extent of linkage disequilibrium have a profound effect on the power of case-control association studies, since these studies depend largely on linkage disequilibrium between disease variants and genotyped single nucleotide polymorphisms (SNPs). Substantially more SNPs are required to capture genomic variation in African populations than populations of European ancestry [1,5]. In addition, African populations are characterized by higher levels of genetic diversity [13,14,15,16] and considerable population substructure [17,18,19], probably the combined result of several migration events, effective population size changes, population differentiation through genetic drift and local selective forces operating in ecologically diverse environments [18].
Hypotheses of migration within Africa based on mitochondrial DNA (mtDNA) suggest that at least three major migration events are plausible that could account for the patterns of mtDNA variation within Africa [17]; (1) the divergence of southern African San and east African populations who share the ancestral mtDNA haplogroup (L0d) and associated lineages in their maternal gene pool from an ancestral parental population circa 200 kya, (2) the establishment of west African maternal haplogroups (L1'5 & L0abf) from an east African source (circa 100 kya), and (3) the Bantu expansion from the Niger-Congo region into central, eastern and southern Africa (< 5 kya). Although a southern African versus east African origin of modern humans cannot be fully evaluated with current data, multiple lines of evidence from mtDNA [16], Y chromosomes [20], Alu insertions [21], and autosomal SNPs [3] place the divergence of the San at the root of modern humans with at least 100 ky of isolation from other non-San African populations [17,22], and relatively recent (< 5 kya) admixture with Bantu-speaking populations [16,23,24,25,26,27], followed by subsequent admixture (< 5 kya) in the region [16,28,29,30]. Given this relative isolation of present-day San in southern Africa, it is expected that many SNPs ascertained in HapMap populations may not necessarily be polymorphic in San, unless the polymorphisms arose well before the divergence of these populations. Southern Africa was occupied exclusively by the San prior to the arrival of Bantu-speaking populations within the past 1,500 years, a consequence of the Bantu-expansion out of west Africa some 5000 years ago [16,23,24,25,26,27,31]. Migrations across equatorial central Africa to the region of the Great Lakes in east Africa, followed by southern African migrations [16,25] established the eastern and southeastern Bantu-speaking groups, respectively. Migrations along the west coast of Africa contributed to western and southwestern Bantu-speaking groups, the latter, currently extending to Namibia [16,25,26,27,28,29]. According to our findings, the label "Khoe-San" represent populations resulting from the mixture of predominately San, Eurasian and Bantu-speaking populations. Over hundreds of years, indigenous San and Khoe-San communities have undergone a sharp decline in population size, largely due to warfare and diseases such as smallpox which arrived with colonialists [29,32]. It is estimated that the population decline (i.e. 90 percent) of both San and Khoe-San populations was due to smallpox [31,32]. Recently, Lachance et al. [33] used the whole-genome sequences of five individuals in each of three different hunter-gatherer populations, including Pygmies from Cameroon, Khoe-Sanspeaking Hadza and Sandawe from Tanzania, and identified several genomic regions with evidence of archaic introgression in the hunter-gatherers. In addition, Lachance et al. [33] demonstrated that distribution of the time to the most recent common ancestors for these regions was similar to that observed for introgressed regions in Europeans [33]. Ancient and relatively recent contact between immigrants from Europe, Asia and Indonesia with sub-Saharan Africans [24,26,34] have resulted in varying degrees of admixture between these populations. Furthermore, a recent study by Gurdasani et al. [35] presented a broad survey of polymorphisms in a novel array genotyping data set of *1,481 individuals from 18 self-identified ethnic/linguistic and low coverage whole genome sequencing data set of 320 individuals from 7 self-identified ethnic/linguistic in Sub-Saharan Africa, and suggested that Eurasian back migrations to Africa and contributions to ancestry has a substantial impact on differentiation among some sub-Saharan African populations. These mixtures have also contributed to shaping the gene pool of the derived populations in south-eastern Africa [28,35]. Other disciplines, such as archaeology, history and anthropology, have given us clues about the prehistory of African populations. The study by Pickrell et al. [16] convincingly demonstrated waves of two-way admixture between Niger-Congo-speaking African and west Eurasian (European or Middle Eastern) populations to form eastern and southern African (admixed) populations. However, the role of native indigenous San in the south-eastern African region and the genetic contribution of this population to the southern and eastern African admixed populations has not been elucidated. The present study makes use of genetic markers to investigate which factors, and to what extent, they have contributed in shaping the gene pools of extant southern and eastern African populations. More specifically, we used the Affymetrix Genome-Wide Human SNP Array 6.0, to examine *900K SNPs and copy number variants in five indigenous populations comprising 25 Ju\'hoansi San from Namibia (KHS), southeastern Bantu-speakers [25 Sotho-Tswana (STS), 36 Xhosa (XHS), 25 Zulu (ZUL)] as well as 25 Herero (HER), a southwestern Bantu-speaking group from Namibia. These data were used in conjunction with other published data to examine the genetic origins of southern African populations. Importantly, our study demonstrates the admixture of the indigenous San, Niger-Congo-speaking populations and populations of Eurasian ancestry in southern and eastern African populations. We have also developed two complementary approaches to identify signatures of selection prior to and following putative admixture events in the southern African populations.

Sampling and Genotyping
The sample consisted of unrelated individuals belonging to the following five self-identified ethnic/linguistic populations of  [36]. The Blood samples were collected with the subject's informed consent, and the use of DNA samples for population genetics research was approved by both the University of the Witwatersrand and University of Cape Town. DNA samples were shipped to Affymetrix (http://www.affymetrix.com) for genotyping using the Affymetrix Genome-Wide Human SNP Array 6.0, containing 906,600 SNPs and more than 946,000 probes for the detection of copy number variation. These data were used to examine patterns of migrations, genetic ancestry and effects of selection in this study. Other populations included in this study are listed in S1 Table. Admixture Analysis The separation of Africans from non-Africans is clearly evident (Fig. 1 (A)); this has also been previously reported with both microsatellite data [37,38] as well as with other SNP data [2,3,5]. From pairwise population genetic distance estimates, we find that there is little genetic difference among Bantu-speaking populations (S2 Table). In addition, Fig. 1 (A) shows a distinct separation of San populations (San (SAN) and Ju\'hoansi (KHS) and Khoe-San populations (Bushmen (BUS), ‡Khomani (KHO)), consistent with previous studies [16,26,33,39,40]. This result suggests Khoe-San, and both eastern and southern Bantu-speaking populations have undergone admixture. Furthermore, this result is consistent with the 3-population test [39,40] result displayed in S3 Table, which shows clear evidence of admixture between Yoruba (YRI) and KHS in the southern Bantu (ZUL, STS, XHS). Furthermore, the ‡Khomani (KHO), and eastern Bantu-speaking populations also reflect a three-way admixture of Caucasian (CEU), Yoruba (YRI) and KHS. The results in Fig. 1 (A and D) suggest that the genetic makeup of the southeastern Bantu-speaking groups (ZUL, STS, XHS) includes ancestral contributions from Niger-Congo (26% ± 0.3%) and San populations (74% ± 0.4%). However, consistent with previous findings [40], the data in Fig. 1(B-C), suggests Niger-Congo ancestry (17% ± 1.2% and 57% ± 1.6%), San ancestry (70 ± 1.3% and 15% ± 0.4%), and notably Eurasian-related ancestry (13% ± 1% and 28% ± 2%) in the genetic make-up of ‡Khomani (KHO) and Sandawe (SAW), respectively. The admixture observed in the Khoe-San (KHO), and in the eastern African populations, (particularly) Sandawe (SAW) reflects the gene flow from Bantu-speaking agriculturalists and/or eastern African pastoralists within the past 1,200 years and sea-borne immigrants from Europe, Asia and Indonesia [33,35,39,40,41]. Our observation of Eurasian ancestry in both eastern (SAW) and southern (KHO) African populations is consistent with archaeological, genetic, climatological and linguistic data [24,25,26,27,28,35]. Furthermore, Pickrell et al. [16] previously demonstrated multiple waves of population mixture in the history of many eastern and southern African populations, and that genetic material from Eurasians or related populations entered eastern Africa 2,700-3,300 years ago, and southern Africa 900-1,800 years ago [16,41]. In addition, our study demonstrates the genetic contribution of the San population to the waves of admixture in the ancestry of the southern and eastern African populations.

Relationship between Genetic and Geographic Distance
Using the Mantel test with N = 10000 permutations (Materials and Methods), we found a significant positive correlation between genetic and geographic distance in the southern African populations (Pearson's r = 0.64; p-value = 1.0 × 10 −4 ; Fig. 2). To analyse more closely the outlier points in Fig. 2, we calculated the perpendicular distance between each point and the regression line. Analysing the concentration of points around the linear regression, we therefore defined outliers as points which are greater than 0.05 distance units from the regression line. When analysing the scatter plot (Fig. 2), there are 10 outlier points, which suggest possible obstacles to migration (S4 Table), assuming that populations have used the shortest path during their migrations. To assess patterns of migrations and to capture the genetic drift in southern African populations, we used a maximum likelihood tree and Gaussian approximation to the genetic drift model; implemented in Treemix [40]. We observed not only a major split between the African and European continent exhibited on this population tree, but also sub-lineages within African, and particularly within the southern African populations (S1 Fig.) which is consistent with previous results [16,26,34,39,40]. S1 Fig. (B) shows the inferred graph with three migration events, explaining the model for the relationship of southern and eastern Africans and non-Africans. This provides evidence for a shared origin for San-and Eurasian-and  Bantu-related populations in Sandawe (SAW) and ‡Khomani (KHO). The latter possibility would be consistent with known south-east African admixture in the Sandawe (SAW) and ‡Khomani (KHO). We clearly see four population branches in southern Africa: (i) one formed from the southern Bantu-speaking populations, which are very distinct from the Niger-Congo and eastern Bantu-speaking populations, (ii) the second group formed with eastern Bantu-speaking populations, and (iii) the third, and (iv) the fourth group formed with San (KHS+SAN) and Khoe-San (BUS+KHO), both hunter-gatherers which are quite distinct, and are split into two distinct groups, including San populations (SAN and Ju\'hoansi (KHS)) and Khoe-San populations (BUS and KHO). This is also consistent with the admixture results shown in Fig. 1, reaffirming the concordance between genetic data with geographic origins of populations and their linguistic affinities.

Haplotypes, Fine-Scale Recombination Rates and Imputation Accuracy
Consistent with previous observations [13], the mean haplotype block lengths are substantially shorter in African populations than in non-Africans ( Fig. 3 (A) and S5 Table). Mean block lengths are remarkably consistent across the southern African populations in this study and easily distinguishable from the non-African block lengths. Similarly, decay of linkage disequilibrium with physical distance along the genome is rapid in southern Africans when compared with non-Africans ( Fig. 3 (B)). Ascertainment biases have been shown to result in faster decay of linkage disequilibrium compared to a sample of non-ascertained markers [42]. We  S4 Table).
performed coalescent simulations (S1 Text and S2 Text) in order to investigate the effects of ascertainment bias when markers are ascertained in a population divergent from that in which they are genotyped. Consistent with previous reports [42], we found the rate of decay of linkage disequilibrium to be greater with ascertained SNPs (S2 Fig. (A)). Similarly, haplotype block lengths are similar, irrespective of whether markers were ascertained in the genotyped population, or in a divergent population (S2 Fig. (A)). Frequency spectra, however, differ when SNPs are ascertained in a divergent population (S2 Fig. (A)). Indeed more monomorphic SNPs, and thus lower overall SNP diversity, are evident when markers are ascertained in a population divergent from that in which they are genotyped. This is further evident in distributions of minor allele frequencies from empirical data, in which the distribution of minor allele frequencies of San more closely resembles the theoretical expectation for a non-ascertained sample (S2 Fig.  (B)), mostly due to the abundance of monomorphic SNPs. In addition to differences in demographic processes, such as bottlenecks, differences in the extent and pattern of linkage disequilibrium may be the result of differences in the patterns of fine-scale recombination rate. We assessed the impact of fine-scale recombination events to differences in linkage disequilibrium patterns using a coalescent-based method [7]. Interestingly, we found that the southern African Bantu-speaking populations share proportionally more recombination hotspots with both Yoruba (YRI) and Europeans (CEU) than with the Ju\'hoansi (KHS) (Fig. 4, S6 Table), where a shared hotspot is identified as a region with greater than five times the background recombination rate within a 10kb window. The proportion of hotspots shared between southern Africans and both European (CEU) and Yoruba (YRI) samples was generally low (Fig. 4). Our empirical analyses indicate that few recombination hotspots are shared between southern Africans and the HapMap populations, with San being the most extreme. More results on recombination hotspots and the test of whether increased frequency of low frequency and monomorphic SNPs improves the power to detect recombination hotspots are detailed in S4 Text and S7 Table. To assess the accuracy with which missing SNPs in southern African populations can be imputed using Yoruba (YRI) or European (CEU) reference populations, we removed SNPs, imputed them and checked for correctness in imputation (detail in S1 Text and S3 Text). Our results show that YRI appears to be useful for imputation, at least for some of the southern Bantu-speaking groups included in the study, namely Sotho/Tswana (STS), Zulu (ZUL), Herero (HER) and Xhosa (XHS), but less so for the San, for whom imputation accuracy is significantly lower than for other African populations (S3 Fig.). Xhosa (XHS) also had lower imputation accuracy, compared with other Bantu-speaking groups.

Unusual Differentiation in Allele Frequencies
We first developed an approach to select polymorphisms that exhibit large allele frequency differences between ancestral populations of Sandawe (SAW), Xhosa (XHS) and ‡Khomani (KHO) (see Materials and Methods). We constructed 3 different panels of AIMs [for Sandawe (SAW), Xhosa (XHS) and ‡Khomani (KHO)], where selected SNPs have a certain level of admixture LD with each other and with at least 1MB spacing between adjacent genetic markers on a chromosome (Materials and Methods). This was to avoid linkage disequilibrium (LD) in the ancestral population. Such background LD could contribute noise (or bias) to the estimation of ancestral allele frequencies and locus-specific ancestry [43]. Thinning down the SNPs to a 1Mb spacing may result in a reduction in power to detect cases of deviation in ancestry or allele frequency differences that result from selection. Consequently, our strategy to detect regions of unusual differentiation between the admixed southern African populations and their source populations, and unusual deviation in local ancestry, is conservative. We evaluated whether there is an excess of common SNPs with large allele frequency differences (expressed as a χ2 (1 d.o.f.) statistic under a model (see Materials and Methods) of neutral genetic drift) between putative ancestral populations of each admixed southern African population [ ‡Khomani (KHO), Sandawe (SAW) and Xhosa (XHS) ( Table 1 and S5 Fig.)]. An unusual extent of population differentiation can suggest the action of population-specific natural selection. We observed several SNPs within chromosomal regions (Table 1) for which the evidence of unusual population differentiation was genome-wide significant between the Sandawe (SAW) and Caucasian (CEU) populations (S5 Fig.), and a small number of SNPs (on chromosome 17q25.1 and 12q24.21) showed unusual genome-wide significant differentiation between SAW and its two other putative ancestral populations, Yoruba (YRI) and Ju\'hoansi (KHS) (S5 Fig.). Chromosome region 3p11 yielded (to) a genome-wide significance of unusual differentiation between the Xhosa (XHS) and Ju\'hoansi (KHS) (p = 9.5e-10, lowest p-value), and between ‡Khomani (KHO) and Ju\'hoansi (KHS) (p = 7.6e-09, lowest p-value). Furthermore, unusual allele frequency differences between the Yoruba (YRI) and Xhosa (XHS) were identified on chromosome 1q41. No significant signal of unusual allele frequency differences between Yoruba (YRI) and ‡Khomani (KHO) were observed, which may be explained by the fact that the Niger-Congo contribution to admixture in the Khoe-San groups, in particular the ‡Khomani (KHO) (Khoe-San population) occurred too recently for it to have a significant impact on their allele frequencies. All these identified candidate SNPs of unusual allele frequency differences lie in or near known genes (Table 1). Their biological functions in the GeneCards database [44], are putatively linked with diseases of high prevalence in southern Africa; their detailed annotations are presented in Table 1.

Local Ancestry in XHS, SAW and KHO
We selected the best proxy parental populations of Xhosa (XHS) based on a pool of Click-speaking and Bantu-speaking populations using PROXYANC [45]. Yoruba (YRI) and Ju\'hoansi (KHS) were chosen as best proxy ancestral populations for Xhosa (XHS). Similarly, among the populations in the study, Yoruba (YRI), European (CEU) and Ju\'hoansi (KHS) were chosen as best non-San, European and San proxy ancestral populations for both ‡Khomani (KHO) and Sandawe (SAW) (Materials and Methods). Using AIMs panels, LAMP-LD [46] was employed to estimate the distribution of genetic contributions of ancestry across the genome (Materials and Methods) to provide additional reassurance from our data that we obtain unbiased results in the absence of possible background LD. The average locus-specific Ju\'hoansi (KHS) and Yoruba (YRI) ancestry proportions across the Xhosa (XHS) samples were estimated to be 27% ± 3.1% and 73% ± 3.1% (mean ± SD), respectively. We obtained 12% ± 0.8%, 77% ± 1.1% and 11% ± 0.9% (mean ± SD) locus-specific Yoruba (YRI), Ju\'hoansi (KHS) and Caucasian (CEU) average ancestry contributions, respectively along the genome of the ‡Khomani (KHO). For the Sandawe (SAW), the locus-specific ancestry proportions were 12% ± 0.9%, 70% ± 0.7% and 18% ± 1.0% for Yoruba (YRI), Ju\'hoansi (KHS) and Caucasian (CEU) average ancestry, respectively. The   [43,47,48,49,50,51]. Here, we considered not only the regions of strong deviation from ancestry, but we also implemented an approach that is now incorporated in PROXYANC [45] to test for unusual deficiency or excess ancestry using the inferred locus-specific ancestry across the genomes of admixed populations. The loci showing unusual ancestry patterns, i.e. four standard deviations above (excess ancestry) or below (reduced ancestry) the genome-wide average, were identified as candidates of post-admixture natural selection (Materials and Methods).

Identification of Regions of Unusual Excess or Reduced Ancestry in the Xhosa (XHS) population
Examining the genome-wide distribution of ancestry in Xhosa (XHS), we detected the natural selection events post-admixture (Table 2). We identified a region on chromosome 3p11 (chr3: size: 17,184 (bp), p = 1.4e-10) with strongly reduced Ju\'hoansi (KHS) ancestry in Xhosa (XHS) ( Table 2). This region yielded a genome-wide significance with an unusual difference of ancestry, suggesting a signal of selection after admixture. The SNP in the 3p11 region with the lowest p-value, rs4858960, is associated with POU1F1, which in turn interacts with five other genes [52], including ETS1, NR3C1, JUN, NR1I3 and MED1. These genes are known to play a role in a metabolic pathway that positively affects growth traits and hormone deficiency [53]. Furthermore, the 3p11 region showed strong differences in allele frequencies between Xhosa (XHS) and Ju\'hoansi (KHS) (p = 9.5e-10) ( Table 1). Since San and Khoe-San communities have undergone a sharp population decline in their history, this differentiation suggests an environmental pressure that the San ancestors of the Xhosa (XHS) may have experienced before population admixture, and we speculate a possible adaptation of Xhosa (XHS) to the local environment. Mutations in the POU1F1/PIT1 gene, a pituitary-specific transcription factor, affect the development and function of the anterior pituitary and lead to combined pituitary hormone deficiency [53].
That some genes in these regions are associated with ‡Khomani (KHO)-and Sandawe (SAW)specific high-risk diseases (such as malaria) [53], suggests a functional role these disease-related genes (or other genetic elements in these regions) might have played in their migration and particularly local adaptation due to such selective pressure resulting from shared gene-culture co-evolution and cultural practices in Bantu-speaking and Click-speaking populations. Overall, in the results of genome-wide allele frequency differences between Yoruba (YRI) and these two admixed populations (Tables 1, 2 (Tables 2 and  3). Importantly, these two regions (Tables 2 and 3) are also associated with some important diseases such as breast cancer, lung cancer, tumour inflammation, diabetes mellitus, Parkinson's and other diseases [44,53], Although these regions have been associated with diseases, there is no indication of whether this points to any mechanistic association. However, it is tempting to speculate that factors such as food, pathogens, and life style, could also be responsible for such reduction in ancestry and may therefore play a role

Copy Number Variation
Our approach to analyzing copy number variation in southern African populations involved the detection of known copy number polymorphisms (CNPs) using a Gaussian mixture model, and the identification of potential novel copy number variants (CNVs) using a Hidden Markov Model (HMM) (S5 Text). The number of CNPs (S5 Text) in Yoruba (YRI) is greater than that found in the European (CEU) and the southern African populations ( Table 4). The former is probably the result of bottlenecks in non-Africans and subsequent loss of CNPs of low frequency [54,55,56], whereas the latter is likely the result of ascertainment bias. Given that CNP probes were ascertained in HapMap populations (including Yoruba (YRI)), lower levels of CNP diversity for populations that are divergent from ascertained populations is expected. However, southern African populations, which are approximately matched for sample size, show marked differences in the distribution of the number of CNPs, particularly in the San (Ju\'hoansi (KHS)) with fewer CNPs than other southern African populations ( Table 4). Distributions of derived allele frequencies of CNPs suggest higher purifying selection on duplications (S7 Fig.). In contrast, however, there appears to be little difference in the degree of purifying selection on duplications and deletions in novel CNVs detected with the HMM (S7 Fig. (A)). We detected a total of 1873 CNVs (Table 5), of which 1231 were deletions. Only 137 of the CNVs were singletons, with 87 deletions and 50 duplications (Table 6). A total of 397 were novel with respect to the Database of Genomic Variants [55,56,57,58]. At least 157 of these were unique CNVs, which occurred in only one population. The number of CNVs per individual is generally similar between populations (S7 Fig. (B)), except San which had significantly fewer deletions than other populations [e.g. Herero (HER) vs Ju\'hoansi (KHS)]: Student's T-test, t 20 = 22.4, P = 1.3e-15). Furthermore, distributions of derived allele frequencies of CNPs suggest purifying selection on duplications (S7 Fig. (A)). In contrast, however, there appears to be little difference in the degree of purifying selection on duplications and deletions in novel CNVs detected with the HMM (S7 Fig. (A)).

Discussion
In this study, we have conducted a systematic population genomics survey and investigated demographic histories of indigenous southern African populations, making it possible to address questions about the signature of selection prior to and following purported ancient admixture events. Consistent with previous studies [16,26,33,34,35,39,40], we demonstrated stratification among indigenous southern African populations. Both the geographic distribution of genetic variations and the population structure, suggested a complex human population history generally within the African continent, and specifically in southern and eastern Africa. Incorporating the data from other Click-speaking populations from previous studies [16,26,33,34,39,40] together with that from our 25 Ju\'hoansi (KHS) subjects, it was possible to investigate the relationship between Click-speaking and southern Bantu-speaking populations thought to represent an early diverging branch of modern humans. The admixture analyses, particularly that of southern African populations, lends support of gene flow between San and Niger-Congo-speaking populations due to their contact following migrations of Bantu-speaking populations across the continent [17,18,26,27,33,34,35]. Consistent with previous studies [16,26,33,34,39,40], our admixture ( Fig. 1) and tree-mix analyses (S1 Fig.) suggested a division between south-west (San) and south-east (Khoe-San mostly admixed) populations. Our findings confirm an ancient link between San and some eastern African populations, including Sandawe, consistent with previous findings [16,26,35,34,39,40]. The Eurasian ancestral components in south-east Khoe-San and some eastern Bantu speaking populations (such as Sandawe, Hadza) may be a consequence of an early Eurasian genetic contribution into Africa [16,28,35], Furthermore, the f-3 statistic test (S3 Table) confirms southern Bantu speaking populations, in particular Xhosa (XHS) to be two-way admixed, and both ‡Khomani (KHO) and Sandawe (SAW) are at least three-way admixed. The San (KHS) exhibit higher levels of homozygosity (S9 Table), increased relatedness (S9 Table) and higher proportions of monomorphic SNPs (S8 Table) than other African populations. However, we have shown that ascertainment of markers in a divergent population results in a reduction of diversity in the genotyped population, probably the result of polymorphisms arising after the divergence of the ascertained and genotyped populations, and the loss of polymorphisms in the genotyped population through fixation. Improved statistical models are therefore needed for the comparison of populations that have varying degrees of divergence from the population in which markers were ascertained.
Our copy number analysis included identification of both known CNPs, which are copy number loci previously identified in HapMap populations [55,56,58], and putatively novel CNVs. CNPs are highly ascertained, since they have been selected to be polymorphic and segregating at allele frequencies > 1% in HapMap populations [56]. CNVs, however, are less ascertained and should have more similar levels of polymorphisms in all of the studied populations [55]. In the case of CNVs, deletions are observed more frequently than duplications. This appears to be inconsistent with the proposal that deletions are under stronger purifying selection [58,59,60], which has also been inferred previously based on a lower degree of overlap between deletions and both genomic regions [59], and disease-related genes [59]. However, the disparity in the number of deletion and duplication CNVs probably reflects the relative difficulty of detecting the latter, due to a smaller relative change in copy number (3:2 versus 2:1) [59], rather than stronger purifying selection on duplications. In the southern African data, deletions and duplications have similar distributions to that of derived allele frequencies for CNVs, suggesting little difference in the relative degree of purifying selection. The number of deletion CNVs per individual differs markedly between the San (KHS) and other African populations. This may be an effect of sample size; however Herero (HER), with a similar sample size to San (KHS) for copy number calling, have no reduction in the number of deletions. In addition, copy number variants called for the Zulu (ZUL) panel with only 20 samples, were more than 99.9% concordant at normal, and 81.6% concordant at abnormal copy number regions, with those called in conjunction with other Bantu populations. Alternatively, some hybridization probes may have lower intensities in the San (KHS) due to probe-target mismatch mutations. However, such probe effects are likely to cause increased numbers of deletions in the San (KHS). Finally, population demographic and selective effects may cause differences in the number of deletion CNVs. In summary, copy number results suggest San (KHS) to be unique, although they should ideally be validated using trios, as shown previously [55,56]. Haplotype blocks show very similar patterns of linkage disequilibrium between African populations, with this collective group having substantially shorter haplotype blocks, and less linkage disequilibrium, than Non-African populations. For instance, patterns of linkage disequilibrium surrounding the lactose tolerance (LCT) gene, known to have undergone a selective sweep in Europeans [7], have strong levels of linkage disequilibrium in Europeans, yet not in southern African populations (S2 Fig. and S4 Fig.). Khoe-San, however, appear to have increased levels of linkage disequilibrium associated with LCT than the other African populations [particularly the Sotho/Tswana (STS) and Zulu (ZUL); S2 Fig.]. This may be due to a weak selective sweep or the result of gene admixture with the San (KHS), a pastoral group from Namibia known to be lactose tolerant [29].
In addition, it was particularly interesting to examine the signature of selection in the indigenous and admixed southern African populations, including ‡Khomani (KHO), Xhosa (XHS) and Sandawe (SAW) due to the high mortality of the San population, historically. Following the recommendation of Bhatia et al. [61], we additionally implemented two strategies to detect possible evidence of population-specific natural selection in southern African populations. The first strategy, involved evaluating whether there is an excess of common SNPs with large allele frequency differences between admixed southern African populations, including ‡Khomani (KHO), Sandawe (SAW) and Xhosa (XHS) and their purported parental populations. The power of this analysis was based on an approach we developed to select three panels of 502 SNPs with at least 1MB spacing between adjacent genetic markers on each individual chromosome. Several SNPs on chromosomal regions for which there is evidence of unusual population differentiation between Sandawe (SAW) and Caucasians (CEU), are displayed in Table 1. Importantly, most of the signals of selection identified through this strategy are linked with specific high-risk diseases such as malaria, influenza, tuberculosis, and AIDs/HIV, which have a high prevalence in southern African populations (e.g. in the Sandawe, ‡Khomani and Xhosa populations) ( Table 1). The allele frequency differences between southern African populations (including some putative parental populations) follow the null distribution predicted by neutral drift as a consequence of the recent origin of southern African population structure. This may yield a risk of false positive associations due to population stratification in disease association studies, despite the fact that there are differences between southern African populations [62].
The second strategy to detect possible evidence of population-specific post-admixture selection involved a signal of unusual excess or deficiency of ancestry in the admixed southern African populations [ ‡Khomani (KHO), Sandawe (SAW) and Xhosa (XHS)]. The recent studies by Bhatia et al. [61,63] showed that loci with significant deviation in local ancestry (from the genome-wide average) may due to insufficient correction for multiple hypothesis testing and/or due to possible systematic errors in local ancestry inference. We have employed the minor allele frequencies from the correct proxy ancestral populations of the admixed population to correct for possible systematic errors on the inferred local ancestry that may lead to false positive deviations in local ancestry. Moreover our study did not only rely on the deviation (more than 4.0 standard deviations) in local ancestry from the genome-wide average; we additionally used the distribution of difference in locus-specific ancestry along the genome admixed population to evaluate the genomic regions showing unusual excessive or reduced ancestry which are likely to be signatures of natural selection after admixture [43,48,49,50,51].
Several recent studies have detected excessive or reduced ancestry contributions in admixed populations as signals of post-admixture selection, using reference ancestral parental populations [43,48,49,50,51]. Our study used selected best proxy ancestral populations and AIMs panels for our admixed southern African populations, and we extended previous approaches to test for unusually increased or decreased ancestry contribution along the genome. We identified three and four regions showing a significant excess of Yoruba (YRI) ancestry in Sandawe (SAW) and ‡Khomani (KHO), respectively (Tables 2 and 3). Three other regions showed unusually reduced Caucasian (CEU) and San (KHS) ancestry in both ‡Khomani (KHO) and Sandawe (SAW) (Tables 2 and 3). Since some of the genes in these regions are linked with specific high-risk diseases such as malaria in the ‡Khomani (KHO) and Sandawe (SAW), as has also been noted in the recent study by Gurdasani et al. [35], it is plausible that these disease-related genes might have played a role in population adaptation historically. Among the identified genomic regions, the 12q24.1 region was found in both strategies for detecting signals of natural selection, supporting evidence of environmental pressures that the ‡Khomani (KHO) and Sandawe (SAW) experienced. Furthermore, two other candidate regions pointing to natural selection were identified in both ‡Khomani (KHO) and Sandawe (SAW), showing strong deficiency of European and San ancestry components, and also an unusual population differentiation in these regions. These two regions are also linked with some important diseases such as breast cancer, lung cancer, inflammation, diabetes mellitus and Parkinson's disease [53], which are known to occur at a relatively higher prevalence in European populations, when compared to indigenous southern African populations [59].
African, and particularly southern and eastern African populations, face a heavy burden of diseases including HIV/AIDs, tuberculosis and malaria, and a growing burden of non-communicable diseases [17]. Of note, all the reported regions with signals of selection are in admixture LD and with significant deviation in average local ancestry (or unusual difference in allele frequency). In addition, our constructed AIMs panels for southern and eastern admixed populations may potentially be utilized for further admixture mapping studies in these populations. Nevertheless, further investigations are required to reveal the targets and agents of selection that have played important roles in shaping the admixed gene pool of these southern and eastern African admixed populations. With extensive admixture, both between none-San and San populations, and between African and non-African populations, southern and eastern African populations have a great potential for the identification of genes which determine susceptibility to both communicable and non-communicable diseases and to understand the African genetic variations with response to drugs/treatment variability.
The southern Bantu and Khoe-San populations are 'admixed' and future genome-wide studies will need to correct for this stratification or may need to use the locus-specific ancestry to increase power in association studies. Admixture mapping in the African-American and some other three-way admixed populations (such as Latinos, Puerto) has been successful for some disease traits [43,51]. Since the admixed southern African populations have similar admixture proportions to admixed American populations, we hypothesize that admixture mapping would likely be a successful approach in many southern Bantu and Khoe-San cohorts, and particularly in the Xhosa, ‡Khomani and Sandawe.
A large proportion of the currently active genomic studies being conducted as part of the recently launched H3Africa programme (H3Africa, http://h3africa.org/) and the more recently described African Genome Variation Project [35], involve genome wide association studies [64]. A significant number of these studies involve large collections of sub-Saharan African subjects, and would benefit from this knowledge.

Genetic Marker Selection: Relationship between Population Differentiation and Admixture Linkage Disequilibrium
Consider a pair of populations k and l from a pool of K ancestral populations of an admixed population and assume that the minor allele frequencies at SNPs i and j are greater than 0.005. Similar to Glaubitz et al. [67], we defined the admixture linkage disequilibrium as Where m is the ancestral proportion, δ i and δ j are differences in allele frequency at SNPs i and j in population k and l, respectively. Assuming for each pair of SNPs i and j there is no linkage disequilibrium in ancestral populations, it thus follows, At a given pair of SNPs i and j in the admixed population, Equation (c) establishes a relationship between the observed linkage disequilibrium L ij in a recently admixed population and ancestral population differentiation. One can expected the ratio (part 2) in Equation c to be closer to 1 when the two reference ancestral populations contributed to the admixture of the related admixed population. Equation (c) is a total ancestry content (AC) at a pair of SNPs i and j. Let I ij denote the ration in Equation c, assuming a uniform ancestral proportion, and summing Equation (c) over all possible pairs of proxy ancestral populations, we can obtain the ancestry informativeness I ij of each pair of SNPs i and j as follows, Let M be the total number of SNPs. For i ∊ {1,. . ., M}, let N i be the total number of pair-wise LD j with i, where j 6 ¼ i, 8 j ∊ {1,. . ., M} within SNP i, we obtain the ancestry informativeness at SNP i as a weighted sum of I ij , We applied this method to construct the AIMs panel for Xhosa, ‡Khomani and Sandawe. This approach of selecting ancestry informative markers (AIMs) is implemented in the PROX-YANC program (http://web.cbio.uct.ac.za/proxyanc/).

Screening for Close Relatives and Admixture Analysis
We estimated the pair-wise genome-wide level of relatedness using a previously described relatedness statistic [67] applied to a random selection of 2500 putatively unlinked SNP markers with minor allele frequencies between 0.3 and 0.5. These SNPs were randomly selected across each chromosome, with a minimum spacing of 1 MB, to prevent inclusion of SNPs in strong linkage disequilibrium, which would violate the assumption of marker independence. Principal Component Analysis (PCA) was performed, using EIGENSOFT [68], on the combined Hap-Map3, HGDP, other African data from [26,34,39,40] and southern African genotypes, which included a total of 50K SNPs shared between these different panels. In addition to the PCA analysis, an F ST matrix using the smartpca program was generated. Admixture analysis [68,69] was performed on combined panels based on 900K SNPs using the ADMIXTURE program [69]. To evaluate the genetic relationships among the above populations, we used the TreeMix software [40] to infer the structure of a graph from genome-wide allele frequency data and a Gaussian approximation to genetic drift. Furthermore, to identify some aspects of ancestry not captured by the tree, we also examined the residuals of the model's fit and sequentially added the migration events to the tree. We also used copy number variants as a population marker in an additional population structure analysis, but only for HapMap3 and southern African samples for which the intensity data (CEL files) necessary for copy number calling were publicly available. Copy number variants, detected with a Hidden Markov model that identifies novel copy number variation [55], were preferred over previously described copy number polymorphisms, since these are affected to a lesser extent by ascertainment bias. We randomly selected a total of 2869 copy number variable positions, corresponding to 1 marker every 1Mb, across all chromosomes and specified copy number alleles as either a deletion, normal or duplicated state dependent on the copy number state called in the Birdseye algorithm [55]. We only selected simple copy number variants consisting of either a deletion or duplication, but not both.

Relationship between Geographic and Genetic Distance
Here, we used all available southern African population data, including HER, SAN, XHS, XHS, LWK, BUS, ZUL, SAW, a Niger-Congo-speaking population (YRI) and a non-African population, which included CEU. We made use of the Haversine formula to compute the geographic distance (in kilometre) between pairwise populations based on great circle distances using the way points between continents. The way-points used are Egypt (29.998392, 30.999751) and Turkey (41.015472, 27.986336). Thus, we computed the correlation between F ST and Geographic distance using a linear regression equation as We analysed the scatter plot of the relationship between F ST and geographic distance. To address this, we computed the perpendicular distance between each point and the regression line. This enabled us to define outliers as points whose distance to the regression line is greater than or equal to 0.05 units.

Unusual Difference in Allele Frequency
To minimize deviation from the normality assumption, SNPs with minor allele frequencies < 0.05 are excluded. Thus, at a given locus i, the differenceðp k i À p l i Þ between observed variant allele frequencies of two populations, k and l, can be approximated as a normal distribution under neutral drift with mean 0 and variance [60] pð1 À pÞ Where F ST is the genetic distance between the population k and l. To avoid overestimating the degree of differentiation at single SNPs due to sample size difference, we used the estimator of F ST in by Bhatia et al (63). N k and N l are total variant allele counts in each population, and p is the ancestral allele frequency that is commonly approximated as the average of the two observed variant allele frequencies. Similar to [60], we test unusual difference in allele frequency U kl from population k and l as follows t  [60]. We applied this method to the data from the Xhosa population using Ju\'hoansi and Yoruba as ancestral populations. We also applied this method to KHO and SAW using KHS, CEU and YRI populations. All gene annotations and associated diseases were obtained using both the GeneCards and MalaCards databases [44,53].

Locus-specific Ancestry Inference
We used LAMP-LD to infer locus-specific ancestry in admixed populations [46]. The model in LAMP-LD leverages the structure of linkage disequilibrium in the proxy ancestral populations. LAMP-LD achieved highest accuracy in both simulation and real data in the study of Puerto Rico and Mexico populations [43]. Here, we applied LAMP-LD to infer local ancestry in three potential southern African populations, including KHO, XHS and SAW. Following the population structure result and the proxy ancestry selection approach developed in PROXYANC [45], YRI, KHS and CEU was selected as reference ancestral populations from a pool of Bantuspeaking, Click-speaking and European populations, respectively. We obtained phased haplotype data by running Beagle software [70] on KHS, CEU and YRI data. To estimate the distribution of genetic contributions of ancestries to XHS across the genome, we used haplotypes of 80 YRI and 80 KHS. In addition, the haplotypes of 80 YRI, 80 CEU and 24 KHS were used to compute the locus-specific genetic contributions to KHO and SAW using the AIMs panel.

Estimating Excess or Deficiency of Ancestry
Admixed populations provide special opportunities for investigating recent selection. Prior to admixing, the ancestral populations have been isolated geographically, and their genomes may have evolved in distinct environments. Migration of previously isolated populations may have brought individuals of the ancestral populations into an unusual environment, and may consequently introduce life-style changes or changes in pathogens they are exposed to. This type of selection may differ from that faced by stationary populations, for which the local environmental changes may occur gradually, allowing for rare advantageous alleles to increase in frequency [43]. Here, we adopted an approach to detect ancestral signatures of selection by looking in an admixed population for genomic regions that exhibit unusually large deviations in ancestry proportions compared with what is typically observed elsewhere in the genome. Given the genome-wide ancestral proportions, α k , from ancestral populations k ∊ {1, . . ., K} in N samples of an admixed population, let φ i;m k be the estimated locus-specific ancestry of individual i at genetic marker m ∊ {1, . . ., M}, from the k th ancestral population. We computed the deficiency or excess of ancestry, at each SNP using the estimated admixture proportion as a baseline. We thus define the deficiency/excess of ancestry from ancestral population k at marker m as, where φ m k is the average locus-specific ancestry at SNP m. d m k can be approximated as a normal distribution under neutral drift with mean 0 and empirical variance, derived from the distribution of φ i;m k values among the N individuals [43,51]. We can fit a chi-square on φ i;m k as follows, is a χ 2 with 1 degree of freedom. A large value of the chi2 statistic indicates deviations from the null model and 4 standard deviations above (excess ancestry) or below (deficiency ancestry) the genome-wide average, suggests the action of natural selection post-admixture [51]. Summing-up the equation above over all SNPs assigned to a gene, we obtain the deficiency/excess of ancestry at the gene level. This allows us to assess the statistical significance of a deficiency/ excess of ancestry at the SNP and gene level. To assess unusual difference in deficiency/excess of ancestry between a pair of ancestral populations given SNP m ∊ {1, . . ., M} within a gene, we computet Which is a two-sample t-statistic with M − 2 degrees of freedom, assuming equal sample size N. For a pair of populations, k 6 ¼ l ∊ {1, . . ., K}, we compute the overall unusual difference in a deficiency/excess of ancestry,

Enrichment Analysis of Scans for Selection
In order to summarize the types of loci and explore the potential adaptive genetic architecture implicated by our genome-wide selection scans, we identified all protein coding genes within 40 kb downstream or upstream of SNPs showing signatures of selection. To achieve this, we downloaded genomic coordinates for all genes from the NCBI ftp-server (ftp://ftp.ncbi.nih. gov/), retaining only entries for the human reference sequence and protein-coding genes. We updated genomic coordinates to the latest assembly using the Lift-Over tool on GALAXY (https://main.g2.bx.psu.edu/). We obtained the genomic predicted human genes from the Gen-eCard database [44]. We investigate the roles of genes and cells in disease processes using the MalaCard database [44; 53].
Supporting Information S1 Fig. (A-B) Maximum likelihood tree of indigenous southern Africa populations, including a proxy European ancestral population for southern Africa populations. (C) Residual fit from the maximum likelihood tree is plotted and the standard error of the entries in the covariance matrix is represented ten times on the scale bar.  Table. Populations that were included in population structure analysis of South African Coloureds (SAC). The southern Bantu-speakers in this study are represented by the Sotho-Tswana (STS) inhabiting the central plateau of southern Africa; the Nguni, represented by Zulu (ZUL), Xhosa (XHS) speakers, inhabiting KwaZulu Natal on the east coast and the Eastern Cape, and the Herero (HER) inhabiting northern Namibia, respectively (S1 Table). The eastern Bantu-speakers are mostly populations inhabiting the central lake regions and the east coast of Africa. (DOC) S2  Table. Power to recover a recombination hotspot after a population bottleneck when SNPs have been ascertained in (a) the genotyped population and (b) a population that has diverged from the genotyped population τ generations before the present. Power is estimated as the proportion of simulated datasets in which a recombination hotspot is inferred with strength 50 times the background recombination rate, and which lies within 25kb of the simulated hotspot. (DOC) S8 Table. Number of monomorphic SNPs, and proportion of SNPs monomorphic with respect to the total number of SNPs shared (n = 798807) between the HapMap and southern African datasets.