Bonobos Fall within the Genomic Variation of Chimpanzees

To gain insight into the patterns of genetic variation and evolutionary relationships within and between bonobos and chimpanzees, we sequenced 150,000 base pairs of nuclear DNA divided among 15 autosomal regions as well as the complete mitochondrial genomes from 20 bonobos and 58 chimpanzees. Except for western chimpanzees, we found poor genetic separation of chimpanzees based on sample locality. In contrast, bonobos consistently cluster together but fall as a group within the variation of chimpanzees for many of the regions. Thus, while chimpanzees retain genomic variation that predates bonobo-chimpanzee speciation, extensive lineage sorting has occurred within bonobos such that much of their genome traces its ancestry back to a single common ancestor that postdates their origin as a group separate from chimpanzees.


Introduction
In humans, extensive data has been collected for large numbers of individuals (e.g. [1]; HapMap project, www.hapmap.org; NIEHS SNP project, http://egp.gs.washington.edu/; the 1000 genomes project). By contrast, data on DNA sequence variation in the closest non-extinct relatives of humans, chimpanzees (Pan troglodytes) and bonobos (Pan paniscus), are limited to a handful of studies that have collected genetic data across a handful of genomic regions to discern basic population and demographic parameters [2,3,4,5,6,7]. While these studies provide an important first step, their geographic sampling was limited, so they are likely to have captured only a fraction of the species-wide variation that may exist within natural populations. In particular, variation in bonobos and some populations of chimpanzees has been examined in only a few studies of mostly captive-born individuals [6,8]. In chimpanzees, four ''subspecies'' are commonly recognized which correspond to the geographic ranges where these groups are found ( Figure 1): western chimpanzees (Pan troglodytes verus); Nigerian-Cameroonian chimpanzees (P.t. ellioti or formerly P.t. vellerosus); central chimpanzees (P.t. troglodytes); and eastern chimpanzees (P.t. schweinfurthii). Little or no morphological and behavioral differences distinguish the four groups from each other [9,10,11]. In contrast, bonobos (Pan paniscus) have no recognized subspecies but clearly differ from chimpanzees in morphology and behavior [11,12,13,14]. Genetic variation in bonobos has been examined in only a few, usually captive-born individuals [2,3,4,5,6,15,16]. We have collected DNA from 20 wild-born bonobos and 44 wild-born and 14 captive-born chimpanzees representing all four chimpanzee groups (see Materials and Methods) and sequenced 15 noncoding autosomal regions each encompassing about 10,000 basepairs (bp), as well as the complete mitochondrial (mt) DNAs from these individuals.

Genetic diversity
To characterize the patterns of genetic diversity within groups, we computed various summary statistics ( Table 1). As previously shown [6,8,17,18,19], nucleotide diversity and hence effective population sizes are highest in central and eastern chimpanzees and lowest in western chimpanzees, while bonobos have diversity levels similar to western chimpanzees, in agreement with previous studies of zoo [2,6] and wild-caught individuals [20]. Diversity levels in Nigerian-Cameroonian chimpanzees are intermediate to those of central and eastern chimpanzees on the one hand and western chimpanzees and bonobos on the other.
We assessed how well a null model of constant population size and random mating fit each population [21]. Only the central chimpanzee samples deviate from this null model (p,1/10000) due to an excess of rare alleles (Table 1 and Table S1) that suggests population growth as has been previously reported [3,4,22].

Genetic relationship between populations
Chimpanzee populations. Figure 2 shows the phylogeny estimated for each genomic region. While central, eastern and Nigerian-Cameroonian chimpanzees are highly interspersed in these phylogenies and never form monophyletic groups, western chimpanzees are monophyletic in 6 of the 16 trees ( Figure 2; Table 2). As expected from its lower effective population size and as has previously been reported based on smaller numbers of individuals and regions [20], mtDNA shows more clustering than the autosomal DNA sequences. We note that four individuals designated as central chimpanzees fall within the eastern chimpanzee cluster for mtDNA. Three of these individuals were confiscated in the Democratic Republic of Congo, where both central and eastern chimpanzees can be found. Below, we report the results with all individuals included but have checked that the results are not significantly changed when these four individuals are excluded from the analyses wherever necessary.
When we combine all autosomal SNPs, the level of genetic differentiation between pairs of chimpanzee populations (as assessed by F st ) was highest for comparisons involving western chimpanzees (Table 3). When the program Structure was applied to the whole dataset, the highest likelihood was obtained for the model with four populations (Figure 3), the four populations being bonobos, western, central, and eastern chimpanzees. While the picture is clear for western chimpanzees, it is more complex for central and eastern chimpanzees: six eastern chimpanzees are inferred to have more than 20% ancestry from central chimpanzees and among the 16 central chimpanzees that according to the mtDNA are of central African origin, seven have more than 20% ancestry from eastern chimpanzees. In a principal component analysis (PCA) using all individuals (Figure 4), the first PC separates bonobos from chimpanzees while the second PC, that explains 14.3% of the variation, separates western chimpanzees from the other chimpanzees (p,10 28 ) (Figure 4). The third PC, explaining 6.1% of the variation, tend to separate eastern and central chimpanzees. Note that the four central chimpanzees falling with eastern chimpanzees on the right of the graph are the same individuals that group with eastern chimpanzees in the mtDNA phylogeny.
Of particular interest is that none of the autosomal phylogenies support the monophyly of the Nigerian-Cameroonian chimpanzees ( Figure 2; Table 2). In fact, Nigerian-Cameroonian individuals group with individuals from all other three populations ( Figure 2). The F st based on the autosomal sequences show that Nigerian-Cameroonian chimpanzees are more differentiated from western chimpanzees (F st = 0.37) than from central and eastern chimpanzees (F st = 0.16 and 0.21, respectively) ( Table 3) and the Structure analysis fails to suggest that the Nigerian-Cameroonian chimpanzees are a separate population (Figure 3), and instead suggests that they have more than 50% ancestry shared with central chimpanzees. Likewise, in the PCA, the four Nigerian-Cameroonian individuals fall within central and eastern chimpanzee variation, while being separated from western chimpanzees ( Figure 4).
An earlier study of one wild-caught Nigerian-Cameroonian individual showed it to be related to both western and central chimpanzees based on nuclear microsatellites [2]. However, in a recent study [7] of 94 chimpanzees, including 32 designated as Nigerian-Cameroonian, the weight of evidence is supporting three major groups, which are western chimpanzees, Nigerian-Cameroonian chimpanzees and central/eastern chimpanzees. One possible reason for this apparent discrepancy may be the limited number of Nigerian-Cameroonian individuals in our study. However the nuclear DNA sequences studied here are expected to reflect historic events over a much greater time depth than microsatellites. Thus, our data suggest that any genetic differentiation of Nigerian-Cameroonian populations is likely to have occurred relatively recently in the evolution of chimpanzees. Regardless, it is important to note that both nuclear data sets fail to support the close relationship of Nigerian-Cameroonian chimpanzees and western chimpanzees observed for mitochondrial DNA ( Figure 2) [23,24,25].

Chimpanzees and Bonobos
When all SNPs in the nuclear data are combined, more than 50% of the variation is between bonobos and chimpanzees (0.54#F st #0.74; Table 3) and the Structure (Figure 3) as well as the PCA analyses ( Figure 4) support a separation between bonobos and common chimpanzees as does a concatenated phylogenetic analysis of all of autosomal region ( Figure S1). However, the individual autosomal phylogenies reveal contrasting patterns between bonobos and chimpanzees. Bonobos are strongly monophyletic (i.e., all bonobos are more closely related to each other than to any chimpanzee) for all but one genomic region ( Figure 2; Table 2). The one possible exception (Figure 2 h) also shows a high posterior probability for the bonobo clade (P = 0.913). Thus, much of the bonobo genome coalesces to a common ancestor since the population split from chimpanzees. In contrast, monophyly of all chimpanzees is only observed in a small subset of the phylogenies (4 out of 16) and for 5 of the 16 regions bonobos unambiguously fall within the diversity of chimpanzees (i.e., some chimpanzees are more closely related to bonobos than to other chimpanzees) ( Figure 2). This is also reflected in the patterns of allele sharing (Table S2) where there are more sites that unite bonobos to the exclusion of chimpanzees (than vice versa) and more sites for which bonobos fall within the diversity of chimpanzees (than vice versa).
Phylogenetic lineage sorting subsequent to the separation of two species necessarily proceeds through three distinct phases: complete lack of sorting (polyphyly), sorting within one species (paraphyly), and complete sorting (reciprocal monophyly) [26,27,28,29]. Incomplete lineage sorting has been seen for many other closely related organisms (e.g. [30,31,32,33]). In the human genome, about 32% of regions share a more recent ancestry with gorilla than with chimpanzees or fall outside chimpanzees and gorillas [34,35], and for ,1% of the genome, humans may share a more recent ancestry with either chimpanzees or bonobos rather than these with each other [3]. We show that chimpanzees retain ancestral polymorphisms to a much greater degree than bonobos. This is presumably a result of a larger effective population size of

Collection of sequence data
Ethics statement. All animal work was conducted according to relevant national, EU and international guidelines. In all cases, the animals were not subjected to any experimental procedures, and the blood samples used were left-over aliquots collected by veterinarians carrying out routine medical examination. Authorization for use of the samples was obtained from the respective Ministries of Environment as well as by the Ministère de la Recherche Scientifique (DRC) to ''Les Amis des Bonobos du Congo'', the Uganda Wildlife Authority and the Uganda National Council for Science and Technology, and the Ministère de l'Enseignement Supérieur et de la Recherche Scientifique from Republic of Congo. The international transport of samples was approved (CITES numbers: Uganda E-3520/05, Kenya E-1259/ 05, DRC E-0908/07, Republic of Congo E-1274/07). The proposal that in part cover this research (233297, TWOPAN) was reviewed and approved by the European Commission.

Samples
A total of 58 unrelated common chimpanzees and 20 unrelated bonobos were used for this study. Most of these apes were confiscated by various officials from individuals selling these animals for trade or who kept them as pets, and then were brought to the sanctuaries. Where an animal is confiscated is thought to be an imperfect, but probable indication of where the chimpanzee was originally living and the population identity of the chimpanzees are mostly based on this. We thus have a sample of 20 eastern Table 2. For each region, the posterior probability for a tree that supports (1) reciprocal monophyly of chimpanzees and bonobos (2) and monophyletic grouping of bonobos (3) chimpanzees as a whole (3-6) each population of chimpanzee separately.

Region
Reciprocal    Cameroonian chimpanzees (Pan troglodytes ellioti or formerly P.t. vellerosus) from Sweetwaters chimpanzee sanctuary, Kenya and 20 bonobos (Pan paniscus) from Lola ya bonobo sanctuary, Kinshasa, Democratic Republic of Congo. Nigerian-Cameroonian chimpanzees were confiscated in Cairo, Egypt, and originated from Nigeria. An analysis of mtDNA confirmed their designation as Nigerian-Cameroonian chimpanzees. Since the sampling scheme follows the broad population framework established based on geography and analysis of mtDNA, our analysis does also [23,36]. The blood samples from central and eastern chimpanzees and bonobos were collected by Michel Halbwax and Anne Fischer in 2007 and 2008 during regular health checks. The lymphocytes were extracted from blood samples using a Ficoll gradient and frozen.

Genomic regions chosen
For each individual, we targeted the complete mitochondrial genome and 15 regions of approximately 10 kilobases (kb). The 15 nuclear regions are all non-coding, distant from known genes (at least 72 kb) and were chosen to have average recombination rate and GC content in the human genome. Based on these criteria, we used eleven of the 15 nuclear regions that overlapped with locuspairs previously sequenced in humans [21]. Four additional regions were picked at random in the genome, based on the same criteria. The chromosomal location and coordinates are given in Table S3. Based on human recombination estimates, the average recombination rate for all regions was 1.63 cM/Mb, slightly higher than the human genome-wide average of 1.19 cM/Mb. The average GC content for these regions was 37.94%, a bit lower compared to the genome average in humans of 42%. Human was used as an outgroup (genome sequence built hg18).

DNA extraction, amplification and sequencing
DNA was extracted from 50 ml cell cultures (5-50610 6 cells) obtained from lymphocytes transformed with Epstein-Barr virus using the Gentra-puregene kit from QIAGEN and following manufacturer's instructions. The DNA was aliquoted to a concentration of 100 ng/ml for further use.
For the mitochondrial genome, two sets of primers previously designed to amplify mtDNA from a large number of primates were used [37]. Each of the 15 regions was amplified in a single polymerase chain reaction (PCR) using primers that were designed from the reference human and chimpanzee genome sequences. All loci were amplified in 50 ml reactions and PCR was run with an annealing temperature of 64uC according to the manufacturer's instructions. PCR products were cleaned with PEG purification, washed twice with 70% ethanol, and eluted in TE.
All sequencing was performed using the 454 FLX sequencing platform. We used a parallel tagged sequencing protocol [37] to enable multiplexing of regions and individuals.

Assembly
We used an iterative mapping assembler, (MIA, R. E. Green, http://sourceforge.net/projects/mia-assembler/) to assemble all reads. The first round of MIA performs a mapping assembly based on a reference genome, which for the 15 autosomal regions was the human genome sequence and for the mitochondrial genome the published sequences for bonobo and chimpanzee, respectively. MIA then uses the consensus-call of the previous round of alignments as reference for further rounds of alignment until the called consensus sequence is not changed in two consecutive rounds. MIA was run separately for each individual and for each region. However, initially all reads from a given individual were included during the assembly of each region. If a read was aligned to more than one region, the best alignment score was used to assign the read to only one region. All regions were then assembled again using MIA, but using only those reads that matched each region best. To be considered for the assemblies, at least 12 consecutive base-pairs of a read had to match the reference sequence. An average of six percent of the reads did not map to any reference sequence and likely represents unspecific PCR products. To minimize the effect of homopolymer over-and under-calls in 454 data, we used a gap penalty that decreased according to the function 1/L, where L is the homopolymer length. To test whether using the human reference sequence as the initial mapping sequence influences the outcome of the iterative procedure, all autosomal regions from one bonobo were assembled using both the human and chimpanzee sequences as reference. The final consensus sequences were identical except for the length of nine different homopolymers (all of length .6), six cases where simple repeat regions differed in length and lead to misaligned reads, and two cases where there was an insertion in the chimpanzee (114 and 214 base pairs, respectively) not present in the human genome. We therefore masked simple repeat regions from further analyses. Our analyses were not affected by homopolymer over-or undercalls or by insertion-deletions, as these were ignored.

Calling of SNPs
We used perl scripts to identify potential heterozygous sites within a given individual from the MIA output file. Below, we list the filtering steps that were applied to get a set of high-quality SNPs.
1. In order to remove multiple sequences that may be generated from single molecules being amplified in an emulsion droplet with more than one bead (i.e. emulsion PCR duplicates) [38], we retained only the read with the highest quality score for each group of reads with identical strand and start position in the alignment. 2. Following [39], we filtered all reads whose alignment to the consensus sequence contained gaps within 5 base-pairs on each side of the potential SNP position. 3. We allowed at most one mismatch in 5 neighboring bases around the potential SNP. 4. We did not call SNPs if there exists a homopolymer of length longer or equal to 6 within a 20 base pair window around the site, since we observed high rates of sequencing error and misalignments in these regions. 5. Following [39], we also used the quality scores produced by the 454 base calling software (Version 2.0) to apply the Neighbor-Quality Score (NQS) with a cutoff of at least 15 for 5 neighboring positions on each side of the potential SNP position and 20 for the middle base on all reads.
If, after removal of reads due to filters 1 and 2, the potential SNP position was covered by less than 8 reads, or if the potential SNP position failed due to filters 3, 4 and 5 then we were unable to call a SNP in this position. If we were able to call a SNP, then we considered the potential position to be a SNP (i.e. to be heterozygous) if the minor allele frequency was above 0.15 and at least one read from each allele passes the NQS criteria outlined in filter 5. Otherwise the potential position was considered to not contain a SNP (i.e. to be homozygous).

False positive rates in homozygous PCR products
We used data from ten 5 kb X chromosomal regions (Thalmann et al., in preparation) collected in a similar fashion to our data in two male humans to test the false positive rate of our SNP calling protocol. No SNPs were called.

False negative rates by comparing 454 data to previously sequenced data
In order to further test how many heterozygous positions were missed or gained, we compared the newly generated 454 data with data previously generated by Sanger sequencing [6]. Seven samples from western chimpanzees and 9 regions were overlapping between this study and the one from [6], representing a total of ,9 kb of data. This 9 kb contained 62 SNPs previously identified with Sanger sequencing. We found all of them when applying our filtering criteria. We also found one more SNP, which we did not call with Sanger sequencing.

Effect of changing SNP calling algorithm
We varied the filtering criteria and looked at how the number of inferred SNPs changed for the above two situations. The filtering criteria chosen were the ones that gave the least number of false positives and false negatives. Using more permissive values resulted in finding SNPs when there were none (or were not present under Sanger sequencing), while more restrictive values made us lose SNPs which were present under Sanger sequencing.
Furthermore, our nucleotide diversity estimates and the excess of rare alleles in central chimpanzees (see Table 1) all confirm previous findings (e.g. [4,6]). This suggests that we are not excessively missing rare alleles or overcalling SNPs, at least compared to Sanger sequencing.

Homozygous regions/allelic dropout
When analyzing the data from the 20 bonobos, 85 of the 300 products (15 loci * 20 individuals = 300 products) were homozygous across the entire 10 kb of sequence. This raised the concern of allelic dropout in our data. To exclude allelic dropout, we repeated the entire experiment for bonobos with nested primers. One region turned out to be affected by allelic dropout. For all other samples, we thus repeated all the regions that were completely homozygous in one individual with a second pair of primers. We note, however, that we did observe individuals in both bonobos and chimpanzees that seem to be truly homozygous across some 10 kb stretches of DNA.
In the case of no allelic dropout and equal amplification of both alleles in the PCR, we expected both alleles at a frequency close to 50% and the distribution of the minor allele frequency to look like half a normal distribution. A closer look at the data revealed that for some heterozygous individuals the minor allele frequency was skewed. This can be explained by unequal amplification of both alleles, which could be due to a mismatch to one allele in the 39 end of one primer. We therefore plotted the minor allele frequency for all regions for each individual separately and also repeated each region showing a skew in the minor allele frequency with a second pair of primers. A skew was defined as the minor allele frequencies of each SNP in one region are all below 30%. Figure  S2 shows the shift in minor allele frequencies before and after using a new set of primers for one population.
We note that 6 products still showed a skew in minor allele frequency after using different primers and repeating the PCR. We kept them as is in the analysis.

Sequence analyses
The consensus sequences for each region were aligned with Muscle using default parameters [40,41].
Sequences are available under accession numbers JF725992 -JF727238.

Population genetic analyses
Summary statistics were calculated using DNAsp v5 [42], including nucleotide diversity (p and h w ), Tajima's D, Fu and Li's D*. Effective population sizes (N e ) were estimated as N e = h w /4 m [43], where m = (d/2t)g [44], d is a sequence divergence of 1.35% as estimated from the data, t the time since divergence between humans and chimpanzees (6 million years), and g the generation time assumed to be 20 years [4].
Testing the fit of each population to standard neutral model We tested the fit of each population to a standard neutral model based upon the observed allele frequency spectra using the method of [21]. Briefly, we used the program ms [45] to simulate 1,000 15 locus datasets, where for each locus we matched the total number of chromosomes and the average length (,10,000 bp) in a given population. h and r for each locus were chosen from a distribution. Values for h followed a gamma distribution parameterized using the average and variance in the mutation rate across all 15 loci. Mutation rates were estimated based on divergence to human, assuming a generation time of 20 years (Fischer et al. 2004) and a divergence time of 6 million years. Values for r followed a log-normal distribution parameterized using the average human recombination rate and variance in the human recombination rates for all 15 loci [46].
The observed and simulated data were compared using the variance across loci of Tajima's D plus four additional summary statistics whose average value across all loci were computed: the number of segregating sites (S), the mean pairwise difference (p), Fu and Li's D*, and Tajima's D. For the simulated data, these summary statistics were computed using sumstats [47]. Following [21], we computed the probability of observing each summary statistic across the ,1,000 simulated datasets and then calculated the sum, C, of the natural log of the p-values for each of the summary statistics.
We evaluated the fit of a given demographic model by calculating the probability of observing C in the simulated data. Note that our approach differs from [21] in that we do not include the population recombination rate, r, in our list of summary statistics. Independent estimates of local recombination rates, which are often not well conserved between humans and chimpanzees [48,49], are not yet available in chimpanzees and bonobos and our data generally provided a poor fit across populations to recombination estimates derived from human populations (data not shown).

Population structure
To explore genetic structure among populations, we used two approaches. The Structure software [50] was run using the admixture model, so that individuals were allowed to have ancestry from multiple populations. Three independent runs were performed with a model of correlated allele frequencies, a ''burn-in'' of 100,000 Markov Chain Monte Carlo (MCMC) iterations, and 1,000,000 additional MCMC iterations. The number of populations assumed, K, varied from 2 to 7. We averaged the results of the three independent runs for each K value to determine the most likely model, i.e. the one with the highest likelihood.
In addition, the Eigensoft software package [51] was used to perform a principal component analysis (PCA). For pairs of SNPs in high linkage disequilibrium (r 2 .0.5), one position was randomly excluded. Likewise, we removed all positions with a minor allele frequency lower than 5%. The statistical significance of any given principle component (PC) is obtained by a bonferroni-corrected Tracy Widom test [51]. A significant PC was considered indicative of significant population structure, which can be in the form of clusters or gradients along an axis of genetic variation.

Population divergence times
We attempted to use MIMAR [17], a Markov Chain Monte Carlo approach which allows for some recombination to estimate divergence times and migration rates between closely-related populations, on each pair of populations. However, we found that the reasonably high recombination rates within our 10 kb regions proved too computationally demanding and we failed to reach convergence of the Markov chains even after four months.

Phylogenetic analysis
We reconstructed the phylogeny of each region, using Bayesian inference as implemented in MrBayes v3 [52]. Each region was collapsed to unique reconstructed haplotypes and a best-fit model of sequence evolution was selected using decision theory with the program DT-ModSel [53] and PAUP* v4.0d105 [54]. However, only a few models are available in MrBayes. Thus, we chose the closest best fit model that is actually implemented in MrBayes. For mtDNA, the 14 nuclear regions and all regions concatenated the closest best-fit model was the Hasegawa-Kishino-Yano substitution matrix [55] with invariant sites and a gamma distributed correction for rate heterogeneity (HKY+I+G) [55]. For one region the generaltime reversible (GTR+I+G) model was the best model. (No qualitative differences were seen if we used the (HKY+I+G) model for all regions). For each region, we ran four independent runs, each for ten million generations and sampled every 1,000 generations. For each run, we used one cold and three heated Markov Chains. We excluded the first 10% of each run, resulting in a posterior distribution of 36.000 distinct tree topologies. We used the program Tracer v1.4.1 (http://tree.bio.ed.ac.uk/software/tracer/) to verify that convergence was reached by the chosen burn-in. A single human sequence was used as an outgroup for all phylogenetic analyses. We calculated the posterior probability of monophyly by determining the proportion of phylogenies for a given locus that were consistent with monophyly for each population group and/or species. To do this, we used PAUP* to constrain the posterior distribution of phylogenies from the four independent runs of MrBayes (minus the 10% burn-in for each run) to conform to each of the following hypotheses: monophyly of bonobos, western chimpanzees, eastern chimpanzees, central chimpanzees, Nigerian-Cameroonian chimpanzees, all chimpanzees, and reciprocal monophyly between chimpanzees and bonobos. The proportion of phylogenies retained under each constrained model was taken as the posterior support for each hypothesis. We arbitrarily defined a tree as showing support for monophyly if the posterior probability .95% and as showing support for paraphyly if the posterior probability ,5%.
We also estimated a single phylogeny for a concatenated alignment of the 15 nuclear regions. We had difficulty reaching convergence of the Markov Chains in our initial analyses using a Bayesian framework. Therefore, we estimated a single phylogeny with Maximum Likelihood (ML) using the program RAxML (v. 7.2.8; [56]). We used the fast bootstrapping algorithm under the GTR+G model of sequence evolution. Two thousand bootstrap replicates were performed with simultaneous optimization of the ML topology.