^{1}

^{*}

^{1}

^{2}

^{1}

^{3}

^{1}

^{1}

^{4}

^{5}

^{6}

^{1}

^{1}

^{7}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: TM MHS MW JYD KM AH LNA. Performed the experiments: TM AEH. Analyzed the data: TM AEH MHS. Contributed reagents/materials/analysis tools: GL KP AS. Wrote the paper: TM MHS AS.

We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the

Next-generation sequencing technology has enabled the generation of whole-genome data for many closely related species. For population genetic inference we have sequenced many loci, but only in a few individuals. We present a new method that allows inference of the divergence process based on two closely related genomes, modelled as gradual isolation in an isolation with migration model. This allows estimation of the initial time of restricted gene flow, the cessation of gene flow, as well as the population sizes, migration rates, and recombination rates. We show by simulations that the parameter estimation is accurate with genome-wide data and use the model to disentangle the divergence processes among three sets of closely related great ape species: bonobo/chimpanzee, eastern/western gorillas, and Sumatran/Bornean orang-utans. We find allopatric speciation for bonobo and chimpanzee and non-allopatric speciation for the gorillas and orang-utans. We also consider the split between humans and chimpanzees/bonobos and find evidence for non-allopatric speciation, similar to that within gorillas and orang-utans.

The decreasing cost of genome sequencing has led to the sequencing of many species, including closely related species. With these available genomes, it becomes possible to study the speciation process in more detail. Due to the processes of recombination and coalescence, the complete genomes of two related species contain a large number of partly independent histories, with different regions having different migration histories and times to common ancestry. These histories, if inferred, can inform us about key parameters of the species divergence process. Particularly informative is the distribution of the length of genomic fragments having identical histories, since this depends directly on the time interval over which recombination could have acted and therefore on the distribution of coalescent times. Demographic parameters shape this distribution, so inference of it is informative about demographic history. However the difficulty of modelling coalescence with recombination means that most previous approaches have been unable to exploit this information, with one recent notable exception being inference of population size history from a single diploid genome

Previous isolation with migration (IM) models have been designed to deal with relatively short sequences from several individuals of each species, since this was typical of data sets available until recently. Adding more individuals to a data set is often not as powerful as adding loci, since most coalescence events occur recently in the history of the samples and there are only few ancestors present at deep coalescence times. For the constant population size coalescent, the total branch length in the ancestry of a sample set grows logarithmically with the number of samples, but linearly in the number of loci, and most statistical methods for exploring the evolution of closely related species therefore employ multiple loci with small sample sizes

Using coalescent theory it is reasonably straightforward to compute the coalescence density under many demographic scenarios, including scenarios with and without gene flow. This follows from the Markov property of the process when viewed backwards in time. When recombination is absent, we can often derive simple maximum likelihood algorithms for inferring parameters. When recombination

CoalHMMs permit recombination between any neighbouring pair of nucleotides, and represent the correlation between sites as a Markov model along the alignment

In this paper we extend the coalescent hidden Markov model of Mailund et al. (2011)

Our isolation-with-migration model considers two separated populations (sub-species or species) derived from a shared ancestral population in the recent past. The model assumes that the ancestral population split into two populations in the past, at time

When considering only pairs of genomes, the likelihood of a model will depend only on the divergence time at each locus

In Mailund et al. (2011)

Once the joint coalescence densities are computed however, we can construct a hidden Markov model from following the approach of our previous work. We discretize time in a number of intervals (see

The parameters of the hidden Markov model are the same as those used for the CTMC for computing the coalescence densities. The IM model introduced in this paper is parameterized by

We then assume three different time periods (see

We assume that both the recombination rate

The coalescent HMM employs two important approximations. It assumes that the coalescent process is Markov along the sequences and that coalescent events occur at discrete time points rather than continuously in time. The Markov assumption is very difficult to relax because it enables us to reduce the problem of inference across the genome to that of the history of two adjacent nucleotides. The assumption of discrete coalescent times can be investigated by varying the number of intervals. Therefore, we have used extensive simulation studies to validate that the model can recover true parameters simulated under the coalescent with recombination process, both under ideal circumstances and under circumstances where different aspects of the model are mis-specified (see

For all simulations we used a coalescence rate of

The box-plot shows the distribution of parameter estimates for six different simulation scenarios. In all scenarios the coalescence rate and the recombination rate parameters are kept fixed, while the end of gene flow,

The CoalHMM assumes that both the mutation rate and the recombination rate are constant in the region analysed, which in general will not be true. We explored the sensitivity to variation in rates through simulations.

The figure shows the effect on parameter estimation when the mutation rate is varied along the genome alignment. We split the alignment into segments geometrically distributed with mean length 500 bp and 2 kbp, and the mutation rate is then scaled by a random value chosen uniformly in the range 0.75 to 1.25 or 0.5 to 1.5. The dashed lines show the simulated values. The largest effect on varying the mutation rate is seen in the top-most parameters, the coalescence rate and the mutation rate. Varying the mutation rate increases the variance in coalescence times scaled with mutation rate, which is interpreted by the model as a decreased coalescence rate, while segments with low mutation rates are seen as more recent coalescence rates which the model interprets as evidence for migration. Consequently, variation in mutation rate decreases our estimates of the coalescence rate and increases our estimates of migration rates.

The figure shows the effect on parameter estimation when the recombination rate is varied along the genome alignment. To simulate variation in the recombination rate, we sampled random 10 Mbp segments of the human genome, extracted the DeCODE recombination map for these segments, and scaled the recombination rate in the simulations according to the variation in this map. The dashed lines show the simulated values of the parameters. For most parameters, the effect of varying the recombination rate is seen as an increased variance in the estimates, while they do not appear to be biased. The exception is the recombination rate that becomes even more underestimated than for a constant recombination rate.

The model assumes that we have one haploid genome from each species. From sequencing data, however, we generally only obtain diploid genomes, and inferring the phase to split this into haploid genomes is not immediately possible with only one genome sequenced. If the species are sufficiently diverged, however, most polymorphism in the species will be local to one of the species, and which variant is considered for a heterozygotic site will not matter for the divergence to the other species. However, when species are so closely related that shared polymorphisms are common, we expect that assuming phased chromosome will make us believe that more recombination has occurred and therefore bias the inferred

We simulated the situation where the genotype phase is unknown by simulating two genomes and selecting a random allele for all heterozygotic sites. The plot shows the effect on parameter estimates of not knowing the phase.

Recently-sequenced primate genomes allow us to apply the model on three different closely related species pairs: (a) bonobos and chimpanzees (Prüfer et al. (2012)

The box plot shows the estimated split times using either the isolation model or the isolation-with-migration model for the three great ape comparisons. The box plots on the left shows the split time estimate in the isolation model while the box plots on the right shows both the initial population divergence and the end of gene flow. The variation in estimates is from each 10 Mbp segment of the genome.

The box plots show the estimates of the initial split time and the end of gene flow in the isolation-with-migration model for each 10 Mbp segment for each chromosome.

The box plots show the Akaike Information Criteria (AIC) for the isolation model against the isolation-with-migration model. For each 10 Mbp genomic segment we have plotted the AIC for the model including migration minus the model without. The model with the smallest AIC should be preferred, so values below zero prefers the isolation model while values above zero prefers the migration model.

Among the great ape speciation events, the human and chimpanzee speciation has received the most attention. We applied our new model to this speciation event using both the chimpanzee and bonobo genomes compared to the human genome, see

The histograms show the distribution of parameter estimates for the human/bonobo speciation (blue) and the human/chimpanzee speciation (red) using the isolation model.

The histograms show the distribution of parameter estimates for the human/bonobo speciation (blue) and the human/chimpanzee speciation (red) using the isolation-with-migration model.

The histograms show the distribution of AIC differences for the isolation and isolation-with-migration model for the human/chimpanzee comparison and the human/bonobo comparison. Negative values indicate a preference for the isolation model while positive values indicate a preference for the isolation-with-migration model. The overall result points toward a preference for a prolonged speciation for the Homo/Pan split.

Our study shows that detailed information on the divergence process can be gathered from just two related haploid genomes through joint inference of the length of segments with the same history and their times to coalescence. A period of time with limited migration is detectable as a period in which coalescences occur at a much lower frequency than in a panmictic population, because they are limited by the rate of migration events. Simulations under the full coalescence with recombination process show that the Markov assumption does not significantly bias the estimation of parameters, except for the recombination rate which is consistently underestimated, typically by 20–30%. The cause of this underestimation is the tendency of the real coalescent with recombination process to return to the same ancestor which implies that the average effect of recombination events is smaller than assumed by the Markov model (for an extended discussion, see Dutheil

Estimation of time and migration parameters is robust to typical violations of model assumptions such as mutation rate and recombination rate variation. A practical concern is that the model assumes haploid phased genomes whereas most genome sequences are a random mix of two haplotypes. Using a random phase should have no effect if the genomes are sufficiently diverged that they do not share any polymorphism; in this case the patterns of coalescence time will be the same for either phase at any position along the genome. If the genomes have shared polymorphism, then the phase will affect the estimated coalescence time, and with a high degree of shared polymorphism we expect that a random phase will cause the HMM to jump between states too often.

To investigate the consequences of this, we simulated diploid genomes and constructed haploid genomes by choosing a random phase, and then estimated parameters from this data (see

The demographic parameters inferred by our model are expressed in units of sequence divergence (substitutions per base pair). To translate these into units measured in years and effective population sizes measured in individuals requires both a genomic substitution rate and a generation time. A substitution rate has typically been estimated from fossil dates, with values around

Recent measurements of

The figure shows the inferred split times when scaled with a mutation rate,

Our application of the model to three closely related great ape species pairs revealed different speciation processes between chimpanzees and bonobos on one hand and the gorilla and orang-utan species on the other.

We estimate that the two gorilla species have experienced a long period of time with a very small amount of gene flow. Evidence for recent gene flow between these species was also found by Thalmann et al. (2007)

For the two orang-utan species we also find evidence for an extended period of limited gene flow but with a more ancient cessation of gene flow than is observed for the Gorillas. This is in agreement with the DaDi analysis presented by Locke et al. (2011)

Finally, for the chimpanzees and bonobos we find no evidence against an allopatric separation. This is in agreement with Won and Hey (2005)

Thus, most estimates for the bonobo-chimpanzee separation are largely consistent with our results and there is little evidence against allopatry. If we use recent estimates of present-day human mutation rate of about

The human-chimpanzee speciation has been the focus of considerable attention in previous studies, most of which have assumed a simple (allopatric) speciation model. However, as shown in

Considering the evidence reported here and previously for gene flow between species within the great ape genera

The exploitation of whole genome data in a demographic inference model is made computationally efficient by the Markov assumption underlying CoalHMMs, and should increase the statistical power for parameter estimation and for comparing different demographic scenarios. However the construction of complex demographic models with CoalHMMs is complicated by the mathematics involved in specifying transition probabilities between local genealogies.

The model we have presented in this paper is an initial attempt at building a speciation model using a CoalHMM, but the underlying framework, using a continuous time Markov chain to model transitions between genealogies, generalises straightforwardly to other demographic scenarios (see

Using a hidden Markov model also enables us, via posterior decoding, to investigate variation in coalescent times, recombination and gene exchange along the genome. See

The crux of constructing a coalescent hidden Markov model is deriving transition and emission probability matrices from the coalescence process parameters of interest. For computing the transition probabilities we take the approach from Mailund et al. (2011)

The coalescence process with recombination can be formulated as a CTMC running back in time from a present day sample of genomes back until all nucleotides have found their most recent common ancestor (MRCA)

When constructing our coalescent hidden Markov model, we consider the case for genome segments only two nucleotides long. In this case, we get a system of a relatively small number of states where we can explicitly construct the rate matrix and compute transition probabilities exactly. The process is very similar to the isolation model in Mailund et al. (2011)

In this model, we represent lineages at a single nucleotide as sets. The sets

To assign lineages to species, we pair them again, and let

We define the following transitions of states:

Coalescence:

Recombination:

Migration:

where

On the left is shown an ancestral recombination graph for two genomes with two nucleotides. Lineages, in the notation we use for constructing the CTMCs, are shown in red. On the right is shown the corresponding list of transitions int he CTMC with the type of transitions on the arrows: recombination (R), migration (M) and coalescence (C). The transition from the two separate populations to the ancestral population is a special transition – the projection matrix in the CTMC – shown in red.

As the initial state of the system, we use the state where sequence 1 is in species 1, sequence 2 is in species 2, and both sequences have their left and right nucleotides linked,

When constructing the rate matrix from the state space, we set migration rates to zero for the time period where we do not allow gene flow, from

Let

First, we compute CTMC transition probability matrices for individual time intervals. Let

Let

Second, we compute the CTMC transition probability matrices across several intervals. These can be computed by simple matrix multiplication, since the probability of being in state

If

To reduce the computation time for calculating the sums, we use a dynamic programming algorithm.

Given the joint probabilities of having the left nucleotide coalesce in time interval

For computing the emission probabilities, we use the same approach as in our previous work

For computing the mean coalescence time in each time interval, we can use the truncated exponential density as in our previous models for the intervals further back in time than

To deal with this, we set up a CTMC similar to those for the transition probabilities, but considering only a single nucleotide per genome. In the following,

From a CTMC we can compute the mean time until an absorbing state (i.e. coalescence) using standard CTMC theory (see e.g. Tavaré (1979)

The probability

Let

For the integral, we now use a change of variable,

The density,

Now, let

For computational efficiency, we first compute all

The model was implemented in Python, and we used the numerical optimization functionality from the scipy optimize module, function fmin to find the maximum likelihood parameters and HMMLib

For our simulation experiments, we simulated ancestral recombination graphs from the coalescent with recombination process using the CoaSim tool

Genome sequence alignments between eastern and western gorillas and between Bornean and Sumatran orang-utans were generated as follows. Illumina paired-end reads for Mukisi, an eastern lowland gorilla (

In both cases mapped reads were merged using Picard (

Genome alignments for the analysis of the chimpanzee-bonobo, human-chimpanzee and human-bonobo splits were produced as described in Prüfer

Analysis results for all ape analyses.

(XLSX)

Source code for the CoalHMM.

(GZ)

Report describing simulation experiments and results.

(PDF)

Report describing data analysis results not included in the main manuscript.

(PDF)

We are grateful to the international orang-utan, gorilla, and bonobo genome analysis consortia for making the sequence data available. We are also grateful to the Center for Scientific Computing at Aarhus University for use of their computer resources, without which our extensive simulation studies would not have been possible. Nick Patterson and one anonymous reviewer provided very helpful advice for improving the manuscript.