^{1}

^{*}

^{2}

^{3}

^{4}

^{5}

Analyzed the data: PL. Wrote the paper: PL MAS. Designed the study: PL MAS. Provided programming assistance: AR AJD. Conceived the original idea and developed the software implementation: MAS.

The authors have declared that no competing interests exist.

As a key factor in endemic and epidemic dynamics, the geographical distribution of viruses has been frequently interpreted in the light of their genetic histories. Unfortunately, inference of historical dispersal or migration patterns of viruses has mainly been restricted to model-free heuristic approaches that provide little insight into the temporal setting of the spatial dynamics. The introduction of probabilistic models of evolution, however, offers unique opportunities to engage in this statistical endeavor. Here we introduce a Bayesian framework for inference, visualization and hypothesis testing of phylogeographic history. By implementing character mapping in a Bayesian software that samples time-scaled phylogenies, we enable the reconstruction of timed viral dispersal patterns while accommodating phylogenetic uncertainty. Standard Markov model inference is extended with a stochastic search variable selection procedure that identifies the parsimonious descriptions of the diffusion process. In addition, we propose priors that can incorporate geographical sampling distributions or characterize alternative hypotheses about the spatial dynamics. To visualize the spatial and temporal information, we summarize inferences using virtual globe software. We describe how Bayesian phylogeography compares with previous parsimony analysis in the investigation of the influenza A H5N1 origin and H5N1 epidemiological linkage among sampling localities. Analysis of rabies in West African dog populations reveals how virus diffusion may enable endemic maintenance through continuous epidemic cycles. From these analyses, we conclude that our phylogeographic framework will make an important asset in molecular epidemiology that can be easily generalized to infer biogeogeography from genetic data for many organisms.

Spreading in time and space, rapidly evolving viruses can accumulate a considerable amount of genetic variation. As a consequence, viral genomes become valuable resources to reconstruct the spatial and temporal processes that are shaping epidemic or endemic dynamics. In molecular epidemiology, spatial inference is often limited to the interpretation of evolutionary histories with respect to the sampling locations of the pathogens. To test hypotheses about the spatial diffusion patterns of viruses, analytical techniques are required that enable us to reconstruct how viruses migrated in the past. Here, we develop a model to infer diffusion processes among discrete locations in timed evolutionary histories in a statistically efficient fashion. Applications to Avian Influenza A H5N1 and Rabies virus in Central and West African dogs demonstrate several advantages of simultaneously inferring spatial and temporal processes from gene sequences.

Phylogenetic inference from molecular sequences is becoming an increasingly popular tool to trace the patterns of pathogen dispersal. The time-scale of epidemic spread usually provides ample time for rapidly evolving viruses to accumulate informative mutations in their genomes

Phylogeographic analyses are a common approach in molecular ecology, connecting historical processes in evolution with spatial distributions that traditionally scale over millions of years

Probabilistic methods draw on an explicit model of state evolution, permitting the ability to glimpse the complete state history over the entire phylogeny and conveniently draw statistical inferences

While probabilistic methods have been previously presented in a bio- or phylogeographic context, in particular Bayesian methods that integrate over phylogenetic uncertainty and Markov model parameter uncertainty

Advances in evolutionary inference methodology have frequently demonstrated how novel approaches can be appended to a sequence of analyses, in many cases starting from alignment to parameter estimation conditional on tree reconstructions. For example, demographic inference has involved genealogy reconstruction, estimating a time scale for the evolutionary history, and coalescent theory to quantify the demographic impact on this tree shape

Here, we implement ancestral reconstruction of discrete states in a Bayesian statistical framework for evolutionary hypothesis testing that is geared towards rooted, time-measured phylogenies. This allows character mapping in natural time scales, calibrated under a strict or relaxed molecular clock, in combination with several models of population size change. We use this full probabilistic approach to study viral phylogeography and extend the Bayesian implementation to a mixture model in which exchange rates in the Markov model are allowed to be zero with some probability. This Bayesian stochastic search variable selection (BSSVS) enables us to construct a Bayes factor test that identifies the most parsimonious description of the phylogeographic diffusion process. We also demonstrate how the geographical distribution of the sampling locations can be incorporated as prior specifications. Through feature-rich visual summaries of the space-time process, we demonstrate how this approach can offer insights into the spatial epidemic history of Avian influenza A-H5N1 and rabies viruses in Africa.

The highly pathogenic avian influenza A-H5N1 viruses have been present for over a decade in Southern China and spread in multiple waves to different types of poultry in countries across Asia, Africa and Europe

We examine the evolution and spatial dispersion of two viral pathogens, Avian influenza A-H5N1 and rabies, to demonstrate the strengths and limitations of our discretized stochastic model for phylogeography.

To reconstruct the spatial dispersion patterns of Avian influenza A-H5N1, we analyze the hemagglutinin (HA) and neuraminidase (NA) gene datasets previously compiled by

We color branches according to the most probable location state of their descendent nodes. We use the same color coding as

We further annotated the tree nodes with their most probable (modal) location states via color labelings. Although the nucleotide substitution rates are very similar across genes (HA: posterior mean

Despite different time scales for HA and NA, most probable location states agree on Guangdong as the predominant location of these sequences throughout the majority of their evolutionary history. As an indication of the A-H5N1 epidemic origin, we consider the inferred location at the root of the trees (

Data | Model | Kullback-Leibler | Association index | |

root | GsGD | |||

HA | C | 1.4464 | 2.1999 | 0.21 (0.17–0.25) |

NA | C | 0.0184 | 1.6679 | 0.14 (0.09–0.18) |

HA | C, BSSVS | 1.7895 | 1.4383 | 0.24 (0.19–0.29) |

NA | C, BSSVS | 0.5660 | 1.1185 | 0.20 (0.14–0.26) |

HA | D, BSSVS | 1.7861 | 1.4059 | 0.25 (0.20–0.30) |

NA | D, BSSVS | 0.5811 | 1.1889 | 0.23 (0.17–0.29) |

Shared | C | HA: 1.4704 | HA: 2.2303 | HA: 0.21 (0.17–0.25) |

NA: 0.0321 | NA: 1.7281 | NA: 0.15 (0.10–0.19) | ||

Shared | C, BSSVS | HA: 1.8965 | HA: 1.5844 | HA: 0.25 (0.21–0.30) |

NA: 0.7813 | NA: 1.2511 | NA: 0.22 (0.16–0.28) | ||

Shared | DI, BSSVS | HA: 1.8038 | HA: 1.6086 | HA: 0.26 (0.21–0.31) |

NA: 0.7748 | NA: 1.3195 | NA: 0.23 (0.17–0.29) | ||

HA (fixed) | C | 1.5030 | 2.5626 | 0.18 |

HA (fixed) | C, BSSVS | 1.7578 | 1.7026 | 0.18 |

HA (fixed) | DI, BSSVS | 1.7235 | 1.7364 | 0.18 |

We report the Kullback-Leibler divergence between the posterior and prior location distributions of the root and the GsGD most recent common ancestor (MRCA), as well as a phylogeographic association index. We analyze genes independently, assuming equal phylogeographic models (Shared) and by fixing the HA phylogeny through phylogeographic models with prior rates proportional to a constant (C) or distance-informed (DI) and using Bayesian stochastic search variable selection (BSSVS).

Under BSSVS, we assume a truncated Poisson prior that assigns 50% prior probability on the minimal rate configuration, comprising 19 non-zero rates connecting the 20 locations. This model strongly favors reduced parameterizations. A sensitivity analysis with respect to larger Poisson prior means reinforces that the data prefer a minimal number of rates, as increasing the mean leads to lower overall marginal likelihoods (

The posterior probabilities are shown for different expectations,

Prior Mean | ML (stdev) | Posterior median (BCIs) | KL divergence | |

root | Gs/GD | |||

log(2) | −11339.343 (0.856) | 21 (19–22) | 1.7895 | 1.4383 |

1 | −11339.670 (0.636) | 21 (19–23) | 1.7991 | 1.4540 |

5 | −11341.197 (0.955) | 25 (22–29) | 1.7804 | 1.4533 |

10 | −11342.463 (0.883) | 29 (24–34) | 1.7940 | 1.7940 |

20 | −11343.429 (0.957) | 36 (29–43) | 1.7691 | 1.5691 |

We report estimates of

By specifying a prior on the number of non-zero rates, we are able to construct Bayes factor (BF) tests for significance of individual rates (

Only rates supported by a BF greater than 3 are indicated. The color and thickness of the line represent the relative strength by which the rates are supported; thin white lines and thick red lines suggest relatively weak and strong support respectively. The maps are based on satellite pictures made available in Google Earth (

The presence of reassortment amongst the gene segments obfuscates phylogenetic inference for concatenated HA/NA sequence data. In this respect, it is interesting to note that previous parsimony reconstructions on a phylogeny for the concatenated HA and NA segments result in fewer significant diffusion rates compared to the separate analyses;

Only rates supported by a BF greater than 3 are indicated. The color and thickness of the line represent the relative strength by which the rates are supported; thin white lines and thick red lines suggest relatively weak and strong support respectively. The maps are based on satellite pictures made available in Google Earth (

A major advantage of the current phylogeography implementation is the ability to infer the migration process in natural time scales. The panels in

We provide snapshots of the dispersal pattern for May 1997, 2001, 2003 and 2005. Lines between locations represent branches in the MCC tree along which the relevant location transition occurs. Location circle diameters are proportional to square root of the number of MCC branches maintaining a particular location state at each time-point. The white-green and yellow-magenta color gradients inform the relative age of the transitions for HA and NA respectively (older-recent). The maps are based on satellite pictures made available in Google Earth (

We investigate the “Africa 2” lineage of rabies transmitted by African dogs. This lineage forms one of the most divergent African rabies virus clades

(A) MCC phylogeny with branches colored according to the most probable posterior location of their child nodes; superimposed under the phylogeny lies the inferred demographic history. (B) Root location posterior probabilities are shown for the standard discrete model (opaque) and for the BSSVS extension with, in addition, distance-informed priors on the infinitesimal migration rates (transparent). The distance-informed priors in the latter had little impact on the results (data not shown). Both the height and width of the cylinders are proportional to root location posterior probability; the same colors as the tree branches in (A) are used. The maps are based on satellite pictures made available in Google Earth (

Although geographic origins remain elusive, we are able to identify locations that are epidemiologically linked using the BF test under BSSVS (

(A) Significantly non-zero migration rate using a Bayes factor test. Line thicknesses and the white-red color gradient relate to relative posterior migration rate expectations. (B) Projection of reconstructed migration events. Link heights indicate the relative durations of the branches upon which the inferred migration occurs, while the yellow-magenta color gradient informs the relative age of the transition (older-recent). The maps are based on satellite pictures made available in Google Earth (

Although the best supported rates mainly form an East-West axis, many transitions along this axis occur in the last three decades; this suggests that the axis is not representative of a relatively slow unidirectional migration wave.

The different panels represent temporal projections of reconstructed migration events every 15 years: A) 1977, B) 1992 and C) 2007. In these projections, each MCC branch is again translated into a geographic link that connects the branch's most probable starting and ending location states. The panels only show migration events or partial migration events that have occurred up to a particular date, assuming that the virus migrates at a constant rate over the inferred time span of the branch. The maps are based on satellite pictures made available in Google Earth (

The Bayesian phylogeographic inference framework we present here incorporates the spatial and temporal dynamics of gene flow. In this study, we focus on pathogen diffusion because viral sequence sampling on a time-scale commensurate with the rate of substitution permits the inference of spatial patterns in real-time units. In addition, elucidating the phylodynamics of viral epidemics has important implications for public health management. We selected the Avian influenza A-H5N1 example to allow a convenient comparison of Bayesian ancestral state inference with the previous parsimony analysis; on the other hand, statistical analysis of the rabies migration in Africa up to this point has been largely unexplored. Both zoonoses represent a clear threat to human health. The frequent transmission of A-H5N1 from poultry or wild birds to humans suggest that the virus could emerge as or contribute genetically to the next human flu pandemic. Although the lack of a human-to-human transmission mechanism means that rabies will not emerge as a purely human disease, rabies infection causes a fatal neurological disease and at least 55,000 people die from this disease every year, mainly in the developing world

A Bayesian statistical approach presents many advantages over parsimony inference of ancestral states. First, MCMC offers a computational technique to integrate over an unknown phylogeny and unknown migration process as the former is not directly observable in nature and the latter is poorly understood. Accommodating this lack of knowledge protects against potentially severe bias, but can reduce the power to make inferences; phylogeographic analyses are no exception to this. One can regard this uncertainty itself as a ‘mixed blessing’ because whilst it can hamper drawing definitive conclusions

Bayesian inference also proffers particular benefits within the class of likelihood-based methods, for example, by allowing for straightforward approaches to control model complexity. BSSVS naturally provides a BF test to identify significant non-zero migration rates. Further prior specification easily incorporates geographical detail of the sequence data. Although distance-informed priors appear to have little impact on the phylogeographic analyses presented here, both BSSVS and informed priors furnish new opportunities for hypothesis testing when comparing competing prior scenarios of the diffusion process. Examples include “gravity models”

Our primary motivation for exploiting BSSVS to select among all possible migration graphs is to elucidate the limited number of epidemiological links that appropriately explain the viral diffusion process. This parsimonious set both informs major modes of migration and reduces the high statistical variance that burdens estimation of all pairwise transition rates. Following this argument, less uncertainty on node state reconstructions would be expected when focusing on a parsimonious parameterization of the instantaneous rate matrix. The A-H5N1 analysis indeed indicates lower uncertainty of root state reconstructions. However, for some other internal nodes, we note the opposite behavior. We attribute this to the reversibility assumption in the rate matrix. Selection of reversible rates by BSSVS imposes more balanced transitions in the phylogeny among locations that could have unidirectional links in reality. Therefore, work is in progress to develop non-reversible models that may better fit a spatially expanding epidemic like A-H5N1 or recurring epidemic influenza emergence through source-sink dynamics

Our rabies phylogeographic analysis confirms a longstanding presence of this viral lineage in West Africa

Many questions in evolutionary biology require a biogeographical perspective on the population under investigation. We hope to have demonstrated that Bayesian phylogeographic framework can contribute significantly to evolutionary hypothesis testing, and, although we have focused on viral phylodynamics, this approach is generally applicable in molecular evolution. Employing geographically-informed priors delivers a first step in incorporating GIS information. Future developments like irreversible CTMC processes may offer even more biological realism.

For many spatial scales and problems, geography can naturally be partitioned into a finite number of discrete sites

One illuminating perspective from which to view a CTMC is that of a random walk on a graph

When GTR models find use modeling nucleotide substitution, most of the

BSSVS is traditionally applied to model selection problems in a linear regression framework, in which statisticians start with a large number of potential predictors

Recent work in BSSVS

To map BSSVS into the phylogeography setting, we consider selection among the

To specify a prior distribution over

We entertain two prior choices for

To complete the CTMC specification, we assume that all unnormalized rates in

Considerable additional information exists about the sites

The Bayes factor (BF) for a particular rate

A strength of the Bayesian approach we exploit in this paper is the ability to integrate together into a joint model of spatial locations

We approximate the joint posterior (8) and its marginalizations using MCMC implemented in the software package BEAST

An important statistical question asks to what extent the data inform our inference when fitting different phylogeographic models. A model of low statistical power makes poor use of the information in the data, while a successful model exploits this information to generate posterior distributions that are maximally different from prior beliefs. One primary outcome of a Bayesian phylogeographic study is the marginal posterior distribution of the root location

Following existing phylogeographic approaches, we finally score the degree of spatial admixture using a modified association index (AI)

To summarize the posterior distribution of ancestral location states, we annotate nodes in the MCC tree with the modal location state for each node using TreeAnnotator, and visualize this tree using FigTree (available at

KML file for H5N1 diffusion over time as inferred from HA

(2.13 MB XML)

Supplementary information: KML file for H5N1 diffusion over time as inferred from NA

(2.10 MB XML)

We thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK, for hosting us during its Phylogenetics Programme from which this research grew. We thank Hervé Bourhy and Shiraz Talbi for providing the rabies data and commenting on the manuscript.