Unifying Viral Genetics and Human Transportation Data to Predict the Global Transmission Dynamics of Human Influenza H3N2

Information on global human movement patterns is central to spatial epidemiological models used to predict the behavior of influenza and other infectious diseases. Yet it remains difficult to test which modes of dispersal drive pathogen spread at various geographic scales using standard epidemiological data alone. Evolutionary analyses of pathogen genome sequences increasingly provide insights into the spatial dynamics of influenza viruses, but to date they have largely neglected the wealth of information on human mobility, mainly because no statistical framework exists within which viral gene sequences and empirical data on host movement can be combined. Here, we address this problem by applying a phylogeographic approach to elucidate the global spread of human influenza subtype H3N2 and assess its ability to predict the spatial spread of human influenza A viruses worldwide. Using a framework that estimates the migration history of human influenza while simultaneously testing and quantifying a range of potential predictive variables of spatial spread, we show that the global dynamics of influenza H3N2 are driven by air passenger flows, whereas at more local scales spread is also determined by processes that correlate with geographic distance. Our analyses further confirm a central role for mainland China and Southeast Asia in maintaining a source population for global influenza diversity. By comparing model output with the known pandemic expansion of H1N1 during 2009, we demonstrate that predictions of influenza spatial spread are most accurate when data on human mobility and viral evolution are integrated. In conclusion, the global dynamics of influenza viruses are best explained by combining human mobility data with the spatial information inherent in sampled viral genomes. The integrated approach introduced here offers great potential for epidemiological surveillance through phylogeographic reconstructions and for improving predictive models of disease control.


Sequence and location data
We compiled 1441 hemagglutinin sequences with known date and location of sampling previously obtained by [1]. These sequences were sampled globally from 2002 to 2007 and are representative of a larger sampling (13,000 isolates) used for antigenic analysis [1]. We explored different spatial and air travel-assisted subdivisions with sub-sampling to examine the impact of discrete sampling allocation and sample numbers per locations on our phylogeographic estimates. We note that sample sizes may strongly impact ancestral reconstruction using phylogenetic diffusion models. Depending on the location-specific diversity, over -and underrepresented locations may be more likely to be inferred as source/origin and sink/destination locations respectively. The bias arising from such over -and underrepresentation may be more pronounced when the overall sampling is sparse because the location specific-diversity may be poorly captured in such cases. Although it may be useful to have more samples available from locations in which a large viral diversity has been established, which may for example help to establish its potential role as a source location, and this would be the case for a random sample from the entire population of infections, it remains difficult to assess to what extent convenient sampling reflects the underlying epidemiology. In an attempt to include all sequence data while keeping the number of samples per location as balanced as possible, we divided all the sequences into 26 geographic regions (Table S1 in Text S1) ( Figure S1 in Text S1).
Since these spatial partitions sometimes required arbitrary subdivisions (e.g. breaking up USA, China, Japan and Australia), we also reduce this spatial partitioning to 15 geographic regions by joining regions from a single country (Table S2 in Text S1: Mideast Japan, Midwest Japan, Northeast Japan and Southwest Japan into Japan; East China, Midwest China, South China and North China into China; Northeast USA, West USA, South USA and Midwest USA into USA; and Northeast Australia, Northwest Australia and South Australia into Australia). Within each sampling year, we randomly down-sampled the five locations with the highest number of samples relative to their population size (USA: from 278 to 150; Australia: from 166 to 30; New Zealand: from 59 to 20; Japan: from 341 to 75; South Korea: from 51 to 30) and analyzed three different sub-sampled data sets. We note that trying to keep the number of samples per location as balanced as possible results in geographic partitions of various sizes, and more comprehensive sampling may enable more appropriate geographic partitioning in the future.
Because of the difficulties associated with geographic partitioning, we also identified discrete air communities in the worldwide air transportation network and applied these as location states to our sequence sample. To increase sequence numbers for under-sampled air communities, we complemented the hemagglutinin gene sequences with publicly available sequences from Africa (n = 21), USA (Hawaii, n = 4), Central America (n = 13), South America (n = 46) and Canada (n = 10). From this data set, we removed six sequences that appeared to be outliers in a root-to-tip divergence versus sampling time regression analysis, resulting in a total of 1529 sequences. Within each sampling year, we randomly downsampled the four locations with the highest number of samples relative to their population size (USA: from 318 to 120; Oceania: from 225 to 50; Japan: from 327 to 75; Southeast Asia: from 175 to 100; Table S3 in Text S1) and analyzed three different sub-sampled data sets discretized according to the 14 air communities. To asses the impact of sample sizes on the diffusion predictor identification (see below), we also perform the analysis on the complete data set on the one hand and the two randomly subsampled data sets on the other hand for which we restrict the number of sequences per location to 25 (Fig. S2). These scenarios represent more imbalanced and more balanced sample sizes respectively with respect to the three sub-sampled data sets used throughout the main text.

Incorporating uncertainty in air community assignment
Using a generalization of the method introduced in [2], we identified highly modular partitions in the global air transportation network. For an ensemble of 1000 modularity subdivisions we quantify the uncertainty by an affinity matrix that, for each pair of locations, summarizes the fraction of partitions in which these locations are in the same community. Based on a partition encompassing a number of air communities (n = 14) that is in size close to the 15-geographic region partition, we subsequently obtain the average affinity for each airport to the communities in this partition. We assign each airport to the community for which it shows the highest average affinity, but we take into account its uncertainty by also considering assignments that yield affinities that are > 2/3 of the highest affinity score. This cut-off resulted in 771 ambiguous airport assignments. Finally, we partitioned the sequence data according to the air community assignment and accommodate 368 (24%) ambiguous sequence locations, i.e. those sequences related to airports with ambiguous community assignments, using ambiguity coding in our phylogeographic approach.

Bayesian statistical analysis of sequence and trait evolution
We integrate genetic, spatial and air transportation data within a single full probabilistic evolutionary model and simultaneously estimate the parameters of phylogeographic diffusion using Markov chain Monte Carlo (MCMC) analysis implemented in BEAST [3]. We introduce a novel phylogenetic diffusion model and associated inference procedures in the subsections below. To model sequence evolution, we partition the hemagglutinin codon positions into first+second and third positions [4] and apply a separate HKY85 [5] CTMC model of nucleotide substitution with discrete gamma-distributed rate variation [6] to both. We assume a flexible Bayesian skyride prior over the unknown phylogeny [7]. Exploratory runs using the data for the 26 locations indicated that a relaxed molecular clock represented an over-parametrization [8]. A strict clock was therefore used in subsequent analyses. Because the exact date of sampling was not known for some additional publicly available sequences, we integrated out their dates over the known sampling time interval [9]. We capitalize on BEAGLE [10] in conjunction with BEAST to improve computational performance on our large data sets. MCMC analyses were run sufficiently long to ensure stationarity as diagnosed using Tracer. We used the TreeAnnotator tool in BEAST to summarize trees in the form of maximum clade credibility (MCC) trees. As part of the supporting information files (Dataset S1), we make available an XML document specifying the data and analysis settings for main analysis of the air communities, and the associated empirical trees required to run the analysis (section 1.3.3). This includes accession numbers for all the sequences as well as their sampling dates, the locations we assigned them to (section 1.1), the different sub-samplings, the (GLM) model settings and the predictors (section 1.3.1 and 1.3.2).

GLM diffusion implementation and predictor support
Bayesian phylogeographic inference models discrete diffusion as a continuous-time Markov chain process parameterized in terms of a K × K infinitesimal rate matrix Λ of discrete location change with K representing the number of location states. The GLM diffusion model extends this by adopting a generalized linear model (GLM) approach that takes an arbitrary number P of predictors X = (x 1 , . . . , x P ), where a single predictor x p is a flattened vector of quantities corresponding to entries in the i to j rate matrix x p = (x 1,2,p , . . . x K−1,K,p ) . The GLM considers every instantaneous movement rate Λ ij for i = j in Λ as a log linear function of the set of predictors X, such that: where β = (β 1 , . . . , β P ) represent the effective sizes for the predictors, quantifying their contribution to Λ, and (δ 1 , . . . , δ P ) are (0,1)-indicator variables that govern the inclusion or exclusion of the P predictors in the model. In constructing our predictors, we shift the covariates to ensure positivity if needed (e.g. using a pseudo count for the passenger flux) and subsequently log-transform them. We note that other transformations, e.g. from the Box-Cox family, may also be explored in our framework. The incorporation of indicator variables allows for Bayesian stochastic search variable selection (BSSVS) [11,12], which involves estimating the posterior probabilities of all 2 P possible linear models that may or may not include the predictors. When an indicator δ p equals 1, then predictor x p is included in the model, demonstrating that it helps to explain the diffusion process in the phylogenetic history with high probability. We complete this GLM specification with variable selection by assigning independent Bernoulli prior probability distributions on δ p . We use a small prior probability on each predictor's inclusion that reflects a 50% prior probability on no predictors being included, but specifying equal prior probability on each predictor's inclusion and exclusion yields highly similar results (Fig. S3). Lemey et al. [13] discuss BSSVS in further detail and analogous to Edo-matas et al. [14], we can use Bayes factors (BFs) [15,16] to express how much the data change our prior opinion about the inclusion of each predictor. These BFs are calculated by dividing the posterior odds for the inclusion of a predictor with the corresponding prior odds, e.g. 0.019:0.981 prior odds for the analysis of the 14 air communities and 15 geographic regions (Fig. 2 in the main manuscript), or 1:1 odds for the same analyses using equal prior probability for each predictor's inclusion and exclusion ( Fig. S3): where pp p is the posterior probability that predictor p is included, in this case the posterior expectation of indicator δ p , and qp p is the prior probability that δ p = 1. The posterior odds follows immediately from the marginal posterior probability that a predictor indicator equals 1, estimated through the posterior expectation of the predictor indicator. We specify that a priori all β p are independent and normally distributed with mean 0 and a relatively large variance of 4, which still ensures adequate mixing. We implement the GLM-diffusion parametrization in the software package BEAST [3] and approximate the joint posterior and its marginalizations using standard Markov chain Monte Carlo (MCMC) transition kernels. Similar to recent advances in phylogeographic inference in continuous space [17], we integrate out discrete location states at internal nodes in the trees but draw stochastic realizations of the node states when logging the trees. An important, novel extension to the standard MCMC machinery in BEAST lies in generating an efficient Metropolis-Hastings proposal distribution for the GLM coefficients β. Given the potential for high correlation between predictors X, attempting to update one coefficient β p at a time while holding the remaining coefficients constant returns high autocorrelations times. Instead, we exploit the fixed correlation structure X X between predictors to generate a multivariate proposal β .
In particular, if we assume β are the current realized values, then we draw where α is an auto-tunable variance scalar. Motivation for this proposal stems from imagining that the marginal posterior distribution of β under our phylogenetic GLM should partially approximate a simple linear regression model involving β, whose posterior variance is proportional to X X. We note that colinearity among explanatory variables is not allowed because this would make the X X singular and not invertible. We capture the numerical exception that our transition kernel for the regression coefficients would throw in this case, which will inform the user about colinearity issues. We consider a 'bit-flip' operator on the Bernoulli rate indicators; this transition kernel is further discussed in [18].

GLM diffusion predictors
Depending on the location state partitioning scheme, we considered several potential predictors of global influenza diffusion in the GLM diffusion model: • Average and minimum distance. To test whether geographical proximity predicts influenza diffusion we considered two different great-circle distance measures: (i) the average distance between two locations based on the pairwise distances between all pairs of airports from the two locations and (ii) the minimum distance amongst those pairwise distances.
• Absolute latitude. Absolute latitudes for each region/community were calculated as the absolute values of the average latitudes of the sequence sampling locations (Sequences from unknown locations within specific countries were assigned to the capital of that country) and are listed in Table S1, S2 and S3 in Text S1.
• Passenger flux. The total number of seats on flights between each pair of locations per day. In addition, we also include a separate origin and destination predictor that summarizes the total air flux within each air community or geographic regions, thus representing the within-location air connectivity. Although the passenger flux data does not account for variations in passenger occupancy, the seat-based predictors are expected to be fairly robust to this because the variation in actual fluxes across the links in the air transportation network are orders of magnitude higher than variations between occupancy and seat numbers. Because passenger flux does not differ in a statistically significant manner from symmetry in the global air transportation network [19], we consider flows that were symmetrized.
• Population size and density. Population size estimates for 2005, or interpolated for 2005 based on data close to this year, were obtained from Geographica [20] or www.citypopulation.de (listed in Table S1, S2 and S3 in Text S1). In addition to general population sizes/densities, we also consider urban population sizes for the air communities/geographic regions based on airport-associated cities. For this purpose, we downloaded population sizes for all cities with a population exceeding 15000 (downloaded from the GeoNames geographical database: www.geonames.org) and associated them with the closest airport from the 1227-largest-airport network (that represents 95% of the passenger flux) within a 75 km radius. We subsequently aggregated all these airport associated population sizes for each air community or geographic area in which the airports are located and also included these as potential predictors in the model. This measure does not appear to be very sensitive to the radius threshold we use to associate cities with airports because a radius of 200 km results in numbers that highly correlate with the urban population sizes we used (r > 0.97). As an urban population density measure that accounts for the number of airports these population sizes are distributed over, we also considered a predictor that divides the airport-associated urban population sizes by the number of airports to which they were assigned in each air community/geographic region. Both the urban population sizes and densities are also listed in Table S1, S2 and S3 in Text S1. We report the GLM estimates for the general population sizes and densities (Fig. 2 in the main manuscript and Fig. S3), but list the estimates for the full set of predictors for each sub-sampled data set in Fig. S1. In addition, we report estimates for the full set of predictors for the complete data set and for two random sub-samples that include a maximum of 25 sequences per location for the air communities (Fig. S2).
To include a measure of population clustering, we also incorporated an 'agglomeration Index (AI)', which is based on population density, population of large urban centers (>50,000 people) and travel time to that urban center [21]. This was included for the analysis of the 14-air community and 15-location geographic partitions (Table S2 and S3 in Text S1), but the level of detail was not available for the 26-location geographic partitions. For air communities or geographic regions for which sequences were available for a limited number of countries, we summarized average (urban) population sizes, densities and AIs based on only these countries. All demographic measures were included as separate origin and destination predictors.
• Economic data -Gross domestic product (GDP). We construct predictors based on GDP data collected for the year 2004 from the World Bank (data.worldbank.org/) If data was not available from this resource, we resorted to GeoHive (http://www.geohive.com) to complement it. Analogous to the demographic measures, we obtain GDP averages from countries for which sequence samples were available in case we only had samples available for only a limited number of countries per air community or geographic region. Analogous to the AIs, the GDP was available for the 14-air community and 15location geographic partitions, but not to the level of detail required for the 26-location geographic partitions.
• Viral surveillance data. To test the predictive power of viral surveillance data, we essentially aimed at capturing the nature and degree of synchronicity of yearly incidence profiles in each air community/geographic region. To this purpose, we extracted the number of influenza viruses A(H3) detected per country from week 1 in 1997 to week 45 in 2010 from FluNet/WHO (www.who.int/flunet) for relevant countries in the 14-air community and 15-region geographic partition schemes. The level of detail retired for the 26-region geographic partition scheme was not available. Taiwanese surveillance data was obtained from [22]. We focused on the influenza A(H3) incidence counts between 2002 and 2007 or as close as possible to this time period when insufficient data was available. Average incidence counts were used when data from multiple countries per region/community was available. We subsequently calculated average incidences per week across multiple years for each region/community, normalized these weekly averages and smoothed them with a Gaussian standard deviation of 2 weeks. Fig. S2 in Text S1 depicts the resulting incidence profiles for the 14 air communities.
We derived several potential predictors from these incidence profiles, including incidence overlap, origin incidence versus destination growth rate, peak time difference, and incidence in the origin location at fixed times prior to peak incidence in the destination location. The incidence overlap summarizes the overlapping area under the origin-destination incidence curves for each pair of locations. The origin incidence versus destination growth rate sums the product of origin incidence and destination growth rate for each week of the year. Peak time difference quantifies the difference in peak incidence for each origin-destination pair. For the latter, we summarized the donor incidences at 4, 8, 12, 16, 20 and 24 weeks prior to peak incidence in the destination location as potential predictors. We report the GLM estimates for the incidence overlap (Fig. 2 in the main manuscript and Fig. S3), which serves as a representative predictor for all incidence-derived measures. The estimates for the full set of predictors are summarized for each of the three sub-sampled data set in Fig. S1 and for the complete data set and two smaller but more balanced sub-samples in Fig. S2.
In addition to including measures of synchronicity in epidemic profiles between pairs of locations, we also attempted to capture the seasonality of each air of the 14 air communities and 15 geographic partitions. In particular, we consider the origin and destination seasonal entropy, which is the degree of entropy in weekly incidence across the year for the origin/destination region. In this case, flatter distributions (like China, Taiwan and Southeast Asia) have larger entropies. As an alternative, we consider a predictor that quantifies the number weeks out of the year that have standardized incidence greater than the median incidence ('origin and destination above-median incidence').
• Antigenic evolution. Because antigenic evolution can provide insights into the seeding dynamics of seasonal H3N2 [1], we sought to include the average antigenic divergence for each region as phylogeographic diffusion predictors. Based on the available antigenic cartography data for the strains in our phylogeographic analyses, we performed a local regression (LOESS) of the principle antigenic component, obtained from a multidimensional scaling analysis of hemagglutination inhibition assay measurements [1], against time. The resulting scatter plot with strains colored according to air community is presented in Fig. S3 in Text S1. Distances from the spline (residuals) were calculated for each antigenic measurement and average residuals were obtained for each region/community, which reflects whether a location is on average antigenically leading or trailing [1]. These average residuals are listed in Table S1, S2 and S3 in Text S1. We considered the exponentiated residual and exponentiated negative residual as a measure of efflux and influx respectively for each location and included these as separate origin and destination predictors.
• Sample sizes. To test the impact of sampling effects, we considered origin and destination sample sizes (number of H3N2 sequences included per discrete location state in the phylogeographic analysis) as separate predictors. Although sampling sizes are expected to have an impact on the number of location transitions, support for other factors in addition to sampling size predictors may suggest that they are robust to potential sampling biases.
In constructing predictors, we log-transform all strictly positive quantities and then standardized all predictors to have a mean of 0 and a variance of 1 before their incorporation in the GLM approach. Standardization facilitates prior specification and BSSVS.

Fitting diffusion models to empirical tree distributions
Although we generally desire to simultaneously reconstruct sequence and discrete/continuous trait evolution using our Bayesian statistical framework, integrating over tree-space becomes a computationally daunting task for a large number of taxa. The main limiting factor in Bayesian analysis of evolutionary history is typically the efficiency with which topology proposals explore phylogenetic tree space [23]. To side-step these limitations, we seek to approximate phylogenetic uncertainty in our phylogeographic estimates in cases where sampling from tree space needs to be performed repeatedly (e.g. when comparing different diffusion models). To this purpose, we follow [24] and implement a transition kernel that randomly draws from an empirical posterior distribution of trees which, in our case, were solely inferred from sequence data. Because the likelihood of a tree topology will largely be dominated by an informative sequence alignment compared to a single discrete (location) site, we expect such an empirical approach will closely approximate the phylogenetic uncertainty in the joint inference approach. We provide an example empirical tree set for one of the subsampled air community data sets in Dataset S1.

Comparing simulated spatial expansion to recorded H1N1 pandemic data
Based on the numbers of pandemic H1N1 isolates detected per week per country during 2009, downloaded from the World Health Organization database FluNet (www.who.int/flunet), we combine time series between countries to estimate regional patterns. To control for sampling intensity, detection counts are adjusted so that the total number of isolates is equal between countries. The resulting counts are further adjusted by scaling in proportion to the population size of each country. The estimates for each region are only marginally affected by these adjustments (moving the initial epidemic peak by at most one week). To control for week-to-week stochastic variation in sampling intensity, we smooth each regional distribution using kernel density estimation with a Gaussian kernel and a 10-day bandwidth.
The resulting time series often show multiple distinct peaks, with many temperate regions showing a smaller summer peak and a later more-severe fall peak (Fig. S4 in Text S1). We are primarily interested in the initial spread of the pandemic and so take the timing of the initial peak as an indication of the rate and pattern of geographic spread. We run simulations for different migration matrices and compare simulated epidemic peaks to observed initial peaks in the FluNet data (Fig. 4 in the main text for simulations using an equal rate matrix, air travel data, standard phylogeographic estimates and GLM estimates respectively, and Fig. S5 for simulations using a migration matrix estimated by BSSVS). We calculate the peak incidence of each regional epidemic across 100 simulations, yielding a mean and 95% range for the timing of each peak in each region, and measure the relative correspondence between the mean peak times and the observed peak times for all locations except Mexico using the Spearman's rank correlation coefficient. The significance test takes the mean peak times across the simulations and the mean observed peak times, and then randomly permutates their values across the regions to assess in what frequency we get more extreme Spearman rank correlations than the original correlation value.
2 Evaluation of the GLM-difussion approach on empirical data As a validation, we test the GLM extension of the discrete diffusion model by simultaneously inferring the phylogeographic history of dog rabies viruses in Morocco and identifying the factors driving their dispersal based on a previously published data set [25]. The original study evaluated several potential predictors of dog rabies dispersal by fixing the parameters of the discrete rate matrix and independently fitting different models to the joint sequence and spatial data. Model-fit, as assessed using a harmonic mean estimator (HME) of the marginal likelihood, indicated that road distances were the best predictors of rabies diffusion in Morocco [25]. Considering the large variance and poor repeatability of the HME (see e.g. [26]) and most importantly the fact that the HME systematically overestimates the marginal likelihood [27], it remains difficult to unambiguously reject competing hypotheses (such as geographic distances, which were less than one ln likelihood unit different from the best fitting model). Here we compare our GLM-diffusion model with different marginal likelihood estimators including the HME and recent BEAST implementations of path sampling (PS), stepping stone sampling (SS) [28,29,30,27,31], as well as the Akaike's information criterion through MCMC (AICM, [28,32]). Table S4 lists the marginal likelihood estimates for rates fixed to equal values, inverse distances, the product of origin and destination population sizes (human population sizes as a proxy for dog population sizes), the product of origin and destination population sizes divided by geographic distance (the so-called gravity model), inverse population surface distances, inverse road distances and accessibility estimates [25]. Two independent analyses were performed to assess the variability of the different estimators. The two runs for PS and SS consistently favored road distance over the other predictors while the HME and AICM preferred either road or geographic distance. Formal evaluation of these model comparison approaches has recently shown that both PS and SS are more reliable and more consistent than the HME and the AICM [28,29,30]. Nevertheless, substituting marginal likelihood estimates between the two runs for some predictors could yield a different PS or SS ranking of the models indicating that unambiguous model selection remains difficult. When applying the GLM-diffusion model, we do not consider equal rates because the model can default to this scenario when no predictors are included. We also do not explicitly construct a product of origin and population sizes nor a gravity model because the GLM approach can invoke these by simultaneously including the individual predictors (origin population size, destination population size and distance). The GLM-diffusion results in Table S 5 indicate that only road distances are supported (BF= 6.64) as a predictor with a mean conditional effect size of about −1.37 on a log scale and credible intervals that do not include zero. In conclusion, the GLM-diffusion model is consistent with state-of-the-art marginal likelihood estimators and direct support for predictors allows more robust model selection when comparing similar models.