Combining Spatial-Temporal and Phylogenetic Analysis Approaches for Improved Understanding on Global H5N1 Transmission

Background Since late 2003, the highly pathogenic influenza A H5N1 had initiated several outbreak waves that swept across the Eurasia and Africa continents. Getting prepared for reassortment or mutation of H5N1 viruses has become a global priority. Although the spreading mechanism of H5N1 has been studied from different perspectives, its main transmission agents and spread route problems remain unsolved. Methodology/Principal Findings Based on a compilation of the time and location of global H5N1 outbreaks from November 2003 to December 2006, we report an interdisciplinary effort that combines the geospatial informatics approach with a bioinformatics approach to form an improved understanding on the transmission mechanisms of H5N1 virus. Through a spherical coordinate based analysis, which is not conventionally done in geographical analyses, we reveal obvious spatial and temporal clusters of global H5N1 cases on different scales, which we consider to be associated with two different transmission modes of H5N1 viruses. Then through an interdisciplinary study of both geographic and phylogenetic analysis, we obtain a H5N1 spreading route map. Our results provide insight on competing hypotheses as to which avian hosts are responsible for the spread of H5N1. Conclusions/Significance We found that although South China and Southeast Asia may be the virus pool of avian flu, East Siberia may be the source of the H5N1 epidemic. The concentration of migratory birds from different places increases the possibility of gene mutation. Special attention should be paid to East Siberia, Middle Siberia and South China for improved surveillance of H5N1 viruses and monitoring of migratory birds.


Introduction
Highly pathogenic avian influenza (HPAI) A subtype H5N1 virus was first detected in Guangdong, China in 1996. It has initiated several outbreak waves that swept across the Eurasia and Africa continents since 2003. The high fatality of over 60% in human infection has sparked a fear that a disastrous pandemic strain may arise by mutation or reassortment with currently circulating seasonal influenza virus [1]. Intervention efforts would be better made if the main transmission agents and their ways of transmission are better understood. However, our knowledge on the global spreading of H5N1 is still limited. There is an ongoing debate regarding how H5N1 is spreading around the world [2][3][4][5].
Long distance migration of wild birds, poultry production and trade are at the center of deliberation. Advancements in transportation, coupled with the global food chain, enable the virus to be spread anywhere. Migratory birds, especially wild waterfowl such as Anseriformes and Charadrillformes, are considered natural carriers of all avian influenza A viruses. However, what exactly the main agent of dispersal and the spreading mechanisms are remain undetermined. Most previous studies with evidences on the cause and spread of H5N1 were based on sporadic specific cases, such as virus detection in smuggling and illegal poultry trading [6], or isolation in wild bird populations [7], using virus isolation, sequencing and phylogenetic analyses. At the regional level, environmental and socio-economic risk factors favoring the reoccurrence of the viruses have been identified through statistical regression analysis [8]. However, a more comprehensive global analysis on the spreading mechanisms of H5N1 at various spatial scales has rarely been done [46].
It is of interest to concentrate on how the virus is introduced into a new region. Once this issue is solved, it will be easier to implement surveillance and control strategies. Most previous studies make use of phylogenetic analysis to explore the emergency and circulation of a virus. Phylogenetic analysis is a powerful bioinformatics tool that can elucidate the phylogenetic relationships of virus isolates, give further insights into the origin of the virus, and provide clues to inferring putative pathways of introduction. But most phylogenetic analyses of H5N1 viruses were done at the regional to continental scales [9,10]. It is insufficient to determine when, where, and how the virus is transmitting at the global scale. Furthermore, most analyses focused on the divergence and evolution of gene sequences, with little consideration on their geographic migration [11]. With a georeferenced global dataset [12], a study on the geographical analysis at the global scale is now possible [45]. Figure 1A shows the distribution of animal outbreak cases from November 2003 to December 2006, including 3,365 poultry and wild bird cases in 50 countries. In this study, we use 2,642 poultry cases ( Figure 1B) and 334 wild bird cases ( Figure 1C), while the remaining cases in the dataset are uncertain or incomplete.
In this paper we have three objectives: first, to determine the transmission mechanisms from the spatial temporal distribution of those global H5N1 outbreak cases; second, to delineate important introduction and spreading routes since 2003; and third, to identify the primary spreading agents for the long term and repeated circulation of the H5N1 virus.

Spatial and Temporal Patterns of H5N1 Outbreak Cases
The global geographic H5N1 dataset registers each outbreak case at the location of the nearest village or township. We estimate most of the positional registration errors in densely populated areas to be within 10 km. A recent study estimates the maximum positioning error to be 46 km [45]. Hence, on the global scale, H5N1 cases can be considered points in space and time. A point pattern analysis of H5N1 cases is used to determine where and when cases are spread as well as which spatial and temporal scales are optimal for disease outbreak clustering. Choosing the right scale is critical in clustering analyses. Here we used exploratory spatial statistical techniques to examine the patterns of H5N1 outbreaks. We analyzed the data from (1) the spatial [13]; (2) the temporal [14]; and (3) the spatial-temporal perspectives [45].
Ripley's K function is an effective approach to analyze point process data. Point process data contain locations and time of events. K function can be used to describe how the expected value of a point process changes over different spatial and/or temporal range of sizes (lag). K function value would be plotted against a series of increasing lags, and a peak value indicates an outbreak cluster emerging at the scale of the corresponding lag. Because the analysis is on a global scale, the spatial lag is in spherical distance (see the Methods for details of the calculation). This is different from the K function analysis based on a flat plane [45]. Avian flu primarily occurs on habitable land, so we excluded from our analysis inhabitable areas such as ocean and permanent ice cover derived from remotely sensed data [15]. Figure 2A shows the K function value calculated from the entire global dataset of H5N1 outbreaks. There are two peaks located at approximately 300 km and 1250 km. We calculated the K function of center points of residential areas, which are extracted from the 2003 Defense Meteorological Satellites Program (DMSP) nightlight data in ''Version 2 DMSP-OLS Nighttime Lights Time Series'' database [16] (see Method for a description of the processing procedure). The major peak is around 300 km. We also calculated the K function of important staging points of wild birds, which are crucial birds' habitats [17]. The peak spatial lag is around 1250-1300 km ( Figure 3B). The spatial pattern of poultry is strikingly similar to that of the global patterns because a majority (88%) of the cases in the global dataset belongs to poultry cases ( Figure 2B).
We suspect that the spatial lag for wild birds is related to migration distances of wild birds. Wild bird migration is considered as a mode of H5N1 transmission, referred to as the ''bird migration'' mode. The 300-km spatial lag corresponds to the average distance between major residential centers. Poultry are produced, transported, and consumed by people. Hence, the residential centers can be deemed as centers of poultry trade. This type of poultry trade represents a unique mode of H5N1 transmission by transportation, referred to as the ''poultry transmission'' mode.
In an epidemiological study, the intensity of disease outbreak cases varies with time. Thus, it is necessary to explore the temporal patterns in the global H5N1 cases using the extension of the K function. As can be seen in Figure 2d, the primary peak is located at 55 days, and a second peak is located at 80 days. The poultry cases exhibit a single peak at 55 days ( Figure 2E), while the wild bird cases exhibit a single peak at 80 days ( Figure 2F).
The 55-day peak suggests that H5N1 cases tend to concentrate during a period of 55 days. The temporal cycle may be explained by the passage of time from transportation, incubation, and the latent period up to the detection period. Under normal circumstances, the transportation time lasts from a few days to 2-3 weeks. The incubation period for the HPAI virus is highly variable, from 2 to 3 and up to 16 days. The persistence time of the virus also changes with the environment: it can survive in poultry houses for up to five weeks [18] and remain viable in feces for up to 32 days [19]. All these periods add up to about 30 to 60 days. An 80-day time lag may fit the migration cycle of migratory birds [20]. Migratory birds often take their spring or autumn migration for approximately three months. In general, only when birds settle at breeding or stopover sites, they could spread their self-carrying viruses when in contact with other birds or poultry.
The spatial-temporal pattern of H5N1 cases is illustrated by spatial-temporal K functions using a contour map (Figure 4). High values in the contour map indicate space-time clustering in the data. Only spatial-temporal patterns can realistically depict the spatial-temporal process of certain events of interest because spatial or temporal analysis alone may bring together irrelevant events that occur too far apart in space or in time to be captured by other means. Here, the most significant space-time peak involves a time period from 30 to 90 days and a spatial lag from 400 to 700 km. Interestingly, we find the lags to be similar to those of the separate spatial and temporal patterns of poultry, both close to the 300-km distance lag and covering the 55-day time lag. This implies that poultry abundance and density are responsible for the intensity of avian flu outbreaks. The two scales form the basis for choice of spatial and temporal scales in a subsequent spatialtemporal cluster analysis of the H5N1 outbreak cases. Poultry served as fuel and wild birds acted as fire seeds.

Spatial-Temporal Cluster Analysis
From the spatial-temporal analysis of the H5N1 outbreaks, we found that they are clustered both in a short time and in a relatively small area (compared with the whole study area). Cluster patterns have an advantage: large quantity of points can be easily tracked in space and time. We used SaTScan statistics software to look for statistically significant spatial-temporal clusters based on a 400-km spatial scale and a 30-day time scale as mentioned above [21]. These two derived scales can help locate cluster centers and exclude those dependent cases around each cluster center. Phylogenetic analysis can then be applied to gene sequences at corresponding spatial-temporal clusters to identify the linkages between them. Possible spreading routes of H5N1 will then be derived.

Phylogenetic Analysis
From GenBank, we downloaded 185 influenza A H5N1 hemagglutinin (HA) gene sequences isolated during 2003 to 2006 with the same outbreak location and year as the clusters identified above. The A/Goose/guangdong/1/96 sequence was also downloaded for gene tree construction (Table S1, Figure 5). One caveat may be that, without detailed geographical and time information in those sequences, the clusters and gene sequences can not be matched exactly. To most sequences, the location and time information was only defined to country and year level. Strains located in large countries, like China, contain province-  level information. We referenced the source articles of all sequences, and checked and implemented their detailed descriptions to the sequence database. With this accuracy improved dataset, it is relatively easier to match them with the geospatial data based on their spatial and temporal attributes. Mismatch during the georeferencing was unavoidable, but we attempted to minimize them to the minimum level. It is desirable that all available publicly sequences are precisely documented with exact location and time.
From the perspective of geographic distribution, most sublineages in this phylogenetic tree comprise sequences from geographic adjacent areas, usually within one country. And they are labeled with the geographic region. Viruses isolated in China in 2004 are consistently grouped with A/Goose/Guangdong/1/ 96 ( Figure 5). Although multiple reassortment events happened onward, and several replacements of predominant genotypes were observed from 1996 [48,49] [22]. Interestingly, the two isolates from South Korea in 2003 also fell in the China05 sublineage instead of the China04 ( Figure 6). Thus, it is suspected that the viruses in the China05 sublineage may be reassorted or recombined from the Korea 2003 virus and the A/Goose/Guangdong/1/96 virus. Another noteworthy evidence is that all the Qinghai Lake isolates can be traced to one isolate from migratory ducks at Poyang Lake, north of JiangXi Province in February 2005. Qinghai Lake locates at northeastern Qinghai Province, and acts as one of the most important birds' breeding areas. All these suggest that migratory birds be responsible for the virus dissemination. Both Korea and South China are wintering regions in the East Asia flyway of  Table S1''. Letters stand for the isolated places, and numbers stand for the isolated year. Multiple sequence alignments were performed using Clustal W algorithm and Maximum Likelihood trees was reconstructed. Approximate Likelihood Ratio test values greater than 50% were marked at the branches. doi:10.1371/journal.pone.0013575.g005 migratory birds while East Siberia is the breeding region. When wintering birds carrying the Gs/GD/1/96 virus from China and those carrying the virus from Korea meet in East Siberia during the breeding season, their interaction could have caused the cross infection leading to a reassorted strain. From late 2004 to early 2005, migratory birds carried the reassorted viruses back to south China and were identified. The outbreak time and location of the virus at Poyang Lake and Qinghai Lake matched well with the pattern of bird migration. Thus, we speculate that the Qinghai Lake virus was a reassortant from the Fujian-like sublineage, possibly along with those from Southeast Asia as well [7].
In Southeast Asia, viruses in Vietnam have evolved over time into geographically distinct groups. Three isolates in North Vietnam in 2005 clustered into China04 sublineage. This suggests that the virus was likely introduced to Vietnam, most probably from Guangxi and Yunnan, China [2,47]. Other isolates in South Vietnam, were clustered with those in Thailand and Laos in 2004. This introduction was likely from Thailand through Laos. All viruses from Indonesia in 2005 form a distinct sublineage suggesting their outbreaks are likely originated from a single introduction that spread throughout the country. Older viruses in the previous year from Yunnan and Guangxi of China, that are genetically clustered in the Indonesia group are suspected as the origin. Both Yunnan and Guangxi are located in southern china, while Guangxi shares its border with Vietnam, and Yunnan borders Myanmar, Laos, and Vietnam.
Besides the sublineages circulating in East and Southeast Asia, there is a dominant one containing isolates from Europe, Middle Asia and Africa (named as EMA sublineage). Many studies suggest that the origin of the EMA sublineage is Qinghai Lake [23][24][25]. The gene tree points to the evidence that the isolates in Southern Russia, such as Kurgan, Novosibirsk, Tula, and Astrakhan, are identical to the A/bar headed goose/Quinghai/1A/2005 strain. West Asia sublineage contains isolates from India, Afghanistan, Pakistan, Russia (Krasnodar), Azerbaijan and Italy. The close relationship between the Indian viruses and those in the Central Asia Flyway of migratory birds suggests that the viruses in West Asia may have been introduced through migratory birds from middle Siberia.
The sublineages of European countries reveal a western spread trend from Mongolia, Russia, Romania, to Northern European countries including Sweden and Denmark, and a southwest spread trend from those north European countries to West Africa. In all European sublineages, except Egypt and Romania, there is no solely contained stain from just one country. This means that the HPAI viruses in European countries have been highly mixed. They are related with each other but had reassorting characteristics. The Sweden, Germany, Italy sublineage contains strains from Romania, Italy, Sweden and Germany, revealing a longitudinal direction of virus spread. This spread route was supposed to be caused by a sudden westward movement of wild birds due to spills of Arctic cold in early 2006 [26]. In the West African sublineages, Sudan, Burkina Faso, Cote d'Ivoire are in a group with isolates from Kurgan and Russia. We speculate this sublineage was originated from Kurgan, then to central Asia (Iraq) and Europe, and finally reached Nigeria through the Black Sea-Meditterranean flyway. Isolations in Egypt are in the same group with those in Turkey and Romania with a 97% bootstrap rate implying that the viruses were introduced from west Russia to Turkey, then to Egypt. All the above suspected spread routes, along with the spatial-temporal cluster locations were drawn in Figure 7.

Discussion
In this paper, we highlighted two transmission modes at the global scale through a spatial-temporal analysis: the poultry transmission mode and bird migration mode. From Figure 7, except for the introduction into North and South Vietnam which may be caused by poultry transmission, all remaining routes are most likely caused by the bird migration mode. This is because the spatial and temporal regularity can only be explained by routes of wild bird migration.
Spatially, the spread route coincides well with the migration flyways. Our study area contains four flyways among the eight main flyways in the world (Figure 7).  Figure 7, outbreak always started in Southeast and East Asia during the spring migration period, and at the end of autumn migration, the virus activity in those regions intensifies. A clear spreading pattern can be observed outside Asia, where poultry management is more controlled and virus circulation history were not as frequent as they are in Southeast Asia. Cases occurred in Qinghai Lake in May and those in Siberia in August match the time of breeding season. After this period, birds began their autumn migration, which lasts from August to November. Different from the spring migration, in order to take care of the juveniles, birds fly slowly and stop more frequently, which favors wider spreading. Birds from Siberia migrate along the Black Sea-Meditterranean flyway. And during October and December, H5N1 viruses already had been spread to the warmer regions in the Black Sea area and the Mediterranean region. From December to the following March, western European and West Africa countries reported wild bird infection. Thus, we conclude that it is the wild birds that are the main agent for H5N1 virus dispersal.
Tremendous efforts have been put into the investigation on the source, the ways of introduction, and the evolution of H5N1 transmission [9][10]27]. However, most analyses were conducted at regional scales and with a relatively short time span. Conclusions from such studies may lead to arbitrary and unilateral decision making. Gs/GD/1/96 sublineage has been circulating in China for almost ten years. After its reassortment into the Fujian-like virus, it was hypothesized to be facilitated by vaccination in China [22]. However, our analysis indicates that this mutation may be caused by cross-infection of wild birds in East Siberia.
We recommend that monitoring and surveillance are enhanced in three hinge sites in the way of H5N1 virus spread: East Siberia, Middle Siberia, as well as South China and the Southeast Asia region. The Russian Far East region is the most important breeding area in the East Asia flyway, with most H5N1 endemic countries falling into the wintering region of this flyway. However, less surveillance work is carried out in the Russian Far East because of its low population density. Thus, more efforts should be put to the virological surveillance, isolation and characterization of dead birds. Middle Siberia possesses the highest density and richest species of migratory birds. Three migration flyways cross-cover this region, but the migration patterns of wild birds are still not well understood there, making it hard to tract the complex migration movements and recognize the relationship among the infected wild birds in European, Africa and South Asia countries. Hence, more work is needed over these regions to study the spatial and temporal migration patterns of wild birds from a variety of avian orders. In order to eradicate the long circulating virus pool in South China and Southeast Asia, and reduce the possibility of cross infection between poultry and poultry, or wild birds and poultry, the governments of China and Southeast Asian countries should regulate their poultry market, especially the live poultry market and backyard poultry raising practices.

Spatial point pattern analysis
K function is originally defined as the following [28]: where E(d) denotes the expectation of the number of events found within certain scale d, and l is the intensity of the events.
Therefore, a suitable estimate of spatial K function can be calculated by: where R is the total area of the study, n is the number of observed events, I d (i,j) is an indicator function that takes value of 1 when the spherical distance between point i, and point j is less than d. W ij is the adjustment factor of the edge effect, which is the ratio between the area of a spherical circle of arc-radius d that excludes ocean and permanent ice cover and the area of the spherical circle. Because our analysis is done on a global scale, we work with spherical distance for any pair of points on the globe: dij~R earth arccos ( sin (lati) sin (latj) z cos (lati) cos (latj) cos (loni{lonj)) To assess the point pattern, one way is to compare K(d) estimated from the observed data with A(d) (A(d)2 )), which is the area of the spherical circle of arc-radius d. In clustering, K(d) would be greater than A(d), and less than A(d) under regularity. We apply a transformation to K(d) to have L(d)~R Ã arccos (1{ K(d) 2pR 2 ){d, and to plot L(d) against d.
We calculated the distance from 1km and the bin size is 2km. In Figure 2 and 3, we only show curves before 2500km. Curves beyond 2500km show direct downward trend without much fluctuation.
The upper and lower bounds are determined by undertaking Monte Carlo simulations 99 times. For each simulation, we generate the same number of random points as the cases, and then calculate their K functions. To each lag, the upper and lower bounds are the minimum and maximum K values from 99 simulations.

Temporal point pattern analysis
In the temporal K function, the time lag is t, and intensity is calculated with respect to time. Consequently, the temporal K function takes the following form: where, T is the time span from the earliest case to the latest case in our dataset. For convenience, we changed the Gregorian date to the Julian date. We apply L(t)~K(t){2t to the temporal K function for visualization purposes. To avoid the time edge effect in the analysis, we calculate from the first nth to the last nth point, which will only introduce the amplitude influence on the result (here we assign ''n'' as 300). Finally L(t) is plotted against t. I t (i,j) is an indicator function that takes value of 1 when the time interval between point i and point j is less than t.

Spatial-temporal point pattern analysis
The spatial-temporal K function, K(d,t) is defined as the expected number of events per unit of area and per unit of time. That is, where, R, T, n and W ij are as used above, and I d,t (i,j) will be 1 if the distance and time intervals of point i and j are within d and t, respectively. If there is no spatial-temporal interaction, then K(d,t) is simply the product of the spatial and temporal K-functions, K(d) and K(t), respectively. Thus, a test for space-time interaction may be based on this equation: L(d,t)~K(d,t){K(d)K(t).

SaTScan and GIS used in exploring H5N1 spatialtemporal spread patterns
SaTScan is a freely available software program that can be used for cluster detection using the theory of scan statistics. To run SaTScan, two datasets are required. One is a coordinate file containing location identifiers (IDs) and x and y coordinates describing the location. The other one is a case file that contains location IDs, number of cases, and the date. The method uses a moving circular window with its center settled on any point of a map to scan the whole region. For each distinct window, a likelihood ratio is calculated. For circle C i,r , the likelihood ratio is where n i,r is the number of cases inside C i,r , N is the total number of cases, and m i,r is the expected number of cases inside C i,r , and i stands for the outbreak event.
The circle with the maximum likelihood is picked out and identified as the most likely cluster. To extend scan statistics to clustering in space and time, we use cylinders instead of circles, and the height of the cylinder represents time. The rest of the process is unchanged. The likelihood ratio is calculated for each cylinder and the scan statistic is the maximum likelihood ratio over all possible cylinders, again corresponding to the most likely cluster.

DMSP night light data for extraction of major residential areas
The cleaned up average of the visible band data of the nighttime DMSP satellite images acquired in 2003 was used as a proxy measure of urban extent in this study. The dataset contains lights from cities, towns, and other sites with persistent lighting. Successful application in delineating the boundary of residential areas with nightlight data had been achieved [29]. Raw nightlight images have booming problem, which exaggerate and shift the extent of urban areas, even union two neighboring areas together. The images must be thresholded and different thresholds must be applied to different parts of the world with different level of economic development [30]. Overthresholding tends to eliminate areas with small extent. Hence, determining appropriate light thresholds for delineating residential areas remains a challenge. The simplest but efficient way to delineate the boundaries of residential areas is to visually compare the 1 kilometer resolution nightlight image with 30 meter resolution TM/ETM+ satellite images. Fourteen TM/ETM+ images acquired over different cities were chosen. Each image represents certain part of the world at different levels economic development. They were downloaded from the ESDI GLCF website [31]. The image acquisition date, mostly in 2000 and 2001, was close to the collection dates of the nightlight dataset. Detected urban areas with different thresholds from the DMSP image were compared with those urban areas delineated from TM/ETM+ images. Finally the most optimal threshold producing the nearest size and number of residential areas was chosen ( Table 1). Boundaries of residential area overlaid on the corresponding TM/ETM+ images are shown in Figure S1. Our main purpose is to extract the center of each residential area, rather than their actual boundaries ( Figure S2). Thus, a medium value, avoiding exaggeration or underestimation of the number of cities, is chosen for each continent.

Phylogenetic analysis
Full length or partial length of hemagglutinin (HA) sequences were obtained from Genbank. Sequences were then aligned by CLUSTAL W algorithm [32] implemented in BioEdit v.7.0 [33]. Phylogenetic relationships of the HA were constructed by Neighbour-joining (NJ), Maximum Parsimony (MP) and Maximum Likelihood (ML) approaches ( Figure S3, S4, 5). The best nucleotide substitution model was selected using the Akaike Information Criterion [34] and a hierarchical likelihood ratio [35] test in jModelTest v. 0.1.1 [36]. For our datasets, the General Time Reversible model [37] assuming a rate variation across sites according to a gamma-shaped distribution [38] with an estimated proportion of invariant sites [39] were selected by the above mentioned criteria. For NJ tree reconstructions, we used composite maximum likelihood algorithms implemented in MEGA v. 4.0 [40] to estimate the transversion/ transition bias and nucleotide substitution patterns. We assumed that substitution rate varies among sites and substitution pattern heterogeneous among lineages. Similarly, alignments were examined and investigated by an MP approach with heuristic search in MEGA. Tree reliabilities were tested with 1000 bootstrap replicates to yield a majority consensus tree. For ML tree reconstruction, we used a web server (http://www.atgc-montpellier.fr/phyml/) to perform PhyML [41]. The heuristic searching strategy for the best topology was started via five random BIONJ trees and those trees were moved by nearest-neighbour interchange (NNI) and subtree pruning and regrafting (SPR) approaches. Clade support for the phylogenetic inferences was estimated by the approximate Likelihood Ratio Test [42] using a Shimodaira-Hasegawa-like procedure in PhyML. Numerous simulations showed that aLRT estimation for the probability of a branch is better than bootstrapping [43]. For all tree-based analysis, bootstrapping and aLRT values of the nodes in the consensus tree were estimated. The final consensus trees were visualized and edited in TreeView [43] and FigTree v.1.2.3 [44]. The aLRT and bootstrap value $50% was regarded as high support. Sublineages with high support in ML, NJ and MP trees are discussed in this paper. Since the phylogenetic organizations of the three trees are identical, we only used the ML tree for illustration. NJ and MP trees can be found in supplementary files.