Host-Specific and Segment-Specific Evolutionary Dynamics of Avian and Human Influenza A Viruses: A Systematic Review

Understanding the evolutionary dynamics of influenza viruses is essential to control both avian and human influenza. Here, we analyze host-specific and segment-specific Tajima’s D trends of influenza A virus through a systematic review using viral sequences registered in the National Center for Biotechnology Information. To avoid bias from viral population subdivision, viral sequences were stratified according to their sampling locations and sampling years. As a result, we obtained a total of 580 datasets each of which consists of nucleotide sequences of influenza A viruses isolated from a single population of hosts at a single sampling site within a single year. By analyzing nucleotide sequences in the datasets, we found that Tajima’s D values of viral sequences were different depending on hosts and gene segments. Tajima’s D values of viruses isolated from chicken and human samples showed negative, suggesting purifying selection or a rapid population growth of the viruses. The negative Tajima’s D values in rapidly growing viral population were also observed in computer simulations. Tajima’s D values of PB2, PB1, PA, NP, and M genes of the viruses circulating in wild mallards were close to zero, suggesting that these genes have undergone neutral selection in constant-sized population. On the other hand, Tajima’s D values of HA and NA genes of these viruses were positive, indicating HA and NA have undergone balancing selection in wild mallards. Taken together, these results indicated the existence of unknown factors that maintain viral subtypes in wild mallards.


Introduction
The influenza A virus is a zoonotic pathogen that infects a wide range of mammalian and avian species [1]. According to the antigenicity of hemagglutinin (HA) and neuraminidase (NA), influenza A viruses are divided into 18 HA subtypes and 11 NA subtypes [2]. The natural hosts of influenza A viruses are aquatic birds, such as ducks, geese, and gulls [3]. Sixteen HA subtypes and 9 NA subtypes of influenza A viruses are circulating among these aquatic bird species. So far, H1N1, H2N2, and H3N2 subtype viruses have caused pandemics in humans [4,5]. H5N1, H5N2, and H7N7 subtype viruses cause highly pathogenic avian influenza to chickens, and they have damaged poultry industry for long time [6,7]. Zoonotic transmissions of viruses from pigs and chickens to humans have been reported frequently [8][9][10].
Since all the influenza A viruses circulating in humans and poultry originated from their natural hosts, understanding the evolutionary dynamics of influenza A viruses in aquatic bird species is important for the control of both avian and human influenza. Kida et al. [11] showed ducks infected with influenza A viruses did not show clinical signs of diseases and they produced only low levels of serum antibodies. These results suggested that influenza A viruses have undergone neutral evolution in their natural host population, but clear evidence has yet to be found.
Tajima's D is a statistic that can be used to test whether or not the population structures of target organisms follow the Wright-Fisher model (WF-model) [12][13][14][15]. The WF-model starts from two assumptions. First, the population of target organisms is selectively neutral. Second, the population is constant in size and not subdivided. Using nucleotide sequence data from surveillance studies, Tajima's D can test whether or not these assumptions hold with the population. Tajima's D is often used to analyze genetic variation maintained in a population of organisms, including bacteria and viruses [16,17].
In this study, we analyze host-specific and segment-specific Tajima's D trends of influenza A viruses. To avoid bias from viral population subdivision, we conducted a systematic review of surveillance studies on influenza A viruses of wild mallards, chickens, and humans using nucleotide sequences registered in the database of National Center for Biotechnology Information (NCBI). To our knowledge, this is the first comprehensive Tajima's D study that uses datasets obtained by stratifying NCBI database sequences according to their isolation hosts, sampling sites, and sampling year. To clarify theoretical detectability of influenza outbreaks by Tajima's D, we also conducted computer simulations of viral evolution with changing viral demography and confirmed a clear relationship between Tajima's D and the viral population changes.

Tajima's D
Tajima's D [13] is the normalized difference between two statistics, Watterson's estimator and Tajima's estimator. Watterson's estimator θ w , that is, the expected number of segregating sites between n sequences, is given by The numerator of Eq (1), S n is the observed number of segregating sites, and the denominator of Eq (1) is the expected total length of genealogy of n samples divided by 2 times total population N. Tajima's estimator θ T , which is the average number of nucleotide differences, is given by Here π ij denotes the pairwise difference between the i th sequence and the j th sequence in the samples, and n(n-1)/2 is the total number of pairs in the samples.
Tajima's D is derived by subtracting Watterson's estimator from Tajima's estimator and by normalizing its numerator as follows; From Eq (3), the sample size for Tajima's D have to be larger than three because the denominator of Tajima's D becomes zero.

Systematic review
We downloaded all the database records of influenza A viruses from the GenBank on November 24th 2015 by using the Taxonomy ID of influenza viruses as a search condition, i.e. "txid = 11320". From the retrieved GenBank records, PubMed IDs were collected. Based on the PubMed ID, articles accompanied with more than 100, 300, and 1,000 GenBank sequence records respectively for mallard, chicken, and human viruses were collected. Influenza virus surveillance studies with wild mallards are conducted at a smaller scale than those for chickens and humans. To collect similar numbers of studies, we used these different thresholds on the minim sequence numbers for mallard, chicken, and human. To avoid bias from population subdivision [12,15], the abstract of articles were reviewed, and nucleotide sequences from surveillance studies conducted at a single sampling site from single host species were collected.

Alignment of Sequences and Calculation of Tajima's D
For each surveillance study selected by above criteria, nucleotide sequences of each gene segment were aligned using MAFFT, a multiple sequence alignment program (version 7) [18]. Sequences with a length less than 90% of complete gene were removed from the alignment. These aligned sequences were stratified according to their sampling years. Since Tajima's D requires at least four sequences for its calculation, the datasets having less than four sequences were removed. For each dataset containing nucleotide sequences of the same gene segment of influenza A viruses isolated from the same sampling site in a single year, Tajima's D was computed by a custom program implemented with Python3 (v.3.3.3).

Outbreak Simulation
Viral sequence evolutions in a rapidly expanding population were simulated using Python3 (python scripts are available in S1 Script and S2 Script). We set the length of nucleotide sequences to 500 and mutation rate in the simulated evolution to 10 −6 per base per generation. For each generation, viruses are randomly selected from the previous generation with replacement, and their nucleotide sequences were copied to the offspring in the current generation with mutations. We used equal mutation rates for all nucleotide bases (JC69 model) [19]. The simulation was started with 1,000 viruses with identical nucleotide sequences. During the first 5,000 generations, the population size was fixed to 1,000. In each generation from the 5,000 th to 5,005 th , the population size was doubled. From the 5,005 th generation, the population size was fixed to 32,000 to the end of the simulation. For every 400 generations, 50 viruses were randomly sampled and Tajima's D was calculated from their nucleotide sequences. Totally, 100 simulations were conducted with the same setting, and averages of Tajima's D values were calculated.

Data Retrieval and Sequence Alignment
Using 401,212 GenBank records retrieved from the NCBI database, we identified 1,635 articles published with nucleotide sequences of the influenza A viruses. Among them 19, 17, and 11 articles satisfied our criteria for mallard, chicken and human, respectively (Fig 1). A total of 42,664 nucleotide sequences accompanied with these 47 surveillance articles were used calculating Tajima's D. Table 1 shows the numbers of datasets for each segment and each host after removing dataset having less than four sequences in the alignment. The accession numbers and their nucleotide sequences used in this study can be found in the supplementary information.  Table 2). Medians of Tajima's D for the internal gene segments (PB2, PB1, PA, NP, and M) across datasets were close to zero, and the differences from zero were not significant (p>0.05, 1-sample Wilcoxon signed rank test) (Fig 2  (A)). The mean Tajima's D of the surface protein genes (HA and NA) and non-structural gene segment (NS) was 1.524, 1.769 and 0.657, respectively ( Table 2). Medians of Tajima's D of these gene segments across datasets were significantly positive (p<0.05, 1-sample Wilcoxon signed rank test) (Fig 2(A)).

Tajima's D in non-natural host species
Chicken. Influenza A viruses that were circulating in chickens had an overall mean Tajima Table 2). Medians of Tajima's D values for HA, NP, NA, and MP gene segments across datasets were significantly negative (p<0.05, 1-sample Wilcoxon signed rank test) (Fig 2(B)). There were significant differences in Tajima's D between the wild mallard and chicken except NS gene segment (p<0.05; Two-sample Kolmogorov-Smirnov test).
Human. Influenza A viruses circulating in humans had a mean Tajima Table 2). Medians of Tajima's D for all gene segments were significantly negative (p<0.05, 1-sample Wilcoxon signed rank test) (Fig 2(C)), and there were significant differences in Tajima's D between the wild mallards and humans for all gene segments (p<0.05; Two-sample Kolmogorov-Smirnov test). Tajima's D values for all the dataset can be found in S1 Table.

Outbreak simulation
At the first duration when the viral population size was constant over viral generations, the mean of Tajima's D of 100 simulations was around zero and was within the range of 95% confidence interval for D = 0 (the error distribution was assumed to be beta distribution [13], which agreed with the theory of Tajima's D. After a sudden increase of the viral population, the mean Tajima's D value decreased to -2.052, which is significantly negative (p<0.05; beta distribution). Consequently the mean Tajima's D value increased gradually and returned within the range of 95% confidence interval for D = 0 (Fig 3(B)).

Discussion
In this study, we analyzed host-specific and segment-specific Tajima To analyze the selection on the external genes of influenza A viruses circulating in wild mallards, we compared Tajima's D of the data containing only one subtype with those containing multiple subtypes using dataset from Bahl et al. [20]. Tajima (Table 3). A similar pattern was observed for NA (Table 4). These results suggested that selection within a subtype was neutral or weak purifying selection as observed in other non-natural hosts, on the other hand, selection across subtypes is balancing selection.
The diversity of influenza A viruses circulating wild mallard is much higher than other hosts. This high diversity is not able to be explained by relatively low pathogenicity or low immune response of wild mallard, which is one of the main reasons why wild mallard is considered to be the natural host of influenza A viruses. These factors can explain neutral selection on the viruses, but they cannot explain balancing selection.
Several studies have analyzed the evolutionary dynamics of avian influenza viruses using their nucleotide sequences. Time to the most recent common ancestor (TMRCA) of HA, NA and NS were much older than that of internal gene segments [21], and the result is consistent with our results. The phylogenetic analyses of HA and NA suggested high inter-subtype diversity and low intra-subtype diversity, which were not seen in internal gene segments [22]. The distinct divergence between two alleles of NS suggested balancing selection on NS [22], and this was consistent with our results. The dN/dS ratio-the ratio of the number of non-synonymous substitutions per site to the number of synonymous substitution per sitehad suggested purifying selection on internal gene segments [20], while Tajima's D in our study supported neutral selection. This discrepancy between results from dN/dS and Tajima's D remains as an open question, and one possible explanation for this is that the discrepancy would be attributed to difference between the selection at the lineage level and the selection at the population level.
The negative Tajima's D values observed in the human and chicken viruses rejected the WF-model for these viral populations. These negative Tajima's D values should be attributed to the population increase due to recent outbreaks, purifying selection due to viral adaptation to new hosts, or combined effects of population change and selection. However, Tajima's D Tajima proposed the use of beta-distribution to reject WF-model and to calculate the 95% confidence interval of Tajima's D under the WF-model [13]. Our computer simulations showed that Tajima's D values fell outside 95% confidence interval right after the sudden increase of viral population. Although Simonsen et al. [23] showed that criteria using beta-distribution was too conservative to reject WF-model when neutrality assumption does not hold, our computer simulations suggested that beta-distribution could be used to reject WF-model when population size is rapidly growing. When we have multiple samples independently collected form the population, an alternative approach to reject WF-model is to use 1-sample Wilcoxon signed rank test, as shown in the previous section.
It would be of particular interest to find connection between the Tajima's D of an infectious agent and the effective reproduction number of infectious disease caused by the agent. The effective reproduction number measures the continuance of an outbreak and the expected number of secondary infections. Recent studies have utilized coalescent theory to estimate the time evolution of population size of the ancestors of sampled sequences. By assuming constant-sized population between two coalescence events, Pybus et al. developed a method to estimate the time evolution of population size from their nucleotide sequences [24]. Mathematical models on population dynamics of infectious diseases have been also proposed to characterize infectious disease outbreaks from nucleotide sequences of infectious agents [25,26].

Conclusions
We calculated host-specific and segment specific Tajima's D values of influenza A viruses through a systematic review using viral sequences registered in the NCBI database. Interestingly, sequences encoding external proteins of influenza A viruses showed positive Tajima's D in wild mallards, suggesting the existence of balancing selection, although zero or negative Tajima's D was expected. This result suggests the existence of missing factors other than low immune response or low pathogenicity to maintain the variation of the subtypes circulating in the natural hosts.
(DOC) S1 Dataset. Accession numbers in fasta format that are used in this study.