Archival Isolates Confirm a Single Topotype of West Nile Virus in Australia

West Nile virus is globally wide-spread and causes significant disease in humans and animals. The evolution of West Nile virus Kunjin subtype in Australia (WNVKUN) was investigated using archival samples collected over a period of 50 years. Based on the pattern of fixed amino acid substitutions and time-stamped molecular clock analyses, a single long-term lineage (or topotype) was inferred. This implies that a bottleneck exists such that regional strains eventually die out and are replaced with strains from a single source. This was consistent with current hypotheses regarding the distribution of WNVKUN, whereby the virus is enzootic in northern Australia and is disseminated to southern states by water-birds or mosquitoes after flooding associated with above average rainfall. In addition, two previous amino acid changes associated with pathogenicity, an N-Y-S glycosylation motif in the envelope protein and a phenylalanine at amino acid 653 in the RNA polymerase, were both detected in all isolates collected since the 1980s. Changes primarily occurred due to stochastic drift. One fixed substitution each in NS3 and NS5, subtly changed the chemical environment of important functional groups, and may be involved in fine-tuning RNA synthesis. Understanding these evolutionary changes will help us to better understand events such as the emergence of the virulent strain in 2011.


Introduction
West Nile virus (Family Flaviviridae; Genus Flavivirus) is a mosquito-borne virus which can cause disease in humans, horses and birds. In humans, the disease can manifest as a fever with rash and, infrequently, neurological disease including meningitis, encephalitis and acute flaccid paralysis [1]. Introduction of West Nile virus into the Americas in 1999 resulted in tens of thousands of cases of human disease causing over 1,500 deaths, and has devastated bird populations [2]. The virus is now globally distributed, with the exception of Antarctica [3]. There are 7 lineages of which two comprise the majority of circulating strains [4]. In Australia, a strain of West Nile virus called Kunjin virus (WNV KUN ) is endemic [5]. It has historically been considered more benign than strains circulating globally, but can cause rare cases of non-fatal encephalitis in humans [6]. WNV KUN has been included in Lineage 1b [3], and a single Australian topotype was previously inferred on the basis of oligonucleotide fingerprinting [7].
In 2011, the largest outbreak of equine encephalitis in over 30 years occurred in south-eastern Australia. Over 1,000 horses were infected, with a case fatality rate of 10-15% [8]. Surprisingly, only one case of human disease was reported, despite widespread virus activity [9]. The major aetiological agent of the outbreak, WNV NSW2011 , was subsequently shown to be neuroinvasive, but was less virulent than the highly pathogenic North American WNV NY99 strain in mouse models [10]. Hence, this unprecedented outbreak was an opportunity to explore the drivers causing an apparently benign strain to become virulent in horses. Sequencing revealed that WNV NSW2011 carried two known virulence markers that were not present in the prototype WNV KUN strain isolated in 1960 [10]. One was a glycosylation site at position 154 of the envelope. A second was a phenylalanine at amino acid 653 in the RNA polymerase protein encoded by the NS5 gene. Further sequencing revealed that all recent strains isolated, with the exception of the prototype strains, carry both these virulence markers [11], and an additional 4 neuroinvasive strains that had been isolated in the previous two decades were identified. These genotypic changes may be fully or partially responsible for the observed increases in virulence. However, the drivers of the 2011 outbreak still remain unclear and are probably a result of the complex interplay between known, and possibly undetermined, viral genetic virulence determinants; increased fitness in the major WNV KUN vector Cx. annulirostris; or a convergence of high mosquito and waterbird populations associated with the widespread La Niña driven rain event [11,12].
Only the original 1960 prototype strains and a handful of coding-complete WNV KUN isolates from the 1980s until the present are currently available in GenBank. As most of the amino acid substitutions of the known determinants of increased neurovirulence of recent WNV KUN strains occurred at least before 1984, there is inadequate sequencing of strains from the early 1960s until that time. In order to better understand the emergence of the neurovirulent strain responsible for the 2011 equine epidemic, and the evolution of the virus, additional sequence data is required. We accessed archival material collected over 5 decades and performed next generation sequencing to generate the data required to discriminate WNV KUN strains at a significantly higher resolution than previous analyses. Changes in the envelope protein were observed, consistent with immune pressure, and also in key replicative proteins which would subtly change the chemical environment of active sites. The analysis revealed a stable genome with a single lineage which was subjected to mostly purifying selection.

Virus collection, isolation and culture
Archival material in the QIMR Berghofer Medical Research Institute collection was obtained over the period from 1960-2011, and was isolated from mosquitoes collected from the Australian states of Queensland, New South Wales, and South Australia (S1 Table). Isolates were stored frozen at -80˚C until required. The isolates were passaged twice in C636 Aedes albopictus cells prior to sequencing.

Viral genome sequencing and assembly
Tissue culture supernatants of cultured isolates were processed (excluding the ultracentrifugation step), RNA was extracted, and nucleic acids pre-amplified using sequence-independent amplification (SIA) as described previously [13]. High throughput sequencing was performed on one of two platforms. Library construction and sequencing on the Ion Torrent Personal Genome Machine (PGM™; Thermo Fisher Scientific) was as described previously [13]. The SIA reaction products were subjected to tagmentation, indexing and library amplification following the manufacturer's recommendations as described in the Nextera XT kit (Illumina). Sequencing was performed on an Illumina HiSeq2500, at the Australian Genome Research Facility. At least 18 million 125 bp paired-end reads (i.e. 9 million pairs) were obtained for each sample. A virus consensus sequence was obtained using Geneious R8 and MRM61C (accession number D00246) as a reference sequence.

Bioinformatic analysis
Nucleotide sequences were aligned using the MAFFT plugin (algorithm, FFT-NS-ix2; scoring matrix, 200 PAM/k = 2; gap open penalty, 1.53; offset value, 0.123) in the Geneious R8 package [14]. For comparison, an amino acid substitution was defined as a change that was different from at least two separate prior isolates. It was considered fixed when it occurred at least two subsequent times, at least once in a subsequent year and different sampling location. The codon adaptation index (CAI) and expected CAI (eCAI) were calculated as previously described using the server available at http://genomes.urv.es/CAIcal/. For selection analysis, single likelihood ancestor counting (SLAC) and internal fixed effects likelihood (IFEL) analyses were performed as previously described with an optimized substitution model, using the server available at http://www.datamonkey.org/. The data were checked first for evidence of recombination using RDP v4.56 [15]. Unless otherwise indicated, statistical tests used were those incorporated into the software.
For phylogenetic analysis, the optimal substitution model was determined using jModelTest [16] for 88 different models. Using the optimized general time reversible model with invariant sites (GTR+I), a phylogenetic tree was constructed using maximum likelihood in MEGA v7.0.14 [17] with 1,000 bootstrap replicates. The tree was rooted by setting WNV NY99 as the out-group. For molecular clock analysis with time-stamped data, the BEAST v1.8.1 package [18] was used with a log-normal relaxed clock. A chain length of 10,000,000 was run, sampling every 1,000. The best supported tree was obtained using the utility TreeAnnotator and a burnin of 1,000. The optimal tree was graphically modified in TreeGraph2 [19].

Analysis of virulence determinants and fixed amino acid substitutions
Data from 31 isolates were assembled and the consensus sequences aligned with sequence data from other strains available in GenBank. The alignment revealed a remarkable degree of conservation for an RNA virus over the 50 year collection period. Interestingly, the alignment revealed that the substitutions that were previously shown to be associated with virulence [11] had been present for many years. The phenylalanine at 653 of the RNA polymerase, which corresponds to the antagonist of type I interferon-mediated JAK-STAT pathway, was present in all strains with the exception of the MRM61 prototype and its derivatives. The N-Y-S glycosylation motif, present at amino acids 154-156 of the envelope protein, which first appeared in the 1960s, was partially fixed in the sampled population by the 1970s, and completely fixed in all isolates by the 1990s. Therefore, this change probably went to fixation sometime in the 1980s. This conclusion was supported by previous sequencing of the envelope protein of isolates from this decade (J.H. Scherret, unpublished data). Hence, the two known changes from the 1960 prototype strain were mostly likely circulating for at least two decades prior to the 2011 equine outbreak.
There were a large number of amino acid substitutions apparent in the alignment that occurred in single isolates, or a small number of isolates that were collected around the same time, but that failed to go to fixation in the population. These substitutions may be biologically significant for individual strains circulating in a region for a certain period of time. However, the substitutions that go to fixation are most relevant to the understanding of the long-term evolution of the virus. Hence, further analysis will focus on such substitutions.
There were 35 sites where fixed amino acid substitutions occurred in the genome of WNV KUN over the course of 5 decades (Fig 1A; S2 Table). Of note, is the fact that once fixed, these substitutions neither reverted nor were they replaced, and were then permanently maintained in the future virus population regardless of their origin in Australia. For example, the I49V substitution in NS5, previously suggested as an evolutionary marker [11], was identified as one such fixed substitution. This characteristic maintenance of fixed substitutions is indicative of a single long-term lineage of WNV KUN in Australia. The only exception to this observation occurred in two sites of the core protein. Firstly, a S11N substitution that occurred in the 1960s reverted back in the 2000s. Secondly, amino acid 114 of the core protein was substituted twice, once in each of the 1970s and 1990s.
The fixed substitutions were distributed across the genome's single large open reading frame (ORF) (Fig 1A). There was a minimum of one and a maximum of 6 fixed substitutions per processed viral protein. Changes were not evenly distributed over time (P < 0.05; χ 2 test), with twice as many fixed substitutions occurring in the 1960s than the other decades. This indicated that the period was a hot-spot for non-synonymous nucleotide change. There was some indication of temporal clustering, with all 6 of 7 fixed substitutions in the surface proteins occurring during the 1960s and 70s, and the two fixed substitutions in NS2B occurring during the 2000s (Fig 1B). The fixed amino acid substitution density was lowest for both NS3 and NS5 in comparison with all the other viral proteins (Fig 1C). Both these proteins are critical for RNA synthesis. Therefore, this may indicate some constraint on the replicative machinery which limits the number of possible changes to the protein structure and function before viability is compromised.

Phylogenetic and molecular clock analyses
A root-to-tip divergence analysis was conducted to determine whether a molecular clock was applicable to the WNV KUN time-stamped sequence data. A strict linear correlation was found between the age of the isolates and the root-to-tip distance (Pearson correlation coefficient r = 0.97, P < 0.001; coefficient of correlation R 2 = 0.93) indicating a molecular clock could be applied (Fig 2). A phylogenetic tree was constructed using BEAST (Fig 2) with an optimized substitution model (GTR+I). This tree morphology was very similar to a tree constructed using the maximum likelihood model with the same data that had not been time-stamped (S1 Fig). Consistent with the conclusions reached from observing the pattern of fixed substitutions, a single lineage was apparent. Multiple clades were present when multiple isolates were collected at a single time point and location. For example, at both Mitchell River Mission in the 1960s and Charleville in the 1970s there were strains that could be grouped in multiple clades. However, after several years, these minor lineages died out to be replaced by the single main lineage. This result, in combination with the analysis of fixed substitutions, strongly suggested a single source or focus from which this lineage is continuously replenished. This may be either a physical and environmentally favourable site, permissive of virus replication; or alternatively, a climatic or seasonal restriction resulting in a bottleneck effect. Finally, from this analysis a rate of 6.924×10 −4 nucleotide substitutions/site/year was calculated which was comparable to values calculated for West Nile virus [3], and other flaviviruses such as dengue [20], over longer genetic distances.
Rates of synonymous (dS) to non-synonymous (dN) change were calculated across the WNV KUN genome to determine selection pressure. Firstly, the sequences were checked for evidence of recombination using RDP software. The isolate SA2011 was determined to be a possible recombinant (P < 0.05 for 6 different models), and was therefore omitted to prevent a problem with the selection analysis. For the other included sequences, the SLAC method detected 9 codons undergoing significant (P < 0.05) negative selection (Fig 3A). An alternative method called IFEL, which detects selection along internal branches, detected 20 codons also undergoing negative selection (Fig 3B). Only one codon (codon 231 of envelope) was determined to be under positive selection (P < 0.05). However, as this was an asparagine to serine substitution that did not go to fixation, its significance is unclear. Hence, with the exception of possibly one codon in the envelope protein, purifying selection is the major selection pressure being exerted on WNV KUN .

Codon adaptation and the life-cycle of WNV KUN
To what extent adaptation to the host's codon usage influences the life-cycle of WNV KUN is unknown. To explore this question, a measure of codon usage referred as the codon adaptation index (CAI) was used [21]. CAI was calculated for WNV KUN using mosquito (Culex tritaeniorhynchus), human (Homo sapiens), horse (Equus caballus) and bird (Phalacrocorax carbo) codon usage data sets ( Table 1). The CAI values between the mosquito and all of the vertebrates (human, horse and bird) were significantly different (P < 0.001). Hence, there was a codon usage bias towards vertebrate hosts in comparison with the mosquito host.
To establish whether the observed usage bias was due to codon preference or simply a function of nucleotide composition, an expected value of CAI (eCAI) was determined [22]. This value was calculated using the nucleotide and amino acid frequencies for 500 random sequences at a pre-determined probability (P < 0.05), and a ratio of CAI:eCAI greater than one indicated a preference for codon usage. All eCAI values were subjected to a Kolmogorov- Smirnov test and were found to follow a normal distribution. Mosquito CAI/eCAI was not significant (0.93), indicating that usage bias was most likely due to nucleotide composition in that case. CAI:eCAI ratios were for the vertebrates were borderline for significance (0.99) and so it was difficult to be conclusive in that case, but may indicate a codon preference with vertebrates. This result compares with a recent study looking at codon preferences in 449 WNV strains with a clear codon preference towards human usage [23]. The inability to find a clear codon preference in our data may be due to sequence differences between global WNV strains, and WNV KUN strains circulating in Australia.

Changes in the envelope protein
To gain a better understanding of how fixed amino acid substitutions affected the function of the virus, these changes were related to known WNV protein structures at atomic resolution, where this information was available. For the envelope protein, there were 3 fixed substitutions that could be placed within its structure. The envelope protein is the main viral surface protein and contains the majority of the B cell epitopes of WNV [24]. A K310R substitution in envelope (Fig 4) was within the lateral ridge of domain III (DIII-lr) which contains the major virus neutralization epitope [24,25]. Another two fixed substitutions that could be placed within the envelope structure (K44R and F156S) were both within domain I, proximal to the interface between domains I and III. The substitution at amino acid 156 is part of the glycosylation motif associated with increased virulence [10]. All of these fixed substitutions were surface accessible, consistent with changes due to selection pressure from humoural immune responses. There was a fourth L483F fixed substitution in the transmembrane domain. However, this is not shown on the figure as it was not included in the structure when this was originally determined. Finally, the positively-selected amino acid at 231 is also shown (Fig 4). It is a relatively conserved residue within the genus Flavivirus, suggesting it may have an important biological function.

Structural and functional changes in the non-structural proteins
The non-structural proteins are responsible for replication of the virus. To investigate their evolution in WNV KUN , the positions of fixed amino acid substitutions were identified in the structures of viral proteins NS1, NS3, and NS5 (S2 and S3 Figs; Figs 5 and 6), as atomic scale information was also available for these proteins. NS3 contains a viral protease and a helicase which facilitates RNA synthesis (Fig 5). There were two substitutions that could be determined within the dimeric structure of NS3: one was K382R and the other was N465S. Amino acid 465 showed a relatively high degree of sequence conservation, suggesting an important functional role. It is immediately adjacent to domain VI of the helicase/NTPase [26]. The substitution was a change from an asparagine to a serine group. Whilst this was a conservative amino acid change, it may have resulted in a subtle shift in the environment surrounding the helicase catalytic groups. The substitution at amino acid 382 was at a variable site, is presumably due to stochastic drift and, therefore, of probably low biological significance. The NS5 protein is crucial for RNA synthesis, and has two important functional groups: a methyltransferase activity (MTase) responsible for transcript capping (Fig 6), and the RNA polymerase (S2 Fig). A V183I fixed substitution, buried within the protein structure, was immediately adjacent to a residue in the methyltransferase catalytic tetrad (K 61 -D 146 -K 182 -E 216 ) [27]. The functional domain of which this residue forms a part is important, as K 182 directly participates in deprotonation of the 2'-OH group in the ribose. A K182A mutant is attenuated in cell culture further demonstrating the domain's importance [27]. This valine to isoleucine fixed substitution in WNV KUN at amino acid 183 was another example of a conservative amino acid change, resulting in a subtle change in the chemical environment of the catalytic site. In addition, there were two surface accessible substitutions: one I49V and the other R101K. Both of these were either variable or only slightly conserved, and so were most likely due to stochastic drift. The changes in NS1 and the polymerase domain of NS5 were not near important functional residues, so their structures are provided as supplemental data only.

Discussion
The most remarkable finding of this work is the stability of the genome over a period greater than 50 years, consistent with other recent observations [28,29]. Over that period, the picture for West Nile virus in Australia is one of overwhelming purifying selection. There have been a maximum total of 6 fixed substitutions, and in most cases less, for individual viral proteins over that time. Of the changes that have occurred, most are in regions of variable sequence, implying stochastic drift; or where they are in possible functional domains, the changes are conservative in nature, resulting in minor changes at most. From these results, it was hard to deduce any change, or changes, that occurred which have led to enhanced virulence in some strains. Two substitutions were identified previously that were linked to potential enhanced virulence [10]. To better understand the emergence of virulent strains in this study, fixed substitutions were related to protein structure where structural information was available. There will be many other regions left unexplored by this analysis, and therefore, there is much potential to explore this issue further. However, this question is complicated by the possibility that virulence may be related to a single change, or a complex interaction involving changes at multiple sites, making it difficult to reach definitive conclusions.
Our analysis indicates that both of the previously mentioned changes linked to virulence may have been present for over two decades. In particular, the phenylalanine at amino acid 653 of the RNA polymerase was present in all strains except the prototype and its derivatives. A second substitution resulted in the N-Y-S glycosylation motif, and a previous study indicated that most recent WNV KUN strains are indeed glycosylated [11]. As discussed above, this may indicate a trend to glycosylation at this site in recent decades. However, it has been suggested that the absence of the glycosylation site in the prototype and related strains are due to sequence changes occurring during growth in cell culture [9]. Alternatively then, the possibility that the lack of a glycosylation motif in the isolates prior to the 1970s may be a consequence of growth in cell culture cannot be eliminated, without directly sequencing the original material. Unfortunately, due to the length of time since the original collections, this material is no longer available.
RNase-based oligonucleotide fingerprinting over 26 years ago indicated a single topotype for WNV KUN across Australia [7]. The high resolution genome sequencing in this study seems to concur with that early genome analysis. Both the nature of accumulation of fixed substitutions and time-stamped molecular clock analysis strongly suggests a single lineage, and seems to be independent of the site of virus collection. However, due to the sporadic and regional nature of collection, sampling occurred only at one site per year. In order to be conclusive about the observation of a single topotype, multiple samples will have to be obtained at relevant sites around Australia, preferably over multiple years. Complicating this picture, mosquito surveillance suggests circulation of strains different from the main lineage at a low frequency in enzootic regions [11].
Given the probability of a single topotype, it is interesting to speculate on the implications for the virus's life-cycle and its dissemination. The maintenance of a single long-term lineage requires disseminated strains to eventually die out, to be replaced by another strain which is introduced from a single source. This could be either a location where virus replication occurs in the vertebrate host reservoir or mosquito vector, or alternatively, a local seasonal or climatic restriction which acts as a bottleneck on growth of the virus. In the longer term, these situations would prohibit the circulation of regional strains, and the seeding by a single source. This conclusion from the analysis is in agreement with current hypotheses regarding the epidemiology of WNV KUN . The virus is thought to be enzootic in a bird-mosquito transmission cycle in northern Australia, and is disseminated to the southern states after periods of above average rainfall [30]. Hence, the life-cycle presents opportunities for bottlenecks due to both location and climate, and for dissemination.
There is one further implication of a single lineage for WNV KUN in Australia. The 2011 equine outbreak was associated with a more virulent strain of the virus [10], which led to concerns that this strain may establish and also cause serious human disease in the future. If in the long-term WNV KUN strains die out and are re-seeded from a single source, then more virulent local strains may only circulate for several years before being replaced. WNV KUN is enzootic in Western Australia (WA), and most active in the Pilbara region [31]. Isolates collected in WA are less virulent than those associated with the 2011 outbreak [11]. Hence, it is possible that the trend in southern states of Australia may be toward less virulence given sufficient time to replace regional strains. showing fixed substitution residues (space-filled residues: blue, amino acid 560; and red, amino acid 719). The structure rotated 90˚toward the viewer is also shown (lower). (B) Space filled model with residues coloured according to flavivirus sequence conservation level with corresponding residues indicated. (TIF) S1