Conservation and Variability of West Nile Virus Proteins

West Nile virus (WNV) has emerged globally as an increasingly important pathogen for humans and domestic animals. Studies of the evolutionary diversity of the virus over its known history will help to elucidate conserved sites, and characterize their correspondence to other pathogens and their relevance to the immune system. We describe a large-scale analysis of the entire WNV proteome, aimed at identifying and characterizing evolutionarily conserved amino acid sequences. This study, which used 2,746 WNV protein sequences collected from the NCBI GenPept database, focused on analysis of peptides of length 9 amino acids or more, which are immunologically relevant as potential T-cell epitopes. Entropy-based analysis of the diversity of WNV sequences, revealed the presence of numerous evolutionarily stable nonamer positions across the proteome (entropy value of ≤1). The representation (frequency) of nonamers variant to the predominant peptide at these stable positions was, generally, low (≤10% of the WNV sequences analyzed). Eighty-eight fragments of length 9–29 amino acids, representing ∼34% of the WNV polyprotein length, were identified to be identical and evolutionarily stable in all analyzed WNV sequences. Of the 88 completely conserved sequences, 67 are also present in other flaviviruses, and several have been associated with the functional and structural properties of viral proteins. Immunoinformatic analysis revealed that the majority (78/88) of conserved sequences are potentially immunogenic, while 44 contained experimentally confirmed human T-cell epitopes. This study identified a comprehensive catalogue of completely conserved WNV sequences, many of which are shared by other flaviviruses, and majority are potential epitopes. The complete conservation of these immunologically relevant sequences through the entire recorded WNV history suggests they will be valuable as components of peptide-specific vaccines or other therapeutic applications, for sequence-specific diagnosis of a wide-range of Flavivivirus infections, and for studies of homologous sequences among other flaviviruses.


Introduction
West Nile virus (WNV) is a mosquito-borne pathogen of the family Flaviviridae, genus Flavivirus, closely related to other important human pathogens, such as yellow fever (YFV), Japanese encephalitis (JEV), and dengue (DENV) viruses, among others. The genome is a single-stranded positive-sense RNA encoding a polyprotein precursor of approximately 3,430 amino acids, which is cleaved into three structural (capsid, C; precursor membrane and membrane, prM/ M; envelope, E) and seven nonstructural proteins (NS1, 2a, 2b, 3, 4a, 4b and 5) [1,2]. WNV is present predominantly amongst avian hosts, and can infect humans through incidental zoonotic transmission via mosquitoes [3]. The virus is endemic in many parts of Africa, Asia, Europe, and most recently North America [4]. At present, there is no registered human vaccine or specific therapy to prevent or treat WNV infection. Although the majority of infected humans remain asymptomatic, about 20% experience influenza-like symptoms, and approximately 1 in 150 develop severe illness, including meningoencephalitis [5,6].
Like other RNA viruses, WNV exhibits significant genetic diversity, a consequence of mainly high mutation rates in RNA replication, and subsequent selection of mutants adapted to changing environment [7,8]. Five distinct genotypes have been identified by phylogenetic analyses of the C-prM-E region, differing from each other by 20-25% across the complete genome [9]. However, variability is uneven across the viral genome, since mutations detrimental to viral fitness are restricted. Thus, while certain protein sites permit multiple mutations, sites critical to viral structure-function are evolutionarily robust and highly conserved. The analysis of the evolutionary dynamics and immunogenic properties of these sites has relevance to multiple applications, including the design of diagnosis, drugs and vaccines.
The focus of this study is to identify and characterize WNV protein regions that have exhibited strong conservation throughout the recorded history of the virus, and that are potential targets of T-cell immune responses. T-cell responses have been implicated in the control and clearance of WNV infection [10][11][12][13][14][15]. Their specificities are governed by human leukocyte antigen (HLA) binding to peptides, derived from proteolysis of antigen (Ag) proteins, for presentation to T cells. HLA class I and class II molecules present peptides to CD8 + and CD4 + T-cells respectively, which play a critical role in cytotoxic responses and in the induction and maintenance of Ag-specific memory responses.
A bioinformatics approach is applied herein to (a) examine the large number of WNV sequences available in public databases, (b) analyze the conservation and variability of these sequences, (c) identify sequence fragments of WNV proteins that are completely conserved in all known WNVs (henceforth referred to as pan-WNV sequences), (d) examine the structure-function relationship and distribution in nature of pan-WNV sequences, and (e) assess the immune relevance of pan-WNV sequences as potential T-cell epitopes, correlating immunoinformatic predictions to previously reported human WNV T-cell epitopes and to our current studies in the identification of human WNV T-cell epitopes by use of HLA transgenic mice.

Methodology overview
The overall bioinformatics approach to this study is summarized in Figure 1. The rationale for this approach and methodology is consistent with that of previous studies [16][17][18], and is briefly described below.

Data preparation, selection and alignment
WNV sequence records were retrieved from the NCBI Entrez protein database [19] in June 2007 by searching the NCBI taxonomy browser for WNV (taxonomy ID 11082).
The sequences of the 10 WNV proteins (C, prM, E, NS1, NS2a, NS2b, NS3, NS4a, NS4b and NS5) were extracted from the downloaded records by performing BLAST [20] against the record sequences using the individual protein sequences as queries, obtained from the annotated WNV reference record P06935 (Swiss-Prot/ TrEMBL [21]). Multiple sequence alignment of extracted sequences for each viral protein was performed by use of MUSCLE v3.6 [22]. All multiple sequence alignments were manually inspected and corrected wherever necessary. All extracted WNV protein sequences, whether partial or full-length, were included in all analyses, unless otherwise indicated in the sections below.
In large-scale proteomic analyses such as this study, bias may result from the collection of redundant sequences, derived from identical or highly similar WNV isolates sequenced by surveillance programs. We retained the duplicate sequences (2,206) for our analysis because they reflect the incidence of the corresponding WNV isolates in nature and, further, they do not affect the identification of WNV sequences that are completely (100%) conserved. As for the highly similar sequences, which may have been generated from large sequencing projects during single outbreaks; their removal was deemed undesirable, since such arbitrary selection would introduce additional bias.

Amino acid difference between WNV protein sequences
Pair-wise percentage amino acid difference for the full-length unique sequences of each WNV proteins was computed by use of ClustalW 1.83 [23] with default parameters. This was done to survey the extent of amino acid variation in the WNV data of 2007.

Nonamer entropy analysis of WNV sequences
Entropy analysis [24,25] was carried out as described in [17], by use of the Antigenic Variability Analyser tool (AVANA), to study the diversity of WNV protein sequences over the period which the sequences were collected. Entropy measurements were based on 9 amino acid peptides (nonamers), since it is the typical length of epitopes that bind to HLA class I molecules, and of the binding cores of HLA class II epitopes [26]. At any given position x in the alignment, nonamer peptide entropy H(x) was calculated by where p i,x represents the measured probability of a particular nonamer peptide i occurring with its center position at x; n(x) represents the total number of peptides observed at position x -larger number of peptides generally result in increased entropy values. Positions with highly dominant (conserved) peptides are characterized by low entropy values, approaching zero. In the case of incomplete sequences, only sequences with a valid nonamer centered at position x were included in the computation, and positions where more than 50% of sequences contained gaps were discarded for their low statistical significance. Since peptide entropy is computed at a nonamer's center position, the first and last four positions in each protein alignment are not assigned peptide entropy values. For sequence sets of finite size, entropy calculations are affected by size bias, which were corrected by a statistical sub-sampling method, as described in [17].

Nonamer variant analysis of WNV sequences
Variable amino acid sites were further analyzed by computing the representation of nonamer variants. At any given position x, variant nonamers were defined as nonamers that differed by at least one amino acid from the predominant nonamer (the most common peptide) at that position. Further details of this analysis are given in [17].

Identification of completely conserved WNV sequences
Completely conserved sequences (pan-WNV sequences) of at least 9 amino acids and fully identical in all the sequences analyzed (100% representation) were identified from the multiple sequence alignment of each protein. Peptides containing unknown residues (X) were ignored.

Structure-function analysis of pan-WNV sequences
A literature search was conducted to identify reported and putative functional properties of the pan-WNV sequences, including a search of the Prosite database [27] using ScanProsite [28], and a search of the Pfam database [29]. The conserved sequences were also mapped onto the three-dimensional (3-D) structures of WNV proteins whenever these were available in the protein data bank (PDB) [30]. Only 3-D structures obtained via X-ray diffraction were utilized for mapping, and were visualized by use of the CPK representation in the ICM Browser v3.5 (www.molsoft.com).

Identification of pan-WNV sequences common to other viruses and organisms
Pan-WNV sequences with at least 9 consecutive amino acids in common with other viruses were identified by performing BLAST search against protein sequences of all viruses (txid10239) reported at NCBI (as of August 2007), except WNV (txid11082): parameters included search by Entrez query limited to ''txi-d10239[Organism:exp] NOT txid11082[Organism:exp]'', ''automatically adjust parameters for short sequences'' option disabled, ''low-complexity'' filter disabled, maximum number of aligned sequences to be displayed set to ''20,000'', expect threshold set to ''200,000'', or ''20,000'', or ''2,000'' until a valid result was obtained, word size set to ''2'', matrix set to ''PAM30'', gap costs set to ''Existence: 9, Extension: 1'', compositional adjustments set to ''no adjustment''. Similar BLAST searches were carried out against protein sequences of all organisms, except viruses: parameters were the same as the previous search against all viruses

Identification of known and predicted WNV HLAsupertype binding epitopes
A literature search and a search of the Immune Epitope Database [31] (www.immuneepitope.org) identified previously reported HLA class I and II human T-cell epitopes of WNV that overlapped at least 9 consecutive amino acids of pan-WNV sequences. In addition, four prediction models were used to identify candidate WNV sequences that bind to multiple HLA class I or II supertype alleles. Putative HLA class I supertyperestricted peptides were predicted by use of NetCTL [32], Multipred [33], ARB [34], and class II supertype-restricted peptides by Multipred and TEPITOPE [35], following the specifications as described in [17].

WNV protein sequence datasets
A total of 2,746 complete and partial WNV protein sequences were extracted from the NCBI Entrez protein database records as of June 2007 ( Table 1; Data Set S1). The large number of sequences analyzed and their wide spatial and temporal ; based on information available in annotated NCBI records) distribution enabled a broad survey of WNV protein diversity in nature. The distribution of these sequences varied considerably among the different proteins (from 141 NS4b sequences to 927 E sequences). Comparisons of amino acid variation between the fulllength unique sequences of the 10 WNV proteins showed that C had the highest range of amino acid differences across the sequences (up to 23%), while NS4b had the lowest (up to 8%) ( Table 1).

Evolutionary stability of WNV
The evolutionary diversity of WNV was studied by computing entropy as described in the Methods. The entropy plot revealed the evolutionary variability of nonamer sequences across the WNV proteome ( Figure 2). The vast majority of nonamer positions exhibited low to moderate entropy (#1.0), indicating lower probability of mutations occurring over time. Many regions had zero entropy, signifying no change throughout the recorded history of the virus. Peak or near peak entropy values (,2) were observed in the E, NS4a and NS5 proteins. The NS5 protein,

Representation of variant WNV sequences
Completely conserved nonamer sites with zero variant were numerous and the occurrence of variant nonamer sequences across the WNV proteome was generally low, less than 10% of all WNV recorded sequences at most positions ( Figure 3). The position with the highest representation of variant nonamer sequences (49%) was found in the nonstructural protein NS4a. Overall, our data suggest a low probability of immune challenges from variant WNV T-cell epitopes, due to a high representation of historically conserved sequences of the WNV proteome in the known virus data.

Completely conserved pan-WNV sequences
A total of 88 completely conserved sequence fragments (pan-WNV sequences) were identified across the whole proteome ( Table 2). The length of these fragments ranged from 9 to 29 amino acids, covering a total length of 1,169 amino acids (,34%) of the complete WNV polyprotein (3,430 aa) ( Table 3). The C protein had no completely conserved nonamer fragment, which is consistent with the large number of amino acid difference (23%) observed for sequences of this protein, compared to other WNV proteins ( Table 1). The NS3 and NS5 proteins contained the greatest number of completely conserved fragments, 25 in NS3 (spanning 48% of the protein length) and 30 in NS5 (spanning 51% of the protein length). The other nonstructural proteins NS1, NS2a, NS2b, NS4a and NS4b collectively contained a total of 24 completely conserved sequences, covering 11% to 40% of their respective protein lengths. In contrast, the variability of the structural proteins was much greater: prM had only two completely conserved sequences (14% of the protein length), while E had 7 (18% of the protein length).

Functional and structural analysis of pan-WNV sequences
Sequences conserved throughout the evolutionary history of rapidly mutating RNA viruses are thought to be critical for structure and/or function [36]. A search in the Prosite and Pfam databases [27,29] revealed that 50 of the 88 pan-WNV sequences are known to be associated with putative or known biological functions and/or structure ( Table 4); the biological significance of the remaining 38 sequences is still to be determined. In the E protein, two pan-WNV sequences correspond to the fusion loop and dimerisation domain [37], while two correspond to immunoglobulin-like domain, attributed to putative receptor binding sites [38]. One NS1 sequence correspond to the putative ATP/GTP binding site p-loop motif, likely to be involved in helicase activity [39]. NS3 contained 4 pan-WNV sequences that correspond to the peptidase family S7 (Flavivirus serine protease) domain [40], and 4 that correspond to known/putative Flavivirus Asp-Glu-Ala-Asp/ His (DEAD/H) domain associated with ATP-dependent helicase activity [41]. NS5 contained 17 sequences that correspond to the RNA-dependent RNA polymerase (RdRp)/catalytic domain [42,43]. Furthermore, 33 of the 50 pan-WNV sequences were predicted to exhibit post-translational modification(s), including Nglycosylation, protein kinase C (PKC), casein kinase II (CKII) and tyrosine kinase (TK) phosphorylation, N-myristoylation and/or amidation.
Amino acid residues exposed and protruding on the surface of viral proteins are generally subject to fewer packing constraints and residue interactions as compared to those buried within protein interiors. Thirty of the 88 pan-WNV sequences could be mapped on available, but incomplete, WNV protein structures obtained from the PDB (E protein, 2HG0; NS3, 2IJO; and NS5, 2HFZ) ( Figure S1). Five pan-WNV sequences were mostly buried and an equal number of pan-WNV sequences were partially exposed (13) or largely exposed (12). These results should be considered preliminary until full-length 3-D structures are available.

Distribution of pan-WNV sequences in nature
Sixty-seven (67) of the 88 pan-WNV sequences (,76%) overlapped at least 9 amino acid sequences of as many as 68 other viruses of the family Flaviviridae, genus Flavivirus ( Figure 4). Each of these 67 sequences matched at least one, and at most 61 Flavivirus species ( Figure 5 and Table S1). Murray valley encephalitis (MVE) virus shared 49 of the 67 pan-WNV    Fifty-eight (58) of the 67 pan-WNV sequences shared by other flaviviruses were from the non-structural proteins. Of the 27 pan-WNV sequences found in NS5, 10 were present in at least 30 Flavivirus species; while of the 16 sequences in NS3, three were found in between 25 and 34 other species; The remaining 15 sequences were contained in non-structural proteins NS1 (7), NS2a (2), NS2b (1), NS4a (3) and NS4b (2). Nine (9) of the 67 pan-WNV sequences shared by flaviviruses originated from the structural proteins E (7) and prM (2); one of the E protein sequences was present in 31 species.
Remarkably, 5 of the 88 pan-WNV sequences (prM 158-167 , NS3 408-418 , NS4b 208-229 , NS5 1-10 , and NS5 504-519 ) shared 9 consecutive amino acids with 7 non-viral species. The nonamer sequence from prM 158-166 is found in the bacterium Acidiphilium cryptum JF-5; NS3 409-417 in the mosquito Aedes albopictus; NS4b 218-226 in the Japanese rice Oryza sativa (japonica cultivar-group); NS5 2-10 in the bacterium Actinomyces odontolyticus, NS5 504-512 in the bacteria Burkholderia ambifaria MC40-6 and Burkholderia cepacia AMMD; and NS 506-514 in the bacterium Methylobacterium extorquens PA1.  Table 4. Reported biological properties of pan-WNV sequences. previously reported WNV T-cell epitopes immunogenic in human, having HLA restriction (when known), with both class I (B*07) and II (DR2) specificities ( Table 5). Further evaluation of the immune-relevance of pan-WNV sequences included a search for putative promiscuous HLA supertype-restricted T-cell epitopes within these regions by use of NetCTL, Multipred, ARB and TEPITOPE prediction tools. Seventy-eight (78) of the 88 pan-WNV sequences (,89%) were predicted to contain 271 supertyperestricted binding nonamers ( Figure 6 and Table S2). Of these sequences, 62 contained nonamers predicted to bind to multiple HLA supertypes. Clusters of predicted binders, two or more overlapping nonamer peptides with identical HLA supertyperestrictions, known as hotspots [33,44], were found in 41 of the 78 sequences. Seven (7) of the 78 sequences had at least 3 sequential nonamers overlapping by 8 amino acids. As these sequences are completely conserved, all of these epitopes are found in all reported WNV strains. In addition, 44 pan-WNV sequences were found to contain sequences of at least 9 amino acids present in 54 CD4 + CD8 2 and/or CD4 2 CD8 + IFN-c ELISpot positive peptides ( Table 6) Table 6 and Table S2). The experimental data revealed that 11 out of 44 pan-WNV sequences, localized in prM, E, NS1, NS3, NS4a, NS4b and NS5, were promiscuous for at least two HLA-DR alleles; the promiscuity of 9 of these 11 pan-WNV sequences were correctly predicted (Table S2). In summary, combined with previously reported data for human WNV T-cell epitopes from literature and public database ( Table 5), at least 44 of the 88 pan-WNV sequences contained numerous HLA-restricted class I and/or class II epitopes demonstrated by in vivo T-cell response assays.

Discussion
In the 70 years following the discovery of WNV in Africa in 1937 [45], there has been 100% conservation of 88 pan-WNV sequences, corresponding collectively to 1169 aa or ,34% of the 3,430 aa total composition of the viral proteome. The remaining 66% of the proteome contained one or more amino acid variants within each nonamer segment across the reported WNV sequences. Most of the pan-WNV sequences were found in the non-structural proteins. Quantitatively, 40% (1058/2643 aa) of the amino acids of the non-structural proteins (NS1, NS2a, NS2b, NS3, NS4a, NS4b and NS5) comprised the pan-WNV sequences, compared to only 14% (111/787 aa) of the structural proteins (C, prM and E). This marked difference in the evolutionary conservation/variability of the viral proteins can be attributed to greater demands on the integrity of nonstructural proteins in their viral functional roles, and possibly to the selective advantage of modified structural proteins in the adaptation to host immune responses. This evolutionary history of the conserved protein sequences extends to other members of the Flaviviridae family, with 67 of the 88 pan-WNV sequences shared among at least 68 other flaviviruses. Many of the identified critical biological and/or structural properties are associated with the conserved sequences; for example, the E dimerisation domain and fusion loop [37,38], NS3 peptidase S7, DEAD/H domain [40,41], and NS5 proteins RdRp domain [42,43]. Hence, these conserved sequences are unlikely to significantly diverge in newly emerging WNV isolates in the future, and represent attractive targets for the development of diagnostics, specific anti-viral compounds and vaccine candidate targets. In short, they can be defined as multi-purpose immutable, functional and immunological tags of WNV.
It is also noteworthy that 9 consecutive amino acids of 5 of the pan-WNV sequences are also present in non-viral proteomes, the Aedes albopictus mosquito, Oryza sativa Japanese rice and several bacteria. This overlap of pan-WNV sequences with non-viral sequences is possibly coincidental, but is likely to be statistically significant as the probability of randomly matching a nonamer is almost negligible (1/(20 9 )). WNV protein sequences found in the proteomes of bacteria are possibly due to integration of some unknown virus into the bacterial genome [46,47].. Similarly, the NS3 nonamer sequence fragment found in the Asian Tiger mosquito (Aedes albopictus), is possibly due to genetic recombination between phyla [48]. Unexpectedly, a nonamer of WNV NS4b protein was found in a single instance within a plant pathogenesisrelated protein from Japanese rice (Oryza sativa), which functions as plant defense system against pathogens [49].
There is evidence that many of the conserved sequences are immunologically relevant in humans. Numerous (44/88) contained at least 9 amino acids overlapping with a total of 54 peptides that have been reported to be immunogenic in humans and/or HLA Tg mice. In addition, putative T-cell epitopes were predicted by computational analysis for 12 major HLA class I supertypes and for class II DR supertype, with broad application to the immune responses of human population worldwide. Some of the putative T-cell epitopes were predicted to be promiscuous to multiple HLA supertypes as has been observed with several viruses [17]. These findings of the limited variability of WNV sequences relevant to cellular immunity point to the probable success in the development of a WNV vaccine as compared to the history of failure of candidate vaccines against the much more highly variable Flavivirus, such as DENV [17].
A comparison can be made with a similar study of each of the four DENV serotypes [17]. In contrast to WNV, the sequences of the combined serotypes of DENV are highly diverse, with only 44 pan-DENV sequences, representing 15% of the proteome length, that are present in 80% or more of the sequences of each serotype; only two of these 44 sequences were completely conserved in all the four serotypes. However, the conservation and variability of each DENV serotype is comparable to WNV. The individual DENV serotypes and the WNV show remarkable stability over the entire recorded history of their sequences, as demonstrated by their low peptide entropies and variant frequencies. The pan-WNV sequences matched sequences of DENV (one or more serotypes) with representations ranging from low (3%) to high (100%); similar observations were made for pan-DENV sequences matching WNV sequences (4 to 100%). The conserved sequences that matched with low representation may pose potential risk of altered ligands resulting in pathologic immune responses following co-infection or vaccination and secondary infection with a similar virus. Thus, while the consequences of such extensive possible cross-reactive immunity are hypothetical, we propose, for vaccine formulation, that it is prudent to select conserved sequences that are specific to the pathogen, and thus representative of a minimal number of variant sequences. Figure S1 The localization of pan-WNV sequences (in purple) on the three dimensional structure of the respective WNV proteins (E -2HG0, NS3 -2IJO and NS5 -2HFZ). Abbreviations: (E) major portion exposed, (P) partially exposed, (B) major portion buried.      Table 6. cont.