Genomic Diversity and Evolution of the Lyssaviruses

Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as ‘Lagos Bat’. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses.


Introduction
Lyssaviruses (LYSSAV) are RNA viruses with single-stranded, negative-sense genomes of the family Rhabdoviridae [1] that infect a variety of mammals causing rabies-like diseases.Rabies is an ancient disease that may have been reported in the Old World before 2300 B.C. [2].However, the absence of effective control measures in animal reservoir populations combined with a widespread lack of human access to vaccination means that more than 50,000 people annually die of rabies, particularly in Asia and Africa [3,4].Currently, there are seven recognised genotypes (GT) of LYSSAV defined on the basis of their genetic similarity [5,6]: rabies virus (RABV, GT1) responsible for classical rabies in terrestrial mammals globally and in bats on the American continent, as well as the cause of most rabies-related human deaths worldwide [3]; Lagos bat virus (LBV, GT2); Mokola virus (MOKV, GT3); Duvenhage virus (DUVV, GT4); European bat lyssavirus type 1 (EBLV-1, GT5); European bat lyssavirus type 2 (EBLV-2, GT 6); and Australian bat lyssavirus (ABLV, GT7).All genotypes except MOKV (where the host species is unknown) have bat reservoirs, hinting that lyssaviruses originated in these mammals [7].Additionally, four new lyssavirus genotypes that infect bats in central and southeast Asia have been proposed: Aravan virus, Khujand virus, Irkut virus and West Caucasian Bat virus [8,9].The negative-sense LYSSAV genome encodes five proteins: the nucleoprotein (N), phosphoprotein (P), matrix protein (M), glycoprotein (G) and RNA polymerase (L) in the order 39-N-P-M-G-L-59 [10].
Despite the importance of LYSSAV for human and wildlife populations, the number of complete genome sequences of field isolates of LYSSAV is sparse, with only eight currently available for limited type species [11][12][13][14][15]. Herein, we present the first genomic and evolutionary analysis of the seven known genotypes of LYSSAV, therein significantly increasing the extent of available genome sequence data available for these important mammalian pathogens.

Viruses and RNA isolation
Total RNA (Table 1) was isolated from original specimens or from suckling mice brain after early passage using Tri-Reagent (Euromedex).The only exception was the 8743THA isolate that was adapted on BSR cells (passage 22).For this isolate, total RNA was isolated from infected BSR cells infected at a low multiplicity of infection (0,1).Reverse transcription was performed with random hexamer primer (Roche Boehringer) using Superscript II (Invitrogen) following the manufacturer instructions.

PCR and sequence determination
Long-range PCR products were obtained using ExTaq (Takara) and specific primers (Table S1) using manufacturer recommen-dations.For sequence determination we used a shotgun base approach called LoPPS (Long PCR Product Sequencing) [16,17].39 genomic ends were generated by RACE protocol [14] using a 59 phosphorylated reverse complementary T7 primer.T7 cDNAs were further used for heminested-PCR with ExTaq using T7 and two strain specific primers designed in the N coding region (supplementary Table 1).To determine the 59 sequence of the genomic RNA we used a 59RACE version 2.0 kit from Invitrogen following manufacturer instructions.The PCR products (59 or 39 RACE) were then purified on gel using Qiaquick gel extraction kit (Qiagen) and cloned in PCR 2.1 TOPO T/A (Invitrogen) for sequencing.Each position of the consensus nucleotide sequence was determined from at least three independent sequences.All consensus sequences obtained using Sequencher 4.7 (Gene Codes) software were aligned using ClustalX 1.83.1 [18].The untranslated regions were further aligned manually using the SE-AL program (http://tree.bio.ed.ac.uk/).GenBank accession numbers for the sequences newly acquired here are designated EU293108-EU293121.

Phylogenetic analysis
Phylogenetic analysis of LYSSAV genomes was based on a multiple alignment of concatenated coding region sequences (12105 nt).A maximum likelihood (ML) phylogenetic analysis of these data was undertaken using PAUP* [19] employing the bestfit GTR+I+c 4 model of nucleotide substitution inferred by ModelTest [20].To determine the extent of support for different groupings on the tree a bootstrap resampling analysis was undertaken employing 1000 replicate neighbor-joining trees estimated under the ML substitution model.

Results and Discussion
In total we determined 14 new complete genome sequences of field isolates representing six (GT1, GT2, GT3, GT4, GT5 and GT6) of the seven genotypes of LYSSAV, with complete genome from GT4 obtained for the first time.These genomes were combined with eight genomes described previously (with the exception of one Australian bat lyssavirus for which leader and trailer sequences are unavailable).Eight field isolates of viruses isolated from humans, canids and bats were chosen as representative of the diversity of GT1.Two vaccine strains (SAD-B19 and PV) were included in all sequence comparisons but not in the phylogenetic analysis.Our study also represent the first analysis of the intrinsic genetic diversity of GT2, GT3, GT4, GT5 and GT6 based on full length genomes.
All genomes have the same structural organization although their lengths varied between 11918 nt.(GT7) and 12016 nt.(GT2) (Table 2).The predicted size of the coding regions is similar among genotypes, with the M protein identical in length across all genotypes and the P protein the most variable [14,21,22].As observed in other RNA viruses, all genotypes show a bias toward G+C richness [23], with the lowest G+C content observed in GT2 and the highest in GT1 (Table 2).All genomes have a polycistronic genome organization surrounded by untranslated regions (Table S2) similar to that already described [10,14].The extent of genetic diversity, reflected in percentage identity, varies within and among proteins (Figure 1), in the order N.L.M.G.P (95.2, 94.2, 92.3, 85.8, 81.5% amino acid identity, respectively).A similar pattern was previously observed using more limited data sets [14].This same order was also observed in terms of overall selection pressure, measured as the mean ratio of nonsynonymous (d N ) to synonymous substitutions (d S ) per site (d N /d S ), estimated using the maximum likelihood SLAC (Single Likelihood Ancestor Counting; http://www.datamonkey.org/)method [24]: N = 0.048; L = 0.055; M = 0.078; G = 0.119; P = 0.187.This approximately four-fold difference in mean d N /d S reflects major differences in selective constraint among proteins.This trend was also reflected in previous analyses of full length genomes of vaccine strains [22] and through partial gene comparisons [25,26].
Our study represents the largest analysis of the 39 and 59 UTR of the lyssavirus genomes undertaken to date.The 39 UTR comprises 70 nt and includes the leader regions potentially transcribed into the leader RNA.The 59UTR region comprises 86-145 nt and contains the trailer regions of size 68-69 nt.Both the 39 and 59 UTR have conserved signals that play a role to modulate replication and transcription (Figure S1) [27].Our data also reveals a strict complementarity limited to the 9 terminal nucleotides as well as nucleotide positions 14 and 16 from both ends of the genome [14,28,29].
There have been several attempts to estimate the evolutionary relationships among lyssaviruses, with most utilizing only one or two genes [1,7,21,26,[30][31][32][33][34].We therefore undertook a phylogenetic analysis of 22 genomes representative of the seven genotypes of LYSSAV based on a multiple alignment of concatenated coding sequences.Our phylogenetic analysis reveals the separation of LYSSAV into two major branches previously defined as different 'phylogroups' [7] and 7 component lineages defined as genotypes [5,35].Phylogroup 1 comprised GT1, 4, 5, 6 and 7, while phylogroup 2 contains only GT2 and GT3 (Figure 2).Notably, phylogroup 2 contains viruses of sampled exclusively from Africa -LBV and MOKV -while a third African genotype (DUVV) is found within phylogroup 1 [32,36].Also of note was the observation that although GT5 and GT6 both circulate in European insectivorous bats [32], the former is more closely related to the African GT4 viruses [32].Hence, there has clearly been an independent origin of genotypes 5 and 6 in European bats, as previously documented in analyses of the N and G genes in isolation [32].Finally, that bats appear as the principle host species across such a large phylogeographic range indicates that the association between lyssaviruses and bats is likely to be the ancestral condition (with a secondary loss of bat transmission in GT3), such that the movement of bats is likely to be responsible for the global dissemination of these viruses [7].
Notably, our study represents the first analysis of the genetic diversity of four complete genomes of GT2 and GT4, both of which are African in origin.While little variation is seen within GT4, the degree of divergence among the two GT2 isolates (Lagos bat virus -LBV) is striking (23.7% and 12.1% at the nucleotide and amino acid levels, respectively) and greater than that seen within any other genotype.Hence, although 0406SEN and 8619NGA are related according to the arbitrary classification system based on nucleotide identity between N coding regions (80.3% between 0406SEN and 8619NGA compared to a cut-off of 80%) [6,21], this classification system will likely need to be revised as expanded surveys of LYSSAV in Africa (this study and [37]) and in Eurasia [8,9,21] reveal greater genetic diversity.More fundamentally, if, as we suggest, complete genomes represent the best tools for genotyping, we propose that 0406SEN should constitute a new GT8 different from GT2 (8619NGA) and that the genotype division should be set at 76.4 to 81.6% nucleotide identity at coding sequences for all five viral proteins.Such a cut-off would provide more discriminatory power than systems that utilize the N gene in isolation (Table 3).Finally, we suggest that the phylogenetic methods used herebased on a realistic model of nucleotide substitution, a robust phylogenetic method, and rigorous bootstrap resampling -  Within each window, the similarity of any one position is taken to be the average of all the possible pairwise scores at that position and is calculated using PLOTCON (available at http://bioweb.pasteur.fr/seqanal/interfaces/plotcon.html).doi:10.1371/journal.pone.0002057.g001represent a more powerful method of lyssavirus classification than those based on pairwise genetic diversity alone, particularly as they account for any lineage-specific rate variation that will compromise all distance-based approaches used to date.This method has also been proposed for HIV to try to standardize viral classification [38] confirming the interest of this method for viral classification.

Figure 2 .
Figure 2. Phylogenetic relationships of 22 complete coding regions of LYSSAV genomes representatives of the 7 genotypes.The phylogeny was inferred using an ML procedure, and all horizontal branches are scaled according to the number of substitutions per site.Boot strap values (.95%) are shown for key nodes.The tree is mid-point rooted for purposes of clarity only.doi:10.1371/journal.pone.0002057.g002

Figure 1 .
Figure 1.Schematic representation of lyssavirus genome organization and sequence similarity among 24 aligned genomes.A. The 39 leader, N-, P-, M-, G-and L-coding regions and the 59 trailer region are shown.B. Sequence similarity is calculated by moving a window of 60 nucleotides along the aligned sequences.C. Sequence similarity is calculated by moving a window of 20 amino acids along the aligned sequences.Within each window, the similarity of any one position is taken to be the average of all the possible pairwise scores at that position and is calculated using PLOTCON (available at http://bioweb.pasteur.fr/seqanal/interfaces/plotcon.html).doi:10.1371/journal.pone.0002057.g001

Figure S1
Figure S1 Comparison of the 59 and the reverse complementary 39 genomic termini of the antigenomic (+) sense RNA of

Table 1 .
Isolates of lyssavirus analysed in this study.

Table 2 .
Coding potential, genome size (in nucleotides) and G+C content of 24 genomes representing the 7 genotypes of the lyssavirus genus.

Table 3 .
Minimum intra-genotype and maximum inter-genotype sequence similarities among 24 lyssaviruses.when 0406SEN is considered as the representative isolate of a new GT8, with 8619 the representative isolate of GT2.doi:10.1371/journal.pone.0002057.t003 *