The Genomic Signature of Human Rhinoviruses A, B and C

Human rhinoviruses are single stranded positive sense RNA viruses that are presented in more than 50% of acute upper respiratory tract infections. Despite extensive studies on the genetic diversity of the virus, little is known about the forces driving it. In order to explain this diversity, many research groups have focused on protein sequence requirements for viable, functional and transmissible virus but have missed out an important aspect of viral evolution such as the genomic ontology of the virus. This study presents for the first time the genomic signature of 111 fully sequenced HRV strains from all three groups HRV-A, HRV-B and HRV-C. We observed an HRV genome tendency to eliminate CpG and UpA dinucleotides, coupling with over-representation of UpG and CpA. We propose a specific mechanism which describes how rapid changes in the HRV genomic sequence can take place under the strict control of conservation of the polypeptide backbone. Moreover, the distribution of the observed under- and over-represented dinucleotides along the HRV genome is presented. Distance matrice tables based on CpG and UpA odds ratios were constructed and viewed as heatmaps and distance trees. None of the suppressions can be attributed to codon usage or in RNA secondary structure requirements. Since viral recognition is dependent on RNA motifs rich in CpG and UpA, it is possible that the overall described genome evolution mechanism acts in order to protect the virus from host recognition.


Introduction
Human rhinoviruses (HRVs) are non-enveloped, positive-sense, single stranded RNA viruses (+ssRNA) which belong to the genus Enterovirus in the family Picornaviridae. The HRV 3D-polymerase, necessary for synthesis of new genome, has no proof-reading capability a fact that estimates one to four mutations per replicative cycle for lytic RNA viruses, a number that would predict incredibly high nucleotide replacement rates for an RNA virus during an epidemic [1,2]. However, the actual replacement rates are between 5610 22 and 5610 24 per site per annum [2,3]. This presents an interesting case because in one hand RNA viruses can evolve rapidly, close to the maximum rate, compatible with maintaining genetic information, and with frequent recombination, yet their genomes are remarkably stable when grown under unchanging conditions [4][5][6]. The paradox between the predicted and observed replacement rates could be explained if effective selection processes were in operation during virus replication such as the strict control of functional RNA secondary structures, conservation of the polypeptide backbone and avoidance of ''intolerable'' nucleic acid immunostimulatory motifs and/or structures.
These viral evolutionary constraints can be depicted as anomalies in the occurrence of the sixteen dinucleotides (XpY) when comparing their odds ratios (R XpY ). This dinucleotide bias estimator is usually referred to as the genomic signature or the dinucleotide odds ratio profile of a given sequence or genome of an organism. In order to better understand the evolutionary pressures responsible for the high HRV sequence variation observed amongst all fully sequenced HRV strains and how this is achieved we performed a dinucleotide odds ratio profiling analysis. We propose a novel evolutionary mechanism of their genomic sequences which is independent of codon usage and/or RNA structures but is controlled by the maintenance of a functional amino acid level equilibrium.

Data Acquisition
The complete genome sequences for 111 Human rhinovirus strains were downloaded from the National Center for Biotechnology Information (NCBI) website in GenBank (National Center for Biotechnology Information) format [7]. This reference set includes 75 HRV-A and 25 HRV-B serotypes. The HRV-C species has only recently been recognized, and up to date consists of 11 types whose complete genomes are known [8][9][10][11][12].

Dinucleotide Odds Ratio Calculation
Dinucleotide odds ratio is the quotient of the probability of finding a dinucleotide in a given sequence divided by the product of the probabilities of finding each nucleotide that forms the pair in the same sequence, calculated as shown in Equation 1.
Equation 1: Calculation of dinucleotide odds ratio R XpY for a single stranded sequence Dinucleotides with odds ratio values outside the 0.81-1.19 range were considered as having a low or high relative abundance, respectively, as proposed by Burge et al [14,15].

Sequence and structural alignment
RNA and protein sequence alignments were produced in the CLC RNA Free Workbench 4.4 (CLC bioA/S) and CLC Protein Free workbench 5.5.2 (CLC bio), respectively. Phylogenetic analysis based on sequence alignments were performed in the same platforms using neighbor joining algorithm and 100 bootstraps. Structural alignments were performed in the ''Sequence to Structure (S2S)'' package [16].

Relative Informative Synonymous Codon Usage Calculation
Relative Synonymous Codon Usage Calculation (RSCU) is used to estimate codon bias for all codons which code for an amino acid with degeneracy greater than one. It is defined as the observed frequency of a codon j in a sequence x divided by the frequency expected E if all synonymous codons for the amino acid coded by j were equally frequent, as shown in Equation 2.
Equation 2: Calculation of RSCU Expected values are calculated by counting the total number of synonymous codons for a given amino acid in the sequence divided by the number of existing codons that codes for it. Informative synonymous codons are defined as the trinucleotides containing a dinucleotide which is differentially represented in the odds ratio profiling (CpG, UpG, CpA and UpA) and encode for an amino acid which is also encoded by at least one trinucleotide without the aforementioned dinucleotides. Thus the non-informative codons UAC-UAT (Tyrosine), UGU-UGC (Cysteine), CAU-CAC (Histidine) and CAA-CAG (Glutamine) cannot be used in the analysis. All calculations were generated using the CALcal software [17].

Pairwise distance analyses
Matrices of pairwise distances based on odds ratios of CpG and UpA dinucleotides showing percentage differences between all pairs of the 111 HRV strains were constructed and presented as heatmaps in figures 1 and 2. These were further analysed using the PHYLIP package [18]. DRAWTREE and DRAWGRAM were used to visualize the results as pairwise distance trees (supplementary data, FS1, FS2). Neighbor Unweighted Pair Group Method with arithmetic mean (UPGMA) and Neighbor-Joining (NJ) algorithms were used to generate the best tree, along with the DAMBE software which utilizes the FastME method (used with default parameters) [19]. In all scenarios, CpG/UpA dinucleotide odds ratios of other single-stranded RNA viruses were included in the analysis based on the Rima and McFerran publication [15].

Results
We analyzed the odds ratio R XpY of the 16 dinucleotides for 111 HRV genomes comprising 75 HRV-A, 25 HRV-B and 11 HRV-C strains (Table S1, supplementary data). Mean R XpY values along with minimum and maximum values are shown in table 1. We found three differentially represented dinucleotides (CpG, CpA, UpG) in the odds ratio profiling of all HRV sequences tested with p,0.001 (Dunns Repeated Measures test) along with a small borderline under-representation of UpA (mean R UpA : 0.82) which was consistent in most strains (range 0.69-0.90). The lowest R UpA value was observed in HRV-QCE (0.69). CpA and UpG dinucleotides were highly over-represented in all human rhinovirus strains ( Table 1). The highest R UpG value was observed in HRV-QCE (1.52) which belongs to HRV-C. The highest R CpA value was found in HRV-A95 (1.51). The dinucleotide CpG was massively under-represented in all HRV strains (mean R CpG : 0.28). The highest CpG suppression was observed in HRV-A51, HRV-A81 and HRV-B84 (0.19). We also noticed a small overrepresentation of CpC in the HRV-A (1.22) and HRV-C (1.24) group and an even smaller over-representation of GpG only in HRV-C (1.20). None of the reversed dinucleotides (GpC, ApU, ApC, GpU) of the four differentially represented dinucleotides were over-or under-represented in our sequences indicating that the observed dinucleotide tendencies are not mononucleotide driven (table 1). CpG occurrence was significantly inverse correlated with C+G content (P,0.0001, r = 0.416) and the same applied for UpA/U+A content (P,0.0001, r = 0.362).
In order to understand the relationship of these discrepancies in the dinucleotide odds ratios we constructed a 464 occurrence table (table 1) [15]. In such a table both columns and rows of R XpY values must sum to four since the expected R XpY for all dinucleotides when no evolutionary pressure is present must equall one. If a bias of dinucleotide usage is arising by replacement of one nucleotide by another, a compensatory mechanism must maintain the four sum in both columns and rows. From this table it is obvious that CpG, UpA, UpG and CpA are all involved in a mechanism that acts in order to maintain a balance amongst the 16 dinucleotides. However, we observe that the level of CpG suppression is such that the four sum cannot be reached neither in the CpG containing column or row. From the same table can be seen that UpA even though only mildly under-represented is part of this mechanism. That is why we chose to include it in all of our subsequent analyses.
Next, we evaluated the distribution of the dinucleotide disturbances in the non-coding and coding regions of the human rhinovirus genome. The results are presented in table 2. The under-representation of CpG/UpA is higher in the coding regions coupled to the over-representation of CpA/UpG in all three HRV groups. This led us to investigate more our results. Table 3 In order to investigate whether the observed dinucleotide tendencies are codon driven we compared them with RSCU values for all informative synonymous codons (table 4). In RSCU analysis if a certain amino acid is over-or under-represented in the protein sequence then its mean RSCU value should deviate significantly from the value 1. If this deviation correlates with the observed dinucleotide tendencies then a possible cause of these tendencies would be protein requirements for specific amino acids containing specific dinucleotides. In our RSCU table none of the amino acids has a mean RSCU value lower or higher than 1 suggesting that the observed dinucleotide suppressions and overrepresentations are not codon driven. More specifically the CpG dinucleotide encodes, as part of a coding trinucleotide, five different amino acids: Serine (S), Proline (P), Threonine (T), Alanine (A) and Arginine (A). One would expect that if CpG suppression is driven by codon usage the above amino acids would be represented with low mean RSCU values. Looking at the right side of the RSCU table it can be seen that the first eight codons with the lowest RSCU values contain CpGs. Alanine encoded by GCpG has an RSCU value ranging from 0. The above results suggest that the CpG suppression is not associated with specific amino acid usage since for every CpG-containing codon with a low RSCU value there are other informative synonymous codons with RSCU values well above 1, which compensate for the specific amino acid loss. Even in the case of arginine, which is known to be avoided in The dinucleotide CpA which is over-represented in the odds ratio profiling ''encodes'' serine (UCpA), proline (CCpA), threonine (ACpA), alanine (GCpA), histidine (CpAU-CpAC) and glutamine (CpAA-CpAG). The last two are non-informative synonymous codons. In the RSCU table the CpA-containing codons for A, T, S and P are located on the bottom right side where the highest RSCU values are located and as discussed above are also encoded by CpG-containing trinucleotides. The results suggest that the observed increase in CpA levels have a same impact in the CpA ''encoded'' aminoacids which are also increased. Since the same amino acids can also be encoded by CpG containing codons these data suggest that the high R CpA values can be attributed to the need of the protein sequence to equilibrate the ''loss'' of the CpG-encoded amino acids by a CpG.CpA (G.A) transition.
On the other hand, UpG over-representation was the highest amongst all dinucleotides in all HRVs and we expected to find a similar over-representation of the UpG ''encoded'' amino acids as in the case of CpAs. However, only 25% in HRV-A One possible way for a coding nucleotide sequence to adjust dinucleotide frequencies without affecting (or partly affecting) the encoded amino acids is by incorporating the dinucleotide in codon junctions. Although codon usage fixes the frequency with which a dinucleotide is present in positions 1 and 2 and in positions 2 and 3, it has no effect on the incidence of junctional XpY dinucleotides, in which the mononucleotide X occupies the third position of a codon and Y the first position of the following codon. Table 5 shows the distribution of the 4 differentially represented in the odds ratio profiling dinucleotides in codons and in codon junctions. As expected, in all HRVs more UpGs occupy positions in codon junctions than in codons (57.5% versus 42.5%, respectively) a percentage reaching 60% in HRV A! Since UpG and UpA containing codons encode for the same aminoacids, these data suggest a UpA.UpG (A.G) transition. Furthermore, UpA is the second most abundant dinucleotide in codon junction position (45.5%) indicating a direct relationship with UpG.
Up to this point a CpG.CpA and UpA.UpG transition mechanism has been established which is guided by HRV aminoacid balance but is not driven by codon usage. However when looking at the distribution of the odds ratio values for CpG, UpA, UpG and CpA in the non coding versus the coding region of the genome we observed that in the 59UTR CpG suppression can only be coupled with UpG over-representation suggesting a CpG.UpG (C.U) transition. This has a dual implication for our results: (1) The input in UpG increase originates also from CpG suppression apart from UpA suppression, however because UpGs are mostly located in codon junctions this has a little effect in the encoded UpG-amino acids, (2) UpA suppression is probably masked by CpG.CpA and CpG.UpG which predicts increased numbers of UpAs and not decreased as in our case.
In order to visualize the depth and variation of CpG/UpA suppressions amongst HRV strains, we constructed color-coded matrices of pairwise distances showing percentage differences between all pairs of HRV strains (Figures 1 and 2). We also draw pairwise distance trees based on the odds ratio values of the two under-represented dinucleotides CpG/UpA (FS1 and FS2, supplementary data). These trees do not imply phylogenetic relationships between the strains.

Discussion
Even though the sequence and structure similarity of HRVs have been extensively investigated, this is the first study that presents the genomic signature of Human rhinoviruses and proposes a possible evolutionary mechanism of their genomic sequences. Our analysis revealed a CpG/UpA under-and CpA/ UpG over-representation in all of 111 HRV genomic sequences. CpG/C+G and UpA/U+A are inverse correlated. The consequence of this inverse correlation is that when the expected number of CpG/UpA is low due to a low C+G/U+A content, the observed numbers are further suppressed. The under-representation of CpG/UpA is higher in the coding regions coupling with the over-representation of CpA/UpG in all three HRV groups. RSCU analysis suggests that none of the observed suppression The Genomic Signature of Human Rhinoviruses PLOS ONE | www.plosone.org tendencies are codon driven but the dinucleotide transitions are most probably determined by the HRV amino acid functional balance. We used RSCU analysis to investigate how this suppression/over-representation mechanism acts since the encoded polyprotein has a pivotal role in ''specifying'' the genomic sequence, and no mechanism can act ignoring requirements for the translation of specific amino acids. This fact has been missed in most publications concerning the genomic signature of various ssRNA viruses. Based on our observations we propose a possible mechanism of HRV genome evolution: (1) A CpG suppression by transition of CpG.CpA (G.A) and CpG.UpG (C.U) takes place leading to (2) a subsequent increase in CpA and UpG numbers, with the latest being the highest. Interestingly, it seems that increased CpAcontaining codons act in order to restore balance in the decrease of the amino acids encoded by CpG-containing synonymous codons. This is further supported by the fact that in the non-coding 59UTR where CpG suppression is minimum there is no overrepresentation of CpA, suggesting a possible mechanism where the CpG.CpA transition is ''active'' mostly or only at the coding region of the genome. An interesting finding is that while a UpG increase would ''normally'' be depicted in subsequent increase of the amino acids encoded by UpG-containing codons, this is kept to a minimum by the localization of UpGs in codon junction positions in higher percentage than in codon positions in the coding sequence. The fact that in the 59 UTR where the UpG over-representation can only be coupled with the CpG suppression suggests a CpG.UpG transition, (3) there is an initial UpA suppression also leading to a compensatory increase in UpG by transition of UpA.UpG (A.G). This is masked to a degree by the CpG suppression mechanism. Furthermore, UpG containing codons are synonymous with UpA containing codons. Overall, this highly sophisticated process ensures that effective suppression of CpG/UpA can take place without altering a functional balance in the amino acids encoded by the over-or-under-represented dinucleotide-containing codons. By this way it seems possible that the HRV genomic sequence can change in a ''sub-protein'' level. This is justified by the fact that the protein sequence of the virus which is mainly determined by viral (structural and non-structural proteins) and host (receptor utilization such as ICAM1 and LDL-R) structural requirements needs to be conserved amongst different strains thus no dramatic changes in the amino acid sequence can take place. However, the dinucleotide frequencies can change without affecting the encoded polyprotein. Possible reasons that could lead to CpG/UpA suppression are discussed below.
Karlin et al suggested that a potential reason for CpG suppression was the enhanced free energy for CpGs in doublestranded RNA [20]. Rima et al successfully argued that the substitutions of CpG by CpA/UpG or UpA by UpG are unlikely to affect the overall stability of any complementary or doublestranded intermediate, thus this mechanism cannot be applied in RNA viruses [15]. This is also evidenced in our results. The 59UTR contains the internal ribosome entry site complex which generates the main replication signal of HRV. The IRES contains six stem-loop subdomains and the UA-rich polypyrimidine tract. If Karlin's suggestion applied for Human rhinovirus then we would observe higher suppression of CpGs and UpAs in the 59UTR than in the rest of the genome since the IRES is a region rich in RNA secondary structures. However, we observe the least CpG suppression and no UpA suppression than in the rest of the genome. Furthermore, in our RNA secondary structure alignments we did not observe any CpG under-representation in the specific structures (data not shown). Additionally, it is known that viral RNA genomes that form complex and extensive secondary structures through internal base-pairing are tightly evolutionary constrained since based substitutions in any of the multiple sites that interact to form the structure require matching substitutions elsewhere such that the stem-loops are conserved [21]. On the other hand base-pair mismatches in RNA secondary structures can provide flexibility needed for conformational changes to take place. In vertebrate DNA genomes, methylated cytosines are prone to mutate through spontaneous deamination, generating the TpG with a mismatch pair T/G. This mismatch will in turn cause a mutation in the opposite strand if replication occurs without repair, leading to the appearance of the dinucleotide CpA as well.  UpA dinucleotide suppression could be a mosaic of numerous reasons: (i) UpAs are avoided in genomic sequences since they participate in the trinucleotides that encode stop codons. By reducing UpAs inside the genome, the possibility of generating deleterious for the protein sequence stop codons is further minimized. (ii) Beutler and colleagues suggested that a potential reason for UpA suppression is the susceptibility of UpA to RNase activity. They argued that the suppression of TpA in DNA is caused by the instability of UpA in RNA as the suppression was greater in exons than in introns and non-transcribed regions [22]. This is also evidenced in our results were there is a 0.20-0.30 difference in R UpA between the non-coding and the coding sequence of HRV.
It is known that CpG when unmethylated in a DNA sequence can induce a strong immunostimulatory response on mammalian immune cells [23]. This is triggered by the intracellular pattern recognition receptor (PRR) Toll-like 9 (TLR9) which recognizes CpG-unmethylated DNA and triggers several immune responses [24]. Since the vertebrate immune system relies on non-methylated CpG recognition as a sign of infection and the observed CpG under-representation is present only in vertebrate viruses, it is reasonable to suggest that a similar mechanism with TLR9 could be used in RNA viruses [25]. Greenbaum and colleagues observed an inverse correlation of CpG depletion and C+G content in vertebrate infecting viruses with no such bias found in viruses with high C+G content [26]. Lobo and colleagues noticed a very similar CpG depletion tendency in vertebrate genes as well, suggesting that the genomes of RNA vertebrate viruses are selected to mimic some features of host mRNA to avoid immune system detection with a still unknown anti-viral mechanism [25]. Furthermore, it has been shown that ssRNA with unmethylated CpG motifs can stimulate CD14 + CD11c + monocytes to produce IL-12. These CpG oligoribonucleotides can also stimulate PBMCs to activate NF-kB and p38 MAPK. The activation of cells is not mediated by any known dsRNA, ssRNA or dsRNA receptor, but is abrogated if the 59 position of C becomes methylated, similar to that of CpG DNA [27]. Furthermore, in a recent publication by Jimenez-Baranda, it was shown that CpG motifs in a UA-rich context quantitatively control pDC activation in Influenza virus infections [28]. The above suggest that CpG suppression in the HRV genome can be a counter-defense mechanism to escape RNA CpG motif recognition by the immune system.
Recently, Forsbach and colleagues identified single stranded RNA sequences containing specific sequence motifs that preferentially activate human TLR8-or TLR7/8-mediated signaling.  Supporting Information  Figure S1 Pairwise distance tree based on R CpG values constructed using the FastME algorithm.
(TIF) Figure S2 Pairwise distance tree based on R UpA values values constructed using the FastME algorithm. (TIF)