Extensive C->U transition biases in the genomes of a wide range of mammalian RNA viruses; potential associations with transcriptional mutations, damage- or host-mediated editing of viral RNA

The rapid evolution of RNA viruses has been long considered to result from a combination of high copying error frequencies during RNA replication, short generation times and the consequent extensive fixation of neutral or adaptive changes over short periods. While both the identities and sites of mutations are typically modelled as being random, recent investigations of sequence diversity of SARS coronavirus 2 (SARS-CoV-2) have identified a preponderance of C->U transitions, proposed to be driven by an APOBEC-like RNA editing process. The current study investigated whether this phenomenon could be observed in datasets of other RNA viruses. Using a 5% divergence filter to infer directionality, 18 from 36 datasets of aligned coding region sequences from a diverse range of mammalian RNA viruses (including Picornaviridae, Flaviviridae, Matonaviridae, Caliciviridae and Coronaviridae) showed a >2-fold base composition normalised excess of C->U transitions compared to U->C (range 2.1x–7.5x), with a consistently observed favoured 5’ U upstream context. The presence of genome scale RNA secondary structure (GORS) was the only other genomic or structural parameter significantly associated with C->U/U->C transition asymmetries by multivariable analysis (ANOVA), potentially reflecting RNA structure dependence of sites targeted for C->U mutations. Using the association index metric, C->U changes were specifically over-represented at phylogenetically uninformative sites, potentially paralleling extensive homoplasy of this transition reported in SARS-CoV-2. Although mechanisms remain to be functionally characterised, excess C->U substitutions accounted for 11–14% of standing sequence variability of structured viruses and may therefore represent a potent driver of their sequence diversification and longer-term evolution.


Introduction
The evolution of viruses is typically conceptualised as a combination of adaptive sequence change in response to a range of selection pressures in the environment and a process of random diversification in which neutral or near neutral nucleotide substitutions become fixed in virus populations [1][2][3]. An extensive literature documents adaptive (or Darwinian) evolution in response to antiviral treatments and escape from host immune responses in the form of T cell and antibody driven epitope escape mutation [4][5][6][7]. Viruses may furthermore display a series of changes to encoded viral proteins and even genome rearrangements in the process of jumping hosts and adapting to new internal environments [8,9].
A further source of mutations in viruses arises from effects of several innate antiviral effector mechanisms in vertebrate cells that operate through viral genome editing. Of these, the best characterised are the interferon-inducible isoform of adenosine deaminase acting on RNA type 1 (ADAR1) [10] that targets RNA viruses during replication, and members of the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) family [11]. Although hitherto generally considered as an antiviral pathway primarily active against retroviruses and retroelements, there has been considerable discussion of whether the overrepresentation of C->U transitions observed in genomic sequences of SARS coronavirus 2 (SARS-CoV-2), rubella virus (RUBV) and potentially other mammalian RNA viruses might originate from a similar process for RNA editing by one or more APOBEC-related deaminases [12][13][14][15][16][17][18].
The current study investigated this possibility through construction and analysis of sequences changes in alignments of mammalian RNA virus sequences from a wide range of families. Since APOBEC-mediated RNA editing has been proposed to occur in the context of RNA structure elements, virus datasets were additionally scanned for sequence-dependent internal RNA base-pairing [19,20]. Virus genome structural and compositional factors favouring APOBEC-like C->U editing in these datasets were analysed and the potential contribution of these driven sequence changes to their longer-term evolutionary trajectories estimated.

Sequence datasets
The primary resource for the study were aligned sequences from a wide range of mammalian RNA viruses derived from several vertebrate virus families (Tables 1 and S1). These were These showed levels of intra-population sequence divergence ranging from 5% -19% (Table 1). These included viruses with and without large scale RNA secondary structure in the genomes [19][20][21], with mean folding energy differences (MFEDs) ranging from -0.1% -17.8% (Table 1). The analysis was supplemented by inclusion of previously described datasets of SARS-CoV-2 and other coronaviruses [21].

Detection of mutational asymmetries in RNA virus datasets
The previous analysis of the directionality of sequence changes in SARS-CoV-2 and the detection of an excess number of C->U changes was simplified by the minimal sequence diversity of the assembled post-pandemic sequences [22]. SARS-CoV-2 sequence diversity primarily comprised isolated base changes relative to a consensus sequence shared by all but one or a few sequences in the alignment. However, for the datasets analysed in the current study, population diversity was substantially greater making inference of directionality increasingly arbitrary as site variability increased. To overcome this problem, analyses of relative mutation frequencies were restricted to sites showing low degrees of heterogeneity so that the directionality of mutations can be inferred. Sequence datasets was analysed using the program Sequence Change in the SSE package, which records the occurrences and sites of sequence changes from a majority rule alignment consensus sequence. Collectively, there was a significantly greater number of C->U changes compared to its reverse (U->C) and to other transitions in the RNA virus datasets at sites showing <5% heterogeneity although the degree of over-representation was highly variable between viruses (Fig 1A). Transition frequencies were consistently higher than other mutations (S1 Fig). Transversion frequencies between pairs of bases were comparable, with the exception of a substantially higher frequency of G->U mutations compared to its reverse in coronaviruses but not in other RNA viruses (S1 Fig) To more formally quantify the degree of over-representation of C->U transitions in each virus dataset, the ratio of each transition to its reverse (Columns 13; 14, S2 Table) was normalised for base composition as described in a previous analysis for SARS-CoV-2 [12] (Columns 7, 8, Table 1). Formally, in the absence of mutational pressure (null expectation), the expected ratio of frequencies of a mutation X->Y to Y->X would be proportional to their native base frequencies and [f(X->Y) / f(Y->X)] / [f(X) / f(Y)] should approximate to 1. Applying this to the observational data in the virus datasets, the high frequencies of C->U changes were reflected in strong C->U / U->C transition asymmetries (Fig 1B;  This initial analysis of mutational asymmetry was conducted using a 5% heterogeneity threshold to allow directionality of sequences to be inferred. The relationship between site heterogeneity and transition asymmetry was determined for two example RNA virus datasets showing excess C->U changes at the 5% threshold (Fig 2; HCV-mean transition asymmetry 3.0; FMDV: 2.2; Table 1). At highly variable sites, frequencies of G->A / A->G and C->U / U->C were comparable and close to the null expectation. However, for both viruses, increasing asymmetry was observed at sites with reduced heterogeneity, ruling out the possibility that the observed asymmetries were the result of unrecognised compositional biases in the virus datasets, and validating the use of the 5% threshold to analyse directionality of sequence change.
The estimation of relative transition frequencies was collectively based upon all sequences within each virus alignments. To investigate the degree of heterogeneity in C->U changes between sequences, numbers of this transition were computed individually and compared with those of the reverse mutation (Fig 3) in two of the larger datasets showing high and low normalised transition asymmetries (HCV-1a -3.1 and EV-A71-1.4 respectively; Table 1). For EV-A71, there were means of 3.1 C->U and 2.5 U->C substitutions per sequence at the 5% heterogeneity level, and a distribution of values that approximated to a Poisson distribution, although marginally over-dispersed (Kolmogorov-Smirnov single sample test statistic = 6.8; p < 0.001). Similarly for HCV-1a, both U->C and C->U transition frequencies followed marginally skewed Poisson distributions (test statistics 1.5 [p = 0.03] and 2.5 [p < 0.001] respectively), but with a higher mean number of C->U transitions per sequence (12.1 / sequence) than the reverse (2.8 / sequence). The sequence datasets of HCV and EV-A71 sequences therefore showed no evidence for the occurrence of individual hypermutated sequences as described previously for HIV-1 [23][24][25]; the driver of elevated C->U frequencies in HCV sequences appeared to operate at a similar intensity on all sequences analysed.
To investigate which genome features of RNA genomes were predictive of the C->U/U->C transition asymmetry, a range of compositional attributes (G+C content, representation of CpG and UpA dinucleotides, asymmetry in the number of G bases relative to C, and of A relative to U), mean folding energies (MFEs) of consecutive 300 base fragments and differences of this value from sequence order randomised controls (MFEDs) were computed. The association of each with C->U/U->C and G->A/A->G transition asymmetry values was analysed by multivariate analysis ( Table 2). MFED value was the only variable significantly associated with C->U / U->C asymmetry (p = 0.013); this parameter represents the degree of sequence orderdependent RNA folding in the coding region(s) of the virus ( Table 2). The association between C->U / U->C asymmetry and MFED was further apparent and readily visualised by simple linear regression (Fig 4; p = 0.002). There was no association with MFE, representing the minimum free energy on RNA folding, a property primarily influenced by the G+C content of the sequence, which also showed no association with the C->U / U->C transition asymmetry. There were similarly no associations with extents of CpG or UpA dinucleotide suppression in RNA virus genomes, or base imbalances (U/A, C/G). As expected from the minimal differences from the null expectation, no association between G->A/A->G transition asymmetry values with any compositional or structural sequence attribute was detected.

Sequence contexts for C->U transition asymmetry
C->U transitions in SARS-CoV-2 genomes were influenced by the immediate 5' and 3' base contexts of the mutated site [12]. Other RNA virus datasets showing C->U / U->C transition asymmetries were analysed similarly ( Fig 5). Once normalised to base composition (S3 Table), relative mutation frequencies varied over a substantial range but with a 5'U being consistently associated with greater C->U/U-C transitional asymmetry. Sites with a 5'U showed a mean over-representation for all viruses of 2.0 compared to 0.78, 0.56 and 0.84 in 5' A, C and G contexts respectively (p values of 0.0004, 5 x 10 −11 and 8 x 10 −5 respectively by Mann Whitney U test). Effects of 3' context were far more variable between viruses, but with evidence for favoured C->U transitions upstream of A in FMDV and to lesser extents in other RNA viruses.

Homoplasy of C->U mutations
Previous analyses demonstrated that a proportion of C->U mutations in SARS-CoV-2 failed to become genetically fixed in a population [26]. The distribution of many mutations violated the overall phylogeny of the dataset, appearing convergently and transiently in different parts of the tree [12]. The possibility that excess C->U changes observed in current datasets might be similarly homoplastic was investigated more systematically through measurement of the concordance between nucleotide identities at variable sites in virus alignments and their overall tree topology. This enables segregating substitutions that reflect evolutionary relationships to be distinguished from phylogenetically uninformative or incongruent sites that may arise from host-driven mutational processes. As the method does not require directionality to be inferred, the approach is not restricted to relatively invariant sites (<5% heterogeneity) examined in previous analyses.
The program, Homoplasy Scan in the SSE package was developed to sequentially analyse each variable site in a virus sequence alignment; this recorded the degree of segregation of each base in a global tree constructed from 1200 base genome fragment that incorporates the interrogated bases (Fig 6). Association index (AI) values based on sequences grouped by their component bases were typically low in DENV3, HPeV-3 and EV-A71, indicating that most substitutions co-segregated with overall phylogeny. AI distributions were typically narrower (more informative) for more variable sites (high Shannon entropy values) despite the potential effect of site saturation and convergence on site with only 4 possible character states. The  pattern of base segregation was remarkably different in alignments of HCV (results from genotype 1a are shown but other genotypes were closely similar and HPgV-1 (Fig 7). In these, only a small fraction of sites showed a base distribution that segregated with (and defined) the overall phylogeny of the alignment; the majority, irrespective of their underlying diversity, poorly matched overall phylogeny with AI values approaching the mean of the null distribution (AI = 1.0). Distributions for FMDV, MNV and JEV were intermediate between these extremes. Broadly, the observed differences arose from different genetic structuring of sample populations; phylogenetic trees from alignments with predominantly informative sites showed a marked degree of structured internal branching (S2 Fig), while variants of HCV and HPgV-1 showed deep branching and little ordering of sequences beyond their initial diversification. The differences in tree structure between datasets were quantified using a lineage through time plot generated by the LTT program in the Phylocom package [27] (S3 Fig). This depicts the substantial lineage diversification of HCV, HPgV-1 and to lesser extents of MNV and FMDV-A at the base of the tree, and a contrasting late diversification of JEV, DENV1, HPeV-3 and EV-A71.
Irrespective of the actual distributions of AI values (and associated differences in tree topologies of the different RNA virus datasets), categorisation of sites based on AI value ranges allowed an investigation of whether C->U transition frequencies were specifically over-represented at phylogenetically uninformative sites as would be expected if these mutation were subject to homoplasy (Fig 8). Amongst representative viruses showing C->U/U->C transition asymmetry (HCV-1a, HPgV-1, MNV and FMDV-O), there was a significant excess of C->U transitions compared to other transitions at uninformative sites (AI values > 0.2; Fig 8; upper left graph), in contrast to ratios observed in the unbiased dataset (EV-A71, JEV, DENV1 and HPeV-3; upper right graph). Excess C->U transitions were also more extensively distributed at sites of medium / low Shannon entropy values (lower left graph). Contrastingly, the representation of other transitions (U->C, G->A and A->G) showed no association with either AI score or site variability. The relationship between excess C->U changes at sites with high AI values and low variability is consistent with the hypothesis for extensive homoplasy specifically of this transition.

Contribution of C->U transitions to viral diversity
The extent to which the observed excess of potentially homoplastic and transient C->U mutations in certain viral datasets contributed to overall viral sequence diversity was calculated. The number of sites in a virus alignment showing a majority C->U change subtracted by those showing majority U->C changes (excess C->U) were expressed as proportion of the number of variable sites in the alignment, with totals sub-divided into sites showing different ranges of AI values (Fig 9). There was a substantial over-representation of sites showing excess C->U mutations in HCV, HPgV and other virus datasets showing the C->U / U->C asymmetry, particularly at sites with high AI values. From these, it appears that the asymmetric mutational process contributes a substantial proportion of their standing viral diversity in all four of the viruses analysed (11% -14% of variable sites).

Fig 5. Influence of 5' and 3' bases on C->U mutation frequencies.
Influence of the identities of the immediate 5' base and 3' bases on C->U mutation frequencies in a range of RNA viruses showing C->U/U->C transition asymmetry. Normalised C->U/C->U transition asymmetries in each 5' and 3' context were adjusted to account for 5' or 3' base frequencies. The y-axis shows the over-or under-representation of the asymmetry values in each context relative to the value for all contexts; the null expectation (no effect of 5' or 3' base) was 1.0 (red dotted line  6. Using the association index to determine informativeness of individual sites. Schematic summary of the steps used to investigate site informativeness. Individual alignment positions are sequentially analysed for their concordance to a global phylogeny. Base identity is used to assign groups which are then use for calculation of an association value through group segregation in a neighbour-joining tree of the alignment where non-bootstrap supported branches are collapsed. The AI index is its ratio to the mean association value of 10 sequence label order randomised controls (representing the null expectation of no association). Finally, the Shannon entropy score, representing site heterogeneity is recorded.

Mutational asymmetries in RNA viruses
This study provides evidence for an excess of C->U changes over the reverse (U->C) and complementary (G->A) transitions in a diverse range of RNA viruses, including HCV (all genotypes), FMDV, MNV, HPgV-1 and rubella virus. Fold excesses ranging from 1.9x -3.6x (Table 1 and Fig 3A) approached those reported in previous analyses of SARS-CoV-2 (7.5x), SARS-CoV (4.2x), MERS-CoV (3.0x) and were comparable to those reported in seasonal coronaviruses (1.8x -3.0x) [12]. There has been little systematic investigation of the phenomenon of this type of mutational asymmetry in RNA viruses, although it has been long recognised that an RNA editing enzyme, ADAR1 may play a role in prominent excess of U->C and A->C mutations in measles virus genomes associated with sub-acute sclerosing panencephalitis (SSPE) [28], while APOBECs have been shown to create hypermutated proviral DNA copies of HIV-1 and other retroviral genomes [23,29,30]. A recent study identified a noticeable excess of C->U changes in the genomes of rubella virus persisting in patients after immunisation [14,31] a finding that was replicated using the analytical methods of the current study on that dataset, where analysis of a larger dataset of circulating rubella virus strains (n = 73) recorded an even higher C->U / U->C transition asymmetry (3.6x; Fig 4). Although the authors linked the asymmetry to RUBV persistence [14,31], perhaps by analogy with measles virus and SSPE, the finding of substantial C->U asymmetry in circulating RUBV strains in the current study indicates this phenomenon occurs as part of a natural transmission chain of rubella virus infections.

Mutational mechanisms
The underlying mechanism(s) for the observed elevated frequencies of C->U changes in certain RNA viruses are functionally uncharacterised and conceivably may originate through several mechanisms. In the discussion, we will briefly review the evidence for or against transcriptional, RNA damage-associated and cellular RNA editing mechanisms that may create the excess C->U mutations observed in RNA viruses, taking into account their prominent (+) strand asymmetry, 5' base context effect and their variable distributions in viruses with structured and unstructured genomes. RNA transcription. A major source of mutations in viruses and other organisms are transcription errors by cellular RNA or DNA polymerases that are incorporated during replication. Different types of polymerases may show varied mutational profiles with separate propensities to mis-incorporate particular transitions or transversions. Very limited data exists on mutation frequencies associated with viral RNA dependent RNA polymerases (RdRps). However, biochemical assays of misincorporation kinetics on RNA templates by the poliovirus RdRp [32,33] or retroviral reverse transcriptase [34] did not identify C->U misincorporation to be favoured over other mutations.
Mutations introduced by RNA transcription would be expected to be symmetric in RNA viruses as their genomes derive from an equal number of positive and negative strand copyings by the same RdRp operating in the same cellular compartment. A tendency to misincorporate Table 1). Histograms were sub-divided based on their Shannon entropy range (see key for colour coding; minimally variable sites (Shannon entropy < 0.3) were excluded). Insets show the corresponding tree topologies for each virus analysed, for large datasets (HCV, EV-A71; DENV1), trees based on randomly selected representative sequences are shown for clarity. Phylogenetic trees drawn to scale are provided in S2 Fig. https://doi.org/10.1371/journal.ppat.1009596.g007 a U instead of a C would therefore be reflected in a parallel number of G->A mutations where it occurred on the minus strand. However, as observed previously for SARS-CoV-2 [12][13][14]26], the frequency of G->A mutations was substantially lower than C->U changes, and generally comparable to those of the other transitions, A->G and U->C (Figs 1 and 2).

Fig 7. Distribution of association index values in virus datasets. Frequency distributions of AI values at variable sites in alignments of representative viruses showing unbiased (left) or elevated (right) C->U/U->C transition asymmetries and with comparable overall sequence divergence (MPD values listed in
To counter this, it could be argued that natural selection operates differently on the minusand plus-RNA strands where the limited number of minus-strand templates produces a much larger number of genomic RNAs. Therefore there may be more stringent selection against mutations occurring in the negative-strand as these are invariably copied into the plus-strand, whereas plus-strands are relatively dispensable and mutations occurring during their synthesis will not overly affect the overall replication process. Indeed, full-length transcripts with plusstrand synthesis errors could be readily packaged into virions assembled from viral proteins synthesised from independently transcribed mRNAs. An alternative possibility is that mutations occurring during transcription of the minus strand are more rather than less likely to be fixed because minus strands are in a substantial minority of viral RNA sequences in the infected cell. Mutations arising from specific RdRp misincorporation biases occurring early in infection during synthesis of the minus-strand might be more likely to become fixed in the infected cell through a founder effect than mutations occurring from the 10s or 100s of plus-strands synthesised from that template. In contrast to the selection-based asymmetry hypothesis above, the observed excess of C->U mutations in the genomic (plus-)strand must therefore have originated from elevated frequencies of G->A mutations in the minus-strand rather than C->U.
Neither hypothesis for a transcriptional origin of the C->U strand asymmetry recorded in the current study is supported by the observation of its variable presence in different RNA virus groups and association with genomic RNA secondary structure. While it is possible that RdRps from different RNA viruses may vary in their spectra of misincorporation frequencies, it is notable that C->U asymmetries cut across The association of C->U asymmetry with the presence of genomic RNA secondary structure (Fig  4) therefore does not mechanistically support a role of differentially selected RdRp mutational errors or founder effects as the explanation for the observations. The RNA secondary structure association is particularly problematic for a transcriptional origin of the C->U asymmetry since its effects are only manifest on the single-stranded genomic viral RNAs, and not within the primarily double-stranded RNA replication complex where the proposed RNA transcriptional mutational errors occur. Secondly, the observed preference for a 5'U at C->U mutated sites (Fig 5) is not supported by what is known about potential conditioning effects of transcriptional contexts influencing RdRp error rates [32,33].
Reactive oxidative species (ROS). It has been recently proposed that substitutions occurring in the SARS-CoV-2 genome, notably the G->U transversion may originate from mutational damage to viral RNA during periods of oxidative stress [15,16]. ROSs are widely associated with DNA mutations, particularly those that target guanine [35]. The oxoguanine bases formed are copied as adenine, creating G->T transversions. As ROSs may also target single-stranded RNA, the production of ROSs on viral infections [36] may potentially account for its previously described overrepresentation of G->U tranversions in SARS-CoV-2 sequences [15][16][17]. We found that higher frequencies of G->U compared to reverse (U->G) or complement (C->A) mutations were more widely found in coronaviruses but broadly absent in the other RNA viruses analysed in the study (S1 Fig). The presence of ROS is not clearly associated with damage to other bases, such as modifications of cytidines that would template the observed excess of C->U transitions.
RNA editing. An alternative source of mutations arises from the documented effects of several innate antiviral effector mechanisms in vertebrate cells that operate through a process of genome editing; these may potentially introduce mutations into RNA virus genomes during replication. Of these, the best characterised are the interferon-inducible isoform of adenosine deaminase acting on RNA type 1 (ADAR1) [10] that targets RNA viruses during replication, and members of the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) family [11]. The substrate for ADAR1 is double-stranded (ds) RNA formed as a replicative intermediate; upon binding, it catalyses the deamination of adenine bases to inosine which are subsequently copied as a G by viral RNA polymerases, creating A->G base mutations (or U->C on the opposite strand). Although ADAR1 activity has been widely proposed as a mutational mechanisms in SARS-CoV-2 [13,15,16] and other RNA viruses (reviewed in [37]), the direction of changes induced (U->C, A->G) is opposite to what is predominantly observed in virus datasets analysed in the current study.
Members of the APOBEC family typically target single stranded DNA templates for mutagenesis of cytidine to thymidine during reverse transcription of retroviruses and hepatitis B virus (HBV) and in the genomes of small DNA viruses such as papillomaviruses [11,38,39].
For example, APOBEC3G editing of the single stranded DNA generated by reverse transcription of retroviruses generates a damaged proviral copy unable to direct further retroviral replication [29,30,40]. There has been extensive discussion over whether the observed overrepresentation of C->U transitions in SARS-CoV-2 is driven by one or more of the APOBEC proteins [12][13][14][15][16][17][18]. The current study shows a similar over-representation of C->U changes in the subset of RNA viruses that possessed structured genomes, including other coronaviruses; these findings would therefore predict a strikingly wider breadth of the antiviral activity of this pathway than is currently recognised.
If APOBEC was responsible for the over-representation of C->U changes in RNA virus genomes, there should be an increased C->U/U->C mutational asymmetry downstream of a U, as this is the target preference of most mammalian APOBEC paralogues. Consistently with findings of cellular mRNA editing by A3A [41,42] and proposed for rubella virus [31], C->U transitions with a 5'U context showed a 1.5-3 fold increased representation compared to other upstream bases (Fig 5A), with particularly high values in SARS-CoV-2 as previously described [12][13][14]. However, there was additionally an apparent favoured downstream context (3'A > 3'U > 3'C/G) among the majority of RNA viruses analysed (Fig 5B), similar to what has been described for SARS-CoV-2 mutational preferences but not generally recorded for retrovirus or other APOBEC target sequences [43].
The wide range of G+C contents of the viruses analysed in the current study (Table 1) necessitated the calculation of normalised context representations to enable meaningful comparison of 5' and 3' base preferences between different RNA viruses. While there any many ways to achieve this, we adopted the simple approach of first calculating normalised C->U/U->C ratio in each 5' or 3' context and then calculating their relative representations taking into account their global mononucleotide frequencies (see Materials and Methods). This yielded relatively consistent over-representations of C->U in a 5'U context that was less apparent in non-normalised data (S3 Table). Other approaches might include normalisation based on dinucleotide frequencies (eg. the relative representation of UpC, ApC, GpC and CpC) to normalise for 5' compositional effects. However, the problem with this approach is determining 3' contexts since the CpG dinucleotide is substantially suppressed in mammalian +strand viruses, often to 15-20% of expected frequencies [44][45][46]; attempts to normalise this typically lead to a substantial compensatory and potentially artefactual over-representation of this context that are unlikely to represent the true editing preferences of APOBEC.
The favoured 5' context of APOBEC-mediated mutations and consequent depletion of TpC/UpC dinucleotides has been exploited as a means to identify editing "footprints" in viral genomes in a previous bioinformatic analysis [47]. Depletion of TpC was detected in a wide range of small DNA viruses, particularly polyomaviruses, parvovirus B19V, herpesviruses. Amongst RNA viruses, UpC depletion was only detected in some seasonal coronaviruses that show substantial genome wide enrichment of U and depletion of C previously ascribed to cytidine deamination [48,49]. The relative weakness of the 5'U context effect on C->U transition frequencies in other RNA viruses (1.5-3-fold for most RNA virus datasets; Fig 5) may explain why TpC depletion was not readily detected by this method.
Apart from 5' and 3' base contexts, the only other compositional metric influencing the extent of C->U / U->C transition asymmetry was RNA structure formation (Table 2), a genomic property of a subset of +strand RNA viruses displaying the previously described genomescale ordered RNA structure [19,20,50]. There was a significant association between MFED value and C->U / U->C asymmetry ratio (Table 2 and Fig 4), but no association with the G->A / A->G normalised asymmetry values.
This association with structured RNA virus genomes potentially recapitulates the previous noted restriction of editing of human mRNA sequences by A3A to sites in defined stem-loop contexts [41]. In an analysis of human transcriptome RNA sequences, over half of the identified edited sites were flanked by short palindromic sequences, typically locating the edited base in the terminal unpaired region of a stem-loop. Supporting this structural association with APOBEC editing, an analysis of the contexts of SARS-CoV-2 and rubella C->U edited sites identified preferential C->U mutations in terminal loop compared to stem sequences [14], or in predicted unpaired compared to unpaired regions [21]. The observed association of excess C->U changes in HCV and other structured viruses is therefore consistent with the action of one or more APOBEC isoforms directly editing a wider range of RNA virus genomes.
While these associations are intriguing, the evidence is circumstantial without systematic functional studies to support a role of APOBEC-mediated RNA editing in restriction of RNA virus replication. Indeed, it was shown that antiviral effects of APOBEC occurred in the absence of any evidence for RNA editing of coronavirus HCoV-NL63 genomic sequences [51]. It would be also unclear where and how editing of RNA virus genomes might occur in the infected cell. The strand-specificity and its presumed occurrence on +strand RNA sequences suggests that editing takes place on naked genomic RNA, perhaps during virus entry or during packaging, rather than co-transcriptionally as documented for retroviruses. Whether and in what cytoplasmic location genomic RNA may be exposed to APOBEC is unclear; coronaviruses show marked excesses of C->U changes, but their genomes, like those of-strand RNA viruses, are typically associated with ribonucleoprotein throughout the replication cycle and must substantially limit their exposure to host pattern recognition receptors.
Inferences on mechanism based on bioinformatics analyses therefore must at this stage be indirect; they should additionally acknowledge that the currently recognised repertoire of RNA editing pathways and other mutational mechanisms in mammalian and other vertebrate cells is almost certainly incomplete. Viral RNA editing may indeed originate from an entirely different mechanism outside the current ADAR1 and APOBEC paradigm. As recently suggested, overlapping but nevertheless distinct C->U mutational contexts in SARS-CoV-2 and rubella genomes points towards the operation of more than one mutational mechanism [14]; the heterogeneity in the effects of 3' base contexts (Fig 5) may be further evidence for the existence of more than one pathway. Establishment of effective methods to induce and quantify RNA virus editing in vitro and a targeted gene deletion approach to functionally test editing abilities of individual APOBEC proteins on RNA templates would be important steps in such investigations.

Evolutionary consequences of C->U hypermutation
Irrespective of the underlying mutational mechanisms, the analyses performed in the study provided evidence that a substantial proportion of population variability in HCV and other structured viruses could be attributed to a marked over-representation of C->U transitions (15% -20% of total sequence changes; Fig 9). The observed substitutions correlated poorly with overall phylogeny of viruses in the alignment based on association index calculations (Fig 7A), consistent with previous analyses that documented extensive homoplasy of C->U changes in SARS-CoV-2 [12,26]. The limitations of this type of analysis should however be acknowledged. While the association index calculation represent an established and robust, phylogeny-based method for evaluating group membership with phylogeny [52,53] and performs well compared to other metrics of genetic partitioning [54], analyses in the current study were based upon groupings derived from base identities. These are necessarily limited to between two and four character states defining groups at each alignment position, and this restriction may create mutational saturation effects at highly variable sites and lead to false detection of homoplasy. However, even sites with relatively low Shannon entropy values (0.3-0.5), that would not create any consistent saturation effect showed an excess of C->U changes in HCV and HPgV-1 alignments (Fig 8).
The distribution of C->U changes (and other transitions) at an individual sequence level approximated to a Poisson distribution (Fig 3), albeit with some over-dispersion in the EV-A71 sequence dataset likely arising from its phylogenetic structuring (S3 Fig) compared to HCV. There was no evidence for the occurrence of individual hypermutated sequences in a background population of non-mutated sequences, as previously observed in HIV-1 and hepatitis B virus (HBV) [23][24][25]55,56]. This likely originates from the differences in replication cycles of RNA viruses and retro-transposing viruses-the RNA virus sequences in the current study were derived from consensus sequences and therefore would have to be fully replication competent and evolutionarily fit to be represented in clinical samples. In contrast, hypermutated sequences of HIV-1 are typically derived from integrated proviruses derived from APOBEC-edited reverse transcription. Their survival in memory T cells occurs irrespective of whether they are able to generate infectious virus or not; similarly for HBV [55,56]. Such sequences can therefore accumulate extensive and bizarre mutational damage as they are effectively evolutionary dead-ends. In marked contrast, the excess of C->U changes observed in structured RNA virus genomes may therefore represent the maximum tolerable mutational load compatible with viability and onward transmission.
While tangential to the primary focus of the study, the shape of phylogenetic trees constructed from different RNA viruses differed substantially (Figs 8 and S2) and potentially contributed to observations of homoplasy. These differences in branching density have been previously quantified using the temporal clustering (TC) metric [57]. The bush-like, over-dispersed topology of HCV showed a lower TC value that derived from a neutral evolutionary simulation, a difference attributed to potential rate variation in different lineages of HCV or population subdivision which promotes the co-existence of lineages. The latter model may potentially be equated with distinct patterns of endemic and epidemic partitioning in the different trees associated respectively with persistent and non-persistent virus infections (S2 Fig). However, there is the further possibility that tree shape and the associated occurrence of phylogenetically uninformative sites in structured virus genomes may also be influenced by extensive RNA editing and homoplastic cycles of mutation and reversion as observed in SARS-CoV-2 [12,26]. The development of evolutionary simulation methods where RNA editing is incorporated and parameterised may lead to valuable insights into the nature and trajectory of short-term diversification. It may serve to better characterise the evident differences between RNA viruses in the nature of their divergent evolution. The observation that excess C->U changes accounted for 11%-14% of variable sites of HCV, HPgV-1, FMDV and MNV at any one time (Fig 9) indicates the powerful role of C->U hypermutation in the generation of RNA virus diversity.

Sequence datasets
Alignments of sequences of HCV genotypes 1a, 1b, 2a and 3a, SARS-CoV-2 and other coronaviruses were derived from previous studies [21,50]. Further alignments of other RNA viruses were constructed for the study from GenBank and the VIPR database [58] using all available or randomly selected sequence subsets as described in S1 Table. Coding region sequences were aligned using MUSCLE [59] as implemented in the SSE package version 1.4 (http://www. virus-evolution.org/Downloads/Software/) [60]. Analysis of viruses encoding single polyproteins (ie. picornaviruses, flaviviruses) was based on coding regions only. Regions spanning the start of the first open reading frame (ORF) to the end of the last ORF were used for analysis of viruses with polycistronic genes (coronaviruses, togaviruses, pneumoviruses, filoviruses, hepeviruses and caliciviruses). Alignments are available from the author on request.

Sequence analysis
Calculation of pairwise distances and nucleotide composition was performed using the SSE package version 1.4. Sequence changes were compiled using the program Sequence Changes with a variability threshold typically set at 5% heterogeneity, where heterogeneity was calculated as the cumulative frequency of all non-consensus bases. Multiple thresholds were used to analyse mutation representation at sites showing different levels of variability (Figs 1 and 2).
Normalised ratios (nC-U) of C->U and U->C transitions (and comparably for G->A and A->G) were calculated as:

RNA secondary structure prediction
Computation of MFE and MFED values was carried out using the program Folding energy scan in the SSE package using sequential 300 base sequence fragments incrementing by 30 bases between fragments. The program call the RNAFold.exe program in the RNAFold package, version 2.4.2 [61] with default parameters.

Association index calculations
AI values were calculated using the algorithm originally described by Wang et al. [52] and Cochrane et al. [53] and implemented in the SSE package. An explanation of the method and its underlying algorithm is provided in a Supplementary Methods section (Suppl. Data). The assignment of group labels based on nucleotide identity at sequential sites in an alignment was automated in the program extension, homoplasy scan in the SSE package version 1.5.

Phylogenetic analysis
Neighbour joining trees were constructed from aligned sequences using the program MEGA7 [62]. Lineage against time plot were derived from data generated in the Phylocom package [27].

Statistical analysis
All statistical calculations and histogram constructions used SPSS version 26.
Supporting information S1