Similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemic

Genomes of tens of thousands of SARS-CoV2 isolates have been sequenced across the world and the total number of changes (predominantly single base substitutions) in these isolates exceeds ten thousand. We compared the mutational spectrum in the new SARS-CoV-2 mutation dataset with the previously published mutation spectrum in hypermutated genomes of rubella - another positive single stranded (ss) RNA virus. Each of the rubella isolates arose by accumulation of hundreds of mutations during propagation in a single subject, while SARS-CoV-2 mutation spectrum represents a collection events in multiple virus isolates from individuals across the world. We found a clear similarity between the spectra of single base substitutions in rubella and in SARS-CoV-2, with C to U as well as A to G and U to C being the most prominent in plus strand genomic RNA of each virus. Of those, U to C changes universally showed preference for loops versus stems in predicted RNA secondary structure. Similarly, to what was previously reported for rubella, C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBEC cytidine deaminase being a source of these substitutions. We also found enrichment of several other trinucleotide-centered mutation motifs only in SARS-CoV-2 - likely indicative of a mutation process characteristic to this virus. Altogether, the results of this analysis suggest that the mutation mechanisms that lead to hypermutation of the rubella vaccine virus in a rare pathological condition may also operate in the background of the SARS-CoV-2 viruses currently propagating in the human population.


32
ADARs (ADAR1 and ADAR2) are double-strand (ds) RNA-specific enzymes converting adenine 33 to inosine (A to I). Since inosine pairs with cytosine, this will result in A to G changes after the 34 next round of replication. The preference of ADARs for certain deamination motifs -reflecting a 35 combination of immediate nucleotide context and the anticipated dsRNA formed by folding -was 36 assessed for in vitro editing of several RNA substrates. Based on these data, software was 37 developed aimed to assign predictive ADAR deamination scores to any A position in a given 38 RNA molecule [8]. The ADAR editing sites that were deduced in RNAs of cultured stimulated 39 immune cells [9] agreed with the preferences defined in the in vitro study. It remains to be 40 established whether these preferences would hold for a wide variety of RNA substrates in 41 conditions of controlled in vivo expression of either ADAR1 or ADAR2.

43
Unlike ADARs, the structure of APOBEC enzymes allow deamination only in single-strand (ss)

44
RNA or in ssDNA. At least two APOBECs, APOBEC1 and APOBEC3A are capable of 45 deaminating cytosine to uracil in RNA, however an RNA editing capacity of other APOBECs 46 cannot be excluded [10][11][12]. Cytosine deamination in RNA creates the normal RNA base -47 uracil, which can be then accurately copied in subsequent rounds of RNA replication. DNA substitutions in each SARS-CoV-2 and rubella MAFs, dividing a base substitution count by the 150 number of the mutated base in the reference sequence. We then assessed similarity of base 151 substitution densities distributions between rubella and each of SARS-CoV-2 filtered MAF using 152 non-parametric Spearman correlation with the null hypothesis that, there is no positive 153 correlation between spectra in rubella and SARS-CoV-2.

155
Statistical evaluation of mutagenesis in trinucleotide-centered mutation motifs

156
Calculating enrichment and statistical evaluation of mutagenesis in a small number of 157 trinucleotide-centered mutation motifs identified from mechanistic knowledge turned productive 158 in our prior assessments of mutagenesis associated with established mechanisms and known 159 preference to certain trinucleotide motifs [14,19,26,27]. In this study we extended statistical 160 evaluation to all 192 possible trinucleotide centered motifs.

161
Trinucleotide and single-nucleotide frequencies in the genomic background were calculated 162 using two alternative methods:

163
(i) context-based -counts in the 41 nt windows centered around each mutation location;

164
(ii) reference-based -counts in the whole reference genome. The output for each analysis (.out) in dot-bracket notation was input into BBEdit

204
(https://www.barebones.com/products/bbedit/) and all characters in both sequence and notation 205 rows were made space delimited. Each of these rows were pasted into Excel and turned into 206 space delimited cells. Sequence and notation were separately copied and pasted using the 207 "Transform" function into a new Excel spreadsheet. A column with the nucleotide position was 208 added and the file saved as a tab delimited text file *RNAfold.txt. For each resulting file the first 209 column was the nucleotide position, the second column is the nucleotide, and the third column 210 was the annotation of that nucleotide in dot-bracket notation. The *RNAfold.txt files were used to 211 add a stem-loop annotation column "RNAfold" to all MAF files using the vlookup function in

212
Excel and saved as a tab delimited text file.

214
For searching for motifs and trinucleotides, *RNAfold.txt files were used to create a searchable 215 text files as follows. Columns two and three of each file were copied into a two new text files. On 216 command line, the two columns in each file were merged using 217 awk '{$(NF-1)=$(NF-1)""$NF;$NF=""}1' OFS="\t"

218
The output file from this was opened in BBEdit, the line breaks were removed, resulting in a file

246
Design of the analysis

247
The overarching hypothesis of this study was that some of the processes generating RNA 248 mutation load in population of SARS-CoV-2 genomes are similar (but not necessarily identical)

249
to the processes that generated changes in genomic RNAs of hypermutated rubella viruses.

250
For that purpose, we obtained the viral genome FASTA files and processed them to obtain have originated from the already mutated virus spreading the same mutation(s) into genomes of 277 multiple (up to thousands) downstream isolates (see column "times_refPos_mutated_inMAF" in 278 S1A Table). Therefore, we also annotated each mutation in a sample by the number of different 279 samples in which such a mutation was found. Since each genome of an individual isolate 280 contained only few mutations and many of these mutations were identical in multiple isolates,

281
we built our analysis to evaluate the overall spectrum of non-redundant mutation events that  Table) Table) would be the least affected by functional selection and thus, most accurately 298 represent the impact of unconstrained mutational processes operating on the viral genome.

299
While this group is smaller, it still contains a sufficient number of changes (4,740 mutation 300 events) for detecting trends in the mutational patterns. The mutation spectrum of 7,416

301
NoDupsFunc events (S2B Table) was also analyzed. Each of the three SARS-CoV-2 mutation 14 302 spectra was compared to the combined mutation spectrum (993 base substitutions) from six 303 independent isolates originated from the hypermutated rubella-vaccine virus [19] (S3A Table).

304
Unlike many SARS-CoV-2 isolates, where individual mutated event could be carried from one 305 isolate to another, each rubella isolate contained mutations that had occurred independently 306 from the vaccine virus in each subject. Thus, the total of the mutational events in six rubella

307
isolates was, at least in part, representative of the mutation spectrum. However, mutation 308 spectra in each rubella isolate may represent an unknown level of selection. Indeed, a number 309 of mutations was observed in more than one rubella isolate (see column

310
"times_refPos_mutated_inMAF" in S3A Table). Some level of selection was also indicated by 311 analysis of synonymous and nonsynonymous substitutions in each codon [19]. Therefore, we 312 made separate comparisons of the rubella mutation spectra with each of the three SARS-CoV-2 313 non-redundant MAFs (Fig 1): the non-duplicated mutation events (NoDups), and its two subsets

314
-the non-duplicated mutation events with potential of functional significance (NoDupsFunc) and 315 the non-duplicated non-functional mutation events (NoDupsNonFunc).  Table).

366
The only apparent discrepancy between the two viruses was in a high density of the G to U 367 changes in the plus-strand of SARS-CoV-2, while they were nearly absent in rubella. We note 368 that the density of G to U changes in minus-strand (reported as the complementary C to A 369 changes in plus-strand in Fig 2) was similar to other low abundant changes in both viruses. A 370 possible origin of the increased G to U changes in SARS-CoV-2 genomes will be detailed in  Tables and Methods). We then compared mutation counts in loop vs stem for each type of 384 base substitutions (Fig 3 and S5 Table).   407 outstanding feature of the G to U changes showed up in SARS-CoV-2, but not in rubella (see 408 above).

410
Mutational motif preferences in SARS-CoV-2 and rubella genomes suggest APOBEC 411 cytidine deaminases as a source of C to U base substitutions in the plus RNA strand.

412
Since base substitution spectra in SARS-CoV-2 and rubella are correlated, it is likely that they 413 also shared common mechanisms which generated these changes. It is well established that

434
Bars represent minimum estimates of mutation load that can be assigned to motif-specific 435 mutagenic mechanism (MutLoad) as described in Methods. Statistical evaluation of  Table. 445 446 The only revealed similarity between statistically-significant enriched motifs in rubella and in 447 SARS-CoV-2 was for the uCn to uUn changes, consistent with the tCn to tTn ssDNA 448 mutagenesis specificity of a subgroup of APOBEC cytidine deaminases. However, even within 449 the APOBEC-like group of motifs there was a difference between strong enrichment with uCa to 450 uUa motif in rubella and the lack of statistically significant preference for this motif in SARS-

451
CoV-2. There were also three groups of motifs significantly enriched in SARS-CoV-2, but not in 452 rubella (see Tab 1 and Discussion for possible mechanistic assignment of these motifs).

454
We also assessed the potential loop vs stem preference for trinucleotide motifs containing C to 455 U and G to U single base substitutions that showed overall loop vs stem preference. None of   for loops versus stems in the RNA secondary structure (Fig 3 and S5 Table). This phenomenon 506 is similar to the preference of ssDNA and mRNA editing in loops by the APOBEC3A cytidine 507 deaminase [35,38]. Agnostic analysis of enrichments in all 192 possible trinucleotide mutation 508 motifs highlighted statistically significant excess of uCa to uUa motif in rubella, however these 23 509 changes were not prevalent in SARS-CoV-2. Mutations in plus-strands of both viruses showed 510 statistically significant enrichments with uCg to uUg and uCu to uUu motifs (Fig 4A,B). These 511 motifs belong to a group uCn to uUn (tCn to tTn in DNA) which is characteristic of several 512 APOBEC cytidine deaminases ([14] and references therein). We note that although these 513 signatures were enriched in the non-functional mutations (NoDupsNonFunc), they did not pass 514 the 0.05 FDR threshold in filtered datasets that included mutations with potential functional 515 effects (NoDupsFunc). These differences in the mutation signatures between SARS-Cov-2 and 516 rubella may be due to different APOBEC family members performing editing or due to the 517 confounding presence of other sources of C to U mutagenesis, such as spontaneous cytosine 518 deamination that frequently occurs in ssDNA [39] or oxidative mutagenesis capable of 519 generating C to T mutations in ssDNA in vivo [40,41]. In support of the role of oxidative 520 damage in SARS-CoV-2 genomes, is the increased prevalence of G to U substitutions which is 521 consistent with the oxidation of guanines in the RNA plus-strand (Fig 2)

530
There were two more groups of trinucleotide mutation motifs involving C to U (and 531 complementary G to A) substitutions in plus RNA strand specifically enriched for SARS-CoV-2 532 (Fig 4A,B). The aCn to aUn (reverse complement nGu to nAu) group of motifs may represent a 533 preference previously unknown for APOBECs in RNA or just a mutagenic mechanism yet to be 534 defined. The cGn to cAn group of motifs seen in the plus-strand may be in fact due to mutations 24 535 of the reverse complement motif nCg to nUg in the minus-strand. nCg to nTg (CpG to TpG) 536 germline and somatic mutagenesis is universally present in DNA of species with 5-537 methylcytosine and is generated by systems specialized to mutagenesis in methylated CpG shows with high statistical confidence that nCg to nUg (CpG to UpG) mutagenesis in the minus 543 strand is enriched (Fig 4B) supporting the role of nCg-(CpG)-specific cytosine deamination in 544 minus RNA strand in SARS-CoV-2 genomic mutagenesis.

546
In summary, comparison of base substitution spectra and signatures between hypermutated 547 rubella isolates and the SARS-CoV-2 multi-genome dataset demonstrates both similarities and 548 differences in the mutational processes active upon the two plus-strand RNA viruses. It is 549 important to understand the mechanisms that contribute to mutagenesis of viral genomes, since 550 hypermutation of even inactivated rubella vaccine virus was shown to generate reactivated viral 551 particles [19]. We demonstrate here that the APOBEC-specific uCa to uUa changes that are 552 highly enriched in hypermutated rubella, are much less prevalent in SARS-CoV-2. We propose 553 that assessment of uCa to uUa signature in viral genomes can provide insights into the potential 554 hypermutation risk of SARS-CoV-2. Moreover, understanding the genomic mutational patterns 555 is important for predicting virus evolution. Our study has highlighted several distinct features of 556 SARS-CoV-2 mutational spectrum that, after validation with independent dataset(s) can be used 557 to build predictive models for this and related SARS viruses.

560
We thank Dr. Natalya Degtyareva and Mr. Adam Burkholder for critical reading of the 561 manuscript and Mr. Michael Doubintchik for help in the initial evaluation of the on-line databases 562 of SARS-CoV-2 genomes. We gratefully acknowledge the authors, as well as the originating 563 laboratories for submitting the sequences to GISAID's EpiCoV™ Database on which this 564 research is based; complete list of contributors is provided in S9 Table. 565 26 566 567