Mutational signature dynamics indicate SARS-CoV-2’s evolutionary capacity is driven by host antiviral molecules

The COVID-19 pandemic has been characterised by sequential variant-specific waves shaped by viral, individual human and population factors. SARS-CoV-2 variants are defined by their unique combinations of mutations and there has been a clear adaptation to more efficient human infection since the emergence of this new human coronavirus in late 2019. Here, we use machine learning models to identify shared signatures, i.e., common underlying mutational processes and link these to the subset of mutations that define the variants of concern (VOCs). First, we examined the global SARS-CoV-2 genomes and associated metadata to determine how viral properties and public health measures have influenced the magnitude of waves, as measured by the number of infection cases, in different geographic locations using regression models. This analysis showed that, as expected, both public health measures and virus properties were associated with the waves of regional SARS-CoV-2 reported infection numbers and this impact varies geographically. We attribute this to intrinsic differences such as vaccine coverage, testing and sequencing capacity and the effectiveness of government stringency. To assess underlying evolutionary change, we used non-negative matrix factorisation and observed three distinct mutational signatures, unique in their substitution patterns and exposures from the SARS-CoV-2 genomes. Signatures 1, 2 and 3 were biased to C→T, T→C/A→G and G→T point mutations. We hypothesise assignments of these mutational signatures to the host antiviral molecules APOBEC, ADAR and ROS respectively. We observe a shift amidst the pandemic in relative mutational signature activity from predominantly Signature 1 changes to an increasingly high proportion of changes consistent with Signature 2. This could represent changes in how the virus and the host immune response interact and indicates how SARS-CoV-2 may continue to generate variation in the future. Linkage of the detected mutational signatures to the VOC-defining amino acids substitutions indicates the majority of SARS-CoV-2’s evolutionary capacity is likely to be associated with the action of host antiviral molecules rather than virus replication errors.

The regression of daily cases to government stringency index needs to be adjusted for the latter metric to be informative.The continent-level 14-day moving average is an appropriate means of dealing with the noisiness of reported case data, where there are frequent gaps in reporting that appear to be zero daily cases.The inappropriate step in this analysis is comparing the continent-level 14-day moving average of new cases to the average OxCGRT index of the component countries within the continent.The daily case rate will naturally be biased toward the countries with the highest case numbers (who contribute the greatest proportion of cases to the average) while the average OxCGRT index will have equal contributions from each country.In a scenario where only a handful of countries have low stringencies indices and these same countries are responsible for an outsized number of cases, any positive effect of government stringency will be dampened (as the OxCGRT index will effectively be artificially inflated).I highly recommend repeating this analysis but weighing a country's contribution to the continent-level OxCGRT based on its contribution to the continent-level 14-day average daily cases within the same time window.From my assumptions of how the analysis scripts have been written, I don't anticipate this being too onerous of a calculation.However, if this is challenging other approaches could be considered.I recommend this approach rather than weighing by population, as the former will still partially control for differences in population (India will be weighed more heavily than Nepal, for instance) while the latter will have the issue of more populous countries not necessarily having infrastructure for widespread community testing.
Response: Thank you for the excellent suggestion to weigh the stringency index by each country's contribution to the number of cases.We agree that this addresses countries with an outsized number of cases to avoid artificial inflation.In response, we have refined our analysis, now weighting each country's influence on the continent-level OxCGRT index according to its proportionate impact on the continent's 14-day rolling average daily case count (Figure 2B), see additions to page 6.This does not need to be adjusted by the authors, but for the sake of documentation part of the "careful analysis" referenced above would include empirical testing of the assumption "Sequences reported in GISAID were assumed to be representative of the diversity of infections for that continent/country."Performing these analyses is not key to the interpretability of the later results section and for that reason I don't consider it necessary.

Response:
We appreciate this insight regarding the representativeness assumption of sequences from GISAID.Although we haven't validated this assumption in this work, we acknowledge its significance and consider it an important direction for subsequent research or analyses.
The authors make a very astute observation on page 18 that the ROS-like signature appears to have context preference: "The G→T signature itself seems to display a context preference of TpG and ApG nucleotides".This observation can (and should) be tested quantitatively.The same approach as in PMID: 34061905 can be used to calculate normalized context preferences.As one will need to compare to a reference sequence, I would approach this by calculating ratios for each tree-based reference for the pangolin lineages (using the observed substitutions for those references) and then calculating a median ratio from all of the values.This way, you'll be able to robustly determine whether or not any contexts are over-or underrepresented.Dinucleotide context is probably sufficient, but if it is computationally straighforward then trinucleotide context would also be useful and interesting.The reason I view this as key to the manuscript is that the observation of a context preference for ROS would either (a) indicate context and structural preference for ROS, which would be a novel finding or (b) bolster the argument that the G->T phenotype may be due to another as of yet defined nucleic acid editing mechanism.Either of those findings would be highly significant to the overall story and for that reason deserve robust statistical analysis.
Response: Thank you for the excellent suggestion.We have investigated the mutational signatures using normalised counts for each substitution since our current signatures did not account for context biases from the reference.For this normalisation we divided our initial counts by the number of available triplets in the reference to allow for context-agnostic signatures.With regards to the G->T signature we find that the TpG and TpA context preference do appear to be driven more by their abundance in the genome, rather than any specific targeting (see additions to page 18) and revised figure Suppl.Figure S5: In pages 17 and 18, the discussion of potential molecular mechanisms underlying the (lack of) ROS-like signatures in Omicron is excellent.I am very interested in what hypotheses and explanations the authors are able to generate for the preponderance of the ADAR-like signature in Alpha.The data in figure 5 is particularly striking and I think this result deserves more attention in the discussion that it is currently given.
Response: Thanks for highlighting this, we agree this is an interesting result and have expanded our discussion around it on page 19.
On a smaller (but important) note, additional clarification is needed on page 20 in the methods to ensure reproducibility.Please state on which date the daily case data was downloaded from Our World in Data's GitHub repository.If prior to February 20, 2023, the OWID data was actually just piped from the Johns Hopkins University COVID-19 dashboard -there are several publications that can be cited about this work (Dong et al 2020 andDong et al 2022).If after that date, the case data comes via the Reviewer 2

Major Comments
The three signatures that the authors have extracted are novel and interesting, as are the dynamics of these signatures through time.However, the assignment of these signatures to drivers based simply on their dominant substitution type is highly problematic.For example, there are a large number of processes that can cause C-to-T mutations (for example see PMID: 30982602).APOBECs are one such process but it is not sufficient to identify C-to-T mutations and jump straight to APOBECs as the cause.Likewise, there are other factors that can cause G-to-T mutations apart from ROS. Signature 2 consists of A-to-G and T-to-C mutations, which are symmetric so the authors assume ADAR; why would this signature not be consistent with a change in polymerase error profile or another mutagenic factor that causes both substitution types?The assignment of signatures to causes is a major issue with the manuscript currently as it drives a lot of the narrative, including the title.But there is not any support provided for these drivers over any other driver of these mutation types.It is reasonable to speculate that APOBECs, ADAR and ROS may contribute to the mutational burden of SARS-CoV-2 but there is not evidence to show that they are major drivers and the conclusions of the manuscript in this regard are therefore not supported.I'd suggest that the authors refer to the signatures throughout as, for example, 'signature 1' rather than 'APOBEC-like'.

Response:
We take the reviewer's point and have taken a number of actions to address these comments: 1. Throughout the results, signature labels were changed from "APOBEC-Like" to "Signature 1" so that an individual mutational mechanism is not immediately attributed to a proposed extracted signature.The link to mechanism is now only in our Discussion.2. In the discussion, we have added further explanation as to why we consider each mutational process primarily associated with our extracted signatures.We also add an explanation on why we think there are clearly further mutational processes at work (such as the mentioned polymerase error) and why we still are confident in our signature extractions despite this (see additions to pages 17 and 18). 3.While it is very difficult to absolutely attribute a mutational signature to its real process, there is a growing body of evidence to support each of these three mechanisms being important contributors to SARS-CoV-2's ongoing evolution.The repeatable extraction of these signature patterns with contexts and substitution types that are similar to those of the aforementioned mutational processes gives us confidence in these assignments.(Suppl.Figure S9, page 40) 4. Finally, to reflect the hypothetical nature of the signature's link to different host antiviral molecules, we have moderated the title to "Is SARS-CoV-2's evolutionary capacity mostly driven by host antiviral molecules?".
In Figures 4, 5 and 6, the authors assess the proportion of mutations associated with their extracted signatures.According to these plots and the author-assigned drivers of the signatures, we are led to believe that the viral polymerase has not made a single unrepaired replication error during the pandemic as all mutations are suggested to be driven by APOBEC, ADAR and ROS.While the proofreading exonuclease encoded by SARS-CoV-2 can recognise replication errors, it clearly won't repair all mistakes and polymerase errors will contribute to mutational spectra.This again highlights why assigning signatures to APOBEC, ADAR and ROS without further support is a major issue.

Response:
We have added more discussion surrounding replication error and its effects on the extracted signatures.We acknowledge that the mutational signatures we see do not explain all of the observed mutations.Many changes in the dataset are not included in the signatures, and we cannot explain all of the mutations in the dataset, since the exposures do not equate to the full number of mutations at each epidemic week.Our expanded Figure 6 now quantifies the error for the unobserved mutation categories to show the mutations we are missing due to being not summarised by either of the 3 signatures.We are of the opinion this expansion now clearly shows the effect of the observed mutational processes while acknowledging that processes like polymerase error clearly occur but we do not have the power to detect them.

(Pages 11-14)
There is a need for further validation of the patterns of mutational signatures through time.5c-d.

Response:
We have created plots for each variant subset to show the substitution patterns as requested in Suppl.Figure S7.As the signature decreases suggest, we do see a reduction in C->T and a rather high quantity of A->G and T->C changes in the Alpha lineage that support the signature exposure plots.This ratio between C->T changes and A->G/T->C changes in Alpha is much greater than observed in other lineages, which would explain the increased exposure of Signature 2 relative to Signature 1 in Alpha.The same plots for continents can be found in the revised Suppl.Figure S8.
Section 2.5 examining the drivers of particular amino acid changes is highly problematic.Assigning mutations to signatures based on their mutation type is not possible.For example, C-to-T mutations are present in each of the three extracted signatures.The authors cannot therefore identify a C-to-T mutation and state that it was driven by signature 1, as there is a reasonable probability it could have been driven by signature 2 or 3, or by a currently unidentified signature.It's also unclear why the authors would include G-to-A mutations in signature 1 here when G-to-A is a major contributor to signatures 2 and 3.The interpretations of this data are then highly problematic -for example, the authors state "Clearly, this APOBEC-like process was frequently active prior to the Alpha VOCs emergence"; this is based on the branch leading to Alpha having a proportion of C-to-T mutations.To assign these to signature 1 purely because they are C-to-T mutations is not possible.As outlined above, the drivers of these signatures are not supported, so this statement combines an unreliable assignment of mutations to signatures with an unreliable assignment of the driver of this signature and attempts to make a major conclusion about the processes contributing to variant generation.This is not reliable.Another example is "The high density of ADAR RBD mutations in a variant that has emerged as the ADAR signature has increased may suggest that the ADAR mutational process has driven the emergence of the Omicron variant."The authors are making major conclusions here that are not supported by reliable data.These are examples but section 2.5 is largely formed of the same analysis and conclusions.It may be possible to generate a statistical model that can take a set of contextual mutations and provide some estimate of the processes that have contributed to their generation based on mutation types and their surrounding contexts but failing this, section 2.5 does not provide reliable insights.
Response: Thanks for this comment.We agree that using simple substitution as a way to attribute mutations to their process is naive, so instead we have used a maximum likelihood approach to assign mutations to signatures in a more agnostic fashion.A mutation is now attributed to a signature rather than a specific process.The mutations are assigned using the maximum likelihood of a mutation and its context being assigned to one of the three signatures.
Using the trinucleotide context C[C->T]G as an example, the likelihood function is P(C[C->T]G | Signature), which correspond to the probability bars in the extracted signatures.This ensures that the mutation allocation is agnostic to pre-existing expectations and keeps the analysis purely data-driven (see additions to page 11-13).
The methods section is light on detail, but my reading of the methods is that the authors calculated mutational spectra by comparing individual sequences from GISAID against an "ancestral lineage reference sequence".It is not clear from the methods whether this sequence is the ancestor of the same lineage the sequence is assigned to or the parental lineage as the authors use an example of sequences from lineage B.1 being compared against the reference for lineage B. Did the authors use one reference per lineage?Or were ancestral references only included for some lineages?Importantly, how were mutations counted?For example, if we take a cluster of related sequences within lineage B.1 that share 3 mutations that were acquired since the lineage B.1 ancestor, are these 3 shared mutations only counted once?Or are they counted in each of the sequences in the cluster?Each mutation should be counted on each phylogenetic branch on which the mutation occurred.Only counting each mutation once will undercount convergent mutations while counting each mutation in each sequence in which it occurs will result in highly inaccurate spectra.The methods here need to be clarified and, if necessary, updated.

Response:
We have added clarity to the lineage reference section in particular, and have added a supplementary figure S3 showing examples of which sequences use which lineage reference.We have also added a section clarifying the use of unique mutations to extract mutational signatures and discuss its impact on the counting of convergent mutations.We believe this now adequately explains any ambiguity that previously existed with this section.We use one reference per lineage and mutations are counted once within a lineage.We don't cluster sequences within lineages (this was included as a reference to how another paper attempted to solve the same problem).While undercounting convergent mutations in a lineage is possible, convergence will still be captured between lineages, and overcounting inherited mutations will cause, as mentioned, highly inaccurate signature extraction.(See page 23,24,25 and 39 for these changes.) How were the surrounding contexts of each mutation identified?Were these identified from the same "ancestral lineage reference" or from the Wuhan-Hu-1 sequence?
Response: Mutational context was identified using the lineage reference so that acquired mutations would be included in the context as well, which would not be the case if the Wuhan-1 reference was used.
The authors don't appear to have rescaled the calculated spectra by triplet availability in the genome, which is a necessary step to obtain reliable mutational spectra.For example, if the genome contains 1000 CCT contexts but only 100 GCG contexts, a C-to-T mutation would be far more likely to occur in a CCT context than a GCG context.Therefore to obtain a mutational spectrum that represents the actual mutational processes, it is necessary to divide the mutation counts by the number of occurrences of the starting triplet, as has been carried out in previous publications on SARS-CoV-2 spectra (for example PMID 37185044).

Response:
We have now generated the mutational signatures using normalised counts as requested.This now gives 2 sets of signatures, one set showing the patterns of mutations given the reference sequence genomic biases, and one set agnostic of triplet biases.Both are now discussed when referring to mutational process attribution since the normalised signatures give a better indication of process context specificity, while unnormalised show the real probability of change given the viral sequence.Both are of interest clearly, since one can help to attribute the real chance of a mutation editing a context, whereas the other can help identify signature aetiology (see additions to Page 17,18,19,25 and 36).

I'm not sure what the novel takeaways are from section 2.1 as all of the statements in this section (number of lineages, dominance of certain lineages, waves of different variants, infection rates, VOCs increasing infection rates) have been extensively covered in previous publications and public dashboards/resources.
Response: We acknowledge that the topics discussed have been previously covered in other literature and resources.The intention behind including section 2.1 was to provide a consolidated overview and context for readers.It was also aimed at setting the stage for the subsequent sections of our paper where we delve into more specific and potentially novel findings.This helps us provide a narrative for known driving factors as well as other non-variant factors Section 2.2 should include sensitivity analyses.There are significant challenges associated with the analyses in section 2.2 as testing and case reporting differ hugely by country.This may be possible to account for at the level of individual countries where there is good knowledge of the testing and reporting practices but at the continent level this appears to be a major challenge and it's not clear that the data is reliable enough to enable effective modelling.Additionally, the authors only examine four predictor variables and exclude other factors that are very likely to have an effect, for example previous infection burden.These factors may contribute to the inability of the models to completely fit the data in the insets in Figure 1, for example the fit in Oceania appears poor.A sensitivity analysis to assess the impact of poor estimates of cases in some regions and the impact of missing variables would be needed to trust the output from this modelling (or citations to previous papers that have done this if such papers are available).Furthermore, the authors state that "fitted mean response values with predictor values as the input resembled the rise and fall of infection cases".There is not currently a statistical assessment to support this and it is not clear that this is the case from examining the data in Figure 1; a statistical assessment of the model fits should be carried out to support this statement.
Response: Thank you for the thoughtful feedback on Section 2.2 and the concerns raised therein.We have addressed each point raised: 1. Challenges at the continent level: We recognize the disparities in testing and case reporting across countries.To address this, we have refined our continent-level analysis.Specifically, the influence of each country on the continent-level OxCGRT index is now weighted according to its proportionate impact on the continent's 14-day average daily case count.We believe that this is not exhaustive but better captures the unique dynamics at play within each continent and offers a better understanding.
2. Accounting for true burden: While we now employ the Human Development Index (HDI) to adjust reported cases (citing https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9408439/),we acknowledge that this may not fully capture the true infection burden.Several factors underlie the actual burden, and while HDI provides one approach, it is by no means exhaustive.However, it serves as a useful proxy to bridge some of the data gaps.

Inclusion of additional predictor variables:
We agree with this observation regarding the omission of potential influential factors such as previous infection burden.Predictor variables were included based on data completeness over the study period.We have revised our model to integrate previous infection burden which enhances their predictive power.We acknowledge these predictors aren't comprehensive and discuss this limitation 4. Model fit for Oceania: the predictor variables globally in our initial analysis did not appear to have a good fit in Oceania.Reviewing our model (adjusting continent level cases by HDI, including additional predictor variables and weighting the continent-level OxCGRT index by the proportionate impact on daily cases), greatly improved the overall fit.

Uncertainty in predictions:
We include confidence errors in our predictions.We also now report R squared values for every continent and country included in the analysis, which gives an indicator of the goodness of fit.Suppl.Figure S2.country level regression model fitting.
Suppl.Table A3 A2 shows some Pango lineages and some WHO variant names, the "13 Pango lineages" should be updated to reflect this.

Response: This has been addressed
Figure 1 -from the legend, the bars show biweekly proportions of common lineages.Does the whitespace show the proportion of sequences from other lineages?Please specify.
Response: Yes we have updated our text to state this more clearly.
Figure 1 -not all insets can be read, it would be useful to update the figure to make these clearer.

Response:
We have updated our figures, moving the insets into Figure 2b for better legibility Methods section 4.1 -what were the cutoffs for "excessive mutations/deletions"? Please describe.

Response:
The cutoff for filtering out hypermutated sequences was 175 mutations in coding regions or more than 69 different deletions, the cutoffs were manually determined after evaluation of the mutation/deletion distribution and selecting the point where sequence counts were consistently observed in single digits, this resulted in 1,852 sequences being filtered out.Response: We've revised the sentence to better convey our intention.The updated sentence now reads: "We collected continuous dependent variables reported on a daily basis."We believe this modification offers a straightforward understanding of our data collection method.
Methods section 4.2, paragraph 1, lines 7-8 -missing vaccination data was handled as no vaccines were administered.Are these data points at times where it is reasonable to believe that no vaccines were administered (i.e.before vaccine rollouts in the respective countries)?This would be useful to state.

Response:
In response to this query, the instances where vaccination data was missing primarily corresponded to periods before vaccine rollouts in the respective countries.Thus, it was reasonable to treat these missing data points as periods where no vaccines were administered.We have updated the text to reflect this.
Methods section 4.2 -some more details on the estimation of virus fitness would be useful.How were the mutations in each lineage defined?How is the sum of fit mutations performed -is each "fit" mutation observed counted as 1 regardless of frequency in the population?
Response: Mutations were defined against the Wuhan-Hu-1 sequence.Each fit mutation in a sequence was counted as 1 (frequency depended on the number of sequences with that mutation), and this was normalized to the number of sequences generated in a given location.
We have now updated our text to give more detail.
Methods section 4.7 -what was the NMF run on?Was this run on the set of spectra from all epidemic weeks?Please expand this section to make clear.A supplementary figure showing the reasoning for three signatures being chosen as optimum would be useful.

Response:
The NMF was run on 100 bootstrapped epidemic week datasets for each set of signatures, i.e., 2-10.The set of individual samples was bootstrapped and then converted into the epidemic week pseudo-sample datasets which ensured that each set of pseudosamples were created from a different set of sequences drawn from the original data.We have added some text to clarify this further and as requested have created a graphical figure showing the methods process down to the evaluation of the signature N selection in Suppl.Figure S4.Response: Thank you for pointing this out.This has now been addressed Figure 3 -not all mutation types are represented in the signatures (for example A-to-C to T-to-G).These mutation types do occur within SARS-CoV-2 mutations (for example see PMIDs 37185044 and 37039557).This should be stated and included in the discussion around the activities of signatures through time, as it is likely that not all mutations are driven by the extracted signatures.
Response: This is an excellent point, we have added a number of sections to address this and stress that while the signatures cover much of the observed mutations, other mutational processes could still be in play.
The use of mutation nomenclature is not always consistent -for example C->T or CT.Please update.
Response: This has now been updated.

Figure 2 -
Figure 2 -while it's clear which panel is which, panels A and B aren't labelled in the figure currently.
If this were true, we'd expect to see a very high proportion of A-to-G and T-to-C mutationswithin this time period, as these are the dominant mutation types in signature 2. The authors include FigureA3showing counts of unique substitutions per week split by mutation type (although this doesn't seem to be referenced in the text).There doesn't seem to be a large increase in the proportion of mutations that are A-to-G or T-to-C during early 2021, nor a decrease in C-to-T which would be expected if signature 1 is reducing within Alpha.While only a fraction of global sequences at this time are Alpha, I'd still expect to see some impact in this plot.To support the signature patterns, the authors should show similar plots to FigureA3but broken down by variant and continent, to complement the signature plots in Figure For example, the authors suggest that signature 2 causes the majority of mutations in Alpha in early 2021.