The interplay of SARS-CoV-2 evolution and constraints imposed by the structure and functionality of its proteins

The unprecedented pace of the sequencing of the SARS-CoV-2 virus genomes provides us with unique information about the genetic changes in a single pathogen during ongoing pandemic. By the analysis of close to 200,000 genomes we show that the patterns of the SARS-CoV-2 virus mutations along its genome are closely correlated with the structural and functional features of the encoded proteins. Requirements of foldability of proteins’ 3D structures and the conservation of their key functional regions, such as protein-protein interaction interfaces, are the dominant factors driving evolutionary selection in protein-coding genes. At the same time, avoidance of the host immunity leads to the abundance of mutations in other regions, resulting in high variability of the missense mutation rate along the genome. “Unexplained” peaks and valleys in the mutation rate provide hints on function for yet uncharacterized genomic regions and specific protein structural and functional features they code for. Some of these observations have immediate practical implications for the selection of target regions for PCR-based COVID-19 tests and for evaluating the risk of mutations in epitopes targeted by specific antibodies and vaccine design strategies.


Introduction
We live in the middle of the COVID-19 pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). First identified and characterized in early 2020, the virus is mutating and evolving into separate clades with distinct geographical and time distribution (https://www.gisaid.org) [1]. This is typical for RNA viruses and cannot be automatically interpreted as a sign that the disease it is causing is changing [2], but epidemiological [3] and biochemical [4] data indicate that a viral strain with higher infectivity appeared already in March 2020. Recent pandemic flare ups in the United Kingdom [5] and South Africa [6] suggest that even newer super-transmissible strains are emerging. We can track these events in almost real time thanks to a massive effort in sequencing the SARS-CoV-2 genome variants from all over the world, with most of this information available through resources such as GISAID [1] (https://www.gisaid.org). The genomic data on SARS-CoV-2 provides information on the phylodynamics of the COVID-19 pandemic and is also studied for signals of positive selection in the search of changes that the virus may undergo while adapting to its new host (https://observablehq.com/@spond/revised-sars-cov-2-analytics-page).
There is no doubt that the evolution of the SARS-CoV-2 virus, typical for RNA viruses [7], is mostly driven by a combination of genetic drift and negative (purifying) selection that removes non-viable viruses [8]. Genetic drift and negative or purifying selection typically receive less attention than positive selection since it is considered as being less informative. In this manuscript, we explore the possibility that integrating information about patterns of genetic drift and negative genomic selection with that on protein three-dimensional structures would allow us to gain novel insights about the structurally and functionally important regions of SARS-CoV-2 proteins.
Negative or positive selection is typically measured by a rate of synonymous to non-synonymous mutations and thus can be calculated for individual positions or regions of the genome. Such calculations are carried out for individual variants in the SARS-CoV-2 genome (https:// observablehq.com/@spond/revised-sars-cov-2-analytics-page). Here we want to focus only on larger trends, using averages of mutation rates over entire proteins, individual domains or some specific functional regions in them. Such observations were made from the beginning of structural biology, when Perutz and his team noticed that mutations rarely happen in the protein core [9]. These early observations were later corroborated for different organisms, classes of proteins, and evolutionary timescales [10]. It is, however, not obvious if this trend would hold for a rapidly mutating pathogen tracked in the timescale of one year, where neutral genetic drift is expected to be a dominant factor and how much it could be used, in reverse to the earlier analyses, to learn about protein structure and function from the analysis of the mutation patterns. Assuming that negative selection is a dominant effect, we focus on the under and over mutated regions, interpreting the difference as a signal of the importance of these regions to a broadly defined viral fitness.
The analysis presented here is also enabled by the rapid pace of structural characterization of the SARS-CoV-2 proteome [11] as by the end of December 2020, there was direct or indirect high-quality structural information for over 60% of the total length of SARS-CoV-2 proteins. Our group has recently developed the Coronavirus3D server [12], available at https:// coronavirus3d.org, to integrate information about the three-dimensional structures of SARS-CoV-2 virus proteins from the Protein Data Bank (PDB) [13] resource (http://rcsb.org) with the information on SARS-CoV-2 genomic variations retrieved from GISAID [1]. This integration allows us to track, in almost real-time, the emergence of new trends or patterns in the evolving SARS-CoV-2 genome. The new functionality of variant tracking is now the default first page and the features described in this manuscript are available from the menu on the top of the page as "3D proteome viewer" or directly at https://coronavirus3d.org/#/3dproteomeviewer.
In the first part of the manuscript, we evaluate mutation rate distributions along the genome to gain insights into the types of selection pressure for individual SARS-CoV-2 proteins as well as for their functional domains and sites. In the second part we analyze the mutation pattern of known antibody epitopes and regions used for COVID-19 diagnostic tests, showing that the continuous evolution of the SARS-CoV-2 virus can also affect the medical and public health aspects of the COVID-19 pandemic and that structural information on viral proteins is useful in our efforts to control it.

Results
Coronaviruses have a unique RNA copy-proofing mechanism [14] and, as a result, have a lower mutation rate than other RNA viruses. Despite this, over 70% of the positions (21,124 out of 29,880) along the SARS-CoV-2 genome have been mutated at least once, as can be seen by the analysis of over 192,030 high-coverage genomes sequenced as of Dec 3 rd , 2020 on the GISAID website (https://www.gisaid.org/) (see the Methods section for the details of the protocol used to select these genomes). The distribution of mutations along the SARS-CoV-2 genome has been discussed in many papers [15,16]; here we search for new observations that could be made by mapping this information onto the structures of the proteins encoded by the genome.

Distribution of mutations along SARS-CoV-2 genome and in its proteins
In line with earlier observations [16], the largest proportion of mutations observed in SARS-CoV-2 genomes were missense mutations (61%), followed by synonymous mutations (33%) and a relatively small number of start/stop gains and losses, as well as mutations in untranslated regions (see S1 Fig for more details). When translated to the amino acid sequence, 7811 out of the total of 9926 (79%) amino acids in the SARS-CoV-2 proteome are mutated in at least one genome in the dataset used in this study. We plotted the distribution of missense and synonymous mutations using a moving 100 nt. window along the viral genome ( Fig 1A). A cluster of densely mutated regions near the 3'-terminus of the genome begins at the boundary between Orf1ab (coding for non-structural proteins) and Orf2-Orf10 (coding for structural and accessory proteins). Other minima and maxima of the mutation rate can mostly be mapped to the functional parts of the genome as illustrated in Fig 1A. For instance, the region corresponding to the C-terminal domain of nsp3 (violet line in Fig 1A) was found to be significantly less mutated, likely due to its key role in inducing the formation of double-membrane vesicles [17] and the minima in the spike protein to a RBD and postfusion core regions, both critical for the virus entry into the host cell.
As seen in Fig 1B, the variance of the numbers of missense mutations in the 100 nt. windows along the viral genome is about four times higher (20.96 versus 5.33) than the corresponding variance for synonymous mutations (Levene-test p-value = 5.2E-44) confirming that the number of the synonymous mutations fluctuates much less than that of missense mutations.
At the same time, the rates of missense and synonymous mutations along the length of the genome are marginally correlated (Spearman's rank correlation coefficient Rs = 0.22, p-value = 0.00012) implying that two mechanisms could be coupled-regions under stronger or weaker negative selection on protein level are also slightly more or slightly less frequently mutated overall. A full analysis of this interesting effect is outside the scope of this paper.
Comparing the segment of the genome coding for the non-structural proteins (Orf1ab, corresponding to proteins nsp1-nsp16) to the whole proteome, we see that it is under-mutated in both missense and synonymous mutations (p = 8.01E-75 and 5.72E-6, respectively). We also see that this segment has a lower ratio of missense to synonymous mutations than the segment coding for the structural and accessory proteins (p = 1.85E-12) (see the details of the calculation in the Methods section). This might suggest that negative selection is stronger in the region coding for the essential viral reproduction apparatus, but also that the RNA features of the genome support a lower mutation rate in this region. Therefore, we decided to use the mutation rates in these two regions as two separate background probabilities in the statistical tests applied to individual proteins later in the paper.

Rates of missense mutations along SARS-CoV-2 genome as a measure of evolutionary pressure
Analysis of the rates of missense mutations in the genomes has a long history in the field of cancer genomics, where cancer mutations have been shown to have clearly non-random distribution when mapped on protein sequences and structures [18]. This effect was interpreted as a signal of positive selection and used to identify cancer driver genes and mutations. It is worth noting that this analogy is not perfect, as in cancer the count of mutations (number of samples) can be directly included in the analysis as they are independent evolutionary events. In contrast, mutations observed in viral genomes are not independent events as individual genomes inherit some mutations from their ancestors. Virus counts are also dependent on the sequencing rates in different regions, with orders of magnitude difference between the industrialized and developing countries. Ideally, the mutation counts should include independent recurrence events. However, existing estimates show a very low frequency of recurrent mutations in SARS-CoV-2 genomes [4,19], on the order of 1% of all positions. They also disagree on the positions of such recurrences as they depend on the details of the phylogenetic tree of all the genomes, which at this point is still not definite [20]. Therefore, we decided to use the approximation of counting each mutation once, without taking into account the virus counts (number of observations) nor the level of recurrence on specific positions since recurrence corrected counts for~1% of positions would not significantly influence counts averaged over 100nt window, as used here. With rapidly growing number of available genomes this situation may change, but our in-house and literature [4,19] estimates for the datasets as analyzed here support this assumption.
At the same time, as shown in the previous section, variation in the rate of synonymous mutations along the genome is small as compared to missense mutations. Taking into account these observations, we decided to focus entirely on the missense mutations, as they are manifested at the amino acid level and allow us to directly interpret the mutation rates through their effect on proteins, their three-dimensional structures and potentially their functions.

Some SARS-CoV-2 proteins and domains show significant differences in the rate of mutations
The SARS-CoV-2 genome codes for at least 29 individual proteins, with the product of Orf1ab being further processed into 16 individual non-structural proteins through post-translational processing by the viral proteases 3CLpro and PLpro. The exact count of the proteins coded in the SARS-CoV-2 genome is still disputed, as some of the ORFs code for multiple proteins in alternative reading frames [21]. Many of the SARS-CoV-2 proteins, such as nsp3 or Nucleocapsid phosphoprotein, can be further divided into independently folding regions (domains) with specific functions. In the following analysis, we compared the observed number of missense mutations in a given protein or its domains with their expected number under an appropriate background mutation rate, to identify regions that are significantly over-or undermutated (see Methods). Because domain assignment is not complete for SARS-CoV-2 proteins, we use information on structurally characterized constructs to define boundaries of structural (and functional) domains or regions, in addition to domains identified by in-depth sequence analysis such as Y1 and CoV-Y domains in nsp3 [22]. Regions located between structurally characterized domains, for instance in nsp3, form another group of indirectly defined regions. The complete list of SARS-CoV-2 proteins and experimentally solved structures/ domains within them that are used in the following analysis are listed in S1 and S2 Tables.
While the differences in mutation rates between specific proteins or protein regions could be caused by differences in type of evolutionary pressure between them, they can also be affected by a different "background" mutation rate between different genomic regions. However as discussed earlier the region coding for the non-structural proteins is systematically under-mutated as compared to the region coding for the structural and accessory proteins. Therefore, in this analysis, we used different background frequencies for the different parts of the proteome being analyzed (see Methods). Table 1 presents the results of the significance analysis of the mutation rate for individual proteins (16 non-structural proteins, 4 structural proteins and 6 accessory proteins) and Table 2 presents the results for individual functional

PLOS COMPUTATIONAL BIOLOGY
Structural constraints on SARS-CoV-2 evolution domains as identified from the structural analysis and the literature. The complete results are available in S1 and S2 Tables. As seen from Table 1, many of the SARS-CoV-2 proteins show a statistically significant difference in mutation rate when compared to their corresponding backgrounds, with nine being under-mutated and ten being over-mutated. We see that the majority of under-mutated proteins are non-structural proteins (7 out of 9), which mirrors the trend seen earlier. This is seen despite the fact that individual non-structural proteins were compared to the set of all non-structural proteins as the background, and individual structural and accessory proteins were compared to the set of all structural and accessory proteins, suggesting that these proteins are under strong negative selection. Most of the non-structural proteins play a role in RNA replication/processing and are part of the viral replication and transcription complex (RTC) [23,24]. Their low mutation rate can be explained by the fact that these proteins are crucial for the viral life cycle. One of the exceptions to this trend is nsp1,

PLOS COMPUTATIONAL BIOLOGY
Structural constraints on SARS-CoV-2 evolution with a mutation rate similar to that of orfs2-14. Indeed, its function is somewhat different from the other non-structural proteins, as it interacts with the ribosome to inhibit host protein translation [25,26]. Overall, 6 out of the 10 over-mutated proteins are structural and accessory proteins. Over-mutation generally implies lower negative selection or potentially some positive selection, and again, this can be explained for at least some of these proteins based on their functions, which involve interacting with components of the host cell. For example, Orf8, Orf6 and N protein have been implicated in disrupting the host anti-viral immune response [23,24] so their high mutation levels can contribute to the immune avoidance by the SARS-CoV-2 virus.
In the next step, we looked at individual domains within SARS-CoV-2 proteins. As domains within multidomain proteins often have their independent evolutionary history and identifiable, individual functions, differences in mutation rates between different domains may provide a more detailed picture of their relative importance for the viability of the virus. We have used a similar approach in the eDriver algorithm used to identify the role of individual domains in cancer driver proteins [27] (with all the caveats discussed in the previous section). The expression of multiple constructs from individual proteins allowed researchers to recognize fragments that could fold independently and often can be assigned specific functions. Three-dimensional structures of many of these domains have been determined, so here we use the mapping of the SARS-CoV-2 proteins into the PDB structures/models as a proxy for identification of domain boundaries (Table 2), in addition to those identified through the literature.
We also looked at regions in between the solved structures/models, assuming that these would form important linker regions or domains whose structures remain unsolved (Fig 2 and Table 2). The complete list of domains found in SARS-CoV-2 proteins, and the relative excess or dearth of mutations in them, is provided in S2 Table. We see domains with significantly different mutation frequencies in four proteins: nsp3, nsp4, N (nucleocapsid phosphoprotein) and S (spike/surface glycoprotein) protein. In N protein we see that the structured regions i.e., the RNA binding domain and the dimerization domain (PDB IDs 6m3mA and 6wjiA, respectively), are significantly under-mutated. This effect can probably be explained by the rest of the nucleocapsid protein (including the region between the two domains) being partly disordered, since such regions are typically much more tolerant of mutations (for the detailed discussion of mutation rate in different parts of the N protein-see the section Significantly lower mutation rate in the region of overlapping reading frames). In line with this, we see that the region between the domains is significantly overmutated. In nsp3, three domains are over-mutated (Ubl1, ADRP and a linker domain) and four are under-mutated (SUD, another linker, Y1 and CoV-Y domains). The SUD domain, which is further composed of two macrodomains (Mac2 and Mac3), has been shown to bind to G-quadruplexes. This binding occurs via lysine residues in both macrodomains; however, it was shown through mutational analyses that only the lysine residues in Mac3 are essential for binding [28]. Moreover, this binding appears to be essential for viral replication [29], supporting the low mutation rate of this domain. The Ubl1 and ADRP domains have both been suggested to interfere with the host immune response. However, the connection between their functions and the observed mutation distribution is less clear, particularly since they may perform more than one function [22].

Evidence of protein structure-driven purifying selection in SARS-CoV-2
The proportion of missense mutations in structurally characterized protein residues of SARS-CoV-2 increases with their increasing solvent exposure following a known trend observed in many protein families from different organisms [10]. There is a strong, nearly linear increase in the rate of missense mutations, with synonymous mutations remaining at an approximately constant level, similar to their flat distribution along the genome discussed earlier (Fig 3A). The strong change in missense mutations rates is explained by tightly packed cores presenting strong constraints for amino acid residue choices and many mutations there leading to unfolded protein products. Protein-protein interaction interfaces do not pack as tightly as protein cores, but also have specific amino-acid composition and their mutations may lead to function-affecting changes in protein complex formation. Notably, in cancer, we see the opposite effect, with disproportionately high number of driver mutations found on protein-protein Frequencies of missense (red) and synonymous (blue) mutations for residues buried in the protein core, exposed to the solvent and involved in known protein-protein interfaces. A) The purifying selection decreases with increasing solvent exposure. B) Exposed residues are very over-mutated and buried residues are very under-mutated. For interfaces, the mutation rate falls between that for exposed and buried residues.
https://doi.org/10.1371/journal.pcbi.1009147.g003 interfaces of cancer driver genes. SARS-CoV-2 non-structural proteins are known to form higher order assemblies essential for their function [30], thus, we can expect that in most cases interface residues should be conserved. This hypothesis is confirmed by the results shown in Fig 3B, where the ratio between missense and synonymous mutations for the residues on known protein interaction interfaces falls to a value between those for exposed and buried residues.
In the calculations shown here we only used the information on currently known proteinprotein interfaces in SARS-CoV-2 proteins, based on experimental structures of viral protein complexes. We can expect that "unexplained" conserved patches on the surfaces of SARS-CoV-2 proteins may aid the discovery of some yet unknown interaction interfaces.
Significantly lower mutation rate in the region of overlapping reading frames. Overlapping reading frames are common in viruses, resulting in local protein coding density over 100% [31]. Systematic analyses suggest that combined negative selection on two reading frames results in decreased rate of all mutations as mutations synonymous in one reading frame may be missense (and potentially deleterious) in another reading frame [32]. The N-terminal part of the Nucleocapsid gene of SARS-CoV-2 is translated into two different reading frames resulting in an additional gene coding for a functional protein Orf9b. A similar overlapping reading frame is suggested for the region coding for Orf14. We tested the rate of mutations in the region of the Nucleocapsid protein which is coding for two proteins in two different reading frames and compared the mutation rate in this region to the background rate for the entire gene confirming the expected result (see Fig 4). The largest decrease in the rate of mutations is observed in the region where proteins coded in two reading frames (Orf9b and Nucleocapsid) have well-defined structures. The N-terminal region of the Nucleocapsid gene does not have an experimental structure and is predicted to be structurally disordered and, as

PLOS COMPUTATIONAL BIOLOGY
Structural constraints on SARS-CoV-2 evolution such, is expected to impose less constraints on mutations [33]. Despite the fact that it also overlaps with Orf9b (see Fig 4), the density of mutations there is not decreased.
The protein-coding Orf14 does show only moderate decrease in mutation density, in the region where it overlaps in a different reading frame with the gene coding for Nucleocapsid. This is probably again explained by the fact that it mostly overlaps with the structurally disordered region of the Nucleocapsid protein which does not impose strong constraints on missense mutations.
These observations have important practical implications for the selection of primers and probes for COVID-19 diagnostic tests as mutations in their target genomic regions have detrimental effect on their accuracy. Taking into account constraints on mutation rate imposed by protein structure and function may help in selecting regions which are less likely to accumulate mutations in the future. Unfortunately, in fact, multiple PCR-based diagnostic tests for COVID-19 target the genome region encoding the Nucleocapsid protein (see Fig 4) with some of them mapping to the highly mutated disordered protein regions. We discuss this issue in more detail in a separate section.

Missense mutations in epitopes on the Receptor Binding Domain of the Spike protein
The Spike protein is the main surface antigen of SARS-CoV-2, a preferred target of therapeutic antibodies for COVID-19, and the immunogen used in the currently available vaccines. There are already more than 40 structurally characterized complexes of various types of antibodies with the Spike protein and almost all of them bind to its Receptor Binding Domain (RBD). Therefore, in the following analysis, we only focused on epitopes localized on the RBD.
Substitutions of residues in epitopes are a serious potential problem for both therapeutic antibodies and vaccines. At the same time, many of these epitopes overlap with the part of the RBD surface that binds to human ACE2 -the main entry receptor for SARS-CoV-2. The surfaces mediating interactions between SARS-CoV-2 proteins are under-mutated indicating purifying selection (Fig 3B) and therefore it can be expected that the RBD-ACE2 interface would also be under-mutated.
Indeed, the comparison of missense mutation rates in different groups of residues of the Spike protein trimer shows that the rate of missense mutations in epitopes is close to that of other exposed residues (see Fig 5A). However, exposed residues involved in the RBD-ACE2 interface appear to be under strong purifying selection as they are significantly under-mutated as compared to other exposed residues (p-value = 0.03). This is expected as the RBD-ACE2 interface is essential for the entry of the virus into the host cell and any, even minor, disruption of its binding would most likely diminish the ability of the virus to enter host cells. As a result, the epitope residues that are also involved in the RBD-ACE2 interface are effectively "protected" from mutations. It seems that mutations, especially those which are observed multiple times (higher virus counts), are unlikely to be found on this interface (Figs 5B and 6). In individual epitopes, the positions significantly involved in contact with ACE2 only rarely have mutations and these mutations usually have low viral counts (see Fig 6). While sufficient statistics for these trends are still lacking, they support the idea that antibodies targeting epitopes with large overlap with the ACE2 interaction interface are at a lower risk of immunological escape by the virus.

The rate of mutations in regions targeted by the diagnostic PCR tests
The adverse effects of SARS-CoV-2 genomic mutations on the PCR-based diagnostic test results have already been discussed by others [34,35] (also see the GISAID page on popular Mutations in known epitopes of the RBD of SARS-CoV-2 spike protein: A) Frequencies of missense (red) and synonymous (blue) mutations in (from the left to right): residues buried in the Spike protein core, in its exposed residues, in all residues from (structurally characterized) epitopes in RBD, and in all residues involved in binding to the human ACE2 receptor. B) The virus counts of missense mutations in epitopes as a function of the percent of residue's surface that is involved in the RBD-ACE interface. https://doi.org/10.1371/journal.pcbi.1009147.g005

PLOS COMPUTATIONAL BIOLOGY
Structural constraints on SARS-CoV-2 evolution primers available at https://www.gisaid.org/). The false-negative results of the PCR tests, especially of the TaqMan-qPCR assay are linked to mutations and the high sensitivity of this technique to primer/probe-template mismatches [36,37]. Both missense and synonymous mutations have an impact on the accuracy of PCR tests, but only missense mutations are under structural and functional constraints imposed by proteins. Nevertheless, since missense mutations comprise most (59%) of the mutations in the SARS-CoV-2 genome, the overall mutation rate depends mostly on missense mutations.
Here we investigated the mutation rates of the target regions of the widely used PCR primers and probes in the context of proteins and protein domains encoded by these regions. To this end, we collected the sequences of primers and probes commonly used for COVID-19 diagnostic PCR assays. The coordinates of the genomic target regions of these primers and probes were obtained by mapping them to the reference genome used in this study (GenBank: MN908947.3) and then these genomic coordinates were mapped to SARS-CoV-2 proteins and (where possible) to experimental structures. As expected, the tests targeting the genomic regions encoding highly conserved proteins whose functions are essential to the viral life-cycle, such as RdRP, show the lowest rate of mutations ( Fig 7A). More generally, target regions encoding stable, protein structures have lower mutation rates than those encoding structurally disordered protein regions. Regions coding for structurally disordered proteins are known to be enriched in mutations [33] and this applies to the regions targeted by some widely used diagnostic tests (Fig 7B). The examples of such frequently mutated target sequences are the targets of 2019-nCoV_N1 (also known as RX7038-N1 or CDC N1) primers and probe as shown in Fig 7B. These regions encode the structurally disordered region of the SARS-CoV-2 Nucleocapsid protein. Our predictions of structural disorder obtained using the Disopred program [38] were recently confirmed experimentally as it was shown that the SARS-CoV-2 Nucleocapsid protein is highly dynamic and contains three disordered regions [39]. Such regions are less suitable as targets of PCR-based diagnosis of SARS-CoV-2. At the same time, the region coding for RdRP has few mutations ( Fig 7B) and, thus, is a more reliable target for SARS-CoV-2 diagnostic purposes. The list of diagnostic primers and probes, mutation counts in their target regions, and proteins encoded by these regions are provided in S4 Table.

Discussion
In this manuscript, we have shown that the connection between the distribution of amino acid mutations and structures of the proteins encoded in the genome is clearly evident in the evolution patterns for the SARS-CoV-2 virus.
The rate of missense mutations significantly varies along the SARS-CoV-2 genome, while the rate of synonymous mutations shows much lower variability. This indicates that mutations are significantly impacted by the selection mechanisms on the protein level. A simple analysis of the rate of missense mutations along the genome reveals some strong maxima and minima. Some peaks of mutation rate are correlated with structurally disordered regions where structural constraints on amino-acid substitution are generally lower [33]. At the same time, some deep minima in mutation rate correspond to known essential regions of SARS-CoV-2 proteins whose functions put significant constraints on the possible mutations. At least one deep minimum in the rate of missense mutations corresponds to the structurally uncharacterized C-terminal domain of nsp3, suggesting that it has a well-defined structure whose conservation is essential for the viral life cycle. The analysis of mutation frequencies of individual proteins and domains further corroborated these observations.
Additionally, the mutations in SARS-CoV-2 proteins follow a known trend [9] with positions corresponding to the residues in protein cores mutated less often than those corresponding to solvent-exposed residues. Positions on the protein-protein interfaces present an intermediate case, but it is possible that the existence of some as yet unknown interaction interfaces complicates the analysis. This also opens up a possibility of searching for such interfaces by looking for patches of below-average sequence variability on protein surfaces. Another fascinating example of constraints imposed by protein structures on SARS-CoV-2 mutations are proteins encoded by overlapping reading frames. In agreement with trends observed earlier in bacteria [33], the region of the Nucleocapsid gene that codes for two different protein structures in two reading frames shows a significantly lowered rate of mutations.
The analysis of the mutation pattern in the SARS-CoV-2 virus is interesting from the evolutionary point of view but may also be of practical importance. For instance, it makes it possible to predict which of the currently known epitopes on the surfaces of SARS-CoV-2 proteins are more likely to undergo widespread mutations in the future. Similar predictions can be useful for regions targeted by primers and probes used in PCR-based diagnostic tests for COVID-19, as there is already evidence that the accuracy of some of these tests has been negatively affected by the accumulation of multiple mutations [40]). We show examples from both categories, where structure constrained mutation rates may differentiate between evolutionarily stable and unstable epitopes or probe sites, leading to antibodies less prone to viral escape and more reliable PCR-based diagnostic tests, respectively.
It is noteworthy that another study which addressed mostly positive selection in the regions of the SARS-CoV-2 genome [41] reports high conservation of the central RNA replication machinery (nsp6-nsp13), suggests both strong positive and negative selection for the Spike protein and high conservation of the Orf3a-N region. While our analysis addresses mostly negative selection, the last of these observations is not in agreement with ours as we report most of these genes in the Orf3a-N region to be over-mutated. We believe that the difference between these analyses is partly due to the different scope of the two studies (RNA level analysis [41] and protein level in our analyses) and partly due to the limited number of genomes used by Berrio et al. [41]. Currently, the rate of missense mutations in some parts of the Orf3a-N region approaches 3 mutations per nucleotide suggesting very low conservation for these proteins.
Our simple approach of analyzing the frequency of mutations in genomic regions coding for proteins or domains, originally applied to cancer [18,42], has its limitations. It was most appropriate when the number of known SARS-CoV-2 genomes (and the number of detected mutations) was relatively low. With rapidly accumulating data and increasing rates of recurrence, mutation counts (which, in the beginning, were most likely linked to founders' effects) are expected to be increasingly correlated with fitness. However, at the time of finalizing this manuscript the rate of missense mutations was still a good measure of local negative selection as demonstrated by its expected strong dependence on the residues' solvent exposure and anticipated high value in structurally disordered regions.

Rate of mutations definition
In our analysis we focus on the rates of missense mutations (with the exception of overlapping reading frames where we analyze all mutations). However, in figures we also often show numbers of synonymous mutations as the illustration of the fact that their distribution is mostly flat and (as expected) it does not significantly differ between regions defined by structural and functional features of proteins. In the manuscript we use the term "rate of mutations" for the number of distinct mutations in a given protein region or per 100 nt. This does not include virus counts (numbers of known genomes with a given mutation) so in our approach each mutation is counted only once. We provide a more detailed justification of this approach in the Results section.

Data collection and curation
Sequences and metadata of complete SARS-CoV-2 genomes were retrieved from GISAID (https://www.gisaid.org/). as of December 3 rd , 2020. All low-coverage genomes i.e., genomes containing less than 29,000 nt. and those containing more than 1% of undetermined nucleotides (Ns) based on GISAID cutoff, were removed. Several recent studies have shown that each coronavirus genome has median minority variants (either inter-or intra-host) ranged from 1-38 [43][44][45] and this median range recorded as roughly 1-10 for point mutations (substitutions) per sample [46]. The high rate of mutations could simply arise from sequencing errors as pointed out in the literature [46] (also see https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473). Therefore, to avoid including spurious mutations, we excluded genomes with substitution exceeding a cutoff of 1.5 interquartile range (IQR) above the 3 rd quartile of substitution rates in all genomes.
This filtering procedure resulted in 192,030 genomes and this set was used in all calculations and analyses in this study. One of the early annotated and sequenced complete genomes of SARS-CoV-2 (GenBank: MN908947.3) was retrieved from The National Center for Biotechnology Information (NCBI) and used as a reference for all genomic coordinates and as a query in all alignments.

Alignment, variant calling, and annotation
We calculated a multiple sequence alignment (MSA) of all high-coverage SARS-CoV-2 genomes using MAFFT version 7 (https://mafft.cbrc.jp/alignment/server/) with the default parameters. The MSA file was then processed using SNP-sites [47] and BCFtools version 1.9 [48] for variant calling and variant normalizations, respectively. In all analyses, we only considered single nucleotide substitutions involving unambiguous nucleotides (A, T, C, G). In the text we simply refer to them as "mutations". All variations identified in this study along with the corresponding metadata are accessible via VarCoV application available at http:// immunodb.org/varcov/.
To annotate variants, we used SnpEff (http://snpeff.sourceforge.net/). We used R package "vcfR" to manipulate and visualize variant calling format (VCF) data. The complete genome of SARS-CoV-2 (GenBank: MN908947.3) was used as a reference for genomic coordinates of proteins, protein structures, and models.

Comparison of missense mutations rate in non-structural proteins (Orf1ab) and in the structural and accessory proteins
For each protein (except for the two very short peptides Orf3b and nsp11), we counted the total number of missense mutations (based on genomic positions) that were observed in at least one sample (that is, virus counts were ignored). If two mutations occurred at the same genomic position but resulted in different base substitutions, they were counted independently.
We then used the (two-sided) binomial test to compare: 1. the rate of missense mutations in Orf1ab (except nsp11) against the rate of missense mutations in the complete proteome (excluding Orf3b and nsp11).
2. The rate of missense mutations in the set of all structural and accessory proteins (except Orf3b) against the complete proteome.
3. The rate of synonymous mutations in Orf1ab to the full proteome.
4. The rate of synonymous mutations in the set of all structural and accessory proteins to the full proteome.
(The binomial tests were performed in a manner similar to the analysis of mutation rates in individual proteins, described below). We found that Orf1ab is significantly under-mutated, while the structural and accessory proteins are significantly over-mutated in both missense (p = 8.01x10 -75 ) and synonymous mutations (p = 5.72x10 -6 ). We further used Pearson's Chisquared test with Yates' continuity correction (as implemented in R) to assess the significance of the difference between the proportion of missense and synonymous mutations in these two regions (p = 1.85x10 -12 ). Because of these significant differences we decided to use separate backgrounds of Orf1ab and structural and accessory proteins to evaluate over-and undermutation of individual proteins.
The full list of proteins used in this analysis is provided in the S1 Table.

Assessing differences in rate of missense mutations in individual SARS-CoV-2 proteins and domains
For the assessment of differences in mutation frequencies of individual proteins and domains we compared the rate of missense mutations in these regions to the rate of missense mutations in some larger background region encompassing the protein/domain of interest.
We used the binomial test to identify individual proteins and domains that have significantly different mutation frequencies, when compared to an appropriate background mutation rate. This approach was used previously by our group in the eDriver algorithm [27] to evaluate the significance of differences in mutation rates between domains of cancer driver proteins.
The arguments for the binomial test, which are the number of successes, the number of trials, and the expected probability of success, were set as follows: 1. The number of successes was the observed number of missense mutations in the protein/ domain being analyzed. This was counted as the total number of distinct missense mutations in that protein/domain observed in at least one sample. Therefore, virus counts (the number of samples where the mutation was observed) were ignored, since we assumed that, in most cases, these would not represent independent mutation events (see discussion of recurrence in the Results section). However, missense mutations that occurred at the same genomic position, but resulted in different base substitutions were counted independently.
2. The number of trials was the number of missense mutations in the background region used for comparison.
3. The expected probability of success (under the null hypothesis) was equal to the length of the protein/domain divided by the length of the background region.
All lengths were calculated in terms of genomic positions (i.e., the length of the genomic region coding for the protein/domain being analyzed). Missense mutations were also counted at the level of genomic positions.
The following approaches and background regions were used in the analyses of individual proteins and domains: 1. We used the set of all non-structural proteins (Orf1ab) (see note below) as the background for analysis of individual non-structural proteins, and the set of all structural and accessory proteins (see note below) as the background for the analysis of individual structural and accessory proteins.
The full list of proteins analyzed can be found in S1 Table. Note: Two very short peptides Orf3b and nsp11 coded in alternative reading frames (containing 9 bases and 38 bases respectively) were excluded from these analyses.
1. Domains were identified based on protein structures or models and through the literature.
(For that purpose, only structures/models representing segments of the protein and not the full protein were considered.) We also considered regions in between known structures/models to represent domains as well. The full list of domains can be found in S2 Table. For each domain analyzed, the encompassing full protein was used as the reference background region.

Structural coverage of the SARS-CoV-2 proteome and derived structural characteristics
The structural data for biological assemblies of SARS-CoV-2 was downloaded from Corona-virus3D server developed recently by our group [12]. The Coronavirus3D server provides links to experimental structures of SARS-CoV-2 proteins stored in PDB [13] and models of protein regions of SARS-CoV-2 for which direct structural characterization is still lacking. Models were calculated with Modeller [49] based on FFAS [50] alignments. For the purpose of this study, we prepared a non-redundant list of structures which included non-overlapping structures and models providing only one structural characterization for each residue where possible (with the exception of structures coded in two different reading frames). The list of structures and models used in this study is provided in S3 Table. The collected experimental and modeled biological assemblies of SARS-CoV-2 proteins were used to calculate solvent exposure with the GetArea program [51]. Solvent exposure was calculated separately for biological assemblies and for isolated chains. The buried residues were defined as those with less than 20% of their surface exposed to the solvent according to GetArea. The remaining residues were classified as exposed. Interfaces were defined as a subset of residues whose solvent exposure decreased by at least 20% of their total area in biological assembly as compared to an isolated chain.

Assessment of mutation frequencies as function of solvent exposure
The list of single nucleotide mutations in SARS-CoV-2 genomes (prepared as described in the section Collection and curation of SARS-CoV-2 variation data) was merged with the solvent exposure data prepared for residues of SARS-CoV-2 proteins (as described in the previous section). The total numbers of synonymous and non-synonymous mutations were then calculated for codons of protein residues for different ranges and categories of solvent exposure. The significance of differences in rate of missense mutations between buried, exposed, and interface residues was again assessed using binomial tests as described in previous sections with the entire proteome of SARS-CoV-2 used as the background.

Rate of mutations in epitopes on RBD of Spike and in Spike-ACE2 interface
For these calculations we used the following definitions. Epitopes include residues whose solvent exposure decreases by at least 20% of their maximal solvent exposed area in the RBD-antibody complex as compared to RBD alone. Similarly, ACE2-contact area for any residue from RBD is the % of its solvent exposure lost when RBD is bound to ACE2. Antibody binding and ACE2 areas were derived from separate PDB entries. For the purpose of comparison all epitopes are shown in the same structural context of the RBD-ACE2 complex (PDB id 6m0j) rather than in the context of the antibody RBD complexes. However, RBD may undergo some conformational changes in complexes with antibodies.

Rate of mutations in overlapping reading frames
In the overlapping reading frames, we tested differences in rate of all mutations rather than only missense mutations as mutations which are missense in one frame may be synonymous in other and vice versa. The significance of the changes in rate of all mutations in different regions of Orf9b and Nucleocapsid proteins was calculated using binomial tests in a way analogous to that used for individual proteins (see the previous section). For example, the number of all mutations in the region of the overlap of two structures coded in different reading frames (positions 28415-28574), the total number of mutations in Nucleocapsid and the ratio of the length of the overlap to the total length of Nucleocapsid were used as number of successes, number of trials and background probability in binomial tests, respectively. All lengths were calculated in terms of nucleotides.  Table. The list of widely used PCR tests for diagnosis of SARS-CoV-2 along with mutation counts in their target regions and proteins coded by these regions. (DOCX)