Fig 1.
The mutational space of spike is specific for SARS-CoV-2 lineage.
(A) Black dots indicate sites of mutation that define VOC sublineages and emerged until April 8th 2022. Cells are shaded by the inferred selective pressures applied on the sites in the baseline of each VOC calculated using the SLAC method. (B) Phylogenetic tree based on 40,350 unique spike sequences from the indicated lineages. The SARS-CoV-2 baseline (BL) group is composed of sequences within 0.0015 nucleotide substitutions per site from the SARS-CoV-2 ancestral strain. Branches are colored by the residue at position 501. (C) Schematic of the approach to calculate volatility for each position of spike. (D) Volatility at RBD positions in the indicated VOCs or the BL group. (E) Lineages were partitioned into groups of 500 sequences and the absence or presence of volatility at all spike positions in each group was determined. All groups were thus assigned 1273-bit strings that describes the volatility profile of spike. Strings were compared using the UPGMA clustering method. (F) Relationships between the 1273-bit strings. Data points represent the strings of all 500-sequence clusters, which are labeled by lineage. To visualize these relationships, the distance matrix between all vectors was used as input for multidimensional scaling. Lineage-specificity of the profiles was determined by a permutation test. P-values: *, P<0.05; **, P<0.005; ***, P<0.0005. (G) Volatility was calculated for each lineage at all positions of the NTD (20–286), RBD (333–527) and S2 (686–1213). The Spearman correlation coefficient between volatility values in any two lineages was determined. Coefficients are compared with the mean nucleotide distance that separates any two lineages. rS, Spearman coefficient. P-values, two-tailed test.
Fig 2.
Spike positions with high volatility in the baseline group emerge as sites of sub/lineage-founder mutations.
(A) Phylogenetic tree based on 16,808 unique spike sequences. Terminal groups are colored and labeled, with their WHO variant designations in parentheses. (B) Volatility values for all positions of spike subunit S1 calculated using the 114 baseline clusters (see values for subunit S2 in S4B Fig). (C) Comparison of volatility values between spike positions that emerged with LFMs, sLFMs or no such mutations. P-values in an unpaired T test: ***, P<0.0005; ****, P<0.00005; ns, not significant. (D) Number of positions that appeared with LFMs and sLFMs when volatility in the BL group was zero or larger than zero. The number of positions in each subset is indicated in parentheses.
Fig 3.
Spike positions in a volatile environment emerge as sites of sub/lineage-founder mutations.
(A) Volatility values for positions 1–527 were calculated for the SARS-CoV-2 BL group (black bars). Solvent exposure of the residues (based on spike structure 6ZGI) is expressed as a percent of the total solvent-accessible surface area of each residue (red bars). Positions unresolved on the structure (1–13 and 71–75) are assigned exposure values of 0. (B) Autocorrelation of volatility values for the NTD and RBD. Significance of the autocorrelation was calculated using the Spearman rank-order test: *, P<0.05; **, P<0.005; ***, P<0.0005; ****, P<0.00005. (C) Mapping of volatility values calculated for the BL group onto the spike trimer structure. (D) Results of a permutation test to identify positions with high volatility at their 10 closest neighbors on the trimer structure. Low P-values indicate sites with a high-volatility environment. Such sites located outside the NTD are labeled. (E) Fisher’s exact test was used to calculate the co-occurrence of a volatile state at any two positions of spike in the 114 clusters of the SARS-CoV-2 BL group. The distance between the closest atoms of any two residues was also calculated. Bars indicate the percent of position pairs with significant co-variability (P-value < 0.01) as a percent of all position pairs separated by the same distance range. The number of position pairs in each bin is indicated by the dotted green line. (F) The variable D describes for each position the total distance-weighted volatility at spike positions within 6Å on the trimer structure. D values are compared between positions with LFMs, sLFMs or no such mutations. (G) The number of positions that emerged with LFMs and sLFMs when the D value was zero or larger than zero. The number of positions in each subset is indicated in parentheses.
Fig 4.
High volatility at co-variable sites is associated with emergence of LFMs and sLFMs.
(A) Schematic of the approach to generate the co-variability network of spike. For all positions, the absence (0) or presence (1) of amino acid variability was determined in each cluster of 50 sequences. The co-occurrence of variability at all position pairs was calculated using Fisher’s test, and the P-values were used to construct the network. (B) The co-variability network around position 614 as the root node. Edges were assigned if P-values were smaller than 0.05. First- and second-degree nodes are shown. Node size corresponds to the number of triangle counts for each position. (C) Network robustness. Networks were constructed using P-value thresholds of <0.01, <0.05 or <0.1. For each network, we randomly deleted 10%, 20% or 30% of edges and examined the effect on network stability. The degree distribution (i.e., the number of nodes associated with each position) is shown for the intact and depleted networks. (D) R values describe for each position the total weighted volatility at network-associated positions. R values for spike positions that emerged with LFMs, sLFMs or with no such mutations are shown. (E) Number of LFMs and sLFMs that emerged at spike positions when R in the BL group was equal to zero or greater than zero. (F-H) Correlations between volatility, D and R values. rs, Spearman coefficient. P-value, two-tailed test. (I) Classification metrics for evaluating performance of volatility, D and R values to predict presence of s/LFMs using univariate logistic regression. Error bars, standard errors of the means for five-fold cross-validation. Bal. Acc., Balanced Accuracy.
Fig 5.
Volatility patterns among early-pandemic isolates predict emergence of mutations during the lineage-emerging phase.
(A) Timeline for emergence of SARS-CoV-2 lineages until July 2021. Lineage emergence time is determined by the date on which 26 sequences that contain all the lineage-defining mutations were identified. Lineages with WHO variant designations are indicated by their symbols (see S2 Table) and the number of mutations in each is shown by dots. (B) Volatility, D and R values were calculated for all spike positions using the early phase sequences and applied to a logistic regression model to predict emergence of the lineage-defining mutations. Datapoints describe probabilities assigned to all spike positions and are grouped by the lineage in which they emerged. Values are compared between the mutation sites in the indicated VOCs (or minor lineages, labeled “Other Lin.”) and the no-mutation sites (“No mut”) using an unpaired T test. (C-E) Volatility, D or R values for all spike positions were analyzed using a univariate logistic regression model. Probability values of all sites are compared between lineages, as described above. (F) Volatility, D and R values and the combined probability were calculated using different amounts of time-indexed sequences from the early phase (at 50-sequence increments). AUC values are shown for predicting emergence of the 67 lineage-defining mutations in the lineage-emerging phase. (G) The 67 sites of mutation were grouped by the emergence time of the first lineage that contains them. Mutation probabilities assigned to the sites by sequences collected until April 1st 2020 are compared with the probabilities assigned to the no-mutation sites. (H) Probabilities assigned by the April 1st 2020 dataset are shown for mutation sites that appeared in one or more lineages. Values are compared between all groups using an unpaired T test.
Fig 6.
The mutational profiles of new SARS-CoV-2 sublineages are predicted well by volatility patterns in their parental lineages.
(A) Sequences from the baselines of the indicated VOCs were used to calculate the combined probabilities for mutations at all spike positions. These values were compared with the absence or presence of a mutation that defines a Pango sublineage at the sites. The number of sublineage founder mutations (n) in each VOC and the AUC values are shown. (B) Mutation probabilities calculated using the sequences of each lineage (input) are compared with the sublineage mutational outcomes observed in all VOCs (outcome). The highest AUC for each lineage outcome is bolded. (C) Weblogos of the minority variants at sublineage mutation sites. Frequencies are expressed as a fraction of all sequences with a non-lineage ancestral residue. The residue change from the lineage ancestor is shown below the axis, and the emergent residue is also highlighted in red font. (D) Relationship between sampling of residues in the parental lineage and their emergence as the sublineage-defining mutations. The frequencies of all possible residues (excluding the VOC ancestral form) at the sites shown in panel C were calculated as a fraction of all minority variants identified in each VOC. The values were partitioned into the indicated bins. For example, position 138 in lineage B.1.1.7 contained the minority variants His, Tyr, Asn, Ala and Gly at 62.8, 32.7, 1.9, 1.3 and 1.3 percent, respectively, and no representation of all other residues–all 21 residue options (including N-linked glycosylation motifs and deletion events but excluding the majority variant) were distributed into their corresponding frequency bins. For each bin, we calculated the number of residues that emerged as the new sublineage-defining mutation (indicated in red font) as a percent of all instances in that bin (in black font).
Fig 7.
Mutations that conferred resistance to Bebtelovimab in subvariants BQ.1/BQ.1.1 are predicted well by volatility patterns in the parental BA.5 lineage.
(A) Mutation probabilities were calculated for all positions of spike using the baseline sequences of BA.5. Residues on the structure of the RBD in complex with Bebtelovimab (PDB ID 7MMO) are colored by their probability ranks as indicated. Ranking is relative to all 1273 positions of spike whereby lower numbers indicate higher probabilities for mutations. The heavy (H) and light (L) chains are shown. (B) Probability ranks at the contact sites of Bebtelovimab, as calculated using the baseline sequences of the indicated lineages. (C) Probability values calculated for all positions of spike using the BA.5 baseline sequences. Positions with a probability value of 0.99 or greater are highlighted in red. (D) Sequences from the BA.5 baseline were indexed by their date of collection, divided into 50-sequence clusters, and mutation probability values were calculated for groups of increasing cluster numbers. Values are shown for the Bebtelovimab contact sites. A vertical dashed line indicates the collection date of the first BQ.1 sequence. (E) Changes in values of the volatility-based variables for the three sites of mutation that define subvariants BQ.1/BQ.1.1. (F) Frequency of minority variants at position 444, expressed as a percent of all sequences in the baseline groups of the indicated lineages.
Fig 8.
The mutation in subvariant BA.4.6 that conferred resistance to Cilgavimab is predicted well by volatility patterns in the parental BA.4 lineage.
(A) Mutation probabilities were calculated for all positions of spike using the baseline sequences of BA.4. Residues on the structure of the RBD in complex with Cilgavimab (PDB ID 7L7E) are colored by their probability ranks as indicated. (B) Probability ranks for contact sites of Cilgavimab, as calculated using the baseline sequences of the indicated lineages. (C) Sequences from the BA.4 baseline were indexed by their date of collection, divided into 50-sequence clusters, and mutation probability values were calculated for groups of increasing cluster numbers. Values are shown for all Cilgavimab contact sites. A vertical dashed line indicates the collection date of the first BA.4.6 sequence (D) Frequency of minority variants at position 346, expressed as a percent of all sequences in the baseline groups of the indicated lineages. Del, deletion event.