Fig 1.
Graphical representation of hypothesis and experimental design.
(A) Schematic of sequence space and the impact of flexibility on sequence tolerance. S1 and S2 represent two unique conformations of the same residue length separated by some RMSD that populate two local energy minima. Black lines with end caps represent unique sequences that are energetically most favorable for a single conformation. The dark shaded area encircles sequences that are energetically favorable for both conformations. Here we illustrate that by using multiple conformations during protein design, we identify sequences that are energetically suitable for conformational flexibility, yet are not necessarily the most stable sequence for any given conformation. Additionally, the requirement to adopt multiple conformations constrains the number of suitable sequences (B) Flow chart of benchmark design.
Table 1.
Proteins used in conformation-dependent sequence tolerance benchmark.
Fig 2.
Metrics used to quantify conformational flexibility.
(A) Illustration of maximum RMSD100, the metric used to quantify large-scale, or global, conformational flexibility. For simplicity, we only represent RMSD on a two-dimensional plane, where the x and y axes represent the difference in distance of cartesian space if two conformations were superimposed onto the same coordinate system. Each protein conformation of identical sequence is represented as a circle, and is separated by some distance vector evaluated as the RMSD100 of two conformations. The maximum RMSD100 describes the greatest pairwise RMSD100 within an ensemble. (B) Illustration of dihedral angle ϕ and φ variation used to calculate dihedral angle RMSD (RMSDda). Orientation of atoms is color-coded and corresponds to the diagram drawn at the bottom of the panel. RMSDda is illustrated as the range of dotted lines, corresponding to the deviation in relative orientation of the third and fourth atoms. (C) Explanation of contact proximity deviation. Two conformations of the same protein are depicted in the left, with two residues, outlined in cyan or orange, shown in their respective positions. These two residues are magnified (top right) in their local side chain environment in Conformation A on the top and Conformation B on the bottom. Contact residues in Conformation A are colored yellow. If the same contacts are maintained in Conformation B, contact residues remain colored yellow in the bottom two boxes. If new contacts are made, contact residues are colored in purple. Even though the cyan residue changes slightly in its relative orientation between conformations, the same contacts are maintained so that the degree of conformational flexibility is relatively low in comparison to the heptad trimer refolding, and would have a low contact proximity deviation score. In contrast, the orange residue completely rearranges its local side chain contacts between conformations as a result of the large conformational rearrangement, and would have a high contact proximity deviation score. (D) Explanation of contact proximity deviation. We assigned a score to each Cβ–Cβ distance by applying a soft-bounded, continuously differentiable function that accounts for the proximity of two side chains and approximates the likelihood of two side chains forming a contact, illustrated in the top left of Panel D. We then calculated the deviation of each Cβ–Cβ distance across an ensemble as shown in the matrix, with low deviation scores in white and high scores in black. The contact proximity deviation score represents the sum of all Cβ–Cβ proximity deviations a single residue undergoes within an ensemble, as shown in the bottom row separated from the matrix.
Fig 3.
Design native sequence recovery and mutation profile variability comparisons to PSI-BLAST profiles.
(A) Comparison of total native sequence recovery of relaxed and unminimized RECON MSD and SSD designs to PSI-BLAST sequence profiles generated using the native sequence. For this figure and all subsequent boxplots, shaded regions of each box plot denote values within the first and third quartiles (interquartile range, or IQR), with the median indicated as a solid line and whiskers representing values ± 1.5 × IQR. Outliers are represented as dots. Asterisks indicate the significance of difference of means of each design in comparison to the PSI-BLAST profile, with a z-test p-value < 0.01 represented by one asterisk, and a p-value < 0.00001 by three asterisks. The p-value provided in this figure and all subsequent figures represents a two-sided, 95% confidence interval. (B) Mutation frequency root mean square deviations of designs in comparison to a PSI-BLAST profile. The y-axis values represent the root mean square deviation (RMSD) of mutation profiles for each designed residue in relation to a PSI-BLAST profile, represented as: where aaj represents the frequency of an amino acid observed at position i for each of all twenty amino acids (j), and y is the sum of all i differences for all amino acids within a protein of length n residues. A y-value of 0 would indicate that the design profile is identical to the PSI-BLAST profile, and an increase in y-value indicates the root mean square deviation of the sequence profile for each residue is more dissimilar to a PSI-BLAST profile.
Fig 4.
Comparison of exchangeability rates.
(A) Average amino acid exchangeability of PSI-BLAST, RECON MSD, and SSD sequence profiles. Single-letter amino acid codes were used for both x and y axes, with the x axis representing the original amino acid and the y axis representing the average mutation frequency the original amino acid to the indicated mutation. (B) Comparison of exchangeability rates between profiles, excluding rates of native sequence conservation rates. The y axis represents the mean frequency a native amino acid is replaced with a specific, non-native amino acid, which we term as amino acid exchangeability. (C) Difference of mean amino acid-specific exchangeability observed in a PSI-BLAST profile compared to a design profile. The x axis represents each type of amino acid present in the native sequence. The y axis represents the difference in average exchangeability frequency of each amino acid type, or rather, the average frequency a native amino acid type is replaced with any other non-native amino acid. A positive value indicates the native amino acid is less likely to be exchanged for a non-native amino acid during design, whereas a negative value indicates the native amino acid is more likely to be exchanged, as compared to a PSI-BLAST profile.
Fig 5.
Comparison of mutation profiles predicted by protein design to mutation profiles observed within calmodulin and influenza type A HA2 multiple sequence alignments.
(A) Comparison of root mean square deviation of mutation frequencies derived from calmodulin natural homologues to mutation profiles predicted by RECON MSD or SSD. Calmodulin natural homologue mutation preferences were derived from the multiple sequence alignment of calmodulin homologues. The root mean square deviation (RMSD) here represents the mean standard deviation of an individual residue’s mutation profile, consisting as the mean sum of squared differences of all twenty amino acid frequencies as determined by the multiple sequence alignment of calmodulin homologue sequences in relation to either RECON MSD or SSD residue profile. (B) Residue profile standard deviations between calmodulin multiple sequence alignment profiles and design profiles mapped onto the unbound conformation of calmodulin (PDB ID 1CLL). Here, RMSD represents the mean sum of squared differences of all twenty amino acid frequencies of each residue between homologue and design profiles. Residues whose sequence profiles were predicted to have identical mutation profiles as that within the corresponding position with the multiple sequence alignment are colored in white. The greater the dissimilarity between the homologue mutation profile and design profile, the greater the saturation in red, with complete saturation indicating an RMSD of 1.0. Residues within all four of the conserved EF-hand motifs are labeled, with the bidentate ligand at position 12 critical for Ca2+ binding labeled in boldface. (C) Comparison of root mean square deviation of mutation frequencies derived from influenza type A sequence alignments to mutation profiles predicted by RECON MSD or SSD. RMSD is calculated in a similar fashion as in Panel A. (D) Residue profile standard deviations between HA2 multiple sequence alignment profiles and design profiles mapped onto the pre-fusion conformation of the HA2 trimer (PDB ID 2HMG). RMSD is calculated and labeled the same as in Panel B, but only one HA2 monomer is labeled with RMSD values of the influenza A IVR residue profiles in relation to RECON MSD or SSD profiles. The N- and C-terminal residues of loop regions that undergo large local conformational rearrangements in the post-fusion form are labeled. This includes the B loop that rearranges into an alpha helix and the S5 domain, which stabilizes the alpha helical form of the B loop. Residues within the CR8020 broadly neutralizing epitope [32], including N146 and E150, are also labeled.
Fig 6.
Root mean square deviations of residue mutation frequencies of influenza A subtypes and HA2 profiles predicted by RECON MSD and SSD.
(A) Dendrogram of root mean square deviations (RMSD) of influenza A subtype HA2 profiles sorted by pairwise RMSD. The mutation frequencies derived from the multiple sequence alignment profile of each influenza A subtype was compared to all other subtypes by calculating the mean standard deviation of each aligned position’s mean sum of squared differences of all twenty amino acid frequencies with respect to each other subtype profile. Pairwise RMSD values were sorted to form clades, with the height along the y axis indicating the pairwise RMSD between each clade. (B) RMSD of each IVR subtype multiple sequence alignment (MSA) profile with respect to RECON MSD and SSD. The x axis represents each IVR subtype profile sorted as in Panel A. The y axis represents the RMSD, calculated in the same fashion as in Panel A, of each subtype profile in relation to either RECON MSD or SSD.
Fig 7.
Relationship of conformational flexibility and native sequence recovery by sequence profiles.
The x axis is binned into three groups of equal number of data points to show the distribution of native sequence recovery between groups of low, middle, and high values for each metric. A Kendall τβ rank correlation test was performed on each profile to measure the strength of dependence of native sequence recovery on each metric, indicated in each plot along with its associated p-value. (A) Comparison of native sequence recovery dependence on maximum RMSD100 between sequence profiles. (B) Comparison of native sequence recovery dependence on RMSDda between sequence profiles. RMSDda values of each protein were not equally distributed, nor were of similar range. Therefore, a z-score of was used to normalize RMSDda values of each protein to compare dihedral angle deviation scores, shown along the x axis. A similar approach was implemented to normalize contact map deviation scores. (C) Comparison of native sequence recovery dependence on contact deviation scores.
Fig 8.
Average per-residue total energy score of the lowest ten percent scoring models for RECON MSD, SSD, and starting relaxed (Native) models.
One hundred simulations were performed for each group and the lowest ten total energy scoring models were used for the comparison. The total scores were normalized so that the calculated total score was divided by the number of residues within each model to obtain a mean residue score. For RECON MSD models, the total calculated score also had to be normalized by the number of states within each model. The violin plot width indicates the normalized energy score density of each group.
Fig 9.
Comparison of conformational diversity and per-residue total scores.
All panels are binned into low, medium, and high x values, with equal number of data points for each bin. A Kendall τβ rank correlation test was performed on each profile to measure the strength of dependence of native sequence recovery on the x axis value, indicated in each plot along with its associated p-value. (A) Comparison of maximum RMSD100 and mean total energy score, normalized by the number of residues. (B) Comparison of normalized RMSDda z-score and mean total energy score of each residue. (C) Comparison of normalized contact proximity deviation z-score and mean total energy score of each residue.
Table 2.
Number of sequences within each influenza type A HA sequence dataset.