^{1}

^{1}

^{*}

^{2}

Conceived and designed the experiments: ONB AH RU. Analyzed the data: ONB AH RU. Wrote the paper: ONB AH RU.

The authors have declared that no competing interests exist.

Two different strategies for stabilizing proteins are (i) positive design in which the native state is stabilized and (ii) negative design in which competing non-native conformations are destabilized. Here, the circumstances under which one strategy might be favored over the other are explored in the case of lattice models of proteins and then generalized and discussed with regard to real proteins. The balance between positive and negative design of proteins is found to be determined by their average “contact-frequency”, a property that corresponds to the fraction of states in the conformational ensemble of the sequence in which a pair of residues is in contact. Lattice model proteins with a high average contact-frequency are found to use negative design more than model proteins with a low average contact-frequency. A mathematical derivation of this result indicates that it is general and likely to hold also for real proteins. Comparison of the results of correlated mutation analysis for real proteins with typical contact-frequencies to those of proteins likely to have high contact-frequencies (such as disordered proteins and proteins that are dependent on chaperonins for their folding) indicates that the latter tend to have stronger interactions between residues that are not in contact in their native conformation. Hence, our work indicates that negative design is employed when insufficient stabilization is achieved via positive design owing to high contact-frequencies.

Most proteins are functional only in their native states. The stability of the native state of proteins is, therefore, of paramount importance both

Protein stabilization can be achieved via two different strategies: (i) ‘positive design’ in which the native state is stabilized; and (ii) ‘negative design’ in which non-native states are destabilized

The stability, dynamics and function of proteins are determined by both short- and long-range pairwise interactions. Long-range interactions are manifested, for example, in the energetic coupling between distant ligand-binding sites in allosteric proteins owing to conformational changes that are propagated from one site to another. The strength of both direct (short-range) and indirect (long-range) pairwise interactions can be analysed experimentally using the double-mutant cycle (DMC) method

Negative design in the lattice models is also found to be associated with a higher incidence, on average, of correlated mutations, i.e. mutations at one site that tend to be accompanied by other mutations at a second site. Correlated mutations are assumed to be due to selective pressure to maintain protein structure or function and have, therefore, been used for prediction of 3D protein structure

This paper is organized as follows. First, we show that the effects on stability of positive and negative design in lattice models are both linearly dependent, but with opposite sign, on the contact-frequency and that there is a strong trade-off between them. We then provide a general (not lattice specific) mathematical derivation supporting these claims. An analysis of correlated mutations in sequences selected for stability of a lattice fold that follows next shows that the density of correlated mutations increases with increasing contact-frequency. Finally, we show that a similar trend is likely to exist in real proteins by analyzing correlated mutations in proteins that fold with difficulty and are suspected to have higher contact-frequencies.

Sets of 25 residue-long sequences that share a particular native state were generated with and without selection for native state stability. The native states of the sets that were formed (termed SBSS) correspond to each of the 1081 compact folds on a 5×5 lattice. The average perturbation energy (ΔΔG_{per}) was then calculated for each pair of positions i and j in an alignment and the difference, D^{(i,j)}, in the average perturbation energies for that pair of positions in the alignments with and without selection was determined. The average value of D^{(i,j)} was then calculated for all pairs of positions in contact in a particular native conformation, <D^{(i,j)}>_{short}, and for all pairs that form long-range interactions in that conformation, <D^{(i,j)}>_{long}. Two positions are defined as forming a long-range interaction in a particular conformation if there is no path formed by residues in contact in that conformation that connects them (_{per} equals zero by definition. The values of <D^{(i,j)}>_{short} of different folds were found to be correlated with their respective average contact-frequencies (^{(i,j)}>_{short} decreases when the corresponding value of ^{(i,j)}>_{short} reflect a smaller contribution of pairs in contact to the gain in stability upon selection. Surprisingly, we discovered that some native states have zero or even negative <D^{(i,j)}>_{short} values. Such values are found when the value of

The values of a measure of the effect of positive design of stability, <D^{(i,j)}>_{short}, for the 1081 different folds of 25 residue-long sequences on a 5×5 lattice are plotted against their respective average contact-frequencies,

We also examined whether a correlation exists between the contribution of negative design to stability and the average contact-frequency. The correlation between <D^{(i,j)}>_{long} and ^{(i,j)}>_{short} for each fold is plotted against the corresponding value of <D^{(i,j)}>_{long}. The correlation observed is almost perfect (r = −0.96,

The values of a measure of the effect of negative design of stability, <D^{(i,j)}>_{long}, for the 1081 different folds of 25 residue-long sequences on a 5×5 lattice are plotted against their respective average contact-frequencies,

The values of a measure of the effect of negative design of stability, <D^{(i,j)}>_{long}, for the 1081 different folds of 25 residue-long sequences on a 5×5 lattice are plotted against their respective values of a measure of the effect of positive design of stability, <D^{(i,j)}>_{short}. A linear correlation is observed with

The results shown in _{c} is the energy of a contact that was removed (see ^{(i,j)}, in the average perturbation energies with and without selection was determined for every relevant pair of positions in the alignments. Inspection of Eq. (3) shows that D^{(i,j)} for positions i and j in the alignment is equal to:^{(i,j)} over all the pairs of positions i and j that form direct short-range native-state contacts, <D^{(i,j)}>_{short}, can therefore be written using Eq. (3), as follows:_{c}> over all the pairs of positions i and j that form direct short-range native-state contacts and ^{(i,j)}>_{short}, which is a measure of the impact of positive design on stability, and _{c} = 0 as it is for the case of long-range interactions. Given that the sum of all the contact-frequencies is equal to some constant, α, we can write:^{(i,j)}>_{long}, which is a measure of the impact of negative design on stability, and

Data set |
Number of alignments | Average density of correlated mutations | S.D. |

‘Control’ proteins | 432 | 0.0018 | 0.007 |

GroEL-dependent substrates (class I) | 35 | 0.0027 | 0.007 |

GroEL-dependent substrates (class II) | 110 | 0.0040 | 0.006 |

GroEL-dependent substrates (class III) | 77 | 0.0051 | 0.009 |

Intrinsically unstructured proteins | 72 | 0.0103 | 0.023 |

Each data set is comprised of sequence alignments generated using a reference sequence belonging to one of the five groups listed.

The different SBSS corresponding to the 1081 different 5×5 lattice folds were subjected to correlated mutation analysis in order to determine whether there is a connection between this phenomenon and the stabilization strategy. The correlated mutation analysis was able to identify all the 16 pairs of positions that are in contact in all the 1081 different folds except for some rare cases in which one or two contacts were not detected. In the case of the long-range interactions, the strength of the correlated mutations signal for a given fold was found to depend on the average contact frequency of its contacts. The different folds were divided into three equal-sized classes corresponding to different ranges of values of

The 1081 different folds of 25 residue-long sequences on a 5×5 lattice were ordered according to their average contact frequency, (

The apparent connection between employing negative design and prevalence of correlated mutations at positions involved in long-range interactions enables us to expand our analysis to real protein data. Given that the calculation of the contact-frequency parameter for a large number of real proteins is impractical owing to the huge size of their conformational spaces, we decided to look into groups of proteins for which there is good reason to assume that their average contact-frequency is high. We analysed two sets of proteins that are likely to have a high average contact-frequency of their contacts. The first set contains intrinsically unstructured proteins (IUP) that populate many conformations and are, therefore, likely to have relatively high values of contact-frequency since individual contacts probably stabilize many different conformations. The second set is based on the GroEL-interacting proteins found by Hartl and co-workers

The distributions of correlated mutation densities calculated using the tree-based method

The densities of correlated mutations were calculated for the sets of control proteins (A), classes I (B), II (C) and III (D) of the GroEL-interacting proteins and the intrinsically unstructured proteins (E). It can be seen that the density of correlated mutations of these sets increases with the increasing likelihood that their average ‘contact-frequency’ has increased.

A key observation in this study (

The analysis in this paper is based on the premise that stabilization of short-range contacts reflects positive design whereas stabilization of long-range interactions reflects negative design. In lattice models, this assumption is correct since the energy of any native state is determined only by its contacts and, therefore, any stabilization due to long-range interactions must stem from destabilization of non-native states (i.e. negative design). In the case of real proteins, however, this assumption is not necessarily correct since long-range (e.g. electrostatic) interactions can also stabilize the native state. However, the correlated mutation results that we obtained for both the lattice models and real proteins showed the same trend and, therefore, we assume that the correlated mutations that are mostly between distant positions reflect negative design.

It is interesting that two different mechanisms for thermostabilization have also been revealed by comparing mesophilic proteins with their thermophilic homologs

In conclusion, in this study we subjected lattice model proteins to selection for stability and showed that the balance between positive and negative design strategies differs for each fold and depends on the average ‘contact-frequency’ of that fold. The use of negative design is found to increase with increasing values of the average ‘contact-frequency’ of the respective fold. Our results, therefore, indicate that each fold has its own stabilization potential that limits its ability to adapt to extreme conditions. We also showed that negative design in lattice models can be identified by correlated mutation analysis and is reflected in higher values of correlated mutation densities. This trend was also found in correlated mutation analysis of real proteins when comparing intrinsically unfolded proteins and chaperonin-dependent protein substrates to other control proteins. Thus, we conclude that stabilization of real proteins with high values of average contact-frequency tends to rely more on negative design and is reflected in higher densities of correlated mutations.

A 2D lattice model similar to the one described before _{ij}, between neighboring lattice points (excluding consecutive residues in the sequence which are always neighbors), as follows:_{ij}) are as before

The free energy of folding, ΔG, of the native conformation of a sequence was calculated using _{N} is the probability that the chain is in its native state N. This probability is given by:

Sets of sequences that have the same native conformation were generated. Two kinds of sets were generated for each of the 1081 lattice conformations. In the first set, the only requirement was that all the sequences comprising the set have the same particular native conformation. In the second set, we required that the free energy of folding to the native state of the selected sequences is lower than some threshold value. Sets of the first type can be generated easily by classifying random sequences to different SBSS according to their native conformation. Sets of the second type could not be generated rapidly using this simple procedure and, therefore, we used a Monte Carlo (MC) maximization process of the following function:_{c} and N_{non} are the total number of contacts and non-contacts in the specific conformation, respectively. In each step of the MC process, two residues in the sequence were randomly swapped and the swap was accepted if the Metropolis criterion _{threshold} was found. We will refer to the SBSS that were generated by procedures (i) and (ii) as the sets without and with selection, respectively. Each set contained between 50–64 sequences. The average ΔG of folding for all the sequences in the sets without selection is approximately 2.6±0.6 and, therefore, the threshold, ΔG_{threshold}, for selecting sequences with stable native folds was set to zero.

We calculated a perturbation energy, ΔΔG_{per} = ΔG_{wt}−ΔG_{m}, for every possible pair of positions in each sequence where ΔG_{wt} and ΔG_{m} are the respective free energies of folding of the wild-type sequence before and after a particular short- or long-range pairwise interaction is ‘turned off’ but without affecting any other interaction. Under ideal circumstances _{per} for each pair of positions i and j was determined for each sequence in a SBSS and the average value,

The fraction of conformations in the ensemble in which residues at two positions in a sequence are in contact is termed the ‘contact frequency’. The contact frequency is sequence-independent and is a function only of the length of the protein, the positions of the two residues in the sequence and the lattice dimensions

Three data sets of real protein sequence alignments were generated: (i) the IUP set; (ii) the GroEL-interacting proteins set; and (iii) a control set of alignments of proteins that does not include any members of the first two sets and their homologs. The IUP data set was generated by downloading the DisProt database version 4.8 (

Multiple sequence alignments (MSA) corresponding to the above three sets were generated by searching the UniProt database

Correlated mutation analysis was carried out for both real protein sequences and lattice model sequences. In the case of the real proteins, our tree-based method

Scheme of a lattice model showing examples for (i) short-range interactions between residues in contact and (ii) long-range interactions between residues that are not in contact either directly or indirectly. Examples for pairs of residues involved in short-range interactions (e.g. 17 and 24) are indicated by the red line that connects the two residues in contact. Residues 8 and 24, for example, are in indirect contact since there is a path formed by residues in contact that connects them (8-19-14-17-24). By contrast, residues 2 and 13, for example, that are connected by the dashed arrow are defined as being involved in a long-range interaction since there is no path formed by residues in contact that connects them.

(0.61 MB TIF)

We thank Dr. Yanay Ofran for useful comments on an earlier draft of this paper.