Long-Range Epistasis Mediated by Structural Change in a Model of Ligand Binding Proteins

Recent analyses of amino acid mutations in proteins reveal that mutations at many pairs of sites are epistatic—i.e., their effects on fitness are non—additive—the combined effect of two mutations being significantly larger or smaller than the sum of their effects considered independently. Interestingly, epistatic sites are not necessarily near each other in the folded structure of a protein, and may even be located on opposite sides of a molecule. However, the mechanistic reasons for long–range epistasis remain obscure. Here, we study long–range epistasis in proteins using a previously developed model in which off–lattice polymers are evolved under ligand binding constraints. Epistatic effects in the model are qualitatively similar to those recently reported for small proteins, and many are long–range. We find that a major reason for long–range epistasis is conformational change—a recurrent theme in both positive and negative epistasis being the transfer, or exchange of material between the ordered nucleus, which supports the binding site, and the liquid–like surface of a folded molecule. These local transitions in phase and folded structure are largely responsible for long–range epistasis in our model.


Introduction
Epistasis in protein biophysics refers to the non-additive effects of amino acid mutations on protein folding and function [1]. An epistatic interaction is said to occur between two mutations when their combined effect on a trait is either larger or smaller than the sum of their effects considered independently. For example, a mutation that, by itself, damages the biochemical function of a protein may have a neutral or beneficial effect when considered in the presence of another mutation-a form of positive epistasis. Conversely, mutations that have a neutral, or nearly neutral effect on function individually, may produce a damaging effect in combination-a form of negative epistasis. Recent analyses of amino acid mutations in proteins reveal that mutations at many pairs of sites are epistatic, and suggest that epistasis plays a significant role in protein evolution [1][2][3][4][5][6][7][8][9][10][11][12][13]. Interestingly, epistatic sites are not necessarily in contact with each other in the folded state of a protein, and may even be located on opposite sides of a molecule. However, because the partially folded and mis-folded ensembles resulting from epistatic mutations are difficult to study, the mechanistic reasons for long-range epistasis remain somewhat obscure.
In this work, we investigate long-range epistasis using a previously developed model in which polymers were evolved to imitate the behavior of small ligand binding proteins [14]. The model re-capitulates basic properties of evolved proteins, such as folding to an ordered, soluble native structure, maintenance of amino acid sequence complexity [15], linear rates of amino acid change as a function of solvent exposure [16], packing density [17], and distance from the binding site [18], and linear rates of structure divergence as a function of the number of accepted mutations [19]. Below, we sample epistatic effects in evolved polymers by random selection of pair mutations, and we study the folded and mis-folded ensembles of single and double mutants in instances of significant epistasis-epistasis being measured in terms of the probability of folding a structure in which the binding site is correctly formed. Epistatic effects in the model resemble those reported by Olson et. al [2] for the small IgG-binding domain of protein G (protein GB1), and many are long-range. We find that a major reason for longrange epistasis is conformational change: In positive epistasis, either or both mutations disrupt the binding site, however, the double mutant folds to a locally re-configured native state in which the binding site is maintained. In negative epistasis, a single mutation leads to a neutral, or slightly deleterious change in native structure; This change conflicts with a second (formerly neutral, or slightly deleterious) mutation, leading to more frequent mis-folding of the binding site. A recurrent theme in both positive and negative epistasis is the transfer, or exchange of material between the ordered nucleus, which supports the binding site, and the disordered surface of a molecule, reminiscent of the theory of allostery in protein domains [20]. These local transitions in phase and structure are largely responsible for long-range epistasis in our model. Alternatively, neutral, or slightly deleterious mutations that preserve the native structure can conspire to frustrate formation of the binding site during folding. In this case, a mutation that increases the energy of the native fold relative to mis-folded states, or provides for greater conformational freedom during folding, amplifies the negative effect of a second mutation, presumably as a result of proximity to a thermodynamic phase transition [1,2]. We find, however, that this process does not necessarily result in a complete change of phase.
In the following, we describe the distribution of epistatic effects in our model, and then explore the mechanisms of positive and negative epistasis in specific examples. First, we briefly outline our model; A more detailed description can be found in the Methods section.

Model
The polymer model is a chain of point monomers that interact as low resolution amino acids via spherically symmetric pair potentials. Polymers evolve kinetically by Langevin dynamics. The fitness of a sequence is determined by folding N ¼ 127 polymer replicas on a parallel computer and analyzing the resulting ensemble of structures: Each replica is initiated from a random coil state below the folding temperature of a typical viable sequence. The amount of time allowed for folding is determined by the number of amino acids in a sequence, N, according to an estimate provided by Lin and Zewail [21]. The temperature is then reduced substantially, the replicas are equilibrated for a short period, and a final ensemble of folded and quenched structures, Γ, is recovered.
To obtain the sequences studied in this work, polymers were first evolved to recover an ordered (but otherwise un-restricted) folding domain, as determined by the Lindemann melting criterion [22]. To apply the Lindemann criterion, we select a structure x ? and an ensemble DG ? of 3N =4 replicas to represent the dominant energy basin recovered by the folding procedure. Here, x ? plays a role analogous to the equilibrium (lattice) positions of atoms in a crystal.
The structure x ? and the ensemble DG ? are selected to minimize the structurally aligned distance between x ? and replicas x μ included in DG ? . To determine distance, the structures x μ are aligned to x ? by rotation, translation, and reflection through the closest 2N/3 pairs of monomers using methods described in reference [14]. Finally, sequences are selected to recover at least 15 ordered monomers, where order is measured by fluctuations of monomer positions x m j in structures x m 2 DG ? against their positions x ? j in the reference structure x ? . Under this condition, polymers spontaneously evolve ordered surface cavities (putative binding sites), resembling the active sites of small enzymes [23]. These "enzyme-like" sequences are subsequently evolved to recover an ordered binding site compatible with a model ligand, with the requirement of folding an ordered domain relaxed. To enforce the binding condition, the ligand is optimally docked onto the binding sites of replicas folded by a given sequence (see Methods). If the distances between amino acids in the binding site of a replica (including the ligand) are each within 1 Angstrom of their corresponding distances in a pre-defined target state, the replica is considered "active". The fitness of a sequence is then defined by the fraction of active replicas, P ¼ N ? =N , recovered by the folding and docking procedures.
In earlier work, we found that ligand binding, as defined above is, by itself, sufficient to maintain an ordered folding domain during evolution [14]: In order to maintain a given binding site structure against thermal fluctuations, it is necessary to maintain an ordered nucleus as a scaffolding to support the binding site. As a result, both binding affinity and thermodynamic stability are implicated in the fitness parameter P.

Results
Below, we select two evolved sequences from these simulations to explore epistasis in the model (S1 and S2 Figs). In each sequence, pairs of sites are selected at random and subjected to random mutations roughly consistent with the genetic code (see S3 Fig and ref. [24]). For each pair of mutations, we compute the change in fitness, ΔP ν = P ν − P 0 where P ν is the fitness of a (single or double) mutant sequence, and P 0 is the fitness of the initial evolved sequence. Epistasis is measured as in obvious notation. To estimate the uncertainty, or error in a single fitness measurement (i.e., against the value that would be obtained in the limit N ! 1), we computed the width, σ(P), of the fitness distribution, ω(P), for a number of different sequences. The distribution of fitness values for an evolved sequence is shown in Fig 2. From these considerations, we find that the typical error, δP, in a measurement of P (i.e., the typical width of a distribution, h σ(P)i) is about δP ' 0.037. The value of P 0 used in Fig 1 is based on 10 3 measurements, and consequently, the errors in ΔP ν are essentially the same as the errors in P ν . If errors in P ν are considered independent, the error in ΔP 1 + ΔP 2 can be estimated as ffiffi ffi 2 p dP, and the error in can be estimated as d ' ffiffi ffi 3 p dP. In Fig 3, we plot the distribution of values, ω(), for points in Fig 1 that satisfy either ΔP 12 ! λ, or ΔP 1 ! λ and ΔP 2 ! λ, with λ = −0.2 (i.e., for positive epistasis, the double mutant is neutral, or slightly deleterious, and likewise for single mutants in negative epistasis. This cut through the distribution is somewhat arbitrary, and is simply intended to include mutants that have a chance of becoming fixed in evolution). The width of the distribution, σ(), is about 3 times the error in a measurement of . This result is maintained for essentially all values of λ (Fig 4). As a result, about 30 percent of the data points exhibit statistically significant epistasis (jj> $ 3d) in rough agreement with the results of Olson et. al [2] for protein GB1 (see also ref. [1]). Finally, in Fig 5 we plot the distribution of values for points in Fig 1 as a function of the distance, R, between mutations in the initial fold. Amino acids begin to interact directly when R < 1.5l, where l = 3.8 Angstroms is the length of a polymer bond (see Methods). As is evident by inspection of Fig 5, many of the samples exhibit pronounced long-range epistasis, with R > 1.5l and ) δ. The variation in values decreases with R, similar to the results for protein GB1, however, positive epistasis is more prevalent than negative epistasis in the model. Similar results to those above are obtained for the sequence in S2 Fig. To conclude our analysis, we examined the folded ensembles of 12 samples exhibiting various levels of positive and negative epistasis. For each sample, fitness values, P ν , were re-computed by averaging 10 2 measurements to minimize the error in ΔP ν . Samples were selected essentially at random to include a range of values of .
In all 6 instances of positive epistasis (in particular, even when epistasis is weak) the double mutant is found to adopt a locally re-configured native structure that preserves the binding In this example, both single mutants recover disordered ensembles in which the binding site is disrupted: The mutation R1 ! G1 destabilizes the charged part of the ordered scaffolding (red and orange) that supports the binding site, removing a charged amino acid; The mutation T12 ! I12, while adding a hydrophobic amino acid, leads to competing nuclear structures. In the double mutant, the destabilizing effect of the first mutation is compensated by the second through local changes in the structure of chain segments containing the mutated A similar example is shown in Fig 7. Here, the mutation W4 ! S4 removes a hydrophobic monomer from the nuclear scaffolding leading to large fluctuations in the binding site; The mutation G10 ! V10, while adding a hydrophobic amino acid to the nucleus, leads to competing nuclear structures during folding (originally, G10 is disordered). In the double mutant, V10 replaces W4 in the nucleus, and the mutated amino acid S4 is disordered. As in Fig 6, the double mutation leads to the re-configuration of loops containing the mutated sites, exchanging material between ordered and disordered phase regions of the folded ensembles. A slightly different mechanism of long-range positive epistasis is shown in Fig 8. Here, the mutation A23 ! E23 has a stabilizing effect once the charged amino acid E6 is removed from the nuclear scaffolding by the mutation E6 ! G6. Both A23 and E23 are ordered in their respective ensembles. In the double mutant, material is transferred from ordered to disordered phase regions of the folded ensembles by the mutation E6 ! G6.
To study negative epistasis, we selected samples in which single mutations are neutral or slightly deleterious. Fig 9 provides an example of long-range negative epistasis in which the structural change caused by one mutation is frustrated by a second, formerly neutral mutation. Panel (A) describes the folded structure of the initial sequence, while panels (B) and (C) Long-Range Epistasis Mediated by Structural Change in a Model of Ligand Binding Proteins describe the folded structures of the single mutants T15 ! R15 and W4 ! C4 respectively. The mutation T15 ! R15 adds a charged amino acid to the nuclear scaffolding, which brings the mutant amino acid R15 in contact with W4; The interaction between R15 and W4 is attractive, and the amino acid T15 is disordered in the initial ensemble. The second mutation, W4 ! C4, is neutral, and leaves the native structure essentially intact. However, the interaction between C4 and R15 is repulsive, which frustrates the transfer of R15 to the nucleus in the double mutant, disrupting the binding site. This pattern is repeated in 2 of the examples we studied. Fig 10 describes an example of weak, but very long-range negative epistasis in which an exchange of material between ordered and disordered phases is frustrated. Again, panel (A) describes the folded structure of the initial sequence, while panels (B) and (C) describe the folded structures of the single mutants Q25 ! E25 and E6 ! G6 respectively. The mutation E6 ! G6 removes a charged amino acid from the nuclear scaffolding. In the mutant ensemble, the amino acid G6 is disordered, however, this has no effect on fitness. The mutation Q25 ! E25 adds an ordered charged monomer; However, in the mutant ensemble, the interactions between E25 and its neighbors are almost all repulsive. In the double mutant, the removal of E6 allows E25 to form favorable contacts with other charged monomers during folding, which leads to more frequent mis-folding of the binding site.
The remaining samples of negative epistasis conform to the process described earlier above, in which a neutral, or slightly deleterious mutation that preserves the native structure amplifies the negative effect of another. An example of this effect exhibiting strong epistasis is provided in S9 Fig. In this sample, the mutation, D19 ! Y19, reduces the specificity of interactions with its neighbors, allowing for greater conformational freedom of the binding site, making it more susceptible to the negative effect of the mutation, N13 ! D13; The mutated amino acids Y19 and D13 form favorable contacts in mis-folded states of the replica ensemble.

Discussion
To summarize, many instances of long-range epistasis in our study involve local structural changes that transfer or exchange material between ordered and disordered phase regions of the initial and double mutant ensembles. In positive epistasis, the loss of an amino acid on the ordered scaffolding which supports the binding site creates a "vacancy" that is compensated by a second amino acid in the double mutant. The second amino acid either occupies a site close to the vacancy (Fig 6) or fills the vacancy with a similar amino acid (Fig 7). In negative epistasis, the filling of an existing vacancy on the ordered scaffolding is frustrated by a second mutation (Fig 9). In these examples, epistatic mutations connect local environments in a folded molecule that are dis-connected in the initial evolved structure. This may explain why long-range epistatic interactions are difficult to predict [25]: If the environments of mutated residues change significantly in epistasis, the information needed to predict epistatic interactions will depend on the  environments of residues formed in the mutant structures, which will often be excluded from the evolutionary record by purifying selection. In addition, because computer modeling techniques are not sufficiently advanced to predict these changes, it is actually necessary to solve the folded structures of the mutant proteins to reveal the causes of epistasis. As a result, these effects, if they are detected, are very difficult to interpret, and may be more prevalent than expected.  Of course, our model is rather small, and neglects many of the constraints present in folded proteins, such as secondary structure. While this is not an obstacle for modeling short proteins (which are commonly devoid of significant secondary structure), secondary structure provides additional stiffness in larger proteins, which can lead to longer range collective effects such as allostery [26], and more exotic mechanisms for epistasis. Accordingly, the structural changes exhibited by our model are probably limited to more flexible regions in larger proteins.

Methods
The polymer model is a chain of point monomers that interact as low resolution amino acids via spherically symmetric potentials. Interactions along the chain are described by potentials of the form, where r is the distance between monomers, l is the equilibrium length of a link, and κ is a constant (see below).
Interactions between non-adjacent monomers along the chain are constructed from the unit Morse potential, The attractive minimum of the Morse potential occurs at r = l. Let and U 0 !0 ðrÞ ¼ m r l ðrÞ þ Wðl À rÞ þ 0 exp ðÀ aðr À lÞÞ ð7Þ respectively (Fig 11). Each potential consists of an excluded volume part, μ r l (r) + ϑ(l − r), modulated by the parameter , and a sequence dependent part, modulated by the parameter 0 ; The parameter 0 takes on different values, depending on the amino acid types involved in an interaction, where E μν is the energy of a contact between amino acids μ and ν defined by the empirical parameters in reference [27], and E o = hjE μν!μ ji is the average strength of an interaction (the empirical parameters are obtained Long-Range Epistasis Mediated by Structural Change in a Model of Ligand Binding Proteins by re-scaling the Miyazawa-Jernigan parameters [28] using threonine as a reference solvent [27]). The potentials for unit strength attractive and repulsive interactions are plotted in Fig 11. To describe polymer kinetics, we integrate the Langevin equation using the method of van Gunsteren and Berendsen [29], with monomer mass m = 1.66Á10 −22 g, friction coefficient γ = 10 ps −1 , and integration time step, Δt = 0.01 ps. The parameters used to define the potentials are l = 3.8 Angstroms, κ = 11 k B T 0 , α = 2.1 Angstroms −1 , and = 2 k B T 0 , where k B is Boltzmann's constant and T 0 = 302.15 Kelvin.
Folding is initiated from a random coil state below the folding transition temperature of a typical evolved sequence, which we estimate as T f * 1.25T 0 from specific heat data. The time allowed for folding is determined by the length of the polymer according to the estimate of Lin and Zewail [21], where Δt f = 10 ps roughly describes the timescale for positional exchanges among monomers on the surfaces of polymer nuclei. Following this step, the replicas are equilibrated for a short time t q = t f /3 at temperatures T 1 = 218.2 Kelvin and T 2 = 134.3 Kelvin.
To determine the number of active replicas in the folded ensemble, it is necessary to dock the model ligand onto the binding site structures recovered by replicas in the folding procedure. To accomplish this, the folded structure of a replica is enclosed in a spherical shell consisting of *10 4 evenly distributed points [30]. We then measure, and record the energy of the target ligand (in this work, a single monomer) at each point on this shell. In this procedure, interactions with monomers in the binding site group are considered attractive, and are described by unit Morse potential, μ(r), while interactions with monomers not included in the binding site group are described by the repulsive core of the Morse potential, μ r l (r). The radius of the shell is reduced, and the energies are re-computed at each point, iteratively, until the shell lies inside the folded structure of the replica. The structure of the binding site complex (i.e. binding site plus ligand) is determined from this sweep as the configuration with minimal energy. A docked replica is considered active when the distances between amino acids in the binding complex are each within 1 Angstrom of their corresponding distances in a predefined target state, determined by averaging the states of replicas with properly formed binding sites recovered by a selected initial (evolved) sequence. The structures used to represent the dominant energy basin recovered by a sequence are obtained using the structural alignment procedure described in ref. [14].
Supporting Information S1 Fig. (A) Folded structure x ? , and (B) sample of the folded ensemble DG ? recovered by a selected sequence evolved under ligand binding conditions. Amino acids (monomers) are colored blue, light blue, blue-green, green, yellow, orange, and red, in order of increasing affinity to solvent. The binding site monomers and the target ligand (here, a single monomer) are colored black. The ensemble DG ? is aligned to x ? using methods described in ref. [14].  Fig. Amino acid exchange probabilities, p(μ, ν) = A μν / ∑ ν A μν , for evolved sequences in ref. [14], where A μν is the number of transitions recorded between amino acids μ and ν.
Model values are indicated by filled red circles. Empirical values obtained from the data of Dayhoff et. al [24] are indicated by open blue circles. The value of p(μ, ν) is indicated by the radius of the corresponding circle. In random sampling of pair mutations, amino acid transitions are allowed when p(μ, ν)>0. (TIFF) S4 Fig. Folded ensembles of initial and mutated sequences corresponding to Fig 6. Panel (U) corresponds to the initial, un-mutated sequence. Panels (1) and (2) correspond to the single mutants R1 (orange) ! G1 (green) and T12 (yellow) ! I12 (light blue), respectively. Panel (12) corresponds to the double mutant. Dotted spheres indicate the positions of mutated amino acids. Each ensemble DG ? is aligned to its corresponding reference fold, x ? , using methods described in ref. [14] For clarity, each figure (1) and (2) correspond to the single mutants W4 (blue) ! S4 (yellow) and G10 (green) ! V10 (blue), respectively. Panel (12) corresponds to the double mutant. Ensembles are arranged as described in S5  (1) and (2) correspond to the single mutants Q25 (orange) ! E25 (red) and E6 (red) ! G6 (green), respectively. Panel (12) corresponds to the double mutant. Ensembles are arranged as described in S5 Individual mutations N13 (yellow) ! D13 (red) and D19 (red) ! Y19 (blue-green) in panels (B) and (C) are nearly neutral, with DP 1 ' À 0:07 and DP 2 ' À 0:1 respectively (DP 12 ' À 0:43). Both N13 and D19 are ordered in the initial ensemble, and both D13 and Y19 are ordered in the single mutant ensembles. Single mutations do not significantly alter the reference structure of the initial sequence. In the double mutant, the mutation, D19 ! Y19, reduces the specificity of interactions with its neighbors, allowing for greater conformational freedom of the binding site, making it more susceptible to the negative effect of the mutation, N13 ! D13; The mutated amino acids Y19 and D13 form favorable contacts during folding, which disrupts the binding site in the quenched ensemble. The distance between mutated positions in panel (A) is R ' 7:3 Angstroms. (TIFF) S1 File. Text files of the data points and structures shown in Figs 1-10 and S1-S9 Figs. Structural ensembles are provided in Protein Data Bank (PDB) format. In each ensemble file, structures (models) are aligned to the reference fold (first model in a given file), and are arranged in order of decreasing alignment quality. Structures are aligned through the closest 2N/3 monomers as described in the Text. (GZ)

Author Contributions
Conceptualization: EN NG.