On topology and knotty entanglement in protein folding

We investigate aspects of topology in protein folding. For this we numerically simulate the temperature driven folding and unfolding of the slipknotted archaeal virus protein AFV3-109. Due to knottiness the (un)folding is a topological process, it engages the entire backbone in a collective fashion. Accordingly we introduce a topological approach to model the process. Our simulations reveal that the (un)folding of AFV3-109 slipknot proceeds through a folding intermediate that has the topology of a trefoil knot. We observe that the final slipknot causes a slight swelling of the folded AFV3-109 structure. We disclose the relative stability of the strands and helices during both the folding and unfolding processes. We confirm results from previous studies that pointed out that it can be very demanding to simulate the formation of knotty self-entanglement, and we explain how the problems are circumvented: The slipknotted AFV3-109 protein is a very slow folder with a topologically demanding pathway, which needs to be properly accounted for in a simulation description. When we either increase the relative stiffness of bending, or when we decrease the speed of ambient cooling, the rate of slipknot formation rapidly increases.


Introduction
Topological techniques are often very powerful. They are used with great success to analyze numerous problems in Physics, from theories of fundamental interactions to models of condensed matter. Here we propose to introduce topological techniques to study protein folding and dynamics, where thus far these techniques have been used only sparsely. As an example we analyze the formation of knots and self-entanglement in protein folding. A knot along a protein backbone is a delocalized structure, with a definite topological character. A knot can not be removed by any small amplitude local motion such as twisting, bending or crumpling of the protein backbone. A knotty self-entanglement persists as long as there is no backbone chain crossing, and provided the C and N terminals of the protein are not circumnavigated.
Topology is a global, collective property of a system. Thus, in the analysis of topological structures, such as knotty entanglements in a protein, global techniques that are based on topological concepts should be preferable. However, the prevailing ambition in the protein modeling community is to simulate the entire folding process as atomistically as possible, within the limitations of available computer resources. The commonly available techniques aim to describe protein folding primarily as a local process, with little if any regard to global aspects such as the backbone topology. Instead, global aspects of protein structure and dynamics are presumed to only reflect the local character of native contacts, and other local details. For example, in all-atom simulations the interaction range between individual atoms is commonly cut off beyond 10-12 Å, and in Gō-type models the range of interactions can be even shorter. Since knottiness is not a local, but a global characteristic of a protein backbone, we suspect that such a preference to localize all interactions is a reason why it appears to be very hard to simulate the folding of a knotty protein with a high success rate. At least unless one introduces some kind of bias such as funneling, steering or other method of (global) augmentation [5,6,11,12,19], to cross the topological barrier.
Here we introduce and develop global techniques, as a complement to the already existing local ones, to account for topological aspects of protein folding dynamics. For this we scrutinize the folding and unfolding processes of a protein with a knotty self-entanglement. Our approach builds on an effective theory description. We employ a mechanical free energy that describes the entire protein tertiary structure as a single topological object, in lieu of the individual atoms or other highly localized interaction centers. The protein is modeled as a topological multi-soliton solution of a nonlinear difference equation that determines the critical points of a mechanical free energy; the equation that identifies the critical points resembles the discrete nonlinear Schrödinger (DNLS) equation [20,21].
Soliton solutions of non-linear difference and differential equations are commonly encountered when searching for principles of structural self-organization in physical scenarios. The nonlinear Schrödinger equation is the paradigm equation for describing topological solitons. In our approach an individual topological soliton models a single super-secondary protein structure such as a helix-loop-helix or a strand-loop-strand motif. We use the DNLS equation to combine together several mutually interacting individual soliton profiles into a multisoliton that models the entire Cα backbone as a single entity. From the point of view of an energy landscape description [22], the multi-soliton is a stable attractor in the landscape of all conceivable protein structures. The mechanical free energy of the DNLS equation funnels unfolded protein structures to progress towards its minimum energy configuration, described by the multi-soliton.
Our multi-soliton description is computationally highly effective: When we start from a random chain, the folding simulation proceeds several orders of magnitude faster than in any other computational approach to protein folding that we are aware of. Thus we can perform numerical simulations to analyze the mechanism of protein self-entanglement and the dynamics of knot formation, with a very high level of computational efficiency using a simple laptop computer.
By its design, our mechanical free energy with its multi-soliton critical points models the protein backbone at very low temperature, with no consideration to thermal fluctuations. In order to study the folding and unfolding patterns, we subject the multi-soliton to a series of repeated heating and cooling simulations. For this we use Monte Carlo methodology and the Glauber algorithm that we adapt to proteins [23], to describe their non-equilibrium thermodynamics in the presence of variable ambient simulation temperature. In our simulation we follow how a knotted multi-soliton unfolds into a random chain when the ambient temperature increases. We then continue and follow how the random chain folds and self-entangles back into the knotty multi-soliton configuration, as the ambient temperature decreases. By varying the rate of temperature change we identify and investigate the different folding and unfolding patterns and pathways.
We start and construct a multi-soliton that models the three-dimensional Cα backbone of the folded AFV3-109 protein as a solution of the generalized DNLS equation. We use the crystallographic Protein Data Bank (PDB) structure with PDB code 2J6B [24] as a decoy. The AFV3-109 is an α/β protein that, when folded, supports a self-entangled structure akin the shoelace slipknot. Even though the slipknot is technically not a proper knot, it has nevertheless a topological character, with a large degree of structural stability under local backbone deformations; the topological character of the slipknotted AFV3-109 is similar to that encountered in other, more elaborately self-entangled knotty proteins [6,[17][18][19]. Moreover, our simulation results reveal that in the case of the AFV3-109 protein, the formation of the slipknot entails a trefoil knot as a folding intermediate. The AFV3-109 protein has been studied previously as a prototype example of a slipknotted protein [6,24], but these previous simulations have not reported of an intermediate trefoil structure. On the contrary, it is presumed that a slipknot itself occurs as an intermediate, along the folding pathway of more complex knot formation, including a trefoil [6,[17][18][19]].

Methods
We model protein structures in terms of topological solitons [20,21]. These are extended objects that emerge as a non-local solution to a set of partial difference equations, and are topologically stable against decay to a "trivial" solution. The pertinent partial difference equations describe critical points of an energy function, in an effective theory approach that builds on the (virtual) Cα protein backbone geometry. For this we use the formalism of discrete Frenet framing [25]: At the location of the i th Cα atom, a discrete Frenet frame comprises the mutually orthonormal backbone tangent (t i ), binormal (b i ) and normal (n i ) vectors, as follows where r i (i = 1, . . ., n) are the Cα coordinates. These vectors are subject to the discrete Frenet equation [25] n iþ1 Here the θ are the virtual backbone bond angles, they are computed as follows, The virtual backbone torsion angles ϕ are computed as follows, The T 2 and T 3 are 3 × 3 matrices that generate three dimensional rotations with (T i ) jk = � ijk , the permutation symbol. By substituting (1) in (3), (4) we can compute (θ i , ϕ i ) in terms of the Cα coordinates r i . Conversely, when we know (θ i , ϕ i ) and the distances between neighboring Cα atoms, we can reconstruct the Cα coordinates r i by solving (2). Here we assume that the distances between neighboring Cα atoms coincides with the average PDB value Once the Cα position have been determined, the remaining heavy atom structure can be fully reconstructed from the knowledge of the (θ, ϕ) coordinates [26][27][28][29][30][31], even in the case of a dynamical protein [32]. Thus, in the sequel we do not explicitely include the effects of side chains.
The set of all possible (θ, ϕ i ) values governs the entire conformational space of the Cα backbone. Various effective theory conformational free energy functions have been previously constructed in terms of these coordinates. Examples include the fully flexible chain model and its extensions [33][34][35], that are used widely in studies of biological macromolecules and other filamental objects. Thus, we also employ the Cα backbone angles (θ i , ϕ i ) of each and every Cα atom, as the dynamical variables to introduce a refined extension of these earlier models. Ours is the following mechanical free energy [23, 36- We refer to S1 File for additional discussion including background and motivation in terms of integrable models and soliton theory. For the present purposes the following is sufficient: The mechanical free energy (5) is designed to model the geometry of the natively folded Cα backbone as its minimal energy critical point: • The first sum over index i (that labels the n residues) coincides with the energy function of the discretized non-linear Schrödinger (DNLS) equation in the standard Hasimoto representation [20].
• The second sum over the index i extends the DNLS energy function in a manner appropriate for modeling a folded protein: The first term in this sum, with parameter c k , is a Proca mass term. In combination with the second term of the first sum, it constitutes the Kirchhoff energy of an elastic rod [34]. The second term, with parameter b k , is the conserved momentum in the integrable DNLS hierarchy. The third term, with parameter a k , is the conserved helicity in the DNLS hierarchy. These two terms break the parity symmetry, and make the backbone right-handed chiral (for positive a k , b k ).
• The last term in the free energy, a sum over two-body interactions V(r i − r j ), models long distance interactions along the chain. However, a topological soliton is also highly non-local, it is an extended topological object. Thus the topological multi-soliton that we construct to model the Cα chain also describes non-local effects due to long range interactions. Thus, to avoid double counting we include in this last term only a hard-core Pauli repulsion that prevents the chain from self-crossing. For this we use a step-wise profile that keeps any two Cα atoms at least 3.8 Å apart. We refer to [23] for detailed analysis of different choices V(r i − r j ).
• Finally, the sum with index k2solitons extends over all the individual super-secondary loop segments, such as a helix-loop-helix or a strand-loop-strand motif, that characterize the local Cα geometry. The parameters (λ k , m k , a k , b k , c k , d k ) have constant values in any individual motif, but these parameter values are in general different for different motifs, with different amino acid structures.
Note that all the contributions in (5) are functionals of the various tangent vectors t i : For the (θ i , ϕ i ) dependent terms this follows from (1), (3) and (4) and for the last term we use Accordingly, the free energy (5) can be interpreted as a leading order contribution in a systematic expansion of a full, complete free energy that depends on the various two-body interactions, through the distances r i − r j between all point-like interaction centers, expressed in terms of the tangent vectors t i .
To determine the various parameter values, we demand that the minimum energy critical point of (5) describes the natively folded Cα geometry, with a prescribed precision. The critical points of (5) are topological soliton solutions of the following generalized DNLS equation [40,41], An individual topological soliton models a segment in a super-secondary structure such as a helix-loop-helix or a strand-loop-strand motif with constant i.e. fixed-k parameters. For example, a right-handed α-helix has a À helix : and for the β-strand b À strand : A single soliton is characteristically a loop structure that interpolates between different regular structures such as α-helices and β-strands. But a long loop can also accommodate more than one single soliton. Along an individual single soliton structure the values of (θ i , ϕ i ) are variable while the parameters λ k , m k , a k , . . . have constant values, over the entire single soliton. The multi-soliton solution of (6, 7) combines the individual solitons into a single structure that models the entire protein tertiary structure, with different sets of k-dependent parameters that correspond to the different individual soliton segments. The number of individual soliton profiles that are introduced, depend on the desired precision of the multi-soliton: An increase in the number of individual solitons increases the number of parameters, and improves the precision. Since an individual soliton generically extends over several amino acids, the number of parameters in the multi-soliton is usually comparable to, in fact often even smaller than, the number of amino acids in the protein that it describes. A multi-soliton profile that describes a crystallographic protein structure with its experimental resolution, has usually far fewer parameters, and accordingly also much more predictive power, than e.g. Gō-like structure based models. In particular, the free energy function (5) does not engage any network of native contacts between individual atoms or other localized interaction centers. We aim to describe (un)folding experiments where the protein folding and unfolding processes are controlled by temperature variations, rather than by e.g. pH variations, or denaturants such as inorganic salts and organic solvents. This also sets our approach apart from Gō models [5,6] that fold the protein at a constant temperature.
There are many techniques to simulate temperature variations, and here we use a Markovian Monte Carlo algorithm. For the Monte Carlo update there are different alternatives, and we refer to [23] for a comprehensive algorithm comparison in the present model. We adopt the following variant: At each Monte Carlo step n we independently change either the torsion angle ϕ i or the bond angle θ i at a randomly chosen site i and compute the ensuing change ΔF in the free energy (5) DF ¼ Fðy nþ1 ; � nþ1 Þ À Fðy n ; � n Þ We then use a version of the Glauber algorithm [23] to determine the transition probability P between the two states In our simulations we adiabatically decrease and increase the value of the temperature factor β, to simulate heating and cooling. Note that the inverse Monte Carlo temperature factor β −1 can not be directly identified with the physical temperature factor k B T where k B is the Boltzmann constant and T is the ambient temperature measured in Kelvin. But for an equilibrium distribution the two can be related by methods of renormalization, as described in [42]. When the number of updates increases and the value of β is kept fixed, the Glauber algorithm is known to approach the Gibbsian equilibrium distribution at an exponential rate. Moreover, the Glauber algorithm models pure relaxation dynamics that for simple systems reproduces Arrhenius' law. At the same time, small proteins are known to fold according to Arrhenius' law [47].
In the case of the torsion angles ϕ i the change in their values at each Monte Carlo step is calculated according to where R is a random number with normal distribution centered at R = 0 and with dispersion ΔR = 1. The backbone bond angles θ i are known to be much stiffer than the torsion angles ϕ i . For this reason we calculate the change in their values using the heat bath algorithm described in [23]. Accordingly y nþ1 i is determined in terms of θ n using a random number with probability distribution in the interval [0, π) where F θ is computed from all the θ-dependent terms in (5). We relate the parameter β hb to the inverse Monte Carlo temperature factor β and in our production simulations we use β hb = 10 13 β.
The simulation algorithm consists of three stages; heating, thermalization and cooling. After each step we change the value of β in (8), by multiplying it with a constant, and the value of the constant determines the rate of heating/cooling. During the high temperature thermalization stage the number of Monte Carlo steps is always fixed to 7×10 6 steps and in this stage the value of β remains constant. During heating and cooling the number of steps can be variable and in the simulation described here we have 480×10 6 steps.
A full simulation ensemble consists of 1.000 independent heating and cooling cycles at the given temperature factor value. At the end of each simulation cycle we screen the final conformation for slipknotted structure using a variety of distance measures. These include the overall RMS distance between the final simulated structure and its crystallographic target, in combination with distances between individual key residues. In borderline cases we resort to visual inspection of the final structure, in order to detect/confirm the presence (or absence) of a knotted configuration.
In the case of the archaeal virus protein AFV3-109 studied here, the Protein Data Bank structure 2J6B that we use as a decoy for constructing the multi-soliton solution has 109 amino acids. The core of this α/β protein consists of a β-sheet with five β-strands that are connected by helices and loops. The residues 78-106 make up a looped segment that is threaded into a slipknot through a knotting loop that consists of residues 8-77: The Fig 1 shows the slipknot structure.
For identifying the soliton profiles we use the software GaugeIT and for determining the parameter values we use the software Propro. Both are accessible through https://protoin.ru/ propro/index.php.

Results
We have constructed the multi-soliton that describes the Cα backbone of 2J6B as a minimum energy critical point of the mechanical free energy (5). According to [24] there are 3 helical and 5 strand-like regular segments in the crystallographic structure. Thus, including the C and N terminals, there are at least 9 individual loops along the relatively short backbone This suggests us to start and introduce a division of the Cα backbone into 9 individual solitons. However, a scrutiny of the crystallographic 2J6B structure reveals that it has a more complex geometry. There are relatively long loops between residues 6-18, 57-74 and 81-91, and some of the β-strands are slightly bent. In order to describe the backbone with a precision that is comparable to the resolution of the crystallographic structure we then introduce additional soliton structures: We identify a total of 20 individual soliton profiles, with the long loops comprising more than a single soliton each and the β-strands supporting a soliton that accounts for their bending. In the S1 File we present the parameter values of the free energy function (5) that supports the multi-soliton solution as its energy minimum. The Fig 2 Panels a)-c) summarizes the multi-soliton model we have constructed. Fig 2a) shows an overlay comparison between the Cα trace of the PDB structure 2J6B and its multi-soliton model. The root-meansquare-distance (RMSD) between the two is 1,23 Å, comparable with the reported resolution 1.3 Å of the experimental structure. Fig 2b) shows a comparison of the Cα trace bond angles (3) and torsion angles (4) between the PDB structure and its multi-soliton model. Fig 2c) identifies the secondary structures (according to PDB) and shows how we divide them into 20 individual solitons.  6b) we identify this simulation with an arrow, among all the displayed simulations).
The heating always starts from the multi-soliton that models 2J6B; the folding events we observe are fully reversible. The solid lines in the Fig 3 denote the evolution of the mean value over all structures in the simulated ensemble. The spread around each solid line shows the extent of one standard deviation around the mean. We note that in the high temperature stage we have random structures that reside in the phase of a self-avoiding random walk, with no regular structural details and in particular no structural resemblance to the folded conformation.
The lower horizontal axis in the Fig 3 gives the logarithmic value of the temperature factor β during the simulation. The upper horizontal axis gives a corresponding Celsius value, that we deduce from a comparison with myoglobin simulations in [43]; the Celsius value should not be taken literally, it is intended to be suggestive, and for an accurate comparison between the simulation temperature and the corresponding Celsius value additional folding and unfolding experiments need to be performed over an extended range of temperature values. For a detailed analysis how to derive a relation between the temperature factor β and the physiological temperature value measured in Celsius, we refer to [42].
The Fig 3a) and 3b) show temperature dependence of the Cα root-mean-square distance (RMSD) between the 2J6B crystallographic structure and the multi-soliton, during the heating a) and cooling b) phases of the simulation. The final average RMSD value shown in the Fig 3b) at the end of the cooling has *1.3 Å deviation from the initial multi-soliton, with *2.5 Å one-sigma spread around the average. This deviation is comparable to the resolution in the experimental data; notably the final ensemble of our simulation includes all the final structures, including the small portion of configurations that do not support a slipknot. In Fig 3c) and 3d) we show the corresponding results for the radius of gyration value R g .
In the Fig 3e) we show the evolution of R g at low temperatures, during the very early stages of the heating simulation. We observe a very slight but systematic initial decrease in the R g value. This means that when the chain starts unknotting, its effective volume initially shrinks, before the chain then expands and unfolds: The initial slipknotted multi-soliton structure appears to be (slightly) swelled in comparison to an unknotted but collapsed structure, even though the free energy of the slipknot has a lower value. As shown in Fig 3f) we observe a corresponding, systematic but very small increase of R g i.e. swelling at the end of the cooling period, before the energy minimum is reached.
Even though the swelling is very small in terms of the radius of gyration value R g , at the visual level of the structure the effect is much more clear. The Fig 4 shows a structure at the minimal value of R g together with the fully folded conformation; the configuration with minimal R g value appears more compact.   3. Panel a) shows the evolution of radius of gyration R g of the multi-soliton during the heating stage, and Panel b) shows the R g evolution during the cooling stage. The solid line shows the mean value of the simulation ensemble that consist of 1.000 independent cycles, and the spread denotes one standard deviation. Panel c) and d) show the corresponding results for the root-mean-square distance (RMSD) to the multi-soliton solution. Panels e) and f) are close looks to the R g values, at early/late stages of heating/cooling. There is a clear albeit very slight shrinking at the beginning of the unfolding process, and a similar swelling just before the final slipknot forms. The unfolding always starts from the multi-soliton. But for folding statistics the entire simulation ensemble is included, including those that do not form a slipknot. https://doi.org/10.1371/journal.pone.0244547.g003 A swelling that is caused by knottiness has been previously reported in an analysis of knotted crystallographic PDB structures [44].
We have analyzed the folding and unfolding transitions by following the temperature dependent fluctuations Δϕ in the values of the torsion angle (4). In terms of Δϕ the initial unfolding process appears to start at the location of the β-bridge that is centered at the proline with PDB residue number 61. In Fig 5a) we show how initially, at very low temperature values, the amplitude of the thermal fluctuations Δϕ(61) slowly increases. This increase coincides with the decrease in the radius of gyration R g shown in Fig 3e) and at the same time we observe that the slipknot starts opening. When the temperature factor reaches a value log 10 β � 13.0 we observe a sharp order-disorder transition in the fluctuation amplitude Δϕ(61). At this β-value the slipknot opens, and becomes converted into a trefoil knot shown in Fig 5b); see also S1 Movie. When the temperature factor further increases, the trefoil unknots and the chain starts unfolding, with a rapid, sharp increase in the radius of gyration value R g as shown in Fig 3e).
The folding transition proceeds similarly but in the opposite direction, the slipknot forms through an intermediate trefoil knot.
The Fig 6a) shows the simulated temperature dependence in Δϕ along the entire chain. The first entire regular structure that melts is the β-strand that is located between PDB residues 74-80. This strand forms the edge of the looped segment that is threaded into a slipknot. In general, the β-strands appear to unfold at lower temperatures than the α-helices, and the third αhelix between residues 93-99 appears to be the last to unfold as temperature increases.
Previous authors have investigated the folding of knotted proteins, mostly using different variants of Cα structure based Gō models. These authors do not report on trefoil knot folding intermediate, but they report significant difficulties in reaching the natively folded knotty conformation. Apparently, a knotted fold either does not appear at all [5] without the help of funneling or other kind of augmentation during the folding process, or then it occurs only very rarely, in about 1-2% of folding simulations [6] (2.8% in the case of AFV3-109 [6]). We have performed a series of folding and unfolding simulations under different conditions and Fig 6b) summarizes our results: When cooling proceeds very quickly the relative number of slipknots is small. However, when the cooling process becomes progressively slower, the number of slipknotted final conformations increases rapidly and as shown in the Fig 6b) we reach https://doi.org/10.1371/journal.pone.0244547.g004 around 95 per cent success rate in the long and slow cooling simulations that we have performed. Thus we conclude that the slipknotted AFV3-109 is a slow folder [45,46], proceeding through a trefoil knot folding intermediate. Moreover, since we observe both a trefoil knot folding intermediate and a light swelling at the final stages, the folding of AFV3-109 is not a pure relaxation (Arrhenius) process [47].  Finally, when we increase the relative stiffness of the bond angles, the rate of the slipknotted final conformations increases: We always observe both the trefoil folding intermediate and the slipknot final fold, at the end of our long and slow simulations in the limit where the bond angles have constant values during the entire heating and cooling process, when only torsion angles are truly mobile. In actual proteins the virtual Cα backbone bond angles are known to be very stiff and the torsion angles are known to be quite flexible. Our simulation results show that when one properly accounts for the relative stiffness/flexibility of the bond and torsion angles, the Cα backbone of AFV3-109 practically always folds into a slipknot; misfolded conformations are indeed very rare.

Conclusions
Topology and in particular self-entanglement play an important role in protein folding and dynamics. But topological effects are difficult to investigate. Moreover, conventional simulation approaches aim to describe a protein and its folding as a local process, at atomic level precision. Due to limitations in available computational resources it then becomes very difficult to detect large scale collective motions and global, topological phenomena in the conventional simulation approaches. Nevertheless, knotty and other kind of self-entanglements are often important to protein stability, and presumably also important for the correct biological function. Thus there is value to develop global, topological approaches to protein folding and dynamics, as a complement to local, atomic level scrutiny based approaches.
We have developed a global technique that is rooted on topological concepts, to analyze and describe the formation of topological structures in proteins, in particular aspects of knottiness and self-entanglement. For this, we have modeled the entire Cα backbone of a protein in terms of a single topological multi-soliton entity; the multi-soliton describes the minimum of a mechanical free energy. As a case study, we have investigated the folding and unfolding of the slipknotted AFV3-109 protein, instigated by variable ambient temperature, using powerful state-of-art Monte Carlo techniques of non-equilibrium thermodynamics. We have found that the multi-soliton describes the formation of the slipknot very accurately, and we are able to describe the folding pathways and make predictions on the physical origin of knot formation. In paricular, we have been able to observe a trefoil knot as a folding intermediate. Our results demonstrate the value of developing global approaches to protein folding and dynamics; global approaches are highly accurate, and even though they may lack in atomic level details they appear to correctly capture the global, topological aspects of self-entanglement during protein folding and dynamics.
Supporting information S1 File. Theoretical/Technical background and data file. (PDF) S1 Movie. Unfolding of slipknotted structure, with increasing temperature, as described in the text.