Polymer Uncrossing and Knotting in Protein Folding, and Their Role in Minimal Folding Pathways

We introduce a method for calculating the extent to which chain non-crossing is important in the most efficient, optimal trajectories or pathways for a protein to fold. This involves recording all unphysical crossing events of a ghost chain, and calculating the minimal uncrossing cost that would have been required to avoid such events. A depth-first tree search algorithm is applied to find minimal transformations to fold , , , and knotted proteins. In all cases, the extra uncrossing/non-crossing distance is a small fraction of the total distance travelled by a ghost chain. Different structural classes may be distinguished by the amount of extra uncrossing distance, and the effectiveness of such discrimination is compared with other order parameters. It was seen that non-crossing distance over chain length provided the best discrimination between structural and kinetic classes. The scaling of non-crossing distance with chain length implies an inevitable crossover to entanglement-dominated folding mechanisms for sufficiently long chains. We further quantify the minimal folding pathways by collecting the sequence of uncrossing moves, which generally involve leg, loop, and elbow-like uncrossing moves, and rendering the collection of these moves over the unfolded ensemble as a multiple-transformation “alignment”. The consensus minimal pathway is constructed and shown schematically for representative cases of an , , and knotted protein. An overlap parameter is defined between pathways; we find that proteins have minimal overlap indicating diverse folding pathways, knotted proteins are highly constrained to follow a dominant pathway, and proteins are somewhere in between. Thus we have shown how topological chain constraints can induce dominant pathway mechanisms in protein folding.

One conceptual refinement to arise from theoretical and simulation studies is the study of ''good'' reaction coordinates that correlate with commitment probability to complete the protein folding reaction [36,[46][47][48][49][50][51]. Reaction coordinates must generally take into account the energy surface on which the molecule of interest is undergoing conformational diffusion [52][53][54], and the Markovian or non-Markovian nature of the diffusion [55,56]. In a system with many degrees of freedom on a complex energy landscape and obeying nontrivial steric restrictions, finding a best reaction coordinate or even a good reaction coordinate is a difficult task. Finding reaction paths between metastable minima is an old problem, in which many approaches have been developed to account for the underlying complex, multi-dimensional potential energy surface [57][58][59][60][61][62][63].
An alternate approach, in the spirit of defining order parameters in statistical and condensed matter physics, is to consider the geometry of the product and reactant in defining a reaction coordinate without reference to the underlying potential energy landscape. The overlap function q of a spin-glass is an example of a geometrically-defined order parameter [64], for which the underlying Hamiltonian determines behavior such as the temperature-dependence. We pursue such a geometric approach in this paper.
A transformation connecting unfolded states with the native folded state can be considered as a reaction coordinate. A transformation can also be used as a starting point for refinement, by examining commitment probability or other reaction coordinate formalism.
In this paper we consider transformations between polymer conformation pairs that would not be viable by a conjugategradient type or direct minimization approach, in that dead-ends would inevitably be encountered. We focus specifically on how one might find geometrically optimal transformations that account for polymer non-crossing constraints, which would apply to knotted proteins for example.
By a geometrically optimal transformation, we mean a transformation in which every monomer in a polymer, as represented by the a-carbon backbone of a protein for example, would travel the least distance in 3-dimensional space in moving from conformation A to conformation B. This is a variational problem, and the equations of motion, along with the minimal transformation and the Euclidean distance covered, have been worked out previously [72][73][74][75]. Although minimal transformations have been found for the backbones of secondary structures, and the non-crossing problem has been treated [74], minimal transformations between unfolded and folded states for full protein chain lengths have not been treated before.
The minimal transformation inevitably involves curvilinear motion if bond, angle, or stereochemical constraints are involved [72,76]. Such curvilinear transformations as a result of bond constraints were developed in [72][73][74][75]. If such constraints are neglected, the minimal distance corresponding to the minimal transformation reduces to the mean of the root squared distance (MRSD), or the mean of the straight line distances between pairs of atoms or monomers. This is not the conventional RMSD. For any typical pair of conformations, the MRSD is always less than the RMSD [73]. Used as an alignment cost function, aligned configurations using MRSD are globally different than those using RMSD [75]. The RMSD can be thought of as a least squares fit between the coordinates defining the two structures. Alternatively, it may also be thought of as the straight-line Euclidean distance between two structures in a high-dimensional space of dimension 3N, where N is the number of atoms in the protein, or C a atoms if the protein is coarse-grained. Fast algorithms have been constructed to align structures using RMSD [77][78][79][80][81][82].
If several intermediate states are known along the pathway of a transformation between a pair of structures, then the RMSD may be calculated consecutively for each successive pair. This notion of RMSD as an order parameter goes back to reaction dynamics papers from the early 1980's [57][58][59][60], however in these approaches the potential energy governs the most likely reactive trajectories taken by the system, and RMSD is simply accumulated through the transition states.
In the absence of a potential surface except for that corresponding to steric constraints, the incremental RMSD may be treated as a cost function and the corresponding transformation between two structures found algorithmically [71]. However, the minimal transformation using RMSD (or 3ND Euclidean distance) as a cost function is different than the minimal transformation using 3D Euclidean distance (MRSD) as a cost function, and the RMSD-derived transformation does not correspond to the most straight-line trajectories. The RMSD is not equivalent to the total amount of motion a protein or polymer must undergo in transforming between structures, even in the absence of steric constraints enforcing deviations from straight-line motion. Conversely, the transformation corresponding to the MRSD will be curvilinear in the 3N-dimensional space.
In what follows, we develop a computational scheme for describing how difficult it might be for different proteins to reach their folded configuration. The essence is a calculation of how much ''effort'' the protein chain must expend to avoid having to cross through itself as it tries to realize its folded state. This involves finding the different ways a polymer can uncross or ''untangle'' itself, and then calculating the corresponding distance for each of the untangling transformations. Since there are typically several avoided crossings during a minimal folding transformation, finding the optimal untangling strategy corresponds to finding the optimal combination of uncrossing operations with minimal total distance cost.
After quantifying such a procedure, we apply this to full length protein backbone chains for several structural classes, including ahelical proteins, b-sheet proteins, a-b proteins, 2-state and 3-state folders, and knotted proteins. We generate unfolded ensembles for each of the proteins investigated, and calculate minimal distance transformations for each member of the unfolded ensemble to fold. From this calculation, we obtain the mean minimal distance to fold from the unfolded ensemble, for a given structural class. We look for differences in the mean minimal distance between structural and kinetic classes, and compare these to differences in other order parameters between the respective classes. The extra non-crossing distance per residue D nx =N turns out to be the most consistent discriminator between different structural and kinetic classes of proteins. We find the extra distance covered to avoid chain crossing is generally a small fraction (*1=10) of the total motion. We also investigate how the various order parameters either correlate or are independent from each other.
We then select three proteins, an a-helical, a b-sheet, and a knotted protein, to further dissect the taxonomy of their minimal folding transformations. We construct what might be called ''multiple transformation alignments'' that describe the various different ways each protein can fold from an ensemble of unfolded conformations. We find that noncrossing motions of an N-or Cterminal leg are generally obligatory for a knotted protein, and only incidental for an a protein. A consensus minimal folding transformation is constructed for each of the above-mentioned native folds, and rendered schematically. By investigating a ''pathway overlap'' order parameter, we find that non-crossing constraints, as are prevalent in b proteins and pervasive in knotted proteins, explicitly induce a pathway ''mechanism'' in protein folding, as defined by a common sequence of events independent of the initial unfolded conformation. We finally discuss our results and conclude.

Calculation of the transformation distance
The value of the uncrossing or non-crossing distance, D nx , is calculated as follows: The chain transforms from conformation A to conformation B as a ghost chain, so the chain is allowed to pass through itself. The beads of the chain follow straight trajectories from initial to final positions. This is an approximation to the actual Euclidean distance D of the transformation, where straight line transformations of the beads are generally preceded or proceeded by non-extensive local rotations to preserve the link length connecting the beads as a rigid constraint [72,73]. The instances of self-crossing along with their times are recorded. The associated cost for these crossings is computed retroactively, for example the distance cost for one arm of the chain to circumnavigate another obstructing part is then added to the ''ghost'' distance to compute the total distance.
The method for calculating the non-crossing distance D nx has three major components, evolution of the chain, crossing detection, and crossing cost calculation. Each are described in the subsections below.
Evolution of the chain. As mentioned above, the condition of constant link length between residues along the chain is relaxed, so that the non-extensive rotations that would generally contribute to the distance traveled are neglected here. This approximation becomes progressively more accurate for longer chains. Thus ideal transformations only involves pure straight-line motion. The approximate transformation is carried out in a way to minimize deviations from the true transformation (D), such that link lengths are kept as constant as possible, given that all beads must follow straight-line motion. We thus only allow deviations from constant link length when rotations would be necessary to preserve it; this only occurs for a small fraction of the total trajectory, typically either at the beginning or the end of the transformation [72,73].
A specific example. As an example of the amount of distance neglected by this approximation, consider the pair of configurations in Figure 1, where a chain of 10 residues that is initially horizontal transforms to a vertical orientation as shown in the figure. The distance neglecting rotations (our approximation) is 77.78, in reduced units of the link length, while the exact calculation including rotations [72,73] gives a distance of 78.56.
A few intermediate conformations are shown in the figure. In particular note the link length change (and hence violation of constant link length condition) in the fourth link for the gray conformation (conformation F), resulting from our approximation. If the link length is preserved, the transformation consists of local rotations at the boundary points.
Also note that when transforming from cyan to magenta the first bead moves less than d, because it reaches its final destination and ''sticks'' to the final point, and will not be moved subsequently. A movie of this transformation is provided as Movie S1 in the Supporting Information.
General method. The algorithm to evolve the chain is as follows. Straight-line paths from the positions of the beads in the initial chain configuration to the corresponding positions of the beads in the final configuration are constructed. The bead furthest away from the destination, i.e. the bead whose path is the longest line, is chosen. Let this bead be denoted by index b where 0ƒbƒN. In the context of Figure 1, this bead corresponds to bead number 9 (b 9 ). The bead is then moved toward its destination by a small pre-determined amount d, and the new position of bead b is recorded. In this way the transformation is divided into say M steps: M~d max =d, where d max is the maximal distance. Let i be the step index 0ƒiƒM. If initially the chain configuration was at step i (e.g. i~0), the spatial position of bead b at step i before the transformation d is denoted by r b,i , and after the transformation by r b,iz1 . The upper bound d to capture the essence of the transformation dynamics differs according to the complexity of the problem. To capture all of the instances of self-crossing, a step size d of two percent of the link length sufficed for all cases.
The neighboring beads (bz1 and b{1) should also follow paths on their corresponding straight-line trajectories. Their new position on their paths (r bz1,iz1 and r b{1,iz1 ) are then calculated based on the constant link length constraints. This new position corresponds to moving the beads by d bz1,i , d b{1,i respectively. Once r bz1,iz1 and r b{1,iz1 are calculated, we proceed and calculate r bz2,iz1 and r b{2,iz1 until we reach the end points of the chain. As an example consider Figure 1, going from the conformation B (Green) to the conformation C (Yellow). First, bead number 9, which is the bead farthest away from its final -G proceeding along the color sequence red, green, yellow, cyan, magenta, gray, and blue) are shown. The step-size delta is shown. Note the step in which the first bead of the chain (b 0 ) is ''snapped'' into the final conformation because its distance to the destination is less than d (going from D to E). In the intermediate conformation F (Gray), beads 0 to 3 have reached their final locations and no longer move. Note also the link length violation of link 4 in conformation F, due to the approximation that ignores end point rotations, for this intermediate figure. A milder violation is observed when going from D (cyan) to E (magenta), since bead 1 through N all assume a step size of d while bead 0 moved a step size vd. destination, is moved by d, then taking constant link length constraints and straight line trajectories into account, the new position of bead 8 is calculated and so on, until all the new bead positions which correspond to the yellow conformation are calculated.
If somewhere during the propagation to the endpoints, a solution cannot be constructed or no continuous solution exists, i.e. lim d?0 (r bzm,iz1 {r bzm,i )=0, then we set r bzm,iz1~rbzm,i . That is, the bead will remain stationary for a period of time [83]. Consequently r bzn,iz1~rbzn,i for all beads with nwm that have not yet reached their final destination. This is because the new position of each bead is calculated by the position of the bead next to it for any particular step i. The same recipe is applied when propagating incremental motions d b,iz1 along the other direction of the chain (going from b{n to b{n{1) as well. When a given bead that has been held stationary becomes the furthest bead away from its final position, it is then moved again. I.e. stationary beads can move again at a later time during the transformation if they become the furthest beads away from the final conformational state. Such a scenario does not occur in the context of the simple example of Figure 1, however in Movie S2 in the Supporting Information, a transformation is given for a full protein that involves such a process. During the course of such a transformation the viewer will notice that several beads on the chain (in the upper right in the movie) remain stationary for a part of the transformation. For these beads no continuous solution for the motion exists, i.e. as d?0 the beads in question cannot move without violating the constant link length constraint. At a later time during the transformation, when the beads in the given segment are farthest from the final folded conformation, the beads resume motion.
Once the positions of all the beads in step iz1 are calculated, the same procedure is repeated for step iz2 and so on, until the chain reaches the final configuration. If the position of a given bead b at step i is such that Dr b,i {R b Dvd, where R b is the spatial position of bead b in the final conformation, then r b,iz1 is set to R b . In other words we discretely snap the bead to the final position if it is closer than the step size d. In the context of Figure 1, this corresponds to going from conformation D (Cyan) to conformation E (Magenta). Bead 0 (b 0 ) is snapped to the final conformation. Once a bead reaches its destination it locks there and will never move again. See conformation F (gray) in Figure 1. Figure 2 shows a histogram of the mean link length over the course of a transformation, for 200 transformations between random initial structures generated by self avoiding random walks (SAW), and one pre-specified SAW. The length of the random chains was 9 links. The chains were aligned by minimizing MRSD before the transformation took place [73][74][75], where MRSD stands for the mean root squared distance and is defined by Deviations from the full unperturbed link length are modest: the ensembleaveraged mean link length is 96% of the initial link length.
Crossing Detection. As stated earlier, during the transformation the chain is initially treated as a ghost chain, and so is allowed to cross itself. To keep track of the crossing instances of the chain, a crossing matrix X is updated at all time steps during the transformation. If the chain has N beads and N{1 links, we can define an (N{1)|(N{1) matrix X that contains the crossing properties of a 2D projection of the strand, in analogy with topological analysis of knots. The element X ij is nonzero if link i is crossing link j in the 2D projection at that instant. Without loss of generality we can assume that the projection is onto the XY plane, as in Figure 3. We illustrate the independence of our method on projection plane explicitly for a crossing event in cold-shock protein (1CSP) in File S1 (see figure SA). We use the XY plane projection throughout this paper [84].
We parametrize the chain uniformly and continuously in the direction of ascending link number by a parameter s with range 0ƒsƒN. So for example the middle of the second link is specified by s~3=2. If the projection of link i is crossing the projection of link j, then DX ij D is the value of s at the crossing point of link i and DX ji D is the value of s at the crossing point of link j. If link i is over j (i.e. the corresponding point of the cross on link i has a higher z value than the corresponding point of the cross on link j) then X ij w0, otherwise X ij v0. Thus after the sign operation, sign(X) is an anti-symmetric matrix.
A simple illustrative example of the value of X for the 3-link chain in Figures 3a and 3b is The fact that X 13 is negative at time t o indicates that at that instant, link 1 is under link 3 in 3D space, above the corresponding point on the plane on which the projections of the links have crossed (green circle in Figure 3). At each step during the transformation of the chain, the matrix X is updated. A true crossing event is detected by looking at X for two consecutive conformations. A crossing event occurs when any non-zero element in the matrix X discontinuously changes sign Polymer Uncrossing and Unknotting in Folding PLOS ONE | www.plosone.org without passing through zero. Once X ij changes sign, X ji must change sign as well. If the chain navigates through a series of conformations that changes the crossing sense and thus the sign of X ij , but does not pass through itself in the process, the matrix elements X ij will not change sign discontinuously but will have values of zero at intermediate times before changing sign.
Movie S3 in the Supporting Information shows the result of applying crossing detection. In the movie of the transformation, whenever an instance of self-crossing is detected, the transformation is halted and the image is rotated to make the location of the crossings easier to visualize.
Crossing Cost calculation. Even in the simplest case of crossing, there are multiple ways for the real chain to have avoided crossing itself. The extra distance that the chain must have traveled during the transformation to respect the fact that the chain cannot pass through itself is called the ''non-crossing'' distance D nx . If the chain were a ghost chain which could pass through itself, the corresponding distance for the whole transformation would be the MRSD, along with relatively small modifications that account for the presence of a conserved link length. Accounting for non-crossing always introduces extra distance to be traveled.
As the chain is transforming from conformation A to conformation B as a ghost chain according to the procedure discussed above, a number of self-crossing incidents occur. Figure 4 shows a continuous but topologically equivalent version of the crossing event shown in Figure 3 (b). Even for this simple case, there are multiple ways for the transformation to have avoided the crossing event, each with a different cost.
Furthermore, later crossings can determine the best course of action for the previous crossings. Figure 5 illustrates how noncrossing distances are non-additive, so that one must look at the whole collection of crossing events. Therefore to find the optimum way to ''untangle'' the chain (reverse the sense of the crossings), one must look at all possible uncrossing transformations, in retrospect. The recipe we follow is to evolve the chain as a ghost chain and write down all the incidents of self-crossings that happen during the transformation. Then looking at the global transfor-mation, we find the best untangling movement that the chain could have taken.
To compute the extra cost introduced by non-crossing constraints we proceed as follows: We construct a matrix that we call the cumulative crossing matrix Y. Y ij is non-zero if link i has truly (in 3D) crossed link j, at any time during the transformation. This matrix is thus conceptually different than the matrix X, which holds only for one instant (one conformation) and which can have crossings in the 2D projection which are not true crossings during the transformation. The values of the elements of Y are calculated in the same way that the values are calculated for X. The sign also depends on whether the link was crossed from over to under or from under to over, so that a given projection plane is still assumed. The order in which the crossing have happened are kept track of in another matrix Y O . The coordinates of all the beads at the instant of a given crossing are also recorded. For example, if during the transformation of a chain, two crossing have happened, then two sets of coordinates for intermediate states are also stored. We describe a simple concrete example to illustrate the general method next.
A Concrete Example. Figure 6 shows a simple transformation of a 7-link chain. During the transformation the chain crosses itself in two instances. The first instance of self-crossing is between link 5 and link 7. The second instance is when link 2 crosses link 4. The location of the crossing along the chain is also recorded: i.e. if we assume that the chain is parametrized by s~0 to N, then at the instant of the first crossing (link 5 and link 7) s~4:4 (link 5) and s~6:9 (link 7). The second crossing occurs at s~1:3 (link 2) and s~3:8 (link 4). The full coordinates of all beads are also known: we separately record the full coordinates of all beads at each instant of crossing. The information that indicates which links have crossed and their over-under structure can be aggregated into the cumulative crossing matrix Y. For the example in Figure 6, the cumulative matrix (up to a minus sign indicating what plane the crossing events have been projected on) is The blue chain and the red chain have the exact same vertical projection, however their corresponding X matrices are different in sign, as given in Eq. 2. This indicates that the overunder sense has changed for the links whose projections are crossing. This in turn indicates that a true crossing has occurred when going from the red conformation to the blue conformation, as opposed to a series of conformations where the chain has navigated to conformations having the opposite crossing sense without passing through itself. doi:10.1371/journal.pone.0053642.g003 : Y tells us, during the whole process of transformation, which links have truly crossed one another and what the relative over-under structure has been at the time of crossing. For example, by glancing at the matrix we can see that two links 5,7 and 2,4 have crossed one another. We also know from the sign of the elements in Y that both links 2 and 7 were underneath links 4 and 5 just prior to their respective crossings in the reference frame of the projection. Two links will cross each other at most once during a transformation. If one link, e.g. link i, crosses several others during the transformation, elements i,j, i,k etc… along with their transposes will be nonzero. The order of crossings can be represented in a similar fashion as a sparse matrix. : Analyzing the structure of the crossings is similar to analyzing the structure of a knot, wherein one studies a knot's 2D projections, noting the crossings and their over/under nature based on a given directional parameterization of the curve [89][90][91]. One difference here is that we are not dealing with true closed-curve knots (in the mathematical sense), as a knot is a representation of S1 in S3.
Here we treat open curves.
Crossing substructures. By studying the crossing structure of open-ended pseudo-knots in the most general sense, one can identify a number of sub-structures that recur in crossing transformations. Any act of reversing the nature of all the crossings of the polymer can be cast within the framework of some ordered combination of reversing the crossings of these substructures.
We identify three sub-structures: Leg, Loop, and Elbow. Leg. Given any self-crossing point of a chain, a leg is defined from that crossing point to the end of the chain. Therefore for each self-crossing point two legs can identified as the shortest distance along the chain from that crossing point to each end-see Figure 7. A single leg structure is shown in Figure 8(a).
Loop. As stated earlier, when traveling along the polymer one arrives at each crossing twice. If the two instances of a single crossing are encountered consecutively while traveling along the polymer, and no intermediate crossing occurs, then the substructure that was traced in between is a loop. See Fig. 8 Elbow. If two consecutive crossings have same over-under sense, then they form an elbow; see Fig. 8(c). Note that the same two consecutive crossing instances will occur in reverse order on the second visit of the crossings: these form a dual of the elbow. By convention the segment with longer arc-length between the two consecutive crossings is defined as the elbow. This would be the horseshoe shaped strand in Figure 8(c).
Reversing the crossing nature. The goal of this formalism is to assist in finding a series of movements that will result in  reversing the over-under nature of all the crossings, with the least amount of movement required by the polymer. So at this point we introduce basic movements that that will reverse the nature of the crossings for the above substructures.
Using leg movement. A transformation that reverses the over/under nature of a leg involves the motion of all the beads constituting the leg. Each bead must move to the location of the crossing (the ''root'' of the leg), and then move back to its original location [74]. The canonical leg movement is shown schematically in Figure 9.
We can reverse the nature of all the crossings that have occurred on a leg, if more than one crossing occurs, through a leg movement (see Figure 10). The move is topologically equivalent to the movement of the free end of the leg along the leg up to the desired crossing, and then moving all the way back to the original position while reversing the nature of the crossing on the way back.
Loop twist and loop collapse. Reversing the crossing of a loop substructure can be achieved by a move that is topologically equivalent to a twist, see Figure 11 (a). This type of move is called a Reidemeister type I move in knot theory. However the optimal motion is generally not a twist or rotation in 3-dimensional space (3D). Figure 11(b) shows a move which is topologically equivalent to a twist in 3D, but costs a smaller distance, by simply moving the residues inside the loop in straight lines to their final positions, resulting in a ''pinching'' motion to close the loop and re-open it. From now on we refer to the optimal motion simply as loop twist, because it is topologically equivalent, but we keep in mind that the actual optimal physical move, and the distance calculated from it, is different.
Elbow moves. Reversing the crossings of an elbow substructure can be done by moving the elbow segment in the motion depicted in Figure 12: Each segment moves in a straight line to its corresponding closest point on the obstruction chain, and then it moves in a straight line to its final position.
Operator Notation. The transformations for leg movement, elbow, and twist can be expressed very naturally in terms of operator notation, where in order to untangle the chain the various operators are applied on the chain until the nature of all the self-crossing are reversed.
If we uniquely identify each instance of self-crossing by a number, then a topological loop twist at crossing i can be represented by the operator R(i) (R for Reidemeister). An elbow move, for the elbow defined by crossings i and iz1, can be represented as E(i,iz1). As discussed above, for each selfcrossing, two legs can be identified corresponding to the two termini of the chain. This was exemplified in Figure 7, by the red and blue legs. Since we choose a direction of parametrization for the chain, we refer to the two leg movements as the ''start leg''  movement and the ''end leg'' movement, and for a generic crossing i we denote them as L N (i) and L C (i) respectively.
The operators that we defined above are left acting (similar to matrix multiplication). So a loop twist at crossing i followed by an elbow move at crossings j and jz1 is represented by Example. Figure 13 shows sample configurations before and after untangling. The direction of parametrization is from the red terminus to the cyan terminus. It can be seen that there are several ways to untangle the chain. One example would be R(3)L C (2)R(1), which consists of a twist of the green loop, followed by the cyan leg movement, followed by a twist of the blue loop. Another path of untangling would be E(2,3)L N (1), which is movement of the red leg followed by the magenta elbow move.
For the two above transformations, the order of operations can be swapped, i.e. they are commutative, and the resulting distance for each of the transformations will be the same. That is Other transformation moves are not commutative in the algorithm, for example in Figure 13, L N (1)R(3)R(2) is not allowed, since R(2) will only act on loops defined by two instances of a crossing that are encountered consecutively in traversing the polymer, i.e. no intermediate crossings can occur. Therefore even if crossing 2 happens kinetically before crossing 3 during the ghost transformation, only transformation L N (1)R(2)R(3) is allowed in the algorithm.
Minimal uncrossing cost. For each operator in the above formalism, a transformation distance/cost can be calculated. Hence the optimal untangling strategy is finding the optimal set of operator applications with minimal total cost. This solution amounts to a search in the tree of all possible transformations, as illustrated in Figure 14. The optimal application of operators can be computed by applying a version of the depth-first tree search algorithm.
According to the algorithm, from any given conformation there are several moves that can be performed, each having a cost associated with the move. The pseudo-code for the search algorithm can be written as follows: The values to the right side of the equality sign in the arguments of the procedure are the default values that the procedure starts with. The procedure is called recursively, and returns both the set of optimal uncrossing moves (for a given crossing matrix   corresponding to a starting and final conformation), and the distance corresponding to that set of optimal uncrossing moves.
The algorithm visits all branches of the tree of possible uncrossing operations until it reaches the end. However it is smart enough to terminate the search along the branch if the cost of operations exceeds that of a solution already found. See Figure 14 for an illustration of the depth-first search tree algorithm. The above procedure was implemented using both the GNU Octave programming language and C++. To optimize speed by eliminating redundant moves, only one permutation was considered when operators commuted.

Generating unfolded ensembles
To generate transformations between unfolded and folded conformations, we adopt an off-lattice coarse grained C a model [92,93], and generated an unfolded structural ensemble from the native structure as follows. For a native structure with N links, we define three data sets: N The set of C a residue indices i, for which i~1 Á Á Á N N The set of native link angles h j between three consecutive C a atoms, for which j~2 Á Á Á N{1 N The set of native dihedral angles w k between four consecutive C a atoms, i.e. the angle between the planes defined C a atoms (k{1, k, and kz1), and (k, kz1, and kz2). The index k runs from k~2 Á Á Á N{2.
The distribution of C a -C a distances in PDB structures is sharply peaked around 3.76 Å (s~0:09Å ). In practice we took the first C a -C a distance from the N-terminus as representative, and used that number for the equilibrium link length for all C a -C a distances in the protein.
To generate an unfolded ensemble, we start by selecting at random a C a atom n (2ƒnƒN{1) in the native conformation, and we then perform rotations that change the angle centered at that randomly chosen residue n, h n , and that change the dihedral defined by rotations about the bond n-(n+1), w n . If n~N{1 only the angle is changed. The new angle and dihedral are selected at random from the Boltzmann distributions as described below. After each rotation, h n ?h new n and w n ?w new n . Changing these angles rotates the entire rest of the chain, i.e. all the beads i with iwn are rotated to a new position. This recipe corresponds to an extension of the pivot algorithm [94,95].
However, we additionally require that the values of each angle and dihedral that are present in the native structure, h Nat n and w Nat n , are more likely to be observed. We implement this criterion in the following way. The new angle h n is chosen from a probability where we have set k h~2 0. Similarly for the dihedral w n , the probability distribution function is proportional to where k w1~1 , and k w3~0 :5. The fact that the k w s are much smaller than k h means that for a given temperature, dihedral angles are more uniformly distributed than bond angles. If all k h and k w are set to zero, then all states are equally accessible and the algorithm reduces to the pivot algorithm, i.e. a generator for unbiased, self-avoiding random walks. If all k h and k w are set to ?, then chain behaves as a rigid object and does not deviate from its native state. Each pivot operation results in a new structure that must be checked so that it has no steric overlap with itself, i.e. the chain must be self-avoiding. If the new chain conformation has steric overlap, then the attempted move is discarded, and a new residue is selected at random for a pivot operation.   A chain with several self-crossing points before and after untangling. Various topological substructures that are discussed in the text are color coded. For the case of the legs (red and cyan) note that various other legs can be identified, for example a leg that starts at crossing 2 and ends at the red terminus. Here we color only the shortest legs from crossing 1 to the terminus as red, and crossing 2 to the opposite terminus as cyan. doi:10.1371/journal.pone.0053642.g013 In practice, we defined steric overlap by first finding an approximate contact or cut-off distance for the coarse-grained model. The contact distance was taken to be the smaller of either the minimum C a -C a distance between those residues in native contact (where two residues are defined to be in native contact if any of their heavy atoms are within 4.9 Å ), or the C a -C a distance between the first two consecutive residues. For SH3 for example the minimum C a distance in native contacts is 4:21Å and the first link length is 3:77Å , so for SH3 all non-neighbor beads must be further than 3:77Å for a pivot move to be accepted. Future refinements of the acceptance criteria can involve the use of either the mean C a -C a distance or other criteria more accurately representing the steric excluded volume of residue side chains.
In our recipe, to generate a single unfolded structure we start with the native structure and implement N successful pivot moves, where N is related to the number of residues N by N~ln(0:01)=ln½0:99(N{2)=(N{1).
For the next unfolded structure we start again from the native structure and pivot N successful times, following the above recipe. Note that N successful pivots does not generally affect all beads of the chain. In the most likely scenario some beads are chosen several times and some beads are not chosen at all, according to a Poisson distribution. This particular choice of N means that for polymers with Nv101 where N{2 N{1 v0:99, the chance that any given link is not pivoted at all during the N pivot operations is 0:01. On the other hand for longer polymers where N{2 N{1 w0:99, the probability that any particular segment of the protein with the length 0:01 of the total length, has 0:01 chance of not having any of its beads pivoted. For any N however, the shear number of pivot moves generally ensures a large RMSD between the native and generated unfolded structures.
Each unfolded structure generally retains small amounts of native-like secondary and tertiary structure, due to the native biases in angle and dihedral distributions. For example, for SH3 the number of successful pivot moves was 162 and the mean fraction of native contacts in the generated unfolded ensemble was 0:06.

Protein dataset
The 45 proteins used in this study are given in Table 1. When divided into kinetic classes, they consist of 25 2-state folders, 13 non-knotted 3-state folders, and 7 knotted proteins not used in the kinetic analysis. Structurally there are 11 all a-helix proteins, 14 all b-sheet proteins, 13 a-b proteins, and 5 knotted proteins. These proteins were selected randomly from the datasets in references [96,97], where kinetic rate data was available to categorize the proteins into 2-state or 3-state folders. Our dataset contains 27 out of the 52 proteins in [96], and 38 out of the 72 proteins in [97]. The datasets in [96,97] do not include knotted proteins however; the Knotted proteins were taken from several additional sources, including references [98] (1NS5) [99], (1MXI) [100], (3MLG) [101], (2K0A) [102], (2EFV), and the protein knot server KNOTS [103] (1O6D, 2HA8). Aside from the Stevadore knot in [102] we did not consider pseudo-knots more complex than the 3 1 trefoil.
Several of these proteins (a-amylase inhibitor 2AIT and MerP mercuric ion binding protein 2HQI) have disulfide bonds present in the native structure. These constraints are not used in the current analysis. The folding pathways we obtain may be thought of as relevant to the initial folding event before disulfide bonds are formed, or for a protein of equivalent topology but sequence lacking the disulfide bond. Lack of preservation of disulfide bonds is a shortcoming of the present algorithm; development of more accurate computational algorithms for unfolded ensemble generation are a topic of future work.
Several of the proteins also have ligands present in the crystal or NMR structures. These include 1A6N and 1HRC (heme ligands), 1RA9 (Nicotinamide adenine), 1GXT (sulfate), 1MXI (iodide ion), 2K0A (3 Zn ions), 2EFV (phosphate ion). Since we have removed energetics in general from our analysis of geometrical pathways, these ligands and any effect they may have on the folding pathway due to protein-ligand interactions are not included here. In the folding kinetics analysis of references [96,97], they are generally not present either, e.g. the folding rate for 1A6N is actually that for apomyoglobin [104].
Structural alignment properties of our protein dataset. To categorize proteins as two-or three-state, we have chosen proteins with folding rate data available. This dataset has somewhat different structural alignment statistics than that for a non-redundant (NR) database, e.g. [105]. The TM-score based alignment of Zhang and Skolnick [106] can be used to obtain structural alignment statistics. Their method resolves the problems of outlier and length-dependent artifacts of RMSD-based alignments. Distributions of TM-score for both the above NR database, our dataset, and the datasets in references [96,97], which nonknotted proteins in our dataset were taken, are given in figure SB of File S1, along with statistical analysis of the distributions. The bulk of our proteins (98%) have TM-scores consistent with the NR database of Thiruv et. al (see figure SB in File S1), however our dataset and those of [96,97] contain a small number of structural homologs not present in the NR dataset, which are tabulated in table SB of File S1. We do not suspect that this small number of homologs will significantly modify the conclusions derived from statistical analysis of our dataset, however expansion and  refinement to find the most relevant dataset is a topic for future work.

Calculating distance metrics for the unfolded ensemble
To obtain minimal transformations between unfolded and native structures for a given protein, the C a backbone was extracted from the PDB native structure, and 200 coarse-grained unfolded structures were generated using the methods described above. The unfolded structures were then aligned using RMSD and the average (residual) RMSD was calculated. The unfolded structures were then aligned by minimizing MRSD, and the residual MRSD was calculated. Then conformations were further coarse-grained (smoothed) by sampling every other bead, hence reducing the total number of beads. By the above further-coarse graining which is in the spirit of the initial steps of Koniaris-Muthukumar-Taylor reduction [107][108][109][110], we eliminate all instances of potential self-crossing in which the loop size or elbow size is smaller than three links. Each structure was then transformed to the folded state by the algorithm discussed earlier in Methods. The self-crossing instances, along with the coordinates of all the beads, were recorded as well. Appropriate data structures were formed and relevant crossing substructures (leg, elbow, and loop) were detected. With topological data structures at hand, the minimal uncrossing cost was found, through the depth-first search in the tree of possible uncrossing operations that was described above. Finally, the minimal uncrossing cost, D nx , and the total distance, D are calculated for each unfolded conformation. These differ from one unfolded conformation to the other; the ensemble average is recorded and used below. The ensemble average of MRSD and RMSD are also calculated from the 200 unfolded structures that were generated.
Importance of non-crossing. We define the importance of non-crossing (INX) as the ratio of the extra untangling movement caused by non-crossing constraints, divided by the distance when no such constraints exists, i.e. if the chain behaved as a ghost chain.
Following [86], we define Long-range Order (LRO) as: where i and j are the sequence indices for two residues for which the C a {C a distance is ƒ8 Å in the native structure. Likewise we define Relative Contact Order (RCO) following [85]: where N is the total number of contacts between non-hydrogen atoms in the protein that are within 6 Å in the native structure, L is the number of residues, and DL ij is the sequence separation between contacts in units of the number of residues.
Similarly, Absolute Contact order (ACO) [85] is defined to be:

Results
Proteins were classified by several criteria: Order paramaters discriminate protein classes In Table 2, we compare the unfolded ensemble-average of several metrics between different classes of proteins, and perform a p-value analysis based on the Welch t-test. The null hypothesis states that the two samples being compared come from normal distributions that have the same means but possibly different variances. Metrics compared in Table 2  The most obvious check of the general method outlined in the present paper is to compare the non-crossing distance D nx between knotted and unknotted proteins. Here we see that knotted proteins traverse about 3:5| the distance as unknotted proteins in avoiding crossings, so that the two classes of proteins are different by this metric. The same conclusion holds for knotted vs. unknotted proteins if we use D nx =N, D, D=N, or INX. Of all metrics, the statistical significance is highest when comparing D=N, which is important because the knotted proteins considered here tend to be significantly longer than the unknotted proteins, so that chain length N distinguishes the two classes. Dividing by N partially normalizes the chain-length dependence of D, however D=N still correlates remarkably strongly with N when compared for all proteins (r~0:824 see table SJ in File S1).
It was somewhat unusual that MRSD and RMSD distinguished knotted proteins from unknotted proteins better than D (or D nx ), which accounts for non-crossing. All other quantities, including INX, ACO, and RCO distinguish knotted from unknotted proteins. The only quantity that fails is LRO.
The importance of noncrossing INX , measuring the ratio of the uncrossing distance D nx to the ghost-chain distance N|MRSD, was largest for knotted proteins, followed by b proteins, with a proteins having the smallest INX . Mixed proteins had an average INX value in between that for a and b proteins.
In distinguishing all-a and all-b proteins, we find that LRO and RCO are by far the best discriminants. Interestingly, INX and D nx =N also discriminate these two classes comparably or better than ACO does. D nx is marginal, while all other metrics fail.
All metrics except for N and D are able to discriminate a from mixed a-b proteins, with LRO performing the best by far.
Interestingly, none of the above metrics can distinguish b proteins from mixed a-b proteins.
It is sensible that energetic considerations would be the dominant distinguishing mechanism between two-and three state folders. Intermediates are typically stabilized energetically. We can nevertheless investigate whether any geometrical quantities discriminates the two classes. Indeed LRO and RCO fail, as does INX. This supports the notion that intermediates are not governed by ''topological traps'' that are undone by uncrossing motion, but rather are energetically driven. ACO performs marginally. Threestate folders tend to be longer than 2-state folders, so that N distinguishes them and in fact provides the strongest discriminant, consistent with previous results [111]. Interestingly RMSD, MRSD, and D perform comparably to N. However these measures also correlate strongly with N (see table SJ in File S1).
D=N, D nx and D nx =N also perform well, but still correlate with N, albeit more weakly than the above metrics. Figure 15A shows a scatter plot of all proteins as a function of D nx =N vs. and LRO. Knotted and unknotted proteins are indicated, as are a, b, and mixed a-b proteins. Two and three state proteins are indicated as triangles and squares respectively. From the figure, it is easy to visualize how LRO provides a successful discriminant between a=b and a=(mixed) proteins, but is unsuccessful in discriminating b=(mixed), knotted and unknotted, and two and three state folders. It is also clear from the figure how D nx =N discriminates knotted from unknotted proteins. One can also see distribution overlap, but nevertheless successful discrimination between a and b and a and mixed proteins. Figure 15B shows a scatter plot of all proteins as a function of D nx vs. N, using the same rendering scheme for protein classes as in Figure 15A. From the figure, one can see how the metrics Order parameters for various classifications of proteins. The data set of 2-and 3-state folders is the same as the data set for a-helical b-sheet and mixed proteins, and is given in Table 1. This is also the same data set as the unknotted proteins. Knotted proteins are separately classified, and not included as either 2-state or 3-state proteins. A discrimination is deemed statistically significant if the probability of the null hypothesis is less than 5%. doi:10.1371/journal.pone.0053642.t002 correlate with each other, and how they both discriminate knotted from unknotted proteins and 2-state from 3-state proteins. Moreover one can see how despite the significant correlation between D nx and N, D nx can discriminate a proteins from either b proteins or mixed a/b proteins, while N cannot. As a control study for the above metrics, we took random selections of half of the proteins, to see if random partitioning of the proteins into two classes resulted in any of the metrics distinguishing the two sets with statistical significance. No metric in this study had significance: the p-values ranged from about 0.32 to 0.94. Figure 16 shows a plot of the statistical significance for all the metrics in Table 2 to distinguish various pairs of proteins classes: 2state from 3-state proteins, a from b, a from mixed a=b, b from mixed, and knotted from unknotted. We can define the most consistent discriminator between protein classes as that metric that is statistically significant for the most classes, and for those classes has the highest statistical significance. By this criterion D nx =N is the most consistent discriminator between the general structural and kinetic classes considered here.
Interestingly, in all cases, the extra distance introduced by noncrossing constraints is a very small fraction (less than 13%) of the MRSD, which represents the ghost distance neglecting noncrossing. This was not an obvious result, but it was encouraging evidence for the reason simple order-parameters that neglect an explicit accounting of non-crossing have been so successful historically [49,85,[112][113][114][115][116].

Scaling laws for pathway distances across domains and whole proteins
Larger proteins will typically have larger MRSD. A protein of twice the chain length need not have twice the MRSD however; we plot the unfolded ensemble averaged MRSD of the proteins in our dataset as a function of N in Figure 17A. The plot shows subextensive scaling for the straight-line path distance per residue: MRSD*N 0:65 . On the other hand, the non-crossing distance per residue, D nx =N, shows superextensive scaling: D nx =N*N 1:33 , indicating that non-crossing induced entanglement becomes progressively more important even on a per-residue basis for longer proteins, and likely polymers in general. In fact, the steeper slope of D nx =N indicates a crossover such that when N is larger than about 3600, chain non-crossing dominates the motion of the minimal folding pathway. It is noteworthy that the scatter in the  log-log scaling plot of Figure 17A is much larger for D nx =N than for MRSD, illustrating the larger dispersion of D nx =N for proteins of the same length but different native topology.
The above analysis can be applied to domains within a single protein, to test how autonomous their folding mechanisms are as compared to separate proteins. Run on our dataset, the program DDomain [117] only finds multiple domains in methyltransferase domain of human tar (HIV-1) RNA binding protein (PDB 2HA8) (Wu, H. et. al. unpublished data), between residues 20-88 and 89-178 (residue 20 is the first resolved residue in the crystal structure). The domain finding program DHcL [118] also finds domains in this protein between residues 20-83 and 84-178. DHcL also finds domains in several other proteins, some generally accepted as single domain, however one of these proteins is clearly a repeat protein containing a 36 residue helix-turn-helix motif: tumor suppressor P16INK4A (PDB 2A5E) [119]. For this protein, DHcL finds domains between the 1st and 2nd, and 2nd and 3rd repeating units. We manually added a domain boundary between the 3rd and 4th repeating units to yield 4 domains containing residues 1-36, 37-72, 73-108, and 109-144. The domains of 2HA8 and 2A5E are illustrated in Figure 17C.
Using the above domain structures for 2HA8 and 2A5E, we analyze the scaling of MRSD with chain length N in Figure 17B. In these plots the individual domains are considered as separate proteins, then combined together if the domains are contiguous, e.g. for 2A5E proteins consisting of domains 1, domains 1 and 2 together, 1, 2, and 3 together, all domains together, and all contiguous combinations therein are examined. This yields the same scaling law for both proteins: MRSD*N 0:76 , which has a larger power law than the scaling between proteins above. Chain connectivity constraints apparently induce cross-talk between domains even for MRSD. Likewise, the scaling law for noncrossing distance per residue is D nx =N*N 2:51 , indicating significant polymer chain interference between domain folding. The individual domains of multidomain proteins apparently show less severe chain constraints than single domain proteins of the same size.

Quantifying minimal folding pathways
The minimum folding pathway gives the most direct way that an unfolded protein conformation can transform by reconfiguration to the native structure. However, different configurations in the unfolded ensemble transform by different sequences of events, for example one unfolded conformation may require a leg uncrossing move, followed by a Reidemeister move elsewhere on the chain, followed by an uncrossing move of the opposite leg, while another unfolded conformation may require only a single leg uncrossing move.
The sequence of moves can be represented as a color-coded bar plot, which, for the 3 proteins rendered in Figure 18, is shown in Figures 19,20,21. In these figures, the sequence of moves is taken from right to left, and the width of the bar indicates the noncrossing distance undertaken by that move. A scale bar is given underneath each figure indicating a distance of 100 in units of the link length. Red bars indicate moves corresponding to the Nterminal leg (L N ) of the protein, while green bars indicate moves Non-crossing distance per residue D nx =N (red circles) vs. N shows much larger scatter across native topologies, but follows an approximate scaling law D nx =N*N 1:33 which is superextensive, indicating an increasing importance of chain non-crossing per residue as system size is increased. At system sizes larger than N&3600, even minimal motion is dominated by entanglement. (B) Same quantities as in panel (A), but for the domains in proteins 2A5E and 2HA8. The scaling laws are different than in panel (A), and show stronger chain-length dependence. For 2HA8, domains 1, 2 and 1-2 together (the full protein) are considered; for 2A5E domains 1,2,3,4, 1-2, 2-3, 3-4, 1-2-3, 2-3-4, and 1-2-3-4 (the full protein) are considered. Based on these scaling laws found by building up proteins from subdomains, at system sizes larger than N&400, minimal pathways become entanglement-dominated. (C) Schematic renderings of the domains, color-coded in 2HA8 (left) and 2A5E (right). doi:10.1371/journal.pone.0053642.g017 corresponding to the C-terminal leg (L C ). Blue bars indicate Reidemeister ''pinch and twist'' moves, while cyan bars indicate elbow uncrossing moves.
The typical sequence of moves varies depending on the protein. Figure 19 shows the uncrossing transformations of the all-a protein acyl-coenzyme A binding protein (PDB id 2ABD [120], see Figure 18A). Panels A and B depict the same set of transformations, but in A they are sorted from largest to smallest values of L N uncrossing, and in B they are sorted from largest to smallest values of L C uncrossing. The leg moves in each panel are aligned so that the left end of the bars corresponding to the moves being sorted are all lined up. Some transformations partway down in panel A do not require an L N move; these are then ordered from largest to smallest L C move. The converse is applied in panel B. Some moves do not require either leg move; these are sorted in decreasing order of the total distance of Reidemeister loop twist moves. Finally, some transformations require only elbow moves; these are sorted from largest to smallest total uncrossing distance. Figure 20 shows the noncrossing transformations for the Src homology 3 (SH3) domain of phosphatidylinositol 3-kinase (PI3K), a largely-b protein (about 23% helix, including 3 short 3 10 helical turns; PDB id 1PKS [121], see Figure 18B), sorted analogously to Figure 19. Figure 21 shows the uncrossing transformations involved in the minimal folding of the designed knotted protein 2ouf-knot (PDB id 3MLG [100], Figure 18C).
Interestingly, for the all-a protein 2ABD, &12% of the 172 transformations considered did not require any uncrossing moves, and proceed directly from the unfolded to the folded conformation. These transformations are not shown in Figure 19. For the b protein and knotted protein, every transformation that we considered (195 for 1PKS and 90 for 3MLG) required at least one uncrossing move.
As a specific example, the top-most move in Figure 21 panel B consists of a C-leg move (green) covering &90% of the noncrossing distance, followed by N-leg move (red) covering &7% of the distance, then a short elbow move (cyan), a short Reidemeister loop move (blue), another short elbow move (cyan), and finally a short Reidemeister move (blue). In some cases the elbow and loop moves commute if they involve different parts of the chain, but generally they do not. For this reason we have not made any attempt to cluster loop and elbow moves, rather we have just represented them in the order they occur. On the other hand, consecutive leg moves commute and can be taken in either order.
In Figures 19, 20,21, one can see that significantly more motion is involved in the leg uncrossing moves than for other types of move. The total distance covered by leg moves is 82% for 3MLG, 69% for 1PKS, and 49% for 2ABD. For 3MLG, the total leg move distance is comprised of 44% L N moves, and 38% L C moves. For 1PKS, leg move distance is comprised of 18% L N moves, and 51% L C moves. For 2ABD, distance for the leg moves is roughly symmetric with 26% L N and 23% L C .
One difference that can be seen for the all-a protein compared to the b and knotted proteins is in the persistence of the leg motion. For 2ABD, only 24% of the transformations require L N moves and only 30% of the transformations require L C moves. On the other hand the persistence of leg moves is greater in the b protein and greatest in the knotted protein. For 1PKS, L N and L C moves persist in 74% and 66% of the transformations respectively. In 3MLG, L N and L C moves persist in 92% and 41% of the transformations respectively.
Inspection of the transformations for the b protein 1PKS in panels A and B of Figure 20 reveals that uncrossing moves generally cover larger distance than in the a protein 2ABD (the mean uncrossing distance for is 136 for 1PKS vs. 77.5 for 2ABD). We also notice that in contrast to the leg uncrossing moves in 2ABD, both L N and L C moves are often required (44% of the transformations require both L N and L C moves, compared to 5% for 2ABD). The asymmetry of the protein is manifested in the asymmetry of the leg move distance: the L N moves are generally shorter than the L C moves, covering about 1/4 of the total leg move distance. As mentioned above, L C moves comprise about 51% of the total distance for the 195 transformations in 20, while L N moves only comprise about 18% of the distance on average. Both L N and L C moves are persistent as mentioned above. A leg move of either type is present in 95% of the transformations.
Inspection of the transformations in Figure 21 reveals that every transformations requires either an L N or L C move. This is sensible for a knotted protein, and is in contrast to the transformations for the a protein 2ABD, where many moves do not require any leg uncrossing at all and consist of only short Reidemeister loop and elbow moves. In this sense the diversity of folding routes [40,41] for the knotted protein 3MLG is the smallest of the proteins considered here, and illustrates the concept that topological constraints induce a pathway-like aspect to the folding mechanism. The N-terminal L N leg move is the most persistently required uncrossing move, present in about 92% of the transformations. This is generally the terminal end of the protein that we found was involved in forming the pseudo-trefoil knot. Sometimes however, the C-terminal end is involved in forming the knot, though this move is less persistent and is present in only 41% of the transformations. However when an L C move is undertaken, the distance traversed is significantly greater, as shown in Panel B of Figure 21. This asymmetry is a consequence of the asymmetry already present in the native structure of the protein.   Consensus minimal folding pathways From the transformations described in Figures 19, 20, 21, we see that there are a multitude of different transformations that can fold each protein. The pathways for the a protein 2ABD are more diverse than those for the b or knotted proteins. From the ensemble of transformations for each protein, we can average the amount of motion for each uncrossing move to obtain a quantity representing the consensus or most representative minimal folding pathway for that protein. This takes the form of the histograms in Figure 22, with the x-axes representing the order of uncrossing/ untangling events, right to left, and the y-axes representing the average amount of motion in each type of move.
The ensemble of untangling transformations can be divided into three different classes: transformations in which leg L N is the largest move, transformations in which leg L C is the largest move, and transformations in which an elbow E or loop R (for Reidemeister type I) are the largest moves. Moreover, if L N and L C moves occur consecutively they can be commuted, so without loss of generality we take the L N move as occurring before the L C move in the x-axes of Figure 22. The leg moves, if they occur first, are then followed by either elbow (E) and/or loop (R) moves, of which there may be several. In general, the leg moves may both occur before the collection of loop and elbow moves, after them, or may bracket the elbow and loop moves (e.g. 2nd bar in Figure 21b). By the construction of our approximate algorithm, if two L N moves were encountered during a trajectory (they were encountered only a few times during the course of our studies), they would be aggregated into one L N move involving the larger of the two motions, in order to remove any possible redundancy of motion. Hence no more than one L N or L C move is obtained for all transformations. We found that three pairs of elbow and loop moves was sufficient to describe about 93% of all transformations (see the x-axes of Figure 22). In summary, the sequence L N , L C , R, E, R, E, R, E, L N , L C (read from left to right) characterized almost all transformations, and so was adopted as a general scheme. Any exceptions simply had more small elbow and loop moves that were of minor consequence; for these transformations we simply accumulated the extra elbow and loop moves into the most appropriate R or E move. The general recipe for rendering loops R in Figure 22 is as follows: if one R move is encountered (regardless of where), each half is placed first and last (third) in the general scheme. If two R moves are encountered, they are placed first and last, and if three R moves are encountered, they are simply partitioned in the order they occurred. For four or more R moves, the middle N{2 are accumulated into the middle slot in the general scheme. The same recipe is applied to elbow moves E. As a specific example, the first bar in Figure 21b consists of L C , L N , E 1 , R 1 , E 2 , R 2 , which after permutation of the first two leg moves falls into the general scheme above as L N , L C , R 1 , E 1 , 0, 0, R 2 , E 2 , 0, 0. The bottom-most transformation in Figure 21B consists Figure 22 shows histograms of the minimal folding mechanisms, obtained from the above-described procedure. Note again there are 3 classes of transformation, one where L N is the largest move, one where L C is the largest move, and one where either loop R or elbow E is the largest move. Each uncrossing element of the transformation, C-leg, N-leg, Reidemeister loop, or elbow, contributes to the height of the corresponding bar, which represents the average over transformations in that class. The percentage of transformations that fall into each class is given in the legend to panels A-C of Figure 22.
Most of the transformations (71:5%) for the a-protein 2ABD fall into the class with a dominant loop or elbow move, which itself tends to cover less uncrossing distance than either C-or N-leg uncrossing (ordinates of Panels A-C Figure 22). This is a signature of a diverse range of folding pathways: minimal folding pathways need not involve obligatory leg uncrossing constraints. In this sense, the b protein 1PKS is more has a more constrained folding mechanism than the a protein; there is a significantly larger percentage of transformations for which a leg transformation L C or L N dominates, though the mean distances undertaken when a leg move does dominate are comparable for L C and even larger for the a protein for L N .
The knotted protein 3MLG has the most constrained minimal folding pathway. A leg move from either end dominates for 91% of the cases. Even for the transformations where loop or elbow moves dominate, there is still relatively significant L N motion. The dominant pathways for knotting 3MLG involve leg crossing from either N or C terminus. When the C terminus is involved in the minimal transformation, the motion can be significant ( Figure 22B).
Among all transformations of a given protein, a transformation can be found that is closest to the average transformation for one of the three classes in Figure 22. This consensus transformation has a sequence of moves that when mapped to the scheme in Figure 22, has minimal deviations from the averages shown there. Further, we can find the transformation that has minimal deviation to any of the three classes in Figure 22. For the knotted protein 3MLG, the best fit transformation is to the class with L Ndominated moves, for the a protein 2ABD, the best fit transformation is to the class with miscellaneous-dominated moves, and the b protein 1PKS, the best fit transformation is the class with L C -dominated moves. For the a, b, and knotted proteins, these are the transformations denoted by a short arrows to the left of the transformation in panels A and B of Figures 19,20, and 21 respectively. For the a, b, and knotted proteins, the transformations are illustrated schematically in Figures 23, 24, and 25 respectively.
Inspection of the most representative transformation for the alla protein 2ABD shown in Figure 23 indicates that the transformation requires remarkably little motion: it contains a negligible leg motion followed by a loop uncrossing of modest distance, followed by a short elbow move that is also inconsequential: in shorthand E½9R½20L N ½1, where the numbers in brackets indicate the cost of the moves in units where the link length is unity. In constructing a schematic of the representative transformation in Figure 23, we ignore the smaller leg and elbow moves and illustrate the loop move roughly to scale. Although additional crossing points appear from the perspective of the figure, the remainder of the transformation involves simple straight-line motion.   Figure 25 shows the most representative folding transformation for the knotted protein 3MLG. The sequence of events constructed from the minimal transformation, R½21R½18L N ½125 in the above notation, consists of a dominant leg move depicted in steps 4 and 5 of the transformation, and two relatively short loop moves that are neglected in the schematic as inconsequential. Loops appear from the perspective of the figure, and the crossing points appear to shift in position, however the remainder of the transformation involves simple straight-line motion.

Topological constraints induce folding pathways
From Figures 19,20,21, one can see that topological noncrossing constraints can induce pathway-like folding mechanisms, particularly for knotted proteins, and in part for b-sheet proteins as well. The locality of interactions in conjunction with simple tertiary arrangement of helices in the a-helical protein profoundly affects the nature of the transformations that fold the protein, such that the distribution of minimal folding pathways is diverse. Conversely, the knotted protein, although largely helical, has nontrivial tertiary arrangement, which is manifested in the persistence of a leg crossing move in the minimal folding pathway. In this way, a folding ''mechanism'' is induced by the geometry of the native structure.
We can quantify this notion by calculating the similarity between minimal folding pathways. To this end we note that, for example, the transformation that is 6 bars from the bottom in Figure 21b, which contains an L N move followed by 2 short loops and an elbow, should not fundamentally be very different than the transformation 10 from the bottom in that figure, which contains a loop and 2 short elbows followed by a larger L N move. In general we treat the commonality of the moves as relevant to the overlap  rather than the specific number of residues involved, or the order of the moves that arises from the depth-first tree search algorithm.
Thus for each transformation pair we define two sequence overlap vectors in the following way. Overlaying the residues involved in moves for each transformation along the primary sequence on top of each other as in Figure 26, we count as unity those moves of the same type that overlap in sequence for both transformations, otherwise a given move is assigned a value of zero. So for example in Figure 26 the result is two vectors of binary numbers, one with 4 elements for transformation a and one with 5 elements for transformation b, based on the overlap of moves of the same type. That is, the first vector isD D a~( 1,1,0,1) and the 2nd vector isD D b~( 1,0,1,0,1). To find the pathway overlap, we also record the noncrossing distances of the various transformations which here would be two vectors of the form . Square matrices D are constructed for a and b, where each row is identical and equal to the vectorD D. This matrix then operates oñ D D to make a new vector that has distances for the elements that are nonzero inD D, and is the same length for both a and b. In the above example shown in Figure 26, . These vectors are then multiplied through the inner product, and divided by the norms ofD D a and D D b to obtain the overlap Q ab . In the above example, . In general, the formula for the overlap is given by When a~b, Q ab~1 . In the above example, Q ab v1 even if all loops were aligned, because there is no elbow move in transformation a. If two transformations have an identical set of moves, Q ab~1 if all the moves have at least partial overlap with a move of the same type in primary sequence. If a loop move in transformation b overlaps two loop moves in transformation a, it is assigned to the loop with larger overlap in primary sequence. For the first two transformations in Figure 21A, Q ab~0 :988, and for the first two transformations in Figure 21B, Q ab~0 :999. On the other hand for the first and last transformations in Figure 21B, Q ab~0 :033. Figure 27 shows the distributions of overlaps Q ab between all pairs of transformations indicated in Figures 19, 20, 21, for the three proteins shown in Figure 18. The distributions show a transition from multiple diverse minimal folding pathways for the a protein, to the emergence of a dominant minimal folding pathway for the knotted protein. The mean overlap Q between transformations can be obtained by averaging Q ab in Equation (8) over all pairs of transformations: Þ . Mean overlaps for each protein are given in the caption to Figure 27. This illustrates that topological constraints induce mechanistic pathways in protein folding. We elaborate on this in the Discussion section.

Discussion
The Euclidean distance between points can be generalized mathematically to find the distance between polymer curves; this can be used to find the minimal folding transformation of a protein. Here, we have developed a method for calculating approximately minimal transformations between unfolded and folded states that account for polymer non-crossing constraints. The extra motion due to non-crossing constraints was calculated retroactively for all crossing events of a ghost chain transformation involving straight line motion of all beads on a coarse-grained model chain containing every other Ca atom, from an ensemble of unfolded conformations, to the folded structure as defined by the coordinates in the protein databank archive. The distances undertaken by the uncrossing events correspond to straight-line motions of all the beads from the conformation before the crossing event, over and around the constraining polymer, and back to the essentially identical polymer conformation immediately after the crossing event. Given a set of chain crossing events, the various ways of undoing the crossings are explored using a depth-first tree search algorithm, and the transformation of least distance is recorded as the minimal transformation.
We found that knotted proteins quite sensibly must undergo more noncrossing motion to fold than unknotted proteins. We also find a similar conclusion for transformations between all-b and alla proteins; all-a proteins generally undergo very little uncrossing motion during folding. In fact the uncrossing distance, D nx , averaged over the unfolded ensemble, can be used as a discrimination measure between various structural and kinetic classes of proteins. Comparing several metrics arising from this work with several common metrics in the literature such as RMSD, absolute contact order ACO, and long range order LRO, we found that the most reliable discriminator between structural classes, as well as between two-and three-state proteins, was D nx per residue. (later paragraph moved here:) Knotted proteins, as compared to unknotted proteins, are the most distinguishable class of those we investigated, in that all metrics we investigated except for LRO significantly differentiated the knotted from unknotted proteins. The differentiation between structural or kinetic classes of proteins as studied here is a separate issue from the question of which order parameters may best correlate with folding rates [28,86,96,122,123]; this latter question is an interesting topic of future research. Differentiating native-structure based order parameters that provide good correlates of folding kinetics is a complicated issue, in that different structural classes may correlate better or worse with a given order parameter [123].
Non-crossing distance per residue D nx =N increases more rapidly with chain length than the mean straight line distance between residue pairs (MRSD). Considering proteins as separate domains indicates a crossover at long chain length, about N~3600. Considering proteins built up by adding successive domains, specifically for two representative multi-domain proteins in our dataset (2HA8 and 2A5E), indicates a crossover to entanglement-dominated folding mechanisms at shorter lengthsabout N~400. This crossover point may indicate a regime where energetics begins to play a role in order to fold domains independently, and avoid progressively more significant polymer disentanglement in order to fold.
Even for knotted proteins, the motion involved in avoiding noncrossing constraints is only about 13% of the total ghost chain motion undertaken had the noncrossing constraints been neglected. This was not an obvious result, to these authors at least. In contrast to melts of long polymers, chain non-crossing and the resultant entanglement does not appear to be a significant factor in protein folding, at least for the structures and ensembles we have studied here. It is tempting to conclude from this that chain noncrossing constraints play a minor role in determining folding mechanisms. It is nevertheless an empirical fact that knotted proteins fold significantly slower than unknotted proteins [100,124]. As well, raw percentages of total motion do not take into account the difficulty in certain types of special polymer movement, in particular when the entropy of folding routes is tightly constrained [12,40,41,[125][126][127]. The small percentage of non-crossing motion may offer some explanation however, as to why simple order parameters, such as absolute contact order, that do not explicitly account for noncrossing in characterizing folding mechanisms have historically been so successful in predicting kinetics.
The non-crossing distance was calculated here for a chain of zero thickness, so that non-crossing is decoupled from steric constraints. Finite volume steric effects would likely enhance the importance of non-crossing constraints, since the volume of phase space where chains are non-overlapping is reduced, and thus chain motions must be further altered to respect these additional constraints [128]. Steric constraints may significantly alter the shape of reactive trajectories, and slow kinetics by enforcing entropic bottlenecks. Such constraints may become particularly important for collapsed or semi-collapsed proteins, and knotted proteins where they restrict stereochemically-allowed folding pathways. These effects may in principle be treated by extending the present formalism to include non-zero chain thickness, and by extending the minimal folding pathway to the partition function of pathways, with each pathway having weight proportional to the exponent of the distance [72]. Such a treatment is an interesting and important topic of future work.
One potential issue in the construction of the algorithm used here is that the approximated minimal transformation is generally not equivalent to a kinetically realizable transformation. In the depth-first tree search algorithm illustrated in Figure 14, the set of crossing points defines a set of uncrossing moves that may be permuted, or combined for example through a compound leg movement as in Figure 10. However the kinetic sequence of crossing events, in particular those significantly separated in ''time'' along the minimal transformation, may not be permutable or combinable physically, at least not without modifying the distance travelled [129] Hence the transformations are treated here as approximations to the true minimal transformations that respect non-crossing.
The algorithm as described above may underrepresent the amount of motion involved in noncrossing by allowing kinetically separated moves to be commutable. On the other hand, the motion assumed in the algorithm to be undertaken by a crossing event contains abrupt changes in the direction of the velocity (corners) at the time of the uncrossing event, and so is larger than the true minimal distance. These errors cancel at least in part. It is an interesting topic of future research to develop an improved algorithm that computes minimal transformations, perhaps using these approximate transformations as a starting point for further optimization or modification.
The mathematical construction of minimal folding transformations can elucidate folding pathways. To this end we have dissected the morphology of protein structure formation for several different native structures. We found that the folding transformations of knotted proteins, and to a lesser extent b proteins, are dominated by persistent leg uncrossing moves, while a proteins have diverse folding pathways dominated simply by loop uncrossing.
A pathway overlap function can then be defined, the structure of which is fundamentally different for a proteins than for knotted proteins. While the overlap function supports the notion of a diverse collection of folding pathways for the a protein, the overlap function for the knotted protein indicates that topological polymer constraints can induce JmechanismJ into how a protein folds, i.e. these constraints induce a dominant sequence of events in the folding pathway. This effect is observed to some extent in the b protein we investigated, but is most pronounced for knotted proteins.
Other approaches have been made previously to quantify topological frustration, and construct folding pathways that minimize such frustration. Norcross and Yeates [126] have extended the earlier analysis of Connolly et. al [130], to show that edges between consecutive C a atoms in the coarse-grained primary sequence can be surrounded by a ring of other C a atoms consisting of the vertices of tetrahedra from Delauney tesselation. They then find the folding pathways that minimize the number of times a ring forms before its thread is formed within a singlesequence approximation: these indicate topologically-frustrated pathways. As an interesting example in [126], strand IV of superoxide dismutase (SOD1) is highly buried by parts of the Znbinding loop, electrostatic loop, and neighboring strands V and VII. In vitro folding studies [131,132] show however that this problem is resolved by Zn-binding after folding of the b-barrel, which is coupled with structural formation of the Zn-binding and electrostatic loops (loops IV and VII). The apo state is an energetically stressed, metastable intermediate [133]. In general, folding coupled to ligand binding could remove topological frustration by inducing unfrustrated pathways in the folding mechanism.
Similar schematic ''average'' folding mechanisms as in Figures 23, 24, 25, based on minimal folding pathways, were proposed for the complex Stevedore knotted protein a-haloacid dehalogenase by Bölinger et al [102], based on folding simulation statistics of G o o models.
Coarse-grained simulation studies of the reversible folder YibK [99] showed that non-native interactions between the C-terminal end and residues towards the middle of the sequence were a prerequisite for reliable folding to the trefoil knotted native conformation [134], the evolutionary origins of which were supported by hydrophobicity and b-sheet propensity profiles of the SpoU methyltransferase family. This suggests a new aspect of evolutionary ''design'' involving selective non-native interactions, beyond the generic role that non-native interactions may play in accelerating folding rate [135,136]. Low kinetic success rates *1{2% in purely structure-based Gō simulations are also seen in coarse-grained simulation studies of YibK [137] and all-atom simulation studies of the small a=b knotted protein MJ0366 [138]. In these studies by Onuchic and colleagues, a ''slip-knotting'' mechanism driven by native contacts is proposed, rather than the ''plug'' mechanism in [134], which is driven by non-native contacts. Both slip-knotting and plug mechanisms were described by Mohazab and Plotkin as optimal un-crossing motions of protein chains in [74]. Such mechanisms may be facilitated by flexibility in the protein backbone: highly conserved glycines in the hinge regions of both knotted and slipknotted [139] proteins modulate the knotted state of the corresponding subchain of the protein [140]. Further bioinformatic studies that investigate evolutionary selection by strengthening critical native or non-native interactions in knotted proteins are an interesting topic of current and future research. There is certainly a precedent of selection for native interactions that penalize on-pathway intermediates in some proteins such as ribosomal protein S6 [40,41,141]. Structural analysis of the deeply buried trefoil knot in acetohydroxy acid isomeroreductase indicates swapping of secondary structural elements across replicated domains likely arising from gene duplication [109], which argues in favor of knot formation driven by native interactions, through a mechanism apparently distinct from slipknotting.
Lua and Grosberg have found that, due to enhanced return probabilities originating from finite globule size along with secondary structural preferences, protein chains have smaller degree of interpenetration than collapsed random walks, and thus fewer knots than would be expected for such collapsed random walks [142], in spite of the fact that collapse dramatically enhances the likelihood of knot formation [110], an effect foreshadowed by the dramatic decrease in characteristic length for knot formation as solvent quality changes from good to ideal (theta) [107,108]. It is still not definitively answered whether this statistical selection against knots in the protein universe is a cause or consequence of the above size and structural preferences. Similarly, Mansfield [143,144] has suggested that the polar nature of the N-and Ctermini of the protein chain energetically penalize processes that would result in the formation of knots.
Conversely, some functional roles may benefit from the presence of knotted topologies. Virnau and colleagues [145] have suggested that the presence of complex knots in proteins involved in regulation of ubiquitination and proteolysis serve a protective role against incidental proteasome degradation, and as well, they observe evidence for the modulation of function by alteration of an enzymatic binding site through either the presence or absence of a knot in homologues of transcarbamylase. Phylogenetic analysis indicates that the presence of a knot is most likely mediated by a single evolutionary event involving insertions of short segments in the primary sequence [146].
The interplay between sequence-determined energetics and chain connectivity in the folding of proteins with complex or knotted topologies is a topic of much current interest, despite the fact that the number of proteins exhibiting knots or slipknots in their native structures is relatively small. It will be interesting to see how evolution has optimized sequence or facilitated proteinchaperone interactions to enable folding for these ''problem children'' of the proteome.

Supporting Information
Movie S1 Approximate solution to minimal distance transformation from a vertical line to a horizontal one.

(GIF)
Movie S2 Approximate solution to minimal distance transformation from an unfolded conformation of protein 1CSP to the folded conformation, where the chains are ghost chains.

(GIF)
Movie S3 Approximate solution to minimal distance transformation from an unfolded conformation of protein 1CSP to the folded conformation, where the chains are ghost chains; instances of self-crossing are emphasized.

(GIF)
File S1 Includes: Description and analysis showing that crossing detection and uncrossing distance are independent of the projection plane, including Figure SA; Structural alignment statistics of our protein dataset, including Table SA, Table SB