DBFOLD: An efficient algorithm for computing folding pathways of complex proteins

Atomistic simulations can provide valuable, experimentally-verifiable insights into protein folding mechanisms, but existing ab initio simulation methods are restricted to only the smallest proteins due to severe computational speed limits. The folding of larger proteins has been studied using native-centric potential functions, but such models omit the potentially crucial role of non-native interactions.Here, we present an algorithm, entitled DBFOLD, which can predict folding pathways for a wide range of proteins while accounting for the effects of non-native contacts. In addition, DBFOLD can predict the relative rates of different transitions within a protein’s folding pathway. To accomplish this, rather than directly simulating folding, our method combines equilibrium Monte-Carlo simulations, which deploy enhanced sampling, with unfolding simulations at high temperatures. We show that under certain conditions, trajectories from these two types of simulations can be jointly analyzed to compute unknown folding rates from detailed balance. This requires inferring free energies from the equilibrium simulations, and extrapolating transition rates from the unfolding simulations to lower, physiologically-reasonable temperatures at which the native state is marginally stable. As a proof of principle, we show that our method can accurately predict folding pathways and Monte-Carlo rates for the well-characterized Streptococcal protein G. We then show that our method significantly reduces the amount of computation time required to compute the folding pathways of large, misfolding-prone proteins that lie beyond the reach of existing direct simulation methods. Our algorithm, which is available online, can generate detailed atomistic models of protein folding mechanisms while shedding light on the role of non-native intermediates which may crucially affect organismal fitness and are frequently implicated in disease. Author summary Many proteins must adopt a specific structure in order to function. Computational simulations have been used to shed light on the mechanisms of protein folding, but unfortunately, realistic simulations can typically only be run for small proteins, due to severe limits in computational speed. Here, we present a method to solve this problem, whereby instead of directly simulating folding from an unfolded state, we run simulations that allow for computation of equilibrium folding free energies, alongside high temperature simulations to compute unfolding rates. From these quantities, folding rates can be computed using detailed balance. Importantly, our method can account for the effects of nonnative contacts which transiently form during folding and must be broken prior to adoption of the native state. Such contacts, which are often excluded from simple models of folding, may crucially affect real protein folding pathways and are often observed in folding intermediates implicated in disease.


Introduction
Many proteins suffer from very slow or inefficient folding from a denatured state owing 14 to a tendency to misfold into non-native intermediates. Such intermediates can be 15 detrimental in vivo, where they may be degraded, form toxic oligomers, or aggregate, 16 potentially leading to loss of fitness and/or disease [1][2][3][4][5][6][7]. Organisms deploy various 17 cellular mechanisms to mitigate protein misfolding including chaperones [4,[8][9][10] and 18 co-translational folding on the ribosome [5, [11][12][13][14][15][16][17], which may be enhanced by slowly 19 translating codons located at nascent chain lengths that show optimal folding 20 properties [15][16][17]. Despite the fact that non-native folding intermediates exert 21 widespread and significant consequences, we have yet to develop a detailed atomstic 22 understanding of how they slow folding, and how cellular mechanisms reduce their 23 formation and detrimental effects. All-atom simulation methods such as Molecular 24 Dynamics (MD) and Monte-Carlo (MC) simulations have the potential to generate 25 detailed models of folding, but unfortunately their use has thus far been restricted to 26 small proteins which typically engage in few nonnatives interactions (<≈ 100 amino 27 acids), due to severe limits in computational speed [18]. The folding of larger proteins, 28 which comprise the majority of the proteome, can be simulated using native-centric Go 29 models [18][19][20], but such models lack non-native interactions which may crucially affect 30 real folding pathways. 31 To address these difficulties, various enhanced sampling techniques have been 32 developed that allow the folding of complex proteins to be investigated without 33 requiring ab initio simulation. For instance, replica exchange or parallel-tempering [21] 34 whereby multiple simulations are run in parallel under different conditions and 35 information is periodically exchanged between cores, can assist a protein in sampling 36 folding intermediates that are separated from the initial structure (often the 37 equilibrated native state) by large kinetic barriers. Such simulations can then be 38 analyzed using methods such as WHAM [22] or MBAR [23] to infer the free energies of 39 intermediates. However, the implementation of replica exchange comes at the expense of 40 realistic state transition kinetics, and replica exchange is unlikely to promote sampling 41 saddle points in the free energy landscape, thus hindering barrier-height computation, simulations can be extrapolated to physiologically-reasonable temperatures at which the 56 native state is stable, and combined with intermediate free energies inferred from the 57 former simulations to compute unknown folding rates from detailed balance. In what 58 follows, we develop theory to elucidate the conditions under which our method can be 59 applied. As a proof of principle, we then apply this technique to investigate the folding 60 pathways, including the role of nonnative states, for the well-characterized Streptococcal 61 protein G. We show that we can accurately predict transition rates between protein G 62 folding intermediates, which for this small protein, can be verified via direct folding 63 simulation Finally, we discuss how this method can be applied to larger, more complex 64 proteins whose folding pathways have not been well studied. Our implementation of this 65 algorithm, DBFOLD, includes both the latest version of MCPU-an all-atom 66 Monte-Carlo simulation platform that we use in this work-as well as a user-friendly 67 Python package that analyzes simulations to compute folding rates using the techniques 68 described here. In its current implementation, DBFOLD computes rates in Monte-Carlo 69 (MC) units. This allows for meaningful comparison of the relative rates of different 70 steps in a given protein's folding pathway. Additionally, MC folding rates can be 71 meaningfully compared across truncated forms of a given protein in order to elucidate 72 how vectorial synthesis affects co-translational folding [17].

86
Developing a coarse-grained folding landscape 87 In order to apply this technique, we must first coarse-grain a protein's folding landscape 88 into a set of meaningful intermediates. It is crucial that this be done carefully such that 89 detailed balance can be used to compute folding rates between the resulting 90 intermediates-this may not be the case, for instance, if intermediates are defined in such 91 a way that transitions are non-Markovian. To proceed, we deploy an approach similar 92 to the one described in [20], where we generate a native contact map for an equilibrated 93 protein and identify islands of contiguous native contacts, referred to as substructures 94 (See Fig. 2A for examples, and Methods for details). We expect that during each 95 on-pathway transition in the folding/unfolding process, one such substructure 96 forms/breaks cooperatively, and that these transitions are accompanied by a high free 97 energy barrier [20]. This is justified because making the first set of contacts in a 98 substructure typically entails the formation of a loop, which carries a large entropic loss 99 that is not compensated by an enthalpic benefit until subsequent contacts within the 100 substructure form. We next define a topological configuration as a possible subset of 101 native substructures that can be formed during the folding process. In Fig. 2B, we show 102 sample structures of E. Coli DHFR assigned to various topological configurations, for example abcdef g (all native substructures formed), cd (only substructures c and d 104 formed), and ∅ (no substructures formed).

105
In the simple case where only native contacts can form during folding, then we 106 expect that transitions between topological configurations will show Markovian 107 dwell-time distributions, owing to the high free energy barriers associated with the 108 transitions. The resulting network of topological configurations thus resembles a Markov 109 state model [30] in which states are defined according to structural similarity, rather 110 than based on kinetic data. But in reality, a protein may also form nonnative contacts 111 at any stage in the folding process which may impede the formation of additional native 112 substructure(s), and must be broken before productive folding can proceed. We define a 113 coarse state S i = {s n i } as the collection of all microstates containing nonnative contacts 114 that are topologically consistent with a given topological configuration, indexed by i, as 115 well as microstates with topological configuration i that lack nonnative contacts. The 116 presence of nonnative microstates may lead to non-Markovian behavior for transitions 117 between S i and some other coarse state S j with topological configuration indexed by j 118 (which differs from the one indexed by i by the formation/breaking of one substructure), 119 owing to complex internal dynamics involving these microstates. Nonetheless, it turns 120 out that, so long as certain conditions are satisfied, then detailed balance can still be 121 used to compute unknown transition rates between coarse states. Two such 122 non-mutually exclusive conditions are briefly described below, and detailed in the SI 123 section with heading "Details on conditions for applicability of method". DHFR (right). Substructures are labeled alphabetically. Native contacts not assigned to 128 a substructure are shown in gray on the contact maps. We note that, for brevity and 129 ease of visualization, we omit contacts involving residues less than 8 amino acids apart 130 in sequence (and thus intra-helical contacts are excluded). (B) Sample topological 131 configurations, alongside representative simulation snapshots for DHFR are shown. A 132 snapshot is assigned to a given topological configuration if it contains a certain subset of 133 folded native substructures (indicated by the labels above, see Methods for details 134 regarding the assignment process), and configuration ∅ includes snapshots with no folded 135 native substructures. (C-D) Schematic illustrations of conditions I (C) and II (D) under 136 which a detailed-balance like relationship can be used to compute folding rates between 137 topological configurations given equilibrium probabilities (from enhanced sampling) and 138 unfolding rates (from high-temperature simulations Where δ H(i,j),1 , the Kronecker delta function, has value 1 if i and j differ by exaclty one 154 substructure and 0 otherwise, while k i→j and k j→i refer to the rates of transition from 155 S i to S j and vice versa, respectively. These rates are generally temperature-dependent 156 as discussed later, but the temperature-dependence is omitted from the notation for 157 brevity. Importantly, these transition rates satisfy detailed balance. That is, letting 158 P eq (S i ) and P eq (S j ) denote the equilibrium Boltzmann probabilities and G Si and G Sj 159 the free energies of states S i and S j , respectively, we have: using equation (2), we can solve for the unknown folding rate k i→j . 168 We further note that the folding landscape can be further coarse-grained by 169 grouping together sets of coarse states into a cluster C k , so long as the slowest timescale 170 of exchange between states within C k is significantly faster than the fastest timescale of 171 exchange between C k and any other cluster C l . If so, then these clusters will themselves 172 behave as Markovian states whose occupancy probabilities satisfy equations (1) and (2). 173 Such clustering may be desirable so as to reduce the number of parameters in the model, 174 as explored for protein G below.

175
Condition II. In case condition I is not satisfied for a coarse state S i , then the 176 dwell time distribution associate with the transition to any other coarse state S j will 177 show multi-exponential behavior, and thus we cannot meaningfully define a single rate 178 k i→j for the transition. Nonetheless, it turns out a detailed-balance like relationship can 179 still be used to compute the mean first passage time (MFPT) to state S j so long as 180 certain conditions are satisfied. Namely, we consider a subset S h i ⊂ S i , termed a hub 181 state, which contains the only microstates belonging to S i from which transitions to S j 182 can occur. We require that 1.) All s n i ∈ S h i equilibrate with one another rapidly relative 183 to transitions to S j , such that the hub itself satisfies condition I and shows Markovian 184 behavior, and 2.) Upon first reaching S i , the system must start in the hub with 185 probability 1. Under these conditions, we can compute the MFPT to reach S j , 186 conditioned on the facts that the protein has just transitioned into S i (and thus, by transition times"), we obtain: Where we have assumed that the reverse transition from S j into S i also satisfies either 192 condition I or II. We thus find that, even though the S i to S j transition does not show 193 Markovian behavior, the inverse MFPTs nonetheless satisfies a relationship akin to 194 equation (2). Thus, so long as we can extrapolate the reverse unfolding timescale 195 < τ > j→i from high temperatures (justifiable under conditions described below), then 196 we can use equation (3) to compute the folding the MFPT as a characteristic timescale 197 for the folding transition between coarse states S i and S j . As in the previous section, 198 we can cluster S i with any other coarse states with which it rapidly exchanges, so long 199 as these coarse states satisfy either conditions I or II. If so, then the resulting cluster's 200 hub state retains the properties above and thus the cluster still satisfies condition II.

201
Finally we note that whereas condition I applies to a coarse state/cluster as a whole, 202 condition II is specific to a transition, and may not apply to an entire state/cluster. For 203 example, suppose the subset of microstates via which S i can transition to S j does not 204 fully overlap with the subset that permits transition to S k . If we further suppose that 205 only the subset that allows transition to S j rapidly equilibrates internally, then the S i 206 to S j transition will satisfy condition II, but the S i to S k transition will not.

207
Requirements for Extrapolation Under certain conditions, unfolding rates 208 obtained from high temperature simulations can be extrapolated using a an appropriate 209 model. In this work, we use the Arrhenius equation, given by Where k j→i (T ) is the transition rates from clusters C j to C i as in the previous section 211 (with the temperature dependence now explicitly considered), k 0 j→i is an intrinsic,  most of the enthalpy that stabilizes s will have been lost, but the entropy 226 associated with disrupting s will not yet have been gained, as the residual 227 contacts will severely restrict the conformational freedom of residues involved in 228 this substructure. However, the precise position at which this saddle occurs may 229 change over a large temperature range.

230
Moreover, in some cases unfolding of s will be preceded by the breaking of some set 231 of nonnative contacts n which are observed with high probability in cluster C j , but not 232 in cluster C i . This may occur, for instance, if n and substructure s are energetically 233 coupled. If so, then we additionally require that 234 3. Cluster C j must satisfy conditions I or II. In case condition II, but not I is 235 satisfied, we cannot define a single unfolding rate, and we instead extrapolate the 236 inverse mean-first passage time (MFPT) to unfolding. If condition I is satisfied, 237 then this inverse MFPT is equivalent to the unfolding rate. Thus for generality we 238 deal with with the inverse MFPT throughout rest of this text 239 4. In case C j is composed of multiple nonnative microstates s n j (all of which contain 240 nonnative contacts that must be broken), then one such microstate s m j * must at all temperatures of interest. The presence of multiple minima with comparable 243 but non-identical equilibrium probabilities will produce nonlinearities in the 244 dependence of log(k j→i ) on inverse temperature (See SI section with heading 245 "Justification of Arrhenius kinetics") as different minima may be favored at 246 different temperatures. 247 We further note that, when utilizing this method in practice, we assume that the 248 protein will fold via the opposite sequence of substructures as that through which it 249 unfolds at high temperatures, as we can only obtain rates of folding for transitions for 250 which we have extrapolated the reverse unfolding time. This is generally true because a 251 protein will transition in both directions via whichever sequence of topological 252 configurations involves the lowest rate-limiting barrier, but this optimal sequence may 253 change over a wide temperature range.

254
Computing equilibrium folding properties for protein G

255
As a proof of principle, we now apply our method on Streptocaccal Protein G, a model 256 protein whose folding has been extensively studied using both computational and 257 experimental methods [20,[31][32][33][34][35][36][37][38]. We begin by running equilibrium simulations with 258 replica exchange and umbrella sampling using native contacts as the reaction coordinate 259 along which we bias [39] (See Materials and methods). These simulations were run for a 260 total of 1.  Computing unfolding rates for protein G

295
To compute rates of transition between states, we run unfolding simulations at a range 296 of high temperatures above T M . By extrapolating unfolding times to physiological 297 temperatures, we can then compute folding times using detailed balance provided our 298 coarse states satisfy Conditions I or II from the previous section, which we verify later. 299 As discussed previously, we expect that the unfolding pathways will correspond to the 300 reverse of the folding pathways. We find that the protein unfolds via one of two parallel 301 pathways ( Fig. 4A) in which either substructure d or a can unfold first, followed by the 302 other, and finally c unfolds last. We now attempt to simplify our model of folding by 303 clustering together topological configurations that exchange rapidly (See Methods). We 304 find that the central helix (substructure b) folds and unfolds very rapidly compared to 305 the timescale with which the beta-sheet substructures (a, c, and d) form/break. Thus, 306 we construct kinetic clusters, which we refer to as a(b)cd, a(b)c, (b)cd, (b)c, and b/∅, in 307 which every observed combination of beta substructures occurs alongside helix b in 308 either its folded or unfolded state. So long as these clusters satisfy the requirements for 309 extrapolation in the previous section, we expect to be able to extrapolate unfolding 310 rates for each transition between clusters to low temperatures, and compute the reverse 311 folding rates. Indeed, we find that the inverse mean-first passage times (MFPTs) for 312 each unfolding step as a function of inverse temperature are well fit by the Arrhenius 313 equation (Fig. 4B). This suggests that unfolding times can be appropriately 314 extrapolated down to physiologically-relevant temperatures. We note that all clusters 315 show clear single-exponential survival probability curves as a function of MC step, with 316 the exception of (b)cd whose survival appears to show multi-exponential decay (Figs.

317
4C and S3). This is because, as we show later, this latter cluster does not satisfy Characterizing off-pathway nonnative states 339 Next, we characterize the off-pathway nonnative states the protein can adopt at each 340 stage in the folding process. To this end, we generate nonnative contact maps for 341 snapshots assigned to each cluster at physiologically-reasonable temperatures, then 342 group these maps based on similarity to identify recurrent nonnative states (Fig. 5A, 343 Fig. S4, and Methods). We find that when the protein is fully unfolded (cluster ∅/b), it 344 can form nonnative contacts, but these do impede the formation of the C-terminal beta 345 hairipin, which is the first productive folding step (Fig. S4). In contrast, once that 346 hairpin is formed (cluster (b)c), two recurrent nonnative states are observed, both of 347 which contain contacts that must be broken prior to subsequent folding steps (Fig. 5A, 348 left). Namely, the N-terminal beta strand (residues 5-10) forms nonnative contacts with 349 the C-terminal hairipin by docking with either residues 40 − 50 (nonnative state 1) or Moreover, these nonnative contacts seem to rapidly preform while the protein is still in 374 the previous cluster (b)c (Fig. 5A). Thus, this cluster also violates condition II, which 375 requires that the system start in the hub state from which productive transitions to the 376 next cluster, a(b)cd can occur. Given that both conditions are violated, we cannot 377 accurately predict the a(b)c to a(b)cd transition time in the context of an ab initio 378 folding trajectory (Fig. S5A). However, if we instead initialize simulations from a(b)c 379 snapshots in which no nonnatives are present, we artificially ensure that this cluster 380 satisfies condition II, and can now accurately predict the folding time (Fig. S5B).

381
Finally we consider cluster (b)cd. Although the nonnative contacts observed in this 382 cluster form slowly (Fig. 5B, (3)) to compute the inverse mean-first passage 406 time (MFPT) for each folding transition (Fig. 6). This quantity is equivalent to the 407 exponential folding rate (equation (2))for clusters that satisfy condition I. We then 408 compare these predictions with observed transition times obtained from serial refolding 409 simulations (See methods). We find that, for all transitions which satisfy condition I or 410 II, the predicted and observed inverse MFPTs closely agree at physiological from refolding trajectories for rare folding events whose inverse MFPT is much less than 416 10 −9 MC steps. 3.) Imperfect convergence of equilibrium simulations, which may bias 417 our free energy estimations. This latter issue particularly affects the calculation of the 418 ∅/b− > b(c) (C-terminal hairpin) folding rate at temperatures below T ≈ 0.9T M , as the 419 unfolded cluster ∅/b is highly unstable at these temperatures and is thus rarely 420 populated, leading to significant error. Despite this uncertainty, we nevertheless observe 421 qualitatively clear anti-Arrhenius behavior with temperature for this transition, which 422 has been previously described theoretically and experimentally for the folding of a 423 simple hairpin without interference from non-native contacts [40][41][42]. For all other 424 transitions, we observe that as temperature is decreased below the melting temperature, 425 folding rates decrease owing to increased stabilization of non-native contacts. In the 426 case of the (b)cd → a(b)cd transition, this decrease is more modest, owing to the fact 427 that nonnative traps stabilizing cluster (b)cd are relatively shallow (Fig. 5A, right). transition is never observed in our serial folding simulations, suggesting it is likely very 436 slow compared to all other transitions (Fig. S5). We note that, although we are able to 437 predict inverse MFPTs from (b)cd to a(b)cd (due to condition II being satisfied), this 438 transition will technically show multi-exponential behavior, which we neglect when we 439 treat it as a singe-exponential process in the master equation. Solving the equation, we 440 observe that following rapid folding of the C-terminal beta hairpin (cluster (b)c), the 441 majority of the flux rapidly proceeds into pathway 1, entering cluster a(b)c. But this 442 pathway represents a trap, owing to the extremely slow (approximated as 0) rate of 443 transition from a(b)c to the fully folded state. Thus, any flux that enters this pathway 444 must backtrack to (b)c before folding can proceed through the productive pathway 2.

445
This requires the separation of the N and C-terminal beta strands, which tend to pair 446 up in a nonnative, anti-parallel registration in clusters (b)c and a(b)c. Once equilibrium 447 is reached (around 4 * 10 10 MC steps), roughly 95% of the population is fully folded at 448 this simulation temperature of T = 0.91 T M , although qualitatively similar behavior is 449 observed at other temperatures (Fig. S6). 450 We further note that we have coarse-grained out the folding of the central helix In the previous sections, we have verified that DBFOLD can accurately predict folding 474 pathways and Monte-Carlo rates for protein G, a small protein for which these 475 predictions can readily be verified via direct folding simulation. But for larger, more 476 complex proteins, DBFOLD can compute folding pathways at a significantly reduced 477 computational cost relative to direct simulation, which would require unreasonably long 478 simulation times. We quantify this computational speedup for various proteins 479 simulated here and in reference [17] by tallying the total time that was required to run 480 equilibrium simulations to convergence, and to collect sufficient high-temperature 481 unfolding statistics to allow for extrapolation prior to use of DBFOLD (Fig. 7,   482 rightmost column). These times range from 2 to 5 · 10 9 MC sweeps, which corresponds 483 to wall times on the order of 10 CPU days on a 4x AMD Opteron 6376 2.3 GHz 484 processor. For each protein, we compared this time to the the average time required for 485 a direct simulation to undergo the rate-limiting folding step in the context of the 486 MCPU potential and moveset (Fig. 7, second column to right). These times, which 487 were estimated directly from the predicted rates obtained from DBFOLD, range from order-of-magnitude range in predicted folding times is comparable to the experimentally 491 measured range from sub-microseconds to thousands of seconds [40].

492
From these values, we see that DBFOLD significantly reduces the computational 493 costs required to compute folding properties for the proteins MarR, FabG, and 494 CMK-the latter two of which are larger than 200 AAs in size. In contrast, for the 159 495 AA protein DHFR, the cost to use DBFOLD is comparable to the direct simulation 496 cost, whereas for the sub-100 AA HEMK N-terminal domain and protein G, DBFOLD 497 is significantly less efficient than direct folding simulation. This is because these later 498 two proteins' fast folding rates are readily accessible to direct simulation. Indeed in 499 reference [17], we confirm that HEMK NTD folding events are observable within the 500 timescale of a short unfolding/refolding trajectory, as is the case here for protein G.

501
Interestingly, we note that the three proteins for which DBFOLD confers a substantial 502 computational advantage are the same ones that were predicted in reference [17]to 503 benefit from co-translational folding. Namely, these proteins were found to fold very 504 slowly due to nonnative contacts involving C-terminal residues, which can be the total duration of simulations (time per processor multiplied by number of processors) 516 that were run prior to using the DBFOLD algorithm, including both the time for 517 equilibrium simluations to reach convergence and the total duration of high-temperature 518 unfolding simulations. Simulations for Protein G were run in this work (Figs. 3-6) while 519 simulations for the other indicated proteins were run in reference [17]. We note that for 520 MarR, simulations were run for the monomeric species, which is relatively unstable in 521 the absence of a dimerization partner, hence a lower simulation temperature was used. 522  [43,44]. However, these simulations by themselves cannot be used to obtain 532 rates of folding, nor can they elucidate the role of nonnative contacts, which may 533 stabilize folding intermediates at physiologically-relevant temperatures but be absent at 534 high temperatures at which unfolding simulations are run.

535
Our work overcomes this limitation by making use of replica-exchange simulations to 536 model intermediate states under physiological conditions. We note, however, that our 537 approach for simulating kinetically-complex proteins differs from previous efforts to 538 extract kinetic information from replica-exchange simulations [45][46][47][48]. While these 539 previous methods can accurately compute transition rates for small proteins that 540 rapidly transition between conformers, larger proteins rarely undergo such transitions 541 within simulation timescales in the absence of a biasing potential. Although methods 542 have been developed to obtain rates from biased simulations such as our equilibrium 543 simulations [49], these techniques require knowledge of a functional form for how bias 544 affects transition rates, which may not be known for complex systems. Our method is 545 the first, to our knowledge, which combines biased replica-exchange simulations with 546 high-temperature unfolding simulations to overcome the difficulty in accurately 547 computing transition rates for large proteins. Moreover, while we have chosen to bias 548 our equilibrium simulations along one specific order parameter-namely native 549 contacts-our method does not require us to have chosen an optimal order parameter 550 along which slow transitions occur, as is the case with Metadynamics-based 551 methods [50,51]. Rather, the purpose of our biasing potential is to promote sampling of 552 distinct minima along the folding landscape, and not necessarily the transition saddle 553 points. The free energy barriers associated with these transitions are instead obtained 554 via extrapolation from high-temperature unfolding simulations. Thus, we expect that 555 biasing along alternative order parameters, such as root-mean squared deviation 556 (RMSD) and, radius of gyration (R G ), will yield similar results, and believe it would be 557 interesting to compare their efficiencies in future work.

558
Validation on protein G 559 To validate our method, we demonstrate that it can be used to predict folding pathways 560 and Monte-Carlo rates for the well-studied Streptococcal protein G. We find that, for all 561 folding transitions which satisfy certain requirements, the predicted rates agree closely 562 with observed rates, which for this relatively small protein, can be directly computed 563 via ab initio simulation. Moreover, these predicted rates paint a picture of complex 564 folding kinetics involving multiple intermediates and off-pathway nonnative states (Fig. 565  6E), consistent with experimental findings that the protein cannot be described as a 566 simple two or three-state folder [36]. Our model further reproduces a number of 567 atomistic-level findings regarding protein G's folding pathway. For example, we observe 568 that the C-terminal hairpin is the first structural element to fold, consistent with a 569 number of previous experimental and computational works [20,31,32]. Furthermore we 570 find that pathway 2, in which the N and C-termini dock together immediately after the 571 August 26, 2020 13/26 C-terminal hairpin forms, represents the dominant pathway through which productive 572 folding occurs. This is consistent with previous work using a native-centric 573 potential [20].

574
Use of method to investigate folding of complex proteins 575 In addition to cross-validating DBFOLD's predictions against direct folding simulations 576 of the fast-folding protein G, we have shown that this method significantly reduces the 577 computational effort required to predict folding pathways and Monte-Carlo rates for 578 larger proteins whose folding is significantly slowed by nonnative contacts. The effects 579 of nonnative contacts may have been significantly underestimated by previous 580 computational studies of protein folding, many of which rely on native-centric (Gö) 581 potentials. These simplified potentials allow complete folding trajectories to be 582 simulated in a reasonable amount of computation time, and may provide a valid 583 description of folding pathways for proteins that are minimally frustrated [52]. However, 584 many proteins, especially larger ones, may suffer from significant nonnative trapping, 585 and their folding thus cannot be accurately described using native-centric potential 586 functions. In contrast to many existing techniques, DBFOLD can generate detailed 587 atomistic predictions of nonnative states and account for their effect on folding times.

588
Thus, the method may be used to shed light on myriad cellular processes where these 589 states play a crucial role including co-translational folding [17], the role of chaperones, 590 and non-native oligomerization or aggregation.

591
A number of considerations may be relevant when DBFOLD is applied to larger 592 proteins. On the one hand, for large proteins, we expect that nonnative contacts will 593 often form rapidly compared to native contacts, owing to the plethora of nonnative 594 states that are possible alongside the fact that nonnative contacts are often 595 short-range [53,54] Use of DBFOLD to generate experimentally-testable predictions 630 The atomistic description of folding provided by DBFOLD can be used to generate novel 631 experimentally-testable predictions, including the role of nonnative contacts. In the case 632 of protein G, we observe that the N-terminal beta strand (residues 5 − 10) frequently 633 docks with the C-terminal hairpin in a non-native, anti-parallel fashion early during 634 folding (Fig. 4A). These nonnative contacts persist for a significant amount of time, 635 sometimes causing the protein to enter the non-productive pathway 1 whereby both 636 hairpins close while misaligned. This leads to an off-pathway kinetic trap. Although 637 previous work has also identified nonnative states involving these hairpins docked in an 638 anti-parallel orientation [38], our work provides a quantitative estimate of the effect of 639 these nonnative contacts on folding kinetics. Namely, our finding that a significant 640 fraction of the population adopts long-lived states in which the hairpins misalign in an 641 antiparallel fashion (Fig. 6E) suggests these nonnative contacts should be observable by 642 FRET. Our method can also be applied to predict how sequences changes affect a 643 protein's folding pathway. For instance, previous work suggests that protein G's close 644 structure homolog, protein L, folds via its N-terminal hairpin first [31,32], in contrast to 645 protein G whereby folding begins with the C-terminal hairpin. Consistent with this, 646 DBFOLD predicts increased stability for intermediates along this N-terminal pathway 647 in protein L as compared to protein G (Fig. S11). Moreover, we find that the ratio of 648 the N-to-C terminal hairpin folding rates is higher for protein L than for protein G (Fig. 649  S12), although the simulations do not predict a complete shift in the folding flux 650 towards the N-terminal pathway. Nevertheless, we observe the expected trend between 651 these two proteins, and we note that DBFOLD can readily be applied to simulations 652 under alternative energy functions which may capture this particular mutational effect 653 more thoroughly. When used with our MCPU potential, we expect that DBFOLD may 654 accurately predict sequence effects on the folding of proteins for which the potential has 655 successfully predicted mutational changes before, including E. Coli DHFR [55] and 656 human γD-crystallin [56]. Finally, our method has been previously applied to generate 657 atomistic-level predictions of how complex, misfolding-prone proteins may begin folding 658 co-translationally [17]. These predictions can be readily tested by purifying and 659 biophysically characterizing protein fragments of different lengths.

660
When DBFOLD is used alongside Monte-Carlo simulation algorithms such as 661 MCPU, it can predict relative rates of different folding transitions for a given protein, 662 but not absolute rates in measurable units, as the predicted rates are all in Monte-Carlo 663 units. Knowledge of these relative rates is nevertheless useful-for instance, it allows for 664 rate-limiting folding step(s) to be identified. Likewise, kinetic models can be generated 665 that predict the relative populations of different folding intermediates over time (e.g. 666 Fig. 6E), even if the absolute timescale over which this evolution occurs is not known. 667 In order to convert Monte-Carlo rates to experimentally-measurable units, it will be 668 necessary to benchmark the MC simulations against experimentally measured folding 669 times. Alternatively, DBFOLD could in principle be utilized with molecular dynamics 670 (MD) simulations, which have the obvious advantage of predicting folding rates in 671 absolute, measurable units. We expect the method to work well with MD so long as the 672 requirements presented here are satisfied. In particular, it will be important to verify 673 that unfolding rates can still be well-fit to the Arrhenius equation. The presence of 674 explicit solvent may introduce a temperature-dependence to the unfolding-rate 675 prefactors and activation energies, in which case an alternative model should be used for 676 rate extrapolation [57]. However, it is worth noting that the small simulation timesteps 677 used in MD render these simulations much slower to run than MC simulations. Even 678 with the use of replica exchange and enhanced sampling, the timescales required to 679 achieve convergence in MD simulations may significantly exceed accessible computation 680 times, particularly for large proteins [24,58,59]. In principle, our method can be applied in conjunction with any protein molecular 684 dynamics or Monte Carlo simulation software so long as detailed balance is obeyed.

685
Here, we utilize an atomistic Monte Carlo (MC) simulation package, MCPU, described 686 in previous works [17,[60][61][62]. This package, whose latest version is available online as 687 part of DBFOLD , uses a knowledge-based potential to rapidly compute energies while 688 accounting for both native and non-native interactions. We model all backbone and 689 sidechain atoms with the exception of hydrogen, and assign to each configuration an 690 energy contains terms accounting for contacts between atoms, hydrogen bonding, 691 relative orientation of aromatic residues, as well as local backbone and side chain torsion 692 angle strain. Our MC moveset allows for both sidechain and backbone rotations, as well 693 as "local moves" which modify the dihedral angles of only three consecutive residues 694 while keeping the rest of the backbone intact. In order to ensure detailed balance is 695 satisfied, a proposed move from a configuration with atomic coordinates x n to one with 696 coordinates x m is accepted with probability given by the Metropolis-Hastings Criterion: 697 Where E(x n ) and E(x m ) are the energies of the respective configurations while 698 J(x n ) and J(x m ) are Jacobian determinants that account for changes in the size of 699 phase space owing to local moves-for details see [63,64] 700 To compute a protein's thermodynamic properties, we deploy enhanced sampling 701 techniques in order to aid convergence to equilibrium. First, for each trajectory, we add 702 to the energy function a harmonic biasing term that encourages the simulation to 703 explore with a number of native contacts in the vicinity of some setpoint S. Namely, a 704 configuration with unmodified energy E(x n ) 0 (computed as described above) and 705 number of native contacts N (x n ) will be assigned a modified energy given by Equilibrium simulations are run at a range of temperatures and setpoints as In our simulations, we implement exchanges every 500,000 MC steps, at which 75 715 pairs of cores with adjacent setpoints or temperatures are randomly chosen to attempt 716 an exchange.

717
All simulations except refolding simulations are initialized from a relaxed structure, 718 which is generated from a starting crystal structure (PDB ID 1igd for protein G) by 719 first running a simulated annealing protocol whereby the protein is subjected to 720 gradually decreasing temperatures from T = 0.45 to T = 0.1, and allowed to equilbrate 721 for 2 million MC steps at each temperature without umbrella biasing. Next we run an 722 initial set of replica exchange simulations starting from the annealed structure in the 723 previous steps, with multiple umbrella setpoints and temperatures as low as T=0.2.

724
Together, these two steps increase the likelihood that the protein will undergo small 725 conformational changes needed to reach the energy minimum in this potential. The 726 lowest energy structure after the second step is defined as the equilibrated native 727 structure, which is used to initialize the subsequent simulations:  The number of native contacts is computed by counting the number of alpha 733 carbon (CA) pairs whose distance within the equilibrated native structure is less 734 than some cutoff (typically 6 Angstroms for predominantly beta-sheet proteins 735 such as protein G and 8 Angstroms for helical proteins where residues often 736 interact via their side-chains, thus resulting in larger CA distances) and whose 737 separation in sequence is at least 8 residues (to include only tertiary contacts). A 738 value of k bias = 0.02 is used in simulation energy units. Simulations are run until 739 convergence is reached, as assessed using the methods described in the next 740 subsection. 741 2. High temperature unfolding simulations that implement neither umbrella biasing 742 (i.e. with k bias = 0) nor replica exchange are run at a range of temperatures above 743 the melting temperature for 100 million MC steps. 744 We additionally run refolding simulations, which are initialized from various 745 structural intermediates, as described in the main text. These simulations again deploy 746 neither umbrella bias nor replica exchange.

748
To assess the convergence of our equilibrium simulations, we compute the simulation 749 energy, averaged over a sliding window, as a function of MC step for each umbrella at 750 various temperatures (Fig S2A-C). For protein G, we find that by 1 billion Monte Carlo 751 steps, these average energies stop changing, suggesting convergence. This is further 752 supported by examining a more global metric of convergence-namely for each trajectory 753 with setpoint s, we compute the deviation between s and the number of native contacts 754 at each MC step. We then compute the root-mean-square of this deviation, averaged 755 over all trajectories with setpoint s across all temperatures and over a sliding window of 756 50 million MC steps. (Fig. S2D). Based on these metrics, we compute equilibrium 757 properties using the last 150 million MC steps , but our results do not significantly 758 change if we slightly vary the MC steps used (Fig S2E-H compute equilibrium probability P T eq (X) of observing some observable value X at 796 temperature T (in the absence of umbrella biasing). X may correspond, for instance, to 797 some topological configuration or some number/fraction of native contacts. Given these 798 equilibrium probabilities, we can define a dimensionless potential of mean force (PMF) 799 at temperature T as a function of X as: In Figure 4C, we show PMF values for each coarse state computed from the equilibrium 801 probabilities as above. From these probabilities, we can also compute the ensemble 802 average of X over all observed values, X i , as 803 X = i X i P T eq (X i )) (9) Figure 4B shows the mean number of native contacts as a function of temperature, 804 computed in this way.

805
Computing and extrapolating unfolding rates 806 To obtain transition rates from high temperature unfolding simulations, we first assign 807 all observed snapshots to their respective topological configurations. For the larger 808 proteins simulated in reference [17], we applied an additional filtering step to eliminate 809 configurations that are rarely observed, thereby reducing the number of total 810 configurations and parameters in the model. Namely, any snapshot assigned to a 811 topological configuration that encompasses less than some fraction s (typically s = 1%) 812 of all unfolding simulation snapshots is reassigned to either the configuration that came 813 before or after it in the trajectory, depending on which it is most topologically similar 814 to. In the main text, this step was not applied for protein G, where the total number of 815 configurations occupied was small. However, we show in Fig. S9  transition between them in either direction, given that the system is in one of the two 828 states.We then define two configurations as adjacent if their kinetic distance is less than 829 some threshold, and define clusters as connected components of the resulting adjacency 830 matrix. The value of the adjacency threshold T A , set to 100 million MC steps for 831 protein G, is chosen to lie between any highly separated timescales that are observed, 832 such that configurations within a cluster exchange much faster than configurations in 833 different clusters (Fig. S7). This criterion ensures that clusters obey the criteria 834 outlined in Conditions I and II in the main text.

835
Having assigned trajectory snapshots into clusters, we now estimate the survival 836 probability, P S (C j , t, T ) for cluster C j , namely the probability that a trajectory which 837 transitioned into cluster C j from some other cluster at t=0 will not yet have made any 838 excursions out of C j after time t has elapsed at temperature T (as in Fig. 5C). We can 839 likewise estimate conditional probability, given that the protein is initially in cluster C j 840 at temperature T, of transitioning to some other cluster C i during the inter-snapshot 841 time interval ∆t (which is typically 500,000 MC steps) Where Z T Cj →Ci is the total number of observed transitions between clusters C j and C i 843 at temperature T, and Z T Cj is the total number of snapshots at temperature T assigned 844 to cluster C j . We then convert this to a rate 845 k j→i (T ) = 1 ∆t log 1 1 − l =j P (C l |C j , T, ∆t) P (C i |C j , T, ∆t) l =j P (C l |C j , T, ∆t) (11) This assumes a continuoius-time Markov process, but in case the transition from C i to 846 C j is non-Markovian (e.g. if condition II is satisfied, but not condition I), then this 847 equation nonetheless provides a good approximation to the inverse mean-first passage 848 time to transition, so long as l =j P (C l |C j , T, ∆t) is small-i.e. transitions during a 849 single MC step are unlikely. For every pair of clusters C j and C i , we compute k j→i (T ) 850 at all temperatures at which transitions are observed, and the dependence on inverse 851 temperature is fit to the Arrhenius equation (eq. (4)). To obtain an error distribution 852 on the extrapolated unfolding rates due to finite sampling, we perform a bootstrap 853 analysis whereby, at each temperature, N trajectories (where N is the original total 854 number of trajectories at that temperature) from the original set are randomly sampled 855 with replacement. The log unfolding rates for these resampled trajectories are again fit 856 to the Arrhenius equation, and re-extrapolated to lower temperatures using the resulting 857 parameters. This process is repeated 1000 times. We note that in the main text ( Fig.   858 3), all unfolding temperatures are included in Arrhenius plots, but during folding rate 859 computation, for each transition we only perform Arrhenius analysis with temperatures 860 that show five or more transition events. Applying this threshold does not significantly 861 change results for protein G, but is a recommended practice in general to reduce noise. 862 Identifying nonnative states 863 We generate nonnative contact maps for each cluster by pooling all snapshots from 864 equilibrium simulations assigned to topological configurations within that cluster.

865
Nonnative contacts are defined as contacts whose Manhattan distance on the contact 866 map is at least 2 from any contact present in the equilibrated native contact map (so as 867 to exclude native contacts that have been slightly register-shifted), with the exception of 868 cluster (b)c where only native contacts are excluded. We then subdivide these contact 869 maps as in reference [17]-briefly, we identify the connected components of the adjacency 870 matrix between different snapshots' nonnative contact maps, where two maps are 871 defined as adjacent if their hamming distance is less than 10. For cluster (b)c, two 872 structurally-distinct classes of nonnative states were identified using this method, 873 whereas only one such class was identified for all other clusters. For each class of 874 nonnative states, we then compute averaged nonnative contact maps among all 875 snapshots assigned to that class (Fig. 5). We then identify nonnative substructures 876 among all nonnative contacts that are present at least 10% of the time within a given 877 nonnative class using the methods described in the subsection Substructure analysis 878 with values of C = 7, d c = 6.5 Angstroms, and h = 3. For each cluster C i , we then run 879 refolding simulations, initialized from snapshots assigned to C i with no more than 2 880 nonnative contact drawn from equilibrium simulations at T ≈ 1.14T m , and compute 881 the probability, as a function of MC step, that the protein forms at least one nonnative 882 substructure obtain from the nonnative contact maps generated as above for cluster C i 883 or the probability that the protein has transitioned to a subsequent cluster in the 884 folding pathway (Fig. 4B). 885 Computing predicted and observed folding rates 886 To compute the predicted folding rate k T i→j between clusters C i and C j at temperature 887 T , we incorporate equilibrium probabilities of occupying the respective clusters, as well 888 as the unfolding rate k T j→i (computed and extrapolated as per the previous section) into 889 equation (2)). We obtain 890 k T i→j = l 1 S l ∈Cj P T eq (S l ) l 1 S l ∈Ci P T eq (S l ) k j→i (12) Where the sum is over all observed coarse states, 1 S l ∈Cj has value 1 if coarse state S l is 891 assigned to cluster C j and 0 otherwise, and P T eq (S l ) is the equilibrium probability of 892 occupying coarse state S l computed as per the subsection "Computation of 893 thermodynamic properties from equilibrium simulations" (and likewise for cluster C i ). 894 As discussed previously, this rate corresponds to an inverse mean-first passage time for 895 transitions that do not show Markovian behavior. We obtain an error distribution on

901
We note that error bars appear symmetric on a log scale because the boostrap analysis 902 is performed entirely in log space. We further note that this error does not account for 903 uncertainty in the PMF calculation.

904
In order to directly compute refolding rates, we run a set of refolding simulations in 905 which protein G is initialized from snapshots assigned the fully unfolded configuration ∅ 906 at high temperatures (under the restriction that no more than two nonnative contacts 907 are initially present). At each of five physiolgoically-reasonable temperatures, 120 908 refolding trajectories are run, starting from snapshots drawn randomly (with 909 replacement) from among which satisfy the required criteria. These trajectories are run 910 for 200 million MC steps, at which point snapshots that successfully transitioned into 911 cluster (b)c are used to initialize 120 new simulations. This process is repeated serially 912 for each step in the two pathways shown in Fig. 5. To compute observed folding rates 913 from these forward folding simulations, we use the HMM method described in the 914 subsection Computing and extrapolating unfolding rates to reduce the frequency with 915 which simulation snapshots are misassigned to the wrong topological configuration. We 916 then assign each snapshot to whichever cluster contains the topological configuration to 917 which that snapshot was assigned (where clusters are determined from the unfolding 918 simulations). Finally, equations (10) and (11) are used to compute the observed folding 919 rate from cluster C i to C j at each temperature at which folding simulations are run. As 920 with the predicted folding rates, a bootstrap analysis, in which N refolding trajectories 921 at each temperature are resampled 1000 times with replacement, is used to obtain an 922 error distribution on these observed refolding rates. Supporting figures S1-S10 933