Functional and Early Folding Residues are separated in proteins to increase evolvability and robustness

The three-dimensional structure of proteins captures evolutionary ancestry, and serves as starting point to understand the origin of diseases. Proteins adopt their structure autonomously by the process of protein folding. Over the last decades, the folding process of several proteins has been studied with temporal and spatial resolution which allowed the identification of so-called Early Folding Residues (EFR) in the folding process. These structurally relevant residues become affected early in the folding process and initiate the formation of secondary structure elements and guide their assembly. Using a dataset of 30 proteins and 3,337 residues provided by the Start2Fold database, discriminative features of EFR were identified by a systematical characterization. Therefore, proteins were represented as graphs in order to analyze topological descriptors of EFR. They constitute crucial connectors of protein regions which are distant at sequence level. Especially, these residues exhibit a high number of non-covalent contacts such as hydrogen bonds and hydrophobic interactions. This tendency also manifest as energetically stable local regions in a knowledge-based potential. Conclusively, these features are not only characteristic for EFR but also differ significantly with respect to functional residues. This unveils a split between structurally and functionally relevant residues in proteins which can drastically improve their evolvability and robustness. The characteristics of EFR cannot be attributed to trivial features such as the accessible surface area. Thus, the presented features are novel descriptors for EFR of the folding process. Potentially, these features can be used to design classifiers to predict EFR from structure or to implement structure quality assessment programs. The shown division of labor between functional and EFR has implications for the prediction of mutation effects as well as protein design and can provide insights into the evolution of proteins. Finally, EFR allow to further the understanding of the protein folding process due to their pivotal role. Author summary Proteins are chains of amino acids which adopt a three-dimensional structure and are then able to catalyze chemical reactions or propagate signals in organisms. Without external influence, most proteins fold into their correct structure, and a small number of Early Folding Residues (EFR) have been shown to become affected at the very start of the process. We demonstrated that these residues are located in energetically stable local conformations. EFR are in contact to many other residues of a protein and act as hubs between sequentially distant regions of a proteins. These distinct characteristics can give insights into what causes certain residues to initiate and guide the folding process. Furthermore, it can help our understanding regarding diseases such as Alzheimer’s or amyotrophic lateral sclerosis which are the result of protein folding gone wrong. We further found that the structurally relevant EFR are almost exclusively non-functional. Proteins separate structure and function, which increases evolvability and robustness and gives guidance for the artificial design of proteins.

Most proteins adopt their three-dimensional conformation autonomously during the 2 process of protein folding [1,2] which is strongly connected to protein design as well as 3 the quality assessment of structures and in silico models [1,3]. Various diseases are 4 caused by misfolding or aggregation of proteins [4][5][6][7]. During the protein folding 5 process, the denatured chain of amino acids passes a energetic barrier, called transition 6 state, to form a compact and functional native structure [2]. 7 How proteins fold is an open question [1]. There is a lack of experimental data 8 describing which events or residues guide the folding process [8][9][10]. The protein 9 sequence resembles the starting point and the three-dimensional structure captures the 10 result of the protein folding process for a wide range of proteins, yet how they connect 11 via the transition state is unclear. The unstable nature of the transition state hinders 12 its experimental determination [11,12]. Another hindrance for the understanding of the 13 sequence-structure relation is that some proteins depend on chaperons to fold 14 correctly [7]. 15 The defined pathways model 16 Whether general folding patterns exist [13] and whether folding is stochastic or 17 deterministic [14] remains to be answered -even within some protein families the 18 process differs [15]. The defined pathways model proposes that small fragments fold 19 first and then guide a step-wise assembly of further parts of the protein until the native 20 structure is formed [14,16,17]. The process is believed to be deterministic and 21 fragments folding first do so autonomously from other parts of the protein -no other 22 region directly supports or hinders their formation [14,17]. Which parts of the protein 23 initiate the formation of local, ordered structures, e.g. secondary structure elements, is 24 encoded in their sequence [18][19][20][21][22][23]. Consequently, these regions decrease in energy as 25 well as entropy and stabilize the protein during the folding process [23,24]. This also 26 supports the observation that proteins fold cotranslationally as they are being 27 synthesized by a ribosome and stabilizing long-range contacts cannot be formed yet [25]. 28 These local structures form long-range contacts and assemble the global 29 structure [14,18,22,26,27]. The formation of a native structure causes a further 30 decrease in free energy [3,17,28]. Long-range contacts are especially important for the 31 stability of the hydrophobic core of the native structure [29]. 32 Identifying Early Folding Residues during protein folding 33 In recent years, various experimental strategies [30][31][32][33] were established which can 34 identify residues crucial for the folding process. Probably the most elegant approach to 35 track the protein folding process with spatial and temporal resolution is pulse labeling 36 hydrogen-deuterium exchange (HDX) [14,29,[34][35][36][37][38][39]. The state of a protein can be 37 2/18 controlled e.g. by denaturants or temperature [35]. Starting from a denatured protein, 38 folding conditions are gradually established until the protein refolded completely. The 39 resulting folding trajectory can be studied by HDX. Depending on the state of the 40 folding process, individual amino acids will be susceptible to or protected from an 41 exchange of the hydrogen atom of their amide group. Residues become protected when 42 their amide group is isolated from the solvent as the effect of other residues surrounding 43 them. When the folding process affects a residues, its spatial neighborhood is altered. 44 Where and when these exchanges occur is tracked by a downstream mass spectroscopy 45 or nuclear magnetic resonance spectroscopy. Residues which are protected from the 46 exchange at the earliest stages [14,[37][38][39] are called Early Folding Residues (EFR). 47 Residues which became protected at later stages or not at all are referred to as Late 48 Folding Residues (LFR). EFR were shown to initiate the folding process and the 49 formation of secondary structure elements [39] or even larger autonomously folding 50 units [14]. They tend to be conserved, but non-functional residues [40]. In contrast, 51 LFR may be relevant during later stages of the folding process, implement protein 52 function, or be mere spacers between protein regions. 53 The data obtained by HDX experiments is still difficult to interpret [41] and results 54 of other experiments or techniques are tedious to compare [29,39]. The Start2Fold 55 database [39] provides an invaluable annotation of EFR which became protected early 56 in a standardized manner [29]. In a previous study [38], EFR have been shown to 57 exhibit lower disorder scores and higher backbone rigidity. Regions with relatively high 58 backbone rigidity are likely to constitute ordered secondary structure elements and this 59 tendency is manifested in local sequence fragments [19,20,38,39,42]. Especially 60 aromatic and hydrophobic amino acids were linked to ordered regions of proteins [38]. 61 Subsequently, it was shown that EFR are likely buried according their relative accessible 62 surface area (RASA) and proposed that they are also the residues which form the 63 greatest number of contacts in a structure [39]. EFoldMine [10] is a classifier that 64 predicts EFR from sequence. Due to the nature of the trained models [10,38], it is still 65 unclear what the relation between sequence and structure is and if EFR cause their 66 surroundings to fold first or vice versa [23]. 67 Representing proteins by Energy Profiling and graphs 68 A protein's native structure exhibits minimal free energy [14]. Thus, knowledge-based 69 potentials are a potent tool to describe the process of protein folding [28] and have been 70 previously employed for the quality assessment of protein structures [3]. In an approach 71 called Energy Profiling the surroundings of each residue are expressed as energy value. 72 Low energy values occur for hydrophobic amino acids which are stabilized by many 73 contacts. Thus, this approach is a valuable feature to assess the stability of individual 74 residues as well as their interactions with their spatial neighborhood. 75 Individual residues can also be characterized in the context of protein structures by 76 topological features derived from network analysis. Protein structures are represented as 77 graphs: amino acids constitute the nodes and contacts between residues are represented 78 as edges [12,[43][44][45][46][47][48][49]. There is a plethora of contact definitions and most are based on 79 distance cutoffs between certain atoms of amino acids [50]. Graph representations of 80 proteins were previously employed to describe residue flexibility [51] as well as residue 81 fluctuation [43], protein folding [12,46], structural motifs [52], and evolvability [49]. 82 Furthermore, protein graphs were shown to exhibit the character of small world 83 networks [12,[43][44][45][46] whereby a small number of residues has high connectivity and the 84 average path length in the graph is small. Hydrophilic and aromatic amino acids were 85 found to be crucial connectors in the graph -so-called hubs -which underlines their 86 importance in the context of protein folding [53]. 87 Graph representations of proteins also allow to assess whether proteins feature a 88 3/18 modular design [54,55]. Reinvention is rarely observed in nature and whenever possible 89 existing, established, and safe strategies are reused [56]. This can explain why the 90 conceivable sequence and structure space is explored so little: by evolving established 91 sequences, misfolding sequences or those prone to aggregation are avoided [56,57]. This 92 behavior is likely the result of a separation of residues relevant for folding and those 93 relevant for protein function [40] such as ligand binding sites or active sites. Functional 94 residues also were shown to exhibit distinct topological features [45]. A division of labor 95 between fold and function increases robustness and evolvability of protein 96 sequences [40,49,54,55,58] because functional residues can be changed without any The Start2Fold database [39] constitute a dataset of EFR [10,14,17,23]. Previous  It is unknown what sequence features causes particular residues to fold early and 106 how these residues contribute to the formation of the native structure ( Fig 1A). EFR 107 are strongly connected to the defined pathways model and provide an opportunity to 108 understand the driving forces behind the assembly of stabilizing local structures as well 109 as the formation of tertiary contacts [14,23].  [2]. (B) Protein structures are represented as graphs to derive topological descriptors of residues. Amino acids constitute nodes, whereas residue contacts are represented as edges. EFR are structurally relevant residues which participate early in the folding process by forming local contacts to other residues. They are separated from functional residues which are primarily ligand binding sites and active sites as derived from UniProt [59]. EFR show a great number of long-range contacts which furnish the spatial arrangement of protein parts which are far apart on sequence level.
In this study, several novel structural features are employed for the characterization 111 of EFR. Especially, the Energy Profiling approach, topological descriptors of protein 112 graphs, and the explicit consideration of non-covalent contacts types provides a new 113 level of information in order to describe the folding process. EFR exhibit lower, more 114 stable energy values in their Energy Profile [3,28]. A network analysis reveals that EFR 115 are more connected to other residues and that they are located at crucial positions in 116 the protein graph ( Fig 1B). This distinct wiring to the rest of the protein is especially 117 furnished by hydrophobic interactions. EFR are likely structurally relevant for the 118 correct protein fold [10]. This information is used to demonstrate that proteins separate 119 structurally relevant residues from functional residues ( Fig 1B).

121
A previously described dataset [23] of 30 proteins and 3,377 residues is the basis of this 122 study and summarized in S1 Table. 482 (14.3%) of the residues are labeled as EFR, the 123 remaining residues are considered LFR. 124

4/18
To characterize EFR in more detail, various features were defined and compared to 125 the values of LFR. EFR form a significantly greater number of contacts than their LFR 126 counterparts (Fig 2A). The loop fraction is defined as the ratio of unordered secondary 127 structure elements in a window centered on a particular residue [60]. Fewer unordered 128 secondary structure elements can be found around EFR (Fig 2B), whereas LFR exhibit 129 a higher propensity to occur in coil regions. EFR are on average closer to the centroid 130 of a protein structure and are likely embedded in the hydrophobic core ( Fig 2C). 131 Analogously, they also tend to be more distant to the N-or C-terminus of the sequence 132 than other residues and are likely buried regarding their RASA as per S2 Table. 133 Fig 2. General properties discriminative between EFR and LFR. (A) EFR form more contacts to their surroundings than LFR. (B) The loop fraction [60] is the ratio of unordered secondary structure elements which are observed in a windows of nine amino acids around a residue. EFR are more commonly surrounded by ordered secondary structure elements. (C) EFR are located significantly closer to the centroid of the protein than LFR.
The propensity of EFR to participate in more contacts and to occur in the core of a 134 protein are in agreement with previous studies [14,23,38,46]. The shift in loop fraction 135 can also be attributed to these findings and is further substantiated by the fact that 136 long ordered secondary structure elements tend to contain more EFR [23]. It has been 137 reported that buried residues are more likely to be EFR [23,29] which also explains why 138 they are closer to the spatial centroid of a protein and more separated from sequence 139 termini (S2 Table). Yet, all these factors cannot explain why some residues become 140 EFR and others do not. remarkable. This trend also manifests in sequence; thus, energy values predicted by 151 sequence using the eGOR method [28] are also lower for these residues (see S2 Table). 152 Regarding the average absolute contact frequencies, a EFR participates in 3.87 153 hydrogen bonds and forms 1.30 hydrophobic interactions to other residues. This 154 constitutes a significant increase compared to LFR (see S2 Table).  [3,28] was used to characterize the surrounding of each residue. Hydrophobic and aromatic amino acids have a high tendency to be located in the buried core of a protein. Hydrophilic and polar amino acids prefer to be exposed to the polar solvent. This tendency is reflected by low and high average energy values respectively. The distribution of energy values of EFR always exhibits a lower median than LFR. Significance in change is indicated by asterisks (*). EFR observations of serine and threonine exhibit relatively low energy values. The side chains of both amino acids can form hydrogen bonds. The decrease in energy is insignificant for aspartic acid and isoleucine. No annotation of EFR is available for proline.

5/18
EFR exhibit significantly lower values in computed Energy Profiles as well as those 156 predicted from sequence. This indicates that they occur in parts of proteins which are 157 more stable and contain an increased number of hydrophobic amino acids in their 158 spatial surroundings. Especially amino acids such as serine or threonine, which can form 159 hydrogen bonds via their side chains, feature relatively low energy values even though 160 they have an overall tendency to be exposed to the solvent due to their hydrophilic 161 nature. The energy contribution of hydrogen bonds has been shown to be context 162 specific [61], but also crucial for the correct formation of protein structure [53]. 163 Especially amino acids capable of forming side chain hydrogen bonds contribute to the 164 protein stability [1,61]. Hydrophilic and aromatic amino acids like arginine, histidine, 165 and methionine are considered strong hubs in protein structures, which is substantiated 166 by a significant change in computed energy values for EFR. Hydrophobic amino acids 167 occur in the core of a protein and are stabilized by an increased number of hydrophobic 168 interactions. Thus, they have an intrinsic propensity to form stable, low energy 169 conformations which is also reflected by the computed energy values. EFR might be the 170 mediators between the formation of local structure elements and their assembly in the 171 context of the three-dimensional structure. Secondary structure elements such as helices 172 interact e.g. by hydrophobic interactions [62], however, it seems that single contacts are 173 neither strong nor specific enough to guide their assembly [17,63,64]. A future, The way residues interact with their spatial surrounding was assessed by network 180 analysis based on protein graphs. Regarding the topological properties of residues 181 derived from network analysis, EFR show a higher interconnectedness than LFR. They 182 exhibit higher betweenness ( Fig 4A) and closeness (Fig 4B) values. High betweenness 183 values are observed for well-connected nodes which are passed by many of the shortest 184 paths in a graph. High closeness values occur for nodes which can be reached by 185 relatively short paths from arbitrary nodes. The distinct neighborhood count expresses 186 to how many sequentially separated regions of a protein a residue is connected. Again a 187 significant increase can be observed for EFR (Fig 4C). Residues are considered 188 separated when they are more than five sequence positions apart. This threshold was 189 also used to distinguish local contacts (i.e. less than six residues apart) and long-range 190 contacts. Interestingly, the clustering coefficient features a significant decrease when 191 EFR are considered. The clustering coefficient of a node is the number of edges between 192 its adjacent nodes divided by the theoretical maximum of edges these nodes could form. 193 However, EFR are biased to be in the core of the protein [39], thus, it was assessed if 194 this change is also significant when only buried [65] residues are considered. The 195 differences are insignificant in that case (see S2 Table). implying that shortest paths in the graph tend to pass through these nodes more often. (B) They also exhibit higher closeness values because their average path length to other nodes is lower on average. (C) The distinct neighborhood count of a residues describes to how many separated regions it is connected. Residues are considered separated when their separation on sequence level is greater than five. EFR connect significantly more regions of a protein than LFR.
By topological terms, EFR are more connected to the rest of the protein as expressed 197 by betweenness, closeness, and the distinct neighborhood count. The betweenness 198 property is closely related to the small-world characteristics of networks and can be 199 observed in this case due to the ratio of protein surface and volume [46]. Residues 200 relevant for the folding process have been shown to exhibit high betweenness values in 201 the transition state and to be crucial for the formation of the folding nucleus [46]. 202 Interestingly, the clustering coefficient shows no difference between EFR and LFR when 203 only buried residues are considered. Also, the value is higher for LFR, which is probably 204 an effect of EFR being hubs which connect several separated regions of a protein (as 205 shown by the distinct neighborhood count). These regions themselves are not 206 well-connected, which results in a lower clustering coefficient for EFR. The performed 207 network analysis aids the understanding on the idiosyncratic properties of EFR in the 208 context of the whole protein and is in agreement with previous studies [11,46,53]. EFR 209 are hubs between sequentially distant protein regions which underlines their importance 210 for the correct assembly of the tertiary structure of a protein. The distinction between 211 local and long-range contacts provides new insights into the structural relation of 212 residues with their respective neighborhood. Nevertheless, the increased number of local 213 and long-range contacts of EFR point to their importance for the whole protein folding 214 process as described by the defined pathways model [14,17]. The existence of disordered 215 proteins [7,38], chaperons [7], cotranslational folding [25], and the peculiarities of 216 membrane proteins [62] conceal important properties and EFR may be a welcome 217 simplification to advance the understanding of the protein folding process.

218
Early Folding Residues are non-functional residues 219 Division of labor is one of the most successful strategies of evolution [40,54,55,[66][67][68][69]. 220 The separation of residues crucial for folding and those furnishing function may allow 221 reuse of established protein folds [32,40,[54][55][56]58]. The sequence and structure space 222 ascertained over the course of evolutions seems small for a truly random exploration.

223
Reusing established folds could also avoid slow-folding sequences or those prone to 224 aggregation [31,56,70,71]. There seem to be a delicate balance in proteins between 225 robustness and evolvability [55,58]. Thus, functional residues [72] can be mutated and 226 new functions can be adopted without compromising the fold of the protein [32]. In 227 consequence, a clear division should be observable between EFR -which initiate and 228 guide the folding process -and the functional ones implementing protein function.

229
To address this question, residues in the dataset were labeled as either EFR or LFR 230 as well as being either functional or non-functional. Active sites and ligand binding 231 regions were considered to be the functional parts of proteins. The distribution of both 232 binary variables (Table 1) shows that the majority of residues in the dataset are neither 233 EFR (87.2%) nor functional (95.4%) residues. Only 0.5% share both classes, resulting in 234 a Cramér's V of 0.01. The distribution of both variables separated by individual 235 proteins is presented in S1 Table. For most proteins, no residues are both EFR and 236 functional ( Fig 5A). Furthermore, EFR tend to be located in the core of proteins, 237 whereas functional residues are exposed towards the solvent in order to realize their 238 respective function (Fig 5). Acyl-coenzyme A binding protein (STF0001) [33,73,74] 239 features five residues which are both EFR and functional (Fig 5B). 240 For the majority of the dataset, a clear separation of EFR and functional residues 241 can be observed. The acyl-coenzyme A binding protein may exhibit five residues which 242 are both EFR and functional because its a rather small protein of 86 residues which 243 binds ligands with large aliphatic regions. Intuitively, the residues furnishing the 244 bowl-like shape of the protein are also those which participate in the function of ligand 245 binding [33,73,74]. For acyl-coenzyme A binding protein, roughly half of its residues are 246 marked as EFR which further accentuates why the division of labor is less strict in this 247 Out of 2807 observations, 0.5% are EFR and functional at the same time. Cramér's V amounts to 0.01 -this minimal association between both categories implies that EFR are not functional and vice versa. case. Another case in which natures avoids limitations imposed by a defined structural 248 fold can be found in aminoacyl tRNA synthetases [68,69,75]. Ancient enzymes may 249 have existed as functional molten globules [76,77] in their earliest 250 implementations [68,69] in order to not restrict evolution prematurely by ensuring 251 integrity of the protein's fold [78]. Disordered proteins are another example of proteins 252 without structural integrity which achieve a high robustness of function [49]. In 253 structural biology, structure is commonly considered to be equal to function [49,79].

254
However ultimately, it is most important that proteins are functional [79,80]. This 255 potential unimportance of a particular fold underlines that structurally and functionally 256 relevant residues are detached entities in proteins and that their separation is 257 advantageous for evolvability. Another interpretation with respect to the defined 258 pathways model [14] is that EFR initiate and guide the folding process. By assigning 259 this responsibility to a small number of residues, the remaining residues are available to 260 carry other responsibility such as constituting active sites.

261
Early Folding and functional residues exhibit distinct features

262
The previously described features were employed to substantiate the identified 263 separation of structure and function on residue level (S3 Table). EFR show significantly 264 lower computed energy values when compared to LFR or functional residues (Fig 6A). 265 Functional residues exhibit higher energy values than their non-functional counterparts. 266 Most residues form only a small number of hydrophobic interactions, however, the 267 number for EFR is significantly increased (Fig 6B) the change between the hydrogen bond count of EFR and functional residues in a 271 buried state is insignificant (S3 Table). The clustering coefficient of a node captures 272 how many edges can be observed between the adjacent nodes and, thus, describes how 273 well-connected the direct surroundings of a node are. Functional residues show an 274 insignificant change regarding this property (S3 Table). In contrast, the clustering 275 coefficient significantly decreases when EFR are compared to LFR or functional residues 276 (Fig 6C). In summary, EFR exhibit distinct properties compared to functional residues. 277 Their surrounding secondary structure elements, values in Energy Profiles, and the Characteristics of EFR and functional residues. EFR and LFR are compared to functional and non-functional residues. (A) EFR show lower energy values as they are in contact with many residues and tend to be embedded in the hydrophobic core. In contrast, functional residues are exposed to the solvent in order to constitute e.g. binding sites. (B) Hydrophobic interactions occur especially in the core of a protein. Therefore, most residues do not form any. EFR however show an significant increase compared to LFR. (C) The clustering coefficient of a node describes how well-connected its adjacent nodes are. EFR connect regions of a protein which are separated on sequence and, thus, not well-connected on their own. Functional residues exhibit higher values.
Due to their purpose, EFR are located in the hydrophobic core and functional 280 residues are primarily exposed to the solvent. This distinct requirements manifest in the 281 computed energy values. Furthermore, protein function can commonly be broken down 282 to amino acids which feature hydrophilic, chemically functional groups [72]. Hydroxyl 283 groups are a prominent examples for functional groups contributing to catalysis [72]. process. It was shown that functional residues have special requirements on how they 294 are wired to the rest of a protein [45]: Surrounding residues ensure the correct 295 placement of functional residues [45,82,83], modulate their chemical properties such as 296 pK a values [45,72,84], or propagate signals to other parts of a protein [45].

297
Modularity in proteins is also present in domains [54], secondary structure elements, 298 and autonomous folding units of the defined pathways model [17,27]. Particularized 299 knowledge of EFR may improve synthetic biology and could allow the design of proteins 300 combining existing functional domains without influencing one another 301 negatively [2,54,55,85]. Furthermore, understanding the differences of structurally 302 relevant residues and those implementing function could help in predicting mutation 303 effects and provide a new level of detail by allowing whether a mutation disrupts the 304 protein's fold or its function [86,87].

306
A dataset of EFR for the protein folding process was studied. They were found to be 307 highly connected nodes in protein graphs and were observed to be located in 308 energetically favorable conformations as pointed out by the approach of Energy 309 Profiling [3,28]. These structurally relevant residues have distinct properties e.g. 310 regarding the number of hydrophobic interactions compared to functional residues.

311
Future HDX data can substantiate the presented trends regarding the nature of 312 EFR. Potentially, the arsenal of experimental techniques to study the folding process of 313 proteins will expand and become more refined and standardized, so that the underlying 314 dataset of studies like this one will become more robust. EFR are an excellent tool to 315 gain insights into the folding process with spatial and temporal resolution. Future 316 9/18 studies may link them to characteristics on sequence level to understand the sequence 317 composition which causes particular regions of a protein to initiate the folding process. 318 Features presented in this study were shown to be highly discriminative for EFR.

319
Insights into topological properties of residues can also improve structure quality 320 assessment programs [3]. Classifiers for EFR based on sequence [23] or structure may 321 annotate residues crucial for protein folding. Trained classifiers can also report as well 322 as visualize the most discriminative features [88,89] which may further delineate EFR. 323 This information is also invaluable for mutation studies, φ-value analysis, or protein 324 design and can serve as basis for the prediction of mutation effects [86]. Understanding 325 the protein folding problem may also give insights into the cause of diseases such as 326 amyotrophic lateral sclerosis [4,5], Alzheimer's [7], and Parkinson's disease [7]. The 327 same is true for the observed division of structurally relevant and functional residues in 328 proteins. Understanding these topological differences provides insights into the way they 329 interact with the rest of the protein and to what degree they tolerate or compensate 330 manipulation. For decades, scientists longed for a glimpse into the folding process [8][9][10] 331 and the dataset of EFR [39] provides just that. It is stunning that not more studies are 332 focused on this resource.

334
Dataset creation 335 Folding characteristics of residues were obtained from the Start2Fold database [39]. 336 Therein, the authors adopted the definition of EFR from Li et al. [29] and presented a 337 refined dataset which ignores possible back-unfolding and aggregation events [90].

338
This procedure resulted in a dataset for EFR characteristics encompassing 30 339 proteins and 3,377 residues -482 of the EFR class and 2,895 of the LFR class. Due to 340 the nature of the HDX experiments no data can be obtained for proline residues [37], 341 rendering them LFR in any case. Annotation of functional residues was performed using 342 the SIFTS [91] and UniProt [59] resources. For 23 proteins an annotation of binding 343 sites or regions existed, totaling in 2,807 residues -130 classified as functional and 2,677 344 as non-functional. A detailed summary of the dataset is provided in S1 Table. 345 Information used from the Start2Fold database can be found in S1 File. Residues  [47].

351
In this study, amino acids constitute the nodes of a graph, whereas covalent bonds 352 and residue contacts are represented as edges. Residues were considered in contact 353 when their C β atoms were less than 8Å apart -if no C β atom was present the C α 354 position was used as fallback. Furthermore, contacts were labeled as either local (i.e. 355 the separation in sequence is less than six) or long-range (i.e. sequence separation 356 greater than five) [92]. This distinguishes contacts stabilizing secondary structure 357 elements and those which represent contacts between secondary structure elements. The 358 set of distinct neighborhoods of a node is defined as all adjacent nodes which do not 359 share any local edge to any element of the set. Betweenness is defined the number of 360 shortest paths on the graph passing through a specific node, normalized by the number 361 of node pairs [46,93]. Closeness of a node is defined as the inverse of the average path 362 length to any other node [45]. The clustering coefficient of node is the number of edges 363 10/18 between its n k adjacent nodes divided by the maximal number of edges between n k 364 nodes: 0.5 · n k · (n k − 1) [46]. 365 Feature computation 366 Energy Profiles were calculated from structure and predicted from sequence according 367 to the methodology used in the eQuant web server [3,28]. Energy Profiles represent a 368 protein's complex three-dimensional structure as one-dimensional vector of energy 369 values. Thereby, the surroundings of each residue are characterized by one energy value. 370 RASA values were computed by the algorithm of Shrake and Rupley [94]. Buried 371 residues are defined as those with RASA values less than 0.16 [65]. Non-covalent 372 residue-residue contacts were detected by PLIP [95]. Secondary structure elements were 373 annotated using DSSP [96]. The loop fraction is defined as fraction of unordered 374 secondary structure in a window of nine residues around the evaluated amino acid [60]. 375 This yields a fraction, where high values are tied to regions of high disorder, whereas 376 amino acids embedded in α-helices or β-sheets result in scores close to zero. The 377 centroid distance of a residue is the spatial distance of its centroid to that of all atoms. 378 The terminus distance is lower of the sequence separation to either terminus divided by 379 the number of residues.  Supporting information S1 Table. EFR dataset summary. Summarizes identifiers [23] of each entry as well as the number of residues in the corresponding protein chain, the number of EFR and functional residues as well as the cardinality of the intersection of both sets. Proteins not containing any functional residues according to UniProt [59] are marked with dashes.
S2 Table. Statistical characterization of EFR. For each presented feature the mean (µ) and standard deviation (σ) of both the EFR and LFR category is reported. p buried refers to the p-value of the test on residues buried according their RASA value, this was done because EFR have a tendency to be located in the core of a protein and without filtering all differences are significant. Features and employed tests are described in the Methods section.
S3 Table. Comparison of EFR and functional residues. For each presented feature the distribution of values is compared between functional and non-functional residues as well as EFR and functional residues. The corresponding p-values and significance level are stated for buried residues. Mean values are shown for EFR (µ early ) and functional residues (µ func ). Features and employed tests are described in the Methods section.