BCL::Score—Knowledge Based Energy Potentials for Ranking Protein Models Represented by Idealized Secondary Structure Elements

The topology of most experimentally determined protein domains is defined by the relative arrangement of secondary structure elements, i.e. α-helices and β-strands, which make up 50–70% of the sequence. Pairing of β-strands defines the topology of β-sheets. The packing of side chains between α-helices and β-sheets defines the majority of the protein core. Often, limited experimental datasets restrain the position of secondary structure elements while lacking detail with respect to loop or side chain conformation. At the same time the regular structure and reduced flexibility of secondary structure elements make these interactions more predictable when compared to flexible loops and side chains. To determine the topology of the protein in such settings, we introduce a tailored knowledge-based energy function that evaluates arrangement of secondary structure elements only. Based on the amino acid Cβ atom coordinates within secondary structure elements, potentials for amino acid pair distance, amino acid environment, secondary structure element packing, β-strand pairing, loop length, radius of gyration, contact order and secondary structure prediction agreement are defined. Separate penalty functions exclude conformations with clashes between amino acids or secondary structure elements and loops that cannot be closed. Each individual term discriminates for native-like protein structures. The composite potential significantly enriches for native-like models in three different databases of 10,000–12,000 protein models in 80–94% of the cases. The corresponding application, “BCL::ScoreProtein,” is available at www.meilerlab.org.


Introduction
Many protein structures have been determined using experimental techniques such as X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. Of the approximately 69,000 protein structures deposited in the Protein Data Bank (PDB) as of August 2011, X-ray crystallography [1] contributed 88% and nuclear magnetic resonance (NMR) [2] contributed almost all of the remaining 12% [3]. Although the number of experimentally determined protein structures grows, challenges still exist. Membrane proteins are hard to express, crystallize and are usually too large to be studied by NMR [4]. Some proteins evade atomic detail structure determination in isolation and adopt their biologically relevant structure only in the context of a complete biomolecular assembly, e.g. a virus or macromolecular machine [5].
The biological importance of these proteins justifies large efforts to collect limited experimental datasets that describe their structure. Often these data restrain the topology of the protein, i.e. the relative placement of secondary structure elements (SSEs). For example, electron density maps of medium resolution (4-10 Å ) obtained by X-ray crystallography or cryo-Electron Microscopy (cryo-EM) [6][7][8][9] display the location of secondary structure elements but omit loop regions and side chains. Small-Angle Xray Scattering (SAXS) and Small-Angle Neutron Scattering (SANS) display the overall shape of the protein topology [5,10]. NMR spectroscopy of large and/or membrane proteins often yields distance and orientation restraints for atoms in the backbone of SSEs which are easier to label, assign, and interpret. Site-Directed Spin Labeling Electron Paramagnetic Resonance (SDSL-EPR) spectroscopy is applied to interrogate the relative positioning of SSEs relating the information from the tip of the non-natural and flexible spin label back onto the protein backbone [11,12]. Lastly, cross-linking experiments interpreted with mass spectrometry yield typically distance restraints that again focus on the relative position of SSEs [13]. To facilitate construction and evaluation of protein structural models from such limited datasets a tailored energy function that only evaluates the relative positioning of SSEs in topologies would be of great value. Ideally, this energy function should predict the free energy of all states an amino acid sequence can access, and the lowest free energy should be associated with the native structure [14]. In principle, the free energy of a protein structure and its native conformation can then be derived with sufficient sampling of the potential energy surface using molecular mechanics force fields (e.g. CHARMM [15] or AMBER [16]). This approach is often computationally prohibitive and sometimes suffers from inaccuracies in the potential energy function. It has been shown that these potentials do not always distinguish native-like from incorrect structures [17].
An alternative approach constructs scoring functions whose global energy minimum coincides with the native conformation for a database of experimentally determined protein structures of different sequence. Early versions of such knowledge-based or statistical potentials were based on contact frequencies [18] and likely exposure states of amino acid types [19]. Since then, a large variety of such potentials have been developed (for a review see [20]), and their applicability to fold recognition (threading) [19] and protein folding [21] was demonstrated. The underlying assumption that the knowledge based distribution of features is a BOLTZMANN-like distribution can be challenged, e.g. for amino acid pair distances [22]. This is particularly true in protein structure prediction, where the reference state is dependent on the type and density of sampling used [23].
Knowledge based energy functions employ probability theory, and in particular BAYES' theorem, to circumvent the assumption of a Boltzmann distribution [22]. Shen and Sali derive a Discrete Optimized Protein Energy (DOPE) from a sample of native structures based entirely on probability theory [20]. The potential achieves enrichments between 3 and 9 for the identification of native structures in a set of models [24]. Protein structure prediction with ROSETTA uses a low resolution knowledge-based scoring function consisting of an amino acid environment term defined by the burial of an amino acid and an amino acid pair interaction potential defined by all amino acid pair distances [21]. It further includes a secondary structure packing potential for ahelix packing and b-strand pairing in b-sheets. A dot product captures hydrogen bonding in b-strand pairing. This potential uses the loop length connecting two SSEs as an additional dependent variable [25].
The energy function developed herein works off the hypothesis that interactions between SSEs define the core of the protein structure and are the major contributor to the stability of the protein fold, at least for a large fraction of folded proteins. In turn, the majority of stabilizing interactions in the protein structure is present in SSE-only models. Further, it is hypothesized that these stabilizing interactions can be more accurately predicted as flexibility is reduced in the backbone of SSEs when compared to loop regions or amino acid side chains. The expected higher accuracy in placing the SSEs will result in a higher accuracy of the energetic evaluation. As a result, a smoothened energy landscape is expected that can be searched more readily as it is devoid of noise introduced by inaccurately placed loop regions and side chains. The advantages of reduced conformational search space and smoothened energy landscape pair nicely with the abovementioned settings with limited experimental data, since most experimental restraints relate to SSEs and can thus still be employed in protein folding. It is expected that models constructed and evaluated with this energy function can be readily completed through established protocols for the construction of loops and side chains. For example, loops can be modeled using fragment replacement, cyclic coordinate descent [26,27], or kinematic loop closure [28]. Side chains are added using dead end elimination or Monte Carlo sampling of rotamer libraries as implemented for example in SCWRL [29] or ROSETTA [30].
The present manuscript introduces a comprehensive knowledge-based energy potential for proteins which is based on a simplified representation of the protein including only SSEs, i.e. ahelices and b-strands. The hypothesis is that for the majority of well-structured domains the assembly of the SSEs in threedimensional space defines the domain topology, i.e. fold. Based on the amino acid C b atom coordinates within the SSEs (H a2 atom for Glycine) we define an amino acid pair potential, an amino acid environment potential, a secondary structure element packing potential, a b-strand pairing potential, a loop length potential, a radius of gyration potential, a contact order potential, and a secondary structure formation potential. Separate penalty functions forbid amino acid clashes, SSE clashes and loop distances that cannot be bridged. The overall energy potential is a linearly weighted consensus scoring function. These weights balance the individual terms to evaluate the native-likeliness of the SSE arrangement and the three dimensional placement of the amino acids in the context of the fold. While the scoring function is specialized to evaluate the loop-less protein topology as defined by the SSEs, it can be applied to full chain protein models as well.

Results and Discussion
Bayes' Theorem is Applied to Derive a Comprehensive Knowledge-based Potential In deriving the present knowledge-based potential we use BAYES' theorem to estimate the probability of a structure given the sequence. This strategy follows previously described approaches [21,25] in expanding this probability into a series of terms that desribe certain aspect of the protein structure. This strategy avoids the requirement of a BOLTZMANN-like distribution of states in the databank: where seq is the amino acid sequence and struct the protein's three dimensional structure. This approach separates the probability for a given sequence to fold into a certain structure into two terms. The probability of the structure, P struct ð Þ, describes the relative arrangement of SSEs in space independent of their sequence. The probability of the sequence given this SSE arrangement, P seqDstruct ð Þ , evaluates placement of specific amino acids into these SSEs. For the protein folding problem the probability of the sequence P seq ð Þ is a constant. The terms P struct ð Þand P seqDstruct ð Þwill each be expressed as a product of multiple contributing terms P i X ð Þ.

The Inverse Boltzmann Relation Converts Probabilities into an Approximation of Energy
The collected probabilities P i X ð Þ are converted into a free energy approximation using: Where E i X ð Þ is the energy function for X -being the feature observed, R -the gas constant, T -temperature, p i,observed X ð Þthe probability with which that feature was observed and p i,background X ð Þ -the probability to observe that feature by chance. The normalization with p i,background X ð Þ ensures that favorable states receive a negative energy, unfavorable states a positive energy. The energy unit RT is arbitrarily defined as 1 BCL energy unit (BCLEU).
The most direct approach computes the total energy as sum of all individual contributions. One disadvantage of this strategy is that double-counting of contributions through several energy terms is difficult to entirely prevent. Other features of protein folds i.e. side chain hydrogen bonding or backbone interactions of loop residues will be ignored as they are not or only incompletely captured by the geometric features observed. To account for part of these inaccuracies, each energy term is scaled by an individual weight. This weight will be optimized to distinguish native-like from non-native models for a database of proteins.
Another disadvantage of knowledge based potentials is the difficulty to assign an energy penalty to states not observed in protein structures. Typically small pseudo-counts are added which result in a positive energy. However, if a state is not observed at all, the energy assigned through a pseudo-count is arbitrary. To address this shortcoming, penalties for forbidden geometries are split into separate energy terms. Thereby the weight optimization procedure can assign a weight for these penalties independent from other contributions to the energy function.
While this approach is inherently imperfect it proved effective in the past. The resolution of protein models evaluated with the present energy function is too low to unambiguously distinguish native-like from non-native models based on energy alone. The objective of the energy function is to enrich for native-like topologies which can be done effectively in the presence of its inherent inaccuracies.

Ensure Continuous Differentiability of All Geometric Parameters and Energy Potentials
Traditionally some geometric parameters observed contain step functions. An example is the number of neighbors within a given distance cutoff which is often used as a measure of solvent exposure [25,31]. To avoid discontinuities at the cutoff, a continuously differentiable transition function is often introduced into the definition of a feature: In Figure  1A an example of trans { x 0 ,x 1 ,x ð Þtrans { 4,11:4,x ð Þ is shown, which is used to smooth the neighbor count (described below). The difference between trans { and trans z is that the first is a step-up, the latter is a step-down as a function of x. We demonstrated in the past that such a transition function allows for a neighbor count measure that is not only continuously differentiable but also more accurately approximates solvent accessible surface area [31].

Amino Acid Environment Potential
This energy potential captures the preference of an amino acid to be buried and engage in a hydrophobic interaction in the protein core or to be exposed and interact with the solvent.
In order to measure burial, a function that counts the neighbors of an amino acid was used ( Figure 1A): trans { r low ,r high ,r ij À Á Weighing the actual neighbor count between r low and r high smoothens the potential and enables gradient based minimizations. The thresholds have been optimized for a high correlation of the neighbor count value with the MSMS solvent accessible surface area (SASA) approximation implemented in the molecular visualization package VMD [32]. The lower threshold is set to 4.0 Å , the upper threshold to 11.4 Å [31]. A minimal sequence separation of three residues reduces the bias introduced by sequence proximity. This step is particularly necessary to accurately determine exposure at the end of SSEs. In SSE-only protein models amino acids at the end of SSEs would otherwise have an artificially low neighbor count. The background probability distribution is the normalized sum of all normalized amino acid exposure distributions. Neighbor count bins that were empty or had one raw count were assigned a constant repulsive energy value of 18 BCLEU ( Figure 1B).

P seqDstruct
ð Þis proportional to the amino acid pairs observed for a given distance.
In order to define the interactions, statistics for the C b -atom distance between pairs of amino acids aa i ,aa j À Á have been collected. For Glycine, the H a2 hydrogen position was used ( Figure 2A). Distances have been collected between 0 and 20 Å in bins of size 1 Å . Amino acid pairs have been considered if they had a sequence separation of at least 12 residues iz12vj ð Þin order to reduce the bias introduced by sequence proximity. For each bin the energy was approximated using the inverse BOLTZMANN relation. The expected background probability is estimated through the frequency of seeing aa i or aa j with any other amino acid at distance r ij . Distance bins that had fewer than five raw counts were assigned a constant repulsive energy value of 18 BCLEU (Figure 2). Note that a separate penalty will forbid very close distances not observed in protein structures. These would result in clashes of side chain atoms if implicitly present.

Loop Length Potential
SSEs are connected by loop or coil regions whose coordinates are not explicitly considered in the present approach to score protein folds. However, there are preferences for loops of a certain length d s to bridge a certain EUCLIDEAN distance d E ( Figure 3A). This is a sequence-independent score contributing to P struct ð Þ. Note that the requirement that two SSEs can be physically linked with a fully extended loop is controlled by a separate loop closure penalty (read below).
Sequence distance between last residue of SSE i and first residue of SSE j d E SSE i ,SSE j À Á EUCLIDEAN distance between end of main axis of last fragment of SSE i and beginning of main axis first fragment of SSE j .
The background probability is set to P d E SSE i ,SSE j À Á À Á %d 2 E ( Figure 3B). For short sequence distances it is favorable that the EUCLIDEAN distance is short. Long EUCLIDEAN distances are forbidden by a constantly increasing positive energy which is a result of the pseudo count divided by the square of the EUCLIDEAN distance. EUCLIDEAN distances below 4 Å are generally possible but are only preferred for loops of length 0 and 1 which occur in the database for bent and kinked SSEs. There is a nearly linear dependency between the sequence separation and the EUCLIDEAN distance for up to 7 residues in the loop. The maximally possible EUCLIDEAN distance increases linearly to a distance of approximately 32 Å at 10 residues. EUCLIDEAN distances longer than 32 Å are rarely observed in this database of globular proteins. As loops  get longer, the range of EUCLIDEAN distance they bridge becomes wider.

b-Strand Pairing Potential
This potential evaluates the pairing of two b-strand SSEs to form a b-sheet contact.
To compute pair SSE i ,SSE j À Á both strands are decomposed into overlapping fragments of three amino acids ( Figure 4E). A bsheet contact then is defined as a series of l pairs of aligned fragments. The distance d and torsion angle H between each pair of fragments is evaluated ( Figure 4A). Further, a weight w bb{pair is used to distinguish a planar arrangement of two b-strands (bstrand pairing) from an opposing arrangement (b-sheet packing, Figure 4D, for details see Methods). l is limited to the number of fragments in the shorter SSE.: torsion angle at shortest, orthogonal distance in fragment pair k w k,bb{pair weight that decreases as the arrangement deviates from planar b-strand pairing The potential represents the likelihood of observing a given distance between the center of two b-strand fragments and a given twist of two b-strand fragments ( Figure 5A) with respect to each other. Note that the potential omits explicit evaluation of backbone hydrogen bonds to keep the energy landscape smooth. The background probability is assumed to be proportional to d since the chance to find a second b-strand by chance in a parallel arrangements grows approximately linearly with the distance of the object, similar to the girth of a circle.

Secondary Structure Element Packing Potential
While b-strand pairing is defined by backbone hydrogen bonds, SSE packing is driven through side chain interaction. As a result, distance and torsion angles are less tightly controlled which is why we treat both potentials separately. Aside from this separation, SSE packing potentials have been derived in a fashion similar to the b-strand pairing potential.
To compute pack SSE i ,SSE j À Á both SSEs are decomposed into overlapping fragments of three amino acids (b-strands) and five amino acids (a-helices, Figure 4E). A contact then is defined as a series of l pairs of aligned fragments. The distance d and torsion angle H between each pair of fragments is evaluated ( Figure 4, Figure 5B-D).
torsion angle at shortest, orthogonal distance in fragment pair k w k,pack weight that decreases if b-sheets in the packing interact via their edge The term w k,pack is dependent on the types of SSEs in the packing. For the helix-helix interaction, w aa{pack~1 . For helixstrand interactions, w ab{pack decreases from 1 if the face of the bstrand points away from the a-helix. For b-sheet packing, w bb{pack~1 decreases from 1 if the b-strands don't face each other ( Figure 4D, details in Methods). The background probability is assumed to be proportional to d. The resulting potentials plot energy with respect to distance and twist angle.

Contact Order Score
Using the assembly of SSEs to describe the topology of a protein enables the optimization protocol to sample topologies with many non-local contacts. One measure for the complexity of the topology is the contact order. Contact order CO is defined as the average sequence separation of all amino acids in contact, conventionally identified by the closest heavy atom distance between two amino acids, = 8 Å [33]. In this score, the C b -C b distance is used. A larger contact order constitutes a more complex topology. The contact order score is added to restrain the models constructed to a likely contact order range. To ensure comparability we normalize the square of the contact order with the sequence length to compute NCO~CO 2 =length. For native proteins, NCO is largely independent of sequence length being in the range of 0.25 to 0.60 ( Figure S1). An energy term ( Figure 6A) was added based on the hypothesis:

Radius of Gyration Potential
The square of the radius of gyration is proportional to en energy term that describes the compactness of the fold [21]. It is computed as the mean square distance of all C b atom coordinates (H a2 for Glycine) to their mean position: The term e {R 2 gyr can directly be used to estimate P struct ð Þ if sequence length is constant [34]. To enable the energy function to compare proteins of variable length e.g. during the assembly from SSEs, we introduce a normalized radius of gyration NR gyr~R 2 gyr =length. For native proteins, NR gyr is largely independent of sequence length being in the range of 0.8 to 2.0 ( Figure S2). An energy term ( Figure 6B) was added based on the hypothesis: Extended a-helical coil-coiled structures as well as protomers that form obligate oligomers were removed prior to obtaining this statistic. . SSE Fragment packing. SSE fragments are shown with their geometric packing descriptors. A a 1 and a 2 are orthogonal, if the shortest connection between the main axes is orthogonal. B connection is not orthogonal, since the minimal interface length m cannot be achieved. C h is the twist angle around the shortest connection -which is equivalent to the dihedral angle between main axis 1 -shortest connection -main axis 2. D v is the offset from the optimal expected position for a helix-strand interaction, if it is 0u, the helix is on top of the strand, if it is 90u, the helix would interact with the backbone of the strand. v 1 and v 2 are the offsets for a strand-strand packing -for omegas close to 90u, it is a strand backbone pairing interaction dominated by hydrogen bond interaction within a sheet, if they are close to 0u, it is dominated by side chain interactions like seen in sheet-sandwiches. E every SSE is represented as multiple fragments and the SSE interaction is described by the list of all fragment interactions, leaving out additional fragments of the longer SSE with suboptimal packing (bottom grey helix fragment). doi:10.1371/journal.pone.0049242.g004

Secondary Structure Prediction Agreement
Given an amino acid sequence, JUFO [35] and PSIPRED [36] calculate probabilities for each amino acid to be part of an ahelical, b-strand or a coil SSE. Those prediction methods average a per-residue accuracy of up to 80%. This fact can be used to evaluate the per-residue assigned secondary structure for a given protein model.  SS i secondary structure of amino acid i in the structure Due to the inaccuracies in the secondary structure predictions, a mean probability and standard deviation for the probability for actual secondary structures are derived, and the error function of the standard score is defined as the potential used: p SS,i probability of the assigned secondary structure in the model m SS mean probability for accurately predicted secondary structure s SS standard deviation for accurately predicted secondary structure The use of the standard score makes it possible to use different secondary structure prediction methods, of different sensitivity and dynamic range of probabilities. The error function projects the standard score in a less sensitive range if probabilities strongly disagree with the average. Parameters have been derived for JUFO and PSIPRED (Table 1).

Amino Acid Clash, SSE Clash and Loop Closure Constraint
A difficulty with knowledge based potentials is that a BOLTZMANN-like distribution is assumed for the dataset used from which the potentials are derived. Although all potentials described above are based on probabilistic theory, they are ambiguous to geometries absent in native structures. Since no counts are observed for these geometries the associated energies would be infinitely high. However, while the energy will be elevated it will not be infinite. The precise penalty for such non-native features remains difficult to evaluate. Often one pseudo count for every observation is added (according to the rule of succession, ''LAPLACE rule'') giving all non-observed events an equally high penalty. The precise penalty for such non-native features is difficult to determine. To enable fine-tuning of the energy penalties in regions of non-observed events separate energy components are introduced. This procedure allows an independent choice of a weight changing the penalty amplitude in ''structurally forbidden'' regions. The procedure has a second advantage: vdW (van der Waals) repulsion is affiliated with steeply rising energies over a small change in distance. A separate potential allows for a finer binning of these penalty potentials when compared to the attractive counter-parts.
Amino acid pair clash. For the amino acid pair distance potentials, all occurring amino acid pair distances within protein structures have been calculated. They were binned with a resolution of 0.05 Å for each amino acid type pair. The first bin with counts.1, when iterating from shorter distances to larger distance, was determined to be the minimum permitted distance. Using this threshold, a ''penalty'' function is defined: Shortest allowed distance for amino acid type pair r ij Distance between amino acid pair This term is complementary to the amino acid pair distance potential. If the distance between two amino acids is below the allowed distance for this pair of amino acid types, a positive energy penalty is applied, with a maximum at 1 Å below the allowed distance. A matrix of minimal distances for all amino acids types is depicted in the Figure S3.
SSE clash. Although the amino acid clash potential suffices in ''detecting'' clashes of side chains in the packing of SSE, it does not penalize special cases of overlapping SSEs. An example for these kinds of topologies is when one b-strand is positioned on top of another b-strand but offset by one amino acid. C b atoms point in opposite directions avoiding any clash while backbone atoms are not explicitly modeled. To prevent such situations a clash term that is based on the packing SSE fragments was derived. From unoccupied bins in the SSE packing and pairing potentials ( Figure 5) minimal distances between two SSE fragments have been defined as a-helix/a-helix 4 Å , a-helix/b-strand 4 Å , bstrand/b-strand 3 Å : Minimal allowed distance for aligned fragment pair k and l of SSEs i and j d i,j,k,l Length of shortest connection between the two SSE fragments This term is complementary to the SSE packing and b-strand pairing potential. If the distance between two SSE fragments is smaller than m SSE i,k ,SSE j,l À Á , a positive energy is the result. The full positive energy is reached if the distance is 1 Å below the allowed distance for that pair of SSE types.
Loop closure constraint. In order to guarantee the possibility to close loops it proved necessary to add a steep penalty if the EUCLIDEAN distance between ends of SSEs becomes too long. In contrast to the loop length potential, the loop closure constraint only considers SSEs adjacent in sequence. The EUCLIDEAN distance between the terminal C atom and the starting N atom of the following SSEs d CN is evaluated. s d CN is generally shorter than d lim CN~2 :11Az2:56| loop length. This relation was obtained by selecting the EUCLIDEAN distance for a loop length, which is the 5 th percentile of the longest distances. For lengths between one and twenty amino acids in the databank, a linear regression was fitted ( Figure S4). We evaluate therefore Dd CN~dCN {d lim CN : This potential is complementary to the loop length potential. It forbids loops that cannot be closed because of too large EUCLIDEAN distances. Additionally, it measures the distance between the two atoms that are the anchors for the loop, while the loop length potential is using a more crude estimation for the ends of the SSEs using only the tips of the fragment main axes.

Protein Model Sets have been Generated Using Rosetta, a BCL Perturbation Protocol and a BCL Folding Protocol
In order to benchmark the performance of the knowledge-based energy potentials, 53 diverse proteins have been selected and structural models were generated computationally using three methods: (1) Using ROSETTA de novo protein structure prediction. (2) Removing loops from native structures and applying systematic perturbations to the structures. The sets of perturbations were chosen to generate models with preserved native-like topologies.
(3) Re-assembling the SSEs present in the native structures leading to protein models of various arrangements and topologies. Details on the protocols are described in the Methods section.
The rationale for usage of three separate sets of protein models was to maximize diversity in the models thereby maximizing generalizability of the scoring function. The identification of native-like structures was based on three measures: (1) RMSD100,8 Å (C a root mean square deviation normalized to a protein length of 100 residues, see [37]), (2) CR12.20% (contact recovery over 12 residues, see accompanying manuscript) and (3) GDT_TS.25% [38]. The percentage of native-like models varies between 0 and 99.5% for the protein model sets. Only model sets with percentage of native-like models between 1% and 99% have been used for the analysis in a ten-fold cross validation calculation of enrichments. The cross validation subsets have been generated by randomly removing models so that each subset contained 10% correctly folded models and 90% incorrect models.
Enrichment is a Good Measure to Evaluate the Performance of an Energy Potential Figure S5 shows a representative RMSD100-energy plot of a set of protein models that was prepared to contain 10% of native-like models below an 8 Å RMSD100 cutoff. The 8 Å cutoff is based on the observation that two protein models typically share the same topology below that measure. The horizontal line denotes the best 10% of the models with respect to the scoring function used. Models that are below the RMSD100 cutoff are positives P ð Þ, and if they are below the energy of the best 10% by energy, they are considered as true positives TP ð Þ. If the model has a high energy despite being correct by the RMSD100, it is considered a false positive FP ð Þ. FN -false negative and TN -true negative are defined similarly. The optimal result would be to have empty FN and FP quadrants, because this would indicate that the energy function would be completely accurate in identifying native-like models by RMSD100. The enrichment is now defined by the ratio of true positives within the 10% native-like models TPzFN ð Þ divided by the initial ratio of native-like models (defined by the RMSD100 cutoff) to the total number of models TPzFNzFPzTN ð Þ .

enrichment~T P TPzFP
Ã PzN P For the following benchmark PzN ð Þ=P~10 is set, limiting the maximal enrichment to 10. An enrichment of 1 corresponds to no improvement. Enrichment values smaller than 1 suggest that the score deselects native-like SSE arrangements. Table 2 contains enrichments for the 53 protein sets from three different methods each, and the various scores. Note that the number of proteins considered can be smaller than 53 if the number of native-like models was insufficient to confidently determine enrichment (read above). Statistical significance was established by computing the average enrichment over 10 crossvalidations, subtracting the expected mean of 1.0 (for a nondiscriminating potential), and dividing the result by the standard deviation of the enrichment over of the 10 cross validation sets (Zscore). The percent of model sets that are enriched by a statistically significant factor are reported (Z-score.1.0, Table 2). In comparison to the balanced performance of the consensus scoring with an optimized weight set (Table 3), individual components of the scoring function generally discriminate well against random models for the BCL folded and perturbed structures but do perform worse for ROSETTA folded models. This observation is attributed to the fact that ROSETTA folded models will generally score well in the BCL::Score energy function due to the similarity of the two scoring schemes. The amino acid pair distance, amino acid neighbor count and the SSE packing potentials achieve enrichments greater than 1.0 for nearly all the protein sets. The secondary structure prediction scores using PSIPRED secondary structure probabilities enrich ROSETTA and perturbation model sets, which have varying SSE content. BCL folded models cannot be discriminated, since the secondary structure is constant. The consensus scoring function enriches significantly (67% of ROSETTA, 77% of perturbation model sets for RMSD,8 Å ). No statistically significant improvement for BCL folded models is observed. We attribute this to the fact that these models were subject to BCL::Score energy evaluation during folding creating a circular dependence. Considering the performance with respect to GDT_TS.25%, for the three different models sets, 80%, 94% and 83% have a significant enriched model set for ROSETTA as well as BCL perturbed and folded model sets.

BCL::Score Potentials Recapitulate Expected Amino Acid Interaction Preferences
The scoring function was developed for protein models consisting of disconnected, idealized SSEs. The absence of atomic-detail in the SSE-only protein models inherently prevents Table 2. Enrichment of sets of protein models.  Table 2. Cont. unambiguous identification of the native conformation in a set of models. Nevertheless, the amino acid pair potential and the amino acid environment potential both select for native-like arrangements of amino acids. The environment potential follows the expected trend preferring around three neighbors for the negatively charged Glutamate residue but around eleven neighbors for the apolar Valine. For Glycine two minima are observedvery few and very many neighbors. This is somewhat counterintuitive as Glycine prefers exposed positions in loop regions. However, the potential P aa i De i ð Þ reflects the probability of encountering a specific amino acid type given a certain exposure value, rather than the most probable exposure for a given amino acid type. In densely packed positions with an extremely high number of neighbors only Glycine will fit giving it the high probability for such positions. Positions with neighbor counts above twelve are rare in folded proteins and should therefore be disfavored when predicting protein structures. However, this fact will be represented by P struct ð Þ and is correctly omitted in P seqDstruct ð Þ . Leucine and Isoleucine are expected to interact favorably in the pair potential due to vdW attraction, which is reflected by the negative energies for short distances ( Figure 2B). Arginine and Lysine with positively charged side chains are expected to experience COULOMB repulsion when approaching each other which is reflected by the positive energy for short C batom distances. Tryptophan pairs may engage in p-stacking interactions, which are evidenced by a preferred C b -atom distance around 4 Å (b-strand pairing) and 8 Å (SSE packing). Arginine and Lysine are both positively charged and repel each other at close proximity as reflected by the positive energies for C b -atom distances smaller 10 Å . These findings imply that for reduced SSE-only protein models a C b -atom side chain representation (H a2 for Glycine) is sufficient to estimate P seqDstruct ð Þ .

Secondary Structure Element Arrangement Determines the Domain Topology
The preferential arrangement of SSEs in a protein domain results from the sum of many atom-atom interactions. In the absence of atomic-detail in SSE-only protein models, BCL::Score knowledge-based potentials derived from P struct ð Þ discriminate native-like SSE arrangements. An optimal b-strand distance between 4.25 and 5.00 Å is observed. The optimal twist angle is around 215u (parallel b-strand contact) and 165u (anti-parallel bstrand contact). A twist angle of 165u is more pronounced as antiparallel b-strand contacts are slightly overrepresented in the database. Two a-helices pack in a preferred angle of 245u. The anti-parallel packing is slightly less common at around 135u. Further, weak minima around 15u and 2165u are observed. Both cases of packing have a preferred distance of 9-12 Å (Figure 5 Figure 5B). For a-helix-b-sheet packing, the anti-parallel case with angles between 150u and 180u is most common as seen in the TIM-barrel fold or other ''ROSSMAN-Folds'' [32] (Figure 5D). As for the a-helix-a-helix packing, the optimal distance is around 9-12 Å . b-sandwiches pack with a distance of 9-12 Å and twist angles of 230u or 150u ( Figure 5C). Twist angles lead in general to an improved packing as the interacting side chains can reach into gaps left by the side chains of the opposite SSE [33]. Ridges and grooves are formed on the surface of helices. These ridges are formed by residues usually separated by four in sequence. This model explains the predominant packing angle of around 50u. The maximum enrichment for any of the scores for any set of models is never above 5 (Table S1). We attribute this finding to two limitations of BCL::Score: Firstly, the protein models used are incomplete. Contributions of loop and coil regions to the overall energy are neglected resulting in inherent inaccuracies. Secondly, amino acids are represented by their C b -atom only. This procedure introduces additional inaccuracies in the energetic evaluation. As discussed in the introduction, these inaccuracies are taken into account to enable a more rapid sampling of domain topology specifically in a limited experimental data setting. Subsequently protein models can be completed and refined using higher accuracy all-atom energy functions. Nevertheless, BCL::Score knowledge-based potentials enrich a diverse set of decoys with enrichments up to 7 for individual proteins with respect to the weighted consensus score (Table S1, last column). This is a respectable achievement in particular when keeping in mind that some of the protein models are created using an energy function that necessarily covers some or even most aspects of the BCL::Score knowledge-based potential (model sets 1 and 3). Additionally, the other models start from experimental protein structures (model set 2). Accordingly the model sets contain many native-like features that are expected to score well with BCL::Score.

C b Atom is Sufficient to Approximate Side Chain Position
The amino acid pair potential and the amino acid environment potential are both successful in discriminating for native-like protein structures. This implies that a C b atom side chain representation (H a2 for Glycine) is sufficient not only for describing possible interactions with other amino acids as a pair potential but also as an environment potential. For different model sets from ''decoys'r'us'' [39], the rank of the native structure, using different energy potentials, was determined. Ranks for DFIRE through DOPE were copied from the ''DOPE'' publication [20]. For each different model set, the number of sets for which the native was ranked 1 st was counted and reported. In brackets ranks among the top 10 were counted as correct. doi:10.1371/journal.pone.0049242.t004 Enrichment was Achieved for a Diverse set of Protein Models Regardless of the Sampling Algorithm We tested BCL::Score potentials in conjunction with ROSETTAgenerated models (model set 1) to assess the general applicability of the scoring approach. ROSETTA models have a complete and defined backbone conformation. All BCL::Score potentials except for the loop length and contact order score can enrich ROSETTA models for native like conformations. It is expected that the loop length potential will not enrich ROSETTA models as they have a continuous amino acid chain. The loop length potential enriches BCL perturbed and folded structures with a discontinued amino acid chain. Due to the unrestrained sampling of the secondary structure elements, loops are violated and the potential is penalizing this arrangement. The contact order score prevents low and highly complex folds if several SSEs are swapped or not in close proximity. This is the case for BCL folded and perturbed structures, where the potential helps regardless of size and SSE composition, but unlikely in ROSETTA models which are biased towards lower contact orders. As expected, the b-strand pairing score contributes only for b-strand containing proteins. The radius of gyration score performs well for proteins,150 residues, but seems to degrade for larger proteins. It can be observed that for GDT_TS and RMSD100 classification, the percentage drops under 50% for the BCL perturbed structures. This is expected as this model set was created to preserve protein size and relative positioning of SSEs that is native-like but create non-native topologies. We observe the best consensus function discrimination for native like models for this model set. The weighted sum of individual terms performs comparably over all benchmark sets and shows that a linear combination can overcome some weaknesses of the individual terms.
BCL::Score Ranking and Enrichment Performance in Comparison to Other Energy Potentials Table 4 shows the rank of the native structure for different small decoy sets (,500 models) of the ''decoys'r'us'' protein model sets [39]. The ranks for the energy function in the comparison are extracted from experiments for the DOPE potential [20]. Although BCL::Score energy potentials were not designed for full atom protein models represented in the protein model sets, it can rank the native first for 52% of the sets. Of the six tested energy potentials success rates vary (24%, 40%, 48%, 52%, 84%, 84%) placing BCL::Score somewhere in the middle. Keeping in mind, that the other scoring functions leverage additional detail, some even atomic detail, this is a respectable performance. BCL::Score filters reliably models of unlikely overall topology but has difficulty ranking models with native-like topologies. This notion is reinforced when ranking the native structure among the top ten models is counted as success. The BCL::Score success rate increases to 72% compared to 48%, 56%, 60%, 68%, 84% and 84% seen for the other energy functions.  Table 5 summarizes enrichments for the ''moulder'' decoy set [20] for different energy potentials. BCL::Score is able to enrich all of the model sets by a factor between 2.44 and 4.57. This performance is clearly reduced when compared to full atom potentials which achieve enrichments between 2.67 and 8. This decoy set was created by threading different sequences on similar homology models resulting in protein models with native-like overall topologies. As a result, the strength of the BCL::Score functions in discriminating non-native topologies is not assessed in this experiment.

Conclusions
A knowledge-based scoring function is presented optimized for SSE-only models. It enriches native-like topologies in diverse sets of protein models. We expect this scoring to be beneficial for certain settings in de novo protein structure determination: (1) When folding large proteins with complex topology, where simultaneous sampling of SSE arrangements and loop conformations would create a size limit for de novo protein structure determination. The BCL::Score potential for SSE-only models allows sampling of SSE arrangement independent of and prior to the sampling of loop conformations. This approach has the potential to increase the size limit in de novo protein structure determination. (2) Limited experimental datasets often restrain the position of SSEs, for example density maps obtained form cryo-Electron Microscopy [40] or EPR distance restraints [41]. We expect that the present potential can be applied to assemble the topology of large proteins from such datasets. In fact, an early version of BCL::Score has been successfully applied to medium resolution density maps form cryo-Electron Microscopy [9].

Divergent Databank of High Resolution Crystal Structures
Statistics have been derived from a divergent high resolution subset of the protein data bank (PDB) which was generated using the protein sequence culling server ''PISCES'' [42]. With a sequence identity limit of 25%, resolutions up to 2.0 Å , a maximum R-value of 0.3, sequence lengths of 40 residues minimum only X-ray structures have been culled from the PDB. This guarantees that similar sequences are not over represented, introducing a bias to proteins that are amenable to crystallography or are of higher interest in the scientific fields. All membrane proteins have been excluded. The resulting databank has 4,379 chains in 3,409 PDB entries. This approach to create the representative protein database might leave multiple members of the more popular fold groups thereby over-representing certain secondary structure packing motifs. An alternative approach would be a non-redundant fold databank created from SCOP [43] or CATH [44] classifications. Our rational for the first approach is that a non-redundant fold database would not cover the diversity of amino acid environments and interactions that are found within similar folds of diverse sequence worsening the statistics of the amino acid centric potentials. Further we argue that secondary structure packing motifs are conserved beyond the boundaries of individual folds. The statistics describing these packing interactions should therefore not be biased by occasional repetition of one fold group.

Secondary Structure Element Packing
In order to develop statistics for the packing between two SSEs, SSE pairs were collected from protein models in the databank. ahelices with a length,7 residues and b-strands,5 residues have been ignored, and a-helices or b-strands have been described as overlapping sets of fragments of the length of 5 residues for ahelices and 3 residues for b-strands ( Figure 4A). An ideal SSE fragment was superimposed with the backbone coordinates of the SSE fragment from the PDB to determine the orientation (translation and rotation in Euclidean space) of this fragment. The main axes have been considered to be line segments; a minimal interface length between the two SSE fragments of 4 Å was achieved by subtracting 2 Å from each end of each SSE's main axis ( Figure 4B). The packing between two fragments was described by the analytically shortest connection between those two line segments. If this connection was orthogonal, it was considered to be a full contact. If the connection was not orthogonal, a contact weight was defined as a function of the angle between the main axes and the shortest connection. This angle between 90u and 0u was then used to determine a weight between 0 and 1 using half of a cosine function and for both angles those weights are multiplied. The twist between the SSE fragments is defined by the dihedral angle h between the SSE main axes ( Figure 4C). The relative offset, which is important when strand backbone hydrogen interactions could play a role, is defined by the offset angle v between 0u and 90u ( Figure 4D). For a strand-helix packing, only one offset angle can be defined, where an v close to 90u is not favorable, a packing with an offset of 0u is desired, since it is dominated by amino acid side chain interactions. The weight is defined: w O~c os 2vz1 2 If two strands are involved in the interaction, it is necessary to distinguish a strand-strand backbone hydrogen bond mediated packing and a sheet-sheet (sandwich-like) amino acid side chain mediated interaction. For omega values near 90u it has a strandstrand interaction character; if the omega values are close to 0u, it is considered to be a sheet-sandwich interaction. Two weights can be defined: The packing between two SSEs is represented as a list of fragment interactions ( Figure 4E), with distance and dihedral angle. For each fragment of the shorter SSE, the interaction weight to every fragment of the longer SSE (for identical sizes, the SSE that comes first in sequence is the ''shorter'' one) is calculated and the fragment pair with the highest interaction weight w I is added to the list of packing interactions. Since this is done for every fragment in the shorter SSE, the list will have as many entries as the number of fragments in the shorter SSE. Every packing interaction within this list is then considered for the statistics using the weight as the count. During scoring, all interactions in the list are scored, multiplied with the respective interaction weight and summed.

Generation of Benchmark Sets
The benchmark sets of protein models were generated using three different methods. 53 sequences of length between ,70 up to ,300 residues have been selected to represent diversity regarding a-helical and b-strand content as well as sequence length : Three benchmark sets were created: a) Using ROSETTA [45] 10,000 models have been folded de novo for each sequence. Since ROSETTA does not assign secondary structure, DSSP [46] was used to add definitions to the models. b) 10,000 models each have been folded using the BCL::Fold program. For these simulations a scoring function with weights set to 1 was used. Further details on the folding simulations can be gleaned from a companion manuscript ''De novo prediction of complex and large protein topologies by assembly of secondary structure elements'' in the same issue of this journal. c) Additionally, 12,000 perturbed structures have been generated using the BCL::Fold program by starting with the native SSE arrangement and applying randomly the following perturbations to the starting structure: (1) SSE rotation and translation; (2) SSE flip; (3) swapping two SSEs; and (4) SSE removal.
Native-like models or positives were defined using three quality metrics: RMSD100 cutoff of 8 Å to the native, a GDT_TS cutoff of 25% and a contact recovery of 20% (see accompanying manuscript). The remaining models in each set were considered negatives or non-native-like. If there were less than 1% or more than 99% native-like models, that set was ignored for further analysis, since it indicates that the sampling algorithm is not suitable for that protein's structure, either creating too many or too few native-like models. The ratio native-like/non-native-like is dependent on the performance of each protocol. The maximum enrichment a score can achieve is dependent upon this ratio. In order to facilitate comparison of the enrichment values, random sub-sets of models were created that contained 10% native-like models. For this, overall ratio in the complete sets of models had to be adjusted. The class of models over-represented with respect to the desired 1:9 ratio (native or non-native like) was split into ten equally large subsets. From the under-represented class random models were added until the desired 1:9 ration of native to nonnative like models was achieved. This procedure uses all generated models without re-using models from the over-represented class.
The enrichment values reported are the average over the ten experiments (Table S1).

Weight Optimization
An optimized weight for the consensus scoring function was determined to calculate the sum of the scores ( Table 3). The objective for optimizing the weight set was to maximize the sum of the square root of enrichments. This sum is calculated over a ROSETTA model set of 53 proteins. Each protein model set is divided into 10 sets, of 10 % native-like models (RMSD100,8 Å ) and 90% random structures (RMSD100 $ 8 Å ), while the actual composition is randomly chosen from a set of 10,000 model structures.
The start weighting set is set to the inverse standard deviation of the score over the 53 * 10,000 models, so that the dynamic ranges of the scores are scaled to the same range. For every step two randomly chosen weights are modified randomly by adding or subtracting 10% of the starting weight, limiting the weights to a minimal value of 0. The minimization follows a Monte-Carlo/ Metropolis simulated annealing protocol [47,48] with 10,000 iterations maximum, terminating after 250 steps without improvement of the objective.

BCL::Score Availability
All components of BCL::Score, including scoring, sampling, and pdb parsing methods are implemented as part of the BioChemical Library (BCL) that is currently being developed in the Meiler laboratory (www.meilerlab.org). BCL::Score is freely available for academic use along with several other components of the BCL library. Details and sample command lines can be found in Appendix S1.  Appendix S1 BCL::Score availability and its usage. (DOCX)