BCL::Fold - De Novo Prediction of Complex and Large Protein Topologies by Assembly of Secondary Structure Elements

Computational de novo protein structure prediction is limited to small proteins of simple topology. The present work explores an approach to extend beyond the current limitations through assembling protein topologies from idealized α-helices and β-strands. The algorithm performs a Monte Carlo Metropolis simulated annealing folding simulation. It optimizes a knowledge-based potential that analyzes radius of gyration, β-strand pairing, secondary structure element (SSE) packing, amino acid pair distance, amino acid environment, contact order, secondary structure prediction agreement and loop closure. Discontinuation of the protein chain favors sampling of non-local contacts and thereby creation of complex protein topologies. The folding simulation is accelerated through exclusion of flexible loop regions further reducing the size of the conformational search space. The algorithm is benchmarked on 66 proteins with lengths between 83 and 293 amino acids. For 61 out of these proteins, the best SSE-only models obtained have an RMSD100 below 8.0 Å and recover more than 20% of the native contacts. The algorithm assembles protein topologies with up to 215 residues and a relative contact order of 0.46. The method is tailored to be used in conjunction with low-resolution or sparse experimental data sets which often provide restraints for regions of defined secondary structure.


Introduction
Understanding of protein function and mechanics is facilitated by and often depends on the availability of structural information. The Protein Data Bank (PDB), as of April 2011, holds 66,726 protein structure entries, 87% determined by X-Ray crystallography and 12% determined by Nuclear Magnetic Resonance (NMR) spectroscopy, and the remaining 1% determined by Electron microscopy and hybrid methods [1,2]. The millions of protein sequences revealed by genome projects necessitate utilization of computational methods for construction of protein structural models. Comparative modeling utilizes structural information from one or more template proteins with high sequence similarity to the protein of interest to construct a model. As the PDB grows and the number of proteins with an existing suitable template of known structure increases, this method is unarguably most important [3].
However, despite impressive advancements in the combination of experimental protein structure determination techniques [4,5] with comparative modeling [6], entire classes of proteins remain underrepresented in the PDB as they evade crystallization or are unsuitable for NMR studies; e.g. membrane proteins [7] and proteins that only fold as part of a large macromolecular assembly [8,9]. Such proteins more frequently adopt topologies not yet represented in the PDB such that the current structural knowledge fails to encapsulate necessary information to represent all protein families and folds expected to be found in nature [10]. In such situations de novo methods for prediction of protein structure from the primary sequence alone can be applied.

De Novo Protein Fold Determination is Possible for Smaller Proteins of Simple Topology
De novo protein structure prediction typically starts with predicting secondary structure [11,12,13,14] and other properties of a given sequence such as b-hairpins [15], disorder [16,17], nonlocal contacts [18], domain boundaries [19,20,21], and domain interactions [22,23]. System-learning approaches such as artificial neural networks (ANN), hidden Markov models (HMM), and support vector machines (SVM) are most commonly used in this field [24,25]. This preparatory step is followed by the actual folding simulation. Rosetta, one of the best performing de novo methods, follows a fragment assembly approach [26,27,28]. For all overlapping nine-and three-amino acid peptides of the sequence of interest, conformations are selected from the PDB by agreement in sequence and predicted secondary structure. Rosetta is capable of correctly folding about 50% of all sequences with less than 150 amino acids [29].
Chunk-Tasser is another fragment assembly method for de novo structure prediction that was one of the best performing methods in the CASP8 experiment [30]. This method generates chunks, three consecutive secondary structure elements (SSEs) connected by two loops, using nine-and three-residue fragments.
The final models are built by using these chunks as the starting point coupled with a minimization process that also utilizes threading and distance restraint predictions [31].

For Small Proteins with Less than 80 Amino Acids Models can Sometimes be Refined to Atomic-detail Accuracy
During the folding simulation, most de novo methods use a reduced protein representation that excludes side chain degrees of freedom to simplify the conformational search space and potential. The fastest and most accurate algorithms to add side chains in order to build atomic detail models rely on sampling likely conformations of amino acid side chains, so-called rotamers [32,33,34]. At this stage, the backbone of flexible loop regions can be further refined, in Rosetta by a combination of fragment insertions, side chain repacking, and gradient minimization. In the CASP6 experiment, Rosetta was able predict de novo the structure of a small a-helical protein to a resolution of 1.59Å [26]. Following this success, Bradley and co-workers showed comprehensively that high resolution backbone structure prediction facilitates the correct placement of side chains and thus de novo high resolution structure elucidation for small proteins [35]. Note that the refinement of backbone conformations and construction of side chain coordinates aligns with most comparative modeling protocols [36,37] (Figure 1). These algorithms model gaps and insertions using loop closure algorithms that use analytical geometry [38], molecular mechanics [39], or loop libraries from the PDB [40] before entering the refinement process. Thereby both approaches -de novo structure prediction and comparative modeling -share the decoupling of the construction of backbone and side chain coordinates. This procedure relies on the hypothesis that accurately placed backbone coordinates define the side chain conformations [33] (Figure 1).

Progress is Stalled by Inefficient Sampling of Large and Complex Topologies
De novo methods perform well only for small proteins, because the conformational search space increases rapidly as the protein gets larger. Despite simplified representation of proteins that omit side chain degrees of freedom, sampling the correct topology remains the major bottleneck for folding large proteins. Sampling is complicated for large proteins not only by size, but also by a larger number of non-local contacts, i.e. interactions between amino acids that are far apart in sequence. More of these interactions contribute to protein stability and are therefore important to sample in order to find the correct topology. At the same time, when folding a continuous protein chain each of these contacts complicates the search as conformational changes between the two amino acids in contact require coordinated adjustment of multiple phi, psi angles to not disrupt the contact. To quantify the number of such non-local contacts the relative contact order (RCO) of a protein was defined which is the average sequence separation of residues ''in contact'', i.e. having their C b atoms (H 2a for Glycine) within 8Å [41,42] normalized by sequence length. As the RCO increases above 0.25 (i.e. the average separation of two residues that are in contact is 25% of the sequence length), the success rate of de novo prediction drops drastically [43]. Also, the geometry of non-local interactions and b-strand pairings in particular is often inaccurate as relative placement of the SSEs cannot be optimized independently from the connecting amino acid chain. This limitation must be overcome for de novo methods to be successfully applied to larger proteins. Interestingly, contact order correlates also with protein folding rates [44] suggesting that the sampling of non-local contacts is the rate-limiting step in protein folding.

De novo Protein Structure Prediction Optimally Leverages Limited Experimental Datasets for Proteins of Unknown Topology
Experimental structural data that becomes available for proteins of unknown topology are often limited, i.e. sparse or low in resolution. Typically, these limited data sets focus on and are more readily available for backbone atoms in ordered secondary structure elements. For example, cryo-Electron Microscopy and X-Ray crystallography yield medium resolution density maps of 5- Figure 1. Comparison of comparative modeling, BCL::Fold and Rosetta. In comparative modeling the protein backbone is constructed partially from a target-template alignment, followed by loop construction and side chain building. On the other hand, de novo methods, such as Rosetta, only take advantage of the decoupling of backbone placement and the side chain building. BCL::Fold also decouples the construction of loops from assembly of secondary structure elements, similar to comparative modeling. Although disconnecting these steps makes computation more feasible by splitting the total search space into manageable portions, they are not absolute and in order to address these issues SSE placement has to be refined before loop building and side chain construction. While side chain conformations are not explicitly added to the model in the early stages of de novo structure prediction, they are implicitly represented throughout the process through knowledge-based potentials. doi:10.1371/journal.pone.0049240.g001 10 Å where secondary structure can be identified but loop regions and amino acid side chains remain invisible [45,46,47,48,49]. NMR and EPR spectroscopy yield sparse datasets due to technological or resource limitations [49,50,51,52,53,54,55]. Chemical cross linking coupled with mass spectrometry has also been shown to be applicable for protein structure determination at these low resolutions [56,57,58].
While de novo protein structure prediction is typically insufficient in accuracy and confidence to be applied to determine the structure of a protein without the help of experimental data, a series of manuscripts was published that demonstrated the power of such technologies to predict protein structures accurately at atomic-detail when combined with limited experimental data sets of different origin. Qian et al. previously demonstrated use of de novo structure prediction to overcome the crystallographic phase problem [59]. De novo methods have also been applied for rapid fold determination from unassigned NMR data [60] and structure determination for larger proteins from NMR restraints [61]. In addition, de novo structure prediction has also been coupled with EPR restraints [62,63,64] as well as cryo-EM [49]. Kahlkof et. al studied de novo structure prediction of laminin using distance restraints from natural cross-links revealed a structural similarity to galactose binding proteins [57], which was later confirmed when the structure was experimentally determined by X-Ray crystallography [65]. Numerous other studies have also harnessed the power of de novo structure prediction with experimental restraints [66,67,68].
The objective of the present work is to introduce an algorithm for protein folding with a novel approach of assembly of SSEs in three-dimensional space. This approach seeks to overcome size and complexity limits of previous approaches by discontinuing the amino acid chain in the folding simulation thereby facilitating the sampling of non-local contacts. Exclusion of loop regions focuses the sampling to the relative arrangement of rather rigid SSEs, limiting the overall search space. The approach can be readily combined with limited datasets which tend to restrain the location of backbone atoms in SSEs. It leverages established protocols for construction of loop regions and side chains to yield complete protein models ( Figure 1). The decoupling of the placement of SSEs from the construction of loop regions relies on the hypothesis that accurate placement of SSEs will allow for construction of loop regions and subsequent placement of side chain coordinates, a hypothesis tested excessively in comparative modeling. This approach assumes further that the majority of the thermodynamic stabilization achieved through formation of the core of the protein is defined by interactions between SSEs and can therefore be approximated with an energy function that relies exclusively on scoring SSEs. This hypothesis has been tested in a companion manuscript ''BCL::Score'' (BCL -BioChemicalLibrary) in the same issue of this journal. Briefly, the scoring terms cover amino acid environment and pair-wise interaction; clash penalties prevent overlap; secondary structure is evaluated relative to the predicted probabilities; chain breaks are tested for being closable by a loop; the topology is evaluated by the radius of gyration, and SSE pairing and packing is represented by distance and dihedral angle. All potentials are derived using empirics from known protein structures.
Although the algorithm is in principle applicable to membrane protein structure prediction, changes in the sampling steps presented in the BCL::Fold algorithm as well as modifications of the energy terms in the composite energy functions of BCL::Score are required and are presently under investigation.

Results and Discussion
In fragment assembly based approaches to de novo protein structure prediction, local contacts are sampled more efficiently than the non-local ones due to inherent restrictions imposed by the connectivity of the amino acid sequence. This restriction leads to one of the major challenge in de novo protein structure predictionthe sampling of complex topologies as defined by the abundance of non-local contacts and thus higher relative contact order (RCO) values [43]. Further, fragment based approaches spend a large fraction of time sampling the conformational space of flexible loop regions that contribute little to the stability of the fold. Therefore the accuracies of the methods deteriorate as the conformational search space gets larger, typically for proteins with more than 150 residues. In particular, b-strand interactions are often sampled insufficiently densely to arrive at the correct pairings with good geometries. As a result, regular secondary structure cannot be detected in the models giving them the well-known ''spaghetti''look. The score deteriorates hampering detection of the correct topology in a large ensemble of models.
BCL::Fold is Designed to Overcome Size and Complexity Limitations in De Novo Protein Structure Prediction BCL::Fold assembles secondary structure elements (SSEs), namely a-helices and b-strands while not explicitly modeling loop conformations ( Figure 2). Individual residues are represented by their backbone and C b atoms only, (H a2 for Glycine). A pool of predicted SSEs is collected using a consensus of secondary structure prediction methods. A Monte Carlo Metropolis minimization with simulated annealing is used where models are altered by SSE-based moves (Table S1 and Table S2) and evaluated by knowledge-based energy potentials (Table S3). The reduced representation of proteins in BCL::Fold decreases the conformational search space that has to be sampled. Moving discontinued SSEs independently of each other accelerates sampling of nonlocal contacts. The knowledge-based scoring function employed by BCL::Fold is described in a companion manuscript in the same issue of this journal.
BCL::Fold was evaluated using a benchmark set of proteins collected using PISCES culling server. The set includes 66 proteins of lengths ranging from 83 to 293 residues with,30% sequence similarity. The set contains different topologies including 31 all ahelical, 16 all b-strand, and 19 mixed ab folds ( Table 1). The selected proteins have RCOs in the range of 0.12 to 0.47 with an average of 0.3060.07. It should be noted that as proteins get larger, RCO values start decreasing (compare Figure S2). Therefore we introduced a normalized contact order measure NCO which is defined as the square of the contact order divided by sequence length and is largely independent of protein size.

Consensus Prediction of SSEs from Sequence to Create Comprehensive Pool for Assembly
The secondary structure prediction programs JUFO [13,69] and PSIPRED [70] were used to create a comprehensive pool of predicted SSEs. Two methods are used to avoid deterioration of BCL::Fold performance if one of the methods fails. To further avoid dependence on potentially incorrect predicted secondary structure, we implement two strategies: a) the initial pool of SSEs contains multiple copies of one SSE having different length. In extreme cases of ambiguity this could be an a-helix predicted by one method and a b-strand predicted by the other or one long ahelix that overlaps with two short a-helices that span the same region. b) The length of SSEs is dynamically adjusted during the folding simulation in order to allow simultaneous optimization of protein secondary and tertiary structure [13]. Both strategies require a scoring metric that analyzes the agreement of a given set of SSEs with the predicted secondary structure. Before the actual folding simulation is started, a pool of likely SSEs is created using a rapid Monte Carlo Metropolis simulation. The scoring scheme and the pool generation are described in more detail in the methods section. SSEs predicted by this method are only added to the secondary structure pool if they satisfy the minimum length restrictions; five residues for a-helices and three residues for bstrands. Rationale for removal of very short SSEs is two-fold: a) the reduced accuracy of secondary structure prediction techniques for such short SSEs [71] and b) the limited contribution to fold stability expected from short SSEs (compare companion manuscript ''BCL::Score -Knowledge based energy potentials for ranking protein models represented by idealized secondary structure elements''). Table 2 depicts Q3 accuracies (a measure of the accuracy for predicting per residue secondary structure [72]), as well as the percentage of native secondary structures correctly predicted and the average shifts for the SSE pools of the 66 benchmark proteins The secondary structure prediction methods, PSIPRED and JUFO, have been combined to achieve a consensus three state secondary structure prediction. For a given amino acid sequence, stretches of sequence with consecutive a-helix or b-strand predictions above a given probability threshold are identified as a-helical and b-strand SSEs and added to the pool of SSEs to be used in the assembly protocol. (B) Assembly of SSEs. The initial model only has a randomly picked SSE from the SSE pool. At each single iteration step, a move is picked randomly and applied to produce a new model. The details regarding utilized moves are given in the next panel. (C) Energy evaluation using knowledge based potentials. After each change, the model is evaluated using knowledge based potentials. These include loop, loop closure, amino acid environment, amino acid pair distance, amino acid clash, SSE packing, strand pairing, SSE clash, contact order and radius of gyration. (D) Monte Carlo Metropolis minimization. Based on the energy evaluation, models with lower energies than the previous model are accepted, while models with higher energy can be either accepted or rejected based on Metropolis criteria. The accepted models are further optimized, in case of rejected models, the minimization continues with the last accepted model. The minimization is terminated after either a specified total number of steps or a specified number of rejected steps in a row. The protocol consists of two such minimizations, one for assembly and one for refinement. doi:10.1371/journal.pone.0049240.g002

Two-stage Assembly and Refinement Protocol Separates Moves by type and Amplitude
BCL::Fold samples the conformational search space by a variety of SSE-based moves. These moves coupled with exclusion of loop residues, provide a significant advantage in fast sampling of different topologies. The minimization process is divided into two stages. The ''assembly'' stage consists of large amplitude translation or rotations and moves that add or remove SSEs (Movies in Table S5: row ''Assembly). Other moves central to this phase shuffle b-strands within b-sheets or break large b-sheets to create b-sandwiches. The ''refinement'' stage focuses on small amplitude moves that maintain the current topology but optimize interactions between SSEs and introduce bends into SSEs (Movies in Table S5: row ''Refinement). Currently both stages utilize the same energy function (compare companion BCL::Score manuscript).
Once the SSE pool is input, the algorithm initializes the energy functions and move sets with corresponding weight sets for assembly and refinement stages. A starting model for the minimization is created by inserting a randomly selected SSE from the pool into an empty model. The starting model is passed to the minimizer which executes assembly and refinement minimization. The assembly stage terminates after 5000 steps in total or after 1000 consecutive steps that did not improve the score. The refinement stage terminates after 2000 steps in total or 400 consecutive steps that did not improve the score. In general, a move can result in one of four outcomes ( Figure S1): ''improved'' in score, ''accepted'' through Metropolis criterion, ''rejected'' as score worsened, or ''skipped'' as SSE elements required for the move are not present in the model. The temperature is adjusted dynamically based on the ratio of accepted steps (see Methods). The Metropolis outcome ''skipped'' is introduced in the algorithm to deal with the non-applicability of certain steps to the model in the optimization. Although particular moves could be disabled before selecting them randomly, they are selected and counted as ''skipped''.
A comprehensive list of all moves used in BCL::Fold is given in Table S1 (assembly stage) and Table S2 (refinement) along with brief descriptions. The moves are categorized into six main categories; (1) adding SSEs, (2) removing SSEs, (3) swapping SSEs, (4) single SSE moves, (5) SSE-pair moves, and (6) moving domains, i.e. larger sets of SSEs. Representations for a selection of moves used in BCL::Fold are illustrated in Figure 3. SSE, SSE-pair and domain moves are further categorized into specific versions for ahelices, and b-strands or a-helix domains, and b-sheets, resulting in a total of nine individual categories. The relative probability or weight for each move category is initialized at the beginning of the minimization and depends on the SSE content of the pool. For  example, b-sheet moves are excluded if the given pool contains only a-helices. This procedure limits the number of move trials that are unsuccessful or ''skipped'' because the needed SSEs are not in the model. As mentioned in the previous section, depending on the amplitude, moves are categorized to be used in either the assembly stage or the refinement stage. Out of 107 moves, 72 are used exclusively in assembly and 33 are used exclusively in refinement. Resizing SSEs (''sse_resize_nterm'' and ''sse_resize_cterm'') are the only moves used in both stages. Table S1 and Table  S2 also provides statistics of how frequently each move leads to an improved, accepted, rejected, or skipped status as well as the average improvement in the score observed for all the improved steps based on statistics collected on the 66 benchmark proteins. Assembly moves have an average score improvement of 2100678 BCLEU (Table S1) while the refinement moves have an average score change of 215610 BCLEU (Table S2). The five individual moves with largest score improvements either add SSEs or manipulate b-strands, including ''add_strand_-next_to_sheet'', ''sheet_pair_strands'', ''add_sse_short_loop'' and ''add_sse_next_to_sse''. At the same time, these moves also lead to improved models with a relatively high percentage, ranging from 10% to 30% of the cases where the move is not skipped. On the other hand, these moves, especially ones including adding SSEs, also lead to a high percentage of skipped steps. This is due to the fact that the weight for these moves is currently not dynamically adjusted depending on how many SSEs are already added to the model. On the contrary, moves with small average score improvements are less frequently skipped but also less frequently accepted.
It is somewhat misleading to analyze the moves in isolation as rearranging or refining the topology often requires a series of different moves and success of one move relies on the availability of suitable companion moves. For further investigation of which types of moves are more complementary to each other, trajectories from runs for all benchmark proteins were analyzed. For each pair of move type M1 and M2, a non-symmetric correlation measure C M1,M2 ð Þ was calculated using the equation below which measures how likely it is that a move M1 leads to a step where M2 is applied to improve the current model.
where P M2 IDM1 AI ð Þis the fraction of M1 improved moves preceded by a M1 accepted or improved move in the previous 50 steps; P M2 I ð Þ is the fraction of M2 moves that lead to an improved step; and P M1 AI ð Þis the fraction of M1 moves that lead to an improved or accepted step. If any of the probabilities are 0, the correlation value is assigned as -3. A higher correlation value C M1,M2 ð Þindicates that move M1 is more likely to be observed as an improved or rejected step in close proximity before a M2 move is observed in an improved step. Figure 4 depicts heat maps for pairwise correlation values for all assembly moves ( Figure 4A) and refinement moves ( Figure 4B). As observed in both panels, certain move pairs have an increased chance of producing energetically favorable models. As seen in the heat map for assembly ( Figure 4A), add moves (#1-3) often precede successful move combinations since whenever a new SSE is added, it is expected that is not at the energetically optimal The table depicts pool Q3 score, %found (percent of native SSEs identified by predictions) and average shifts for the pools generated using secondary structure prediction methods PSIPRED and JUFO for all of the 66 proteins in the benchmark set. The last two rows show the average and the standard deviation for pool agreement score and Q3 measure. The statistics is repeated for the combined pool of PSIPRED and JUFO. doi:10.1371/journal.pone.0049240.t002  The refinement moves ( Figure 4B) have a less pronounced pattern compared to assembly moves since they introduce smaller changes. Less often a second move is needed to compensate for an accepted first move. a-helix pair moves (#26-27) preferably precede a-helix domain moves (#28-30), while the ''strand_-translate_z_small'' move (#21) is followed by b-sheet moves (#31-35).
BCL::Fold Samples Native-like Topologies for 92% of Benchmark Proteins 10,000 structural models were generated for each protein in the benchmark set using BCL::Fold. Two separate runs were performed with BCL::Fold, one using a SSE pool composed of native SSE definitions as computed from the experimental structures using DSSP [73]. A second run was performed using a BCL::SSE predicted pool. To facilitate the analysis of models, loops were constructed using a rapid CCD based method [38] (see Methods and Movies in Table S5: rows ''Loop grow'', ''Loop close'' and ''Loop force close''). However, in the present analysis we focus on placement of SSEs to form the topology and evaluate models using two qualities measures; RMSD100 (C a root mean square deviation normalized to a protein length of 100 residues [74]) and Contact Recovery (CR). CR measures percentage of native contacts recovered, where a contact is defined as the presence of two amino acids of at least 12 residues sequence separation and,8Å Cb distance. The average and standard deviations of RMSD100 and CR values of the best models generated by these runs can be found in Table 3. Figures 5 and 6 illustrate the best RMSD100 SSE-only and complete structural models generated by BCL using predicted SSE pools for a selection of benchmark proteins. BCL::Fold using the correct secondary structure achieved RMSD100-values of 5.561.6Å (SSE only models) and 6.861.7Å (complete models). For simulations with predicted SSEs, RMSD100 values of 6.061.6Å (SSE only models) and 7.261.7Å (complete models) were obtained. For comparison, Rosetta [28] generated models with RMSD100-values of 6.462.1Å . BCL::Fold improved the RMSD100 when compared with Rosetta in 24 cases (36%) with correct SSE definitions and in 19 cases (29%) using a predicted SSE pool. When CR values are considered, BCL::Fold using the correct secondary structure achieved 44  ison, Rosetta generated models with CR-values of 39.4617.5. BCL::Fold improved the recovery of native contacts when compared with Rosetta in 47 cases (71%) with correct SSE definitions and in 40 cases (60%) using a predicted SSE pool.
When best models by RMSD100 are considered, BCL::Fold was able to predict the correct topology in 61 cases (92%) independent of usage of correct or predicted SSE pools. Models with correct topology, or native-like models, were defined as having an RMSD100 value of less than 8.0Å . After loop construction, native-like models are obtained for 50 cases (75%) using correct SSE predictions and 41 cases (62%) using a predicted SSE pool. In comparison, Rosetta constructed native-like models for 45 cases (68%). When a CR value of.20% is taken as cutoff, success rates change to 64 cases (97%) and 62 cases (94%), respectively, for BCL::Fold and to 60 cases (91%) for Rosetta. We attribute the deterioration of BCL::Fold models after loop construction mostly to limited sampling performed at this stage of the protocol as the present work focuses on topology assembly.
For further analysis, the best-scoring 100 models (1%) for each protein and each method were kept. For these subsets the percentage of targets where the best model by RMSD100 was below 8.0Å were calculated. BCL::Fold using correct SSEs was able to generate a,8.0Å RMSD100 model in top 1% by score for 56% of targets (SSE only models) and 39% (complete models). These values for BCL::Fold simulations with predicted pools were 44% (SSE only models) and 22% (complete models), while being 43% for Rosetta. This was followed by a similar analysis where CR measure was used instead of RMSD100. BCL::Fold using correct SSEs was able to generate.20% CR models for 74% of targets (SSE only models) and 79% (complete models). For simulations with predicted pools, CR values were 73% (SSE only models) and 76% (complete models), while being 74% for Rosetta. Figure 7 provides the comparison of best RMSD100 and CR values achieved for all benchmark proteins between Rosetta and BCL. When RMSD100 values are considered, SSE-only models for BCL runs with predicted SSE pools ( Figure 7A left panel) provide a better performance than Rosetta. As explained earlier, SSE-only models are given an advantage due to the smaller number of atoms over which RMSD100 values are calculated for these models owing to the lack of flexible loop regions. When complete models are compared with Rosetta ( Figure 7A right panel); it is observed that Rosetta produces lower RMSD100 models for more targets although performance correlates very well. Figure 7B displays CR values giving a slight advantage to the BCL in recovering native-like SSE contacts. These results are promising for BCL::Fold especially given the fact that BCL::Fold was designed with a focus on getting the SSE topology correct.
BCL::Fold performance varies between different targets, as observed in the plots mentioned above. We wanted to investigate whether there is a correlation of performance with sequence  length, fold complexity, secondary structure content, or accuracy of the secondary structure prediction. For this purpose, for each benchmark protein, the sequence length is plotted against NCO values ( Figure 8A left panel) and each point is colored according to the highest CR value achieved for the complete models generated by BCL::Fold runs using predicted SSE pools for that protein. As seen in the plots, the best performing proteins (.60%), are limited to,150 residue proteins. On the other hand, 40% to 60% CR values were achieved for proteins up to 200 residues, and 20%-40% CR values were attainable for proteins up to 275 residues. Similar numbers were observed for Rosetta models ( Figure 8A right panel).

Accurate Secondary Structure Improves Quality of BCL::Fold Models only Slightly
Comparison of BCL::Fold runs with either predicted or correct SSEs (Table 3) reveals that using native SSE definitions provides an average improvement of 0.64Å in RMSD100 for SSE only models and 0.40 Å RMSD100 for complete models after loop construction. Although the effect of secondary structure prediction accuracy on average of best RMSD100 models is modest, this effect is not directly related to Q3 values, but rather due to the nature of the BCL::Fold assembly protocol. One interesting example is 1LKIA, a 180 residue protein with Q3 values of 75.9 (PSIPRED) and 44.1 (JUFO). Although this protein has a midrange PSIPRED Q3 value, it exhibits the largest deterioration in both RMSD100 and CR, which is more likely to be explained by the high average shift values; 10.3 residues (PSIPRED) and 16.8 residues (JUFO). Another such example is 1LMIA, which has low Q3 value of 53.7 and 42.5 for PSIPRED and JUFO respectively, accompanied by a low rate correct SSE identification of 67%. On the other hand, if the secondary structure prediction is extremely accurate as in the case of 1TQGA, 1J27A, 3FH2A, 2BK8A (all with PSIPRED Q3.94.0), RMSD100 values deteriorate less than 0.3Å when moving from perfect to predicted secondary structure. Although accurate secondary structure prediction improves the overall accuracy of BCL::Fold, the results indicate that it is not a requirement. As described in Table S1 and Table S2, BCL::Fold utilizes a set of moves to dynamically resize and split SSEs during the minimization to compensate for the inaccuracies in secondary structure prediction.
The SSE content (percentage of residues in a sequence that reside in an SSE as opposed to coil) versus maximum Q3 value of the pool generated (using the highest Q3 value of the PSIPRED and JUFO predictions) for each benchmark protein is plotted in Figure 8B. Each point is colored according to the best CR value achieved for that target in complete models generated in BCL::Fold runs using predicted SSE pools (Figure7B left panel). Nearly all targets with highest CR values (.60% colored purple) have ,80% or higher Q3 values, although the SSE content for these targets can range from as little as 40% to as high as 85%. Similar plots are also provided for Rosetta models for comparison ( Figure 8B right panel). When models for the proteins with,75% maximum Q3 values are considered, a clear decline in the best CR values could be observed for Rosetta models in comparison to BCL models. For these proteins with relatively low confidence secondary structure predictions, the Rosetta generated best CR model was,20% for 4 cases, and between 20% and 40% for 11 cases, while these numbers were 3 and 6 respectively for BCL::Fold. The table lists for all proteins, the best RMSD100 and best CR observed for models generated by BCL and Rosetta. BCL results are presented in 4 columns: SSE-only models using native SSE definitions (BCL N-SSE ), complete models using native SSE definitions (BCL N ), SSE-only models using predicted SSE definitions (BCL P-SSE ), complete models using predicted SSE definitions (BCL P ). The 5th columns under RMSD100 and CR are for Rosetta models. doi:10.1371/journal.pone.0049240.t003 BCL::Fold BETA was Evaluated in CASP9 Experiment All techniques for protein structure prediction are evaluated every two years via the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment [75,76,77]. An early version of BCL::Fold (BCL::Fold BETA) participated in CASP9 and predictions were submitted for 58 of 63 targets given in the human predictor category. For each target 50,000 models were generated, the top 10,000 by BCL score were selected for clustering analysis. The five best scoring models as well as the best scoring models in each of the large clusters (,20) underwent loop construction and side chain packing using Rosetta. The five models for submission were selected from these full atom models as the largest cluster centers. In cases were a template was readily available, the fifth model for submission was the BCL::Fold model with the smallest RMSD to the comparative model built by MODELLER [78]. This approach was chosen to test the BCL::Fold sampling independent from BCL::Score (compare companion BCL::Score manuscript).
Targets in CASP9 were biased towards proteins of known fold. In fact, only 14 out of the 60 human targets had no sequence detectable templates [79]. However, BCL::Fold treated all targets ''free modeling (FM)'' to maximally leverage the blind CASP experiment to test the algorithm. In cases where a template was available we would not expect to perform better than templatebased methods. The remaining few cases represent a too small sample size to comprehensively compare BCL::Fold with other de novo protein structure prediction methods, also because of the BETA stage of that version. Therefore we present anecdotal examples where the potential of this early version of the algorithm became apparent. A more detailed evaluation will be performed during CASP10 in summer 2012.
For FM target T0608_1, the first submission by BCL::Fold had an RMSD of 4.3Å and ranked 9 th out of 132 groups (Figure 9). BCL::Fold was also able to produce native-like models and pick them for submission for the following targets; T0580 (105 residues 4.4Å RMSD), T0619 (111 residues 5.9Å RMSD), T0602 (123 Figure 7. Comparison of best RMSD100 and CR values for BCL and Rosetta. Scatter plot comparing (A) best RMSD100 or (B) best CR SSEonly (left) and complete (right) BCL models vs. Rosetta models. The BCL models considered are from BCL::Fold runs using predicted SSE pools. (B) Scatter plot comparing best CR SSE-only (left) and complete (right) BCL models vs. Rosetta models. The BCL models considered are from BCL::Fold runs using predicted SSE pools. doi:10.1371/journal.pone.0049240.g007 residues 7.7Å RMSD), T0630 (132 residues 8.4Å RMSD), T0627 (261 residues 8.9Å RMSD).

Conclusion
In conclusion, we demonstrate that assembly of SSEs is a viable approach to predict the topology of a protein of unknown fold. BCL::Fold assembles the correct topology for about 3 out of 5 proteins with sequence lengths ranging from 88 residues to 293 residues and 4 to 15 SSEs. BCL::Fold assembly runs range from 1 minute for the smallest protein to 10 minutes for the largest protein with a linear scaling ( Figure S3). A detailed run-time comparison to other prediction methods was not conducted; however Rosetta and BCL::Fold per-model runtimes are currently at the same magnitude (Data for Rosetta not shown). With the maturation of the program code and the method, we expect that runtimes will decrease significantly, and due to the rapid sampling of topologies typically fewer models need to be generated as for this study.
The impact of predicted versus correct secondary structure is small, demonstrating that BCL::Fold can efficiently compensate for inaccuracies in secondary structure prediction. As mentioned above, BCL::Fold currently focuses on topological sampling of SSEs neglecting backbone flexibility within individual SSEs. This leads to increased RMSD100 values especially in b-sheet proteins where despite correct topology, the curvature of b-sheet is not correctly reproduced. With development of more efficient SSE backbone bending strategies BCL::Fold can overcome this limitation. Individual plots are presented for models from BCL::Fold runs using predicted SSE pools (left panels) and Rosetta models (right panels). Points in both (A) and (B) are colored according to the best CR value achieved for that benchmark protein in BCL runs using predicted SSE pools and complete models;,20% (red), 20% to 40% (orange), 40% to 60% (green) and.60% (dark green). doi:10.1371/journal.pone.0049240.g008 As expected, BCL::Fold's overall performance, in terms of both RMSD100 and CR, is more robust for smaller proteins. There is a linear dependency, more clearly seen with decreasing CR values, the larger the protein, the larger the conformational space to be sampled. Out of 31 a-helical proteins tested, BCL::Fold was able generate,8.0Å RMSD100 models for 28 cases (SSE only models) and 15 cases (complete models). Out of 16 b-proteins, this was true for all 16 cases (SSE only models) and 11 cases (complete models). For the remaining 18 ab proteins, native-like models were generated 16 (SSE only models) and 15 cases (complete models). One of the major reasons of the difficulty experienced with a subset of these targets, as in the case of a-helical proteins 1LKIA, 1Z3XA and 2R0SA, and b-sheet proteins 1LMIA, 2QZQA and 1XAKA, can be attributed to inaccurate secondary structure predictions in terms of Q3 as well as being unable to identify one or more native SSEs.
As discussed in the introduction, BCL::Fold was designed for combination with limited experimental datasets. A version of BCL::Fold which integrates low resolution restraints from cryo-EM was previously shown to predict the correct topology for ahelical proteins [49]. Incorporation of limited experimental data from NMR and EPR experiments, folding of membrane proteins, and better reproduction of strongly bent SSEs are future directions of our research.
During review of the present manuscript, two related studies have been published. Lange et al. use the recombination of structural features that occur during the initial states of the Rosetta sampling protocol to more efficiently sample b-sheet topologies [80]. While similar in spirit, the method is restricted to,beta.-sheets and structural features are gathered from models initially sampled possibly limiting the search space. This is different to our method, where improved sampling includes a-helices and bstrands and is not biased by an initial set of models. Simoncini et al. present ''A Probabilistic Fragment-Based Protein Structure Prediction Algorithm'' [81]. Their method ''EdaFold'' increases the probability for picking fragment conformations from models with low energy. This enables focused sampling of low energy conformations, assuming that they are more likely to correspond to the native protein topology. It requires that sampling trajectories communicate with each other to adjust the fragment probabilities. The amino acid sequences are still ''folded'' without breaks and the benchmark proteins used have only up to 128 residues, significantly less than what is used in the present study.

BCL::Fold Protocol and Benchmark Analysis
The flowchart of the BCL::Fold protocol is shown in Figure 2. The amino acid sequence and associated secondary structure predictions are utilized to generate a pool of SSEs (Figure 2A). The SSE pool is likely to have multiple copies for the one SSE with varying start and end points. The algorithm then selects one SSE at random from the pool and places it in the origin to start the simulation. The minimization protocol is composed of a Monte Carlo sampling algorithm ( Figure 2B) coupled with knowledgebased energy potentials ( Figure 2C). Once a specified number of maximum iterations are reached the minimization is ended and the model with the best energy is returned as the final model ( Figure 2D). For each of the benchmark proteins, two BCL::Fold runs with 10,000 models each were completed, one using secondary structure definitions provided in the PDB files and one using the secondary structure predictions.

Preparation of Benchmark Set
The benchmark protein set was collected using the PISCES culling server and includes 66 proteins of lengths ranging from 83 to 293 residues with,30% sequence similarity and X-ray determined structures with a resolution,2.0 Å . The set contains 66 different topologies including 31 all a-helical, 16 all b-strand, and 19 mixed ab folds ( Table 1). The primary sequence and experimental structure of the selected proteins were downloaded from the PDB [82]. The secondary structures were determined using DSSP [73], since the PDB definitions were inconsistent in some places.

Secondary Structure Prediction and Preparation of SSE Pool
JUFO [13,69] and PSIPRED [70] were obtained from the authors of the methods and installed locally. In addition the sequence alignment tool BLAST [83,84] was installed locally to create the position specific scoring matrices for input to JUFO and PSIPRED. These are provided as input to the BCL::SSE application which generates a pool of likely SSEs given secondary structure prediction and BLAST profile. The prediction methods assign probabilities for each residue to be in one of three stateshelix, strand or coil. BCL::SSE first generates an initial pool by taking the highest probability for each residue and assigning it the corresponding secondary structure type. A threshold of 0.5 is applied for a-helices and b-strands: if the probability is below the threshold, then the residue is assigned as a coil even if the highest probability corresponds to a-helix and b-strand. This initial pool is then refined using a Monte-Carlo based minimization composed of 1000 steps. The minimization employs moves that alter the secondary structure assignment of a single residue or divide a SSE, while the energy function used evaluates the correspondence of the secondary structure predictions to the secondary structure assignments generated (see companion BCL::Score paper for more details). For both the initial pool as well as the final pools generated by BCL::SSEs, a-helices shorter than 5 residues and bstrands shorter than 3 residues are excluded.

SSE Pool Evaluation
Q3 is the most commonly used method for evaluating secondary structure assignments [72]. Q3 evaluates the percentage of residues with correct secondary structure assignments. However, since the actual identification of an SSE is more important than individual secondary structure assignments for BCL::Fold, we introduced additional measures: the percentage of native SSEs that were correctly identified as well as the shift, which is sum of deviations in the beginning and ends of predicted SSEs compared to native SSEs.
Monte Carlo-based Sampling Algorithm and Temperature Control BCL::Fold starts the minimizations with a structural model that contains a single SSE picked randomly from the pool. At each iteration, a move is selected randomly from the move set and applied to the model to produce a new structural model. The resultant model is evaluated by a consensus energy function, and whether to accept or reject this model is determined by the Metropolis criterion [85], where E c is the energy of the current model, E b is the energy of the best model observed so far, k is a constant and T is the temperature of the system at that point. The temperature is set to 500 initially and adjusted every 10 th step to approach an overall cumulative move acceptance ratio for the trajectory. The target ratio for move acceptance is 0.5 in the beginning and decreases linearly to 0.2 at the end. The evaluation of the Metropolis criterion can lead to three different results; (1) improved, if the energy of the current model is better than best the energy, (2) accepted and (3) rejected if energy of current model is worse than best energy and Metropolis criterion is used for evaluation. If this step is an ''improved'' state, the current model replaces the best model and minimization is continued with this model. If this step is a ''rejected'', the minimization is continued with the best model observed so far. If this step is an ''accepted'' state, the minimization is continued on this model however the best model is not updated. An additional non-standard state ''skipped'' is defined if the mutate was not able to produce a new model, such as when trying to add a new SSE to a model that is already complete. In that case the energy evaluation is skipped.

Sampling of Conformational Space
BCL::Fold explores the conformational space using a variety of moves. Each move is assigned a probability and one of them is randomly picked at each step based on these probabilities. The list of all moves utilized, their associated probabilities and descriptions can be found in Table S1 (assembly stage) and Table S2 (refinement stage). The moves can be divided into the following six categories; (1) adds, (2) removes, (3) swaps, (4) single SSE moves, (5) SSE-pair moves, (6) domain moves. For SSE, SSE-pair and domain moves, these are further categorized into specific ahelix, mixed-SSE domain, or b-sheet moves.

Loop Building
Missing loop residues were built on to the model predicted by BCL::Fold using an in-house CCD based loop building protocol [38]. The protocol first removes a single residue from each side of all the SSEs in the model to increase the chance of being able to close the loop. Then, missing loop residues are added to the model with phi/psi angles biased by Ramachandran distribution for given amino acid type (Movies in Table S5: row ''Loop grow''). The initial conformations of the residues are optimized using BCL scoring functions including amino acid clash and amino acid environment and a bias to close the chain breaks. This step ensures that initial positions can be found for all residues without causing any clashes. In the next stage, a CCD-based minimization is applied to ensure all loops are closed (Movies in Table S5: row ''Loop close'' and ''Loop force close'').

Composite Knowledge-based Energy Function
The composite energy function is described in detail in the companion BCL::Score manuscript. Briefly, the energy functions consists of twelve individual terms for (1) amino acid pair distance clash, (2) amino acid pair distance, (3) amino acid solvation, (4) SSE pair clash, (5) SSE pair packing, (6) b-strand pairing, (7) loop length, (8) strictly enforcing loop closure, (9) radius of gyration, (10) SSE prediction for JUFO (11) SSE prediction for PSIPRED and lastly (12) contact order. The scores for amino acid solvation and SSE predictions are also computed for the unfolded part of the protein which evaluates all residues not represented in the model, using the corresponding potentials. All scoring functions are implemented within the BCL.
All knowledge based potentials have been derived from a databank that contained 3,409 high resolution x-ray crystallography protein structures compiled using the PISCES server [86]. The collected statistical representations are converted into a free energy using the inverse Boltzmann relation and applying the appropriate normalizations. The weights for individual energy functions were optimized using a benchmark of models composed of de novo folded models by Rosetta [28], BCL::Fold as well as perturbed models of native structures generated by perturbation protocol within BCL. The finalized weights for energy functions used can be found in Table S3.

Benchmark Analysis
For each BCL::Fold run of 10,000 models for each of the 66 proteins in the benchmark set, an initial filtering is done to remove any incomplete models, which was less than 3% of the models for all sets. The models produced by BCL::Fold benchmarks are evaluated by looking at the following quality measures: root-meansquare-deviation (RMSD), RMSD100 and CR. These measures are calculated over Ca atoms of all the residues in a-helices and bstrands in the models. In addition, contact order [43] values were calculated by computing the average sequence separation of contacts defined as having C b (H a2 for Glycine) atoms within 8Å distance. Relative contact order (RCO) values were calculated by normalizing contact order values by the length of the sequence. Normalized contact orders (NCOs) were calculated by dividing the square of the contact order by the length of the sequence. An additional quality measure was developed, named contact recovery (CR), which evaluates the percentage of native contacts with a minimal sequence separation of 12 residues that are recovered in the models.
BCL::Fold models have variable SSE content, as a pool of overlapping SSEs is used. These SSEs have variable length and not each SSE is considered in each model. In result, BCL::Fold models for one and the same protein have different amino acids present. Table S4 summarizes the relative coverage of modeled residues in BCL::Fold SSE-only models. The number of modeled residues is close to the number of residues in native SSEs for all benchmark proteins. Due to over-prediction of secondary structure, SSEs that were filtered from the benchmark proteins can appear in the pool and consequently in BCL::Fold pool-SSE models. This causes the pool-coverage to be usually higher than 100%.
The variable SSE content of BCL::Fold models poses a challenge when comparing these models to the native structure and also when comparing these models to Rosetta-generated models. We considered just using residues in SSEs for both, BCL::Fold and Rosetta. However, using Rosetta secondary structure definitions puts Rosetta at a disadvantage as in particular for large proteins with b-strands secondary structure is often incompletely identified in Rosetta models due to geometric imperfections. We settled on RMSD100 and contact recovery calculations for BCL models with only the SSEs that are present in the BCL models ("incomplete") and usage of "complete" models for comparison to Rosetta.

Protein Structure Prediction using Rosetta
The Rosetta [28] protein structure prediction program was used to predict 10,000 models for each of the benchmark proteins in order to provide a comparison for analysis of BCL::Fold. The models were produced using the de novo mode of Rosetta, and fragment files provided as input to Rosetta were pre-filtered to remove any fragments for homologous proteins. This was done by using the non-homolog flag of the ''make fragments'' Rosetta script. The resulting fragments were searched for proteins selected in more than 30% of the (overlapping) regions of the query sequence. These proteins were excluded for up to four iterations of ''make fragments''. The resulting Rosetta models underwent the same analysis as the models produced by BCL::Fold. Secondary structures in Rosetta models were determined using DSSP [73] and the quality calculations were completed considering C a atoms from identified a-helices and b-strands where applicable.

BCL::Fold Availability
All components of BCL::Fold, including scoring, sampling, and clustering methods are implemented as part of the BioChemical Library (BCL) that is currently being developed in the Meiler laboratory (www.meilerlab.org). BCL::Fold is freely available for academic use along with several other components of the BCL library. Details on its usage can be found in Appendix S1.    Appendix S1 BCL::Fold command line usage and file formats.

(DOCX)
Movie S1 1TQGA assembly of best model by RMSD100. Movie S10 2RB8A loop force close of best model by RMSD100.