Critical Features of Fragment Libraries for Protein Structure Prediction

The use of fragment libraries is a popular approach among protein structure prediction methods and has proven to substantially improve the quality of predicted structures. However, some vital aspects of a fragment library that influence the accuracy of modeling a native structure remain to be determined. This study investigates some of these features. Particularly, we analyze the effect of using secondary structure prediction guiding fragments selection, different fragments sizes and the effect of structural clustering of fragments within libraries. To have a clearer view of how these factors affect protein structure prediction, we isolated the process of model building by fragment assembly from some common limitations associated with prediction methods, e.g., imprecise energy functions and optimization algorithms, by employing an exact structure-based objective function under a greedy algorithm. Our results indicate that shorter fragments reproduce the native structure more accurately than the longer. Libraries composed of multiple fragment lengths generate even better structures, where longer fragments show to be more useful at the beginning of the simulations. The use of many different fragment sizes shows little improvement when compared to predictions carried out with libraries that comprise only three different fragment sizes. Models obtained from libraries built using only sequence similarity are, on average, better than those built with a secondary structure prediction bias. However, we found that the use of secondary structure prediction allows greater reduction of the search space, which is invaluable for prediction methods. The results of this study can be critical guidelines for the use of fragment libraries in protein structure prediction.


Introduction
It is estimated that less than 30% of human protein structures have been empirically determined, and the amount for other species is significantly lower [1]. This is primarily due to the difficulties and costs associated with experimental techniques, such as NMR and X-ray diffraction. Meanwhile, genomes are sequenced at relatively low cost, increasing the gap between known sequences and known structures motivating the development of computational tools PLOS ONE | DOI: 10 optimal number of fragments per position, secondary structure prediction and sequence similarity. Among these, the most relevant were proven to be secondary structure prediction and sequence similarity. The author concluded the fragment of 10 residues lead to the best results and that the library should have least 100 fragments per position. In search of the optimal fragment length, Handl and collaborators report that larger fragments are surprisingly useful, and it has been proposed that fragments of smaller sizes should be derived from them [25]. The work also supports the importance of smaller fragments in refining the structure, in particular, β-sheets, and recommends further investigation on which attributes result in a better fragment library. From a set of 1261 non-redundant proteins, Baeten and collaborators [19] derived the Brix database composed of 4-14 residues long fragments. Brix contains 1000-2000 conformations for each fragment length. Using only the backbone Root Means Square Deviation (RMSD) as a scoring function and without any fragment selection criterion, i.e., using the whole fragment database, close to native structures were built. They showed that loop regions could be better reconstructed using smaller fragments, although regular secondary structures were best approximated by larger fragments. Later, it was reported that the variability of loops connecting helices is greater than those connecting β-strands [26].
To further integrate structural constraints, it has been proposed that better fragment selection could be achieved if protein class information were taken into account [27]. The authors integrated class annotations predicted from sequence alone to fragment libraries and used the Rosetta ab initio protocol to model native structures, reporting up to 7% improvement in model quality.
To sample as many feasible conformations as possible, a library must contain as many structurally diverse fragments as possible, avoid redundancies and, if possible, include fragments that, at least locally, strongly correlate with the target sequence. Usually, the score from a substitution matrix (e.g., BLOSUM) is used to select which fragments will be contained in a library, and several PSP methods use secondary structure predictors as an additional criterion to select fragments. One of the most commonly used secondary structure predictor is PSIPRED [8], which has been especially successful since CASP4 when it reached an 80.6% success rate for 40 protein domains [28]. However, to our knowledge, it has never been fully evaluated how or to which extent secondary structure prediction increases the predictive potential of fragment libraries.
Fragment libraries have been shown to work well in practice by greatly improving the quality of the predicted structures, but considering how current PSP methods rely on fragment libraries, it is important to obtain guidelines for their use and construction. Despite being commonly used, some essential aspects of fragment libraries that influence the correct modeling of native structures remain to be determined. This study investigates some crucial features of a fragment library in the reconstruction of the native structure, isolated from the limitations of ab initio methods, such as an imprecise energy function and optimization algorithms. Concretely, we addressed the extent to which the following features affect the assembly of tertiary structures: (1) secondary structure prediction guiding fragments selection, (2) different fragments sizes and (3) the structural clustering of fragments within libraries. All the simulations were performed with fragments excised from structures that are non-homologous to the targets, therefore emulating a free-modeling prediction.

Results and Discussion
For each protein in the test set, 30 models were built from libraries comprising fragments of 3 to 20 residues long, selected by their sequence similarity with the target protein (given by the score of BLOSUM62) combined with their secondary structure prediction (given by PSIPRED confidence). These libraries were built using the Profrager server [29] (https://www.lncc.br/ sinapad/Profrager). We adopted the RMSD of the main chain atoms (N, Cα, C) against the structure deposited in the PDB as the measure to evaluate the accuracy of the models. For small proteins, a model is often regarded as good if the RMSD is below 3 Å, acceptable when below 5 Å and uninformative when above 5 Å [30].

Single-size fragment libraries
The distribution of RMSDs of the models created by each individual library shows smaller fragments have the potential to generate more accurate results (Fig 1).
Our data shows that none of the models done with 12-mers or larger, and few of the models done with 10-mers to 12-mers meet the RMSD criterion for acceptable models. Even though longer fragments may bring more empirical information into the model, thus further constraining the search space and speeding the convergence towards the native state, the models built purely from longer fragments were of lesser quality than those built from smaller fragments.
There are two possible reasons for the inferior quality of the models assembled with longer fragments. The first is that they might be insufficiently well-represented in databases since the structural diversity grows with the number of residues [19]. Fragments in regions of α-helices have more quality when compared to those of regions of β-sheets, which probably makes longer fragments unsuitable to correctly model β-sheets [31]. Ergo, longer fragments are of lesser quality because known proteins lack a suitable structure for the target sequence.
The second reason is that the insertion of a longer fragment when the conformation is close to native may bring changes that, though locally favorable, could be too drastic for the rest of the structure. Taken together, the lack of diversity and the stronger perturbation may account for the less accurate models obtained using longer fragments. Fragments with 9 residues or smaller generated satisfactory models for at least half of the proteins studied (Fig 1). Fragments larger than 9 have been supported elsewhere, but the tests always involved a combination of large fragments of many different sizes, as well as smaller fragments ( [14,19,24]). Our data reveals the use of fragments that large may have been overrated in the literature as they show limited predictive capacity when used isolatedly.
Given that fragments longer than 9-mers may be disregarded in fragment libraries, there is also the question of whether it is possible to remove other sizes without negatively impacting the overall quality of the final models. There is a significant difference between the results of 3-mers and 9-mers (p < 0.01), which means that larger fragments cannot reproduce the native structure with the same accuracy as smaller fragments. Nonetheless, 9-mers have been shown to improve model accuracy when combined with fragments of other sizes. Rosetta, for example, is a well-established method for protein structure prediction that traditionally uses 9-mers and 3-mers in their ab initio predictions. However, there are no significant differences between the RMSDs of the models generated by 4-mers to 8-mers (p = 0.011), suggesting a certain degree of structural redundancy in these fragments. These results could prove valuable for simulations with mixed libraries, ie, libraries that combine several fragment sizes.

Mixed libraries
A series of different works demonstrate the use of fragments of different sizes generates closer to native models [14,32]. In this section, we focus on the analysis of the mixed-size libraries to obtain some insights about how to construct them, based on the results of the tests with the single-size libraries.
Since 3-mers and 9-mers are statistically apart from each other, we selected the 9-mer to represent the @large fragment‶ and 3-mer to represent the @short‶. Because 3-mers and 9-mers generate models that are significantly different, we found prudent to bridge the gap between them with a fragment of intermediate size. In view of the fact that fragments with 4 to 8 residues do not produce models that are significantly different, any size is equally fit to be considered @medium‶. We selected a 6-mer as the medium-sized fragment.
To verify if a mixed library exclusively composed of 3,6,9-mers is adequate to reproduce the native structure, we generated a new set of models and compared them with those produced by mixed libraries with 3-mers to 20-mers. The comparison of the RMSDs of the best models created with 3,6,9-mers shows no significant difference to those created with 3-20-mers (Fig 2, p = 0.14), which indicates there is no need to add different sizes and that three different fragment sizes are sufficiently diverse to satisfactorily rebuild the native structure.
This proves only a few fragment sizes are enough to model the native structure. All other sizes only seem to add structural redundancy without bringing any useful information that is worth a costly extension of the search space. When the Ramachandran plot of 3-mers to 20-mers and 3,6,9-mers are superposed, 3,6,9-mers cover most of the Ramachandran of libraries with sizes 3-20 (Fig 3).
In this case, we have shown 3,6,9-mers sufficiently cover the associated fragment conformational space, but this is by no means restricted to this specific combination of fragments. Perhaps other authors may find a different combination of sizes more suitable for a particular PSP methodology. It is also worth mentioning that with the increase of data deposited in the PDB, it is expected the barrier for greatest fragment size to surpass 9. For all evaluations in the following sections, we discarded all fragments other than 3,6,9-mers.
The fact that the majority of the best models are obtained with mixed libraries indicates that mixed libraries combine desirable properties of longer and shorter fragments [19,20,23,33,34]. Despite the disadvantages of 9-mers when compared to 3-mers and 6-mers, our findings suggest they have desirable, seemingly indispensable properties for better model accuracy when used in combination with different fragment sizes. Considering that it is theoretically possible to obtain a 9-mer by stringing three 3-mers, it seems counterintuitive that models built exclusively from 3-mers do not have the same quality of those built from mixed libraries. This is because simply adding 3-mers together does not necessarily lead to a naturally existing, clash-free 9-mer. Moreover, if there are 200 fragments per position and three positions, searching the correct 9-mer from a set of 3-mers requires 200 3 evaluations instead of 200 evaluations.
In a previous work by Handl and collaborators [25] it was shown that longer fragments worked well for most proteins, which lead to the suggestion that small fragments such as 3-mers should be derived from the longer. However, extracting shorter fragments from the longer could greatly scale down structural diversity, thus undermining the usefulness of short fragments. Our simulations with the mixed libraries corroborate there is no ideal fragment size. Rather, the best approach seems to be the combination of different sizes, as it allows shorter fragments to compensate for shortcomings of the longer, and vice-versa.  The superiority of the mixed libraries is confirmed by the tests for which single-length libraries were incapable of generating proper protein models (Fig 4). Whereas models up to 50 residues built from 3-mers have approximately the same quality of those built from mixed libraries, the RMSDs for models built exclusively from 3-mers increase greatly with the target length (Fig 4). Such significant loss of quality is not observed for the mixed libraries.
As mixed libraries contain fragments of all three sizes, it is important to describe how these fragments are used to built the best models. We observed that the best models constructed with mixed libraries were obtained by assembling 9-mers, followed by 6-mers and finally 3-mers ( Fig 5, S1 Fig). In the course of the simulations with mixed libraries, 9-mers are especially inserted during the first stage of the optimization, because they bring more empirical information into the model, enabling the optimization algorithm to bias the structure towards the global minimum basin. However, 9-mers are expected to become less useful as the model approaches the native conformation possibly because they might cause structural changes that are inconvenient for the latter stages of the algorithm when more subtle changes are required.
As the simulation progresses, smaller fragments begin to play a major role, and in contrast, 9-mers start losing their importance. Because the global optimization is implemented to take into consideration the whole structure, the fragments used in this stage are somewhat likely to avoid radical conformational changes. Thus, smaller fragments become more suitable for PSP methods. As 6-mers are the most used fragments during global optimization, they apparently represent the best trade-off between the amount of information brought by larger fragments while still allowing for the minimal perturbation that is expected of smaller fragments.
As previously stated, the further into the simulation, the smaller was the fragment size, so it was expected that the final stage was characterized by the predominance of the smallest of all sizes, the 3-mer. In fact, the algorithm inserted 3-mers almost exclusively, indicating they are ideal to local structural refinement (Fig 5, S1 Fig). Use of secondary structure prediction To verify how useful secondary structure prediction is for fragment libraries, we compared the models created by assembling fragments ranked according to sequence similarity alone (BLO) and fragments that combine sequence similarity and secondary structure prediction (PSI).
Secondary structure prediction is widely used by nearly all the successful PSP methods as it has been found to unequivocally ameliorate model quality. Thence, it came as a surprise that the majority of our PSI libraries models are less accurate than the BLO models (Figs 6 and 7, S1 Table). It is important to notice that this is true regardless of the quality of PSIPRED prediction even when its success rate is close to 100% (Fig 8). Certainly, PSI models are expected to have lower quality when the PSIPRED predictions are inaccurate, as is the case with 1AMB, where only 14% of the PSIPRED prediction is correct, but PSIPRED had over 50% success ratio for all tested proteins.
The quality loss of the PSI models compared to BLO models despite the high rate of success in PSIPRED predictions strongly indicates the exclusion of essential fragments and raises questions over the ubiquitous use of secondary structure prediction in fragment selection.
Are there any advantages in using secondary structure prediction in a PSP method? The search space of the problem of assembling a native structure from overlapping fragments consists of the many combinations of the dihedral angles found in the library. When fragment selection is biased by PSIPRED, a library with less structural diversity is created (Fig 9), so it spans a narrower search space. Despite the loss of desirable fragments, a more limited search space allows optimization algorithms to perform better.    The fragments were clustered based on their structural similarity to evaluate to which extent the use of secondary structure prediction allows the optimization algorithm to constrain the search space and how it influences protein modeling. Structural clustering consists in grouping the fragments according to a distance function (e.g.: RMSD, DME) when the value is within a cutoff. The expectation is to reduce the search space with a negligible loss in accuracy. The libraries were clustered by structural similarity, and new models were built using only the fragments with the greatest score from each cluster (referred to as leader).
Since PSI libraries are structurally less diverse, clustered PSI libraries resulted in approximately half the number of leaders of the BLO libraries (Fig 10).
There is a minor loss (at most 1.39 Å, Table 1) in accuracy when models are generated using only leaders, but such loss is compensated by the notable reduction of the search space from 200 fragments per residue to approximately 20 (Fig 10).
While most state-of-the-art protein structure prediction software use one or more secondary structure prediction methods, our previous results of unclustered PSI and BLO (Fig 6) shows that the use of secondary structure prediction has no effect in selecting better fragments. However, clustered PSI libraries resulted in fewer clusters while maintaining similar accuracy when compared to BLO models. In a template-free prediction scenario, the use of PSIPRED allows the optimization algorithm to explore the conformational landscape more efficiently due to the resulting smaller number of clustered fragments.
By comparing the results of clustered and unclustered libraries, we suggest that the use of secondary structure prediction does not rank fragments in a way that selects closest-to-native local conformations. Instead, we suggest the use of secondary structure prediction is relevant to PSP methods because it effectively reduces the search space by sparing only the fragments with a similar secondary structure. The fact that BLO unclustered libraries are superior to PSI unclustered libraries indicates that occasionally the use of PSIPRED sacrifices potentially relevant fragments. However, regarding cost-effectiveness, using PSIPRED for fragment selection is a beneficial strategy due to the relatively smaller number of clusters obtained from PSI libraries.

Validating libraries with Rosetta
As Profrager outputs files that are compatible with Rosetta, Rosetta was used to simulate a real ab initio scenario with BLO, PSI (9,3-mers, as used by Rosetta) and Robetta fragments. Robetta libraries were built using the Robetta server (http://robetta.bakerlab.org/fragmentsubmit.jsp). Table 2 compares the GDT-TS values of the best models created using the three different fragment libraries.
The comparison of the GDT-TS of the best structures created by Rosetta shows no statistical difference between PSI and Robetta (p = 0.37). This means PSI libraries, generated by  Profrager, are a suitable replacement for Robetta as they allowed Rosetta to model the native structure with similar accuracy. Moreover, the rate of correct secondary structures (given by STRIDE [35]) is indistinguishable in practice for these libraries. (Fig 11).
Although BLO fragments were shown to potentially lead to better results (Fig 6), the comparison of the GDT-TS of Robetta and BLO (ANOVA, p = 0.0007) shows a significant difference. These results confirm that, in a real ab initio scenario, fragment selection is better when biased by secondary structure prediction due to the limitations of the search algorithm. This suggests that the full potential of BLO libraries could not be fully explored by the protocol used by Rosetta and corroborate the view that secondary structure prediction is of special importance in current prediction methods.
Despite the inferior results of the BLO fragments, our analysis of the accuracy of secondary structures demonstrate that the advantages brought by ranking fragments with the aid of PSIPRED are limited to structured regions (Fig 11). Both PSI and Robetta use secondary structure prediction to assemble their fragment libraries, and both had better results for helices and extended conformations (strands) when compared to BLO. Contrastingly, BLO libraries are more accurate in coil regions. Better results in structured regions are obtained when the fragment selection relies on secondary structure prediction (e.g.: PSI, Robetta), but such libraries are prone to exhibit a noticeably smaller structural diversity, to a point where some positions have only one type of secondary structure (Fig 9). Such limited fragment diversity apparently interfered with their ability to model coils correctly, making BLO the proper library for the task of modeling regions that are rich in loops, such as some binding sites.
A useful fragment library should contain all possible conformational states for a local segment of the target protein, albeit without making the search space unfeasible. Local conformational diversity is a desirable trait, as it increases the chance of sampling the correct structure, but too many fragments may cause the search space to become too large for the search algorithm. In short, every position should have as few fragments as possible, but not fewer. Thence, the ideal fragment library combines both BLO and PSI fragments but uses PSI fragments to rebuild regions rich in helices and sheets, while leaving loops to be modeled by BLO fragments.

Methods
One of the most well-known servers for fragment libraries creation is the Robetta server [36,37] (http://robetta.bakerlab.org/fragmentsubmit.jsp), which has proven useful for protein structure prediction and protein structure design. However, Robetta does not allow for the customization level we seek to assess in this work, such as varying scoring functions, different fragment sizes and fragment selection method. Therefore, we developed a highly customizable method for fragment library generation to be used by prediction methods. Different fragment libraries can be generated for a target sequence depending on the customization options, which include (i) fragment selection criterion, (ii) fragment length and (iii) number of fragments per library.

Fragment library creation
The PISCES (Protein Sequence Culling Server [38]) was initially used to extract a non-redundant, and non-homologous, subset of 4365 structures (X-Ray or NMR-determined, 2.0 Å or better resolution) from the RSCB PDB data bank. We used a strict criterion of at most 20% sequence similarity to construct the database. Thus, structural diversity is guaranteed by avoiding the selection of several fragments from homologous proteins. Fragment libraries were created using the software Profragger (www.lncc.br/sinapad/ Profrager/). Each residue is represented by three dihedral angles (ϕ, ψ and ω) and three bond angles [23] (N-Cα-C, Cα-C-N, and C-N-Cα) extracted from the non-redundant database. Fragments were excised exclusively from structures non-homologous to the target sequence (PSIBLAST E-value > 0.05).
Fragment libraries are expected to comprise fragments from non-homologous structures that correctly model a sequential, local segment. Profrager searches the best local structures by aligning overlapping fragment sequences with the query sequence and estimates the likelihood of the fragment via an amino acid substitution matrix. Although Profrager allows the user free choice of the substitution matrix, in this work, we used the BLOSUM62 matrix as it is the de facto matrix for protein alignment.
Profrager creates a list starting at every residue of the target sequence, resulting in n-f+1 lists for a target sequence, where n is the number of residues and f is the fragment length. Each list contains 200 fragments that better fit the local structure according to the scoring function. For the sake of comparison, two different scoring functions were used to select the fragments. The first function estimates the likelihood of local fit according to the sequence similarity between the target and the fragment sequences. This similarity is given by the BLOSUM62 matrix entries for each pair of residues.
where B is the value of BLOSUM62 for the residue pair, t i is the ith residue of the target sequence, f i is the ith residue of the fragment, and n is the number of residues of the fragment.
The second function tested combines the sequence similarity score with the agreement between the fragment structure prediction for the target sequence, using the PSIPRED prediction confidence (Eq 2).  (H,E,C) is the secondary structure observed experimentally for the ith residue of the fragment (given by STRIDE), sst i (H,E,C) is the secondary structure predicted to the ith residue of the target sequence, c i is the PSIPRED confidence, t i is the ith residue of the target sequence.
The final score is calculated according to Eq (3) The weights of B() and P() were determined via trial runs, where some proteins were selected and 10 runs of the algorithm were performed. The best RMSDs were obtained when the weights were 1:1 (S2 Table).
Robetta creates two different sets of fragments for each overlapping position of 3 and 9 residues long, respectively. As Profrager creates fragments of any size, we introduced a fragment of intermediate size (6-mers) because the distance between 3 and 9 is too broad.
For each fragment length, two different ranks were used: the first uses only the sequence similarity score given by Eq (1) and the second uses the score given by Eq (3). As it is a common practice by many PSP algorithms to combine fragments of different lengths, we also investigated the effects of using mixed libraries, containing the three fragment lengths.

Model construction by fragments assembly
A greedy algorithm was used to assemble the native structure solely from overlapping fragments to evaluate the most fundamental aspects of better fragment libraries. A cost-function based on the protein structure was used as it allows fast convergence towards the native structure.
Cost function. The Distance Matrix Error (DME, Eq 4) is used as the cost-function.
where N is the number of backbone atoms (N, Cα and C) and p ij (respectively, q ij ) denotes the distance between atoms i and j on the predicted (respectively, experimental) structure. Two versions of DME are used in different stages of the algorithm. The local DME refers to the DME of a section of the protein, for example, a fragment of 9 residues at some position, while the global DME is the DME of the whole structure.
Search algorithm. For a n-residue long protein, there are 200 n-f+1 models that may result from all the possible combinations of the f-residue long fragments in a library, making an exhaustive approach unfeasible even for the smallest proteins. We developed a greedy algorithm to assemble the fragments in the close-to-native model, that is composed of three different stages (i) local optimization; (ii) global optimization and (iii) refinement.
The protein starts from a stretched conformation, that is, the backbone dihedral angles are set to 180˚. One position on the target sequence is randomly selected and the fragment that minimizes local DME is inserted. The fragment insertion consists in replacing the backbone dihedral and bond angles of the model by those of the fragment. Another position is randomly selected among the positions left, and its fragments are tested for local fit.
start until stop criterion while j < n residues select random position among n-j while i < n fragments evaluate fragment i if DME local, current < DME local, best insert fragment end if end while end while end For the global optimization, the algorithm was modified to allow the structure to escape local minima and sample other possible conformations. This is done by testing fragments for two different positions at the same time, and inserting both if the global DME is improved. The global optimization can be summarized as follows: two positions are randomly selected and one fragment is randomly chosen for each position. For each two selected positions, 200 pairs are tested for global DME and the pair that minimizes the global DME is inserted. These positions are marked so they are not sampled again, and two new positions are randomly selected. After the entire sequence has been sampled, the global optimization restarts using the model built as a starting point. The stop criterion is fulfilled when DME is not updated after an iteration, i.e., if the algorithm runs through the entire target sequence and does not insert any pair.
start until stop criterion while j < n residues select two random positions (a, b) among n-j while i < 200 randomly select fragment a and fragment b evaluate fragments if DME local, current < DME local, best insert fragments end if end while end while end Even though the protein is already folded in a close-to-native conformation after the global optimization, the model can usually be further refined. We introduced a refinement stage that is similar to the local optimization, but the cost-function used is the global DME instead of local DME. The algorithm starts with the conformation left by the global optimization and randomly selects a position of the target sequence. All the 200 fragments are tested for global fit, and the best is inserted. The selected position is marked not to be repeated and a new position is randomly selected. After all positions have been selected, the algorithm restarts with the conformation left by the previous iteration and repeats the procedure. The stop criterion is fulfilled when global DME is not updated after an iteration.
When mixed libraries are used, few modifications are introduced on the standard algorithm. A fragment length is randomly selected at the beginning of each stage. The optimization is then performed using only the fragments of the selected length. A new fragment length is then randomly chosen among the options left and the optimization is performed again. For each optimization stage, each fragment library is used only once, in random order.
For each library, 30 runs of the optimization algorithm were performed.

Fragment clustering
To reduce the search space, we used an algorithm to cluster fragments and verified the effect of rebuilding the native structure solely from the fragments with the highest score in each cluster (called cluster leader).
A DME cutoff distance is defined and each fragment is compared to the leader in descending score order. If a fragment does not meet the cutoff, it becomes the leader of a new group; if a fragment meets the cutoff with any existing leaders, it leaves the library and is represented by that leader. This procedure is carried out for each target sequence position. We adopted a DME cutoff of 0.05, 0.7 and 1.5 Å for the 3-mers, 6-mers and 9-mers, respectively.
Test set. Tests were carried out in a set of 20 proteins divided among mainly-α, mainly-β and α+β classes, ranging from 20 to 129 residues (Table 3).

Profrager fragments with Rosetta
To show that Profrager is an adequate replacement for the fragments generated by the Robetta server, Profrager fragments were tested with the Rosetta suite in an ab initio simulation and compared to models created using Robetta fragments. It is only possible to test Profrager 9-mers and 3-mers as these are the only fragment sizes used by Robetta.
A set of 48 proteins ranging from 54 to 148 residues was selected from the CASP9 experiment (Table 2) and for each we generated 1000 models using the ab initio-relax protocol (Rosetta v3.4). The quality of the structures was evaluated by GDT-TS (Global Distance Test Total Score [39,40], given by TM-score [41,42]) and only the best model (greater GDT-TS) was considered during comparisons.

Conclusion
Despite the fact that fragment libraries are commonly used in protein structure prediction, some of its underlying features have never been fully investigated. To have a clearer view of how distinct factors associated to fragment generation and use affect protein structure prediction, we isolated the process of model building by fragment assembly from some common limitations associated with prediction methods, e.g., imprecise energy functions and optimization algorithms, by employing an exact structure-based objective function under a greedy algorithm. Firstly, we assessed the effect of varying fragment sizes in libraries and found that, when considered separately, shorter fragments produce better models than longer fragments. Notwithstanding, combining fragments of all sizes into a mixed library proved to be the best strategy. When mixed libraries are used, longer fragments are invaluable at the beginning of the simulation because they bring more empirical information into the model. These findings suggest the best approach is to use a library with different fragment sizes and to diminish their size as the simulation advances to fully exploit them.
A library composed of 3,6,9-mers has the potential to reproduce the native structure with the same degree of accuracy of a library with fragments of sizes 1 to 20 residues. There is a high degree of structural redundancy among fragments of similar sizes, which makes it possible to remove them from the library and, by consequence, to explore the search space more thoroughly.
Secondly, the fragment libraries were evaluated regarding the fragment selection criteria. Fragments could be selected exclusively by sequence similarity given by a BLOSUM62 score (BLO libraries) or by combining the BLOSUM62 score to the secondary structure prediction confidence provided by PSIPRED (PSI libraries). The former approach yielded closest to native models. Despite the common reliance on PSIPRED predictions, our results strongly indicate that its use in fragment ranking can exclude useful fragments, thus casting reservations on the widespread utilization of secondary structure prediction for fragment libraries construction.
Libraries assembled with the aid of PSIPRED were found to be less diverse, and while they may exclude some useful fragments, they allow better structural clustering and a much greater reduction of the search space. The relatively smaller number of clusters of PSI libraries shows that using secondary structure prediction for fragment selection is more suitable to an optimization problem where the conformational sampling is the main limiting aspect of the model construction process, as is the case of ab initio prediction.
Finally, though lacking conformational diversity PSI libraries can lead to better results in structured regions (helices and sheets). Altogether, these results suggest that fragment libraries should couple PSI and BLO fragments to model structured regions and loops, respectively.
Profrager PSI libraries proved to be a suitable replacement for Robetta, as it generates fragment libraries that allow Rosetta to recreate the native structure with similar accuracy.
The results shown will guide the development of strategies for generation and use of fragment libraries in our protein structure prediction methods [43]. Furthermore, these results prompt the development of novel strategies to build fragment libraries by mixing the best features of sequence similarity based and secondary structure based libraries. Such strategy could be the construction of a library that increases the weight of the sequence similarity score in regions where the probability of coils is detected to be high. Another possible approach could be the use of a multi-objective strategy to select the fragments, such as a Pareto strategy.
Supporting Information S1 Table. BLO: fragment libraries generated using only sequence similarity calculated with BLOSUM62. PSI: libraries generated using sequence similarity calculated with BLOSUM62 and PSIPRED secondary structure prediction of the target sequence. (PDF) S2 Table. Trial runs: best RMSDs (min) and average RMSDs (avg) for 10 runs with different weights for the PSIPRED score (P) relative to the BLOSUM62 score (B = 1). (PDF)  (Table 3). (TIF)