## Figures

## Abstract

Loops in proteins are flexible regions connecting regular secondary structures. They are often involved in protein functions through interacting with other molecules. The irregularity and flexibility of loops make their structures difficult to determine experimentally and challenging to model computationally. Conformation sampling and energy evaluation are the two key components in loop modeling. We have developed a new method for loop conformation sampling and prediction based on a chain growth sequential Monte Carlo sampling strategy, called Distance-guided Sequential chain-Growth Monte Carlo (DiSGro). With an energy function designed specifically for loops, our method can efficiently generate high quality loop conformations with low energy that are enriched with near-native loop structures. The average minimum global backbone RMSD for 1,000 conformations of 12-residue loops is Å, with a lowest energy RMSD of Å, and an average ensemble RMSD of Å. A novel geometric criterion is applied to speed up calculations. The computational cost of generating 1,000 conformations for each of the x loops in a benchmark dataset is only about cpu minutes for 12-residue loops, compared to *ca* cpu minutes using the FALCm method. Test results on benchmark datasets show that DiSGro performs comparably or better than previous successful methods, while requiring far less computing time. DiSGro is especially effective in modeling longer loops (– residues).

## Author Summary

Loops in proteins are flexible regions connecting regular secondary structures. They are often involved in protein functions through interacting with other molecules. The irregularity and flexibility of loops make their structures difficult to determine experimentally and challenging to model computationally. Despite significant progress made in the past in loop modeling, current methods still cannot generate near-native loop conformations rapidly. In this study, we develop a fast chain-growth method for loop modeling, called Distance-guided Sequential chain-Growth Monte Carlo (DiSGro), to efficiently generate high quality near-native loop conformations. The generated loops can be used directly for downstream applications or as candidates for further refinement.

**Citation: **Tang K, Zhang J, Liang J (2014) Fast Protein Loop Sampling and Structure Prediction Using Distance-Guided Sequential Chain-Growth Monte Carlo Method. PLoS Comput Biol 10(4):
e1003539.
doi:10.1371/journal.pcbi.1003539

**Editor: **Roland L. Dunbrack, Fox Chase Cancer Center, United States of America

**Received: **August 29, 2013; **Accepted: **February 1, 2014; **Published: ** April 24, 2014

**Copyright: ** © 2014 Tang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **This work was supported by NSF DBI 1062328 and DMS- 0800257, http://www.nsf.gov/, and by NIH 1R21GM101552, NIH GM079804 and GM086145, http://www.nih.gov/. This work was also funded by the Chicago Biomedical Consortium with support from the Searle Funds at The Chicago Community Trust, http://chicagobiomedicalconsortium.org/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Protein loops connect regular secondary structures and are flexible regions on protein surface. They often play important functional roles in recognition and binding of small molecules or other proteins [1]–[3]. The flexibility and irregularity of loops make their structures difficult to resolve experimentally [4]. They are also challenging to model computationally [5], [6]. Prediction of loop conformations is an important problem and has received considerable attention [5]–[27].

Among existing methods for loop prediction, template-free methods build loop structures *de novo* through conformational search [5]–[7], [9], [10], [13], [14], [17], [18], [21], [23], [28]. Template-based methods build loops by using loop fragments extracted from known protein structures in the Protein Data Bank [11], [19], [27]. Recent advances in template-free loop modeling have enabled prediction of structures of long loops with impressive accuracy when crystal contacts or protein family specific information such as that of GPCR family is taken into account [14], [23], [25].

Loop modeling can be considered as a miniaturized protein folding problem. However, several factors make it much more challenging than folding small peptides. First, a loop conformation needs to connect two fixed ends with desired bond lengths and angles [8], [12]. Generating quality loop conformations satisfying this geometric constraint is nontrivial. Second, the complex interactions between atoms in a loop and those in its surrounding make the energy landscape around near-native loop conformations quite rugged. Water molecules, which are often implicitly modeled in most loop sampling methods, may contribute significantly to the energetics of loops. Hydrogen bonding networks around loops are usually more complex and difficult to model than those in regular secondary structures. Third, since loops are located on the surface of proteins, conformational entropy may also play more prominent roles in the stability of near-native loop conformations [29], [30]. Approaches based on energy optimization, which ignore backbone and/or side chain conformational entropies, may be biased toward those overly compact non-native structures. Despite extensive studies in the past and significant progress made in recent years, both conformational sampling and energy evaluation remain challenging problems, especially for long loops (*e.g.*, ).

In this paper, we propose a novel method for loop sampling, called Distance-guided Sequential chain-Growth Monte Carlo (DiSGro). Based on the principle of chain growth [15], [31], [32], [34], [35], the strategy of sampling through sequentially growing protein chains allows efficient exploration of conformational space [15], [34]–[37]. For example, the Fragment Regrowth via Energy-guided Sequential Sampling (FRESS) method outperformed previous methods on folding benchmark HP sequences [15], [33]. In addition to HP model [15], sequential chain-growth sampling has been used to study protein packing and void formation [35], side chain entropy [29], [38], near-native protein structure sampling [30], conformation sampling from contact maps [39], reconstruction of transition state ensemble of protein folding [40], RNA loop entropy calculation [37], and structure prediction of pseudo-knotted RNA molecules [41].

In this study, we first derive empirical distributions of end-to-end distances of loops of different lengths, as well as empirical distributions of backbone dihedral angles of different residue types from a loop database constructed from known protein structures. An empirical distance guidance function is then employed to bias the growth of loop fragments towards the -terminal end of the loop. The backbone dihedral angle distributions are used to sample energetically favorable dihedral angles, which lead to improved exploration of low energy loop conformations. Computational cost is reduced by excluding atoms from energy calculation using REsidue-residue Distance Cutoff and ELLipsoid criterion, called Redcell. Sampled loop conformations, all free of steric clashes, can be scored and ranked efficiently using an atom-based distance-dependent empirical potential function specifically designed for loops.

Our paper is organized as follows. We first present results for structure prediction using five different test data sets. We show that DiSGro has significant advantages in generating native-like loops. Accurate loops can be constructed by using DiSGro combined with a specifically designed atom-based distance-dependent empirical potential function. Our method is also computationally more efficient compared to previous methods [8], [9], [18], [22], [42]. We describe our model and the DISGRO sampling method in detail at the end.

## Results

### Test set

We use five data sets as our test sets. Test Set 1 contains loops at lengths four, eight, and twelve, for a total of loops from PDB structures, which were described in Table 2 of zRef. [8]. Test Set 2 consists of eight, eleven, and twelve-residue loops from Table C1 of Ref. [42]. Several loop structures were removed as they were nine-residue loops but mislabeled as eight-residue loops: (1awd, 55–63; 1byb, 246–254; and 1ptf, 10–18). Altogether, there are eight-residue loops. Test Set 3 is a subset of that of [5], which was used in the RAPPER and FALCm studies [10], [22]. Details of this set can be found in the “Fiser Benchmark Set” section of Ref. [10]. Test Set 4 is taken from Table A1–A6 of Ref. [42]. Test Set 5 contains fourteen, fifteen, sixteen and seventeen-residue loops from Table 3 of Ref. [23]. Test Set 1 and 2 are used for testing the capability of DiSGro and other methods in generating native-like loops. Test Set 3, 4, and 5 are used for assessing the accuracy of predicted loops based on selection from energy evaluation using our atom-based distance-dependent empirical potential function. Our results are reported as global backbone RMSD, calculated using the N, , C and O atoms of the backbone.

### Loop sampling

To evaluate our method for producing native-like loop conformations, we use Test Set 1 and 2.

We generate loops for each of the loop structures in Test Set 1 at length , , and residues, respectively. We compare our results with those obtained by CCD [8], CSJD [12], SOS [18], and FALCm [22]. The minimum RMSD among sampled loops generated by DiSGro are listed in Table 1, along with results from the four other methods.

Accurate loops of longer length are more difficult to generate. For loops with residues, DiSGro generates more accurate loops than other methods. Our method has a mean of Å for the minimum RMSD, compared to Å for FALCm, the next best method in the group [22]. The minimum RMSD of nine of the ten -residue loops have RMSD Å, while five loops of the ten generated by FALCm have RMSD Å. Compared to the CCD, CSJD, and SOS methods, our loops have significantly smaller minimum RMSD ( Å *vs* , , and Å, respectively, Table 1). The average minimum global backbone RMSD for -residue loops can be further improved when we increase the sample size of generated loop conformations. The minimum global RMSD is improved to Å, Å, and Å when the sample size is increased to 20,000, 100,000, and 1,000,000, respectively. Further improvement would likely require flexible bond lengths and angles.

For loops with residues, DiSGro has an average minimum RMSD value smaller than the CCD, CSJD, and SOS methods ( Å *vs* Å, Å, and Å, respectively, Table 1). In eight of the ten 8-residue loops, DiSGro achieves sub-angstrom accuracy (RMSD Å), although the mean of minimum RMSD of -residue loops is slightly larger than that from FALCm ( Å *vs* Å).

For loops with -residue, the mean of the minimum RMSD ( Å) by DiSGro is significantly smaller than those by the CSJD and the CCD methods ( Å and Å, respectively), and is similar to those by the SOS and FALCm methods( Å and Å, respectively). Noticeably, three of the ten loops have RMSD Å, indicating our sampling method has good accuracy for short loop modeling.

These loops can be generated rapidly. The computing time per conformation averaged over 5,000 conformations for , , and -residues is , , and using a single AMD Opteron processor of . In addition to improved average minimum RMSD, DiSGro seems to take less time than CCD (, , and on an AMD 1800+ MP processor for the , , and -residue loops), and is as efficient as SOS (, , and for the , , and -residue loops on an AMD 1800+ MP processor).

Reducing the number of trial states in DiSGro can further reduce the computing time, with some trade-off in sampling accuracy. For example, when we take , the computing time per conformation averaged over 5,000 conformations for , , and -residues is only , , and , respectively, with the average minimum RMSDs comparable to those from SOS's ( Å *vs* Å, Å *vs* Å, and Å *vs* Å for the , , and -residue loops, respectively). Although the CSJD loop closure method has faster computing time (, , and on AMD 1800+ MP processor), the speed of DiSGro is adequate in practical applications.

We compare DiSGro in generating near-native loops with Wriggling [43], Random Tweak [44], Direct Tweak [42], [45], [45], and PLOP-build [13] using Test Set 2. The minimum RMSD among loops generated by DiSGro are listed in Table 2, along with results from the other methods obtained from Table 2 in Ref. [42]. Direct Tweak and from the LoopBuilder method and our DiSGro have better accuracy in sampling than Wriggling, Random Tweak, and PLOP-build methods. For loops with 11 and 12-residues, these three methods are the only ones that can generate near-native loop structures with minimal RMSD values below Å. Among these, DiSGro outperforms in generating loops at all three lengths: the average minimal RMSD () is Å *vs.* Å for length , Å *vs.* Å for length , and Å *vs.* Å for length , respectively. Compared to the Direct Tweak sampling method, DiSGro has improved for -residue loops ( Å *vs* Å), slightly improved for -residue loops ( Å *vs* Å) and inferior for -residue loops ( Å *vs* Å). Overall, these results show that DiSGro are very effective in sampling near-native loop conformations, especially when modeling longer loops of length 11 and 12.

Our DiSGro method can generate accurate loops and has significant advantages for longer loops compared to previous methods. Using RMSD values calculated from three backbone atoms N, , and C for all loop lengths lead to the same conclusion.

### Loop structure prediction and energy evaluation

To assess the accuracy of loops selected by our specifically designed atom-based distance-dependent empirical potential function, we test DiSGro using Test Set 3 and follow the approach of reference [22] for ease of comparison. Because of the high content of secondary structures, these loops are very challenging to model. In the study of [22], backbone conformations with the best scores evaluated by DFIRE potential function [46] were retained after screening generated backbone conformations for each loop. Loop closure and steric clash removal were not enforced to the conformations. We follow the same procedure, except the DFIRE potential function is replaced by our atom-based distance-dependent empirical potential function. The ensemble of the selected backbone conformations are then subjected to the procedure of side-chain construction as described in the Section “Side-chain modeling and steric clash removal”. The loop conformations with full side-chains are then scored and ranked by the atom-based distance-dependent empirical potential function. Our results are summarized in Table 3.

We measure the average minimum backbone RMSD , the average ensemble RMSD , and the average RMSD of the lowest energy conformations of the 1,000 loop ensemble with the same length. Overall, DiSGro performs significantly better than FALCm and RAPPER in , and for all loop lengths. Compared to FALCm, DiSGro shows significant advantages in on sampling long loops of – residues. Our method has of Å compared to Å for -residue loops, Å compared to Å for -residue loops, and Å compared to Å for -residue loops, respectively. For example, as can be seen in Figure 1, the lowest energy loop (red) of a 12-residue loop in the protein 1scs (residues 199–210) has a Å RMSD to the native structure (white). The generated top five lowest energy loops are all very close to the native loop, yet are diverse among themselves.

The lowest energy loop after side-chain construction is colored in red, and the native structure is in white.

DiSGro also generates loops with smaller compared to FALCm in loops with length ranging from to , indicating DiSGro can generate ensemble of loop conformations with enriched near native conformations. Furthermore DiSGro achieves better modeling accuracy using the atom-based distance-dependent empirical potential function. Compared to FALCm, DiSGro has a of Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, and Å *vs* Å for -residue loops, respectively.

DiSGro is also much faster than other methods. The reported typical computational cost of FALCm is cpu minutes for – residue loops on a Linux server of a 2-core Intel Xeon processor [47]. The computation cost for DiSGro method is only and 10 cpu minutes for and 12–residue loops on a single AMD Opteron processor, respectively. In addition, FALCm has a size restriction, and it only works with proteins with residues. In contrast, the overall protein size has no effect on the computational efficiency of DiSGro since the numbers of atoms for energy calculation that are retained by the ellipsoid criterion are bounded.

The LOOPER method is an accurate and efficient loop modeling method using a minimal conformational sampling method combined with energy minimization [17]. The test set used in the LOOPER study is the original Fiser data set without removal of any loops. Therefore, it is different from Test Set 3 used in the RAPPER and FALCm studies [10], [22]. For ease of comparison, we compare DiSGro to the LOOPER using the test set with –-residue loops from [17]. Our results are summarized in Table 4.

We denote and as the mean and median of backbone RMSD of the lowest energy conformations with the same loop length. Similarly, we use , and to denote the mean and median RMSD values of all-heavy atoms. DiSGro shows improved prediction accuracy compared to LOOPER in both backbone and all-heavy atom RMSD. For the loops of length 12, is Å compared to Å, while the median is Å compared to Å. It also has better all-heavy atom RMSD of Å/ Å (mean/median), compared to Å/ Å for -residue loops, Å/ Å compared to Å/ Å for -residue loops, and Å/ Å compared to Å/ Å for -residue loops.

It is worth noting that DiSGro outperforms LOOPER in speed as well. For a loop with residues, the time cost of DiSGro is minutes using a CPU versus cpu minutes using a processor according to Figure 7 in the LOOPER paper [17].

Prior publications also allowed us to compare results in loop structure predictions based on energy discrimination using Test Set 4 with results obtained using the LoopBuilder method [42]. Following [42], we generated closed loop conformations for eight-residue loops, for nine-residue loops, for ten, eleven, and twelve-residue loops, and for thirteen-residue loops. Energy calculations are carried out using our atom-based distance-dependent empirical potential function. The average RMSD of the lowest energy conformations, , are then compared between these two methods. The results are summarized in Table 5.

Compared to LoopBuilder, DiSGro has better : Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, Å *vs* Å for -residue loops, and Å *vs* Å for -residue loops, respectively. DiSGro has inferior performance in selecting for -residue loops ( Å *vs* Å). The average time using LoopBuilder for twelve-residue loops was around 4.5 hours or 270 minutes, while the computational time using DiSGro is around 10 minutes. Overall, DiSGro has equal or slightly better performance than LoopBuilder in average prediction accuracy of loop structures with far less computing time.

To test the feasibility of DiSGro in modeling longer loops with length , we use the Fiser -residue loops data set to generate and select low energy loop conformations. conformations with low energy are obtained. The mean of minimum backbone RMSD of loops with -residue is Å, and the median is Å. The mean/median of the backbone RMSD , and all heavy atom RMSD of the lowest energy conformations are Å/ Å and Å/ Å, respectively (Table 6).

With extensive conformational sampling using molecular mechanics force field, the Protein Local Optimization Program (PLOP) can predict highly accurate loops [13], [14], [23]. We tested DiSGro using Test Set 5 consisting of loops with length – and compared results with those using PLOP. Here the sampling and scoring processes were similar to those used in Test Set 3, except 100,000 backbone conformations were generated. We measured the average minimum backbone RMSD and the average RMSD of the lowest energy conformations . Our results are summarized in Table 7.

Loops predicted by the PLOP method have smaller compared to DiSGro [23], although DiSGro samples well and gives small of Å for -residue loops, Å for -residue loops, Å for -residue loops, and Å for -residue loops. For loops of length , the of Å is less than the reported Å using PLOP, although it is unclear whether the of loops generated by PLOP is less than Å. Overall, DiSGro is capable of successfully generating high quality near-native long loops, up to length 17. The accuracy of of loops generated by DiSGro may be further improved by using a more effective scoring function.

We also compared the computational costs of the two methods. The average computing time for DiSGro is , , , and hours for loops of lengths , , , and using a single core AMD Opteron processor , respectively, which is more than two orders of magnitude less than the time required for the PLOP method (, , , and hours for loops of length , , , and residues, respectively).

### Improvement in computational efficiency

We used a REsidue-residue Distance Cutoff and ELLipsoid criterion (Redcell) to improve the computational efficiency. To assess the effectiveness of this approach, we carry out a test using a set of 140 proteins (see discussion of the tuning set in Materials and Methods). We compared the time cost of energy calculation of generating a single loop, with and without this procedure. When the procedure is applied, we only calculate the pairwise atom-atom distance energy between atoms in loop residues and other atoms within the ellipsoid. When the procedure is not applied, we calculate energy function between atoms in loop residues and all other atoms in the rest of the protein. The computational cost of energy calculations for sampling single loops with and -residues are shown in Figure 2A and Figure 2B, respectively.

(A) The plot of computing time versus protein size show a large time saving of “Redcell-On” (red solid curve) compared to “Redcell-Off” (black dashed curve) for 12-residue loops, and (B) The plot of 6-residue loops. (C) Plot of computing time versus protein size show “Redcell-On” (red solid curve) has significantly improved computational time cost compared to “Ellipsoid-Only” (black dashed curve) and “Cutoff-Only” (green solid curve).

From Figure 1, we can see that significant improvement in computational cost is achieved. The average time cost using our procedure is reduced from to for sampling -residue loops, and to for -residue loops. In addition, this approach makes the time cost of energy calculations independent of the protein size (Figure 2A and Figure 2B), whereas the computing time without applying this procedure increases linearly with the protein size. The improvement is especially significant for large proteins. For example, to generate a -residue loop in a protein with residues, the computing time is improved from to , which is more than -fold speed-up. Detailed examination indicates that both distance cutoff and the ellipsoid criterion contribute to the computational efficiency. Furthermore, the full Redcell procedure has improved efficiency over using either “Ellipsoid Criterion Only” or “Cutoff Criterion Only”. The computing time for generating a -residue loops is when the full Redcell procedure is applied, compared to , and , when only the ellipsoid criterion and only the distance-threshold are used, respectively (Figure 2C). Furthermore, there is no loss of accuracy in energy evaluation. Overall, Redcell improves the computational cost by excluding many atoms from collision detections and energy calculations, with significant reduction in computation time, especially for large proteins.

## Discussion

In this study, we presented a novel method Distance-guided Sequential chain-Growth Monte Carlo (DiSGro) for generating protein loop conformations and predicting loop structures. Ensembles of near-native loop conformations can be efficiently generated using the DiSGro method. DiSGro has better average minimum backbone RMSD, , compared to other loop sampling methods. For example, is Å for 12-residue loops when using DiSGro, while the corresponding values are Å, Å, Å, and Å when using the CCD, CSJD, SOS, and the FALCm method.

DiSGro also performs well in identifying native-like conformations using atom-based distance-dependent empirical potential function. In comparison with other similar loop modeling methods, DiSGro demonstrated improved modeling accuracy, in terms of an average RMSD of the lowest energy conformations for the more challenging task of sampling longer loops of – residues. For example, DiSGro outperforms FALCm [22] ( Å *vs* Å) and LOOPER [17] ( Å *vs* Å) in predicting -residue loops, while taking less computing time ( minutes *vs* minutes for FALCm and minutes for LOOPER. Compared to LoopBuilder [42], DiSGro also has better : For -residue loops, the is Å using DiSGro, but is Å when using the Loop Builder. The average computing time is also faster when using DiSGro: it takes about minutes to predict structures of -residue loops and minutes for -residue loops. DiSGro also works well for short loops, although this may be largely a reflection of the underlying analytical closure method [12].

There are a number of directions for further improvement. DiSGro can be further improved by adding fragments of peptides when growing loops instead of adding individual residues. Fragment-based approach has been widely used in protein structure prediction [48]–[51] and specifically in loop structure prediction [21]. It is straightforward to apply the strategy described in this study for fragment-based growth, and it will likely lead to improved sampling efficiency further and enable longer loops to be modeled. Furthermore, the energy function employed here can be further improved by optimization such as those obtained by training with challenging decoy loops using nonlinear kernel [52], and/or using rapid iterations through a physical convergence function [53], [54]. In addition, DiSGro is compatible with different loop closure methods [8], [12], [22], and experimenting with other closure strategy may also lead to further improvement.

An efficient loop sampling method such as DiSGro can help to improve overall modeling of loop structures. Currently, the hierarchical approach of the Protein Local Optimization Program (PLOP) [13], [14], [23] gives excellent accuracy in protein loop modeling, but requires significant computational time. The average time cost of modeling a -residue loop is about 4–5 days [23]. Kinematic closure (KIC) method can also make very accurate predictions of -residue loops [21]. However, KIC also requires substantial computation, with about CPU hours on a single Opteron processor for predicting 12-residue loops [21]. As suggested earlier by Spassov et al [17], an efficient loop modeling method combined with energy minimization may overcome the obstacle of high computational cost. By generating high quality initial structures using DiSGro, near native conformations of loops can be used as candidates for further refinement.

## Materials and Methods

### Protein structures representation

All heavy atoms in the backbone and side chain of a protein loop are explicitly modeled. The bond lengths and angles are taken from standard values specific to residue and atom type [55]. The backbone dihedral angles and side chain dihedral angles constitute all the degrees of freedom (DOFs) in our model.

### Distance-guided Sequential chain-Growth Monte Carlo (DiSGro)

In order to efficiently generate adequate number of native-like loop conformations, we have developed a Distance-guided Sequential chain-Growth Monte Carlo (DiSGro) method.

Let the loop to be modeled begins at residue and ends at residue . The sequence of the positions of backbone heavy atoms from atom of residue to () atom of residue are unknown and need to be generated. We assume that the backbone atoms before and after this fragment are known. Coordinates of side chain atoms are also unknown and need to be generated if the coordinates of the atoms they are attached to are unknown.

At each step of the chain growth process, we generate three consecutive backbone atoms continuing from the backbone atom sampled at the previous step. At the -th growth step (), the three backbone atoms are atom of residue , atom of residue , and atom of residue (Figure 3). The coordinates of the three atoms, , and , are denoted as , , and , respectively. The dihedral angles that determine the coordinate of atoms are sampled from a normal distribution with mean and standard deviation . In the next section, we describe in detail in sampling of the dihedral angles , which determine the coordinates of the and the atoms.

Atom has to be on the circle . The position of the atom of residue is determined by , which is based on known distance and the conditional distribution of . Once is sampled, can be placed on two positions with equal probabilities. Here is the selected position of . (yellow ball) is placed at the position alternative to . Similarly, the atom has to be on the circle and its position is determined by in a similar fashion.

#### Sampling backbone angles.

Without loss of generality, we describe the sampling procedure for and atoms at the -th growth step. is generated first, followed by . Denote the distance between and as , and the distance between and as . Since the bond angle formed by the and bonds is fixed, and the bond length is also fixed, will be located on a circle (Figure 3):(1)Given a fixed , can be placed on two positions and on circle (Figure 3, and are labeled as and , respectively.) As the probability for placing on either position is about equal based on our analysis, we randomly select one position to place atom .

In principle, sampling from the empirical distributions of and mapping back to should encourage the growth of loops to connect to the terminal atom. Further analysis of the empirical distribution of given shows that can be very informative for sampling in some cases. This lead us to design the sampling of based on the conditional distribution of . See below for details.

Generating atom is similar to generating , only instead of is placed on a circle :(2)where is the bond length between atom and atom , and the distance between and is . Similarly, atom is placed by sampling condition on from the empirical conditional density . We repeat this process times to generate trial positions of , , and .

#### Sampling and from conditional distributions.

We sample from the conditional distribution to obtain the location of atom. We first construct the empirical joint distribution by collecting pairs over all loops in a loop database derived from the CulledPDB database (version 11118, at 30% identity, 2.0 Å resolution, and with ) [56]. From the 6,521 protein structures in the CulledPDB, we remove PDB structures which appear in our test data set. For the rest of protein structures, loop regions were identified using the secondary structure information either directly from the PDB records or from classification provided by the DSSP software [57]. All random coil regions, including -helices and -strands with length amino acids, are included in our database. In total, we have loop structures.

For each set of loops with the same residue separation , are Winsorised at level [58]. Specifically, the extreme values above are replaced by the values at the percentile. We then use a nonparametric two-dimensional Gaussian kernel density estimator to construct a smooth bivariate distribution based on collected data. To estimate the probability density at a point , we use the observed pairs of data from the database to derive the density function , which takes the form of:(3)where is the symmetric and positive definite bandwidth matrix, is a bivariate gaussian kernel function:(4)

To construct the bandwidth matrix , we calculate the standard deviation of the pairs of . The corresponding entry in the bandwidth matrix is set as . Similarly, is set as . The bandwidth matrix is then assembled as [59]:(5)We partition the domain of into a grid with 32 grid points in each direction. are estimated at the grid points, and interpolated by a bilinear function elsewhere. Conditional distribution is constructed from the joint distribution when is fixed. is sampled from . We follow the same procedure to construct , which is used to sample .

#### Backbone dihedral angle distributions from the loop database.

Although the empirical conditional distributions can efficiently guide chain growth to generate properly connected loop conformations, the dihedral angles of the loops are often not energetically favorable. As a result, conditional distributions described above alone are not sufficient in generating near native loop conformations.

The problem can be alleviated by an additional step of selecting a subset of loops with low-energy dihedral angles from generated samples. We use empirical distributions of the loop dihedral angles obtained from the loop database. Specifically, for the sampled positions of the current residue of type with dihedral angles , we select samples following an empirically derived backbone dihedral angle distribution . Here is derived from the same protein loop structure database for conditional distance distributions and constructed by counting the frequencies of pairs for each residue type.

#### Determining the number of trial states at each growth step for backbone torsion angles.

It is important to determine the appropriate size of trial states and for generating backbone conformations, as small and values may lead to insufficient sampling, resulting in inaccurate loop conformations. On the other hand, very large and values will require significantly more computational time, without significant gain in accuracy.

We use a data set, denoted as *tuning-set* to determine the optimal values of parameters and for sampling backbone conformations. Part of this data set comes from that of Soto *et al* [42]. The rest are randomly selected from pre-compiled CulledPDB (with sequence identity, Å resolution, and ). It contains a total of loops, with loops of length 6, of length 8, of length 10, and of length 12.

The optimal values of and are determined as according to the test result on tuning-set (Figure 4).

We generated samples for each loop. The mean value of the minimum RMSD of the loops (-axis) is plotted against the size of trial samples (-axis) for different choices of . For control, results obtained without sampling torsion angles (, control) are also plotted. The backbone (N, , C and O atoms) RMSD in this paper is calculated by fixing the rest of the protein body.

#### Placement of backbone atoms.

From the sampled dihedral angle pairs , we can calculate the coordinates of atom and for all of the trials. atoms are sampled by generating random dihedral angles from a normal distribution with mean and standard deviation of . Calculating the coordinates of backbone atoms using standard bond length and angle values is straightforward.

The coordinates of backbone atoms of the samples at this particular growth step can be denoted as . For simplicity, we denote the coordinates of the four atoms at residue as and the -th sample as . We sample one of them using an energy criterion. The probability for is defined bywhere is the effective temperature, and is the interaction energy of the four atoms defined by with the remaining part of the protein, including those loop atoms sampled in previous steps. The energy function is an atomic distance-dependent empirical potential function constructed from the loop database, which is effective in detecting steric clashes and efficient to compute. Fragments with steric clashes are rarely drawn because of their high energy values. In summary, the coordinates of the four backbone atoms, , is drawn from the following joint distribution at this step:(6)Altogether, () backbone dihedral angle combinations need to be sampled. When the growing end is three residues away from the -terminal anchor atom of the loop, , we apply the CSJD analytical closure method to generate coordinates of the remaining backbone atoms [12]. Small fluctuations of bond lengths, angles, and dihedral angles are introduced to the analytical closure method to increase the success rate of loop closure.

### Improving computational efficiency

To reduce computational cost of calculating atom-atom distances in energy evaluation, we use a procedure, REsidue-residue Distance Cutoff and ELLipsoid criterion (Redcell) to reduce computational time.

#### Residue-residue distance cutoff.

The residue-residue distance cutoff is used to exclude residues far from the loop energy calculation. Instead of a universal cutoff value, such as the Å distance used in reference [51], we use a residue-dependent distance cutoff value. The residue-residue distance cutoff is assigned to be , where and are the effective radii of residue and , respectively. For one residue type, effective radii is the distance between residue geometrical center and the heavy atom which is farthest away from the residue geometrical center. is a constant set to Å. For a residue in the loop region and residue in the non-loop region, we calculate the residue-residue distance , where and are the geometric centers of residue and , respectively. If , all of the atoms in residue are excluded from energy calculation. This residue-dependent cutoff is more accurate and ensures close residues are included.

#### Ellipsoid criterion.

The basic idea of ellipsoid criterion is to construct a symmetric ellipsoid such that all atoms that need to be considered for energy calculation during loop sampling are enclosed in the ellipsoid. Atoms that are outside of the ellipsoid can then be safely excluded. The starting and ending residues of a loop naturally serve as the two focal points of the ellipsoid. Intuitively, all backbone atoms of a loop must be within an ellipsoid. Formally, we define a set of points , the sum of whose distances to the two foci is less than , defined as the sum of the backbone bond lengths of the loop of length :where and are the two focal points of the ellipsoid. The symmetric ellipsoid () can be written as:(7)where and correspond to the semi-major axis and semi-minor axis of the symmetric ellipsoid, respectively. To incorporate the effects of side chain atoms, we enlarge the ellipsoid by the amount of the maximum side-chain length . Furthermore, we assume that any atom can interact with a loop atom if it is within a distance cut-off of . As a result, the overall enlargement of the ellipsoid is . The final definition of the enlarged ellipsoid for detecting possible atom-atom interactions is given by Eqn (7), with(8)and(9)where is determined by the equation , and by (see Figure 5B).

(A) Three dimensional view of a point locating on the ellipsoid constructed from the total loop length and the two foci and . (B) Two dimensional view along through the -axis of the ellipsoid, with and (dark gray). is along -axis, not shown. The maximum side-chain length is denoted as and the distance cut-off of interaction is . The enlarged ellipsoid, which has updated and , is also shown (light gray).

For any atom in the protein, if the sum of its distances to the two foci points is greater than , this atom is permanently excluded from energy calculations. The computational cost to enforce this criterion depends only on the loop length and is independent of the size the protein, once the rest of the residues have been examined using the ellipsoid criterion. This improves our computing efficiency significantly, especially for large proteins. This criterion also helps to prune chain growth by terminating a growth attempt if the placed atoms are outside the ellipsoid.

### Side-chain modeling and steric clash removal

Side chains are built upon completion of backbone sampling of a loop. For the -th residue of type , we denote the degrees of freedom (DOFs) for its side chain as . DOFs of side chain residues depend on the residue types, *e.g.* Arg has four dihedral angles (), with (). Val only has one dihedral angle (), with (). Each DOFs is discretized into bins of , and only bins with non-zero entries for all loop residues in the loop database are retained.

We sample trial states of side chains from the empirical distribution obtained from the loop database. One of trials is then chosen according to the probability calculated by the empirical potential. Denote the side chain fragment for the -th residue as , we select following the probability distribution:where is the interaction energy of the newly added side chain fragment with the remaining part of the protein, and is the effective temperature.

When there are steric clashes between side chains, we rotate the side-chain atoms along the axis for all residue types except Pro. For Pro, we use the axis for rotation. We consider two atoms to be in steric clash if the ratio of their distance to the sum of their van der Waals radii is less than [13].

### Potential function

To evaluate the energy of loops, we develop a simple atom-based distance-dependent empirical potential function, following well-established practices [46], [52], [60]–[66]. Empirical energy functions developed from databases have been shown to be very effective in protein structure prediction, decoy discrimination, and protein-ligand interactions [54], [63], [64], [67]–[71]. As our interest is modeling the loop regions, the atomic distance-dependent empirical potential is built from loop structures collected in the PDB [72].

Instead of using detailed atom types associated with the amino acids, we group all heavy atoms into groups, similar to the approach used in Rosetta [50]. The side-chain atom types comprise six carbon types, six nitrogen types, three oxygen types, and one sulfur type. The backbone types are N, , C, and O. This simplified scheme helps to alleviate the problem of sparsity of observed data for certain parameter values. For an atom in the loop region of atom type and an atom of atom type , regardless whether is in the loop region, the distance-dependent interaction energy is calculated as :(10)where denotes the interaction energy between a specific atom pair at distance , and are the observed probability of this distance-dependent interaction from the loop database and the expected probability from a random model, respectively.

The observed probability is calculated as: (11)where is the observed count of pairs found in the loop structures with the distance falling in the predefined bins. We use a total of bins for , ranging from Å to Å, with the bin width set to Å. ranging from Å to Å is treated as one bin. Here , where is the number of loops in our loop database, is the observed number of pairs at the distance of in the -th loop. is the observed total number of all atom pairs in the loop database regardless of the atom types and distance, namely, .

The expected random distance-dependent probability of this pair is calculated based on sampled loop conformations, called decoys. It is calculated as: (12)where is the expected number of () pairs averaged over all decoy loop conformations of all target loops in the loop database. Here is the number of pairs at distance in the -th generated loop conformations for the -th loop. is the number of decoys generated for a loop, which is set to . is the number of loops in our loop database. is the total number of all atom pairs in the reference state, .

## Acknowledgments

We thank Drs. Youfang Cao, Joe Dundas, David Jimenez Morales, Hammad Naveed, Hsiao-Mei Lu, and Gamze Gursoy, Meishan Lin, Yun Xu, Jieling Zhao for helpful discussions.

## Author Contributions

Conceived and designed the experiments: KT JZ JL. Performed the experiments: KT JZ. Analyzed the data: KT JZ JL. Wrote the paper: KT JZ JL.

## References

- 1. Bajorath J, Sheriff S (1996) Comparison of an antibody model with an x-ray structure: The variable fragment of BR96. Proteins: Structure, Function, and Bioinformatics 24: 152–157.
- 2. Streaker E, Beckett D (1999) Ligand-linked structural changes in the escherichia coli biotin repressor: The significance of surface loops for binding and allostery. Journal of molecular biology 292: 619–632.
- 3. Myllykoski M, Raasakka A, Han H, Kursula P (2012) Myelin 2′, 3′-cyclic nucleotide 3′-phosphodiesterase: active-site ligand binding and molecular conformation. PloS one 7: e32336.
- 4.
Lotan I, Van Den Bedem H, Deacon A, Latombe J (2004) Computing protein structures from electron density maps: The missing loop problem. In: Workshop on the Algorithmic Foundations of Robotics (WAFR). pp. 153–68.
- 5. Fiser A, Do R, Šali A (2000) Modeling of loops in protein structures. Protein science 9: 1753–1773.
- 6. Sellers B, Zhu K, Zhao S, Friesner R, Jacobson M (2008) Toward better refinement of comparative models: predicting loops in inexact environments. Proteins: Structure, Function, and Bioinformatics 72: 959–971.
- 7. van Vlijmen H, Karplus M (1997) PDB-based protein loop prediction: parameters for selection and methods for optimization1. Journal of molecular biology 267: 975–1001.
- 8. Canutescu A, Dunbrack R Jr (2003) Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Science 12: 963–972.
- 9. de Bakker P, DePristo M, Burke D, Blundell T (2003) Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all-atom statistical potential and the amber force field with the generalized born solvation model. Proteins: Structure, Function, and Bioinformatics 51: 21–40.
- 10. DePristo M, de Bakker P, Lovell S, Blundell T (2003) Ab initio construction of polypeptide fragments: efficient generation of accurate, representative ensembles. Proteins: Structure, Function, and Bioinformatics 51: 41–55.
- 11. Michalsky E, Goede A, Preissner R (2003) Loops In Proteins (LIP)–a comprehensive loop database for homology modelling. Protein engineering 16: 979–985 michalsky2003.
- 12. Coutsias E, Seok C, Jacobson M, Dill K (2004) A kinematic view of loop closure. Journal of computational chemistry 25: 510–528.
- 13. Jacobson M, Pincus D, Rapp C, Day T, Honig B, et al. (2004) A hierarchical approach to all-atom protein loop prediction. Proteins: Structure, Function, and Bioinformatics 55: 351–367.
- 14. Zhu K, Pincus D, Zhao S, Friesner R (2006) Long loop prediction using the protein local optimization program. Proteins: Structure, Function, and Bioinformatics 65: 438–452.
- 15. Zhang J, Kou S, Liu J (2007) Biopolymer structure simulation and optimization via fragment regrowth monte carlo. The Journal of chemical physics 126: 225101.
- 16. Cui M, Mezei M, Osman R (2008) Prediction of protein loop structures using a local move monte carlo approach and a grid-based force field. Protein Engineering Design and Selection 21: 729–735.
- 17. Spassov V, Flook P, Yan L (2008) LOOPER: a molecular mechanics-based algorithm for protein loop prediction. Protein Engineering Design and Selection 21: 91–100.
- 18. Liu P, Zhu F, Rassokhin D, Agrafiotis D (2009) A self-organizing algorithm for modeling protein loops. PLoS computational biology 5: e1000478.
- 19. Hildebrand P, Goede A, Bauer R, Gruening B, Ismer J, et al. (2009) Superlooper–a prediction server for the modeling of loops in globular and membrane proteins. Nucleic acids research 37: W571–W574.
- 20. Karmali A, Blundell T, Furnham N (2009) Model-building strategies for low-resolution x-ray crystallographic data. Acta Crystallographica Section D: Biological Crystallography 65: 121–127.
- 21. Mandell D, Coutsias E, Kortemme T (2009) Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nature methods 6: 551–552.
- 22. Lee J, Lee D, Park H, Coutsias E, Seok C (2010) Protein loop modeling by using fragment assembly and analytical loop closure. Proteins: Structure, Function, and Bioinformatics 78: 3428–3436.
- 23. Zhao S, Zhu K, Li J, Friesner R (2011) Progress in super long loop prediction. Proteins 79 (10) 2920–35.
- 24. Arnautova Y, Abagyan R, Totrov M (2011) Development of a new physics-based internal coordinate mechanics force field and its application to protein loop modeling. Proteins: Structure, Function, and Bioinformatics 79: 477–498.
- 25. Goldfeld D, Zhu K, Beuming T, Friesner R (2011) Successful prediction of the intra-and extracellular loops of four g-protein-coupled receptors. Proceedings of the National Academy of Sciences 108: 8275–8280.
- 26. Subramani A, Floudas C (2012) Structure prediction of loops with fixed and flexible stems. The Journal of Physical Chemistry B 116: 6670–6682.
- 27. Fernandez-Fuentes N, Fiser A (2013) A modular perspective of protein structures: application to fragment based loop modeling. Methods in molecular biology (Clifton, NJ) 932: 141.
- 28. Bruccoleri R, Karplus M (1987) Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolymers 26: 137–168.
- 29. Zhang J, Liu J (2006) On side-chain conformational entropy of proteins. PLoS computational biology 2: e168.
- 30. Zhang J, Lin M, Chen R, Liang J, Liu J (2007) Monte carlo sampling of near-native structures of proteins with applications. PROTEINS: Structure, Function, and Bioinformatics 66: 61–68.
- 31. Rosenbluth M, Rosenbluth A (1955) Monte carlo calculation of the average extension of molecular chains. The Journal of Chemical Physics 23: 356.
- 32.
Grassberger P (1997) Pruned-enriched rosenbluth method: Simulations of
*θ*polymers of chain length up to 1 000 000. Physical Review E 56: 3682. - 33.
Wong SWK (2013) Statistical computation for problems in dynamic systems and protein folding. PhD dissertation, Harvard University.
- 34. Liu J, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association 1032–1044.
- 35. Liang J, Zhang J, Chen R (2002) Statistical geometry of packing defects of lattice chain polymer from enumeration and sequential monte carlo method. The Journal of chemical physics 117: 3511.
- 36.
Liu J (2008) Monte Carlo strategies in scientific computing. Springer Verlag.
- 37. Zhang J, Lin M, Chen R, Wang W, Liang J (2008) Discrete state model and accurate estimation of loop entropy of RNA secondary structures. The Journal of chemical physics 128: 125107.
- 38. Zhang J, Chen Y, Chen R, Liang J (2004) Importance of chirality and reduced flexibility of protein side chains: A study with square and tetrahedral lattice models. The Journal of chemical physics 121: 592.
- 39. Lin M, Lu H, Chen R, Liang J (2008) Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints. The Journal of chemical physics 129: 094101.
- 40. Lin M, Zhang J, Lu H, Chen R, Liang J (2011) Constrained proper sampling of conformations of transition state ensemble of protein folding. Journal of Chemical Physics 134: 75103.
- 41. Zhang J, Dundas J, Lin M, Chen R, Wang W, et al. (2009) Prediction of geometrically feasible three-dimensional structures of pseudoknotted RNA through free energy estimation. RNA 15: 2248–2263.
- 42. Soto C, Fasnacht M, Zhu J, Forrest L, Honig B (2008) Loop modeling: Sampling, filtering, and scoring. Proteins: Structure, Function, and Bioinformatics 70: 834–843.
- 43. Cahill S, Cahill M, Cahill K (2003) On the kinematics of protein folding. Journal of computational chemistry 24: 1364–1370.
- 44. Shenkin P, Yarmush D, Fine R, Wang H, Levinthal C (1987) Predicting antibody hypervariable loop conformation. i. ensembles of random conformations for ringlike structures. Biopolymers 26: 2053–2085.
- 45. Xiang Z, Soto C, Honig B (2002) Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proceedings of the National Academy of Sciences 99: 7432–7437.
- 46. Zhou H, Zhou Y (2002) Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Science 11: 2714–2726.
- 47. Ko J, Lee D, Park H, Coutsias E, Lee J, et al. (2011) The FALC-loop web server for protein loop modeling. Nucleic acids research 39: W210–W214.
- 48. Simons K, Kooperberg C, Huang E, Baker D, et al. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. Journal of molecular biology 268: 209–225.
- 49. Rohl C, Strauss C, Misura K, Baker D, et al. (2004) Protein structure prediction using rosetta. Methods in enzymology 383: 66.
- 50. Sheffler W, Baker D (2010) Rosettaholes2: A volumetric packing measure for protein structure refinement and validation. Protein Science 19: 1991–1995.
- 51. Leaver-Fay A, Tyka M, Lewis S, Lange O, Thompson J, et al. (2011) Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487: 545–574.
- 52. Hu C, Li X, Liang J (2004) Developing optimal non-linear scoring function for protein design. Bioinformatics 20: 3080–3098.
- 53. Thomas P, Dill K (1996) An iterative method for extracting energy-like quantities from protein structures. Proceedings of the National Academy of Sciences 93: 11628–11633.
- 54. Huang S, Zou X (2011) Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures. Proteins: Structure, Function, and Bioinformatics 79: 2648–2661.
- 55. Engh R, Huber R (1991) Accurate bond and angle parameters for x-ray protein structure refinement. Acta Crystallographica Section A: Foundations of Crystallography 47: 392–400.
- 56. Wang G, Dunbrack R (2003) Pisces: a protein sequence culling server. Bioinformatics 19: 1589–1591.
- 57. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637.
- 58. Lewis D (2008) Winsorisation for estimates of change. SURVEY METHODOLOGY BULLETIN-OFFICE FOR NATIONAL STATISTICS- 62: 49.
- 59.
Bowman A, Azzalini A (1997) Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations, volume 18. Oxford University Press, USA.
- 60. Sippl M (1990) Calculation of conformational ensembles from potentials of mena force. Journal of molecular biology 213: 859–883.
- 61. Miyazawa S, Jernigan R, et al. (1996) Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. Journal of molecular biology 256: 623–644.
- 62. Lu H, Skolnick J (2001) A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins: Structure, Function, and Bioinformatics 44: 223–232.
- 63. Li X, Hu C, Liang J (2003) Simplicial edge representation of protein structures and alpha contact potential with confidence measure. Proteins: Structure, Function, and Bioinformatics 53: 792–805.
- 64. Zhang J, Chen R, Liang J (2005) Empirical potential function for simplified protein models: Combining contact and local sequence–structure descriptors. Proteins: Structure, Function, and Bioinformatics 63: 949–960.
- 65. Shen M, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Science 15: 2507–2524.
- 66.
Li X, Liang J (2007) Knowledge-based energy functions for computational studies of proteins. In: Computational methods for protein structure prediction and modeling, Springer. pp. 71–123.
- 67. Samudrala R, Moult J (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. Journal of molecular biology 275: 895–916.
- 68.
Zhang J, Chen R, Liang J (2004) Potential function of simplified protein models for discriminating native proteins from decoys: Combining contact interaction and local sequence-dependent geometry. In: Engineering in Medicine and Biology Society, 2004. IEMBS'04. 26th Annual International Conference of the IEEE. IEEE, volume 2, pp. 2976–2979.
- 69. Zhang C, Liu S, Zhou Y (2004) Accurate and efficient loop selections by the DFIRE-based all-atom statistical potential. Protein science 13: 391–399.
- 70. Huang S, Zou X (2006) An iterative knowledge-based scoring function to predict protein–ligand interactions: I. derivation of interaction potentials. Journal of computational chemistry 27: 1866–1875.
- 71. Zimmermann M, Leelananda S, Gniewek P, Feng Y, Jernigan R, et al. (2011) Free energies for coarse-grained proteins by integrating multibody statistical contact potentials with entropies from elastic network models. Journal of structural and functional genomics 12: 137–147.
- 72. Bernstein F, Koetzle T, Williams G, Meyer E Jr, Brice M, et al. (1977) The protein data bank: a computer-based archival file for macromolecular structures. Journal of molecular biology 112: 535–542.