Combining Evolutionary Information and an Iterative Sampling Strategy for Accurate Protein Structure Prediction

Recent work has shown that the accuracy of ab initio structure prediction can be significantly improved by integrating evolutionary information in form of intra-protein residue-residue contacts. Following this seminal result, much effort is put into the improvement of contact predictions. However, there is also a substantial need to develop structure prediction protocols tailored to the type of restraints gained by contact predictions. Here, we present a structure prediction protocol that combines evolutionary information with the resolution-adapted structural recombination approach of Rosetta, called RASREC. Compared to the classic Rosetta ab initio protocol, RASREC achieves improved sampling, better convergence and higher robustness against incorrect distance restraints, making it the ideal sampling strategy for the stated problem. To demonstrate the accuracy of our protocol, we tested the approach on a diverse set of 28 globular proteins. Our method is able to converge for 26 out of the 28 targets and improves the average TM-score of the entire benchmark set from 0.55 to 0.72 when compared to the top ranked models obtained by the EVFold web server using identical contact predictions. Using a smaller benchmark, we furthermore show that the prediction accuracy of our method is only slightly reduced when the contact prediction accuracy is comparatively low. This observation is of special interest for protein sequences that only have a limited number of homologs.


1) Contact Prediction and Restraint File Generation
The contact predictions used in the manuscript have been generated with the EVFold webserver (available at http://evfold.org/evfold-web/newprediction.do) using standard parameters. The results can be downloaded in form of a compressed folder, which is subdivided into several subdirectories. The all--by--all residue pairing scores are stored in {jobname}_{scoringmethod}.txt in the ev_couplings folder. In case of the standard scoring method (PLM), the file is named {jobname}_PLM.txt. An exemplary prediction for 1wvn is provided in inputs/ev_couplings/1wvn_PLM.txt. From this score file, the l top--ranked residue pairing scores having a minimum distance of 5 residues are extracted and translated into Rosetta specific distance restraints. This can be done with the following two steps: 2) Structure Prediction with the RASREC protocol

2.1) Requirements
The Rosetta software package version 3.6 has to be obtained from www.rosettacommons.org. Rosetta applications are denoted with the extension <.ext>, which should be replaced with the system and compiler dependent extension. For instance, for gcc compiled Rosetta on a linux system use .linuxgccrelease. Note: The RASREC protocol requires MPI with a minimum of 4 computes cores (higher numbers are highly recommended). Rosetta can be compiled in MPI mode with the following commands: cd /path/to/Rosetta/main/source ./scons.py -j <number of processors> bin mode=release extras=mpi RASREC requires substantial computer resources. The required time depends on several factors including size, fold complexity, number and information content of restraints. For instance, target 1r9h requires 26 hours on 96 compute cores (2.6 GHz AMD Opteron Processors). The run time can be reduced by decreasing the standard pool size of 500, however this is not recommended as this will directly affect the final prediction accuracy. For easily setting up a RASREC run we suggest to use the CS--Rosetta Toolbox that is now available at http://csrosetta.chemistry.ucsc.edu/.

2.2) Fragment Selection
We have run the fragment picker for all our targets with the following command: make_fragments.pl -nohoms The flag -nohom leads to exclusion of fragments from homologous proteins. This flag should be omitted when not used for benchmarking. Alternatively, fragments can be generated using the webserver Robetta (available at http://www.robetta.org/). For the tutorial, this step can be omitted as fragments are already provided in $PROTOCOL/inputs/fragments.

2.3) Starting a RASREC run
All runs in our manuscript have been set up with the CSRosetta Toolbox (now available at http://csrosetta.chemistry.ucsc.edu/) In case, the Toolbox is not available, a folder containing all necessary flag files is provided. Both ways (with and without CSRosetta toolbox) to set up a RASREC run will be described below.

2.3.2) Without the use of the CSRosetta Toolbox
In case no CSRosetta toolbox is installed on your system, you can setup the RASREC run manually. All flag files are provided in the folder rosetta_flags. These flags are ready--to--run for this tutorial. All necessary modifications for different targets will be explained below.
The following commands will create a run folder containing all necessary flags and files for a RASREC run cd $PROTOCOL mkdir -p RASREC_runs/1wvn_standard/run/logs cp rosetta_flags/standard/* RASREC_runs/1wvn_standard/run cd RASREC_runs/1wvn_standard/run The final RASREC run can either be started with one of the run--scripts (for different cluster settings) or by executing the following commands:

2.4) Successful Termination and Analysis
A RASREC run is finished, once the folder fullatom_pool_stage8 has been generated. The following Error message occurs after a successful RASREC run and can be ignored: ERROR: quick exit from job-distributor due to flag jd2::mpi_nowait_for_remaining_jobs ---this is not an error The final RASREC models are stored in fullatom_pool/decoys.out

2.4.3) Analyze convergence of ensemble
The following command shows the residues being converged within 2A

2.5) Refinement with RASREC
If the convergence of the initial RASREC run is not sufficient enough (< 90%), a second RASREC run can be carried out. This run will reuse restraints from both predicted contact map and the previous result

2.5.1) Repick Restraints
The following command generates two restraint files given a model ensemble and a contact map.
cd $PROTOCOL $PROTOCOL/scripts/repick_restraints_final.py -c 1wvn_PLM.cmp \\ -s RASREC_runs/1wvn_standard/run/fullatom_pool/low_30.pdb \\ -p 1 \\ -o restraints As output, the following files are generated: # converged distances in low_30.pdb translated to strict bounded # potentials around the average distance restraints_converged_distances.cst # additional restraints from the contactmap that do not completely # disagree with the previous results. Here a more widely bounded # potential is used. restraints_filtered_contactmaps.cst

2.5.2) Setup RASREC run
The flags and patches used for the refinement RASREC run are identical to the ones listed in Section 2.3). The two RASREC runs only differ in the restraints used for structural guiding. The restraint files are added to a RASREC run in the broker file.