Comparison of Algorithms for Prediction of Protein Structural Features from Evolutionary Data

Proteins have many functions and predicting these is still one of the major challenges in theoretical biophysics and bioinformatics. Foremost amongst these functions is the need to fold correctly thereby allowing the other genetically dictated tasks that the protein has to carry out to proceed efficiently. In this work, some earlier algorithms for predicting protein domain folds are revisited and they are compared with more recently developed methods. In dealing with intractable problems such as fold prediction, when different algorithms show convergence onto the same result there is every reason to take all algorithms into account such that a consensus result can be arrived at. In this work it is shown that the application of different algorithms in protein structure prediction leads to results that do not converge as such but rather they collude in a striking and useful way that has never been considered before.


Introduction
Although protein structure determination by biophysical techniques such as X-ray crystallography, cryoelectron microscopy and NMR has become highly automated, there will, for several reasons, continue to be interest in pursuing theoretical predictions of protein structure. Despite the high productivity of the mentioned experimental methods, the rate at which genomics and proteomics data are generated still outstrips the rate at which structures can be determined experimentally. Performing mutant studies for planning protein engineering experiments or screening for proteomic therapeutics (for example: immunotherapeutics) are most rapidly done in silico. Further, it is not simply the case that any given gene has a (singular) function. The protein prescribed by its gene sequence has many functions [1,2]. This implies, in turn, that these are also encoded in the gene. Somewhere, but where and how? As earlier shown [1,2] this is done in a disjoint fashion. So the problem becomes an issue of how to partition the protein sequence information and map these subsets of the entire gene sequence onto this set of functions (the inherent assumption in all this, which dealt with in more detail elsewhere, is that the mapping of sequence loci into function space is both surjective and injective). While many of these issues have been addressed in recent [1] and earlier [2] publications the focus here is on folding and contact prediction and not on any of the other functions that any given protein most certainly has.
Theoretical/computational protein folding studies have undergone a steady series of developments in recent years. These have included significant accomplishments in protein dynamics [3][4][5], methods based on a collage of overlapping peptide fragments [6], and a variety bioinformatics approaches [7][8][9][10][11][12]. The latter have usually involved finding patterns of coevolution within multiple sequence alignments. The so-called correlated mutation analysis (CMA) approach identifies residue positions that show a common pattern of conservation and are deemed to signify the maintenance of some key structural feature, a "contact". Typically, what one had in mind in these studies was protein folding, the need for the protein to fold into domains with a compact (predominantly "hydrophobic") core. A similar argument was used to propose that protein-protein interactions could likewise be predicted [13,14].
As an extension of the CMA idea, studies of patterns of sequence variability (VAR) and Shannon entropy (ENT) [15,16] allowed a distinction to be made between sites in the protein core, or surrounding ligand-binding sites, for example. The first steps towards unravelling the multifunctional nature of proteins [1,2] had been taken. This was recently supplemented by an alternative approach based on Kolmogorov complexity (KOL) [1] which represents a new way to partition protein sequence information into its constituent functionalities.
In this paper, the focus will be restricted to protein folding, or more specifically, the folding of individual domains. The extent to which KOL, and its antecedents [15,16] VAR/ENT (here considered jointly and called VRN), can be used to predict these domain structures will be considered, as well as alternative methods. Foremost among these are methods which have been based on frequency of contacts between amino acid residue sidechains [17,18]. In the present work distinction is made between an earlier method [17] in which a PDB-derived likelihood matrix was used to predict intradomain contacts (referred to herein as the SVB method) and a later development based on pair-to-pair contacts [18] (P2P). The P2PConPred program [18] calculates correlations between sites based on a predefined P2P matrix which in turn is based on the Blocks database [19]. The P2P website states: "The P2P is currently designed to reflect probabilities of pair to pair substitutions at two positions with physical contact. The ultimate goal is to detect residue-residue contact solely based on the evolutionary information stored in multiple sequence alignment.". The present paper includes results from the use of the P2P program but proceeds towards the same ultimate goal in ways that P2P probably did not envisage.

Methods
Before the VRN and KOL measurements can be made it is important to decide the range of values that give results that are relevant to the type of contact being studied. This question has been studied earlier for VRN [15,16] and KOL [1]. For VRN, the values obtained in the original work [15,16] and used elsewhere [1] were used. In the case of KOL the results of making these investigations were not published before so this is done here. Reference is made to Fig 1 which shows how the MCC values for KOL calculated at two different distance cutoffs, 6Å and 10Å (these turn out to be good choices as the later results show, but other values could have been chosen) vary as a function of the range of KOL values is varied, in the range 0.2 to 0.5, the width of each slice of that range is 0.05. As is shown in Fig 1, the optimal MCC value is the same for 6Å and 10Å (and intervening values, data not shown).
The data to produce • VRN and KOL: were determined using previously published methods [1] for dealing with a specially designed nonredundant database of protein domains. Briefly, for each protein a   multiple sequence alignment was produced using the PredictProtein program [20]. This program generates VAR and ENT data, although these need to be parsed and extracted into a usable form. The KOL data are not provided directly but can be calculated based on the complexity of the alignments at each position in the consensus sequence. These methods are all described in detail in [1].
• P2P: http://ignmtest.ccbb.pitt.edu/p2pdocs/p2p_doc.html • CMA: http://gremlin.bakerlab.org/sub.php For all of the above methods, comparisons were made between "hits" identified by the method and those in the 2D contact map for each target protein (listed in Table 2 Table 1) for each method at each cutoff value. The MCCs are plotted along the abscissae and the cutoffs form the ordinates of the plots in Fig 2 and S1 Fig. The results of applying these methods to the target proteins in this work are shown in Fig 2  and S1 Fig. Table 1 records the data for a single member of the set of proteins PDB i.d.:1a4v.
(Corresponding data for all the others is available from the author) and Table 2. The members of the studied protein set (  Table 2). Structural data including rotatable figures are also reachable through the same links.
The question of noise and random effects in all this data has not been ignored; quite the converse. For each of the above metrics, the behaviour of a set of predictions based on Table 2. Summary of results for all 10 protein families (parent protein identified in columns 1 and 2).

Protein domain (4-letter code)
Protein type Figure  completely random inputs was calculated and used to correct the metrics (subtraction of

Results and Discussion
The data arising out of this work can be used to construct 2D contact maps from which 3D structures can, in principle, always be derived using distance geometry [22][23][24][25][26]. There are ample precedents for presenting folding predictions in this way [1,10,11,17]. But now there is a new and better definition of "contact" because now we can define contacts in terms of preferred rather than just assumed (see below) distances, depending on the method used (Table 1). In addition to this enhanced contact data there is a wealth of other data which have earlier been enlisted in these endeavours such as predictions of secondary structure [1,10] and accessibility [1] which confer additional credence to the results of any attempts to compute the 3D structure.
Once these juxtaposed results have been presented, the next step is to decide how best to combine the 2D contact predictions from SVB, P2P, VRN, KOL and CMA in such a way as to provide the best consensus set, "best" being here defined as leading to rapid convergence towards a structure for the protein that corresponds to the crystal structure for this protein.
As stated, there has been widespread interest in trying to predict intra-protein (as well as inter-protein) contacts. But very little is ever said about the nature of those contacts. The definition of "contact" can be very vague or ambiguous, often referring to "hydrogen bond" or "Van der Waals" contacts. Neither of these include the possibility of and maybe even need for long range effects that are not contacts as such. A standard (http://www.ccp4.ac.uk/html/ contact.html) range for short contacts is 2.0Å-3.66Å while a considerably wider range, 6Å-12Å, is considered significant in order to cater for all contact types (http://en.wikipedia.org/ wiki/Protein_contact_map). Given that there is such a wide spread of distances which are involved in defining a "contact" it now becomes interesting and, as it turns out, important to ask the question: in each of the algorithms for contact prediction listed above and summarised in Table 1, what is the characteristic contact distance for each of these algorithms? The answer is provided by the cutoff distance for each case where MCC is a maximum. When this is done it becomes apparent that the various algorithms predict contacts having different characteristic distances. A clear conclusion from this work is that there is no "one-size-fitsall" algorithm for inter-residue contact prediction. One would clearly not choose the SVB or P2P alternatives since their behaviour is somewhat erratic and often confined to predicting rather long spacings (>10Å). There might be correlated behaviour over such long distances, but they can hardly be considered a "contact". But it obviously makes sense to use CMA, VRN and KOL.
One may legitimately ask why, given that CMA got off to such a good start in fold prediction, is there any need to consider other methods? Do the apparent correlations in CMA really correspond to events in coevolution? There have been many discussions on this question [12,27,28] and more remains to be discovered. In particular, there are clear indications [28] that CMA "hits" may reflect the rate of coevolution in relation to preserving arenaceous (i.e. low resolution) structural features such as the protein "core" rather than acting as a predictor of specific pairs of contacting residues as such. But insofar that CMA can with appropriate noise filters be used to predict contacts the Gremlin approach [29] is most useful and it produces results of very high fidelity. This paper has its main focus on protein folding, or rather, domain folding. Several different methods for predicting domain folds were compared and it was found that these methods work in subtly different ways in that they predict contacts with different values. There is every reason therefore to use more than one, but not all, of these methods. Together they provide a more robust and information-rich prediction model and, while they do not "converge" as such, they "collude" in a way that could to lead to a more reliable result (at least as far as VRN and KOL are concerned).
From Fig 3 it would appear that KOL, VRN and CMA are controlled by similar underlying factors and all three correlate in an almost antiparallel fashion with cutoff. Of course, there is no linear correlation as is made abundantly clear in Fig 2 and S1 Fig. P2P and SVB are almost orthogonal to the cutoff indicating little or no dependency in that sense.
One of the missing items in much published work is a clear definition of what is meant by "contact". A mention of this has been made [28] which amounts to a general assumption throughout the CMA debate that, for example a "hydrophobic-hydrophobic" contact can be replaced by a hydrogen bond or a salt-bridge or an "aromatic-aromatic" contact. As if these were freely interchangeable. But they are not interchangeable in such a simple way [30]. These interactions are based on entirely different mechanisms and replacement of one by the other is not to be regarded as a "compensatory mutation" [30]. Another missing item in previously published work is that there has been such a focus on "contacts", however these are defined and/or measured, that other most important protein functions seem to have been forgotten. Exceptions to this is a precursor to the present paper [1] and a most important earlier paper [31] that sets out to consider the ability to disentangle direct and indirect correlations and to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures. Together with the efforts of the present author, this seems to be a valid and useful way forward. To this end, future extensions of this work will give further consideration to these other protein functions [1] that are encoded in the gene (the ability to fold into two (or more) conformational states, to be able to reach one state from the other, arriving at the correct locus inside or outside the cell, or within the cell membrane, recognition/binding to other proteins, recognition/binding of small ligands (orthosteric and/or allosteric agents)). Indeed, much of the difficulty surrounding the use of these contact prediction methods arises out of the fact that so many different functions are encoded in the gene and attempts to partition them lead to the kind of results that have been revealed in this work. Thus the use of the verb "disentangle" [31] is highly appropriate in this context.

Conclusions
This paper has dealt with the question of which inputs to use when conducting ab initio predictions of domain folds. Five methods were compared and it was found that they all make predictions in different ways. Different in respect to which interatomic distance or displacement that they best predict. CMA, VRN and KOL emerge as being the most useful methods for predicting "contacts". The first two are already well established, while the Kolmogorov approach [1,32] represents a novel and promising addition to the arsenal of techniques.
As for the interatomic contacts themselves, no account has been taken here of the nature of the atom types involved (but it already is one of the ongoing extensions from this work). Here, a standard "CA-CA" proximity metric is assumed as a definition for all "contacts". But the issue is an important one. Depending on the chemical nature of the participating atom types the problem (generally) of finding matching pairs amounts is a case of the mathematically well defined "marriage problem" (Gale-Shapley algorithm). This is applicable to "+/type" interactions or wherever there is a duality or asymmetry in the interaction. But there are also interactions of a more neutral or symmetric kind such as "hydrophobic" interactions typical of the way that aliphatic, and to some extent aromatic, sidechains interact. These have more the character of the "stable-roommate problem" (Irving algorithm). It is intended that these distinctions will form the basis of yet another extension of this work.
Supporting Information S1 Fig. This file contains figures A, B, C, D, E, F, G, H and I. The identities of the proteins is stated in columns 1 and 2 of Table 2. Software developed for this paper is all available free of charge to interested users. The source code for the fortran program (predcon.f) which calculates the MCC scores from the input data is provided and users are strongly advised to peruse this file before use making note of the comments and the default names of the input files. Examples of the input files are provided for the purposes of getting the correct format. The program should be run once for each cutoff value and it is advisable to rename the output files using a filename that incorporates that value A shellscript (renumber) is provided to help with that and an awkscript (awkward) is provided to enable the different measurements (CMA, VRN, etc.) to be corrected for the random (RND) scores. The correct command for compilation of predcon.f under Linux is: gfortran -Wall -O3 -ffixed-line-length-132 -o predcon.exe predcon.f (GZ) S1 Software. The software is available in S1 Software. It goes without saying that no commercial use may be made of these programs or scripts in whole or in part without express permission of the author. (GZ)