Conceived and designed the experiments: OCR CAO. Performed the experiments: OCR TJD. Analyzed the data: OCR BHD. Contributed reagents/materials/analysis tools: OCR IS. Wrote the paper: OCR BHD CAO.
The authors have declared that no competing interests exist.
Predicting protein function from structure remains an active area of interest, particularly for the structural genomics initiatives where a substantial number of structures are initially solved with little or no functional characterisation. Although global structure comparison methods can be used to transfer functional annotations, the relationship between fold and function is complex, particularly in functionally diverse superfamilies that have evolved through different secondary structure embellishments to a common structural core. The majority of prediction algorithms employ local templates built on known or predicted functional residues. Here, we present a novel method (FLORA) that automatically generates structural motifs associated with different functional sub-families (FSGs) within functionally diverse domain superfamilies. Templates are created purely on the basis of their specificity for a given FSG, and the method makes no prior prediction of functional sites, nor assumes specific physico-chemical properties of residues. FLORA is able to accurately discriminate between homologous domains with different functions and substantially outperforms (a 2–3 fold increase in coverage at low error rates) popular structure comparison methods and a leading function prediction method. We benchmark FLORA on a large data set of enzyme superfamilies from all three major protein classes (α, β, αβ) and demonstrate the functional relevance of the motifs it identifies. We also provide novel predictions of enzymatic activity for a large number of structures solved by the Protein Structure Initiative. Overall, we show that FLORA is able to effectively detect functionally similar protein domain structures by purely using patterns of structural conservation of all residues.
Understanding how the three-dimensional (3D) molecular structure of proteins influences their function can provide insights into the workings of biological systems. Structural Genomics Initiatives have been set up to investigate these structures on a large scale and make the data available to the wider biological research community. However, in a significant number of cases, there is little known about the functions of the structures that are solved. To address this, computational methods can be used as a predictive tool to guide future experimental investigations. One such approach is to exploit global structural comparison to assign the protein in question to an evolutionary family, which has already been functionally characterised. However, this is problematic in some large evolutionary families, which contain a number of different functional sub-families. We have developed a new method (FLORA) which is able to calculate 3D “motifs” which are specific to each of these sub-families. Any new protein structure can then be compared against these motifs to make a more accurate prediction of its function. Our paper shows that FLORA substantially outperforms other standard approaches for predicting function from structure. We use our method to make confident functional predictions for a set of proteins solved by the structural genomics projects, which could not have been assigned reliably by global structure comparison.
The prediction of protein function from structure has become of increasing interest as a significant proportion
To address the problem of generating templates for all protein structures, there are a number of methods that aim to do this automatically. For example, the reverse template method
One inherent complexity of using PDB structures to transfer annotations between enzymes is the binding state in which the protein is crystallised — for example, structures crystallised with non-cognate ligands, substrate analogs, transition states or apo-enzymes
Despite the many template methods present in the literature, very few are publicly available to the general user. Hence, the first step in assigning function by structure is often to use global structure comparison methods (e.g. CE
Analyses of CATH
The FLORA algorithm presented here was designed to derive structural templates for functional sub-groups (FSGs) within diverse CATH superfamilies. FLORA first performs global structure alignment across the superfamily to recognise the distinctive structural patterns associated with each FSG and builds templates based on these patterns. New functional homologues are then detected by using the global structural alignments to relatives in each FSG again, but only scoring the similarity over positions identified by the FLORA motif. This approach performs very well in discriminating between different enzymatic functions, compared to global methods and another motif-based approach. Although we benchmark here on enzyme superfamilies, the method is applicable to superfamilies containing non-enzymatic relatives. To test FLORA, we have automatically generated a large data set of domains from 29 diverse superfamilies (containing multiple FSGs). Our data set allows us to look at the variation of FLORA results between superfamilies and to stress the importance of using a large test data set for benchmarking methods. We have benchmarked FLORA against CE
In order to benchmark FLORA as a protein function prediction method, it was important to generate a relatively large and unbiased data set. We focussed on functionally diverse superfamilies (≥3 functions at the third E.C.
All protein chains from PDB structures classified in CATH v3.1 were annotated with an E.C. number using PDBSprotEC
To simplify the benchmark data set, all domains from enzymes assigned more than one E.C. (i.e. multifunctional enzymes) were removed. This exclusion criterion removed less than 8% of enzymatic chains in the PDB. In addition, any domains with an incomplete E.C. number (e.g. 2.7.-.-) were also excluded.
All annotated domains in CATH were clustered at 60% sequence identity and a representative taken from each cluster (S60Rep). This threshold was applied as 60% has been found to be an appropriate sequence cut-off for functional similarity
S60Reps were then grouped within the superfamily if they shared at least the first three E.C. numbers; to create what we will subsequently refer to as a functional sub-group (FSG). A CATH superfamily was then included in the data set if it contained at least 3 FSGs, where each enzyme family contained at least 4 S60Reps. These criteria were chosen to create a sufficiently diverse data set, which could be effectively assessed using leave-one-out benchmarking.
The final domain data set (
An outline of the FLORAMake algorithm is shown in
Methods which attempt to create structural templates of residues associated with a given function rely on a range of methods
The first step in our protocol was therefore to generate structural alignments using CATHEDRAL between
All pairs of structure-structure alignments between domains in a given FSG were analysed to identify aligned residues. A set of residues for each domain was then generated from the pairwise alignments to include only those residues that were aligned to residues in at least 75% of other domains in the FSG (to account for sub-optimal alignments). A cut-off of 75% was chosen after exploring a range of cut-offs (0–100%) and gave the fastest performance without affecting the precision/recall of FLORA. These were designated rescons positions.
For each domain, vectors were calculated between all rescons positions. To allow vectors to be appropriately compared between domains, a vector was calculated between the Cβ atoms of residues A and B and then multiplied by a co-ordinate frame calculated from the tetrahedral geometry of the bonds of the Cα of residue A as described in
A given vector from a domain in the FSG was compared to equivalent vectors in domains across the whole superfamily. Equivalent vectors were obtained from the CATHEDRAL structural alignment of the domains being compared. For example, residues 93 and 105 in CATH domain 1vl2A01 are equivalent to residues 92 and 108 in 1k92A01 according to the structural alignment. Hence, the vectors 93→105 (v1) and 92→108 (v2) were scored for similarity using the formula below (which is identical to the vector score developed for the SSAP
The next step in the algorithm is to determine vectors for a given domain which are more similar to equivalent vectors in other domains in the same FSG than to those of relatives in the superfamily with different functions (i.e. in different FSGs). The aim was to eliminate vectors that are conserved mainly to preserve the common fold of the superfamily. Two distributions were calculated for each vector: a) scores to domains in the same FSG (DIST-F) and b) scores of domains in different FSGs (DIST-S). The means of DIST-F and DIST-S were then calculated and the vector was initially determined to be FSG-specific if it satisfied the following inequality:
We experimented with various statistical tests (e.g. Wilcoxon rank sum, calculating an empirical p-value), but found that the set of selected vectors could be best reduced by jack-knifing the data set and repeating the calculation above. That is, each domain in the training set was removed in turn and FLORA only selects a vector if the inequality is always satisfied.
We also explored incorporating measures of sequence similarity when scoring vectors, but in our hands this degraded the performance of FLORA. This could be due to the fact that the benchmark data set contained very diverse relatives and hence exploring the sequence signal requires a more sophisticated approach.
At this point, each domain in the FSG is associated with a set of FSG-specific vectors, which we termed the “FSG-domain template set”.
To score a given query domain against the template for a given domain in a given FSG relies again on the global structural alignment by CATHEDRAL. Hence, the first step is to align the query domain against an FSG domain but then only score the similarity across the subset of template vectors. Essentially, we are calculating a local score over the FLORA template from the correspondence determined by a global structural alignment. Each vector in the template set associated with the FSG domain is scored against the equivalent vector in the query domain (using equation 1), based on the aligned residues from the global alignment. Any vectors that are not aligned (i.e. gapped positions) are given a score of zero. The total similarity of the query domain against enzyme domain (the florascore) is simply the sum of these similarities, normalised by the total number of vectors in the template (Equation 2).
We hypothesised that the extent to which the structure of a domain can change before its enzymatic function changes might be specific to the homologous superfamily. For each FLORA domain-function template, a distribution of all scores is calculated against all domains in different FSGs. The florascore between a given pair of query and enzyme domains is then transformed into a Z-score.
As FLORA is essentially a pattern discovery method, it was vital to assess its performance in an unbiased fashion. We took a standard leave-one-out (or jack-knifing) approach as is often used to test machine learning methods. For each superfamily, one test domain was removed, while training on the remaining domains. The test domain was then scored against all the resulting templates. The aim of this process to was accurately reproduce a situation where a novel domain is classified into a CATH superfamily and then needs to be assigned to a functional group.
The performance of FLORA, CATHEDRAL
In order to examine where residues identified by FLORA overlapped with known functional residues, we compared the location of FLORA positions to those in the Catalytic Site Atlas
For each functional sub-group (FSG), we selected the domain that had the highest mean global structural similarity (measured by CATHEDRAL) to all other members of the FSG as a representative. All residues, from each relative within an FSG, identified by FLORA and CSA annotations were then mapped onto this representative using the CATHEDRAL structural alignment. Consequently, for each FSG we had a representative structure where all residues were annotated as FLORA positions, catalytic residues, or neither. The CSA provided annotations for 61 out of 82 FSGs (74%). We then calculated the average distance between the FLORA residues to the catalytic residues and the average distance between non-FLORA and the catalytic residues.
FLORA produces a set of inter-residue vectors for each domain in a given FSG that are considered to be specific to its enzymatic function, in the context of its evolutionary superfamily. In order to visualise where these vectors lay, we took each set of domain templates for a given enzyme family and mapped them onto the most representative structure — i.e. the structure with the greatest cumulative global structural similarity to all other domains in the family. A given residue was then coloured if it was involved in the top 30% of FLORA template vectors. Residues that are conserved across the whole superfamily (in 75% of relatives) were also identified and those which overlapped with FLORA residues were coloured gold.
Despite targeting proteins with no significant sequence similarity to existing structures in the PDB, Protein Structure Initiative (PSI) structures can often be classified into one of the large, diverse superfamilies in CATH by structure comparison methods once their structure has been solved. However, these superfamilies contain a significant number of relatives with different functions and therefore to be able to further assign these proteins to a specific functional sub-group is of great use for guiding future functional studies. We took all PSI structures solved up to January 2008 that had been newly classified in v3.2 of the CATH database and selected the 276 domains which fell into one the superfamilies in our data set. These 276 were further clustered at 60% sequence identity to produce a non-redundant test set of 104 domains, which was then scanned against the FLORA templates for each FSG in order to predict their function. To exclude hits that could have been fairly confidently assigned using global structure comparison, we removed any structures that matched a CATH domain in v3.1 library with a SIMAX score<1.5
FLORA was designed as a generic method to create structural motifs that can discriminate between different functional sub-groups (FSGs) within diverse domain superfamilies, purely using patterns of structural conservation — FLORA makes no assumptions as to the physico-chemical properties of functionally important residues and uses a purely structure-based conservation score (i.e. sequence similarity is not used to select or score equivalent motif vectors, see
We tested the performance of FLORA against global structure comparison methods (CE
To fairly benchmark any function prediction algorithm, it is important to compare against current methods. Unfortunately, the vast majority of function prediction methods are not publicly available, however here we compare against CE as this method has been used as a benchmark for other structure-based function prediction methods (e.g.
Initially, we investigated to what extent global structure comparison could be used to reliably assign function. The graph of sensitivity versus precision (
FLORAMake and FLORAScan were applied to the domain data set and the performance was assessed using a leave-one-out approach (described in the
FLORA was benchmarked on 29 functionally diverse enzyme superfamilies and the performance quoted thus far refers to an average calculated over the entire data set.
The superfamilies were ranked according the AUC, with the worst performing listed first.
The performance of RT (which works at the whole chain level) is shown for comparison.
At this point, it can be seen that simply focussing at the domain level FLORA is able to very effectively improve the recognition of structures in the same FSG. This is interesting given that the majority of structure-based function prediction methods tend to use the whole protein chain. A possible explanation of the power of FLORA could be that the domains in our data set form a core part of the enzymatically active region of the whole protein. Alternatively, it could be that the selected vectors for each template also contain residues that interact with other enzymatic domains within the chain, and it is these interaction sites that FLORA is detecting.
To see whether any improvement could be achieved by using the whole protein chain, we used CATHEDRAL to re-align the corresponding PDB chain for each of the domains in the data set and performed an identical benchmark as before.
The benchmarking analysis presented above shows that FLORA is indeed able to correctly discriminate between homologous domains from different FSGs better than global structure comparison, despite using global alignments to determine residue correspondence. This suggests that although a global alignment may not be perfect, especially between very distant relatives, it still aligns enough residues that are important for maintaining different functions. To examine where these function-specific residue lay, we chose a representative structure for each enzyme family and visualised the conserved FLORA residues (see
We have analysed these motifs further in domains from the HUP superfamily (CATH 3.40.50.620
The first FSG consists of the catalytic domain of class I aminoacyl-tRNA synthetases (EC 6.1.1.-). These enzymes are essential for protein translation as they catalyse the ligation of amino-acids to their cognate tRNAs in a two-step mechanism that involves ATP. The HUP domains of aminoacyl-tRNA synthetases are found in many different multi-domain contexts in CATH, which appear to partially depend on the amino-acid substrate (data not shown). In representatives from this group, (
A 1f7u, B 1od6, C 1k92. FLORA residues are shown in green.
The next FSG in the HUP superfamily is a group of metabolic enzymes called nucleotidyltransferases (EC 2.7.7.-), which transfer nucleotidyl groups from nucleotide tri-phosphates to other compounds. The nucleotidyltransferase we have analysed further (
Finally, the third FSG consists exclusively of argininosuccinate synthases (EC 6.3.4.5), which catalyse the ATP-dependent synthesis of argininosuccinate from citrulline and aspartate. These enzymes are homo-tetramers in which each subunit is comprised of a nucleotide-binding HUP domain and an additional domain involved in multimerisation and catalysis. Three motifs are identified by FLORA in
Analyses of residues identified by FLORA in these domains and others in this superfamily (data not shown) suggest that FLORA is generally able to target motifs known to be involved in different aspects of molecular function, like binding interfaces or catalytic sites. This behaviour is somewhat expected from FLORA, which was specifically designed to detect such function-related signatures in homologous domains. By mapping catalytic residues from the CSA onto each FSG representative (see
Examining similar representatives from the Class I aldolase superfamily (3.20.20.70) reveals that FLORA template residues (
Our analysis thus far has shown that FLORA is able to substantially improve on the performance of global structure comparison for reliably assigning domains to functional sub-groups. We therefore sought to use it to make novel predictions for structural genomics targets from the PSI. As a data set, we took structures that had been assigned to superfamilies in the latest version of CATH (v3.2) and scanned these against the FLORA templates. Using the benchmark curve from the leave-one-out benchmark, we took a score cut-off corresponding to a precision of 95% (Z-score>3.4) to ensure high confidence in our assignments. All hits above this cut-off were collated, rather than simply taking the top hit so that we could account for bi-functional enzymes and observe any conflicting predictions (i.e. those structures which hit more than one FSG template). A complete table of results is shown in
104 domains from our v3.2 PSI set correspond to 94 PDB structures. Of these 94, we were able to make predictions for 66 (70.4%) with FLORA. To assess the added value of using FLORA over global structure comparison, we took out any PSI structures that matched a domain in CATH with a SIMAX score<1.5 (see
FLORA positions are coloured as in previous figures and catalytic residues are shown in light blue. It can be seen that there is reasonable agreement in the region of the active site.
FLORA predicted NESG structure 2bdt with the E.C. number 2.7.1.-, which is a group including enzymes such as fructose 1-,6 bisphosphate. When this structure was published, it was assigned as a putative gluconate kinase but currently has no official E.C. annotation.
PDB 1vm8 from the JESG consortium was functionally characterised when the structure was solved as UDP-n-acetylglucosamine pyrophosphatase and given the E.C. number E.C. 2.7.7.23. Again, FLORA correctly predicts the E.C. number as 2.7.7.-, despite low global structural similarity to any domains in the template data set.
1ylo is a hypothetical protein solved by the MCSG consortium in 2005. FLORA predicted the E.C. number 3.4.11.-, which comprises a group of amino-acid specific peptidases, with significant hits (Z-score>4) to three domain templates in our data set. A BLAST search indeed reveals significant hits (>99% sequence identity) to annotated amino peptidases, as the protein has now been functionally characterised since its structure was solved. Again, these trivial hits were not in the data set we used, which demonstrates the power of FLORA to find functional homologues even after significant evolutionary divergence.
FLORA is a novel algorithm which exploits patterns of structural conservation to derive templates for different functional sub-groups (FSGs) within diverse domain superfamilies. Unlike many other methods which focus on generating templates based on known or predicted functional residues
By generating a superfamily-specific Z-score, we found that the performance of FLORA increases significantly. This suggests that the degree of structural variation that confers a change in function is specific to each superfamily and the absolute structural similarity must be compared to a background distribution. Therefore, as has also been identified at the sequence level
Another important novelty in our approach was to create a large data set comprising 29 superfamilies (which is made publically available). Although FLORA performed well across the majority of superfamilies, this was not universally true, which suggests that function prediction methods should be benchmarked across as diverse a data set as possible. We have also shown that CATHEDRAL outperforms CE, probably due to producing superior alignments outside of the conserved structural core. Although global structure comparison is not always able to reliably find distant functional relatives, we feel it is appropriate for benchmarking new methods to give a guide of the value they add to structure-based function prediction.
As detailed in the
One of the major ways in which FLORA differs to other methods is by focussing on the domain, rather than at the whole chain or protein complex level. Simply because a domain is present in a given enzyme does not necessarily mean it contributes to or confers catalytic activity. Indeed it might be responsible for protein-protein interactions or other aspects of function, such as locating the protein in a given part of the cell. We have shown that except in the case where there has been a domain duplication (superfamily 3.30.830.10), deriving structural motifs at the domain level performs as well as aligning whole multi-domain chains. Our hypothesis is that where FLORA does not locate conserved positions around the active site, it is able to find parts of the domain that interact with other catalytic domains. We intend to undertake more detailed analysis of other CATH superfamilies to confirm this.
FLORA makes no assumptions about the physico-chemical (e.g. solvent accessibility or polarity) or sequence conservation properties of residues in the templates it derives, only that they show high structural conservation within a given functional sub-group. As a consequence, we observed residues both around the enzymatic active sites and in other locations in the protein. In two of the example superfamilies presented here, we have shown that FLORA template vectors co-locate around the active site. This is possibly due to structural changes in the protein that allow for different relatives to bind different ligands. However, this trend is not observed across the whole data set, where only 59% of FLORA template vectors are on average closer to the active site than other residues in the protein. This suggests that it is not only the enzymatic site that is important for discriminating between different FSGs, but other locations in the structure related to domain-domain or protein-protein interfaces.
The substantial improvement in performance of FLORA over global structure comparison has allowed us to assign 70% of structural genomics targets, assigned to superfamilies in our data set to functional sub-groups, in this case predicting the type of catalytic reaction they perform. Of our FLORA predictions, 78% could not have been reliably made by standard structure comparison techniques, as we were able to transfer annotation from far more distant relatives (RMSD>4 Å). Although some of the predictions we made are supported by experimental work that occurred after the structure was solved, the accuracy of the rest remains for future functional characterisation work.
Taken in the context of our previous analysis of functional divergence across large domain superfamilies in the CATH database
Benchmark data set for FLORA.
(0.08 MB XLS)
Supporting Information.
(0.31 MB DOC)