CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.

describes the shape of a protein, given by the secondary structure orientations in 3D, regardless of their connectivity e.g. the protein resembles a barrel or a sandwich. The T level divides these architectures into their distinct Topologies (folds). The topology of a protein details the specific connectivity amongst the secondary structures. At the Homology level, folds are further separated into superfamilies that have a high structural and sequence or functional similarity, indicating evolution from a common ancestor. The S level divides the homologous structures into Sequence families. Members of the individual S families have ≥35% sequence identity to at least one other relative and have very similar structures and functions.

Description of Structure Comparison Algorithms Employed in CATHEDRAL Graph theoretical Approach for Comparing Protein Structures
A graph is a mathematical description of a system, representing both the layout of a system and how the individual components interact. In graph theory terminology, the components of the system are called "nodes" and the interactions are termed "edges". The fold of a protein is readily described as a graph, particularly because secondary structures can be abstracted as vectors [3]. The nodes in such a graph are associated with each secondary structure vector and are denoted as either helix or strand. The edges within the graph are labelled by the geometric relationships between each pair of secondary structures: more specifically, the distance of closestapproach, dot-product angle and dihedral angle.
The transformation of structures into graphs allows a determination of the amount of overlap in the geometrical descriptions of two proteins, by constructing a correspondence graph. A clique is the part of this graph in which every node has an edge connected it to all the other nodes. In CATHEDRAL, a standard Bron-Kerbosch search algorithm is used to detect the largest clique in the correspondence graph, which corresponds to a matching structural motif between the two proteins. Further details of this methodology can be found in [4].
A simple scoring function is utilised to measure the similarity between two structures [4]. This is based on the sizes of the two comparison proteins, the size of the clique and the percentage of equivalent residues within the clique (residue overlap). Distributions of scores returned from database scans using this approach exhibit an extreme value distribution in the tail. This allows numerical analysis to calculate the frequency with which any particular score could be obtained by chance. The resulting E-value has been shown to provide a consistent statistical description of comparisons across fold space and has no obvious biases towards certain folds, architectures or classes.

Overview Of Double Dynamic Programming for structural alignment
Double dynamic programming was first employed in the SSAP program developed by Taylor and Orengo in 1989. It uses the popular Needleman and Wunsch global dynamic programming algorithm on two levels of matrices. A single upper level matrix is used to accumulate possible alignment paths from the lower level matrices, which compare the structural environments of putative equivalent residues pairs. A structural environment is described by the set of vectors from a given residue to all other residues in the same structure. Vectors are calculated between Cβ atoms and then transformed to a common co-ordinate frame defined by the tetrahedral geometry of the Cα atom. Dynamic programming is then used to align the vector sets for a pair of residues in each protein and if the cumulative score is sufficiently high, the alignment path through the score matrix is added to the upper level summary matrix. The top 20 highest scoring pairs are selected from this matrix, which is then reset to 0. The top 20 pairs are then re-compared and these paths are added to summary matrix. Finally, dynamic programming is used to determine the best alignment path through the summary matrix, giving the final similarity score between the proteins.

Description of Other Structure Comparison Algorithms Used in Assessing the
Performance of CATHEDRAL CE [5] identifies matching octapeptide fragments between structures, which share similar local geometry. These are described as aligned fragment pairs (AFPs) and are concatenated in succession to extend the alignment, with gaps permitted provided their length does not exceed 30 residues -to maintain the speed of the algorithm. CE then seeks the alignment with the best RMSD using dynamic programming and this is returned as a Z-score.
DALI [6] also uses a small fragment approach to construct its alignments. Six residue peptides are compared using contact maps and potentially equivalent pairs identified by searching for similar patterns of distances between residues. The Monte Carlo optimisation method is employed to search for equivalent sets of similar hexapeptide pairs to be concatenated into an alignment.
DALI uses many initial alignments and searches for the best one based on the RMSD. Output includes a raw score, summed over all aligned residue pairs and a normalised z-score. STRUCTAL [7] identifies an initial alignment between the structures and uses this to superimpose the structures by rigid body transformation to obtain a minimal RMSD.
Subsequently, an optimal alignment is obtained through dynamic programming. Initial alignments are obtained in various ways, for example by considering the sequence similarity of the proteins or torsional angle similarity. An iterative approach is employed whereby alignments are refined by dynamic programming and this is followed by further superposition until a local optimum is converged upon. STRUCTAL provides statistical measure of significance of the final alignment produced in the form of a p-value.
LSQMAN [8] also adopts an iterative approach based on rigid body superposition. The first residue of each secondary structure element in the two structures is optimally superposed to give an initial transformation. Subsequently, the method seeks long alignments, of at least 4 residues, in which matching residues are within 6Å separation. These alignments guide a new superposition and the process is repeated in an iterative fashion, with the distance threshold being increased for each iteration. LSQMAN outputs a Z-Score to give a statistical interpretation of the alignments significance.