^{*}

^{*}

Conceived and designed the experiments: AS JZ. Performed the experiments: WL. Analyzed the data: WL. Contributed reagents/materials/analysis tools: WL. Wrote the paper: AS JZ.

The authors have declared that no competing interests exist.

Comparison of protein structures is important for revealing the evolutionary relationship among proteins, predicting protein functions and predicting protein structures. Many methods have been developed in the past to align two or multiple protein structures. Despite the importance of this problem, rigorous mathematical or statistical frameworks have seldom been pursued for general protein structure comparison. One notable issue in this field is that with many different distances used to measure the similarity between protein structures, none of them are proper distances when protein structures of different sequences are compared. Statistical approaches based on those non-proper distances or similarity scores as random variables are thus not mathematically rigorous. In this work, we develop a mathematical framework for protein structure comparison by treating protein structures as three-dimensional curves. Using an elastic Riemannian metric on spaces of curves, geodesic distance, a proper distance on spaces of curves, can be computed for any two protein structures. In this framework, protein structures can be treated as random variables on the shape manifold, and means and covariance can be computed for populations of protein structures. Furthermore, these moments can be used to build Gaussian-type probability distributions of protein structures for use in hypothesis testing. The covariance of a population of protein structures can reveal the population-specific variations and be helpful in improving structure classification. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions. We show that our method performs comparably with commonly used methods in protein structure classification on a large manually annotated data set.

Protein structure comparison is important for understanding the evolutionary relationships among proteins, predicting protein functions, and predicting protein structures. Despite its importance, there have been no rigorous mathematical or statistical frameworks for protein structure comparison. One notable issue in this field is that with many different similarity measures used in comparing protein structures, none of them are proper distances when protein structures of different sequences are compared. In this study, we develop a mathematical framework for protein structure comparison by treating protein structures as three dimensional curves. A formal distance, geodesic distance, can be computed for any two protein structures. In this framework, population-specific variations within protein families can be characterized through building probability distributions for structures of protein families. The mean and covariance computed from groups of protein structures can also help to improve the classifications of protein structures. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions.

Comparison of protein structures (or structure alignment) is an important tool for understanding the evolutionary relationships between proteins, predicting protein structures and predicting protein functions

A previous study aiming to develop a statistical framework for structure alignment

In this study, we develop a mathematical framework for protein structure comparison using a formal distance, a geodesic distance based on a particular Riemannian metric. Geodesic distances in elastic shape analysis have been used widely in shape analysis in computer vision

The rest of this paper is organized as follows. We first describe the mathematical framework that is behind our approach to protein structure comparison. We then use some examples to illustrate this method in pair-wise structure alignment and in computing mean and covariance of a group of protein structures. We further demonstrate the performance of our method using a large-scale classification of proteins in SCOP database and compare our performance with some commonly-used methods. Finally, we conclude the paper with discussions.

We treat the backbone structure of a protein as a parameterized curve in ^{3}. Given any two such parameterized curves, we desire a framework that can quantify the differences in shapes of these two curves. Since the comparisons involve shapes of proteins, the resulting quantifications should not depend on the rigid motions and parameterizations of these curves. We will use a Riemannian framework for this task and the basic idea in this approach is the following. We represent each parameterized curve by a special function called the

To derive a curve from a protein structure, we take the sequence of 3D coordinates of the backbone atoms N, CA and C from the PDB _{i}_{1}(_{i}_{2}(_{i}_{3}(_{i}_{i}_{i}

Let the parameterized curve in ^{3} derived from the backbone structure of a protein be denoted as ^{3}. In order to analyze its shape, we will represent ^{3}, where ^{3}. The SRVF ^{3} of length one. It is actually a unit sphere in the Hilbert space

We have mentioned four shape-preserving transformations – translation, scale, rotation, and re-parameterization. Of these, we have already eliminated the first two from the representations, but the other two remain. Curves that are within a rotation and/or a re-parameterization of each other result in different elements of

When we deform one curve into another we are actually generating a continuous sequence of curves, or a path in the curve space, and a natural question is how long that path is. The length of this path also quantifies the amount of deformation in going from one curve to the other. The question changes to: what should be the metric to measure this path length. An elastic metric is a metric that measures the amount of bending and stretching between successive curves along the path and adds them up for the full path. Mio et al.

Let

The so-called preshape space

Once we have a Riemannian manifold, we can compute distances between points in that manifold. For any two points, the distance between them is given by the length of the shortest path (called a _{1} and _{2}, represented by their SRVFs _{1} and _{2}. In order to compute geodesics between their equivalence classes [_{1}] and [_{2}], we fix _{1} and find the optimal rotation and re-parameterization of _{2} to solve:_{1} and _{1}] at _{2}] at _{1}], [_{2}]) =

a) Elastic matching between protein 1MP6 and 1G1J. b) Elastic matching between protein 1MP6 and 2K98. c) Matching between protein 1MP6 and 2EOW.

In

Proteins are flexible molecules and conformational dynamics is important for protein functions _{1}, _{2}, _{n}_{1}, _{2}, _{n}_{i}_{i}

a) 2JVD NMR structures. b) The mean structure of multiple 2JVD NMR structures. c) Samples from the probability distribution. d) Samples on the largest variation direction.

One unique feature of the framework is its ability to calculate covariances for populations of protein structures. Covariances can reveal the population-specific variation among a group of protein structures and be used in classification of protein structures. From the covariances we can identify the directions with the largest variation within a group of protein structures. To define sample covariance we first approximate the shape space _{μ}_{μ}_{i}_{i}_{i}^{t}_{1}, _{2}, _{3}…) and _{1}, _{2}, … _{k}

If we set _{i}_{1}. To display the rigid superposition of the structures we translate them so that their centres of mass coincide with that of the mean structure. The rotations were obtained through optimally matching the SRVFs of these structures to the mean structure. The variation can be decomposed to residue level and the flexibility (structure variation) of each residue can be analyzed. To obtain the variance for each residue, we sampled randomly 10 structures and align them with the mean shape. The distances of each residue of sampled structures with that of mean shape can then be calculated and used to compute variance of that particular residue. In

X-axis is the indices of residues and y-axis is the variations of the residues.

To illustrate how we can use the covariance for structure classification, we sampled two random structures from the distribution built using multiple NMR structures of protein 2JVD with the same geodesic distance to the mean structure, but along two different directions. The probabilities of the two structures under the calculated distribution are 0.0429 and 0.0048, respectively. Although they have the same distance to the mean structure, their probabilities are quite different. This is so because structure 1 lies in the direction with largest variability in the population and structure 2 lies in the direction with much smaller variability, as shown in

a) Illustration of two directions with different variations. b) Two structures sampled on the two directions of different variability but with similar distance to the mean shape.

In this section, we present the performance of our method on a large scale protein structure classification using structures from SCOP database and compare our results with CE

ESA | CE | Matt | ||||

TP/TN/FP/FN | RI | TP/TN/FP/FN | RI | TP/TN/FP/FN | RI | |

A 229/9 | 2417/16880/6308/501 | 0.7392 | 2778/18721/4467/140 | 0.8235 | 2728/5643/17545/190 | 0.3207 |

B 516/13 | 10786/84675/35621/1788 | 0.7185 | 12193/91395/28901/381 | 0.7796 | 12137/97311/22985/437 | 0.8237 |

C 516/17 | 6624/98811/24907/2528 | 0.7935 | 8075/82681/41037/1077 | 0.6830 | 8827/106865/16853/325 | 0.8707 |

D 292/8 | 5728/34756/1324/678 | 0.9529 | 6361/30684/5396/45 | 0.8719 | 6406/36080/0/0 | 1 |

Total 1579/48 | 24409/1117980/96476/6966 | 0.9170 | 29431/1125707/88749/1944 | 0.9272 | 30042/1069181/145275/1333 | 0.8823 |

An example that illustrates the strength of our method is protein pair 1ycc and 1gu2, which have a small geodesic distance (0.84) and are correctly classified into the same family by our method. For these two proteins, CE gives a small z-score (2.6) and classifies them into different classes. DaliLite and Mammoth give z-scores of 3.2 and 1.6, respectively (small scores imply large distances). Matt, a method for flexible protein alignment, gives a

a) Rigid superposition of 1ycc and 1gu2. b) Matching of points along 1gu2 and 1ycc by elastic shape analysis. The red regions label helices, green regions label strands, and blue regions label coils.

Finally we compared the running time of our method with several other methods.

∼100 residues | ∼200 residues | ∼300 residues | |

CE | 1.3 | 3.0 | 5.1 |

Matt | 0.16 | 2.30 | 2.1 |

Mustang | 1.2 | 2.6 | 15 |

ESA | 0.74 | 1.04 | 1.54 |

In summary, we have developed a mathematical framework for protein structure comparison based on elastic shape analysis, a method originally developed in the field of computer vision and image analysis. Under this framework, protein structures are compared as three dimensional elastic curves and can be treated as random variables for statistical analysis. Mean and covariance of a group of protein structures can be computed. Probability distributions can be built for a population of protein structures and hypothesis testing can be conducted for a protein structure against a known protein family/class. Although protein structures have been studies for many years and many computational methods have been developed for protein structure comparison, as far as we know, this is the first rigorous mathematical framework that can address the above computations.

It is worth mentioning that although we consider protein structures as three dimensional curves in this study and ignore the sequence and local structure features (such as secondary structures), the framework can readily incorporate amino acid sequence or secondary structure information. Such additional information can be very helpful to achieve better alignment. For example, secondary structure information has been used by many structure alignment methods since secondary structure type is the major feature used in manual structure classification. To incorporate such auxiliary information we can construct continuous auxiliary functions along the curves, derived from the additional information. The matching and deformations can then be performed using the higher dimensional composite curves that are formed by concatenating the geometric and the auxiliary coordinates. The distances obtained are still proper distances on the higher dimensional space. In this matching, one needs to adjust the relative magnitude (weight) of the geometric and auxiliary coordinates, which can be problem dependent. With secondary structure type as auxiliary function, we can force protein fragments with the same secondary structure type to match with each other by giving a larger weight on the auxiliary secondary structure information, which may further improve the accuracy of structure classification. When using sequence as auxiliary information, one can perform alignment on both structure and sequence space by using an amino acid substitution matrix (for instance, BLOSUM62 matrix) as the distance measure for amino acid residues along the chains. One can also force all corresponding residues to match with each other when comparing two protein structures with the same sequence. The geodesic path (deformation from one structure to another) generated using such constraint may then have a more natural physical interpretation.

In this study we focused mainly on pairwise protein structure comparison and studying the basic properties of a population of structures such as means and covariances. The framework can also be applied to study multiple structure comparison (multiple structure alignment) and provide an alignment of multiple structures if it is desirable. To do so, we can calculate the mean structure of the multiple structures and align each structure to the mean structure. The mathematical framework also provides principled ways to deal with more complex situations. For example, in the troublesome case that there is one or more structures that are very different from the rest of the structures to be aligned, outliers can be detected based on the mean and covariance structure of the population.

In constructing a probability distribution from a group of structures, we chose the tangent space of our shape space and assumed Gaussian distribution on this space. The shape spaces (ours and most other formally defined shape spaces) are highly nonlinear manifolds and it is difficult to build distributions on them directly. On the other hand, it is a very common practice to impose probability distributions on the tangent space since they are linear (vector) spaces. The mapping between a tangent space and the manifold can be made a bijection by putting some appropriate constraints on the tangent space. As for the choice of Gaussian distribution, we have not validated it on the tangent spaces of our shape space. Our goal in this study is to demonstrate the computation of the second moment for observed shapes and to suggest the simplest probability model that captures the first two moments, i.e. a multivariate normal. One can easily extend this framework to include mixtures of Gaussian models

Since we represent protein structures as curves, our method mainly deals with the type of structure comparison where sequence order of amino acid residues is relevant to the distances of structures (sequential structure alignment). In general, our method is not good at detecting related proteins whose differences are caused by changes such as domain swapping, or domain insertion/deletion. However, the method can be readily modified to compare circular permuted proteins

^{n}.