BES conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.

The author has declared that no competing interests exist.

Understanding the relationship between protein structure and function is one of the foremost challenges in post-genomic biology. Higher conservation of structure could, in principle, allow researchers to extend current limitations of annotation. However, despite significant research in the area, a precise and quantitative relationship between biochemical function and protein structure has been elusive. Attempts to draw an unambiguous link have often been complicated by pleiotropy, variable transcriptional control, and adaptations to genomic context, all of which adversely affect simple definitions of function. In this paper, I report that integrating genomic information can be used to clarify the link between protein structure and function. First, I present a novel measure of functional proximity between protein structures (F-score). Then, using F-score and other entirely automatic methods measuring structure and phylogenetic similarity, I present a three-dimensional landscape describing their inter-relationship. The result is a “well-shaped” landscape that demonstrates the added value of considering genomic context in inferring function from structural homology. A generalization of methodology presented in this paper can be used to improve the precision of annotation of genes in current and newly sequenced genomes.

Since the advent of biological data storage in digital format, researchers have struggled to define quantitative measures of comparison for sequence [

Nevertheless, quantitatively relating structural homology to function has been complicated by a dearth of functional distance measures and numerous examples of folds performing many unrelated functions. This many-to-many relationship between structure and function has been linked to fundamental biological processes and characteristics such as adaptation, specialization, pleiotropy, or differential regulation [

I consider the protein domain universe as the set of all structurally characterized domains [

First, I define a simple but quantitative measure of functional comparison: F-score. F-score is defined as normalized Euclidian distance between GO [_{A,B}_{iϵ{functions}}(_{A,i}−_{B,i})^{2})^{1/2}_{A,B}_{[A|B],i}

Next, I set out to correlate F-score and structural similarity (Z-score calculated using DALI [

(A) The correlation between structural comparison Z-score and functional distance F-score. (Pearson's r = 0.96 and slope = 0.007.) Each bin contains at least 200 observations. It is worth noting that the average functional distance (F-score) falls from 0.48 to 0.30, only by a third during two decades of structural similarity [

(B) The correspondence between phylogenetic profile distances calculated using mutual information and F-score. Slope of the linear fit is 0.36, with Pearson's r = 0.96. The correlation is averaged, i.e., each data point represents a bin containing 150–200 domains, and the functional distances are averaged inside the bin [

(C) The landscape of functional distance with respect to Z and P scores. An average F-score is calculated for each of the 36 bins; each bin contains 100–200 observations. Since F-score is a distance metric, hotter colors represent domains that are farther away and cooler colors represent those that are closer.

From an evolutionary perspective, the environment is often important in defining the precise function of the sequence. Consequently, sequences appearing in the same set of genomes have been shown to perform similar functions [

Finally, quantitative definitions of structure, function, and phylogenetic similarity allowed me to calculate the landscape of F-scores for all pairs of domains with respect to their Z and P scores (

The findings presented here suggest that both our understanding of the structure–function relationship and the precision of functional annotation can be greatly improved by considering structural homology in phylogenetic context. I am currently involved in work trying to improve on my naïve measure of functional similarity and assess the robustness of these results to arbitrary cutoff parameters. Furthermore, using these results it may be possible to outline a novel, optimal strategy with respect to functional annotation for the currently ongoing structural genomics projects.

I employed a Z-score measure of structural proximity as weight for the edges to create a protein domains universe graph (PDUG [

After I defined a PDUG, I had to populate it using sequences, so as to correlate the structures and the set of sequences that fold into those structures. I used a non-redundant database of sequences, NRDB [^{-10} threshold. Since structures from DALI are themselves devoid of sequence homologs, at most one structure is found for every non-redundant sequence from NRDB. Since each sequence is annotated with the function that it performs, this yields a mapping not only of non-redundant sequences but also of their respective functions to nodes on PDUG. The distribution of sequences from NRDB that are homologous to DALI structures is given in

Since I was interested in the most general description of functionality of protein domains, I defined the function of each domain as the weighted set of functions performed by all the sequences that align to it. Thus, the functionality of the domain is represented by a probabilistic GO [

Each node on PDUG now had the representative structure, the set of sequences that fold into that structure, and the set of functions performed by those sequences in the form of a probabilistic, hierarchical GO [

Thus, in order to compare the GO trees, I calculated the Euclidian distance between the nodes on each level of the GO hierarchy by using

Here _{A,B}_{A,i}

Phylogenetic context (the subset of genomes where the domain is found) can have a profound effect on the function and overall evolution of that domain. Knowing this, I created another dimension of PDUG where each node was annotated with the genomes where it was present. This is done by simply BLASTing [

The calculation of distance in genome space is non-trivial and is subject to all kinds of qualifications, such as relative distance between genomes on the tree [

where _{ij}_{i}_{j}

Finally, I correlated all three dimensions of PDUG, by observing the F-score between two nodes with respect to both the structural proximity and the phylogenetic distance (

To evaluate the robustness of the results reported in

The data are available online from

(4.2 MB TIF).

(1.3 MB TIF).

(449 KB TIF).

The author would like to acknowledge Eugene Shakhnovich, John Max Harvey, and support from NIH.

Gene Ontology

protein domains universe graph