Figure 1.
The horizontal line gives the percent identity between query and subject sequences, and the boxes gives the resources and tools that can be used for functional inference.
Figure 2.
Ten-step procedure for comparative analysis of protein structures and sequences to infer biological function.
Table 1.
URLs used for this tutorial
Figure 3.
PSI-BLAST input panel (top) and PSI-BLAST output iteration (bottom).
(Top) Default parameters are used. The fasta sequence of query protein with UniProt accession O67940 from Aquifex aeolicus is blasted against NCBI's nr database. (Bottom) The query protein O67940_ AQUAE hits several structures (tagged with S in a red box). Only two of the non-redundant structures with PDB-ids 2Q6O and 1RQP (marked by a pink box) are functionally characterized with e-values 3e-20 and 3e-17 and percent identities of 32% and 26%, respectively. (The Expect value (E) or an e-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases.)
Figure 4.
Pairwise alignment between query sequence O67940_ AQUAE and 2Q6O (top) and 1RQP (bottom).
(Top) Query aligns end-to-end without any long gaps with a sequence identity of 32%. (Bottom) Query aligns end-to-end but with three regions of gaps, the most significant being a 23-residue region in 1RQP residues 92–116. The sequence identity of query with 1RQP is 26%.
Figure 5.
PIRSF (A,B), COG (C,D), and Pfam (E,F) input and results.
(A) The fasta sequence of query protein with UniProt accession O67940 from Aquifex aeolicus is scanned against PIR's curated family database. (The query is searched against the full-length and domain hidden Markov models for manually curated PIRSFs. If a match is found, the matched regions and statistics are displayed). (B) The query hits the PIRSF family PIRSF006779. The output provides family details; statistical data for full-length proteins, composite domains, and a pairwise alignment of query with the consensus sequence of the PIRSF. (C) The fasta sequence of query protein with UniProt accession O67940 from Aquifex aeolicus is scanned against the database of clusters of orthologous groups. COG compares protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of orthologous/co-orthologous proteins from at least three lineages. (D) The query hits COG1912. The output provides the family details: statistical score, reciprocal best hits, and members of the family. (E) The fasta sequence of query protein with UniProt accession O67940 from Aquifex aeolicus is scanned against the Pfam domain database. The Pfam database is a large collection of domain families, each represented by multiple sequence alignments and hidden Markov models (HMMs). (F) The query hits Pfam family PF01887.
Figure 6.
1RQP is used since our query protein O67940 from Aquifex aeolicus does not have a solved structure. The results indicate that the N-terminal and C-terminal domains of 1RQP belong to two SCOP superfamilies. (The SCOP database provides a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known).
Figure 7.
Since our query protein O67940 from Aquifex aeolicus does not have a solved structure, 1RQP is used as a query. The only non-redundant structural neighbor that provides functional annotation is 2Q6O, indicated by a pink box.
Figure 8.
Structure-guided alignment constructed with homologous sequences using Cn3D (top) and neighbor-joining tree based on the score of aligned residues from homologous sequences using CDTree (bottom).
Figure 9.
SAM-binding residues. Dashed green lines indicate hydrogen bonds, and the half-moon indicates van der Waals interactions. (Ligplot is a program for automatically plotting protein–ligand interactions provided as part of the PDBsum database, which is a Web-based database of summaries and analyses of all PDB structures).
Table 2.
Alignment of functional residues