Real-time structural motif searching in proteins using an inverted index strategy

doi:10.1371/journal.pcbi.1008502

Fig 1.

Structural motif case studies.

(A) Active sites of serine proteases can be made up of multiple, distinct polypeptide chains [2]. (B) Active site sidechains in leucine aminopeptidases [3] coordinate two adjacent ions. (C) Zinc Finger DNA-binding domains [5] are stabilized by zinc ions (N.B.: Cys:F-212 was not used to define the search query.) (D) Position-specific exchanges (additional label_comp_id) can be used to identify enolase superfamily members accurately [6]. (E) RNA G-tetrads can be formed between one, two, or four nucleic acid strands [7]. label_comp_id, auth_asym_id, and auth_seq_id as residue identifiers. Rendering by Mol* [8].

More »

Expand

Fig 2.

Structural motif search workflow.

(A) Fragmentation into residue pairs. (B) Computation of geometric descriptors. (C) Inverted index lookup. All similar occurrences are retrieved for each descriptor. (D) Checking for correspondence to ensure that candidate resembles query motif. (E) Structures not fulfilling requirements are ignored. Only relevant residues are loaded. (F) R.M.S.D. measures quantify structural similarity.

More »

Expand

Fig 3.

Sensitivity of geometric descriptors.

Ground truth for catalytic triad query was determined by an exhaustive search routine [27]. Most low R.M.S.D. hits are found by our approach (blue points). Biologically relevant hits exhibit geometric descriptors (area shaded in blue) similar to the query motif (black horizontal line).

More »

Expand

Table 1.

Runtime and sensitivity analyses.

More »

Expand

Table 2.

Ten most abundant residue pairs in the inverted index.

More »

Expand

Fig 4.

Representation of residue pairs.

Residue pairs are represented by 3 descriptors that are transformation invariant: backbone distance d_b, sidechain distance d_s, and angle θ. This constitutes a compact representation of residue pairs and enables quick retrieval of similar pairs.

More »

Expand