Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding

doi:10.1371/journal.pone.0013596

Figure 1.

The Concept of GDDA-BLAST and Adaptive GDDA-BLAST.

This schematic depicts the work flow of GDDA-BLAST and Adaptive GDDA-BLAST (i–ii) The algorithm begins with a modification of the query amino acid sequence via the insertion of a “seed” sequence from the profile of interest. These seeds are obtained from the profile consensus sequences from NCBI's Conserved Domain Database (CDD). GDDA-BLAST inserts a seed at every query amino acid position; in constrast, Adaptive GDDA-BLAST inserts a seed at the positions where the seed is likely to be extended to an alignment. (iii–iv) The signals are collected from the optimal alignments between the “embedded” sequences and profiles using rps-BLAST or Adaptive GDDA-BLAST; and, they are incorporated as a composite score into an N by M data matrix. (v–ix) This dataspace can be analyzed to generate phylograms and dendrograms based on the Euclidean distance and Pearson correlation measures on alignment profiles of query proteins, respectively.

More »

Expand

Figure 2.

A Performance Comparison of GDDA-BLAST and Adaptive GDDA-BLAST.

(a) Per-query running time of GDDA-BLAST and Adaptive GDDA-BLAST, when running 620 query sequences against 51 target sequences. The numbers in a box represent how much faster Adaptive GDDA-BLAST is than GDDA-BLAST. (b) Fold recognition performance of GDDA-BLAST, Adaptive GDDA-BLAST, PSI-BLAST and SAM-T2K on SABmark Twilight zone set is shown with ROC curves. 534 sequences of 61 SCOP fold groups from SABmark Twilight zone bechmark set. To calculate the sensitivity at different false positive rates, top-k sequences with the highest similarity to each 534 queries are considered as increasing k from 1.

More »

Expand

Figure 3.

The Characterization of Membrane Spanning Regions.

This graph shows the performance of the Hidden Markov Models (TMHMM), rps-BLAST and Adaptive GDDA-BLAST in determining the membrane-spanning domains in Bovine Rhodopsin as determined by X-ray Crystallography (Teal = Beta pleated sheets, Green = helices, loops not shown). This protein was analyzed with an expanded set of PSSMs representing a large variety of transmembrane domains (∼20K PSSMs). Compared with rps-BLAST, Adaptive GDDA-BLAST is more refined with respect to the annotation of alpha-helices. Moreover, this data demonstrates that less statistically valid alignments (e.g., e-value 0.01 vs. 10¹⁰) are still informative for detecting the domain boundaries and outperform lower thresholds. The full-length structure of Rhodopsin is shown (dimer) as well as an inset of the C-terminus that is composed of three small helices with the last one folding parallel with the membrane (it is not transmembrane itself).

More »

Expand

Figure 4.

The Characterization of Ankyrin-repeat Protein Structure.

This graph shows the performance of Adaptive GDDA-BLAST in determining the Ankyrin-repeat domains in Human Ankyrin-R as determined by X-ray Crystallography (Green = helices, loops not shown). This protein was analyzed with an expanded set of PSSMs representing Ankyrin-repeat domains (449 PSSMs). Adaptive GDDA-BLAST annotates 12 Ankyrin-repat domains as well as their alpha-helices. Compared to rps-BLAST, Adaptive GDDA-BLAST shows the structure of 1N11 in much refiner resolution (orange: Fourier-transform point = 7, cyan: Fourier-transform point = 8).

More »

Expand

Table 1.

Residues of a chimera sequence.

More »

Expand

Figure 5.

Adaptive GDDA-BLAST alignment details regarding seed embedding.

(a) Limited region of interest with the seed embedding position. The diagonal line represents the alignment with the seed in different locations. The examples illustrate the region of interest of the N-terminal seeds. Similarly for the C-terminal seed, it is the upper-left corner of the seed. (b) The corresponding hits of a query and a chimera sequence. This example illustrates that the hits between the target sequence (X) and the query sequence (Y) can be reused for aligning a chimera sequence (C) against the target sequence (X). (c) The seed positions selected given a partial alignment. Ranges on the top and bottom represent the seed embedding positions of N-terminal and C-terminal seeds, respetively.

More »

Expand

Figure 6.

Example of embedded alignments.

The seeded alignments for three consecutive chimera sequences. The query and the target sequences are general transcription factor II, i isoform from Homo sapiens (NP001509.2) and ML (MD2related lipidrecognition) domain (cd00912), respectively.

More »

Expand

Figure 7.

Four basic steps of Adaptive GDDA-BLAST.

(i) Step 1: Find multiple non-overlapping local alignments. (ii) Step 2: Select seed embedding positions in query sequence. (iii) Step 3: Generate final alignments with seed. (iv) Step 4: Filter out non-significant alignments using coverage and pairwise identity of the alignment.

More »

Expand