PSimScan: Algorithm and Utility for Fast Protein Similarity Search

doi:10.1371/journal.pone.0058505

Figure 1.

Block-scheme for PSimScan algorithm.

Similarity detection in PSimScan is organized as a pipeline of hit accumulators and filters connected in the order of increasing computational complexity.

More »

Expand

Figure 2.

Block-scheme for dictionary construction.

Location of every character tuple of the size K in the query sequences is recorded in a directly addressable Lookup Table, where a binary-converted tuple by itself serves as an index. Each entry in the table is a pointer to an array of locations.

More »

Expand

Figure 3.

Main elements of the Search Space.

‘Similarity Matrix’ is a rectangular table with columns corresponding to positions in the query sequences and rows corresponding to positions in the ‘current’ subject sequence. Subject sequence database is processed record by record: k-tuples starting from every position are looked up in the Lookup Table, and locations of tuple matches (primary hits) are recorded. Neighboring primary hits form ‘Similarity Zones’, rectangular areas of Similarity Matrix oriented along the diagonals. Each new hit either joins an existing Similarity Zone that appears close by, or forms a new one. If such a hit appears close to more than one existing Zone, the two zones are merged. Similarity Zones are represented in PSimScan as BAND structures. Once a Similarity Zone’s score passes a detection threshold, it gets listed in the ‘Hits Array’.

More »

Expand

Figure 4.

Block-scheme for the inner search loop.

For each subject sequence, k-tuples starting from every position are looked up, and the match locations are recorded. The offset of the diagonal where the match appears is computed, and the associated diagonal control structure is updated with the location and the score of the match. On the accumulation stage, multiple neighboring primary hits are aggregated into Similarity Zones, reducing the number of items for processing. Such zones, in turn, may merge into a smaller number of larger aggregates.

More »

Expand

Figure 5.

Selectivity and Sensitivity of PSimScan at different parameters versus other similarity search tools.

All of the proteins in the PDB90 database were compared with each other using PSimScan, SSEARCH, BLAST, USEARCH, RAPSearch and BLAT. PSimScan was tested at different combinations of kthresh (similarity zone detection threshold) and approx (tuple diversification level) parameters. For SSEARCH, BLAST, USEARCH, RAPSearch and BLAT, the Coverage vs Error graphs were plotted as described by Brenner et al [47]. Similarities between proteins of the same SCOP fold were treated as true positives, while similarities between proteins of different folds – as false positives (errors). The Coverage is the ratio between the number of true positives and the total number of protein pairs, where both members belong to the same fold. The EPQ is the ratio between the number of detected false positives and the number of queries. The Coverage-vs.-Error graph contains points in the Coverage/EPQ plane which correspond to the sets of similarities with E-values below a given cut-off (some dots on the graphs are labeled with E-values). To get comparable graphs for different tools, we re-computed the E-values for all detected similarities with SSEARCH, and used those E-values for the graph construction. We ran PSimScan at all combinations of 6 different kthresh values (shown in legend) and 7 different approx values. For each run, total coverage and EPQ were computed and plotted. On each curve corresponding to a particular kthresh, the triangles mark the following approx values, left to right: 1.0, 0.95, 0.9, 0.85, 0.8, 0.76, 0.72.

More »

Expand

Figure 6.

Selectivity and Sensitivity of PSimScan at different maximum diagonal shift values.

Please see Fig. 5 for explanation of coordinate axes and method used. The dependency of Coverage and EPQ on the maximum diagonal shift taken at 4 different combinations of approx/kthresh values is shown. The dots on graphs are labeled with the values of mxshift parameter.

More »

Expand

Table 1.

Performance testing against NCBI nr database.

More »

Expand

Figure 7.

Processing speed for different quick protein similarity search tools.

All measurements were taken at default parameters but for the “PSimScan2” series (approx: 0.79, kthresh: 14). Streptococcus pneumoniae R6 proteome was used as the query set, SwissProt/UniProt database – as the subject set.

More »

Expand

Figure 8.

PSimScan processing time dependence on tuple diversification and similarity zone detection threshold.

Streptococcus pneumoniae R6 proteome was used as the query set, SwissProt/Uniprot database – as the subject set. Measurements were taken at mxshift = 4.

More »

Expand

Figure 9.

PSimScan processing time dependence on maximum diagonal shift.

Streptococcus pneumoniae R6 proteome was used as the query set, SwissProt/Uniprot database – as the subject set. Measurements were taken at kthresh: 15 and approx: 0.75 and 0.85.

More »

Expand