Fig 1.
(A) BLAST search when new sequences are added to the database. At time t, the database is Dt. In next δt interval, new sequences Dt+δt − Dt are added, and the database becomes Dt+δt. With the traditional approach, the prior search result at time t cannot be reused, and we have to perform an entire BLAST search against the entire Dt+δt database. (B) BLAST search when several taxon-specific databases are present and a result against the combined database is needed. For three taxa, A, B, and C, we can perform individual BLAST searches against the databases DA, DB, DC, respectively. If we want to obtain a search result against the combined database DA∪B∪C, we need to merge the search results in a way that their e-values reflect the combined database size.
Table 1.
Comparison of three different BLAST tools that explicitly deal with e-value statistics correction.
iBLAST supports e-value correction across time and space without requiring prior knowledge of the entire database while the other tools can perform e-value correction in limited scenarios.
Fig 2.
The user can initiate a search using the user interface. The search parameters are then passed to the “Incremental logic” module. After performing an incremental search, this module’s back-end corrects the e-value statistics and merges the result. The “Incremental logic” module looks into an external lightweight database module called the (Record database) to decide whether and how to perform the incremental search. For the actual search and delta database creation, we use NCBI BLAST tools such as blastdbcmd, blastdbalias, blastp, and blastn.
Fig 3.
Experimental design of three case studies.
(A) Case study I: Incremental addition of sequences in the nt database over three time periods. (B) Case study II: Incremental addition of sequences in the nr database over two time periods. (C) Case study III: Incremental search of taxon-specific databases.
Table 2.
Case study I: Fidelity of iBLAST in three consecutive time periods.
blastn search was performed on nucleotide sequence databases (nt). At any time instance, the Past database size is the size of the database from the previous time instance. The Present database size is the database size at the present time instance. Delta is the incremental database growth from the previous time instance to the current time instance. NCBI BLAST must be performed on the entire Present database size, while iBLAST only needs to be performed on Delta.
Fig 4.
Performance comparison between NCBI BLAST and iBLAST for case study I.
(A) Performance comparison between regular blastn and incremental blastn at 3 periods when nt database is growing over time, using 100 nucleotide queries. For 40.8% and 34.0% increase in the database size, iBLAST performs 2.93 and 3.03 times faster respectively. (B) Performance comparison between regular blastp and incremental blastp at 3 periods when nr database is growing over time, using 100 protein queries. For 34.1% and 26.3% increase in the database size, iBLAST performs 4.33 and 4.98 times faster respectively.
Table 3.
Potential for taxon-guided searches enabled by iBLAST.
Comparison of merged BLAST results from multiple individual BLAST searches with a separate BLAST search conducted against a completed nr database shows that biologically relevant taxa can be added incrementally to obtain similar results to nr by searching against a much smaller database size.