Fig 1.
Word2Vec architecture: The figure shows two variants of word2vec architecture—CBOW and Skip gram [26] for a sample.
Fig 2.
SuperVec model: NN1 and NN2 architectures are shallow neural networks with respectively the CBOW and the Skip-gram variant of the Word2Vec model.
The context of comprises its nearby words and is denoted
,
. The context of the si are sequences which have the same label as si.
Table 1.
context-word pairs for NN1 and NN2.
Fig 3.
SuperVecX: A supervised method for generating sequence embeddings [30].
Fig 4.
An example binary tree for H-SuperVec(X): A hierarchical structure obtained by partitioning the class labels of each parent node into equal size subsets.
The root node is assigned p class labels (L1, L2….Lp) and their corresponding sequences. In this example we assume that p is an even number; the right child of the parent node is assigned the even and the left child is assigned the odd indexed labels selected from those assigned to the parent node.
Fig 5.
SuperVec(X) 1/2/3 are the SuperVec(X) model trained for node 1, 2 and 3 respectively.
qi and si are the embedding of a query (q) and subject (s) and di is the distance of query-subject pair computed at ith node, d is the similarity score for q and s.
Table 2.
Hyper-parameters for SuperVec, Seq2Vec and SuperVecX.
Table 3.
Retrieval results for 100 random pairs: Average interpolated precision values at ten recall levels computed for 100 random pair of classes.
All of these pairs differ in the number of database and query sequences. The precision value shown at particular recall level below is averaged over the chosen 100 pairs.
Fig 6.
t-SNE plots: The mapping of database and query embeddings generated through BioVec, Seq2Vec, SuperVec and SuperVecX approaches.
DB1, DB2 denotes the database sequences and Q1, Q2 denotes the query sequences from class1 and class2 of dataset1.
Fig 7.
Supervised Vs unsupervised: Average interpolated precision values at 11 recall levels for dataset1 and dataset2.
For dataset1 the results are averaged for ∼60k queries, the database size is ∼90k (200 classes); for dataset2 the database size is ∼58k sequences (100 classes) and the results are averaged over ∼38k queries.
Fig 8.
100 classes experiment: Comparison of interpolated average precision values for retrieval task performed on largest 100 classes database of dataset1 following setup 2.
Fig 9.
H-SuperVec(X) vs SuperVec(X): Retrieval performance comparison of the hierarchical and vanilla embedding approaches for dataset1 and dataset2.
Table 4.
Querying time: Overall querying time for 1108 queries when processed together for the database of size 90k (200 classes) from dataset1.
Table 5.
Querying time: Overall querying time for 1108 queries when processed serially for the database of size 90k (200 classes) from dataset1.
Fig 10.
Querying time histogram: Querying time histogram for different methods.
Fig 11.
Hybrid approach: Step1 uses H-SuperVec(X) for pruning the original database (DB) and gives reduced database (DBr).
In step 2, BLAST re-ranks DBr based on alignment-based similarity between its sequences and the given query, q and finally provide the list of retrieved sequences, DBo.
Fig 12.
Retrieval result for hybrid approaches: In (a) The results are averaged over ∼ 60k queries on the database of ∼ 90k sequences and 200 classes. The AUPR values for the methods shown are as follows, HSuperVec: 0.451, SuperVecX: 0.703, H-SuperVecX: 0.742, BLAST: 0.776, HSuperVec+B: 0.701, MMseqs2: 0.535. In (b) the results are averaged over ∼ 38k queries on the database of ∼ 58k sequences and 100 classes. The AUPR values for the methods shown are as follows, HSuperVec: 0.343, SuperVecX: 0.381, HSuperVecX: 0.43, BLAST: 0.593, HSuperVec+B: 0.569, HSuVecX+B: 0.597 MMseqs2: 0.311.
Fig 13.
Retrieval performance comparison of hybrid approach and BLAST for database of size ∼650, 000 sequences (1886) classes from dataset1.
The results are averaged over 800 queries randomly chosen from largest two classes.
Fig 14.
Sequence searching sensitivity assessment: Cumulative distribution of area under the curve (AUC) sensitivity for ∼ 38k queries on the database of ∼ 58k sequences and 100 classes of dataset2.
Higher curves signify higher sensitivity.
Table 6.
Classification results: Comparing SuperVec, SuperVecX and other embeddings on various classification tasks.