Learning supervised embeddings for large scale sequence comparisons

doi:10.1371/journal.pone.0216636

Fig 1.

Word2Vec architecture: The figure shows two variants of word2vec architecture—CBOW and Skip gram [26] for a sample.

More »

Expand

Fig 2.

SuperVec model: NN1 and NN2 architectures are shallow neural networks with respectively the CBOW and the Skip-gram variant of the Word2Vec model.

The context of comprises its nearby words and is denoted , . The context of the s_i are sequences which have the same label as s_i.

More »

Expand

Table 1.

context-word pairs for NN1 and NN2.

More »

Expand

Fig 3.

SuperVecX: A supervised method for generating sequence embeddings [30].

More »

Expand

Fig 4.

An example binary tree for H-SuperVec(X): A hierarchical structure obtained by partitioning the class labels of each parent node into equal size subsets.

The root node is assigned p class labels (L₁, L₂….L_p) and their corresponding sequences. In this example we assume that p is an even number; the right child of the parent node is assigned the even and the left child is assigned the odd indexed labels selected from those assigned to the parent node.

More »

Expand

Fig 5.

SuperVec(X) 1/2/3 are the SuperVec(X) model trained for node 1, 2 and 3 respectively.

q_i and s_i are the embedding of a query (q) and subject (s) and d_i is the distance of query-subject pair computed at i^th node, d is the similarity score for q and s.

More »

Expand

Table 2.

Hyper-parameters for SuperVec, Seq2Vec and SuperVecX.

More »

Expand

Table 3.

Retrieval results for 100 random pairs: Average interpolated precision values at ten recall levels computed for 100 random pair of classes.

All of these pairs differ in the number of database and query sequences. The precision value shown at particular recall level below is averaged over the chosen 100 pairs.

More »

Expand

Fig 6.

t-SNE plots: The mapping of database and query embeddings generated through BioVec, Seq2Vec, SuperVec and SuperVecX approaches.

DB1, DB2 denotes the database sequences and Q1, Q2 denotes the query sequences from class1 and class2 of dataset1.

More »

Expand

Fig 7.

Supervised Vs unsupervised: Average interpolated precision values at 11 recall levels for dataset1 and dataset2.

For dataset1 the results are averaged for ∼60k queries, the database size is ∼90k (200 classes); for dataset2 the database size is ∼58k sequences (100 classes) and the results are averaged over ∼38k queries.

More »

Expand

Fig 8.

100 classes experiment: Comparison of interpolated average precision values for retrieval task performed on largest 100 classes database of dataset1 following setup 2.

More »

Expand

Fig 9.

H-SuperVec(X) vs SuperVec(X): Retrieval performance comparison of the hierarchical and vanilla embedding approaches for dataset1 and dataset2.

More »

Expand

Table 4.

Querying time: Overall querying time for 1108 queries when processed together for the database of size 90k (200 classes) from dataset1.

More »

Expand

Table 5.

Querying time: Overall querying time for 1108 queries when processed serially for the database of size 90k (200 classes) from dataset1.

More »

Expand

Fig 10.

Querying time histogram: Querying time histogram for different methods.

More »

Expand

Fig 11.

Hybrid approach: Step1 uses H-SuperVec(X) for pruning the original database (DB) and gives reduced database (DB_r).

In step 2, BLAST re-ranks DB_r based on alignment-based similarity between its sequences and the given query, q and finally provide the list of retrieved sequences, DB_o.

More »

Expand

Fig 12.

Retrieval result for hybrid approaches: In (a) The results are averaged over ∼ 60k queries on the database of ∼ 90k sequences and 200 classes. The AUPR values for the methods shown are as follows, HSuperVec: 0.451, SuperVecX: 0.703, H-SuperVecX: 0.742, BLAST: 0.776, HSuperVec+B: 0.701, MMseqs2: 0.535. In (b) the results are averaged over ∼ 38k queries on the database of ∼ 58k sequences and 100 classes. The AUPR values for the methods shown are as follows, HSuperVec: 0.343, SuperVecX: 0.381, HSuperVecX: 0.43, BLAST: 0.593, HSuperVec+B: 0.569, HSuVecX+B: 0.597 MMseqs2: 0.311.

More »

Expand

Fig 13.

Retrieval performance comparison of hybrid approach and BLAST for database of size ∼650, 000 sequences (1886) classes from dataset1.

The results are averaged over 800 queries randomly chosen from largest two classes.

More »

Expand

Fig 14.

Sequence searching sensitivity assessment: Cumulative distribution of area under the curve (AUC) sensitivity for ∼ 38k queries on the database of ∼ 58k sequences and 100 classes of dataset2.

Higher curves signify higher sensitivity.

More »

Expand

Table 6.

Classification results: Comparing SuperVec, SuperVecX and other embeddings on various classification tasks.

More »

Expand