An index-based algorithm for fast on-line query processing of latent semantic analysis

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.


Introduction
Many real data sets could be grouped as documents, including as web pages, literature and product profiles. With such data sets becoming massive and diverse, there is a need for designing algorithmic tools and developing applications to discover the underlying relationship from the data. Consider an example of the document search in a dataset, even though a document is on precisely the same topic to a input query of keywords, it may not be searched when its contained terms are different to the input keywords. In previous work, there are some semantic approaches that can be used finding the documents whose semantic is similar to the query of PLOS  without re-computing the SVD result. [45] proposed an algorithm for incrementally computing the left singular vectors of SVD result by exploiting the relationship between the QRdecomposition and the SVD. QUIC-SVD [46] provides an algorithm which producing the approximation of the whole-matrix SVD based on a sampling mechanism called the cosine tree, and provides speedups of several orders of magnitude over exact SVD. [47,48] proposed an algorithm for accurately computation of SVD by inhering the high accuracy properties of the Jacobi algorithm [49]. [50] introduced a bi-iteration type subspace tracker for updating SVD approximation of the cross-correlation matrix of dimension N × M. [51] designed a secure, correct, and efficient protocols for outsourcing the SVD of a malicious cloud. [52] proposed an algorithm for extremely fast dimensionality reduction by employing the Gaussianbased random projection and a Hadamard-based random projection. However, the above approaches mainly focus on improving the efficiency in the pre-computation stage, few of them pay attention to the efficiency problem of on-line query processing.
In this paper, we study the efficiency problem of on-line query processing for LSA, towards efficiently searching the similar documents in large dataset. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity, and divide the similarity computation into two steps: the first step is to compute the partial similarities, and the second step is to compute the similarities between query and candidate documents based on the partial similarities. The partial similarities are computed in the off-line stage and stored in a designed index called partial index. For reducing the searching space during query processing, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. The similarities between query and candidate document is computed in the on-line stage, and an efficient algorithm called ILSA is developed for supporting fast on-line query processing through searching similar documents from the partial index. For a given query of keywords, we first transform it into a pseudo document vector and then compute the similarities between query and candidate documents by accumulating the partial similarities obtained from the partial index. ILSA accesses only the partial index nodes corresponds to non-zero entries in the pseudo document vector, which prunes candidate documents that are not promising and reduces the unnecessary operations on similarity computation that make little contribution to similarity scores. By extensive mathematical analysis, we give the maximal upper bound of the difference between ILSA and naive LSA under threshold θ. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.

Preliminaries
Before we discuss further on LSA, we first list the definition of correlation matrix of termdocument for the subsequent discussions.
Definition 1 (Correlation Matrix of Term-Document). A Correlation Matrix of Term-Document is formalized as a M × N matrix C M × N , where M is size of term set T and N is the size of document set D. In which, the entry C t i ;d j represents the correlation between term t i and document d j , which is initialized as the number of times that term t i occurs in document d j .
LSA maps each document into a M-dimension vector and forms a correlation matrix of term-document C. Unlike precise matching method, the matrix C is decomposed by SVD, that compresses matrix C into a new low-dimension space to remove the noise terms. SVD can not only reduce the scale of the data, but also find the underlying relationship between terms. During the on-line query processing, the input terms are firstly transform into a query vector of pseudo document, and then LSA uses cosine coefficient to compute the similarity between query vector and the low-dimension vector corresponds to each document over the decomposition result of matrix C. The candidate documents are sorted according to the corresponding similarities, and then returned to current user. Besides cosine, other measure can also be used for computing similarity, such as Jaccard coefficient and dot product, and without loss of generality we choose cosine to measure the similarity. Specifically, the procedure of LSA can be summarized as follows.
1. Building term-document correlation matrix C by analyzing document set D. For each docu- where v j refers to C t i ;d j as described in Definition 1, that is computed by counting the number of times that term t j occurs in document d i . Precisely, v j is usually defined by the normalized TF Ã IDF (term frequency inverse Ã document frequency) model [53,54] that is widely used for measuring the term weights in a document set [55,56]. Specifically, the entry C t i ;d j is assigned as the TF Ã IDF of term t i that occurs in document d j . After normalizing vector V i for each document d i 2 D, the term-document correlation matrix C is represented as: 2. Singular value decomposition (SVD) of term-document correlation matrix C. For a termdocument correlation matrix C, there exists a decomposition such that where U is an M × M matrix, the column of U is the orthogonal vector of matrix CC T , and C T is the transpose of C; S is an M × N matrix, S i;i ¼ ffiffiffi ffi l i p , and λ i is the i-th biggest eigenvalue of CC T ; and V is an N × N matrix, the vector of V is the orthogonal vector of matrix C T C, and V T is the transpose of V.
3. Get low rank approximation matrix of matrix C. The r-dimension rank approximation matrix of C can be described as: where U r and V r are calculated by discarding the columns of U and V from r + 1 on, S r are calculated by discarding both columns and rows from r + 1 on, and r ( M. The noisy terms can be removed by setting r, but some informative terms would be ignored when r is set too small. On the other hand, when r is set too big, some noisy terms would be involved.
4. On-line query processing for input keywords. Given a query Q of keywords, the procedure of on-line query processing is described as follows. First, view this as a vector of a mini document and transform it into a pseudo document vectorQ of low-dimensional space according to the result of SVD, described as:Q where S À 1 r is the inverse matrix of S r . Second, compute the similarity between Q and document d i 2 D by the cosine value betweenQ and the column vector V T (:, i), described as: And finally, find top k most similar documents ranking from the document set such that sim(Q, d i ) ! sim(Q, d x ) for d i in the returning list and d x not, and then sort them with similarities descending in the returning list.

Rewrite LSA similarity equation
During on-line query processing of LSA, two factors that increase the computational cost are involved. First, the more candidates to check, the more time the algorithm will take; and second, when computing the similarity between the query and each candidate, the more terms related to the candidate, the more time will take. Therefore, the intuition to speed up the search is to prune the candidates that are not promising and reduce the unnecessary operations that make little contribution to similarity scores. For optimizing the on-line query processing, we next rewrite the similarity equation of LSA equivalently based on Eq (5), described as: where PartialSim(d i , t j ) is defined as: which is called the partial similarity between document d i and term t j . Based on this equation, the LSA similarity computation can be divided into two steps: the first step is to compute the partial similarities between documents and terms, and second step is to compute the similarity scores based on the partial similarities.

Partial index
We next introduce an index, called partial index, for reducing the searching space of LSA. The partial index used for storing the partial similarity scores in order to reduce the candidate size and optimize similarity computation. The spiritual of the partial index is similar to the pruning index proposed in our previous work in [57,58]. An example of partial index is shown as Fig 1, where TermID denotes the term ID, DocID denotes document ID, PartialSim denotes the partial similarity, and the two-tuple hDocID,PartialSimi describes that the partial similarity between a document DocID and a term TermID that the document DocID belongs to is Par-tialSim. For example, in the set of "3276", the h7181, 0.003i describes that the partial similarity between document "7181" and term "3276" is 0.003, and in the set of "7801", the h3058, 0.013i describes that the partial similarity between document "3058" and term "7801" is 0.013. Formally, the partial index is represented by i is a node of the partial index corresponds to the 2-tuple of hDocID, PartialSimi form. Specifically, d i is the document corresponds to DocID and PartialSim(d i , t j ) is the partial similarity between document d i and term t j corresponds to PartialSim.

Approximate form of partial index
In fact, not all the terms are informative to represent the documents. For example, "SimRank: A Measure of Structural-Context Similarity" is a paper on the topic of structural-based similarity measure, so it is usually high relevant to the terms "SimRank","link", "LinkClus", "similarity" and etc., and low or not relevant to terms "phisical", "astronomy" and etc. During on-line query processing, the lower or not relevant terms would decrease the on-line query processing efficiency and even affect the quality of returned rankings.
For removing the items corresponds to terms of lower informative involved in candidate check and similarity computation, we give an approximate form of ILSA similarity equation, defined as: where PartialSim θ (d i , t j ) is the partial similarity under threshold θ between document d i and term t j , defined as: Under the threshold θ, we next consider removing the items corresponds to terms of lower informative from the partial index. Specifically, for a 2-tuple hd i ,PartialSim(d i , t j )i in the corresponding partial index, we remove it from the partial index if the partial similarity PartialSim θ (d i , t j ) is lower than θ. The partial index under threshold θ is denoted by a set , only the 2-tuples of non-zero partial similarities are contained in I θ (t j ). In which, hd i ,PartialSim θ (d i , t j )i is a node of the partial index under threshold θ corresponds to the 2-tuple of hDocID,PartialSimi form, specifically, d i is a document corresponds to DocID and PartialSim θ (d i , t j ) is the partial similarity under threshold θ between document d i and term t j corresponds to PartialSim.

Index building algorithm
The procedure for building partial index is shown in Algorithm 1. The input of this algorithm is matrix V r , document D and threshold θ, and the output is the partial index I θ . In the initialization step, the partial index I θ is set as ;. For each term t j 2 ft j jjV T r ðt j :Þj 6 ¼ 0g, we create node I θ (t j ) initialized as ; in the partial index I θ . And for each document d i 2 D, we compute Partial- Algorithm 1 Algorithm for building partial index.

Input:
Matrix V r , document set D, threshold θ; Output: Partial index I θ ; 1: Initialize I θ as ;; 2: for t j 2 ft j jjV T r ðt j : end if 10: end for 11: end for 12: return I θ ; Next we analyze the time complexity of this algorithm. In the initialization stage, the time cost for creating an empty set I θ is derived as O(1). For each term t j , the time cost for computing partial similarities between d i and t j for all d i 2 D is derived as O(N). Since only the partial similarities bigger than θ are considering for creating index nodes, so the total time cost for creating hd i ,PartialSim θ (d i , t j )i in I θ (t j ) for all d i 2 D is derived as Oðε t j NÞ, where ε t j is ratio of the partial similarities lower than θ between term t j and all document d i 2 D. And then the total time cost for computing partial similarities and creating index nodes is derived as O((1 + )N). And finally, the time cost of this algorithm is derived as O(1 + (1 + )rN), where is average ε t i for all t j 2 ft j jjV T r ðt j :Þj 6 ¼ 0g. The time cost of this algorithm is determined by the size of matrix V T r and threshold θ. Usually, a higher threshold θ would reduce the searching space of ILSA, and subsequently lead to lower time cost of on-line query processing, since the partial similarities corresponds to lower partial similarities are skipped when building partial index.

Index-based LSA (ILSA)
The on-line query processing procedure of the Index-based LSA (ILSA) is shown in Algorithm 2. For a given query Q, we transform it into a pseudo document vectorQ and initialize hC; Si by setting both C and S as ;, where C is the set candidate documents, S is the set of similarities between query and candidates, and the element Sðd i Þ in S is the similarity between query Q and document d i . And then, we search the candidate documents and compute the similarities between query and candidate documents by accumulating the partial similarities obtained from the partial index nodes corresponds to non-zero entries inQ. Specifically, for each term t j 2 ft j jQðt j Þ 6 ¼ 0g, we get each document hd i ,PartialSim θ (d i , t j )i 2 I θ (t j ), obtain partial . GetSortedCenter(k, Q) is the function used for obtaining the k most similar documents according to hC; Si, the basic process of which is that firstly get the k most similar nodes from C according to their corresponding similarities in S, then sort and return them. Algorithm 2 ILSA algorithm.

Input:
Matrix U, X ¼ S À 1 r U T r , index I θ , query Q and parameter k; Output: Top-r most similar sorted documents; 1: Initialize hC; Si by setting C and S as ;; 2:Q XQ; 3: For t j 2 ft j jQðt j Þ 6 ¼ 0g do 4: The time cost of this algorithm is affected by the following three aspects. First is the time cost for transforming the given query into pseudo document, derived as O(rN). Second is the time cost for choosing top k most similar documents, derived as Oð P t j 2ft j jQðt j Þ6 ¼0g jI y ðt j Þj þkjCjÞ. And third is the time cost for sorting these k documents, denoted by O(Γ(k)), that is depends on the sort algorithm and we use selection sort in our research. So the total time cost of this algorithm is derived as OðrN þ P t j 2ft j jQðt j Þ6 ¼0g jI y ðt j Þj þ kjCj þ GðkÞÞ. In ILSA algorithm, we first get the non-zero entries from vectorQ, and then check the candidates in the partial index corresponds to the non-zero entries. Therefore, the candidate set is derived as C ¼ [ t j 2ft j jQðt j Þ6 ¼0g Cðt j Þ, where Cðt j Þ is the sub candidate set corresponds to term t j . We access only the 2-tuple hd i ,PartialSim θ (d i , t j )i 2 I θ (t j ) in partial index I θ during on-line query processing, so the sub candidate set Cðt j Þ is derived as Cðt j Þ ¼ fd i jhd i ; PartialSim y ðd i ; t j Þi 2 I y ðt j Þg, and subsequently the candidate set C is derived as C ¼ [ t j 2ft j jQðt j Þ6 ¼0g fd i jhd i ; PartialSim y ðd i ; t j Þi 2 I y ðt j Þg. When giving a higher threshold θ, the accumulation operations for computing similarities would be reduced, which consequently reduces the time cost. In this case, the size of Cðt j Þ would become smaller, and hence the size of C would have a downward trend. So the time cost for choosing the r centers from C would become lower as well. Note that the size of Cðt j Þ is equal to the size of I θ (t j ).

Lemma 1 For given document d i 2 D, term t i 2 T and threshold θ, we have
Proof. By Eqs (7) and (9), we have PartialSim(d i , t j ) = PartialSim θ (d i , t j ) when PartialSim(d i , t j ) > θ, which gives PartialSim(d i , t j ) − PartialSim θ (d i , t j ) = 0; and when

Theorem 1 For given query Q, document d i 2 D and threshold θ, we have
Proof. For given query Q, document d i 2 D and threshold θ, by Eqs (6) and (8), we have Theorem 1 gives the maximal difference of the maximal upper bound between LSA and ILSA, which is under control by tuning threshold θ.

Results
In this section, some preliminary experimental results are reported in real datasets. Experiments were done on a 2.90 GHz Intel(R) Core i7-3520M CPU with 8 GB main memory, running Windows 7 SP1. All algorithms were implemented in C++ and compiled by using Visual Studio C++. Net 2010.

Datasets and evaluation
The dataset used in our experiments is the set of the papers that are selected from DBLP (http://dblp.uni-trier.de/). We only keep entries of the snapshot that correspond to the papers published before March 10th, 2013. The titles of the papers that are published in SIGMOD, VLDB, SIGIR, CIKM, ICDE and EDBT conferences from 2004 to 2013 are selected. From this dataset, we choose the titles of 8,884 papers to test our algorithm and the comparisons, which contains 8,572 terms after removing the stop words, and the values of entries in termdocument matrix is assigned by the TF Ã IDF model [53,54].  We use the NDCG (Normalized Discounted Cumulative Gain) [59] to evaluate the effectiveness of returned ranking list. The NDCG@k (NDCG value at the k-th position) of the ranking result is computed by the exact LSA scores. Formally, NDCG@k is defined as: where DCG@k (Discounted Cumulative Gain at k) is defined as: where i denotes position of v i in the returned list, REL(v, v i ) denotes the similarity score of the naive LSA between v and v i . Efficiency comparison includes the running time for building index, execution time of online query processing. In [60], extensive experiments are done in large datasets to test the performance of LSA. The results suggest that, a value r % 400 provides the best performance, and there is something of an "island of stability" in the r = 300 to 500 range. According to this Fast on-line query processing of latent semantic analysis conclusion, we set parameter r = 400 to test both LSA and ILSA in our experiments. Other parameter settings of the comparison method are implemented strictly following the literature. We input 10 queries that consists of two keywords to test the NDCG value and the time cost of on-line query processing. In order to accurately test the execution time of query processing, we process each query with 10 runs, and then average the total time cost.

Effectiveness
In this section, we observe effectiveness of ILSA through testing the NDCG value by setting different threshold θ, and then choose different k and r to observe the NDCG value on a fixed θ. Fig 3 shows NDCG values on varying threshold θ and the interval is 0.001, where k is set as 100. From θ = 0 to 0.010, we observe that NDCG value decreases with θ increasing, this is because higher θ would lead to more accuracy loss, which is consistent with our previous discussions in Theorem 1. We also observe that the accuracy loss of ILSA before θ = 0.01 is not too much, which suggests a good ranking quality of our approach. Fast on-line query processing of latent semantic analysis respectively. With k increasing, we find that the curve of ILSA(0.001) is nearly horizontal, since the accuracy loss is very minor; and the NDCG values of ILSA(0.005), ILSA(0.010) shown a upward generally as k increases, this is because some similar documents are lost when setting a higher threshold θ. And these similar documents are obtained again as k increases, which increases the accuracy loss. At each position k, the NDCG value of ILSA(0.001) is always close to 1, and ILSA(0.005) is lower than ILSA(0.001) and higher than ILSA(0.010), since higher θ leads to more accuracy loss, which is consistent with the result in Fig 3. The NDCG of both LSA and ILSA(0) are always 1 at each position k, which are not repeatedly shown in our experiment. Fig 5 shows the NDCG change of ILSA on varying rank r, where k = 100 and θ = 0.001. From this result, we observe that the NDCG increases rapidly from r = 100 to 350, this is because more informative terms are contained in similarity computation when increasing r, which consequently increases the effectiveness of the returned rankings. From r = 350 to 450, the NDCG scores are relatively higher and stable, since the number of informative terms are suitable and the noisy terms are not too many. After r = 450, the NDCG value shows a downward trend, since the number of noisy terms are increased when r is set too big, which also affects the returned rankings. This result demonstrates that the returned rankings of ILSA are Fast on-line query processing of latent semantic analysis affected evidently by r, and the effectiveness would be decreased when r is set too big or too small. Fig 6 shows the execution time of on-line query processing on varying θ, where k = 100. From this result, we observe that the time cost decreases with θ increasing, this is because the index nodes corresponds to the partial similarities lower than threshold θ are skipped when building partial index, and subsequently the searching space of on-line query processing is reduced. Fig 7 shows the time cost of on-line query processing on varying rank r, where k = 100 and θ = 0.001. We observe that the execution time of on-line query processing increases with r increasing, this is because more operations on transformation from the query into pseudo document are involved during on-line query processing. And the incremental time becomes smaller as gradually as r increases, since the size of the document set corresponds to each term in partial index is reduced, which reduces the operations for checking candidates during online query processing. time is very minor as k increases, this is because the time cost of on-line query processing is mainly affected by the similarity computation between query and candidates and the transformation from the query into the pseudo document vector. Both of these two steps account for a large proportion of the time cost during on-line query processing. ILSA(0.010) is the most efficient method, this is because the searching space is reduced during on-line query processing when setting a higher θ, which is consistent with the result in Fig 6. Generally, our proposed ILSA is more efficient than LSA at different position k, which demonstrates the improvement on efficiency of our proposed ILSA. Fig 9 shows the time cost for building partial index on varying threshold θ. We observe that the time cost for building partial index decreases with threshold θ increasing, this is because the operations for creating index nodes are saved by skipping the partial similarities lower than θ, which is consistent with the previous discussions on complexity analysis. This result demonstrates that the additional time cost in preprocessing stage for building partial index is very low, which would benefit some researches on semantic analysis in real applications. Fig 10 shows the time cost for building partial index on varying rank r, where k = 100 and θ = 0.001. From this figure, we find that the time cost of index building increases linearly with r increasing, since a bigger r can increase the size of SVD matrices and consequently increase Fast on-line query processing of latent semantic analysis access operations to these matrices, which is consistent with the analysis on time complexity in Algorithm 1. Fig 11 shows the execution time of on-line query processing on different document scale N, where θ = 0, 0.001, 0.005, 0.010 respectively. We observe that the query processing time increases when document scale grows large, since the incremental documents increase the searching space over partial index during on-line query processing. We also observe that the execution time of ILSA(0) is higher than others, and ILSA(0.010) is the most efficient one, since the operations for computing similarities and checking candidates are reduced by setting a higher threshold θ. Fig 12 shows the index building time on different document scale N, where θ = 0, 0.001, 0.005, 0.010 respectively. From this figure, we observe that the time cost for building index increases linearly with N increasing, since more access operations on matrices of SVD are involved in the index building process. In practice, although the running time is significantly higher than the query processing time at each document scale, it is acceptable in real applications since the index is built in off-line stage.  010. We find that the NDCG value of ILSA(0.001) is always close to 1 on varying N and the change is minor, which shows good performance when searching similar documents. The NDCG value of ILSA(0.005) shows a minor downward trend with N increasing, since the candidate set increases when increasing document scale, which subsequently increases the number of similar documents to the given query, but the similar documents should be returned are lost when setting a bigger θ. The downward trend of ILSA(0.010) is more evident than both ILSA(0.001) and ILSA(0.005) as N increases, since θ = 0.010 leads to more accuracy loss when compared to θ = 0.001, 0.005. We also find that the curve of ILSA(0.010) of is evidently lower than both ILSA(0.001) and ILSA(0.005), this is because the effectiveness of the returned rankings would be decreased when setting a higher threshold θ, which is consistent with the result in

Discussion
This paper introduced an index-based query processing algorithm ILSA for efficiently finding similar documents in large document datasets. Compared to the LSA algorithm, ILSA searches the documents over a designed partial index that is derived from the SVD of the Fast on-line query processing of latent semantic analysis term-document matrix, and the searching space can be reduced by skipping the partial similarities lower than a given threshold. ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores, which shows better performance than LSA, and the accuracy loss is under controlled by tuning the threshold. Empirical studies on DBLP through comparison with LSA demonstrate the effectiveness and efficiency of our approach.
There are some directions in our future work. First, ILSA is on the static datasets, and the dynamic datasets are not considered. Accordingly, we will study on how to building a dynamical partial index for the dynamic term set and document set by integrating existing incremental LSA algorithm [61,62] and incremental SVD algorithm [63][64][65]. Second, our approach does not pay attention to the transformation process from query into pseudo document which involves lots of unnecessary operations on entries of lower values and increases the execution time of on-line query process. To further reduce the time cost of on-line query process, we plan to optimize the transformation process by skipping the entries of lower values in the SVD matrices, and further optimize the similarity computation between query and candidate documents by skipping the entries of lower values in the vector of pseudo document. Fast on-line query processing of latent semantic analysis Acknowledgments This work was supported by Natural Science Foundation of Shanghai grant 16ZR14228, http:// www.stcsm.gov.cn/; Innovation Program of Shanghai Municipal Education Commission grants 15ZZ073 and 15ZZ074, http://www.shmec.gov.cn/; and Training Project of University of Shanghai for Science and Technology grant 16HJPY-QN04, http://www.usst.edu.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.