A Multiple Relevance Feedback Strategy with Positive and Negative Models

A commonly used strategy to improve search accuracy is through feedback techniques. Most existing work on feedback relies on positive information, and has been extensively studied in information retrieval. However, when a query topic is difficult and the results from the first-pass retrieval are very poor, it is impossible to extract enough useful terms from a few positive documents. Therefore, the positive feedback strategy is incapable to improve retrieval in this situation. Contrarily, there is a relatively large number of negative documents in the top of the result list, and it has been confirmed that negative feedback strategy is an important and useful way for adapting this scenario by several recent studies. In this paper, we consider a scenario when the search results are so poor that there are at most three relevant documents in the top twenty documents. Then, we conduct a novel study of multiple strategies for relevance feedback using both positive and negative examples from the first-pass retrieval to improve retrieval accuracy for such difficult queries. Experimental results on these TREC collections show that the proposed language model based multiple model feedback method which is generally more effective than both the baseline method and the methods using only positive or negative model.


Introduction
Since the inherent limitations of current retrieval models, it is nearly impossible for any retrieval model to return satisfactory results for every query. Indeed, a query might be so simple or ambiguous that a large number of top-ranked documents are nonrelevant, and we usually call it difficult query. In such a case, a user would have to either reformulate the query or go far down on the ranked list to examine more documents. Both may decrease the user satisfaction. As a result, improving the effectiveness of search results for such difficult queries would bring user satisfaction which is the ultimate goal of search engines.
The language modeling approach to text retrieval was first introduced by Ponte and Croft in [1] and later explored in [2][3][4]. The relative simplicity and effectiveness of the language modeling approach, together with the fact that it leverages statistical methods that have been developed in speech recognition and other areas, make it an attractive framework in which to develop new text retrieval method. Although the language modeling approach has performed well empirically, a significant amount of performance increase is often due to feedback [1,2,5]. When a user is unable to submit an effective query (which happens often in informational queries due to [1,2,5] insufficient knowledge about the relevant documents), feedback can be quite beneficial with the basic idea of extracting useful terms or features from relevant (or pseudo relevant) documents and use them to expand the original query or update the query model. The feedback techniques can help not only text retrieval but also multimedia retrieval [6], such as image searching [7], landmark searching [8] and etc. Although several kinds of feedback techniques, including relevance feedback [9][10][11], pseudo-relevance feedback [12][13][14] and implicit feedback [15], have been extensively studied in information retrieval, most existing work on feedback relies on positive information, i.e., exploiting relevant documents or documents that are assumed to be relevant.To our knowledge, both of implicit feedback and pseudo-relevance feedback have limitations individually. An explicit feedback operation is harmful to user's experience, and the hypothesis of pseudo relevance feedback, the top k documents in the first-round retrieval are all relevant to a specific query, is often invalid [16] which can result in a negative impact on the retrieval performance.
In this paper, we focus on a real environment that a user submits one query to a search engine and then clicks several hyperlinks of return list for viewing. Fortunately, the click operations can be recorded with the form of search engine query logs. Thereby, we assume that all the clicked documents are all highly relevant and others in the return list before the lowestranked clicked document are irrelevant. Then, according this assumption, we use the positive information in the relevant document to derive a new positive model that expands the original query. We choose the Relevance-Based Language Models (RM) in [17], which is a typical language model based Query Expansion (QE) approach in [18], as the implementation of positive model estimating approach in our work.
Some previous studies concluded that when positive documents are available, they are generally more useful than negative documents in [19], so the positive feedback has been studied extensively. As a result, how to exploit negative documents for feedback has been largely under-addressed, and negative feedback has just attracted attention recently. In [16,20], the authors studied different methods for negative feedback using only irrelevant information and neglecting all relevant information. Intuitively, if we can learn from both of positive and negative information to raise the rank of relevant documents and prune non-relevant documents from the original ranked list concurrently, we will improve the performance more. In this paper, we tackle this challenge and estimate a negative feedback model by considering not only the negative information but also the positive model which we just obtained using an improved RM approach. Finally, we use the multiple relevance feedback strategy which is formed by the fusion of positive and negative relevance model to rerank the unseen list. To evaluate the effectiveness of the proposed method, we construct a test collection containing only appropriate queries from TREC collections. Experiment results show that the proposed multiple relevance feedback strategy is effective for improving ranking accuracy and it outperforms the one using only either positive or negative feedback.
The rest of the paper is organized as follows. In the next section, we review related work firstly. Section 3 describes our feedback framework for language models. Then, in the section 4, we show our positive and negative model estimating approaches in details. Section 5 contains experimental results, as well as a discussion of those results and the last section is a conclusion.

Relevance Feedback
Relevance feedback has been shown to be effective with different kinds of retrieval models in [14,15,21,22]. In the vector space model, feedback is usually done by using the Rocchio algorithm, which forms a new query vector by maximizing its similarity to relevant documents and minimizing its similarity to non-relevant documents [10]. The feedback method in classical probabilistic models is to select expanded terms primarily based on Robertson/Sparck-Jones weight [9]. Unfortunately, both of them cannot be naturally implemented in the language modeling approaches [16]. In the language modeling approaches, relevance feedback can be implemented through estimating a query language model [22] or relevance model [17] through exploiting a set of feedback documents.
Recently, several query expansion techniques have been developed in the language modeling framework, including, e.g., mixture-model feedback method [22] and relevance model [17]. The basic idea is to use feedback documents to estimate a better query language model. Both the mixture model and relevance model have been shown to be very effective, but the relevance model appears to be more robust [23]. In the mixture-model feedback, the words in feedback documents are assumed to be drawn from two models: (1) background model and (2) topic model. The mixture-model feedback finds the topic model that best describes the feedback documents by separating the topic model from the background model. The topic model is then interpolated with the original query model to form the expanded query. Much like mixture-model feedback, the relevance model also estimates an improved query language model. Given a query q, a relevance model is a multinomial distribution p(vDh q ) that encodes the likelihood of each term v in the query as evidence. To estimate the relevance model, the authors first compute the joint probability of observing a word together with the query words in each feedback document and then aggregate the evidence by summing over all the documents. It essentially uses the query likelihood p(qDh d ) as the weight for a document d and takes an average of the probability of word v given by each document language model. When there are no real relevance judgments available, alternatively, pseudo relevance feedback [12][13][14] may be performed, which simply assumes that a small number of top-ranked documents in the initial retrieval results are relevant and then applies relevance (positive) feedback. Thus, both of the two above feedback approach in the language model are based on this assumption and our work differs from those in that we use real relevance judgments instead of the assumption above.
There are lots of pervious work focus on explicit feedback which can be used to obtain user's judgements leading to a good performance retrieval, but unfortunately, quite few users will put up with an additional interactive operation. Even if we also use the user's judgements in this work, but a main difference of our work from the explicit feedback approach is that the additional operation is dispensable and we just use the information extracted from search engine query logs. This idea is similar to the measures proposed in [15] which is named implicit feedback, but our work considers not only positive information but also negative information, and this research field has just attracted attention recently.

Negative Feedback
There have been some attempts to exploit non-relevant documents. Query zone [24] appears to be the only major heuristic proposed to effectively exploit non-relevant information for a document routing tasks. It shows that using non-relevant documents that are close to the original query is more effective than using all non-relevant documents in the collection. Also, the work in [25] exploits high-scoring documents outside of top K documents (called pseudo-irrelevant documents) to improve the performance of pseudo-relevance feedback. The work in [16] and later extension [20] exploit the top non-relevant documents to improve the ranking of documents and they are the earliest studies of negative relevance feedback in the language modeling framework. The last one defines an important concept called generalization of a language model and the authors propose an optimization framework based on this concept. It is a brilliant work and we propose our feedback strategy in this paper also based on the same concept, but we consider that positive feedback model should be taken into account when optimizing the negative model to more aggressively (but carefully) prune non-relevant documents, leading to a more effective multiple relevance feedback method.

Problem Formulation
Given a query q and a document collection C, a retrieval system returns a ranked list of documents L where l i is the i-th ranked document in the ranked list L. We assume that the query is difficult enough so that there are only a handful of relevant documents R in top K ranked documents (seen so far by the user) L'~fl 1 ,:::,l K g and most of documents in L' are non-relevant. The goal of our study is to use these positive examples l i [ R to build a positive language model h R first which describe the information need more accurately, so that the rest unseen relevant documents will be assign a higher relevance score when reranking.
However, the second part of our feedback model is a set of negative models. Therefore, we use all the negative feedback examples, i.e., l i [ N to build a set of negative language models, each corresponds to a negative example. Then, every negative language model will be optimized by taking account of the original query model and the positive language model, so that these improved negative language models are better able to describe other unseen non-relevant documents and improve the ranking of relevant documents by pushing down non-relevant documents in the ranked list.
More formally, given a specific query q, a ranked list L and a set of relevance judgements including relevant (positive) documents R and non-relevant (negative) documents N corresponding this query q, our goal is to estimate a positive language model h R (then combine with the original query language model h q to form a relevance topic language model h Ã R ) and a set of imfproved negative language models G N~f h Ã G1 ,:::,h Ã GDND g, where h~fv 1 : p 1 ,:::,v DV D : p DV D g, i.e., each language model consists of words along with their probabilities. All the models above can then be plugged into the final feedback strategy to improve feedback performance.

The KL-Divergence Function
In this paper, we only focus on the positive and negative feedback problem in the language modeling framework, so we just use Language Model (LM) as the basic retrieval model in all our work. There are two main score functions in LM, the original and basic one is Query-Likelihood (QL) function [26]. In it, we construct from each document d in the collection a language model h d . The goal is to rank documents by p(h d Dq), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule we have: where p(q) is the same for all documents, and so can be ignored. The prior probability of a document p(h d ) is often treated as uniform across all d and so it can also be ignored. Thereby, return results ranked by simply p(qDh d ), the probability of the query q under the language model derived from d. The Language Modeling approach thus attempts to model the query generation process: documents are ranked by the probability that a query would be observed as a random sample from the respective document model. The other score function named KL-divergence function [27] which is one of the most effective score function in the language modeling framework [23]. It is a generalization of the query-likelihood function and would score a document d w.r.t query q based on the negative Kullback-Leibler divergence between the query language model h q and the document language model h d : where V is the words in the vocabulary. Clearly, the two main tasks are to estimate the query language model h q and the document language model h d . The document language model h d is usually smoothed using Dirichlet prior smoothing which is an effective smoothing method [28].
The query model intuitively captures what the user is interested in, thus would affect retrieval accuracy significantly. The query language model h q , is often estimated (in case of no feedback) based on: where c(v,q) is the count of word v in query q and DqD is the total number of words in the query. Such a model, is not very discriminative because a query is typically extremely short. When there is feedback information, the information would be used to improve the estimate of query language model h q . According to all pervious work, all our work use language model with KL-divergence score function as the basic retrieval model throughout this paper.

The Positive Feedback Model
The Relevance-Based Language Model (RM). In order to describe users' information need more effectively, we have to estimate a positive feedback language model first, and the Relevance-Based Language Models (RM), which is a typical pseudo relevance feedback (PRF) approach implementation in the language modeling framework, is chosen as the basic of the positive model estimating approach in our work.
In RM estimate function, except the PRF document models, the document weight consists of two components: a document relevance score and a document prior. The former represents the initial document relevance probability, while the latter is the prior probability of selecting the corresponding document. More formally, for each given query q~(q 1 ,q 2 ,:::), based on the corresponding PRF document set R PRF , the RM estimates an expanded query model: where p(v,qDh R PRF ) is the estimated relevance model. A number of terms with top probabilities in p(v,qDh R PRF ) will be used to estimate the QE model (i.e. the expanded query model).
In Equation 4, p(v,qDh RPRF ) is the probability of a term v in the language model h d for a document d, p(h d ) is d's prior probability, and p(vDh d ) is the query-likelihood: In RM, the weighting function is: p(vDh d )). To apply revised weighting functions under the RM framework, we re-formulate the RM as: where f '(d,q) denotes any revised document-weighting function that satisfies P d[R f '(d,q)~1, and different f '(d,q) will derive different RM implement.

Adaptation of KL-Divergence as the Document
Weight. According to Equation 6, f (d,q) in RM is a normalized query-likelihood score (see Equation 1) being eliminated the constant p(q) and since the document prior p(h d ) is assumed to be uniform, it turns out that the weighting function is the normalized query-likelihood probability: The normalized query-likelihood document weight f QL (d,q) are called as QL weights in the following text. From Equation 2 and 8, the QL weights can out be computed efficiently, because it lead to a additional calculation operation. Moreover, the KL-divergence is a more effective function in information retrieval tasks. Thus, we adapt the KL-divergence function as the document weight in the original RM function: When the first time ranked list return, all the necessary KL scores S KL (q,d') can be obtained at once, and then, we can figure out the normalized KL weight very soon.
Nevertheless, RM is a typical pseudo relevance feedback (PRF) approach and the basic assumption is the a small number of topranked documents in the initial retrieval results are relevant. So, it is reasonable to assign each document weight by their relevance score descending sequence. But in our work, all the relevant document are extract from truly judgements by user's feedback, so we consider that document which got a lower relevance score in the first time retrieval maybe need more attention and higher weight in feedback processing, because it is necessary to improve the new query description ability for the document which have not be described well by the original query. Thus, we modify the KL document weight as follow: and the final RM function we use as the positive feedback model is: The experiment results in the evaluation section show that the our positive feedback model (named RM-KL) is more effective than RM.

The Negative Feedback Model
The basic idea in relevance feedback is to extract useful information from positive documents and use them to update the original query language model as we have done above. When a query is difficult, it is often impossible to obtain a lot of (or enough) positive documents for feedback. Therefore, the best way would be to exploit the negative documents to perform negative feedback [16]. The idea of negative feedback is to identify distracting nonrelevant documents and penalize unseen documents containing such irrelevant information.
The two negative feedback methods proposed in [20] are SingleNeg and MultiNeg methods which we briefly describe below.
SingleNeg. This method adjusts the original relevance score of a document with a single negative model. Let h q and h d be estimated query model and document model, respectively. Let h N be a negative language model estimated based on negative feedback documents d N . The new scoring according to this model is: In order to estimate h N , it is assumed that all non-relevant documents are generated from a mixture model of a unigram language model h N and a background language model (generating common words). The log-likelihood of the N sample documents is: where l is a mixture parameter that controls the weight of the background model. A standard EM algorithm is used to estimate parameters p(vDh N ).
MultiNeg. This method adjusts the original relevance score with multiple negative topic models. Document d w.r.t query q is scored as follows: where d N i is a negative document representation and b is a parameter that controls the influence of negative feedback. EM algorithm is used to estimate a negative model h N i for each individual negative document d N i in N. Then DND negative models be obtained and combined with the above formula for reranking. According to the experimental results and conclusion in [20], the MultiNeg strategy lead a better performance than the other one, so of course, we choose the MultiNeg as our basic negative feedback modeling strategy. Specifically, based on the KLdivergence score function, the MultiNeg formula will be expended to the following form: and we use it in our experiments.

The Goal of Optimization
A main goal of our study is to improve the estimate of the positive and negative document language models. A effective positive language model can combine with the original query language model to improve the ranking of relevant documents by boosting their relevance scores directly, and it can be optimized through the EM algorithm. A effective negative document language model can be used to exploit the top non-relevant documents to improve the ranking of documents, and it can be obtained by generalizing a basic negative document language model with an optimization framework. There are three criteria have to be considered in the optimization process: (1) closeness to the original negative language model (to ensure the accuracy), (2) closeness to the relevance (positive) language model (if it is far from the information need, the pruning power is not very effective), and (3) a generalization constraint. The reason why all these three components are important can be explained in Figure 1, where (a) shows that the general negative language model is safe and effective since it is both close to the original negative language model (thus ensures that the pruned documents to be nonrelevant) and reasonably close to the relevance language model (thus can make a difference in the top-ranked results through pruning).
In the next section, we present an optimization framework for improving the estimate of both positive and negative document language models.

The Framework of Optimization
In order to build a more general negative language model, we need an optimization framework that given h Ni , searches in the space of all language models and finds a set of more general negative language models, i.e., h Gi , finally, picks out the best model, i.e. h Ã Gi . Therefore, we prefer the objective function definition and expend it with positive feedback model which is a important pair in our work as follows: where d and d' are divergence functions. c is a tradeoff between closeness to the relevant topic model and closeness to the original negative model. We also continue to use the restriction to avoid overgeneralization: It provides that general negative language model can deviate E at most from original negative language model. The generality W(h) is defined as: where df (v) is the number of documents containing word v in collection C (document frequency) and p(vDh) is the probability of word v given language model h.
Next, we describe the divergence functions, d and d' in the optimization framework.

Divergence Functions
We define both of the two divergence d and d' in Equation 16 based on KL-divergence. First, the divergence from general negative model to the relevant topic model is KL value exactly: The KL-divergence function also be called as relative entropy, the former variable in D(X a DDX b ) is consider as the truly distribution and later variable is testing distribution. But, it is unreasonable to consider either h Ni or h Gi as the truly distribution, so we continue to use the symmetric version of KL-divergence [27] for the divergence between general negative model and the original negative model.
With these instantiations, the objective function is completely defined.

Shrinkage of Searching Space
In the objective function (Equation 16), the searching space is infinite, and in order to find an optimal solution efficiently, we make it tractable by searching in a finite space of all feasible solutions, M. Therefore, we propose two steps for shrink the searching space, and we describe them in details here.
Conflict Removing. As we have explained in Section 4.1, the goal of general negative language model optimization is 1)close to the original negative language model h Ni (the first part of Equation 16), 2) and close to relevance topic language model h Ã R (the second part). The closeness to h Ã R ensures the pruning power, but the original negative language model is in collision with the relevance topic model (the same terms with high observation frequency), that is the main reason that these negative documents are returned in the top-rank list. So we remove the terms, which have a high probability in relevance topic model h Ã R , from the original negative language model h Ni . Specifically, top k(kw0) terms in h Ã R be removed in our experiments.
Term Elimination. Similar to the Perturbation step in [29], foreach h Ni in the original negative language model set, we build a more general negative language model h Gi by removing appropriate terms v. But in our work, we remove those terms iteratively that satisfy p(vDh Gi )|df (v)vY, with the increment y of Y for iteration, until minimizing the objective function and it is no doubt that the revise negative language model is still more general than h Ni . Table 1 shows the iteration of term elimination.
Note that, after any term removing, the probabilities are renormalized to ensure they are comparable.

Experimental Data Set
The evaluation is done using two standard TREC (Text REtrieval Conference -http://trec.nist.gov/) collections: Robust04 and GOV2, that are representative of heterogeneous and homogeneous data sets, respectively, with the details in Table 2.
Our first data set is Robust Track of TREC 2004 which has 528,155 news articles. We use 150 queries in this set for our experiments. The Robust Track is a standard ad hoc retrieval with an emphasis on the overall reliability of IR systems which contains difficult queries and is a heterogeneous data set. The data set is called ''Robust04'' in the following text.
The second data set is a TREC test collection for use in the Terabyte Track which is a homogeneous data set. It contains 25,205,179 documents crawled from the ''.gov'' domain sites in 2004, and there are 150 queries in this set. The data set is called ''GOV2'' below.
For both data sets, preprocessing of documents and queries involves only stemming with Porter stemmer and removing stopwords by a minimum English stopwords list in Lucene (Apache Lucene -http://lucene.apache.org/).
Since our goal is to positive and negative feedback in language modeling framework, we construct a simulated query set to simulate the users' behavior on a search engine. Because there is no truly feedback information, so in our experiments, we treat the relevance judgements published by TREC as the feedback by several truthful users. Considering the hypothesis of multiple relevance feedback, the relevance and non-relevance documents have to appear concurrently in the feedback judgement, so we filter both two query set above following the constraint is that: the baseline method (the language model with the KL-divergence score function and Dirichlet Prior Smoothing, more details in Section 5.2) returned at least 1 relevant document in top 20 (user clicked) and at least 1 non-relevant document (user swept over) before the lowest-ranked relevant document in top 20 also. Finally, there are 112 and 129 queries are available respectively for our experiments, with more details in Table 2. In particular, we treat all topic titles as queries and neglect their description field.

Baseline
In order to evaluate the effectiveness of our method, we use three methods as the baselines for comparison.
(1) The Language Model was implemented by the Indri (Indri Toolkit -http://www.lemurproject.org/indri.php) toolkit, in which the Dirichlet smoothing prior m is set to 2000 for Robust04 and 1500 for GOV2 empirically [26], and this method is denoted by LM-Dir. (2) The Relevance-Based Language Model was also implemented by the Indri, which is one of the PRF expansion approaches and only use the positive feedback model, based on the querylikelihood method by [17] and denoted by RM-QL. (3) The MultiNeg feedback method which we implement following the describe in [20], considering only the negative feedback information,and we denote it by MultiNeg (details in Subsection 3.4). All the parameters are set to the empirical value (Y~5, a~0:5 and E~50) [29].

Experiment Procedure
The multiple relevance feedback strategy which we proposed in this paper, take account of both positive and negative feedback information. Therefore, the goal of our experiments is to simulate a scenario when a user has viewed the top-K ranked documents (on the first page). He (or she) has clicked a few hyperlink for further view and is about to view the rest of the search results (click the button of next page). At this point, we can naturally apply feedback information to re-rank all the unseen documents. As we have showed in Section 5.1, we set K~20, which simulates the scenario of applying feedback, the relevant and non-relevant documents have been found on the first page of search results and the user is about to view the next page of results.
In order to set parameters in our method, i.e., b (described in Section 3.4, a parameter to control the influence of the negative feedback) and c (described in Section 4.2, a tradeoff between two divergence functions), we do a 5-fold cross validation as follows: we fix the number of positive feedback terms k (has been described in Section 4.4) and a (described in [17], a parameter to control the influence of the positive feedback) to 30 and 0:5 respectively, then learn both of two parameters based on the training data. The other parameters are set the same value to the Multi-Neg (described in Section 5.1).
When the user clicked the next-page button, the top-20 ranked documents have been browsed by the user, so they should not be returned again on the next pages. To simulate this scenario and reflect the performance directly, we remove the top-20 documents in the original ranked list from the reranking results.
Specifically, we denote the positive feedback strategy method (described in Section 3.3.2) by RM-KL, and the final multiple relevance feedback strategy method we proposed by Multi-FB.

Results
Before doing some detail testing, Figure 2 shows the results of assigning different a on two TREC collections, when k (the number of positive feedback terms) is set to 30 empirically. As it can be seen in Figure 2, the RM-KL method perform well when the value of a is set to 0:3 for Robust04 and 0:6 for GOV2. Thus, we set the parameter to optimal values above for our RM-KL and Multi-FB methods.
With the setup showed above, the top-ranked 1000 unseen documents for all runs were compared in terms of two sets of performance measures: Mean Average Precision (MAP) and Precision at 20 (P@20), which reflect the utility from users perspective who can not bear with more than two pages browsing. Please note that MAP is considered as the main measure, however, we show our experimental results based on all measures for the sake of completeness.
Finally, in order to see the effectiveness of our proposed strategies, we compare them with the three baselines methods after several ad-hoc retrieval testing on two TREC standard collections and list Table 3 and Table 4 to show the results with MAP and P@20 measures, respectively. Table 3 shows the cross validation results with MAP and Table 4 also show cross validation with P@20, based on both collections Robust04 and GOV2, respectively. These Tables also show the results of assigning different value to k for every collections. The MAP of our method Multi-FB is 8:71% and 3:70% higher than the Relevance-based Model (RM-QL) based on pseudo relevance feedback, also 2:01% and 5:58% higher on P@20 for Robust04 and GOV2 respectively.
According these results, we can see the Multi-FB method outperform the RM-QL and Multi-Neg in most case, it shows that taking account of positive and negative feedback information concurrently lead to a more effective feedback language model than using either of them singly. We also find out that the RM-KL method we proposed preforms better than the RM-QL method, it confirms the effectivity of KL-divergence in information retrieval field.  Table 4. P@20 scores of various methods.