Relevance popularity: A term event model based feature selection scheme for text classification

Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.


Introduction
Text classification has been applied in many contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, hierarchical cataloguing of web resources, and in general any application requiring document organization or selective and adaptive document dispatching [1]. Many classification algorithms have been proposed for text classification, such as the naive Bayes (NB) classifier, k-nearest neighbors, and support vector machine (SVM) [2].
To classify documents, the first step is to represent the content of textual documents mathematically, after which, these documents can be recognized and classified by a a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 computer. The vector space model is certainly employed, in which a document is represented as a vector in term space [3]. Because of the flexibility and complexity of natural language, the vocabulary expands rapidly as the amount of text increases. Vocabularies that are composed of tens of thousands of terms are very common in a nature corpus. Each dimension corresponds to a separate term, and dimensions of the learning space are called features in the general machine learning context. That is, each document is represented by a sparse and ultra-high dimensional vector, in which each element represents the term frequency within the document.
To reduce the dimension and improve classification performance, feature selection is the process of selecting features based on a training set. Representative feature selection methods such as Chi-square (CHI) and information gain (IG), which investigate the relationship between the class label of a document and the absence or presence of a term within the document based on statistical and information theory, have been proved to have a high-performance [4][5][6][7]. Recently, Bayesian feature selection methods are proposed in [8][9][10]. Qian and Shu [11] developed an efficient mutual information-based feature selection algorithm from incomplete data, which integrates the information theory and rough sets. Lin et al. [12] presented a novel framework with an optimization function to deal with multi-label feature selection with streaming labels. Zou et al. proposed a Max-Relevance-Max-Distance feature ranking method to find the optimized feature subset, which balances accuracy and stability of feature ranking and prediction task [13]. The method and software tool got good performance on several bioinformatics problems [14][15][16]. Zhou's lab (Health Informatics Laboratory) described a feature selection algorithm, McTwo, to select features associated with phenotypes, independently of each other, and achieving high classification performance [17]. While, unsupervised methods select features when the document class labels are absente [18][19][20].
However, two features will be considered equally in a document by these methods even when they respectively have very different term frequencies (such as 1 and 10). As such, they will miss the importance of the more frequent terms within the document, and lead to the loss of information which may potentially enhance the feature selection performance.
Feature weighting is to measure feature's contribution, which is another important process to improve classification performance for text classifiers such as SVM, kNN and so on. Term frequency information has gained much more attention in term weighing processes [21][22][23][24][25]. To accurately assign feature's weight, Liu et al. in [26], proposed a novel constraint based weight evaluation using constrained data-pairs. These methods often contain a local weight factor and a global weight factor. Although the term frequency information within the documents is commonly employed in the local weighting factor, it rarely employed in the global weighting factor. Erenel and Altınçay confirmed that using term frequency in the global weight factor is beneficial for tasks which do not involve highly repeated terms [23].
Our motivation is to provide a good feature selection scheme by using the term frequency information within the documents in text classification. To this end, we investigated a widely used term event probabilistic model to capture term frequency information, borrowing from the ideal of relevance weighting [21,27], and then get a novel feature selection measurement named relevance popularity. Finally, term frequency based intra-class association and term frequency based inter-class discrimination can be integrated naturally in our feature selection scheme.
The paper is organized as follows. The background of feature selection for text classification is given in Section 2. Section 3 describes the term event probabilistic model with NB assumption. In Section 4, we explain the newly proposed feature selection methods. Section 5 shows experiments and results. We conclude the paper with a brief discussion in Section 6.

Related works
In this section, we will briefly describe some related works including the state-of-the-art feature selection methods used for text classification. To this end, we will introduce the bag-ofwords model first. A toy example is given in Example 1.

Example 1
We have two documents: d 1 What do you do at work? d 2 I answer telephones and do some typing.
Ignoring the term order, each document can be represent by a term frequency vector using the Bag-of-words model, namely, the number of times a term appears in the text [3]. For the example above, we can construct the following two lists to record the term frequencies of all the distinct words ( Table 1): The number of features will increase rapidly as the number of documents increases, and many of them do not provide information for text classification. Feature selection is an essential step to improve the classification performance. Feature selection methods can be grouped into two main categories: document frequency (DF) based methods and term frequency (TF) based methods.

DF based feature selection methods
Feature selection methods based on DF ignore the term frequency within each document, and instead use binary representation, (B 1 , B 2 , Á Á Á, B p ), where B u is a binary variable that indicates whether the document contains the term t u or not. The label of the document can be denoted by C.
For simplicity and without loss of generality, we denote the feature (variable) B u as B, and consider the 2-class classification problem. N is the number of documents in the training set, while some other notations are introduced in Table 2. Feature selection methods are often based on the number of documents, such as IG, CHI, the odds ratio, and so on.
IG is a synonym for Kullback-Leibler divergence in information theory and machine learning, which is used to measure the ability of a feature to distinguish the sample data. IG is given by The CHI statistic is widely used in text classification as well as in other machine learning applications, which measures the independence between the random variable B and C, and is given by Li et al. proposed a supervised feature selection method, named CHIR, which is based on the χ 2 statistic and new statistical data that can measure the positive term-category dependency [26]. These feature selection methods were proved to have a high-performance in text classification [4], although they do ignore the term frequency information within the documents.

TF based feature selection methods
Recently, term frequency has gained more attention, not only in feature weighting [21,23], but also in feature selection [28][29][30]. Among the TF based feature selection methods, Singh et al. defined a probabilistic popularity of a term by, where Pr(t u |C = k) is the conditional probability of term t given a class label k [31]. To analyze how a feature is distributed over different classes, they suggested to use the Gini coefficient of inequality to obtain the final feature selection measure, which they named the within class popularity (WCP).
After removing the normalize factor in Eq (3), only a term frequency based intra-class association factor is left. An additional inter-class discrimination factor may improve the performance of feature selection.

Methods
Due to the good performance of WCP, we will revisit the probabilistic popularity of the terms and try to look for a model based scheme to measure the term information in this section.

Term event model
In statistical language modelling, a document is often regarded as a sequence of terms (words). The individual term occurrences are the "events" and the document is the collection of term events [32]. This model captures term frequency information in documents, and has been widely used for speech recognition and text classification. In mathematics, a document is represented by (T 1 , T 2 , Á Á Á, T L ), where L is the length of the document. T l is drawn from the vocabulary V = {t 1 , t 2 , Á Á Á, t p }, l = 1, 2, Á Á Á, L. In text classification, the order of events is often ignored. The NB assumption is that T 1 , T 2 , Á Á Á, T L are independent given the document label variable, C [33], which can be illustrated by the graphic model in Fig 1. Now we can obtain a p-dimensional vector X = (X 1 , X 2 , Á Á Á, X p ) by Then, for a document x = (x 1 , x 2 , Á Á Á, x p ), the conditional probability function is (4), it is difficult to deal with the feature selection problem because of the internal dependencies among the features. In this section we will look for a new way, borrowing the matching score ideal from information retrieval [34,35]. We first investigated the Multinomial NB classifier, and then derived a probabilistic feature selection scheme.

Matching score functions. From the view of the Multinomial distribution in Eq
Without loss of generality, the binary text classification case was considered. Multi-class classification problems can be transformed into several two-class ones. For a new document, x = (x 1 , x 2 , Á Á Á, x p ), and its class label, C, let C = 1 denote any document is from the positive class, and C = 0 for negative ones. Classification can be performed by calculating the posterior probability of the label given the document. By applying Bayes' rule, we get To avoid further expansion of Pr(x), we use the probability ratio rather than the probability. Thus, it satisfies the classification task: Ignoring the priori class probability ratio, the classification task can be achieved by the matching score function [35], Pr ðt u jC ¼ 0Þ ¼ 1: The second equal sign in Eq (5) is established because of Eq (4). Hence, the matching score can be factorized into the local factors of each term. Relevance popularity. Now, let us turn to the part of x u in Eq (5). As x u is the number of t u in a new document, an appropriate substitute is the term occurrence probability to remove the influence of the document lenght. To describe the information provided by the term and identify the positive class documents, we define a matching score as After replacing the probabilities by their Bayesian estimators based on the training data, we have a new measure relevance popularity (RP) as where N u,1 , N u,0 are the term frequencies of t u in the positive class and negative class, respectively. N 1 , N 2 are the total term frequencies in the positive class and negative class, respectively. We used shrinkage estimators, known as Laplace smoothing, to allow the assignment of nonzero probabilities to terms which do not occur in the classes [36]. Remark RP has the following characteristics: • The first part is the reigning part of WCP provided by Singh and Gonsalves [31]. A high value can represent a high association between a class and a term, i.e. the term occurs more frequently in documents of the class.
• The second part (in the absolute-value sign) can be regarded as an adjustment factor, and used to assign larger values to the discriminating terms.
Hence, RP can not only capture informative terms, but also discriminating ones. A block diagram of our approach is shown in Fig 2, where our main idea may be summed up as follows: the larger popularity difference of a high-popularity term is between the positive category and the negative category, and the more contribution it makes when selecting the positive samples from the negative ones.
For a K-class classification problem, we first considered K two-class ones. For class k, we have where N u;k ; N u;k are term frequencies of t u in the positive class (i.e. class k) and the negative class (made up of the non-k classes), N k ; Nk are the total term frequencies, respectively.  Feature selection measure across the classes. Feature selection is to identify any features that discriminate between the classes. A good feature should have skewed information distribution across the classes. The Gini coefficient of inequality, which is a popular mechanism to estimate the distribution of income over a population, can be employed in our approach. After sorting rp u,1 , rp u,2 , Á Á Á, rp u,K in increasing order, and denoting them by rp u, (1) , rp u, (2) , Á Á Á, rp u,(K) , we obtain the Gini coefficient estimator as where rp u ¼ 1 [31,37].

Experiments
In this study, we conducted two series of experiments under various experimental circumstances to evaluate the performance of the feature selection methods. To accomplish this, we compared three TF based feature selection methods (including our RP) and two DF based methods on a Chinese corpora and a popular benchmark data English corpora. We look for performance differences between the TF based feature selection methods and the DF based ones from the view of selecting features using the available Chinese dictionary in the first series of experiments. The second series experiments were performed to explore the superiority of the feature selection methods by the classification effectiveness using two state-of-the-art text classifiers: the Multinomial NB classifier and the SVM classifier.

Feature selection methods
Feature selection methods, CHI and IG, were selected in our study due to their reported performance and typical representation in text classification [4]. To consider the term frequency information within the documents, the WCP [31] and T-test [30] methods were also included. Table 3 shows the summary of these methods.

Classifiers
Feature selection methods can be evaluated by further classification using the selected features. Two state-of-the-art text classifiers were chosen in our study, i.e. the Multinomial NB classifier and SVM. All algorithms were run using Matlab R2014b. For SVM, we employed LIBSVM-3.21, which is a integrated SVM software [38]. Relevance popularity: A feature selection scheme for text classification Multinomial NB. Multinomial NB is one of the most widely used and effective classifiers in text classification [33], which is based on the term event model. For a new document, x = (x 1 , x 2 , Á Á Á, x p ), we have where Pr(C = k) and Pr(t u |C = k) can be estimated based on the training data, S denotes the selected feature set. A document can be assigned a class label with maximum value of Pr(C = k| x). Hence, the effect of the feature selection schemes will have a direct bearing on the classification results. Feature selection methods can then be evaluated by the classification results. Support vector machine. SVM is another method which is widely used and seems to have better performance than other methods in text classification. In our study, we adopt the linear SVM rather than the nonlinear SVM, as suggested in [21]. The reason is that the linear SVM is simple and fast and performs better than the nonlinear models.

Text data collections
A Chinese text collection and a widely used English text collection were used in our experiment. The Chinese text collection was MPH-20, which is a subset of appeal call text records from the Mayor's public hotline project in 2015 in the City of Changchun, China. After selecting the top 20 frequency functional departments (categories) and 1,000 documents from each class randomly, we obtained a MPH-20 text data set with 20,000 documents and 24,772 distinct terms, see S1 File. Table 4 shows the selected 20 categories of the appeal call text records.
The benchmark English collection was 20 Newsgroups (can be freely downloaded from http://qwone.com/*jason/20Newsgroups/), which is a collection of approximate 20,000 news documents evenly divided among 20 groups. 18,774 total entries remained in this collection after removing duplicates, empty, single-word, and multi-labelled documents. 61,188 terms occurred in the corpus. Table 5 shows some statistical information of those datasets, where D is the amount of documents, p is the size of the vocabulary, " L is the average length of a document, St.Dev is the standard deviation of the document lengths, D train is the size of the training set, and D test is the size of the testing set.

Experimental results
Feature selection results. We use the available dictionary of MPH-20, and obtained the rank of terms using each feature selection method, see S2 File. Table 6 shows the top 20 Chinese terms selected by each method. From these results, TT (based on t-test) did not select new terms as compared with the results of IG and CHI. WCP found "driver", "Jiutai" and "switch on", which were not in the results of IG and CHI. Our proposed RP obtained quite different results, where all of the top 20 terms were not selected by the comparing methods. From the selected terms, we can see RP selected terms with detailed meaning and high frequency within the documents, such as "take an exam", "chauffeured car", "Boshuo road" and so on. Any terms that often occurred no more than once within the documents were not included in the top 20 terms, such as "Yushu city", "Shuangyang district", "gas", "citizen" and so on.
Classification performance results. In this section, we further compare the performance of the feature selection methods using the Multinomial NB and linear SVM classifiers. In particular, we achieved the classification model by incremental training using 20%, 60%, 100% of the training set. Figs 3-6 show the classification results obtained from using the Multinomial NB and SVM text classifiers on the MPH-20 and 20 Newsgroups datasets. 20%, 60%, 100% of the training set were used from the left to the right. Each curve of these figures represents a different feature selection method. Fig 3 depicts the classification accuracy performance of five different feature selection methods (i.e., RP, WCP, TT, IG, CHI) on MPH-20 when using the Multinomial NB text classifier. All methods obtained their best values when 10% of the features were included. RP outperformed all the contrast methods, with the best accuracy value 0.8636, whereas WCP, TT, IG, CHI obtained 0.8114,0.7848,0.7821,0.8195, respectively, when the entire training set was used. All of the methods obtained better results when the size of training set increased. In all training cases, there were downtrends when more features were included.    Feature selection number determination. In this section, we will determine the feature selection number. We suggest to use cross-validation to choose the best feature selection Relevance popularity: A feature selection scheme for text classification percentage on the training set. For each method, we employed 5-fold cross-validation and tried the following percentages in our experiment: 10%,20%,30%,40%,50%,60%,70%,80%,90%. Table 7 shows the classification accuracy values and the including feature numbers of the feature selection methods on MPH-20. When using the Multinomial NB classifier, RP got the best accuracy 0.8636. CHI got the second best accuracy 0.8195, which is much smaller. WCP got 0.8114, TT and IG performed less well. All methods selected 1,915 features. When using the SVM classifier, RP got the best accuracy 0.8872 and included 17,242 features. WCP got the second best accuracy 0.8851 and included 3,831 features. CHI got 0.8818, TT and IG performed less well. All methods selected more than 13,000 features except WCP. Table 8 shows the classification accuracy values and the including feature numbers of the feature selection methods on 20 Newsgroups. When using the Multinomial NB classifier, RP got the best accuracy 0.8604 and included 32,326 features. WCP got the second best accuracy 0.8517 and included 21,550 features. TT got 0.8459 and included 26,938 features. IG and CHI performed less well. When using the SVM classifier, RP got the best accuracy 0.7753 and included 26,938 features. WCP got the second best accuracy 0.7547 and included 10,775 features. TT, IG and CHI performed less well. RP and WCP selected much less features than other methods.
Discussion. The feature selection results of the TF based methods (RP, WCP and TT) and two DF based methods (IG and CHI) on MPH-20 demonstrate that our method has the advantage of using the term frequency select the terms with more details and important (high frequency within the documents) information.
Furthermore, the classification results when using both the NB and SVM classifiers and different training set sizes on the MPH-20 and 20 Newsgroups datasets illustrate the superiority of RP compared with the state-of-the-art feature selection methods.

Conclusions and future work
We proposed a novel feature selection scheme via a widely used probabilistic text classification model. We captured term frequency information within the documents via a term event Multinomial model. To remove complex factors, we employed the logarithmic ratio of the positive class posterior probability to the negative one (e.g. the matching score idea). Then, we obtained a sub-score named relevance popularity of each feature under the well known NB assumption. Finally, we obtained a global feature selection score by using the Gini coefficient estimator [31,37]. Table 7. MPH-20: The classification accuracy values (A) and the including feature numbers N of the five feature selection methods. The largest accuracy value and the smallest feature numbers are highlighted in bold for each classifier. Experiments on the MPH-20 and 20 Newsgroups datasets that used both NB and SVM classifiers verified that the proposed feature selection scheme has the advantage of the term event model, which provides better scores than exiting methods for text classification problems.

RP
The proposed relevance popularity coupled with the Gini coefficient has an appreciable advantage for text classification problems. Future works may consider the optimal choice of the global goodness function for relevance popularity and obtain some theoretical results.