The authors have declared that no competing interests exist.
‡ These authors also contributed equally to this work.
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Text classification has been applied in many contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, hierarchical cataloguing of web resources, and in general any application requiring document organization or selective and adaptive document dispatching [
To classify documents, the first step is to represent the content of textual documents mathematically, after which, these documents can be recognized and classified by a computer. The vector space model is certainly employed, in which a document is represented as a vector in term space [
To reduce the dimension and improve classification performance, feature selection is the process of selecting features based on a training set. Representative feature selection methods such as Chi-square (CHI) and information gain (IG), which investigate the relationship between the class label of a document and the absence or presence of a term within the document based on statistical and information theory, have been proved to have a high-performance [
However, two features will be considered equally in a document by these methods even when they respectively have very different term frequencies (such as 1 and 10). As such, they will miss the importance of the more frequent terms within the document, and lead to the loss of information which may potentially enhance the feature selection performance.
Feature weighting is to measure feature’s contribution, which is another important process to improve classification performance for text classifiers such as SVM, kNN and so on. Term frequency information has gained much more attention in term weighing processes [
Our motivation is to provide a good feature selection scheme by using the term frequency information within the documents in text classification. To this end, we investigated a widely used term event probabilistic model to capture term frequency information, borrowing from the ideal of relevance weighting [
The paper is organized as follows. The background of feature selection for text classification is given in Section 2. Section 3 describes the term event probabilistic model with NB assumption. In Section 4, we explain the newly proposed feature selection methods. Section 5 shows experiments and results. We conclude the paper with a brief discussion in Section 6.
In this section, we will briefly describe some related works including the state-of-the-art feature selection methods used for text classification. To this end, we will introduce the bag-of-words model first. A toy example is given in
Ignoring the term order, each document can be represent by a term frequency vector using the Bag-of-words model, namely, the number of times a term appears in the text [
What | do | you | at | work | I | answer | telephones | and | some | typing | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | |
0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
The number of features will increase rapidly as the number of documents increases, and many of them do not provide information for text classification. Feature selection is an essential step to improve the classification performance. Feature selection methods can be grouped into two main categories: document frequency (DF) based methods and term frequency (TF) based methods.
Feature selection methods based on DF ignore the term frequency within each document, and instead use binary representation, (
For simplicity and without loss of generality, we denote the feature (variable)
Class | |||
---|---|---|---|
positive | negative | ||
occur | |||
not occur |
IG is a synonym for Kullback–Leibler divergence in information theory and machine learning, which is used to measure the ability of a feature to distinguish the sample data. IG is given by
The CHI statistic is widely used in text classification as well as in other machine learning applications, which measures the independence between the random variable
These feature selection methods were proved to have a high-performance in text classification [
Recently, term frequency has gained more attention, not only in feature weighting [
After removing the normalize factor in
Due to the good performance of WCP, we will revisit the probabilistic popularity of the terms and try to look for a model based scheme to measure the term information in this section.
In statistical language modelling, a document is often regarded as a sequence of terms (words). The individual term occurrences are the “events” and the document is the collection of term events [
Now we can obtain a
Then, for a document
From the view of the Multinomial distribution in
Without loss of generality, the binary text classification case was considered. Multi-class classification problems can be transformed into several two-class ones. For a new document,
Now, let us turn to the part of
After replacing the probabilities by their Bayesian estimators based on the training data, we have a new measure
The first part is the reigning part of WCP provided by Singh and Gonsalves [
The second part (in the absolute-value sign) can be regarded as an adjustment factor, and used to assign larger values to the discriminating terms.
Hence, RP can not only capture informative terms, but also discriminating ones. A block diagram of our approach is shown in
For a
Feature selection is to identify any features that discriminate between the classes. A good feature should have skewed information distribution across the classes. The Gini coefficient of inequality, which is a popular mechanism to estimate the distribution of income over a population, can be employed in our approach. After sorting
In this study, we conducted two series of experiments under various experimental circumstances to evaluate the performance of the feature selection methods. To accomplish this, we compared three TF based feature selection methods (including our RP) and two DF based methods on a Chinese corpora and a popular benchmark data English corpora. We look for performance differences between the TF based feature selection methods and the DF based ones from the view of selecting features using the available Chinese dictionary in the first series of experiments. The second series experiments were performed to explore the superiority of the feature selection methods by the classification effectiveness using two state-of-the-art text classifiers: the Multinomial NB classifier and the SVM classifier.
Feature selection methods, CHI and IG, were selected in our study due to their reported performance and typical representation in text classification [
CHI, IG are based on DF, and the others are based on TF.
CHI | measuring the dependence between a term and the document label |
IG | the number of bits of information obtained for label prediction given a feature |
RP | our newly proposed scheme based on term event model and the Gini coefficient |
WCP | the Gini coefficient of within class probability |
TT | the diversity of the distributions of a term between the specific class and the entire corpus, as based on the T-test |
Feature selection methods can be evaluated by further classification using the selected features. Two state-of-the-art text classifiers were chosen in our study, i.e. the Multinomial NB classifier and SVM. All algorithms were run using Matlab R2014b. For SVM, we employed LIBSVM-3.21, which is a integrated SVM software [
Multinomial NB is one of the most widely used and effective classifiers in text classification [
SVM is another method which is widely used and seems to have better performance than other methods in text classification. In our study, we adopt the linear SVM rather than the nonlinear SVM, as suggested in [
A Chinese text collection and a widely used English text collection were used in our experiment. The Chinese text collection was MPH-20, which is a subset of appeal call text records from the Mayor’s public hotline project in 2015 in the City of Changchun, China. After selecting the top 20 frequency functional departments (categories) and 1,000 documents from each class randomly, we obtained a MPH-20 text data set with 20,000 documents and 24,772 distinct terms, see
Chaoyang District Government | Dehui Government | City Development and Reform Commission |
Nanguan District Government | Jiutai District Government | Municipal Public Security Bureau |
Kuancheng District Government | Nongan Government | Municipal Environmental Protection Bureau |
Erdao District Government | Jingyue Development Zone | City Water Group |
Shuangyang District Government | Economic Development Zone | Changchun Gas |
Lvyuan District Government | Hi-tech Development Zone | City Transit Administration Bureau |
Yushu Government | Automobile Development Zone |
The benchmark English collection was 20 Newsgroups (can be freely downloaded from
Corpus |
|
|||||
---|---|---|---|---|---|---|
MPH-20 | 20,000 | 24,772 | 43.46 | 32.51 | 10,095 | 9,905 |
20 Newsgroups | 18,774 | 61,188 | 243.01 | 489.38 | 9,511 | 9,263 |
We use the available dictionary of MPH-20, and obtained the rank of terms using each feature selection method, see
RP | WCP | TT | IG | CHI |
---|---|---|---|---|
Take an exam | Yushu city | Shuangyang district | Shuangyang district | Shuangyang district |
Chauffeured car | Shuangyang district | Yushu city | Yushu city | Nongan county |
Boshuo road | Dehui city | Nongan county | Nongan county | Yushu city |
Heilin town | Jiutai city | Dehui city | Dehui city | Dehui city |
Daqing | Nongan county | Jingyue development zone | Erdao district | Jiutai city |
Shuangde township | Gas corporation | Automobile development zone | Kuancheng district | Automobile development zone |
Suitcase | Automobile development zone | Nanguan district | Nanguan district | Jingyue development zone |
Operate | Gas | Chaoyang district | Chaoyang district | Gas |
Yunshan | Jingyue development zone | Erdao district | Jingyue development zone | Erdao district |
Cremation | High-tech development zone | Jiutai city | Lvyuan district | Economic development zone |
Wanjinta township | Economic development zone | Kuancheng district | Automobile development zone | High-tech development Zone |
Kaoshan town | Water group | Lvyuan district | Jiutai city | Lvyuan district |
Gongpeng town | Driver | Economic development zone | Economic development zone | Nanguan district |
Gong | Erdao district | High-tech development zone | Gas | Kuancheng district |
Yuxi street | Taxi | Village | High-tech development zone | Chaoyang district |
Rename | Jiutai | Gas | Villager | Gas corporation |
Longjia town | Switch on | Villager | Citizen | Water group |
Shanghewan | Chaoyang district | Citizen | Village | Water pause |
Gaming machine | Nanguan district | Water pause | Water pause | Charge |
Festival | Lvyuan district | Water group | Gas corporation | Taxi |
In this section, we further compare the performance of the feature selection methods using the Multinomial NB and linear SVM classifiers. In particular, we achieved the classification model by incremental training using 20%, 60%, 100% of the training set. Figs
In this section, we will determine the feature selection number. We suggest to use cross-validation to choose the best feature selection percentage on the training set. For each method, we employed 5-fold cross-validation and tried the following percentages in our experiment: 10%,20%,30%,40%,50%,60%,70%,80%,90%.
The largest accuracy value and the smallest feature numbers are highlighted in bold for each classifier.
RP | WCP | TT | IG | CHI | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Classifier | ||||||||||
Multinomial NB | 0.8114 | 0.7848 | 0.7821 | 0.8195 | ||||||
SVM | 17,242 | 0.8851 | 0.8794 | 13,410 | 0.8796 | 15,326 | 0.8818 | 13,410 |
The largest accuracy value and the smallest feature numbers are highlighted in bold for each classifier.
RP | WCP | TT | IG | CHI | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Classifier | ||||||||||
Multinomial NB | 32,326 | 0.8517 | 0.8459 | 26,938 | 0.8451 | 0.8376 | ||||
SVM | 26,938 | 0.7547 | 0.7014 | 48,489 | 0.7016 | 48,489 | 0.7007 | 48,489 |
The feature selection results of the TF based methods (RP, WCP and TT) and two DF based methods (IG and CHI) on MPH-20 demonstrate that our method has the advantage of using the term frequency select the terms with more details and important (high frequency within the documents) information.
Furthermore, the classification results when using both the NB and SVM classifiers and different training set sizes on the MPH-20 and 20 Newsgroups datasets illustrate the superiority of RP compared with the state-of-the-art feature selection methods.
We proposed a novel feature selection scheme via a widely used probabilistic text classification model. We captured term frequency information within the documents via a term event Multinomial model. To remove complex factors, we employed the logarithmic ratio of the positive class posterior probability to the negative one (e.g. the matching score idea). Then, we obtained a sub-score named
Experiments on the MPH-20 and 20 Newsgroups datasets that used both NB and SVM classifiers verified that the proposed feature selection scheme has the advantage of the term event model, which provides better scores than exiting methods for text classification problems.
The proposed
The Chinese text data set used in the experiment.
(MAT)
The feature selection score ranks on MPH-20.
(XLSX)
Fruitful discussions with Shaoting Li, Zhigeng Gao, Xu Zhang, Wei Cai and other members of the Key Laboratory for Applied Statistics of MOE at the University of Northeast Normal University are gratefully acknowledged.