Active Semi-Supervised Learning Method with Hybrid Deep Belief Networks

In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.


Introduction
Recently, more and more people write reviews and share opinions on the World Wide Web, which present a wealth of information on products and services [1]. These reviews will not only help other users make better judgements but they are also useful resources for manufacturers of products to keep track and manage customer opinions [2]. However, there are large amounts of reviews for every topic, it is difficult for a user to manually learn the opinions of an interesting topic. Sentiment classification, which aims to classify a text according to the expressed sentimental polarities of opinions such as 'positive' or 'negative', 'thumb up' or 'thumb down', 'favorable' or 'unfavorable' [3], can facilitate the investigation of corresponding products or services.
In order to learn a good text classifier, a large number of labeled reviews are often needed for training [4]. However, labeling reviews is often difficult, expensive or time consuming [5]. On the other hand, it is much easier to obtain a large number of unlabeled reviews, such as the growing availability and popularity of online review sites and personal blogs [6]. In recent years, a new approach called semi-supervised learning, which uses large amount of unlabeled data together with labeled data to build better learners [7], has been developed in the machine learning community.
There are several works have been done in semi-supervised learning for sentiment classification, and have get competitive performance [3,[8][9][10]. However, most of the existing semisupervised learning methods are still far from satisfactory. As shown by several researchers [11,12], deep architecture, which composed of multiple levels of non-linear operations, is expected to perform well in semi-supervised learning because of its capability of modeling hard artificial intelligent tasks. Deep belief networks (DBN) is a representative deep learning algorithm achieving notable success for text classification, which is a directed belief nets with many hidden layers constructed by restricted Boltzmann machines (RBM), and refined by a gradient-descent based supervised learning [12]. Ranzato and Szummer [13] propose an algorithm to learn text document representations based on semi-supervised auto-encoders that are combined to form a deep network. Zhou et al. [10] propose a novel semisupervised learning algorithm to address the semi-supervised sentiment classification problem with active learning. Socher et al. [14] introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Socher et al. [15] introduce the recursive neural tensor network for semantic compositionality over a sentiment treebank. The key issue of traditional DBN is the efficiency of RBM training. Convolutional neural networks (CNN), which are specifically designed to deal with the variability of two dimensional shapes, have had great success in machine learning tasks and represent one of the early successes of deep learning [16]. Desjardins and Bengio [17] adapt RBM to operate in a convolutional manner, and show that the convolutional RBM (CRBM) are more efficient than standard RBM.
CRBM has been applied successfully to a wide range of visual and audio recognition tasks [18,19]. Though the success of CRBM in addressing two dimensional issues, there is still no published research on the using of CRBM in textual information processing. In this paper, we propose a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. AHD is an active learning method based on deep architecture, which the bottom layers are constructed by RBM, and the upper layers are constructed by CRBM, then the whole constructed deep architecture is fine tuned by a gradient-descent based supervised learning based on an exponential loss function.

Problem formulation
The sentiment classification dataset composed of many review documents, each review document composed of a bag of words. To classify these review documents using corpus-based approaches, we need to preprocess them in advance. The preprocess method for these reviews is similar with [9,10]. We tokenize and downcase each review and represent it as a vector of unigrams, using binary weight equal to 1 for terms present in a vector. Moreover, the punctuations, numbers, and words of length one are removed from the vector. Finally, we combine all the words in the dataset, sort the vocabulary by document frequency and remove the top 1.5%, because many of these high document frequency words are stopwords or domain specific general-purpose words.
After preprocess, each review can be represented as a vector of binary weight x i . If the j th word of the vocabulary is in the i th review, x i j~1 ; otherwise, x i j~0 . Then the dataset can be represented as a matrix: . .
where R is the number of training reviews, T is the number of test reviews, D is the number of feature words in the dataset. Every column of X corresponds to a sample x, which is a representation of a review. A sample that has all features is viewed as a vector in R D , where the i th coordinate corresponds to the i th feature. The L labeled reviews are chosen randomly from R training reviews, or chosen actively by active learning, which can be seen as: where S is the index of selected training reviews to be labeled manually.
The L labels correspond to L labeled training reviews is denoted as: where C is the number of classes. Every column of Y is a vector in R C , where the j th coordinate corresponds to the j th class.
For example, if a review x i is positive, y i~½ 1,{1'; otherwise, We intend to seek the mapping function X?Y using the L labeled data and all unlabeled data. After training, we can determine y using the mapping function when a new sample x comes.

Architecture of HDBN
In this part, we propose a novel semi-supervised learning method HDBN to address the sentiment classification problem. The sentiment datasets have high dimension (about 10,000), and computation complexity of convolutional calculation is relatively high, so we use RBM to reduce the dimension of review with normal calculation firstly. Fig. 1 shows the deep architecture of HDBN, a fully interconnected directed belief nets with one input layer h 0 , N hidden layers h 1 ,h 2 ,:::,h N , and one label layer at the top. The input layer h 0 has D units, equal to the number of features of sample review x. The hidden layer has M layers constructed by RBM and N{M layers constructed by CRBM. The label layer has C units, equal to the number of classes of label vector y. The numbers of hidden layers and the number of units for hidden layers, currently, are pre-defined according to the experience or intuition. The seeking of the mapping function X?Y, here, is transformed to the problem of finding the parameter space W~fw 1 ,w 2 , . . . ,w N g for the deep architecture.
The training of the HDBN can be divided into two stages: 1. HDBN is constructed by greedy layer-wise unsupervised learning using RBMs and CRBMs as building blocks. L labeled data and all unlabeled data are utilized to find the parameter space W with N layers.
2. HDBN is trained according to the exponential loss function using gradient descent based supervised learning. The parameter space W is refined using L labeled data.

Unsupervised learning
As show in Fig. 1 , we construct HDBN layer by layer using RBMs and CRBMs, the details of RBM can be seen in [12]. CRBM is introduced below.
The architecture of CRBM can be seen in Fig. 2, which is similar to RBM, a two-layer recurrent neural network in which stochastic binary input groups are connected to stochastic binary output groups using symmetrically weighted connections. The top layer represents a vector of stochastic binary hidden feature h k and the bottom layer represents a vector of binary visible data h k{1 , k~Mz1,:::,N. The k th layer consists of G k groups, where each group consists of D k units, resulting in G k |D k hidden units. The layer h M is consist of 1 group and D M units. w k is the symmetric interaction term connecting corresponding groups between data h k{1 and feature h k . However, comparing with RBM, the weights of CRBM between the hidden and visible groups are shared among all locations [18], and the calculation is operated in a convolutional manner [17].
We define the energy of the state (h k{1 ,h k ) as: where h~(w,b,c) are the model parameters: w k st is a filter between unit s in the layer h k{1 and unit t in the layer h k , k~Mz1,:::,N.
s is the s th bias of layer h k{1 and c k t is the t th bias of layer h k . A tilde above an array (w w) denote flipping the array, Ã denote valid convolution, and . denote element-wise product followed by summation, i.e., A.B~trA T B [18].
Gibbs sampler can be performed based on the following conditional distribution.
The probability of turning on unit v in group t is a logistic function of the states of h k{1 and w k st : The probability of turning on unit u in group s is a logistic function of the states of h k and w k st : where the logistic function is: A star ? denotes full convolution. The convolution computation can extract the information of text effectively based on deep architecture, although it needs more computation time.

Supervised learning
In HDBN, we construct the deep architecture using all labeled reviews with unlabeled reviews by inputting them one by one from layer h 0 . The deep architecture is constructed layer by layer from bottom to top, and each time, the parameter space w k is trained by the calculated data in the k{1 th layer.
According to the w k calculated by RBM and CRBM, the layer h k ,k~1, . . . ,M can be computed as following when a sample x inputs from layer h 0 : When k~Mz1, . . . ,N{1, the layer h k can be represented as: The parameter space w N is initialized randomly, just as backpropagation algorithm.
After greedy layer-wise unsupervised learning, h N (x) is the representation of x. Then we use L labeled reviews to refine the parameter space W for better discriminative ability. This task can be formulated as an optimization problem: and the loss function is defined as We use gradient-descent through the whole HDBN to refine the weight space. In the supervised learning stage, the stochastic activities are replaced by deterministic, real valued probabilities.

Classification using HDBN
The training procedure of HDBN is given in Table 1. For the training of HDBN architecture, the parameters are random initialized with normal distribution. All the reviews in the dataset are used to train the HDBN with unsupervised learning. After training, we can determine the label of the new data through: Active Hybrid Deep Belief Networks Method

AHD description
Given an unlabeled pool X R and an initial labeled data set X L (one positive, one negative), the AHD architecture h N (x) will decide which instance in X R to query next. Then the parameters of h N (x) are adjusted after new reviews are labeled and inserted into the labeled data set X L . We choose the reviews that are near the separating hyperplane as the labeled training data.  Choose U reviews which near the separating line from train dataset X R through Eq. 17.
Add U reviews into the labeled data set X L .

end for
Train HDBN with labeled dataset X L and all unlabeled data in X.
The selected training reviews to be labeled manually are given by: Classification using AHD The training procedure of AHD is given in Table 2. The training set X R can be seen as an unlabeled pool. We randomly select one positive and one negative reviews in the pool to input as the initial labeled dataset X L that are used for supervised learning. The iteration times I and the number of active choosing data U for each iteration can be set manually based on the number of labeled reviews in the experiment.
For each iteration, the HDBN architecture is trained by all the unlabeled reviews and labeled reviews in existence with unsupervised learning and supervised learning firstly. Then U reviews are chosen from the unlabeled pool based on the distance of these review mapping results from the separating line. At last, these U reviews are labeled manually and added to the labeled dataset X L . For the next iteration, the HDBN architecture can be re-trained by all reviews with unsupervised learning and all labeled reviews with the new increased labeled dataset X L . At last, HDBN architecture is retrained by all the reviews with unsupervised learning and existing labeled reviews with supervised learning.
After active training, we can use the Eq. 15 to determine the label of the new data. The purpose of active learning is choose more useful label data to train the deep architecture, which can use fewer label data to train better classifier.

Experimental setup
We evaluate the performance of the proposed HDBN and AHD method using five sentiment classification datasets. The first dataset is MOV [20], which is a classical movie review dataset. The other four datasets contain products reviews come from the multi-domain sentiment classification corpus, including books (BOO), DVDs (DVD), electronics (ELE), and kitchen appliances (KIT) [21]. Each dataset contains 1,000 positive and 1,000 negative reviews.
The experimental setup is same as [9] and [10]. We divide the 2,000 reviews into ten equal-sized folds randomly, maintaining balanced class distributions in each fold. Half of the reviews in each fold are random selected as training data and the remaining reviews are used for test. Only the reviews in the training data set are used for the selection of labeled reviews by active learning. All the algorithms are tested with cross-validation.
We also compare the classification performance of AHD with three representative active semi-supervised learning methods, i.e., active learning (Active) [24], mine the easy classify the hard (MECH) [9], and active deep networks (ADN) [10]. Active learning [24] is a baseline active learning method for sentiment classification. MECH [9] and ADN [10] are two new active learning method for sentiment classification proposed recently.

Performance of HDBN
The HDBN architecture used in all our experiments have 2 normal hidden layer and 1 convolutional hidden layer, every hidden layer has different number of units for different sentiment datasets. The deep structure used in our experiments for different datasets can be seen in Table 3. For example, the HDBN structure used in MOV dataset experiment is 100-100-4-2, which represents the number of units in 2 normal hidden layers are 100, 100 respectively, and in output layer is 2, the number of groups in 1 convolutional hidden layer is 4. The number of unit in input layer is the same as the dimensions of each datasets. For greedy layerwise unsupervised learning, we train the weights of each layer independently with the fixed number of epochs equal to 30 and the learning rate is set to 0.1. The initial momentum is 0.5 and after 5 epochs, the momentum is set to 0.9. For supervised learning, we run 30 epochs, three times of linear searches are performed in each epoch.
The test accuracies in cross validation for five datasets and five methods with semi-supervised learning are shown in Table 4. The results of previous two methods are reported by [9]. The results of DBN method are reported by [10]. Li et al. [3] reported the results of PIV method. The result of PIV on MOV dataset is empty, because [3] did not report it. HDBN is the proposed method.
Through Table 4, we can see that HDBN gets most of the best results except on KIT dataset, which is just slight worse than PIV method. However, the preprocess of PIV method is much more complicated than HDBN, and the PIV results on other datasets are much worse than HDBN method. HDBN method is adjusted by DBN, all the experiment results on five datasets for HDBN are better than DBN. This could be contributed by the convolutional computation in HDBN structure, and proves the effectiveness of our proposed method.

Performance of AHD
To evaluate the performance of AHD, we compare its results with several previous active learning methods for sentiment classification. The architectures used in this experiments can be seen in Table 3. We perform active learning for 5 iterations. In each iteration, we select and label 20 of the most uncertain reviews, and then retrain the deep architecture on all of the unlabeled reviews and labeled reviews annotated so far. After 5 iterations, 100 labeled reviews are used for training.
The test accuracies in cross validation for five datasets and four methods with active semi-supervised learning are shown in Table 5. The results of previous two methods are reported by  [9]. The results of ADN method are reported by [10]. AHD is the proposed active learning method in this paper. Through Table 5, we can see that the results of AHD is better than Active and MECH methods, and competitive with ADN method. Because ADN and AHD methods are both deep learning method, these results prove that deep architecture is good for sentiment classification.

Performance with variance of unlabeled data
To verify the contribution of unlabeled reviews for our proposed method, we did several experiments with fewer unlabeled reviews and 100 labeled reviews. We use HDBN method in this part, considering AHD method choose the reviews need to label from an unlabeled pool, it is unfair to compare the performance of AHD when the size of unlabeled pool is different.
The test accuracies of HDBN with different number of unlabeled reviews and 100 labeled reviews on five datasets are shown in Fig. 3. The architectures for HDBN used in this experiment can be seen in Table 3. We can see that the performance of HDBN is much worse when just using 400 unlabeled reviews. However, when using more than 1200 unlabeled reviews, the performance of HDBN is improved obviously. For most of review datasets, the accuracy of HDBN with 1200 unlabeled reviews is close to the accuracy with 1600 and 2000 unlabeled reviews. This proves that HDBN can get competitive performance with just few labeled reviews and appropriate number of unlabeled reviews. Considering the much time needed for training with more unlabeled reviews and less accuracy improved for HDBN method, we suggest using appropriate number of unlabeled reviews in real application.

Performance with variance of labeled data
To verify the contribution of labeled reviews for our proposed method, we did several experiments with different number of labeled reviews on five datasets. To compare the active learning performance with ADN [10], we use AHD method in this experiment, all the experimental setting are same as ADN. The architectures for AHD used in this experiment can be seen in Table 3.
The test accuracies of ADN and AHD with different number of labeled reviews on five datasets are shown in Fig. 4. We can see that the performance of AHD is better than ADN for most of the experimental setting, although they are both based on the DBN method. This proves that the convolutional computation has better performance than the normal computation in the deep architecture for sentiment classification. We can also see that both ADN and AHD can get high accuracy even with just 20 labeled reviews for training. This proves the effect of deep learning method for semi-supervised learning with very few labeled reviews.

Conclusions
In this paper, we propose a novel semi-supervised learning method, AHD, to address the sentiment classification problem with a small number of labeled reviews. AHD seamlessly incorporate convolutional computation into the DBN architecture, and use CRBM to abstract the review information effectively. One promising property of AHD is that it can effectively use the distribution of large amount of unlabeled data, together with few label information in a unified framework. In particular, AHD can greatly reduce the dimension of reviews through RBM and abstract the information of reviews through the cooperate of RBM and CRBM. Then an exponential loss function is used to refine the constructed deep architecture with few label information. Moreover, it can choose the review to be labeled actively, improve the performance of deep architecture effectively.
Experiments conducted on five sentiment datasets demonstrate that AHD outperforms most of previous methods and is competitive with DBN based method, which demonstrates the performance of deep architecture for sentiment classification. Experiments are also conducted to verify the effectiveness of AHD method with different number of labeled reviews, the results show that AHD can reach very competitive performance with few labeled reviews and large amount of unlabeled reviews. It provides soundness support for the effectiveness of AHD for real applications, where collecting enough unlabeled data is a relatively easy task while it is hard to get enough labeled data.