Sentiment analysis using averaged weighted word vector features

People use the World Wide Web heavily to share their experiences with entities such as products, services or travel destinations. Texts that provide online feedback through reviews and comments are essential for consumer decisions. These comments create a valuable source that may be used to measure satisfaction related to products or services. Sentiment analysis is the task of identifying opinions expressed in such text fragments. In this work, we develop two methods that combine different types of word vectors to learn and estimate the polarity of reviews. We create average review vectors from word vectors and add weights to these review vectors using word frequencies in positive and negative sensitivity-tagged reviews. We applied the methods to several datasets from different domains used as standard sentiment analysis benchmarks. We ensemble the techniques with each other and existing methods, and we compare them with the approaches in the literature. The results show that the performances of our approaches outperform the state-of-the-art success rates.


Introduction
Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a text, especially in order to determine whether the writer's attitude toward a particular topic, product, etc., is positive, negative, or neutral.A basic task in sentiment analysis is classifying the polarity of a given text in the document or sentence.
As mentioned in Task 5 of Semeval 2016 [1], we use the Web to share our experiences about products, services, or travel destinations [2].Texts that provide online feedback, such as reviews, comments, etc., are important for consumer decision-making [3] and customer comments create valuable sources that can help companies measure satisfaction and improve their products or services.
Besides these, some other datasets exist, such as Stanford IMDB Reviews [20] or Yelp dataset [21], and some SA studies were made by using these datasets.In this work, we focus on a comprehensive analysis of SA studies, and we are very close to the state-of-the-art results ( [22], [23]).
In this work, we make a comprehensive analysis of some existing methods to produce semantic polarities from the reviews from different domains, and we propose two new approaches to produce semantic polarities.Our approach uses weighted word vectors to create feature sets.For this purpose, we used word2vec [24] and glove [25] methods.We ensembled and compared our approaches with existing approaches and analyzed experimental results from domains IMBD reviews, Semeval 2016 Task Dataset, and Yelp Restaurant reviews.We are very close to state-of-the-art performance accuracies with our approaches.
We provide the source code and prepared datasets for the model as well as trained word vectors at https://github.com/alierkan/Sentiment-Analysis.
The rest of this paper is organized as follows: Section 2 presents previous works related to sentiment analysis.In Section 3, we describe our model.In Section 4, we describe the data sets used during experimentation.The results are then presented in Section 5. Finally, Section 6 summarizes the final conclusions and potential future lines of this work.

Related Works
In the literature, there are some datasets to use models, such as Stanford IMDB Movie Reviews, SEMEVAL restaurant and laptop reviews, and the Yelp dataset.We mentioned previous studies that used these datasets.Although there are some rule-based studies in the literature, we listed only studies that used learning algorithms since our study is also focused on machine learning algorithms.
Wang and Manning [23] used Multinomial Naive Bayes (MNB), support vector machine (SVM), and SVM with NB (NBSVM) feature to find out the polarities of the reviews.They split at spaces for unigrams and filtered out anything that is not [A-Za-z] for bigrams.Their approach computes a log-ratio vector between the average word counts extracted from positive documents and the average word counts extracted from negative documents.NBSVM obtained 91.2% accuracy for the IMDB dataset.
Mesnil et al. [22] used three approaches to discriminate positive and negative sentiment for the IMDB reviews, and then they combined these approaches to reach better accuracy.Their first approach is computing the probability of the test document belonging to the positive and negative class via Bayes' rule by using n-grams and Recurrent Neural Networks (RNNs) [26].As a second approach, they used a supervised reweighing of the counts as in the Naive Bayes Support Vector Machine (NB-SVM) that was mentioned in the previous paragraph [23].Finally, they used a sentence vector method, which is proposed by [27] to learn distributed representations of words and paragraphs.The sentence vector was created by using the word2vec algorithm, which was proposed by [24].To create review vectors, at the first step, they added a unique ID at the beginning of every review.In this way, this id became a word that represents the review.Then, they ran the word2vec algorithm on these modified reviews and prevented the algorithm from removing review IDs.So, they had a matrix that included word vectors of the words in the reviews and review IDs.Every row represents a word vector of the corresponding word.Then, they created a sub-matrix of this matrix that contains only word vectors of review IDs.This method has one major issue: A review vector should be generated from both training and test reviews.So, if you want to find the polarity of a new review, all steps, including creating review vectors, have to be repeated.Therefore it is not a practical method.Then, they combined the results of the three approaches and achieved higher accuracy.Mesnil et al [22] passed Wang and Manning [23] score and they reached 92.57% accuracy.
In Semeval 2016 Task 5 Subtask 2 [1], a set of customer reviews about a target entity (e.g. a laptop or a restaurant) was given; the goal is to identify a set of aspect, polarity tuples that summarize the opinions expressed in each review.Khalil et al.(NileTMRG Team) [28] incorporated domain and aspect information in one ensemble classifier consisting of three CNNs [29] trained using the whole training data provided in both domains and initialized with word vectors that were fine-tuned using training examples collected in a semi-supervised way by the same CNN architecture.Each one of the three classifiers is similar to the one with a slight variation resulting from incorporating domain and aspect knowledge into the CNN model.This incorporation was done by introducing new binary features to the hidden layer of the CNN.The new features indicate the presence or the absence of a certain aspect or domain in a given sentence.They have mostly employed Static-CNN, where initialized input vectors are kept as is, and Dynamic-CNN, where input vectors are updated for optimizing the network.Dynamic-CNN on sentences tokenized from the Yelp academic dataset reviews in restaurants.They ensembled their results.This ensemble model counts votes from three classifiers and predicts the class that has the maximum number of votes from the three classes, namely the positive, the negative, and the neutral.They obtained 85.448 percent accuracy for the English Restaurant dataset of Semeval 2016 Task 5 [1].[30] used Lexical Acquisition and supervised classification using Support Vector Machine SVM at Semeval 2016 Task 5.They used lexical expansion to induce sentiment words based on the distributional hypothesis.Due to their observation of rare words, unseen instances, and limited coverage of available lexicons, they thought that the distributional expansion might be a useful back-off technique (Govind et al. [31]).They constructed a polarity lexicon for all languages using an external corpus and seed sentiment lexicon.Finally, they computed the normalized positive, negative, and neutral scores for each word similar to (Kumar et al. [32]).Their main assumption is that words with the same sentiment are semantically more similar.Hence, words that appear more in positive (negative/neutral) reviews have a higher positive (negative/neutral) sentiment score.They obtained 86.729 percent accuracy for the English Restaurant dataset of Semeval 2016 Task 5 [1].[33] apply a term-centric method for feature extraction.For a term, the features are obtained as the lexical-semantic categories (food, service, etc.) associated with the term by a semantic parser, bigrams and trigrams involving the term, and all syntactic dependencies (subject, object, modifier, attribute, etc.) involving the term.First, aspects are extracted using a conditional random field (CRF) model.Then, aspect categories are found and added as features for polarity classification.The features are also delexicalized, replacing a term with its generic aspect category (e.g."staff" is replaced by "service", "sushi" is replaced by "food").They obtained 88.13% accuracy for the English restaurant dataset of the Semeval 2016 Task 5 [1].[34] employ the Logistic Regression algorithm with the default parameter implemented in lib-linear tools 1 to build the classifiers.The 5-fold cross-validation is adopted for system development.They used linguistic features (Word N-grams, Lemmatized Word N-grams, POS, etc.), sentiment lexicon features (mainly ratios between positive, negative, and potential words related to a given aspect), topic model (the document distribution among predefined topics, the topic probability of each word indicates its significance in corresponding topic) and word2vec features to learn.

Method
By using machine learning techniques to learn the polarity of a review, we should represent it with some features.As mentioned before, they may be a bag of words, the number of words in a sentence, or other features.Nowadays, most of these types of studies include word vectors.
Word vector models (WVM) represent words in a continuous vector space where semantically similar words are mapped to nearby points.WVM's have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning.The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g.Latent Semantic Analysis), and predictive methods (e.g.neural probabilistic language models).Count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus and then map these count statistics down to a small, dense vector for each word.Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).Two featured word vector algorithms exist; one is Word2vec, and another is Glove.
Word2vec is a group of related models that are used to produce word embeddings.These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [24].We used the Word2vec algorithm with the Continuous Skip-gram Model.It tries to maximize the classification of a word based on another word in the same sentence.More precisely, each current word is used as an input to a log-linear classifier with a continuous projection layer and predicts words within a certain range before and after the current word.Increasing the range improves the quality of the resulting word vectors, but it also increases the computational complexity.
Glove is an unsupervised learning algorithm for obtaining vector representations for words.Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus.Populating this matrix requires a single pass through the entire corpus to collect the statistics [25].
By using word2vec and glove algorithms, we produced word vectors separately for the words in the reviews.That is, we have two-word vector sets for every review dataset.We obtained the best results with our methods when the lengths of word2vec and glove vectors were 300.We eliminated stop-words and did not use word vectors of stop-words.We used these vectors within our weighted averaged review vector (WARV) and convolutional neural network(CNN) architecture described in Kim [29].In the following sections, we will explain the methods in detail.Since word vectors represent semantic similarity between words, if we can produce a vector that represents semantic similarity for a review or sentence from word vectors, then it will be easy to learn the semantic polarity of the reviews or sentences.For that purpose, if we find a mean vector of a review from word vectors of that review, then we have a new vector representing the review, which will be semantically similar to words in the review.Therefore, we created review vectors from normalized word vectors of words in the review by averaging the word vectors.In this case, all word vectors of the words in a review will be an input of our method, and the output will be a review vector that has the same dimension as word vectors.For every review, we found averaged vectors of all reviews.We produced word vectors by using word2vec and glove algorithms.Then we created averaged review vectors (ARV) from word2vec and glove vectors separately, and we concatenated them.That is, we produced new combined word vectors from word2vec vectors with size M and glove vectors with size M .Obviously, the dimension of combined word vectors is 2M .For any review contains N words, we have averaged N word vectors with size M : [wv 1 , wv 2 , ..., wv N ].Mathematically, we found normalized vectors of the word vectors by where Then, we found the weighted averaged vectors of normalized word vectors of a review from word2vec vectors and glove vectors separately and then concatenated them.For that purpose, we used the following equation, which finds the weighted average of the N word vectors.Note that for b i = 1 for every i we will obtain averaged review vectors.
where b i represents weights for word vectors.Note that for b i = 1 for every i we will obtain averaged review vectors.
We used different weights and we obtained the best result with the below weight for the datasets.
where we represented the probability of the positive polarity of the word i as P (w i ≥ 0) and the probability of the negative polarity of the word i as P (w i ≤ 0).Obviously, In this way, we obtain the averaged vectors as shown in Table1 for every review by using Word2vec and Glove vectors of the reviews as shown in Table2.Hence, for every review, we have 2M features to learn.By using Feed-forward Neural Networks, we learned the sentiment of the review with these weighted averaged review vectors.In the next session, we shared our results for different datasets.We used the Keras framework on Tensorflow to run our feedforward neural network model which is shown in Fig 1 .We ran our model with different numbers of hidden layers and nodes, different activation, optimizers, and loss functions; however, we obtained the best results with the following parameters: Our model includes 3 hidden layers.The first two layers have 2M nodes.We used the rectifier (max value of input nodes) "RELU" as an activation function.At the last layer, we used the "sigmoid" function.Our optimizer is "Adadelta," and the loss function is "binary cross-entropy."

Convolutional Neural Network (CNN)
We used Convolutional Neural Networks (Kim) to learn the sentiment of the reviews by using word2vec vectors and glove vectors.We used word vectors as features directly.Our learning system is based on the Deep Convolutional Neural Network (CNN) architecture described in Kim [29].The architecture we use is shown in Fig 2 . . . . . . .A review matrix is built for each input review, where each row is a vector representation of the word in the review.The review length is fixed to the maximum review length of the dataset so that all review matrices have the same dimensions.(Shorter reviews are padded with row vectors of 0s accordingly.)Each row vector of the review matrix is made up of columns corresponding to word2vec and glove vectors concatenated together.
Let x i ∈ R k be the k-dimensional word vector corresponding to the i-th word in the review.A review of length n (padded where necessary) is represented as where ⊕ is the concatenation operator.In general, let x i:i+j refer to the concatenation of words x i , x i+1 , ..., x i+j .A convolution operation involves a filter w ∈ R hk , which is applied to a window of h words to produce a new feature.For example, a feature c i is generated from a window of words x i:i+h−1 by Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent.This filter is applied to each possible window of words in the review x 1:h , x 2:h+1 , ..., x n−h+1:n to produce a feature map with c ∈ R n−h+1 .We then apply a max-overtime pooling operation (Collobert et al. [45]) over the feature map and take the maximum value ĉ = max c as the feature corresponding to this particular filter.The idea is to capture the most important feature with the highest value for each feature map.This pooling scheme naturally deals with variable review lengths.One feature is extracted from one filter.The model uses multiple filters (with varying window sizes) to obtain multiple features.These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.
For every review, we produced review matrices whose rows represent word vectors, which are obtained by concatenation of Word2vec and Glove as shown in Table 2.If one word does not exist in Word2vec/Glove vectors, then we fill into cells corresponding to Word2vec/Glove columns with zeros.Also, the number of rows of the review matrices is fixed to a number of words of maximum length review (N).For the reviews whose number of words is less than N, empty rows are filled by zeros.The length of our word2vec and glove vectors is 300; hence, we have an input vector whose length is 600 (= 300 + 300).
We used Keras/Tensorflow [35] framework to run CNN.In this framework, by default, the filters W are initialized randomly using the glorot uniform method, which draws values from a uniform distribution with positive and negative bounds described as in 9.
where n in is the number of units that feed into this unit, and n out is the number of units this result is fed to.Again, we used 600 filters, which is equal to the dimension of the input.
These filters are applied at each layer of the network.That is, a discrete convolution is performed for each filter on each input data, and the results of these convolutions are fed to the next layer of convolutions or a fully connected layer.
During training, the values in the filters are optimized with backpropagation by using a loss function.We used the binary cross entropy for positive/negative sentiments (IMDB dataset [20]) and categorical cross-entropy loss for positive/negative/neutral sentiments (Semeval 2016 dataset [1]).

Ensemble
After learning by using training data, we combine the results of different learning algorithms by using validation data sets.We used two different ensemble approaches: As a first ensemble approach (Ensemble-1), we used the log probability scores approach in Mesnil's study [22].
As the second one, we used a neural network over validation sets (Ensemble-2) at Fig 3 .In this case, by using validation data sets, our learning algorithms produce results.The results contain probabilities for each class.Then we used these results as features of the ensemble learning algorithm.For that purpose, we used different learning algorithms to ensemble the methods, and we obtained the best accuracies with logistic regression and neural networks.Our neural network model contains three hidden layers whose number of nodes is equal to the input length.We used Rectified Linear Unit as the activation function of the layers, Sigmoid function to estimate the class or label, AdaDelta [36] as an optimizer, and cross-entropy as a loss function.Then we tested our ensemble methods with test datasets again.Our ensemble method produced better results than every single learning algorithm.

Experiments
In this work, we test our models and other existing models with different datasets.We used Stanford IMDB Reviews [20], SEMEVAL 2016 Task 5 dataset [1] and YELP dataset [21].In this section, we will explain experiments and results.

IMDB Reviews
Stanford IMDB dataset of reviews contains 100,000 movie reviews in English.25,000 of these reviews are labeled as positive, the other 25,000 reviews are labeled as negative and the remaining 50,000 reviews are unlabeled.In fact, IMDB reviews are labeled numbers that are between 1 and 10.However, this dataset contains only reviews with 1 to 4 as negative and reviews with 7 to 10 as positive.Therefore, there are only two labels: positive and negative.
For IMDB reviews, we produced word2vec and glove vectors from IMDB reviews.The length of the vectors is 300, and their window sizes are 5.We run our methods by using word2vec vectors(_wv) and glove vectors(_gl) separately and combined(_wvgl).Therefore, the number of features of our Averaged Review Vectors (ARV) and Weighted Averaged Review Vectors (WARV) is 300 for separate runs and 600 for combined runs.In the same manner, for our CNN model, the number of features is 300 (separate run) and 600 (combined run).As seen in Table 3, with % 94.286 accuracy, our WARV method has better accuracy values than N-Gram, RNNLM, and Mesnil's paragraph vector.Also, our WARV method is computationally more efficient than Mesnil's paragraph vectors (PV) (Mesnil et al. [22]).
When our weighted averaged review vectors (WARV) are ensembled with NBSVM, they reach an accuracy value % 95.02 as shown in Table 4.When we ensemble three methods, the best accuracy (% 95.032) is obtained by the ensemble of WARV, NBSVM, and CNN as shown in Table 4. Also, these two results are better than Mesnil's results.
When we ensemble four methods or five methods, the best accuracy is again % 95.032 as shown in Table 4.When we ensemble all six methods, accuracy becomes % 95.028.These results are nearly close to other results that were obtained by using Bidirectional Transformers BERT embeddings [37] in some studies [38][39] [40][41] that reached % 96 accuracy.
Note that we didn't share the ensemble results of the combinations whose accuracies are below % 94 except for ensembles in Mesnil's study [22].Semeval 2016 Task 5 dataset In English, two domain-specific datasets for consumer electronics (laptops) and restaurants, consisting of over 1000 review texts (approx.6K sentences) with fine-grained human annotations (opinion target expressions, aspect categories, and polarities) will be provided for training/development.In particular, the SE-ABSA15 train and test datasets for restaurants and laptops (with some corrections) will be made available as training data.They consist of 800 review texts (4500 sentences) annotated with approx.15000 unique label assignments (Entity, Aspect, polarity).The laptop dataset consists of 450 review texts (2500 sentences) annotated with 2923 Entity#Aspect, polarity tuples.The restaurant dataset consists of 350 review texts (2000 sentences) annotated with 2499 Entity#Aspect, polarity tuples.All datasets will be enriched with text-level annotations.Also, datasets exist for other languages rather than English.These languages are Arabic, Chinese, Dutch, French, Russian, Spanish, Turkish [1].
We used our two methods on the English restaurant dataset.We produced word2vec and glove vectors from the Yelp dataset.For that purpose, firstly, we get only restaurant-related reviews of the Yelp dataset, then we create word vectors.After that, we applied our two methods (ARV and CNN) to the Semeval 2016 Task 5 dataset to learn polarity.Especially our ARV method produced high accuracy results (87.7%).It is very close to the best result of Semeval 2016 Task 5 as shown in Table 5 and the study [43] (87.8%) with BERT embedding [37].
We ensembled our two results by using a neural network.For this purpose, we used probability outputs of ARV and CNN.Since we have three classes (positive, neutral, and negative), every method has 3 probability values.Therefore, we used 6 probability values of two methods as our learning features.Since Semeval 2016 dataset is very small, we have no opportunity to use some parts of the dataset for validation.Instead of probability values on validation data, we used probabilities of the training dataset.That is, our ensemble feed-forward neural network used probabilities that were produced over the training dataset of Semeval 2016 Task5 by our ARV and CNN methods.Then, we used the test dataset of Semeval 2016 Task 5 to evaluate our ensemble model.As shown in Table 5, our ensemble produced results that are very close to the state-of-the-art value (88.242%) for the Semeval 2016 Task 5 dataset.We used the Keras framework to run our ensemble neural network model.Our ensemble learning neural network includes 3 hidden layers with 600 nodes, which is equal to the number of features.As an activation function, we used "scaled exponential linear units" (SELUs), which induce self-normalizing properties [42].At the last layer, we used the "softmax" function.Our optimizer is "Adadelta," and the loss function is "binary cross-entropy." Note that we did not share the results of methods RNNLM, NBSVM, and PV, which were compared with our methods in the previous section for this dataset since they suffered from a small dataset.They could not learn accurately, and the results of them were very low.
Pretrained transformer-based models have recently been used in sentiment analysis [47].We also used average pretrained transformer-based models BERT vectors [46] to compare our results with transformer-based models.For that purpose, we got vectors of tokens in the last layer of the BERT model, then we calculated the average for each review, which is the same method as our other models.Then, we used these vectors instead of our Word2Vec and Glove vectors to learn the sentiment of the reviews.For Semeval 2016 dataset, we obtained 86.39% accuracy which is less than our ensemble results.Also for the IMDB movie review dataset, we obtained 90.43% accuracy which is less than our ensemble results.

Table 2 :
IMDB Reviews with ids

Table 3 :
Accuracies of Methods Before Ensemble for IMDB dataset

Table 4 :
Accuracies of Methods After Ensemble for IMDB dataset