Emotion computing using Word Mover’s Distance features based on Ren_CECps

In this paper, we propose an emotion separated method(SeTF·IDF) to assign the emotion labels of sentences with different values, which has a better visual effect compared with the values represented by TF·IDF in the visualization of a multi-label Chinese emotional corpus Ren_CECps. Inspired by the enormous improvement of the visualization map propelled by the changed distances among the sentences, we being the first group utilizes the Word Mover’s Distance(WMD) algorithm as a way of feature representation in Chinese text emotion classification. Our experiments show that both in 80% for training, 20% for testing and 50% for training, 50% for testing experiments of Ren_CECps, WMD features get the best f1 scores and have a greater increase compared with the same dimension feature vectors obtained by dimension reduction TF·IDF method. Compared experiments in English corpus also show the efficiency of WMD features in the cross-language field.


Introduction
Since the pace of modern life becomes faster and faster, people always work and live with high stress. From the report published by WHO, one in four people in the world will be affected by mental or neurological disorders at some point in their lives [1]. Thus, it's momentous to make emotion computable for psychotherapy, health prediction or any other fields.
Emotions play an important role in successful and effective human-human communication [2]. There is also significant evidence that rational learning in humans depends on emotions [3]. With Google AI computer program AlphaGo beat Jie Ke at a three-game match in the 2017 Future of Go Summit, artificial intelligence drew the attention of the globe once again and will continue standing on top of the tides. This makes us recall the famous words noted by Marvin Minsky about the future of emotion computing: the question is not whether intelligent machines can have any emotions, but whether machines can be intelligent without any emotions [4].
In this paper, we propose an emotion separated method(SeTFÁIDF) to assign the emotion labels of sentences with different values, which has a better visual effect compared with the values represented by TFÁIDF in the visualization of a multi-label Chinese emotional corpus Ren_CECps. The separated method shows excellent ability of distinguishing sentences from a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 multi-emotion labels, which can move data points away and avoid overlapping from each other. The moved multi-emotion points inspire us to be the first group to utilize the Word Mover's Distance(WMD) algorithm as a way of feature representation in Chinese text emotion classification. Our experiments based on sentence level of Ren_CECps show that both in 80% for training, 20% for testing and 50% for training, 50% for testing experiments of Ren_CECps, WMD features get the best F1-scores of 0.318 and 0.31, where the baseline of TFÁIDF are 0.196 and 0.204 respectively and the enhanced baseline of SeTFÁIDF are 0.293 and 0.283. Compared with the same dimension feature vectors obtained by dimension reduction TFÁIDF method, the WMD features get 3 times' improvement based on F1-scores of 0.115 and 0.116. To speed up the calculation of WMD, we make changes of WMD algorithm, which gives a 16000 times decrease of time-consuming. For a better comparison, some experiments based on 20 newsgroup data set are also conducted. English corpus based experiments give an almost ten times promotion of F1-scores between WMD features(0.646) and dimension reduction TFÁIDF method(0.076). Those above shows the efficiency of WMD features in classification of crosslanguage data sets and the WMD features have a strong ability in multi-emotion classification.
The remainder of this paper is organized as follows: Section 2 presents some related works. Section 3 gives the description of SeTFÁIDF and describes the visualization of Chinese emotional corpus Ren_CECps. Section 4 goes for a comprehensive explanation of Word Mover's Distance and the feature representation method. Section 5 illustrates the experimental configurations of two language data sets and draws the results in tables and graphs. In section 6, some discussions will be given. Section 7 presents the conclusions and future works.

Related works
In 1997, "Affective Computing" was provided by Picard [3], which is of great importance and thereafter launched a new era in human emotion recognition and opinion mining. Accompanying with the blossoming of the Word Wide Web, it's much easier to obtain text data to train a classifier. To show the abundant features of data, some interactive visualization methods were presented, like the most used parallel coordinates [5] and scatter-plot matrix [6] in attribute-decided data visualization. For the uncertainty of data labels, the measurement can be got in term of probabilities [7], which is useful in unTangle Map [8] for multi-label data visualization. As machine learning algorithms were introduced into NLP, a lot of annotated corpus without specific attribute values can be visualized by dimension scaling [9,10], SVD [11], t-SNE [12]. With better visualization, the classification models can also be enhanced by integrating visual features and text features [13][14][15]. When it comes in large graph visualization, avoiding notes overlapping is another hot research topic. The principal method to solve this situation is elongating the distance within points, like force transfer [16] or changing the distribution of categories [17]. This is exactly what we do in this paper.
For similarity computing using a metric between two distributions, the Earth Mover's Distance(EMD) [18] is one of the well-studied algorithms. By calculating the minimum cost that transform the distributions of color and texture into the other, the EMD can get better results for content-based image retrieval [19] and even can detect phishing web pages by visual similarity [20].
The most commonly used algorithms to represent the documents for similarity computing are statistic based algorithms like TFÁIDF [21], LDA [22], or trained vectors using deep neural networks [23,24]. In paper [25], Wan applied EMD into document similarity measurement successfully by decomposing the documents into a set of subtopics and using EMD to evaluate the similarity of many-to-many matching between the subtopics. Limited by the NLP and machine learning algorithms, the pioneering studies in emotion computing were based on lexicons [26][27][28]. After years' development, several annotated multiemotion corpus were published [29][30][31]. Based on the emotion annotated corpus, the derived lexicon with multi-emotion tags can get higher F1 scores compared with traditional lexiconbased feature [32]. Relying on those emotional corpus, sentiment analysis can have a sub-field of emotion computing. A lot of machine learning algorithms were explored. SVM, Naive Bayes and Maximum Entropy are some of the most common algorithms used [33][34][35]. Some research using HMMs had also achieved better results [36,37].
Emotion computing in Chinese has attracted many researchers due to the development of microblogging and tweet. Some studies in sentiment analysis of Chinese documents [38] turn to emoticon-based sentiment analysis [39]. But the studies of hidden sentiment association in contents [40,41] [49], CNN [50] for sentiment analysis had been done, works based on sentiment embeddings also get excellent results [51] and will attract more and more attention.
For estimating emotion of words that not registered in the lexicon, EMD can be applied to vectors of words and get higher accuracy compared with using only word importance value [52]. As the high time-consuming in EMD, the EMD based methods always limited into keywords or topics, not the full words. With the fast specialized solvers of EMD [53] was published, words fully transformed experiments can be carried out [54]. Those achievements facilitate the experiments in this paper.

Visualization of Ren_CECps
For a multi-class corpus, we can make a 2D or 3D scatter-plot to have a good view of the distribution of the data. But for multi-class data with multi-label, to do the visualization with the same different colored points will make the 2D or 3D graph unreadable. Thus, how to match the multi-label information into a 2D or 3D graph is the key target need to be covered. In this paper, for visualizing the multi-label emotional corpus Ren_CECps, we propose an emotion separated TFÁIDF method(SeTFÁIDF) to represent each emotional category independently with different values. And to make a better 2D visualization, we use one of the stat-of-art dimension reduction algorithm t-SNE [12] to measure the low dimension distribution of Ren_CECps.
In this paper, the sentences without emotional labels are regarded as 'neutral' category. Table 1 shows the number of the sentences with different labels, the 'neutral' sentences were calculated in category one.

t-SNE
t-SNE(t-Distributed Stochastic Neighbor Embedding) is a technique for dimension reduction that is particularly well suited for the visualization of high-dimension datasets [12]provided by L.J.P van der Maaten and G.E Hinton. Compared with PCA algorithms, t-SNE computing the distributions of every nodes in the datasets and rebuilding the distribution of those nodes in two or three dimension space. To get the best approximate results, t-SNE uses KL divergence to measure the distance of the two distributions. In this paper, the t-SNE tool uses TSNE in sklearn [55] and the programm is followed a guidance blog written by Alexander Fabisch at http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/AlexanderFabisch/ 1a0c648de22eff4a2a3e/raw/59d5bc5ed8f8bfd9ff1f7faa749d1b095aa97d5a/t-SNE.ipynb.

Emotion separated representation
As mentioned above, considering the 'neutral' label as one emotion category, the total number of emotional categories needed to be calculated is nine. The keyword word i of every sentences represented by TFÁIDF can be calculated through Formula (1).
In which, tfidf means the TFÁIDF result of word i . tf i means the term frequency of the calculated word. ∑tf i means the frequency of the total words. N means the total sentences number. df i means the total number of sentences which contain word i . From Formula (1), the conclusion we can get is that no matter what the words in the sentence are, the feature vector calculated though formula (1) for every annotated emotion labels of one sentence will be the same without any distinguishing.
The good news is that in Ren_CECps the emotion keywords of a sentence are annotated. Thus, for an annotated emotion keyword, we calculate its tfidf if and only if the emotion keyword has the given emotion category for a specific emotion label. In this way, we can generate a distinctive feature vector for each emotion label of a sentence. This method is named emotion separated TFÁIDF(SeTFÁIDF) method, and SeTFÁIDF can be described as the formulas below: where, e j 2 [joy, hate, love, sorrow, anxiety, surprise, anger, expect, neutral]. In which, tfidf e j means the TFÁIDF result of emotion keyword word i in emotion category e j . S e tf i means the term frequency of emotion keyword word i in emotion category e j . df e means the total number of sentences which contain emotion keyword word i .

The result of visualization
Following the algorithm below, the words without emotional labels annotated are calculated through formula (1), for the words with emotional label annotated, those are calculated by formula (2). The steps for visualization are as follows: Using 2D data M 2 and corresponding label list l, we can draw the 2D graph of Ren_CECps. The colors needed for every labels are listed as the sets: Label  (1)  . We can find that the overlapping points in TFÁIDF have been separated in SeTFÁIDF. There are also some conclusions we can get: • The elongated distances between every category of sentences and the distributions changed in SeTFÁIDF indeed have a better visual result compared with TFÁIDF.
• "Love" points have the similar distribution compared with "Anxiety" points; • Most of the "Sorrow" points come with no other emotional points embedding into their group.
• Sentences may have completely opposite emotion categories. In some clusters, there are pairs like "Sorrow and Joy", "Hate and Love".
• Fuzziness emotion categories such as "Expect" have the most frequency to appear with other emotion categories.
Based on those features, we can have a more clear vision of Ren_CECps. For a better view, we also make a 3D visualization result for the emotional corpus at http://a1-www.is. tokushima-u.ac.jp/data_all/.

Word mover's distance feature representation
The visualization graphs in Fig 1 show the remarkable progress derived from the changed distances among emotional points. It seems like a lot of words walk away from each other and which makes the sentences heavy fog in the left being blown over into air circulation. This inspires us if we can let the words "walking" in the algorithm, maybe we will get better results in classification. Following the desire, we focused on transportation problem in NLP, and found word mover's distance algorithm.
Word Mover's Distance The word mover's distance(WMD) [54] is a good distance measure came from earth mover's distance(EMD) [19]. The EMD problem can be solved as transport problem. As in WMD, the distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B [54]. The transportation matrix between documents A and B can be described as formula (3) below: Where, {word i } and {word 0 j } represent the words in document A and document B respectively. {d i } and {d 0 j } mean the term frequencies of the corresponding words. ω j,i is the distance between word 0 j and word i , especially the distance is undirected. To measure the distances of two words, every words are represented as vectors provided by trained word2vec embedding matrix V, and the distances can be calculated by Euclidean  (4): Let T ij where i 2 [1, n], j 2 [1, m] be the number of word i in document A which transports into word 0 j of document B. In this way, P m j¼1 P n i¼1 T ij denotes the total numbers of words in document A transporting into the words of document B. On the contrary, P n i¼1 P m j¼1 T ji means the reverse direction of the transportation.
Thus, the WMD of documents distance measurement can be described as an optimization problem in formula (5), and the minimum result is the distance of two documents.
Here is an example of calculating the similarities of two target sentences S1, S3 with a standard sentence S2 in WMD and TFÁIDF, the sentences are all tokenized and split with blank space: S1: "风和日丽.(English: Sunny Days.)" S2: "天气 很 好.(English: It's a good day.)" S3: "今天 下 雨.(English: It's raining today.)" We use the cosine function to calculate the vectors represented in TFÁIDF as the similarity measurement and the WMD are calculated following the "word mover's distance in python" at http://vene.ro/blog/word-movers-distance-in-python.html published by vene&Matt Kusner [54]. Assuming sim() formula as the similarity between two sentences, we can describe the results below: WMD: sim(S1, S2) = 0.75, sim(S3, S2) = 0.82. TFÁIDF: sim(S1, S2) = 1.0, sim(S3, S2) = 1.0. In TFÁIDF, S1 and S3 get the same similarity results. In WMD, S1 gets a lower result compared with S3, this means S3 is farther away from s2 than S1 from S2. Or can be said that S2 is more similar to S1 than S3, and this matches the ground truth.
As the example shows, WMD has an ability to measure the semantic difference between sentences. Thus we can use several selected sentences as a core dataset, the samples in the entire corpus can be represented by its similarities with all of the sentences in the core dataset. And we will verify this feature representation method in the next section.

Experiments and results
We evaluate the WMD features in SVM [56] model on Ren_CECps, and regard TFÁIDF [57], SeTFÁIDF as baseline and enhanced baseline respectively. To be comprehensive, two low dimensional feature representation methods will be evaluated. For a better comparison, an English corpus based experiment is further added.

Dataset and setup
Ren_CECps The corpus is divided into nine single label data sets. Sentences with multi-label will be replicated in every categories. We select 200 sentences from every nine emotion categories randomly as the seed corpus, naturally, the dimension of the WMD features is 1800. Based on the divided corpus, two ways of selecting subsets for the experiments will be executed: one is 50% of the data for training and the rest 50% of the data for testing; the other one is 80% of the data for training and the rest 20% of the data for testing, the selection is random. 20 newsgroups data set We utilized the split "train" and "test" data sets [58] provided by sklearn tools at http://scikit-learn.org/stable/datasets/twenty_newsgroups.html. All of the "headers", "footers" and "quotes" in data sets are removed. The number of the seed documents for the 20 news categories is 100 and the selection is also random from "train" subset.
Word embeddings The word embeddings used in this paper are different with languages. For Chinese, we merged two additional Chinese data sets(sougouCA at https://www.sogou. com/labs/resource/ca.php: A Chinese news corpus published by Sougou Lab [59] and People's Daily data set: We collected 11,355 days' news data from 1980.01.01 to 2016.02.14 through the Internet) into Ren_CECps to train a 200 dimension word embedding using gensim [60] which is a free python library containing the approach in [23]. For English, a pre-trained embedding at https://code.google.com/archive/p/word2vec/ will be utilized for the experiments, which contains 300-dimension vectors for 3 million words and phrases. Both in Chinese and English experiments, the words not in the embeddings will be determined as zero vectors.

Fast computing
In this paper, the main calculation process is following the program used in examples computing of section 4. The EMD has a best average time complexity of O(N 3 log N) [53], where N denotes the vocabulary length. This means the lower scale of words, the faster the computing will be. We continue to use V as the word embedding, the pseudo-code of calculating WMD of documents used in [54] can be show in Algorithm 2, a pseudo-code of fast WMD for comparing is presented in Algorithm 3 side by side.
As can be shown below, the matrix TD needed for WMD in algorithm 2 is exported from the whole documents of corpus, this is a fully vocabulary length matrix with dimension of tens of thousands in Chinese or hundreds of thousands in English. In algorithm 3, we export the matrix TD' only from two documents which need to be calculated. This makes the dimension of TD' far less than TD, and restricts the dimension into one hundred. For each computation of WMD, though the T-D matrix TD 0 and distance matrix M 0 are both needed to be recomputed within every loop process, the fast WMD still has a lower computational complexity compared with the exponential order complexity of EMD in step 8 of algorithm 2.

Algorithm 2 WMD
Algorithm 3 fast WMD To verify the improvement, we make three groups of experiments. Each group is a ten times' computing based on 10 pairs of sentences which are selected randomly from Ren_CECps. The comparison of time-consuming is illustrated in Table 2.
Parallelization Though the fast WMD algorithm has a 16000 times improvement of computation efficiency compared with WMD algorithm, it's still too slow for our experiments, so we parallelize the fast WMD model using 10-12 processes in 8 servers. To be consist, the words of WMD written in the rest part of this paper are all in the meaning of the fast WMD algorithm without special explanation.

Evaluation measures
In this paper, the evaluation is measured by F1-score: where: In which, tp is the number of true positive, fp is the number of false positive and fn is the number of false negatives. In this paper, both of precision and recall are calculated in 'macro' model using metrics package in sklearn at http://scikit-learn.org/stable/modules/classes. html#module-sklearn.metrics.

Results
The experiments are arranged based on 50321 sentences(split in categories) of Ren_CECps and 18846 documents of 20 newsgroup. We use a Linear Support Vector Machine library at http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for our classification experiments. All of the SVM programs are running upon the default configuration.
We will make some classification experiments among TFÁIDF, SeTFÁIDF, WMD and a sentence embedding method which is one of the stat-of-the-art methods trained by sent2vec [61]. To make a full comparison with the feature's dimensions, some low dimensional feature representation based experiments with TFÁIDF and selected seed sub-corpus of the two language corpus will be carried out. For a better comparison for whether word embedding has a more important influence above WMD or not, we add another experiment in which a method combined TFÁIDF and word embedding will be experimented. Here are some abbreviations using all through this section: • 20 news: The experiments based on the training and testing data of 20 newsgroup data set; • TFÁIDF 1800 : A low dimensional feature representation method, using SVD at http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html to reduce TFÁIDF feature vectors into 1800 dimensions; • TFÁIDF 2000 : The 2000 dimensions feature representation method reduced from TFÁIDF feature vectors; • Seed_TFÁIDF: Using cosine function to calculate the similarity between target data and seed sub-corpus, in which the final similarities will be the feature vectors of training and testing data. In these experiments, all of the data are initialized by TFÁIDF. We assume (v 1 , Á Á Á, v i , Á Á Á, v n ) as the TFÁIDF represented sub-corpus, in which v i is the TFÁIDF vector. t j means the TFÁIDF vector of training and testing data. Thus, the final feature vectors of training and testing data can be calculated as this function: (cos(v 1 , t j ), Á Á Á, cos(v i , t j ), Á Á Á, cos(v n , t j )) • TFÁIDF_word2vec: The enhanced TFÁIDF method which uses the word embeddings trained by word2vec as the weights for the corresponding words. The feature vectors used in this experiment are exported by the multiplication of TFÁIDF and the embedding matrix.
• sent2vec: In this experiment, we use sentences embeddings trained by sent2vec tool at https://github.com/epfml/sent2vec as feature vectors. Every documents of 20 newsgroup data set are converted into one line files. The output dimension of sentence embeddings is 700.
Results pre-processing As the results computed by WMD contain the "NaN" data. In order to fit data into SVM, we convert all of the "NaN" data into integer of zero. In discussion section, we will explain the reason. Table 3 shows the results of classification experiments based on the feature representation methods mentioned above, the best and worst results of three experiments are all marked in bold. According to the results, we drew some histogram graphs below. In Fig 2, we can find that both in 1v1 and 4v1 experiments, the WMD algorithm gets the best classification results, which even higher than the manual emotion separated method of SeTFÁIDF over about three percentage points. When compared with the same dimensional features, the WMD shows strong information representation capability, and gets 20% higher score than the low dimension TFÁIDF 1800 and 5% higher than similarity representation method of Seed_TFÁIDF. But in Fig 3, the WMD method is knocked off in English news experiments. The 20 newsgroup based experiments get the best results in TFÁIDF, and WMD gets nearly the same F1-score with Seed_TFÁIDF. Both of the two methods are 5% lower than TFÁIDF model. One thing makes us excited is WMD method is still higher than the low dimension TFÁIDF 2000 and gets almost 10 times promotion on F1-score.
We can find the TFÁIDF_word2vec method gets the F1-score of 0.2209, 0.235 and 0.582 respectively in Figs 2 and 3, and are all lower than the results got by the WMD method of 0.3105, 0.3182 and 0.6461. The sent2vec experiments haven't got the best results than other methods, the F1-scores are only better than TFÁIDF 2000 and TFÁIDF 1800 .
In Figs 2 and 3, all of the low dimension feature representation methods reduced from TFÁIDF model get the worst results. But selected seed corpus based similarity representation algorithm gets higher results in 1v1 and 4v1 experiments of Chinese corpus than TFÁIDF model, gets lower results in 20 newsgroup experiments of English corpus in the contrary. After digging into the feature dimensions of those methods, we found the dimensions of TFÁIDF vectors in Chinese and English corpus are 30,000 and 130,000 in integer respectively. The dimensions of WMD method in the two corpus are 1800 and 2000 separately as mentioned before. Computing the rate of dimension reduction of WMD, the Chinese corpus got a reduction rate of 17:1 and English corpus got a reduction rate of 67:1, this may explain why WMD perform better in Chinese corpus than in English corpus. The same situation can be found in the results between TFÁIDF and dimension reduced TFÁIDF of TFÁIDF 1800 and TFÁIDF 2000 : In 1v1 and 4v1 experiments, the F1-scores of TFÁIDF 1800 drop two times compared with TFÁIDF (0.22 to 0.11, 0.196 to 0.115), while in 20 newsgroup, F1-scores of TFÁIDF 2000 decline nine times compared with TFÁIDF(0.68 to 0.07).

Discussion
Difficulty in SeTFÁIDF Though SeTFÁIDF can match sentences into different emotion dimensions, the method is based on priori knowledges annotated in corpus. This means we cannot use SeTFÁIDF to match a new sentence or document into multi-emotion dimensions due to lack of no emotional keywords annotated manually. That's why we use SeTFÁIDF as an enhanced baseline method. It's an idealized results. The importance is this visualization algorithm makes us having a clearer visual results, and changes our way of thinking in training multi-label data. The opposite results in Chinese and English data sets In 1v1 and 4v1 experiments, the WMD method gets the best results. On the contrary, in 20 news, the TFÁIDF gets the best result, and Seed_TFÁIDF gets the second, third is WMD. One reason is the reduction rates mentioned above of the two language corpus are different. The other reason maybe the word embeddings of English used in the experiments have more missing words than Chinese embeddings, this can explain why TFÁIDF_word2vec gets higher results and indeed should be higher than TFÁIDF in Chinese corpus, but gets fourth rank in English corpus and almost 10% lower than TFÁIDF.