A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis

The spread of Covid-19 has resulted in worldwide health concerns. Social media is increasingly used to share news and opinions about it. A realistic assessment of the situation is necessary to utilize resources optimally and appropriately. In this research, we perform Covid-19 tweets sentiment analysis using a supervised machine learning approach. Identification of Covid-19 sentiments from tweets would allow informed decisions for better handling the current pandemic situation. The used dataset is extracted from Twitter using IDs as provided by the IEEE data port. Tweets are extracted by an in-house built crawler that uses the Tweepy library. The dataset is cleaned using the preprocessing techniques and sentiments are extracted using the TextBlob library. The contribution of this work is the performance evaluation of various machine learning classifiers using our proposed feature set. This set is formed by concatenating the bag-of-words and the term frequency-inverse document frequency. Tweets are classified as positive, neutral, or negative. Performance of classifiers is evaluated on the accuracy, precision, recall, and F1 score. For completeness, further investigation is made on the dataset using the Long Short-Term Memory (LSTM) architecture of the deep learning model. The results show that Extra Trees Classifiers outperform all other models by achieving a 0.93 accuracy score using our proposed concatenated features set. The LSTM achieves low accuracy as compared to machine learning classifiers. To demonstrate the effectiveness of our proposed feature set, the results are compared with the Vader sentiment analysis technique based on the GloVe feature extraction approach.


Introduction
Outbreak of Covid-19 has a socio-economic impact [1]. The World Health Organization declared it an epidemic on 30 January 2020 [2]. Since then, it has spread exponentially, inflicting serious health issues including painful deaths. As of May 31, 2020, the death toll has a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 reached 636,633 [3]. The virus will be kept spreading in the upcoming days [4]. During the lock-down, traffic on social networking has increased tremendously [5]. Twitter has outclassed other competitors in the timely spreading of Covid-19 news [6]. Overwhelming part of these news is subjective due to the involvement of personal opinions and biasness, hence giving rise to (un)intentional fake information, uncertainty, and negativity in human social circles [7]. At the same time, this situation is catching the attention of researchers to perform computable analysis for creating a wholesome picture. This research focuses on sentiment analysis on Twitter dataset regarding Covid-19 using supervised machine learning algorithms. To this end, we undertake the following research questions: RQ 1: How is the performance comparison of machine learning models for Covid-19 sentiment analysis on tweets?
RQ 2: Can we improve the performance of machine learning models by feature engineering?
To address these questions, the dataset is obtained from the IEEE data port. It contains the tweets IDs and sentiment scores. Corresponding to these tweets IDs, we extract tweets using an in-house built tweets crawler. The tweets are cleaned using preprocessing techniques, which include removal of non-supported information and extraction of meaningful text. Next sentiment scores are found using the TextBlob toolkit. These scores are classified as positive, negative and neutral. The dataset is split into a training set and a testing set with a ratio of 80:20 respectively. For feature extraction, techniques such as term frequency-inverse document frequency (TF-IDF), bag-of-words (BoW) and GloVe are used. We propose a feature technique by concatenating TF-IDF and BoW features. Finally, five machine learning models, random forest (RF), XGBoost classifier, support vector classifier (SVC), extra trees classifier (ETC), and decision tree (DT) are trained to test their performance on the test data. For the sake of completeness, one deep learning model, long-short term memory (LSTM) is also trained and tested. Performance is evaluated on the accuracy, precision, recall, and F 1 score. We also compare our sentiment scores with those provided by the IEEE data port. The contributions of this study are: • Using the standard feature techniques TF-IDF and BoW, performance analysis of five supervised machine learning models, RF, XGBoost, SVC, ETC, and DT for Covid-19 sentiment analysis of tweets. Additionally, we also determine the performance of the LSTM. The results are also compared with sentiment scores as provided by the IEEE data port.
• Propose a feature extraction technique based on BoW and TF-IDF and evaluate its performance.
The rest of the paper organized as follows. Section 2 contains the related work to this study. Section 3 describes the dataset and Section 4 contain methods, techniques and proposed methodology. Section 5 presents results and discussion. Section 6 gives the conclusion.

Related work
The affliction due to Covid-19 is large [8]. News about this event has dominantly surpassed other news on social media. These news also include those fake, unverified and subject to people's biasness. Thus the call to methodically determine negativity in Covid-19 news is timely and justified.
Tweeter stays on top for gathering health-related news [9]. Sentiment analysis on Covid-19 news is done for India using a set of 24000 tweets [10]. Another study focuses on the psychological effect of Covid-19 on human behavior [11]. It reports that people are tense and their depression level increased due to Covid-19 news. Another study reports about industry crisis and new emerging opportunities due to Covid-19 [12]. There is sentiment analysis for data collected from various social networking platforms. It includes Twitter, Instagram, Reddit, YouTube, and Gab. The results report flaws in the information collected [13]. Different classifiers are tested in short text and long text information. For short text, Naive Bayes and Logistics Regression give average results of 91% and 74%. Both models do not perform well on the long text [14].
Another study about Covid-19 is done by collecting 4 million tweets from March 1, 2020, to April 21, 2020, using 25 different hashtags [15]. Thirteen topics are identified, out of which five classes are made. Latent Dirichlet Allocation (LDA) is used to recognize uni-gram, bigram, silent topics, themes, and sentiments in these tweets. The results are promising in the context of health-related emergency situations. Another approach detects emotions using 2500 short text ad 2500 long text messages [16]. Dominant emotion found is depression, which is understandably due to prolonged stay at home, joblessness, and fear due to virus [17]. There is also a BERT model to study emotions, which are assigned single labels and multi labels [18].
The key point of this model is to consider emojis, which is an effective way to express feelings. Another work studies Frequent pattern based sentiment analysis using the FP-growth algorithm [19]. The reported results are better than most of the other relevant works.
Lots of work is also done in the sentiment analysis domain using the deep learning approach in history [20,21], as a study [22] proposed an approach for sentiments classification using a deep learning model. It uses NLP for topic modeling to find the core issues related to Covid-19 as expressed on social media. The classification is done using LSTM Recurrent Neural Networks (LSTM RNN) model. Another variant of the deep learning model performs sentiment analysis on Covid-19 tweets using the sentiemnt140 dataset [23]. The results show an improvement over the existing values. Another study mines a database for over 1 million tweets of five months in 2020 to assess public attitude regarding the preventative measure of mask usage during the COVID-19 pandemic [24]. It is also based on NLP and determines the frequency increase of mask-related positive tweets. Sentiment analysis is also carried out on a set of 26000 tweets in two time intervals: 1st Jan 2019 to 23rd March 2020 and December 2019 to May 2020 [25]. The set includes re-tweets. Tweets within the first interval have mostly neutral and negative polarity, while those within the second interval has neutral and positive polarity. The used classifier is based on deep learning that achieved 81% accuracy.
A model, namely BERT, uses TF-IDF for sentiment analysis and topic modeling for negative posts [26]. Statistical analysis to detect the presence of pandemic is carried out on tweets of January 2020 [27]. It used word-frequency to understand the trend and psychology of tweet users, and sentiment analysis to understand their general attitude.
In relation to all the presented works, our study focus on performance comparison of Covid-19 sentiment analysis using various machine learning algorithms. We introduce a feature set with the aim to increase accuracy. The summary of related works present in Table 1.

Dataset description
The dataset we used in this study is obtained from the IEEE data port on May 31, 2020 [30]. It contains the tweet IDs and sentiment scores of 7528 tweets. A sample data is shown in Table 2. Tweets IDs are extracted using the following filters: language "en" and keywords "corona", "coronavirus", "covid", "pandemic" and variants such as "sarscov2", "nCov", "covid-19", "ncov2019", "2019ncov" and their hashtags. IEEE data port does not provide tweet text, resorting us to develop an in-house crawler that could extract them from Tweeter based on the tweet IDs. Since IEEE data port provides tweet IDs the next day, our dataset corresponds to the tweets of May 30, 2020. After preprocessing the data, we find sentiment scores using the Text-Blob. The results are shown in Table 3.
In Table 3, SS1 and SS2 represent sentiment scores that is provided by IEEE data port and that is found by us after preprocessing the data. The sentiment scores are classified as positive (greater than 0), neutral (equal to 0) and negative (less than 0). Table 4 shows the counts of tweets present in these classes.

Methods and proposed approach
The Lexicon based machine learning models are defined next.  Table 3. A comparison of sentiment scores between those provided by IEEE data port (SS1) and those determined by our approach after preprocessing (SS2).

Supervised machine learning method
We use five supervised machine learning models, their parameters settings are given in in Table 5. 4.1.1 RF. RF is used for both classification and regression problems [31]. RF is an ensemble model that uses bagging techniques. It generates several trees and performs voting between them to make a majority decision. Prediction accuracy increases with the number of tree. RF reduces the problem of over-fitting by using a bootstrap sampling technique [32]. RF can be defined as the mode of the multiple prediction trees, mode(t1, t2, t3,. . ., tn). Here t1, t2, t3 and tn are the predicted values against a tweet, whose Mode is taken. RF parameter settings are given in Table 5. By setting n_estimator = 300, we generate 300 decision trees. By setting max_depth = 300, the tree is restricted to grow to a maximum depth of 300 levels, thus effectively reducing the complexity of the decision tree.

XGBoost.
XGBoost (eXtreme Gradient Boosting) works the same way as Gradient Boosting classifier but with an extra feature of assigning weight to each sample as in Adaboost classifier [33,34]. XGBoost is a tree-based model which gained lots of popularity in recent times. It fits a number of weak learners (decision trees) parallelly unlike gradient boosting which does this sequentially. Due to this, XGBoost gives a speed boost. XGBoost has regularization techniques to control over-fittings such as L1 and L2 but these techniques are not available in Gradient Boosting and Adaboost classifiers. Another key feature of XGBoost is scalability, which means that it also can perform better on a distributed system and can process We set four parameters of XGBoost as given in Table 5. The n_estimator = 300 implies XGBoost uses 300 decision trees as a base learner, which will take part in the prediction process. The parameter max_depth = 300 restricts the growth of trees to a maximum 300. The learning_rate = 0.2 helps to control the over-fitting in the model [34]. The random_state = 27 controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split.

SVC.
SVC is a linear model and used for sentiment analysis in many research work [36][37][38]. SVC maps each item as a data point in an n-dimensional space, where n is the number of features. It performs classification by finding the "best fit" hyper-plane that can differentiate between the classes. We use SVC with a sigmoid kernel and use another parameter C = 3.0 as a regularization value, as given in Table 5.
4.1.4 ETC. ETC implements the meta-estimator, which trains/fits the number of weak learners (randomized decision trees) on various samples of the dataset and boosts the prediction accuracy [39,40]. It is also an ensemble learning model used for classification purposes like RF, thus it is considered similar to RF. The only difference between ETC and RF is the way trees are constructed in the forest. ETC generates decision trees on the original training sample, while RF constructs decision trees on the bootstrap samples drawn from the original dataset. At each test node, each tree is provided with a random sample of k features drawn from the feature-set. Each decision tree must select the best feature to split the data based on some mathematical criteria (typically Gini Index). This random sample of features leads to the creation of multiple de-correlated decision trees. We use ETC with two main parameters, n_estimators = 300 and max_depth = 300, as given in Table 5.

DT.
DT acquires information in the form of a tree, which can also be rewritten as a set of discrete rules [41,42]. The key advantage of DT is the use of decision rules and feature subsets that appear at various classification stages. A DT consists of different types of nodes such as a leaf node and a number of internal nodes with branches. Each leaf node represents a class corresponding to an example while each internal node represents features and branches representing the conjunction of features that lead towards classification. The performance of DT depends on how well it is constructed on the training set. We use the parameter max_depth = 300 to restrict DT to grow a maximum depth of 300 as given in Table 5.

TextBlob
TextBlob is a Python library that is used in natural language processing (NLP) tasks such as part-of-speech tagging, sentiment analysis, noun phrase extraction, translation, classification and many more [43,44]. We use TextBlob to find the sentiments from Covid-19 tweets. Text-Blob sentiment function returns two properties against each tweet, polarity score in the range [-1,1] and subjectivity score in the range [0, 1]. Negative, zero and positive polarity scores represent negative, neutral and positive statements, respectively. Subjectivity refers to the expression of opinions, evaluations, feelings and speculations [45].

Feature extraction techniques
In order to use the machine learing models, we need to extract features of tweets. Two feature extraction techniques, BoW and TF-IDF are used.

TF-IDF.
TF-IDF is a feature extraction technique used for text analysis tasks [28][29][30][31][32][33][34][35][36][37][38]. It gives weighted features for performance boosting [31,39,46]. TF-IDF finds the weight of each feature in a document using the product of term frequency (TF) and inverse document frequency (IDF). TF is the frequency of a feature in document and depends on the length of the document. It can be defined as where count ( t, d) is the number of term t in the document d and totalcount d is the total number of all terms in the document d. IDF measures the extent of a term t being informative in a document for model training. It can be computed as where N is the number of documents in the corpus and Df t is the number of documents that contain the term t. IDF measures the weight of a term t low when term t occurs frequently in many documents. For instance, stopwords have low IDF value. Finally, TF-IDF can be defined as

BoW.
The BoW is the simplest feature extraction technique used in information retrieval and NLP tasks [31,47]. BoW is mostly used in text classification tasks where the occurrence of each word in a document is used for model training. BoW generates the vocabulary all the uniques words and their occurrence frequencies in the all documents for the training of learning models.

Concatenation of BoW and TF-IDF.
To boost the performance of machine learning models, we propose a concatenation of BoW and TF-IDF features, as shown in Fig 3. Such concatenation is helpful for learning models to boost their accuracies.

Evaluation parameters
To evaluate the performance of the machine learning models, we have used four evaluation parameters, accuracy score, precision score, recall score and F 1 score. The accuracy score is the fraction of correct predictions. The maximum accuracy score can be 1 and the minimum accuracy score can be 0. It is given as Four basic notations are explained as follows.
Recall indicates about the completeness of a classifier. It lies in [0, 1] and calculated as F 1 score is a harmonic mean of precision and recall scores. It lies in [0, 1] and calculated as

Proposed methodology
A thematic of the our proposed methodology is shown in Fig 4. After extracting tweets using our in-house built crawler, data is preprocessed. Preprocessing is an important step, which affects accuracy of learning models. We remove stopwords, usernames, link punctuation's and numeric values from tweets and apply stemming techniques. A sample data is given in Table 6. After preprocessing, the results are given in Tables 7-10.

Removal of usernames and links.
Usernames (e.g. @username) and links in tweets do not contribute to sentiment analysis, hence they need to be removed. We also remove special stopwords 'RT' from tweets, which are not important for topic analysis [48]. The results are given in Table 7.

Removal of punctuation marks and conversion to lower case.
Punctuation characters hinder machine understanding, hence they are removed, so do hashtag markers '#'. Similarly all terms are converted to lower case for uniformity. For instance, "COVID", "Covid", "covid", and #covid are all converted to "covid" because machine learning models are case sensitive. The results are given in Table 8.

Removal of stopwords and numeric values.
We remove commonly used stopwords such as a, an, as and etc. to reduce the noise. We also remove the numeric values such as 123, which are not valuable for text analysis. Alpha-numeric words such as covid19 are not removed because most of them are technical terms. The results are given in Table 9.

Stemming.
Stemming is also an important technique in text preprocessing to reduce the terms to their root form [49]. For instance, machine learning algorithms consider "go", "goes" and "going" as different features even if they contain the same information. Reaching the stem reduces the complexity in the text features [50]. The results are given in Table 10.
After preprocessing, we apply the TextBlob technique on the dataset to find sentiment scores. Due to cleaning noise, the TextBlob performs better than the original dataset. A performance comparison of the TextBlob sentiment scores between the original data and the preprocessed data is given in Tables 11 and 12. Please note the change in labels.
After finding sentiment scores, the dataset is split into a training set and a testing set with a ratio of 80:20 as shown in Table 13. Features TF-IDF, BoW, and concatenation of TF-IDF and BoW are used for the training of learning models. For demonstration purposes, we apply these feature engineering techniques on two sample tweets 'trump call covid chines viru' and 'covid could kill'. The results are given in Tables 14-16.   After training the learning models, the performance is evaluated on accuracy score, precision score, recall score and F 1 score (see Section 4.4).

Results
This section presents the Covid-19 sentiment analysis results of the machine learning models RF, XGBoost, SVC, ETC, and DT using features BoW, TF-IDF, and concatenation of TF-IDF and BoW. We compare the model performance between SS1 and SS2. In results tables and confusion matrices, 0, 1, 2 represent neutral, positive, and negative sentiments respectively.

Results with TF-IDF
The TF-IDF results of used models are given Tables 17 and 18. ETC model performs better for both SS1 and SS2 using TF-IDF. It has a higher accuracy score 0.92 for SS2. It is due to the fact that TextBlob performs better when given a cleaned data. Both SS1 and SS2, Tree-based models give better performance for both SS1 and SS2.
The confusion matrix given in Fig 5 shows the performance of the ETC model. Based on our chosen ratio of 80:20 for training data and testing data, there are 7528 tweets and 1506 tweets, respectively. ETC correctly predicts 1390 tweets for SS2, while it makes 116 incorrect predictions. For SS1, ETC makes 250 incorrect predictions.

Results with bag-of-words
The BoW results of used models are given Tables 19 and 21. BoW results are comparable to that of TF-IDF. We observe that tree-based models RF, XGBoost, and DT perform slightly better for BoW than TF-IDF, while SVC perform better for TF-IDF. ETC performance for SS2 is the same for both BoW and TF-IDF.
Referring to results given in Tables 19 and 20, we observe that the tree-based models RF, XGBoost, and DT perform well. ETC achieved high accuracy levels of 0.92 and 0.88 for SS2 and SS1, respectively. Tree-based models also achieved good accuracy of 0.91. The

PLOS ONE
performance ETC is also elaborated by the confusion matrices, shown in Fig 6. ETC makes more correct predictions when we train the model with the clean data. Out of 1506 predictions, it makes 1390 correct predictions for SS2, which are only 1326 for SS1.

TF-IDF & BoW concatenation
We concatenated TF-IDF and BoW features with the aim to achieve high accuracy of machine learning models (see Section 4.3.3). The results are given in Tables 21 and 22. Overall the performance is boosted with the combination. ETC accuracy improves to 0.93 when we train it using clean data. XGBoost and RF also outperform their previous results. The performance of ETC and XGBoost is also improved for SS1. The confusion matrix of the best performer, ETC is shown in Fig 7 for the concatenated features. ETC makes 1412 correct predictions out of 1506, and only 94 incorrect, which is the lowest ratio of incorrect predictions in this study. The improvement in results is due to the concatenated features. By concatenation, the feature set size increases so the model has more features to learn and to improve its accuracy. As is observed in the SS2 case, ETC gives 94 incorrect predictions. Most of these incorrect predictions are made for the positive class, which is 39 out of 94. The reason behind this observation is that the positive class has fewer examples in SS2 as compared to other classes. Thus the model has a low positive class example ratio for training as compare to the other classes.

Comparison of our proposed approach with other techniques
In this section, we compare the results of concatenated features with Vader [51] and GloVe [52] to show the significance of this study. Vader (Valence Aware Dictionary and sEntiment

PLOS ONE
Reasoner) is a lexicon and rule-based technique to find the sentiment score from the social media platforms [51]. It also works well on text from other domains. It assigns intensity to each word in the tweets and then sum-up all the intensities to obtain the sentiment score. The selection of Vader for comparison is due to the fact that many research studies have shown the better performance of Vader on social media type text [53,54]. Results given in Table 23 show that SS2 accuracy outperforms Vader in all cases, thus the significance of SS2. Fig 8 shows the performance comparison between the models for SS1, SS2 and Vader using concatenated features. SS2 accuracy is highest as compared to SS1 and Vader.
Similarly, results are compared with GloVe (Global Vector), which is an unsupervised learning algorithm to obtain vector representations of any word. It creates a feature matrix on the basis of feature-feature co-occurrence [52]. Selecting GloVe to compare with our approach to existing approaches is due to the fact that it is mostly used for feature extraction [55,56]. We also selected two deep learning approaches to extract the features that are used to train our machine learning models: a combination of CNN and LSTM (CNN-LSTM) and Deep Neural Networks (DNN). The resultant performance is not prominent, as given in Table 24.
Our concatenation feature engineering approach outperforms the GloVe, the CNN-LSTM and the DNN features for all models in accuracy [57]. This is due to the fact that GloVe constructs the feature set on the basis of co-occurrence of features, which as evidence is not concrete and only probabilistic. Thus, the training of machine learning models remains incomplete and insufficient. This fact is pointed out [58]. The recommendation is to train the

PLOS ONE
deep learning approaches with extensive data so that resultant co-occurrence distributions converge. The Fig 9 shows detailed results of the five used models.

Results using Long Short-Term Memory
We also compared the performance of LSTM with that of the machine learning models. The LSTM uses an architecture of the embedding layer between the input layer and the LSTM layer, which creates a vector of input text features for the LSTM layer. A dense layer is created with 256 neurons activation function 'relu' (Rectified Linear Unit). Next, a dropout layer with a 0.5 dropout rate is used to drop the neurons during training, thus reducing the chances of over-fitting the model. The output layer with the sigmoid function is used to find the probability of each class. Our LSTM model shows poor performance on the selected dataset by attaining an accuracy score of 0.577. This poor performance is due to the fact that the dataset is too small for a deep learning model. This is in accordance with the study that reports the poor performance of deep learning models on small datasets [39,58]. We also used the BiLSTM [59] and CNN-LSTM [57] and achieved the accuracy score of 0.579 and 0.61 reflectively. BiLSTM and CNN-LSTM take more training time as compared to LSTM but does not give much improvement. There is a possibility of the unsuitability of the used architecture for LSTM and other models for the selected dataset. Hence, the results cannot be conclusive without further investigation of LSTM on other datasets or with more tuned architecture.  techniques to train the machine learning model using the 80% data (6022 Tweets) and evaluate its performance using the remaining 20% data (1506 Tweets). The study concludes that ETC is the best performer with our features concatenation engineering approach. RQ 1 posed investigation of the comparative performance of five supervised machine learning models, RF, XGBoost, SVC, ETC, and DT for Covid-19 sentiments. We addressed this question by acquiring a tweets dataset, cleaning it, and finding its sentiment scores using Text-Blob. TF-IDF and BoW are used for feature extraction. Out of five trained and tested models,

PLOS ONE
ETC outperforms others. This is due to the fact that linearly separable class planes are not possible for textual data such as tweets. Thus models working on discrete states tend to lose accuracy, which is indirectly covered by implicit randomization during ETC. For the sake of completeness, we also trained and tested one deep learning model, LSTM. It exhibits the lowest performance. It is understandable due to the small dataset, which does not allow enough learning paths for an overall stable system that can handle test points. RQ 2 posed investigation of effects of an engineered feature set. This question is addressed by proposing a feature set by concatenation of TF-IDF and BoW. Again our proposed feature set outperforms the two standard techniques, TF-IDF and BoW. This is due to the fact that an extended feature set allows more training points, thus increasing the chances of the test point to lie within closer proximity of one of those training points.
We compared the performance of models using sentiment scores generated using TextBlob and Vader. Of all three feature extraction techniques used, TF-IDF, BoW, and concatenation, our approach gave better results. This is mostly due to the fact that Covid-19 tweets do not have much spread in story-lines. Of those few features extracted, there is a trend of normalized distribution, hence keeping Vader at a disadvantage. Lastly, we also compared the performance of models built on top of our winner feature technique, the concatenated set, and GloVe. Again, our proposed technique performs better. This can be associated with the complexity that emerges due to all possible co-occurrence of features. This complexity actually affects the correct decidability of most of the used supervised learning models.
In future work we plan to direct our work towards the deep learning approaches, specifically to improve their performance on small datasets.