Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance

We investigate the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome—asthma/difficulty breathing. We outline data collection using the Twitter streaming API as well as analysis and pre-processing of the collected data. Even with keyword-based data collection, many of the tweets collected are not be relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. For this, we investigate text classification techniques, and in particular we focus on semi-supervised classification techniques since they enable us to use more of the Twitter data collected while only doing very minimal labelling. In this paper, we propose a semi-supervised approach to symptomatic tweet classification and relevance filtering. We also propose alternative techniques to popular deep learning approaches. Additionally, we highlight the use of emojis and other special features capturing the tweet’s tone to improve the classification performance. Our results show that negative emojis and those that denote laughter provide the best classification performance in conjunction with a simple word-level n-gram approach. We obtain good performance in classifying symptomatic tweets with both supervised and semi-supervised algorithms and found that the proposed semi-supervised algorithms preserve more of the relevant tweets and may be advantageous in the context of a weak signal. Finally, we found some correlation (r = 0.414, p = 0.0004) between the Twitter signal generated with the semi-supervised system and data from consultations for related health conditions.


Abstract
We investigate the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome-asthma/difficulty breathing. We outline data collection using the Twitter streaming API as well as analysis and pre-processing of the collected data. Even with keyword-based data collection, many of the tweets collected are not be relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. For this, we investigate text classification techniques, and in particular we focus on semi-supervised classification techniques since they enable us to use more of the Twitter data collected while only doing very minimal labelling. In this paper, we propose a semi-supervised approach to symptomatic tweet classification and relevance filtering. We also propose alternative techniques to popular deep learning approaches. Additionally, we highlight the use of emojis and other special features capturing the tweet's tone to improve the classification performance. Our results show that negative emojis and those that denote laughter provide the best classification performance in conjunction with a simple word-level n-gram approach. We obtain good performance in classifying symptomatic tweets with both supervised and semi-supervised algorithms and found that the proposed semi-supervised algorithms preserve more of the relevant tweets and may be advantageous in the context of a weak signal. Finally, we found some correlation (r = 0.414, p = 0.0004) between the Twitter signal generated with the semi-supervised system and data from consultations for related health conditions. PLOS

Introduction
Surveillance, described by the World Health Organisation (WHO) as "the cornerstone of public health security" [1], is aimed at the detection of elevated disease and death rates, implementation of control measures and reporting to the WHO of any event that may constitute a public health emergency or international concern. Disease surveillance systems often rely on laboratory reports. More recently some countries such as the UK and USA have implemented a novel approach called "syndromic surveillance", which uses pre-diagnosis data and statistical algorithms to detect health events earlier than traditional surveillance [2]. Syndromic surveillance can be described as the real-time (or near real-time) collection, analysis, interpretation, and dissemination of health-related data, to enable the early identification of the impact (or absence of impact) of potential human or veterinary public health threats that require effective public health action [3]. For example, they use emergency department attendances or general practitioner (GP, family doctor) consultations to track specific syndromes such as influenzalike illnesses (ILI). The digital world is generating data at an astonishing rate. This generated big data has unlocked novel solutions to the area of public health [4,5]. The expansion in digital technology and increasing access to online user-generated content, like Twitter, has provided another potential source of health data for syndromic surveillance purposes. Expanding access to communications and technology makes it increasingly feasible to implement syndromic surveillance systems in low and middle income countries (LMIC) too and some early examples in Indonesia and Peru have indicated reason for optimism [6]. The use of data from microblogging sites, such as Twitter, for disease surveillance has been gaining momentum (e.g. [7][8][9][10][11]). This may not only complement existing surveillance systems, but may also support the precise monitoring of disease activity in sub-groups of the population that do not routinely seek medical help via existing healthcare services. The realtime streaming nature of Twitter data could provide a time advantage for syndromic surveillance activities aimed at early detection of disease outbreaks. In addition to this, the low cost of utilisation of this data means that in LMIC where access to medical services may be restricted but where the of use digital technology and social media is becoming more common, such data may support the development of cost-effective and sustainable disease surveillance systems.
It is in light of this that we develop our work. Our ultimate aim is to establish the utility of social media data, and specifically, Twitter data for syndromic surveillance.
Our first objective is to extract a reliable signal from the Twitter stream for different syndromes and health conditions of interest. To achieve this, we must be able to effectively identify and extract tweets expressing discomfort or concern related to a syndrome of interest and reflecting current events. Such symptomatic tweets are considered "relevant" for our purpose of syndromic surveillance. In this paper, we look at asthma/difficulty breathing as our syndrome of interest, which has received less attention in studies using social media data than other syndromes (e.g. ILI). Keywords such as "asthma" or other asthma-related keywords can capture tweets such as "oh I used to have asthma but I managed to control it with will power" or "Does your asthma get worse when you exercise?" which we consider as not relevant. On the other hand, tweets such as "having an asthma attack atm" or "why is my asthma so bad today?" express a person currently affected, which we would like to consider as relevant. Hence, the problem becomes one of text classification. Other authors [11,12] have already identified that much of the data captured on Twitter represents chatter, concern, or awareness instead of actual infection or suffering from symptoms of a disease, talk of past events or a reflection on news content. Such data is therefore irrelevant as a signal for syndromic surveillance. This irrelevant content may greatly magnify the signal and lead to incorrect results and over-estimation [13]. Once the signal has been extracted, we then compare it to real world syndromic surveillance data to understand how well Twitter works for monitoring our syndromes.
When extracting the Twitter signal, we focus on two novel aspects of tweet classification. Firstly, we investigate emojis in tweet classification, and show their worth in a syndromic surveillance context. While there is published literature making use of emoticons in text classification [11], there are few studies describing the use of emojis. Secondly,we compare both supervised and semi-supervised approaches to text classification. We consider semi-supervised methods because they enable us to use a small amount of labelled data, thereby reducing the initial labelling effort required to build a classifier. Finally, we compare the signal we extracted using our methods, to syndromic surveillance data from Public Health England (PHE) to investigate the utility of Twitter for the syndromic surveillance of asthma/difficulty breathing.

Related work
In a survey carried out in 2015, Charles-Smith et al. [8] identified 33 articles that reported on the integration of social media into disease surveillance with varying degrees of success. However, they reported that there is still a lack of application in practice, despite the potential identified by various studies. Many studies are retrospective as it is relatively easy to predict a disease post-outbreak, but practical application would need to be prospective. Uses of social media data vary from global models of disease [14] to the prediction of an individual's health and when they may fall ill [15].
The most commonly studied disease is influenza (or ILI) [16]. Ginsberg et al. [17] put forward an approach for estimating influenza trends using the relative frequency of certain Google search terms as an indicator for physician visits related to influenza-like symptoms. They found that there was a correlation between the volume of specific Google searches related to ILI and the recorded ILI physician visits reported by CDC [17]. De Quincey and Kostkova [9] introduced the potential of Twitter in detecting influenza outbreaks. They posited that the amount of real-time information present on Twitter, either with regards to users reporting their own illness, the illness of others or reporting confirmed outbreaks from the media, is both rich and highly accessible. Achrekar et al. [18] also investigated the use of Twitter for detecting and predicting seasonal influenza outbreaks, and observed that Twitter data is highly correlated with the ILI rates across different regions within USA. They concluded that Twitter data can act as supplementary indicator to gauge influenza within a population and could be useful in discovering influenza trends ahead of CDC.
In this study, our objective is to collect relevant tweets for our given syndrome. We notice that a majority of tweets are not relevant as they do not express the required sentiment (i.e. a person suffering from the particular ailment at the current time). We view this as a text (or tweet) classification problem, and build models to filter relevant tweets. Several papers have looked at the tweet classification problem using supervised learning for different applications. Sriram et al. [19] classified tweets to a predefined set of generic classes, such as news, events, opinions, deals, and private messages, based on information on the tweets' authors and domain-specific features extracted from tweets. Dilrukshi et al. [20] applied a Support Vector Machine (SVM) to classify tweets to different news categories. Some other papers have also investigated tweet classification in the context of syndromic surveillance to varying degrees. As early as 2013, Lamb et al. [11] used Twitter data to investigate influenza surveillance. They argued that for accurate social media surveillance, it is essential to be able to distinguish between tweets that report infection and those that express concern or awareness. To accomplish this, they made use of a log-linear model with word-level n-gram features extracted from the tweets. Over the years, more sophisticated approaches to Twitter surveillance have arisen.
Most recently, the rise in popularity of deep learning has seen it applied to social media data and health research, yielding positive results. Hu et al. [21] used a Convolutional Neural Network (CNN) to classify drug abuse behaviour in tweets. Lee et al. [22] applied a CNN to the classification of tweets as being related to adverse drug effects or not. Dai et al. [23] made use of deep learned distributed representations of words [24] to characterise tweets which were then separated into two clusters-related or unrelated to a topic (e.g. influenza). Researchers in Australia made use of Recurrent Neural Networks (RNNs) for the classification of chief complaints to the Australian Emergency Department [25]. Chinese researchers applied RNNs to spatio-temporal influenza data from the Shenzen Centre for Disease Control and Prevention to predict flu trends [26]. SENTINEL [10] is a deep learning powered syndromic surveillance system which utilises RSS news feeds in addition to Twitter, and makes use of supervised deep learning classifiers, comparing them to Naive Bayes and SVM. It is set in the US and looks at Influenza-like illnesses, similar to the majority of work on syndromic surveillance. We have also previously carried out some work on the syndromic surveillance of asthma/difficulty breathing using deep learning, making use of a deep Long Short Term Memory (LSTM) RNN after implementing other deep architectures and comparing them [27].
One problem with the above approaches is that they rely on having a set of labelled data for learning, i.e. a sufficient set of tweets must first be labelled as, say, relevant/irrelevant for the learning to take place. Such labelling can be very time consuming, so it often means that researchers do not use all of the data available, but instead use a subset of labelled data to develop their classifiers. Since the syndromes/events we wish to study may not be mentioned frequently in a Twitter feed, we wish to use as many tweets as possible to build our models. Semi-supervised classification approaches try to produce models using a small set of labelled data whilst also taking into account the larger set of unlabelled data. As such, we investigate them next. A number of papers have looked at using semi-supervised learning for sentiment analysis, and in particular, self-training [28,29]. Baugh [30] proposed a hierarchical classification system with self-training incorporated, with the goal of classifying tweets as positive, negative or neutral. Liu et al. [31] proposed a semi-supervised framework for sentiment classification of tweets, based on co-training. They converted tweets into two kinds of distinct features: textual and non-textual. The deep learning approaches highlighted above can also be considered as semi-supervised, as they use unlabelled data to build vector feature representations which are subsequently used for classification. However, the classification process is still entirely supervised. In fact, Oliver et al. [32] argue that while deep neural network tasks have proven successful on standard benchmark tasks, these tasks are not comparable to the scarce data scenarios for which widely-used semi-supervised classification techniques are implemented. Lim et al. [33] proposed a completely unsupervised model for identifying latent infectious diseases on social media. They made use of the unsupervised sentiment analysis system SentiStrength [34] to identify positive and negative tweets. They then tried to check if the negative tweets contained mention of a relevant body part. The fully unsupervised system was found to perform well without any human labelling at all and achieved an F1-score of 0.724.
In this paper, we build classification models for tweets based on the relevance in the context of a specific syndrome/event which perform well in an environment with little labelled data. In parallel work, we have been looking at deep learning methods [27]. There is an advantage in developing methods with minimal labelled sets because they simplify the establishing of Twitter data classification for new syndromes. As part of our investigation, we look into feature representation and feature selection for text, which is an important part of text classification. We experiment with different types of features, taking into consideration suggestions from previous work. We consider the addition of emojis for tweet classification, and show their worth for improving classification in a syndromic surveillance context. We compare both supervised and semi-supervised approaches to text classification in order to understand if, and how, we can utilise more of the data that we collect.

Methods
We discuss the data collection, pre-processing and analysis of tweets in order to extract a relevant signal for a given syndrome. We narrow our efforts to asthma/air pollution incidents in this paper.

Data collection and pre-processing
We collected tweets in different periods to account for the seasonality of the syndromes under study and to have a better chance of observing a significant episode, which is unpredictable. This collection strategy also enables us to observe periods with no significant episodes, as a form of control. Different periods also enable us to monitor changes in the use of Twitter as well as in the language used on Twitter over time. We started with an Autumn period (September 2015 to November 2015), followed by a summer period (June 2016 to August 2016) and a winter through to mid-summer period (January 2017 to July 2017).
Tweets were collected using the official Twitter streaming Application Programmer's Interface (API). The Twitter streaming API provides a subset of the Twitter stream free of charge. The whole stream can be accessed on a commercial basis. Studies have estimated that using the Twitter streaming API, users can expect to receive anywhere from 1% of the tweets to 40% of tweets in near real-time [35]. The streaming API has a number of parameters that can be used to restrict the tweets obtained. We extracted tweets in the English language with specific terms that may be relevant to a particular syndrome. For this, in conjunction with experts from Public Health England (PHE), we created a set of terms that may be connected to the specific syndrome under scrutiny, in this case asthma/difficulty breathing. We then expanded on this initial list using various synonyms from regular thesauri as well as from the urban dictionary (https://www.urbandictionary.com) as those may capture some of the more colloquial language used on Twitter. Examples of our keywords are "asthma", "wheezing", "couldn't breathe" etc. A full list of keywords used is provided in S1 List.
The restriction on specific terms can be implemented in the Twitter API by by using the parameter "track" followed by a comma-separated list of phrases which will be used to determine which tweets will be delivered from the stream. A phrase may be one or more terms separated by spaces, and a phrase will match if all of the terms in the phrase are present in the tweet, regardless of order and ignoring case. Hence in this model, commas act as logical ORs and spaces are equivalent to logical ANDs. The tracked terms are matched against a number of attributes of the tweet including the text attribute of the tweet, expanded_url and display_url for links and media and screen_name for user.
We collected 10 million tweets obtained over the three collection periods. The general characteristics of the collected tweets are reported in Table 1. The anatomy of a tweet is presented in the Status Map in Fig 1. There are several attributes associated with a tweet that are available to our analysis. We did not consider all the available tweet attributes to be useful for our experiments, so we collected those that could help us in our task. More specifically, we collected "tweet_Id", "text", "created_at", "user_id", "source" as well as information that may help us establish location such as "coordinates", "time_zone" and "place.country". We stored the collected tweets using MongoDB, which is an open source no-SQL database whose associative document-store architecture is well suited to the easy storage of the JSON Twitter responses.
3.1.1 Location filtering. Because the aim of this project is to assess the utility of Twitter data for syndromic surveillance systems in England, we would like to exclude tweets originating from outside England. Doing this will give a realistic signal, however, inferring the location of Twitter users is notoriously difficult. Fewer than 14% of Twitter users disclose city-level information for their accounts and even then, some of those may be false or fictitious locations [36]. Less than 0.5% turn on the location function which would give accurate GPS coordinate information from mobile devices. time_zone, coordinates and place attributes, which we collected, can help in the geolocation of a tweet but are not always present or even correct as is shown in Table 2. The most reliable source of location information at the time of tweeting, coordinates, is only present in a very small percentage of tweets.
For building a relevance classifier, accurate location is of relative importance. In this work, we are not overly concerned with accurate location filtering. For the purpose of symptomatic tweet classification for relevance filtering, location is of no importance. We collect tweets from the whole of the UK. We employ all three geolocation fields, filtering out tweets that do not have a UK timezone, a place in the UK or coordinates in the UK. We acknowledge that the location filtering is not entirely accurate and may have a disruptive effect when we compare our signal with public health data collected within England. However, we leave the task of improving on location filtering for future work where we will extend our signal comparisons to include longer periods of time and other syndromes.
3.1.2 Cleaning the data. The Twitter dataset contained retweets (sometimes abbreviated to RT) which are the re-posting of a tweet; other duplicate tweets not marked as RT but containing exactly the same text with different URLs appended; and users tweeting multiple times on the same day. We removed all duplication from the final dataset in order to minimise the detection of false signals.
In addition, we removed URLs, which are often associated with news items and blogs, and replaced them with the token "<URL>". This helped with identification of duplication but also identification of "bot" posting and news items. A "bot" is the term used when a computer program interacts with web services appearing as a human user. Tweets from bots, news and web blogs are not relevant to syndromic surveillance so we developed algorithms to identify them and remove them. An overview of the data after cleaning, showing a considerable reduction in volume, is shown in Table 3.
3.1.3 Labelling. 3,500 tweets from the first data collection period were manually labelled as "relevant" or "not relevant". A tweet was labelled as relevant if it announced or hinted at an individual displaying symptoms pertaining to the syndrome of choice. The labelling was done by three volunteers. A first person initially labelled the tweets. This took approximately 1 hour per 1,000 tweets. A second person checked the labels and flagged up any tweets with labels that they did not agree with. These flagged tweets were then sent to a third person who made the decision on which label to use. 23% of the labelled tweets were labelled as "relevant" while 77% were labelled as "irrelevant". A second set of 2,000 tweets, selected at random, were later labelled following the same procedure from the last data collection period. 32% of these tweets were labelled as relevant and 68% were labelled as irrelevant. From the second sample of 2000 tweets, the Inter-Rater Agreement was computed using Fleiss' Kappa [37] which is given by the following equation: where � P is the mean agreement between raters and P e is the probability of agreement by chance calculated from the observed data using the probabilities of each rater randomly labelling a tweet as relevant or irrelevant. The Fleiss' kappa was chosen over other kappas due to the fact that it is intended to be used when assessing the agreement between three or more raters which is appropriate for this scenario. A value of 1 suggests complete agreement while a value of 0 suggests complete disagreement. We obtained a value of 0.906 for κ.

Basic text classification features.
We acknowledge that deep learned word vectors are an effective avenue for text feature representation. However, as was described in section 2, in this work, our focus is not on deep learning, but semi-supervised approaches that can be used in an environment where labelled data is scarce. In addition, the training and deployment of high-performant industry-grade deep learning systems with hundreds of layers can be intensive and require considerable hardware resources such as dedicated GPUs and TPUs [38]. A system such as ours will require no such special hardware or resources, so it will be easier for low and middle income countries (LMIC) to incorporate such systems at whatever scale.
Classification of tweets may be challenging as they are very short and in our scenario, target classes may share common vocabularies. That is, both relevant and irrelevant tweets could contain the same words. Twitter has specific language and styles of communication that people use. In particular, we found that emojis and emoticons are promising additional tokens that we could exploit for classification: • An emoticon is a pictorial representation of a facial expression using punctuation marks, numbers and letters, usually written to express a person's feelings or mood. :-) is an example of an emoticon.
• Emojis on the other hand are miniature graphics of various objects and concepts including facial expressions. is an example of an emoji. Emoticons have been used successfully as features to improve tweet classification performance in sentiment analysis as well as syndromic surveillance [11]. Practically speaking, emojis can be used for the same purposes as emoticons. However, emojis have seen a recent surge in popularity, presumably due to the fact that emojis provide colourful graphical representations as well as a richer selection of symbols [39]. In fact, as Table 1 shows, there were a large number of emojis in our corpus. Because of this, we extend existing emoticon feature techniques with the inclusion of emojis. A further advantage is that while emoticon use can differ around the world, emoji features are less variant. Take for example the crying emoticon. In Western countries, it can be represented by:'( or:'-(. In Asian countries, "kaomoji", which refers to emoticons depicting faces and made from a combination of Western punctuation characters and CJK (Chinese-Japanese-Korean) characters, are more popular than regular emoticons [40]. An example of a popular kaomoji is "�\_(ッ)_/�". Using the earlier example of the crying face, we could now also expect to see (T_T) for the same concept. Emojis on the other hand are a predefined set of unicode characters. Even though they may be rendered differently on different devices, the underlying mapping between a concept and an emoji remains the same. In this sense, emojis may transcend language barriers.
We believe that emoticons and emojis can help with assessing the tone of a tweet. Tweets we are interested in will most likely have a negative tone as they reflect people expressing that they are unwell or suffer some symptoms. This means they may contain one or more emojis/ emoticons denoting sadness, anger or tiredness, for example. On the other hand, the presence of emojis/emoticons denoting happiness and laughter in a tweet may be an indication that the tweet is not relevant to our context of syndromic surveillance. We also investigate more complex features derived from our words or additional tokens.

Word classes
We extend our n-gram feature set with further syntactical features in order to make up for the shortcomings words may present when applied to Twitter data. Word classes are labels that Lamb et al. [11] found useful in the context of analysing tweets to categorise them as related to infection or awareness. The idea is that many words can behave similarly with regard to a class label. A list of words is created for different categories such as "possessive words" or "infection words". Word classes function similarly to bag of word features in that the presence of a word from a word class in a tweet triggers a count based feature. We manually curated a list of words and classes which are shown in Table 4. As we applied lemmatisation, we did not include multiple inflections of the words in our word classes.

Positive and negative word counts:
We constructed two dictionaries of positive and negative words respectively. These dictionaries are shown in S2 and S3 Lists. This feature computes for every tweet, the number of positive words and negative words it contains. Words that do not appear in either of our dictionaries are not counted. The classifier should then infer a matching between ratios of positive to negative counts and tweet relevance.

Denotes laughter:
This is a simple binary feature which measures the presence of a token (emoji and/or emoticon) that might suggest laughter or positivity. We manually curated and saved a list of positive emojis/emoticons for this. The usefulness of this feature was augmented by also checking for the presence of a small list of more established and popular internet and slang for laughter or humour such as "lol" or "lmao" which stand for "Laughing Out Loud" and "Laughing My Ass Off" respectively. Table 5 shows this feature's distribution over the data.

Negative emojis/emoticons:
This is similar to the Denotes Laughter feature but this time looking at the presence of an emoji or emoticon that can be associated with an illness or the symptoms that it may bring., i.e. negative emotions. We decided to include this feature because we discovered the ubiquity of emojis on Twitter and wanted to investigate their potential. Table 5 shows this feature's distribution over the data. We find that this feature may be the most discriminative of the two emoji-based features. Of the instances with a positive value, a high percentage belong to the "relevant" class and of the instances with a negative value, a high percentage belong to the "not relevant" class.
We experimented with two other features-Contains Asthma-Verb Conjugate and Indicates Personal Asthma Report but found that they underperformed compared to the other features so we do not report on them. We also constructed features from the tweets collected in the latest time period in order to see how the features generalised across time periods. The distributions of the non-continuous features from the latest time period are shown in Table 6.
For each tweet, we appended all of the above features together to form one feature vector. Each tweet T i is therefore represented by an f dimensional vector, where f is a sum of the number of terms, n, in the constructed vocabulary, and the dimensionality of our custom features C (Word Classes, Positive and Negative Word Counts, Denotes Laughter and Negative Emojis/ Emoticons). This gives us where t ij represents the weight of the j-th vocabulary term in the i-th tweet and C k i represents the value of the k-th custom feature in the i-th tweet. The feature vectors are represented in code by dictionary (or hashmap) objects which allows them to contain different types of values (i.e.. binary, continuous and categorical). Semi-supervised classification for relevance filtering in syndromic surveillance

Text classification
A classification algorithm for text can be used to automatically classify tweets, in this case, to the categories of relevant/not relevant. We first applied a variety of popular and powerful supervised classification algorithms to the data namely-Naive Bayes, Decision Trees, Logistic Regression, Support Vector Machines (SVMs) and Multilayer Perceptron (MLP) neural networks. We used the Python implementations found in the Natural Language ToolKit (NLTK) and Sci-Kit Learn [41]. As per our objective, we also implemented a semi-supervised approach which is suited to small to medium sized datasets [32]. Semi-supervised learning attempts to make use of the combined information from labelled and unlabelled data to exceed the classification performance that would be obtained either by discarding the unlabelled data and applying supervised learning or by discarding the labels and applying unsupervised learning. Our intention is to extend the labelling in a semi-supervised fashion. We make use of the heuristic approach to semi-supervised learning and employ a self-training iterative labelling algorithm. We then extend this work by using a form of co-training.
3.2.1 Self-training model. We adopted an Iterative Labelling Algorithm for semi-supervised learning [42]. Iterative labelling algorithms are closely related to and are essentially extensions of the Expectation-Maximization (EM) algorithm put forward by Dempster et al. [43]. The iterative labelling algorithm is a sort of meta-algorithm which uses a data set S of labelled instances L, unlabelled instances U, and a supervised learning algorithm A with For our choice of supervised learning algorithm, we selected the MLP classifier after experimenting with different supervised models and finding it to perform best. We used the trained MLP classifier's predictions to label unlabelled instances in the Assign-Labels function. We set our stopping condition such that the iteration stops when either all the unlabelled data is exhausted or there begins to be a continued deterioration in performance as more data is labelled. Along with the class of an applied instance, we also compute the model's confidence in its classification. Our algorithm, inspired by Truncated Expectation-Maximization (EM) [44], then grows L based on the confidence of our model's classification. When an instance from R is classified, if the confidence of the classification is greater than some set threshold θ, the instance is labelled. Considering this, our algorithm falls within the confidence-based category of iterative labelling or self-training algorithms because it selects instances for which the trained classifier has a high confidence in its predictions.
Confidence-based iterative labelling algorithms can tend toward excessively conservative updates to the hypothesis, since training on high-confidence examples that the current hypothesis already agrees with will have relatively little effect [44]. In addition, it has been proven that in certain situations, many semi-supervised learning algorithms can significantly degrade the performance relative to strictly supervised learning [45,46].
3.2.2 Co-training model. To address the problems of self-training, we take some ideas from co-training [47] to try to improve our algorithm. Co-training requires different views of the data so that multiple classifiers can be maintained for the purpose of labelling new instances. Recall that each tweet can be represented as a feature vector T i with various features. We now distinguish two representations. The first is a concatenation of our n-grams, Word Classes, Denotes Laughter and Negative Emojis/Emoticons features. We represent this feature space as X 1 . The second kind of feature vector is a concatenation of our n-grams, Positive and Negative Word Counts, Denotes Laughter and Negative Emojis/Emoticons features. We represent this feature space as X 2 . We can think of X 1 as the taxonomical feature space as is characterised by its inclusion of the Word Classes feature while X 2 can be the sentimental feature space and this is characterised by its inclusion of the Positive and Negative Word Counts feature. As such, X 1 and X 2 offer different, though overlapping, views of the dataset. Each tweet is then represented as a feature vector from each of these spaces.
We now maintain two separate classifiers trained on different views of the data. During the iterative labelling process, we only label instances for which at least one of the classifiers has a high confidence in its prediction and take the result of that classification as the label. Similar to self-training, at the end of each iteration, the newly labelled data is incorporated into each of the classifiers to update their hypotheses. Once the iterative labelling process is completed, the prior training examples for both classifiers as well as the newly labelled examples are joined together and used to train a new classifier using all the features which will then be applied in practice. The benefit of co-training is that the examples labelled by one classifier are also presented to the other classifier to update the hypothesis on the complementary view. Thus, the examples, as represented in each view, receive at least some of their labels from a source other than the classifier that will be updated with them [42].

Correcting the class imbalance.
We started with an initial set of manually labelled data contained 3,500 tweets. This consisted of 23.7% tweets that were labelled as relevant and 76.3% labelled as irrelevant. Imbalanced data causes well known problems to classification models [48]. We initially tried both oversampling and undersampling techniques to create a balanced training dataset as well as just using the unbalanced data. We found no major difference between the balancing approaches, but they gave some advantage over the unbalanced data, so we opted for over sampling. The class distribution over the balanced training set had 47% of tweets as relevant and 53% as irrelevant. The test set was not balanced.

Performance metrics.
Another important aspect of imbalanced data and of classification in general is having the right performance metric for assessment of classification model [49]. Overall accuracy is a misleading measure [50] as it may only be reflecting the prevalence of the majority class. This is called the accuracy paradox, i.e. we could get high accuracy by classifying all tweets as irrelevant. That would, however, not improve our signal. The aim of our endeavour is to identify tweets which might suggest an increase of cases for a particular syndrome (asthma/difficulty breathing) for the purpose of syndromic surveillance. Our signal for some syndromes is quite weak as not many cases may occur at a national level and even less may be talked about on Twitter. Because of this, we are very concerned with identifying and keeping instances of the positive class (relevant tweets). We would like to reduce the number of irrelevant tweets but not at the expense of losing the relevant tweets. This means that for our classifier, errors are not of equal cost. Relevant tweets that are classified as irrelevant, also known as False Negative (FN) errors, should have a higher cost and hence be minimised; we can have more tolerance of irrelevant tweets classified as relevant, also known as False Positive (FP) errors. Those subtleties are well captured by alternative measures of model performance such as recall, the probability that a relevant tweet is identified by the model and precision, the probability that a tweet predicted as relevant is actually relevant [51]. Precision and recall are often trading quantities. A measure that combines precision and recall is the F-measure or Fscore [52]. We primarily make use of the F 2 measure which weighs recall higher than precision and may be more suited to our purpose.

Assessment of features and key words.
We also assessed the discriminative ability of each of our features by performing feature ablation experiments [53]. We evaluated the performance of a given classifier when using all our features, and then again after removing each one of these features. The difference in the performance is used as a measure of the importance of the feature. We chose to use the difference in F 1 metric over F 2 in this analysis because we wanted to convey how the features performed in the general task of tweet classification.
We also performed some analysis on the word (i.e. n-gram) features to learn which words in our vocabulary were the best indicators of relevant tweets. We analysed the n-gram component of our compound feature vectors in order to calculate the informativeness, or information gain of each word unigram. The information gain of each feature pair is based on the prior probability of the feature pair occurring for each class label. A higher information gain (hence, a more informative feature,) is a feature which occurs primarily in one class and not in the other. Similarly, less informative features are features which occur evenly in both classes. The information gain idea is pivotal to the decision tree algorithm but generalises to others and was adapted in the NLTK package for use in a broader sense. In NLTK, informativeness of a word w was calculated as the highest value of P(w = feature_value|class) for any class, divided by the lowest value of P(w = feature_value|class) [41]. This informativeness I, is summarised below: where C is the set of all classes and c is a possible class. Recall that to collect tweets, we made use of Twitter's streaming API which allowed us to specify keywords to restrict the data collection to tweets containing those specific terms. We measured the usefulness of the keywords we selected. To do this, we assessed their information retrieval performance. Specifically, we used the precision-recall metric. In an information retrieval context, precision and recall are defined in terms of a set of retrieved documents and their relevance. We use our original set of labelled tweets for this assessment (i.e. the set of 3500 tweets). In our scenario, the labelled tweets make up the set of retrieved documents and the tweets labelled as belonging to the "relevant" class make up the set of relevant documents. In this context, recall measures the fraction of relevant tweets that are successfully retrieved while precision measures the fraction of retrieved tweets that are relevant to the query.

Classifiers
The results of our fully-supervised and semi-supervised classification are presented in Table 7. The original data was divided into a 70:30 training and test split through random sampling and the results presented are measures obtained from the test data. Of the fully-supervised classifiers, Logistic Regression, SVM and MLP are very sensitive to hyper-parameters. The values for these hyper-parameters were found using grid-search with a hold-out validation setup. 90% of the training data (70% of the labelled set) was used to build the classifiers and the remaining 10% was used to optimise the hyper-parameters. In the following evaluation, we use the discovered optimal hyper-parameters according to the grid-search. For Logistic regression, we used L2 regularisation with a regularisation strength C of 0.00001. For the SVM, we used a Radial Basis Function kernel and C of 0.01. For the MLP, we used 2 hidden layers, each with 128 neurons, a learning rate of 0.001, a regularisation α of 0.0001, a batch size of 200 and trained for 100 epochs. The Adam optimiser [54] was used in minimising the loss function. For the iterative labelling experiments, we varied and tuned the confidence thresholds until we found the best results and reported those. Below, we also discuss in more detail how the confidence threshold affected the iterative labelling performance as it is a key aspect of the algorithms. The best fully-supervised approach according to a combination of the F 1 and F 2 scores was the MLP, which achieved an F 2 score of 0.888 on the test data. This equated to an overall prediction accuracy of 95.5%. The best semi-supervised approach, which was the co-training algorithm (using the best fully-supervised classifier-MLP as its base), achieved an F 2 score of 0.929 on the test data, also with a predictive accuracy of 95.5%. Overall, the semi-supervised approach achieves higher F scores. To confirm what we concluded from the results, we applied a paired t-test to test the difference in F 2 scores between the fully-supervised MLP algorithm and the co-training algorithm. Before carrying out this test, we confirmed that the data satisfied the assumptions necessary for the paired t-test to be relevant-continuous, independent, normally distributed data without outliers. This resulted in a t-statistic of 7.7 and a p-value of 1.7 × 10 − 13 which suggests that the difference between the F 2 scores of the two algorithms was not due to chance.
We also computed the precision, recall and F-score for the best fully-supervised approach and best semi-supervised approach on the minority class, i.e. the relevant tweets. The results of this experiment are presented in Table 8. The semi-supervised approach produces both a stronger F1 and F2 score on the minority class. To give a better understanding of how the different measures manage to balance the number of FP and FN, we also present the confusion Table 7. Results of relevance classification on the test data. Naive Bayes (NB), Decision Tree (DT), Logistic Regression (LR), Support Vector Machine (SVM) and Multilayer Perceptron (MLP) algorithms are reported together with the self-training and co-training iterative labelling algorithms. matrices for both the best performing fully-supervised and semi-supervised methods on the test data. These confusion matrices are shown in Tables 9 and 10 respectively. From the confusion matrices, we see that the semi-supervised approach performs better for the purpose of syndromic surveillance as it yields only 17 false negatives even though it also yields 37 false positives. Considering that our aim is to develop a filtering system to identify the few relevant tweets in order to register a signal for syndromic surveillance, it is critical to have high recall, hopefully accompanied by high precision, and therefore high accuracy. The semi-supervised method is able to identify and retain relevant tweets more often, while also being able to identify irrelevant tweets to a reasonable degree. Hence, even with a shortage of labelled data, the semi-supervised algorithms can be used to filter and retain relevant tweets effectively. Fig 2 shows how the performances of the semi-supervised systems change as the confidence threshold changes. The confidence threshold controls how conservatively the semi-supervised system assimilates unlabelled instances as it represents how confident the semi-supervised system needs to be in its classification before assimilating the instance to inform future decisions. We observed co-training with MLP to perform best. We also observed that for lower confidence thresholds between 0.1 and 0.5, self-training performance is usually lower and does not change much between thresholds. Co-training on the other hand, appears to be less sensitive to this parameter. Fig 2 also reiterates what we learned from Table 7 that the MLP is our strongest fully-supervised model. In addition, while the logistic regression classifier does not perform as well as the MLP, it appears to be robust to different confidence thresholds when used in an iterative labelling context. We hypothesise that this advantage arises because the logistic regression classifier has considerably less hyper-parameters to optimise. This means that if a set of hyper-parameters, which is impactful on performance, is not optimal for a certain threshold, such a set would be less of a hinderance to the logistic regression model.

Supervised Algorithms Precision
The main issue with iterative labelling algorithms is that, because the classifiers are not perfect and do not have 100% accuracy, we cannot be sure that the unlabelled instances that they label for assimilation are always correct. This means that their initial performance before any labelling iterations is vital. Consider a classifier, initially of poor performance (with an accuracy of 0.2 for example). When classifying unlabelled instance with which to train itself, 80% of its classifications will be wrong, so it will assimilate false hypotheses, which will in turn make its performance in the next iteration even worse and so on. Conversely, if the initial accuracy is high, it is more likely to correctly classify unlabelled instance and be less resistant to the drop in performance from assimilating false hypotheses. We conducted an experiment to measure the quality of the automatically labelled instances assimilated by our  semi-supervised classifiers. For this exercise, we used the second set of labelled tweets from a different time period as the "unlabelled" set with to which the iterative labelling is applied to. The same training set as in our other experiments was used for the initial training stage. The self-training and co-training processes were initiated, applying these classifiers to the alternative set of labelled data (around 2000 instances) in steps of 200. Fig 3 shows a plot of the proportion of correctly classified instances that the iterative labelling process assimilated. The cotraining approach had a higher rate of being correct when making new additions. This was in fact the aim of adopting co-training with its multiple different views of the same data. The proportion of correct assimilations of both the self-training and co-training methods rises as more data is assimilated, due to the fact that the systems are getting more intelligent. Although we could not test beyond 2000 instances (because of our limited labelled data), we believe that the proportion of correct assimilations will increase until a certain point, after which it will plateau. We expect this plateau due to the fact that at a certain point, the iterative learning classifiers will have nothing new to learn from new data after having been exposed to so much. As with the features constructed, we tested how the classifiers would perform for new data collected at a different time period to assess if shifts in language and colloquialisms could have an impact on performance. Our classifiers were built on data from the first collection period (see Table 2). For a simple assessment, we applied our trained model to tweets collected in the most recent collection period, which had a time gap of two years from the original data. Our semi-supervised approach based on co-training achieved a precision of 0.418 and a recall of 0.587 on the 2,000 labelled tweets from the most recent collection period. This means an F 1 score of 0.488 and more importantly, an F 2 score of 0.543. For comparison purposes, we also applied the fully-supervised MLP algorithm to the data from this new time period. This yielded a precision of 0.420 and a recall of 0.410. This meant an F 1 score of 0.415 and an F 2 score of 0.412. In both cases, we observe a deterioration in performance when introduced to tweets from a different time period. This poses an important issue to consider about how language online changes moving forward. Although it changes very gradually, after a period of one or two years, the changes are substantial enough to render the natural language-based models less effective. Table 11 shows the results of the feature ablation experiments. Our central feature was the ngram. Without it, we see the performance of our systems drop by around 0.2. We also found that our supporting features yield some additional improvements in performance on top of the n-gram features. For each of these supporting features, their omission results in a drop in performance of around 0.1. Of our additional features, we found that Negative Emojis/Emoticons were the most discriminative, followed by the Denotes Laughter feature in the supervised approach, which also captures emojis as well as colloquialisms, and Positive/Negative Word Count in the semi-supervised approach. All three of these features capture the mood of a tweet. Table 12 shows the words found to be most informative. For example, the table shows that, of the tweets containing the word chest, 96% are relevant and only 4% are irrelevant. The training data is used for this calculation. A surprising negative predictor was the word health. When health appeared in a tweet, the tweet was irrelevant 94% of the time. The word pollution shows a similar trend. This suggests that when Twitter users are expressing health issues, they may not use precise or formal terms, opting for simple symptomatic and emotional words such as chest, cold or wow. The more formal terms may be more often associated with news items or general chat or discussion. Using this information, we could include some of the more relevant but perhaps unexpected keywords as keywords when collecting streaming tweets from Twitter in order to better target and collect relevant tweets.

Feature analysis
We also investigated which emojis were most prevalent in our data set as well as how often each emoji appeared in tweets for each class. Fig 4 shows the frequency with which each emoji occurred in the labelled tweets. It shows that only a few emojis appear very frequently in tweets collected in our context. This means that only a few important emojis were needed for determining tweet relevancy as opposed to monitoring for the full emoji dictionary. Fig 5 shows a list of some emojis and the distribution of classes that tweets belonged to whenever they contained said emoji. Overall, it can be seen that each of these emojis tends to lean heavily toward one class. This suggests that they could be quite discriminative and useful indicators of class membership and hence, helpful features. Table 13 shows the results of the assessment of keywords used in tweet collection from an information retrieval point of view. We found that asthma, pollution and air pollution were the keywords that yielded the most results at 1313, 757 and 509 out of a total of 3500. Wheezing, fumes and inhaler were next with 219, 132, 121 tweets respectively. The remaining keywords return very few results (below 40) or no results. In an information retrieval scenario, precision refers to the fraction of retrieved results that are relevant to the query while recall refers to the fraction of relevant results that are successfully retrieved. Asthma had the highest recall but not very high precision so most of its results were irrelevant. Wheezing, inhaler, wheeze, cannot breathe, can't breathe, difficulty breathing and short of breath have good precision although their recall is not that high. Some of those keywords express direct symptoms of the syndrome under investigation, hence, we expect good precision. Tight chest and pea souper have very high precision but only appeared in two tweets each. Of the keywords used, wheezing was the most useful in that it brought in a lot of results, most of which were relevant. We included a common misspelling of asthma, the keyword with the highest recall power-asma. We found that asma only appeared in 4 tweets. We hypothesise that this is due to the fact that most users of Twitter post from devices capable of autocorrect hence it may not be necessary to worry about misspelling of keywords. The informativeness, I, was calculated when the keywords were also features in the classifiers and is presented in Table 14. Most of the keywords were not informative from a feature selection point of view, with an information gain ratio of 1:1 for relevant:irrelevant tweets so Semi-supervised classification for relevance filtering in syndromic surveillance they are not included. We found that some of the keywords used for collecting tweets were also informative in the model, although not always associated with the relevant class. For example, pollution, which was a keyword, appeared in the ranking of top 15 most informative words associated with the irrelevant class. Similarly fumes also associates with the irrelevant class more. On the other hand wheezing had good information retrieval performance and associated with relevant tweets.

Syndromic surveillance comparisons
As we discussed earlier, the purpose of our relevance filtering is a syndromic surveillance application. After constructing and evaluating our semi-supervised filtering systems, we assessed their utility for syndromic surveillance purposes by retrospectively applying them to Twitter data and comparing the results against data for existing syndromic indicators from the Public Health England (PHE) Real-time Syndromic Surveillance Service. For this experiment, Semi-supervised classification for relevance filtering in syndromic surveillance  we made use of unlabelled tweets from our second collection period, June to August 2016. Syndromic surveillance data on the proportion of daily GP Out-of-Hours (GPOOH) calls for asthma, wheezing or difficulty breathing and tele-health calls (NHS 111) for difficulty breathing for the period June to August 2016 were compared to the signal detected from our Twitter data using our semi-supervised and fully-supervised filtering systems. As a sense check, we also compared our detected Twitter signal time series against non-respiratory syndrome data in the form of NHS 111 calls for diarrhoea. This was to provide some form of control which should not correlate with the Twitter signal. The resulting time series shows the daily proportion of relevant symptomatic tweets and consultations/calls as observed on Twitter and recorded by PHE (Figs 6 and 7). The signals were smoothed using a 7-day moving average to remove the fluctuations in daily activity for GPOOH data as that service receives more usage over the weekends. We also included a time series showing the Twitter signal without any filtering for further perspective. We see that the time series plots of the self-training and co-training filtering follow a similar trend to the GP data time series. Also, the time series for the Twitter data without any filtering has lots of spurious peaks in relation to the ground truth data (i.e. the syndromic surveillance data). Both of these observations together suggest that Twitter data might mirror the health activity of a population and that relevance filtering is useful in reducing noise and obtaining a clearer picture of such activity. Additionally, we see that while the unfiltered Twitter signal does not match well with the asthma/wheeze/difficulty breathing or difficulty breathing signal, it still seems to match better than that of the diarrhoea signal. To further evaluate the quality of our detected signal, we calculated the Pearson correlation coefficient to determine the strength and direction of any monotonic relationship between the indicators (Table 15). We observed a weak but statistically significant correlation between the Twitter signals and the asthma and difficulty breathing syndromic surveillance data. The signal produced by the co-training method achieved a stronger correlation than that the signal produced by the best fully-supervised method-the MLP. For the diarrhoea syndrome, there was no statistically significant correlation with the Twitter time series. To corroborate this, it can be seen from the time series that the plots for our detected Twitter signals and GP data follow a similar downward trend but the diarrhoea signal does not. This further suggests that our semisupervised system can detect activity similar to that detected by traditional syndromic surveillance systems recording GP visits and NHS 111 calls. This suggests that they could potentially be further explored as an additional form of syndromic surveillance, even in an environment with scarce labelled data.

Discussion
Twitter is a noisy data source for syndromic surveillance but through data processing and tweet classification, we have been able to identify relevant tweets among the noisy data for a specific syndrome or incident (asthma/difficulty breathing). This in turn allowed us to extract a signal of potential use for syndromic surveillance that correlated positively with real-world public health data. Using a semi-supervised method of classification for filtering tweets, we achieve an accuracy of 95.5% and F 1 and F 2 scores of 0.910 and 0.929 respectively. We argued that recall is very important for us because we want to keep all the relevant tweets so that we can have some signal, even if amplified by some misclassified irrelevant tweets. The best recall, obtained by the semi-supervised algorithm equated to retaining over 90% of the relevant tweets after classification. Also, the semi-supervised approach allowed us to use 8000 previously unlabelled tweets before it started to see a deterioration in performance. This allowed us to make more use of the data collected. Tweet classification using supervised learning has received a lot of attention [19,20,[55][56][57] and gave us good results with F 1 and F 2 scores of 0.910 and 0.919 respectively for the MLP neural network. Tweet labelling, required for supervised classification, is time consuming, however, so often researchers do not use all of the data available to build the model. Semisupervised methods for tweet classification have been used for sentiment analysis [58,59]. They can enable more of the collected data to be used for training the classifier bypassing some of the labelling effort. Johnson el al. [60] used a method called label propagation and reported accuracy of 78%. Baugh [30] proposed a hierarchical classification system with self-training and reported accuracy of 61% and an F 1 score of 0.54. We have implemented an iterative labelling semi-supervised approach which seems to have competitive performance and also enables us to use more of the training data without the effort of labelling. Furthermore, we get an improvement on recall over the supervised method, which is important given that the signal we are trying to preserve for syndromic surveillance may be weak. We compare our semisupervised system to others above but we acknowledge that applications in different domains might weaken the comparison. Baugh [30] also applied semi-supervised systems to tweet classification but not for syndromic surveillance so this comparison might be of more value.
We have also identified strong and novel features in the context of tweet classification: emojis. We have hinted at the growing use of emojis [39] and their importance in establishing the tone of at tweet which in turn is important to relevance classification. Emojis cross language boundaries and are often used by people expressing conditions of interest to syndromic surveillance. Our custom features constructed based on Twitter colloquialisms including emojis proved effective in improving classification performance. Of all our custom features, the one that stood out most was the Negative Emojis/Emoticons feature. Emoticons have been used previously [11]. Emojis work even better than emoticons and their uniformity is a real advantage. A smile emoticon could be illustrated in the form ":-D" or ":D". However, because emojis are actually unicode encoded pictographs with a set standard [61], there exist no variants of the same emoji. In a learning scenario, this reduces fragmentation or duplication of features making them more ideal as features than emoticons.
In terms of geolocation of tweets, we have found that most of the obvious location indicators are not well populated, and those that are, may not be accurate. Hence, future work must tackle geolocation as a real part of the problem for establishing a proper signal from Twitter. After comparing our extracted Twitter signal to real world syndromic surveillance data, we found a positive, albeit weak correlation. This suggests that there is a relationship between asthma related Twitter activity and syndromic surveillance data for asthma and breathingrelated incidents. While the actual correlation value indicates a weak relationship, it still suggests that we can detect relevant activity on Twitter which is similar or complementary to that which is collected by traditional means. The strength of the correlation might be affected by the weak location filtering that we have been able to perform. As we discussed, the syndromic surveillance data relates to England but the Twitter data has only been located (not accurately) to the UK. As future work, we plan to assess the full detection capability of Twitter by repeating this analysis prospectively over a longer time period, and for different syndromes, allowing us to determine whether Twitter can detect activity that is of potential benefit to syndromic surveillance.
We also found that "what to collect" is problematic as the data collection of tweets by keywords requires a carefully chosen list of keywords. Furthermore, our experimentation with different type of features like emojis also tell us that the vocabulary used in Twitter is different to expression in other settings (e.g. as part of a medical consultation). Hence we may need to widen our data collection terms to include emojis, emoticons and other types of informal expressions. We may also need to develop adaptive systems in which the set of data collection keywords is dynamically updated to collect truly relevant tweets. So an idea for future research is to begin with a set of keywords, collect tweets, perform relevance analysis and then update the keyword/token list to reflect those that associate with the most relevant tweets, eliminating any keywords/tokens that are not performing adequately.
We saw also that vocabulary and use of tokens change over time. Negative emojis/emoticons appeared more often in the second time period, up from 5.5% to 14.4% of labelled tweets containing them. This could suggest that over the past two years, the use of emojis as a form of expression has grown. However their prevalence in each class also changed, which may explain the classification performance showing some marked deterioration in precision. We performed our research on data collected within a two year period, but further data collection and experimentation would be beneficial to understand the temporality of models generated as Twitter conversations change over time.