Using Twitter To Generate Signals For The Enhancement Of Syndromic Surveillance Systems: Semi-Supervised Classification For Relevance Filtering in Syndromic Surveillance

We investigate the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek health care advice directly. We focus on a specific syndrome - asthma/difficulty breathing. We outline data collection using the Twitter streaming API as well as analysis and pre-processing of the collected data. Even with keyword-based data collection, many of the tweets collected are not be relevant because they represent chatter, or talk of awareness instead of suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. For this, we investigate text classification techniques, and in particular we focus on semi-supervised classification techniques since they enable us to use more of the Twitter data collected without needing to label it all. In this paper, propose a semi-supervised approach to symptomatic tweet classification and relevance filtering. We also propose the use of emojis and other special features capturing the tweet’s tone to improve the classification performance. Our results show that negative emojis and those that denote laughter provide the best classification performance in conjunction with a simple bag of words approach. We obtain good performance on classifying symptomatic tweets with both supervised and semi-supervised algorithms and found that the proposed semi-supervised algorithms preserve more of the relevant tweets and may be advantegeous in the context of a weak signal. Finally, we found some correlation (r = 0.414, p = 0.0004) between the Twitter signal generated with the semi-supervised system and data from consultations for related health conditions.

1 Introduction 1 Surveillance, described by the World Health Organisation (WHO) as "the cornerstone of 2 public health security" [1], is aimed at the detection of elevated disease and death rates, 3 implementation of control measures and reporting to the WHO of any event that may 4

PLOS
1/24 constitute a public health emergency or international concern. Disease surveillance 5 systems often rely on laboratory reports. More recently some countries such as the UK 6 and USA have implemented a novel approach called "syndromic surveillance", which 7 uses pre-diagnosis data and statistical algorithms to detect health events earlier than 8 traditional surveillance [2]. Syndromic surveillance can be described as a real-time (or 9 near real-time) collection, analysis, interpretation, and dissemination of health-related 10 data to enable the early identification of the impact (or absence of impact) of potential 11 human or veterinary public health threats that require effective public health action [3]. 12 For example, they use emergency department attendances or general practitioner (GP, 13 family doctor) consultations to track specific syndromes like influenza-like illnesses (ILI). 14 The expansion in digital technology and increasing access to online user-generated 15 content like Twitter has provided another potential source of health data for syndromic 16 surveillance purposes. Expanding access to communications and technology makes it 17 increasingly feasible to implement syndromic surveillance systems in low and middle 18 income countries (LMIC) too and some early examples in Indonesia and Peru have 19 indicated reasons for optimism [2]. 20 The use of data from microblogging sites such as Twitter data for disease 21 surveillance has been gaining momentum (e.g. [4][5][6][7][8]). This may not only complement 22 existing surveillance systems but may support more accurate monitoring of disease 23 activity in sub-groups of the population that do not routinely seek medical help via 24 existing healthcare services. The real-time and streaming nature of Twitter data could 25 provide a time advantage for syndromic surveillance activities aimed at early detection 26 of disease outbreaks. In addition to this, the low cost of utilisation of this data means 27 that in LMIC where access to medical services may be restricted but where the of use 28 digital technology and social media is becoming more common, such data may support 29 the development of cost-effective and sustainable disease surveillance systems. 30 It is in this light that we develop our work. Our aim is to establish the utility of 31 social media data and specifically Twitter data for syndromic surveillance. 32 Our first objective is to extract a reliable signal from the Twitter stream for different 33 syndromes and health conditions of interest. To achieve this, we must be able to 34 effectively identify and extract tweets expressing discomfort or concern related to a 35 syndrome of interest and reflecting current events. Such symptomatic tweets are 36 considered "relevant" for our purpose of syndromic surveillance. In this paper, we look 37 at at asthma/difficulty breathing as our syndrome of interest which has received less Secondly,we compare both supervised and semi-supervised approaches to text 57 classification. We consider semi-supervised methods because they enable us to use 58 unlabelled data, thereby reducing the initial labelling effort require to build a classifier. 59 Finally, we compare the signal we extracted using our methods to syndromic 60 surveillance data from Public Health England (PHE) to investigate the utility of 61 Twitter for the syndromic surveillance of asthma/difficulty breathing. 62 2 Related Work 63 In a survey carried out in 2015, Charles-Smith et al. [5] identified 33 articles that 64 reported on the integration of social media into disease surveillance with varying degrees 65 of success. However, they reported that there is still a lack of application in practice 66 despite the potential identified by various studies. Many studies are retrospective as it 67 is relatively easy to predict a disease post outbreak but practical application would need 68 to be prospective. Uses of social media data vary from global models of disease [11] to 69 the prediction of an individual's health and when they may fall ill [12]. 70 The most commonly studied disease is influenza or ILI [13]. Ginsberg et al. [7] put 71 forward an approach for estimating influenza trends using the relative frequency of 72 certain Google search terms as an indicator for physician visits related to influenza-like 73 symptoms. They found that there was a correlation between the volume of specific 74 Google searches related to ILI and the recorded ILI physician visits reported by 75 CDC [7]. De Quincey and Kostkova [6] introduced the potential of Twitter in detecting 76 influenza outbreaks. They posited that the amount of real-time information present on 77 Twitter, either with regards to users reporting their own illness, the illness of others or 78 reporting confirmed outbreaks from the media, is both rich and highly accessible. 79 Achrekar et al. [4] also investigated the use of Twitter for detecting and predicting 80 seasonal influenza outbreaks and observed that Twitter data is highly correlated with 81 the ILI rates across different regions within USA. They concluded that Twitter data can 82 act as supplementary indicator to gauge influenza within a population and could be 83 useful in discovering influenza trends ahead of CDC.

84
In this study, our objective is to collect relevant tweets for our given syndrome. We 85 proceed to the initial data collection by using a set of possibly related keywords. 86 However, we notice that a majority of tweets are not relevant as they do not express the 87 required sentiment (i.e. a person suffering from the particular ailment at the current 88 time). We then view this as a text (or tweet) classification problem and build models to 89 filter relevant tweets. Several papers have looked at the tweet classification problem 90 using supervised learning for different applications. Sriram et al. [14] classified tweets to 91 a predefined set of generic classes such as news, events, opinions, deals, and private 92 messages, based on information on the tweets' authors and domain specific features 93 extracted from tweets such as the presence of abbreviated words. Dilrukshi et al. [15] 94 applied a Support Vector Machine (SVM) to classify tweets to different news categories. 95 The most relevant work in the context of tweet classification is that of Dredze and his 96 colleagues [8,9] as they used Twitter data to investigate influenza surveillance. They 97 argue that for accurate social media surveillance it is essential to be able to distinguish 98 between tweets that report infection and those that express concern or awareness. One 99 problem with these approaches is that they rely on having a set of labelled data for 100 learning, i.e. a sufficient set of tweets must first be labelled as say relevant/irrelevant for 101 the learning to take place. Such labelling can be very time consuming so it often means 102 that researchers do not use all of the data available but instead use a subset of labelled 103 data to develop their classifiers. Since the syndromes/events we wish to study may not 104 be mentioned frequently in a Twitter feed, we wish to use as many tweets as possible to 105 build our models. To this effect semi-supervised classification approaches try to produce 106 models using a small set of labelled data but also taking into account the larger set of 107 unlabelled data so we investigate them next. 108 Zhang et al. [16] investigated the semi-supervised classification of tweets for 109 organization name disambiguation, a problem previously tackled with a supervised 110 approach by Yerva et al. [17]. Zhang et al. compared Label Propagation and 111 Transductive Support Vector Machines (TSVMs): both methods utilise unlabelled data 112 in the classifier. A number of papers have looked at using semi-supervised learning for 113 sentiment analysis, and in particular self-training [18,19]. Baugh [20] proposed a 114 hierarchical classification system with self-training incorporated into it where his goal 115 was to classify tweets as positive, negative or neutral. Liu et al. [21] proposed a 116 semi-supervised framework for sentiment classification in tweets that was based on 117 co-training. They converted tweets into two kinds of distinct features -textual and 118 non-textual. Two Random Forest (RF) classifiers were trained with the same labelled 119 data but one with textual features and the other with non-textual features. Johnson et 120 al. [22] proposed a general semi-supervised framework for document classification using 121 Convolutional Neural Networks. Lee et al. [23] applied this framework to the 122 classification of tweets as being related to adverse drug effects or not.

123
In this paper, we build classification models for tweets based on the relevance in the 124 context of a specific syndrome/event. As part of our investigation into feature 125 representation and feature selection for text, which is an important part of text 126 classification, we experiment with different types of features, taking into consideration 127 suggestions from previous work. We also consider emojis in tweet classification, and 128 show their worth for tweet classification in a syndromic surveillance context. We 129 compare both supervised and semi-supervised approaches to text classification in order 130 to understand if and how we can utilize more of the data that we collect. 131

132
We discuss the data collection, pre-processing and analysis of tweets in order to extract 133 a relevant signal for a given syndrome. We narrow our efforts to asthma and air 134 pollution incidents in this paper. Tweets were collected over multiple periods to account for seasonality in asthma activity 137 and to have a higher chance of an air pollution event being observed. Different periods 138 also enable us to monitor changes in the use of Twitter as well as in the language used 139 on Twitter over time. We started with an Autumn period (September 2015 to 140 November 2015), followed by a summer period (June 2016 to August 2016) and a winter 141 through to mid-summer period (January 2017 to July 2017).

142
Tweets were collected using the official Twitter streaming Application Programmer's 143 Interface (API). The Twitter streaming API provides a subset of the Twitter stream free 144 of charge. The whole stream can be accessed on a commercial basis. Studies have 145 estimated that using the Twitter streaming API, users can expect to receive anywhere 146 from 1% of the tweets to 40% of tweets in near real-time [24]. The streaming API has a 147 number of parameters that can be used to restrict the Tweets obtained. We extracted 148 Tweets in the English language with specific terms that may be relevant to a particular 149 syndrome. For this, in conjunction with experts from Public Health England (PHE), we 150 created a set of terms that may be connected to the specific syndrome under scrutiny, in 151 this case asthma and difficulty breathing. We then expanded on this initial list using 152 various synonyms from regular thesauri as well as from the urban dictionary 153 (https://www.urbandictionary.com) as that may capture some of the more colloquial 154 language used in Twitter. Examples of our keywords are "asthma", "wheezing", "couldn't 155 breathe" etc. A full list of terms used is provided in the appendix. 156 We collected 10 million tweets obtained over the three collection periods. The 157 general characteristics of the collected tweets are reported in Tweet is presented in the Status Map in Fig 1 . There are a number of attributes that 159 are associated with a Tweet and would be available to our analysis. We did not consider 160 all the available Tweet attributes useful for our experiments so we collected those that 161 could help us in our task. More specifically, we collected "Tweet_Id", "text", 162 "created_at", "user_id", "source" as well as information that may help us establish 163 location such as "coordinates", "time_zone" and "place.country". We stored the collected 164 Tweets using MongoDB, which is an open source no-SQL database whose associative   Because the aim of this project is to assess the utility of Twitter data for syndromic 171 surveillance systems in England, we would like to exclude tweets originating from 172 outside England. Doing this will give a realistic signal, however, inferring the location of 173 Twitter users is notoriously difficult. Fewer than 14% of Twitter users disclose city-level 174 information for their accounts and some of those may be false or fictitious locations [15]. 175 Less than 0.5% turn on the location function which would give accurate coordinate 176 information from mobile devices. time_zone, coordinates and place attributes, which we 177 collected, can help in the geolocation of a tweet but are not always present or even 178 correct as is shown in For building a relevance classifier, accurate location is of relative importance. In this 181 work, we are not overly concerned with accurate location filtering. For the purpose of 182 symptomatic tweet classification for relevance filtering, location is of no importance. We 183 collect tweets from the whole of the UK. We employ all three geolocation fields, filtering 184 out tweets that do not have a UK timezone, a place in the UK or coordinates in the UK. 185 We acknowledge that the location filtering is not entirely accurate and may have a 186 disruptive effect when we compare our signal with public health data collected within 187 England. However, we leave the task of improving on location filtering for future work 188 where we will extend our signal comparisons to include longer periods of time and other 189 syndromes.  Table 3. Information on the data corpus collected after cleaning In addition, we removed URLs, which are often associated with news items and 197 blogs, and replaced them with the token "<URL>". This helped with identification of 198 duplication but also identification of "bot" posting and news items. A "bot" is the term 199 used when a computer program interacts with web services appearing as a user. Tweets 200 from bots, news and web blogs are not relevant to syndromic surveillance so we 201 developed algorithms to identify them and remove them. An overview of the data after 202 cleaning, showing a considerable reduction in volume, is shown in table 3. approximately 1 hour per 1,000 tweets. A second person checked the labels and flagged 209 up any tweets with labels that they did not agree with. These flagged tweets were then 210 sent to the third person who made the decision on which label to use. 23% of the 211 labelled tweets were labelled as "relevant" while 77% were labelled as "irrelevant". A 212 second set of 2,000 tweets, selected at random, were later labelled following the same 213 procedure from the last data collection period. 32% of these tweets were labelled as 214 relevant and 68% were labelled as irrelevant. Although it is possible to use any sequence of letters or language tokens to represent 217 text, words have been used successfully in language modelling and speech 218 recognition [25]. Words are identified after a process of tokenisation and can then be 219 used to represent a document by their presence or absence without trying to retain any 220 information on the ordering of words, their frequency or their relationship to one 221 another. That approach is called "bag of words" and despite its relative simplicity can 222 work well in many text mining scenarios. It is also possible, with the bag of words 223 model, to use weighting schemes such as tf-idf (Term Frequency-Inverse Document 224 Frequency) [26] and they may perform better than a boolean representation. However, 225 some authors [8,27] have argued that more complex features will dramatically decrease 226 the feature space while leading to better classification performance. We acknowledge 227 that deep learned word vectors are an effective avenue for text feature representation. 228 However, the training and deployment of deep learning systems can be intensive and 229 require considerable hardware resources. Instead of going down the deep learning route, 230 we look towards building the cheapest and simplest system possible with little or no 231 compromise on effectiveness. We believe this will make it easier for low and middle 232 income countries (LMIC) to incorporate such systems at whatever scale.

233
Classification of tweets may be challenging as tweets are very short and in our 234 scenario, the classes may share common vocabularies. That is, both relevant and 235 irrelevant tweets could contain the same words. Twitter has specific language and styles 236 of communication that people use. In particular, we found that emojis and emoticons 237 are promising additional tokens that we could exploit in classification:

238
• An emoticon is a pictorial representation of a facial expression using punctuation 239 marks, numbers and letters, usually written to express a person's feelings or mood. 240 :-) is an example of an emoticon.

241
• Emojis on the other hand are miniature graphics of various objects and concepts 242 including facial expressions. is an example of an emoji.

243
Emoticons and emojis can be used for the same purpose. However, emojis have seen 244 a recent surge in popularity, presumably due to the fact that emojis provide colorful 245 graphical representations as well as a richer selection of symbols. In fact, as table 1   246 shows, there were a large number of emojis in our corpus. A further advantage is that 247 they may transcend language barriers. 248 We believe that emoticons and emojis can help with assessing the tone of a tweet.

249
Tweets we are interested in will most likely have a negative tone as they reflect people 250 expressing that they are unwell or suffer some symptoms. This means they may contain 251 one or more emojis/emoticons denoting sadness, anger or tiredness, for example. On the 252 other hand the presence of emojis/emoticons denoting happiness and laughter in a tweet 253 may be an indication that the tweet is not relevant to our context of syndromic 254 surveillance. 255 We investigate also more complex features derived from our words or additional 256 tokens. shortcomings words may present when applied to Twitter data. Word classes are labels 261 that Lamb et al. [8] found useful in the context of analysing tweets to categorize them 262 as related to infection or awareness. The idea is that many words can behave similarly 263 with regard to a class label. A list of words is created for different categories such as 264 "possessive words" or "infection words". Word classes function similarly to bag of word 265 features in that the presence of a word from a word class in a tweet triggers a count 266 based feature. We manually curated a list of words and classes which are shown in table 267 4. As we applied lemmatisation and stemming, we did not include multiple inflections of 268 the words in our word classes.   usefulness of this feature was augmented by also checking for the presence of a small list 279 of more established and popular internet and slang for laughter or humour such as "lol" 280 or "lmao" which stand for "Laughing Out Loud" and "Laughing My Ass Off" 281 respectively. Table 5 shows this feature's distribution over the data.

285
We decided to include this feature because we discovered ubiquity of emojis on Twitter 286 and wanted to investigate their potential. Table 5 shows this feature's distribution over 287 the data. We find that this feature may be the most discriminative of the three. Of the 288 instances with a positive value, a high percentage belong to the "relevant" class and of 289 the instances with a negative value, a high percentage belong to the "not relevant" class. 290 We experimented with two other features -Contains Asthma-Verb Conjugate and 291 Indicates Personal Asthma Report but found that they underperformed compared to the 292 other features so we do not report on them. We also constructed features from the

306
A classification algorithm for text can be used to automatically classify tweets, in this 307 case, to the categories of relevant/not relevant. We first applied a variety of popular 308 and powerful supervised classification algorithms to the data namely -Naive Bayes,

309
Decision Trees, Logistic Regression and Support Vector Machines. We used the Python 310 implementations found in the Natural Language ToolKit (NLTK) and Sci-Kit Learn [28]. 311 Due to the relatively limited number of labelled instances in our data set, we decided 312 to take a semi-supervised approach to learning. We implemented a semi-supervised 313 approach which is suited to small to medium sized datasets [29]. Semi-supervised 314 learning attempts to make use of the combined information from labelled and unlabelled 315 data to exceed the classification performance that would be obtained either by 316 discarding the unlabelled data and applying supervised learning or by discarding the 317 labels and applying unsupervised learning. Our intention is to extend the labelling in a 318 semi-supervised fashion. We make use of the heuristic approach to semi-supervised 319 learning and employ a self-training iterative labelling algorithm. We then extend 320 this work by using a form of co-training . 3.2.1 Self-training model 322 We adopted an Iterative Labelling Algorithm for semi-supervised learning [30]. Iterative 323 labelling algorithms are closely related to and are essentially extensions of the 324 Expectation-Maximization (EM) algorithm put forward by Dempster et al. [31]. The For our choice of supervised learning algorithm, we selected the Logistic Regression 335 classifier after experimenting with different supervised models and finding it to perform 336 best. We used the trained Logistic Regression classifier's predictions to label unlabelled 337 instances in the Assign-Labels function. We set our stopping condition such that the 338 iteration stops when either all the unlabelled data is exhausted or there begins to be a 339 continued deterioration in performance as more data is labelled. Along with the class of 340 an applied instance, we also compute the model's confidence in its classification. Our 341 algorithm, inspired by Truncated Expectation-Maximization (EM) [32], then grows L 342 based on the confidence of our model's classification. When an instance from R is 343 classified, if the confidence of the classification is greater than some set threshold θ, the 344 instance is labelled. Considering this, our algorithm falls within the confidence-based 345 category of iterative labelling or self-training algorithms because it selects instances for 346 which the trained classifier has a high confidence in its predictions.

347
Confidence-based iterative labelling algorithms can tend toward excessively 348 conservative updates to the hypothesis, since training on high-confidence examples that 349 the current hypothesis already agrees with will have relatively little effect [32]. In 350 addition, it has been proven that in certain situations, many semi-supervised learning 351 algorithms can significantly degrade the performance relative to strictly supervised 352 learning [33,34]. To address the problems of self-training, we take some ideas from co-training [35] to try 355 to improve our algorithm. Co-training requires different views of the data so that 356 multiple classifiers can be maintained for the purpose of labelling new instances. Recall 357 that each tweet can be represented as a feature vector T i with various features. We now 358 distinguish two representations. The first is a concatenation of our Bag-of-Words, Word 359 Classes, Denotes Laughter and Negative Emojis/Emoticons features. We represent this 360 feature space as X 1 . The second kind of feature vector is a concatenation of our Emojis/Emoticons features. We represent this feature space as X 2 . We can think of X 1 363 as the taxonomical feature space as is characterised by its inclusion of the Word 364 Classes feature while X 2 can be the sentimental feature space and this is characterised 365 by its inclusion of the Positive and Negative Word Counts feature. As such, X 1 and X 2 366 offer different, though overlapping, views of the dataset. Each tweet is then represented 367 as a feature vector from each of these spaces. 368 We now maintain two separate classifiers trained on different views of the data. labels from a source other than the classifier that will be updated with them [30]. We started with an initial set of manually labelled data contained 3,500 tweets. This Another important aspect of imbalanced data and of classification in general is having 390 the right performance metric for assessment of classification model [37]. Overall 391 accuracy is a misleading measure [38] as it may only be reflecting the prevalence of the 392 majority class. This is called the accuracy paradox, i.e. we could get high accuracy by 393 classifying all tweets as irrelevant. That would, however, not improve our signal. The 394 aim of our endeavour is to identify tweets which might suggest an increase of cases for a 395 particular syndrome (asthma/difficulty breathing) for the purpose of syndromic and Precision, the probability that a tweet predicted as relevant is actually relevant 407 P recision = T P T P + F P .
where TP and TN stand for True Positives and True Negatives respectively.

408
Precision and recall are often trading quantities. A measure that combines precision 409 and recall is the F -measure or F -score [40]. The generic version of this is:

11/24
With β = 1, that becomes the traditional or balanced F 1 -measure. With a β = 2, 411 the F 2 measure weighs recall higher than precision and so it may be more suited to our 412 purpose. 413 414 We also assessed the discriminative ability of each of our features by performing feature 415 ablation experiments [41]. We evaluated the performance of a given classifier when 416 using all our features, and then again after removing each one of these features. The 417 difference in the performance is used as a measure of the importance of the feature. We 418 chose to use the difference in F 1 metric over F 2 in this analysis because we wanted to 419 convey how the features performed in the general task of tweet classification. 420 We also performed some analysis on the word features to learn which words in our 421 vocabulary were the best indicators of relevant tweets. We analysed the bag-of-words where C is the set of all classes and c is a possible class.

433
Recall that to collect tweets, we made use of Twitter's streaming API which allowed 434 us to specify keywords to restrict the data collection to tweets containing those specific 435 terms. We measured the usefulness of the keywords we selected. To do this, we assessed 436 their information retrieval performance. Specifically, we used the precision-recall metric. 437 In an information retrieval context, precision and recall are defined in terms of a set of 438 retrieved documents and their relevance. We use our original set of labelled tweets for 439 this assessment (i.e. the set of 3500 tweets). In our scenario, the labelled tweets make 440 up the set of retrieved documents and the tweets labelled as belonging to the "relevant" 441 class make up the set of relevant documents. In this context, recall measures the 442 fraction of relevant tweets that are successfully retrieved while precision measures the 443 fraction of retrieved tweets that are relevant to the query.  labelling experiments, we varied and tuned the confidence thresholds until we found the 455 best results and reported those. Below, we also discuss in more detail how the 456 confidence threshold affected the iterative labelling performance as it is a key aspect of 457 the algorithms. The best fully supervised approach according to a combination of the 458 F 1 and F 2 scores was the Logistic Regression classifier, which achieved an F 2 score of 459 0.764 on the test data. This equated to an overall prediction accuracy of 91.5%. The 460 best semi-supervised approach, which was the co-training algorithm, achieved an F 2 461 score of 0.903 on the test data, with an overall predictive accuracy of 92.3%. Overall, 462 the semi-supervised approach is more accurate and achieves higher F scores. To confirm 463 what we concluded from the results, we applied a paired t-test to test the difference in 464 F 2 scores between the fully supervised logistic regression algorithm and the co-training 465 algorithm. Before carrying out this test, we confirmed that the data satisfied the 466 assumptions necessary for the paired t-test to be relevant -continuous, independent, 467 normally distributed data without outliers. This resulted in a t-statistic of 18 and a 468 p-value of 9.6 × 10 −14 which suggests that the difference between the F 2 scores of the 469 two algorithms was not due to chance.  Table 7. Results of relevance classification on the test data. Naive Bayes (NB), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machine (SVM) algorithms are reported together with the self-training and co-training iterative labelling algorithms.
To give a better understanding of how the different measures manage to balance the 471 number of FP and FN. We also present the confusion matrices for both the best 472 performing fully supervised and semi-supervised methods on the test data. These 473 confusion matrices are shown in tables 8 and 9 respectively. We see that the supervised 474 approach shown in Table 8 will retain 243 tweets in total (predicted as positive). Of 475 those 215 are positive tweets retained, however 76 positive tweets will be discarded. In 476 contrast, the semi supervised approach shown in Table 6 will retain 347 tweets, of those 477 273 will be positive, with only 18 positive tweets being discarded. From the confusion 478 matrices, we see that the semi-supervised approach performs better for the purpose of 479 syndromic surveillance as it yields only 18 false negatives even though it also yields 74 480 false positives. Considering that our aim is to develop a filtering system to identify the 481 few relevant tweets in order to register a signal for syndromic surveillance it is critical to 482 have high recall, hopefully accompanied by high Precision, and therefore high accuracy. 483 The semi-supervised method is able to identify and retain relevant tweets more often, 484 while also being able to identify irrelevant tweets to a reasonable degree. Hence even 485 with a shortage of labelled data the semi-supervised algorithms can be used to filter and 486 retain relevant tweets effectively.   Table 9. Confusion matrix for Co-training semi-supervised algorithm on the test data the semi-supervised system needs to be in its classification before assimilating the 491 instance to inform future decisions. We observed co-training with logistic regression to 492 perform best. We also observed that for lower confidence thresholds between 0.1 and The main issue with iterative labelling algorithms is that, because the classifiers are 503 not perfect and do not have 100% accuracy, we cannot be sure that the unlabelled 504 instances that they label for assimilation are always correct. This means that how well 505 they initially perform before starting any iterations is vital. Consider a classifier, 506 initially of poor performance (with an accuracy of 0.2 for example). When classifying 507 unlabelled instance with which to train itself, 80% of its classifications will be wrong, so 508 it will assimilate false hypotheses, which will in turn make its performance in the next 509 iteration even worse and so on. Conversely, if the initial accuracy is high, it is more 510 likely to correctly classify unlabelled instance and be less resistant to the drop in 511 performance from assimilating false hypotheses. We conducted an experiment to 512 measure the quality of the automatically labelled instances assimilated by our 513 semi-supervised classifiers. For this exercise, we used the second set of labelled tweets 514 from a different time period as the "unlabelled" set with to which the iterative labelling 515 is applied to. The same training set as in our other experiments was used for the initial 516 training stage. The self-training and co-training processes were initiated, applying these 517 classifiers to the alternative set of labelled data (around 2000 instances) in steps of 200. 518 Fig 3 shows a plot of the proportion of correctly classified instances that the iterative 519 labelling process assimilated. The co-training approach had a higher rate of being 520 correct when making new additions. This was in fact the aim of adopting co-training 521 with its multiple different views of the same data. The proportion of correct 522 assimilations of both the self-training and co-training methods rises as more data is 523 assimilated, due to the fact that the systems are getting more intelligent. Although we 524 could not test beyond 2000 instances (because of our limited labelled data), we believe 525 that the proportion of correct assimilations will increase until a certain point, after 526 which it will plateau. We expect this plateau due to the fact that at a certain point, the 527 iterative learning classifiers will have nothing new to learn from new data after having 528 been exposed to so much. As with the features constructed, we tested how the classifiers would perform for 535 new data collected at a different time period to assess if shifts in language and 536 colloquialisms could have an impact on performance. Our classifiers were built on data 537 from the first collection period (see Table 2). For a simple assessment, we applied our 538 trained model to tweets collected in the most recent collection period, which had a time 539 gap of two years from the original data. Our semi-supervised approach based on 540 co-training achieved a precision of 0.400 and a recall of 0.628 on the 2,000 labelled 541 tweets from the most recent collection period. This means an F 1 score of 0.488 and 542 more importantly, an F 2 score of 0.564. For comparison purposes, we also applied the 543 fully supervised logistic regression algorithm to the data from this new time period.

544
This yielded a precision of 0.510 and a recall of 0.419. This meant an F 1 score of 0.460 545 and an F 2 score of 0.434. In both cases, we observe a deterioration in performance when 546 introduced to tweets from a different time period. This poses an important issue to 547 consider about how language online changes moving forward. Although it changes very 548 gradually, after a period of one or two years, the changes are substantial enough to 549 render the natural language-based models less effective.  Table 10 shows the results of the feature ablation experiments. We found that negative 552 emojis/emoticons were the most discriminative of our features followed by the Denotes 553 Laughter feature in the supervised approach, which also captures emojis as well as 554 colloquialisms, and Positive/Negative Word Count in the semi-supervised approach. All 555 three of these features capture the mood of a tweet. We also found that our additional   Table 11 shows the words found to be most informative. For example, the table   559 shows that, of the tweets containing the word chest, 96% are relevant and only 4% are 560 irrelevant. The training data is used for this calculation. A surprising negative predictor 561 was the word health. When health appeared in a tweet, the tweet was irrelevant 94% of 562 the time. The word pollution shows a similar trend. This suggests that when Twitter 563 users are expressing health issues, they may not use precise or formal terms, opting for 564 simple symptomatic and emotional words such as chest, cold or wow. The more formal 565 terms may be more often associated with news items or general chat or discussion.

566
Using this information, we could include some of the more relevant but perhaps 567 unexpected keywords as keywords when collecting streaming tweets from Twitter in 568 order to better target and collect relevant tweets.  Table 11. Most informative words measured by their Informativeness and their relevant to irrelevant prior probabilities 569 We also investigated which emojis where most prevalent in our data set as well as 570 how often each emoji showed up in a tweet of each class. Fig 4 shows the frequency with 571 which each emoji occurred in the labelled tweets. It shows that only a few emojis appear 572 very frequently in tweets collected in our context. This means that only a few important 573 emojis were needed for determining tweet relevancy as opposed to monitoring for the 574 full emoji dictionary. Table 12 shows a list of some emojis and the distribution of classes 575 that tweets belonged to whenever they contained said emoji. Overall, it can be seen that 576 each of these emojis tend to lean heavily towards one class. This shows that they could 577 be quite discriminative and useful indicators of class membership hence helpful features. 578   Table 13 shows the results of the assessment of key words used in tweet collection. We 582 found that asthma, pollution and air pollution were the keywords that yielded the most 583 results at 1313, 757 and 509 out of a total of 3500. Wheezing, fumes and inhaler were 584 next with 219, 132, 121 tweets respectively. The remaining keywords return very few 585 results (below 40) or no results. Asthma had the highest recall but not very high 586 precision so most of its results were irrelevant. Wheezing, inhaler, wheeze, cannot breathe, 587 can't breathe, difficulty breathing and short of breath have good precision although their 588 recall is not that high. Some of those keywords express direct symptoms of the 589 syndrome under investigation, hence, we expect good precision. Tight chest and pea 590 souper have very high precision but only appeared in two tweets each. Of the keywords 591 used, wheezing was the most useful in that it brought in a lot of results, most of which 592 were relevant. We included a common misspelling of the keyword with the highest recall 593 powerasma. We found that asma only appeared in 4 tweets. We hypothesize that this 594 is due to the fact that most users of Twitter post from devices capable of autocorrect 595 hence it may not be necessary to worry about misspelling of keywords. The informativeness, I, was calculated when the keywords were also features in the 597 classifiers and is presented in Table 14. Most of the keywords were not informative as 598 features with an informativeness ratio of 1:1 for relevant:irrelevant tweets so they are 599 not included. We found some overlap where streaming keywords where informative in 600 the relevance model though not always associated with the relevant class. For example, 601 Pollution, which was a keyword, appeared in the ranking of top 15  The resulting time series shows the daily proportion of relevant symptomatic tweets 621 and consultations/calls as observed on Twitter and recorded by PHE (Fig 5 and Fig 6). 622 The signals were smoothed using a 7-day moving average to remove the fluctuations in 623 daily activity for GPOOH data as that service receives more usage over the weekends. 624 We also included a time series showing the Twitter signal without any filtering for 625 further perspective. We see that the time series plots of the self-training and co-training 626 filtering follow a similar trend to the GP data time series. Also, the time series for the 627 Twitter data without any filtering has lots of spurious peaks in relation to the ground 628 truth data (i.e. the syndromic surveillance data). Both of these observations together 629 suggest that Twitter data might mirror the health activity of a population and that 630 relevance filtering is useful in reducing noise and obtaining a clearer picture of such 631 activity. Additionally, we see that while the unfiltered Twitter signal does not match 632 well with the asthma/wheeze/difficulty breathing or difficulty breathing signal, it still 633 seems to match better than that of the diarrhoea signal.   classification have been used for sentiment analysis [44,45]. They can enable more of 675 the collected data to be used for training the classifier bypassing some of the labelling 676 effort. Johnson el al. [46] used a method called label propagation and reported accuracy 677 of 78%. Baugh [20] proposed a hierarchical classification system with self-training and 678 reported accuracy of 61% and an F 1 score of 0.54. We have implemented an iterative 679 labelling semi-supervised approach which seems to have competitive performance and 680 also enables us to use more of the training data without the effort of labelling.

681
Furthermore, we get an improvement on recall over the supervised method, which is 682 important given that the signal we are trying to preserve for syndromic surveillance may 683 be weak. We compare our semi-supervised system to others above but we acknowledge 684 that applications in different domains might weaken the comparison. Baugh [20] also 685 applied semi-supervised systems to tweet classification but not for syndromic 686 surveillance so this comparison might be of more value. 687 We have also identified strong and novel features in the context of tweet 688 classification: emojis. We have hinted at the growing use of emojis [47] and their 689 importance in establishing the tone of at tweet which in turn is important to relevance 690 classification. Emojis cross language boundaries and are often used by people expressing 691 conditions of interest to syndromic surveillance. Our custom features constructed based 692 on Twitter colloquialisms including emojis proved effective in improving classification 693 performance. Of all our custom features, the one that stood out most was the Negative 694 Emojis/Emoticons feature. Emoticons have been used previously [8]. Emojis work even 695 better than emoticons and their uniformity is a real advantage. A smile emoticon could 696 be illustrated in the form ":-D" or ":D". However, because emojis are actually unicode 697 encoded pictographs with a set standard [48], there exist no variants of the same emoji. 698 In a learning scenario, this reduces fragmentation or duplication of features making 699 them more ideal as features than emoticons.

700
In terms of geolocation of tweets, we have found that most of the obvious location 701 indicators are not well populated, and those that are, may not be accurate. Hence, 702 future work must tackle geolocation as a real part of the problem for establishing a 703 proper signal from Twitter. After comparing our extracted Twitter signal to real world 704 syndromic surveillance data, we found a positive, albeit weak correlation. This suggests 705 that there is a relationship between asthma related Twitter activity and syndromic 706 surveillance data for asthma and breathing-related incidents. While the actual 707 correlation value indicates a weak relationship, it still suggests that we can detect 708 relevant activity on Twitter which is similar or complementary to that which is collected 709 by traditional means. The strength of the correlation might be affected by the weak 710 location filtering that we have been able to perform. As we discussed, the syndromic 711 surveillance data relates to England but the Twitter data has only been located (not 712 accurately) to the UK. As future work, we plan to assess the full detection capability of 713 Twitter by repeating this analysis prospectively over a longer time period, and for 714 different syndromes, allowing us to determine whether Twitter can detect activity that 715 is of potential benefit to syndromic surveillance. 716 We also found that "what to collect" is problematic as the data collection of tweets 717 by keywords requires a carefully chosen list of keywords. Furthermore, our 718 experimentation with different type of features like emojis also tell us that the 719 vocabulary used in Twitter is different to expression in other settings (e.g. as part of a 720 medical consultation). Hence we may need to widen our data collection terms to include 721 emojis, emoticons and other types of informal expressions. We may also need to develop 722 adaptive systems in which the set of data collection keywords is dynamically updated to 723 collect truly relevant tweets. So an idea for future research is to begin with a set of 724 keywords, collect tweets, perform relevance analysis and then update the keyword/token 725 list to reflect those that associate with the most relevant tweets, eliminating any 726 keywords/tokens that are not performing adequately. 727 We saw also that vocabulary and use of tokens change over time. Negative 728 emojis/emoticons appeared more often in the second time period, up from 5.5 % to 729 14.4% of labelled tweets containing them. This could suggest that over the past two 730 years, the use of emojis as a form of expression has grown. However their prevalence in 731 each class also changed, which may explain the classification performance showing some 732 marked deterioration in precision. We performed our research on data collected within a 733 two year period, but further data collection and experimentation would be beneficial to 734 understand the temporality of models generated as Twitter conversations change over 735 time.