Unsupervised extraction of epidemic syndromes from participatory influenza surveillance self-reported symptoms

Seasonal influenza surveillance is usually carried out by sentinel general practitioners (GPs) who compile weekly reports based on the number of influenza-like illness (ILI) clinical cases observed among visited patients. This traditional practice for surveillance generally presents several issues, such as a delay of one week or more in releasing reports, population biases in the health-seeking behaviour, and the lack of a common definition of ILI case. On the other hand, the availability of novel data streams has recently led to the emergence of non-traditional approaches for disease surveillance that can alleviate these issues. In Europe, a participatory web-based surveillance system called Influenzanet represents a powerful tool for monitoring seasonal influenza epidemics thanks to aid of self-selected volunteers from the general population who monitor and report their health status through Internet-based surveys, thus allowing a real-time estimate of the level of influenza circulating in the population. In this work, we propose an unsupervised probabilistic framework that combines time series analysis of self-reported symptoms collected by the Influenzanet platforms and performs an algorithmic detection of groups of symptoms, called syndromes. The aim of this study is to show that participatory web-based surveillance systems are capable of detecting the temporal trends of influenza-like illness even without relying on a specific case definition. The methodology was applied to data collected by Influenzanet platforms over the course of six influenza seasons, from 2011-2012 to 2016-2017, with an average of 34,000 participants per season. Results show that our framework is capable of selecting temporal trends of syndromes that closely follow the ILI incidence rates reported by the traditional surveillance systems in the various countries (Pearson correlations ranging from 0.69 for Italy to 0.88 for the Netherlands, with the sole exception of Ireland with a correlation of 0.38). The proposed framework was able to forecast quite accurately the ILI trend of the forthcoming influenza season (2016-2017) based only on the available information of the previous years (2011-2016). Furthermore, to broaden the scope of our approach, we applied it both in a forecasting fashion to predict the ILI trend of the 2016-2017 influenza season (Pearson correlations ranging from 0.60 for Ireland and UK, and 0.85 for the Netherlands) and also to detect gastrointestinal syndrome in France (Pearson correlation of 0.66). The final result is a near-real-time flexible surveillance framework not constrained by any specific case definition and capable of capturing the heterogeneity in symptoms circulation during influenza epidemics in the various European countries.

Seasonal influenza is an acute contagious respiratory illness caused by viruses that can 2 be easily transmitted from person to person. Influenza viruses circulate worldwide 3 causing annual epidemics with the highest activity during winter seasons in temperate 4 regions and produce an estimated annual attack rate of 3 to 5 million cases of severe 5 illness and about 250 to 500 thousand deaths around the world [1]. National surveillance 6 systems monitor the influenza activity through a network of general practitioners (GPs) 7 who report the weekly number of influenza-like illness (ILI) cases among the overall 8 patients [2]. These traditional surveillance systems for seasonal influenza are usually the 9 primary source of information for healthcare officials and policymakers for monitoring 10 influenza epidemics. However, classification of ILI cases in GPs reports is usually based 11 on common clinical symptoms observed among patients and, as with any 12 syndromic-based disease surveillance, case definitions of "influenza-like illness" can 13 vary [3][4][5][6][7]. They typically include fever, cough, sore throat, headache, muscle aches, 14 nasal congestion and weakness. There are also some works from hospital-based 15 studies [8,9], age-specific antiviral trials [4,10,11] and national surveillance 16 activities [12] aimed at exploring suitable ILI case symptomatic descriptions but, so far, 17 no unique definition has been widely adopted by the various national surveillance 18 systems worldwide. For this reason, seasonal influenza surveillance in European 19 countries remains rather fragmented. Only in recent years, some state members have 20 adopted the case definition provided by the European Center for Disease Control and 21 Prevention (ECDC) which defines an ILI case as the sudden onset of symptoms with 22 one or more systemic symptoms (fever or feverishness, malaise, headache, myalgia) plus 23 one or more respiratory symptoms (cough, sore throat, shortness of breath) [13]. 24 Nevertheless, a significant fraction of European countries still adopts their own clinical 25 case definition to compile seasonal influenza surveillance weekly reports. 26 In recent years the availability of novel data streams has given rise to a variety of 27 non-traditional approaches for monitoring seasonal influenza epidemics [14][15][16]. Such 28 new digital data sources can be exploited to capture additional surveillance signals that 29 can be used to complement GPs surveillance data [17][18][19][20]. In this context, some 30 so-called participatory surveillance systems have emerged in several countries around 31 the world with the aim of monitoring influenza circulation through Internet reporting of 32

PLOS
2/24 self-selected participants [21][22][23]. One of these systems, the Influenzanet project [21], 33 has been established in Europe since 2011 and it is now present in ten European 34 countries. The system relies on the voluntary participation of the general population 35 through a dedicated national website in each country involved in the project. Data are 36 obtained on a weekly basis through an online survey [24] where participants are invited 37 to report whether they experienced or not any of the following symptoms since their 38 last survey: fever, chills, runny or blocked nose, sneezing, sore throat, cough, shortness 39 of breath, headache, muscle/joint pain, chest pain, feeling tired or exhausted, loss of 40 appetite, coloured sputum/phlegm, watery/bloodshot eyes, nausea, vomiting, diarrhoea, 41 stomach ache, or other symptoms. Differently from most traditional surveillance 42 systems, this participatory form of online surveillance allows the collection of symptoms 43 in real time and directly from the general population, including those individuals who 44 do not seek health care assistance. The list of proposed symptoms has been chosen to 45 include the various ILI definitions adopted by national surveillance systems in Europe 46 and, at the same time, to get a comprehensive list of symptoms that could be clearly circulating flu-related illnesses. Even though participatory systems generally suffer from 49 self-selection bias, causing the sample to be non-representative of the general 50 population [25], previous works have shown that web-based surveillance data can 51 provide relevant information to estimate age-specific influenza attack rates [26,27], 52 influenza vaccine effectiveness [28], risk factors for ILI [29,30], and to assess health care 53 seeking behavior [31]. Moreover, it has been largely demonstrated that weekly ILI 54 incidence rates that can be extracted from web-based surveillance data correlate well 55 with ILI incidence as reported by GPs surveillance [27,32,33] (in such approaches, the 56 number of ILI cases among the web platform participants was determined by applying 57 the ECDC case definition [13] to the set of symptoms reported by participants).

58
An additional advantage of collecting symptoms directly from individuals among the 59 general population in the various Influenzanet countries is that it is straightforward to 60 compare the prevalence and the temporal dynamics of specific symptoms and/or groups 61 of symptoms from one country to the other. In this work, we propose an approach 62 which focuses on studying the temporal trends of groups of symptoms, hereafter called 63 syndromes, as collected through the Influenzanet platforms. The goal is to develop a 64 mathematical framework able to extract, in an unsupervised fashion, the groups of 65 symptoms that are in good correlation with the ILI incidence as detected by traditional 66 surveillance systems without imposing a priori a specific ILI case definition. By using 67 the daily occurrence of symptoms (represented as boolean variables, with value 1 if a 68 symptom is present and value 0 otherwise) as reported by the Influenzanet participants, 69 it is possible to build a matrix whose rows are the weekdays during an influenza season 70 and the columns are occurrences of symptoms as reported by the participants. Each 71 matrix element thus corresponds to the number of times a specific symptom has been 72 reported during a specific day of the influenza season. The result is a high-dimensional 73 sparse data set from which meaningful features can be automatically extracted with the 74 use of mathematical tools such as Non-negative Matrix Factorization (NMF) [34]. In 75 particular, we are interested in extracting latent 1 features of the matrix, corresponding 76 to linear combinations of groups of symptoms, that are deemed as relevant by the NMF 77 algorithm. By assuming that a specific combination of reported symptoms is the 78 symptomatic expression of one or more illnesses experienced by the participants, i.e. of 79 the syndromes affecting the individual, we can select those syndromes which better 80 correlate with the traditional surveillance ILI incidence and adopt such linear 81 1 Throughout this study we employ the term latent as used in computer science, i.e. referring to variables that are hidden, not directly observed, but rather inferred through a mathematical model. There is no reference to the medical use of the term that usually indicates an asymptomatic infection. combination of symptoms as the best approximation for the actual influenza-like illness 82 circulating among the general population.

83
For this study, we employed data collected by Influenzanet platforms in several   Table 1) and asked 140 whether since the last time they visited the platform they experienced any symptoms 141 among those listed. In this study, we analyzed the data collected by the various national 142 platforms from November 2011 to April 2017.

143
Seasonal influenza is traditionally monitored by national networks of general 144 practitioners (GPs) who report the weekly number of visited patients with influenza-like 145 illness symptoms according to the national ILI case definition. Despite some practical 146 limitations, mainly due to an heterogeneous population coverage and to considerable 147 delay in disseminating data, such traditional surveillance data are generally considered 148 as ground truth. In this study, we collected the weekly ILI incidence data for 6 influenza 149 seasons, from 2011-2012 to 2016-2017, from the ECDC dedicated web page [36] for all 150 countries, expect France, for which, instead, we obtained the weekly data on the ILI 151 incidence and gastrointestinal infections directly from the national network, called 152 Réseau Sentinelles [37]. All reports were accessed and downloaded in March 2017.

154
In general, inclusion criteria of participants in the data analysis vary depending on the 155 specific aim of the study [25,33,38,39]. In our case, we included only those individuals 156 who are registered on the national platforms and have filled in at least one survey per 157 season. Specifically, we consider only one survey per participant for each week, keeping 158 the last one if multiple surveys were submitted during the same week. This choice 159 corresponds to a loss of approximately 5% of the total number of surveys and will allow 160 assuming that symptoms reported in a week are proportional to the number of 161 participants in that week. Moreover, to reduce the noise due to low participation rates 162 at the beginning of the influenza season, we included in the analysis only those weeks in 163 which the number of surveys corresponded at least to 5% of the total number of the 164 surveys filled during the week with the highest participation for that season. In S1 165 In this section, we describe the methodology employed to extract the latent features from the self-reported symptoms collected by the various Influenzanet platforms of the participating countries. Our approach relies on the assumption that a specific group of self-reported symptoms corresponds to the symptomatic expression of one or more illnesses, hereafter called syndromes, circulating among the population sample participating in the study. In our study we consider a total of 19 symptoms, corresponding to the 18 symptoms presented in the weekly Symptoms Questionnaire plus an additional symptom, called "Sudden onset", referring to the sudden appearance of symptoms, typically over the course of the previous 24 hours (see Table 1). Symptoms are treated as binary boolean variables having value 1 if the symptom is present and 0 if the symptom is absent.; then, we aggregated across all participants to build a matrix X = [x ij ], whose elements contain the occurrences of each symptom j ∈ {1, .., J} during each day i ∈ {1, .., I}. In other words, each element of the matrix corresponds to the number of times each symptom has been reported on each day of the influenza seasons under study. The result is a high-dimensional sparse matrix that can be linearly decomposed through a Non-negative Matrix Factorizations (NMF) technique [34]. We opted for NMF since its non-negativity constraint offers the advantage of a straightforward interpretation of the results as positive quantities that can then be associated with the initial symptoms. This approach is equivalent to a "blind source separation" problem [40] in which neither the sources nor the mixing procedure is known, but only the resulting mixed signals are measured. In our case, the time series corresponding to the daily symptoms counts are measured by the Influenzanet platforms and can be considered as the result of a linear mixing process driven by unknown sources, i.e. the latent syndromes. In the following we will use interchangeably the terms syndrome, source or component. According to this consideration, each element x ij of the matrix X can be expressed as follows: where the coefficients h kj describe the set of the unknown K sources, the factor w ik 173 represents the time-dependent mixing coefficients, and the terms e ij correspond to the 174 approximation error. The mixing equations Eq. (1) can be equivalently expressed in 175 matrix notation as: where: It is worth stressing that in this representation the matrix X is known, while the 177 matrices W and H are unknown and determined by the NMF algorithm. In particular, 178 we used a variation of the NMF algorithm that minimizes the Kullback-Leibler 179 divergence loss function [41] defined as follows: where: To minimize this function, we adopted the multiplicative update rules described 182 in [42]. Singular Value Decomposition [43], that is based on a probabilistic approach equivalent 186 to the probabilistic latent semantic analysis (pLSA) [44], employed in the context of 187 semantic analysis of text corpora. Since the two approaches of NMF and pLSA are 188 equivalent (see [45] for more details), the results of our matrix decomposition can be 189 probabilistically interpreted as a mixture of conditionally independent multinomials, 190 that we call p(i, j). We can then write: where: and N is the total number of symptoms counts. can be expressed as follows: At this point, Eq. (8) allows to determine the probability p(i, k) that, rescaled on the total number of symptoms counts N , yields the desired decomposition procedure, y ik , which represents the contribution of a specific component k in a day i, given by the following expression: Thus, the final step in our approach is to determine the optimal number of 198 components k min to be used for the decomposition. A natural upper bound for k would 199 be the total number of symptoms, i.e. 19. We need to determine the number of 200 components with the best tradeoff between a model that best approximates the original 201 matrix X and at the same time does not overfit the data. Each time we minimize the 202 loss function Eq. (4) for a specific number of components k, we obtain a candidate 203 decomposition.

204
To determine the best decomposition, we use an approximated model selection 205 criterion, known as the Akaike Information Criterion [46]. In particular, we employ the 206 corrected version of the Akaike Information Criterion (AIC) proposed in [47], valid for 207 finite sample sizes. For each of the candidate decompositions generated by the various 208 values of k, we estimate the AIC criterion, AIC k , expressed as: where, L(k) is the log-likelihood of the model with k components, defined in [45] as: P , is the number of parameters of the model with: where K is the upper bound for the number of components, I is the total number of 212 days and J is the total number of symptoms. The best candidate decomposition is the 213 one that minimizes Eq. (10)  To assess the generalisation potential of our framework in identifying syndromes not 249 related to ILI (e.g. respiratory versus gastrointestinal), we used it to identify the  2 We focused on the case of France due to the immediate data availability from the official surveillance. The Réseau Sentinelles in fact comprises a unique program of data collection about gastrointestinal illness episodes [48] GP incidence: ILI/100,000  surveillance for the season 2016-2017 (GP). The correlation between the two time series 305 is good, ranging from 0.60 to 0.85, for all the countries. We also report, in   For each country, we generated a series of models by increasing the number of 368 syndromes with k ∈ [1,6]. For each of the models, we estimated the AIC criterion,

369
AIC c (k) (Eq. 10). The best model was the one to minimize the Eq. 10, denoted as 370 AIC c (k min ), consisting of k min syndromes. Fig. S1 depicts, for each country, the 371 relative likelihood (AIC c (k) − AIC c (k min )) of each candidate model with k syndromes 372 (AIC c (k)), as compared to the model with the minimum AIC c score (AIC c (k min )).  Figure S1. The best model is the one that minimizes Eq. 10, denoted as AIC c (k min ), and consist of K syndromes. For each country we depict the relative likelihood of each candidate model (AIC c (k) − AIC c (k min )), where the AIC c (k) scores for each candidate model are compared against the AIC score of the best model AIC c (k min ). We depict only models with k up to 6 and not 19 for easier visual inspection. The best model per country, with optimal number of syndromes is: