Towards a COVID-19 symptom triad: The importance of symptom constellations in the SARS-CoV-2 pandemic

Pandemic scenarios like SARS-Cov-2 require rapid information aggregation. In the age of eHealth and data-driven medicine, publicly available symptom tracking tools offer efficient and scalable means of collecting and analyzing large amounts of data. As a result, information gains can be communicated to front-line providers. We have developed such an application in less than a month and reached more than 500 thousand users within 48 hours. The dataset contains information on basic epidemiological parameters, symptoms, risk factors and details on previous exposure to a COVID-19 patient. Exploratory Data Analysis revealed different symptoms reported by users with confirmed contacts vs. no confirmed contacts. The symptom combination of anosmia, cough and fatigue was the most important feature to differentiate the groups, while single symptoms such as anosmia, cough or fatigue alone were not sufficient. A linear regression model from the literature using the same symptom combination as features was applied on all data. Predictions matched the regional distribution of confirmed cases closely across Germany, while also indicating that the number of cases in northern federal states might be higher than officially reported. In conclusion, we report that symptom combinations anosmia, fatigue and cough are most likely to indicate an acute SARS-CoV-2 infection.


Introduction
In December 2019 cases of pneumonia of unknown etiology were reported in Wuhan, China [1]. The pathogenic agent, which was later identified as the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) rapidly spread over the whole country and the rest of the world. On January 31, the World Health Organization held a press conference and declared the outbreak a Public Health Emergency of International Concern (PHEIC) [2]. On March 11, 2020, the coronavirus outbreak was declared a global pandemic by the WHO [2]. Lockdown a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 symptoms and that health care workers who had previously been healthy usually manifested only mild symptoms, concluding that screening should be extended to all workers who have had contact, regardless of whether they are symptomatic or not.

Informed consent
The study was accepted by the medical ethics committee of the Philipps-University Marburg (reference number: 59/20). All research was performed in accordance with current guidelines and regulations.

Technology and study participants
The symptom checking application COVID-Online was developed by the Institute of Artificial Intelligence (Philipps University Marburg) and is written in the Go programming language. The web app was accessible via a publicly accessible website covid-online.de. The questionnaire contained a total of 38 items, whereby 10 items appear only in relation to previous items (i.e. adaptive items).
Informed consent for scientific evaluation was provided by accepting the privacy policy (explicitly asked before submitting the questionnaire, see point 3.1 in the privacy policy). Users also had to confirm that they are at least 18 years old. Furthermore, the paragraph "I agree that COVID-Online may process and evaluate the data I have entered in a pseudonymised form" had to be explicitly agreed to as an opt-in checkbox.
The graphical user interface to answer the questions contained single choice, multiple choice and free text input fields. The questionnaire was divided into three segments, which were presented one after the other: first, basic epidemiological data such as gender, approximate age, approximate height, and approximate weight were collected. Then current symptoms (fever, body aches, cough, sniff, diarrhea, nausea, vomiting, throat pain, headache, loss of smell or taste, dyspnea, fatigue) were queried that could indicate COVID-19. Finally, users were prompted for individual risk factors such as smoking status and comorbidities (lung disease, diabetes, cardiovascular disease, stroke, tumor disease, chronic inflammatory disease, kidney disease, allergies), vaccination status (flu, measles) and their postal code. The complete survey items are available in the (S1 Table).
The web app was launched on April 3rd. At this time, the new infections reached its peak in Germany and the population was in a phase of great uncertainty. Our main goal developing COVID-Online was patient care navigation by guiding the participant through the next steps in case of increased risk, and by this serving as a guidance system for regional patient management to enable an efficient allocation of resources in case of emergency.

Data analysis
Data preparation and data management. All data was obtained from a database query (SQL) in the comma-separated values format (CSV) from the application servers. Program libraries Pandas, SciPy, Matplotlib, Seaborn and Jupyter were used for exploratory data analysis, data post-processing and plotting. We dropped users with missing or invalid postal codes.
Statistical evaluation. Basic characteristics were analysed using descriptive statistics. Groups were formed based on the independent variables age, gender, comorbidities and confirmed contact to a COVID-19 patient. Chi-squared test and Cramer's V correlation were used for data analysis. For categorical variables, we report absolute numbers and percentages. For continuous variables, we report averages ± standard deviation, unless otherwise stated. No imputation was made for missing data. All statistical tests were carried out with a significance level of α = 5%.
Hypothesis testing. Previous work by Drew et al. has shown that symptom combinations are more suitable for distinguishing between COVID-19 positive and negative cases than single symptoms alone [36]. We investigated if the same applies to our dataset. Exploratory data analysis revealed that groups can be divided based on the question "I had confirmed contact to a COVID-19 case". The reason for a different reporting of symptoms of subjects with confirmed contact could be that these participants have a higher sensitivity to symptom inquiry due to fear of infection or that contact has led to a manifestation of the disease with corresponding presentation of symptoms. In the latter case, a higher overall level of infection is to be expected within the formed group. Lastly, we applied the generated linear model by Drew et al. including age, sex, loss of smell, fatigue, cough and loss of appetite to our data: where the prediction was calculated as exp(x) / (1 + exp(x) with a threshold of 0�5 as suggested by Drew et al. [36]. As loss of appetite was not queried by our questionnaire, it was set to true whenever a subject reported anosmia. Clinical experience has shown that loss of appetite rarely occurs isolated and is rather a consequence of the loss of smell. To confirm our findings, we compared the frequencies of predicted and confirmed infections in a seven-day window geographically by districts and contrasted this with the distribution of the individual predictive symptoms fever, anosmia and cough.
Exploratory Data Analysis revealed that the answer to the question whether if a previous contact with a confirmed case of a COVID patient occurred, changed the symptom frequency distribution between these groups. 3564 users with confirmed contact did not affirm any of the symptoms (18.63%), and 190197 users without confirmed contact stated that they had no symptoms (29.08%).
A chi-square test of independence was performed to examine the relation between previous confirmed contact and individual symptoms sniff, cough, fatigue, body aches, headache, diarrhea, sore throat, nausea, dyspnea at rest, anosmia and fever. The relation between these variables was significant (Table 2).
Therefore, patients were divided bases on the dichotomous question "I had contact to a confirmed case of COVID 19" into two groups.

Fig 1. Comparison of symptom distribution between patients with and without confirmed contact. A) Single symptoms:
Color coded Cramer's V correlation of symptoms with the confirmed contact variable (dark red tones) & symptom frequency count in percent of positive statements broken down by groups with and without confirmed contact. Anosmia seems to be the strongest predictor followed by fever and dyspnea at rest. On the contrary, the least single important symptoms are sniff, fatigue and cough by itself. B) Complex symptoms: Symptom frequency count (total) with combinations. The symptom combination fatigue, anosmia and cough has been highlighted in red to illustrate the shift of importance between the two groups. The age distribution of the two groups is depicted in each case above graph B. A random sample of 19128 was taken from the population without confirmed contact for comparison. The percentage of positive cases in the total number of participants without contact was 6�24% whereas the percentage of positive cases in the group with confirmed contact was 23�21%. https://doi.org/10.1371/journal.pone.0258649.g001

PLOS ONE
The importance of symptom constellations in the SARS-CoV-2 pandemic symptoms can achieve sufficient differentiation between the groups on the other hand (Chart B). Interestingly, the frequencies of single symptoms dominate in both groups, with fatigue, cough and diarrhea being the most common. However, in the group with confirmed contact with a COVID-19 patient, not only is anosmia more common as a single symptom, but also the frequency of triple combinations of symptoms in comparison. In the group without confirmed contact these triple combinations are much less frequent. Also, the frequencies of symptoms drop much faster in the group without confirmed contact than the frequencies of symptom combinations within the confirmed contact group. This confirms the impression that single symptoms and symptom combinations less or equal than two might be less suitable for the (geographical) identification of COVID-19 patients than the combination of more symptoms, whereby the combination of fatigue, cough and anosmia is particularly discriminative. This particular symptom combination, which combines the more frequent symptoms cough and fatigue with the less frequent symptom of anosmia, was also the most predictive in the work by Drew  The comparison between the map with the frequency of all users per 100 thousand inhabitants per district with the map with the frequency of predicted infections reveals that the frequencies of predicted infections are largely independent of the number of users in the district. This is an important indicator for the specificity of the applied scoring model. Furthermore, the frequencies of individual symptoms (maps D-F) cannot be clearly assigned to the confirmed infections in comparison to the map of predicted infections using complex symptoms, again confirming that only combinations of symptoms are sufficiently discriminatory.
When comparing the predicted infections with the confirmed infections (Maps B and C), it can first of all be noted that the most overlaps are found in the most severely affected states of Bavaria and Baden-Wuerttemberg. There are also isolated overlaps in North Rhine-Westphalia and Lower Saxony. According to the predicted infections, the northern part of Germany, and in particular Schleswig-Holstein, are much more severely affected than the number of confirmed cases suggest. This may be due to insufficient specificity of the score, increased multiple participation, incorrect reporting in northern Germany and/or insufficiently performed tests in northern Germany. Another reason for differences between the distribution of confirmed infections and predicted infections is that the duration of symptoms of confirmed positive persons can last 2-3 weeks even in non-hospitalized patients [37]. Anosmia regresses significantly later in some patients. Tenforde et al found that it takes from onset to clinical recovery about 2 weeks for mild cases and 3-6 weeks for patients with severe or critical disease; the symptoms to least likely have resolved within 14-21 days after the test date included and fatigue [37]. This means that people who tested positive 2-3 weeks ago do not show up in the statistics of  2020. A, B, D, E, F: The district "Marburg-Biedenkopf" has been excluded from these charts as it contained too many records from internal tests carried out by associated personnel of COVID-Online and was also influenced by regional media reports. Unfortunately, due to the great time pressure in times of crisis, no test or staging instance could be installed for such purposes. https://doi.org/10.1371/journal.pone.0258649.g002

PLOS ONE
The importance of symptom constellations in the SARS-CoV-2 pandemic the last 7 days, but they still report their symptoms in a symptom checker and appear in the statistics of predicted infections. On the contrary, confirmed cases do not necessarily need to show symptoms, as symptoms could have been resolved before the test date or started after the date of testing.
Unfortunately, there is not enough data on the number of total tests, because in Germany, according to the Infection Protection Act, only positive cases are reported to the health authorities. This makes it difficult to determine the actual test density per federal state and even more so per district. However, laboratories are invited to send the number of tests to the RKI. Since participation is voluntary, no statement can be made about the completeness of the tests (see Tables 3 and 4).
From these tables it can be seen that Bavaria and North Rhine-Westphalia have tested the most, both in terms of population and in absolute terms. Lower Saxony, Baden-Wuerttemberg, Schleswig-Holstein and Mecklenburg-Western Pomerania, on the other hand, tested least according to these data. However, it must be emphasised that this may also be due to a lower participation rate in data transmissions. If one assumes, however, that the participation density between the federal states would be approximately the same, more cases would probably have been measured in these states.
In contrast to other countries, preventive testing is not carried out in Germany to date, but only as an indication test on symptomatic patients and contact persons of such patients. Furthermore, the RKI recommends extending the indication for testing to nursing or care The percentage of the total tests to the number of inhabitants of a federal state was displayed in green from a value of 0�5 and otherwise in red. Number of conducted tests greatly varied between federal states. Populous federal states in the south and west of Germany have performed significantly more tests than the northern and eastern regions. https://doi.org/10.1371/journal.pone.0258649.t004

PLOS ONE
The importance of symptom constellations in the SARS-CoV-2 pandemic facilities and medical personnel. In the case of local outbreaks, mass testing is also carried out within a few days to contain the outbreak.

Discussion
The study investigated whether symptoms or symptom constellations are indicative of COVID-19. Moreover, it was examined to what extent a model, that accept such symptom constellations as input features, can accurately estimate the number of positive cases. Previous studies have shown that symptom combinations of equal to or greater than three are more indicative of COVID-19 while predictions based on single symptoms such as anosmia, coughing or fatigue alone are less specific [24,36]. This corresponds with the results of this work. The triple combination of anosmia, fever and cough was most important in distinguishing between groups of participants with and without active contact. The model published by Drew et al. generated 23�21% positive cases in the group with confirmed contact whereas the percentage of positive cases in the group without confirmed contact was 6�24%. This supports the assumption that there must be more COVID-positive patients in the group with confirmed contact. It is moreover remarkable that the same combination of symptoms of anosmia, fever and cough could be determined as the most important feature in this work al by Drew et al. The analysis of the distribution of the predicted positive cases among the districts showed that the distribution of the confirmed positive cases largely overlapped, especially in the federal states of Bavaria, Baden-Wuerttemberg and North Rhine-Westphalia. If the density of testing in the federal states is also considered, it can be assumed that the number of positive cases in the northern part of Germany is likely to be higher at the time of observation. In these cases, more test resources could be mobilized to initiate effective isolation and containment strategies for the spread in the future.
The near real-time analysis of symptom tracking applications might help to more accurately understand the spread and infectivity of infectious diseases such as SARS-CoV-2. In addition, risk factors can be collected and quantified in addition to symptoms. This is a decisive added value compared to the sole laboratory testing for an infectious disease, since persons at risk can be identified on the basis of the data and can be more quickly assigned to a potential therapy. The user numbers from Germany and England indicate a broad acceptance of such tools in pandemic scenarios [24,36]. Symptom-tracking tools may thus play an important role in the control and containment of infectious diseases, alongside Sars-CoV-2, and provide significant added value for public health matters. Future implications are based on these tests and their accuracies, that the combination of both, adequate laboratory tests plus preventive screening such as COVID-Online, may predict future outbreaks and/or hint towards hotspot developments in clinically actionable windows.
It must be emphasized, though, that symptom screening and thereby symptom-based scoring systems will not likely reach a sufficient sensitivity for COVID-19 as asymptomatic carriage remains a problem. For instance, in the Diamond Princess cruise ship study, it was estimated that a proportion of 17�9% among all infected cases were asymptomatic [38]. Asymptomatic patients further complicate the screening problem by the high risk of silent transmission. This implicates that the value of symptom inquiry for individual screening scenarios might be limited, but nevertheless offers crucial insights on a public health level when it comes to understanding the dynamics of spread, containment strategies and hot spot identification.

Limitations of the study
Our work has numerous limitations. One major limitation of our study is that we do not have data on confirmatory laboratory test results of the study participants. Because of this, we

PLOS ONE
The importance of symptom constellations in the SARS-CoV-2 pandemic cannot extend and/or adopt the proposed model by Drew et al. by means of, for instance machine learning. Notably though, the same symptom combination was the most discriminative between the groups with and without confirmed contact, which suggests that the proposed combination might indeed achieve the highest specifity. In addition, we recognize that the data collected originates from a modern web application. Although a balanced gender and age distribution was achieved in the study population, it can be assumed that less technology-affine individuals are underrepresented in the study collective, which means that a contributor bias cannot be excluded. Thus, it cannot be ruled out that certain symptom constellations are more likely to be significant than others, which may be more common in underrepresented user groups. Furthermore, the nature of the collected data is self-reported. Data validity is therefore not checked and may contain false statements. Another limitation is that since users cannot create an account for reasons of data security and privacy, only snapshots of symptoms can be collected and cannot be tracked over time (i.e. longitudinal data). Also, our data and calculations are based on submissions of the questionnaire and do not strictly correlate with the number of individuals, as users can access the web app and submit the questionnaire multiple times. Because of this, we are only able to identify potential COVID-19 patients at the time of complete manifestation. In addition, it must be highlighted that the questionnaire used did not ask specifically whether contact with a confirmed positive person had taken place in a protected or unprotected manner. This is of decisive importance, since many occupational groups, such as medical personnel, have regular contact, but at the same time, with appropriate protective clothing, have a significantly lower risk of infection. Therefore, the division of the study sample based on the item "I had contact to a confirmed case of COVID 19" is only appropriate in the case of the general population.

Conclusions
With the enormous advancements in technology we now have valuable digital tools that not only can keep up with the speed of the spread but also offer unique and actionable key insights for health care professionals. Our work confirms that the symptom combination of anosmia, fatigue and cough are indeed more specific for COVID-19 than single symptoms alone. Besides identifying potential hot spots that require more test resources, data on pre-existing conditions and risk factors such as smoking are obtained, too. By identifying high risk patients with a high probability of a SARS-CoV-2 infection, this opens up the opportunity to take timely preventive measures. While asymptomatic carriage remains an issue, our work supports the recommendation that patients who are suffering from anosmia, fever and cough should consider self-quarantine. Moreover, our approach demonstrates that crowd-sourced data represents an effective and scalable means for collecting population-based data for a data-driven public health response to infectious disease outbreaks in light of the COVID-19 pandemic. We could show that sophisticated and well-accepted real-time symptom reporting analytical platforms can be developed rapidly, even with limited personnel resources. We also demonstrate the potential of crowdsourced data to complement traditional public health surveillance methods, primarily through providing fast, on-demand insights and increased information coverage. In the future, to obtain more robust evidence from online crowdsourcing platforms such as these, the assessment of diagnostic test status (such as RT-PCR on pharyngeal swab) should be considered.