The authors have declared that no competing interests exist.
‡ These authors also contributed equally to this work.
Tests are scarce resources, especially in low and middle-income countries, and the optimization of testing programs during a pandemic is critical for the effectiveness of the disease control. Hence, we aim to use the combination of symptoms to build a predictive model as a screening tool to identify people and areas with a higher risk of SARS-CoV-2 infection to be prioritized for testing.
We performed a retrospective analysis of individuals registered in "
From April 28 to July 16, 2020, 337,435 individuals registered their symptoms through the app. Of these, 49,721 participants were tested for SARS-CoV-2 infection, being 5,888 (11.8%) positive. Among self-reported symptoms, loss of smell (OR[95%CI]: 4.6 [4.4–4.9]), fever (2.6 [2.5–2.8]), and shortness of breath (2.1 [1.6–2.7]) were independently associated with SARS-CoV-2 infection. Our final model obtained a competitive performance, with only 7% of false-negative users predicted as negatives (NPV = 0.93). The model was incorporated by the "
Our results showed that the combination of symptoms might predict SARS-Cov-2 infection and, therefore, can be used as a tool by decision-makers to refine testing and disease control strategies.
The current COVID-19 pandemic caused by the SARS-CoV-2 requires extensive testing programs to understand the transmission, diagnose, and isolate the positive cases. Given the high mortality and absence of a specific treatment or a reliable vaccine, large testing programs are an essential part of epidemic control. The frequency of testing, however, is very heterogeneous among countries. Brazil currently has the second-highest number of COVID-19 cases, even with lower test rates (120,548 tests per one million inhabitants, as of December 02, 2020) [
Some screening tools also have already been introduced, aiming to predict the epidemic trend of COVID-19. Zhu et al. [
Thus, our study aims to use the combination of symptoms and machine learning techniques to develop a predictive model that identifies people and areas with a higher risk of SARS-CoV-2 infection. We used data from an app-based symptom tracker known as "
This study is a retrospective analysis of prospectively collected data from individuals registered in the "
The free smartphone application was launched in Brazil on April 28, 2020. Through a short survey, it collects geo-referenced data from subscribed users, their demographic and occupational characteristics, self-reported symptoms, as well as whether the participant is a health professional and was in contact with a SARS-CoV-2 infected person. The app then combines the surveyed information and selects individuals for testing through selection criteria (see
We included participants registered through the smartphone app from its launch date until July 16, 2020. To train the model, we selected participants who responded to the questionnaire, made the antibody WondfoCOVID-19 IgM/IgG test in a location designated by the app within the city of Rio de Janeiro, and obtained a result (positive or negative). For identifying risk areas, we also included the participants that had not been tested, applying the model to estimate their test results.
Our primary outcome was the test result (positive or negative) at the user level. Our goal was to identify clinical manifestations and individual factors associated with positive testing. Hence, we collected and assessed participant demographics (age, gender), nine symptoms (loss of smell or anosmia, fever, myalgia, cough, nausea, shortness of breath, diarrhea, coryza, and sore throat), and whether the user lives together with someone with a confirmed SARS-CoV-2 infection.
We described the characteristics and symptoms of positive and negative tested participants, displaying the mean and standard deviation for continuous variables and the frequency for categorical variables. We then analyzed the individual association between symptoms and the test result using a logistic regression model adjusted to age and gender. That is, we fitted 11 (one for each feature) logistic regression models, where the response variable was the test result, and the explanatory variables were age, gender, and each of the features. The intention was to remove the confounding effects of age and gender in analyzing the symptoms, obtaining an odds ratio with less interference. We provided the corresponding Odds Ratio (OR) with a 95% confidence interval.
We aim to identify a combination of symptoms to build a prediction model for determining a participant with SARS-CoV-2 infection. For that, we compared five different machine learning techniques: Logistic Regression (LR) stepwise, Naïve Bayes (NB), Random Forest (RF), Decision Tree using C5.0 (DT), and eXtreme gradient Boosting. To address the imbalanced response variable (only 11.8% are positive tests) during model training, we also evaluated four different data balancing techniques: Downsampling, Upsampling, Synthetic Minority Oversampling Technique (SMOTE) [
We divided the data into a training set (80%) and a testing set (20%), keeping the same proportion of majority and minority classes among subsamples. The training set creates predictive models, and the remaining validate the proposed model. During model training, for each combination of machine learning techniques and balancing strategies, we applied grid-search hyperparameter optimization with 5-fold cross-validation, using the Area Under the ROC Curve (AUC) as the target metric. It is independent of a specific cut-off value [
After obtaining the best hyperparameters for each model, we applied Matthews Correlation Coefficient (MCC) to evaluate the results in the test sets since it is a balanced measure among True Positives, True Negatives, False Positives, and False Negatives, which are based on a preset score threshold [
Finally, we evaluated the distribution of SARS-CoV-2 infection risks over the geographic area of Rio de Janeiro modeled as a grid map (each grid is a 400m x 400m square area). Along with the participants with confirmed test results, we applied the chosen model to the sample of participants that were still untested in the period of this study to obtain their estimated test result. We then calculated the proportion of estimated SARS-CoV-2 infections for each grid according to
To avoid misinterpreting proportions in grids with scarce data, we considered grids with at least 10 participants (~94% of all observations). Then, we evaluated the distribution of the grid risks among all grids and classified them into five risk groups using the mean ± 0.5 and 1.5 standard deviations (SD) as thresholds: "very low" (< mean-1.5*SD), "low" (from mean-1.5*SD to mean-0.5*SD), "medium" (from mean-0.5*SD to mean+0.5*SD), "high" (from mean+0.5*SD to mean+1.5*SD), and "very high risk" (>mean+1.5*SD). Using this classification, we built a risk map for Rio de Janeiro.
All analyses were performed in R 3.6.3, using ’
Our final model was incorporated into the app on July 17, 2020. To verify the gains using this proposed model, we performed a validation using Rio de Janeiro’s data. We compared the proportion of positive results before the model was implemented in the app (using data from June 15, 2020, to July 16, 2020) and after its implementation (using data from August 01, 2020, to September 01, 2020). The two-week interval between incorporating the model into the app and the validation was necessary as there were still tests scheduled according to the previous prioritization policy.
We used the unpaired two-samples Wilcoxon test to investigate the hypothesis that the difference between the proportion of positive results before and after the model implementation is statistically significant, with a confidence level of 0.95. We evaluated if the mean of the proportions of positive results before implementing the model can be considered less than the average proportion of positive results after its implementation. Since this compares non-normally distributed data, the Wilcoxon test is the most appropriate hypothesis test.
The study is retrospective. All data acquired were anonymized, and the "
From April 28, 2020, to July 16, 2020, 337,435 individuals registered their symptoms through the smartphone app. Of these, 49,721 users were then tested, from which 5,888 (11.8%) received a positive result for SARS-CoV-2 infection.
According to the self-reported information (
The Odds Ratio (OR) with 95% confidence intervals using logistic regression models for each feature was adjusted by age and gender.
Total | Positive test | Negative test | |
---|---|---|---|
49,721 | 5,888 (11.8) | 43,833 (88.2) | |
Female, n (%) | 30,769 (61.9) | 3,641 (61.8) | 27,128 (61.9) |
Age (years), median [IQR] | 41 [33–51] | 43 [34–53] | 40 [33–51] |
Cohabitation—lives with a SARS-CoV-2 infected person, n (%) | 20,944 (42.1) | 3,398 (57.7) | 17,546 (40.0) |
Health professional, n (%) | 27,737 (55.8) | 3,099 (52.6) | 24,638 (56.2) |
Coryza | 25,973 (52.2) | 3,315 (56.3) | 22,658 (51.7) |
Cough | 23,430 (47.1) | 3,507 (59.6) | 19,923 (45.5) |
Myalgia | 20,858 (42.0) | 3,380 (57.4) | 17,478 (39.9) |
Sore throat | 20,794 (41.8) | 2,459 (41.8) | 18,335 (41.8) |
Fever | 13,042 (26.2) | 2,640 (44.8) | 10,402 (23.7) |
Diarrhea | 12,573 (25.3) | 1,778 (30.2) | 10,795 (24.6) |
Loss of smell | 11,835 (23.8) | 3,112 (52.9) | 8,723 (19.9) |
Nausea | 6,461 (13.0) | 1,025 (17.4) | 5,436 (12.4) |
Shortness of breath | 354 (0.7) | 74 (1.3) | 280 (0.6) |
No symptoms above | 10,865 (21.9) | 844 (14.3) | 10,021 (22.9) |
Results are displayed in median (interquartile range, IQR) for continuous variables and percentage values for categorical variables.
To develop a model to predict positive participants based on the available dataset, we ran 25 different combinations of machine learning techniques and sampling strategies. We comparatively evaluated the performance of the models on the test set according to the metrics of Sensitivity, Specificity, Predictive Positive Value (PPV), Negative Predictive Value (NPV), F1-Score, and MCC. The logistic regression, gradient boosting, and random forest techniques presented the best median MCCs, followed by the decision tree and naïve Bayes, as shown in
Boxplots represent the distribution of MCC values for each model and balancing technique combination. The higher the MCC value, the better the model.
According to
Our final model resulted from the logistic regression method combined with the upsampling balancing strategy (
The probability of an individual be a positive case can be calculated by
Regarding the classification metrics (
The characteristics of the false-negative and false-positive cases predicted by our model can be seen in
False-negative | False-positive | |
---|---|---|
471 | 2,164 | |
Female, n (%) | 258 (54·8) | 1,356 (62·7) |
Age (years), median [IQR] | 42 [34–51] | 42 [32–52] |
Cohabitation—lives with a SARS-CoV-2 infected person, n (%) | 179 (38·0) | 1,464 (67·7) |
Loss of smell | 4 (0.8) | 1,694 (78.3) |
Fever | 88 (18.7) | 1,288 (59.5) |
Myalgia | 156 (33.1) | 1,473 (68.1) |
Cough | 197 (41.8) | 1,511 (69·8) |
Nausea | 43 (9.1) | 429 (198) |
Sore throat | 173 (36.7) | 1,062 (49.1) |
Coryza | 200 (42.5) | 1,321 (61.0) |
Diarrhea | 88 (18.7) | 731 (33.8) |
Shortness of breath | 2 (0.4) | 32 (1.5) |
We observed that most of the false-positive cases present the top-four predictors with the highest positive coefficients. Simultaneously, only four false negatives reported the loss of smell—the strongest predictor of a positive test. The probability density function and frequency of the model’s predicted values using a testing set, compared to the real (observed) values, can be seen in
To optimize the testing strategy, we applied the predictive model (
The "Dados do Bem" app incorporated our final model on July 17, 2020, using it to prioritize users for testing in some Brazilian states. The external validation using data from Rio de Janeiro comprised 57,762 tests from August 01 to September 01, resulting in 18.1% positive results (10,466/57,762). If we consider data from June 15 to July 16 (before model implementation), we observed only 14.9% of positivity (5,296/35,626), thus indicating that the incorporated model increased the proportion of positive tests. The hypothesis test results showed a statistically significant difference between positive results proportion before and after the model implementation (p-value < 0.001 with a 95% confidence level).
Extensive testing programs for SARS-CoV-2 are, in general, not available in low- and middle-income countries, conferring the under-reporting of confirmed cases into a problem. A previous study estimated that only 9.2% of Brazilian cases are being notified [
Since it is impossible to test all individuals, some studies suggest that the combination of symptoms could be used as a screening tool to identify people with potential SARS-CoV-2 infection who could be selected for testing [
Some works criticize the use of symptom-based screening strategies to quantify an individual likelihood of having COVID-19 due to the non-specific nature of some symptoms and the existence of co-infections with other respiratory viruses [
Our model was incorporated into the app and used to select patients for testing. We chose the city of Rio de Janeiro to evaluate the benefit of using our model. Out of the 57,762 users selected according to the model, 18.1% were tested positive. This positivity rate is statistically significant compared to the observed positivity rate without a model (14.9%). It indicates that our model contributed to improve the test strategy and to select the users most likely to be positive in the current scenario. Hallal et al. [
In addition to forecasting the likelihood of each user acquiring the virus, our model also assesses these participants’ geographical distribution, being a source of information to build a risk map for Rio de Janeiro, as shown in
The risk map analysis developed in this work is exemplified in
Map created by
Regarding our results of the reported symptoms, loss of smell (anosmia) was the strongest indicator of SARS-CoV-2, followed by fever, shortness of breath, myalgia, cough, nausea, diarrhea, and coryza. The significant influence of loss of smell and cough is in line with previous studies carried out in high-income countries such as the US and UK [
Menni and colleagues [
Sebo and colleagues [
This study presents some limitations. First, the symptoms are self-reported. Hence, the participant may report apparent manifestations of the disease, which may not be precise as a physician’s physiological evaluation. Second, we could not know when a symptom appeared to indicate the disease’s stage at the testing moment. Third, a non-negligible number of false negatives may be present, considering the serological test’s sensitivity. However, identifying potential clusters and optimizing testing resources using a combination of self-reported symptoms is a viable strategy for many countries. A similar combination of symptoms can explain the SARS-CoV-2 infections in developed countries, such as the United Kingdom and the United States, and LMIC, such as Brazil. Fourth, we do not guarantee that the dataset represents the Brazilian population since our objective was not to perform an epidemiological study. Instead, we aimed to analyze the combination of self-reported symptoms from all users who registered in the city of Rio de Janeiro and obtained a test result (either positive or negative) until July 16, 2020.
Our work used data regarding individual symptoms and demographics obtained from an app-based system to predict individuals with a higher probability of being infected by SARS-CoV-2. We developed a screening model and incorporated it into the app, aiming to prioritize users for testing. After applying the model, out of the 57,762 users selected, 18.1% were tested positive. This positivity rate was more significant than the one observed without a model (14.9%), which indicates that our model contributed to improve the test strategy and select the users most likely to be positive in the current scenario. Moreover, we developed a risk map derived from the model, which may help decision-makers locate regions with a higher risk of positive tests, allowing better testing and disease control policies.
(TIF)
The black vertical line corresponds to the cut-off of 0.5, and the colored dashed vertical lines correspond to the expected average probability for the group of negative (red) and positive (blue) groups.
(TIF)
(DOCX)
(DOCX)
Statistical analysis and machine learning methods. Predictive modeling. Definitions about the Balancing approaches.
(DOCX)
We want to thank all collaborators from "