SARS-CoV-2 infection and cardiovascular or pulmonary complications in ambulatory care: A risk assessment based on routine data

Background Risk factors of severe COVID-19 have mainly been investigated in the hospital setting. We investigated pre-defined risk factors for testing positive for SARS-CoV-2 infection and cardiovascular or pulmonary complications in the outpatient setting. Methods The present cohort study makes use of ambulatory claims data of statutory health insurance physicians in Bavaria, Germany, with polymerase chain reaction (PCR) test confirmed or excluded SARS-CoV-2 infection in first three quarters of 2020. Statistical modelling and machine learning were used for effect estimation and for hypothesis testing of risk factors, and for prognostic modelling of cardiovascular or pulmonary complications. Results A cohort of 99 811 participants with PCR test was identified. In a fully adjusted multivariable regression model, dementia (odds ratio (OR) = 1.36), type 2 diabetes (OR = 1.14) and obesity (OR = 1.08) were identified as significantly associated with a positive PCR test result. Significant risk factors for cardiovascular or pulmonary complications were coronary heart disease (CHD) (OR = 2.58), hypertension (OR = 1.65), tobacco consumption (OR = 1.56), chronic obstructive pulmonary disease (COPD) (OR = 1.53), previous pneumonia (OR = 1.53), chronic kidney disease (CKD) (OR = 1.25) and type 2 diabetes (OR = 1.23). Three simple decision rules derived from prognostic modelling based on age, hypertension, CKD, COPD and CHD were able to identify high risk patients with a sensitivity of 74.8% and a specificity of 80.0%. Conclusions The decision rules achieved a high prognostic accuracy non-inferior to complex machine learning methods. They might help to identify patients at risk, who should receive special attention and intensified protection in ambulatory care.


Introduction
The danger posed by COVID-19 in a population results from the interplay of the high infectivity of the SARS-CoV-2 virus and the mortality risk to infected persons. Several internal and external factors, such as age, a person's pre-existing health condition, social behaviour or containment measures taken by governments, have been discussed to affect these risks.
With regard to the risk of infection, governments around the world have imposed containment measures, such as the wearing of masks, social distancing and special hygiene measures. These measures have been subject of controversial discussion concerning effectiveness and the potential to aggravate other health conditions due to a delay or failure of treatment [1]. Some evidence has also related the risk of infection to living and social conditions, such as residential care for the elderly, assisted living and mobility [2]. Further exploration identified health conditions such as diabetes, kidney disease, dementia and obesity as relevant risk factors [3]. By contrast, a recent investigation did not reveal any health conditions to be associated with the risk of infection [4].
The research objectives of the present cohort study and nested case-control study were to explore the risk of testing positive for SARS-CoV-2 infection and the risk of cardiovascular or pulmonary complications in dependence of pre-existing health conditions. In a hypothesisdriven approach, the International Classification of Diseases 10th Revision (ICD-10) diagnoses of pre-defined diseases and health conditions were used to examine the validity of known risk factors of the hospital setting in the outpatient setting. Another important research objective was to support ambulatory care decision-making by developing a respective prognostic model with the use of regression modelling and machine learning techniques.

Methods
The analysis is based on large routine data of two cohorts with polymerase chain reaction (PCR) test confirmed or excluded SARS-CoV-2 infection, respectively. The anonymous ambulatory claims data was provided by the Bavarian Association of Statutory Health Insurance Physicians (BASHIP) and covers all 11.2 million statutorily insured persons in Bavaria, covering approximately 85% of the population [20]. During the evaluation period from February to the end of September 2020 (i.e., first to third quarter 2020), patients suspected to suffer from COVID-19 infection received naso-pharyngeal swabs for PCR testing in general practice.
According to the national testing strategy, participants without symptoms could also be tested in general practice, for example travelers from risk areas, staff in health care or other vulnerable sectors, and contacts of infected persons. However, these cases were to be billed separately by the Ministry and were thus not documented as claims data. Individual ambulatory claims data was provided for quarterly billing periods from the first quarter of 2015 to the last quarter of 2020. Consent from participants was not required as the analyses are based on secondary billing data and conducted according to the German guideline "Good Practice of Secondary Data Analysis" [21]. We used the German modification of the International Classification of Diseases 10th Revision (ICD-10-GM) [22] to define diseases and health conditions. Codes that have changed between 2015 and 2020 were updated to the coding valid in 2020 according to the official documentation of the ICD-10-GM, which is released by the German Institute of Medical Documentation and Information [22]. For the definition of pre-existing health conditions we considered only those ICD-10 codes marked as secure diagnoses.
According to the "test-negative design" approach [23], cohorts of individuals with a secured U07.1 diagnosis or a U07.1 diagnosis of exclusion, which codes a positive or negative PCR test result for SARS-CoV-2 infection, were defined for analysis. These cases and controls are henceforth called "test-positives"and "test-negatives". We further excluded individuals from the test-negatives who had an additional secured U07.2 diagnosis, coding a clinically-epidemiologically diagnosed COVID-19 according to the case definition of the World Health Organization [24]. Briefly, the U07.2 diagnosis is used for individuals with COVID-19 symptoms who have been in contact with a confirmed case or live in a facility with a suspected outbreak but have not received a PCR test.
The observation period was defined to include the five years preceding and the quarter following the index quarter of PCR test. Accordingly, determination of an index quarter was restricted to the first three quarters in 2020. Individuals who were not residing in Bavaria within the five years preceding PCR test and had no data record before this observational period were excluded from analysis in an attempt to ensure complete observation periods. We additionally removed individuals with implausible ICD-10 coding, i.e. patients with a death diagnosis (R96, R99 and I46.9) within the observation period (Fig 1).
Further information about age, sex, urbanization and nursing home living was used to adjust for potential confounding. To adjust for different settlement and health care supply densities we included a measure of urbanization, categorized by four levels as defined by the German Federal Institute for Research on Building, Urban Affairs and Spatial Development: ‚large cities' with at least 100 000 residents, and ‚urban areas', ‚rural areas' or ‚sparsely populated rural areas' with population densities of >150, �150 or �100 inhabitants per km 2 , respectively [25]. Seven recorded fee codes were used to identify individuals living in nursing homes up to three quarters before or during the quarter of PCR test.
The underlying data for this study are pseudonymized and the study was approved by the Ethics Commission of the Technical University of Munich (Ethikkommission der Technischen Universität München) (approval No 673/20 S-EB).

Statistical analysis
Two multivariable binary regression models were used to investigate the risk of positive PCR test result and the risk of cardiovascular or pulmonary complications separately. Known risk factors for severe COVID-19 in the hospital setting were pre-defined for inclusion into the models as independent predictor variables. These were hypertension, diabetes, CKD, dementia, obesity, CHD, COPD, pneumonia, asthma, tobacco consumption, cancer, liver disease and depression. Additional potential risk factors of interest included in the analysis were anxiety disorder, vitamin D deficiency, immunodeficiency and flu. Respective ICD-10 codes are listed in S1 Table. Potential confounding by age, sex, urbanization and nursing home living was addressed by including respective factor variables and an interaction between age and sex in the models. Thereby, age was categorized to the intervals 0-20, 21-30, 31-40,. . ., 80+ and urbanization to the levels ‚large cities', ‚urban areas', ‚rural areas' and ‚sparsely populated rural areas'.
In the absence of hospital data we decided a priori to use cardiovascular or pulmonary complications occurring in the first quarter after PCR-confirmed SARS-CoV-2 infection as a proxy for a severe course of COVID-19 in outpatient setting. Diagnoses included in this outcome were acute respiratory distress syndrome (ARDS), hypoxia, stroke, angina pectoris,

PLOS ONE
SARS-CoV-2 infection and cardiovascular or pulmonary complications in ambulatory care heart attack, cardiac arrest, pulmonary embolism and apnea (S1 Table). In addition, to elaborate whether a COVID-19 specific risk model is meaningful beyond a general risk model, we investigated COVID-19 as an independent predictor of defined complications. We therefore fitted the multivariable regression model to both groups simultaneously, i.e. to test-positives and test-negatives, in two steps. First, we included the result of PCR test as an additional risk factor, and second, all interaction effects between the PCR test result and the investigated risk factors were sequentially added and screened for possible inclusion to the multivariable regression model with forward stepwise variable selection based on the Akaike's information criterion (AIC). Goodness-of-fit of these nested models was compared by a descriptive likelihoodratio test without formal adjsutment for AIC-based model selection.
Any hypothesis testing was performed at local and global 5% levels of significance, i.e. with and without adjustment for the multiple testing problem. Therefore, P values have additionally been adjusted using the joint distribution of the regression coefficients of the multivariable models [26].
With the intention of assisting decision-making in ambulatory care, we additionally developed a prognostic model for the risk of cardiovascular or pulmonary complications in the outpatient setting with the use of regression modelling and machine learning. Selected algorithms were random forest, conditional inference tree, least absolute shrinkage and selection operator (LASSO), ridge regression, elastic net and binary logistic regression with and without stepwise variable selection based on AIC. In this regard, we randomly selected 75% of the participants (derivation set) to develop the models and internally validated the performance of the models on the remaining participants (validation set) using the area under the receiver operating characteristic curve (AUC) as a measure of discriminatory ability. To improve performance of the prognostic models we tuned the parameters of machine learning algorithms using three-fold cross-validation within the derivation set. With the aim to visualize the results of the best performing model in a comprehensive manner, recursive segmentation and recursive partitioning were applied to the best performing model's predictions [27] to identify subgroups of different risks [28]. This resulted in decision rules enabling a specific characterization of patients with an increased risk of defined complications.
All statistical analyses were performed in R, version 4.0.3 (The R Foundation for Statistical Computing, Vienna, Austria).

Risk of positive PCR test result
A total of 99 811 participants were included in the analysis of the risk of positive PCR test result for SARS-CoV-2 infection. Of these participants, 58 336 (58.4%) were female, 79 236 (79.4%) were younger than 60 years (mean±SD = 44.3±20.8). Among the participants we identified 53 904 (54.0%) test-positives. Overall characteristics of test-positives and test-negatives were similar (Table 1). A flow chart of the participant selection process is given in  Table 2).

Risk of cardiovascular or pulmonary complications
For the analysis of defined complications the cohort of test-positives was reduced to 46 071 participants with available data records in the first quarter after the index quarter of PCR test. Complications could be identified in 1904 (4.1%) individuals, including ARDS (55 (2.9%)),
We additionally analysed the specific relevance of our results of the cohort of test-positives in comparison with test-negatives. When including the result of PCR test as an additional risk factor to a multivariable regression model that is fit to both groups, we found increased risk Table 2

. Odds ratio (OR) and 95% confidence intervals (CI) for the risk of positive PCR test result for SARS-CoV-2 infection and the risk of cardiovascular or pulmonary complications.
Multivariable binary regression models adjusted for age, sex, urbanisation, nursing home living and diseases shown.

Prognostic model
For the development of a prognostic model for the risk of cardiovascular or pulmonary complications, 46 071 test-positive participants were randomly allocated to a derivation set (34 553 participants) and to a validation set (11 518 participants). In the derivation and validation sets 1448 (4.2%) and 456 (4.0%) participants had complications, respectively (S3 Table).
Internal validation of the prognostic models showed AUC values ranging from 0.83 to 0.85, indicating excellent [29] and similar discriminatory ability of all models (Table 3). A random forest achieved the best performance, with the disadvantage of lacking interpretability. Therefore, based on its predictions we performed recursive segmentation and partitioning to characterize subgroups with an increased risk of defined complications, enabling better assistance of decision-making in ambulatory care. From the resulting tree's structure (Fig 2) we derived the following decision rules defining patients of increased risk: 1) age >70 years, 2) diagnosis of CHD, 3) diagnosis of CKD or COPD with an additional diagnosis of hypertension or age >60 years. A validation of these decision rules resulted in a sensitivity and specificity of 74.8% and 80.0%, respectively. The positive and negative predictive values (PPV, NPV) were respectively 13.3% and 98.7%. The performance of these simple decision rules was not inferior to the performance of the more complex prognostic models (cf. Fig 3).

Discussion
The analysis of ambulatory claims data with regression modelling and machine learning techniques revealed, that patients with tobacco consumption, previous flu and cancer were less likely to test positive for SARS-CoV-2 infection; and patients with dementia, type 2 diabetes and obesity showed an increased risk of a positive PCR test result. CHD, CKD, COPD, hypertension and increased age were identified as predictors for unfavourable complications after PCR-confirmed infection. Table 3. Area under the receiver operating characteristic curve (AUC) and 95% Confidence Intervals (CI) of machine learning methods for cardiovascular or pulmonary complications. Internal Validation.

Model AUC (95% CI)
Random  Consistent with previous findings [3], the present study provides evidence for a strong association of the diagnosis of dementia with a positive PCR test result for SARS-CoV-2 infection. Additionally, our results indicate type 2 diabetes and obesity being positively associated with testing positive, while some health conditions previously found to increase the risk of severe COVID-19 including tobacco consumption and cancer were surprisingly associated with a lower risk of testing positive. The latter might be explained with possible behavioural adjustment in the patients belonging to respective vulnerable subgroups, however, this assumption might be impaired as the cohort of tested participants may also include asymptomatic participants. Opposed to previous assumption [30], our analysis did not suggest significant association with hypertension or type 1 diabetes and the risk of SARS-CoV-2 infection. This might be explained by increased efforts of the potentially vulnerable population to protect themselves from infection. In line with this, dementia showed a comparatively strong association with a positive PCR test result, as it is known to substantially affect daily functioning [31].
With regard to the risk of cardiovascular or pulmonary complications, we found various health conditions including CHD, hypertension, tobacco consumption, COPD, CKD and type 2 diabetes to be associated with an increased risk, which is consistent with findings of previous studies [5-9, 11, 13]. Additionally, we observed associations with previous pneumonia, depression, asthma and obesity. In contrast to the findings of [8,32], our results do not suggest a significant association with cancer, while it is important to note that we did not restrict our analyses to a recent cancer diagnosis but investigated cancer history in the whole observation period of five years. An additional sensitivity analysis showed however that consideration of recent cancer diagnosis may yield increased risk for complications. We also did not find significant associations for some risk factors found in previous studies [6][7][8] including dementia, liver disease and type 1 diabetes. This might be explained by the process of data collection in outpatient setting. There are probably less complications in these patients than in hospitalized patients, which could lead to weaker or absent associations with unfavourable complications.
Our prognostic model can be applied in ambulatory care for the identification of patients with an increased risk of cardiovascular or pulmonary complications, based on information which is readily accessible for treating physicians. Our best performing model, a random forest, achieved excellent prognostic accuracy. Based on its predictions we identified subgroups of interest. In this regard, we derived decision rules for patients with an increased risk of complications, which achieved high prognostic accuracy. The resulting PPV = 13.3% indicates that patients with a risk constellation according to our prognostic model should receive special attention and intensified protection in ambulatory care. At first sight, this PPV seems to be low although the sensitivity and specificity values are high at 74.8% and 80.0%, respectively. However, this PPV is acceptable given the low prevalence of 4.1%; higher PPV values would result from a higher prevalence. On the other hand, the resulting NPV = 98.7% helps to ruleout cardiovascular or pulmonary complications in patients without these pre-existing conditions.

Strengths and limitations
A strength of our study is the high representativeness due to the large sample size covering the majority of the Bavarian population. Our study also encountered for main potential confounders, i.e. age, sex, urbanization and nursing home living. Investigations were hypothesis-driven, focusing on pre-defined diseases and health conditions. Finally, in the derivation of the prognostic model, we used a separate part of the data for internal validation. However, the generalizability of the prognostic model still requires assessment by external validation.
Our study has some further limitations. Without information on hospital data, our outcome was a proxy of severe COVID-19 and was defined a priori including cardiovascular or pulmonary complications occurring in the quarter following SARS-CoV-2 infection. This definition may have led to obvious relations with pre-existing diseases and health conditions known to be associated with cardiovascular or pulmonary complications. This limitation was addressed by an additional investigation of risks that are specific to test-positives compared to test-negatives. Despite adjustment for potential confounders there are still some unobserved confounding risks, e.g. vulnerable subgroups might be more alert to symptoms and therefore more likely to test positive for COVID-19 than the other participants, leading to bias in our study. Another possible limitation was that the data and inherent diagnoses are not audited and reflect the coding and clinical practices of treating physicians. The determination of the cohort of test-negatives was based on the coding of a negative PCR test result, which was optional for physicians. The latter might introduce bias to the study, as the willingness of physicians to code the negative PCR test result may vary between different participant groups. Diagnoses could also only be made through physician contact, which may result in incomplete data. Potential incorrect coding was addressed by careful quality checks as described in the methods section. Beyond that, the possibility of asymptomatic participants in a cohort can presumably introduce bias. However, tests of asymptomatic patients were to be billed separately by the government and consequently not documated as claims data for the BASHIP. Therefore, the association between risk factors and the outcome, the estimated odds ratios, should be unbiased in this respect.

Conclusion
The prediction rule based on presence or absence of CHD, CKD, COPD, hypertension and increased age might help to rule-in and rule-out, respectively, unfavourable complications in ambulatory care. The risk of infection in itself might be reduced in patients with tobacco consumption, previous flu and cancer due to behavioural adjustment in terms of increased selfprotection and contact reduction.
Supporting information S1 Table. International classification of diseases 10th revision (ICD-10) diagnoses included in the multivariable analyses. (PDF) S2