Time-series cardiovascular risk factors and receipt of screening for breast, cervical, and colon cancer: The Guideline Advantage

Background Cancer is the second leading cause of death in the United States. Cancer screenings can detect precancerous cells and allow for earlier diagnosis and treatment. Our purpose was to better understand risk factors for cancer screenings and assess the effect of cancer screenings on changes of Cardiovascular health (CVH) measures before and after cancer screenings among patients. Methods We used The Guideline Advantage (TGA)—American Heart Association ambulatory quality clinical data registry of electronic health record data (n = 362,533 patients) to investigate associations between time-series CVH measures and receipt of breast, cervical, and colon cancer screenings. Long short-term memory (LSTM) neural networks was employed to predict receipt of cancer screenings. We also compared the distributions of CVH factors between patients who received cancer screenings and those who did not. Finally, we examined and quantified changes in CVH measures among the screened and non-screened groups. Results Model performance was evaluated by the area under the receiver operator curve (AUROC): the average AUROC of 10 curves was 0.63 for breast, 0.70 for cervical, and 0.61 for colon cancer screening. Distribution comparison found that screened patients had a higher prevalence of poor CVH categories. CVH submetrics were improved for patients after cancer screenings. Conclusion Deep learning algorithm could be used to investigate the associations between time-series CVH measures and cancer screenings in an ambulatory population. Patients with more adverse CVH profiles tend to be screened for cancers, and cancer screening may also prompt favorable changes in CVH. Cancer screenings may increase patient CVH health, thus potentially decreasing burden of disease and costs for the health system (e.g., cardiovascular diseases and cancers).


Introduction
Cancer is the second leading cause of death for both men and women in the United States (US) [1]: breast cancer is the second leading cause of cancer death among women [2]; colorectal cancer ranks second among men and third among women [3]; while cervical cancer ranks as a major cause of cancer death among women [4]. Regular cancer screenings for breast, cervical, and colorectal cancers can help to diagnose cancers early and reduce cancer deaths [5]. For example, in the past 40 years, the number of deaths caused by cervical cancer has significantly decreased thanks to pap tests which can find abnormal cervical cells before they turn to cancer [6]. Similarly, colonoscopy removes non-cancerous colon polyps before becoming malignant. And regular mammography screening can identify breast cancer in an earlier, more treatable stage. Thus, breast cancer screening (BCS), cervical cancer screening (CECS), and colorectal cancer screening (COCS) are very important for early detection and treatment.
Factors associated with cancer screenings include: demographic factors, health insurance coverage, education level, smoking status, obesity, and cholesterol testing. For example, receipt of mammography is associated with modifiable factors such as weight, smoking, and other lifestyle factors [7][8][9][10][11]. Receipt of CECS is associated with healthier weight [12], lower cardiovascular disease occurrence [13], and lower cholesterol [14]. Some studies suggest that smoking, sedentary lifestyle, high body mass index, and high comorbidity are associated with a higher percentage of COCS participation [15][16][17]. Traditionally, data for such studies originate from questionnaires, claims data, and telephone surveys, and statistical analysis methods such as logistic regression models are applied to examine the associations between the risk factors and cancer screenings. Electronic health records (EHR) contain longitudinal healthcare information and data including diagnoses, medications, procedures, lab tests, and images [18] and therefore can be used to discover new patterns and relationships from the rich data. Deep learning algorithms have been widely and successfully used in bioinformatics and healthcare fields as they can effectively capture features and patterns in longitudinal data [19,20].
In this study, we investigated associations between longitudinal CVH risk factors and the receipt of cancer screenings using EHR data by the long short-term memory (LSTM) model [21]. We then studied the distribution of CVH factors between patients who did and did not receive cancer screenings to further investigate the associations. Finally, we compared measures of CVH longitudinally within those who did and did not receive screening to better understand the effect of cancer screenings on CVH measures.

Ethics statement
All the data were fully anonymized before we accessed them. Our study was approved by the Institutional Review Board at the Washington University School of Medicine in St. Louis. We obtained a written acknowledgement of proprietary rights and non-disclosure and data use agreement from the American Heart Association (The Washington University_NDA_DUA_-CONTRACTID 158065_2019.04.26_K).

Data source and study population
The Guideline Advantage (TGA) is a clinical data registry established in 2011 by the American Cancer Society, the American Diabetes Association, and the American Heart Association (AHA) [22]. EHR data has been collected from over 70 clinics across the US by the TGA to track and monitor disease management and outpatient preventative care [23]. We used longitudinal TGA data to predict three types of cancer screenings among 362,533 unique patients.
We used a 6-year range (2010-2015) to identify 777 female patients in the 40-69 year old age group who received BCS; 617 female patients in the 21-64 year old age group who received CECS; and 264 patients in the 50-75 year old age group who received COCS. If patients received multiple types of cancer screening, we only considered the first. Using the same criteria for gender and age, we randomly selected a comparison group of patients who did not receive cancer screenings: 8000 for BCS, 6000 for CECS, and 3000 for COCS.
We utilized the following CVH measures defined by the AHA: smoking status, body mass index (BMI), blood pressure (BP), hemoglobin A1c (A1C), and cholesterol (Low-Density Lipoprotein (LDL) in our dataset). We then classified them into three categories: ideal, intermediate, or poor, according to Table 1. We utilized the Multum drug database [24] as a template to convert the drug names in our dataset to their corresponding drug classes. The Levenshtein distance algorithm [25] was employed for the conversion by comparing the drug names in our dataset to the Multum drug database template. The conversion was considered successful and medications were considered as treatments for BP, A1C, or LDL (Table 1) if the distance between the two compared strings was less than five. All CVH measurements prior to the date of cancer screening were considered in the analysis for those who received screening, and all CVH measurements in the data set were considered in the analysis for those who did not receive screening.
For the primary analysis, we selected patients who had at least one measure of CVH: 725 for BCS, 565 for CECS, and 240 for COCS. In the comparison groups, there were available data for 8,000 BCS; 3,548 CECS; and 3,000 COCS.

Statistical analysis
We first studied the LSTM prediction of cancer screening from time-series CVH factors. We divided each CVH factor into its submetric of "ideal", "intermediate", or "poor" according to Table 1. For example, if a patient had a measure of "ideal" blood pressure, then that feature was called blood pressure ideal. All features were then embedded to a 32-dimensional vector space by word2vec [27] for each type of cancer screenings. The Python Genism Word2Vec model used the following hyperparameters: size (embedding dimension) was 32, window (the maximum distance between a target word and all words around it) was 5, min_count (the minimum number of words counted when training the model) was 1, sg (the training algorithm) was CBOW (the continuous bag of words). Time information for each measure was added and was calculated by the difference in days between each visit date and the most recent visit date. Thus, each feature was associated with its own time point in the unit of days. The resulting embedded vectors and associated time points were fed to the LSTM model. Due to the comparison group being much larger than the number of patients with cancer screening, we randomly selected 800 patients for BCS, 600 patients for CECS, and 300 patients for COCS and repeated this process for 10 times to account for the imbalance between screened and unscreened groups. Each time, the data set for each type of cancer screening was split into a training data set (80%) and a test data set (20%). We trained the LSTM model on the training data and tested the trained model on the test data. We utilized the average of the area under the receiver operator curve (AUROC) to evaluate the performance of our LSTM model for each type of cancer evaluated.
Our LSTM model comprised an input layer, one hidden layer (with 100 dimensions) and an output layer. The hyperparameter used in the model was as follows: a sigmoid function was used as the activation function in the output layer. A binary cross-entropy was used as the loss function. Adam optimizer [28] was used to optimize the model with a mini-batch size of 64 samples.
We then investigated whether distributions of CVH-counts and percentages for each submetric-differed between patients who did and who did not receive cancer screenings by Chi-Squared test. Finally, we studied changes in CVH factors within screening group, for the same patients who received screening and for those who did not. Within screening group, we compared CVH measures from before and on the day of the screening to the CVH measures collected after the screening. For the patients who did not receive screening, we compared CVH measures before and after the mid-point of the visit dates. If patients only had a single visit, then they were not included in the before and after analysis. Analyses were conducted by using the libraries of Scikit-learn, Scipy, Matplotlib with Python, version 3.6.5 in 2019.

Results
The majority of our study population was white, with a mean of age of approximately 55 years for BCS, 50 years for CECS, and 60 years for COCS ( Table 2). The non-white study population was predominantly African-American. The average number of measures (Avg #) among patients who received screening was higher than that of patients who were not screened. For example, the average number of BP measurements for patients with BCS was 11 (15 for CECS and 13 for COCS) compared to 8 for BCS (7 for CECS and 8 for COCS) for patients who were not screened. Fig 1 displays the performance of LSTM cancer screening predictions in terms of 10 repeated AUROCs for each type of screening. The average AUROC of 10 curves was 0.63 for BCS, 0.70 for CECS, and 0.61 for COCS. Table 3 lists the numbers and proportions of patients in ideal, intermediate and poor categories for each submetric for the comparison between patients who received cancer screening and those who did not. We applied a Chi-squared test [29] to check if the frequencies (here percentages) between screening groups were significantly different from one other within each CVH submetric. As shown in Table 3, patients who received cancer screening had a higher prevalence of poor A1C (62% for BCS, 58% for CECS and 72% for COCS) compared to patients who did not receive screening (53% for BCS, 53% for CECS and 51% for COCS).

PLOS ONE
From the first column of Fig 2, we can see that the prevalence of "poor" submetrics decreased after cancer screenings. For example, all five submetrics improved after BCS (Fig 2  (A)), while BP and A1C improved after CECS (Fig 2(B)), and BP, A1C, and smoking improved after COCS (Fig 2(C)). Notably, for the prevalence of poor A1C decreased for all patients who received cancer screenings: 7% in BCS, 14% in CECS, and 17% in COCS. On the other hand, from the second column of Fig 2, we can see that the prevalence of "poor" A1C increased for all comparison patients.

Discussion
In this study, we demonstrated associations between time-series CVH risk factor measures and receipt of three types of cancer screenings, i.e., breast, cervical, and colon cancer screenings, by using a nationally representative dataset-TGA data. The TGA data enabled us to examine multiple sites, CVH submetrics, and types of cancer screenings using advanced deep learning models. An advantage of our study was that all 5 CVH submetrics were investigated simultaneously for an association with 3 different cancer screenings on a unique nationally representative dataset of patients, i.e., the large TGA data set, which contains longitudinal  CVH measurements and cancer screening patterns from more than 70 different clinics in the US.
The comparison of different CVH measure distributions between patients who received cancer screenings and those who did not showed that patients with poorer CVH (especially poor A1C) were more likely to receive cancer screenings. Specifically, patients with poorer A1C were more likely to receive cancer screenings. Some recent studies have showed that individuals with diabetes had 30% higher incidence of certain cancers and also were more likely to be diagnosed with advanced-stage tumors [30][31][32][33]. Thus, providers might be more likely to recommend patients with diabetes to uptake cancer screenings for early prevention of developing cancers, which may lead to more individuals with diabetes to participate in cancer screenings.
Moreover, we investigated the effects of cancer screenings on the changes of CVH measures of the patients to better understand if the screenings had potential associations with the improvement of CVH measures. Our results indicated that patients who received cancer screenings appeared to have better control of CVH factors, especially A1C, than patients who did not receive cancer screenings. Specifically, A1C levels were improved after patients received any type of screening, while A1C levels worsened among patients who did not receive cancer screening. A similar trend could be observed for BMI: it became better after patients received any type of screening, while BMI became worse among patients without BCS or COCS. Levels of BP were improved after patients received BCS or COCS screenings and worsened among patients without BCS or COCS. Poor levels of LDL decreased among patients after receipt of BCS and among those without BCS. However, LDL improvements were much greater among patients after receipt of BCS (34% decrease in LDL) than those without BCS (10% decrease in LDL). After receipt of BCS and COCS, current smoking declined compared to the increase observed among those without the screenings.
In summary, our analyses showed that patients with poor CVH measures were more likely to receive cancer screenings. Patients with receipt of cancer screenings appeared to have improved CVH measures after the screening as compared to before. One possible reason for this was that patients might receive more attention and through care from providers to detect and manage CVH by virtue of reviewing cancer screening and other risk factor data. At the population level, better CVH is associated with a lower risk of cardiovascular disease (CVD) and cancers [34,35]. Thus, cancer screenings may indirectly decrease burden and cost on the health system (e.g., CVD and cancers) by improving patient CVH health.

Limitations
There were some limitations in our analyses. We used values of AUROC to evaluate associations between time-series CVH measurements and receipt of cancer screenings. Higher AUROC values indicated stronger associations between predictors and the binary outcomes [36]. However, our observed AUROC values were relatively low and thus have limited clinical utility at this time. Cancer screenings are potentially affected by CVH and other factors. We acknowledge that we had relatively few patients with receipt of cancer screening. Specifically, there were relatively few patients who received cancer screenings compared to patients who did not within the same age and gender groups. This limitation likely affected the accuracy of The plots of percentages for poor CVH factors for the same patients before and after time points of cancer screening for patients with screenings (A)-(C) and before and after middle time points for patients without cancer screenings (D)-(F). The first row is for BCS, second row is for CECS and the third is for COCS.
https://doi.org/10.1371/journal.pone.0236836.g002 our prediction models. The prediction accuracy of our models could be improved if more patients in our data set had received cancer screening.

Conclusions
We demonstrated that deep learning LSTM models can effectively predict the associations between time-series CVH measures and receipt of cancer screening. Poor CVH, especially poor A1C, may prompt providers to recommend cancer screening for their patients. And patients who received cancer screening may also receive better care for and/or have improved self-management of CVH, especially A1C. Overall, these findings suggest that unhealthier patients are screened for cancers, and that cancer screening may also prompt favorable changes in CVH.