Mining co-occurrence and sequence patterns from cancer diagnoses in New York State

The goal of this study is to discover disease co-occurrence and sequence patterns from large scale cancer diagnosis histories in New York State. In particular, we want to identify disparities among different patient groups. Our study will provide essential knowledge for clinical researchers to further investigate comorbidities and disease progression for improving the management of multiple diseases. We used inpatient discharge and outpatient visit records from the New York State Statewide Planning and Research Cooperative System (SPARCS) from 2011-2015. We grouped each patient’s visit history to generate diagnosis sequences for seven most popular cancer types. We performed frequent disease co-occurrence mining using the Apriori algorithm, and frequent disease sequence patterns discovery using the cSPADE algorithm. Different types of cancer demonstrated distinct patterns. Disparities of both disease co-occurrence and sequence patterns were observed from patients within different age groups. There were also considerable disparities in disease co-occurrence patterns with respect to different claim types (i.e., inpatient, outpatient, emergency department and ambulatory surgery). Disparities regarding genders were mostly found where the cancer types were gender specific. Supports of most patterns were usually higher for males than for females. Compared with secondary diagnosis codes, primary diagnosis codes can convey more stable results. Two disease sequences consisting of the same diagnoses but in different orders were usually with different supports. Our results suggest that the methods adopted can generate potentially interesting and clinically meaningful disease co-occurrence and sequence patterns, and identify disparities among various patient groups. These patterns could imply comorbidities and disease progressions.


Background and significance
Patient level longitudinal data mining and pattern discovery is a common approach for public health studies. For example, disease co-occurrence patterns and disease sequence patterns from large number of patients' diagnosis histories could help to discover comorbidity or PLOS  Objective This is a retrospective cohort study aiming at analyzing frequent disease co-occurrence and sequence patterns of cancer diagnoses in New York State. We studied disparities among these frequent patterns with respect to age, gender and claim types (i.e., inpatient, outpatient, emergency department and ambulatory surgery) for hospital visits or stays. Since cancer ranks the second in the leading causes of deaths in the United States [19], we believe that the results will provide essential data and knowledge for clinical researchers to further investigate comorbidities and disease progression for improving the management of cancers.

Materials and methods
This study has been approved by Stony Brook University IRB (CORIHS B). We took advantage of the cancer-related diagnosis information available in SPARCS data, i.e., the ninth and tenth revision of International Classification of Diseases (ICD-9 and ICD-10) diagnosis codes, and converted them to single-level Clinical Classifications Software (CCS) diagnosis categories [20] to discover disease co-occurrence and sequence patterns from patients' full diagnosis histories within a five-year time frame.

Data sources
While our SPARCS data from most claim types are available as early as the year 2003, outpatient records are only available since 2011. To provide a comprehensive history of patient visits, we choose discharge records in SPARCS during 2011-2015 where all claim types are available. Descriptive statistics of cancer patients based on discharge records in SPARCS during 2011-2015 are presented in Table 1. Each discharge record contains one or more ICD-9 or ICD-10 diagnosis codes. The first diagnosis code is the primary diagnosis code that represents the main reason for that hospital visit. The rest are secondary diagnosis codes that represent the conditions coexisting during the same hospital stay or visit. To reduce dimensionality of data, we mapped ICD diagnosis codes to single-level CCS diagnosis categories and used CCS categories in our analyses. In this paper, both CCS diagnosis category descriptions and labels are used to represent various diagnoses. Since procedure codes were only available in a very small portion of records, we kept using ICD-9 or ICD-10 procedure codes without mapping. This study focuses on discovering patterns from seven types of cancer with high incident rates in New York State: rectum and anus cancer (15), liver and intrahepatic bile duct cancer (16), pancreas cancer (17), lung and bronchus cancer (19), breast cancer (24), prostate cancer (29) and Non-Hodgkin's lymphoma (38) [21]. There are 8,645,995 discharge records from 742,487 patients used in our work.

Data preparation
Each patient's discharge records were grouped together using an encrypted unique patient identifier in SPARCS and ordered by the corresponding admission dates. Discharge records containing AIDS/HIV or abortion diagnoses were removed from our analyses, because their admission dates and patient identifiers were redacted to comply with the Health Insurance Portability and Accountability Act (HIPAA) [22]. Patient level demographic information (i.e. age, gender, race and ethnicity) were collected from the first record of each patient. Patients were classified into cohorts having different types of cancer. A patient who had any cancer diagnosis was selected into the corresponding cohort. One patient could be in multiple cohorts because a person could have been diagnosed with different types of cancer. Table 1 shows the patient characteristics in our analyses. For each type of cancer, we studied the disparities of top 20 frequent co-occurrence and sequence patterns among different age groups (<=34, 35-54, 55-74 and >=75 years old) [23] and gender groups (male and female). We also analyzed disparities of co-occurrence patterns using discharge records from different claim types. Disparities among different race and ethnicity groups are not discussed in this paper, but we provide relevant results generated using discharge records from patients who had Non-Hodgkin's lymphoma (38) in S7 Table.

Patients' diagnosis sequences
For each patient, since discharge records were strictly ordered by admission dates, diagnosis information was also strictly ordered by corresponding admission dates. Thus, each patient's admission dates and diagnosis information constituted a diagnosis sequence. Each patient's diagnosis sequence was assigned a unique sequence ID, which was also an ID for this patient. Fig 1 shows a randomly selected example consisting of three diagnosis sequences from patients having lung and bronchus cancer (19), which is the targeted cancer in this example. Diagnoses are listed after the discharge record ID, a corresponding CCS category is marked in the parentheses following the description. The primary diagnosis in each discharge record is emphasized using grey shading. CCS category that represents the targeted cancer (i.e., lung and bronchus cancer) is highlighted in bold. In this example, pattern 5 ("{19}!{98, 19}") means that diagnosis "{cancer of bronchus; lung (19)}" happens before diagnoses "{essential hypertension (98), cancer of bronchus; lung (19)}". The former one diagnosis and the latter two diagnoses occur in different discharge records on different admission dates. And the latter two diagnoses occur on the same day.

Analysis methods
Apriori algorithm: Identifying frequent disease co-occurrence patterns. We adopted the Apriori algorithm [24] to discover frequent disease co-occurrence patterns and frequent procedure codes. The Apriori algorithm works by making multiple passes over the entire dataset and generating frequent co-occurrence itemsets (i.e., CCS categories on the same admission date) by comparing their supports with a user-specified minimum support threshold. If the minimum support threshold is satisfied, this co-occurrence itemset is kept in the searching results; otherwise it is deleted from the searching results. In the first pass, the algorithm simply counts occurrences (i.e., support) of each CCS category and procedure code and determines which of them are large (i.e., satisfy the minimum support threshold). In each subsequent pass, there are two phases. First, the algorithm starts with large itemsets found in the previous pass to generate new potentially large itemsets, say candidate itemsets, by joining new CCS categories or procedure codes. Next, the dataset is scanned to calculate the support of each candidate itemset, and determine large itemsets in the current pass by comparing these supports with the minimum support threshold. Eventually, all co-occurrence itemsets containing disease codes or procedure codes and satisfy the minimum support threshold are generated.
Both primary and secondary CCS categories were used in the analysis of disease co-occurrence patterns. Only records containing targeted cancer CCS categories were selected. For instance, in the sequences illustrated in Fig 1, discharge records containing "Cancer of bronchus; lung (19)" were used, and diagnoses in the same discharge record would be included in a large itemset if they satisfied the minimum support. We discovered co-occurrence relationships not only between different diagnoses, but also between diagnoses and procedures. As for associations between cancers and procedures, since not all records contained valid procedure codes, we only chose records containing both targeted cancer CCS categories and valid procedure codes in this analyses. For each type of cancer, we selected the top 20 potentially meaningful co-occurrence itemsets that contained targeted cancer diagnoses as frequent cooccurrence patterns. We used apyori 1.1.1 [25], a Python package for Apriori algorithm to discover frequent co-occurrence patterns.
cSPADE algorithm: Discovering frequent disease sequence patterns. We used cSPADE, a frequent sequence mining algorithm [26] to discover frequent disease sequence patterns for different types of cancer. cSPADE algorithm generates frequent sequences iteratively based on a subsequence relation that if a sequence is frequent, then all subsequences of this sequence are also frequent [27]. In each iteration, the cSPADE algorithm also works by comparing the supports of candidate sequences with the minimum support threshold. It starts from the single item sequence, to sequences with maximal length by joining subsequences obtained from previous iteration. When computing the support of a subsequence, multiple occurrences of this subsequence in the same sequence are counted only once. Fig 1 shows an example of diagnosis sequences containing both primary and secondary CCS categories where primary CCS categories are in grey shading. The length of a sequence pattern is the total number of itemsets in this sequence. For example, pattern 5, which means "{19}" happens before "{98, 19}", is a length-2 sequence pattern because there are two itemsets "{19}" and "{98, 19}" in this sequence. We set the minimum interval between two itemsets as one and the maximum interval as 180, such that the duration between admission dates of two consecutive itemsets in a sequence pattern must be within 1-180 days. That is, this algorithm can discover the association of two diagnoses happen within 180 days of each other. Previous studies found out that revisit intervals usually range from one month to over one year, and typical intervals are two, three or six months [28]. Thus, 180 days is an interval long enough to cover significant revisit diagnoses. For each type of cancer, we also kept the first 20 potentially meaningful subsequences that contained targeted cancer diagnoses as frequent sequence patterns. All frequent sequence patterns were mined using arulesSequences [29], a R package for cSPADE algorithm.
Statistical analyses. We selected top 20 co-occurrence patterns and top 20 sequence patterns to run statistical analyses for each type of cancer. Percentages of co-occurrence and sequence patterns were calculated and compared between age, gender, race, ethnicity group and claim type to evaluate disparities among these patient groups. For each of top co-occurrences as the dichotomous outcome, a generalized linear mixed-effect model was fit for the repeated measure data. Age, gender, race, ethnicity and claim type were all included in the same model as covariates. The within-subject dependence over repeated visits was adjusted using an unstructured covariance matrix. p-values based on F-tests were assessed to evaluate the overall significance of those covariates. For sequence patterns, multiple logistic regression models were fit for each sequence pattern as the dichotomous outcome. Similar to the analyses of co-occurrences, age, gender, race and ethnicity were treated as covariates and p-values based on F-tests were used to assess their overall significance. The Bonferroni's method was used to adjust p-values for multiple tests. All statistical analyses were performed using SAS v9.4 (the SAS Institute, Cary, NC) [30].

Results
We performed analyses on patients' diagnosis histories and focused on seven types of cancer with high incident rates in New York State. Meaningless results, such as patterns consisting of identical CCS categories, CCS categories that represent unspecific disease groups or serve administrative purposes, length-1 patterns and patterns irrelevant to targeted cancers, were ruled out from our analyses.
In this section, we mainly discuss patterns containing diagnoses closely related to the targeted cancers and present results of a few cancer types where significant pattern disparities are found. Other results are available in the supporting information.

Diagnosis co-occurrence patterns
We analyzed disparities of co-occurrence patterns among different age groups, gender groups and records from different claim types. Besides, we also discovered correlations between cancer diagnoses and procedures to see what procedures were frequently adopted to treat different types of cancer.
Frequent disease co-occurrence patterns in different age groups. Figs 2 and 3 show supports and p-values of top 20 frequent diagnoses that co-occurred with liver and intrahepatic bile duct cancer (16) and Non-Hodgkin's lymphoma (38), respectively. The p-values revealed that almost all of these diagnoses were significant with respect to age. Besides, supports of many top frequent diagnoses were the highest in the eldest age group (>=75 years old) and were the lowest in the youngest age group (<=34 years old). Fig 2 presents results of frequent co-occurrence patterns regarding liver and intrahepatic bile duct cancer (16). Among patients who were or under 34 years old, deficiency and other anemia (59) and hepatitis (6) were two most popular diagnosis co-occurrences (7.71% and 7.02%, respectively). Patients between 35-74 years old were also more likely to have hepatitis (6), while it was less seen among patients who were or over 75 years old (35-54 years old: 25.51%, 55-74 years old: 25.57%, >=75 years old: 12.05%, p-value<0.0001). Essential hypertension (98) was another diagnosis co-occurrence that usually occurred among patients who were or over 55 years old (55-74 years old: 25.70%, >=75 years old: 35.78%, pvalue < 0.0001). Results for Non-Hodgkin's lymphoma (38) were usually more representative and less noisy (Fig 3). For instance, leukemias (39) and diseases of white blood cells (63) were more frequent among patients who were or under 34 years old (9.21% and 6.22%, respectively). Essential hypertension (98) was also the most popular diagnosis co-occurrence with patients who were or over 55 years old (55-74 years old: 20.22%, >=75 years old: 30.02%).
Patients who had rectum and anus cancer (15) (S1 Table) were more frequently diagnosed with cancer of colon (14). However, this diagnosis was not significant with respect to age (p-value = 0.99). Biliary tract disease (149) and diabetes mellitus without complication (49) were significant diagnoses (p-value<0.0001) that co-occurred with pancreas cancer (17) (S3 Table). As for lung and bronchus cancer (19) (S4 Table), the frequent diagnosis most relevant to this cancer was chronic obstructive pulmonary disease and bronchiectasis (127), which ranked high on the list of the most frequent co-occurrence patterns for patients who were or over 55 years old (55-74 years old: 19.30%, >=75 years old: 26.01%, p-value<0.0001). Pneumonia (122) was another frequent diagnosis that was significant with respect to age (p-value<0.0001). Nonmalignant breast conditions (167) co-occurred more frequently with breast cancer (24) (S5 Table) among people who were or under 54 years old (<=34 years old: 7.06%, 35-54 years old: 8.40%). For co-occurrence patterns among patients who had prostate cancer (29) (S6 Table), genitourinary symptoms and ill-defined conditions (163) was a significant sign of patients who were or under 34 years old (12.27%).
Frequent disease co-occurrence patterns in different gender groups. The most frequent diagnosis co-occurrences for cancer types that are less gender specific, such as rectum and anus cancer (15) (S1 Table), liver and intrahepatic bile duct cancer (16) (S2 Table), pancreas cancer (17) (S3 Table) and lung and bronchus cancer (19) Table), demonstrated similar trends in males and females with only very few disparities. For example, liver and intrahepatic cancer (16), females were usually at a higher risk of having deficiency and other anemia (59) than males (male: 10.32%, female: 12.14%, p-value<0.0001). Males were more likely to have hepatitis (6) compared with females (male: 25.69%, female: 16.83%, p-value<0.0001). Thyroid disorders (48) was more popular with females among patients who had Non-Hodgkin's lymphoma (38) (S7 Table) and breast cancer (24) (Fig 4), but was not high on the list of the top frequent disease co-occurrences for males (p-values<0.0001). It can also be observed that heart disease like coronary atherosclerosis (101) affected males more than females across all seven types of cancer (p-values<0.0001).
Frequent co-occurrence patterns from different claim types. Distribution of frequent diagnosis co-occurrence patterns regarding claim types differed among all seven types of cancer (p-values<0.0001). For instance, colon cancer (14) were the most frequent diagnosis only in ambulatory surgery visits from patients having rectum and anus cancer (15) (Fig 5), biliary tract disease (149) were comparatively frequent in ambulatory surgery visits and inpatient hospital stays with respect to pancreas cancer (17) (S3 Table). However, some common patterns can still be identified. Most of the discharge records of lung and bronchus cancer (19) (S4 Table), prostate cancer (29) (S6 Table) and Non-Hodgkin's lymphoma (38) (S7 Table) came from ambulatory surgery visits, least of them were from inpatient care. Most discharge records for rectum and anus cancer (15) (Fig 5) and liver and intrahepatic bile duct cancer (16) (S2 Table) were collected from emergency department visits. Frequent co-occurrences of cancer diagnoses and procedure codes. Fig 6 illustrates the procedure codes that co-occurred most frequently with different cancer diagnoses. We used discharge records that contained both targeted cancer diagnoses and valid procedure codes in this study. We also selected the top 20 most frequent procedure codes for each type of cancer, while only present top five most frequent procedure codes here. Transfusion of packed cells (9904), injection of antibiotic (9921) and injection or infusion of other therapeutic or prophylactic substance (9929) appeared in the top five most frequent procedure codes of all seven types of cancer and were usually in the first three places. Moreover, transfusion of packed cells (9904) always ranked the first for each type of cancer. Pattern disparities among different types of cancer could also be identified. Insertion of intercostal catheter for drainage (3404) (7.78%) and computerized axial tomography of thorax (8741) (8.97%) were popular regarding lung and bronchus cancer (19). Other anterior resection of rectum (4863) (11.76%) was more common in treating rectum and anus cancer (15). Computerized axial tomography of abdomen (8801) (12.89%) and endoscopic insertion of stent (tube) into bile duct (5187) (9.54%) were more frequent in results for pancreas cancer (17). Percutaneous abdominal drainage (5491) (16.15%) was popular in treating liver and intrahepatic bile duct cancer (16). Injection or infusion of electrolytes (9918) (9.34%) was more frequently used to treat Non-Hodgkin's lymphoma (38). Laparoscopic robotic assisted procedure (1742) (8.62%) was only found in top five frequent patterns for prostate cancer (29). Injection or infusion of cancer chemotherapeutic substance (9925) was most frequently used to treat liver and intrahepatic bile duct cancer (16) (11.80%) and Non-Hodgkin's lymphoma (38) (23.65%). (15) in discharge records for different claim types. Age, gender, race, ethnicity and claim type were treated as covariates and p-values based on F-tests were used to assess their overall significance.

Fig 5. Supports and p-values of the most frequent diagnoses that co-occurred with rectum and anus cancer
https://doi.org/10.1371/journal.pone.0194407.g005

Diagnosis sequence patterns
In our analyses, we searched on full patient diagnosis sequences using primary CCS categories only, because primary diagnoses could help detect more clinically meaningful patterns [31]. We also used both primary and secondary diagnoses and ran analyses on sequences from patients who had Non-Hodgkin's lymphoma (38). Results are available in S7 Table. By comparing results presented in Figs 3 and 7 and S7 Table, we found that although a combination of primary and secondary CCS categories contained richer diagnosis information, the information could be redundant and noisy.
We only present length-2 diagnosis sequence patterns in this section, as longer sequences in our analyses usually consisted of repeated CCS categories representing follow-up visits rather than disease progression. Moreover, since only primary diagnoses were used in mining sequence patterns and patients having targeted diagnoses usually did not have these CCS categories as the primary diagnoses, supports of patterns presented in this section are comparatively lower than patterns generated using both primary and secondary diagnosis information (S7 Table).
Frequent sequence patterns in different age groups. Diagnoses in sequence patterns were usually more closely correlated with targeted cancers than those in co-occurrence patterns.

Discussion
Although our work focused on discoveries of patterns and disparities in different patient groups, there were many common diagnoses appearing in almost results of all seven types of cancer, especially in diagnosis co-occurrence patterns. In co-occurrence patterns, patients who were and over 75 years old had much higher risk having cardiovascular diseases such as coronary atherosclerosis (101) and cardiac dysrhythmias (106). Essential hypertension (98) was another popular diagnosis with elder patients and it was high on the list of the top 20 frequent co-occurrence patterns of every cancer. Besides, disorders of lipid metabolism (53), fluid and electrolyte disorders (55), diabetes mellitus without complications (49) and deficiency and other anemia (59) were also common diagnoses that co-occurred frequently with different cancer diagnoses.
In our study, we used both primary and secondary diagnoses to discover disease co-occurrence patterns and only primary diagnoses to identify sequence patterns. As aforementioned, primary diagnosis codes are the main reason for a hospital visit and secondary diagnosis codes represent conditions that co-exist during the same hospital visit or stay. The sequence patterns usually contained diagnoses that were highly correlated with each targeted cancer, but many of these diagnoses were not available in the most frequent co-occurrence patterns. Thus, primary diagnosis information was more accurate and precise, and secondary diagnosis codes provided richer but redundant and noisy information. Moreover, sequence patterns conveyed information that were time dependent. For example, the support of a sequence was usually different from the support of its reversed sequence. This phenomenon might weigh significantly in studying disease progression.

Conclusions
Open data initiatives make large scale healthcare data available and provide us a unique opportunity for discovering patterns using data mining methods. We adopted Apriori algorithm and cSPADE algorithm to discover frequent disease co-occurrence and sequence patterns among cancer patients in New York State using SPARCS data. We studied seven types of cancer with high incident rates in New York State and focused on disparities of diagnosis cooccurrence patterns and diagnosis sequence patterns from patients' diagnosis histories with respect to age, gender as well as claim types. Our results suggest that the methods can generate potentially interesting and clinically meaningful disease co-occurrence and sequence patterns, which can be used to study comorbidities and disease progression for improving the management of multiple diseases of cancer patients.
Supporting information S1