Constraints on Biological Mechanism from Disease Comorbidity Using Electronic Medical Records and Database of Genetic Variants

Patterns of disease co-occurrence that deviate from statistical independence may represent important constraints on biological mechanism, which sometimes can be explained by shared genetics. In this work we study the relationship between disease co-occurrence and commonly shared genetic architecture of disease. Records of pairs of diseases were combined from two different electronic medical systems (Columbia, Stanford), and compared to a large database of published disease-associated genetic variants (VARIMED); data on 35 disorders were available across all three sources, which include medical records for over 1.2 million patients and variants from over 17,000 publications. Based on the sources in which they appeared, disease pairs were categorized as having predominant clinical, genetic, or both kinds of manifestations. Confounding effects of age on disease incidence were controlled for by only comparing diseases when they fall in the same cluster of similarly shaped incidence patterns. We find that disease pairs that are overrepresented in both electronic medical record systems and in VARIMED come from two main disease classes, autoimmune and neuropsychiatric. We furthermore identify specific genes that are shared within these disease groups.


Introduction
When two diseases occur together in the same individuals more or less often than would be expected by chance, this may signal the operation of important biological processes. Pairs of diseases occurring more than expected are called synergistic; such interactions are familiar from clinical practice when the occurrence of a disease may raise the risk of a second disease. Pairs occurring less than expected are called protective; these interactions, sometimes called "inverse comorbidities," are less common, but intriguing. Disease pairs which consistently diverge from independence in either direction may provide clues towards identifying core genetic, pathway, physiological, or environmental constraints that alter disease risk and represent an important starting point for elaborating a mechanistic understanding of disease and for locating possible drug targets. Because discovery of disease patterns has been haphazard, it is attractive to systematically search for these patterns across a wide range of diseases, without adhering to prior conceptions of disease class, associated features, or expected comorbidities. In this work, we integrate clinical and genomic data across diseases to systematically assess their co-occurrence.
Consistent co-occurrence and conditional dependence in disease phenotypes arises from multiple, non-exclusive, factors: (1) shared genetics, including causal effects of single genes and effects of neighboring genes in linkage disequilibrium, (2) shared environmental exposures, (3) complex interactions in which a phenotype enhances or moderates the risk of another, (4) ascertainment, selection, or referral bias, (5) artifacts of the diagnostic system, where two putatively separate diseases are linked via large overlap of shared features, and (6) random variation. Untangling these factors requires use and integration of both phenotypic and genetic data.
Historically, non-independent phenotype associations are noticed in an opportunistic way when the effect size is large, and otherwise they are detected more accurately through observational studies and meta-analyses [1], or via comprehensive epidemiologic surveys. However, such studies and surveys are expensive to conduct and therefore often do not methodically examine disease combinations. In contrast, electronic medical records (EMR) represent a source of coded medical data that is typically large and, because these records are routinely collected to support clinical and administrative operations, the marginal cost to researchers is small; EMR data may therefore facilitate systematic comparison of disease co-occurrence [2].
Complementary information about disease relationships can be drawn from genomic studies. In particular, VARiants Informing MEDicine, or VARIMED [3], is a hand-curated database of published disease-associated (primarily common) genetic variants. Although it is limited to known genetic variants, it is large and provides an opportunity for detecting the overlapping and shared genetic bases of diseases.
We combine EMR data with genetic data to compare and contrast disease co-occurrence patterns, systematically comparing statistically significant disease comorbidity patterns in EMR data with disease pairs having statistically significant genetic overlap in VARIMED, and characterize the pairs by the predominant influence as (1) clinical and genetic if they both cooccur in the clinical data and share a significant genetic component, (2) clinical without genetic if they co-occur only in the clinical records, or (3) genetic without clinically observable effect if we find only a significant genetic overlap without a corresponding EMR result.
There are several important assumptions to consider here including the penetrance and causality of the genetic relationships that we examine as well as interactions between the genetics and the environment. Furthermore, EMR data are prone to selection and ascertainment bias, and errors from inaccuracies in chart coding. The lifetime of the EMR induces an observation window on the patients represented there, underrecording data from patients for disease pairs with widely separated ages of disease onset, and generating false inverse comorbidities. In order to avoid the confounding effect of age on the pair occurrence counts, we introduce a method for clustering diseases through similarity of their incidence pattern by age.
Other researchers have explored similar ideas. Patterns have been detected using linked administrative and clinical databases. Goldacre and colleagues [4] used data from the Oxford Record Linkage Study to find disease associations, such as an expected association between schizophrenia and lung cancer, and a protective association between schizophrenia and rheumatoid arthritis. A later study using similar data [5] found inverse associations between Parkinson's disease and several kinds of cancer. Rzhetsky et al. [2] developed a mathematical model of ICD9-coded data from a single EMR to infer genetic overlap. Using genomics data from early GWAS studies, Sirota et al. [6] used summary data to define a signed genetic variation score and cluster autoimmune disorders. Jung et al. [7] applied a similar method to studying autoimmune disorders when paired with autism. Li et al. [8] used data from several EMRs and from VARIMED to identify the genetic architecture of novel risk factor-disease associations. Ibáñez et al. [9] compared gene expression profiles for previously identified inversely comorbid neuropsychiatric/cancer disease pairs, and found corresponding up-and down-regulation patterns. Melamed et al. [10] used data from a large database of insurance claims in combination with known genetic associations for Mendelian disorders to identify cancer driver genes. Glicksberg et al. [11] compared the overlap in disease pairs using EMR data and a database of genetic variants, retaining those pairs where both diseases appeared together in PubMed articles.
In this paper, we present a framework for integrating clinical and molecular data to study disease co-occurrence. Because disease risk varies with patient age, and because the co-occurrence of disease is therefore confounded by age, we introduce a method to define age-specific disease clusters and carry out pairwise comparisons of disease co-occurrence. We explicitly model disease pair under and overrepresentation. To reduce bias, we conduct the analysis in two independent clinical databases, and require statistically significant deviation from independence in both. We identify a highly significant group of autoimmune disorders, a set of diseases with known environmental triggers, and some results which question the clinical manifestation of previously described disease associations.

Results
Data on disease pairs were drawn from Columbia and Stanford electronic medical record systems, and compared to data on disease pairs with genetic overlap from the VARIMED database. The overall information flow is shown in Fig 1. Disease comorbidities were assessed for significance using a conventional 2 × 2 table recording the presence or absence of each disease; a patient contributes a count to one of the four cells (Table 1.) In aggregate, there are three possibilities: (1) independence, where the value of d follows proportionally from the marginal sums, (2) synergistic interactions, where d is larger than predicted from independence, and the pairs are overrepresented, or (3) protective interactions, where d is smaller than predicted, and the pairs are underrepresented.
Disease pairs were collated by the sources in which they were found to be significant: (1) significant in both EMRs and significant in VARIMED, or "clinical and genetic,", (2) in both EMRs but not in VARIMED, or "clinical without genetic," and (3) in VARIMED but not in both EMRs, or "genetic without observed clinical effect." (See Table 2.) Appearance in a clinical database represents the interaction between genetic predispositions, environmental exposures, and socioeconomic and phenotypic factors that lead to presentation for evaluation and treatment.  We start with 161 disorders in the available EMR data, of which 35 disorders appear in both EMRs and also in VARIMED; these are listed in Table 3, along with disease counts and frequencies for each EMR, and the gene counts from VARIMED.
To avoid the confounding effects of age on disease incidence, we form age-incidence clusters, where cluster members have similar age-incidence patterns, and only analyze disease pairs where both members of the pair fall in the same cluster. We use a data-driven method to compute cluster size, finding a locally optimum size of five. For visualization, each cluster is processed by forming the average of the incidence vectors in that cluster; these averages, along with a loess smoother, are shown in Fig 2. Plots of all the data points for all examined cluster sizes appear in the Appendix. Conveniently, four of the clusters correspond to different life-stages (neonate, youth, adulthood, and aged) and were assigned those names by hand for ease of reference and to aid interpretation; the names are also listed in the final column of Table 3. The fifth cluster contains data from predominantly younger patients, but is noisier and less consistent than the other clusters; it is labelled "other." No significant results were found for diseases in the "neonate" cluster, so that cluster does not appear in the tables below. A complete list of all EMR disorders appearing in each cluster appears in the Supplement.
Significant disease pairs are presented in overview here and described in detail in the following sections. For each of the two EMRs, a significant disease pair is either under or overrepresented. We consider only those results that show concordant results, either both underrepresented, or both overrepresented, in the two EMRs. The structure of the overrepresented disease pairs is seen in the network diagram in Fig 3; this figure uses data from the larger EMR (Columbia) to set the node size from disease frequency. There are three large components in the network, which have been coded by the color of the age-incidence cluster that forms each component; all have a compact, densely connected structure with only a few sparse ties, in spite of coming from an arbitrarily chosen list of common and rare diseases. The strongest effect is shown with thick link, marking the connection between lipid metabolism disorders and type 2 diabetes. The small light-green cluster (middle of lower row in figure) highlights the connections between autism, pervasive developmental disorder, attention deficit, and cerebral palsy.
The set relationship of disease-pairs from the EMR data compared to the genetic variants is shown in the Venn diagram (Fig 4). For the underrepresented disease pairs, four are shared between Columbia and Stanford, but none appear in the intersection with VARIMED. For the overrepresented disease pairs, 186 are shared between Columbia and Stanford, and five of those remain when intersected with VARIMED.

"Clinical and genetic" results
In this section we report the results that are significant in both EMRs (Columbia and Stanford) and in VARIMED. We refer to these as "clinical and genetic." For disease pairs significant in Table 3. Counts and frequencies (as percent) for diseases that occur in both EMR data sets and in VARIMED. Number of genes is from VARIMED. Cluster names were assigned by hand to facilitate comprehension, as described in the text. Prior work has found considerable genetic sharing between many autoimmune diseases [6], [12]; specific results include, ankylosing spondylitis and psoriasis [13]. The association between ankylosing spondylitis and lupus has been reported, but is extremely rare [14]. Rheumatoid arthritis and secondary Sjogren's syndrome have a well-known association. The other two results are previously identified associations between neuropsychiatric disorders; bipolar disorder and schizophrenia [15], and bipolar disorder and depression, although there may also be diagnostic overlap, as depression and bipolar disorder can be confused clinically.
We furthermore identify specific genes which are common to these two groups (Table 5). In the autoimmune subgroup those include well known associations in the HLA region such as HLA-DRA, HLA-E [16], interleukin receptors (IL13, IL23R and IL2RA) [17,18], [19], BTNL2 [20] and MICA [21]. Interleukins are any of a class of glycoproteins produced by leukocytes for regulating immune responses. While these genes have been previously associated with autoimmune diseases, they provide an interesting opportunity to explore shared therapeutic targets and diagnostic markers across these phenotypes.
In the neuropsychiatric subgroup some genes that are of interest include ANK3, CACNA1C, CDH13, ITIH4 and PDE7B. Ankyrins are a family of proteins that are believed to link the integral membrane proteins and play key roles in activities such as cell motility, activation, proliferation, contact, and the maintenance of specialized membrane domains. Ankyrin 3 is an immunologically distinct gene product from ankyrins 1 and 2, and was originally found at the axonal initial segment and nodes of Ranvier of neurons in the central and peripheral nervous systems. CACNA1C is a voltage-dependent calcium channel and has been previously linked to several neurodegenerative diseases [22], [23]. CACNA1C is also an associated gene of the one of the most highly significant SNPs for both bipolar disorder and schizophrenia in a cross-disorder genome wide analysis [15]. Cadhedrin, CDH13, is a known ADHD-susceptibility gene that has been investigated in other neuropsychiatric disorders [24], [25], [26]. Some of the  Network structure of the significant disease pairs that occur in both EMRs. Each node represents a disease, with the node size scaled to the other shared genes such as ITIH4, inter-alpha-trypsin inhibitor heavy chain family, member 4, and PDE7B, phosphodiesterase 7B do not have clearly known links to the neuropsychiatric phenotypes and might be interesting to explore further.

"Clinical without observed genetic effect" results
In this section we report disease pairs that are significant in both EMRs, but not significant in VARIMED ("clinical without observed genetic effect"). One protective interaction was found: alcoholism and goiter. This pair in the Columbia dataset has an observed/expected ratio of  Table 6 shows the 23 overrepresented interactions.
As expected, disorders with clear environmental triggers are apparent in both lists: alcohol, injection drug use (HIV, hepatitis B and C), and diet (diabetes type 2, and gout). Of the protective pairs, alcoholism and goiter have been previously noted to be underrepresented [27].
Most of the detected synergistic interactions are well-known. These include: alcoholism and bipolar disorder, alcoholism and depression, alcoholism and schizophrenia, depression and schizophrenia, and the alcoholism and injection drug-associated pairs. Of those less familiar, references are provided here: depression and migraine [28], migraine and lupus [29], cardiomyopathy and diabetes [30], aortic aneurysm and cardiomyopathy [31], diabetes type 2 and

"Genetic without observed clinical effect" results
In this section we report disease pairs that have significant overlap of genetic variants in VAR-IMED, but are not significant in both EMRs. There are 17 such pairs, shown in Table 7; when the disease pair was significant in one EMR, that was recorded in the "EMR" column. The group of "genetic without observed clinical effect" are those which have significant genetic overlap in VARIMED, but not in both EMRs. Nearly all are autoimmune disorders, and may represent pairs with sharing detected at the level of genes that do not produce pathway interactions leading to disease phenotypes, or rare interactions that do not achieve statistical significance. Table 6. Results for overrepresented disease pairs that are significant in Columbia and Stanford EMRs but not in VARIMED. comparing diseases when they fall in the same cluster of similarly-shaped incidence patterns. The method is fast, easy to interpret, and can be extended in a straightforward manner to other EMRs, data from national health systems, and large insurance databases. Our primary aim is to identify disease pairs which might share a common mechanism or treatment option for further exploration and research. We link disease pairs that are under or overrepresented in EMR data to statistically significant overlapping genes sets for the same pairs. The genetic variants are known to have phenotypic effects, while EMRs capture a broad collection of diseases states that are severe enough to require diagnosis and treatment, and represent a constellation of genetic predispositions, environmental influences, and social and economic factors that affect when diseases are detected. Many of the predisposing factors in EMRs are not measured, but we can find pairs that have known genetic associations and also find pairs that do not. As always for candidate generation or prioritization methods, the question arises of how to validate novel results, given that validating experiments have not yet been conducted. By contrasting results in two EMRs and in a database of genetic variants, we have reduced the chance that the same biases are operating across all data sources. There are several limitations of our approach which should be recognized. Our method only compares diseases when they fall in the same cluster. This is a simple, but conservative, match on age patterns, and should enrich results for true positives at the expense of missing other true positives that would only be found in cross-cluster comparisons. For example, because autism and Alzheimer's disease would fall in different age-incidence clusters and would not be compared, possible interactions between those disorders would not be detected. VARIMED, while large, contains only published results, reflecting investigators' choices of important areas of study, including, as we found, autoimmune disorders and neuropsychiatric disorders. Also, VARIMED focuses primarily on common variation as most genetic association has been based on genotype-based GWAS. VARIMED (and other databases) are not randomly sampled from the space of biological phenomena, and the absence of a genetic variant may only mean that such have not yet been investigated. It is likely that our method will fail in such circumstances to identify comorbid pairs using the conjunction of data from EMRs and from VARIMED, which is a source of bias. In addition, we link to EMR records on the basis of a straightforward, but necessarily imprecise, mapping through a disease name. We restrict the genetic analysis to the genes and do not consider the allele-specific relationships (risk-enhancing or risk-moderating). Although both gender and ethnicity are known to be important covariates for the prevalence of disease, because the available Columbia data were not stratified by gender or ethnicity, neither were used in this study. This would be particularly important for autoimmune disorders with their known gender dependence; combining the genders for analysis, as we had to do, may have diluted statistical signal, and would explain the appearance of results in Table 7. Finally, the set of diseases examined was restricted to the 161 in the original Rzhetsky study, and further restricted by the limited overlap with VARIMED; although drawing from both common and rare diseases, the set is small compared to the full range coded by ICD9. Prior studies have found multiple protective interactions between CNS disorders and cancers [34] but, unfortunately, few cancers were in the list of disorders analyzed here. In spite of these limitations, we hope this study can serve as a proof of principle for integrating EMR and genetics data to uncover relationships between diseases. Using larger data sets, and incorporating important covariates and the direction of allele-specific risk would important validating extensions of the current work. Furthermore, text mining of EMR clinical notes and other databases of environmental exposures could represent an opportunity for identifying non-genetic causes of diseases.

Discussion
In conclusion, we have presented a method integrating clinical EMR and genetics data in order to elucidate disease comorbidity. We identify a set of disease pairs which deviate from the independence assumption in their co-occurrence in two different EMR systems. By integrating the clinical observations with genetics, we are further able to categorize which of the disease pairs might be explained by the shared genetics and which might have more of an environmental component.

Ethics statement
Our validation data set used patient records from Stanford's electronic medical record system, STRIDE (Stanford Translational Research Integrated Database Environment). The data request was judged to be exempt from human subject concerns by the Stanford Institutional Review Board, and was also approved by its Data Privacy Office. The Stanford data were retrieved June 6, 2013. Encounter records contained a masked patient identifier, current age, gender, ethnicity, icd9, and age at visit. (Gender was not used because gender information was not generally available for the Columbia data. Ethnicity was not used for the same reason.) Because of small numbers of very old patients, ages were censored at 90 years for privacy reasons by STRIDE staff prior to our use.

Clinical data analysis
For the electronic medical record data, the discovery data set comes from the composite data for the Columbia EMR, published as an online appendix of [2], which lists counts of diseases and disease pairs for a total of 161 disorders. As described in the original article, "We selected disorders that represent a broad spectrum of maladies, from common to rare, affecting diverse physiological systems, yet we also placed special emphasis on neurological phenotypes." The total number of patients was 1,478,976, however because these records include data on healthy hospital employees, the total was lowered by 500,000 as described in their Appendix 2, p 19. Disease count data were extracted from their Appendix 3, and disease-pair count data were extracted from Supplemental Information Data Set 1. Separately, the mapping from ICD9 codes to disease names was taken from their Appendix 3; a small number of mapping errors were corrected by hand. For validation, patient-level data were retrieved from STRIDE (Stanford Translational Research Integrated Database Environment) [35] for the same 161 diseases to allow for direct comparison with the Columbia data. The raw data contained 1,057,132 records for 397,474 patients. We focus our analysis on data starting in the year 2008, which was the year of comprehensive EMR rollout. When there were fewer than 50 patients with a disease, that disease was judged too rare to contribute to the incidence frequencies in a meaningful way, and was removed. This left data for 277,290 patients. Also, disease pairs were removed if any of the cells in the 2 × 2 table had observed or expected values less than 5.
The EMR records were aggregrated and processed to retain the earliest occurrence of each ICD9 code for each patient, which were then consolidated using the ICD9-to-disease mapping from [2] to produce a table of patient counts for each disease name. A similar procedure was used to count disease pairs.

Correcting for age bias through clustering
Biases arise from EMR data not being a random sample of diseases in the population. For example, autism and Alzheimer's disease have very different incidence patterns. (See Fig 5.) It would be unlikely for a patient to have this disease pair in their records, even if they were ultimately afflicted by both disorders, because young patients at risk for autism would not also be at risk for Alzheimer's until many years in the future, and those at risk for Alzheimer's would have been at risk for autism in an era when the EMR did not exist, even if autism had then been a clearly defined syndrome. This will lead to systematic undercounting of these and similar disorder pairs.
To control for the confounding effects of age, disease pairs were analyzed only when both diseases could be put in the same age-incidence cluster. Clusters were formed so as to be as large as possible (to maximize the number of subsequent disease comparisons), while simultaneously imposing within-cluster homogeneity, so that each cluster had similar age-incidence patterns. The age-incidence clustering used from patient-level data for Stanford; corresponding details for the Columbia data were not available.
Each disease was represented as a 91-dimensional vector containing counts of the number of patients whose earliest onset of that disease occurred for each of the ages 0 years through 90 years. For normalization, each vector was divided by its length to produce unit vectors. Hierarchical clustering with Ward's method for linkage was chosen to produce clusters that were compact and of similar size.
Cluster size was determined in a data-driven manner by systemically searching through possible clustering methods and cluster scoring measures. The methods and measures were taken from those provided by the R package COMMUNAL and are listed in the Supplemental Material. The cluster measures (also known as cluster indices) provide scores for each method and each cluster size. The measures were combined into a composite score by standardizing each measure (zero mean, unit variance) and then averaging. All measures were converted to have the same sense, so that larger values were associated with more desirable clusters. Any measure with a monotonic function (either increasing or decreasing) of cluster size for all methods and measures was removed because such a measure would be minimized or maximized at the extremes of the search range for cluster size, and thus not be responsive to patterns in the data.
Pairs of diseases that showed significant comorbidity pairs were identified in the Columbia data and verified in the Stanford data, so all pairs reported here were statistically significant in both. In addition, only pairs that were underrepresented in both EMRs or overrepresented in both EMRs were retained, ensuring consistent directionality; discordant pairs were not analyzed. Statistical significance was computed using the Fisher exact test. Bonferroni correction was applied using the number of diseases in each cluster. The conventional level of significance, 0.05, was used for all tests. Incidence-by-age graphs for autism and Alzheimer's disease. Because of the gross disparities in these patterns, patients at risk for one disorder would be a low risk for the second disorder at any given age, reducing the observed comorbidity.

Genetic analysis
Genetic associations came from VARIMED, a hand-curated database of published phenotypeassociated genetic variants [3]. As of May, 2015, this database contained variants from 17,088 publications, with 466,890 SNPs associated with 2,992 diseases or traits. SNPs were mapped to genes using the dbSNP annotation database. Sometimes, a variant maps to more than one gene. In such cases, we use colons to separate the genes in a single group; this notation is used in Table 5. Phenotype descriptions were mapped by hand to the set of 161 disease names used for the Columbia and Stanford data. There were 35 diseases that appeared in Columbia, Stanford, and in VARIMED (Table 3). Because VARIMED is proprietary, the relevant subset of 35 diseases, with associated genes, chromosome number, and PubMed ID of the source of each association were extracted and used for the analysis we report here. This dataset is included in the Supplement.
Genetic variants that were significantly associated with each phenotype of interest were obtained from VARIMED and mapped to gene names. In this study, we used significant disease-SNP associations (p < 10 −6 ) with known risk alleles and published odds ratios. The number of genes associated with each of the 35 diseases of interest are shown in Table 3. We furthermore focus our analysis on the gene level, specifically calculating enrichment of the number of overlapping genes between two phenotypes of interest. We report the number of genes shared by the disease pair if the overlap was determined as significant by the Fisher exact test using Bonferroni correction for the number of tests.
A network diagram (Fig 3) showing the structure of the disease pairs and their clusterings was created using the Cytoscape software tool. In the network diagram, a node represents a disease. Two nodes are connected if that disease pair is statistically significant in the EMR data and appears in the same age-incidence cluster. The size of a node represents the frequency of that disease in the larger EMR (Columbia). The edge width represents the effect size for that pair (observed number divided by expected number). The node color indicates cluster membership, using the same colors as in Fig 2. Supporting Information S1 Appendix. Contains information about the Columbia and Stanford EMR data sets, and details on the age-incidence clustering method. (PDF) S1 File. Subset of VARIMED used in our analysis, with disease name, gene name, chromosome, and PubMed ID.