Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

  • Yoonjung Yoonie Joo,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Jennifer A. Pacheco,

    Roles Conceptualization, Data curation, Investigation, Methodology, Resources, Writing – original draft, Writing – review & editing

    Affiliation Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • William K. Thompson,

    Roles Data curation, Formal analysis, Methodology, Software

    Affiliation Center for Health Information Partnerships, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Laura J. Rasmussen-Torvik,

    Roles Data curation, Resources, Validation, Writing – review & editing

    Affiliation Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Luke V. Rasmussen,

    Roles Data curation, Resources, Validation, Writing – review & editing

    Affiliation Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Frederick T. J. Lin,

    Roles Formal analysis, Investigation, Writing – review & editing

    Affiliation Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Mariza de Andrade,

    Roles Data curation, Writing – review & editing

    Affiliation College of Medicine, Mayo Clinic, Rochester, MN, United States of America

  • Kenneth M. Borthwick,

    Roles Data curation, Writing – review & editing

    Affiliation Geisinger, Danville, PA, United States of America

  • Erwin Bottinger,

    Roles Data curation, Writing – review & editing

    Affiliation Icahn School of Medicine at Mount Sinai, New York, NY, United States of America

  • Andrew Cagan,

    Roles Writing – review & editing

    Affiliation Partners Healthcare, Charlestown, MA, United States of America

  • David S. Carrell,

    Roles Data curation, Writing – review & editing

    Affiliation Kaiser Permanente Washington Health Research Institute, Seattle, Washington, United States of America

  • Joshua C. Denny,

    Roles Data curation, Methodology, Writing – review & editing

    Affiliation Departments of Biomedical Informatics and Medicine, Vanderbilt University, Nashville, TN, United States of America

  • Stephen B. Ellis,

    Roles Data curation

    Affiliation The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America

  • Omri Gottesman,

    Roles Data curation

    Affiliation The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America

  • James G. Linneman,

    Roles Data curation

    Affiliation Office of Research Computing and Analytics, Marshfield Clinic Research Institute, Marshfield, WI, United States of America

  • Jyotishman Pathak,

    Roles Data curation

    Affiliation Department of Healthcare Policy and Research, Weill Cornell Medical College, New York, NY, United States of America

  • Peggy L. Peissig,

    Roles Data curation

    Affiliation Center for Precision Medicine Research, Marshfield Clinic Research Institute, Marshfield, WI, United States of America

  • Ning Shang,

    Roles Data curation

    Affiliation Department of Biomedical Informatics, Columbia University, New York, NY, United States of America

  • Gerard Tromp,

    Roles Data curation

    Affiliation Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences, Faculty of Medicine and Health Sciences, Stellenbosch University, Stellenbosch, South Africa

  • Annapoorani Veerappan,

    Roles Investigation, Validation

    Affiliation Department of Medicine, Gastroenterology, Duke University, Durham, NC, United States of America

  • Maureen E. Smith,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Writing – review & editing

    Affiliation Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Rex L. Chisholm,

    Roles Funding acquisition, Project administration, Resources, Writing – review & editing

    Affiliation Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • Andrew J. Gawron,

    Roles Investigation, Supervision, Writing – review & editing

    Affiliation Division of Gastroenterology, Hepatology & Nutrition, University of Utah, Salt Lake City, UT, United States of America

  • M. Geoffrey Hayes ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft, Writing – review & editing

    ghayes@northwestern.edu (MGH); a-kho@northwestern.edu (ANK)

    ‡ MGH and ANK are jointly supervised to this work.

    Affiliations Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America, Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America, Department of Anthropology, Northwestern University, Evanston, IL, United States of America

  •  [ ... ],
  • Abel N. Kho

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Resources, Supervision, Writing – original draft, Writing – review & editing

    ghayes@northwestern.edu (MGH); a-kho@northwestern.edu (ANK)

    ‡ MGH and ANK are jointly supervised to this work.

    Affiliations Center for Health Information Partnerships, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America, Division of General Internal Medicine and Geriatrics, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America

  • [ view all ]
  • [ view less ]

Abstract

Objective

Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique.

Materials and methods

We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes.

Results

Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes.

Discussion

As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation.

Conclusion

A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.

Introduction

Diverticular disease (DD) is the most common morphological defect of the intestinal tract and the fifth most important gastrointestinal (GI) disorder in terms of medical cost as high as >$5.4 billion in the United States [13]. DD usually indicates asymptomatic diverticulosis (the mere presence of diverticula, a pouch-like protrusion in the colonic wall), but also includes diverticulitis (acute or chronic inflammation of diverticula) and its clinical complications [4]. Diverticulitis occurs in approximately 4% to 15% of patients with diverticula and has a high reoccurrence rate, which is associated with fever, abdominal pain, leukocytosis, and potentially life-threatening peritonitis [47].

DD has long been regarded as a disease of Western countries [8]; North America has the highest prevalence of DD, affecting approximately one-third of the population older than 45, and up to 67% over 65 [6, 9]. However, in recent decades, virtually all countries worldwide are observing an increasing burden of DD irrespective of their economic developmental or demographical variability [1013]. Dietary intake of low fiber, processed foods, and red meats has been implicated as potential causes of DD [8, 14], but this idea is still controversial [15, 16].

As with most medical conditions, current evidence supports a complex interplay of both environmental and genetic factors in the pathophysiology of DD. Twin studies reveal that the genetic heritability of DD is estimated to be up to 53% (95% CI, 45–61%) [5]. To date, three GWAS have identified 52 genetic susceptibility loci associated with DD [1719].

A significant challenge to its etiologic investigation is that approximately 75% to 90% of diverticulosis patients remain asymptomatic until presenting with diverticulitis [20], making it difficult to self-identify or detect the disorder in clinical setting. In acute cases, a computed tomography (CT) imaging of the abdomen is most often used in the evaluation of diverticulitis, but it may not be completely diagnostic in cases of early or mild diverticulitis [21]. Currently, the definitive ascertainment of the presence or absence of DD depends on colonoscopy results [2123], but this requirement suffers from incomplete patient compliance given current screening guidelines [24].

To address these challenges, we developed an automated phenotyping algorithm that incorporated natural language processing (NLP) technique to efficiently identify the presence or absence of diverticulosis or diverticulitis utilizing both structured and unstructured data from the electronic health records (EHR). By integrating heterogeneous EHR data sources, we aim to present a scalable framework to perform EHR-powered GWAS and phenome-wide association studies (PheWAS) to systematically investigate the genetic epidemiology of DD.

Methods

NLP-enriched phenotyping algorithm for DD

Genome-wide genotype data of 38,827 individuals from 9 EHR-linked biobanks and phenotype data including their demographic, clinical diagnosis, colonoscopy reports of 99,185 individuals were collected from 12 EHR-linked biobanks from the electronic Medical Records and Genomics (eMERGE) network [25, 26]. The details of genotyping, imputation, and quality control processes are available in S1 File.

We developed two different phenotyping algorithms while accounting for data availability at each implementing site. For patients with physician reports in the EHR, the first NLP-driven algorithm scanned the unstructured text of colonoscopy or abdominal imaging reports to identify DD. The algorithm considered any subject that had any positively asserted mention of “diverticul*” in those reports to have diverticulosis, and a positively asserted mention of “diverticulitis” was considered to have diverticulosis with diverticulitis. We used the ConText algorithm [27], an updated NegEx tool, to detect negated mentions of either diverticulosis or diverticulitis and supplemented the results with diagnostic and procedure codes additionally (Fig 1A).

thumbnail
Fig 1. Natural language processing (NLP)-enriched phenotyping algorithms for diverticular disease (DD) cases and controls.

(a) The NLP-driven phenotyping algorithm used in five medical institutions in the eMERGE network (NU, VU, Geisinger, KPWA/UW, Mayo clinic). (b) The structured data-driven phenotyping algorithm was used in two eMERGE sites (Marshfield, Mount Sinai).

https://doi.org/10.1371/journal.pone.0283553.g001

For the sites where only a limited subset of these imaging reports were available, the algorithm alternatively used International Classification of Disease 9th revision (ICD-9) codes that started with 562(‘Diverticulosis and diverticulitis’ category), assigned within 7 days after a colonoscopy or abdominal imaging, to select diverticulosis cases. The result was supplemented with NLP components when physician reports were available (Fig 1B). Additional criteria to define ‘diverticulosis’ and ‘diverticulitis’ are detailed in S1 File.

We validated the algorithm performance by a standardized chart review of randomly selected patients’ charts. Trained clinicians and chart reviewers reviewed a total of 364 individuals’ records to assess the positive predictive value (PPV) of our developed algorithms, using established guidelines [28] from four data collection sites.

Genome-wide association tests

Multi-ancestral (MA) GWAS was conducted on the identified subjects from the 9 sites that implemented our phenotyping algorithms (Table 1). We used logistic regression (PLINK v.1.9 [29]), adjusting for sex, age at colonoscopy, study site, and the first 10 principal components of ancestry. To test for associations with diverticulosis, we compared the patients with diverticulosis, either with or without diverticulitis, to the healthy control patients without any evidence of diverticulosis or diverticulitis. To test for associations with diverticulitis, we excluded any diverticulosis patients without diverticulitis records, and compared the patients with diverticulitis (presenting both diverticulosis and diverticulitis) to the healthy control patients. Similar GWASs were repeated in European ancestry (EA) and African ancestry (AA) participants separately, which are the two largest ancestral groups available.

thumbnail
Table 1. Demographic characteristics of the patients by each eMERGE site.

Patients with diverticulitis are a subset of the people with diverticulosis.

https://doi.org/10.1371/journal.pone.0283553.t001

We annotated the significant GWAS loci with eQTL, deleteriousness score (CADD score [30]), and potential regulatory functions (RegulomeDB score [31]) using the GTEx v7 database. A subsequent conditional analysis was performed within a window of ±1Mb of the genome-wide significant GWAS variants using genome-wide complex trait analysis (GCTA) v.1.26 [32].

Evaluation of our NLP-enriched phenotyping algorithm for DD

We compared our NLP-enriched phenotyping algorithm results against the results of an ICD-based phenotyping method that has been commonly implemented in previous GWASs of DD [1719]. Using the phecode map v1.2 [33] for DD (ICD-9 562), we compared the numbers of DD patients identified by each algorithm within our multicenter EHR data. We excluded patients with any related gastrointestinal manifestations such as ‘ulcerative enterocolitis’(ICD-9 556), ‘regional enteritis’(ICD-9 558), ‘volvulus of intestine’(ICD-9 560.2), etc. to avoid classification bias (S1 Table).

PheWAS

We conducted PheWAS of independent GWAS-significant SNPs with suggestive threshold (GWAS p-value<1E-06 and LD r2<0.1) grouped by ancestry [34]. We retrieved the diagnoses of the 91,166 MA participants, including ICD-9 and 10 codes, whichever available at the time of analysis. With a minimum of 30 cases per phenotype [34], logistic regression between the GWAS SNPs and each phecode was performed with the adjustment for the first 10 PCs, and participation sites, through the PheWAS R package [34]. A false discovery rate (FDR)< 0.05 was used for reporting significance.

We also conducted PheWAS of the 52 reported GWAS susceptibility loci from the three existing GWASs of DD [1719]. The genomic positions of the 52 loci were converted to GrCh37/hg19 (40 loci from Maguire et al. [17], 12 loci from Schafmayer et al. [18]), including three proxy variants (R2 > 0.5) available in our genotype data (S2 Table).

Results

Performance of NLP-enriched phenotyping algorithm

Compared to a gold standard of manual clinical chart review, the overall PPV of our phenotyping algorithm for diverticulosis cases (with/without diverticulitis) was 0.96, and 0.94 for controls without diverticulosis or diverticulitis (Table 2). We identified 21,777 study participants using the developed algorithm without missing covariate data. Of these, we identified 12,577 diverticulosis cases with or without diverticulitis, of which 1,265 were diverticulitis cases, and 9,200 controls without diverticulosis or diverticulitis in the entire MA discovery cohort (Table 1).

thumbnail
Table 2. Phenotyping algorithm validation and comparison of two phenotyping algorithms for diverticular diseases by site, out of 21,777 subjects with colonoscopy reports.

https://doi.org/10.1371/journal.pone.0283553.t002

Evaluation of NLP-enriched phenotyping vs. ICD-based phenotyping

We identified more cases and controls using ICD-based phenotyping, than with NLP-enriched phenotyping, due to the lower availability of report data: 3,313 diverticulitis cases and 45,111 healthy controls were found with ICD-based phenotyping. However, out of 21,777 subjects with imaging reports data, ICD-based phenotyping identified only 3,591 of them as diverticulosis cases whereas our NLP-enriched algorithm identified 12,577 diverticulosis cases. For diverticulitis, our NLP-enriched algorithm identified 1,265 patients and ICD-based phenotyping identified 1,201 patients (Table 2), and only 87.0% (n = 1,101) of case patients were overlapping between these two phenotyping algorithms. Even though the PPV of DD ICD-10 code was reported as high as 0.98 [35], we found that considerable phenotyping heterogeneity existed without the supporting procedure reports.

Genetic associations with DD

The GWAS of DD in the MA population identified one genome-wide significant locus (Fig 2) at 2q22.3 within ARHGAP15 gene. The association patterns between the two conditions are largely similar; the diverticulitis GWAS showed more significant and larger ORs than the diverticulosis GWASs in general (Table 3). In the MA GWAS for diverticulosis, rs2835676 (DSCR9 gene) showed strong eQTL association with both transverse and sigmoid colon tissues within the PIGP and TTC3 genes (FDR<3.90E-13).

thumbnail
Fig 2.

Manhattan plots of genome-wide associations with diverticular disease (DD) in (a) Multiancestry (MA) participants (n = 21,777), (b) European Ancestry (EA) participants (n = 19,211), and (c) African Ancestry (AA) participants (n = 2,322). In each panel, the upper graph presents GWAS results of diverticulosis, and the bottom graph shows GWAS results of diverticulitis. The red horizontal line indicates genome-wide significance of p<5.0E-08 for each analysis.

https://doi.org/10.1371/journal.pone.0283553.g002

thumbnail
Table 3. Genetic variants that reach suggestive genome-wide significance (P < 1E-06) with diverticulosis or diverticulitis in MA (multi-ancestry), EA (European ancestry) and AA (African ancestry) participants.

https://doi.org/10.1371/journal.pone.0283553.t003

The genetic signals found from EA-specific analysis and MA analysis were largely analogous, possibly because approximately 85.0% of the discovery population was EA (Fig 2, Tables 1 and 3). Even though ARHGAP15 loci showed non-significant p-values of 0.24–0.99 in the AA GWAS, the effect directions of ARHGAP15 loci were mostly positive and substantial, ranging from 1.111 to 1.464 in AA diverticulitis GWAS, except few loci which showed negative ORs (S3 Table).

We performed additional GWASs in the ICD-phenotyped diverticulitis cohort and replicated an ARHGAP15 locus on chromosome 2 (rs6717024) as genome-wide significant. One of the nearly significant associations included rs11843418 (FAM115A), which was previously identified [17, 19] but not significantly detected in our NLP-enriched GWAS possibly due to statistical power or varying genetic composition of study cohorts (S4 Table).

PheWAS

(1) DD susceptibility variants identified in our MA, EA, AA GWAS (p<1E-06) tested in the medical phenome of MA, EA, AA participants.

We observed FDR-significant PheWAS associations (FDR < 0.05) between DD phecodes (562, 562.1, and 562.2) and several independent (LD r2<0.1) ARHGAP15 loci in MA and EA PheWAS (Table 4). Other than diverticular EHR phenotypes, rs9565028 (NBEA gene) showed FDR-significant associations with genitourinary manifestations including ‘functional disorders of bladder’ and ‘other disorders of bladder’ in the MA and EA phenome. No significant associations were identified in AA PheWAS.

thumbnail
Table 4. Significant genotype-EHR phenotype associations (suggestive threshold P<1E-04) from ancestry-stratified PheWAS of the discovered diverticular disease susceptibility SNPs from our GWAS.

https://doi.org/10.1371/journal.pone.0283553.t004

(2) DD susceptibility variants identified in previous GWAS (p < 5E-08) tested in the medical phenome of MA, EA, AA participants.

In the MA PheWAS, 55 genotype-EHR phenotype associations were significant (S5 Table). Among them, 18 significant genotype-EHR associations were endocrine/metabolic phenotypes, 17 of them were digestive phenotypes and 10 of them were circulatory system related phenotypes. The largest number of significant EHR phenotype associations were DD; 7 ‘diverticulosis and diverticulitis’, 7 ‘diverticulosis’ and 1 ‘diverticulitis’ were identified as significant. Other than the ARHGAP15 loci, rs4333882 (SLC35F3 gene) and rs10472291 (WDR70 gene) showed significant clinical associations with DD. SNP rs9272785 (HLA-DQA1 gene, proxy variant for rs7990) generated the most significant association in MA PheWAS coupled with ‘rheumatoid arthritis’. The SNP was also strongly associated with several diabetes manifestations, including ‘type 1 diabetes’, ‘type 1 diabetes with ophthalmic’, ‘type 1 diabetes with ketoacidosis’, ‘type 2 diabetes’, etc.

In the EA PheWAS, 49 genotype-EHR phenotype associations were identified with FDR significance (S5 Table). Among them, 17 EHR phenotypes are classified as digestive phenotypes, 15 are endocrine/metabolic-related phenotypes and 6 were related to the circulatory system. Rs9272785 (HLA-DQA1 gene) also marked the most significant association in EA PheWAS with ‘rheumatoid arthritis’. The variant also revealed additional associations in the EA phenome, including ‘developmental delays and disorders’, ‘multiple sclerosis’, ‘ulcerative colitis’ and ‘chronic lymphocytic thyroiditis’.

In AA PheWAS, two genotype-EHR phenotype associations met FDR significance: rs9272785 (HLA-DQA1 gene) showed the most significant SNP-phenotype association as it did in MA and EA PheWAS. The variant also showed strong associations with ‘type 1 diabetes with ketoacidosis’ and ‘type 1 diabetes’ in the AA phenome.

Discussion

To date, patient identification in the EHR was partially limited in that mostly inpatient medical coding was used, which might result in under-diagnosis of the case patients and/or misclassification of controls who possibly have DD. In the most recent GWAS of DD [18], the review of replication cohorts and input of physicians/technicians were manual; however, manual review has limited application to larger population-based datasets in its lack of scalability. Our NLP-enriched phenotyping approach showed a significant improvement in performance (algorithm PPVs≥0.94, 3.5-fold increase in diverticulosis patient identification) compared with the use of only ICD-codes, (Table 2) and supports the importance of leveraging the full breadth of data captured in the her [36, 37].

Our multi-ancestry GWAS of DD confirmed the strong genome-wide association of ARHGAP15 with both diverticulosis and diverticulitis (Table 3). ARHGAP15 is known to strongly and negatively regulate GTPase binding property of the Rac protein family in leukocytes, which modulates important antimicrobial functions [38]. This mechanism of ARHGAP15 possibly impacts the inflammatory environment of the intestine, promoting the development of diverticula or progression of diverticula due to bacterial growth along the colonic wall. In the ancestry-stratified GWAS analyses, the often-replicated associations between ARHGAP15 with DD was detected in EA cohorts and similarly positive effect sizes but little to no association was observed in AA cohorts (S3 Table). Notably, the sample size for the AA cohort is less than 1/10th that of the EA cohort, as well as different risk allele frequencies between ancestries. Our additional power calculation showed that at least 15,000 participants are needed to perform GWAS on the ARHGAP15 loci (EAF 0.18, disease prevalence 0.10, OR 1.20) with 80% statistical power (S1 Fig). Further investigation is needed to confirm the universal susceptibility effect of ARHGAP15 to DD in patient of non-European ancestry.

Our PheWAS of the independent ARHGAP15 loci (rs6736741, rs10928187, rs386651361) confirmed its significant phenotypic expression with DD in MA and EA and the second most significant association with paralytic ileus (Table 4). Some genitourinary phenotypes of functional bladder disorders found in MA and EA should be noted in that the muscular motility or neuromuscular dysfunction of internal organs possibly influence both colonic walls for diverticulosis and bladder muscle for urinary disorders.

In the PheWAS of the established diverticular variants, we identified several circulatory system related EHR phenotypes associated with DD variants, including phlebitis and thrombophlebitis, pulmonary heart disease, and deep vein thrombosis. Notably, recent studies have suggested a possible epidemiologic association between DD and acute coronary syndromes and thromboembolic events [39, 40]. We also confirmed the associations of rs9272785 (HLA-DQA1 gene) with type 1 diabetes and its manifestations with FDR significance across ancestries. The HLA class 2 region, where rs9272785 is located, is not only associated with risk of type 1 diabetes but also increased susceptibility to juvenile rheumatoid arthritis and other autoimmune diseases [41, 42].

Compared to previous GWASs of DD, our summary statistics generally show larger effect sizes possibly fueled with the improved patient identification by the NLP-enriched phenotyping algorithm. For example, rs6734367, the strongest ARHGAP15 locus reported in Maguire et al. [17] showed positive OR of 1.010 in the original study, whereas it presents an OR as high as 1.177 (diverticulosis) and 1.280 (diverticulitis) in our EA GWAS with the same allelic direction (S6 Table). For the rest of the GWAS-significant SNPs, the ORs in our results generally showed increased effect sizes despite a cohort 1/20th the size of Maguire et al. (Fig 3). Among the 52 tested variants, 5 loci were significantly replicated in our EA GWAS of diverticulosis (p-value < 0.05/52). As the cohort size gets larger, and patients with diverse genetic backgrounds are included, our results suggest improved analytical power for future genomic research with the integration of different layers of EHR data.

thumbnail
Fig 3. Comparison of effect size (OR) between our GWAS with NLP-enriched phenotyping and previous GWAS with ICD-based phenotyping from Maguire et al.

The dashed y = x line indicates equal ORs in both studies.

https://doi.org/10.1371/journal.pone.0283553.g003

There are several caveats in our study. We did not separately validate our phenotyping algorithms’ performance for diverticulitis vs. diverticulosis, which should be included in future research. Our GWAS did not identify any novel association and only confirmed an existing locus with DD, albeit with larger effect sizes across the analyses. Also, our MA analysis was composed of 85% EA participants, so the signals are largely driven by EA-centric results. The cohort size of AA is considerably smaller than the EA or MA cohorts, which elevates the risk for false positive findings.

Our approach has highlighted the richness and potential of the heterogenous EHR data in patient classification with NLP, and the feasibility of an integrative analytical pipeline, from GWAS to post-GWAS analysis such as PheWAS, to facilitate etiological investigation of a disease in clinical setting.

Supporting information

S1 Fig. Results of power calculation for our DD GWAS analyses.

https://doi.org/10.1371/journal.pone.0283553.s001

(TIF)

S2 Fig. QQ plots of our DD GWAS results.

https://doi.org/10.1371/journal.pone.0283553.s002

(TIF)

S1 Table. PheWAS association results (p-value < 1E-04) of 52 susceptibility SNPs for diverticular diseases in MA, EA, AA participants.

https://doi.org/10.1371/journal.pone.0283553.s003

(XLSX)

S2 Table. List of exclusion ICD codes for phecode mapping: Not classified as control or case.

https://doi.org/10.1371/journal.pone.0283553.s004

(XLSX)

S3 Table. GWAS results of ARGHAP loci among participants of African ancestry.

https://doi.org/10.1371/journal.pone.0283553.s005

(XLSX)

S4 Table. GWAS results of ICD-based identified patients with diverticulitis in eMERGE cohort.

https://doi.org/10.1371/journal.pone.0283553.s006

(XLSX)

S5 Table. Information of 52 reported susceptibility variants from three Eurocentric GWAS of diverticular diseases: Sid et al.

(2017), Maguire et al. (2018), and Sch et al. (2019).

https://doi.org/10.1371/journal.pone.0283553.s007

(XLSX)

S6 Table. OR comparison between our European-specific GWASs and the previous GWAS results from Maguire et al.

https://doi.org/10.1371/journal.pone.0283553.s008

(XLSX)

S1 File. Details of genotyping, imputation, and quality control processes.

https://doi.org/10.1371/journal.pone.0283553.s009

(DOCX)

References

  1. 1. Sandler RS, Everhart JE, Donowitz M, Adams E, Cronin K, Goodman C, et al. The burden of selected digestive diseases in the United States. Gastroenterology. 2002;122(5):1500–11. Epub 2002/05/02. pmid:11984534.
  2. 2. Peery AF, Crockett SD, Barritt AS, Dellon ES, Eluri S, Gangarosa LM, et al. Burden of Gastrointestinal, Liver, and Pancreatic Diseases in the United States. Gastroenterology. 2015;149(7):1731–41 e3. Epub 2015/09/04. pmid:26327134; PubMed Central PMCID: PMC4663148.
  3. 3. Peery AF, Crockett SD, Murphy CC, Lund JL, Dellon ES, Williams JL, et al. Burden and Cost of Gastrointestinal, Liver, and Pancreatic Diseases in the United States: Update 2018. Gastroenterology. 2019;156(1):254–72 e11. Epub 2018/10/14. pmid:30315778; PubMed Central PMCID: PMC6689327.
  4. 4. Strate LL, Morris AM. Epidemiology, Pathophysiology, and Treatment of Diverticulitis. Gastroenterology. 2019;156(5):1282–98 e1. Epub 2019/01/21. pmid:30660732; PubMed Central PMCID: PMC6716971.
  5. 5. Reichert MC, Lammert F. The genetic epidemiology of diverticulosis and diverticular disease: Emerging evidence. United European Gastroenterol J. 2015;3(5):409–18. Epub 2015/11/05. pmid:26535118; PubMed Central PMCID: PMC4625748.
  6. 6. Colcock BP. Diverticular disease of the colon. Major Probl Clin Surg. 1971;11:1–135. Epub 1971/01/01. pmid:4949985.
  7. 7. Shahedi K, Fuller G, Bolus R, Cohen E, Vu M, Shah R, et al. Long-term risk of acute diverticulitis among patients with incidental diverticulosis found during colonoscopy. Clin Gastroenterol Hepatol. 2013;11(12):1609–13. Epub 2013/07/17. pmid:23856358; PubMed Central PMCID: PMC5731451.
  8. 8. Painter NS, Burkitt DP. Diverticular disease of the colon: a deficiency disease of Western civilization. Br Med J. 1971;2(5759):450–4. Epub 1971/05/22. pmid:4930390; PubMed Central PMCID: PMC1796198.
  9. 9. Painter NS, Burkitt DP. Diverticular disease of the colon, a 20th century problem. Clin Gastroenterol. 1975;4(1):3–21. Epub 1975/01/01. pmid:1109818.
  10. 10. Makela J, Kiviniemi H, Laitinen S. Prevalence of perforated sigmoid diverticulitis is increasing. Dis Colon Rectum. 2002;45(7):955–61. Epub 2002/07/20. pmid:12130886.
  11. 11. Nagata N, Niikura R, Aoki T, Shimbo T, Itoh T, Goda Y, et al. Increase in colonic diverticulosis and diverticular hemorrhage in an aging society: lessons from a 9-year colonoscopic study of 28,192 patients in Japan. Int J Colorectal Dis. 2014;29(3):379–85. Epub 2013/12/10. pmid:24317937.
  12. 12. Warner E, Crighton EJ, Moineddin R, Mamdani M, Upshur R. Fourteen-year study of hospital admissions for diverticular disease in Ontario. Can J Gastroenterol. 2007;21(2):97–9. Epub 2007/02/15. pmid:17299613; PubMed Central PMCID: PMC2657668.
  13. 13. Ogunbiyi OA. Diverticular disease of the colon in Ibadan, Nigeria. Afr J Med Med Sci. 1989;18(4):241–4. Epub 1989/12/01. pmid:2558553.
  14. 14. Aldoori WH. The protective role of dietary fiber in diverticular disease. Adv Exp Med Biol. 1997;427:291–308. Epub 1997/01/01. pmid:9361853.
  15. 15. Peery AF, Barrett PR, Park D, Rogers AJ, Galanko JA, Martin CF, et al. A high-fiber diet does not protect against asymptomatic diverticulosis. Gastroenterology. 2012;142(2):266–72 e1. Epub 2011/11/09. pmid:22062360; PubMed Central PMCID: PMC3724216.
  16. 16. Peery AF, Sandler RS, Ahnen DJ, Galanko JA, Holm AN, Shaukat A, et al. Constipation and a low-fiber diet are not associated with diverticulosis. Clin Gastroenterol Hepatol. 2013;11(12):1622–7. Epub 2013/07/31. pmid:23891924; PubMed Central PMCID: PMC3840096.
  17. 17. Maguire LH, Handelman SK, Du X, Chen Y, Pers TH, Speliotes EK. Genome-wide association analyses identify 39 new susceptibility loci for diverticular disease. Nat Genet. 2018;50(10):1359–65. Epub 2018/09/05. pmid:30177863; PubMed Central PMCID: PMC6168378.
  18. 18. Schafmayer C, Harrison JW, Buch S, Lange C, Reichert MC, Hofer P, et al. Genome-wide association analysis of diverticular disease points towards neuromuscular, connective tissue and epithelial pathomechanisms. Gut. 2019;68(5):854–65. Epub 2019/01/21. pmid:30661054.
  19. 19. Sigurdsson S, Alexandersson KF, Sulem P, Feenstra B, Gudmundsdottir S, Halldorsson GH, et al. Sequence variants in ARHGAP15, COLQ and FAM155A associate with diverticular disease and diverticulitis. Nat Commun. 2017;8:15789. pmid:28585551; PubMed Central PMCID: PMC5467205.
  20. 20. Matrana MR, Margolin DA. Epidemiology and pathophysiology of diverticular disease. Clin Colon Rectal Surg. 2009;22(3):141–6. pmid:20676256; PubMed Central PMCID: PMC2780269.
  21. 21. Destigter KK, Keating DP. Imaging update: acute colonic diverticulitis. Clin Colon Rectal Surg. 2009;22(3):147–55. Epub 2010/08/03. pmid:20676257; PubMed Central PMCID: PMC2780264.
  22. 22. Feingold D, Steele SR, Lee S, Kaiser A, Boushey R, Buie WD, et al. Practice parameters for the treatment of sigmoid diverticulitis. Dis Colon Rectum. 2014;57(3):284–94. Epub 2014/02/11. pmid:24509449.
  23. 23. Diverticulosis [Internet]. 2019. Available from: https://www.ncbi.nlm.nih.gov/books/NBK430771/.
  24. 24. Joseph DA KJ, Richards TB, Thomas CC, Richardson LC. Use of colorectal cancer screening tests by state. Preventing Chronic Disease. 2018;15:170535. pmid:29908051
  25. 25. Stanaway IB, Hall TO, Rosenthal EA, Palmer M, Naranbhai V, Knevel R, et al. The eMERGE genotype set of 83,717 subjects imputed to ~40 million variants genome wide and association with the herpes zoster medical record phenotype. Genet Epidemiol. 2019;43(1):63–81. Epub 2018/10/10. pmid:30298529; PubMed Central PMCID: PMC6375696.
  26. 26. Lessons learned from the eMERGE Network: balancing genomics in discovery and practice. Human Genetics and Genomics Advances. 2021;2(1):100018. pmid:35047833
  27. 27. Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;42(5):839–51. pmid:19435614; PubMed Central PMCID: PMC2757457.
  28. 28. Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc. 2013;20(e1):e147–54. pmid:23531748; PubMed Central PMCID: PMC3715338.
  29. 29. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 2015/02/28. pmid:25722852; PubMed Central PMCID: PMC4342193.
  30. 30. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310–5. Epub 2014/02/04. pmid:24487276; PubMed Central PMCID: PMC3992975.
  31. 31. Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012;22(9):1790–7. Epub 2012/09/08. pmid:22955989; PubMed Central PMCID: PMC3431494.
  32. 32. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics. 2011;88(1):76–82. Epub 2010/12/21. pmid:21167468; PubMed Central PMCID: PMC3014363.
  33. 33. Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol. 2013;31(12):1102–10. Epub 2013/11/26. pmid:24270849; PubMed Central PMCID: PMC3969265.
  34. 34. Carroll RJ, Bastarache L, Denny JC. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30(16):2375–6. pmid:24733291; PubMed Central PMCID: PMC4133579.
  35. 35. Erichsen R, Strate L, Sorensen HT, Baron JA. Positive predictive values of the International Classification of Disease, 10th edition diagnoses codes for diverticular disease in the Danish National Registry of Patients. Clin Exp Gastroenterol. 2010;3:139–42. Epub 2010/01/01. pmid:21694857; PubMed Central PMCID: PMC3108666.
  36. 36. Wei WQ, Denny JC. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015;7(1):41. Epub 2015/05/06. pmid:25937834; PubMed Central PMCID: PMC4416392.
  37. 37. Peissig PL, Rasmussen LV, Berg RL, Linneman JG, McCarty CA, Waudby C, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19(2):225–34. Epub 2012/02/10. pmid:22319176; PubMed Central PMCID: PMC3277618.
  38. 38. Costa C, Germena G, Martin-Conte EL, Molineris I, Bosco E, Marengo S, et al. The RacGAP ArhGAP15 is a master negative regulator of neutrophil functions. Blood. 2011;118(4):1099–108. Epub 2011/05/10. pmid:21551229.
  39. 39. Pers TH, Karjalainen JM, Chan Y, Westra HJ, Wood AR, Yang J, et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun. 2015;6:5890. pmid:25597830; PubMed Central PMCID: PMC4420238.
  40. 40. Strate LL, Erichsen R, Horvath-Puho E, Pedersen L, Baron JA, Sorensen HT. Diverticular disease is associated with increased risk of subsequent arterial and venous thromboembolic events. Clin Gastroenterol Hepatol. 2014;12(10):1695–701 e1. Epub 2013/12/10. pmid:24316104.
  41. 41. Begovich AB, Bugawan TL, Nepom BS, Klitz W, Nepom GT, Erlich HA. A specific HLA-DP beta allele is associated with pauciarticular juvenile rheumatoid arthritis but not adult rheumatoid arthritis. Proc Natl Acad Sci U S A. 1989;86(23):9489–93. Epub 1989/12/01. pmid:2512583; PubMed Central PMCID: PMC298522.
  42. 42. Noble JA, Valdes AM. Genetics of the HLA region in the prediction of type 1 diabetes. Curr Diab Rep. 2011;11(6):533–42. Epub 2011/09/14. pmid:21912932; PubMed Central PMCID: PMC3233362.