Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

Objective Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. Materials and methods We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Results Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. Discussion As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation. Conclusion A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.


Introduction
Diverticular disease (DD) is the most common morphological defect of the intestinal tract and the fifth most important gastrointestinal (GI) disorder in terms of medical cost as high as >$5.4 billion in the United States [1][2][3].DD usually indicates asymptomatic diverticulosis (the mere presence of diverticula, a pouch-like protrusion in the colonic wall), but also includes diverticulitis (acute or chronic inflammation of diverticula) and its clinical complications [4].Diverticulitis occurs in approximately 4% to 15% of patients with diverticula and has a high reoccurrence rate, which is associated with fever, abdominal pain, leukocytosis, and potentially life-threatening peritonitis [4][5][6][7].
DD has long been regarded as a disease of Western countries [8]; North America has the highest prevalence of DD, affecting approximately one-third of the population older than 45, and up to 67% over 65 [6,9].However, in recent decades, virtually all countries worldwide are observing an increasing burden of DD irrespective of their economic developmental or demographical variability [10][11][12][13].Dietary intake of low fiber, processed foods, and red meats has been implicated as potential causes of DD [8,14], but this idea is still controversial [15,16].
As with most medical conditions, current evidence supports a complex interplay of both environmental and genetic factors in the pathophysiology of DD.Twin studies reveal that the genetic heritability of DD is estimated to be up to 53% (95% CI, 45-61%) [5].To date, three GWAS have identified 52 genetic susceptibility loci associated with DD [17][18][19].
A significant challenge to its etiologic investigation is that approximately 75% to 90% of diverticulosis patients remain asymptomatic until presenting with diverticulitis [20], making it difficult to self-identify or detect the disorder in clinical setting.In acute cases, a computed tomography (CT) imaging of the abdomen is most often used in the evaluation of diverticulitis, but it may not be completely diagnostic in cases of early or mild diverticulitis [21].Currently, the definitive ascertainment of the presence or absence of DD depends on colonoscopy results [21][22][23], but this requirement suffers from incomplete patient compliance given current screening guidelines [24].
To address these challenges, we developed an automated phenotyping algorithm that incorporated natural language processing (NLP) technique to efficiently identify the presence or absence of diverticulosis or diverticulitis utilizing both structured and unstructured data from the electronic health records (EHR).By integrating heterogeneous EHR data sources, we aim to present a scalable framework to perform EHR-powered GWAS and phenome-wide association studies (PheWAS) to systematically investigate the genetic epidemiology of DD.

NLP-enriched phenotyping algorithm for DD
Genome-wide genotype data of 38,827 individuals from 9 EHR-linked biobanks and phenotype data including their demographic, clinical diagnosis, colonoscopy reports of 99,185 individuals were collected from 12 EHR-linked biobanks from the electronic Medical Records and Genomics (eMERGE) network [25,26].The details of genotyping, imputation, and quality control processes are available in S1 File.
We developed two different phenotyping algorithms while accounting for data availability at each implementing site.For patients with physician reports in the EHR, the first NLP-driven algorithm scanned the unstructured text of colonoscopy or abdominal imaging reports to identify DD.The algorithm considered any subject that had any positively asserted mention of "diverticul*" in those reports to have diverticulosis, and a positively asserted mention of "diverticulitis" was considered to have diverticulosis with diverticulitis.We used the ConText algorithm [27], an updated NegEx tool, to detect negated mentions of either diverticulosis or diverticulitis and supplemented the results with diagnostic and procedure codes additionally (Fig 1A).
For the sites where only a limited subset of these imaging reports were available, the algorithm alternatively used International Classification of Disease 9 th revision (ICD-9) codes that started with 562('Diverticulosis and diverticulitis' category), assigned within 7 days after a colonoscopy or abdominal imaging, to select diverticulosis cases.The result was supplemented with NLP components when physician reports were available (Fig 1B).Additional criteria to define 'diverticulosis' and 'diverticulitis' are detailed in S1 File.
We validated the algorithm performance by a standardized chart review of randomly selected patients' charts.Trained clinicians and chart reviewers reviewed a total of 364 individuals' records to assess the positive predictive value (PPV) of our developed algorithms, using established guidelines [28] from four data collection sites.

Genome-wide association tests
Multi-ancestral (MA) GWAS was conducted on the identified subjects from the 9 sites that implemented our phenotyping algorithms (Table 1).We used logistic regression (PLINK v.1.9[29]), adjusting for sex, age at colonoscopy, study site, and the first 10 principal components of ancestry.To test for associations with diverticulosis, we compared the patients with diverticulosis, either with or without diverticulitis, to the healthy control patients without any evidence of diverticulosis or diverticulitis.To test for associations with diverticulitis, we excluded any diverticulosis patients without diverticulitis records, and compared the patients with diverticulitis (presenting both diverticulosis and diverticulitis) to the healthy control patients.Similar GWASs were repeated in European ancestry (EA) and African ancestry (AA) participants separately, which are the two largest ancestral groups available.

PheWAS
We conducted PheWAS of independent GWAS-significant SNPs with suggestive threshold (GWAS p-value<1E-06 and LD r 2 <0.1) grouped by ancestry [34].We retrieved the diagnoses of the 91,166 MA participants, including ICD-9 and 10 codes, whichever available at the time of analysis.With a minimum of 30 cases per phenotype [34], logistic regression between the GWAS SNPs and each phecode was performed with the adjustment for the first 10 PCs, and participation sites, through the PheWAS R package [34].A false discovery rate (FDR)< 0.05 was used for reporting significance.

Performance of NLP-enriched phenotyping algorithm
Compared to a gold standard of manual clinical chart review, the overall PPV of our phenotyping algorithm for diverticulosis cases (with/without diverticulitis) was 0.96, and 0.94 for controls without diverticulosis or diverticulitis (Table 2).We identified 21,777 study participants using the developed algorithm without missing covariate data.Of these, we identified 12,577 diverticulosis cases with or without diverticulitis, of which 1,265 were diverticulitis cases, and 9,200 controls without diverticulosis or diverticulitis in the entire MA discovery cohort (Table 1).

Evaluation of NLP-enriched phenotyping vs. ICD-based phenotyping
We identified more cases and controls using ICD-based phenotyping, than with NLP-enriched phenotyping, due to the lower availability of report data: 3,313 diverticulitis cases and 45,111   2), and only 87.0% (n = 1,101) of case patients were overlapping between these two phenotyping algorithms.Even though the PPV of DD ICD-10 code was reported as high as 0.98 [35], we found that considerable phenotyping heterogeneity existed without the supporting procedure reports.

Genetic associations with DD
The  largely similar; the diverticulitis GWAS showed more significant and larger ORs than the diverticulosis GWASs in general (Table 3).In the MA GWAS for diverticulosis, rs2835676 (DSCR9 gene) showed strong eQTL association with both transverse and sigmoid colon tissues within the PIGP and TTC3 genes (FDR<3.90E-13).
The genetic signals found from EA-specific analysis and MA analysis were largely analogous, possibly because approximately 85.0% of the discovery population was EA (Fig 2, Tables 1 and 3).Even though ARHGAP15 loci showed non-significant p-values of 0.24-0.99 in the AA GWAS, the effect directions of ARHGAP15 loci were mostly positive and substantial, ranging from 1.111 to 1.464 in AA diverticulitis GWAS, except few loci which showed negative ORs (S3 Table ).
We performed additional GWASs in the ICD-phenotyped diverticulitis cohort and replicated an ARHGAP15 locus on chromosome 2 (rs6717024) as genome-wide significant.One of the nearly significant associations included rs11843418 (FAM115A), which was previously identified [17,19] but not significantly detected in our NLP-enriched GWAS possibly due to statistical power or varying genetic composition of study cohorts (S4 Table ).
(2) DD susceptibility variants identified in previous GWAS (p < 5E-08) tested in the medical phenome of MA, EA, AA participants.In the MA PheWAS, 55 genotype-EHR phenotype associations were significant (S5 Table).Among them, 18 significant genotype-EHR associations were endocrine/metabolic phenotypes, 17 of them were digestive phenotypes and 10 of them were circulatory system related phenotypes.The largest number of significant EHR phenotype associations were DD; 7 'diverticulosis and diverticulitis', 7 'diverticulosis' and 1 'diverticulitis' were identified as significant.Other than the ARHGAP15 loci, rs4333882 (SLC35F3 gene) and rs10472291 (WDR70 gene) showed significant clinical associations with DD.SNP rs9272785 (HLA-DQA1 gene, proxy variant for rs7990) generated the most significant association in MA PheWAS coupled with 'rheumatoid arthritis'.The SNP was also strongly associated with several diabetes manifestations, including 'type 1 diabetes', 'type 1 diabetes with ophthalmic', 'type 1 diabetes with ketoacidosis', 'type 2 diabetes', etc.
In the EA PheWAS, 49 genotype-EHR phenotype associations were identified with FDR significance (S5 Table ).Among them, 17 EHR phenotypes are classified as digestive phenotypes, 15 are endocrine/metabolic-related phenotypes and 6 were related to the circulatory system.Rs9272785 (HLA-DQA1 gene) also marked the most significant association in EA PheWAS with 'rheumatoid arthritis'.The variant also revealed additional associations in the EA phenome, including 'developmental delays and disorders', 'multiple sclerosis', 'ulcerative colitis' and 'chronic lymphocytic thyroiditis'.
In AA PheWAS, two genotype-EHR phenotype associations met FDR significance: rs9272785 (HLA-DQA1 gene) showed the most significant SNP-phenotype association as it did in MA and EA PheWAS.The variant also showed strong associations with 'type 1 diabetes with ketoacidosis' and 'type 1 diabetes' in the AA phenome.

Discussion
To date, patient identification in the EHR was partially limited in that mostly inpatient medical coding was used, which might result in under-diagnosis of the case patients and/or misclassification of controls who possibly have DD.In the most recent GWAS of DD [18], the review of replication cohorts and input of physicians/technicians were manual; however, manual review has limited application to larger population-based datasets in its lack of scalability.Our NLPenriched phenotyping approach showed a significant improvement in performance (algorithm PPVs�0.94,3.5-fold increase in diverticulosis patient identification) compared with the use of only ICD-codes, (Table 2) and supports the importance of leveraging the full breadth of data captured in the her [36,37].
Our multi-ancestry GWAS of DD confirmed the strong genome-wide association of ARH-GAP15 with both diverticulosis and diverticulitis (Table 3).ARHGAP15 is known to strongly and negatively regulate GTPase binding property of the Rac protein family in leukocytes, which modulates important antimicrobial functions [38].This mechanism of ARHGAP15 possibly impacts the inflammatory environment of the intestine, promoting the development of diverticula or progression of diverticula due to bacterial growth along the colonic wall.In the ancestry-stratified GWAS analyses, the often-replicated associations between ARHGAP15 with DD was detected in EA cohorts and similarly positive effect sizes but little to no association was observed in AA cohorts (S3 Table ).Notably, the sample size for the AA cohort is less than 1/10 th that of the EA cohort, as well as different risk allele frequencies between ancestries.Our additional power calculation showed that at least 15,000 participants are needed to perform GWAS on the ARHGAP15 loci (EAF 0.18, disease prevalence 0.10, OR 1.20) with 80% statistical power (S1 Fig) .Further investigation is needed to confirm the universal susceptibility effect of ARHGAP15 to DD in patient of non-European ancestry.
Our PheWAS of the independent ARHGAP15 loci (rs6736741, rs10928187, rs386651361) confirmed its significant phenotypic expression with DD in MA and EA and the second most significant association with paralytic ileus (Table 4).Some genitourinary phenotypes of functional bladder disorders found in MA and EA should be noted in that the muscular motility or neuromuscular dysfunction of internal organs possibly influence both colonic walls for diverticulosis and bladder muscle for urinary disorders.
In the PheWAS of the established diverticular variants, we identified several circulatory system related EHR phenotypes associated with DD variants, including phlebitis and thrombophlebitis, pulmonary heart disease, and deep vein thrombosis.Notably, recent studies have suggested a possible epidemiologic association between DD and acute coronary syndromes and thromboembolic events [39,40].We also confirmed the associations of rs9272785 (HLA-DQA1 gene) with type 1 diabetes and its manifestations with FDR significance across ancestries.The HLA class 2 region, where rs9272785 is located, is not only associated with risk of type 1 diabetes but also increased susceptibility to juvenile rheumatoid arthritis and other autoimmune diseases [41,42].
Compared to previous GWASs of DD, our summary statistics generally show larger effect sizes possibly fueled with the improved patient identification by the NLP-enriched phenotyping algorithm.For example, rs6734367, the strongest ARHGAP15 locus reported in Maguire et al. [17] showed positive OR of 1.010 in the original study, whereas it presents an OR as high as 1.177 (diverticulosis) and 1.280 (diverticulitis) in our EA GWAS with the same allelic direction (S6 Table ).For the rest of the GWAS-significant SNPs, the ORs in our results generally showed increased effect sizes despite a cohort 1/20 th the size of Maguire et al. (Fig 3).Among the 52 tested variants, 5 loci were significantly replicated in our EA GWAS of diverticulosis (pvalue < 0.05/52).As the cohort size gets larger, and patients with diverse genetic backgrounds are included, our results suggest improved analytical power for future genomic research with the integration of different layers of EHR data.
There are several caveats in our study.We did not separately validate our phenotyping algorithms' performance for diverticulitis vs. diverticulosis, which should be included in future research.Our GWAS did not identify any novel association and only confirmed an existing locus with DD, albeit with larger effect sizes across the analyses.Also, our MA analysis was composed of 85% EA participants, so the signals are largely driven by EA-centric results.The cohort size of AA is considerably smaller than the EA or MA cohorts, which elevates the risk for false positive findings.
Our approach has highlighted the richness and potential of the heterogenous EHR data in patient classification with NLP, and the feasibility of an integrative analytical pipeline, from GWAS to post-GWAS analysis such as PheWAS, to facilitate etiological investigation of a disease in clinical setting.

Fig 1 .
Fig 1. Natural language processing (NLP)-enriched phenotyping algorithms for diverticular disease (DD) cases and controls.(a) The NLP-driven phenotyping algorithm used in five medical institutions in the eMERGE network (NU, VU, Geisinger, KPWA/UW, Mayo clinic).(b) The structured data-driven phenotyping algorithm was used in two eMERGE sites (Marshfield, Mount Sinai).https://doi.org/10.1371/journal.pone.0283553.g001 healthy controls were found with ICD-based phenotyping.However, out of 21,777 subjects with imaging reports data, ICD-based phenotyping identified only 3,591 of them as diverticulosis cases whereas our NLP-enriched algorithm identified 12,577 diverticulosis cases.For diverticulitis, our NLP-enriched algorithm identified 1,265 patients and ICD-based phenotyping identified 1,201 patients (Table GWAS of DD in the MA population identified one genome-wide significant locus (Fig 2) at 2q22.3 within ARHGAP15 gene.The association patterns between the two conditions are

Table 1 . Demographic characteristics of the patients by each eMERGE site. Patients
with diverticulitis are a subset of the people with diverticulosis.

Table 2 . Phenotyping algorithm validation and comparison of two phenotyping algorithms for diverticular diseases by site, out of 21,777 subjects with colonoscopy reports.
*This is for comparison purpose.Our main analysis did not utilize the samples identified by this ICD-based phenotyping algorithm.**PPV= positive predictive value of the phenotyping algorithms overall, and by site, where cases are patients with diverticulosis (either with or without diverticulitis) and controls are patients without diverticulosis (nor diverticulitis), identified by the phenotyping algorithms.https://doi.org/10.1371/journal.pone.0283553.t002