Figures
Abstract
Objective
Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique.
Materials and methods
We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes.
Results
Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes.
Citation: Joo YY, Pacheco JA, Thompson WK, Rasmussen-Torvik LJ, Rasmussen LV, Lin FTJ, et al. (2023) Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm. PLoS ONE 18(5): e0283553. https://doi.org/10.1371/journal.pone.0283553
Editor: Heming Wang, Brigham and Women’s Hospital and Harvard Medical School, UNITED STATES
Received: April 23, 2022; Accepted: March 9, 2023; Published: May 17, 2023
Copyright: © 2023 Joo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used for the analyses described in this manuscript are available from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs001584.v2.p2. Access is limited to General Research Use Data Use Limitation, as allowed by the eMERGE subjects' informed consents and institutional certifications from each of the eMERGE sites.
Funding: The eMERGE Network was initiated and funded by NHGRI through the following grants: U01HG006828 (Cincinnati Children’s Hospital Medical Center/Boston Children’s Hospital); U01HG006830 (Children’s Hospital of Philadelphia); U01HG006389 (Essentia Institute of Rural Health, Marshfield Clinic Research Foundation and Pennsylvania State University); U01HG006382 (Geisinger Clinic); U01HG006375 (Group Health Cooperative/University of Washington); U01HG006379 (Mayo Clinic); U01HG006380 (Icahn School of Medicine at Mount Sinai); U01HG006388 (Northwestern University); U01HG006378 (Vanderbilt University Medical Center); U01HG006385 (Vanderbilt University Medical Center serving as the Coordinating Center); U01HG004438 (CIDR) and U01HG004424 (the Broad Institute) serving as genotyping centers. Due to the structure of the U01 cooperative agreement, the recruitment sites, coordinating center, and funding agency collaborated on the study design, collection of data, and management of the research, but not in the analysis, interpretation of the data or the preparation of the manuscript.
Competing interests: The authors have no conflicts to declare.
Introduction
Diverticular disease (DD) is the most common morphological defect of the intestinal tract and the fifth most important gastrointestinal (GI) disorder in terms of medical cost as high as >$5.4 billion in the United States [1–3]. DD usually indicates asymptomatic diverticulosis (the mere presence of diverticula, a pouch-like protrusion in the colonic wall), but also includes diverticulitis (acute or chronic inflammation of diverticula) and its clinical complications [4]. Diverticulitis occurs in approximately 4% to 15% of patients with diverticula and has a high reoccurrence rate, which is associated with fever, abdominal pain, leukocytosis, and potentially life-threatening peritonitis [4–7].
DD has long been regarded as a disease of Western countries [8]; North America has the highest prevalence of DD, affecting approximately one-third of the population older than 45, and up to 67% over 65 [6, 9]. However, in recent decades, virtually all countries worldwide are observing an increasing burden of DD irrespective of their economic developmental or demographical variability [10–13]. Dietary intake of low fiber, processed foods, and red meats has been implicated as potential causes of DD [8, 14], but this idea is still controversial [15, 16].
As with most medical conditions, current evidence supports a complex interplay of both environmental and genetic factors in the pathophysiology of DD. Twin studies reveal that the genetic heritability of DD is estimated to be up to 53% (95% CI, 45–61%) [5]. To date, three GWAS have identified 52 genetic susceptibility loci associated with DD [17–19].
A significant challenge to its etiologic investigation is that approximately 75% to 90% of diverticulosis patients remain asymptomatic until presenting with diverticulitis [20], making it difficult to self-identify or detect the disorder in clinical setting. In acute cases, a computed tomography (CT) imaging of the abdomen is most often used in the evaluation of diverticulitis, but it may not be completely diagnostic in cases of early or mild diverticulitis [21]. Currently, the definitive ascertainment of the presence or absence of DD depends on colonoscopy results [21–23], but this requirement suffers from incomplete patient compliance given current screening guidelines [24].
To address these challenges, we developed an automated phenotyping algorithm that incorporated natural language processing (NLP) technique to efficiently identify the presence or absence of diverticulosis or diverticulitis utilizing both structured and unstructured data from the electronic health records (EHR). By integrating heterogeneous EHR data sources, we aim to present a scalable framework to perform EHR-powered GWAS and phenome-wide association studies (PheWAS) to systematically investigate the genetic epidemiology of DD.
Methods
NLP-enriched phenotyping algorithm for DD
Genome-wide genotype data of 38,827 individuals from 9 EHR-linked biobanks and phenotype data including their demographic, clinical diagnosis, colonoscopy reports of 99,185 individuals were collected from 12 EHR-linked biobanks from the electronic Medical Records and Genomics (eMERGE) network [25, 26]. The details of genotyping, imputation, and quality control processes are available in S1 File.
We developed two different phenotyping algorithms while accounting for data availability at each implementing site. For patients with physician reports in the EHR, the first NLP-driven algorithm scanned the unstructured text of colonoscopy or abdominal imaging reports to identify DD. The algorithm considered any subject that had any positively asserted mention of “diverticul*” in those reports to have diverticulosis, and a positively asserted mention of “diverticulitis” was considered to have diverticulosis with diverticulitis. We used the ConText algorithm [27], an updated NegEx tool, to detect negated mentions of either diverticulosis or diverticulitis and supplemented the results with diagnostic and procedure codes additionally (Fig 1A).
(a) The NLP-driven phenotyping algorithm used in five medical institutions in the eMERGE network (NU, VU, Geisinger, KPWA/UW, Mayo clinic). (b) The structured data-driven phenotyping algorithm was used in two eMERGE sites (Marshfield, Mount Sinai).
For the sites where only a limited subset of these imaging reports were available, the algorithm alternatively used International Classification of Disease 9th revision (ICD-9) codes that started with 562(‘Diverticulosis and diverticulitis’ category), assigned within 7 days after a colonoscopy or abdominal imaging, to select diverticulosis cases. The result was supplemented with NLP components when physician reports were available (Fig 1B). Additional criteria to define ‘diverticulosis’ and ‘diverticulitis’ are detailed in S1 File.
We validated the algorithm performance by a standardized chart review of randomly selected patients’ charts. Trained clinicians and chart reviewers reviewed a total of 364 individuals’ records to assess the positive predictive value (PPV) of our developed algorithms, using established guidelines [28] from four data collection sites.
Genome-wide association tests
Multi-ancestral (MA) GWAS was conducted on the identified subjects from the 9 sites that implemented our phenotyping algorithms (Table 1). We used logistic regression (PLINK v.1.9 [29]), adjusting for sex, age at colonoscopy, study site, and the first 10 principal components of ancestry. To test for associations with diverticulosis, we compared the patients with diverticulosis, either with or without diverticulitis, to the healthy control patients without any evidence of diverticulosis or diverticulitis. To test for associations with diverticulitis, we excluded any diverticulosis patients without diverticulitis records, and compared the patients with diverticulitis (presenting both diverticulosis and diverticulitis) to the healthy control patients. Similar GWASs were repeated in European ancestry (EA) and African ancestry (AA) participants separately, which are the two largest ancestral groups available.
Patients with diverticulitis are a subset of the people with diverticulosis.
We annotated the significant GWAS loci with eQTL, deleteriousness score (CADD score [30]), and potential regulatory functions (RegulomeDB score [31]) using the GTEx v7 database. A subsequent conditional analysis was performed within a window of ±1Mb of the genome-wide significant GWAS variants using genome-wide complex trait analysis (GCTA) v.1.26 [32].
Evaluation of our NLP-enriched phenotyping algorithm for DD
We compared our NLP-enriched phenotyping algorithm results against the results of an ICD-based phenotyping method that has been commonly implemented in previous GWASs of DD [17–19]. Using the phecode map v1.2 [33] for DD (ICD-9 562), we compared the numbers of DD patients identified by each algorithm within our multicenter EHR data. We excluded patients with any related gastrointestinal manifestations such as ‘ulcerative enterocolitis’(ICD-9 556), ‘regional enteritis’(ICD-9 558), ‘volvulus of intestine’(ICD-9 560.2), etc. to avoid classification bias (S1 Table).
PheWAS
We conducted PheWAS of independent GWAS-significant SNPs with suggestive threshold (GWAS p-value<1E-06 and LD r2<0.1) grouped by ancestry [34]. We retrieved the diagnoses of the 91,166 MA participants, including ICD-9 and 10 codes, whichever available at the time of analysis. With a minimum of 30 cases per phenotype [34], logistic regression between the GWAS SNPs and each phecode was performed with the adjustment for the first 10 PCs, and participation sites, through the PheWAS R package [34]. A false discovery rate (FDR)< 0.05 was used for reporting significance.
We also conducted PheWAS of the 52 reported GWAS susceptibility loci from the three existing GWASs of DD [17–19]. The genomic positions of the 52 loci were converted to GrCh37/hg19 (40 loci from Maguire et al. [17], 12 loci from Schafmayer et al. [18]), including three proxy variants (R2 > 0.5) available in our genotype data (S2 Table).
Results
Performance of NLP-enriched phenotyping algorithm
Compared to a gold standard of manual clinical chart review, the overall PPV of our phenotyping algorithm for diverticulosis cases (with/without diverticulitis) was 0.96, and 0.94 for controls without diverticulosis or diverticulitis (Table 2). We identified 21,777 study participants using the developed algorithm without missing covariate data. Of these, we identified 12,577 diverticulosis cases with or without diverticulitis, of which 1,265 were diverticulitis cases, and 9,200 controls without diverticulosis or diverticulitis in the entire MA discovery cohort (Table 1).
Evaluation of NLP-enriched phenotyping vs. ICD-based phenotyping
We identified more cases and controls using ICD-based phenotyping, than with NLP-enriched phenotyping, due to the lower availability of report data: 3,313 diverticulitis cases and 45,111 healthy controls were found with ICD-based phenotyping. However, out of 21,777 subjects with imaging reports data, ICD-based phenotyping identified only 3,591 of them as diverticulosis cases whereas our NLP-enriched algorithm identified 12,577 diverticulosis cases. For diverticulitis, our NLP-enriched algorithm identified 1,265 patients and ICD-based phenotyping identified 1,201 patients (Table 2), and only 87.0% (n = 1,101) of case patients were overlapping between these two phenotyping algorithms. Even though the PPV of DD ICD-10 code was reported as high as 0.98 [35], we found that considerable phenotyping heterogeneity existed without the supporting procedure reports.
Genetic associations with DD
The GWAS of DD in the MA population identified one genome-wide significant locus (Fig 2) at 2q22.3 within ARHGAP15 gene. The association patterns between the two conditions are largely similar; the diverticulitis GWAS showed more significant and larger ORs than the diverticulosis GWASs in general (Table 3). In the MA GWAS for diverticulosis, rs2835676 (DSCR9 gene) showed strong eQTL association with both transverse and sigmoid colon tissues within the PIGP and TTC3 genes (FDR<3.90E-13).
Manhattan plots of genome-wide associations with diverticular disease (DD) in (a) Multiancestry (MA) participants (n = 21,777), (b) European Ancestry (EA) participants (n = 19,211), and (c) African Ancestry (AA) participants (n = 2,322). In each panel, the upper graph presents GWAS results of diverticulosis, and the bottom graph shows GWAS results of diverticulitis. The red horizontal line indicates genome-wide significance of p<5.0E-08 for each analysis.
The genetic signals found from EA-specific analysis and MA analysis were largely analogous, possibly because approximately 85.0% of the discovery population was EA (Fig 2, Tables 1 and 3). Even though ARHGAP15 loci showed non-significant p-values of 0.24–0.99 in the AA GWAS, the effect directions of ARHGAP15 loci were mostly positive and substantial, ranging from 1.111 to 1.464 in AA diverticulitis GWAS, except few loci which showed negative ORs (S3 Table).
We performed additional GWASs in the ICD-phenotyped diverticulitis cohort and replicated an ARHGAP15 locus on chromosome 2 (rs6717024) as genome-wide significant. One of the nearly significant associations included rs11843418 (FAM115A), which was previously identified [17, 19] but not significantly detected in our NLP-enriched GWAS possibly due to statistical power or varying genetic composition of study cohorts (S4 Table).
PheWAS
(1) DD susceptibility variants identified in our MA, EA, AA GWAS (p<1E-06) tested in the medical phenome of MA, EA, AA participants.
We observed FDR-significant PheWAS associations (FDR < 0.05) between DD phecodes (562, 562.1, and 562.2) and several independent (LD r2<0.1) ARHGAP15 loci in MA and EA PheWAS (Table 4). Other than diverticular EHR phenotypes, rs9565028 (NBEA gene) showed FDR-significant associations with genitourinary manifestations including ‘functional disorders of bladder’ and ‘other disorders of bladder’ in the MA and EA phenome. No significant associations were identified in AA PheWAS.
(2) DD susceptibility variants identified in previous GWAS (p < 5E-08) tested in the medical phenome of MA, EA, AA participants.
In the MA PheWAS, 55 genotype-EHR phenotype associations were significant (S5 Table). Among them, 18 significant genotype-EHR associations were endocrine/metabolic phenotypes, 17 of them were digestive phenotypes and 10 of them were circulatory system related phenotypes. The largest number of significant EHR phenotype associations were DD; 7 ‘diverticulosis and diverticulitis’, 7 ‘diverticulosis’ and 1 ‘diverticulitis’ were identified as significant. Other than the ARHGAP15 loci, rs4333882 (SLC35F3 gene) and rs10472291 (WDR70 gene) showed significant clinical associations with DD. SNP rs9272785 (HLA-DQA1 gene, proxy variant for rs7990) generated the most significant association in MA PheWAS coupled with ‘rheumatoid arthritis’. The SNP was also strongly associated with several diabetes manifestations, including ‘type 1 diabetes’, ‘type 1 diabetes with ophthalmic’, ‘type 1 diabetes with ketoacidosis’, ‘type 2 diabetes’, etc.
In the EA PheWAS, 49 genotype-EHR phenotype associations were identified with FDR significance (S5 Table). Among them, 17 EHR phenotypes are classified as digestive phenotypes, 15 are endocrine/metabolic-related phenotypes and 6 were related to the circulatory system. Rs9272785 (HLA-DQA1 gene) also marked the most significant association in EA PheWAS with ‘rheumatoid arthritis’. The variant also revealed additional associations in the EA phenome, including ‘developmental delays and disorders’, ‘multiple sclerosis’, ‘ulcerative colitis’ and ‘chronic lymphocytic thyroiditis’.
In AA PheWAS, two genotype-EHR phenotype associations met FDR significance: rs9272785 (HLA-DQA1 gene) showed the most significant SNP-phenotype association as it did in MA and EA PheWAS. The variant also showed strong associations with ‘type 1 diabetes with ketoacidosis’ and ‘type 1 diabetes’ in the AA phenome.
Discussion
To date, patient identification in the EHR was partially limited in that mostly inpatient medical coding was used, which might result in under-diagnosis of the case patients and/or misclassification of controls who possibly have DD. In the most recent GWAS of DD [18], the review of replication cohorts and input of physicians/technicians were manual; however, manual review has limited application to larger population-based datasets in its lack of scalability. Our NLP-enriched phenotyping approach showed a significant improvement in performance (algorithm PPVs≥0.94, 3.5-fold increase in diverticulosis patient identification) compared with the use of only ICD-codes, (Table 2) and supports the importance of leveraging the full breadth of data captured in the her [36, 37].
Our multi-ancestry GWAS of DD confirmed the strong genome-wide association of ARHGAP15 with both diverticulosis and diverticulitis (Table 3). ARHGAP15 is known to strongly and negatively regulate GTPase binding property of the Rac protein family in leukocytes, which modulates important antimicrobial functions [38]. This mechanism of ARHGAP15 possibly impacts the inflammatory environment of the intestine, promoting the development of diverticula or progression of diverticula due to bacterial growth along the colonic wall. In the ancestry-stratified GWAS analyses, the often-replicated associations between ARHGAP15 with DD was detected in EA cohorts and similarly positive effect sizes but little to no association was observed in AA cohorts (S3 Table). Notably, the sample size for the AA cohort is less than 1/10th that of the EA cohort, as well as different risk allele frequencies between ancestries. Our additional power calculation showed that at least 15,000 participants are needed to perform GWAS on the ARHGAP15 loci (EAF 0.18, disease prevalence 0.10, OR 1.20) with 80% statistical power (S1 Fig). Further investigation is needed to confirm the universal susceptibility effect of ARHGAP15 to DD in patient of non-European ancestry.
Our PheWAS of the independent ARHGAP15 loci (rs6736741, rs10928187, rs386651361) confirmed its significant phenotypic expression with DD in MA and EA and the second most significant association with paralytic ileus (Table 4). Some genitourinary phenotypes of functional bladder disorders found in MA and EA should be noted in that the muscular motility or neuromuscular dysfunction of internal organs possibly influence both colonic walls for diverticulosis and bladder muscle for urinary disorders.
In the PheWAS of the established diverticular variants, we identified several circulatory system related EHR phenotypes associated with DD variants, including phlebitis and thrombophlebitis, pulmonary heart disease, and deep vein thrombosis. Notably, recent studies have suggested a possible epidemiologic association between DD and acute coronary syndromes and thromboembolic events [39, 40]. We also confirmed the associations of rs9272785 (HLA-DQA1 gene) with type 1 diabetes and its manifestations with FDR significance across ancestries. The HLA class 2 region, where rs9272785 is located, is not only associated with risk of type 1 diabetes but also increased susceptibility to juvenile rheumatoid arthritis and other autoimmune diseases [41, 42].
Compared to previous GWASs of DD, our summary statistics generally show larger effect sizes possibly fueled with the improved patient identification by the NLP-enriched phenotyping algorithm. For example, rs6734367, the strongest ARHGAP15 locus reported in Maguire et al. [17] showed positive OR of 1.010 in the original study, whereas it presents an OR as high as 1.177 (diverticulosis) and 1.280 (diverticulitis) in our EA GWAS with the same allelic direction (S6 Table). For the rest of the GWAS-significant SNPs, the ORs in our results generally showed increased effect sizes despite a cohort 1/20th the size of Maguire et al. (Fig 3). Among the 52 tested variants, 5 loci were significantly replicated in our EA GWAS of diverticulosis (p-value < 0.05/52). As the cohort size gets larger, and patients with diverse genetic backgrounds are included, our results suggest improved analytical power for future genomic research with the integration of different layers of EHR data.
The dashed y = x line indicates equal ORs in both studies.
There are several caveats in our study. We did not separately validate our phenotyping algorithms’ performance for diverticulitis vs. diverticulosis, which should be included in future research. Our GWAS did not identify any novel association and only confirmed an existing locus with DD, albeit with larger effect sizes across the analyses. Also, our MA analysis was composed of 85% EA participants, so the signals are largely driven by EA-centric results. The cohort size of AA is considerably smaller than the EA or MA cohorts, which elevates the risk for false positive findings.
Our approach has highlighted the richness and potential of the heterogenous EHR data in patient classification with NLP, and the feasibility of an integrative analytical pipeline, from GWAS to post-GWAS analysis such as PheWAS, to facilitate etiological investigation of a disease in clinical setting.
Supporting information
S1 Fig. Results of power calculation for our DD GWAS analyses.
https://doi.org/10.1371/journal.pone.0283553.s001
(TIF)
S1 Table. PheWAS association results (p-value < 1E-04) of 52 susceptibility SNPs for diverticular diseases in MA, EA, AA participants.
https://doi.org/10.1371/journal.pone.0283553.s003
(XLSX)
S2 Table. List of exclusion ICD codes for phecode mapping: Not classified as control or case.
https://doi.org/10.1371/journal.pone.0283553.s004
(XLSX)
S3 Table. GWAS results of ARGHAP loci among participants of African ancestry.
https://doi.org/10.1371/journal.pone.0283553.s005
(XLSX)
S4 Table. GWAS results of ICD-based identified patients with diverticulitis in eMERGE cohort.
https://doi.org/10.1371/journal.pone.0283553.s006
(XLSX)
S5 Table. Information of 52 reported susceptibility variants from three Eurocentric GWAS of diverticular diseases: Sid et al.
(2017), Maguire et al. (2018), and Sch et al. (2019).
https://doi.org/10.1371/journal.pone.0283553.s007
(XLSX)
S6 Table. OR comparison between our European-specific GWASs and the previous GWAS results from Maguire et al.
https://doi.org/10.1371/journal.pone.0283553.s008
(XLSX)
S1 File. Details of genotyping, imputation, and quality control processes.
https://doi.org/10.1371/journal.pone.0283553.s009
(DOCX)
References
- 1. Sandler RS, Everhart JE, Donowitz M, Adams E, Cronin K, Goodman C, et al. The burden of selected digestive diseases in the United States. Gastroenterology. 2002;122(5):1500–11. Epub 2002/05/02. pmid:11984534.
- 2. Peery AF, Crockett SD, Barritt AS, Dellon ES, Eluri S, Gangarosa LM, et al. Burden of Gastrointestinal, Liver, and Pancreatic Diseases in the United States. Gastroenterology. 2015;149(7):1731–41 e3. Epub 2015/09/04. pmid:26327134; PubMed Central PMCID: PMC4663148.
- 3. Peery AF, Crockett SD, Murphy CC, Lund JL, Dellon ES, Williams JL, et al. Burden and Cost of Gastrointestinal, Liver, and Pancreatic Diseases in the United States: Update 2018. Gastroenterology. 2019;156(1):254–72 e11. Epub 2018/10/14. pmid:30315778; PubMed Central PMCID: PMC6689327.
- 4. Strate LL, Morris AM. Epidemiology, Pathophysiology, and Treatment of Diverticulitis. Gastroenterology. 2019;156(5):1282–98 e1. Epub 2019/01/21. pmid:30660732; PubMed Central PMCID: PMC6716971.
- 5. Reichert MC, Lammert F. The genetic epidemiology of diverticulosis and diverticular disease: Emerging evidence. United European Gastroenterol J. 2015;3(5):409–18. Epub 2015/11/05. pmid:26535118; PubMed Central PMCID: PMC4625748.
- 6. Colcock BP. Diverticular disease of the colon. Major Probl Clin Surg. 1971;11:1–135. Epub 1971/01/01. pmid:4949985.
- 7. Shahedi K, Fuller G, Bolus R, Cohen E, Vu M, Shah R, et al. Long-term risk of acute diverticulitis among patients with incidental diverticulosis found during colonoscopy. Clin Gastroenterol Hepatol. 2013;11(12):1609–13. Epub 2013/07/17. pmid:23856358; PubMed Central PMCID: PMC5731451.
- 8. Painter NS, Burkitt DP. Diverticular disease of the colon: a deficiency disease of Western civilization. Br Med J. 1971;2(5759):450–4. Epub 1971/05/22. pmid:4930390; PubMed Central PMCID: PMC1796198.
- 9. Painter NS, Burkitt DP. Diverticular disease of the colon, a 20th century problem. Clin Gastroenterol. 1975;4(1):3–21. Epub 1975/01/01. pmid:1109818.
- 10. Makela J, Kiviniemi H, Laitinen S. Prevalence of perforated sigmoid diverticulitis is increasing. Dis Colon Rectum. 2002;45(7):955–61. Epub 2002/07/20. pmid:12130886.
- 11. Nagata N, Niikura R, Aoki T, Shimbo T, Itoh T, Goda Y, et al. Increase in colonic diverticulosis and diverticular hemorrhage in an aging society: lessons from a 9-year colonoscopic study of 28,192 patients in Japan. Int J Colorectal Dis. 2014;29(3):379–85. Epub 2013/12/10. pmid:24317937.
- 12. Warner E, Crighton EJ, Moineddin R, Mamdani M, Upshur R. Fourteen-year study of hospital admissions for diverticular disease in Ontario. Can J Gastroenterol. 2007;21(2):97–9. Epub 2007/02/15. pmid:17299613; PubMed Central PMCID: PMC2657668.
- 13. Ogunbiyi OA. Diverticular disease of the colon in Ibadan, Nigeria. Afr J Med Med Sci. 1989;18(4):241–4. Epub 1989/12/01. pmid:2558553.
- 14. Aldoori WH. The protective role of dietary fiber in diverticular disease. Adv Exp Med Biol. 1997;427:291–308. Epub 1997/01/01. pmid:9361853.
- 15. Peery AF, Barrett PR, Park D, Rogers AJ, Galanko JA, Martin CF, et al. A high-fiber diet does not protect against asymptomatic diverticulosis. Gastroenterology. 2012;142(2):266–72 e1. Epub 2011/11/09. pmid:22062360; PubMed Central PMCID: PMC3724216.
- 16. Peery AF, Sandler RS, Ahnen DJ, Galanko JA, Holm AN, Shaukat A, et al. Constipation and a low-fiber diet are not associated with diverticulosis. Clin Gastroenterol Hepatol. 2013;11(12):1622–7. Epub 2013/07/31. pmid:23891924; PubMed Central PMCID: PMC3840096.
- 17. Maguire LH, Handelman SK, Du X, Chen Y, Pers TH, Speliotes EK. Genome-wide association analyses identify 39 new susceptibility loci for diverticular disease. Nat Genet. 2018;50(10):1359–65. Epub 2018/09/05. pmid:30177863; PubMed Central PMCID: PMC6168378.
- 18. Schafmayer C, Harrison JW, Buch S, Lange C, Reichert MC, Hofer P, et al. Genome-wide association analysis of diverticular disease points towards neuromuscular, connective tissue and epithelial pathomechanisms. Gut. 2019;68(5):854–65. Epub 2019/01/21. pmid:30661054.
- 19. Sigurdsson S, Alexandersson KF, Sulem P, Feenstra B, Gudmundsdottir S, Halldorsson GH, et al. Sequence variants in ARHGAP15, COLQ and FAM155A associate with diverticular disease and diverticulitis. Nat Commun. 2017;8:15789. pmid:28585551; PubMed Central PMCID: PMC5467205.
- 20. Matrana MR, Margolin DA. Epidemiology and pathophysiology of diverticular disease. Clin Colon Rectal Surg. 2009;22(3):141–6. pmid:20676256; PubMed Central PMCID: PMC2780269.
- 21. Destigter KK, Keating DP. Imaging update: acute colonic diverticulitis. Clin Colon Rectal Surg. 2009;22(3):147–55. Epub 2010/08/03. pmid:20676257; PubMed Central PMCID: PMC2780264.
- 22. Feingold D, Steele SR, Lee S, Kaiser A, Boushey R, Buie WD, et al. Practice parameters for the treatment of sigmoid diverticulitis. Dis Colon Rectum. 2014;57(3):284–94. Epub 2014/02/11. pmid:24509449.
- 23. Diverticulosis [Internet]. 2019. Available from: https://www.ncbi.nlm.nih.gov/books/NBK430771/.
- 24. Joseph DA KJ, Richards TB, Thomas CC, Richardson LC. Use of colorectal cancer screening tests by state. Preventing Chronic Disease. 2018;15:170535. pmid:29908051
- 25. Stanaway IB, Hall TO, Rosenthal EA, Palmer M, Naranbhai V, Knevel R, et al. The eMERGE genotype set of 83,717 subjects imputed to ~40 million variants genome wide and association with the herpes zoster medical record phenotype. Genet Epidemiol. 2019;43(1):63–81. Epub 2018/10/10. pmid:30298529; PubMed Central PMCID: PMC6375696.
- 26. Lessons learned from the eMERGE Network: balancing genomics in discovery and practice. Human Genetics and Genomics Advances. 2021;2(1):100018. pmid:35047833
- 27. Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;42(5):839–51. pmid:19435614; PubMed Central PMCID: PMC2757457.
- 28. Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc. 2013;20(e1):e147–54. pmid:23531748; PubMed Central PMCID: PMC3715338.
- 29. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 2015/02/28. pmid:25722852; PubMed Central PMCID: PMC4342193.
- 30. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310–5. Epub 2014/02/04. pmid:24487276; PubMed Central PMCID: PMC3992975.
- 31. Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012;22(9):1790–7. Epub 2012/09/08. pmid:22955989; PubMed Central PMCID: PMC3431494.
- 32. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics. 2011;88(1):76–82. Epub 2010/12/21. pmid:21167468; PubMed Central PMCID: PMC3014363.
- 33. Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol. 2013;31(12):1102–10. Epub 2013/11/26. pmid:24270849; PubMed Central PMCID: PMC3969265.
- 34. Carroll RJ, Bastarache L, Denny JC. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30(16):2375–6. pmid:24733291; PubMed Central PMCID: PMC4133579.
- 35. Erichsen R, Strate L, Sorensen HT, Baron JA. Positive predictive values of the International Classification of Disease, 10th edition diagnoses codes for diverticular disease in the Danish National Registry of Patients. Clin Exp Gastroenterol. 2010;3:139–42. Epub 2010/01/01. pmid:21694857; PubMed Central PMCID: PMC3108666.
- 36. Wei WQ, Denny JC. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015;7(1):41. Epub 2015/05/06. pmid:25937834; PubMed Central PMCID: PMC4416392.
- 37. Peissig PL, Rasmussen LV, Berg RL, Linneman JG, McCarty CA, Waudby C, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19(2):225–34. Epub 2012/02/10. pmid:22319176; PubMed Central PMCID: PMC3277618.
- 38. Costa C, Germena G, Martin-Conte EL, Molineris I, Bosco E, Marengo S, et al. The RacGAP ArhGAP15 is a master negative regulator of neutrophil functions. Blood. 2011;118(4):1099–108. Epub 2011/05/10. pmid:21551229.
- 39. Pers TH, Karjalainen JM, Chan Y, Westra HJ, Wood AR, Yang J, et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun. 2015;6:5890. pmid:25597830; PubMed Central PMCID: PMC4420238.
- 40. Strate LL, Erichsen R, Horvath-Puho E, Pedersen L, Baron JA, Sorensen HT. Diverticular disease is associated with increased risk of subsequent arterial and venous thromboembolic events. Clin Gastroenterol Hepatol. 2014;12(10):1695–701 e1. Epub 2013/12/10. pmid:24316104.
- 41. Begovich AB, Bugawan TL, Nepom BS, Klitz W, Nepom GT, Erlich HA. A specific HLA-DP beta allele is associated with pauciarticular juvenile rheumatoid arthritis but not adult rheumatoid arthritis. Proc Natl Acad Sci U S A. 1989;86(23):9489–93. Epub 1989/12/01. pmid:2512583; PubMed Central PMCID: PMC298522.
- 42. Noble JA, Valdes AM. Genetics of the HLA region in the prediction of type 1 diabetes. Curr Diab Rep. 2011;11(6):533–42. Epub 2011/09/14. pmid:21912932; PubMed Central PMCID: PMC3233362.