Phenome-Wide Association Studies (PheWAS) investigate whether genetic polymorphisms associated with a phenotype are also associated with other diagnoses. In this study, we have developed new methods to perform a PheWAS based on ICD-10 codes and biological test results, and to use a quantitative trait as the selection criterion. We tested our approach on thiopurine S-methyltransferase (TPMT) activity in patients treated by thiopurine drugs. We developed 2 aggregation methods for the ICD-10 codes: an ICD-10 hierarchy and a mapping to existing ICD-9-CM based PheWAS codes. Eleven biological test results were also analyzed using discretization algorithms. We applied these methods in patients having a TPMT activity assessment from the clinical data warehouse of a French academic hospital between January 2000 and July 2013. Data after initiation of thiopurine treatment were analyzed and patient groups were compared according to their TPMT activity level. A total of 442 patient records were analyzed representing 10,252 ICD-10 codes and 72,711 biological test results. The results from the ICD-9-CM based PheWAS codes and ICD-10 hierarchy codes were concordant. Cross-validation with the biological test results allowed us to validate the ICD phenotypes. Iron-deficiency anemia and diabetes mellitus were associated with a very high TPMT activity (p = 0.0004 and p = 0.0015, respectively). We describe here an original method to perform PheWAS on a quantitative trait using both ICD-10 diagnosis codes and biological test results to identify associated phenotypes. In the field of pharmacogenomics, PheWAS allow for the identification of new subgroups of patients who require personalized clinical and therapeutic management.
The use of underlying molecular mechanisms and other factors to describe and classify diseases is a major challenge for future treatment strategies. New methods are needed to achieve this goal. The phenome wide association study (PheWAS) methodology was initially developed to unveil unknown associations between a specific genetic status and phenotypic features (e.g. diagnoses from electronic health records). We initially propose to extend this method to assessment of the relationships between the levels of a quantitative trait and diagnosis codes. We also assess the relationships between this quantitative trait and the biological test results. We tested this method using the levels of enzymatic activity of thiopurine S-methyltransferase (TPMT) that is involved in the metabolism of thiopurine drugs used in inflammatory bowel diseases for example. We discovered an association between a very high TPMT activity and nutritional anemia and diabetes. These results could be used to describe a new subgroup of patients in order to optimize drug treatments.
Citation: Neuraz A, Chouchana L, Malamut G, Le Beller C, Roche D, Beaune P, et al. (2013) Phenome-Wide Association Studies on a Quantitative Trait: Application to TPMT Enzyme Activity and Thiopurine Therapy in Pharmacogenomics. PLoS Comput Biol 9(12): e1003405. https://doi.org/10.1371/journal.pcbi.1003405
Editor: Donna K. Slonim, Tufts University, United States of America
Received: May 31, 2013; Accepted: November 8, 2013; Published: December 26, 2013
Copyright: © 2013 Neuraz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This study was partly funded by the BioIntelligence collaborative program for the Institut National de la Recherche Médicale (INSERM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The US National Research Council report “Toward Precision Medicine” proposed the redefinition of diseases using the underlying molecular causes and other factors in addition to traditional signs and symptoms . To establish the relationships between molecular characterization and clinical features, different methods have been proposed . Genome Wide Association Studies (GWAS) have allowed the identification of Single Nucleotide Polymorphisms (SNPs) associated with a determinate phenotype –. (Figure S1 Panel A) Between 2005 and June 2012, 1,350 GWAS were published . In 2010, Denny et al. described another method called Phenome-Wide Association Study (PheWAS) . PheWAS investigates whether the SNPs associated with a phenotype are also associated with other diagnoses (Figure S1 Panel B) , . Therefore, for a selected SNP, two groups are composed: one with a specific allele and a control group with other alleles. Thereafter, to search for new associations, all of the phenotypic data (for example, all International Classification of Diseases (ICD) codes) available in the medical records of the patients having the specific allele are screened and compared to those of the control group . Denny et al. genotyped 6,000 patients in the BioVU data bank at five SNPs with previously reported disease associations and ran a PheWAS on each SNP, based on the ICD-9-CM codes , . They replicated four out of seven known molecular-clinical associations and discovered 19 new potential associations.
Following this example, further PheWAS were performed on the SNPs associated with hypothyroidism (FOXE1) , rheumatoid arthritis , and on HLA-DRB1*1501, which has been linked to several autoimmune diseases . Most PheWAS were performed with data collected through the Electronic Medical Records and Genomics (eMERGE) network, including the Marshfield Clinic's Personalized Medicine cohort . With the aim of analyzing the genetic architecture of complex traits and identifying new pleiotropic relationships, Pendergrass et al. conducted a PheWAS on 70,061 study participants representing four major racial/ethnic groups in the Population Architecture using Genomics and Epidemiology (PAGE) network , .
Analyses combining GWAS and PheWAS have been reported: whereas GWAS allows researchers to identify a genomic region of interest or one SNP associated with a clinical condition, PheWAS identifies all the diagnoses potentially associated with these markers. For example, Denny et al. performed a GWAS for primary hypothyroidism and, afterwards a PheWAS on 13,617 patient records, based on the locus that was previously identified. Thus, genetic associations with thyroiditis and thyrotoxicosis but neither Graves or thyroid cancer have been highlighted . More recently, Ritchie et al. performed genome- and phenome-wide analysis on cardiac conduction, which resulted in the identification of new markers for atrial fibrillation and arrhythmia .
To perform a PheWAS, a large amount of data must be included to infer potential patterns and discover new possible associations , . The criterion for data selection includes the presence of a particular genotype. A cohort containing all types of diagnoses is necessary to discover some new potential associations. Clinical Data Warehouses (CDWs) have been developed to allow the integration of Electronic Health Records (EHRs) data and their use for research; they can also be used as data source for such studies –. When linked to DNA repositories, CDWs are a source of patient data to analyze the relationship between genetic variations and human traits –.
Instead of directly using genomic data as the inclusion criteria, it is possible to use a quantitative trait (e.g., biological test results) . This approach presents three advantages: (i) quantitative traits are usually recorded as part of the clinical data; (ii) a quantitative trait, consisting of both genetic variations and non-genetic factors can more accurately describe a clinical feature than genetic mutations alone; (iii) quantitative traits can be highly correlated to a genomic status. This is the case for thiopurine S-methyltransferase (TPMT), a key enzyme involved in thiopurine metabolism, as TPMT activity is highly correlated to the genotypes of individuals –.
Thiopurine drugs (azathioprine, 6-thioguanine and 6-mercaptopurine) are frequently prescribed in autoimmune disorders, such as inflammatory bowel disease (IBD), or in blood cancers, such as acute lymphoblastic leukemia , . Severe adverse effects occur in 15% to 28% of the treated patients, and up to 40% of IBD patients are resistant to thiopurines , , . The production of active metabolites, such as the 6-thioguanine nucleotides (6-TGN), is largely regulated by TPMT , . Genetic polymorphisms of TPMT result in a trimodal distribution of TPMT activity (TPMTa). Whereas a large majority, approximately 89%, of the population show normal activity (nTPMTa), approximately 11% have a partially deficient activity level, and 0.3% have a completely deficient activity level , , . Moreover, among patients with nTPMTa, approximately 15% show a very high TPMTa (vhTPMTa) , .
In treated patients, there is a negative correlation between partial or completely deficient TPMTa, and high 6-TGN intra-erythrocyte concentrations, resulting in severe hematological toxicities or even lethal bone marrow suppression . Conversely, patients with vhTPMTa are more prone to low 6-TGN intra-erythrocyte concentrations and pharmacological resistance to thiopurines . Therefore, to detect patients at high risk of severe hematological toxicities, the US Food and Drug Administration (FDA) and the Clinical Pharmacogenetics Implementation Consortium (CPIC) strongly recommend that TPMT status be determined either by genotyping or phenotyping prior to initiation of thiopurine therapy . Based on these observations, TPMTa levels can be used as a starting point for a PheWAS.
We aimed to develop methods to perform a PheWAS based on the ICD-10 codes and biological test results, while using a quantitative trait as a selection criterion. We then tested our approach on a specific quantitative trait, TPMTa, in order to identify new subgroups of patients with different characteristics.
Materials and Methods
Study and clinical data warehouse
We performed an in silico retrospective case-control study using data from an academic hospital, Hôpital Européen Georges Pompidou (HEGP) in Paris, France. We extracted data from HEGP CDW, an i2b2 CDW containing more than 606,524 single patients, collected between 2000 and 2012 , . This CDW contains routine care data divided into nine categories (208,955,369 items): demographics (age, sex, and hospital vital status), vital signs (e.g., temperature, blood pressure, weight…), diagnoses (ICD-10), procedures (French CCAM classification), clinical data (structured questionnaires from EHR), free text reports, pathology codes (French ADICAP classification), biological test results, and Computerized Provider Order Entry (CPOE) drug prescriptions.
Definition of ICD-10 PheWAS codes: Two aggregation levels
ICD codes could not be directly used for analysis because of their fine granularity. Therefore, we developed two different aggregation methods.
ICD-9-CM mapping PheWAS codes.
The first aggregation scale relies on mapping between ICD-9-CM and ICD-10 . We extracted the ICD-10 classification from the United Medical Language System (UMLS) . Then, we used a mapping file developed by the New Zealand Ministry of Health to map the ICD-10 codes to the ICD-9-CM codes (Figure 1) . After format adaptations, 99.5% of the codes were mapped successfully. The 57 remaining codes were mapped manually. This allowed us to use the ICD-9-CM PheWAS codes from Denny et al.  These ICD-9-CM PheWAS codes contained 829 different codes including 771 used for analysis. Codes that were not a proper diagnosis were excluded (e.g. “Effects of air pressure caused by explosion”).
PheWAS: Phenome-wide association study; ICD: International classification of diseases; ICD-9-CM: International classification of diseases clinically modified; ICD-9-CM-A: Australian version of the ICD-9-CM, with custom codes added. ICD-10-AM: Australian version of the ICD-10, with custom codes added. 1: Mapping file from the New-Zealand Ministry of Health was used to project ICD-10 codes on ICD-9-CM. 2: Mapping of the previous projection with existing ICD-9-CM PheWAS codes. 3: File with correspondence between ICD-10 codes and ICD-9-CM PheWAS codes.
ICD-10 PheWAS codes.
The other grouping method was based on the ICD-10 hierarchy. Given the size of our sample population, a lower level of granularity was more relevant. We used the superclasses of the three digit codes, leading to 257 groups. ICD-10PheWAS codes and ICD-10 to ICD-9-CM PheWAS codes mapping files are available for download here: http://umrs872eq22.com/TPMT_PLOS/Phewas_codes_ICD10_ICD9.zip
ICD codes analyses
Groups of patients were divided according to the quantitative trait studied. (Figure 2) Then, as described by Denny et al., for each PheWAS code, a case-control comparison was performed: (i) the case group was generated with patients having an ICD code in the range of this PheWAS code; (ii) the control group was composed of patients without any ICD code in this range; and (iii) patients with ICD codes that were too close to those of the current PheWAS code were excluded from this specific comparison. For each PheWAS code, its siblings were used as exclusion ranges. For example: for grouped codes under C15–26 (Malignant neoplasms, digestive organs), the exclusion range was from C00 to D48 (Neoplasms). We successively used the two methods of ICD code aggregation to compare the distribution of cases between the groups.
PheWAS: Phenome-wide association study; ICD: International classification of diseases; TPMT: thiopurine S-methyltransferase. Patients are assigned to a group depending on the level of a quantitative trait (e.g. TPMT activity). ICD codes and biological test results are screened to find systematic differences between the groups.
Biological test result analyses
For each biological test, we used thresholds to define low-value cases and high-value cases, according to the normal value range (Table 1). Because patients had more than one occurrence of each biological test, two algorithms of analyses were applied. The first was a “global approach” in which a high-value case (resp. a low-value case) was defined as the presence of at least one test result above the high threshold (resp. below the low threshold). (Figure S2) In addition, for hyperglycemia, we required two occurrences above the high threshold. (Table S1) The proportions of cases among patient groups were then compared, similar to the ICD analysis. The second method was a “frequency based approach” in which a case was defined by one encounter with at least one result either below (low-value cases) or above (high-value cases) the thresholds. (Figure S2) The proportions of encounter cases (“episodes”) per patient for a test were compared, similar to the ICD analysis. The results of the “global approach” and of the “frequency-based approach” were analyzed in view of the ICD-based findings.
For biological tests significantly associated to a TPMTa group, we performed an event-free Kaplan-Meier survival analysis (i.e. low-value event or high-value event) after initiation of thiopurine therapy, excluding events occurring within the first week of treatment. Analysis was censored to 360 days after initiation of thiopurine therapy.
Application to TPMT enzyme activity
We selected all the patients who underwent a TPMTa assay and with at least one ICD-10 code or one biological test result between January 2000 and July 2013, i.e. TPMT cohort. For the PheWAS analysis, we included the patients having a notion of thiopurine treatment in their EHR and kept ICD codes and biological test results dated after the starting of thiopurine treatment. There were no exclusion criteria. We will refer to this group as the study population.
We first compared the characteristics of the TPMT cohort to a hospital control group composed of randomly selected patients among the HEGP CDW who did not undergo a TPMTa assessment and were matched for year of birth and sex (3 for each patient in the TPMT cohort).
Then, we split the initial TPMT cohort into three groups according to TPMTa level: (i) low TPMTa (lowTPMTa) combining both partial and completely deficient TPMTa patients, with an activity below 8.5 nmol/h/mL red blood cells (RBC); (ii) nTPMTa; and (iii) vhTPMTa, with an activity above 15.0 nmol/h/mL RBC . We have assessed that TPMTa is stable over time from the patients (n = 51) who underwent more than one TPMTa assay (Table S2). For these patients, only the first measurement was used in the analyses.
An open database connection (ODBC) linking an Oracle database (11 g Enterprise Edition Release 18.104.22.168.0) of i2b2 CDW (version 1.3) to R software (version 2.15.3) was set up. The dataset containing data from the TPMT cohort (demographic, diagnoses, free text reports, structured questionnaires, biological tests results and drug prescriptions) was imported into R. All further analyses were carried out in R, using the RODBC 1.3–6 and the ggplot2 0.9.3.1 packages.
The information concerning the treatment was found in the drug prescriptions, in free text reports or in clinical data from structured questionnaires. We extracted prescriptions from the CPOE drug prescriptions with starting dates or directly from free text reports using the brand name and the generic name (IMUREL, AZATHIOPRINE, IMURAN, MERCAPTOPURINE, PURINETHOL) and using the date of report as the starting date.
ICD codes analysis.
We compared the proportions of cases and controls in the TPMTa groups: (i) vhTPMTa versus other TPMTa and the (ii) lowTPMTa versus other TPMTa. We selected the PheWAS codes with at least 5 occurrences for analysis.
Biological test result analyses.
Among the biological tests, we focused on 11 routine blood tests widely prescribed during the monitoring of thiopurine treatment: leukocyte count (WBC), neutrophil count, RBC count, hemoglobin, platelet count, mean corpuscular volume (MCV), glycemia, alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), and gamma glutamyl-transpeptidase (GGT) (Table S1).
Thiopurine efficacy analysis on free-text reports.
From study population, we selected the patients having at least two free-text reports with a reference to thiopurine therapy in their EHR. We excluded the patients with a reported adverse effect or intolerance to azathioprine or 6-mercaptopurine, and the patients whose treatment was interrupted within the first month.
Thiopurine failure was defined as at least one reference to inefficiency/failure of azathioprine/6-mercaptopurine therapy, or as a sustained dependency to steroids, reported by physicians in free-text reports. Of note, if the treatment was initially reported as effective, a secondary failure was not considered in our analysis. Proportions of thiopurine therapy failure were compared between vhTPMTa patients and other TPMTa patients.
Fisher exact test and unadjusted logistic regression were used to compare discrete variables. Continuous variables were compared using Student t-test. Log-rank test was used to evaluate survival curves. We calculated the odds ratios (OR) and confidence intervals at 95% (95%CI). q-q plots were realized to evaluate the distribution of p-values. The p-value was fixed at 0.05. We used the False Discovery Rate (FDR) method to manage multiple testing and used the threshold of 0.2 .
A total of 554 patients (TPMT cohort) underwent a TPMTa assessment. Of these patients, 460 had ICD-10 codes and at least one biological test result, and a total of 442 patients, i.e. study population, had also a notion of thiopurine treatment in their EHR. (Figure 3, Figure S6) These 442 EHRs included 10,252 ICD-10 occurrences and 72,711 results of the selected biological tests (Table 1). Of these patients, 324, representing 6,183 free-text reports, were included in the thiopurine efficacy validation analysis, after exclusion of the patients having less than two reports with a notion of thiopurine therapy and patients with an adverse effect or intolerance to thiopurines. (Figure 3) Known indications for thiopurine therapy, e.g., Crohn's disease (OR, 699.6; 95%CI, 343.7–1,600, p = 1.73E-263) or ulcerative colitis (OR, 583.1; 95%CI, 237.9–1,843, p = 1.5E-144) and their consequences were significantly associated with the TPMT cohort versus hospital population (Table S3). No patient with leukemia or an associated pathology were found in the analysis, as there is no hematologic department at HEGP.
HEGP CDW: Clinical data warehouse from Hôpital Européen Georges Pompidou, France. TPMT Cohort: patients with a thiopurine S-methyltransferase (TPMT) activity assessment in HEGP between January 2000 and July 2013. ICD: International Statistical Classification of Diseases and Related Health Problems. PheWAS: phenome-wide association study.
Using our ICD-10 based aggregation, the 1,016 distinct ICD-10 codes occurring in the study population EHRs resulted in 156 distinct aggregated codes, including 83 codes with at least 5 occurrences. (Table S4) ICD-9-CM mapping aggregation led to 289 distinct aggregated codes, including 94 codes with at least 5 occurrences. (Table S5) These 156 and 289 aggregated codes represent respectively 59% and 37% of the aggregated classifications.
In the vhTPMTa versus other TPMTa analysis, two significant codes for ICD-10 based aggregation were found: diabetes mellitus (p = 0.0009) and nutritional anemia (p = 0.0005). These results agreed with the ICD-9-CM mapping codes (p = 0.0004 and p = 0.0015, respectively). (Figures 4, 5, Tables S6, S7) These results remained significant after FDR multitesting evaluation for the two aggregation methods. (Tables S6, S7) The distribution of p-values did not show any systemic bias according to q-q plots. (Figure S3)
ICD-9-CM: International classification of diseases 9 clinically modified; TPMT: thiopurine S-methyltransferase. The dotted line represents a P-value of 0.05 and the dashed line represents the FDR corrected level of significance for q = 0.2.
ICD-10: International classification of diseases 10; TPMT: thiopurine S-methyltransferase. The dotted line represents a P-value of 0.05 and the dashed line represents the FDR corrected level of significance for q = 0.2.
Biological test results.
With the “global approach”, the proportion of patients with at least one episode of moderate to severe biological anemia was higher in the vhTPMTa group than in the other TPMTa group: 40.8% versus 26.1% (OR, 1.9; 95%CI, 1.2–3.3;p = 0.01). (Table 2, Figure 6) Analyzing the same groupings, we also found that 13.6% of vhTPMTa patients had an episode of hyperglycemia versus 5.9% in the other TPMTa group (OR, 2.48; 95%CI, 1–6.1;p = 0.046) (Table 3). The “frequency-based approach” confirmed that the mean frequency of moderate to severe biological anemia episodes was higher in the vhTPMTa group than in the other TPMTa group: 18% versus 9% of encounters (p = 0.01). (Table 2) On the other hand, there was no statistically significant difference in the frequency of encounters with hyperglycemia between the two groups. (Table 3) With respect to neutropenia, it was interesting to note that there was no difference between the two groups using the global approach. However, the “frequency-based approach” identified a lower rate of neutropenia in the vhTPMTa group than in the other TPMTa groups (Table 2).
Using the global approach, a high-value case, resp. low-value case, is defined as at least one occurrence of a biological test result above, resp. below, the high or low threshold. Low-value case analyses have not been performed on alanine aminotransferase, aspartate aminotransferase and gamma glutamyl-transpeptidase test results, as a low threshold is not relevant for these tests. The dotted line represents a P-value of 0.05. Grey triangles represent the results above the high threshold and black triangles represent the results below the low threshold.
There were no differences between groups when comparing lowTPMTa versus other TPMTa group using the “global approach”. However, the “frequency-based approach” showed a lower frequency of leucopenia (3.7% versus 10%, p = 0.02) and neutropenia (0.9% versus 2.7%, p = 0.01) in the lowTPMTa group compared to other TPMTa group.(Tables S10, S11)
Event-free survival analysis
Event-free survival was evaluated for anemia and hyperglycemia. It showed that patients with vhTPMTa had a significant risk to have earlier anemia episodes than others (p = 0.04). (Figure 7) Regarding the development of hyperglycemia, there was no difference between the groups.
TPMT: thiopurine S-methyltransferase. Analysis based on biological test results. Anemia was censored for hemoglobin test results below 9 g/100 mL. All events occurring within the first week after starting thiopurine therapy were excluded from the analysis. Follow-up was censored after 360 days. A log-rank test was used for this analysis.
Thiopurine efficacy analysis
The efficacy analysis, based on free-text reports, showed 30.6% (15/49) of thiopurine therapy failure in the vhTPMTa group versus 13.1% (36/275) in the other TPMTa group (OR, 2.91; 95%CI, 1.33–6.17; p = 0.0045). After adjustment for sex and age in a logistic regression model, we found an adjusted OR of 3.11 (95%CI, 1.61–6.04; p = 0.0007).
This study demonstrates the feasibility and benefits of performing a PheWAS on a quantitative trait. Two independent approaches, based on (i) ICD codes and (ii) biological test results, were used to discover pathophysiological features potentially associated with this quantitative trait. In this manner, findings can be cross-validated: the phenotypes extracted from diagnosis codes were confirmed by the biological test results. By this way and using a quantitative trait in the context of pharmacogenomics we discovered new potential associations between TPMTa related to thiopurine treatment and clinical data.
To our knowledge, this is the first PheWAS performed using data encoded with ICD-10 classification, as previously published PheWAS were based on ICD-9-CM. The consistency in the results found between the two aggregation methods –the ICD-10-based method and the mapping between ICD-9-CM and ICD-10– demonstrates the feasibility of PheWAS using ICD-10. In our study population and using the ICD code distribution described above, ICD-9-CM based PheWAS codes resulted in more informative phenotypes than the ICD-10 based. Thus, it appears that ICD code aggregation level, i.e. the number of code groups, needs to be optimized according to the size of the population. For example, in a larger population, it may be more appropriate to use a fine grained aggregation based on the 3 digit codes of ICD-10, resulting in more accurate phenotypes.
The use of a CDW gives the opportunity to combine data from six heterogeneous sources: demographic data from administrative records, diagnosis codes from the billing system, biological test results, drug prescriptions from the CPOE system, free text reports, and clinical data from structured questionnaires. The clinical interpretation of patient condition by the physician, encoded with ICD codes, and the biological test results, extracted from the laboratory result server, were confronted. Drug prescriptions were extracted directly from the structured data issued by CPOE, structured questionnaires and from free-text reports. The close relationship between thiopurine drug prescriptions and TPMTa assays for therapeutic management was taken into account by incorporating temporal data for this study. Therefore, we restricted our analyses to the events following the initiation of thiopurine therapy.
In addition to patient selection based on TPMTa, biological test results were employed to validate the phenotypes obtained from ICD codes analysis. Thus, we assessed the feasibility of expanding PheWAS to another type of data from the CDW. In that aim, classification algorithms were developed to transform continuous test results into discrete classes using value and frequency thresholds. Such algorithms could benefit from semantic web technologies , because description logic includes reasoning capabilities. First, the patient's history was considered globally to compare the proportion of patients with an occurrence of abnormal biological test result between groups. In a second step, we analyzed the number of episodes for a specific biological abnormality, allowing us to compare event frequencies between TPMTa groups.
From a clinical point of view, the analyses using ICD-9-CM- or ICD-10-based groupings and biological test results are consistent, resulting in more frequent anemia in vhTPMTa patients than in other patients. In IBD, anemia is frequently observed and has a multifactorial etiology such as chronic inflammation or iron-deficiency caused by enteric bleeding . In addition, myelosuppressive drugs such as thiopurines can cause anemia . In our study, the strong association between iron-deficiency anemia – observed by ICD codes and hemoglobin test results – and vhTMPTa could reflect more active disease in these patients. Moreover, evaluation of the anemia-free duration showed earlier episodes of anemia in the vhTPMTa group compared to other patients. Finally, thiopurine efficacy analysis on free-text reports showed a three times more therapy failure occurrences in the vhTPMTa group, in relation with anemia episodes and an active disease. Besides, an over-representation of diabetes mellitus, identified by ICD-9-CM and ICD-10 mapping analyses, has been observed in patients with a vhTPMTa. This result has been confirmed by glycemia test result analyses with more patients having hyperglycemia. Onset of type 2 diabetes or glucose intolerance could result from a sustained steroid therapy secondary to thiopurine resistance and active disease in vhTPMTa patients. This finding is strengthened by the weak association with secondary hypertension also known as a steroid adverse effect. Finally, the higher risk of thiopurine therapy failure in vhTPMTa patients, highlighted by free-text report analysis, is in agreement with sustained steroid therapy, according to IBD therapeutic management. Altogether, these findings suggested that patients with vhTPMTa could have more active disease than the others, leading to more frequent anemia episodes despite thiopurine therapy. These patients may benefit from more intensive thiopurine therapy to maintain remission, spare steroids and lessen common adverse effects.
As a limit of our PheWAS study, the study design does not distinguish the effect of vhTPMTa itself from a drug effect. A possible approach to assess this point would be to perform a PheWAS on patients with a TPMTa assessment but without thiopurine therapy. However, according to TPMTa testing indication, i.e. before starting a thiopurine therapy to screen TPMT-deficient patients, the HEGP CDW did not contain data to process such an analysis. Systematic TPMTa determination for inpatients, in a context a large DNA biobanking could be valuable for analyzing the impact of vhTPMTa on clinical phenotypes.
The number of patients in our study (n = 442) is relatively small. Previously published PheWAS were mainly based on pooled data or large population based cohorts , , . However, despite the size of our study, we obtained statistically significant results and supported by a clinical/biological cross-validation. This cross-validation was followed by a manual in-depth analysis of free-text reports to evaluate the validity of our initial conclusions. Regarding multiple testing issues, Denny et al. used a Bonferonni correction but estimated that it might be too restrictive , , . We decided to use FDR because of its tolerance towards auto-correlated tests . Given the cross-validation process based on the biological test results: (i) we did not exclude PheWAS codes with small numbers of cases from our analysis as in previous studies; (ii) and we considered the patients who had at least one occurrence of the ICD code p as having the phenotype p, whereas previous studies considered patients as cases when the ICD code was present more than once in the patient record (a minimum of two or even four occurrences of the same code) , , , .
To be used as a selection criterion, a quantitative trait should be stable over the period of phenotype analysis. As all enzymes, TPMT can be influenced by physiological factors (e.g., pregnancy) or co-treatments , . In our study, TPMTa was stable over the analysis period. To extend this method to other quantitative traits, this stability over time must be checked.
Regarding our ICD and biological test result analyses, it could be valuable to extend it to other retrospective cohorts or CDW. Finally, the implementation of a large prospective study, including patients treated by thiopurine according to their TPMTa, could help to confirm our findings regarding vhTPMTa and thiopurine therapy failure associated with steroid side effects, and to develop further research.
We described here an original method to perform a PheWAS analysis on a quantitative trait, TPMTa, using both ICD-10 diagnosis codes and biological test results to identify associated phenotypes. This study highlighted a potential association between very high TPMT activity and signs that could be associated with a failure of thiopurine therapy and sustained steroid requirements in IBD patients. In the field of pharmacogenomics, PheWAS may allow the description of new subgroups of patients who need personalized clinical and therapeutic management.
Comparison between Genome Wide Association Studies (GWAS) and Phenome Wide Association Studies (PheWAS). SNP: single nucleotide polymorphism. A. GWAS: a group of patients with a selected phenotype (i.e. disease) is compared to a control group. All the genomic data available are screened to find systematic genomic differences between the groups. B. PheWAS: a group of patients with a selected allele or SNP on a particular gene is compared to a control group with different alleles on the same gene. All the phenotypic data available are screened to find systematic phenotypic differences between the groups.
Schematic representation of the discretization of quantitative biological test results for one single patient. A. Global approach: a patient is considered as a high-value case if he has at least one occurrence of a biological test result above the high threshold. B. Frequency-based approach: the frequency of high-value encounters is defined as the number of encounters with at least one occurrence above the high threshold divided by the number of encounters.
q-q plots of p-values from phenome-wide association study. Left: q-q plot of p-values from the analysis of ICD codes with the ICD-10 based aggregation. Right: q-q plot of p-values from the analysis of ICD codes with the ICD-9-CM mapping based aggregation. The red line represents the normal distribution.
Manhattan plot of Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD-9-CM mapping aggregation. Groups of ICD codes are represented by dots. Results of association tests (logistic regressions) are represented vertically (−log10(p-value)). The dotted line indicates p = 0.05. The dashed line indicates the FDR corrected level of significance for q = 0.2. When the p-value is under 0.05, the dot size represents the level of the odds-ratio. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: <8.5 nmol/h/mL red blood cells; Very high TPMTa: ≥15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.
Manhattan plot of Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD-10 based aggregation. Groups of ICD codes are represented by dots. Results of association tests (logistic regressions) are represented vertically (−log10(p-value)). The dotted line indicates p = 0.05. The dashed line indicates the FDR corrected level of significance for q = 0.2. When the p-value is under 0.05, the dot size represents the level of the odds-ratio. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: <8.5 nmol/h/mL red blood cells; Very high TPMTa: ≥15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.
Distribution of TPMT activity in the study population (n = 442). TPMT: thiopurine S-methyltransferase; RBC: red blood cells.
Thresholds for biological test result analyses. Thresholds have been defined according to the normal value ranges of the hospital laboratory. One test result occurrence below the low or above the high threshold defines a low-value case or a high-value case, respectively. * one neutrophil count below the low threshold of 1.0 G/L defines a neutropenia . **one hemoglobin test result below the low threshold of 9.0 g/100 mL defines a moderate to severe biological anemia . *** specially for glycemia, a high-value case (hyperglycemia) is defined by two test result occurrences above the high threshold .
Thiopurine S-methyltransferase activity (TPMTa) for patients with multiple assays. RBC: red blood cells. TPMTa: TPMT activity. Over the 51 patients that underwent more than one TPMTa assay, only one patient had results that could induce a change in groups. He was assigned to the normal TPMTa group (group from his first TPMTa assessment). For all the other patients, we considered that the TPMTa was stable over time.
Results of the preliminary phenome-wide association study on patients from TPMT cohort versus randomly selected patients from hospital clinical data warehouse. The ICD codes aggregation used was based on the 3-digit ICD-10 codes (2040 groups). Only the statistically significant results are reported here.
Distribution of PheWAS Codes from ICD-10 based aggregation.
Distribution of PheWAS Codes from ICD-9-CM mapping aggregation.
Results of the Phenome-wide association study (PheWAS) between very high TPMT activity patients and other TPMT activity patients for the ICD-9-CM mapping aggregation. The ICD-9-CM mapping aggregation corresponds to 771 groups of codes. Associations are assessed using logistic regression. Only PheWAS codes with a p-value<0.05 are reported here. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: <8.5 nmol/h/mL red blood cells; Very high TPMTa: ≥15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.
Results of the Phenome-wide association study (PheWAS) between very high TPMT activity patients and other TPMT activity patients for the ICD-10 based aggregation. The ICD-10 based aggregation corresponds to 256 groups of codes. Only PheWAS codes with a p-value<0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: <8.5 nmol/h/mL red blood cells; Very high TPMTa: ≥15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.
Results of the Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD-9-CM mapping aggregation. The ICD-9-CM mapping aggregation corresponds to 771 groups of codes. Only PheWAS codes with a p-value<0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: <8.5 nmol/h/mL red blood cells; Very high TPMTa: ≥15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.
Results of the Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD10 based aggregation. The ICD-10 based aggregation corresponds to 256 groups of codes. Only PheWAS codes with a p-value<0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant.
Results of the low-value case biological test analyses between low TPMT activity patients and other patients with normal and very high TPMT activity. Global approach: a low-value case is defined as at least one occurrence, over the study period, of biological test result below the low threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported). Low-value case analyses have not been performed on alanine aminotransferase, aspartate aminotransferase and gamma glutamyl-transpeptidase test results, as a low threshold is not relevant for these tests.
Results of the high-value-case biological test analyses between low TPMT activity (lowTPMTa) patients and other patients. Global approach: a high-value case is defined as at least one occurrence, over the study period, of biological test result above the high threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported). TPMTa: thiopurine S-methyltransferase activity. lowTPMTa: low TPMTa (<8.5 nmol/h/mL red blood cells); vhTPMTa: very high TPMTa (≥15.0 nmol/h/mL red blood cells); nTPMTa: normal TPMTa (in between).
Conceived and designed the experiments: AN LC AB MAL PA. Performed the experiments: AN LC AB MAL PA. Analyzed the data: AN LC AB MAL PA. Contributed reagents/materials/analysis tools: GM CLB DR PB PD. Wrote the paper: AN LC AB MAL PA GM CLB DR PB PD. Obtained IRB approval: AN PA.
- 1. (2011) Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. Washington, D.C.): National Academies Press (US).
- 2. Feero WG, Guttmacher AE, Collins FS (2010) Genomic medicine–an updated primer. N Engl J Med 362: 2001–2011
- 3. Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, et al. (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308: 385–389
- 4. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America 106: 9362–9367
- 5. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861
- 6. Hindorff LA, MacArthur J (European Bioinformatics Institute), Morales J (European Bioinformatics Institute), Junkins HA, Hall PN, et al.. (n.d.) A Catalog of Published Genome-Wide Association Studies. Available: http://www.genome.gov/gwastudies/. Accessed 9 April 2013.
- 7. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, et al. (2010) PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26: 1205–1210
- 8. Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Basford MA, et al. (2011) Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. American journal of human genetics 89: 529–542
- 9. WHO (2010) WHO | International Classification of Diseases (ICD). Available: http://www.who.int/classifications/icd/en/. Accessed 5 January 2013.
- 10. Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, et al. (2008) Development of a large-scale de-identified DNA biobank to enable personalized medicine. ClinPharmacolTher 84: 362–369
- 11. Liao KP, Kurreeman F, Li G, Duclos G, Murphy S, et al. (2012) Autoantibodies, autoimmune risk alleles and clinical associations in rheumatoid arthritis cases and non-RA controls in the electronic medical records. Arthritis Rheum
- 12. Hebbring SJ, Schrodi SJ, Ye Z, Zhou Z, Page D, et al. (2013) A PheWAS approach in studying HLA-DRB1*1501. Genes Immun
- 13. Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM, et al. (2011) Electronic medical records for genetic research: results of the eMERGE consortium. SciTransl Med 3: 79re1
- 14. Pendergrass SA, Brown-Gentry K, Dudek SM, Torstenson ES, Ambite JL, et al. (2011) The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genetic epidemiology 35: 410–422
- 15. Pendergrass SA, Brown-Gentry K, Dudek S, Frase A, Torstenson ES, et al. (2013) Phenome-Wide Association Study (PheWAS) for Detection of Pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet 9: e1003087
- 16. Ritchie MD, Denny JC, Zuvich RL, Crawford DC, Schildcrout JS, et al.. (2013) Genome- and Phenome-Wide Analysis of Cardiac Conduction Identifies Markers of Arrhythmia Risk. Circulation. Available: http://circ.ahajournals.org/content/early/2013/03/05/CIRCULATIONAHA.112.000604. Accessed 20 March 2013.
- 17. Shah NH, Tenenbaum JD (2012) The coming age of data-driven medicine: translational bioinformatics' next frontier. J Am Med Inform Assoc 19: e2–4
- 18. Halevy A, Norvig P, Pereira F (2009) The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24: 8–12.
- 19. Lowe HJ, Ferris TA, Hernandez PM, Weber SC (2009) STRIDE–An integrated standards-based translational research informatics platform. AMIA AnnuSympProc 2009: 391–395.
- 20. Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, et al. (2010) Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association: JAMIA 17: 124–130
- 21. Szalma S, Koka V, Khasanova T, Perakslis ED (2010) Effective knowledge management in translational medicine. J Transl Med 8: 68
- 22. Kohane IS, Drazen JM, Campion EW (2012) A glimpse of the next 100 years in medicine. N Engl J Med 367: 2538–2539
- 23. Altman RB (2012) Translational Bioinformatics: Linking the Molecular World to the Clinical World. Clinical Pharmacology & Therapeutics 91: 994–1000
- 24. Denny JC (2012) Chapter 13: Mining Electronic Health Records in the Genomics Era. PLoSComputBiol 8: e1002823
- 25. Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts. PLoSComputBiol 7: e1002141
- 26. Bellazzi R, Masseroli M, Murphy S, Shabo A, Romano P (2012) Clinical Bioinformatics: challenges and opportunities. BMC Bioinformatics 13 Suppl 14: S1
- 27. Chouchana L, Narjoz C, Loriot M-A (2012) TPMT status determination: the simplest is the most effective? J Crohns Colitis 6: 807 author reply 808. doi:https://doi.org/10.1016/j.crohns.2012.04.003.
- 28. Schaeffeler E, Fischer C, Brockmeier D, Wernet D, Moerike K, et al. (2004) Comprehensive analysis of thiopurine S-methyltransferase phenotype-genotype correlation in a large population of German-Caucasians and identification of novel TPMT variants. Pharmacogenetics 14: 407–417.
- 29. Chouchana L, Narjoz C, Beaune P, Loriot MA, Roblin X (2012) Review article: the benefits of pharmacogenetics for improving thiopurine therapy in inflammatory bowel disease. Alimentary pharmacology & therapeutics 35: 15–36
- 30. Relling MV, Gardner EE, Sandborn WJ, Schmiegelow K, Pui CH, et al. (2011) Clinical Pharmacogenetics Implementation Consortium guidelines for thiopurine methyltransferase genotype and thiopurine dosing. Clinical pharmacology and therapeutics 89: 387–391
- 31. Stocco G, Martelossi S, Barabino A, Fontana M, Lionetti P, et al. (2005) TPMT genotype and the use of thiopurines in paediatric inflammatory bowel disease. Dig Liver Dis 37: 940–945
- 32. Fraser AG, Orchard TR, Jewell DP (2002) The efficacy of azathioprine for the treatment of inflammatory bowel disease: a 30 year review. Gut 50: 485–489.
- 33. Dubinsky MC, Yang H, Hassard PV, Seidman EG, Kam LY, et al. (2002) 6-MP metabolite profiles provide a biochemical explanation for 6-MP resistance in patients with inflammatory bowel disease. Gastroenterology 122: 904–915.
- 34. Lennard L, Van Loon JA, Lilleyman JS, Weinshilboum RM (1987) Thiopurine pharmacogenetics in leukemia: correlation of erythrocyte thiopurine methyltransferase activity and 6-thioguanine nucleotide concentrations. ClinPharmacolTher 41: 18–25.
- 35. Weinshilboum RM, Sladek SL (1980) Mercaptopurine pharmacogenetics: monogenic inheritance of erythrocyte thiopurine methyltransferase activity. Am J Hum Genet 32: 651–662.
- 36. Appell ML, Berg J, Duley J, Evans WE, Kennedy MA, et al. (2013) Nomenclature for alleles of the thiopurine methyltransferase gene. Pharmacogenet Genomics 23: 242–248
- 37. Ansari A, Hassan C, Duley J, Marinaki A, Shobowale-Bakre E-M, et al. (2002) Thiopurine methyltransferase activity and the use of azathioprine in inflammatory bowel disease. Aliment PharmacolTher 16: 1743–1750.
- 38. Chouchana L, Roche D, Jian R, Beaune P, Loriot MA (2013) Poor response to thiopurine in inflammatory bowel disease: how to overcome therapeutic resistance? Clin Chem 59: 1023–6.
- 39. Zapletal E, Rodon N, Grabar N, Degoulet P (2010) Methodology of integration of a clinical data warehouse with a clinical information system: the HEGP case. Studies in health technology and informatics 160: 193–197.
- 40. Mapping between ICD-10 and ICD-9 (2000). Ministry of Health of New Zealand. Available: http://www.health.govt.nz/nz-health-statistics/data-references/mapping-tools/mapping-between-icd-10-and-icd-9. Accessed 8 February 2013.
- 41. Bodenreider O, Nelson SJ, Hole WT, Chang HF (1998) Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies. Proc AMIA Symp 815–819.
- 42. Denny J (2013) ICD9 to PheWAS: Code translation map. Vanderbilt University. Available: http://knowledgemap2.mc.vanderbilt.edu/research/sites/default/files/code_translation.txt. Accessed 8 January 2013.
- 43. Anglicheau D, Sanquer S, Loriot MA, Beaune P, Thervet E (2002) Thiopurine methyltransferase activity: new conditions for reversed-phase high-performance liquid chromatographic assay without extraction and genotypic-phenotypic correlation. J Chromatogr B AnalytTechnol Biomed Life Sci 773: 119–127.
- 44. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological) 289–300.
- 45. Pathak J, Kiefer RC, Bielinski SJ, Chute CG (2012) Applying semantic web technologies for phenome-wide scan using an electronic health record linked Biobank. J Biomed Semantics 3: 10
- 46. Bergamaschi G, Di Sabatino A, Albertini R, Ardizzone S, Biancheri P, et al. (2010) Prevalence and pathogenesis of anemia in inflammatory bowel disease.Influence of anti-tumor necrosis factor-alpha treatment. Haematologica 95: 199–205
- 47. Evans WE, Horner M, Chu YQ, Kalwinsky D, Roberts WM (1991) Altered mercaptopurine metabolism, toxic effects, and dosage requirement in a thiopurine methyltransferase-deficient child with acute lymphocytic leukemia. J Pediatr 119: 985–989.
- 48. Dunnett CW (1955) A Multiple Comparison Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50: 1096–1121
- 49. Reiner A, Yekutieli D, Benjamini Y (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19: 368–375.
- 50. Jharap B, de Boer NKH, Stokkers P, Hommes DW, Oldenburg B, et al. (2013) Intrauterine exposure and pharmacology of conventional thiopurine therapy in pregnant patients with inflammatory bowel disease. Gut
- 51. Hsieh MM, Everhart JE, Byrd-Holt DD, Tisdale JF, Rodgers GP (2007) Prevalence of Neutropenia in the U.S. Population: Age, Sex, Smoking Status, and Ethnic Differences. Ann Intern Med 146: 486–492
- 52. WHO (2011) WHO|Haemoglobin concentrations for the diagnosis of anaemia and assessment of severity. Available: http://www.who.int/vmnis/indicators/haemoglobin/en/index.html. Accessed 29 January 2013.
- 53. WHO (2006) WHO|Definition and diagnosis of diabetes mellitus and intermediate hyperglycaemia. Available: http://www.who.int/diabetes/publications/diagnosis_diabetes2006/en/index.html. Accessed 28 January 2013.