Phenome-Wide Association Studies on a Quantitative Trait: Application to TPMT Enzyme Activity and Thiopurine Therapy in Pharmacogenomics

Phenome-Wide Association Studies (PheWAS) investigate whether genetic polymorphisms associated with a phenotype are also associated with other diagnoses. In this study, we have developed new methods to perform a PheWAS based on ICD-10 codes and biological test results, and to use a quantitative trait as the selection criterion. We tested our approach on thiopurine S-methyltransferase (TPMT) activity in patients treated by thiopurine drugs. We developed 2 aggregation methods for the ICD-10 codes: an ICD-10 hierarchy and a mapping to existing ICD-9-CM based PheWAS codes. Eleven biological test results were also analyzed using discretization algorithms. We applied these methods in patients having a TPMT activity assessment from the clinical data warehouse of a French academic hospital between January 2000 and July 2013. Data after initiation of thiopurine treatment were analyzed and patient groups were compared according to their TPMT activity level. A total of 442 patient records were analyzed representing 10,252 ICD-10 codes and 72,711 biological test results. The results from the ICD-9-CM based PheWAS codes and ICD-10 hierarchy codes were concordant. Cross-validation with the biological test results allowed us to validate the ICD phenotypes. Iron-deficiency anemia and diabetes mellitus were associated with a very high TPMT activity (p = 0.0004 and p = 0.0015, respectively). We describe here an original method to perform PheWAS on a quantitative trait using both ICD-10 diagnosis codes and biological test results to identify associated phenotypes. In the field of pharmacogenomics, PheWAS allow for the identification of new subgroups of patients who require personalized clinical and therapeutic management.


Introduction
The US National Research Council report ''Toward Precision Medicine'' proposed the redefinition of diseases using the underlying molecular causes and other factors in addition to traditional signs and symptoms [1]. To establish the relationships between molecular characterization and clinical features, different methods have been proposed [2]. Genome Wide Association Studies (GWAS) have allowed the identification of Single Nucleotide Polymorphisms (SNPs) associated with a determinate phenotype [3][4][5]. (Figure S1 Panel A) Between 2005 and June 2012, 1,350 GWAS were published [6]. In 2010, Denny et al. described another method called Phenome-Wide Association Study (PheWAS) [7]. PheWAS investigates whether the SNPs associated with a phenotype are also associated with other diagnoses ( Figure S1 Panel B) [7,8]. Therefore, for a selected SNP, two groups are composed: one with a specific allele and a control group with other alleles. Thereafter, to search for new associations, all of the phenotypic data (for example, all International Classification of Diseases (ICD) codes) available in the medical records of the patients having the specific allele are screened and compared to those of the control group [9]. Denny et al. genotyped 6,000 patients in the BioVU data bank at five SNPs with previously reported disease associations and ran a PheWAS on each SNP, based on the ICD-9-CM codes [7,10]. They replicated four out of seven known molecular-clinical associations and discovered 19 new potential associations.
Following this example, further PheWAS were performed on the SNPs associated with hypothyroidism (FOXE1) [8], rheumatoid arthritis [11], and on HLA-DRB1*1501, which has been linked to several autoimmune diseases [12]. Most PheWAS were performed with data collected through the Electronic Medical Records and Genomics (eMERGE) network, including the Marshfield Clinic's Personalized Medicine cohort [13]. With the aim of analyzing the genetic architecture of complex traits and identifying new pleiotropic relationships, Pendergrass et al. conducted a PheWAS on 70,061 study participants representing four major racial/ethnic groups in the Population Architecture using Genomics and Epidemiology (PAGE) network [14,15].
Analyses combining GWAS and PheWAS have been reported: whereas GWAS allows researchers to identify a genomic region of interest or one SNP associated with a clinical condition, PheWAS identifies all the diagnoses potentially associated with these markers. For example, Denny et al. performed a GWAS for primary hypothyroidism and, afterwards a PheWAS on 13,617 patient records, based on the locus that was previously identified. Thus, genetic associations with thyroiditis and thyrotoxicosis but neither Graves or thyroid cancer have been highlighted [8]. More recently, Ritchie et al. performed genome-and phenome-wide analysis on cardiac conduction, which resulted in the identification of new markers for atrial fibrillation and arrhythmia [16].
To perform a PheWAS, a large amount of data must be included to infer potential patterns and discover new possible associations [17,18]. The criterion for data selection includes the presence of a particular genotype. A cohort containing all types of diagnoses is necessary to discover some new potential associations. Clinical Data Warehouses (CDWs) have been developed to allow the integration of Electronic Health Records (EHRs) data and their use for research; they can also be used as data source for such studies [19][20][21][22]. When linked to DNA repositories, CDWs are a source of patient data to analyze the relationship between genetic variations and human traits [23][24][25][26].
Instead of directly using genomic data as the inclusion criteria, it is possible to use a quantitative trait (e.g., biological test results) [27]. This approach presents three advantages: (i) quantitative traits are usually recorded as part of the clinical data; (ii) a quantitative trait, consisting of both genetic variations and nongenetic factors can more accurately describe a clinical feature than genetic mutations alone; (iii) quantitative traits can be highly correlated to a genomic status. This is the case for thiopurine Smethyltransferase (TPMT), a key enzyme involved in thiopurine metabolism, as TPMT activity is highly correlated to the genotypes of individuals [28][29][30].
In treated patients, there is a negative correlation between partial or completely deficient TPMTa, and high 6-TGN intraerythrocyte concentrations, resulting in severe hematological toxicities or even lethal bone marrow suppression [34]. Conversely, patients with vhTPMTa are more prone to low 6-TGN intraerythrocyte concentrations and pharmacological resistance to thiopurines [38]. Therefore, to detect patients at high risk of severe hematological toxicities, the US Food and Drug Administration (FDA) and the Clinical Pharmacogenetics Implementation Consortium (CPIC) strongly recommend that TPMT status be determined either by genotyping or phenotyping prior to initiation of thiopurine therapy [30]. Based on these observations, TPMTa levels can be used as a starting point for a PheWAS.

Objectives
We aimed to develop methods to perform a PheWAS based on the ICD-10 codes and biological test results, while using a quantitative trait as a selection criterion. We then tested our approach on a specific quantitative trait, TPMTa, in order to identify new subgroups of patients with different characteristics.

Study and clinical data warehouse
We performed an in silico retrospective case-control study using data from an academic hospital, Hôpital Européen Georges Pompidou (HEGP) in Paris, France. We extracted data from HEGP CDW, an i2b2 CDW containing more than 606,524 single patients, collected between 2000 and 2012 [20,39]. This CDW contains routine care data divided into nine categories (208,955,369 items): demographics (age, sex, and hospital vital status), vital signs (e.g., temperature, blood pressure, weight…), diagnoses (ICD-10), procedures (French CCAM classification), clinical data (structured questionnaires from EHR), free text reports, pathology codes (French ADICAP classification), biological test results, and Computerized Provider Order Entry (CPOE) drug prescriptions.
Definition of ICD-10 PheWAS codes: Two aggregation levels ICD codes could not be directly used for analysis because of their fine granularity. Therefore, we developed two different aggregation methods.
ICD-9-CM mapping PheWAS codes. The first aggregation scale relies on mapping between ICD-9-CM and ICD-10 [40]. We extracted the ICD-10 classification from the United Medical Language System (UMLS) [41]. Then, we used a mapping file developed by the New Zealand Ministry of Health to map the ICD-10 codes to the ICD-9-CM codes ( Figure 1) [40]. After format adaptations, 99.5% of the codes were mapped successfully. The 57 remaining codes were mapped manually. This allowed us to use the ICD-9-CM PheWAS codes from Denny et al. [42] These ICD-9-CM PheWAS codes contained 829 different codes including 771 used for analysis. Codes that were not a proper diagnosis were excluded (e.g. ''Effects of air pressure caused by explosion'').

Author Summary
The use of underlying molecular mechanisms and other factors to describe and classify diseases is a major challenge for future treatment strategies. New methods are needed to achieve this goal. The phenome wide association study (PheWAS) methodology was initially developed to unveil unknown associations between a specific genetic status and phenotypic features (e.g. diagnoses from electronic health records). We initially propose to extend this method to assessment of the relationships between the levels of a quantitative trait and diagnosis codes. We also assess the relationships between this quantitative trait and the biological test results. We tested this method using the levels of enzymatic activity of thiopurine S-methyltransferase (TPMT) that is involved in the metabolism of thiopurine drugs used in inflammatory bowel diseases for example. We discovered an association between a very high TPMT activity and nutritional anemia and diabetes. These results could be used to describe a new subgroup of patients in order to optimize drug treatments.
ICD-10 PheWAS codes. The other grouping method was based on the ICD-10 hierarchy. Given the size of our sample population, a lower level of granularity was more relevant. We used the superclasses of the three digit codes, leading to 257 groups. ICD-10PheWAS codes and ICD-10 to ICD-9-CM PheWAS codes mapping files are available for download here: http://umrs872eq22.com/TPMT_PLOS/Phewas_codes_ICD10_ICD9.zip

ICD codes analyses
Groups of patients were divided according to the quantitative trait studied. (Figure 2) Then, as described by Denny et al., for each PheWAS code, a case-control comparison was performed: (i) the case group was generated with patients having an ICD code in the range of this PheWAS code; (ii) the control group was composed of patients without any ICD code in this range; and (iii) patients with ICD codes that were too close to those of the current PheWAS code were excluded from this specific comparison. For each PheWAS code, its siblings were used as exclusion ranges. For example: for grouped codes under C15-26 (Malignant neoplasms, digestive organs), the exclusion range was from C00 to D48 (Neoplasms). We successively used the two methods of ICD code aggregation to compare the distribution of cases between the groups.

Biological test result analyses
For each biological test, we used thresholds to define low-value cases and high-value cases, according to the normal value range (Table 1). Because patients had more than one occurrence of each biological test, two algorithms of analyses were applied. The first was a ''global approach'' in which a high-value case (resp. a lowvalue case) was defined as the presence of at least one test result above the high threshold (resp. below the low threshold). ( Figure  S2) In addition, for hyperglycemia, we required two occurrences above the high threshold. (Table S1) The proportions of cases among patient groups were then compared, similar to the ICD analysis. The second method was a ''frequency based approach'' in which a case was defined by one encounter with at least one result either below (low-value cases) or above (high-value cases) the thresholds. ( Figure S2) The proportions of encounter cases (''episodes'') per patient for a test were compared, similar to the ICD analysis. The results of the ''global approach'' and of the ''frequency-based approach'' were analyzed in view of the ICDbased findings.
For biological tests significantly associated to a TPMTa group, we performed an event-free Kaplan-Meier survival analysis (i.e. low-value event or high-value event) after initiation of thiopurine therapy, excluding events occurring within the first week of treatment. Analysis was censored to 360 days after initiation of thiopurine therapy.

Application to TPMT enzyme activity
Population. We selected all the patients who underwent a TPMTa assay and with at least one ICD-10 code or one biological test result between January 2000 and July 2013, i.e. TPMT cohort. For the PheWAS analysis, we included the patients having a notion of thiopurine treatment in their EHR and kept ICD codes and biological test results dated after the starting of thiopurine treatment. There were no exclusion criteria. We will refer to this group as the study population.
We first compared the characteristics of the TPMT cohort to a hospital control group composed of randomly selected patients among the HEGP CDW who did not undergo a TPMTa assessment and were matched for year of birth and sex (3 for each patient in the TPMT cohort).
Then, we split the initial TPMT cohort into three groups according to TPMTa level: (i) low TPMTa (lowTPMTa) combining both partial and completely deficient TPMTa patients, with an activity below 8.5 nmol/h/mL red blood cells (RBC); (ii) nTPMTa; and (iii) vhTPMTa, with an activity above 15.0 nmol/ h/mL RBC [43]. We have assessed that TPMTa is stable over time from the patients (n = 51) who underwent more than one TPMTa assay (Table S2). For these patients, only the first measurement was used in the analyses.
Data management. An open database connection (ODBC) linking an Oracle database (11 g Enterprise Edition Release 11.2.0.1.0) of i2b2 CDW (version 1.3) to R software (version 2.15.3) was set up. The dataset containing data from the TPMT cohort (demographic, diagnoses, free text reports, structured questionnaires, biological tests results and drug prescriptions) was imported into R. All further analyses were carried out in R, using the RODBC 1.3-6 and the ggplot2 0.9.3.1 packages.
Time restrictions. The information concerning the treatment was found in the drug prescriptions, in free text reports or in clinical data from structured questionnaires. We extracted prescriptions from the CPOE drug prescriptions with starting dates or directly from free text reports using the brand name and the generic name (IMUREL, AZATHIOPRINE, IMURAN, MERCAPTOPURINE, PURINETHOL) and using the date of report as the starting date.
ICD codes analysis. We compared the proportions of cases and controls in the TPMTa groups: (i) vhTPMTa versus other TPMTa and the (ii) lowTPMTa versus other TPMTa. We selected the PheWAS codes with at least 5 occurrences for analysis.
Thiopurine efficacy analysis on free-text reports. From study population, we selected the patients having at least two free-text reports with a reference to thiopurine therapy in their EHR. We excluded the patients with a reported adverse effect or intolerance to azathioprine or 6-mercaptopurine, and the patients whose treatment was interrupted within the first month.
Thiopurine failure was defined as at least one reference to inefficiency/failure of azathioprine/6-mercaptopurine therapy, or as a sustained dependency to steroids, reported by physicians in free-text reports. Of note, if the treatment was initially reported as effective, a secondary failure was not considered in our analysis. Proportions of thiopurine therapy failure were compared between vhTPMTa patients and other TPMTa patients.

Statistical analyses
Fisher exact test and unadjusted logistic regression were used to compare discrete variables. Continuous variables were compared using Student t-test. Log-rank test was used to evaluate survival curves. We calculated the odds ratios (OR) and confidence intervals at 95% (95%CI). q-q plots were realized to evaluate the distribution of p-values. The p-value was fixed at 0.05. We used the False Discovery Rate (FDR) method to manage multiple testing and used the threshold of 0.2 [44].  Table 1. Description of electronic health records (EHRs) of the thiopurine S-methyltransferase activity tested patients (TPMT cohort) and the control patients.

Participants
A total of 554 patients (TPMT cohort) underwent a TPMTa assessment. Of these patients, 460 had ICD-10 codes and at least one biological test result, and a total of 442 patients, i.e. study population, had also a notion of thiopurine treatment in their EHR. (Figure 3, Figure S6) These 442 EHRs included 10,252 ICD-10 occurrences and 72,711 results of the selected biological tests ( Table 1). Of these patients, 324, representing 6,183 free-text reports, were included in the thiopurine efficacy validation analysis, after exclusion of the patients having less than two reports with a notion of thiopurine therapy and patients with an adverse effect or intolerance to thiopurines. (Figure 3) Known indications for thiopurine therapy, e.g., Crohn's disease (OR, 699.6; 95%CI, 343.7-1,600, p = 1.73E-263) or ulcerative colitis (OR, 583.1; 95%CI, 237.9-1,843, p = 1.5E-144) and their consequences were significantly associated with the TPMT cohort versus hospital population (Table S3). No patient with leukemia or an associated pathology were found in the analysis, as there is no hematologic department at HEGP.

PheWAS analysis
ICD groupings. Using our ICD-10 based aggregation, the 1,016 distinct ICD-10 codes occurring in the study population EHRs resulted in 156 distinct aggregated codes, including 83 codes with at least 5 occurrences. (Table S4) ICD-9-CM mapping aggregation led to 289 distinct aggregated codes, including 94 codes with at least 5 occurrences. (Table S5) These 156 and 289 aggregated codes represent respectively 59% and 37% of the aggregated classifications.
In the vhTPMTa versus other TPMTa analysis, two significant codes for ICD-10 based aggregation were found: diabetes mellitus (p = 0.0009) and nutritional anemia (p = 0.0005). These results agreed with the ICD-9-CM mapping codes (p = 0.0004 and p = 0.0015, respectively). (Figures 4, 5, Tables S6, S7) These results remained significant after FDR multitesting evaluation for the two aggregation methods. (Tables S6, S7) The distribution of p-values did not show any systemic bias according to q-q plots. (Figure S3) In the lowTPMTa versus other TPMTa analysis, no grouping showed statistically significant results after FDR correction. (Tables S8, S9, Figures S4, S5) Biological test results. With the ''global approach'', the proportion of patients with at least one episode of moderate to severe biological anemia was higher in the vhTPMTa group than in the other TPMTa group: 40.8% versus 26.1% (OR, 1.9; 95%CI, 1.2-3.3;p = 0.01). (Table 2, Figure 6) Analyzing the same groupings, we also found that 13.6% of vhTPMTa patients had an episode of hyperglycemia versus 5.9% in the other TPMTa group (OR, 2.48; 95%CI, 1-6.1;p = 0.046) ( Table 3). The ''frequencybased approach'' confirmed that the mean frequency of moderate to severe biological anemia episodes was higher in the vhTPMTa group than in the other TPMTa group: 18% versus 9% of encounters (p = 0.01). (Table 2) On the other hand, there was no statistically significant difference in the frequency of encounters with hyperglycemia between the two groups. (Table 3) With respect to neutropenia, it was interesting to note that there was no difference between the two groups using the global approach. However, the ''frequency-based approach'' identified a lower rate of neutropenia in the vhTPMTa group than in the other TPMTa groups (Table 2).

Event-free survival analysis
Event-free survival was evaluated for anemia and hyperglycemia. It showed that patients with vhTPMTa had a significant risk to have earlier anemia episodes than others (p = 0.04). (Figure 7) Regarding the development of hyperglycemia, there was no difference between the groups.

Discussion
This study demonstrates the feasibility and benefits of performing a PheWAS on a quantitative trait. Two independent approaches, based on (i) ICD codes and (ii) biological test results, were used to discover pathophysiological features potentially associated with this quantitative trait. In this manner, findings can be cross-validated: the phenotypes extracted from diagnosis codes were confirmed by the biological test results. By this way and using a quantitative trait in the context of pharmacogenomics we discovered new potential associations between TPMTa related to thiopurine treatment and clinical data.
To our knowledge, this is the first PheWAS performed using data encoded with ICD-10 classification, as previously published PheWAS were based on ICD-9-CM. The consistency in the results found between the two aggregation methods -the ICD-10-based method and the mapping between ICD-9-CM and ICD-10demonstrates the feasibility of PheWAS using ICD-10. In our study population and using the ICD code distribution described above, ICD-9-CM based PheWAS codes resulted in more informative phenotypes than the ICD-10 based. Thus, it appears that ICD code aggregation level, i.e. the number of code groups, needs to be optimized according to the size of the population. For example, in a larger population, it may be more appropriate to use a fine grained aggregation based on the 3 digit codes of ICD-10, resulting in more accurate phenotypes.
The use of a CDW gives the opportunity to combine data from six heterogeneous sources: demographic data from administrative records, diagnosis codes from the billing system, biological test results, drug prescriptions from the CPOE system, free text reports, and clinical data from structured questionnaires. The clinical interpretation of patient condition by the physician, encoded with ICD codes, and the biological test results, extracted from the laboratory result server, were confronted. Drug prescriptions were extracted directly from the structured data issued by CPOE, structured questionnaires and from free-text reports. The close relationship between thiopurine drug prescriptions and TPMTa assays for therapeutic management was taken into account by incorporating temporal data for this study.   Therefore, we restricted our analyses to the events following the initiation of thiopurine therapy.
In addition to patient selection based on TPMTa, biological test results were employed to validate the phenotypes obtained from ICD codes analysis. Thus, we assessed the feasibility of expanding PheWAS to another type of data from the CDW. In that aim, classification algorithms were developed to transform continuous test results into discrete classes using value and frequency thresholds. Such algorithms could benefit from semantic web technologies [45], because description logic includes reasoning capabilities. First, the patient's history was considered globally to compare the proportion of patients with an occurrence of abnormal biological test result between groups. In a second step, we analyzed the number of episodes for a specific biological abnormality, allowing us to compare event frequencies between TPMTa groups.
From a clinical point of view, the analyses using ICD-9-CM-or ICD-10-based groupings and biological test results are consistent, resulting in more frequent anemia in vhTPMTa patients than in other patients. In IBD, anemia is frequently observed and has a multifactorial etiology such as chronic inflammation or irondeficiency caused by enteric bleeding [46]. In addition, myelosuppressive drugs such as thiopurines can cause anemia [47]. In our study, the strong association between iron-deficiency anemiaobserved by ICD codes and hemoglobin test results -and vhTMPTa could reflect more active disease in these patients. Moreover, evaluation of the anemia-free duration showed earlier episodes of anemia in the vhTPMTa group compared to other patients. Finally, thiopurine efficacy analysis on free-text reports showed a three times more therapy failure occurrences in the vhTPMTa group, in relation with anemia episodes and an active disease. Besides, an over-representation of diabetes mellitus, identified by ICD-9-CM and ICD-10 mapping analyses, has been observed in patients with a vhTPMTa. This result has been confirmed by glycemia test result analyses with more patients having hyperglycemia. Onset of type 2 diabetes or glucose intolerance could result from a sustained steroid therapy secondary to thiopurine resistance and active disease in vhTPMTa patients. This finding is strengthened by the weak association with secondary hypertension also known as a steroid adverse effect. Finally, the higher risk of thiopurine therapy failure in vhTPMTa patients, highlighted by free-text report analysis, is in agreement with sustained steroid therapy, according to IBD therapeutic management. Altogether, these findings suggested that patients with vhTPMTa could have more active disease than the others, leading to more frequent anemia episodes despite thiopurine therapy. These patients may benefit from more intensive thiopurine therapy to maintain remission, spare steroids and lessen common adverse effects.
As a limit of our PheWAS study, the study design does not distinguish the effect of vhTPMTa itself from a drug effect. A possible approach to assess this point would be to perform a PheWAS on patients with a TPMTa assessment but without thiopurine therapy. However, according to TPMTa testing indication, i.e. before starting a thiopurine therapy to screen TPMT-deficient patients, the HEGP CDW did not contain data to process such an analysis. Systematic TPMTa determination for inpatients, in a context a large DNA biobanking could be valuable for analyzing the impact of vhTPMTa on clinical phenotypes.
The number of patients in our study (n = 442) is relatively small. Previously published PheWAS were mainly based on pooled data or large population based cohorts [7,13,14]. However, despite the size of our study, we obtained statistically significant results and Table 2. Results of the low-value case biological test analyses between very high TPMT activity patients and other patients with normal and low TPMT activity. Global approach: a low-value case is defined as at least one occurrence, over the study period, of biological test result below the low threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported). Low-value case analyses have not been performed on alanine aminotransferase, aspartate aminotransferase and gamma glutamyl-transpeptidase test results, as a low threshold is not relevant for these tests. TPMTa: thiopurine S-methyltransferase activity. lowTPMTa: low TPMTa (,8.5 nmol/h/mL red blood cells); vhTPMTa: very high TPMTa ($15.0 nmol/h/mL red blood cells); nTPMTa: normal TPMTa (in between). doi:10.1371/journal.pcbi.1003405.t002 supported by a clinical/biological cross-validation. This crossvalidation was followed by a manual in-depth analysis of free-text reports to evaluate the validity of our initial conclusions. Regarding multiple testing issues, Denny et al. used a Bonferonni correction but estimated that it might be too restrictive [7,42,48]. We decided to use FDR because of its tolerance towards autocorrelated tests [49]. Given the cross-validation process based on the biological test results: (i) we did not exclude PheWAS codes with small numbers of cases from our analysis as in previous studies; (ii) and we considered the patients who had at least one occurrence of the ICD code p as having the phenotype p, whereas previous studies considered patients as cases when the ICD code was present more than once in the patient record (a minimum of two or even four occurrences of the same code) [8,11,12,16].
To be used as a selection criterion, a quantitative trait should be stable over the period of phenotype analysis. As all enzymes, TPMT can be influenced by physiological factors (e.g., pregnancy) or co-treatments [29,50]. In our study, TPMTa was stable over the analysis period. To extend this method to other quantitative traits, this stability over time must be checked.
Regarding our ICD and biological test result analyses, it could be valuable to extend it to other retrospective cohorts or CDW. Finally, the implementation of a large prospective study, including patients treated by thiopurine according to their TPMTa, could Figure 6. Pseudo-Manhattan plot of 2log10 (P-values) for the 11 biological tests between very high TPMT activity patients and other TPMT activity patients. Using the global approach, a high-value case, resp. low-value case, is defined as at least one occurrence of a biological test result above, resp. below, the high or low threshold. Low-value case analyses have not been performed on alanine aminotransferase, aspartate aminotransferase and gamma glutamyl-transpeptidase test results, as a low threshold is not relevant for these tests. The dotted line represents a P-value of 0.05. Grey triangles represent the results above the high threshold and black triangles represent the results below the low threshold. doi:10.1371/journal.pcbi.1003405.g006 Table 3. Results of the high-value case biological test analyses between very high TPMT activity patients and other patients with normal and low TPMT activity. Global approach: a low-value case is defined as at least one occurrence, over the study period, of biological test result below the low threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported help to confirm our findings regarding vhTPMTa and thiopurine therapy failure associated with steroid side effects, and to develop further research. We described here an original method to perform a PheWAS analysis on a quantitative trait, TPMTa, using both ICD-10 diagnosis codes and biological test results to identify associated phenotypes. This study highlighted a potential association between very high TPMT activity and signs that could be associated with a failure of thiopurine therapy and sustained steroid requirements in IBD patients. In the field of pharmacogenomics, PheWAS may allow the description of new subgroups of patients who need personalized clinical and therapeutic management. Figure S1 Comparison between Genome Wide Association Studies (GWAS) and Phenome Wide Association Studies (PheWAS). SNP: single nucleotide polymorphism. A. GWAS: a group of patients with a selected phenotype (i.e. disease) is compared to a control group. All the genomic data available are screened to find systematic genomic differences between the groups. B. PheWAS: a group of patients with a selected allele or SNP on a particular gene is compared to a control group with different alleles on the same gene. All the phenotypic data available are screened to find systematic phenotypic differences between the groups.  Table S1 Thresholds for biological test result analyses. Thresholds have been defined according to the normal value ranges of the hospital laboratory. One test result occurrence below the low or above the high threshold defines a low-value case or a high-value case, respectively. * one neutrophil count below the low threshold of 1.0 G/L defines a neutropenia [51]. **one hemoglobin test result below the low threshold of 9.0 g/100 mL defines a moderate to severe biological anemia [52]. *** specially for glycemia, a high-value case (hyperglycemia) is defined by two test result occurrences above the high threshold [53]. (DOCX) Results of the Phenome-wide association study (PheWAS) between very high TPMT activity patients and other TPMT activity patients for the ICD-9-CM mapping aggregation. The ICD-9-CM mapping aggregation corresponds to 771 groups of codes. Associations are assessed using logistic regression. Only PheWAS codes with a pvalue,0.05 are reported here. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: ,8.5 nmol/h/mL red blood cells; Very high TPMTa: $15.0 nmol/h/mL red blood cells; Normal TPMTa: in between.

(DOCX)
Table S7 Results of the Phenome-wide association study (PheWAS) between very high TPMT activity patients and other TPMT activity patients for the ICD-10 based aggregation. The ICD-10 based aggregation corresponds to 256 groups of codes. Only PheWAS codes with a p-value,0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: ,8.5 nmol/h/mL red blood cells; Very high TPMTa: $15.0 nmol/h/mL red blood cells; Normal TPMTa: in between. (DOCX) Table S8 Results of the Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD-9-CM mapping aggregation. The ICD-9-CM mapping aggregation corresponds to 771 groups of codes. Only PheWAS codes with a pvalue,0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The p-value must be under the calculated FDR threshold to be considered as significant. TPMTa: thiopurine S-methyltransferase activity. Low TPMTa: ,8.5 nmol/h/mL red blood cells; Very high TPMTa: $15.0 nmol/h/mL red blood cells; Normal TPMTa: in between. (DOCX)

Table S9
Results of the Phenome-wide association study (PheWAS) between low TPMT activity patients and other TPMT activity patients for the ICD10 based aggregation. The ICD-10 based aggregation corresponds to 256 groups of codes. Only PheWAS codes with a p-value,0.05 are reported here. Associations are assessed using logistic regression. The q value for false discovery rate (FDR) was q = 0.2. The pvalue must be under the calculated FDR threshold to be considered as significant.

(DOCX)
Table S10 Results of the low-value case biological test analyses between low TPMT activity patients and other patients with normal and very high TPMT activity. Global approach: a low-value case is defined as at least one occurrence, over the study period, of biological test result below the low threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported). Low-value case analyses have not been performed on alanine aminotransferase, aspartate aminotransferase and gamma glutamyl-transpeptidase test results, as a low threshold is not relevant for these tests.

(DOCX)
Table S11 Results of the high-value-case biological test analyses between low TPMT activity (lowTPMTa) patients and other patients. Global approach: a high-value case is defined as at least one occurrence, over the study period, of biological test result above the high threshold defined in Table 1. Frequency-based approach: for a given patient, the frequency of low-value encounters is defined as the number of encounters with at least one occurrence below the low threshold divided by the number of encounters (mean low-value encounter frequencies are reported). TPMTa: thiopurine S-methyltransferase activity. low-TPMTa: low TPMTa (,8.5 nmol/h/mL red blood cells); vhTPMTa: very high TPMTa ($15.0 nmol/h/mL red blood cells); nTPMTa: normal TPMTa (in between). (DOCX)