A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record

Background The Electronic Medical Record (EMR) is a potential source for high throughput phenotyping to conduct genome-wide association studies (GWAS), including those of medically relevant quantitative traits. We describe use of the Mayo Clinic EMR to conduct a GWAS of red blood cell (RBC) traits in a cohort of patients with peripheral arterial disease (PAD) and controls without PAD. Methodology and Principal Findings Results for hemoglobin level, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration were extracted from the EMR from January 1994 to September 2009. Out of 35,159 RBC trait values in 3,411 patients, we excluded 12,864 values in 1,165 patients that had been measured during hospitalization or in the setting of hematological disease, malignancy, or use of drugs that affect RBC traits, leaving a final genotyped sample of 3,012, 80% of whom had ≥2 measurements. The median of each RBC trait was used in the genetic analyses, which were conducted using an additive model that adjusted for age, sex, and PAD status. We identified four genomic loci that were associated (P<5×10−8) with one or more of the RBC traits (HBLS1/MYB on 6q23.3, TMPRSS6 on 22q12.3, HFE on 6p22.1, and SLC17A1 on 6p22.2). Three of these loci (HBLS1/MYB, TMPRSS6, and HFE) had been identified in recent GWAS and the allele frequencies, effect sizes, and the directions of effects of the replicated SNPs were similar to the prior studies. Conclusions Our results demonstrate feasibility of using the EMR to conduct high throughput genomic studies of medically relevant quantitative traits.


Introduction
As costs of genotyping continue to drop, accurate phenotyping is emerging as the rate-limiting step for conducting genomic studies. Consequently, there is considerable interest in leveraging the electronic medical record (EMR) for high-throughput phenotyping of diseases and medically relevant traits. Repositories of DNA from patients seen in the clinical setting can be matched with the EMR and genotyping/sequencing conducted to identify genetic variants associated with human diseases as well as related quantitative traits. Such an approach may reduce the time, effort, and cost involved in conducting genomic studies to identify disease susceptibility loci.
In 2007, National Human Genome Research Institute (NHGRI) funded the Electronic Medical Records and Genomics (eMERGE) consortium to develop and implement approaches for leveraging biorepositories with EMR systems for large-scale genomic research, including but not limited to genome-wide association studies (GWAS), sequencing, and structural variation [1]. The five participating sites include Group Health Cooperative 2 University of Washington, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University. Each site chose to conduct a GWAS of a primary and supplementary phenotype. The Mayo Clinic proposal aims to identify genetic loci associated with peripheral arterial disease (PAD) and red blood cell (RBC) traits including hemoglobin, hematocrit, RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC).
Disorders involving RBCs, including anemia and polycythemia, have been associated with adverse cardiovascular outcomes as well as hypertension and heart failure [2,3,4,5]. Prior studies indicate that RBC traits have a substantial genetic component with heritabilities of 0.56, 0.52, and 0.52 reported for RBC count, MCV, and MCH, respectively [6]. A genome-wide linkage scan in the Framingham Heart Study noted a significant linkage signal for RBC count (chromosomes 12p13 and 19p13), MCV (chromosome 11p15), and MCH (chromosome 11p15) [6]. Recently, the results of several GWAS for RBC traits in populations of European ancestry were reported [7,8,9,10], with over 20 quantitative trait loci (QTL) identified. The objective of the present study was to assess the feasibility of leveraging the EMR to conduct a GWAS of quantitative traits, using RBC traits as an example. We investigated whether the QTL identified in recent GWAS of RBC traits [7,8,9,10] could be replicated using trait values derived from the EMR. We first developed and validated an algorithm based on billing codes and natural language processing (NLP) of unstructured clinical notes, to exclude RBC trait values that may have been affected by comorbidities, marrow/immune suppressing medications, or major surgery. We then undertook a GWAS for RBC traits extracted from the Mayo Clinic EMR [11].

Characteristics of participants
A total of 3,487 patients (PAD cases and controls), were recruited through 09/30/2009 for the Mayo Clinic eMERGE study. Figure S1A illustrates the process of extraction of RBC traits from the EMR. In total, 10 fields were extracted for each individual ( Figure S1B). After using the unique test code for each RBC trait, as well as excluding RBC values obtained during hospitalization, 3,411 patients remained. Since the RBC traits are measured together as part of the complete blood count, the number of participants and laboratory tests were similar for six RBC traits and multiple measurements for each RBC trait were available in most individuals ( Figure S1C).

Assessment of comorbidities and medications that can affect RBC traits
We excluded 12,864 records and 200 individuals based on the algorithm shown in Figure 1 and described in detail in Tables S1-S5. As a result, 3,012 genotyped patients with 20,650 values were included in the association analyses. We selected 50 sets of RBC trait values and performed a manual review of the EMR to assess whether any of the exclusion criteria were present at the time of the blood draw for these values. No exclusionary criteria were present at the time of the blood draws, thereby validating the algorithm. Characteristics for 3,012 individuals grouped by PAD status are summarized in Table 1.

GWAS of RBC traits
The distribution of the number of measurements for each RBC trait is shown in Figure S2; ,20.6% individuals had only one laboratory test and .95% had #20 laboratory tests. For individuals with multiple measurements, the median value was used in the analyses, which were performed under the additive model that adjusted for sex, age and PAD status, using PLINK [12]. We identified 11 significant SNPs (ie, P,5610 28 ) within four genomic regions that were associated with four RBC traits. Quantile-quantile plots for the QTL for six RBC traits are shown in Figure 2, and Manhattan plots for the QTL are shown in Figure 3. Table 2 summarizes the chromosomal location, minor allele (minor allele frequency), effect size by the minor allele, variance explained by the associated loci, and P value for these SNPs. The variance of RBC traits explained by the associated SNPs ranged from 0.7%-2.2%.

Replication of significant loci identified in prior GWAS for RBC traits
We compared our results with recently reported GWAS of RBC traits in subjects of European ancestry [7,8,9,10]. We were able to replicate three loci identified in these studies ( Table 3). The minor allele frequencies in our study were similar to the HapMap CEU population. The direction of allele effects was consistent across the studies. Although the effect sizes (ie, regression coefficients) varied  across different studies, the effect sizes in our study were similar to effect sizes in at least one of the prior studies. In order to compare the results among different studies, we plotted the distribution of P values and patterns of LD along these genomic regions ( Figure 4). The SNP rs4895441 within HBS1L/MYB (chromosome 6q23.3) has been found to be associated with MCV [7], and the SNP rs9402686 [in high LD with rs4895441 (HapMap CEU r 2 = 0.953)] identified by Soranzo et al. [8] was also associated with MCV. The SNP rs9483788 (r 2 = 0.602 with rs4895441) within this genomic region was associated with RBC count [7]. These SNPs seem to be located within an LD block ( Figure 4A), close to HBS1L. In addition, we found this locus to be associated with MCH (P = 3.1610 214 ), a finding not observed in previous studies.

Discussion
The EMR contains diverse and rich phenotypic information and DNA repositories linked to the EMR allow rapid assembly of patient sets for genomic studies. However, the utility of EMRbased approaches for discovery or validation of genotypephenotype associations remains unproven. In the present study, we demonstrate that a biorepository matched to the EMR can be leveraged to conduct a GWAS of RBC traits. We extracted RBC traits values over a span of 15 years from the EMR, and used a billing code and NLP-based algorithm to exclude values that may  have been affected by comorbidity, medication use or major surgery. We identified 11 unique significant SNPs (P,5610 28 ) within four genomic loci associated with four RBC traits. Of these, three genomic loci (ie, HBS1L/MYB, TMPRSS6, and HFE) recently identified as being associated with RBC traits, were replicated, highlighting that phenotypes extracted from the EMR can be used for GWAS of quantitative traits. The fourth genomic locus 2 SLC17A1 2 a gene involved in sodium-phosphate cotransport system in the kidney, is a novel locus that we found to be associated with MCH. Application of the GWAS approach to quantitative traits obtained from the EMR presents several challenges [1,13]. Data integration from the EMR often requires querying across different data sources using different information extraction procedures [14]. In the present study, we used several separate data sources across the Mayo EMR ( Figure S1A) to ensure the accuracy and completeness of the RBC trait values, making it feasible to conduct the GWAS. An additional challenge in using the EMR for genomic studies is assessment of comorbidities and medications that can affect the trait of interest. We used an algorithm that combined billing codes to identify comorbidities, procedure codes to identify surgeries associated with blood loss, and NLP to identify relevant medications, while retaining a sufficiently large sample size (Figure 1). We A remarkable aspect of our study is that we were able to identify 11 SNPs in 4 loci influencing RBC traits at a genome-wide significance level using EMR-derived phenotypic data in only 3,012 patients. In spite of comorbidities such as chronic kidney disease and chronic obstructive lung disease that can affect RBC traits in PAD patients, we were able to replicate loci associated with RBC traits in prior cohort studies. Three of the four loci had been recently identified in GWAS that included much larger numbers of participants. Although we did not replicate all genomic loci from these prior studies, the loci we detected are the only ones that were found in at least two previous studies. Our findings are encouraging from the viewpoint of using the EMR for genomic studies. When compared with the previous studies for RBC traits [7,8,9,10], the directions of effect alleles were the same and the effect sizes of the alleles were comparable to our study ( Table 3). The variance explained by the associated loci ranged from ,1%-2%, similar to the prior studies.
The molecular functions of the four genomic regions that were associated with RBC traits are summarized in Table 4. In addition to regulating fetal globin expression [15], HBS1L/MYB may have additional roles in erythropoiesis [16]. TMPRSS6 is a type II membrane-anchored serine protease that is involved in matrix remodeling processes in the liver [17], and is essential for normal  iron homeostasis [18]. HFE and transferrin directly compete for binding to the transferrin receptor, thereby lowering its affinity for iron-containing transferrin and down-regulating uptake of iron by cells [19]. SLC17A1 plays an important role in phosphate homeostasis in animals and humans; how variants in this gene might influence MCH needs further investigation [20]. Of note, an intronic SNP rs17270561 (HapMap CEU r 2 = 0.51 with rs17342717) within SLC17A1 was found to be associated with transferrin saturation (P = 5610 28 ), by Benyamin et al [21].

Limitations
A limitation of the use of EMR in genomic studies is the potential for selection and referral bias. Considerable effort may be needed to develop and validate phenotyping algorithms. The present study required a combined approach of NLP to identify prescribed medications and billing codes to exclude RBC values that might have been affected by chronic disease or medication use, while capturing a sufficiently large sample size. How well the genetic architecture of quantitative traits can be delineated from EMR-based genomic studies may vary with the trait of interest and will be influenced by trait heritability, variance in trait values, and how comorbidities affect trait values. In the present study, our ability to replicate may have been made easier by the fact that measurement of RBC traits is relatively precise in the clinical setting, trait values are stable over times, values may be relatively less affected by acute phase response, and that the traits have relatively high heritability. Additional GWAS of several quantitative traits are currently in progress within the eMERGE consortium, and will provide further insights in this regard.

Future directions
The present study lays the groundwork for a GWAS of RBC traits across the five eMERGE sites (n = ,17,000). We anticipate detection of additional novel genetic loci influencing RBC traits in the consortium-wide analyses. Although the availability of multiple measurements of a trait within the EMR may provide a more precise estimate of the trait value as well as change in trait value over time, it is not clear how to deal with multiple measurements in GWAS analyses. We are investigating the statistical power of different regression methods in dealing with multiple measurements. Finally, consistent with the goals of the eMERGE network, we are developing phenotyping algorithms to enable EMR-based genomic studies of other medically relevant quantitative traits and assessing the extent to which the algorithms are portable across EMR systems.  Essential for phosphate homeostasis in animals and humans [20] *hemojuvelin is essential for production of the iron regulatory hormone hepcidin [27]. doi:10.1371/journal.pone.0013011.t004 In conclusion, we demonstrate the use of the EMR to replicate genetic loci associated with inter-individual variation in RBC traits in prior cohort studies. As genotyping costs continue to decrease, phenotyping is emerging as the major bottleneck for identifying genetic loci influencing disease susceptibility or variation in medically relevant quantitative traits. Mining of the EMR is a high throughput, relatively inexpensive method to facilitate genetic studies of quantitative traits. Increasing use of the EMR affords an opportunity to expedite the investigation of genetic architecture of common and rare diseases as well as quantitative traits of medical importance.

Study participants
In October 2006, a biorepository of plasma and DNA samples was initiated by recruiting patients referred for lower extremity arterial evaluation to the Mayo Clinic's non-invasive vascular laboratory and individuals referred to the stress ECG laboratory to screen for coronary artery disease. Between October 2006 and May 2009, 3,527 patients were recruited. We used the following criteria to define presence of PAD: 1) an ankle brachial index (ABI) #0.9 at rest or 1 min after exercise; or 2) presence of poorly compressible arteries; or 3) normal ABI but prior history of revascularization for PAD [22]. All participants gave their written informed consent for participation in the study and the use of their data for future research. The study protocol was approved by the Institutional Review Board of the Mayo Clinic. The Mayo EMR began accumulating data in the early 1990s [23] and now includes all inpatient and outpatient billing codes, laboratory values, reports, and clinical documentation, almost all in electronic formats available for searching [11]. It currently contains over 120 million documents on ,2 million patients. Patient-level data elements in the Mayo EMR included demographics, outpatient visits and hospitalizations, providers, diagnosis and procedure codes, and RBC trait values. Birth date, race, sex, ethnicity were obtained from the demographic database; the categories for race were 'White,' 'Black or African American,' 'Hispanic,' 'Asian/ Pacific Islander,' 'American Indian/Alaskan Native,' 'Others,' 'Unknown,' and 'Choose not to disclose.'

RBC traits
The complete blood count is a commonly performed laboratory test [24] and includes the following RBC traits: (1) hemoglobin level: the concentration of hemoglobin within whole blood; (2) hematocrit, the percentage of whole blood comprising cellular erythrocyte elements; (3) RBC count, the number of red blood cells per volume of blood; (4) mean corpuscular volume (MCV), the average erythrocyte volume; (5) mean corpuscular hemoglobin (MCH), the average mass of hemoglobin per RBC in a sample of blood; and (6) mean corpuscular hemoglobin concentration (MCHC), the concentration of hemoglobin in a given volume of packed RBC.

Data integration from the EMR
To extract data for RBC traits, we used separate relational databases as well as semi-structured data sources in the Mayo EMR. A schematic depicting extraction of RBC traits from the EMR is shown in Figure S1A. The data extracted for the period 01/01/1994 to 09/30/2009 included the test code and description, date and time of sample, units of results, associated reference range and indicators for low/high results, lab accession number, and results of the test in both character and numeric format ( Figure S1B). Any RBC trait values obtained during an inpatient hospitalization (admit date#sample date#discharge date) were excluded unless these were only tests available for a patient.

Assessment of comorbidities and medications that can affect RBC traits
Since RBC traits are affected by a wide array of medical conditions, we developed an EMR-based algorithm that includes billing codes and NLP of unstructured clinical notes to exclude values affected by comorbidities, medications or blood loss ( Figure 1, and Tables S1-S5 and Methods S1). We compiled the International Classification of Disease 9 Clinical Management (ICD-9 CM), procedural ICD-9, and Current Procedural Terminology (CPT-4) codes indicative of clinical conditions that may affect RBC traits. The medical conditions included hematologic and solid-organ malignancies, bone marrow and solid-organ transplantation, cirrhosis, hereditary anemias, and malabsorption disorders. The medications included chemotherapeutic and immunosuppressive drugs. The algorithm is described in detail in the supplementary materials. Out of 35,159 RBC trait values in 3,411 patients, we excluded 12,864 values (in 1,165 patients) that had been measured during hospitalization or in the setting of hematological disease, malignancy, or use of drugs that affect RBC traits. As a result, 200 patients were excluded from the analyses.

Association analyses
We used the median of a trait value when multiple results were available. Genotyping was performed at the Center for Genotyping and Analysis at the Broad Institute, using the Illumina Human660W-Quadv1_A genotyping platform, consisting of 561,490 SNPs and 95,876 intensity-only probes. Data were cleaned using the quality control (QC) pipeline developed by the eMERGE Genomics Working Group. This process includes evaluation of sample and marker call rate, gender mismatch and anomalies, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium, sample relatedness, and population stratification. A total of 489,421 SNPs were used for analysis based on the following QC criteria: SNP call rate .98%, sample call rate .98%, minor allele frequency .0.05, Hardy-Weinberg equilibrium .0.001, 99.99% concordance rate in duplicates, and unrelated samples only. We excluded 11 samples with labeling errors. The data from all the patients, in addition to the HapMap III populations, were evaluated for population structure/substructure using EIGENSTRAT software [25], and those who were not in the European cluster were excluded (n = 42). After QC steps, 3,012 samples with phenotype and genotype data were available for association analyses ( Figure S3).
Single-locus tests of association were performed in PLINK using linear regression analysis that assumed an additive genetic model and incorporated age, sex, and PAD case-control status as covariates [12]. To assess population structure, we examined the genomic control inflation factor (l GC ) for six RBC traits, and found these values to be below 1.020 without systematic inflation: 1.014 (hemoglobin), 1.017 (hematocrit), 1.007 (RBC count), 1.007 (MCV), 1.004 (MCH), and 1.016 (MCHC). After correcting for population structure using l GC , the significant loci identified in the present study remained at P,5610 28 . The power of our study was ,85% to detect a QTL that explains 1.5% variance in an RBC trait, given a sample size of 3,000, a minor allele frequency of 0.05, and the significance level of 5610 28 . The data for the consortium-wide analyses of RBC indices will be uploaded to dbGAP (www.ncbi.nlm.nih.gov/gap). Figure S1 A. Schematic diagram of extracting data of RBC parameters from the EMR. B. The structure of extraction data from the EMR. Test description is the six RBC traits. C. Summary of the RBC traits in the extraction data. MCV, mean corpuscular volume; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration.   Author Contributions