Expression Quantitative Trait Loci (eQTL) Mapping in Puerto Rican Children

Background Expression quantitative trait loci (eQTL) have been identified using tissue or cell samples from diverse human populations, thus enhancing our understanding of regulation of gene expression. However, few studies have attempted to identify eQTL in racially admixed populations such as Hispanics. Methods We performed a systematic eQTL study to identify regulatory variants of gene expression in whole blood from 121 Puerto Rican children with (n = 63) and without (n = 58) asthma. Genome-wide genotyping was conducted using the Illumina Omni2.5M Bead Chip, and gene expression was assessed using the Illumina HT-12 microarray. After completing quality control, we performed a pair-wise genome analysis of ~15 K transcripts and ~1.3 M SNPs for both local and distal effects. This analysis was conducted under a regression framework adjusting for age, gender and principal components derived from both genotypic and mRNA data. We used a false discovery rate (FDR) approach to identify significant eQTL signals, which were next compared to top eQTL signals from existing eQTL databases. We then performed a pathway analysis for our top genes. Results We identified 36,720 local pairs in 3,391 unique genes and 1,851 distal pairs in 446 unique genes at FDR <0.05, corresponding to unadjusted P values lower than 1.5x10-4 and 4.5x10-9, respectively. A significant proportion of genes identified in our study overlapped with those identified in previous studies. We also found an enrichment of disease-related genes in our eQTL list. Conclusions We present results from the first eQTL study in Puerto Rican children, who are members of a unique Hispanic cohort disproportionately affected with asthma, prematurity, obesity and other common diseases. Our study confirmed eQTL signals identified in other ethnic groups, while also detecting additional eQTLs unique to our study population. The identified eQTLs will help prioritize findings from future genome-wide association studies in Puerto Ricans.


Introduction
Identification of sources of variability in gene expression is a critical step in our understanding of biological mechanisms of common diseases such as asthma. In humans, transcript abundance is largely regulated by genetic variants.
Expression quantitative trait loci (eQTL) are defined as genomic loci that regulate expression of mRNA. eQTLs provide insight into gene networks and disease pathogenesis, and help enhance and interpret results from genome-wide association studies (GWASs). Concurrent data on genome-wide (GW) genotypes and mRNA allow us to identify eQTLs, ultimately enhancing our interpretation of GWASs of complex diseases by helping identify functional variants [1][2][3]. On the basis of the physical distance between a SNP and a transcript, there are two types of eQTL effects: local and distal, which are alternatively called cis and trans. A local effect is identified when a regulatory SNP is within certain genomic distance (e.g. 1Mb) of the transcript starting site (TSS), and implies a direct regulatory effect of the SNP. In contrast, a distal effect is identified with a regulatory SNP that is much further from a TSS, or on a different chromosome, which implies that the eQTL acts through an intermediary transcription factor. To date, few eQTL analyses have been conducted in Hispanic subgroups, and no such study has been reported in Puerto Ricans.
Puerto Ricans are a racially admixed population, with one study reporting average European, African and Native American ancestries of 64%, 24%, and 12%, respectively [4]. For reasons that are unclear but likely to include genetic susceptibility, Puerto Ricans are disproportionately affected with diseases such as obesity, prematurity and asthma [5,6]. Understanding eQTL patterns in Puerto Ricans should help us elucidate both regulation of gene expression and disease causation in racially admixed populations in general, and in Hispanic subgroups in particular.
In this report, we present a comprehensive survey of genome-wide eQTL mapping using both GW genotypic data and GW whole-blood mRNA in 121 Puerto Rican children who participated in a study of asthma. We thus deliver the first eQTL database in Puerto Ricans, while also reporting findings suggestive of disease-enriched eQTL.

Subject recruitment and study protocol
From March of 2009 to June of 2010, children in San Juan were chosen from randomly selected households, as previously described [4]. In brief, households in the Standard Metropolitan Area of San Juan were selected by a multistage probability sample design. Primary sampling units (PSUs) were randomly selected neighborhood clusters based on the 2000 U.S. census, and secondary sampling units were randomly selected households within each individual PSU. A household was eligible if !1 resident was a child 6-14 years old. A total of 6,401 households were contacted: 1,111 had !1 child who met inclusion criteria other than age (four Puerto Rican grandparents and residence in the same household for !1 year). Of these 1,111 households, 438 had !1 eligible child with asthma (defined as physician-diagnosed asthma and wheeze in the prior year). From these 438 households, one child with asthma was selected (at random if there was more than one such child). Similarly, only one child without asthma was randomly selected from the remaining 673 households. In order to reach a target sample size of 700 children, we attempted to enroll 783 of the 1,111 eligible children. Parents of 105 of these 783 children refused to participate or could not be reached, leaving 678 study participants (351 children with asthma and 327 control subjects); DNA was extracted from blood samples collected in 592 children, of whom 583 had sufficient DNA for genome-wide genotyping. Total RNA was extracted from additional whole-blood samples collected in PaxGene tubes for the last 142 study participants (71 children with asthma and 71 control subjects).
Written parental consent was obtained for participating children, from whom written assent was also obtained. The study was approved by the Institutional Review Boards of the University of Puerto Rico (San Juan, Puerto Rico), Brigham and Women's Hospital (Boston, MA), and the University of Pittsburgh (Pittsburgh, PA).

Genome-wide genotyping
Subjects were genotyped at~2.5 million single nucleotide polymorphisms (SNPs) using the HumanOmni2.5 BeadChip (Illumina, Inc., San Diego, CA), as previously described [4]. Subjects with a call rate <95% were removed from the analysis. SNPs were removed if they were not in Hardy-Weinberg equilibrium (P<10 -6 ) in control subjects, had minor allele frequency lower than 5% or a failure rate greater than 5%. After all subject and marker quality control steps were completed, we had data on~1.3 million SNPs from 560 subjects.
Genome-wide expression analysis in whole blood RNA was extracted from whole blood using PAXgene blood miRNA kits (Qiagen Inc). Globin transcripts were then depleted using GLOBINclear kit (Life technologies). RNA quality and concentration were determined using Agilent RNA 6000 Nano kit (Agilent technologies). Of the 141 children with RNA samples, 121 also had GW genotypes and are included in this analysis. GW gene expression was thus measured in 121 whole-blood globin-cleaned RNA samples at the University of Pittsburgh Genomics and Proteomics Core Laboratories, using the HumanHT-12 v4 Expression BeadChip (Illumina Inc). Background subtraction and quantile normalization were performed using the Lumi package implemented in R (version 3.0.1). Probes with > 70% absent points among all 121 samples were retained in the downstream analysis. A total of~15,000 probes and~1.3 million SNPs were included in the final eQTL analysis.

Principal component analysis (PCA)
Principal component analysis was performed using SNP data and mRNA expression data separately. PCs derived using SNPs and expression data can capture different sources of variation due to global ancestry, unknown environmental exposures, or technical factors. The first two eigenvectors derived from genotypic data were included as covariates in our primary association analysis. The first ten eigenvectors derived from expression data were additionally included as covariates in our secondary association analysis (see below).

Association analysis
A multivariate linear regression analysis was performed for each pair of mRNA expression y i and SNP x j for the i th transcript probe and the j th SNP. This analysis was conducted separately in subjects with and without asthma, as well as in all subjects. All multivariate models were adjusted for age, gender, and the first two principal components derived from whole-genome SNP data (to control for potential population stratification); models for the analysis including all subjects were additionally adjusted for asthma status. To maximize our statistical power, we repeated the analysis after additional adjustment for the first ten principal components derived from mRNA expression data, as follows: The eQTL analysis was performed using R package MatrixeQTL [7]. Benjamini-Hochberg False Discovery Rate (FDR) was controlled at 0.01, 0.05, and 0.1. The proportion of expression variability explained (R 2 ) by each tested SNP is calculated in the linear regression for each local pair. A model without z was applied similarly in the group of asthmatics and nonasthmatics, respectively.
Annotation and relation to other diseases eQTL loci identified in our association analysis were then compared with three existing eQTL databases. The first dataset is from an eQTL study in primary peripheral blood CD4+ lymphocytes [8]. The second database was derived from a systematic study of diverse human populations from HapMap3 samples [9]. The third database includes two datasets (MRCE and MRCA) derived from two studies of childhood asthma in Europe using a similar Illumina platform [10]. Next, we performed a pathway analysis using Ingenuity Pathway Analysis (IPA) and compared our top eQTL genes with a GWAS catalog database [11].

eQTL mapping
We systematically examined how SNPs regulate RNA expression using 1,351,737 SNPs and 15,075 transcripts from 121 participants. We separated the mapping results by local effects and distal effects. Of a total of 20,377,435,275 tests, 12,442,690 were for local pairs and 20,364,992,585 were for distal pairs (Fig. 1). Table 1 shows the number of eQTL signals identified before and after adjustment for PCs from RNA data in all subjects, as well as the results of separate analyses in children with asthma and control subjects (both adjusted for PCs from DNA data). The majority of eQTLs were identified from local effect pairs, and we had greater statistical power to detect eQTLs after additional adjustment for PCs derived from RNA data. Adjustment for PCs resulted in the detection of an additional 22,863 local pairs under a FDR < 0.05. This finding is consistent with those from previous studies [10].
We focused our analysis on cis-effect mapping, since these effects are the most biologically plausible and interesting. Fig. 2 is the histogram of proportion of expression variability explained for significant local effect under a FDR <0.1. The median proportion of expression variability explained was 0.11 (inter-quartile range = 0.07 to 0.16). Table 2 shows a summary of the twenty genes with the most significant associations (lowest P values). Of note, significant local eQTLs were found in a wide frequency range (S1 Fig.). There are more highly heritable local eQTLs for SNPs with higher frequency. The listed SNPs can explain between 61% and 87% of the expression variability for each gene. A complete list can be found in the supporting information (S1 Table). Fig. 3 shows a plot of the significant of each association test against to the distance between each tested SNP and its corresponding transcript. In general, SNPs tend to affect RNA expression more significantly when they are closer to the starting site for a transcript.
Even though a systematic comparison between asthmatics and non-asthmatics is limited by the sample size in each group, we list local eQTL pairs from the subgroup analysis in S2 and S3 Tables, and list case-specific and control-specific eQTLs in S4 and S5 Tables (under FDR < 0.1).

Comparisons with existing eQTL databases
Studies reporting identification of eQTLs differ with regard to the tissues analyzed, the platforms utilized for genotyping and generation of RNA microarrays, analytical methods, and the race or ethnicity of subjects. We did focus on the identification of findings that overlap with those from previous studies. For simplicity, we focused on official gene symbols, which were  then compared with those reported in other studies. It is worth noting that our limited statistical power may have led to false negative results.
In the first eQTL study we used for comparison [8], a genome-wide scan was conducted in 200 subjects for SNP genotyping and RNA expression assessment (in peripheral blood CD4+ lymphocytes). Interestingly, a few genes in Table 2 for this study (SCGB3A1, IPO8, CHURC1, and FAM118A) were among the top 34 genes with more than 50% expression variability explained in that earlier study. The second study identified eQTLs with local effects using GW genotypic data and GW expression data from Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines in 726 subjects in eight populations (109 subjects of European ancestry, 80 Han Chinese, 82 Gujarati Indians, 82 Japanese, 82 Luhya from Kenya, 45 subjects of Mexican ancestry 138 Maasai from Kenya, and Yoruba from Nigeria) in the HapMap 3 project [9]. In the third and most recent study, Liang et al. conducted eQTL mapping using GW genotypes (directly genotyped and imputed) and GW expression data from lymphoblastoid cell lines in two cohorts of European nuclear families ascertained through a child with asthma (the MRCA panel) or eczema (the MRCE panel) [10]. As shown in Table 3, a majority of the genes identified in the current study overlap with those identified in the studies conducted using HapMap3 or the MRCA/MRCE panels. The average percentage for such overlap is approximately 85% and 69% under FDR thresholds of 0.05 and 0.01, respectively. In the comparison with MRCA and MRCE datasets, we observed a fairly consistent pattern for a longer gene list. In fact, the numbers shown above are likely underestimated due to limited power [12].

Functional analysis and disease association
Next, we investigated if eQTL mapped genes are enriched for disease associations. We downloaded the latest version (released on April 1, 2014) of the GWAS Catalog from the NIH website. We used the "Reported Gene(s)" and "Mapped_gene" columns in the catalog file to map the genes in our study. We matched the IDs with the corresponding Disease/Trait description from the catalog, and then summarized this information in a combined table (S6 Table). It is worth noting that well-known susceptibility genes for asthma were identified in our list (S7 Table). For example, SNP rs8067378 explained nearly 24% of the variability in expression of ORMDL3 (adjusted p-value < 10 -6 ) in the current study, and has been associated with asthma in several studies [13,14].
To identify diseases associated with our top eQTL-enriched genes, we conducted an Ingenuity Pathway Analysis on the top 500 genes. Endocrine, gastrointestinal, immunological, metabolic, and dermatological diseases appeared to be ranked at the top of the results for this analysis. Similar results were obtained when selecting a different number of top genes (e.g. 100, 300). For each gene, the most significant SNP was reported. Beta is the increase (+) or decrease (-) unit of transcript per unit increase of reference allele (A1).
Reported p-value is adjusted by FDR. R 2 is the proportion of expression variability explained by the reported SNP.

Discussion
We have presented and made available our findings from an eQTL study of a well-characterized Hispanic cohort of Puerto Rican children. Our results contribute to the understanding of gene expression regulation in Puerto Ricans, while also complementing existing eQTL databases in other ethnic groups. eQTL loci identified in the current analysis will help prioritize follow up of findings from other studies of complex diseases, such as GWASs.
We recognize several limitations of our study. Firstly, we had limited statistical power because of our sample size. Thus, we were unable to detect rare or low impact eQTLs and likely underestimated the number of genes overlapping with those reported in other studies. Moreover, we cannot confidently identify eQTLs differing between Puerto Ricans and other racial/ ethnic groups and between asthmatics and non-asthmatics. Subject recruitment in ongoing projects will further increase our sample size and thus our statistical power to detect additional eQTL, particularly those with modest effects. Secondly, we assessed GW expression in wholeblood, which contains a heterogeneous population of cells (e.g. subpopulations of white blood eQTL in Puerto Rican Children cells and platelets). However, our findings are comparable to those reported using GW expression data from CD4+ lymphocytes or lymphoblastoid cell lines. Thirdly, we limited our comparisons with previous studies to official gene symbols, largely because of complex differences across studies. An exact comparison of SNP-transcript pairs would have provided more accurate biological interpretations. Fourthly, our small sample size precluded incorporation of racial ancestry information into the analysis, which would have facilitated identification of population-or ethnic-specific eQTLs. Detection of such eQTLs is important and would contribute to gene mapping for complex diseases in racially admixed populations, such as Puerto Ricans. We plan future studies integrating genetic, epigenetic and mRNA expression data to further enhance our knowledge of the biological mechanisms underlying complex diseases in Puerto Ricans and other racially admixed populations.  Table. A list of top identified eQTL genes overlapping with genes associated with any trait in the GWAS catalog. (CSV) S3 Table. A list of top identified eQTL genes overlapping with genes associated with asthma-related traits in the GWAS catalog. (CSV) S4 Table. A list of significant local-effect eQTL pairs in cases under FDR < 0.1. (CSV) S5 Table. A list of significant local-effect eQTL pairs in controls under FDR < 0.1. The column of "Listed Gene" is the number of genes reported in the original study. The column of "Overlapped Gene" is the number of overlapping genes between our eQTL study and compared study.  Table. A list of significant local-effect eQTL pairs in cases but not in controls (case specific) under FDR < 0.1.

(CSV)
S7 Table. A list of significant local-effect eQTL pairs in controls but not in cases (control specific) under FDR < 0.1. (CSV)