Large-Scale East-Asian eQTL Mapping Reveals Novel Candidate Genes for LD Mapping and the Genomic Landscape of Transcriptional Effects of Sequence Variants

Profiles of sequence variants that influence gene transcription are very important for understanding mechanisms that affect phenotypic variation and disease susceptibility. Using genotypes at 1.4 million SNPs and a comprehensive transcriptional profile of 15,454 coding genes and 6,113 lincRNA genes obtained from peripheral blood cells of 298 Japanese individuals, we mapped expression quantitative trait loci (eQTLs). We identified 3,804 cis-eQTLs (within 500 kb from target genes) and 165 trans-eQTLs (>500 kb away or on different chromosomes). Cis-eQTLs were often located in transcribed or adjacent regions of genes; among these regions, 5′ untranslated regions and 5′ flanking regions had the largest effects. Epigenetic evidence for regulatory potential accumulated in public databases explained the magnitude of the effects of our eQTLs. Cis-eQTLs were often located near the respective target genes, if not within genes. Large effect sizes were observed with eQTLs near target genes, and effect sizes were obviously attenuated as the eQTL distance from the gene increased. Using a very stringent significance threshold, we identified 165 large-effect trans-eQTLs. We used our eQTL map to assess 8,069 disease-associated SNPs identified in 1,436 genome-wide association studies (GWAS). We identified genes that might be truly causative, but GWAS might have failed to identify for 148 out of the GWAS-identified SNPs; for example, TUFM (P = 3.3E-48) was identified for inflammatory bowel disease (early onset); ZFP90 (P = 4.4E-34) for ulcerative colitis; and IDUA (P = 2.2E-11) for Parkinson's disease. We identified four genes (P<2.0E-14) that might be related to three diseases and two hematological traits; each expression is regulated by trans-eQTLs on a different chromosome than the gene.

Supplementary note β and R 2 as indices for effect magnitudes β, or |β|, is the coefficient (or the absolute value of the coefficient) of genotypes for an eQTL, and represents the effect size of a minor allele for means; specifically, how much the mean expression level is changed by possessing one minor allele on a log 2 scale (i.e., β=1 means that the expression levels double per minor allele). R 2 is the proportion of the regression sum of squares to the total sum of squares, and this proportion represents the proportion of phenotypic variance explained by genotype: i.e., R 2 represents how well the genotypes of a SNP explain the variance in an expression phenotype. Because we did not scale expression phenotypes by the standard deviation, β can be correlated to variability of the phenotype, while R 2 is not influenced by variability. We showed results for both measures because β and R 2 are two different measures for effects of predictor variables. When two eQTLs have the same R 2 values but different β values, the proportion of variance explained by genotypes is the same but the difference in means between two genotypes, say genotypes AA and Aa, is different. When two eQTLs have the same β values but different R 2 values, the difference in means between two genotypes, say genotypes AA and Aa, is the same, but the proportion of explained variance is different. β is important because it represents effect sizes, not statistical significance. R 2 is closely related to P values; with the same sample sizes, comparing R 2 is equivalent to comparing P values. R 2 is also important because it represents the narrow-sense heritability, h 2 , where the SNP is the only genetic factor for the phenotype. If there are more than one independent genetic factors, h 2 is given by the sum of R 2 of all the genetic factors.
In our study, many results showed substantial discordance between β and R 2 . For example, genic and intergenic cis-eQTLs were different in R 2 but not obviously different in β ( Figure 3A and 3B); association with RegolumeDB was significant for R 2 but not for β ( Figure 4); relationship between R 2 or β and eQTL-gene distance was also different ( Figure 5C and 5F). Regarding these differences, we consider that R 2 is the better index to represent eQTL effects because R 2 was more consistent with known biological evidences, and also because β is influenced by variability of phenotype.
However, as mentioned above, β indicates how an eQTL can change the mean expression levels, which might be of greater interest than statistical significance represented by R 2 . Therefore, we showed results for both β and R 2 .

Definitions of Case 1-4 for GWAS results
We defined the following four cases for GWAS results in which cis or trans effects were found for

Surrogate variable correction and distribution of P values for all distant SNPs
The surrogate variable analysis (SVA) identifies unmodeled latent factors that cause heterogeneity in expression data [1]. We identified two significant surrogate variables (SV), and we corrected each expressional phenotype for age, gender, and the two SV. It was shown that SVA improved eQTL reproducibility [2]. In our data, we identified more trans-eQTLs with SV correction than without SV correction (we did not check whether or not more cis-eQTLs are found with SV correction). Therefore, we consider that SVA improves eQTL identification. We note, however, that adding SV correction to age and gender adjustment changed the distribution of P values of all distant SNPs. As shown in Figure SN1, with SV correction, the distribution of P values became conservative, with disregarding enrichment of small P values, than expected distribution from complete null hypotheses. nonsynonymous SNV, a single nucleotide change that cause an amino acid change; synonymous SNV, a single nucleotide change that does not cause an amino acid change. Functional changes caused by indels were not shown here because no SNP was assigned to these categories.

Exclusion of possible false eQTLs caused by outliers or violation of normality assumption
To exclude possible false discoveries caused by outliers or violation of normality assumption made for a linear regression, non-parametric tests or inverse normal transformation is commonly used.
We considered applying either method to assure that our eQTLs are not false discoveries caused by such reasons. To employ more stringent method for our data, we evaluated the two methods; 1) Kruskal-Wallis test [3], and 2) linear regression following rank-based inverse normal transformation [4] (INT+LR). We tested pairs of the most significant local SNP and transcript with each method (Figure SN2), and compared with a linear regression (LR), which was performed as described in Methods ( Figure SN2). The significance thresholds for LR was determined by the permutation FDR (as described in Methods), and those for Kruskal-Wallis test and INT+LR were determined based on a receiver operating characteristic (ROC) curve analysis [5] (the closest point to the upper-left corner) using the significance by LR as a golden standard. We identified 200 and 155 possible false positives with Kruskal-Wallis test (P < 0.00015) and INT+LR (P<2E-05), respectively (those in the lower-right region in Figure SN2). Therefore, we employed Kruskal-Wallis test, which gave more stringent criteria.