Leveraging the resolution of RNA-Seq markedly increases the number of 1 causal eQTLs and candidate genes in human autoimmune disease 2 Mapping eQTLs in autoimmune disease using RNA-Seq 3 4

Abstract Genome-wide association studies have identified hundreds of risk loci for autoimmune disease, yet only a minority (~25%) share a single genetic effect with changes to gene expression (eQTLs) in primary immune cell types. RNA-Seq based quantification at whole-gene resolution, where abundance is estimated by culminating expression of all transcripts or exons of the same gene, is likely to account for this observed lack of colocalisation as subtle isoform switches and expression variation in independent exons are concealed. We perform integrative cis-eQTL analysis using association data from twenty autoimmune diseases (846 SNPs; 584 independent loci), with RNA-Seq expression from the GEUVADIS cohort profiled at gene-, isoform-, exon-, junction-, and intron-level resolution. After testing for a shared causal variant, we found exon-, and junction-level analyses produced the greatest frequency of candidate-causal cis-eQTLs; many of which were concealed at whole-gene resolution. In fact, only 9% of autoimmune loci shared a disease-relevant eQTL effect at gene-level. Expression profiling at all resolutions however was necessary to capture the full array of eQTL associations, and by doing so, we found 45% of loci were candidate-causal cis-eQTLs. Our findings are provided as a web resource for the functional annotation of autoimmune disease association studies (www.insidegen.com). As an example, we dissect the genetic associations of Ankylosing Spondylitis as only a handful of loci have documented causative relationships with gene expression. We classified fourteen of the thirty-one associated SNPs as candidate-causal cis-eQTLs. Many of the newly implicated genes had direct relevance to inflammation through regulation of TNF signalling (for example NFATC2IP, PDE4A, and RUSC1), and were supported by integration of functional genomic data from epigenetic and chromatin interaction studies. We have provided a deeper mechanistic understanding of the genetic regulation of gene expression in autoimmune disease by profiling the transcriptome at multiple resolutions. Author Summary It is now well acknowledged that non-coding genetic variants contribute to susceptibility of autoimmune disease through alteration of gene expression levels (eQTLs). Identifying the variants that are causal to both disease risk and changes to expression levels has not been easy and we believe this is in part due to how expression is quantified using RNA-Sequencing (RNA-Seq). Whole-gene expression, where abundance is estimated by culminating expression of all transcripts or exons of the same gene, is conventionally used in eQTL analysis. This low resolution may conceal subtle isoform switches and expression variation in independent exons. Using isoform-, exon-, and junction-level quantification can not only point to the candidate genes involved, but also the specific transcripts implicated. We make use of existing RNA-Seq expression data profiled at gene-, isoform-, exon-, junction-, and intron-level, and perform eQTL analysis using association data from twenty autoimmune diseases. We find exon-, and junction-level thoroughly outperform gene-level analysis, and by leveraging all five quantification types, we find 45% of autoimmune loci share a single genetic effect with gene expression. We highlight that existing and new eQTL cohorts using RNA-Seq should profile expression at multiple resolutions to maximise the ability to detect causal eQTLs and candidate-genes.


21
expression from the GEUVADIS cohort profiled at gene-, isoform-, exon-, junction-, and intron-level 22 resolution. After testing for a shared causal variant, we found exon-, and junction-level analyses 23 produced the greatest frequency of candidate-causal cis-eQTLs; many of which were concealed at 24 whole-gene resolution. In fact, only 9% of autoimmune loci shared a disease-relevant eQTL effect at 25 gene-level. Expression profiling at all resolutions however was necessary to capture the full array of 26 eQTL associations, and by doing so, we found 45% of loci were candidate-causal cis-eQTLs. Our 27 findings are provided as a web resource for the functional annotation of autoimmune disease 28 association studies (www.insidegen.com). As an example, we dissect the genetic associations of 29 Ankylosing Spondylitis as only a handful of loci have documented causative relationships with gene 30 expression. We classified fourteen of the thirty-one associated SNPs as candidate-causal cis-eQTLs.

31
Many of the newly implicated genes had direct relevance to inflammation through regulation of TNF 32 signalling (for example NFATC2IP, PDE4A, and RUSC1), and were supported by integration of Introduction 55 The autoimmune diseases are a family of heritable, often debilitating, complex disorders whereby 56 immune system dysfunction leads to loss of tolerance to self-antigens and chronic inflammation [1].

57
Genome-wide association studies (GWAS) have now detected hundreds of susceptibility loci 58 contributing to risk of autoimmunity [2] yet their biological interpretation still remains challenging 59 [3]. Mapping single nucleotide polymorphisms (SNPs) that influence gene expression (eQTLs) can 60 provide crucial insight into the potential candidate genes and etiological pathways connected to 61 discrete disease phenotypes [4]. Expression profiling in appropriate cell types and physiological conditions is necessary to capture the 67 pathologically relevant regulatory changes driving disease risk [8]. Lack of such expression data is 68 thought to explain the observed disparity of shared genetic architecture between disease association 69 and gene expression at certain autoimmune loci [9]. A much overlooked cause of this disconnect 70 however, is not only the use of microarrays to profile gene expression, but also the resolution to 71 which expression is quantified using RNA-Sequencing (RNA-Seq) [10]. Expression estimates of 72 whole-genes, individual isoforms and exons, splice-junctions, and introns are obtainable with RNA-73 Seq [11][12][13][14][15][16][17][18]. The SNPs that affect these discrete units of expression vary strikingly in their proximity 74 to the target gene, localisation to specific epigenetic marks, and effect on translated isoforms [18]. For 75 example, in over 57% of genes with both an eQTL influencing overall gene expression and a 76 transcript ratio QTL (trQTL) affecting the ratio of each transcript to the gene total, the causal variants 77 for each effect are independent and reside in distinct regulatory elements of the genome [18]. 78 79 RNA-Seq based eQTL investigations that solely rely on whole-gene expression estimates are likely to 80 mask the allelic effects on independent exons and alternatively-spliced isoforms [16][17][18][19]. This is in 81 5 part due to subtle isoform switches and expression variation in exons that cannot be captured at gene-82 level [20]. Recent evidence also suggests that exon-level based strategies are more sensitive than 83 conventional gene-level approaches, and allow for detection of moderate but systematic changes in 84 gene expression that are not necessarily derived from alternative-splicing events [15,21]. Furthermore, 85 gene-level summary counts can be biased in the direction of extreme exon outliers [21]. Use of 86 isoform-, exon-, and junction-level quantification in eQTL analysis also support the potential to not 87 only point to the candidate genes involved, but also the specific transcripts or functional domains 88 affected [10,18]. This of course facilitates the design of targeted functional studies and better 89 illuminates the causative relationship between regulatory genetic variation and disease. Lastly, though 90 intron-level quantification is not often used in conventional eQTL analysis, it can still provide 91 valuable insight into the role of unannotated exons in reference gene annotations, retained introns, and 92 even intronic enhancers [22,23].

94
Low-resolution expression profiling with RNA-Seq will impede the subsequent identification of 95 causal eQTLs when applying genetic and epigenetic fine-mapping approaches [24]. In this 96 investigation, we aim to increase our knowledge of the regulatory mechanisms and candidate genes of 97 human autoimmune disease through integration of GWAS and RNA-Seq expression data profiled at 98 gene-, isoform-, exon-, junction-, and intron-level in lymphoblastoid cell lines (LCLs). Our findings 99 are provided as a web resource to interrogate the functional effects of autoimmune associated SNPs 100 (www.insidegen.com), and will serve as the basis for targeted follow-up investigations.

104
Detection of cis-eQTLs and candidate-genes of autoimmune disease using RNA-Seq 105 Using association data from twenty human autoimmune diseases, we performed integrative cis-eQTL 106 analysis in lymphoblastoid cell lines (LCLs) with RNA-Seq expression data profiled at five 107 resolutions: gene-, isoform-, exon-, junction-, and intron-level. We tested for a shared causal variant 108 between disease and expression at each association. The 846 autoimmune-associated SNPs taken 109 forward for analysis are documented in S1 Table and

114
We found that cis-eQTL association analysis using exon-, junction-, and intron-level quantification 115 yielded the greatest frequency of significant (q < 0.05) cis-eQTLs and eGenes ( Fig 2B). These 116 findings persisted after testing whether each statistically significant cis-eQTL showed strong evidence 117 for colocalisation with the genetic variant underlying the autoimmune disease association (q < 0.05 118 and RTC > 0.95). For clarity, we define such eQTLs as candidate causal cis-eQTLs and we define 119 their targets as eGenes ( Fig 2C). Exon-level analysis detected the most candidate-causal cis-eQTLs 120 (235) and eGenes (233) out of all quantification types, followed by junction-and intron-level 121 quantification. Isoform-and gene-level analysis were thoroughly outperformed, with the latter 122 detecting only 70 candidate-causal cis-eQTLs and 65 eGenes. In fact, we observed gene-level 123 quantification presented the greatest dropout of significant cis-eQTLs that were candidate causal (Fig   124   2D). Only 23.8% of significant cis-eQTLs were candidate-causal at gene-level compared to 49.8% at 125 exon-level; suggesting that in the autoimmune susceptibility loci tested more strongly associated cis-  Profiling at all resolutions is necessary to capture the full array associated cis-eQTLs 131 We pruned the 846 autoimmune associated SNPs using an r 2 cut-off of 0.8 and 100kb limit to create a 132 subset of 584 independent susceptibility loci. By combining all five resolutions of RNA-Seq, we 133 found 267 loci (45.7%) presented a shared genetic effect between disease association and gene 134 expression ( Fig 3A). Strikingly, only 9.3% of associated loci shared an underlying causal variant at 135 gene-level, in contrast to the 29.1% classified at exon-level. We mapped the candidate-causal cis-136 eQTLs detected by RNA-Seq back to the diseases to which they are associated ( Fig 3B). On average, 137 47% of associated SNPs per disease were classified as candidate-causal cis-eQTLs using all five 138 RNA-Seq quantification types. Interestingly, we observed the diseases that fell most below this

145
We further broke down our results per disease by RNA-Seq quantification type ( Fig 3C) and in almost 146 all cases, the greatest frequency of candidate-causal cis-eQTLs and eGenes were captured by exon-147 and junction-level analyses.

149
By separating candidate-causal cis-eQTL associations out by quantification type, we found over half 150 were detected by either exon-or junction-level, and considerable overlap of cis-eQTL associations 151 existed between both types ( Fig 3D). The greatest correlation of effect sizes (r 2 : 0.88) of candidate-152 causal cis-eQTLs between exon-and junction-level (S1 Fig). Strong correlation also existed between 153 the effect sizes of gene-and isoform-level candidate-causal cis-eQTLs as expected (r 2 : 0.83); yet 154 gene-level analysis detected only 19% of all candidate-causal associations. Gene-and isoform-level 155 analysis did however capture six and eighteen candidate-causal cis-eQTLs unique to their 156 8 quantification type respectively. Thus, our data suggest that although exon-and junction-level, and to 157 a lesser extent intron-level analysis, capture the majority of candidate-causal cis-eQTL associations, it 158 is necessary to prolife gene-expression at all quantification types to avoid misinterpretation of the 159 functional impact of disease associated SNPs.

161
Web resource for functional interpretation of association studies of autoimmune disease 162 We provide our data as a web resource (www.insidegen.com) for researchers to lookup candidate-163 causal cis-eQTLs and eGenes of autoimmune diseases detected across the five RNA-Seq 164 quantification types. Data are sub-settable and exportable by SNP ID, gene, RNA-Seq resolution, 165 genomic position, and association to specific autoimmune diseases. Full data are also made available 166 in S2 Table. 167 168

Functional dissection of Ankylosing Spondylitis genetic associations using RNA-Seq 169
We decided to apply the results of our integrative cis-eQTL analysis to functionally dissect the

223
The remaining AS associated variant rs1128905 is a candidate-causal cis-eQTL for both CARD9 and 224 SNAPC4 (Fig 4A). The candidate-gene at this AS locus is thought be to CARD9

228
Accordingly, the exon 18-19 boundry was also significantly decreased, captured by junction-level 229 quantifiation (β = -0.26; P = 2.74 x 10 -04 ). As rs1128905 lies over 39kb away from the transcription 230 start site of SNAPC4, we used existing promoter capture Hi-C data in lymphoblastoid cell lines to 231 assess whether rs1128905 and associated SNPs may act distally upon SNAPC4 to influence its 232 expression [32]. We found the bait region encompassing rs1128905 interacts with five targets with 233 great confidence CHiCAGO score > 12 (Fig 4C) [46]. Four of these are located within the SNAPC4 suggest that associated SNPs rs10870201 and rs10870202 may perturb the enhancer-promoter 239 interaction with SNAPC4 affecting expression. In fact, rs10870201 was the best cis-eQTL in the 1Mb 240 region for exons 18 and 19 of SNAPC4. Interestingly, although no autoimmune phenotype has been 241 documented with SNAPC4, an uncorrelated SNP rs10781500 (r 2 with rs1128905 < 0.5), associated 242 with Crohn's Disease, inflammatory bowel disease, and ulcerative colitis, has also been classified as a 243 candidate-causal cis-eQTL for SNAPC4 but not CARD9 in ex vivo human B-lymphocytes (the risk 244 allele is also correlated with reduced expression of SNAPC4) [47]. This effect holds true in our 245 analysis -rs10781500 is an eQTL for SNAPC4 but not CARD9.

246
Our data point to candidate genes and molecular mechanisms but further functional characerization is 247 of course necessary to determine the true causative gene(s) at this locus.

Detection of autoimmune associated trans-eQTLs using RNA-Seq 250
We extended our RNA-Seq based eQTL investigation to include expression targets > 5Mb away from 251 each of the 846 lead autoimmune GWAS variants (S3 Table). Though we were relatively 252 underpowered for a trans-eQTL analysis, we were able to detect 26 trans-eQTLs at isoform-level, 253 eight at exon-level, six at gene-level, three at junction-level ( Fig 5A). Many of the trans-eQTLs 254 detcted were only associated with one eGene, and no trans-eQTLs were detected at intron-level. With 255 exon-level quantification however, we were able to identify an interesting effect of trans-eQTL 256 rs7726414 -associated with Systemic Lupus Erythematosus (SLE) [7]. We found rs7726414, was a 257 trans-eQTL for eight eGenes (Fig 5B). These comprise SIPA1L2, PDPK1, IVNS1ABP, HES2, JAZF1, 258 ULK4, RP11-51F16.8, and PPM1M. We found the risk allele rs7726414 [T] was associated with 259 increased expression of each of these eight genes (Fig 5C). We highlight, PDPK1, a key regulator of 260 IRF4 and inducer of apoptosis [48], and JAZF1 which is genetically associated with many 261 autoimmune diseases including SLE itself [7]. The serine/threonine-protein kinase ULK4 is also of 262 interest as its family member, ULK3, is also an SLE suseptability gene [7]. Though we did not 263 classify rs7726414 as a candidate-causal cis-eQTL in our dataset, it has been documented as junction-, and intron-level, and tested for a shared genetic effect at each significant association. We 289 found exon-and junction-level quantification led to the greatest frequency of candidate-causal cis-290 eQTL and eGenes, and thoroughly outperformed gene-level analysis (Fig 2C). We argue however that 291 it is necessary to profile expression at all possible resolutions to diminish the likelihood of 292 overlooking potentially causal cis-eQTLs (Fig 3D). In fact, by combining our results across all 293 resolutions, we found 45% of autoimmune loci were candidate-causal cis-eQTLs for at least one 294 14 eGene. Our findings can be used as a resource to lookup causal eQTLs and candidate genes of 295 autoimmune disease (www.insidegen.com).

297
Gene-level expression estimates can generally be obtained in two ways -union-exon based 298 approaches [14,17] and transcript-based approaches [11,12]. In the former, all overlapping exons of 299 the same gene are merged into union exons, and intersecting exon and junction reads (including split-300 reads) are counted to these pseudo-gene boundaries. Using this counting-based approach, it is also 301 possible to quantify meta-exons and junctions easily and with high confidence by preparing the 302 reference annotation appropriately [13,15,54]. Introns can be quantified in a similar manner by 303 inverting the reference annotation between exons and introns [18]. Conversely, transcript-based 304 approaches make use of statistical models and expectation maximization algorithms to distribute reads 305 among gene isoforms -resulting in isoform expression estimates [11,12]. These estimates can then be 306 summed to obtain the entire expression estimate of the gene. Greater biological insight is gained from 307 isoform-level analysis; however, disambiguation of specific transcripts is not trivial due to substantial 308 sequence commonality of exons and junctions. In fact, we found only 15% of autoimmune loci shared 309 a causal variant at transcript-level (Fig 3A). The different approaches used to estimate expression can 310 also lead to significant differences in the reported counts. Union-based approaches, whilst 311 computationally less expensive, can underestimate expression levels relative to transcript-based, and 312 this difference becomes more pronounced when the number of isoforms of a gene increases, and when 313 expression is primarily derived from shorter isoforms [20]. The GEUVADIS study implemented a 314 transcript-based approach to obtain whole-gene expression estimates. A gold standard of eQTL 315 mapping using RNA-Seq is essential therefore for comparative analysis across datasets.

317
Our findings support recent evidence that suggests exon-level based strategies are more sensitive and 318 specific than conventional gene-level approaches [21]. Subtle isoform variation and expression of less 319 abundant isoforms are likely to be masked by gene-level analysis. Exon-level allows for detection of 320 moderate but systematic changes in gene expression that are not captured at gene-level, and also, 321 gene-level summary counts can be shifted in the direction of extreme exon outliers [21]. It is therefore 322 15 important to note that a positive exon-level eQTL association does not necessarily mean a differential 323 exon-usage or splicing mechanism is involved; rather a systematic expression effect across the whole 324 gene may exist that is only captured by the increased sensitivity. Additionally, by combining exon-325 level with other RNA-Seq quantification types, inferences can be made on the particular isoforms and

329
We found intron-level quantification also generated more candidate-causal cis-eQTLs than gene-330 level. As the library was synthesised from poly-A selection, these associations are unlikely due to 331 differences in pre-mRNA abundance. Rather, they are likely derived from either true retained introns 332 in the mature RNA or from coding exons that are not documented in the reference annotation used.

333
We observed multiple instances where a candidate-causal cis-eQTL at intron-level was detected, yet a 334 previous investigation had detected an exonic effect using a different reference annotation. For 335 example, an intronic-effect was detected for SLE candidate eGenes IKZF2 and WDFY4 in this 336 analysis (which used the GENCODE v12 basic reference annotation). Using the comprehensive 337 reference annotation of GENCODE v12, we found these effects were in fact driven by transcribed 338 exons located within the intronic region of the basic annotation -and were validated in vitro by qPCR 339 [10]. The choice of reference annotation therefore has a profound effect on expression estimates [55]; 340 and so again, a gold standard is necessary prevent misinterpretation and increase consistency of eQTL 341 associations.

343
Lastly, we show how our findings can be leveraged to comprehensively dissect GWAS results of 344 autoimmune diseases. We found 14 of the 31 SNPs associated with Ankylosing Spondylitis (AS) 345 were candidate-causal cis-eQTLs for at least on eGene (Fig 4). The majority of these eQTLs analysis, we took the lead SNPs for each disease -defined as a genome-wide significant SNP with the 371 lowest reported P-value (S1 Table) SNPs within the defined hotspot interval against these residuals. The RTC score was then calculated 416 as (N SNPs -Rank GWAS SNP / N SNPs . Where N SNPs is the total number of SNPs in the recombination hotspot 417 interval, and Rank GWAS SNP is the rank of the GWAS SNP association P-value against all other SNPs in 418 the interval from the liner-association against the residuals of the best cis-eQTL. Disease associated 419 SNPs with statistically significant association to gene expression (q < 0.05) and an RTC score > 0.95 420 were classified as 'candidate-causal eQTLs'. Genes whose expression is modulated by the eQTL were

595
The 846 autoimmune disease associated SNPs per disease are documented in S1 Table and were LD   596 pruned to 584 independent loci (see Methods). Genotypes of 1000Genomes individuals were quality 597 controlled and subset to regions of recombination hotspots. If the lead GWAS SNP was found 598 between a recombination hotspot, then all SNPs were between the recombination hotspot intervals