Genome-wide association study identifies 16 genomic regions associated with circulating cytokines at birth

Circulating inflammatory markers are essential to human health and disease, and they are often dysregulated or malfunctioning in cancers as well as in cardiovascular, metabolic, immunologic and neuropsychiatric disorders. However, the genetic contribution to the physiological variation of levels of circulating inflammatory markers is largely unknown. Here we report the results of a genome-wide genetic study of blood concentration of ten cytokines, including the hitherto unexplored calcium-binding protein (S100B). The study leverages a unique sample of neonatal blood spots from 9,459 Danish subjects from the iPSYCH initiative. We estimate the SNP-heritability of marker levels as ranging from essentially zero for Erythropoietin (EPO) up to 73% for S100B. We identify and replicate 16 associated genomic regions (p < 5 x 10−9), of which four are novel. We show that the associated variants map to enhancer elements, suggesting a possible transcriptional effect of genomic variants on the cytokine levels. The identification of the genetic architecture underlying the basic levels of cytokines is likely to prompt studies investigating the relationship between cytokines and complex disease. Our results also suggest that the genetic architecture of cytokines is stable from neonatal to adult life.


Introduction
Circulating inflammatory markers are essential to human health and disease [1]. An important group of small circulating proteins are cytokines. These have important roles in cell signaling in general and in modulating immune function in particular, including inducing and reducing inflammation [2] Circulating inflammatory cytokines have been implicated in many classes of diseases, including cancers [3], cardiovascular diseases [4], metabolic diseases [5], autoimmune diseases [6] and neuropsychiatric disorders [7]. Their utility goes beyond their explanatory power in disease mechanism; since measuring their blood levels is a simple procedure, they can be useful in their diagnostic and predictive power. For example, they can be used as indicators for obesity and early cancer risk factors [8]. In addition to their involvement in disease, cytokines are also involved in both physiological function e.g. pain [9] and mental, cognitive, or brain function [10]. The latter point might be extremely important given emerging evidence for the links between immune function and psychiatric disorders [11][12][13].
Despite their relevance to disease mechanism and diagnostic power, only few studies have examined the genetic architecture of circulating inflammatory markers [14][15][16][17]. Furthermore, previous studies have mainly used adult samples; thus, it is unclear whether the genetic control of inflammatory markers varies across age groups. Here, we estimate the genetic contribution to variation in the circulating cytokine marker levels at birth for: interleukin 8 and 18 (IL8 and IL18), monocytes chemoattractant protein (MCP1 aka CCL2), thymus and reactivation regulated chemokine (TARC, also known as CCL17), erythropoietin (EPO), immunoglobulin A (IgA), C-reactive protein (CRP), brain-derived neurotrophic factor (BDNF), vascular endothelial growth factors (VEGFA) and S100 calcium-binding protein (S100B). Of note, the genetics of S100B has not been previously studied.

Results
We use data from 12,000 neonatal blood spots as part of the Danish iPSYCH Initiative [18], in which the concentrations of ten cytokines were measured using a two-step design with a discovery sample (N = 10,000) and a replication sample (N = 2,000). Five of the ten markers, i. e. BDNF, IL8, IL18, MCP1 and S100B, were measured in both samples. Both discovery and replication samples included subjects tested at birth who later in life had at least one inpatient or outpatient hospital discharge code involving one or more of six psychiatric disorders: schizophrenia, bipolar disorder, depression, autism, attention-deficit/hyperactivity disorder and anorexia (S1 Table) [18], as well as a random population sample.
Genome-wide genotyping of DNA extracted from neonatal blood spots was accomplished using the Infinium PsychChip v1.0 array in 23 waves (for detailed protocol see Pedersen et al. [18]) and used to impute~9 million 1000 Genomes Project Phase 3 SNPs. We performed two rounds of strict quality control to remove possible technical artifacts within each wave and across waves, respectively (Materials and Methods). We inferred the ancestry of each subject using both national birth register data and genomic principal component (PC) analysis. Non-Danish subjects were subsequently removed before the genetic association analyses. In total, 8,318 and 1,141 subjects were used in the discovery and replication analyses, respectively (Materials and Methods). Marker levels were log-transformed and age-residualized using a generalized additive model with 5 degrees of freedom (hereafter normalized, Materials and Methods). As expected, we observed both positive and negative correlations of the measured marker levels; correlation coefficients range from -0.06 to 0.43, but positive correlation is observed in the majority of the cases (S33 Fig). This suggests a complex regulation mechanism for immune responses.
We performed a genome-wide association study for each cytokine using a multiple linear regression model, including the first 6 principal components (PCs), diagnosis of any of the six disorders, genotyping wave indicators and sex as covariates (Materials and Methods). The same model was separately applied to both discovery and replication samples. We did not observe inflation in the resulting association statistics (lambda: min = 0.99, max = 1.03; S1-S10 Figs). Except for the cases of BDNF and IL8, we observed a high number genetic variants significantly associated (P<5x10 -9 ) with the cytokine markers, ranging from 131 for EPO to 3,941 for S100B. Extreme p values (P < 10 −100 ) are especially common for IL18, S100B and VEGFA (Fig 2A) in line with analyses showing strong cis-regulatory mechanisms for these markers ( Fig 1B). As shown in Fig 2B, common variants (minor allele frequency: MAF>20%) make up 56% of all significant variants.
We constructed PGRSs for: BDNF, IL18, IL8, MCP1 and S100B, measured in both samples, for the replication analysis using the effect estimates from discovery association. Fig 3B shows Pearson's correlations between the PGRSs and the corresponding normalized marker levels stratified by different "discovery association strength". The PGRSs based on SNPs with P<10 −6 (S1) were correlated most strongly across all markers except IL8. In contrast, PGRS constructed with all SNPs (S8), i.e. P�<1.0, show no significant correlation except for with S100B. The PGRSs constructed with SNPs with P<10 −6 accounts for 21% of S100B variation in the replication sample (Fig 3C), and the correlation between PGRS and S100B levels is~0.5. For comparison, an analysis based on a previously-studied discovery sample [17] is shown in S31 and S32 Figs. The observed low correlations between PGRS and IL8 levels (Figs 1C, 3B, S31 and S32 Figs) can be partially explained by the low SNP heritability for IL8 estimated in our sample (Fig 1A). On the other hand, the most significant correlations for the other cytokines was achieved by the PGRS with the lowest p-value threshold indicate that the genetic architecture of cytokines may be less polygenic than other human complex traits.
The associated loci contain a large number of genome-wide significant SNPs (P<5x10 -9 , Fig 2A), making it challenging to infer the causal variants for follow-up experimental studies. We performed Bayesian statistical fine-mapping on each associated region [24] (Materials and Methods). For each associated region, we inferred the most probable causal configuration (causal set) assuming at most 3 causal variants per region (Materials and Methods). As shown in Table 2, eleven causal sets include their corresponding leading SNPs, among which 3 are one-variant sets. Nonetheless, 9 causal sets do not contain their corresponding leading SNPs, indicating that the top association signals may be driven by the allelic combination of SNPs in the causal sets (Table 2). Re-analysis assuming at most six causal variants per region did not change the results (S9 Table).
The FINEMAP program was applied to each associated region (500kb left and right of the leading SNP) in Table 1, assuming each region contains at most three causal variants. Abbreviations used: log 10 (BFc): common logarithm of Bayesian factor for the inferred most probable causal configuration; log 10 (BF): common logarithm of Bayesian factor for the SNP being in the causal set; PIP: posterior inclusion probability in the causal set; R 2 : LD r-square of the SNP with the leading SNP in the corresponding region; P: association p value in the discovery sample; P repl: association p value in the replication sample or previous studies; Gene: closest gene to the corresponding SNP; Enh gene: inferred genes regulated by the corresponding enhancer.
Most of the identified genetic variants are located outside of protein-coding regions. We integrated associated loci with public epigenomic datasets [25][26][27][28] to infer plausible regulatory mechanisms. Eighteen of the 50 identified leading SNPs implicated by both association and fine-mapping analyses are located in enhancers from GeneHancer, the GeneCards Suite [29] database of human enhancers and their associated genes (Materials and Methods)( Table 2 and S2-S8 Tables). We also tested whether cytokine associated SNPs were enriched in DNAse hypertensive sites, histone modification sites and chromatin states. However, after correcting multiple testing no significant enrichment was observed (S34-S36 Figs). In Fig 4, we demonstrate the annotation by the 21q22.3 region, indexed by rs62224256, associated with S100B level. The SNPs rs11910707 (P = 1.26x10 -205 , replication P = 1.66x10 -27 , log 10 BF = 13.25, 12kb upstream of PRMT2) and rs2839314 (P = 188x10 -240 , replication P = 4.30x10 -19 , log 10 BF = 4.1, 22kb upstream of DIP2A) are the most probable causal variants. The rs11910707 SNP overlaps with the elite enhancer (Materials and Methods) GH21G046620, and rs2839314 -with GH21G046541. Both enhancers modulate the transcription of the S100B gene (the former through a double-elite association). Moreover, among the genes regulated by at least one of these enhancers are PRMT2, DIP2A, and SPATC1L (Table 2). Thus, the most highly associated signal with rs62224256 is highly likely to be a proxy of the two causal SNPs. As such, the closest gene, PCNT, may or may not play roles in the regulation of circulating S100B.

Discussion
In this study, we investigate the genetic architecture of ten cytokines in whole blood at birth, in a sample of 12,000 individuals, the largest study so far. Our results highlight an important role for regulatory elements in determining levels of circulatory inflammatory markers. Importantly, we robustly replicate our findings in an in-house replication sample and by using data from other studies [16,17]. The latter studies, in contrast to the current study, were based on adult samples, and, therefore, our results suggest that the genetic architecture of cytokines is stable from neonatal to adult life.
Inflammation and conditions associated with it, such as infections and autoimmune diseases, have been implicated in a number of disorders and medical conditions [1], including mental disorders [7,11,13]. In the context of the latter type of disorders, studies such as ours could be of great utility; while it has been known for a long time that mental disorders have strong genetic etiologies [30], when it comes to reliable accounts of disease mechanism, our current understanding is very limited compared to not only monogenic disorders, but also other complex disorders such as autoimmune disorders [31,32]. This is not necessarily due to lack of significant genetic associations, e.g. for schizophrenia [33], but rather it could also stem from the difficulty in defining the psychiatric traits. In this respect, leveraging the results of studies such as ours could be useful for both diagnosis and as a future avenue for research; given the links between inflammation, immunity and mental illness, and the properties of some of the inflammatory makers studied here, it could be envisaged that the latter could be used in a way similar to how endophenotypes could be used in psychiatry [34,35]. Moreover, the intricate genetic architecture identified in this study, which highlights gene regulation, could be informative to molecular studies of psychiatric diseases and other types of diseases. For example, it is likely to prompt studies using e.g. Mendelian randomization [36] to investigate the relationship between inflammatory markers and complex disease. a. from Ahola-Olli et al [17] b. from Suhre et al [16] c. from choi et al [20]. https://doi.org/10.1371/journal.pgen.1009163.t002 The main strengths of our study are the large number of markers included, the large sample size, and the replication sample (S37 Fig). The postnatally sampling on days 5-7 day renders our findings relatively independent of the child's behavior and natural environment, which could be considered a major strength. However, it should be noted that the marker levels may, in some cases, be influenced by perinatal complications, diseases and medication administered to the child, as well as by the smoking habits, alcohol consumption, diet, weight and other general life conditions of the mother. Certain peptides, e.g. antibodies, cross the placenta, and neonatal levels in the child therefore reflect those of the mothers at birth, thus reducing the power of the study and accounting for the zero heritability of IL8 and BDNF. A possible source of noise in the levels of inflammatory markers is that measurements come from dried, whole blood samples that may not precisely correspond to concentrations measures in plasma or serum in practice. However, our replication of findings from adult samples suggests that these putative biases do not present a serious limitation to the study.
In conclusion, our study sheds some light on the complex genetic architecture of inflammatory markers and highlights the important role of regulatory elements therein. We also show that the mechanisms involved are relatively stable throughout life, by comparing our results to those of studies which used adult samples. We hope that these results will prompt future studies looking into the links between inflammation and complex diseases and, in particular, that they will contribute to investigations into the mechanisms of mental illness, which have proven difficult to explain from a molecular perspective.

Sample
The sample was based on complete and consecutive birth cohorts of all singletons born in Denmark between May 1, 1981 andDecember 31, 2005. Only individuals who were residents in Denmark on their first birthday and who have a known mother (N = 1,536,309) were included. From this group, 78,000 subjects were genotyped in 23 waves by the Broad Institute using the PsychChip version 1. For the discovery sample, 10,000 subjects were randomly selected from the 23 waves of the iPSYCH initiative [18]. For the replication sample 2,000 subjects were chosen from the second wave, excluding the discovery sample (for detailed description of samples see S1 Table).

Cytokine level measurements
The 2000 samples for replication analysis were measured using Luminex technology as described by Skogstrand et al. [37,38]. The second 10 000 samples used for discovery study were measured using Meso-Scale technology as described in Skogstrand et al. [39]. Briefly, dried blood spot sample were punched as 3.2mm disks into PCR-plates (Sarstedt, 72.1981.202). 130 μl extraction buffer (PBS containing 1% BSA and 0,5%Tween-20) were added to each well, and the samples were extracted in 1 hour at room temperature on a microwell shaker set at (900rpm). The extracts were manually moved to sterile Matrix 2D tubes (Thermo Scientific, 3232) and frozen at -80˚C. One (Luminex) or two (Meso-Scale) years later, samples were thawed and analyzed using either Luminex technology in-house assays or Meso-Scale plates printed customized for the project. Analyte concentrations were calculated from the calibrator curves on each plate using 5PL (Luminex) or 4PL (Meso-Scale) logistic regression. Analytes falling below the lowest concentration within the working range were assigned to that value.
The measured levels were first inspected for potential outliers by scatter plots. Then, each marker level was logarithm transformed and age-residualized using a generalized additive model with 5 degrees of freedom, using the R function 'gam'. The resultant data was further checked for normality and outliers.
After imputation, we identified SNPs with high imputation quality (INFO � 0.1) and minor allele frequency (MAF > 0.01). Imputed dataset across 22 waves were merged and further quality control measures were applied (min INFO � 0.1 and MAF � 0.01). The bestguess genotypes were called using parameters: INFO � 0.9 and MAF > = 0.05. The set of SNPs after linkage disequilibrium pruning (r2 � 0.02) was used for relatedness testing and population structure analysis. PLINK [42] was used for relatedness testing. One random member of a pair of subjects with pi-hat � 0.2 were removed. Principal component analysis was performed using EIGENSOFT [43] with the same collection of autosomal SNPs. After quality control, 8,318 subjects remained for discovery and 1,141 subjects for replication sample. In total, about 9 million SNPs were used in the association study.

SNP heritability, h 2 SNP
The merged genotypes for discovery sample were quality controlled using the same parameter as above. Before estimating the heritability, SNPs were thinned by the PLINK 38 using the command:-indep-pairwise 100 50 0.2. The first 6 PCs (see next section), genotyping wave indicators and sex were used as covariates in the restricted maximum likelihood-based program BOL-T-REML [19]. To estimate per-chromosome SNP heritability, SNPs located in the focal chromosome was removed and the estimated h 2 SNP was subtracted from the whole genome estimates.

Genome-wide association
Genome-wide association study of SNPs with inflammation marker levels were performed separately for the discovery and replication sample using a multiple linear regression model implemented in PLINK [42]. Principal components were computed separately for discovery and replication, and the first 6 principal components were used as covariates, along covariates for sex and wave indicator variables. We employed the first 6 PCs following regression analyses testing each PC and each cytokine until we reached a PC which was not associated (P>0.05) with any of the 10 cytokines. Manhattan plots in S1-S8 Figs presented the association results. The genomic inflation factors were estimated and shown in the quantile-quantile plots in S1-S10 Figs. The regional association results were constructed using LocusZoom [44](S11-S30 Figs). The phenotypic variance explained by a SNP was estimated by the ðb � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 2 � p � ð1 À pÞ p Þ 2 , where β is the estimated effect and p the allele frequency in the discovery sample.

Associated regions and genes
Association results were 'clumped' using PLINK based on the linkage disequilibrium structure of the 1000 Genomes projects phase 3 EUR dataset, with parameters-clump-p1 5e-9 -clump-2 1e-6 -clump-r2 0.1. Five hundred kilo-base (kb) were used as inter-region distance threshold. Genes whose genomic coordinates located within the boundaries of each region were assigned to the corresponding region. SNPs with the smallest association p values were taken as the leading SNP for the corresponding region. The associated SNPs were annotated to the closest genes by genomic position the Ensembl tool VEP [45] (S2-S8 Tables).

Fine mapping
Association regions were fine-mapped using the FINEMAP [24] program. Regions were defined as genomic segments 500kb on both sides of the most significant SNP in an associated region (P < 5x10 -9 ). Linkage disequilibrium data from the 1000 Genomes Project phase 3 European sample were used in fine mapping. We performed two analyses: the first set the maximum number of causal variants to 3 and the other to 6. S2-S8 Tables listed all SNPs with posterior inclusion probability (PIP) > 0.1 for 3-causal variants analysis. S9 Table listed all inferred causal SNPs in each region for 6-causal variants analysis. The log 10 Bayesian factors for the causal set (log10BFc) and for each SNP are shown in the tables along with association statistics.

Enhancer annotation
The associated SNPs were mapped onto genomic enhancer regions from the GeneHancer database (v4.5) [25] using a specially-prepared annotated dataset. The GeneHancer database contains enhancers that were integrated from five enhancer sources (Ensembl [46], ENCODE [47], VISTA [48], dbSUPER [49] and FANTOM [50]) and enhancer-gene connections that are based on five methods (eQTLs [51], eRNAs [50], TF-gene expression correlations, capture-HiC [52], and genomic distance from TSS). Double-elite associations are considered to be more confident annotations and are defined as enhancer-gene connections for which both the enhancer itself and the connection to the gene are supported by at least two sources or methods, respectively.

Polygenic risk scoring
We computed the polygenic risk scores (PGRS) for both discovery and replication samples. To compute the effect size: for discovery sample, we used the association results from the previous study [17]; and, for the replication sample, we used both the association results from discovery sample and the same previous study. The association summary statistics were first carefully filtered by removing SNPs with: MAF < 0.05 or INFO < 0.8 or having a multi-character allele. We, then, clumped the resultant data based on the 1000 Genomes Project 3 EUR linkage disequilibrium structure using the program PLINK [42] with parameters:-clump-p1 1.0,clump-p2 1.0,-clump-r2 0.1 and-clump-kb 500. The same program was used for scoring each subject in our sample. The correlations between normalized marker levels and PGRS were computed using the R program with the cor.test for the Pearson' correlation. The proportions of the variance explained for each marker by each PGRS was computed as the square of the Pearson's correlation coefficients.
Supporting information S1 Text. Additional analyses performed. (PDF) S1  Table. FINEMAP results for regions associated with S100B assuming six causal variants.