Use of Systems Biology Approaches to Analysis of Genome-Wide Association Studies of Myocardial Infarction and Blood Cholesterol in the Nurses' Health Study and Health Professionals’ Follow-Up Study

With the advance of genome-wide association studies and newly identified SNP (single-nucleotide polymorphism) associations with complex disease, important discoveries have emerged focusing not only on individual genes but on disease-associated pathways and gene sets. The authors used prospective myocardial infarction case-control studies nested in the Nurses’ Health and Health Professionals Follow-Up Studies to investigate genetic variants associated with myocardial infarction or LDL, HDL, triglycerides, adiponectin and apolipoprotein B (apoB). Using these case-control studies to illustrate an integrative systems biology approach, the authors applied SNP set enrichment analysis to identify gene sets where expression SNPs representing genes from these sets show enrichment in their association with endpoints of interest. The authors also explored an aggregate score approach. While power limited one’s ability to detect significance for association of individual loci with myocardial infarction, the authors found significance for loci associated with LDL, HDL, apoB and triglycerides, replicating previous observations. Applying SNP set enrichment analysis and risk score methods, the authors also found significance for three gene sets and for aggregate scores associated with myocardial infarction as well as for loci-related to cardiovascular risk factors, supporting the use of these methods in practice.


Introduction
The traditional risk factors for myocardial infarction (MI) include age, plasma lipid concentrations, blood pressure, use of tobacco, and presence of type 2 diabetes mellitus. Family history has also been established as an important risk factor for MI, and is included in some risk scores [1]. While lifestyle clearly plays a role in these risk factors, a large proportion of the inter-individual variability in plasma lipid concentrations is due to inherited factors [2][3][4][5][6]. Premature MI appears to have a particularly strong genetic component independent of established risk factors, and individuals having at least one parent with premature cardiovascular disease (age of onset <55 years in men, <65 years in women) have significantly increased risk of a cardiovascular event [7].
These observations have motivated investigators to undertake a variety of studies to identify genes responsible for the heritability of MI and cardiovascular risk factors. For several decades, linkage analyses and candidate gene studies investigated genes responsible for the heritability of cardiovascular risk and successfully identified a handful of loci unequivocally linked to MI, as well as several additional loci with a weaker level of evidence.
More recently, the emergence of genome-wide association (GWA) arrays have made it possible to perform genome-scale screens for common DNA sequence variants associated with phenotypes of interest including coronary artery disease and MI [8][9][10]. Genome-wide association studies (GWAS) involving up to 100,000 individuals of European ancestry have identified >90 genetic loci contributing to inter-individual variation in plasma lipid concentrations [11]. Many of these loci harbored genes previously known to influence plasma lipid concentrations. Knowledge of the new sequence variants that confer risk has the potential to illuminate causal biologic pathways in humans and help appropriately target diagnosis and treatment [12].
Whereas most prior GWA studies of coronary heart disease (CHD) have been cross-sectional, we performed GWAS using the established prospective MI case-control studies of the Nurses' Health Study (NHS) and Health Professionals Follow-Up Study (HPFS) cohort to identify common genetic variants that associate with MI, cardiovascular disease biomarkers and lipids. The use of prospective studies avoids survival bias which can confound cross-sectional studies. With integrative analysis approaches, we used expression SNPs (eSNPs), transcriptional networks and expression profiling signatures to explore explanatory relationships between genetic variants identified and disease phenotype.

Study Design and Samples
Details of the NHS and HPFS cohorts have been described previously [13,14]. Briefly, the NHS was established in 1976 when 121,700 female registered nurses aged 30-55 years and residing in 11 US states completed a mailed questionnaire on their medical history and lifestyle. The lifestyle factors, including smoking, menopausal status and postmenopausal hormone therapy and body weight, have been updated by validated questionnaires every 2 years. A total of 32,826 women provided blood samples between 1989 and 1990. The HPFS is a prospective cohort study of 51,529 US male health professionals aged 40-75 years at study initiation in 1986. Information about health and disease is assessed biennially by a self-administered questionnaire. Between 1993 and 1999, 18,159 men provided blood samples. Participants from both cohorts underwent local phlebotomy and returned samples to our laboratory via overnight courier. Upon arrival, whole blood samples were centrifuged and stored in cryotubes as plasma, buffy coat, and red blood cells in the vapor phase of liquid nitrogen freezers. DNA was extracted from the buffy coat fraction of centrifuged blood with the QLAmp Blood Kit (Qiagen, Chatsworth, California). NHS and HPFS participants for the current study were those with a blood sample in the nested case-control studies of incident CHD, defined as non-fatal MI or fatal CHD. In the NHS, 474 women free of cardiovascular disease or cancer in 1990 and 454 men in HPFS free of cardiovascular disease in 1994 developed incident CHD prior to June 2004. For each incident case of CHD, 2 men or women who were free of cardiovascular disease were randomly selected matching on age (in 5-year increments), smoking (in 5 categories), and month of blood return using risk-set sampling [15]. In the NHS, matching criteria also included fasting status and reported problems with blood sampling.

Ethics Statement
The present study was approved by the institutional review board of the Brigham and Women's Hospital and the Human Subjects Committee Review Board of Harvard School of Public Health. All documents sent to participants have been approved by the ethics committee.
We wrote to participants who reported incident CHD on the follow-up questionnaires to confirm the report and request permission to review medical records. Participants, or the next of kin in the case of deceased participants, provided implied written consent by the return of the mailed questionnaires. However, further consent to review medical records was through signed written consent. To document the process, upon reporting an incident MI, participants were recontacted by mail to confirm the report and to obtain written consent to review medical records. We also sought medical records for deceased participants, whose deaths were identified by families and postal officials and through the National Death Index. Physicians blinded to the participant's questionnaire reports reviewed all medical records. Cases were identified primarily through review of medical records, as previously described [14,16].

Genotyping and Quality Control
Genotyping was done using the Affymetrix SNP 6.0 array and the Birdseed calling algorithm [17]. Genotypic data for a total of 1330 HPFS samples (98%) passed laboratory technical quality control criteria. Likewise, 96% of the NHS samples were successfully genotyped. A subset of 312 NHS samples were not genotyped together with the remaining CHD case-control set as they overlapped with previous GWA studies of breast cancer (Illumina 550) and type 2 diabetes (Affymetrix 6.0). These samples were quality controlled as part of the earlier GWAS (leaving n=272 samples with available data) and SNPs, also present on the Affymetrix 6.0 platform, were subsequently merged with the CHD data, after removing samples with the following criteria: 1) accidental duplicates as identified by pairwise identity-by descent; 2) samples identified as siblings; 3) discordant genotypic and phenotypic sex; and 4) sample call rate <98%. Population structure was investigated by principal component analysis. We used a set of 12,021 SNPs selected to have very low levels of linkage disequilibrium and minor allele frequency >0.05 in Caucasians [18]. Study subjects passing quality control were analyzed together with a set of 209 HapMap II founders (59 CEU, 60 YRI, 45 JPT and 45 CHB). Subjects within the means of the first and second principal components [mean (SD) = 3] among self-described Caucasians were classified as having primarily European ancestry. Due to very few samples with substantial evidence of non-European genetic ancestry, these samples were excluded from subsequent analysis (n=24). SNPs that were monomorphic, had a missing call rate ≥2%, a Hardy-Weinberg Equilibrium p-value <1×10-4, or a minor allele frequency <0.02 were excluded, leaving a total of 724,881 SNPs that passed quality control in HPFS and 721,316 in NHS for analysis of called genotypes. Imputation of ~2.5 million SNPs was performed using MACH software (v1.0.16) with HapMap CEU phased II data (Release 22) as the reference panel.

Statistical Analysis
We performed unconditional logistic regression to analyze the association between each assayed or imputed SNP and incident CHD risk using an additive genetic model. Unconditional regression was used because some of the original matches were broken when informed consent had to be obtained a second time. NHS and HPFS were examined separately. The genomic inflation factor was adjusted by genomic control methods. To control for potential confounding by population stratification, we performed further analyses by including the top principal components of genetic variation chosen for each study in the models. Adjusting for the top three and four principal components for NHS and HPFS, respectively, made no material difference to the GWA results.
We used two meta-analytic methods to summarize the statistical evidence for each SNP. We combined odds ratios for a given reference allele on a logarithmic scale weighted by the inverse of their variances using a fixed-effects model. We also combined evidence for association solely on the basis of signed p-values weighted by sample size. In brief, we converted the two-sided p-value to a z-statistic and assigned a sign to reflect the direction of the association given the reference allele. Each z-score was then weighted with the squared weights summing to 1, with sample-specific weights being proportional to the square root of the effective number of individuals in the sample.
All analyses were implemented using R (version 2.12).

SSEA
A SNP set enrichment (SSEA) test [19,20] was applied in an attempt to identify underlying biological signals within the many suggestive associations observed for MI in this dataset. Briefly, more than 100 gene sets derived from mRNA profiling signatures, gene expression networks and other sources relevant to dyslipidemia and cardiovascular disease were assembled (Table S1 and Table S2). Each gene set was sequentially tested against the associations identified for the phenotype of interest ( Figure 1). For each gene set, multiple eSNPs representing the genes in that gene set were tested as a group for association to the trait of interest using a Kolmogorov-Smirnov test. Gene sets corresponding to Kegg pathways were also tested for the phenotype of MI.

Testing for Gene Environment Interaction
Taking advantage of the rich environmental data captured in this cohort, we examined gene-environment interactions for HDL, apoB, LDL, smoking, and alcohol in relation to incident MI using both linear and non-linear terms. Given the additional tests, although not independent, a conservative Bonferroni threshold of p<1E-9 was considered for genome wide significance We also constructed a two degree of freedom test for main and interaction effects in the context of nested models. The test is essentially equivalent to that proposed by Kraft et al. [21]. By comparing a model with both main effects and the interaction between the gene and the environmental factor to one with only the main effects, a likelihood ratio test was formed to screen (1) SNPs which may have significant main or interaction effects, and (2) SNPs may only show significant associations when interaction is considered. Both of these models were fit using the continuous linear and non-linear variable for the trait of interest (alcohol, HDL, total cholesterol, apoB). We constructed four sets of logit models, and tested the SNP by environmental interactions by comparing the full model vs the null model in each set after adjusting for age, smoking and principal components. The four models differed by whether the environmental factor was fit using a polynomial spline function with three degrees of freedom or a continuous linear function (methods S1). Finally, we reported significant interactions in any of those models.

Predicting MI Risk using SNPs Underlying Lipids and Other Risk Factors
A genotype risk score similar to that of Purcell et al. [22] was applied to evaluate whether common genetic variants have an important role in the prediction of MI risk. Using increasing thresholds for the number of SNPs included (based on ascending p-value), we defined large sets of 'score alleles' from the cardiovascular biomarkers [LDL, HDL, apoB, adiponectin, c-reactive protein (CRP), triglycerides, total cholesterol] and the MI associated loci to generate aggregate risk scores. In brief (1), we selected the top N SNPs from each GWA study on a given lipid trait, avoiding SNPs in strong linkage disequilibrium (LD) by forcing the SNPs to be at least 100 kb apart (2). For each subject we compiled a risk score based on his/her genotypes of the selected SNPs. For a SNP associated with LDL, total cholesterol, apoB or triglycerides, if the subject was homozygous for a SNP linked to higher lipid level, we added "+1" to his/her risk score; if homozygous for a SNP linked to lower lipid level, we added "-1"; a heterozygous genotype was assigned "0". Signs were reversed for HDL and adiponectin in the risk score (3). The association of each subject's risk score to MI was tested in a logistic regression model and significant results indicated prediction power to MI.

Genome-wide Association for MI and Blood Markers
Cases were matched to controls on age but cases but had a greater prevalence of hypertension, diabetes and smoking (Table 1). No loci reached genome-wide thresholds for statistical significance for MI. Several SNP loci reached

SNP Set Analysis for MI and apoB
In SSEA analysis, the top gene sets associated with MI (meta-analysis) included published mRNA profiling signatures from infarcted tissue (p=2.9E-4), implicated causal genes from the Global Lipid Genetics GWAS Consortia (p=2.5E-3) (Figure  2), and a gene set formed from the intersection of many gut profiling signatures in the Nextbio database (p=5.0E-3). Kegg pathway gene sets significantly associated with MI included small cell lung cancer (p=1.1E-3), porphyrin and chlorophyll metabolism (p=1E-2) and glycosaminoglycan degradation (p=1.6E-2). Finally, for the 29 MI loci previously identified and replicated by the Myocardial Infarction Genetics, Cardiogram or the C4D Consortia [9,10,23] no statistically significant individual SNPs or gene sets were observed.

Gene Environment Interactions Examined for Myocardial Infarction
SNP interactions that reached or approached genome-wide levels of significance included rs11147396 with alcohol for MI; rs3806850 and HDL for MI; rs17542348 with cholesterol for MI in HPFS only; and rs10794792 with apoB for MI ( Table 4). None of these associations showed a significant association to MI when examined for main effect in the meta-analysis or in HPFS only. SNP rs17542348 was marginally associated with MI risk (p value=3.7E-6) in NHS, where T was the risk allele.

Genotype Risk Score of Cardiovascular Risk Factors and the Prediction of MI
Not surprisingly, aggregate risk scores derived from biomarker meta-analysis associations (LDL, HDL, apoB, adiponectin, CRP) associated strongly with MI in both the

Discussion
MI is a leading cause of death and disability worldwide, and the inherited basis for myocardial infarction remains incompletely understood, although substantial progress has been made recently through the application of GWAS [9,10,[23][24][25]. Applying GWAS to our prospective case-control studies of MI, we found no loci reaching genome wide significance levels, and no association signals at 9p21.3 which has previously been associated with early onset myocardial infarction or CAD [9,24,25]. Restricting the age of MI to younger individuals did not elucidate an association signal around 9p21, although power was even more limited. Similarly, no strong associations were observed at up to twenty nine other MI loci previously identified and replicated by the Myocardial Infarction Genetics, Cardiogram or the C4D Consortia [9,10,23]. We believe the absence of apparent association between genetic variants and MI is likely due to limited sample size. After this study was initiated, it became apparent the current cohort was substantially underpowered for the primary endpoint of MI. It is also possible that the prospective study design influenced the likelihood of detecting loci discovered predominantly in cross sectional studies. However larger prospective cohorts will be required to evaluate this possibility.  Significant SNP associations with blood cholesterol phenotypes LDL, HDL, apoB and triglycerides were observed, successfully replicating numerous previous observations [11,26]. The identification of the APOB locus as the one most significantly associated with circulating apoB levels confirms that meaningful information can be derived from GWAS with quantitative traits using sample sizes in this range. Similarly, SNPs in close proximity to the gene encoding the C-reactive protein exhibited the strong associations with CRP levels.
Notable suggestive loci observed for apoB have been previously associated with adiposity [27] and hypertension [28]. The MSRA locus contains eSNPs for MSRA (MGH subq), SOX7 (MGH Liver, deLiver) and others [29]. The suggestive locus around SCN3A is also supported by an expressional SNP for SCN3A in subcutaneous adipose tissue. FXR (Nr1h4) is proximal to the TMEM16D locus identified. One of the primary functions of FXR activation is the suppression of cholesterol 7 alpha-hydroxylase (CYP7A1), the rate-limiting enzyme in bile acid synthesis from cholesterol. The N4BP2 locus also contains a (MGH) Liver eSNP to N4BP2 but no evident biological rationale.
In the context of complex human conditions, the effect size for a given causal SNP is often very small. Therefore very large sample size (e.g. N>10,000) would be required to appropriately power detection of such SNPs. An alternative approach is to use the SNP Set Enrichment Analysis (SSEA) to determine whether a group of SNPs, identified by association with gene sets, is enriched for small p-values in GWAS results by examining the effect of many SNPs, as a group, simultaneously. SSEA gains statistical power by (1) reducing the number of tests and (2) pooling genetic effects from multiple weak potentially causal SNPs. Many Kegg and biologically relevant gene sets were assembled and tested independently, and while the individual SNP p-values were moderate, as a set, we observed globally significant SSEA pvalues. The top gene sets associated with MI (meta-analysis) included published mRNA profiling signatures from infarcted tissue [30], implicated causal genes from the Global Lipid Genetics Consortia GWAS [11], and a gene set formed from the intersection of many gut mRNA profiling signatures in the Nextbio database (www.nextbio.com). The gene signatures from infarcted tissue were derived in a mouse model of MI and it is striking that the expression quantitative trait loci for such a gene set would show a significant association with MI when examined together. Similarly, expression quantitative trait loci for genes underlying the Global Lipid Genetics Consortia GWAS loci show a significant association with MI. This is perhaps not that surprising given the established relationship in MI prediction for many of the LDL loci in particular. The gut mRNA profiling signatures were chosen predominantly on their relevance to cholesterol related experiments so this may be reflected in the association identified. Three Kegg pathway gene sets were significantly associated with MI, but no evident biological rationale exists for these Kegg pathways. Previously reported MI GWAS loci only achieve moderate p-values in our cohorts (partially due to sample size and statistical power).
A number of SNP-environment interactions were observed that reached or approached genome wide levels of significance although none of these SNPs were significantly associated with MI as main effects. The interaction of a locus on 13q12 with alcohol has not been previously reported in association with MI. No biological rationale is evident however, for the genes Ubl3 or SLC7A1 that lie in the region of the SNP association. The gene for Cadherin 6 underlies the interaction observed at 5p13 for HDL in association with MI. Cadherins mediate cell-cell binding and contribute to the sorting of heterogeneous cell types and the maintenance of orderly structures in tissue, but no role has yet been described in lipoprotein metabolism. An interaction for total cholesterol and a SNP at 7p21 was observed for MI in the HPFS cohort only. This SNP is near Phf14, a nuclear phosphoprotein that may function as a colon cancer tumor suppressor. Finally, an interaction between apoB and rs10794792 for MI was identified at 10p15 near the gene location for ADARB2. ADARB2 encodes Adenosine deaminase RNA-specific B2, which binds to single-stranded and doublestranded RNAs, and may play a role in regulation of substratespecific RNA editing. Hepatic apoB mRNA editing determines the proportion of VLDL that contains full length (apoB100) or truncated (apoB48) RNA; however, ADARB2 is not known to play a role in this process. Additional replication studies are warranted for these interactions in independent cohorts.

Conclusions
We used prospective case-control studies of MI to illustrate application of novel SSEA methods and a risk score approach as a means of assessing groups of loci that may be predictive of MI, despite marginal individual loci effects. While power limited our ability to detect significance for individual loci with MI, we found significance for loci associated with LDL, HDL, apoB and triglycerides, replicating previous observations. Applying SSEA and risk score methods, we found significance for three gene sets and for aggregate scores associated with MI as well as for loci related to cardiovascular risk factors, supporting the utility of these methods in practice.