Systematic Pathway Enrichment Analysis of a Genome-Wide Association Study on Breast Cancer Survival Reveals an Influence of Genes Involved in Cell Adhesion and Calcium Signaling on the Patients’ Clinical Outcome

Genome-wide association studies (GWASs) may help to understand the effects of genetic polymorphisms on breast cancer (BC) progression and survival. However, they give only a focused view, which cannot capture the tremendous complexity of this disease. Therefore, we investigated data from a previously conducted GWAS on BC survival for enriched pathways by different enrichment analysis tools using the two main annotation databases Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). The goal was to identify the functional categories (GO terms and KEGG pathways) that are consistently overrepresented in a statistically significant way in the list of genes generated from the single nucleotide polymorphism (SNP) data. The SNPs with allelic p-value cut-offs 0.005 and 0.01 were annotated to the genes by excluding or including a 20 kb up-and down-stream sequence of the genes and analyzed by six different tools. We identified eleven consistently enriched categories, the most significant ones relating to cell adhesion and calcium ion binding. Moreover, we investigated the similarity between our GWAS and the enrichment analyses of twelve published gene expression signatures for breast cancer prognosis. Five of them were commonly used and commercially available, five were based on different aspects of metastasis formation and two were developed from meta-analyses of published prognostic signatures. This comparison revealed similarities between our GWAS data and the general and the specific brain metastasis gene signatures as well as the Oncotype DX signature. As metastasis formation is a strong indicator of a patient’s prognosis, this result reflects the survival aspect of the conducted GWAS and supports cell adhesion and calcium signaling as important pathways in cancer progression.


Introduction
Worldwide, breast cancer (BC) is the most common cancer among women, comprising 23% of all female cancer. Each year, about 1.4 million new cases are diagnosed and about 460,000 women die of this disease [1]. It has been shown that survival of BC is in part heritable which can possibly be explained by yet unknown genetic factors [2]. Further knowledge about the effects of inherited genetic variation on BC survival can help to predict the patient's individual risk for disease progression and survival probabilities and to develop new and better therapies and prevention strategies. A genome-wide association study (GWAS) is a powerful tool to search for a genetic influence on complex traits. Within the last six years 34 GWASs on breast cancer have been performed identifying 194 new susceptibility loci (http:// www.genome.gov/gwastudies). Also three GWASs on breast cancer survival have been conducted leading only to three prognostic loci [3][4][5]. Therefore, a more global view on GWAS data can reveal new insights in cancer formation and progression and give new clues for further investigations.
A good tool to set high-throughput data into a global context is a pathway enrichment analysis [6]. The gene-group-based approach increases the likelihood to identify the biological processes which are overrepresented in the high-throughput data and have a high impact on the studied disease. The most commonly used gene annotation databases are Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), in which the biological knowledge about genes and their associated processes and pathways are collected. This knowledge can be used by pathway enrichment tools, which map the genes of the investigated list to the associated biological annotation terms of the database. Then, the customized enrichment result is compared to the control background and an enrichment p-value is calculated and corrected for multiple testing. Currently, a huge variety of different pathway enrichment tools are available.
Some of the tools input lists of genes or proteins and output enriched pathways. Others take the locations of single nucleotide polymorphisms (SNPs) into consideration, and thus gene lists can be derived from GWAS data. The aim of our study was to submit a GWAS on BC survival to a pathway enrichment analysis. In the GWAS, the genotype data of women of Western European origin with long and short time survival after the diagnosis of BC were compared. Pathway enrichment analysis was conducted using six different enrichment tools on four final gene lists. The gene lists based on our SNP data with allelic p-value cut-offs 0.005 and 0.01 and with a gene annotation by excluding or including a 20 kb upand down-stream sequence of the gene. Only those categories which were enriched in all four lists and more than one tool were considered to be consistently enriched. We were also interested whether our results are supported by gene signatures on breast cancer prognosis derived from gene expression profiling studies. Therefore, we performed pathway enrichment analyses with several commonly used prognostic gene signatures and compared the results with our GWAS data.

Ethics Statement
All participants in the GWAS gave written informed consent to the use of their samples for research purpose. The study was approved by the ethical committee of each participating institute.

GWAS
The GWAS on BC survival was a population based case-only study, in which the BC patients were divided in two groups based on their survival time. We considered as cases 369 women with short-time survival (less than 6 years after breast cancer diagnosis) which were compared with a group consisting of 369 women with long-time survival ($11 years after breast cancer diagnosis) as controls. Details of the characteristics of the study population are found in the table S1. The cases and controls were selected from four cohorts and matched for age at diagnosis (,40, 40-49, 50-59 and $60 years), gender, diagnosis period (1985-1989, 1990-1994 and 1995-) and cohort (table S2). Blood samples were prospectively collected in each cohort. The cases and controls were identified from the cohorts by record linkage to the regional cancer registries. Follow-up was performed until December, 31st, 2007 and the data were available for every patient. The Vä sterbotten intervention project (VIP), the mammary screening project (MSP) and the Department of Oncology, Norrlands University Hospital, Umeå, Sweden, contributed with 96 cases and 96 controls [7].  [10].
A genome-wide scan of , 300,000 tagging SNPs was conducted using the Illumina HumanCytoSNP-12 v1 according to the manufacturer's protocols. Before analysis, markers with one or more of the following criteria were excluded: ,90% genotype call rate, minor allele frequency ,5% or Hardy-Weinberg equilibrium exact p-value ,10 25 . Genotype calling was done using Illumina GenomeStudio 2010. The GWAS was conducted by PLINK v1.06, with the option of ''model'' to perform a Cochran-Armitage and a full-model case-control association test.

Enrichment Analysis
The GWAS data were investigated for SNPs which were annotated to a gene and located within the 59UTR, 39UTR, introns and exons of the gene, alternatively within a genomic region including up to 20 kb up-and downstream of a gene locus. Different allelic p-value cut-offs (0.05, 0.01, 0.005, 0.001 and 0.0001) were set to generate gene lists for both scenarios. If there was more than one SNP per gene meeting the selection criteria, the SNP with the lowest p-value was taken into account. Finally, four gene lists, two per scenario, with the allelic p-value cut offs of 0.01 and 0.005 were selected as input for six enrichment analysis tools (ConsensusPathDB, DAVID, FatiGO, GATHER, GeneCodis and WebGestalt) using the two main annotation databases Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) as basis. These pathway enrichment tools were selected based on our previous experience on pathway enrichment Table 2. Number of GO annotations and KEGG pathways enriched by six pathway enrichment tools for gene lists with allelic SNP p-value cut-offs 0.005 and 0.01. analyses [11]. The selection criteria included free availability, userfriendly handling, the usage of gene names as input and the GO and the KEGG database as basis, variations in stringency, test statistics and multiple comparison adjustment methods. The four gene lists were selected because they provided the enrichment tools with an applicable number of genes to run the enrichment analyses successfully. With a too stringent allelic SNP p-value cutoff, too few genes serve as input resulting in no significantly enriched categories. A too tolerant allelic p-value cut-off increases the background noise and may result either in too many unspecific or no enriched categories [12,13]. For all tools the same conditions were applied, which were a significance threshold of 0.05 for the adjusted enrichment p-value, at least two genes from the input list in the enriched category and the whole genome as reference background. The goal was to identify the functional categories (GO terms and KEGG pathways) that are consistently overrepresented in a statistically significant way in the list of SNPs inferred from the GWAS on BC prognosis. The used tools and their characteristics can be seen in table S3.

Consistently Enriched Categories
For DAVID, FatiGO and GATHER, the tool's default p-value cut-off of 0.05 generated a list of 20-30 enriched categories for the comparison. However, for Consensus PathDB, GeneCodis and WebGestalt, this p-value cut-off generated lists of up to 176, 278 and 77 enriched GO terms and we used the more stringent enrichment p-value cut-offs of 1610 26 , 1610 24 and 0.01, respectively. The enriched categories of each allelic p-value cutoff gene list based on the SNPs within a gene region were compared to those of the gene list taking also the SNPs within the 620 kb spanning region into account (0.01 list vs. 0.01620 kb list; 0.005 list vs. 0.005620 kb list). This was done for every tool separately. Then, the overlaps of the two different allelic p-value cut-offs were compared to each other. Finally, we compared the results of all tools to each other. Only categories enriched in all four gene lists and by more than one tool were considered consistently enriched (figure 1).

Prognostic Gene Expression Signatures
Literature was searched for commonly used prognostic gene signatures derived from breast cancer expression data. Twelve gene expression signatures were selected for further pathway enrichment analysis conducted by MetaCore GeneGo pathway enrichment analysis (false discovery rate (FDR) cut-off 0.05) because this tool enables the pathway enrichment analysis of two gene lists simultaneously and compares the results to each other. Also the 0.01 gene list of our GWAS was analyzed again with this tool to make the results comparable to the ones of the gene expression signatures.

Systematic Enrichment Analyses of the GWAS Data
The consecutive steps of the pathway enrichment analysis are summarized in figure 1. The GWAS data were filtered for SNPs located within a gene (59UTR, 39UTR, intron and exon), as well as for SNPs located in a genomic region 20 kb up-and downstream from a gene locus to take also genetic effects in regulatory regions into account. Five different p-value cut-offs for both scenarios were set to generate gene lists (table 1). The gene lists based on SNPs created by the p-value cut-offs 0.01 and 0.005, consisting of 737 and 1143 genes and 402 and 638 genes, respectively, provided the enrichment tools with an applicable number of genes to run the enrichment analyses successfully.
These four gene lists served as input in six different pathway enrichment analysis tools under the same setting. The number of enriched GO terms/KEGG pathways differed enormously between the enrichment analysis tools due to the individual tool features although the same analysis conditions were assigned (table 2). The ConsensusPathDB and GeneCodis tool reported in general much more enriched GO terms than the other used tools. For example, they generated 176 and 278 overrepresented GO annotations, respectively, when the 0.01620 kb gene list was analyzed. As comparison, DAVID and FatiGO reported only 4 enriched categories each.
To reduce the number of enriched categories, the results of the two gene lists with an allelic p-value cut-off of 0.005 were compared with each other. This was also done for the two gene  Table 4. Gene overlap of the consistently enriched categories for all pathway genes. lists with a p-value cut-off of 0.01 and the resulting overlaps were compared with each other. This was done separately for every tool. To define consistently enriched categories, the categories had to be overrepresented by at least two different tools. After this comparison eleven categories remained: two GO terms, which were ''calcium ion binding'' and ''cell adhesion'' and nine KEGG pathways named ''adherens junction'', ''arrythmogenic right ventricular cardiomyopathy'', ''axon guidance'', ''calcium signaling'', ''dilated cardiomyopathy'', ''ECM-receptor interaction'', ''focal adhesion'', ''O-glycan-biosynthesis'' and ''small cell lung cancer'' (table 3). Most categories were reported by three or four tools. We compared the genes of every category to each other to detect overlaps of the pathways to define the consistently enriched categories (table 4). The cross-tabulation revealed a strong association of ''cell adhesion'' genes with all pathways except for the genes in ''calcium signaling'' and ''O-glycan biosynthesis''. Moreover, we investigated the overlap of our GWAS genes in the pathways, resulting in a similar outcome (table 5). Based on this analysis most categories were summarized in two overarching categories: Additionally, the gene overlap of 20-30% between the two GO terms ''calcium ion binding'' and ''cell adhesion'' supports a connection of these two annotations.
Enrichment Analysis of the 0.01 Gene List with MetaCore As we wanted to compare our GWAS pathways with prognostic expression signatures, the longer 0.01 gene list was further analyzed by GeneGo, the pathway enrichment analysis tool of MetaCore, which allows simultaneous analysis and comparison of two gene lists. Fifteen pathways passed the significance level defined by a FDR of 0.05 and the analysis confirmed the importance of cell adhesion, axon guidance and calcium signaling (figure 2) in the GWAS survival signature. Although the GeneGo pathway enrichment analysis uses its own pathway terms, they are similar to the GO terms or KEGG pathways. Also the O-Glycan biosynthesis was found in the top 5 enriched pathways. The five most common terms, cell adhesion, cytoskeleton remodeling, development, muscle contraction and neurophysiological process, constituted 56% of the top 50 pathways (table 6, table S4).  Table 6. Distribution of the seven generic terms among the 50 top pathways in the enrichment analyses of the 13 gene lists.

Enrichment Analyses of the Gene Expression Signatures
Literature was searched for commonly used prognostic gene signatures derived from breast cancer expression data. We selected twelve signatures for further pathway enrichment analysis (table 7). Mammaprint [14], Oncotype DX [15], MapQuant [16], Gene Search [17] and the fibroblast core serum response (CSR) signature, commonly known as wound response signature [18], are well established, often cited in literature and commercially available. We also included five gene signatures based on expression data of metastatic breast cancer or metastatic adenocarcinomas of diverse origin [19][20][21][22][23], because metastasis formation has a profound impact on patients' survival. Last, we added two prognostic gene signatures based on meta-analyses of published gene expression signatures and microarray data sets of breast tumors [24,25] to evaluate how a combination of several prognostic gene signatures influences the enrichment analysis and if this result is comparable to the one obtained by the GWAS.
The signatures could be divided in four subgroups. The commercially available gene signatures Mammaprint, Oncotype DX, MapQuant and Gene Search were dominated by the terms cell cycle and development (table 6). Also the two meta-analyses showed enrichment of genes involved in cell cycle (34% of the 50 top pathways each), as did the general metastasis signature (16%). The specific lung, brain and bone metastasis signatures showed a strong connection to the generic terms immune response and development, which represented about 30% and 20% of the enriched pathways, respectively, and they were lacking pathways associated with cell cycle. The wound response and invasiveness signature did not show any specific pattern.

Comparison of the GWAS and the Gene Expression Signatures
In order to evaluate the similarities between the GWAS and the gene expression signatures, we analyzed the GWAS gene signature and every prognostic gene expression signature simultaneously with the MetaCore GeneGo pathway enrichment analysis tool to get a detailed view on their common pathways (figure 3, Figures  S1-S11). In this analysis the two gene lists were investigated for  ''Airway smooth muscle contraction in asthma'' (figure 4) is almost identical to the top pathway in this analysis, ''Muscle contraction_GPCRs in the regulation of smooth muscle tone'' (figure S12), showing a clear connection to calcium ion binding, with the Ca 2+ -ions containing endoplasmatic reticulum and the associated proteins as one central part of these pathways. Several proteins of these pathways can also be found in the GeneGo pathway ''Cytoskeleton remodeling_Cytoskeleton remodeling'' (figure 5). This pathway combines several sub-pathways, many of them involved in cell adhesion. These include the pathways ''ECM-receptor interaction'', ''focal adhesion'' and ''adherens junction''. Also links to the well-known cancer pathways ''TGF-b signaling'' and ''Wnt signaling'' are observed. ''Cytoskeleton remodeling_Cytoskeleton remodeling'' was also significantly enriched by the brain metastasis gene signature (P brain metastasis signature = 2.5610 23 ) and placed at rank 15 ( figure S7).
The third pathway significantly overrepresented by the 0.01 gene list and a gene expression signature in the simultaneous analyses was ''Neurophysiological process_Receptor-mediated axon growth repulsion'' (figure 6), which was significantly enriched in the analysis together with the Oncotype DX signature and placed at rank 4 (P 0.01 gene list = 9.1610 25 ; P Oncotype DX = 6.4610 23 ) (figure S2). Also this pathway has a connection to the calcium signaling and cell adhesion pathway.

Discussion
The aim of our study was to set data derived from a GWAS on breast cancer survival into a global context by using a systematic pathway enrichment analysis with the two independent databases GO and KEGG as basis. In this process, the GO database was searched for overrepresented terms on a higher level of abstraction. A more detailed and focused view was achieved by using the data of the KEGG. By this way we gained eleven consistently enriched categories, two more general GO terms and nine specific KEGG pathways, which may have an influence on BC survival. A gene overlap of up to 87% between six of these categories revealed a strong connection to cell adhesion and included the KEGG pathways ''adherens junction'', ''axon guidance'' ''ECM-receptor interaction'', ''focal adhesion'', and ''small cell lung cancer'' and the GO term ''cell adhesion''. The second category with a high proportion of overlapping genes involved in calcium ion binding included the GO term ''calcium ion binding'' and the KEGG pathways ''arrythmogenic right ventricular cardiomyopathy'', ''calcium signaling'', ''dilated cardiomyopathy'' and ''O-glycan biosynthesis''. There was also an overlap of 20-30% between the two overarching categories, which emphasizes the interplay of cellular adhesion processes and calcium signaling as an important process in breast cancer survival.
In the second part of our study we compared the pathway enrichment results of our GWAS data to those of twelve prognostic gene expression signatures. Simultaneously conducted pathway enrichment analyses with each of the expression signatures revealed that the ''Airway smooth muscle contraction in asthma'', the ''Cytoskeleton remodeling'' and the ''Neurophysiological process_Receptor-mediated axon growth repulsion'' pathways were the only ones which were significantly overrepresented by both the GWAS and a gene expression signature. The gene expression signatures involved were the general metastasis signature with two simultaneously enriched pathways and the specific brain metastasis signature and Oncotype DX each with one simultaneously enriched pathway, respectively. The general metastasis signature was derived from a comparison of gene expression data of adenocarcinomas of diverse origin (lung, breast, prostate, colorectal, uterus, ovary) with the corresponding metastases leading to 128 genes that distinguished best between primary tumors and metastases [23]. The brain metastasis signature is based on a genome-wide expression analysis of two BC cell lines and their highly brain metastatic cell derivates [20]. The comparison of the gene expression profiles led to 243 differentially expressed genes, which were used as a brain metastasis signature in our study. The Oncotype DX signature was generated by a hypothesis driven search of the literature and databases for candidate cancer genes, which were tested for their correlation with disease recurrence in three independent breast cancer studies [15]. The sixteen best performing genes and five reference genes were used to calculate a recurrence score. These three gene expression signatures are based on genes which are already known for their involvement in cancer (Oncotype DX) or which are associated with the metastasis forming process (general and brain metastasis signature). Together, the three simultaneously enriched pathways picture well a possible interaction of the cell adhesion pathways with the calcium signaling pathway in the metastatic process and the patients' survival probabilities (figures [4][5][6]. Calcium signaling and cell adhesion interact in various ways with each other and play an important role in metastasis, which involves detachment from the solid primary tumor, migration and invasion in a foreign tissue [26]. For example, E-cadherin as a key cell-to-cell adhesion molecule, essentially requires Ca 2+ -ions to form homophilic interactions between two neighboring cells in adherens junction [27]. Its down-regulation or inactivation in carcinomas has been reported to result in reduced cell adhesion [28,29] making it as a major suppressor of metastasis. Also focal adhesions, as the main linkage point between the cells and the extra cellular matrix (ECM), are influenced by calcium. Focal adhesion turnover, which determines the efficiency of cell migration, is regulated by calcium signaling. An important component in this process is focal adhesion kinase (FAK), which is a contact point for diverse extracellular stimuli, including Ca 2+concentration. FAK coordinates signals between integrins, the attachment molecules to the ECM, and growth factor receptors and promotes cell migration [30]. These examples point to the regulation of the metastasis formation either directly through mutations in the involved adhesion molecules or indirectly through impaired ''calcium signaling''.
Metastases are the leading cause of death of cancer patients and therefore strongly connected to patients' survival. This was also reflected in our study population: short-time survivors tended to have tumors with higher stage than long-time survivors. As our data is based on a GWAS on BC survival comparing women with short-time survival to those with long-time survival, the results of our pathway enrichment analyses reflect the impact of the invasive tumor phenotype on the survival of a patient. Moreover, the comparison analyses with the pathway enrichment results of commonly used prognostic gene expression signatures support our conclusion.
Although pathway enrichment tools are able to put the GWAS data into a global context, there are some points which need to be considered [12,31]. Large genes with more SNPs are more likely to contain associated SNPs by chance alone than small genes. To avoid this bias, we annotated the SNPs to a gene both by excluding and including a 20 kb up-and downstream sequence of the gene. Only the best SNP (i.e. the one with the lowest p-value) per gene was included in the analysis. The 20 kb limit was applied because the average length of haplotype blocks in the CEU population ranges between 5.9 kb (calculation method based on the four gamete test) and 16.3 kb (calculation method based on a composite of local D9 values) [32]. The pathway enrichment tools themselves also suffer from some limitations, which we experienced in our study. Even though the conditions were identical in all analyses, the different tools showed large variability in the number of overrepresented categories and their corresponding pvalues [33]. The reasons for this variation include the source and the version of the annotation files, the annotation level used by the tool, the statistical model applied for the enrichment analysis, the correction for multiple testing, and the background gene set, which is used to calculate the p-values for the overrepresented pathways [34]. One way to avoid the problem of inconsistent results obtained by different tools is to use several tools and to compare the results with each other. In our study, we analyzed four gene lists derived from the GWAS on BC survival with six tools and compared the results to detect true, consistently enriched categories.
In conclusion, our pathway enrichment analysis of the highthroughput data from a GWAS on BC survival revealed an influence of cell adhesion and calcium signaling on BC patients' survival. This was also confirmed by our comparison to the enrichment analyses of twelve prognostic gene expression signatures. The known high impact of metastasis on a patients' survival is supported by our genetic data, which also highlights the influence of changes in cell adhesion and calcium signaling in the metastatic process. Therefore, a further investigation of the identified pathways and the defined mechanisms of metastasis is a promising target to get classifiers for the patients' survival.