Use of Genome-Wide Association Studies for Cancer Research and Drug Repositioning

Although genome-wide association studies have identified many risk loci associated with colorectal cancer, the molecular basis of these associations are still unclear. We aimed to infer biological insights and highlight candidate genes of interest within GWAS risk loci. We used an in silico pipeline based on functional annotation, quantitative trait loci mapping of cis-acting gene, PubMed text-mining, protein-protein interaction studies, genetic overlaps with cancer somatic mutations and knockout mouse phenotypes, and functional enrichment analysis to prioritize the candidate genes at the colorectal cancer risk loci. Based on these analyses, we observed that these genes were the targets of approved therapies for colorectal cancer, and suggested that drugs approved for other indications may be repurposed for the treatment of colorectal cancer. This study highlights the use of publicly available data as a cost effective solution to derive biological insights, and provides an empirical evidence that the molecular basis of colorectal cancer can provide important leads for the discovery of new drugs.


Introduction
Since the advent of high-density single nucleotide polymorphism (SNP) genotyping arrays, researchers have used genome-wide association studies (GWAS) to identify innumerable loci associated with a multitude of diseases. The vast majority of SNPs identified by GWAS are within the intergenic or intronic regions (approximately 88%) [1,2]. GWAS has also enabled the discovery of many genetic variations of colorectal cancers (CRC). The next step was to identify the genes that were affected by causal variants, which would enable us to translate the risk SNPs to meaningful insights on pathogenesis.
Most reports have simply implicated the nearest gene to a GWAS hit as a target of the functional variant without any evidence [1]. The identification of expression quantitative trait loci (eQTL) has been proposed as a promising method to find the candidate genes associated with a disease risk [3] [4]. It should be noted that identifying an eQTL provides only an indirect evidence of a link between genotype and gene transcription [1].
As far as we know, there's simply no good way to identify these target genes, which is key to understanding the mechanism by which GWAS variants act. So we proposed a bioinformatics pipeline to prioritize the most likely candidate genes by using several biological data sets. Seven criteria were adopted to prioritize candidate genes. The widely used eQTL criterion mentioned above is only one of seven criteria in the pipeline.
One way of accelerating the translation of data from GWAS into clinical benefits, is to use the results to identify new indications for treatment with existing molecules. GWAS may be used to construct drug-related networks, aiding drug repositioning. Although GWAS do not directly identify most of the existing drug targets, there are several reasons to expect that new targets will nevertheless be discovered using these data [5,6]. Initial results on drug repurposing studies using network analysis are encouraging and suggest directions for future development [7]. By integrating rheumatoid arthritis genetic findings with the catalog of approved drugs for rheumatoid arthritis and other diseases, Okada Y et al provided an empirical data to indicating that genetic approaches may be useful for supporting genetics-driven genomic drug discovery efforts in complex human traits [8].
In the present study, we used the in silico pipeline to systematically integrate data on risk loci for CRC biology and drug discovery from a variety of databases.

Materials and Methods
An overview of the study design is illustrated in Fig. 1. Biological candidate genes were obtained from GWAS-identified CRC risk loci. Next, the genetic data were integrated with the results of statistical analyses, computational approaches, and publicly available large data sets to prioritize the obtained genes, and propose new targets for drug treatments.

CRC risk loci from GWAS
We downloaded CRC risk SNPs from the National Human Genome Research Institute (NHGRI) GWAS catalogue database on January 31, 2014 [2].

Biological candidate genes from CRC risk loci
It is a well-known fact that the risk SNPs indicates haplotypes on which the functional variants reside; therefore, the next step was to identify their target genes. By adopting multi-annotations between risk SNPs and their surrounding genes, the snp2gene allowed conventional annotation due to their proximity, as well as linkage disequilibrium [9].
For each of the GWAS SNPs involved, we used the snp2gene to identify the candidate genes. For each gene in the risk loci, we evaluated if the gene was the nearest gene to the CRC risk SNP within the risk locus.

Prioritization of candidate genes
By using several biological data sets, we devised a bioinformatics pipeline to prioritize the most likely candidate genes.
Firstly, functional annotations for CRC risk SNPs were identified by ANNOVAR [10]. Traitassociated variants were enriched within chromatin marks, particularly in H3K4me3 [11]. H3K4me3 data of 34 cell-types could provide the fine mapping of associated SNPs to identify causal variation in the previous studies [12]. So we evaluated whether the CRC risk SNPs and SNPs in the linkage disequilibrium (r 2 > 0.80) were overlapping with H3K4me3 peaks of 34 cell types. The H3K4me3 data were obtained from the National Institutes of Health Roadmap Epigenomics Mapping Consortium, by a permutation procedure with 10 5 iterations [12]. We identified genes for CRC risk SNPs or SNPs from linkage disequilibrium (r 2 > 0.80) that were annotated as missense variants.
Secondly, we assessed the cis-expression quantitative trait loci (cis-eQTL) effects using the data of 5,311 European subjects from the study on peripheral blood mononuclear cells (PBMCs) [13]. Westra HJ et al. had made a browser available for all significant cis-eQTLs, detected at a false-discovery rate of 0.50. (http://genenetwork.nl/bloodeqtlbrowser/) In their study, eQTLs were deemed cis-eQTLs when the distance between the SNP chromosomal position and the probe midpoint was less than 250 kb. The eQTLs were mapped using Spearman's rank correlation on imputed genotype dosages. Resultant correlations were then converted to P values, and their respective z scores were weighted by the square root of sample size.
To evaluate cis-eQTL genes of risk SNPs, it was only needed to provide all risk SNP. When the CRC risk SNP was not available in eQTL data sets, we alternatively used the results of best proxy SNPs in linkage disequilibrium with the highest r 2 value (r 2 > 0.80).
Thirdly, by using the Gene Relationships among Implicated Loci (GRAIL), we evaluated the degree of relatedness among the genes within disease regions. GRAIL is a tool to examine the relationships between genes in different disease associated loci. Given many genomic regions or SNPs associated with CRC, GRAIL searches for similarities in the published scientific text among the associated genes [14]. A p value of <0.05 was considered significant. To avoid publications that reported on or were influenced by the disease regions discovered in the recent scans, we use only those PubMed abstracts published prior to December 2006, before the recent An overview of the study design. One hundred and forty-seven candidate genes were obtained from 50 CRC risk loci. A bioinformatics pipeline was developed for the prioritization of these candidate genes. Seven criteria were used to score the genes: (1) CRC risk missense variant; (2) cis-eQTL; (3) PubMed text mining; (4) PPI; (5) cancer somatic mutation; (6) knockout mouse phenotype; and (7) functional enrichment. Extent of overlap with target genes for approved CRC drugs was also assessed. onslaught of GWA papers identifying novel associations, avoid any strong bias towards the genes closest to the associated SNPs. This approach effectively avoids this problem. [14].
Next, we used the Disease Association Protein-Protein Link Evaluator (DAPPLE) for assessing the presence of significant physical connectivity among proteins encoded by candidate genes by protein-protein interaction (PPI), reported in the literature. The DAPPLE takes a list of seed SNPs that converts them into genes based on the overlap. The hypothesis behind DAP-PLE is that a genetic variation affects a limited set of underlying mechanisms that are detectable by PPI [15]. A p value of <0.05 was considered significant.
Nest, we obtained cancer somatic mutation genes from the Catalogue of Somatic Mutations in Cancer (COSMIC) database [16], and downloaded knockout mouse phenotype labels and gene information from the Mouse Genome Informatics (MGI) database [17] on April 8, 2014. We defined all CRC risk genes included in the CRC risk loci, and evaluated the overlap with cancer phenotypes with registered somatic mutations, and phenotype labels of knockout mouse genes with human orthologous. Hypergeometric distribution test was used for overlap statistical analyses with significance at a p value of <0.05.
Finally, we performed function enrichment analysis to investigate if genes affected by SNPs were enriched for specific functional categories or pathways. The DAVID Bioinformatics Resources that included Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Online Mendelian Inheritance in Man (OMIM) were used for analyses. [18] The obtained results were considered significant at a p value of <0.05.
We scored each of the genes by using the following selection criteria, and calculated the number of the satisfied criteria: (1) genes with missense variants; (2) cis-eQTL genes of risk SNPs; (3) genes prioritized by PubMed text mining; (4) genes prioritized by PPI network; (5) cancer somatic mutation genes; (6) genes prioritized by associated knockout mouse phenotypes; and (7) genes prioritized by functional enrichment analysis.
Correlations of candidate gene prioritization criteria were evaluated by the Pearson correlation analysis. Each gene was scored based on the number of criteria that were met (scores ranged from 0-7 for each gene) in case of weak correlations. Genes with a score of ⩾2 were defined as 'biological risk genes'.

Drug validation and discovery
If human genetics can validate drug targets, then it can be used to identify whether the approved drugs currently used for treating other indications can be used for the treatment of CRC. We present here an analysis of the potential application of GWAS data, drug repositioning.
We obtained drug target genes and corresponding drug information from Drug Bank [19] and Therapeutic Targets Database (TTD) [20] on Oct 18, 2013. We selected drug target genes that had pharmacological activities and were effective in human orthologous models, and those which annotated with any of the approved, clinical trial or experimental drugs.
The drug target genes annotated to CRC drugs were manually extracted by professional oncologists. To decrease the rates of the false positive, only the drugs under current clinical use were involved in the study, which could be found in the NCCN clinical practice guidelines in colorectal cancers. [21] We extracted genes from direct PPI with biological CRC risk genes by using protein interaction network analysis-2 (PINA2), which integrates six well-known manually curated PPI databases [22].
We evaluated the possibility of exploring protein products from the identified biological risk genes, or any genes from a direct PPI network as targets of approved CRC drugs or drugs for other indications. Let x be the set of the biological CRC risk genes and genes in direct PPI with them (n x genes), y be the set of genes with protein products that are the direct target of approved CRC drugs (n y genes), and z be the set of genes with protein products that are the direct target of all approved drugs (n z genes). We defined n x\y and n x\z as the numbers of genes overlapping between x and y and between x and z, respectively. Hypergeometric distribution tests were used for overlap statistical analyses, and a p value of <0.05 was considered statistically significant.

Results
In the present study, 50 CRC-associated SNPs were obtained from NHGRI (Table 1), and 140 genes were obtained based on proximity and linkage disequilibrium using the snp2gene (S1 Table).

Functional annotations of CRC risk SNPs
Most SNPs (62%) were located in the intergenic regions (S2 Table). Two SNPs were identified in linkage disequilibrium with missense SNPs (r 2 > 0.80; S3 Table). Next, we assessed 50 CRC risk loci for enriched epigenetic chromatin marks [12]. Of the 34 cell types investigated, we observed a significant enrichment of CRC risk alleles with H3K4me3 peaks in rectal mucosa cells (p = 0.00014 and 0.00023, respectively) (S4 Table).

PubMed text-mining
Twenty-four genes were prioritized based on data obtained by PubMed text mining using GRAIL with gene-based p < 0.05 [14] (S6 Table).

Cancer somatic mutation
Among the 522 genes with registered somatic mutations obtained from the COSMIC database [16], a significant overlap was observed in genes associated in non-hematological cancers (5/6, P = 2.41E -05) (S8 Table).

Knockout mouse phenotype
We evaluated overlap with genes implicated in knockout mouse phenotypes [17]. Among the 30 categories of phenotypes, we observed nine categories significantly enriched with CRC risk genes (p < 0.05), led by craniofacial phenotype (S9 Table).
Because these criteria showed weak correlations with each other (R 2 < 0.48; S13 Table), each gene was scored based on the number of criteria that were met (scores ranged from 0-7 for each gene).
To provide empirical evidence of the pipeline, we analyzed the gene scores. Genes with higher biological scores were more likely to be nearest to the risk SNP (62.8% for gene score 2, 24% for gene score < 2; p < 0.001). Meanwhile, rectal mucosa cells demonstrated significant overlapping proportions with H3K4me3 peaks compared with other cell types.
Finally, we evaluated the potential role of genetics in relation to drug discovery for the treatment of CRC. Hypergeometric distribution tests were used for overlap statistical analyses. We obtained 11303 genes pairs from curated PPI databases. We obtained 871 drug target genes corresponding to approved, in clinical trials or experimental drugs for human diseases (S14 Table). For the sake of calculation reliability, only CRC drugs in the first line therapy were involved in the study. Eight target genes of approved CRC drugs were included (S15 Table).
Thirty-one biological CRC risk genes overlapped with 533 genes from the expanded PPI network (S2 Fig. and S3 Fig.). We found an overlap of 5/8 drug target genes of approved CRC drugs (5/8 vs 70/781, 12.09-fold enrichment, P = 0.00013). All 871 drug target genes (regardless of disease indication) overlapped with 70 genes from the PPI network, which suggested that the enrichment was 1.55-fold higher than that expected by chance alone (p = 0.00012), but less by 7.78-fold when compared with currently approved CRC drugs (p = 1.78 × 10 -5 ). Examples of approved CRC therapies identified by this analysis included irinotecan, regorafenib, and cetuximab (Fig. 2). Correlation of approved drugs for other diseases with biological CRC risk gene was also assessed. An example of drug repositioning (Fig. 3) is the use of crizotinib, an approved drug for non-small cell lung cancer for the treatment of CRC [16]. Arsenic trioxide vrinostat, dasatinib, estramustine, and tamibarotene are all promising drugs for the treatment of CRC (Fig. 4).

Discussion
GWAS have identified innumerable disease-associated genetic variants. However, significant obstacles have hampered our ability to identify genes affected by causal variants and in elucidating the mechanism by which genotype influences phenotype. Most reports have simply implicated the nearest gene to a GWAS hit without substantial evidence [1]. This study prioritized the most likely target genes. A total of 31 biological CRC risk genes were identified. Although biological CRC risk genes are more likely to be causal genes,  this still needs confirmation by basic molecular studies using advanced technologies. Edwards et al provided a pipeline for follow-up studies, which includes fine mapping of risk SNPs, prioritization of putative functional SNPs, and in vitro and in vivo experimental verification of predicted molecular mechanisms for identifying the targeted genes [1].
GWAS is criticized for their lack of clinical translation because of the size of the effect. However, individual small effect sizes do not necessarily preclude clinical utility. Sanseau et al proposed the use of GWAS for drug repositioning, which is regarded as a promising strategy in translational medicine. In a study investigating 3-hydroxy-3-methyl-glutaryl coenzyme A, a well-known cholesterol-lowering medication, SNPs within this gene were unambiguously associated with low-density lipopolysaccharide cholesterol levels in the GWAS data. [6] Their study included all GWAS-associated genes that were selected from the GWAS catalog, without achievement about CRC drugs.
In the present study, we focused on repurposing drugs for CRC based on prioritization of candidate genes in the GWAS-identified loci. For example, crizotinib, arsenic trioxide, vrinostat, dasatinib, estramustine, and tamibarotene are also promising repurposed drugs for CRC. Although further investigations are necessary to confirm the results of this study, we opine that these target drugs selected could be promising drug candidates in the treatment of CRC. GWAS data is useful in providing insights into the biology of diseases, but may also translate these leads into profitable opportunities in drug development. However, GWAS data does not provide detailed pathophysiological information; hence, the newly identified uses of old drugs may possibly be side effects [23]. Successful repurposing of a drug entails the combination of results from published literature, and clinical research.
Although there were a number of positive aspects from this study, there were some limitations as well. Firstly, data of the PBMC study was used for cis-eQTL analysis. Although eQTLs identified from one tissue type may be a useful surrogate to study the genetics of gene expression in another tissue [24], the use of tissue-specific eQTLs is probably more useful in understanding the pathogenesis of CRC [25]. Secondly, of the 34 cell types investigated, we only observed a significant enrichment of risk SNP with H3K4me3 peaks in rectal mucosa cells. Nevertheless, the enrichment was not significant for the colon mucosa cells.
In this study, we integrated genetic data and statistical analysis, computational approaches, and publicly available large data sets to prioritize candidate genes, and propose new targets for CRC drug treatments. We believe that target genes and drugs selected by this approach could be promising leads in the development of candidate drugs for the treatment of CRC, although, further investigations are warranted for confirmation of these results.