A Bioinformatics Filtering Strategy for Identifying Radiation Response Biomarker Candidates

The number of biomarker candidates is often much larger than the number of clinical patient data points available, which motivates the use of a rational candidate variable filtering methodology. The goal of this paper is to apply such a bioinformatics filtering process to isolate a modest number (<10) of key interacting genes and their associated single nucleotide polymorphisms involved in radiation response, and to ultimately serve as a basis for using clinical datasets to identify new biomarkers. In step 1, we surveyed the literature on genetic and protein correlates to radiation response, in vivo or in vitro, across cellular, animal, and human studies. In step 2, we analyzed two publicly available microarray datasets and identified genes in which mRNA expression changed in response to radiation. Combining results from Step 1 and Step 2, we identified 20 genes that were common to all three sources. As a final step, a curated database of protein interactions was used to generate the most statistically reliable protein interaction network among any subset of the 20 genes resulting from Steps 1 and 2, resulting in identification of a small, tightly interacting network with 7 out of 20 input genes. We further ranked the genes in terms of likely importance, based on their location within the network using a graph-based scoring function. The resulting core interacting network provides an attractive set of genes likely to be important to radiation response.


Introduction
In the 'omics' era, the number of biomarker candidates potentially available for statistical testing is often much larger than the number of patient data points. This presents a fundamental problem in biomarker research: the number of candidate genetic or epigenetic markers often overwhelms the inherent statistical power available in a clinical dataset, which usually has tens or hundreds of patient cases available rather than thousands. This statistical mismatch is typically becoming worse as more of the intracellular complexity of molecular machinery is identified. At one extreme, a genome-wide association study (GWAS) examining the correlations of millions of tag singlenucleotide polymorphisms (SNPs) to cancer treatment outcome may require a very high, and biologically unlikely, odds ratio given the number of multiple comparisons, to reach statistical significance. At the other extreme, it is clear that investigators cannot a priori identify the most important biomarker genes or SNPs for testing. These unsatisfying extreme cases motivated our search for a middle strategy that would objectively identify a modest number of promising SNPs/proteins, etc. as a cohort for testing against a given dataset. Because clinical datasets for a given endpoint are commonly of modest size (tens or hundreds, not thousands, of patients), we searched for key protein interaction networks that result in less than approximately a hundred candidate SNPs. Our methodology, of course, could be adopted to throw a wider net if much larger datasets become available. Our endpoint of interest is late toxicity following radiation therapy for cancer. Many cancer patients who receive radiation therapy suffer from acute or late side effects; the risk for experiencing these side effects is expected to have a genetic component [1]. Numerous genes participate in a cascade of events in response to radiation and the resulting DNA damage in a complex signal transduction network [2].
Recently, many studies have focused on finding radio-responsive genes at the whole genome level with gene expression microarrays. Rieger and Chu used oligonucleotide microarrays to develop a genome-wide portrait of transcriptional response to ionizing radiation (IR) and ultraviolet (UV) radiation in cell lines collected from 15 healthy individuals [3]. In another study [1] using samples extracted from cancer patients with acute radiation toxicity, Rieger et al. showed that toxicity after radiation therapy (radiotherapy) could be associated with abnormal transcriptional responses to DNA. Jen and Cheung [2] assessed transcriptional levels of genes in lymphoblastoid cells at various time points with 3 Gy and 10 Gy of ex vivo IR exposure. Following 10 Gy of IR exposure, more genes were induced, suggesting that a higher radiation dose causes a more complex response. A high percentage of significant genes were involved in cell cycle, cell death, DNA repair, DNA metabolism, and RNA processing. Eschrich et al. [4] analyzed microarray gene expression data derived from 48 human cancer cell lines and generated an interaction network using MetaCore software (GeneGo, Encinitas, CA) with the top 500 genes identified by linear regression analysis. Subsequently, based on 10 hub genes obtained from the network, they modeled radiosensitivity (survival fraction at 2 Gy) using a linear regression method.
Normal tissue toxicity after radiotherapy may partially be attributable to specific genetic mutations. In an effort to identify candidate polymorphisms at the SNP level involved in the cellular response to irradiation in breast and prostate cancers, Popanda et al. [5] surveyed many published studies that show associations of SNPs in candidate genes with acute or late side effects of radiotherapy. Andreassen and Alsner [6] summarized studies published on genetic variation in normal tissue toxicity and proposed a model of allelic architecture that illustrates relative risk for genetic variants associated with normal tissue radiosensitivity.
In this study, we attempted to define an objective method for identifying key radiosensitivity genes likely to have a significant impact on clinical outcome following radiotherapy. We elected to construct a staged filter. The first step was a comprehensive literature review of radiosensitivity-related genes. These genes were then further delimited to genes responding to IR in an analysis of publicly available microarray gene expression datasets. We further focused the search on interacting networks, based on the hypothesis that good biomarkers are likely to be embedded in important pathways or networks involving multiple genes known to be important to the endpoint in question [7]. This last step may potentially add new, previously unreported targets, based on curated pathway libraries.

Materials and Methods
In summary, we used a multi-component filtering process: (1) genes associated with radiation response in the literature and (2) genes associated with radiation response in two microarray mRNA datasets. Overlapping genes from these three sources were fed into a curated protein interaction network system (MetaCore) to identify key interacting networks. The most important network was taken as our target set.

Literature Review of Radiosensitivity-related Genes
We attempted a complete literature review of all genes implicated in radiation response. Published papers were searched by using PubMed and Scopus search engines in 2010 and by following citations within the identified papers. The search strategy was based on a combination of the following search keywords: ''SNPs, polymorphisms, or microsatellites'' and ''irradiation, radiation, or radiotherapy'' and ''morbidity, radiosensitivity, normal tissue, toxicity, or complications'' and ''siRNA, knockdown, or knockout''. Papers referred to in the original search returns, or referring to the original papers at a later date were also reviewed. This resulted in an in-depth review of around 200 published papers, and a list of 221 genes implicated in radiation response.

Microarray Gene Expression Datasets
To identify significant radio-responsive genes based on microarray gene expression profiling, we searched for all relevant, publically available microarray datasets, resulting in locating two datasets. We analyzed GSE1977 and GSE23393, downloaded from the publicly available Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/). In GSE1977, lymphoblastoid cell lines obtained from 15 healthy individuals were established by immortalization of peripheral blood B-lymphocytes [3]. The response of numerous genes was measured by mock treatment, UV, and X-ray exposures. Cells were exposed to 5 Gy radiation doses and harvested for RNA 4 hours later. In our work, the differential between mock and X-ray cases was used. In contrast, in GSE23393 [8], blood was gathered from eight radiotherapy patients (at our institution): eight samples were collected immediately before irradiation and another eight samples were collected at 4 hours after total body irradiation with 1.25 Gy X-rays.

Preprocessing for Identification of Significant Genes
Before the microarray datasets were analyzed, gene expression values were log-base-2 transformed, followed by quantile normalization across all samples [9]. Microarray gene expression values from two different conditions (before and after exposure) were compared using a two-tailed t-test to identify differentially expressed genes (radio-responsive genes). To estimate the likelihood of identifying significant genes by chance, we computed permutation-based p-values using 10,000 permutations. Then, using Storey's method, the false discovery rate (FDR) and q-value for each gene were calculated [10]. Significance Analysis of Microarrays (SAM) and t-test are widely used for indentifying differentially expressed genes in the analysis of microarray data [11]. We chose a permutation t-test with an assumption that the permutation t-test and SAM could yield a set of similar significant genes, as recommended by Chen et al. [11]. In this analysis, we did not use a fold change cutoff in order to avoid losing some important genes, a problem described by Larsson et al. [12].

Pathway and Process Analysis
Significant genes were identified both in the literature review and the analysis of two microarray datasets (GSE1977 and GSE23393). These genes were then entered into a manually curated pathway analysis database (MetaCore TM , GeneGo, Inc., Carlsbad, California). The commercial pathway analysis system, MetaCore, computes p-values for overrepresented pathways and processes. MetaCore is based on a comprehensive manually curated attempt to capture protein interactions as networks. We used MetaCore to attempt to find the most probable interaction pathways among a set of genes uploaded by the user. Several algorithms are available to do this; we used the ''Analyze network'' option. If necessary, MetaCore adds appropriate genes to complete a network.

Gene Ontology Analysis
A further analysis of the resulting significant genes was performed using the Gene Ontology (GO) database (www. geneontology.org), in which genes are annotated with known molecular functions, biological processes, and cellular component locations.

Gene Ranking
In our previous work to identify blood-based protein biomarkers to predict radiation-induced pneumonitis [7], we proposed a graph-based scoring function to rank proteins in a protein-protein interaction network. The network consisted of candidate proteins we identified in mass spectrometry analysis and four previously identified ('regularization') biomarker proteins. Using the proposed method, we attempted to measure a 'functional distance' between each candidate protein and the four regularization proteins, based on the hypothesis that some proteins relevant to a specific disease exist in close proximity, in a network sense. In the current study, we modified that algorithm such that within a given protein-protein interaction network for a biological process, we ALAD 210 [19,20] ANXA1 CD68 968 [20] CD69 969 [16] CD70    TNFSF10 8743 v [156] estimate the functional distance between each protein and all the remaining proteins in the network, since all the proteins in the network are more likely to be related to one another and act together in the biological process.
To rank biomarkers, we modelled each protein-protein interaction network as a directed graph, G = (V, E), where V consists of a set of nodes (proteins) and E is the set of possible edges (protein-protein interactions) between pairs of nodes. Let A and B be two proteins in a network. We assume that there are two concepts of distance between A and B: a geometrical distance that is defined in terms of the number of nodes in the shortest path between A and B, as well as a virtual distance that is defined in terms of the number of publications that verify the interactions along the shortest path. Intuitively, as the number of intermediate nodes between A and B increases, the geometrical distance increases and the two proteins are less likely to be correlated. In contrast, considering virtual distance, we expect that as the number of references demonstrating a relationship between two proteins increases, they are more likely to be related. In other words, the number of references is proportional to relatedness while the number of nodes is inversely proportional.
Using a power law, we calculate two scores from A to B: a reference score (rs) and a node score (ns) as follows: where r and n are the total number of references and nodes in the shortest path from A to B. We suppose that the influence of the number of nodes is greater than that of the number of references. Therefore, as the number of intermediate nodes between any two given nodes increases, the relationship between the two nodes becomes much less likely. The score capturing the path from A to B is defined as the summation of two different scores: We suppose that the final score of a protein is computed by the summation of all scores between the protein and all the remaining proteins in the network. Hence, the final score of a protein A is defined by: To estimate the number of references and nodes, we employed two methods. For the number of references, we used a function in the MetaCore software that provides the number of references between two connected proteins in a network. For the number of nodes, we used the Floyd-Warshall algorithm that was originally designed to find the shortest paths between all pairs of nodes based on dynamic programming [13]. To apply this algorithm to our problem of estimating the number of nodes, we modified the original Floyd-Warshall algorithm such that an equal weight of 1 was assigned to all connected edges in a network. As a result, the modified algorithm generated a matrix that represents the number of nodes on the all-pairs shortest-paths in a given protein-protein interaction network.

Identification of Significant Biomarkers via Literature Review
Based on the literature review, several types of biomarkers, including genes, proteins, kinases, ligands, and protein complexes were identified. To unify the biomarker terms differently used across studies, we converted all the biomarkers into their corresponding gene symbols. As a result, 221 unique genes and 4 protein complexes (DNA-PK, HSP70, MRN(95), RAS) were identified from around 200 papers that studied radiation response-related biomarkers [4,. Table 1 displays the 221 unique genes and their corresponding GO processes, including DNA repair, cell proliferation/cycle, apoptosis, RNA processing, and response to stress. It is well known that ionizing radiation causes DNA damage that activates the p53 pathway through ATM [186]. Genes that are involved in cell cycle, such as CDKN1A, GADD45A, MDM2, and CCNG1, are known to be dependent on p53 [2]. Also, other cell cycle-related genes including CCNB1 and CDC20 were identified. Among cell cycle or proliferation genes, TOB1, BTG2, and CDKN1A are anti-proliferative/checkpoint related [3]. Several genes (XPC, DDB2, PCNA, ERCC4, and NBN) are involved in DNA repair. Two major pathways to repair IR-induced DNA double-strand breaks are homologous recombination (HR; genes include XRCC2, XRCC3, MRE11A, RAD50, NBN, BRCA1, and BRCA2) and non-homologous end joining (NHEJ; genes include LIG4, XRCC4, XRCC5, XRCC6, and DNA-PK) [3]. Some genes, including FAS, BBC3, and TNF, are involved in apoptosis [187]. BCL2 and DDR1 are antiapoptotic.
For biological process and pathway analysis, the 221 unique genes were uploaded into the MetaCore. Figure 1 illustrates a direct interaction network generated with these genes. As shown, numerous genes are strongly connected to one another, suggesting   that interacting genes are more likely to play related roles. Table 2 shows the top ten GeneGo pathways, GeneGo processes, and GO processes. As can be seen in the table, the most highly ranked pathways and processes are associated with DNA damage and repair, cell cycle, and apoptosis.

Identification of Significant Genes via Microarray Dataset Analysis
To identify significant changes in gene expression values between the two groups (before and after irradiation) in two microarray datasets, a t-test with 10,000 permutations was performed. To estimate p-values, we counted the number of permutations for each gene whose t-scores are greater than or equal to the t-score calculated with observed values. Then, the number of permutations passed the criterion was divided by the total number of permutations [188]. With an FDR of 20%, 631 probes (corresponding to 550 unique genes) were significantly identified for GSE1977. Figure 2 shows a normal quantile plot of t-scores for GSE1977. Data points of genes that are farther away from the black diagonal line are considered to be differentially expressed. Figure 3 displays a volcano plot that depicts the -log10 of q-values against log2 of fold changes for all genes. The majority of genes with an FDR of 20% changed 1.2fold or higher. For GSE23393, with an FDR of 20%, 224 probes (corresponding to 184 unique genes) were identified ( Figure S1 and Figure S2).

Overlapping Genes
To delimit our potential biomarker set, we investigated which genes are commonly or uniquely found among the set of genes identified by our literature review and two sets of genes identified in the analysis of the two gene microarray datasets, as summarized in Figure 4. The rationale is that those are genes likely to be key to an active response, but unlikely to be false positives due to the literature review. Twenty genes were commonly identified among the three different analyses (literature review and two microarray datasets), as shown in Table 3. We further analyzed pathways and biological processes associated with the 20 genes. Table 4 shows the top ten GeneGo pathways generated by the MetaCore software (Table S1). Not surprisingly, even with the 20 genes, DNA damage/repair and apoptosis-related pathways were highly ranked.    Gene Ranking and Identification of a Core Radioresponse Network Figure 5 shows the most probable/robust single interaction network when the 20 overlapping genes were entered into the MetaCore software. Of the 20 input genes, seven genes appeared in this core radio-response network. We applied our graph-based scoring function to this network and the results are summarized in Table 5. MYC was ranked first with a score of 113.74, which had a high p-value in GSE23393 and a statistically significant p-value, yet still relatively high compared to other genes, in GSE1977. As a hub gene, MYC had the highest number of edges (n = 12) that seem to contribute to the score. Overall p-values in GSE23393 (in situ IR) are higher than those of GSE1977 (ex vivo IR). Intuitively, as the number of edges increases, the score seems to increase. However, it should be noted that although GADD45A has 9 edges, it obtained a higher score than PPM1D, which has 11 edges. This is attributed to the fact that when we calculate the score for a gene, our scoring function takes into account all network interactions and the number of references on the interactions in the network. Interestingly, CDKN1A obtained a relatively high score of 100.13, considering only 3 edges and substantially low p-values (0.00027 in GSE1977 and 0.00367 in GSE23393).

Discussion
We have demonstrated an unbiased bioinformatics filtering methodology to objectively identify a core network of key interacting genes that are important to radiation response. We hypothesized that, by combining several different types of datasets, we are increasingly likely to identify interacting genes that are particularly important to radiation response. We also hypothesize that these genes are therefore attractive candidates for biomarker testing. For example, the 7 key genes contain 89 relevant SNPs in our radiation therapy cancer dataset and we are in the process of testing late toxicity with the dataset. We make no claim that the network shown in Figure 5 dominates radiation response and do not expect that to be the case. Nevertheless, this network seems to be highly relevant to radiation response: among the 7 genes, 5 and 4 genes are involved in cell cycle control and apoptosis, respectively. More detailed information is shown in Table S2. Five of these genes, including MYC, BBC3, GADD45A, CDKN1A, and XPC belong to a list of 34 radio-responsive genes observed by Tusher et al. [187]. Moreover, this network is consistent with (though slightly different from) the programmed cell death network reported by Moussay et al. [189]. Figure 4 shows the number of genes commonly or uniquely identified among three different studies (literature review and analysis of two microarray datasets). Interestingly, relatively few genes overlapped among the three analyses. Literature coverage is expected to be incomplete regarding coverage of radiosensitivity genes. Microarray analysis is subject to high false positive and false negative rates [190]. Another possible reason for the small number of overlapped genes is the widely differing irradiation conditions and doses. Despite this, the biological processes and pathways generated from the 20 overlapping genes were similar to those generated from the whole literature review.
We further analyzed the 20 genes, uploading these genes into the MetaCore software. In the network of the most probable biological process shown in Figure 5, only seven out of 20 genes appeared in the network. Additional genes were automatically added to the network by MetaCore, including AKT1, RELA, BCL2L1, PTEN, CDK1, and XIAP. Note, however, that these genes were also members of the list generated by our radiation response literature review, suggesting some consistency between these sources. This also suggests a potential ability to find novel biomarker candidates through the network mapping/ranking process, though that did not occur in this case.
The graph-based scoring function proposed in our previous study [7] was modified and applied to the network shown in Figure 5. In some studies, researchers tend to regard genes with high degrees of connectivity (hub genes) as significant in an interaction network, while neglecting others [4]. While this is rational, finding hub genes based on edge connectivity considers only direct interactions between genes whereas our proposed approach takes into account all interactions in a network (that is, the entire graph structure) and the number of published references on the interactions. To measure the closeness between two proteins (say A and B), we employed two scores; a node score and a reference score. In a protein interaction network, it is obvious that as the number of internal nodes between A and B increases, these two proteins are less likely to be related with each other. In contrast, the reference score is a score calculated using the number of papers that studied on an interaction between two proteins, which can be important evidence that there is an actual relationship between the two proteins. As can be seen in Table S3, MYC was first ranked using a total score. However, BBC3 and PPM1D were first ranked using a reference score and a node score, respectively. CDKN1A, PLK3, and XPC obtained somewhat high scores considering their connectivity, suggesting that they could play important roles in this core network. We believe that the use of both scores could be more effective for ranking proteins in a protein interaction network.
Future work will test SNPs identified in this network against toxicity resulting from radiation therapy. As the number of patients available for SNP analyses increases, it may be rational to expand the number of candidate SNPs to several hundreds or more. The general methodology may be applied in many genetic/protein biomarker studies with limited patient data. Figure S1 A normal quantile plot of t-scores for GSE23393 after 10,000 permutations. (TIF) Figure S2 Significant gene detection. A volcano plot that depicts the -log10 of q-values against log2 of fold changes for all genes in GSE23393. (TIF)

Table S1
The top ten GeneGo pathways/processes and GO processes generated by the MetaCore software when 20 overlapped genes were used. (DOC) Table S2 Biological processes for the seven genes shown in Table 5.

Identification of Radiation Response Biomarkers
PLoS ONE | www.plosone.org