Application of a Combination of a Knowledge-Based Algorithm and 2-Stage Screening to Hypothesis-Free Genomic Data on Irinotecan-Treated Patients for Identification of a Candidate Single Nucleotide Polymorphism Related to an Adverse Effect

Interindividual variation in a drug response among patients is known to cause serious problems in medicine. Genomic information has been proposed as the basis for “personalized” health care. The genome-wide association study (GWAS) is a powerful technique for examining single nucleotide polymorphisms (SNPs) and their relationship with drug response variation; however, when using only GWAS, it often happens that no useful SNPs are identified due to multiple testing problems. Therefore, in a previous study, we proposed a combined method consisting of a knowledge-based algorithm, 2 stages of screening, and a permutation test for identifying SNPs. In the present study, we applied this method to a pharmacogenomics study where 109,365 SNPs were genotyped using Illumina Human-1 BeadChip in 168 cancer patients treated with irinotecan chemotherapy. We identified the SNP rs9351963 in potassium voltage-gated channel subfamily KQT member 5 (KCNQ5) as a candidate factor related to incidence of irinotecan-induced diarrhea. The p value for rs9351963 was 3.31×10−5 in Fisher's exact test and 0.0289 in the permutation test (when multiple testing problems were corrected). Additionally, rs9351963 was clearly superior to the clinical parameters and the model involving rs9351963 showed sensitivity of 77.8% and specificity of 57.6% in the evaluation by means of logistic regression. Recent studies showed that KCNQ4 and KCNQ5 genes encode members of the M channel expressed in gastrointestinal smooth muscle and suggested that these genes are associated with irritable bowel syndrome and similar peristalsis diseases. These results suggest that rs9351963 in KCNQ5 is a possible predictive factor of incidence of diarrhea in cancer patients treated with irinotecan chemotherapy and for selecting chemotherapy regimens, such as irinotecan alone or a combination of irinotecan with a KCNQ5 opener. Nonetheless, clinical importance of rs9351963 should be further elucidated.


Introduction
Genomic information has been proposed to be utilized as the basis for ''personalized'' health care. Interindividual variation in a drug response among patients has been well documented to cause serious problems in pharmacotherapy. This variation may be due to multiple factors such as disease phenotypes, genetic and clinical parameters (or environmental factors), and variability in the drug target or allergic response; all of these factors may affect both main and side effects [1,2]. Although some biomarkers [3][4][5][6][7][8][9] have been proposed, it is still difficult to determine which group of patients will respond positively, which patients are nonresponders, and which may experience adverse reactions in cases where patients are administered the same medication dose. For effectiveness of personalized medicine in cancer chemotherapy, it is critically important to observe interindividual differences in a drug response and the role of genetic polymorphisms relevant to the drug metabolic pathways and drug response biology in pharmacogenomics [10].
Diarrhea induced by irinotecan is classified into early-and delayed-onset diarrhea, occurring within 24 hr or #24 hr after irinotecan administration, respectively [36]. Irinotecan induces early-onset diarrhea as one of adverse cholinergic effects (acetylcholinelike effects) by inhibiting acetylcholinesterase (AChE) and binding to muscarinic acetylcholine receptors (mAChR) [37,38]. These inhibitory actions are induced by irinotecan, which has an amino group at the C-10 position [37]. Other than that, irinotecan induces delayed-onset diarrhea via rapid deconjugation of SN38G and adsorption of released free SN-38 by bglucuronidase of intestinal flora [39][40][41], as shown in Fig. 1. In the present study, we focused on polymorphisms of genes with transporter activity to identify predictive factors of diarrhea induced by irinotecan because there are many genes related to transporter activity in both pathways.
A genome-wide association study (GWAS), also known as a whole-genome association study (WGA study, or WGAS), is an examination of many common genetic variants in different individuals to determine whether a particular variant is associated with a trait. GWAS using hypothesis-free genomic data is a powerful technique for identifying interindividual variation among patients. On the other hand, multiple testing problems are a limitation of this approach. To address this issue, we recently proposed a combined method consisting of a knowledge-based algorithm, 2 stages of screening, and a permutation test for identifying single nucleotide polymorphisms (SNPs) [7]. In general, the objective of a statistical or bioinformatic analysis is the enrichment of important information in a large dataset [42][43][44][45][46][47]. The use of a knowledge-based algorithm is not a novel concept, but is both practical and useful [48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66]. In the previous study, we found that rs2293347 in the gene of human epidermal growth factor receptor (EGFR) is a candidate SNP related to the chemotherapeutic response; we achieved this result by applying our combined method to gastric cancer patients who were treated with fluoropyrimidine [7]. However, our combined method was applied to only 1 dataset. Therefore, the usability of our combined method as a novel approach was still unclear.
We used the combined method in an actual genome-wide pharmacogenomics study of antitumor drugs, particularly irinotecan. We found that rs9351963 in the gene of potassium voltagegated channel subfamily KQT member 5 (KCNQ5) is a candidate SNP related to the adverse response. Rs9351963 may be a potential predictive factor of incidence of diarrhea in cancer patients treated with the cancer prodrug irinotecan.

Ethics statement
The study was conducted according to the principles expressed in the Declaration of Helsinki, and the ethics committees of the National Cancer Center and National Institute of Health Sciences, Japan, approved the study protocol. All patients provided written informed consent to participate.

Preparation of hypothesis-free genomic data for cancer patients treated with irinotecan
This study was performed within the framework of the Millennium Genome Project in Japan, and 4 antitumor drugs were chosen as project targets: gemcitabine, paclitaxel, fluoropyrimidine, and irinotecan. These drugs (alone or in some combination) were administered to approximately 1,000 cancer patients at the National Cancer Center, Japan. Additionally, approximately 1,000 DNA samples were extracted from peripheral blood mononuclear cells and 109,365 SNPs were genotyped using the Illumina Human-1 BeadChip. In this study, we focused on pharmacogenomic properties of irinotecan. Participants included 177 Japanese irinotecan-naïve cancer patients (56 cancer patients treated with irinotecan monotherapy and 121 cancer patients treated with irinotecan combination therapy) at the National Cancer Center Hospital and National Cancer Center Hospital East. A summary of the characteristics of the 176 patients is listed in Table S1. We excluded 1 patient who refused grading of adverse reactions. Furthermore, we excluded 8 patients who did not have genotyping data. Therefore, we analyzed the remaining 168 patients (53 cancer patients treated with irinotecan monotherapy and 115 cancer patients treated with irinotecan combination therapy) in the present study. We defined the 53 patients treated with irinotecan monotherapy as the first dataset and the 168 patients treated with irinotecan chemotherapy (consisting of irinotecan monotherapy and combination therapy) as the second dataset for 2 stages of screening.

Monitoring and adverse effects
A complete medical history and data on physical examination were recorded before the irinotecan therapy. Complete blood cell counts with differentials and platelet counts, as well as blood biochemical variables, were measured once a week during the first 2 months of irinotecan treatment. Adverse events were graded according to the National Cancer Institute -Common Toxicity Criteria (NCI-CTC Version 2.0). Only the highest grade of adverse events was recorded during the first 2 months of irinotecan treatment for each patient and adverse event.

Patient characteristics and clinical parameters
A summary of the patients' characteristics in the two datasets for diarrhea is shown in Table 1. The association of genetic or clinical parameters with incidence of grade $2 diarrhea was examined on the basis of Spearman's rank correlation coefficient. ''UGT1A1*6 or *28'' is an effective genetic predictive factor of irinotecaninduced neutropenia and pharmacokinetics in cancer patients [5]. This factor was constructed from 2 polymorphisms: UGT1A1*6 and *28.

Fisher's exact test
This statistical test is usually used to determine nonrandom associations between 2 categorical variables [67]. Fisher's exact test is similar to the chi-squared test. If a sample size is large, then the chi-squared test is suitable. Nevertheless, significance values from the chi-squared test are only approximated. Fisher's exact test is used in to analyze contingency tables when the sample sizes are small [67]. We used Fisher's exact test in the present study. The odds ratio (OR) is defined as a6d/(b6c), where a is the number of patients that had adverse events with a minor allele, b is the number of patients that did not have adverse events with a minor allele, c is the number of patients that had adverse events with a major allele, and d is the number of patients that did not have adverse events with a major allele. The null hypothesis for Fisher's exact test is OR = 1.  Table 1. Irinotecan-treated cancer patients with SNP information, genetic factor, and clinical parameters for incidence of diarrhea.   The permutation test The permutation test theory evolved from the works of Fisher and Pitman in the 1930s [68]. In this study, p values of multiplecomparison analyses were adjusted by applying the permutation test to 2 stages of screening. The case-control (or phenotype) labels were randomly shuffled for the 2 screening stages, and p values were calculated using Fisher's exact test. The lowest p value was selected for the randomized data. This procedure was repeated 100,000 times. Exact p values for the permutation test were calculated based on the distribution of the lowest p values.

Multiple testing correction
The Bonferroni correction is a method used to address the problem of multiple comparisons (also known as the multiple testing problem). It is considered the simplest and most conservative method for control of the family-wise error rate (FWER). In addition, false discovery rate (FDR) controlling procedures, such as the Benjamini-Hochberg (BH) method [69], are more powerful (i.e., less conservative) than the FWER procedures, such as the Bonferroni correction, at the cost of increasing the likelihood of false positives within the rejected hypothesis. In the present study, the BH method was used to calculate the q value. The q value is defined as an FDR analog of the p value.

The Akaike information criterion (AIC)
The AIC is a measure of the relative goodness of fit of a statistical model [70]. A smaller AIC indicates a better fit when comparing fitted objects. The AIC is defined according to the formula -26 (log likelihood) + (26n par ), where n par represents the number of parameters in the fitted model, and the log-likelihood value [71] is obtained from the logistic regression model.

The receiver operating characteristic (ROC)
ROC analysis is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is built by plotting sensitivity (the number of true positive results divided by the number of true positive samples) against (1 minus specificity) at various threshold settings. (Specificity is the number of true negative results divided by the number of true negative samples.) The area under the curve (AUC) of a ROC curve is an indicator of expected performance of the test. A higher AUC is more desirable, with a value of 1.00 denoting perfect performance (sensitivity and specificity are both 100%), while a value of 0.50 indicates random performance.
Gene set based on gene ontology GO terms GO has been developed to provide scientists with a controlled terminology system for labeling gene functions in a precise, reliable, computer-readable manner. Data for annotated genes and associated GO terms were obtained from the GO website (http://www.geneontology.org). We compiled a GO term list to select polymorphisms in genes with transporter activity (GO:0005215) and related activities, as shown in Table S2. The numbers of GO terms obtained was 943. GO data were obtained on July 1, 2010.

Association analysis of adverse affects and clinical parameters (or a genetic factor)
The association between clinical parameters (or a genetic factor) and incidence of grade $2 diarrhea was examined on the basis of Spearman's rank correlation coefficient, as shown in Table 1. This table shows that no parameter was associated with the adverse response to chemotherapy (incidence of grade $2 diarrhea) in the first dataset (patients treated with irinotecan monotherapy). Nonetheless, Amrubicin (p = 0.00625) was significantly associated with the response in the second dataset (patients treated with any irinotecan chemotherapy: a combination or monotherapy). Mitomycin C (MMC; p = 0.0740) was weakly associated with the response. These clinical factors should be evaluated when constructing diagnostic models involving multiple factors.
Extraction of candidate SNPs using the combined method consisting of the knowledge-based algorithm, 2 stages of screening, and the permutation test In this study, we applied the combined method to hypothesisfree genomic data on cancer patients treated with irinotecan chemotherapy as shown in Fig. 2. Figure 2A shows an outline of the knowledge-based algorithm for identifying SNPs (KB-SNP). In the previous study, we extracted rs numbers (SNP IDs) related to cancer using a combination of National Center for Biotechnology Information (NCBI) dbSNP and NCBI PubMed [7]. In the present study, we extracted rs numbers from genes linked to specific GO terms instead of the combination of NCBI dbSNP and PubMed. In this analysis, we defined specific GO terms as the terms related to transporter activity.
A total of 6,506 SNPs related to transporter activity were extracted from 109,365 SNPs using KB-SNP (Fig. 2B). Furthermore, we excluded SNPs with a p value ,0.2 in the Hardy-Weinberg equilibrium (HWE) or the minor allele frequency (MAF) ,0.05. Then the extracted 5,242 SNPs were used in the association study.
We analyzed 53 patients treated with irinotecan monotherapy as the first dataset for first-stage screening in the association study (Fig. 2C). Each p value was calculated using Fisher's exact test for the allele model. A total of 24 SNPs with p,0.005 were extracted. In the second stage of screening, 168 patients treated with irinotecan chemotherapy (including 53 patients treated with irinotecan monotherapy) were analyzed to validate these 24 SNPs. Adjustment of a calculated p value for the second stage of screening was conducted using the permutation test for these 2 stages of screening (Fig. 2D). Only rs9351963 in KCNQ5 showed a statistically significant p value (0.0289), which was determined using the permutation test. The rs9351963 is a common variant (MAF = 0.328). Furthermore, we conducted Fisher's exact test and used the Benjamini-Hochberg method [69] to calculate p and q values for the second dataset only. Seven SNPs had a q value ,1, as shown in Table 2. Six SNPs (rs11022922, rs3918305, rs3813627, rs768172, rs3813628, and rs10815019) had q = 0.802 as shown in Table 2. This result indicates that 5 out of 6 SNPs were false positive; however, we assessed performance of only rs9351963 in the process of model construction.

Comparison of models based on rs9351963 in KCNQ5
We analyzed not only an allele model but also dominant and recessive models of rs9351963 in KCNQ5 in relation to the first dataset (irinotecan monotherapy), the second dataset (any irinotecan chemotherapy), and the dataset of irinotecan combination chemotherapy (excluding irinotecan monotherapy), as shown in Figure 3. Figure 3A shows that the p value of the allele model was the lowest (p = 8.86610 25 , OR = 6.3), and the p value (p = 1.29610 24 , OR = 24) of the dominant model was lower than the p value (p = 0.0358, OR = 7.0) of the recessive model in the first dataset. In addition, Figure 3B shows that the p value of the allele model was the lowest (p = 3.31610 25 , OR = 3.1), and the p value (p = 1.28610 24 , OR = 6.7) of the recessive model was lower than the p value (p = 4.44610 23 , OR = 3.3) of the dominant model in the second dataset. Therefore, we evaluated the 3 models using the dataset of irinotecan combination chemotherapy (excluding irinotecan monotherapy; Fig. 3C). Figure 3C shows that the p value (p = 1.44610 23 , OR = 6.9) of the recessive model meant strong statistical significance and the OR was almost equal to OR ( = 7.0) in the first dataset, as shown in Figure 3A. Although ORs of the recessive models seemed to have high homogeneity among all 3 datasets, there was no statistical evidence. Therefore, the proportional odds model was used to construct multiple logistic regression models.

Selection of the model of rs9351963 in KCNQ5 and construction of multiple regression models
We compared the AICs and AUCs using the second dataset in the 8 models: NULL (without parameter), ''UGT1A1*6 or *28'' (an integrated predictive factor based on polymorphisms related to neutropenia), and rs9351963 (genotype of rs9351963 in KCNQ5), Amrubicin, MMC, rs9351963+Amrubicin, rs9351963+MMC, and rs9351963+Amrubicin+MMC (Fig. 4A). Figure 4A shows that performance of all models except UGT1A1 *6 or *28 is better than the performance of the NULL model. Although the Amrubicin+MMC (combination of Amrubicin and MMC) model was better than Amrubicin alone or MMC, the rs9351963 models were clearly better than the Amrubicin+MMC model, as shown in Figures 4A and 4B. Performance of rs9351963+Amrubicin and rs9351963+MMC models was better than performance of the rs9351963 model. Furthermore, performance of the rs9351963+ Amrubicin+MMC model was better than that of rs9351963+ Amrubicin and rs9351963+MMC models. Therefore, we selected the rs9351963+Amrubicin+MMC model as the best one on the basis of AIC. AUC, sensitivity, and specificity of this model were 0.744, 77.8%, and 57.6% in in the ROC curve, respectively, as shown in Figures 4A and 4B.

Discussion
In the present study, we used 2 stages of screening: the method that is based on the concept of joint analysis. Joint analysis is more efficient than replication-based analysis [72]. The first dataset is a part of the second dataset in joint analysis (the latter includes the former). In contrast, the 2 datasets must be independent in a replication-based analysis (which we did not use here). Our 2 stages of screening derived from the joint analysis were used to increase statistical detection power. KB-SNP was performed prior to 2 stages of screening. KB-SNP reduced the number of Calculation of the p value in the permutation test based on the 2 stages of screening. The second dataset (includes first dataset) was permutated. The permutated first dataset was extracted from the permutated second dataset. By using Fisher's exact test, SNPs with p,0.005 for the first dataset were selected from among 5,242 SNPs. Among the selected SNPs, those with the lowest p value in Fisher's exact test for the second dataset were selected. This procedure was repeated 100,000 times and empirical null distribution was constructed. Using the distribution, the actual p value obtained from the second stage of screening was converted to the adjusted p value (based on correction of multiple testing problems). At these screening steps, allele models were used for each SNP. doi:10.1371/journal.pone.0105160.g002 candidate SNPs to 6,506 from 109,365. Approximately 80,000 SNPs can be extracted without knowledge-based reduction of the SNP number. Thus, statistically significant SNPs cannot be extracted from the present data. We could find the statistically significant rs9351963 in KCNQ5 by applying the combined method to hypothesis-free genomic data.
The KCNQ/K(v)7 potassium channel family consists of 5 members of neural muscarine channel (M channel; from KCNQ1 to KCNQ5) which have a distinct expression pattern and a functional role. Although KCNQ1 is prevalently expressed in the cardiac muscle, KCNQ2, KCNQ3, KCNQ4, and KCNQ5 are expressed in neural tissue [73][74][75]. On the other hand, a recent study revealed that KCNQ4 and KCNQ5 are the most abundantly expressed KCNQ channels in smooth muscle throughout the gastrointestinal tract [76]. Furthermore, Jepps et al. opined that drugs that selectively block KCNQ4/KCNQ5 might be promising as therapeutics for the treatment of motility disorders such as constipation associated with irritable bowel syndrome [76]. In other words, drugs that selectively open KCNQ4/KCNQ5 might be effective against diarrhea. The KCNQ family gene products assemble as homomeric or heteromeric tetramers to form functional channels that mediate the M-current [77], a current that is suppressed by mAChR activation [78,79]. Irinotecan induces adverse cholinergic effects (acetylcholinelike actions) by inhibiting AChE and binding to mAChR [37,38,80]. Therefore, polymorphisms of KCNQ5 genes possibly effect incidence of diarrhea as interindividual variation in the drug response among cancer patients treated with irinotecan chemotherapy.
In the present study, only the highest grade of adverse events is recorded during the first 2 months of irinotecan treatment for each patient and each adverse effect. Therefore, incidence of grade $2 diarrhea possibly includes cases caused partially by enterohepatic circulation of APC and NPC, but genotype of rs9351963 in KCNQ5 correlates with the start date of treatment with antidiarrheal agents (Spearman's rank correlation coefficient r = 20.198, p = 0.00995). In other words, genotype of rs9351963 may correlates with the diagnosis (or presentiment) of irinotecan induced early-onset diarrhea (diagnosis is made by trained chemotherapists).
The rs9351963 A.C polymorphism is located in an intron, which does not change the amino acid sequence of the protein and may not influence the biological function of the protein itself. Nonetheless, some intronic polymorphisms are effective markers: For example, rs2237892 in intron 15 of KCNQ1 is associated with susceptibility to type 2 diabetes mellitus in Japanese individuals [81], and the CA simple sequence repeat in intron 1 (CA-SSR1) of the gene of epidermal growth factor receptor (EGFR) is associated with the clinical outcome in gefitinib-treated Japanese patients with non-small cell lung cancer [82]. Furthermore, variations related to intronic or synonymous SNPs possibly affect mRNA stability, translational kinetics, and splicing, resulting in alterations at the protein level, e.g., changes of structure or function [83][84][85][86][87][88][89]. Although rs9351963 does not have a known function, this SNP is a possible predictive factor of adverse effects of irinotecan-based chemotherapy and is possibly linked to some functional polymorphisms in KCNQ5. Their clinical importance needs to be further elucidated.
In the present study, we extracted rs9351963, which showed a p value (0.0289) obtained using a combination of 2 stages of screening and a permutation test from SNPs selected by KB-SNP. In the second dataset, the p value of Fisher's exact test was 3.31610 25 , and the q value was 0.173 calculated by correction of Benjamini-Hochberg method, as shown in Table 2. This value is Table 2. Extracted 7 SNPs with q,1 for the second dataset. statistically insignificant. Therefore, during the 2 stages of screening, it is statistically sufficient to extract rs9351963. The calculation of probability of occurrence in Bernoulli trials is suitable to for estimation of validity of the repetition number in the permutation process. In this trial, occurrence probability is defined as n C k 6 (p B ) k 6(1 -p B ) (n-k) , where k is the occurrence number, n is the repetition number, and p B represents probability. If the repetition number is 100,000 for rs9351963 (p = 0.02891 [2891/ 100000]) and the significance level of the test (a) is 0.05, the occurrence probability is 100000 C 2891 6 (0.05) 2891 6 (1-0.05) (100000-2891) = 4.89610 2241 . In statistics, the 99% (or 95%) confidence interval should be considered. The significance level of a = 0.05 does not exist in the 99% confidence interval of the p value for rs9351963, because the occurrence probability 4.89610 2241 is clearly lower than 0.01. Similarly, if the repetition number is 10,000, the occurrence probability is 3.41610 226 . This way, the occurrence probability is sufficiently low for 10,000 permutations. Nevertheless, we conducted 100,000 permutations to estimate p values more accurately for the permutation test.
Using our combined method involving KB-SNP, we identified rs9351963 as a potential predictive factor of diarrhea in cancer patients treated with irinotecan chemotherapy; however, the comprehensiveness of KB-SNP was limited. Therefore, statistical information regarding the adverse effects of cancer patients treated with irinotecan chemotherapy is shown in Table S3 for incidence of diarrhea (p,0.05) and in Table S4 for incidence of neutropenia (p,0.05). The relevant data are also provided on the website Genome Medicine Database of Japan (GeMDBJ) [90] (https:// gemdbj.nibio.go.jp/). These data will be useful for replication studies or meta-analyses in the future.
In conclusion, in the present study, we applied the combined method to hypothesis-free genomic data on cancer patients treated with irinotecan chemotherapy. By means of this method, rs9351963 in KCNQ5 was extracted as a candidate SNP related to the incidence of diarrhea. For example, the association of rs9351963 with irinotecan-related diarrhea (OR of 3.14) showed a p value of 3.31610 25 in Fisher's exact test (allele model). Even if this p value were adjusted by means of the permutation test for the effects of multiple testing problems, the adjusted p value would still indicate statistical significance (adjusted p value of 0.0289,0.05). Additionally, we evaluated the performance of rs9351963 using multiple regression models. rs9351963 was clearly superior to clinical parameters (or environmental factors) and showed a sensitivity of 77.8% and specificity of 57.6% in the multiple regression model, including rs9351963. Recent studies showed that the KCNQ4 and KCNQ5 genes encode components of the M channel expressed in gastrointestinal smooth muscles and suggested that these genes are associated with irritable bowel syndrome and similar peristalsis diseases. These results suggest that rs9351963 may be a predictive factor of diarrhea in cancer patients treated with irinotecan chemotherapy. This SNP may also be useful for selection of chemotherapy regimens, such as irinotecan monotherapy or a combination of irinotecan chemo- therapy with KCNQ5 opener. Furthermore, the result of the present analysis supports usability of our combined method.