Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An Improved PSO Algorithm for Generating Protective SNP Barcodes in Breast Cancer

  • Li-Yeh Chuang,

    Affiliation Department of Chemical Engineering and Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan

  • Yu-Da Lin,

    Affiliation Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan

  • Hsueh-Wei Chang ,

    changhw@kmu.edu.tw (HWC); chyang@cc.kuas.edu.tw (CHY)

    Affiliation Department of Biomedical Science and Environmental Biology, Center of Excellence for Environmental Medicine, Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan

  • Cheng-Hong Yang

    changhw@kmu.edu.tw (HWC); chyang@cc.kuas.edu.tw (CHY)

    Affiliation Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan

An Improved PSO Algorithm for Generating Protective SNP Barcodes in Breast Cancer

  • Li-Yeh Chuang, 
  • Yu-Da Lin, 
  • Hsueh-Wei Chang, 
  • Cheng-Hong Yang
PLOS
x

Abstract

Background

Possible single nucleotide polymorphism (SNP) interactions in breast cancer are usually not investigated in genome-wide association studies. Previously, we proposed a particle swarm optimization (PSO) method to compute these kinds of SNP interactions. However, this PSO does not guarantee to find the best result in every implement, especially when high-dimensional data is investigated for SNP–SNP interactions.

Methodology/Principal Findings

In this study, we propose IPSO algorithm to improve the reliability of PSO for the identification of the best protective SNP barcodes (SNP combinations and genotypes with maximum difference between cases and controls) associated with breast cancer. SNP barcodes containing different numbers of SNPs were computed. The top five SNP barcode results are retained for computing the next SNP barcode with a one-SNP-increase for each processing step. Based on the simulated data for 23 SNPs of six steroid hormone metabolisms and signalling-related genes, the performance of our proposed IPSO algorithm is evaluated. Among 23 SNPs, 13 SNPs displayed significant odds ratio (OR) values (1.268 to 0.848; p<0.05) for breast cancer. Based on IPSO algorithm, the jointed effect in terms of SNP barcodes with two to seven SNPs show significantly decreasing OR values (0.84 to 0.57; p<0.05 to 0.001). Using PSO algorithm, two to four SNPs show significantly decreasing OR values (0.84 to 0.77; p<0.05 to 0.001). Based on the results of 20 simulations, medians of the maximum differences for each SNP barcode generated by IPSO are higher than by PSO. The interquartile ranges of the boxplot, as well as the upper and lower hinges for each n-SNP barcode (n = 3∼10) are more narrow in IPSO than in PSO, suggesting that IPSO is highly reliable for SNP barcode identification.

Conclusions/Significance

Overall, the proposed IPSO algorithm is robust to provide exact identification of the best protective SNP barcodes for breast cancer.

Introduction

Genome-wide association studies (GWAS) can identify several highly robust and statistically significant single nucleotide polymorphisms (SNPs) associated with breast cancer susceptibility [1][6]. The associations for genotype frequencies of case and control data have significant impacts on the disease susceptibility. Although GWASs provide representative SNPs from the entire genome, many SNPs with a low or marginal significance are frequently excluded to effectively retrieve highly significant and representative tagSNPs.

A steroid hormone metabolism and signalling-related genes are implicated in the pathogenesis of breast cancer [7][12]. Several single nucleotide polymorphism (SNP) association studies involved these genes, such as the estrogen receptor 1 (ESR1), steroid sulfatase (microsomal), isozyme S (STS), cytochrome P450, family 19, subfamily A, polypeptide 1 (CYP19A1), progesterone receptor (PGR), catechol-O-methyltransferase (COMT), and sex hormone-binding globulin (SHBG), have all been reported in these studies [13][15].

Many studies hypothesize that the cancer or disease risk is associated with the co-occurrence of SNPs displaying a jointed effect [16][22]. In recent breast cancer association studies, further evidence for SNP-SNP interactions has been identified, such as the SNP-SNP interactions of genes related to DNA repair [23], [24], chemokine ligand-receptor interactions [25], and estrogen-response gene [6]. However, the possible SNP-SNP interactions between these hormone metabolisms and signalling-related genes have hardly been addressed. This is in part due to the computationally challenging nature of association studies with multiple SNP candidates.

Currently, analysis of SNP-SNP interactions remains challenge because of the complex combination of data with huge SNPs. Many possible combinations of alleles in SNP-SNP interactions are generated when multiple SNPs are evaluated simultaneously. Mathematically, the possible combinations of SNP interactions between cases and controls is estimated to be C(N,M)*3M = N!/[M! (N-M)!]*3M, where N is the number of SNPs or factors, and M is the selected prediction number of SNPs. Many artificial intelligence methods have been proposed to compute the association of genotype frequencies of case and control data. They were demonstrated to be effective in reducing the number of search items among a greater number of SNP combinations, such as multifactor dimensionality reduction (MDR) [26], [27], polymorphism interaction analysis (PIA) [28], support vector machine (SVM) [29], particle swarm optimization (PSO) [30], and genetic algorithm (GA) [31]. In general, MDR provides many useful features but tends to yield false positive and negative errors when the case/control ratio in a combination of genotypes is similar to the ratio in the entire data set [32]. The PSO and GA methods have the ability to generate relevant SNP combinations in high-dimensional data; however, these methods do not guarantee that every implemented result contains a relevant solution when the dimensionality is very high. This is due to the PSO and GA algorithms using random generator initial values and a set number of iterations. Accordingly, the improved algorithms for solving this complex interaction problem are essential.

Here, we develop an improved PSO algorithm called IPSO that improves the reliability of traditional PSO. This improvement is based on the population initialization step during the PSO process, i.e., keeping good solutions and improving always the concept of best solution during the process; this conservation of superior results yields better solutions for high-order SNP-SNP interactions. We systematically evaluated the joint effects of 23 SNP combinations of six steroid hormone metabolisms and signalling genes involved in breast carcinogenesis. The SNP barcodes generated by the IPSO algorithm were statistically evaluated by the odds ratio and risk ratio to predict breast cancer susceptibility. The results demonstrate that the proposed IPSO method can identify more relevant SNP barcodes for high-dimensional data sets and improved the reliability of the results in the 20 test runs we conducted.

thumbnail
Figure 1. Population initialization using conservation of the best 5 results.

http://dx.doi.org/10.1371/journal.pone.0037018.g001

Methods

Particle Swarm Optimization

PSO is an efficient evolutionary computation learning algorithm developed by Kennedy and Eberhart [33]. It was originally developed to graphically mimic the unpredictable movement of birds in a flock. The concept of PSO was designed to simulate social behavior based on information exchange, and was designed for practical applications. Within the problem space, each potential solution can be seen as a particle in a swarm. Every particle with a certain velocity can adjust its direction path according to its own flight experience and that of its companions. This superior strategy effectively mines the optimal regions of complex search spaces through the interaction of individuals in a population of particles. The basic elements of PSO are mentioned below:

  1. Population: A swarm (population) consisting of N particles. Each particle can be regarded as a problem solution in this study.
  2. Particle position, : Each candidate solution can be represented by a D-dimensional vector; the ith particle can be described as , where is the position of the ith particle with respect to the Dth dimension. Each dimensional vector in particle position is defined by the number of selected SNPs and the corresponding genotypes for the associated SNPs.
  3. Particle velocity, : The velocity of the ith particle is represented by , where is the velocity of the ith particle with respect to the Dth dimension. The new locations of particles are chosen by adding to the coordinate of the particle position; PSO operates this process by adjusting . In addition, the velocity of a particle is limited within .
  4. Inertia weight, w: The inertia weight is used to control the impact of the previous velocity of a particle on its current velocity. This control parameter affects the trade-off between the exploration and exploitation abilities of the particle.
  5. Individual best value, pbesti: pbesti is the position of the ith particle with the highest fitness value at a given iteration. It can be currently regarded as a best solution of SNP barcodes so far in terms of ith particle.
  6. Global best value, gbest: The best position of all pbest particles is called global best. It can be currently regarded as a best solution of SNP barcodes so in all particles.
  7. Termination criteria: The process is stopped after the maximum allowed number of iterations is reached.

The PSO algorithm can be divided into four steps within a process period. First, particles are respectively initialized in a population of random solutions. Then each particle finds its own pbesti by comparing its current fitness to the fitness of its previous position. In a third step, the gbest of all the particles in the population is determined. And finally, the PSO algorithm executes a search for optimal solutions by updating the generations. In each generation, the position and velocity of the ith particle are updated with pbesti and gbest of the swarm population. The update equations can be formulated as:(1)(2)where w is the inertia weight. This inertia weight is a positive linear function of time that changes with the generations; r1 and r2 are random numbers between (0, 1), and c1 and c2 are acceleration constants that control how far a particle moves in a single generation. Velocities and , respectively, denote the velocities of the new and old particles; is the current particle position, and is the updated particle position. The velocity implies the degree to which a particle’s position should be changed at a particular moment in time, so that it can equal that of the global best position, i.e., the velocity of the particle flying toward the best position. To obtain a search solution, the particles’ velocities in each dimension are limited within [Vmin, Vmax]D, and the particles’ positions are limited within [Xmin, Xmax]D, thus determining the size of the steps the particle is allowed to take through the solution space.

Improved Particle Swarm Optimization

This study proposes a new idea to improve the stability of results obtained with particle swarm optimization. We conserve the best results in the each SNP barcode prediction, which allows us to offer better results for high-order SNP-SNP interactions. The retention of the best results in PSO is very simple and can be done without increasing the computational complexity of the process. The difference between IPSO and PSO is that the proposed new idea is applied in the population initialization step during the PSO process. The IPSO proceeds as follows: The initial population is generated by our strategy and then the fitness values of all individuals in the population are calculated by a fitness function. The particles are repositioned according to their own pbest and gbest solutions. The procedure is repeated in each successive iteration until the termination conditions are reached.

Encoding schemes.

In IPSO, every particle in a population is associated with a solution group. We define a particle based on the number of selected SNPs, and the genotype associated with the corresponding SNPs; the SNPs cannot be repeatedly selected. The particle encoding can thus be represented by:

where SNPi,j represents the selected SNP, Genotypei,j represents the three possible genotypes once SNP i,j is selected, m represents the size of the population, and n represents the number of SNPs selected. Initial particles are randomly generated in this study. For example, let P = (SNP3,4,8, Genotype2,1,3). In this representation of the particle, SNP3,4,8 represents the chosen SNPs (3, 4, 8) and Genotype2,1,3 represents the chosen genotypes (2, 1, 3). In this case, the selected SNPs with their corresponding genotypes are represented as (3, 2), (4, 1), and (8, 3), respectively.

Population initialization using conservation of the top five results.

The top five results in the 2-SNP barcode and in the n-SNP barcode (n ≧3) are generated differently. For the 2-SNP barcode, we only apply the exhaustive search algorithm to compute and check all possible 2-SNP combinations to give the best five results for all 2-SNP barcodes. To generate the n-SNP barcode (n ≧3), the steps for population initialization are illustrated in Figure 1. To initialize the population, the top five results amongst the previous 2-SNP combinations are used to initialize the population initialization for other numbers of SNP combinations. For example, (SNP1,3, Genotype1,2) is one of the top five 2-SNP barcodes (step 1-1). Subsequently, this 2-SNP barcode is applied to search for the best combination of the 3-SNP barcode with a maximum difference value between the case and control data (step 2-2a); in this example the search result is (SNP1,3,i, Genotype1,2,j), with i = {2, 4, 5, 6 … n | n representing the number of SNPs} and j = {1, 2, 3}. Then, the exhaustive search algorithm is applied to compute and check all possible 3-SNP combinations to find the top five results for all 3-SNP barcodes (step 3-3a). If the exhaustive search algorithm finds the answer to be i = 6 and j = 1 (the newly added third SNP and the genotype are 6 and 1, respectively), the best 2-SNP barcode (SNP1,3, Genotype1,2) can generate its best 3-SNP barcode (SNP1,3,6, Genotype1,2,1). Similarly, four of the top five 3-SNP barcodes are generated.

thumbnail
Table 3. Estimated effect (odds ratio and 95% CI) from individual SNPs of 23 steroid hormone metabolisms and signalling-related genes on the occurrence of breast cancer in patients.

http://dx.doi.org/10.1371/journal.pone.0037018.t003

Meanwhile, the 3-SNP barcodes are generated in a random way (step 2-2b) and then sorted by the order of the fitness values (step 3-3b). The result from step 3-3a is used to replace the worst of the top five SNP barcodes (step 3-4). Finally, the updated 3-SNP barcode population is ready for the PSO computation, in which the top five in amongst the 3-SNP barcodes (step 3-5) are determined. Now, the top five SNP barcodes can be used to start the generation of the next higher order SNP barcodes. The steps are described in detail by the annotated IPSO pseudo-code in the next two sections.

thumbnail
Table 4. The best estimated protective SNP combinations on the occurrence of breast cancer as determined by IPSO.

http://dx.doi.org/10.1371/journal.pone.0037018.t004

Fitness function.

In this study, the fitness value means are used to compute the difference between the case and control data from the selected SNP combinations. The focus lies on specific SNP combinations to obtain the highest fitness value, i.e., the maximum SNP combination difference between cases and controls. The concept uses the intersection of set theory to compute the difference between cases and controls. The intersection of two sets is the set that contains all elements of one of these sets that also belong to the other set, but no other elements. A high fitness value indicates the best combination of an SNP and genotypes. The relevant equation is shown below:(3)where n represents the total number of elements in a set. C represents the total number of SNP interactions in the case group, and N represents the total number of SNP interactions in the control group. Pi represents the ith particle. The fitness value definition can be divided into three steps. First, the total number of intersections of the case data set and the ith particle is calculated as n(CPi). Second, the total number of intersections of the control data set and the ith particle is calculated as n(NPi). Finally, Eq (3) is used to calculate the fitness value that is the difference between the intersection of the case and the particle and the intersection of the control and the particle. For example, P = (SNP1,2, Genotype2,1) it is used to compute the number matching the condition of the SNP and genotypes for the case and control in the breast cancer data. First, the number of controls for SNP1 with genotype 2 and SNP2 with genotype 1 is calculated. The number of cases independently matching SNP1 with genotype 2 and SNP2 with genotype 1 was 76 in the breast cancer data set. Second, the number of controls independently matching SNP1 with genotype 2 and SNP2 with genotype 1 is calculated as 141. According to Eq. (3), the fitness value is determined by subtracting 76 from 141, giving -65. If the fitness is negative, the absolute value is taken to obtain a fitness value of 65.

Identification of pbest and gbest.

Each particle finds its personal best position (pbest) and the global best position (gbest) when moving. If the fitness value of a particle Pi in the current iteration is better than the fitness value of pbest in the previous iteration, pbest is updated to that of Pi. If the fitness value of particle Pi in the current iteration is better than gbest in the previous iteration and is the best one in the current iteration, gbest is updated to that of Pi. Each particle then adjusts its direction based on pbest and gbest in the following iteration.

As mentioned in Table 1, the pseudo-code for IPSO algorithm can collocate data with the adaptation procedure as mentioned above and generate the best SNP barcode for breast cancer prediction.

thumbnail
Figure 2. The maximum difference between cases and controls for PSO and IPSO on the best barcodes containing two to ten SNPs.

http://dx.doi.org/10.1371/journal.pone.0037018.g002

Parameter settings.

The population size parameter was set to 50 (Figure 1, step 2-2b). The termination condition of the PSO is reached at a prespecified number of iterations (in our case, the number of iterations is 100) (Figure 1, step 3-5). The other parameters used in the PSO were c1 = c2 = 2. Vmax was equal to (XmaxXmin) and Vmin was equal to – (XmaxXmin). These parameters have been optimized by Kennedy and Eberhart [33].

Performance measurement and statistical analysis.

We used five commonly used criteria to determine the performance [28].(4)(5)(6)(7)(8)

TP, TN, FN, and FP represent the number of true positives, true negatives, false negatives, and false positives, respectively. For statistics analysis with SPSS 13.0, the risk ratio (RR) and odds ratio (OR) are used to determine the best SNP barcode and quantitatively measure the breast cancer risk. The boxplots were analysed by SigmaPlot 9.0 (Systat Software, Inc.).

Results

Data Set Preparation

The data set for the steroid hormones and their signalling and metabolic pathways (96 SNPs for 8 genes) were obtained from the breast cancer association study in [14]. This data set only provides the genotype frequencies without the original raw data for the genotypes of each SNP. In our study, we simulated the genotype data based on the original frequencies of the data set. Using the simulated genotype data, susceptibility to breast cancer in terms of complex SNP-SNP interactions can be considered. However, it does not reflect the true distribution of those SNPs in cases and controls, and therefore results are not real. However, the original data involves different numbers of genotypes, and hence we had to perform normalization to make each genotype size the same in order to allow further analysis. Our simulated data was randomly generated and obeys the original genotype frequency in the entire data set; the simulated data is available at http://bioinfo.kmu.edu.tw/brca-steroid-96SNP.xlsx.

The normalization procedure is provided in the “pseudo-code for randomly generated data” as shown in Table 2. For example, we set the range size to a maximum range of 5000, and then calculate the amount of three genotypes in each SNP. The example of SNP4 (rs3020314) includes 4551 genotypes in the original data, which contain 2132 for CC, 1970 for CT, and 449 for TT, respectively (the step for pseudo-code 04). In each SNP, the percentage of each genotype is calculated, for the above instance, 2132/4551 (46.85%) for CC, 1970/4551 (43.29%) for CT, and 449/4551 (9.86%) for TT. Based on these percentages, the modified data for SNP4 is obtained by multiplication of the percentage with the amount of the entire data set, i.e., 46.85%×5000 = 2343 for CC, 43.28%×5000 = 2164 for CT and 9.86%×5000 = 493 for TT (the step for pseudo-code 05 to 11). The simulated data for SNP4 has thus been normalized to 5000 (2343+2164+493 = 5000). Accordingly, all original data are normalized to the same number in this manner.

thumbnail
Figure 3. Boxplots displaying the extremes, the upper and lower quartiles, and the median of the maximum difference between cases and controls for (A) the IPSO algorithm and (B) the PSO algorithm on three to ten combined SNPs over 20 runs.

The boundary of the box closest to zero indicates the 25th percentile, a line within the box marks the median, and the boundary of the box farthest from zero indicates the 75th percentile. Error bars above and below the boxes indicate the 90th and 10th percentiles, respectively. The triangle symbols indicate the 95th and 5th percentiles.

http://dx.doi.org/10.1371/journal.pone.0037018.g003

Evaluation of Breast Cancer Susceptibility in 23 Separate SNPs from 6 Steroid Hormone Metabolisms and Signalling-related Genes

Based on our simulated data, Table 3 shows the performance (OR and 95% CI) for each SNP from 6 steroid hormone metabolisms and signalling-related genes (COMT, CYP19A1, ESR1, PGR, SHBG, and STS). Some SNPs (such as SNPs 4, 10, 12–15, 17, and 19–23 listed in Table 3) with certain genotypes display a statistically significant OR (p<0.05) for breast cancer; their OR values range from 1.268 to 0.846. The other SNPs show no statistically significant OR for breast cancer.

thumbnail
Table 5. The best estimated protective SNP combinations on the occurrence of breast cancer as determined by PSO.

http://dx.doi.org/10.1371/journal.pone.0037018.t005

Identification of the SNP-SNP Interactions with Maximum Differences between Cases and Controls Using IPSO

Using the IPSO algorithm, the best SNP-SNP interaction is evaluated by the difference between cases and controls for all the SNP barcodes. After computation, the top five of the 2-SNP barcodes can be listed in order of the difference between cases and controls: SNPs (4-19)-genotype (1-1), SNPs (4-23)-genotype (1-2), SNPs (4-9)-genotype (1-2), SNPs (19-23)-genotype (1-2), and SNPs (9-23)-genotype (2-2). The differences in the number of cases and controls for these SNP barcodes are 174, 168, 158, 150, and 146, respectively (data not shown). In this study, as shown in Table 4, we only select the 2-SNP barcodes with a maximum difference, i.e., the best 1 of the 2-SNP barcode. Similarly, the n-SNP barcodes (n = 3 to 10) with maximum differences, i.e., the best for each n-SNP barcode, are also selected (left side of Table 4).

With the conservation of the top five results, we found that the best for n-SNP barcode contains the corresponding best (n-1)-barcode. For example, the 3-SNP barcode contains the 2-SNP barcode, i.e., SNPs (4-19-23)-genotypes (1-1-2) vs. SNPs (4-19)-genotypes (1-1), where the bold letters indicate the newly selected SNP. The 4-SNP barcode contains the 3-SNP barcode, i.e., SNPs (4-9-19-23)-genotypes (1-2-1-2) vs. SNPs (4-19-23)-genotypes (1-1-2).

Prediction Scores of the Best IPSO-generated SNP Barcodes in Breast Cancer

The best n-SNP barcodes (n = 3 to 10) calculated by the IPSO algorithm are listed in Table 4 to calculate their five prediction scores, i.e., the correctness, sensitivity+specificity, PPV+NPV, RR, and OR, in order to evaluate the breast cancer susceptibility based on the IPSO-generated SNP barcodes. The sensitivity and specificity values of the respective best SNP barcodes are all higher than 0.96, suggesting that IPSO can identify the best SNP barcodes associated with breast cancer. The correctness and PPV+NPV values of the respective best SNP barcodes range from 0.48 to 0.50 and 0.64 to 0.96, respectively, and the RR and OR of the best SNP barcodes range from 0.88 to 0.17 and 0.84 to 0.17, respectively. The SNP barcodes involving two to seven SNPs show significantly decreasing OR values (p<0.05 to 0.001). Since the SNP barcodes listed in Table 4 show that the control numbers are greater than the case numbers, the SNP barcodes are regarded as protective SNP barcodes against breast cancer.

Comparison between the Best IPSO-generated and PSO-generated SNP Barcodes in Breast Cancer

We compare IPSO with PSO for the reliability and the ability to identify SNP barcodes to support the advantage of the top-five strategy. The performances of the PSO and IPSO algorithms from 20 simulation runs (see supplement Table S1 and Table S2 for details) are compared by means of the best maximum difference between cases and controls as shown in Figure 2. To examine the performance in terms of the statistical differences between both algorithms, we performed the Wilcoxon Signed-Rank test and found that there were significant differences between cases and controls in n-SNP barcodes (n = 2 to 10) (Table S3).

The maximum differences for each SNP barcode generated by IPSO are higher than those of PSO, suggesting that the selection of the best protective SNP barcodes is more reliable in IPSO than in PSO. As shown in Figure 3, the median value results suggest that IPSO is more suitable for selecting the best SNP barcodes for breast cancer protection. Moreover, the interquartile ranges (25th to 75th) of the boxplot, as well as the 5th, 10th, 90th and 95th percentiles for each n-SNP barcode (n = 3 to 10), are more narrow in the IPSO algorithm (Figure 3A) than in the PSO algorithm (Figure 3B). These data suggest that the results of the PSO algorithm are more unstable. In contrast, the IPSO algorithm provides exact identification of the best SNP barcodes for breast cancer protection. Actually, the data in Figure 3A (IPSO) are all the same for each n-SNP (n = 3 to 10), i.e., 128, 87, 55, 35, 21, 12, 8, and 5 (Table 4). The best PSO-generated n-SNP barcodes with maximum differences between cases and controls are listed in Table 5. In the PSO algorithm the top five results are not conserved. Accordingly, the PSO-generated SNP barcodes conserve the selected SNPs to a lesser degree (Table 3). For example, only one SNP in the 2-SNP barcode shows up in the 3-SNP barcode, i.e., SNP 4 (rs3020314; Table 3), and only one SNP in the 3-SNP barcode shows up in the 4-SNP barcode, i.e., SNP 23 (rs2017591). Therefore, an order of influence on breast cancer is very difficult to establish from the SNPs in Table 3.

Discussion

Many association studies of cancer focused on the analysis of risk genetic factors that influence common complex traits in terms of commonly occurring SNPs. However, the possible protective effects are also important for the prediction of cancer morbidity by SNPs. Here, we analyzed the contribution of 23 SNPs from six breast cancer related genes to generate the protective SNP barcodes in a case-control study of 5000 cases and 5000 controls with genotype data simulation.

The maximum difference information calculated by the IPSO algorithm can predict the relative strength of the impact of an SNP on breast cancer protection. For example, the difference between controls and cases for SNP barcode [SNPs (4-19)-genotype (1-1)] is higher than that of [SNPs (4-19-23)-genotype (1-1-2)], suggesting that SNP 4 and SNP 19 are more associated with breast cancer protection than SNP 23. Accordingly, an order of impact on breast cancer for the SNPs listed in Table 3 can be arranged: SNPs 4/19> SNP 23> SNP 9> SNP 3> SNP 13> SNP 20> SNP 12> SNP 14> SNP 21. In this simulated breast cancer association study, the IPSO-generated SNP barcodes involving two to seven SNPs and two to four SNPs show significantly decreasing OR values ranging from 0.84 to 0.57 (Table 4). In contrast, some individual SNPs with certain genotypes display statistically significant OR values ranging from 1.268 to 0.846 (Table 3).

Some SNPs may display different impacts on the protection of breast cancer in terms of the individual SNPs or the combinational SNPs. For example, some individual SNPs such as SNPs 3 and 9 are not significantly associated with breast cancer (Table 3), but the occurrence of 4- to 10-SNP combinations including SNPs 3 and 9 shows the significant association with breast cancer (Table 4). These data suggest that the association relationship for breast cancer may be ignored when the SNP interaction is of no concern.

A key issue of detecting SNP-SNP interactions in genome-wide case-control study is the computational efficiency. The computational complexity of IPSO algorithm is estimated by the objective function computation. If there are M number of iterations and N number of solutions (particles) in the population, then the objective function computation has O(MN) computational complexity. The effective feature of the top 5 strategy computation is only storing the top 5 solutions in each iteration. If there are K solutions in the archive, storing the solutions in the archive has O(M+K) computational complexity. If the archive and the iteration have the same numbers, the overall complexity of IPSO is O(MN+K).

Although the optimal parameters of PSO were demonstrated by Kennedy and Eberhart [33], we found that the parameter adjustments may promote better results even for large numbers of SNPs. Firstly, the population and iterations could adjust its size according the data size, in which the population size suggested setting from 50 to 100 and number of iterations suggested setting from 100 to 1000, i.e., it explores to better SNP barcodes with large difference between cases and controls, but the computational complexity of IPSO is also increased. Secondly, the c1 and c2 are acceleration constants that control how far a particle moves in a single generation, and they respectively control the exploitation and exploration ability in each search. In order to balance the exploitation and exploration, the c1 and c2 are suggested the same as 2.

Although we explored the benefit of IPSO algorithm for SNP interaction based on the simulated breast cancer study, the IPSO algorithm is not exclusively into breast cancer data and can be applied to other real data sets. After running these algorithms using another disease with the real dataset [18], e.g., osteoporosis, we found that the IPSO algorithm again showed better performance for selecting SNP barcodes in SNP-SNP interaction studies than the PSO algorithm (data not shown).

IPSO can overcome the limitations imposed on computational time for complex SNP interactions for GWAS because IPSO has the following advantages: 1) IPSO allows robust analysis of high-order SNP combinations for GWAS studies and generates the best SNP barcodes; 2) IPSO is an improved evolutionary algorithm without exhaustive search; 3) IPSO only needs two parameters for computation without complex settings; and 4) Its computational complexity is unaffected by the size of data sets.

In conclusion, we propose an improved PSO algorithm to perform a powerful breast cancer association analysis in terms of SNP-SNP interactions with 23 SNPs. Our strategy successfully improves on the performance of traditional PSO in terms of the reliability with a combination of more statistically significant SNPs associated with breast cancer protection. With the help of the IPSO algorithm, the best fitness of cases and controls can be identified. The algorithm can potentially be applied to identify complex SNP-SNP (gene-gene) interactions for different diseases, even in cases where a large number of SNPs is involved in genome-wide association studies.

Supporting Information

Table S1.

The estimated protective SNP combinations on the occurrence of breast cancer as determined by IPSO.

doi:10.1371/journal.pone.0037018.s001

(PDF)

Table S2.

The estimated protective SNP combinations on the occurrence of breast cancer as determined by PSO.

doi:10.1371/journal.pone.0037018.s002

(PDF)

Table S3.

Wilcoxon Signed-Rank test for IPSO and PSO.

doi:10.1371/journal.pone.0037018.s003

(PDF)

Author Contributions

Conceived and designed the experiments: HWC CHY. Performed the experiments: YDL CHY. Analyzed the data: YDL LYC. Contributed reagents/materials/analysis tools: YDL. Wrote the paper: LYC HWC.

References

  1. 1. Li J, Humphreys K, Darabi H, Rosin G, Hannelius U, et al. (2010) A genome-wide association scan on estrogen receptor-negative breast cancer. Breast Cancer Res 12: R93.
  2. 2. Kraft P, Haiman CA (2010) GWAS identifies a common breast cancer risk allele among BRCA1 carriers. Nat Genet 42: 819–820.
  3. 3. Thomas G, Jacobs KB, Kraft P, Yeager M, Wacholder S, et al. (2009) A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet 41: 579–584.
  4. 4. Meindl A (2009) Identification of novel susceptibility genes for breast cancer - Genome-wide association studies or evaluation of candidate genes? Breast Care (Basel) 4: 93–99.
  5. 5. Fanale D, Amodeo V, Corsini LR, Rizzo S, Bazan V, et al. (2012) Breast cancer genome-wide association studies: there is strength in numbers. Oncogene 31: 2121–2128.
  6. 6. Yu JC, Hsiung CN, Hsu HM, Bao BY, Chen ST, et al. (2011) Genetic variation in the genome-wide predicted estrogen response element-related sequences is associated with breast cancer development. Breast Cancer Res 13: R13.
  7. 7. Soto AM, Sonnenschein C (2001) The two faces of janus: sex steroids as mediators of both cell proliferation and cell death. J Natl Cancer Inst 93: 1673–1675.
  8. 8. Auricchio F, Migliaccio A, Castoria G (2008) Sex-steroid hormones and EGF signalling in breast and prostate cancer cells: targeting the association of Src with steroid receptors. Steroids 73: 880–884.
  9. 9. Ando S, De Amicis F, Rago V, Carpino A, Maggiolini M, et al. (2002) Breast cancer: from estrogen to androgen receptor. Mol Cell Endocrinol 193: 121–128.
  10. 10. Giovannelli P, Di Donato M, Giraldi T, Migliaccio A, Castoria G, et al. (2011) Targeting rapid action of sex steroid receptors in breast and prostate cancers. Front Biosci 17: 2224–2232.
  11. 11. LaPensee EW, Ben-Jonathan N (2010) Novel roles of prolactin and estrogens in breast cancer: resistance to chemotherapy. Endocr Relat Cancer 17: R91–107.
  12. 12. Fortunati N, Catalano MG, Boccuzzi G, Frairia R (2010) Sex Hormone-Binding Globulin (SHBG), estradiol and breast cancer. Mol Cell Endocrinol 316: 86–92.
  13. 13. Udler MS, Azzato EM, Healey CS, Ahmed S, Pooley KA, et al. (2009) Common germline polymorphisms in COMT, CYP19A1, ESR1, PGR, SULT1E1 and STS and survival after a diagnosis of breast cancer. Int J Cancer 125: 2687–2696.
  14. 14. Pharoah PD, Tyrer J, Dunning AM, Easton DF, Ponder BA (2007) Association between common variation in 120 candidate genes and breast cancer risk. PLoS Genet 3: e42.
  15. 15. Low YL, Taylor JI, Grace PB, Mulligan AA, Welch AA, et al. (2006) Phytoestrogen exposure, polymorphisms in COMT, CYP19, ESR1, and SHBG genes, and their associations with prostate cancer risk. Nutr Cancer 56: 31–39.
  16. 16. Zheng SL, Sun J, Wiklund F, Smith S, Stattin P, et al. (2008) Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358: 910–919.
  17. 17. Yen CY, Liu SY, Chen CH, Tseng HF, Chuang LY, et al. (2008) Combinational polymorphisms of four DNA repair genes XRCC1, XRCC2, XRCC3, and XRCC4 and their association with oral cancer in Taiwan. J Oral Pathol Med 37: 271–277.
  18. 18. Lin GT, Tseng HF, Chang CK, Chuang LY, Liu CS, et al. (2008) SNP combinations in chromosome-wide genes are associated with bone mineral density in Taiwanese women. Chinese Journal of Physiology 91: 1–10.
  19. 19. Cauchi S, Meyre D, Durand E, Proenca C, Marre M, et al. (2008) Post genome-wide association studies of novel genes associated with type 2 diabetes show gene-gene interaction and high predictive value. PLoS ONE 3: e2031.
  20. 20. Ricceri F, Guarrera S, Sacerdote C, Polidoro S, Allione A, et al. (2010) ERCC1 haplotypes modify bladder cancer risk: a case-control study. DNA Repair (Amst) 9: 191–200.
  21. 21. Yin J, Lu K, Lin J, Wu L, Hildebrandt MA, et al. (2011) Genetic variants in TGF-beta pathway are associated with ovarian cancer risk. PLoS ONE 6: e25559.
  22. 22. Chen L, Li W, Zhang L, Wang H, He W, et al. (2011) Disease gene interaction pathways: a potential framework for how disease genes associate by disease-risk modules. PLoS ONE 6: e24495.
  23. 23. Han W, Kim KY, Yang SJ, Noh DY, Kang D, et al. (2011) SNP-SNP interactions between DNA repair genes were associated with breast cancer risk in a Korean population. Cancer 118(3): 594–602.
  24. 24. Conde J, Silva SN, Azevedo AP, Teixeira V, Pina JE, et al. (2009) Association of common variants in mismatch repair genes and breast cancer susceptibility: a multigene study. BMC Cancer 9: 344.
  25. 25. Lin GT, Tseng HF, Yang CH, Hou MF, Chuang LY, et al. (2009) Combinational polymorphisms of seven CXCL12-related genes are protective against breast cancer in Taiwan. OMICS 13: 165–172.
  26. 26. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69: 138–147.
  27. 27. Chung Y, Lee SY, Elston RC, Park T (2007) Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics 23: 71–76.
  28. 28. Mechanic LE, Luke BT, Goodman JE, Chanock SJ, Harris CC (2008) Polymorphism Interaction Analysis (PIA): a method for investigating complex gene-gene interactions. BMC Bioinformatics 9: 146.
  29. 29. Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, et al. (2008) A support vector machine approach for detecting gene-gene interaction. Genet Epidemiol 32: 152–167.
  30. 30. Yang CH, Chang HW, Cheng YH, Chuang LY (2009) Novel generating protective single nucleotide polymorphism barcode for breast cancer using particle swarm optimization. Cancer Epidemiol 33: 147–154.
  31. 31. Yang CH, Chuang LY, Chen YJ, Tseng HF, Chang HW (2011) Computational analysis of simulated SNP interactions between 26 growth factor-related genes in a breast cancer association study. OMICS 15: 399–407.
  32. 32. Chung Y, Lee SY, Elston RC, Park T (2006) Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions. Bioinformatics 23: 71.
  33. 33. Kennedy J, Eberhart RC (1995) Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks. IV. pp. 1942–1948. doi:10.1109/ICNN.1995.488968.