An Improved PSO Algorithm for Generating Protective SNP Barcodes in Breast Cancer

Background Possible single nucleotide polymorphism (SNP) interactions in breast cancer are usually not investigated in genome-wide association studies. Previously, we proposed a particle swarm optimization (PSO) method to compute these kinds of SNP interactions. However, this PSO does not guarantee to find the best result in every implement, especially when high-dimensional data is investigated for SNP–SNP interactions. Methodology/Principal Findings In this study, we propose IPSO algorithm to improve the reliability of PSO for the identification of the best protective SNP barcodes (SNP combinations and genotypes with maximum difference between cases and controls) associated with breast cancer. SNP barcodes containing different numbers of SNPs were computed. The top five SNP barcode results are retained for computing the next SNP barcode with a one-SNP-increase for each processing step. Based on the simulated data for 23 SNPs of six steroid hormone metabolisms and signalling-related genes, the performance of our proposed IPSO algorithm is evaluated. Among 23 SNPs, 13 SNPs displayed significant odds ratio (OR) values (1.268 to 0.848; p<0.05) for breast cancer. Based on IPSO algorithm, the jointed effect in terms of SNP barcodes with two to seven SNPs show significantly decreasing OR values (0.84 to 0.57; p<0.05 to 0.001). Using PSO algorithm, two to four SNPs show significantly decreasing OR values (0.84 to 0.77; p<0.05 to 0.001). Based on the results of 20 simulations, medians of the maximum differences for each SNP barcode generated by IPSO are higher than by PSO. The interquartile ranges of the boxplot, as well as the upper and lower hinges for each n-SNP barcode (n = 3∼10) are more narrow in IPSO than in PSO, suggesting that IPSO is highly reliable for SNP barcode identification. Conclusions/Significance Overall, the proposed IPSO algorithm is robust to provide exact identification of the best protective SNP barcodes for breast cancer.


Introduction
Genome-wide association studies (GWAS) can identify several highly robust and statistically significant single nucleotide polymorphisms (SNPs) associated with breast cancer susceptibility [1][2][3][4][5][6].The associations for genotype frequencies of case and control data have significant impacts on the disease susceptibility.Although GWASs provide representative SNPs from the entire genome, many SNPs with a low or marginal significance are frequently excluded to effectively retrieve highly significant and representative tagSNPs.
Many studies hypothesize that the cancer or disease risk is associated with the co-occurrence of SNPs displaying a jointed effect [16][17][18][19][20][21][22].In recent breast cancer association studies, further evidence for SNP-SNP interactions has been identified, such as the SNP-SNP interactions of genes related to DNA repair [23,24], chemokine ligand-receptor interactions [25], and estrogen-response gene [6].However, the possible SNP-SNP interactions between these hormone metabolisms and signalling-related genes have hardly been addressed.This is in part due to the computationally challenging nature of association studies with multiple SNP candidates.
Currently, analysis of SNP-SNP interactions remains challenge because of the complex combination of data with huge SNPs.Many possible combinations of alleles in SNP-SNP interactions are generated when multiple SNPs are evaluated simultaneously.Mathematically, the possible combinations of SNP interactions between cases and controls is estimated to be C(N,M)*3 M = N!/[M!(N-M)!]*3 M , where N is the number of SNPs or factors, and M is the selected prediction number of SNPs.Many artificial intelligence methods have been proposed to compute the association of genotype frequencies of case and control data.They were demonstrated to be effective in reducing the number of search items among a greater number of SNP combinations, such as multifactor dimensionality reduction (MDR) [26,27], polymorphism interaction analysis (PIA) [28], support vector machine (SVM) [29], particle swarm optimization (PSO) [30], and genetic algorithm (GA) [31].In general, MDR provides many useful features but tends to yield false positive and negative errors when the case/control ratio in a combination of genotypes is similar to the ratio in the entire data set [32].The PSO and GA methods have the ability to generate relevant SNP combinations in highdimensional data; however, these methods do not guarantee that every implemented result contains a relevant solution when the dimensionality is very high.This is due to the PSO and GA algorithms using random generator initial values and a set number of iterations.Accordingly, the improved algorithms for solving this complex interaction problem are essential.
Here, we develop an improved PSO algorithm called IPSO that improves the reliability of traditional PSO.This improvement is based on the population initialization step during the PSO process, i.e., keeping good solutions and improving always the concept of best solution during the process; this conservation of superior results yields better solutions for high-order SNP-SNP interactions.We systematically evaluated the joint effects of 23 SNP combinations of six steroid hormone metabolisms and signalling genes involved in breast carcinogenesis.The SNP barcodes generated by the IPSO algorithm were statistically evaluated by the odds ratio and risk ratio to predict breast cancer susceptibility.The results demonstrate that the proposed IPSO method can identify more relevant SNP barcodes for high-dimensional data sets and improved the reliability of the results in the 20 test runs we conducted.

Particle Swarm Optimization
PSO is an efficient evolutionary computation learning algorithm developed by Kennedy and Eberhart [33].It was originally developed to graphically mimic the unpredictable movement of birds in a flock.The concept of PSO was designed to simulate social behavior based on information exchange, and was designed for practical applications.Within the problem space, each potential solution can be seen as a particle in a swarm.Every particle with a certain velocity can adjust its direction path according to its own flight experience and that of its companions.This superior strategy effectively mines the optimal regions of complex search spaces through the interaction of individuals in a population of particles.The basic elements of PSO are mentioned below:

1)
Population: A swarm (population) consisting of N particles.Each particle can be regarded as a problem solution in this study.

2)
Particle position, x i : Each candidate solution can be represented by a D-dimensional vector; the i th particle can be described as x i ~(x i1 ,x i2 , . . .,x iD ), where x iD is the position of the i th particle with respect to the D th dimension.Each dimensional vector in particle position is defined by the number of selected SNPs and the corresponding genotypes for the associated SNPs.

3)
Particle velocity, v i : The velocity of the i th particle is represented by , where v iD is the velocity of the i th particle with respect to the D th dimension.The new locations of particles are chosen by adding v i to the coordinate of the particle positionx i ; PSO operates this process by adjusting v i .In addition, the velocity of a particle is limited within V min ,V max ½ D .

4)
Inertia weight, w: The inertia weight is used to control the impact of the previous velocity of a particle on its current velocity.This control parameter affects the trade-off between the exploration and exploitation abilities of the particle.

5)
Individual best value, pbest i : pbest i is the position of the i th particle with the highest fitness value at a given iteration.It can be currently regarded as a best solution of SNP barcodes so far in terms of i th particle.

6)
Global best value, gbest: The best position of all pbest particles is called global best.It can be currently regarded as a best solution of SNP barcodes so in all particles.7) Termination criteria: The process is stopped after the maximum allowed number of iterations is reached.
The PSO algorithm can be divided into four steps within a process period.First, particles are respectively initialized in a population of random solutions.Then each particle finds its own pbest i by comparing its current fitness to the fitness of its previous position.In a third step, the gbest of all the particles in the population is determined.And finally, the PSO algorithm executes a search for optimal solutions by updating the generations.In each generation, the position and velocity of the i th particle are updated with pbest i and gbest of the swarm population.The update equations can be formulated as: where w is the inertia weight.This inertia weight is a positive linear function of time that changes with the generations; r 1 and r 2 are random numbers between (0, 1), and c 1 and c 2 are acceleration constants that control how far a particle moves in a single generation.Velocities v new id and v old id , respectively, denote the velocities of the new and old particles; x old id is the current particle position, and x new id is the updated particle position.The velocity implies the degree to which a particle's position should be changed at a particular moment in time, so that it can equal that of the global best position, i.e., the velocity of the particle flying toward the best position.To obtain a search solution, the particles' velocities in each dimension are limited within [V min , V max ] D , and the particles' positions are limited within [X min , X max ] D , thus determining the size of the steps the particle is allowed to take through the solution space.

Improved Particle Swarm Optimization
This study proposes a new idea to improve the stability of results obtained with particle swarm optimization.We conserve the best results in the each SNP barcode prediction, which allows us to offer better results for high-order SNP-SNP interactions.The retention of the best results in PSO is very simple and can be done without increasing the computational complexity of the process.The difference between IPSO and PSO is that the proposed new idea is applied in the population initialization step during the PSO process.The IPSO proceeds as follows: The initial population is generated by our strategy and then the fitness values of all individuals in the population are calculated by a fitness function.
The particles are repositioned according to their own pbest and gbest solutions.The procedure is repeated in each successive iteration until the termination conditions are reached.
Encoding schemes.In IPSO, every particle in a population is associated with a solution group.We define a particle based on   the number of selected SNPs, and the genotype associated with the corresponding SNPs; the SNPs cannot be repeatedly selected.The particle encoding can thus be represented by: where SNP i,j represents the selected SNP, Genotype i,j represents the three possible genotypes once SNP i,j is selected, m represents the size of the population, and n represents the number of SNPs selected.Initial particles are randomly generated in this study.For example, let P = (SNP 3,4,8 , Genotype 2,1,3 ).In this representation of the particle, SNP 3,4,8 represents the chosen SNPs (3,4,8) and Genotype 2,1,3 represents the chosen genotypes (2,1,3).In this case, the selected SNPs with their corresponding genotypes are represented as (3, 2), (4, 1), and (8, 3), respectively.
Population initialization using conservation of the top five results.The top five results in the 2-SNP barcode and in the n-SNP barcode (n §3) are generated differently.For the 2-SNP barcode, we only apply the exhaustive search algorithm to compute and check all possible 2-SNP combinations to give the best five results for all 2-SNP barcodes.To generate the n-SNP barcode (n §3), the steps for population initialization are illustrated in Figure 1.To initialize the population, the top five results amongst the previous 2-SNP combinations are used to initialize the population initialization for other numbers of SNP combinations.For example, (SNP 1,3 , Genotype 1,2 ) is one of the top five 2-SNP barcodes (step 1-1).Subsequently, this 2-SNP barcode is applied to search for the best combination of the 3-SNP barcode with a maximum difference value between the case and control data (step 2-2a); in this example the search result is (SNP 1,3,i , Genotype 1,2,j ), with i = {2, 4, 5, 6 … n | n representing the number of SNPs} and j = {1, 2, 3}.Then, the exhaustive search algorithm is applied to compute and check all possible 3-SNP combinations to find the top five results for all 3-SNP barcodes (step 3-3a).If the exhaustive search algorithm finds the answer to be i = 6 and j = 1 (the newly added third SNP and the genotype are 6 and 1, respectively), the best 2-SNP barcode (SNP 1,3 , Genotype 1,2 ) can generate its best 3-SNP barcode (SNP 1,3,6 , Genotype 1,2,1 ).Similarly, four of the top five 3-SNP barcodes are generated.
Meanwhile, the 3-SNP barcodes are generated in a random way (step 2-2b) and then sorted by the order of the fitness values (step 3-3b).The result from step 3-3a is used to replace the worst of the top five SNP barcodes (step 3-4).Finally, the updated 3-SNP barcode population is ready for the PSO computation, in which the top five in amongst the 3-SNP barcodes (step 3-5) are determined.Now, the top five SNP barcodes can be used to start the generation of the next higher order SNP barcodes.The steps are described in detail by the annotated IPSO pseudo-code in the next two sections.
Fitness function.In this study, the fitness value means are used to compute the difference between the case and control data from the selected SNP combinations.The focus lies on specific SNP combinations to obtain the highest fitness value, i.e., the maximum SNP combination difference between cases and controls.The concept uses the intersection of set theory to compute the difference between cases and controls.The intersection of two sets is the set that contains all elements of one of these sets that also belong to the other set, but no other elements.A high fitness value indicates the best combination of an SNP and genotypes.The relevant equation is shown below: where n represents the total number of elements in a set.C represents the total number of SNP interactions in the case group, and N represents the total number of SNP interactions in the control group.P i represents the ith particle.The fitness value definition can be divided into three steps.First, the total number of intersections of the case data set and the ith particle is calculated as n(C>P i ).Second, the total number of intersections of the control data set and the ith particle is calculated as n(N>P i ).Finally, Eq (3) is used to calculate the fitness value that is the difference between the intersection of the case and the particle and the intersection of the control and the particle.For example, P = (SNP 1,2 , Genotype 2,1 ) it is used to compute the number matching the condition of the SNP and genotypes for the case and control in the breast cancer data.First, the number of controls for SNP 1 with genotype 2 and SNP 2 with genotype 1 is calculated.The number of cases independently matching SNP 1 with genotype 2 and SNP 2 with genotype 1 was 76 in the breast cancer data set.Second, the number of controls independently matching SNP 1 with genotype 2 and SNP 2 with genotype 1 is calculated as 141.According to Eq.
(3), the fitness value is determined by subtracting 76 from 141, giving -65.If the fitness is negative, the absolute value is taken to obtain a fitness value of 65.
Identification of pbest and gbest.Each particle finds its personal best position (pbest) and the global best position (gbest) when moving.If the fitness value of a particle P i in the current iteration is better than the fitness value of pbest in the previous iteration, pbest is updated to that of P i .If the fitness value of particle P i in the current iteration is better than gbest in the previous iteration and is the best one in the current iteration, gbest is updated to that of P i .Each particle then adjusts its direction based on pbest and gbest in the following iteration.
As mentioned in Table 1, the pseudo-code for IPSO algorithm can collocate data with the adaptation procedure as mentioned above and generate the best SNP barcode for breast cancer prediction.
Parameter settings.The population size parameter was set to 50 (Figure 1, step 2-2b).The termination condition of the PSO is reached at a prespecified number of iterations (in our case, the number of iterations is 100) (Figure 1, step 3-5).The other parameters used in the PSO were c 1 = c 2 = 2. V max was equal to (X max -X min ) and V min was equal to -(X max -X min ).These parameters have been optimized by Kennedy and Eberhart [33].
Performance measurement and statistical analysis.We used five commonly used criteria to determine the performance [28]. PositivePredictiveValue(PPV)zNegativePredictiveValue(NPV) TP, TN, FN, and FP represent the number of true positives, true negatives, false negatives, and false positives, respectively.For statistics analysis with SPSS 13.0, the risk ratio (RR) and odds ratio (OR) are used to determine the best SNP barcode and quantitatively measure the breast cancer risk.The boxplots were analysed by SigmaPlot 9.0 (Systat Software, Inc.).

Data Set Preparation
The data set for the steroid hormones and their signalling and metabolic pathways (96 SNPs for 8 genes) were obtained from the breast cancer association study in [14].This data set only provides the genotype frequencies without the original raw data for the genotypes of each SNP.In our study, we simulated the genotype data based on the original frequencies of the data set.Using the simulated genotype data, susceptibility to breast cancer in terms of complex SNP-SNP interactions can be considered.However, it does not reflect the true distribution of those SNPs in cases and controls, and therefore results are not real.However, the original data involves different numbers of genotypes, and hence we had to perform normalization to make each genotype size the same in order to allow further analysis.Our simulated data was randomly generated and obeys the original genotype frequency in the entire data set; the simulated data is available at http://bioinfo.kmu.edu.tw/brca-steroid-96SNP.xlsx.
The normalization procedure is provided in the ''pseudo-code for randomly generated data'' as shown in Table 2.For example, we set the range size to a maximum range of 5000, and then calculate the amount of three genotypes in each SNP.The example of SNP 4 (rs3020314) includes 4551 genotypes in the original data, which contain 2132 for CC, 1970 for CT, and 449 for TT, respectively (the step for pseudo-code 04).In each SNP, the percentage of each genotype is calculated, for the above instance, 2132/4551 (46.85%) for CC, 1970/4551 (43.29%) for CT, and 449/4551 (9.86%) for TT.Based on these percentages, the modified data for SNP 4 is obtained by multiplication of the percentage with the amount of the entire data set, i.e., 46.85%65000 = 2343 for CC, 43.28%65000 = 2164 for CT and 9.86%65000 = 493 for TT (the step for pseudo-code 05 to 11).The simulated data for SNP 4 has thus been normalized to 5000 (2343+2164+493 = 5000).Accordingly, all original data are normalized to the same number in this manner.Based on our simulated data, Table 3 shows the performance (OR and 95% CI) for each SNP from 6 steroid hormone metabolisms and signalling-related genes (COMT, CYP19A1, ESR1, PGR, SHBG, and STS).Some SNPs (such as SNPs 4, 10, 12-15, 17, and 19-23 listed in Table 3) with certain genotypes display a statistically significant OR (p,0.05) for breast cancer; their OR values range from 1.268 to 0.846.The other SNPs show no statistically significant OR for breast cancer.

Identification of the SNP-SNP Interactions with Maximum Differences between Cases and Controls Using IPSO
Using the IPSO algorithm, the best SNP-SNP interaction is evaluated by the difference between cases and controls for all the SNP barcodes.After computation, the top five of the 2-SNP barcodes can be listed in order of the difference between cases and controls: SNPs (4-19)-genotype (1-1), SNPs (4-23)-genotype (1-2), SNPs (4-9)-genotype (1-2), SNPs (19-23)-genotype (1-2), and SNPs (9-23)-genotype (2-2).The differences in the number of cases and controls for these SNP barcodes are 174, 168, 158, 150, and 146, respectively (data not shown).In this study, as shown in Table 4, we only select the 2-SNP barcodes with a maximum difference, i.e., the best 1 of the 2-SNP barcode.Similarly, the n-SNP barcodes (n = 3 to 10) with maximum differences, i.e., the best for each n-SNP barcode, are also selected (left side of Table 4).
With the conservation of the top five results, we found that the best for n-SNP barcode contains the corresponding best (n-1)barcode.For example, the 3-SNP barcode contains the 2-SNP barcode, i.

Prediction Scores of the Best IPSO-generated SNP Barcodes in Breast Cancer
The best n-SNP barcodes (n = 3 to 10) by the IPSO algorithm are listed in Table 4 to calculate their five prediction scores, i.e., the correctness, sensitivity+specificity, PPV+NPV, RR, and OR, in order to evaluate the breast cancer susceptibility based on the IPSO-generated SNP barcodes.The sensitivity and specificity values of the respective best SNP barcodes are all higher than 0.96, suggesting that IPSO can identify the best SNP barcodes associated with breast cancer.The correctness and PPV+NPV values of the respective best SNP barcodes range from 0.48 to 0.50 and 0.64 to 0.96, respectively, and the RR and OR of the best SNP barcodes range from 0.88 to 0.17 and 0.84 to 0.17, respectively.The SNP barcodes involving two to seven SNPs show significantly decreasing OR values (p,0.05 to 0.001).Since the SNP barcodes listed in Table 4 show that the control numbers are greater than the case numbers, the SNP barcodes are regarded as protective SNP barcodes against breast cancer.

Comparison between the Best IPSO-generated and PSOgenerated SNP Barcodes in Breast Cancer
We compare IPSO with PSO for the reliability and the ability to identify SNP barcodes to support the advantage of the top-five strategy.The performances of the PSO and IPSO algorithms from 20 simulation runs (see supplement Table S1 and Table S2 for details) are compared by means of the best maximum difference between cases and controls as shown in Figure 2. To examine the performance in terms of the statistical differences between both algorithms, we performed the Wilcoxon Signed-Rank test and found that there were significant differences between cases and controls in n-SNP barcodes (n = 2 to 10) (Table S3).
The maximum differences for each SNP barcode generated by IPSO are higher than those of PSO, suggesting that the selection of the best protective SNP barcodes is more reliable in IPSO than in PSO.As shown in Figure 3, the median value results suggest that IPSO is more suitable for selecting the best SNP barcodes for breast cancer protection.Moreover, the interquartile ranges (25th to 75th) of the boxplot, as well as the 5th, 10th, 90th and 95th percentiles for each n-SNP barcode (n = 3 to 10), are more narrow in the IPSO algorithm (Figure 3A) than in the PSO algorithm (Figure 3B).These data suggest that the results of the PSO algorithm are more unstable.In contrast, the IPSO algorithm provides exact identification of the best SNP barcodes for breast cancer protection.Actually, the data in Figure 3A (IPSO) are all the same for each n-SNP (n = 3 to 10), i.e., 128, 87, 55, 35, 21, 12, 8, and 5 (Table 4).The best PSO-generated n-SNP barcodes with maximum differences between cases and controls are listed in Table 5.In the PSO algorithm the top five results are not conserved.Accordingly, the PSO-generated SNP barcodes conserve the selected SNPs to a lesser degree (Table 3).For example, only one SNP in the 2-SNP barcode shows up in the 3-SNP barcode, i.e., SNP 4 (rs3020314; Table 3), and only one SNP in the 3-SNP barcode shows up in the 4-SNP barcode, i.e., SNP 23 (rs2017591).Therefore, an order of influence on breast cancer is very difficult to establish from the SNPs in Table 3.

Discussion
Many association studies of cancer focused on the analysis of risk genetic factors that influence common complex traits in terms of commonly occurring SNPs.However, the possible protective effects are also important for the prediction of cancer morbidity by SNPs.Here, analyzed the contribution of 23 SNPs from six breast cancer related genes to generate the protective SNP barcodes in a case-control study of 5000 cases and 5000 controls with genotype data simulation.
The maximum difference information calculated by the IPSO algorithm can predict the relative strength of the impact of an SNP on breast cancer protection.For example, the difference between controls and cases for SNP barcode [SNPs (4-19)-genotype (1-1)] is higher than that of [SNPs (4-19-23)-genotype (1-1-2)], suggesting that SNP 4 and SNP 19 are more associated with cancer protection than SNP 23.Accordingly, an order of impact on breast cancer for the SNPs listed in Table 3 can be arranged: SNPs 4/ 19.SNP 23.SNP 9. SNP 3. SNP 13.SNP 20.SNP 12. SNP 14. SNP 21.In this simulated breast cancer association study, the IPSO-generated SNP barcodes involving two to seven SNPs and two to four SNPs show significantly decreasing OR values ranging from 0.84 to 0.57 (Table 4).In contrast, some individual SNPs with certain genotypes display statistically significant OR values ranging from 1.268 to 0.846 (Table 3).Some SNPs may display different impacts on the protection of breast cancer in terms of the individual SNPs or the combinational SNPs.For example, some individual SNPs such as SNPs 3 and 9 are not significantly associated with breast cancer (Table 3), but the occurrence of 4-to 10-SNP combinations including SNPs 3 and 9 shows the significant association with breast cancer (Table 4).These data suggest that the association relationship for breast cancer may be ignored when the SNP interaction is of no concern.
A key issue of detecting SNP-SNP interactions in genome-wide case-control study is the computational efficiency.Although the optimal parameters of PSO were demonstrated by Kennedy and Eberhart [33], we found that the parameter adjustments may promote better results even for large numbers of SNPs.Firstly, the population and iterations could adjust its size according the data size, in which the population size suggested setting from 50 to 100 and number of iterations suggested setting from 100 to 1000, i.e., it explores to better SNP barcodes with large difference between cases and controls, but the computational complexity of IPSO is also increased.Secondly, the c 1 and c 2 are acceleration constants that control how far a particle moves in a single generation, and they respectively control the exploitation and exploration ability in each search.In order to balance the exploitation and exploration, the c 1 and c 2 are suggested the same as 2.
Although we explored the benefit of IPSO algorithm for SNP interaction based on the simulated breast cancer study, the IPSO algorithm is not exclusively into breast cancer data and can be applied to other real data sets.After running these algorithms using another disease with the real dataset [18], e.g., osteoporosis, we found that the IPSO algorithm again showed better performance for selecting SNP barcodes in SNP-SNP interaction studies than the PSO algorithm (data not shown).
IPSO can overcome the limitations imposed on computational time for complex SNP interactions for GWAS because IPSO has the following advantages: 1) IPSO allows robust analysis of highorder SNP combinations for GWAS studies and generates the best SNP barcodes; 2) IPSO is an improved evolutionary algorithm without exhaustive search; 3) IPSO only needs two parameters for computation without complex settings; and 4) Its computational complexity is unaffected by the size of data sets.
In conclusion, we propose an improved PSO algorithm to perform a powerful breast cancer association analysis in terms of SNP-SNP interactions with 23 SNPs.Our strategy successfully improves on the performance of traditional PSO in terms of the reliability with a combination of more statistically significant SNPs associated with breast cancer protection.With the help of the IPSO algorithm, the best fitness of cases and controls can be identified.The algorithm can potentially be applied to identify complex SNP-SNP (gene-gene) interactions for different diseases, even in cases where a large number of SNPs is involved in genome-wide association studies.

Figure 3 .
Figure 3. Boxplots displaying the extremes, the upper and lower quartiles, and the median of the maximum difference between cases and controls for (A) the IPSO algorithm and (B) the PSO algorithm on three to ten combined SNPs over 20 runs.The boundary of the box closest to zero indicates the 25th percentile, a line within the box marks the median, and the boundary of the box farthest from zero indicates the 75th percentile.Error bars above and below the boxes indicate the 90th and 10th percentiles, respectively.The triangle symbols indicate the 95th and 5th percentiles.doi:10.1371/journal.pone.0037018.g003 The computational complexity of IPSO algorithm is estimated by the objective function computation.If there are M number of iterations and N number of solutions (particles) in the population, then the objective function computation has O(MN) computational complexity.The effective feature of the top 5 strategy computation is only storing the top 5 solutions in each iteration.If there are K solutions in the archive, storing the solutions in the archive has O(M+K) computational complexity.If the archive and the iteration have the same numbers, the overall complexity of IPSO is O(MN+K).

Table 2 .
jM[1.d]08: evaluate P i by Eq. 5, iM[1.n]09: find best Xg in N-SNP combinations 10: the worst five P are replaced with Xg 11: repeat PSO: 12: for each swarm P i , i M[1.n] 13: f i r evaluate P i by Eq. 3 14: if pbest i ,f i then 15: pbest i r??f i ; pbestX i rP i Pseudo-code for randomly generated data.

Table 3 .
Estimated effect (odds ratio and 95% CI) from individual SNPs of 23 steroid hormone metabolisms and signalling-related genes on the occurrence of breast cancer in patients.
[14]ta collected from literature[14].b Data highlighted in bold text are statistically significant results.

Table 4 .
The best estimated protective SNP combinations on the occurrence of breast cancer as determined by IPSO.

Table 5 .
The best estimated protective SNP combinations on the occurrence of breast cancer as determined by PSO.The SNP combinations on the occurrence of breast cancer are significantly different (p value,0.05).The meanings of the SNP and genotype numbers are provided in Table3.doi:10.1371/journal.pone.0037018.t005 *