A Hybrid Approach for CpG Island Detection in the Human Genome

Background CpG islands have been demonstrated to influence local chromatin structures and simplify the regulation of gene activity. However, the accurate and rapid determination of CpG islands for whole DNA sequences remains experimentally and computationally challenging. Methodology/Principal Findings A novel procedure is proposed to detect CpG islands by combining clustering technology with the sliding-window method (PSO-based). Clustering technology is used to detect the locations of all possible CpG islands and process the data, thus effectively obviating the need for the extensive and unnecessary processing of DNA fragments, and thus improving the efficiency of sliding-window based particle swarm optimization (PSO) search. This proposed approach, named ClusterPSO, provides versatile and highly-sensitive detection of CpG islands in the human genome. In addition, the detection efficiency of ClusterPSO is compared with eight CpG island detection methods in the human genome. Comparison of the detection efficiency for the CpG islands in human genome, including sensitivity, specificity, accuracy, performance coefficient (PC), and correlation coefficient (CC), ClusterPSO revealed superior detection ability among all of the test methods. Moreover, the combination of clustering technology and PSO method can successfully overcome their respective drawbacks while maintaining their advantages. Thus, clustering technology could be hybridized with the optimization algorithm method to optimize CpG island detection. Conclusion/Significance The prediction accuracy of ClusterPSO was quite high, indicating the combination of CpGcluster and PSO has several advantages over CpGcluster and PSO alone. In addition, ClusterPSO significantly reduced implementation time.


Introduction
This study combines clustering technology (CpGcluster) and PSO to develop a simple and accurate method (ClusterPSO) for detecting CpG islands. CpGcluster is first introduced in the pre-treatment strategies for detecting all CpG island candidates; these CpG islands are evaluated according to the distances between CpGs and their p-values (P 0.01). PSO is then used to predict CpG islands from all CpG island candidates obtained from the CpGcluster results. The GGF criteria are used as a basis to compare results with other methods based on five evaluation criteria including sensitivity, specificity, accuracy, performance coefficient (PC), and correlation coefficient (CC). Results showed that ClusterPSO provides high efficiency for CpG island detection with improved sensitivity, accuracy, and correlation coefficient, while reducing search time in the human genome.

CpGcluster
CpGcluster was first proposed by Hackenberg et al. [11]. The basic theory of CpGcluster is based on the physical distance between neighboring CpGs to directly detect CpG clusters without considering subjective CpG criteria. The CpGcluster procedure has two parts: (i) a distance-based algorithm searches for CpG clusters in the genome, and (ii) the p-value criteria selects the statistically significant CpG clusters. The detailed procedure is explained in the Clus-terPSO description below.

Particle swarm optimization
In PSO, each particle represents a practicable solution x i = {x i1 , x i2 , . . ., x id } in the d-dimensional problem space, and it has a memory function to accumulate its own experience defined as the pbest i = {pbest i1 , pbest i2 , . . ., pbest id } [14]. All particles share a common memory used to identify the current best experience gbest = {gbest 1 , gbest 2 , . . ., gbest d }. Based on these two experiences, a particle moves to find a better solution. Since each particle adopts a directional trajectory in the search process, it can effectively find a better solution by continually updating its location. The PSO procedure includes five steps: (i) initialization, (ii) fitness evaluation, (iii) updating the pbest i and gbest, (iv) updating the positions and velocities of particles, and (v) confirmation that the stop criterion is met. All steps are explained in detail in the ClusterPSO section.

ClusterPSO
The proposed CpGcluster is used to detect all possible CpG island locations (CpG island candidates) and optimize data processing, thus increasing PSO effectiveness by removing huge amounts of unnecessary DNA fragments from the calculations. Since the known CpG island minimum length detected by CpGcluster is 8 bps [11], all CpG island candidates are extended both forward and backward by 200 bps to conform with the GGF criteria, where 200 bps is the known minimum CpG island length in the GGF criteria. This region of CpG island candidates allows the PSO to predict optimal CpG islands using the GGF criteria. Fig 1 shows the ClusterPSO flowchart divided into two stages: 1. CpGcluster detects CpG island candidates in the DNA sequence. This stage has two sub-steps: (i) Clustering technology and statistical search methods are used to identify CpG island candidates in the chromosome sequence. The distance threshold is used to combine adjacent and sufficiently close CpG dinucleotides as clusters. (ii) The p-value is used to verify CpG island candidates. 2. PSO predicts CpG islands among the CpG island candidates based on the GGF relaxed CpG island criteria. Here we provide an example to illustrate how the ClusterPSO works, with details provided in the Additional file. Stage 1. Clustering detects CpG candidates in DNA sequence. CpG clustering within the CpG island region is determined by a distance threshold, with closely adjacent CpGs belonging to the same cluster, and longer distances differentiating different clusters. Thus, the intensive CpGs are set into a cluster to select CpG island candidates. All steps are described in detail as follows: Step 1. All CpG positions are scanned from 5' to 3' in a DNA sequence, and the CpG positions are collected into a set C = {c 1 , c 2 , . . ., c n }, where n is the total number of CpGs.
Step 2. The distances of all adjacent CpGs are calculated, in which a physical distance between two adjacent CpGs is computed by d i = X i +1 -X i − 1. The shortest distance of adjacent CpGs is 1, i.e., CGCG.
Step 3. A threshold value (d t ) is defined according to the distribution of distances of all adjacent CpGs in a DNA sequence, and used to determine whether the adjacent CpGs belong to a same cluster or not. The set C is sorted according to the distance between adjacent CpGs, and the threshold d t is defined at the 65 percentile in the set C.
Step 4. CpGcluster uses a threshold value to start extending downstream (! 3') from the first CpG of set C. When the adjacent distance between CpGs is smaller than the threshold value, the two adjacent CpGs are classified into a single cluster; otherwise, the position of the upstream G nucleotide of the adjacent CpGs is defined as the closing cluster position. Thus, step 4 continues to search for new clusters until it meets the last CpG.
p-value criteria selects the statistically significant CpG clusters. Since all CpG island clusters are determined by distance, the p-value of a cluster is then computed to calculate the probability of a CpG cluster appearing in a random sequence. The negative binomial distribution estimates the probability to reduce the requirements of the CpG clusters. The distribution fails if the number of successes is fixed in advance. Thus, the successes represent the CpGs and the failures represent non-CpGs. The above-mentioned probability is calculated by the cumulative density function at point n f of the CpG cluster, and is taken as the p-value: where N is the number of CpGs in the cluster, n f is the number of independent non-CpGs in the cluster, L is the number of nucleotides in the cluster, and p is the probability of success finding a CpG. N s and N is are the number of CpGs and the number of independent dinucleotides in the DNA sequence, respectively. This step searches statistically significant CpG clusters and assumes that all CpG islands are included in these CpG clusters. When the p-value of a cluster is smaller than the threshold value (set as 10 −2 in this study), the clusters are discarded; otherwise, the cluster is retained.
Stage 2. PSO predicts CpG islands from CpG candidates. Since the CpGcluster step collects CpG island candidates without the CpG island criteria, the CpG island candidates may be shorter than 200 bp. The lengths of these CpG island candidates must be extended for the application of PSO prediction. All CpG island candidates are extended forward and backward by 200 bps, and a separate PSO procedure is used for each specific CpG island candidate, which can be divided into the following five steps: Step 1. Initialization defines the particle as (population size = 300) a possible CpG island region, and the particle coordinates constitute a two-dimensional problem space, i.e., d = 2. Eq 4 is the particle encoding formula.
where P i is the i th particle. Fs i and Fl i are respectively the start position and length of a predicted CpG island in a CpG island candidate. At initialization, each particle P i is randomly generated the parameters Fs i and Fl i .
Step 2. Define the particle fitness by GGF criteria. The fitness function for evaluating the particles is based on the length, GC content, and O/E ratio of CpG island properties. A higher fitness value represents a better CpG island prediction result. In addition, the recorded particle positions must conform to the GGF definition for CpG island properties, such as the range of particle length ! 200 bps, GC content ! 50%, and O/E ratio ! 0.6. Then follow an overlap process to predict all CpG island locations. Eq 5 is the normalization function to calculate the CpG length. Eqs 6 and 7 respectively calculate the GC content and O/E ratio. Eq 8 is the fitness function.
where #A, #T, #C, and #G are respectively the numbers of nucleotide adenine (A), thymine (T), cytosine (C), and guanine (G) in the predicted CpG island region at P i . p min is the starting position of the cluster minus 200, and p max is the starting position of the cluster plus 200. #CpG represents the number of CpG in the predicted CpG island region at P i .
Step 3. Update the pbest i and gbest for all particles. Each particle finds its personal best position (pbest) and the global best position (gbest) when moving. If the fitness value of a particle P i in the current generation is higher than the fitness value of pbest, the fitness value and position of pbest is replaced by P i . If the fitness value of particle P i in the current generation is higher than gbest, the fitness value and position of gbest is replaced by P i .
Step 4. Update the positions and velocities of particles in each generation. The position and velocity of the P i is updated by the pbest i and gbest. Eqs 9 and 10 respectively calculate the velocity and position of the updating formulas at particle i .
where r 1 and r 2 are random variables between zero and one, and both c 1 and c 2 are constants. c 1 and r 1 account for the subjective experience of particle i in the search process, while The inertia weight w is used to improve the particle search path. Using Eq 11, the inertia weight can be linearly decreased from 0.9 to 0.4 according to the generation number. Poli et al. [15] demonstrated how the inertia weight improved the balance between local search and global search to facilitate particle search.
where w max and w min are respectively set at 0.9 and 0.4, and move max , move i are respectively the maximum and current generations. Thus, each particle can adjust its direction based on a better pbest and gbest in the next generation.
Step 5. Confirm whether the stop criterion (maximum generation = 100) is met or not. Steps 2 to 5 are repeated until the maximum generation is reached. When all CpG island candidates are predicted by PSO, all the identified CpG islands represent the CpG island detection results.
Performance measurement. The measurement of association between binary variables, 2 × 2 contingency, is used to evaluate the accuracy of CpG island detection [16]. As shown in the contingency table, true positivity (TP) represents that the number of bases detected in the CpG island region are true CpG islands. True negative (TN) represents that the number of bases detected in the non-CpG island region is correct. False negative (FN) represents that the number of bases detected in the true CpG island region is incorrect, and false positive (FP) represents that some of the bases detected in the CpG island region are non-CpG islands. Five common criteria are used to evaluate the results, including accuracy, sensitivity, specificity, the performance coefficient (PC), and the correlation coefficient (CC). Eq 12 is used to evaluate prediction accuracy by comparing the detection results with the true CpG islands. Sensitivity is calculated by Eq 13 to compute the rate of correct CpG island detection. Eq 14 is used to evaluate the specificity for computing the proportion of correct non-CpG island detection. Eq 15 summarizes sensitivity and specificity as a measure of global accuracy. Eq 16 shows the correlation coefficient, which is a single scalar value for correct non-CpG island detection.
Parameter settings. In the CpGcluster step, the p-value and distance threshold (percentile) parameters are respectively set as 0.01 and 65 th . In the PSO step, the parameters of population size, maximum generation, and both c 1 and c 2 are respectively set as 300 [17], 100, and 2 [15]. GGF [1] is used to define CpG islands to allow for comparisons with other GGF-based methods.

Data set
In previous study [10], six regular contig sequences, including NT_113953.1, NT_113954.1, NT_113955.2, NT_113958.2 and NT_113952.1 in chromosome 21, and NT_028395.3 in chromosome 22, are selected to evaluate the performance of ClusterPSO and other methods. All contig sequences and CpG islands were verified based on sequence analysis and bisulphite sequencing (BS-seq) and were obtained from NCBI (http://www.ncbi.nlm.nih.gov), along with the entire human genome (NCBI.36). Eight CpG island detection methods, including CpGPlot, CpGcluster, CpGProD, CpGIS, and PSO-based methods, were selected for comparison with the proposed ClusterPSO method. All data sets and ClusterPSO source code were published on github (url: https://github.com/jackmel030/ClusterPSO). Table 1 shows the results of six contig sequences detected by the nine test methods. Both Clus-terPSO and CPSORL show a higher sensitivity, PC, and CC values in the six contig sequences. CpGPlot shows excellent SP results in all the contig sequences. CPSORL performs better than the methods of CpGPlot, CpGcluster, CpGProD, CpGIS, PSO, PSORL, and CPSO, but Clus-terPSO outperforms CPSORL for sensitivity, accuracy, PC, and CC values in the six contig sequences. A non-parameter statistical hypothesis test, Wilcoxon Signed-Rank test, is used on pairs of result groups to gauge their validity. The p-value of the Wilcoxon Signed-Rank test can determine whether the difference between the two methods is significant or not. A pvalue < 0.001 indicates the ClusterPSO is statistically significant and exhibits improved CpG island detection. All p-values for ClusterPSO vs. the other eight methods on the six contig sequence results result in p < 0.0001. Although some methods use different CpG island definitions, the 2 × 2 contingency seems to provide a higher accuracy for CpG island detection based on the known positions of CpG islands in the NCBI. The results of the nine methods are taken from the relevant literature. Fig 2 provides a visual illustration of the performance of the six PSO-based methods in contig NT_113952.1. The true CpG island indicates the known CpG island in the NCBI. The true positive indicates the CpG length detected by other PSO-based methods that equal the true CpG island, the false positive indicates the CpG length detected by other PSO-based methods that does not equal the to the true CpG island length. In addition, the false negative indicates the CpG length is not detected by other PSO-based methods but is covered by the true CpG island length, and the true negative indicates the CpG length is not recognized by other PSObased methods or the true CpG island. According to the experimental results (Fig 2), clus-terPSO showed a higher degree of sensitivity and specificity compared to other methods.    Table 2 shows the CpG island detection results for the whole chromosome sequences using CpGPlot, CpGcluster, CpGProD, CpGIS, PSO, PSORL, CPSO, CPSORL, and ClusterPSO. In chromosome 21, the total length of CpG islands is 1,719,555 bps and the known CpG island coverage is 3.66% in NCBI. ClusterPSO finds the CpG island length is 1,728,357 bps, and the coverage of these CpG islands is 3.68%. In chromosome 22, the total length of CpG islands is 3,114,716 bps and the known CpG island coverage is 6.27% in NCBI. ClusterPSO finds a CpG island length of 3,090,231 bps and coverage of 6.22%. The overall CpG island length and coverage values with known values in NCBI indicate that ClusterPSO outperforms the other methods in terms of detection accuracy. Table 2 provides detailed detection results, including the average length, the minimum length, the maximum length, the GC content, and the O/E ratio for the detected CpG islands. All methods can be applied to detect the GGF-defined CpG islands, but the detection results obtained by CpGcluster for chromosomes 21 and 22 showed a minimum length of 8 bps.

CpG island detection of human chromosomes 21 and 22
CpG island detection of entire human genome Table 3 summarizes the CpG island detection results using CpGcluster, CpGIS, CPSORL, and ClusterPSO in the entire human genome, and shows the statistical values for their detection abilities. ClusterPSO detects 254,783 CpG islands, which is 1.28 times that detected by CpGcluster (198,702), 6.75 times CpGIS (37,729), and 1.22 times CpGIS (208,536). Using Clus-terPSO, the average island length is longer than that of CpGcluster (494 bps vs. 273 bps). The length distribution of the results of the CpG islands is shown in Figure A in S1 File. The minimum lengths of the various methods are as follows: CpGcluster (8 bps), CpGIS (500 bps), CPSORL (100 bps), and ClusterPSO (200 bps). ClusterPSO determined the greatest CpG island length in a range of 200~749 bps. Figure B in S1 File shows the distributions of the results of GC content and O/E ratios for the CpG islands in 24 chromosomes using ClusterPSO. Most CpG islands have a GC content between 0.5 and 0.7 and an O/E ratio between 0.6 and 1.0, indicating the CpG islands detected using ClusterPSO conform to the GGF criteria. We examined the promoter and transcription start site (TSS) overlap in the CpG island region. A promoter region is defined as −1,500 to +500 bps around the TSS. The TSS number of CPSORL is higher (below 9.33%) than that of CpGcluster and ClusterPSO; the promoter region numbers are also higher (below 11.73%). The TSS numbers and promoter region numbers of CpGcluster are similar to those of ClusterPSO. Table B in S1 File compares the performance of CPSORL and ClusterPSO in 24 chromosomes of the human genome. ClusterPSO is found to outperform CPSORL in terms of the average values for sensitivity, SP, accuracy, PC, and CC, which clearly indicates that ClusterPSO provides significantly improved accuracy in CpG island detection. For the Wilcoxon Signed-Rank test, the p-values of sensitivity, specificity, accuracy, PC, and CC are 1.82E-05, 0.807, 1.81E-05, 1.82E-05, and 1.82E-05 in 24 chromosomes, respectively. The statistical analysis indicates that specificity values are similarly high for both CPSORL and ClusterPSO (98.97 vs.   Fig 3B and 3F show that CpGcluster can handle long sequences quickly, where time spent is indicated by the arrow in the figures. Although sequence scanning may proceed more slowly at first than in other methods, it provides a strong time advantage for complete genome searches.

ClusterPSO method
We propose a procedure (ClusterPSO) for CpG island detection to overcome the drawbacks of CpGcluster and reduce the search difficulty of PSO. Although gene analysis targets a whole In the Hackenberg et al. study, three thresholds (25 th , 50 th and 75 th ) are used to test CpGcluster for CpG island detections. Hackenberg et al. suggested using the region between 50 th and 75 th to detect CpG islands. Raising the distance threshold to the 75 th percentile obtains longer islands, and can thus increase the sensitivity by more than 20% while only minimally improving overall accuracy [11]. Lowering the p-value threshold beyond 10 −5 slightly increased SP but also clearly decreased SN, thus lowering overall global accuracy. In this study, ClusterPSO selected the 65 th to balance the SN and SP. The PSO step can filter CpG islands again, indicating the number of non-CpG island regions can be reduced. When the p-value is small, the computational time of Clus-terPSO can be increased due to the large number of detected CpG island candidates. Therefore, we suggest the ClusterPSO use the 65 th to detect CpG islands, and the region of p-value thresholds is suggested between the 50 th and 75 th to detect CpG islands.
Comparison of the true positive and false positive rates for CpGcluster, PSO-based methods, and ClusterPSO Our previous study proposed PSO-based methods to detect the CpG islands with high true positive rates (TPR) and low false positive rates (FPR) [10]. These PSO-based methods reveal slightly higher FPR than CpGcluster, but also show substantially higher TPR than CpGcluster and other methods. Overall, ClusterPSO is obviously superior to all other PSO-based methods. Figure C in S1 File shows the XY charts for comparing the TPR and FPR of CpGcluster, PSObased methods, and ClusterPSO. The upper-left location of the chart indicates the better CpG island detection results, and the figure indicates that ClusterPSO outperforms the other five methods in terms of TPR, but has a slight disadvantage in terms of FPR. In addition, this figure indicates that the disadvantages of both CpGcluster and PSO can be greatly improved by combining the two methods, making our approach highly effective for CpG island detection.

Search stability comparison for all of the PSO-based methods
Search stability is an important performance criterion for PSO-based methods. Different random seeds have an impact on CpG island detection. Figure D in S1 File shows a box plot which compares the relative stabilities of sensitivity, specificity, accuracy, PC, and CC for PSO-based, and ClusterPSO through 1,000 test iterations. Given an identical random seed, all PSO-based methods have the same random value in each generation, and some seed values may result in search failure, thus reducing the accuracy in CpG island detection. However, the ClusterPSO still detects CpG islands with a high degree of accuracy, even with unfavourable seed values (the error bars near the left boxes indicate 10 th percentiles). In the box plot, a small box indicates better stability in each performance measurement. In terms of stability, Clus-terPSO outperforms the other PSO-based methods.
ClusterPSO performance for the entire human genome In entire human genome analysis, ClusterPSO performs better than CPSORL in terms of sensitivity, specificity, accuracy, PC, and CC values ( Table B in S1 File), and CpG island prediction accuracy and coverage rate. Table C in S1 File shows that ClusterPSO outperforms PSO-based methods and CpGcluster in terms of detecting overlapping CpG islands, indicating that Clus-terPSO can correctly detect CpG islands.
The short lengths of CpG islands may result in high O/E ratio and high GC content levels, but also results in low sensitivity [13]. However, ClusterPSO is based on CpGcluster and uses the CpG criteria to detect CpG islands, thus the O/E ratio and GC content of ClusterPSO are smaller than those of CpGcluster, but with substantially improved sensitivity. Figure  Hackenberg et al. compared CpGcluster against the sliding window method. They found that CpGcluster co-localized more specifically to TSSs, and many of the small CpG islands detected by CpGcluster may be functional, given the overlap with conserved elements or promoter regions [12]. The results for TSSs and the promoter regions (Table 3) show that Clus-terPSO inherits these advantages from CpGcluster because ClusterPSO is based on CpGcluster detection.
The Hackenberg et al. study found that small CpG clusters (length < 200bps) may have important biological implications [11]. In our study, ClusterPSO used the GGF to define the CpG islands (length > 200 bps, GC content > 50%, and O/E > 0.6), in which the minimum length defined in the particle (see Stage 2) may limit the predictive potential of ClusterPSO. This minimum length limitation can be eliminated for the detection of additional CpG islands with important biological meanings but the prediction results may reduce the prediction sensitivity due to the great number of small CpG islands.
This study combines CpGcluster and PSO methods to design a simple and accurate Clus-terPSO method to detect CpG islands in human DNA sequences. This combination has several advantages over CpGcluster and PSO alone: (1) the implementation time and search stability of PSO can be significantly improved by pre-treatment with CpGcluster; (2) the short CpG island length of CpGcluster may not meet CpG island criteria, but it can be improved by an accurate PSO prediction; (3) ClusterPSO only requires six parameters which is easy to implement; and (4) future improvements to CpGcluster and PSO will greatly enhance the accuracy of ClusterPSO detection.  Figure A). Distribution of the results of CpG islands in the human genome ( Figure B). XY charts comparing the true positive and false positive rates amongst the six methods for six contig sequences ( Figure C). Box plot comparing the stability of five methods in six contig sequences ( Figure D). Box plot of the O/E ratio for each interval length in the human genome ( Figure E). Number of CpG islands located in gene regions identified with CPSORL and ClusterPSO (Table A). Performance measurement of ClusterPSO and CPSORL for all chromosomes in the human genome (Table B). Number of detection CpG islands overlapping on true CpG islands for CpGcluster, CPSORL and ClusterPSO for all chromosomes in the human genome (Table C).

Supporting Information
(DOC)