The authors have declared that no competing interests exist.
Conceived and designed the experiments: MZ YZ. Analyzed the data: YZ. Wrote the paper: MZ. Designed the novel algorithm: MZ. Designed the experiments: YS GXL CGZ.
Gravitation field algorithm (GFA) is a new optimization algorithm which is based on an imitation of natural phenomena. GFA can do well both for searching global minimum and multiminima in computational biology. But GFA needs to be improved for increasing efficiency, and modified for applying to some discrete data problems in system biology.
An improved GFA called IGFA was proposed in this paper. Two parts were improved in IGFA. The first one is the rule of random division, which is a reasonable strategy and makes running time shorter. The other one is rotation factor, which can improve the accuracy of IGFA. And to apply IGFA to the hierarchical clustering, the initial part and the movement operator were modified.
Two kinds of experiments were used to test IGFA. And IGFA was applied to hierarchical clustering. The global minimum experiment was used with IGFA, GFA, GA (genetic algorithm) and SA (simulated annealing). Multiminima experiment was used with IGFA and GFA. The two experiments results were compared with each other and proved the efficiency of IGFA. IGFA is better than GFA both in accuracy and running time. For the hierarchical clustering, IGFA is used to optimize the smallest distance of genes pairs, and the results were compared with GA and SA, singularlinkage clustering, UPGMA. The efficiency of IGFA is proved.
Largescale gene sequencing technologies
Two challenging tasks of various optimization algorithms are to search the global optimum and to find all local optima in the solutions space of genes hierarchical clustering from available experimental data, especially from largescale gene expression data. Selection or creation a proper optimization algorithm is one important work for many system biologists. And a lot of optimization algorithms had been proposed and applied in many biology fields. In these researches, the research on heuristic search algorithms is the fastest growing field. These algorithms include genetic algorithm (GA)
In these algorithms, the efficiency of GFA, which is a novel heuristic search algorithm proposed in 2010, had been proved for many functions and problems. And some advantages can be found in GFA. First of all, GFA can not only deal with global extreme optimal problems, but also the multi extremes optimal problems which traditional heuristic search algorithms can’t deal with. Secondly, GFA can be convergent in the global solution space with probability 1 in three conditions for object functions of onedimensional independent variable. And the convergence had been proved through mathematical demonstration
But actually, GFA are not matured as a novel algorithm, especially in two parts. The first part is the theory of GFA. Some immature theory problems should be resolved, especially the strategy of solution space division. The other part is the accuracy of GFA. Some algorithm steps should be improved, and the rotation factor is proposed in this paper to increase the efficiency of GFA. In our prophase research, no effective method can be used for division, only one or twodimensional variables could be divided properly in solution space. When the dimensionality is greater than two, only one selected dimensionality is used as criteria for division. And other dimensionalities are not considered as the criteria. Some better division strategy should be proposed to improve the division theory quality. In addition, the movement operator of GFA is not good enough for most problems, although the convergence is proved. When the number of dusts is not big enough, GFA will not be convergent with probability 1. New mechanism should be used in this step of GFA.
In this paper, an improved GFA was proposed with two improvements, which is called IGFA. The two improvements are random division and rotation factor respectively. Random division strategy is used for multidimensional problem. This method will allocate every dust randomly to any group for certain solution space. And any group will not be dismissed until the optimum in IGFA is not changed for a long time. When all dusts assemble together in one group, the assembled aggregate will be considered as a new dust for next random division. The other improvement is the rotation factor which is used to improve the efficiency of IGFA in this paper and to avoid the local convergence in IGFA. When the small surrounding dusts come to the bigger centre dust, the surrounding dusts could be pushed away from the centre dust in some direction with a certain probability. The whole modified procedure of IGFA will be described in this paper.
In some computational biology, such as reconstruction of gene regulatory network and simulation of gene expression
Two kinds of experiments from a suite of benchmark functions were used to test the efficiency of IGFA. One is the global minimum searching. The other is the multiminima searching. And we also compared the performance of IGFA with GFA, GA and SA. 500 minimization runs were used in this paper. The results showed that the performance of IGFA is better than GFA in many cases, including the accuracy and running time.
And for the application in hierarchical clustering algorithm, Yeast Saccharomyces cerevisiae gene expression data were used in this paper. And the results were compared with GA and SA methods. The efficiency of IGFA can be proved.
GFA is derived from the point of the hypothesis theory Solar Nebular Disk Model (SNDM)
IGFA will start with dusts initialization which simulates the dark nebular in SNDM. For continuous data, the task is to generate N dusts
Anydimensional value of dust
It will be meaningless for the value of
The random division strategy is the core part in IGFA and the most important improvement of the algorithm in this paper. The task of division is to divide the solution space into G groups. In any group, the dust with the biggest mass value is called the centre dust. The others are called surrounding dusts. A proper division operator will improve the efficiency and reduce the running time.
The division strategy in twodimensional solution space is a smart method which is called the greatest common divisor method described in
In this method, all areas in solution space are all the same. The number of groups will be decomposed as
Every subfigure in Fig. 2 is divided into 7 groups. Subfigure (a) is the figure of random division method. In this method, the length of each group is random. Subfigure (b) is the figure of average division method. In this method, the length of each group is average.
Although this strategy is feasible for anydimensional solution space, it is not reasonable for just using one certain dimensionality as division criteria, especially for highdimensional solution space. So a generic strategy must be proposed in IGFA. Random Division Decomposition (RDD) was proposed in this paper described as following.
G is defined as the number of groups in RDD. In one epoch, every dust is allocated to any group i randomly. After membership between all dusts and all groups is determined, movement operator and the corresponding absorption operator will start in this iteration of RDD. RDD operator can be used for anydimensional solution space, including onedimensional and higherdimensional solution space. An example of onedimensional RDD is shown as
There are 3 groups in this onedimensional RDD example: G1, G2 and G3. And every dust belongs to these 3 groups randomly. In this example, 1st, 3rd and 4th dusts belong to G1, 2nd dust belongs to G2, and 5th and 6th dusts belong to G3.
Before all movement and absorption operator epochs finish in this iteration, two RDD strategies were designed to change the dusts membership. The first one is called Regular Update Strategy (RUS). In one epoch of RDD, if the optimal value in the whole IGFA procedure is changed in certain number of movement and absorption epochs, the membership of all dusts does not need to update. But if the optimal value in the whole IGFA is not changed at all in certain number of movement and absorption epochs, all groups would be dismissed and every undeleted dust in solution space will divide randomly again. The main consideration of RUS strategy is that there is no optimal value in all the dusts and the paths along the movement of surrounding dusts. But actually, maybe the optimal value is already in some groups. So another strategy is used in RDD, which is called Never Update Strategy (NUS). In one epoch, no matter the optimal value is changed or not in IGFA, all groups will not be dismissed and updated until this iteration finish.
When the movement and absorption operators in one or more groups finish, the number of dusts in the corresponding groups is just one. And all groups in the whole IGFA will be dismissed. This RDD operation ends. The number of groups G will be updated then. And the membership of all dusts will be determined again. RDD will go into the next epoch.
The selection of RUS and NUS is very important. The algorithm may be not convergent, or the running time will be too long with a wrong selection. A better way is to use a mixed method for RUS and NUS. RUS is used when IGFA is in the earlier period. And NUS is used when GFA is in the later period. It’s hard to decide the border line between these two period, so the number of iterations is used as the criteria to decide. 10 were used for the line between RUS and NUS in the experiments of IGFA.
The movement operator is another kernel part in IGFA. The task of this operator is to search the optimal value in each group with absorption operator together. So the convergence of IGFA is related to movement operator, especially when the number of groups is just one in the late period of IGFA. The main idea of this operator is described in the prophase research
There are 3 groups in this twodimensional RDD example: G1, G2 and G3. And every dust belongs to these 3 groups randomly. In this example, cycle dusts belong to G1, rectangle dusts belong to G2, and triangle dusts belong to G3.
The movement is the iterative procedure. In each epoch, the centre dust, whose mass function value is the maximum or minimum in its own group, will be selected at first. The other dusts, which are called surrounding dusts, will be in the gravitation field of the centre dust and move towards centre dust. The rule of movement of surrounding dusts for anydimensional solution space is mentioned in
In this paper, the distance between two dusts was defined as a difference between two vector variables, which can reduce the runningtime of IGFA. For example, the distance between the centre dust (1, 1, 1) and one surrounding dust (3, −1, 7) is (−2, 2, −6), which is also a vector. And this method is called Direct Minus Method (DMM). When the proportion of distance is used for the pace for the movement rules in
Although the convergence of GFA is proved in three conditions
The underside cycle was a surrounding dust which didn’t run the movement operator. The upside cycle was the corresponding surrounding dust which had run the movement operator. It was an unproper movement operator obviously for a bigger pace The optimal value was missed and GFA would not be convergent.
Like the rotation of planets will throw out some dusts, the RF proposed in this paper can be used to prevent local convergence. The rotation operator is used when the movement epoch is completed. The task of rotation is to push surrounding dusts backwards the centre dust. The backward direction is not the original forward direction, but any possible directions randomly. And to avoid too much pace, the max backward pace is defined as Eq. (1):
In Eq. (1), the max backward pace
RF is a probability value with which the rotation operator runs. And RF is inverse proportion to the current distance between the surrounding dust and the centre dust. So the value will change in the whole IGFA process. But only one RF for all dusts will be not proper in IGFA obviously. It’s a better way to set a special RF for each dust. And the RF value is defined as Eq. (2):
In Eq. (2), factor (i+1) is the RF after
The basic rules of movement operator are described in detail as above, but a serious problem will be also appeared in movement operator. When a surrounding dust moves towards the centre dust, the location value of the surrounding dust will change to a new one. Obviously, this new value will also be in the solution space in most cases, but it will also be out of the solution space in some cases. So boundary verification should be used to ensure the new value is legal. If the new value is in the solution space, the algorithm will go on, or a new random dust will replace the illegal dust.
The strategy of absorption is easy but efficient. The surrounding dusts will be deleted when the distance between this dust and the centre dust is small enough. And when the number of the surrounding dusts is smaller than the threshold in IGFA, all dusts will be deleted except the centre dust in the group. Then all groups will dismiss and a new iteration of division will begin until the algorithm ends.
The complete pseudocode of IGFA is presented in Algorithm 1. In Algorithm 1,N is the number of dusts, G is the number of groups. Both of values would decreases continuously in IGFA. So the runningtime will be small by this mechanism. ‘GetCentre’ is a method of getting the centre dust through the current dust. And ‘GetMax’ is a method of getting the dust which has the biggest mass value. ‘threshold’ is the smallest distance between the centre dust and the surrounding dust. If the distance is smaller than this value, the corresponding dust will be deleted. centre[] is the final results. It can be one value, or many optima.
1:
2:
3: dusts[
4:
5:
6: centre[
7:
8:
9: dusts[
10:
11: dusts[
12:
13:
14:
15:
16: delete dusts[
17:
18:
19:
20:
21: dusts[
22:
23:
24:
25: goto
26:
27: goto [29]
28:
29:
30: return centre[]
31:
32: update N,G
33: goto
34:
In this part, DMM can be verified. And a new theorem will be proposed and proved.
When the movement operator is on, if the pace is the proportion of the total distance between this dust and the centre dust, the distance of DMM is equal to the Euclidean distance.
If the Theorem 1 for Mdimensional solution space is desired to be proved, two parts must be proved. The first one is that the Theorem 1 is correct when M = 1. The other is that if the Theorem 1 is correct when the solution space is Mdimensional, then Theorem 1 is correct too when the solution space is (M+1)dimensional. When these two parts are proved, Theorem 1 of any dimensionality can be proved, such as M = 1, 2,
The Theorem 1 is correct when M = 1 obviously, since the distance of DMM is the Euclidean distance itself.
AC is the distance of points A and C in (M+1) dimensional solution space. AE is the projection of AC in Mdimension solution space. CE is the projection of AC in
In the theory framework, the mass function within a certain continuous data solution space must be used in IGFA. But for some applications in system biology, especially in hierarchical clustering, the discrete data is used. Thus, the IGFA must be modified again for discrete data in this paper.
First of all, the initial part must be modified. Because all solutions are discrete, a new method should be used for initializing all dusts. The distance function of each genes pair is used as the mass function in IGFA. A series numbers are used for identifying all the dusts, and two numbers are used as parameters in the distance function. In the hierarchical clustering, two integers are selected randomly from 1 to
The other modified part is the movement operator in IGFA. The rule of decrease or increase progressively was used because both i and j were integers in [1, N]. The modified movement operator is described in detail as following:
step = 1.
If the serial number of centre dust
If the serial number of centre dust
If ij pair of the surrounding dust is the state ‘used’, then step = step+1 and goto (2) or (3) again. If i j pair is ‘not used’, Then step = 1 and goto (5).
After (4), i j pair must be identified as ‘used’.
The state ‘used’ in (4) and (5) should be used because unexpected results will not appear in the discrete IGFA described above, so the state ‘used’ can reduce the running time of IGFA in the application of hierarchical clustering.
Except these two parts, other parts will not be necessary to modified for the application.
To test the efficiency of IGFA proposed in this paper, a suite of five functions, which include Eq. (4)–(8), was used to assess the algorithm performance. And the test results of IGFA will be summarized and be compared with GFA, GA and SA. 500 different runs of each method and each benchmark function were performed and compared with each other.
Sphere function
Rosenbrock function
Rastrigin function
Griewangk function
Ackley function
In Eq. (4)–(8), D is dimensionality. And D = 50 was defined in these experiments. Eq. (4) is the single minimum function, the others can be used as both single minimum and multi minima function.
Algorithm parameters  IGFA  GFA  GA  SA 
Max. numbers of iterations  1000  1000  1000  1000 
Population size  50  50  50  – 
Number of polulations  200  200  200  – 
Initial temperature  –  –  –  0–5.0 
various rate  0.3  –  0.3  – 
The error functions used to determine the algorithm efficiency were Mean squared error (MSE)
In these three functions, n is the number of runs,
Sphere  Rosenbrock  Rastrigin  Griewank  Ackley  


MSE 

0.0143 



STD 




12.4189 
MGE 

0.0454 





MSE  0.3254  0.0155  151.5743  5.3587e007  16.4500 
STD  0.2931  0.0155  6.5848  5.1176e004  16.4500 
MGE  0.4769  0.1179  0.9587  5.9689e004  1 


MSE  7.2747 

7054.2  6.7709e007  14.6001 
STD  0.7409  0.0556  16.2621  6.8194e004 

MGE  0.9927  0.0546  1  5.2149e004  1 


MSE  1619.2  0.0069  745.7810  0.0030  26.9123 
STD  7.1102  0.0827  12.5530  0.0385  26.9123 
MGE  0.9967 

0.9952  0.0439  1 
The common settings of parameters were shown in
To compare the performance of these four algorithms, 500 different runs of these five functions were used. And the value range of each dimensionality was [−2,2]. The MSE, STD and MGE results had been concluded in
Sphere  Rosenbrock  Rastrigin  Griewank  Ackley  


mean numberof 
33  51  36  57  113 
number of failures  0  0  0  0  0 


mean number ofepochs 




98 
number of failures  0  0  0  0  0 


mean numberof epochs  51  57  51  51 

number of failures  0  0  0  0  0 


mean number ofepochs  816  46024  6349  21634  1001 
number of failures  108  500  314  126  234 
Sphere  Rosenbrock  Rastrigin  Griewank  Ackley  

191.2 


68.75 


201.4  66.84  114.87  67.93  82.58 


63.29  157.33 

101.30 

14211.08  11463.29  52914.75  15536.02  6406.68 
From the comparison results between IGFA, GFA, GA and SA, we could see that for some simple functions, like Sphere’s, Rastrigin’s and Griewank’s functions, MSE, STD and MGE of IGFA were lower than ones of GA and SA. But for some complex functions, like Rosenbrock’s and Ackley’s functions, IGFA was not better than GA and SA.
Rosenbrock  Rastrigin  Griewank  Ackley  


1 



3.9940 
2  0.1007 



3 




4 




5 

0.9587 

3.6948 


1  1.0026  1.0469  0.0117 

2 

1.0943  0.0094  3.5237 
3  0.0029  0.0685  0.0009  0.0030 
4  0.1043  1.0327  0.0081  3.5797 
5  1.0031 

0.0097 

From
All the yeast genes were in the hierarchical binary tree. All the relationships between these genes can be seen. The green blocks indicate the low expression. The red blocks indicate the high expression. There are 16 blocks in every row which indicate 16 samples. And there are 7663 rows which state 7663 genes. In addition, the grey blocks indicate the deficit data.
Sphere’s, Rastrigin’s and Griewank’s functions could be optimized better by IGFA. Both Rosenbrock’s and Ackley’s functions could be optimized partly better by IGFA than by others. The lowest criteria were STD and MGE for IGFA. And all functions could be optimized better by IGFA than by GFA.
For running time of IGFA, the rotation factor makes the epochs number bigger, but the rules of random division makes the running time of one epoch smaller. Less time would be used by IGFA to divide the dusts in the solution space.
Time complexity of the four algorithms should be seen both from epoch number and the whole running time of the algorithms. The results had been concluded in
It seems that only IGFA outperforms for four of five benchmark function. Only for one function, which is Ackley’s function, GA is better than IGFA. But actually, Time complexity of the algorithm was also determined by the running time of one epoch. From
From the comparison result between IGFA and GFA for all functions, the mean epoch number of IGFA will be bigger than GFA. But total running time of IGFA is smaller than GFA. That is, the efficiency of IGFA is better than GFA.
Although most tasks of given problems is to search the global minimum, multiminima are also needed by some problems, such as Bayesian network inferring
Because the multiminima searching is beyond GA’s and SA’s ability, the algorithm results of IGFA will be compared with the results with GFA using four benchmark functions: Eq. (5)–(8). Sphere’s function is not suitable for multiminima searching. In this experiment, the domain in every dimensionality for all four functions was [−2, 2]. The max iteration number was 1000. The number of initial dusts was 30000 for all functions. The number of initial groups was 200. 500 different runs were performed, and 5 minima of two algorithms were calculated and concluded in
The error functions were not used. But the direct result values of the multivalleys searching with IGFA and GFA were calculated and shown in
From
algorithms  Distance  running time 

27.34  15.78 

21.32  19.84 

11.18  38.74 

23.93  78.56 

21.18  95.78 
The data used in this paper for hierarchical clustering is GDS38 in GEO
In traditional hierarchical clustering,
After calculated clustering by IGFA, the direct result is shown as
It is impossible to objectively evaluate how good a specific gene expression clustering is without referring to what the gene cluster will be used for. However, once an application has been identified, it may be possible to evaluate objectively the quality of the gene cluster for that special application. In our work, biological applications are desirable. To provide more meaningful biological information, all 300 genes involving mitosis of the yeast cell in the Spellman’s experiment
The efficiency of the IGFA in hierarchical clustering could be seen from
In
Overall, The IGFA achieved a longest distance and lowest running time in the experiment. Relatively, the accuracy of any traditional clustering algorithms was in the same magnitude with GA. But the running time is too long to tolerant for clustering.
In this paper we improved the generic searchingoptimization algorithm GFA, which is called IGFA. There are two improved parts in IGFA. One is the rule of random division, which determines the every dust membership. The other is the rule of the rotation factor, which can be used to prevent local convergence. In addition, for the application in hierarchical clustering, IGFA will be modified again to resolve the discrete data problem. The modified parts conclude two parts, one is the initial part, and the other is the movement operator.
Three parts experiments were used in this paper. One is the global minimum searching. In this experiment, the results of IGFA were compared with GFA, GA and SA using five benchmark functions. The second part is the multiminima searching. The results of IGFA were compared with GFA. The third part is the application in the hierarchical clustering. And the results were compared with GA and SA. The efficiency of the IGFA was proved by these three kinds of experiments.