An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

Proposed fault tolerant architecture.
The fault index value suggests the rate of tendency of resource failure; the lesser the fault index value, the lesser the failure rate of the resource and the higher the fault index value, the higher the failure rate. Checkpoint handler queries the checkpoint repository to obtain latest checkpoint files of the executed jobs on the failed resource and reschedules the jobs along with last checkpoint status (see Algorithm 7). On the successful completion of the job, the checkpoint handler receives the job completion message from the Grid resource and updates the fault index handler to increment the success rate of the resource The fault index handler maintains a fault index history of the Grid resources, which indicates the failure rate of the resource. To update and maintain the fault index of a Grid resource, the fault index handler uses Algorithm 6 described below to take decision:

More »

Fig 2 — Fig 2.

An illustration of the recovery analysis.

More »

Table 1 — Table 1.

Grid resource characteristics.

More »

Table 2 — Table 2.

Gridlet characteristics.

More »

Table 3 — Table 3.

Parameterization of the ACOwFT and ACO.

More »

Fig 3 — Fig 3.

Average makespan for varied Gridlets.

More »

Fig 4 — Fig 4.

Average throughput for varied Gridlets.

More »

Fig 5 — Fig 5.

Average turnaround time for varied number of Gridlets.
Similar experiments were carried out by keeping the number of Gridlets constant and varying the number of resources. Figs 6, 7 and 8 gives the results obtain as the number of Gridlets is kept constant with varied number of resources. In this experiment, 3,000 jobs are sent to the Grid with varying number of resources from 50 to 3,050, and as can be seen, increasing the number of resources has a decreasing exponential effect on the execution time. The proposed algorithms perform better when there is a small number of resources and a large number of jobs.

More »

Fig 6 — Fig 6.

Average makespan time for varied number of resources.

More »

Fig 7 — Fig 7.

Average throughput time for varied number of resources.

More »

Fig 8 — Fig 8.

Average turnaround time for varied number of resources.
Another important factor that is worth mentioning here is the robustness of the proposed algorithm in comparison with the existing ACO or AntZ algorithm. Robustness in this case implies the capability of an algorithm to deal with resource failure when it occurs in the system and to be able to automatically recover from such failure. Since the main goal of the proposed work is to model a fault tolerant algorithm, then a simulation test is further carried out to verify our claim that the proposed ACOwFT algorithm is more robust than the existing ACO algorithm. For this experiment, we considered the case, where 3,000 jobs are sent to the Grid for execution and different percentages of faults deliberately introduced into the system. Similar to the work presented in [24], were the injected fault percentages are assumed from 10% to 50%, in this paper the assumed fault percentages introduced into the system is from 10 to 70%. The essence of introducing a very high fault percentage is to thoroughly evaluate the robustness of the proposed scheduling system under heavy faulty conditions.

More »

Table 4 — Table 4.

Average makespan table for varied Gridlets.

More »

Table 5 — Table 5.

Average throughput table for varied Gridlets.

More »

Table 6 — Table 6.

Average turnaround time for varied Gridlets.

More »

Table 7 — Table 7.

Average makespan time for varied resources.

More »

Table 8 — Table 8.

Average throughput time for varied resources.

More »

Table 9 — Table 9.

Average turnaround time for varied resources.

More »

Fig 9 — Fig 9.

Average makespan time for varied number of faults (total number of Gridlets = 3,000).

More »

Fig 10 — Fig 10.

Average throughput time for varied number of faults (total number of Gridlets = 3,000).

More »

Fig 11 — Fig 11.

Average turnaround time for varied number of faults (total number of Gridlets = 3,000).
To conclude the overall evaluation of the results, and with the aim of making a deeper analysis, the Friedman’s non-parametric test is carried out to check if there are any statistically significant difference between the two algorithms in terms of makespan, throughput, and turnaround time results reported for each of the algorithms. For makespan, throughput and average turnaround time, the resulting Friedman statistics has been 7.00. Taking into consideration that the confidence interval has been stated at the 99% confidence level, the critical point in χ² distribution with 1 degree of freedom is 6.635. Since 7 > 6.635(p-value = 0.008), it can be concluded that there are statistically significant difference between the three metric results reported by ACOwFT and ACO whilst running χ²(1) = 7, with ACOwFT being the one with the lowest rank.

More »