Master-Leader-Slave Cuckoo Search with Parameter Control for ANN Optimization and Its Real-World Application to Water Quality Prediction

Artificial neural networks (ANNs) have been employed to solve a broad variety of tasks. The selection of an ANN model with appropriate weights is important in achieving accurate results. This paper presents an optimization strategy for ANN model selection based on the cuckoo search (CS) algorithm, which is rooted in the obligate brood parasitic actions of some cuckoo species. In order to enhance the convergence ability of basic CS, some modifications are proposed. The fraction Pa of the n nests replaced by new nests is a fixed parameter in basic CS. As the selection of Pa is a challenging issue and has a direct effect on exploration and therefore on convergence ability, in this work the Pa is set to a maximum value at initialization to achieve more exploration in early iterations and it is decreased during the search to achieve more exploitation in later iterations until it reaches the minimum value in the final iteration. In addition, a novel master-leader-slave multi-population strategy is used where the slaves employ the best fitness function among all slaves, which is selected by the leader under a certain condition. This fitness function is used for subsequent Lévy flights. In each iteration a copy of the best solution of each slave is migrated to the master and then the best solution is found by the master. The method is tested on benchmark classification and time series prediction problems and the statistical analysis proves the ability of the method. This method is also applied to a real-world water quality prediction problem with promising results.


Introduction
Computational intelligence is defined as a set of nature-inspired computational approaches to deal with complex real-world problems. This intelligence is directly linked to computing concepts such as fuzzy logic, decision making, artificial neural networks (ANNs), and metaheuristic algorithms as optimization techniques. Artificial neural networks are a family of learning models that are inspired by biological neural networks and are employed to estimate functions that are generally unknown. A number of researchers have used optimization algorithms to train neural network models [1][2][3][4][5][6][7][8]. The multi-layer self-organizing ANN has been studied in the literature and metaheuristic algorithms have been used to optimize the structure of the ANN [9][10][11][12]. A few methods have been used to attempt to optimize both the weights and the structure of ANNs [8,[13][14][15]. Other proposals include the use of an adaptive merging and growing algorithm in the design of ANNs [16] and the adoption of a Taguchi-based parameter for the genetic algorithm used in ANN training [17]. [4]used a pruned probabilistic neural network with a genetic algorithm to optimize the structure of an ANN, while [6] applied particle swarm optimization (PSO) to optimize an ANN. Other PSO based approaches can be found in [18,19]. In another research study [20], the authors used a combination of self-organizing networks and an artificial immune system to minimize the neurons in an ANN.
Finding the global optimum solution is the major and common aim of optimization algorithms. Good communication between diverse exploration and intensive exploitation results in good convergence in any algorithm. There is no set method for achieving a balance between exploration and exploitation. Different methods have been proposed for different algorithms to achieve this trade-off. Based on a review of the relevant literature, the obvious solution is that more efforts should focus on improving the diverse exploration phase in early iterations and on enhancing intensive exploitation in later iterations so that discovery of the global optimum will most likely be achieved.
Many research studies have found that the use of a multi-population in the algorithm results in more exploration and enables the algorithm to move toward the global optimum. For instance, the authors in [21] proposed a multi-population cooperative method for PSO, called CPSO-S, in which the solution vector is split into smaller sub-vectors. Each of these subvectors is optimized using a separate swarm. A complete solution vector is built by using the best solution found by each swarm. Another study in [22] proposed a master-slave multi-population for PSO. Other works based on PSO include [23][24][25][26]. Other works based on PSO include Yen and Daneshyari [23], Zhan and Zhang [24], Zhang, Cui and Zou [25], and El-Abd and Kamel [26]. A multi-population cooperative method for bee swarm optimization has also been studied [27]. [28] proposeda multi-population cultural algorithm, in which the competitive multi-population genetic algorithm is embedded into the population of the cultural algorithm. Recently, a multi-population cooperative bat-inspired algorithm for optimization of the ANN model has also been proposed [29].
In this paper, we propose an optimization methodology based on the cuckoo search (CS) algorithm. Cuckoo search is rooted in the obligate brood parasitic actions of some cuckoo species. The key advantage of the CS algorithm is its simplicity. Unlike other population-based algorithms there is only one parameter, P a , in CS, which makes it easy to implement. A further advantage of CS is that its search uses Lévy flights instead of standard random walks; Lévy flights include infinite mean and variance so can explore the search space more efficiently compared to random walks. The CS algorithm was first proposed by [30] and its superior ability was quickly established in many areas of optimization [31][32][33][34][35][36][37].
To enhance the performance of the CS algorithm and its convergence ability, we perform the following modifications. First, the fraction P a of the n nests replaced by new nests is a parameter that is fixed in basic CS. As the selection of P a has a direct effect on exploration in the search space, in the proposed method P a is set to a maximum value at initialization to achieve more exploration or diversification in early iterations and it is decreased during the search to achieve more exploitation or intensification in later iterations until it reaches the minimum value in the final iteration. This first modification gives the algorithm fewer parameters to work with and also makes it a self-adaptive algorithm. Furthermore, the modification enhances the algorithm's convergence ability by improving the balance between exploration and exploitation. The second modification involves the application of a novel master-leaderslave multi-population strategy, where the slaves employ the best fitness function among all slaves which is selected by the leader unit for the next Lévy flight if, after a certain number of iterations, there is no improvement in the quality of the best solution. In each iteration a copy of the best solution of each slave is migrated to the master and then the best solution among these is found by applying CS to the master sub-population. In this strategy, exploration is provided by the slaves and exploitation is achieved by the master.
The rest of this paper is organized as follows: this paper firstly provides a brief description of cuckoo behavior and then introduces the mechanism of the CS algorithm and Lévy flights, followed by the explanation on the proposed modifications of the basic CS algorithm. After that, the experimental results of applying the proposed approach on benchmark datasets and on real-world water quality data are given. The conclusion of this work is provided in last section.

Reproductive Behaviour of Cuckoos
Some cuckoo species lay their eggs in the nest of another bird species alongside the host bird's eggs, while others remove the host eggs and then lay their own in the nest to increase the hatching probability of their own eggs [30]. Female parasitic cuckoos are specialized in mimicking the pattern and color of the eggs of a few selected host species. This decreases the likelihood of the eggs being abandoned and so improves the reproductive outcome for the cuckoo. However, if a host bird discovers that the eggs are not its own, then it either throws them away or just abandons its nest and makes a new nest elsewhere. The parasitic cuckoo often selects a nest anywhere the host bird has just laid its eggs. Generally, the cuckoo eggs hatch a little before the host bird's eggs. When the first cuckoo chick is hatched, its first natural action is to throw out the host eggs by blindly pushing the eggs out of the nest. This behavior has the effect of increasing the cuckoo chicks' share of the food provided by the host bird. A cuckoo chick can also mimic the call of the host chicks to increase its chances of being fed.

Cuckoo Search Algorithm
The first version of the CS algorithm was introduced by [30]. The authors made three idealized assumptions to ensure the simplicity of the CS algorithm. In this simplified form: • Each cuckoo lays one egg at a time in a random nest; • The best nests containing high-quality eggs are kept for the next iteration; • The number of on-hand host nests is fixed and the probability of the egg laid by the cuckoo being discovered by the host bird is considered as P a [0, 1]. If the cuckoo egg is found, the host bird can either discard the egg or leave the nest and build a new nest in a new place. This assumption is simulated by a fraction of P a of n nests being replaced by new nests (having new random solutions).
In this simple form of CS each egg in a nest stands for one solution, and a cuckoo egg corresponds to a new solution. In this algorithm the aim is to replace a poor-quality solution in the nest with new and possibly better solutions.
In the CS algorithm a new solution is generated by using Lévy flights. ALévy flight is a kind of random walk in which the random step lengths have a probability distribution. The reason for using Lévy flights in basic CS is that it is more efficient for exploration of the search space compared to random walk. This efficiency is due to the longer step length in Lévy flights. In the CS algorithm the Lévy flights are performed as shown by Eq 1: where β > 0 is denoted as the step size. The step size depends on the scale of the problem. For most problems the value of β can be set to 1 [30]. The symbol È denotes entrywise multiplication. The Lévy flights provide a random walk with a random step length, which is derived from a Lévy distribution that contains infinite variance with infinite mean, as expressed by Eq 2: The idea of using Lévy flights originated in the cuckoo's reproductive behavior; if a cuckoo's egg appears similar to a host's eggs, the probability of the cuckoo's egg being discovered by the host bird is less. Therefore, in the simulation of the CS algorithm, the differentiation of the solutions using Lévy flights is used in generating new solutions for the CS algorithm.
Similar to other population-based algorithms, the CS starts with an initial random population. In the main loop of this algorithm, when the stopping criterion is not met the algorithm will get a cuckoo (new solution) by Lévy flights, and if its fitness is better than a random solution in the population then the random solution is replaced by the new solution. The worst solutions are abandoned and new random solutions are replaced under the condition of a fraction of P a . In the final step of each iteration the solutions are ranked and the best one is updated. The pseudocode to maximize the result of the basic CS taken from [30] is shown in

Modified Cuckoo Search Algorithm
We propose two modifications of CS to improve the performance of the basic CS. The first modification aims to enhance the overall performance of CS by controlling the P a parameter. Controlling this parameter improves the balance between exploration and exploitation of the search space and can therefore increase the likelihood of fast convergence to the global optimum. The second modification involves maintaining the diversity of the solutions and directing the search process toward the best solution by using a novel master-leader-slave multi-population strategy. The details of these modifications are discussed in the following subsections.

P a parameter control
The main idea of the CS is based on decreasing the probability of the cuckoo's eggs being detected by the host bird so the cuckoo's eggs have more opportunity to survive and become mature cuckoos. Therefore, in this algorithm if more eggs survive it means that the algorithm is going to optimize and converge toward the best solution. The P a parameter is a very important parameter in adjusting the convergence rate of the algorithm. In the basic CS algorithm, the fixed value of the P a parameter is set in the initialization step and cannot be altered during the search process. The main challenge encountered in the basic CS algorithm is how to tune this parameter to find the global optimum solution. If the value of P a is small the probability of the host bird detecting the cuckoo's egg is low. In this case there is an insufficient diversity of solutions and therefore there is not enough exploration of the search space. Inadequate exploration might decrease the performance and result in finding a poor solution. If the value of P a is large the probability of the host bird detecting the cuckoo's egg is high. So there is high exploration but there is not enough exploitation to converge the algorithm toward the optimum solution.
We propose a simple yet effective way to fine-tune the P a parameter to overcome this challenging issue. In order to improve the performance of the algorithm, the value of P a must be big enough in the early iterations to force the algorithm to maintain the diversity of the solutions and increase exploration of the search space. However, the value of P a should be decreased in later iterations to speed up the convergence of the algorithm. In the proposed modified CS algorithm, the value of the P a parameter is changed dynamically with the number of iterations. Since P a [0, 1] in basic CS, the maximum value in this range, which is equal to 1, is considered for P a in the initial setting and then it is decreased in each iteration at a rate that is calculated by Eq 3 until it reaches the minimum value of 0 in the range of [0, 1]: where the decreasing rate of P a in the i th iteration is denoted as DR P a i , the number of the current iteration is indicated as iteration i , and numOfIte stands for the total number of iterations.
This parameter control strategy reduces the parameters as the P a parameter is automatically changed during the search process based on the number of the current iteration and the total number of iterations. The schematic flowchart of the modified CS algorithm with P a control is illustrated in Fig 2.

Master-leader-slave multi-population
In multi-population cooperative PSO [38], a master-slave approach is used that employs one master swarm and a number of slave swarms. The slave swarms apply the PSO algorithm in parallel and then each transfers its best solution to the master. Then the master updates the best solution from among all the slaves' best solutions. A modified master-slave multi-population, which is a combination of the ring and master-slave for the bat algorithm was proposed in our earlier work [29]. In that work the slaves exchange information on their best solutions in a ring-like manner if, after a certain number of iterations, there is no improvement on the best solution. Further cooperation is achieved by sending the best solution of each slave to the master in each iteration. The master involves the application of an optimization algorithm on the population of best solutions collected from all the slaves. The advantage of this work lies in its ability to maintain the diversity of the solutions, where exploration of the population is conducted by the slaves and exploitation is performed by the master.
We take this approach one step further to propose a novel multi-population cooperative strategy called master-leader-slave for CS, which is based on the master-slave strategy with the addition of another unit called the leader. The leader unit does not involve the use of any optimization procedure but it receives the best solutions found by the slaves if, after a certain number of iterations, there is no improvement in the quality of the best solution. After receiving the best solutions from the slaves the leader selects the best solution from among all the best solutions from the slaves. Then, the leader sends the information on the selected solution to all the slaves to guide them to follow this selected solution in the next Lévy flight. The master receives a copy of the best solutions found by the slaves in each iteration and applies the CS optimization algorithm to the population of best solutions from the slaves. Then the overall best solution is updated and the master is zeroized. The cooperation between slaves, leader and master is shown schematically in Fig 3. In Fig 3, the arrows labeled Sol best show that a copy of the best solution is transferred to the master. While the arrows labeled f (best) denote that the information (quality) about the best solution is sent to the leader. The arrows labeled f (b) illustrate that the information on the best solution among all the best solutions from the slaves is sent to the slaves. The cooperation between slaves and leader gives the algorithm a powerful exploration capability and provides a high diversity of solutions in the population, while the support given by the master to the slaves improves the ability of algorithm to achieve fast convergence. This proposed multi-population

Experimental Results
For the ANN used in this present study, two hidden layers with two nodes for each hidden layer were selected as this format is commonly used and is the most accurate [8,29]. The activation function used in this experiment was the hyperbolic tangent as it has better presentation [39] than other activation functions. A one-dimensional vector was used for the solution representation, where the weights and biases of the ANN are located in each cell of this vector. The length of the vector is equivalent to the number of weights plus the number of biases of the ANN.

Benchmark classification and time series prediction problems
In this section, we examine the performance of the basic CS algorithm (CS), the proposed modified CS algorithm with P a control (P a Ctrl-CS) and the proposed modified CS algorithm with parameter control and multi-population (Multipop-P a Ctrl-CS) by applying them to six classification and two standard time series prediction problems. The classification problems are Iris, Diabetes diagnoses, Thyroid dysfunction, Breast cancer, Credit card, and Glass identification. The time series prediction problems are Mackey-Glass and Gas Furnace; the former is a univariate dataset, whereas the latter is a multivariate dataset. The classification problems are taken from the UCI machine learning repository [40]. The Gas Furnace dataset is available from http://datasets.connectmv.com/datasets, while Mackey-Glass was produced from equation in the literature [29].  The initial parameters are shown in Table 1. The values of α and P a in the basic CS algorithm are adopted from [30], while the value of P a in the modified CS algorithms is changed within the range of [0, 1]. This range is based on the suggestion in [30]. The ANN consists of two hidden layers with two nodes for each hidden layer as this structure has been used in previous related research [8,29]. The activation function for this experiment is the hyperbolic tangent as it has superior performance compared to other activation functions [39]. The solutions are represented as a one-dimensional vector, where the weights and biases of the ANN are placed in each cell of this vector. The length of the vector is equal to the number of weights plus the number of biases of the ANN.
We considered the output of x(t+6) with the input variables of x(t), x(t-6), x(t-12) and x(t-18) for the Mackey-Glass dataset. For the Gas Furnace problem, the input variables were u(t-3), u(t-2), u(t-1), y(t-3), y(t-2), y(t-1) and the output variable was y(t), as used in earlier works. We used 30 twofold iterations [8] to evaluate the performance of the model. The data were randomly separated into two parts for each run. One half was used for the training set and the other half was employed as the testing set to test the model. The examples in the datasets were normalized into the range of [-1, 1]. We compare the results in the following two subsections. The first investigates the performance of the proposed algorithms in comparison with each other and the second presents a comparison of the best proposed method with the approaches in the literature.
Results of comparison of proposed methods. In this section, we evaluate the performance of the basic CS and the two proposed methods based on the percentage of the error. A summary of the results obtained by the three versions of the algorithm is shown in Table 2. In the case of the classification datasets (the first six datasets in the table) the training and testing errors are represented by the classification error. In the case of the Mackey-Glass time series dataset the training and testing error is represented by the root mean squared error (RMSE), whereas for the Gas Furnace time series dataset it is denoted by the mean square error (MSE). From Table 2 it can be seen that Multipop-P a Ctrl-CS has fairly superior performance compared to the other methods. To confirm the above finding we carried out an average ranking test to discover the first-ranked algorithm. The results are shown in Table 3, from which it can be seen that Multipop-P a Ctrl-CS is ranked first in two cases for training error and testing error and P a Ctrl-CS and CS are ranked second and third, respectively.
In order to further investigate the performance of proposed methods, a comparison of the optimization progress of CS, P a Ctrl-CS and Multipop-P a Ctrl-CS was conducted, the results of which are provided in Fig 5. This figure shows the optimization progress of the proposed methods in 100 iterations for all tested datasets. In all cases Multipop-P a Ctrl-CS starts with a better solution and also converges to a better solution compared to CS and P a Ctrl-CS. This is because the multi-population has superior exploration and guides the search toward the global optimum by using the leader and master actions of the master-leader-slave strategy. From Fig  5 it is also evident that P a Ctrl-CS performs better than CS in most cases. This is because the method is designed to control the P a parameter during the search process to improve exploration in early iterations and to achieve more exploitation in later iterations until final convergence is reached. To measure whether Multipop-P a Ctrl-CS is statistically different from the other proposed methods, we computed the p-values of the three algorithms for all datasets, where the critical value α is equal to 0.05. Critical values for a test of hypothesis depend upon a test statistic, which is specific to the type of test, and the significance level, α, which defines the sensitivity of the test. A value of α = 0.05 implies that the null hypothesis is rejected 5% of the time when it is in fact true. The most commonly used significance level is α = 0.05.
This evaluation was carried out for the training error and testing error. The results are shown in Table 4. Values lower than the critical level (highlighted in bold) confirm the superior ability of Multipop-P a Ctrl-CS. Only in the case of four training errors where we compare P a Ctrl-CS andMultipop-P a Ctrl-CS is the p-value higher than the critical value. This proves the superior performance of Multipop-P a Ctrl-CS compared to P a Ctrl-CS and CS. Results of comparison of the best proposed method with other methods in the literature. There are plentiful studies on the ANN, particularly in relation to its application to classification problems. Therefore, we compared Multipop-P a Ctrl-CS as the best method among those examined in the previous section with the most recent methods in the literature that are mainly related to our proposed method and that employed the same datasets. Comparisons of the testing errors for classification and time series prediction are reported in Tables 5 and 6, respectively. The best results in these tables are shown in bold. Table 5 illustrates the results of a comparison of Multipop-P a Ctrl-CSwithseven methods, namely, the multi-population of the bat algorithm (BatRM-S) [29], simulated annealing (SA), tabu search (TS), genetic algorithm (GA), combination of TS and SA (TSa), and integration of TS, SA and GA and backpropagation (GaTSa+BP) [8] and standard particle swarm optimization (PSO) which was re-implemented to be compared with proposed method for the classification problem. Table 5 shows that Multipop-P a Ctrl-CS exhibits higher performance than the other methods in the literature. The Multipop-P a Ctrl-CS algorithm outperforms these methods in four out of the six tested methods. Note: The percentage of error has been calculated for the results in the literature. "*" is a part of method's name as presented in the original paper. Table 6 provides the results of a comparison of Multipop-P a Ctrl-CSwith several algorithms investigated in [10,11,29]and re-implemented PSOfor the time series prediction problem. For both the Mackay-Glass and Gas Furnace time series datasets Multipop-P a Ctrl-CS achieved the best results compared to the other approaches in the literature. The superiority of Multipop-P a Ctrl-CS is due to its ability to control the P a parameter to provide better exploration and enhanced exploitation during the search process as well as to its use of master-leader-slave multi-population strategy.
In order to further validate the results, we carried out a Friedman test and a Nemenyi test on Multipop-P a Ctrl-CS. These tests were used to determine whether there are significant differences between the achievement of the method and the other methods in the literature in terms of classification error and prediction error.
In the case of classification problems, the Friedman test result was 28.1625, which is greater than 13.45 (critical value) for the testing error of classification problems. The critical value 13.45 was found in table of critical values for the Chi-Square test where the degree of freedom is equal to K-1. K is the number of methods, which in our experiment is equal to eight. Since the value of the Friedman test was greater than the critical value, the null hypothesis was rejected. This evaluation showed that there is a significant difference in performance between the algorithms in terms of classification error.
The Nemenyi test was also carried out as a post-hoc test to discover the group of methods that are differ from the other methods. The standard error (SE) was calculated and its posterior computing the minimum significant difference (MSD) was computed. The value of the MSD is calculated to see where any differences in averages were higher than the MSD. The MSD in our experiment was equal to 7.579328. The result of the Nemenyi test is highlighted in bold in Table 7. This table shows that Multipop-P a Ctrl-CS has a statistically significant difference in six cases.
The Friedman test result for the time series prediction was equal to 9.3374. This value is larger than the critical level (9.143), so we rejected the null hypothesis. We also performed a Nemenyi post-hoc test for time series prediction and the results are shown in Table 8. The MSD is equal to 4.209448 and from the highlighted values in Table 8 it can be seen that Multipop-P a Ctrl-CS performed better in four cases.

Real-world application of proposed methods to water quality prediction
In the last part of our study, the proposed methods were applied to real-world water quality data. The data were collected from a weather station near Kajang in the Selangor area of Malaysia. The data comprises monthly water quality data records from the years 2004 through 2013. They are multivariate time series and have been used as a prediction problem. The data  All the features were employed as input for the ANN and the last six features were considered as output of the ANN as they are the most critical features for water quality prediction. The data were divided into two parts; 70% of the data was used as a training set and 30% was used as a testing set. The data were normalized into the range of (0, 1) using the Min-Max normalization technique. A 10-fold cross-validation was used to validate the results. One step ahead prediction was performed. The averages of 30 runs for the prediction are shown in Table 9. For ease of reference and assessment, the average ranking of the training and testing errors of the proposed algorithm are provided in Table 10. As shown in this table, Multipop-P a Ctrl-CS is ranked first for both training and testing errors. To determine whether there is significant difference between the results of Multipop-P a Ctrl-CS, P a Ctrl-CS and basic CS, the p-values were calculated and compared. The results are shown in Table 11. From the table it can be seen that all the p-values are much lower than the critical level of 0.05, which proves the higher performance of Multipop-P a Ctrl-CS compared to P a Ctrl-CS and CS.
A comparison of the actual value and predicted value in both the training and testing parts of for CS, P a Ctrl-CS, and Multipop-P a Ctrl-CS when tested on real-world water quality data are shown in Fig 6. The first column in this figure (Fig 6(a)) provides the results for CS, the second column (Fig 6(b)) shows the results for P a Ctrl-CS, and the last column (Fig 6(c)) illustrates the results for Multipop-P a Ctrl-CS. It is clear that, although the data have an irregular pattern and high fluctuation, the ability of Multipop-P a Ctrl-CS to predict the features is better than the other two methods and all three are acceptable for application to real-world data prediction problem.
The predicted parameters can be used for the Water Quality Index (WQI) calculation and to find the degree of water quality. It means that the WQI can be derived using NH3-NL, PH, SS, COD, BOD, and DO, which are predicted by the proposed method. These parameters represent significant chemical, physical and biological parameters of water quality conditions. Using WQI, numerical face defining certain level of water quality can be presented. Therefore, WQI summarizes water quality data into a simple concept (like a grade) such as excellent in the range of (90, 100), good in (70, 89), medium in (50, 69), bad in (25,49) and very bad in the range of (0, 24) in a reliable way [41].

Conclusion
This paper examined the capability of the cuckoo search algorithm and its modifications to contribute to a more accurate ANN model. To attain this important aim, first, the basic cuckoo search algorithm was applied to optimize the ANN model and then two modifications (P a Ctrl-CS and Multipop-P a Ctrl-CS) of the cuckoo search were proposed. These modifications were designed to improve the exploration and exploitation of the algorithm and its ability to achieve better convergence. Control of the P a parameter in P a Ctrl-CS was achieved by setting this parameter to a maximum value in the initial stage to gain more exploration and by decreasing this parameter during the search process to gain more exploitation until the algorithm reached final convergence with the minimum value of P a . Furthermore, a master-leader-slave multipopulation strategy was embedded in Multipop-P a Ctrl-CS to improve the convergence ability of the algorithm. In this strategy, the slaves with aid of leader guidance provided good exploration and the master had the role of providing the algorithm with more exploitation. Based on extensive evaluations it is concluded that the Multipop-P a Ctrl-CS algorithm has the ability to outperform other recent algorithms in the literature in five out of six classification problems. The algorithm also demonstrated better performance on two tested time series prediction problems. We believe that the superiority of the results is due to the fine balancing between exploration and exploitation in Multipop-P a Ctrl-CS provided by P a control and the masterleader-slave multi-population. Finally, the proposed methods were applied to real-world data for water quality prediction. The promising results for both benchmark and real-world data motivate us to improve this method in future work.