Determination of the Optimal Training Principle and Input Variables in Artificial Neural Network Model for the Biweekly Chlorophyll-a Prediction: A Case Study of the Yuqiao Reservoir, China

Predicting the levels of chlorophyll-a (Chl-a) is a vital component of water quality management, which ensures that urban drinking water is safe from harmful algal blooms. This study developed a model to predict Chl-a levels in the Yuqiao Reservoir (Tianjin, China) biweekly using water quality and meteorological data from 1999-2012. First, six artificial neural networks (ANNs) and two non-ANN methods (principal component analysis and the support vector regression model) were compared to determine the appropriate training principle. Subsequently, three predictors with different input variables were developed to examine the feasibility of incorporating meteorological factors into Chl-a prediction, which usually only uses water quality data. Finally, a sensitivity analysis was performed to examine how the Chl-a predictor reacts to changes in input variables. The results were as follows: first, ANN is a powerful predictive alternative to the traditional modeling techniques used for Chl-a prediction. The back program (BP) model yields slightly better results than all other ANNs, with the normalized mean square error (NMSE), the correlation coefficient (Corr), and the Nash-Sutcliffe coefficient of efficiency (NSE) at 0.003 mg/l, 0.880 and 0.754, respectively, in the testing period. Second, the incorporation of meteorological data greatly improved Chl-a prediction compared to models solely using water quality factors or meteorological data; the correlation coefficient increased from 0.574-0.686 to 0.880 when meteorological data were included. Finally, the Chl-a predictor is more sensitive to air pressure and pH compared to other water quality and meteorological variables.


Introduction
Chlorophyll-a (Chl-a) is commonly used as an indicator of the abundance of phytoplankton and the population levels of primary productivity in the lakes and reservoirs that provide most of the drinking water for dozens of large and medium cities in China. Predicting the levels of Chl-a is a vital part of water quality management to ensure that urban drinking water is safe from harmful algal blooms.
Chl-a levels in lakes and reservoirs have been modeled for over 40 years [1], [2], and several statistical and process-based physical models have been developed using analysis of phytoplankton. Two of the most commonly used statistical predictors are linear regression models [3], [4] and principal component analysis [5], [6], [7]. These methods are simple but often do not yield reliable results, and sometimes even produce significant errors due to poor statistical stability and the use of linear equations. With improved understanding of aquatic ecosystem processes and advanced computing capabilities, physical models are now used to address water quality problems [8], [9], [10]. Although these models can describe variations in Chl-a levels based on the mechanism, they are not well suited for most Chinese lakes and reservoirs they require a significant amount of field data.
Artificial neural networks (ANNs), which imitate the basic characteristics of the human brain such as self-adaptability, self-organization and error tolerance, are able to map non-linear relationships among the variables that are typical of aquatic ecosystems [11]. Since their first application for the prediction of algal blooms from water quality databases of the Saidenbach Reservoir in Germany [12], ANNs have been widely applied to study Chl-a. Some examples of their application include prediction of algal blooms in Lake Kasumigaura in Japan [13], forecasting the incidence of cyanobacteria in the Murray River in Australia [14], estimation of the Chl-a levels in three water bodies in Turkey [15], analysis of algal bloom dynamics in the coastal waters of Hong Kong [16], elucidation of phytoplankton dynamics in the Nakdong River in Korea [17], prediction of the Chl-a levels in the Nanzui water area of Dongting Lake in China [18], and modeling of Chl-a levels during spring algal blooms in the Xiangxi Bay of the Three Gorges Reservoir in China [19]. These studies revealed that ANNs outperform traditional statistical models in modeling non-linear behavior and are more flexible than physical models because they require less detailed knowledge of the aquatic ecosystem. However, none of these studies encountered difficulties specific to modeling of the Yuqiao Reservoir, which has extensive submerged aquatic plants in addition to problems common to most reservoirs, such as abundant blue algae, limited data, highly variable water levels, and complex physical and chemical processes. Shallow water and appropriate nutrition conditions in the Yuqiao Reservoir have led to extensive growth of submerged aquatic plants.
Furthermore, although it is important to select the proper training method to improve prediction, few studies have systematically analyzed the performance of different ANNs in predicting Chl-a levels. Finally, almost all these studies used only water quality data as inputs, whereas meteorological factors that greatly affect the growth and accumulation of algae were rarely considered. Therefore, this study developed an accurate biweekly Chl-a predictor for the Yuqiao Reservoir by selecting appropriate training methods based on comparison of several ANN and non-ANN methods and by determining the appropriate model inputs including meteorological factors. largest city in China with a population of 2.92×10 7 in 2010. The reservoir was built in 1959 and used as a regulating reservoir during the diversion project from Luanhe to Tianjian in 1983. The reservoir surface area is 86.8 km 2 , and its volume and average depth at normal water level are 0.42×10 9 m 3 and 4.6 m, respectively. The mean annual precipitation and air temperature of the basin are 750 mm and 11.5°C, respectively.
The ecosystem of the Yuqiao Reservoir has undergone significant changes over the past few decades because of the natural evolution of biological species, changes in water diversion patterns, accelerated eutrophication of water quality, and substantial reduction of runoff resulting from climate change and human activities. The dominant species of submerged vegetation has changed from Potamogeton maackianus to Potamogeton crispus, and the biomass of Potamogeton crispus in late May increased from 4.8×10 7 kg in 1988 to 1.19×10 8 kg in 2009, whereas the distribution area of this species increased from 34.85% to 60.84%, according to remote-sensing estimates made using the Huanjing-1A/B satellite. The safety of the water supply of the Yuqiao Reservoir is now threatened by excessive growth of Potamogeton crispus in spring and algal outbreaks in summer. Potamogeton crispus is a submerged aquatic plant that purifies water by absorbing excess nutrients and competing for resources with cyanobacteria during its high growth period from April to mid-May. Furthermore, Potamogeton crispus promotes the Training Principle and Input Variables in Chlorophyll-a Prediction growth of cyanobacteria by accelerating nutrient release during its explosive death and decay during late May and early June.
(2) Features and variables. Feature extraction and determination of variables are important for any pattern recognition task, especially for Chl-a prediction, which involves complicated processes and variations. Excessive inputs may result in inefficient Chl-a prediction, whereas limited inputs may fail to describe the relationship between the influential variables and Chla levels.
To prepare features and variables for model inputs, we first interpolated the field water quality data into biweekly sets using a linear method and then processed the meteorological data to match the water quality data. The predicted day was set as Day 0 , and the current day was set as Day 15 . Therefore, the average value of the water quality and meteorological data of the preceding days 15-165 were processed into biweekly intervals. Second, considering the absence of field data for days 0-15 relative to the predicted day, we supplemented these data with the 10year (2000-2009)-average water quality and meteorological variables of the corresponding period. Therefore, a total of 372 variables ((11 meteorological data + 20 water quality variables) × 12) were prepared (Table 1).
To reduce the dimensionality of the input data and to determine the appropriate model inputs, a threshold was applied to the correlation coefficient (Table 1). Variables whose correlation coefficient with Chl-a was over 0.5 were considered relatively important and selected as inputs. Therefore, a total of 27 variables of 6 water quality features ( Based on field experience, meteorological variables such as WS, SD and R were added as inputs because of their close relationship to Chl-a despite their low correlation coefficients (<0.5), which result from the typical non-linear relationship between variables and Chl-a. Meteorological variables whose correlation coefficient with Chl-a was over 0.3 were also selected as inputs. Therefore, 14 meteorological variables (5 WS variables: WS 0 , WS 15 , WS 30 , WS 45 , and WS 60 ; 5 SD variables: SD 0 , SD 15 , SD 30 , SD 45 , and SD 60 ; 4 R variables: R 30 , R 45 , R 60 , and R 75 ) were selected.
pH was also selected as an input despite a low correlation coefficient, similar to the WS meteorological variables.
Considering the similarity of air pressure variables, 3 air pressure features (P, Pmax and Pmin) were reduced to one (P), and 8 air pressure variables were excluded (4 Pmax  Precipitation was not considered in this study despite relatively good correlation coefficients (0.38-0.48) because precipitation events were rare, and their values varied greatly, which might cause significant uncertainty in the prediction model.
Chla 0 was excluded because by definition it was the predicted Chl-a, i.e., the 10-year-average Chl-a. The use of Chla 0 might significantly influence the annual average Chl-a prediction model, making it less flexible to variations in water quality and meteorological conditions. Therefore, a total of 49 variables of 12 features (27 variables of 7 water quality features and 22 variables of 5 meteorological features) were selected for this study (Table 2).
(3) Configuration. Based on the above parameters, the Chl-a predictor was designed as shown in Fig. 2. The predictor comprises three parts: an input layer, an output layer and several hidden layers. Each layer contains several neurons. Each neuron receives inputs from neurons in the previous layers or from external sources and then converts the inputs either to an output signal or to another input signal for neurons in the next layer. The connections between neurons in successive layers were assigned weighted values, which represent the importance of that connection in the network.

Methodology
This section introduces the strategy used to develop a Chl-a predictor, which considers factors including choice of an appropriate training method, determination of adequate model inputs, and identification of suitable network architecture and parameters.

Training method
To identify which model is best suited for the Chl-a predictor, the following six widely used ANNs were compared: Back Propagation (BP), Probabilistic Neural Network (PNN), Modular Neural Networks (MNN), Jordan-Elman network, Self-Organizing Map (SOM), and Co-Active Neuro-Fuzzy Inference System (CANFIS). BP is most likely the most widely used ANN and comprises a feed-forward multi-layer neural network in which connections can jump over one or more layers, and errors are propagated back to connections stemming from the input units. PNNs are nonlinear hybrid networks typically containing a single hidden layer of processing elements and use Gaussian transfer functions; all weights can be calculated analytically in these networks. MNNs combine the results from several parallel multilayer perceptrons. SOMs transform arbitrary dimensional inputs into a one-or two-dimensional discrete map considering topological constraints. CANFIS integrates adaptable fuzzy inputs with a modular neural network to rapidly and accurately approximate complex functions. These ANNs are described in detail in Liu et al. [20].
To examine the performance of ANNs, ANNs were compared to two typical traditional non-ANN methods: principal component analysis (PCA) and support vector machine (SVM). PCA is a widely used statistical method, which identifies relatively few "features" or components that as a whole represent the full object state. SVM geometrically separates the training set using a hyperplane or more complex surfaces if necessary; SVM is a new mathematical method, which is widely used in modeling ecosystems.
The ANN predictors were performed using the NeuroSolutions 6.31 (www.neurosolutions. com) software for the MATLAB neural network toolbox.

Model inputs
To examine the feasibility of including meteorological variables in the Chl-a predictor, which uses only water quality data, three models were constructed and analyzed using the following inputs:

Evaluation indices
The performance of the Chl-a predictor was measured first by computational cost and then by precision. The first evaluation index was based on the training time required, and the second index was based on the normalized mean square error (NMSE), the correlation coefficient (Corr), and the Nash-Sutcliffe coefficient of efficiency (NSE). These evaluation indices are described below.
Corr ¼ y pi is the predicted Chl-a value at moment i, and y i is the observed value; N is the number of days with interval of 15 days; yis the average Chl-a value observed at all moments.

Model parameters
Selection of parameters such as the number of hidden layers, number of neurons, and learning rules, etc. was mainly based on the performance of NMSE, Corr, and NSE, which depended on the experience of the researcher and several tests. The Chl-a predictor was trained with maximum supervised epochs of 10000 times, and average MSE less than 0.01 were used as the termination constraint condition. The learning momentum of the 6 ANN models was set as 0.7. A hyperbolic tangent function was used as the transfer function for axons (TanhAxons) as follows: f(x i ,w i ) = tanh(x i lin ), where x i lin = βx i is the scaled and offset activity inherited from the linear axon. The learning momentums and TanhAxons were the same for SVM and PCA in the output layers.
Other parameters of the 8 Chl-a models with the same inputs are shown in Table 3. For the three models using different inputs, parameters were identical to the chosen model shown in Table 3 except for the number of neurons in the input and hidden layers. The number of neurons in the input layer for the models WQ, MF and WM were 27, 22, and 49, respectively, whereas the number of neurons in the first hidden layer were 30, 20 and 40, respectively, and in the second hidden layer, there were 25, 20 and 30 neurons, respectively.

Training and validation
Since 1983, when the Yuqiao Reservoir became the only source of drinking water for Tianjin city, the greatest changes in water quality, weather conditions and ecosystem in the reservoir occurred during 1999-2012. These changes occurred because of increased nutrient input, significant reduction of runoff, change in water diversion patterns, and natural evolution of ecosystems, which were closely related to climate change, urban water consumption, and newly built water conservancy projects in the upper reaches. The Chl-a level varied from 0.00-0.35 mg/l in 1999-2009 and from 0.00-0.28 mg/l in 2010-2012. The factors influencing the aquatic ecosystems were similar in 2010-2012 and 1999-2009, and there were no extreme weather conditions or changes in water utility patterns. Therefore, the prediction model was developed using data from 1999-2009 and tested using data from 2010-2012 because generally approximately 80% of the samples are used for training and the rest are used for testing while developing ecologic models.
Among the development data, seventy percent were randomly selected to train the model, and the remaining data were used for cross-validation. To avoid over-fitting the network, training was stopped if there was no improvement from the cross-validation process after 100 iterations. Weighted connection values were adjusted to minimize the RMSE between the desired and predicted outputs.
Because the training data spanned most cases of extreme conditions in the Yuqiao Reservoir since 1983, and the validation data were appropriate to test the performance of the proposed model, the Chl-a predictor should illustrate variations in Chl-a levels corresponding to changes in the ecosystem, weather conditions, and water diversion plans. Furthermore, the proposed model used a greater number of appropriate water quality factors and incorporated meteorological factors as inputs, whereas traditional predictors only use a limited number of water quality factors; therefore, compared to most traditional Chl-a predictors, the proposed model should adapt better to variations in weather conditions and water diversion patterns.  However, the performance of the model under new water diversion patterns and extreme weather conditions is unclear. This scope of this study did not include cases with limited water quality data and extreme conditions, which are low-probability events and occur randomly.

Sensitivity
To examine how the trained Chl-a predictor reacted to changes in each input, a sensitivity analysis was performed. Each input to the model was altered by 5%, 10% and 20%, and the corresponding change in output was calculated. For an input indicator to be considered sensitive, the corresponding output variation had to be greater than the input variation. A maximum input alteration of 20% was selected because some parameters such as pH and air pressure are relatively stable and vary by less than 20%.

Results and Discussion
The performance of the Chl-a predictor was examined in three ways. First, 6 ANNs and 2 non-ANN predictors were compared to identify the appropriate model. Second, three ANN models with different input variables were developed to determine the feasibility of incorporating meteorological variables. Third, a sensitivity analysis was performed to examine how the trained network reacted to changes in each input. Table 4 shows the results of the training, validation and testing of the eight Chl-a predictors. Except for the PNN method, all other methods required a training period that was less than 30 seconds. There were no time limits for most ANNs.

Comparison of ANN and non-ANN predictors
Except for the SVM method, the Corr between the observed and predicted Chl-a values of the ANNs was 0.524-0.880 in the testing period. This level of precision was consistent with similar studies on other water bodies, such as a correlation coefficient of 0.5-0.7 in the Putrajaya Lake of Malaysia [21] and 0.77 in the Nakdong River Basin of South Korea [17]. The performance of the ANN predictors was largely satisfactory considering the difficulties encountered in modeling the Yuqiao Reservoir, which contains extensive submerged aquatic plants in addition to the complex physical, chemical, and biological processes observed in other water bodies. Furthermore, the long-term series training data extending over 11 years In this study, all 6 ANNs outperformed non-ANN methods. For example, the NSE of ANNs during testing was 0.604-0.754, whereas the NSE of the PCA and SVM methods was 0.540 and 0.491, respectively. The failure of the PCA and SVM methods may result from the complex nonlinear nature of the Yuqiao Reservoir ecosystem.
Among the ANNs, the BP method best predicted Chl-a levels with NMSE, Corr and NSE values of 0.003 mg/l, 0.880, and 0.754, respectively, during testing. However, there was no clear advantage of one ANN over others because all 6 ANN models yielded acceptable results.
The SVM method is not suitable for Chl-a prediction because Corr was < 0.1 and NSE was < 0.50 during the training, validation and testing periods; this is potentially because the SVM method treats multi-category problems as a series of binary problems and may thus fail to capture the high variability of the aquatic system in the Yuqiao Reservoir.
In conclusion, the performance of the eight predictors indicated that ANNs, especially when trained by the BP method, are a powerful alternative to traditional modeling techniques for Chl-a prediction.  Table 5 show the results of the three ANN models with different inputs of water quality and meteorological variables. The model with only meteorological factors (MF) as inputs always overestimated the concentration of Chl-a, whereas the model with only water quality variables (WQ) as inputs underestimated Chl-a, which was evident during the training period. Combining the water quality and meteorological variables (WF) improved the performance of the Chl-a predictor greatly by accurately detecting peak timing and magnitude. For example, the Corr of the WF model was 0.880, whereas the Corr of the WQ and MF models was only 0.574 and 0.686, respectively. The NSE of the WF model was 0.754, whereas the NSE of the WQ and MF models was 0.225 and 0.662, respectively.

Sensitivity
The sensitivity of the Chl-a predictor to water quality variables is shown in Fig. 4. The sensitivity decreased in the following order: pH > DO, Tw, PI > NO 3 -N, TP and the prior Chl-a.
Tw has a short-term positive effect on the Chl-a concentration but a negative impact over longer durations. For example, the concentration of Chl-a increases with increasing water temperature in the preceding 30-75 days but reduces with increasing water temperature in the preceding 90-105 days. This may be because warm water promotes the growth of algae in the summer and Potamogeton crispus in the spring; excessive growth of Potamogeton crispus can inhibit the growth of algae by competing for nutrients and light.
Chl-a is very sensitive to pH variations, and the Chl-a concentration increases at twice the rate of pH change. This is potentially because a slight decrease in pH may significantly promote algal photosynthesis by increasing the dissolution of CO 2 in water.
A higher Chl-a value generally implies a higher level of DO. To some extent, the level of DO can indicate how much oxygen is produced by phytoplankton. Training Principle and Input Variables in Chlorophyll-a Prediction PI and TP have similar effects on Chl-a, and Chl-a is relatively more sensitive to the permanganate index than to TP. This is because PI can indicate the abundance of phytoplankton, whereas TP influences Chl-a indirectly by promoting the growth of phytoplankton.
Chl-a has similar sensitivity to NO 3 -N and water temperature: Chl-a first increases then decreases with increasing NO 3 -N.
The Chl-a concentration is closely related to the Chl-a level during the preceding 15-60 days. This indicates that algal seeds significantly influence growth and Chl-a levels in the subsequent two months.
The sensitivity of the Chl-a predictor to meteorological variables is shown in Fig. 5. The sensitivity decreased as follows: P > WS, SD, and R > Ta.
The Chl-a concentration increases rapidly as P decreases because low air pressure promotes floating and accumulation of algae on the water surface.
Chl-a is almost completely insensitive to changes in Ta; the Chl-a level varied by less than 5% when Ta was altered by 20%. This is because air temperature influences the aquatic system indirectly with water as a medium. WS has a short-term negative effect and a long-term positive effect on Chl-a levels. This is because strong wind promotes the release of nutrients from the sediment, which promotes Chl-a increase in a relatively slow manner; however, strong wind rapidly inhibits the growth and accumulation of algal particles.
Longer SD and R periods result in increased Chl-a because they increase energy input to the aquatic ecosystem, which promotes photosynthesis.
Sensitivity to meteorological variables is meaningful for short-term forecasts and for longterm prevention of algal blooms. For example, consecutive days with low air pressure, slight wind speed and increasing SD and R in summer indicate a higher probability of algal bloom, which can help water quality management departments to implement advance countermeasures.

Conclusion
To develop an appropriate biweekly Chl-a predictor for the Yuqiao Reservoir, this study first compared several Chl-a predictors trained using different methods and then examined the feasibility of incorporating meteorological factors for prediction. In addition, a sensitivity analysis was performed to examine how the Chl-a predictor reacted to changes in each input. The following observations were made: (1). ANN is a powerful predictive alternative to traditional modeling techniques for Chl-a prediction with Corr values of 0.524-0.880 in the testing period. The BP model yields better results compared to other ANN models.
(2). Combining the water quality and meteorological data greatly improves the performance of the Chl-a predictor compared to models using water quality or meteorological data alone as inputs; the Corr values increased from 0.574-0.686 to 0.880 when both inputs were combined.
(3). Among the meteorological variables, Chl-a is most sensitive to air pressure, followed by wind velocity, sunshine duration, total radiation, and air temperature. Chl-a is more sensitive to changes in pH compared to other water quality variables such as DO, water temperature, NO 3 -N, TP and prior Chl-a values.