A novel cross-validation strategy for artificial neural networks using distributed-lag environmental factors

In recent years, machine learning methods have been applied to various prediction scenarios in time-series data. However, some processing procedures such as cross-validation (CV) that rearrange the order of the longitudinal data might ruin the seriality and lead to a potentially biased outcome. Regarding this issue, a recent study investigated how different types of CV methods influence the predictive errors in conventional time-series data. Here, we examine a more complex distributed lag nonlinear model (DLNM), which has been widely used to assess the cumulative impacts of past exposures on the current health outcome. This research extends the DLNM into an artificial neural network (ANN) and investigates how the ANN model reacts to various CV schemes that result in different predictive biases. We also propose a newly designed permutation ratio to evaluate the performance of the CV in the ANN. This ratio mimics the concept of the R-square in conventional statistical regression models. The results show that as the complexity of the ANN increases, the predicted outcome becomes more stable, and the bias shows a decreasing trend. Among the different settings of hyperparameters, the novel strategy, Leave One Block Out Cross-Validation (LOBO-CV), demonstrated much better results, and the lowest mean square error was observed. The hyperparameters of the ANN trained by the LOBO-CV yielded the minimum number of prediction errors. The newly proposed permutation ratio indicates that LOBO-CV can contribute up to 34% of the prediction accuracy.


Introduction
Numerous studies from different countries have found environmental aspects that are key factors attributable to human mortality [1][2][3][4]. Extreme climates occur more frequently than ever due to global warming, encouraging more research on the impact of temperature variations on health outcomes [5,6]. In addition to temperature, air pollution also plays an important role, such as with CO, O 3 , CO 2, and particulate matter (PM) PM 2.5 and PM 10 [7,8]. An air quality report in 2016 indicated that heart disease is a major cause of death in young adults, and the 80% mortality rate is attributable to air pollution [9]. Previously, the Distributed-Lag Non-Linear Model (DLNM) [10] was the ideal strategy to deal with environmental factors that have lag a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 effects such as temperature or air pollution since the cumulative impact could be fitted into the same complex statistical model. DLNM discovered the lag impact of temperature on mortality [11] as well as delayed air pollution [8].
Nowadays, machine learning and artificial neural networks (ANNs) [12] demonstrate superior prediction abilities compared to conventional logistic regression. Applications of the ANNs occurred in many research fields. In particular, an improved fuzzy neural network that predicts traffic speed draws great attentions [13]. Recently, Tan et al. [14] also comprehensively examined both statistical and machine learning methods for incident clearance time prediction. Parameter tuning is crucial in machine learning, and cross-validation (CV) is the primary step for finding the optimal setting of hyperparameters without overfitting, where the tenfold CV is the most popular procedure [15,16]. The CV error is defined as 1 is the mean square error (MSE) for the k-th-fold dataset, and the smallest CV error suggests the optimal setting of the hyperparameters. However, CV with time-series data raises a serious issue if each data point is randomly selected and then shuffled without keeping the time sequence. In this case, the later data is used to predict earlier outcomes, which violates the serial pattern assumption. Therefore, hblock cross-validation was proposed [17]. However, this method causes information loss. Later, four strategies for CV were examined by Bergmeir et al. [18] including the fivefold CV, leave-one-out CV, h-block fivefold CV, and out-of-sample evaluation. Among these, the fivefold CV demonstrated the most satisfying results. Following this approach, recent research has focused on error bias estimates using the generalized linear model (GLM) and random forrest (RF) methods [19].
To date, there is no machine learning or artificial neural network considered the most popular artificial intelligence (AI) model that deals with DLNM. Therefore, if lag environmental factors such as temperature or air pollution can be properly incorporated by an ANN, then the predictive model and accuracy are expected to be substantially improved. Therefore, this study aims to develop a novel CV procedure for an ANN that incorporates the complex lag exposure based on the DLNM structure. When CV is conducted in machine learning or artificial neural networks (ANNs), one randomly splits the entire data into 10 unrelated sets. Here, we propose an opposite approach that preserves the correlation owing to the nature of time-series data with predictors and lag effects. We anticipate that the new strategy will outperform the fivefold CV [18].

Materials and methods
In Taipei City, all-cause daily mortality was obtained from the Cause of Death Database published by the Ministry of Health and Welfare from January 1, 2012, to December 31, 2016. The Institutional Review Board (IRB) of the National Yang-Ming University approved the use of anonymous mortality data and satisfied ethics guidelines. The approved IRB number was YM107045E. Daily mean temperature records were downloaded from the Taipei Weather Station. The freely available data are governed by the Central Weather Bureau (CWB) Observation Data Inquiry System website [20]. Air pollution, including the daily mean ozone concentration and daily mean PM 2.5 concentration, were downloaded from the Taipei Air Quality Monitoring System, which is maintained by the Environmental Protection Administration Executive Yuan website [21]. Although some air pollutants were missing, we could only omit these observations because the missing rate is low with an ignorable impact on the analyses. Descriptive statistics are listed in Table 1. The DLNM models the current mortality, which is defined as follows:

PLOS ONE
The independent (predictive) variable (x t ) is the daily mean temperature, and other pollutant variables (O 3t and PM2.5 t ) are treated as potential confounders. The dependent (outcome) variable (μ t ) was the all-cause daily mortality. The DLNM was fitted through a cross-basis function s(x t ,l,β) that simultaneously describes the effect of the daily mean temperature x t and its lag structure with maximum lag l on the expected daily mortality. Daily mean ozone concentration O 3t and daily mean concentration PM2.5 t were treated as fixed effects. A natural cubic spline f(z t i ; θ) with eight degrees of freedom for each year was used to adjust for the seasonal effect. In general, the maximum exposure lag l was 30. The cross basis consists of a quadratic B-spline for temperature with the knots placed at 10, 75, and 90 percentiles and a natural cubic spline for the lag with 5˚of freedom, which indicates three internal knots equally spaced on the log scale.
Instead of the DLNM, we aim to implement an ANN to accommodate such complex structures with a large number of correlated predictors by treating all predictors in the DLNM as input neurons. Previously, mortalities were included as additional input neurons in the ANN. In this way, the ANN could assess whether variations in previous mortality records would affect the mortality outcome in the current day. Therefore, the ANN treats a large set of predictors as different neurons in the input layer, and the training process of the ANN could capture nonlinear associations and provide a satisfying prediction of the mortality outcome. By contrast, the DLNM ignores lag mortality records. Since the compiled dataset is between 2012 and 2016, the year variable is coded as five variables, and the month variable is coded as 12 variables. Weekday, weekend, and holidays are three indicator variables. Up to 30 lags of the temperature variable and mortality were considered by the ANN.
Regarding the number of hidden layers, the ANN is fitted by two or three layers since one layer may not be suitable for such a complex distributed-lag time series. For an ANN with two hidden layers, the number of neurons was 6, 12, 24, 36, 48, and 60, which resulted in 36 combinations. Regarding an ANN with three hidden layers, owing to the limitation of running time, only three scenarios were considered: (6, 6, 6), (36, 36, 36), and (60, 60, 60). With respect to the number of hidden layers and the number of neurons, the number of parameters to be estimated in the ANN model increases dramatically (Fig 1). The smallest number of parameters is 2,007 for 2 hidden layers, with 6 neurons in each layer. The highest number of parameters is 11,121 when 3 hidden layers with 60 neurons in each layer are trained. Therefore, four or more hidden layers are not practical, and these scenarios were not considered. According to Bergmeir et al. [18], the fivefold CV was the best performer. Hence, it is the only strategy to be compared with the new methods. The first novel algorithm is Leave-One-Block-Out Cross-Validation (LOBO-CV), and the second is Temporal-Block Cross-Validation (TB-CV). The concept of the three CV schemes is displayed in Fig 2. LOBO-CV utilizes all distributed-lag time series since part of the training set occurred before the testing set. LOBO-CV  has a concept similar to that of leave-one-out cross-validation, but each iteration involves a group of time series and results in fewer operations. TB-CV is based on LOBO-CV but avoids unreasonable predictions using the later data. Hence, the sample size of TB-CV is significantly reduced compared to LOBO-CV, especially in the first sequential block.
The preceding 70% of the time-series data is the training set, and the remaining 30% is used as the testing set. In the training set, fivefold CV, LOBO-CV, and TB-CV are implemented. Therefore, the CV error is the average of the five validation errors.
The validity of the three CV methods was further evaluated by a permutation study. We define a new statistic, the permutation ratio (PR), to assess the changes in CV errors from the null hypothesis of no association between mortality and the set of complex correlated predictors to the alternative hypothesis, the observed situation. The CV error with permutation could be compared to that without permutation, which is the original training data.
The PR is defined as The rationale is that if the CV error does not contribute to the prediction accuracy, then the PR would be close to 1. The higher the PR, the lower the impact of the prediction accuracy using ANN. If the PR is close to 0, then the results suggest that the training process of ANNbased on such a CV scheme has the highest impact on prediction accuracy, and all variations are explained by the model. After the best hyperparameter and the optimal CV strategy are determined by the minimum CV error, the ANN can be fitted to the testing dataset and obtain a fair and robust prediction accuracy.

Results
In the ANN with two hidden layers, we combined the number of neural nodes in two layers with 6, 12, 24, 36, 48, and 6. Changes in the results were observed under combinations of hyperparameters. In the simulation results of the neural network architecture with two hidden layers, one can observe the preset results under three types of CV with different hyperparameters in Tables 2, 3 and 4. Comparing the results of the three types of CV, we discovered that as the complexity of the model increases, the predicted results improve with a lower MSE. Regarding the three strategies, fivefold CV, LOBO-CV, and TB-CV, the best performing In the case of two hidden layers, the permutation-based CV errors are listed in Tables 5, 6 and 7, where the null hypothesis of no association is simulated. It can be observed that the comparative performance is almost similar, but the MSE increases between 80 and 82 for most sets of hyperparameters. The average MSEs of the fivefold CV, LOBO-CV, and TB-CV are 82.03, 80.016, and 83.279, respectively. The results match our expectation that under the null hypothesis, the three models should yield similar errors, and the errors should be higher than the CV errors of the original data.
After obtaining both the CV error and the permutation-based CV error, the PR is easily calculated, and a lower value represents a better outcome (Table 8). It can be observed that in the context of different sets of hyperparameters, most of the fivefold CV and LOBO-CV demonstrated better results. Among all 36 sets of hyperparameters, the fivefold CV achieved the lowest MSE. In 11 sets of hyperparameters, LOBO-CV is the best performer with 22 sets of the best PRs, and TB-CV has three sets of best performances.  The best hyperparameters for the fivefold-CV, LOBO-CV, and TB-CV are (24, 48), (36, 24), and (36, 6), respectively. Therefore, the best PRs are 0.697, 0.656, and 0.705, respectively. The difference between 1 and PR indicates the contribution of the ANN, which means that the trained neural networks based on the three CV strategies contribute 30.3%, 34.4%, and 29.5% of the prediction accuracy, respectively.
In the ANN with three hidden layers, we used (6, 6, 6), (36, 36, 36), and (60, 60, 60) as three sets of hyperparameters to observe the changing patterns ( Table 9). The results suggest that the performance of this particular set of hyperparameters (60, 60, 60) is the best. Therefore, the conclusion is consistent that more complex ANN models have better predictive results. The lowest MSE of the fivefold CV, LOBO-CV, and TB-CV were 59.504, 54.762, and 65.626,  respectively. It is worth noting that LOBO-CV is the best performer under all circumstances. All MSEs are between 81 and 83, which is similar to the results shown by the two hidden layers but further narrows the range. Finally, for the permutation error, we can observe that the three methods are consistent with this set of hyperparameters (60, 60, 60) as the best result. Therefore, the best PRs for the fivefold CV, LOBO-CV, and TB-CV are 0.79, 0.85, and 0.67, respectively. In summary, the ANN model contributes approximately 21%, 15%, and 33% of the prediction accuracy, respectively. In the simulations for three hidden layers, there is no case in which the PR is greater than 1, as observed in the two hidden layer cases. This means that the three-hidden-layer model is more robust in such a complex data structure.
After the ANN was trained with the optimal hyperparameters, we implemented the three CV strategies in the two-layer hidden layer since the set of parameters is more detailed compared to the three hidden layers. The hyperparameters selected for fivefold CV, LOBO-CV, and TB-CV are (60, 12), (24, 60), and (36, 6), respectively. The MSEs were 116.36, 109.77, and 112.73, respectively. LOBO-CV remains the best performer in the testing dataset, which is consistent with the results of the training dataset.

Discussion
This research aims to explore the stability and accuracy of the ANN under different CV methods to address lag effects. This study proposed two novel CV strategies and compared their performances to that of Bergmeir et al. [18], which stated that in pure time-series data, the randomly selected K-fold CV is the best. The performance comparisons were completed by computer simulation, and a novel PR value was proposed to evaluate the comparisons, which could be viewed as the R-square in the ANN. These three methods have different levels of trade-offs between the forward-looking bias and the integrity of the data. Therefore, the simulation results are expected to be a reference for the future use of ANNs to predict time-series data with or without lag effects.
In the simulation results, we can see that as the complexity of the hyperparameters increases, a better performance of the ANN model is observed. This trend is consistent either in the two-hidden-layer or three-hidden-layer ANNs.
However, it is worth noting that the best hyperparameters of the model in each layer are often not the points with the highest model complexity in each layer. We believe that this is because we have limited the training times of the neural network to 50,000 times, which may not be the best solution for the model. The lowest point reached the preset number of training sessions. Therefore, the model results may be affected by different starting points, resulting in a jittery decrease in the final predicted MSE rather than strictly decreasing as the model complexity increases.
In addition, we found that as the model complexity increases from two to three hidden layers, the model performs better in terms of stability and prediction accuracy. Taking LOBO-CV as an example, the hyperparameters of two hidden layers are (12,12), (36, 36), and (60, 60), and the MSEs are 74.006, 66.154, and 59.774, respectively. For three hidden layers at (12,12,12), (36, 36, 36), and (60, 60, 60), the MSE becomes 64.918, 69.642, and 54.762, respectively, Although the performance results for (36, 36, 36) are poor, the overall stability and prediction rate are better than those of the two-hidden-layer model. We believe that this result is due to the complexity of the DLNM. After all, there are 54 variables in the ANN input neurons with 24 independent variables and 30 lag temperature variables. Therefore, when the amount of data is sufficient, with sufficient training times, a more complex model should be more capable of estimating such complex data. Therefore, the three-hidden-layer model exhibits better performance.
Comparing the three CV methods with each other, we found that when the model complexity is low, the fivefold-CV and LOBO-CV methods have their own advantages and disadvantages. However, as the complexity of the model increases, the performance of LOBO-CV is significantly better than fivefold CV; and in the final 39 groups of hyperparameters, 25 groups have the lowest MSE, demonstrating that under the DLNM structure, LOBO-CV is a better strategy for the ANN model. Finally, TB-CV showed poor predictive accuracy in most of the parameter settings. We speculate that this may be due to the loss of important information because the data points after the validation block are eliminated to preserve the temporal structure.
In this study, a total of 39 different hyperparameter scenarios were simulated for CV performance, plus permutation simulations of 39 hyperparameters. Because the ANN needs both forward and backward propagations to enhance accuracy, the time required for model training will be lengthier than that of other common machine learning models such as random forest [22] or Support Vector Machine (SVM) [23].
Taking the parameters (6, 6) of the two-hidden-layer neural network with the lowest model complexity as an example, it takes 34,502 s to complete the three types of CV, which is approximately 10 h. Therefore, when conducting similar studies, researchers may need to consider the time spent by simulations in advance.

Conclusions
The following items were researched: 1) a new CV scheme that generates the minimum error for an ANN model, 2) a proposed new permutation ratio such that one can interpret the attributable errors reduced by the ANN model, and 3) the first attempt to extend the DLNM into the ANN structure by treating all predictors, including the lag temperatures, as the input neurons. In the extended ANN model, lag mortalities can be included as additional input neurons. In this manner, the ANN can effortlessly utilize lag mortalities compared to the DLNM, which only assesses the variabilities in the current mortality.
The limitations of this research are as follows. Owing to the tremendous amount of computer running time required, the scenarios in the three-layer ANN model were more limited than the scenarios in the two layers. The lag temperatures were used in the ANN model, but air pollution and other factors did not consider lag effects. Although they did not affect the superiority of the new LOBO-CV over previous CV schemes, more complicated input neurons in the ANN would still be informative.
In future studies, a more detailed grid search for the optimal hyperparameters is desirable. In addition, this study has all-cause mortality, but we could not obtain disease-specific deaths. Studies with health outcomes related to temperature or air pollution could contribute more significantly to the clinical applications of this novel LOBO-CV for ANNs.