Statistical and Machine Learning forecasting methods: Concerns and ways forward

Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods. The paper discusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions.


Introduction
Artificial Intelligence (AI) has gained considerable prominence over the last decade fueled by a number of high profile applications in Autonomous Vehicles (AV), intelligent robots, image and speech recognition, automatic translations, medical and law usage as well as beating champions in games like chess, Jeopardy, GO and poker [1]. The successes of AI are based on the utilization of algorithms capable of learning by trial and error and improving their performance over time, not just by step-by-step coding instructions based on logic, if-then rules and decision trees, which is the sphere of traditional programming.
In light of the above, AI found applications in the field of forecasting and a considerable amount of research has been conducted on how a special class of it, utilizing Machine Learning methods (ML) and especially Neural Networks (NNs), can be exploited to improve time series predictions. Literally hundreds of papers propose new ML algorithms, suggesting methodological advances and accuracy improvements [2][3][4][5][6][7][8]. Yet, limited objective evidence is available a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.

The accuracy of ML methods: A brief review and discussion
The first application of NNs (as ML methods were called at that time and also sometimes today) in forecasting dates back to 1964 but did not achieve much follow-up until the technique of backpropagation was introduced almost 20 years later [19]. Since then, there have been numerous studies utilizing NNs and some of them comparing their accuracy to traditional, statistical ones. A good number of these studies, going back to 1995, are summarized in the work of [15] who concluded: "The outcome of all of these studies has been somewhat mixed". A similar conclusion was reached by [9] who evaluated 48 NN studies and stated that their accuracy in comparison to statistical methods provided mixed results. What characterized all these studies, however, was the limited number of series employed in the comparisons.
The first large scale study, using 3003 time series, dates back to the M3 Competition published in 2000 by [16] that included an Automated Artificial NN (AANN) method which, accuracy-wise, did average in comparison to the traditional statistical ones included in the Competition and below the most accurate ones (see Table 1). Eleven years later, Crone, Hibon and Nikolopoulos (C-H-N) published the results of a specialized NN competition, using a subset of the M3 monthly data [12]. In this competition they compared 22 NN and CI (Computational Intelligence) methods, in addition to 11 statistical ones. Their conclusion was that no ML method outperformed the Theta method [20], the most accurate one in the M3 Competition, and that only one [21] was more accurate than Damped trend exponential smoothing [22] when the symmetric Mean Absolute Percentage Error (sMAPE) for the average of all 18 forecasting horizons was used. However, four NNs did better than the AANN of the M3 Competition denoting improvements in the accuracy of newer ML methods. Overall, however, the accuracy of the NNs was not exceptional, vis-à-vis those of the M3 Competition, or the 11 statistical methods that were included in the (C-H-N) study (see Table 2).
ML methods have been gaining prominence over time as interest in AI has been rising. They are used to predict financial series [18,23], the direction of the stock market [24], macroeconomic variables [25], accounting balance sheet information [26] and a good number of other applications, covering a wide range of areas [27]. A major purpose of this study is to determine, empirically, if their performance exceeds that of statistical methods and how their advantages could be exploited to improve forecasting accuracy. What seems certain is that Chatfield's prediction of NNs becoming a "breakthrough or passing fad" will not be realized  [10]. Their performance cannot be classified yet as a breakthrough but at the same time they are still used while there are indications that such usage will increase over time as newer ML methods are introduced and more ways are being devised to improve their accuracy [15,28] and computational efficiency.

The major contribution of this paper: Comparing the performance of ML forecasting methods with traditional statistical ones
As highlighted in the previous section, [15]  (for more information regarding these models see the work of [15,29,30] as well as our own descriptions in section 3.3 below). In their paper, Ahmed and co-authors used a subset of 1045 series (the same ones being used in our study), selected from the monthly ones of the M3 Competition so that they have a length of between 81 and 126 months. However, before computing the 18 forecasts, they preprocessed the series in order to achieve stationarity in their mean and variance. This was done using the log transformation, then deseasonalization and finally scaling, while first differences were also considered for removing the component of trend. Consequently, they calculated one-step-ahead forecasts for each one of the 1045 series. The sMAPE and ranking of the eight ML methods can be seen in Table 3 (for details of how the preprocessing was done, how the forecasts were produced and how the accuracy measures were computed, see the paper by [15]). As seen, the most accurate ML method is the MLP, the next one is the BNN and the third the GP. The sMAPE of the remaining methods is at a double digit indicating a distinct difference in their accuracy. What would be of considerable research value is to investigate the reasons for the differences in accuracy among the eight ML methods and come up with guidelines of selecting the most appropriate one for new types of forecasting applications. The major contribution of this paper is to extend Ahmed and co-authors' study in six directions: First, we included eight statistical methods in order to compare the accuracy of the ML methods with that of the statistical ones. Second, we included two additional ML methods that have become popular during the last years [17]. These methods are the Recurrent Neural Network (RNN) and the Long Short Term Memory network (LSTM). Third, we used an additional accuracy measure introduced in recently, the MASE [31], to ensure that the same conclusions apply for such an alternative. Fourth, we used three different approaches for obtaining 18-step-ahead forecasts, not just the one-step-ahead ones computed by Ahmed and co-authors, to test the forecasting performance of ML methods for longer horizons. Fifth, we recorded a measure of computational complexity to determine the amount of computer time required by each method to develop the model and obtain the forecasts. Finally, we estimated a Model Fit measure to determine how well each method was fitted to the n − 18 historical data that was used for training and parameter estimation. The purpose of such a measure was to determine the relationship between the accuracy of model fitting and that of the post sample forecasting accuracy in order to figure out possible over-fitting that may decrease the accuracy of the post-sample predictions.

Accuracy measures
Two accuracy measures are used in this paper: The symmetric Mean Absolute Percentage Error (sMAPE) and the Mean Absolute Scaled Error (MASE). The first measure, which was originally used in the M3 Competition for evaluating the participating methods, is defined as follows: where k is the forecasting horizon, Y t are the actual observations andŶ t the forecasts produced by the model at point t. Since sMAPE penalizes large positive errors more than negative ones  [32], the MASE was also introduced to complement the former [31]. This is defined as follows: where n is the number of the available historical observations and m is the frequency of the time series. MASE, among its other characteristics, is independent of the scale of the data, its value being less than one if the forecast is more accurate than the average model fitting prediction of the seasonal Naive benchmark, and greater than one if it is less accurate.
We note that, in order to estimate the average forecasting accuracy more indicatively,Ŷ t were computed ten times and the average of the errors produced was used to avoid problems caused by the selection of the initial values chosen for the parameterization of the ML methods [33]. In this paper the first n-18 observations were used for training/validating the models, and the last 18 for testing their forecasting accuracy (following the same procedure as that of the M Competitions).

Computational complexity and model fitting
Computational Complexity (CC) is used to determine the time needed to train a given model and use it for extrapolation. Thus, CC can be simply defined as the mean computational time required by the model to predict a time series, divided by the corresponding time needed by the Naive method to achieve the same task. In this regard, we end up with a relative metric indicating the additional, proportional time required for obtaining the forecasts from the more complex methods. Computational time was estimated using a system with the following characteristics: Intel 1 Core™ i7-4790 CPU @ 3.60GHz, 8.00 GB RAM, x64 based processor.
Finally, the accuracy of how well a model fits (MF) the historical data is defined as Expression (4) is actually the Mean Squared Error (MSE) of the n-k model fit forecasts, normalized by the mean value of the time series being examined.

Statistical and ML methods utilized
To compare the performance of ML methods to traditional statistical ones, we included the six most accurate methods of the M3 Competition plus a naive benchmark (the first method), Naive 2, which is actually a random walk model adjusted for seasonality. The second method is Simple Exponential Smoothing (SES) [34], aimed at predicting series without a trend. The third and fourth, Holt and Damped exponential smoothing [22], are most appropriate for time series with trend. The fifth is a combination (average) of the three exponential smoothing methods just described: SES, Holt and Damped (Comb), aimed at achieving the possible benefits of averaging the errors of multiple forecasts [35]. The sixth model is the Theta method [20], which achieved the best overall sMAPE in the original M3 Competition. Finally, the seventh and eighth models are automatic model selection algorithms for ARIMA [36] and exponential smoothing (ETS) [37] models which recent studies have shown to be of considerable accuracy. Moreover, the former serves as a good benchmark to compare ML models, being the linear form of the most popular neural network, the Multi-Layer Perceptron [38]. For an analytical description of the statistical methods used in this study see [39].
A brief description of the same eight ML methods used by Ahmed and co-authors as well as by this study is provided next. Additionally, RNN [40] and LSTM [41] that have recently attracted a lot of interest in the forecasting field, are also included in this study and their accuracy is compared with those of the other methods.
3.3.1 Multi-Layer Perceptron (MLP). First, a single hidden layer NN is constructed. Then, the best number of input nodes N = [1, 2, . . ., 5] is defined by using a 10-fold validation process, with the inputs being observations Y t−5 , Y t−4 , Y t−3 , Y t−2 , and Y t−1 for predicting the time series at point t, and doing so for all the n − 18 data. Third, the number of the hidden nodes is set to 2N + 1 following the practical guidelines suggested by [42] aimed at decreasing the computational time needed for constructing the NN model (the number of the hidden layers used is typically of secondary importance [2]). The Scaled Conjugate Gradient method [43] is then used instead of Standard Backpropagation for estimating the optimal weights. The method, which is an alternative of the famous Levenberg-Marquardt algorithm, has been found to perform better in many applications and is considered more appropriate for weight optimization. The learning rate is selected between 0.1 and 1, using random initial weights for starting the training process with a maximum of 500 iterations. Finally, to maximize the flexibility of the method, although the activation function of the hidden layer is a logistic one, a linear function is used for the output nodes. This is crucial since, if a logistic output activation function is used for optimizing trended time series, it is bounded and "doomed to fail" [28]. In addition, due to the nonlinear activation functions, the data is scaled between 0 to 1 to avoid computational problems, meet algorithm requirement and facilitate faster network learning [2]. The linear transformation is as follows: Once all predictions are made, the forecasts are then rescaled back to the original scale.
Having defined the architecture of the optimal neural network, 100 MLP models were additionally trained and used to extrapolate the series. The mean, median and mode of the individual forecasts were then used as the final forecasts. This is done to evaluate the possible benefits of forecast combination, extensively reported in the literature, especially for the case of neural networks which are characterized by great variations when different initial parameters are used [44]. Yet, given that the gains in accuracy for combining multiple forecasts are negligible and the complexity is almost double, we do not present all the results as doing so is unnecessary.
The MLP method is constructed using the mlp function of the RSNNS R statistical package [45].

Bayesian Neural Network (BNN).
The BNN is similar to the MLP method but optimizes the network parameters according to the Bayesian concept, meaning that the weights are estimated assuming some a priori distributions of errors. The method is constructed according to the suggestions provided by [46] and [47]. The Nguyen and Widrow algorithm [48] is used to assign initial weights and the Gauss-Newton algorithm to perform the optimization. Similar to the MLP method, the best number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and the number of the hidden nodes is set to 2N + 1. A total number of 500 iterations are considered and the data are linearly scaled.
The BNN method is constructed exploiting the brnn function of the brnn R statistical package [49].

Radial Basis Functions (RBF).
RBF is a feed-forward network with one hidden layer and is similar to the MLP method. Yet, instead of using a sigmoid activation function, it performs a linear combination of n basis functions that are radially symmetric around a center. Thus, information is represented locally in the network, which allows the method to be more interpretable and faster to compute. Like the previous approaches, the best number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and the number of the hidden nodes is automatically set to 2N + 1. A total number of 500 iterations are considered and the data are linearly scaled. The output activation function is the linear one.
The RBF method is constructed exploiting the rbf function of the RSNNS R statistical package [45].

Generalized Regression Neural Networks (GRNN).
The GRNN method, also called the Nadaraya-Watson estimator or the kernel regression estimator, is implemented by the algorithm proposed by [50]. In contrast to the previous methods, GRNN is nonparametric and the predictions are found by averaging the target outputs of the training data points according to their distance from the observation provided each time. The sigma parameter, which determines the smoothness of fit, is selected together with the number of the inputs N using the 10-fold validation process. The inputs, linearly scaled, vary from 1 to 5 and the sigma from 0.05 to 1, with a step of 0.05.
The GRNN method is constructed exploiting the guess, learn and smooth functions of the grnn R statistical package [51].

K-Nearest Neighbor regression (KNN).
KNN is a nonparametric regression method basing its forecasts on a similarity measure, the Euclidean distance between the points used for training and testing the method. Thus, given the N inputs, the method picks the closest K training data points and sets the prediction as the average of the target output values for these points. The K parameter, which determines the smoothness of fit, is once again optimized together with the number of inputs using the 10-fold validation process. The inputs, which are linearly scaled, may vary from 1 to 5 and the K from 2 to 10.
The KNN method is constructed exploiting the KNN function of the class R statistical package [52].

CART regression trees (CART).
CART is a regression method based on tree-like recursive partitioning of the input space [53]. The space specified by the training sample is divided into regions, called the terminal leaves. Then, a sequence of tests is introduced and applied to decision nodes in order to define in which leave node an object should be classified based on the input provided. The tests are applied serially from the root node to the leaves, until a final decision is made. Like the previous approaches, the total number of input nodes N = [1, 2, . . ., 5] is defined using a 10-fold validation process and are then linearly scaled.
The CART method is constructed exploiting the rpart function of the rpart R statistical package [54].

Support Vector Regression (SVR)
. SVR is the regression process performed by a Support Vector Machine which tries to identify the hyperplane that maximizes the margin between two classes and minimize the total error under tolerance [55]. In order for an efficient SVM to be constructed, a penalty of complexity is also introduced, balancing forecasting accuracy and computational performance. Since in the present study accuracy is far more important than complexity, forecasts were produced using an -regression SVM which maximizes the borders of the margin under suitable conditions to avoid outlier inclusion, allowing the SVM decide the number of the support vectors needed. The kernel used in training and predicting is the radial basic one, mainly due to its good general performance and the few parameters it requires. Following the suggestions of [15], is set equal to the noise level of the training sample, while the cost of constraints violation C is fixed to the maximum of the target output values, which is 1. Then, the γ parameter is optimized together with the total number of inputs N set for the method, using a 10-fold validation process. The inputs are linearly scaled as in the previous methods described.
The SVR method is constructed exploiting the SVM function of the e1071 R statistical package [56].

Gaussian Processes (GP).
According to GP, every target variable can be associated with one or more normally distributed random variables which form a multivariate normal distribution, emerging by combining the individual distributions of the independent ones [57]. In this respect, Gaussian processes can serve as a nonparametric regression method which assumes an a priori distribution for the input variables provided during training, and then combines them appropriately using a measure of similarity between points (the kernel function) to predict the future value of the variable of interest. In our case, the input variables are the past observations of the time series, linearly scaled, while their total number N = [1, 2, . . ., 5] is defined using a 10-fold validation process. The kernel function used is the radial basis one, while the initial noise variance and the tolerance of termination was set to 0.001 given that, as suggested by [15], it would be computationally prohibitive to use a three-dimensional 10-fold validation approach to define them.
The GP method is constructed exploiting the gausspr function of the kernlab R statistical package [58].

Recurrent Neural Network (RNN).
Simple RNN, also known as Elman network [59], has a similar structure to the MLP, but contains feedback connections in order to take into account previous states along with the current input before producing the final output(s). This is done by saving a copy of the previous values of the layer containing the recurrent nodes and using them as an additional input for the next step. In this respect, the network is allowed to exhibit dynamic temporal behavior for a time sequence.
In this study, the model used to implement the RNN is the sequential one. It is comprised of two layers, a hidden one containing recurrent nodes and an output one containing one or more linear nodes. Due to high computational requirements, we did not use k-fold validation for choosing the optimal network architecture per series but rather three input nodes and six recurrent units, forming the hidden layer, for all the time series of the dataset. The selection was made based on the results of a random sample of series for which this parameterization displayed the best performance. Regarding the rest of the hyper-parameters, a number of 500 epochs was chosen and the learning ratio was set to 0.001, with the linear activation function being used in all nodes.

Long Short Term Memory neural network (LSTM).
The LSTM network is similar to the RNN discussed above and was proposed by [41] to avoid the long-term dependency problem present for the case of the latter. The advantage of LSTM units have over regular RNN units is their ability to keep information over longer periods of time due to their complex architecture which consists of several gates with the power to remove or add information to the unit's state.
Similar to RNN, the model used to implement the LSTM network is the sequential one comprised of a hidden and an output layer. Similarly, due to high computational time, the architecture of the model consists of three input nodes, six LSTM units forming the hidden layer and a single linear node in the output layer. The linear activation function is used before the output of all units and the hard sigmoid one for the recurrent step. Regarding the rest of the hyper-parameters, the rmsprop optimizer was used, a number of 500 epochs was chosen and the learning ratio was set to 0.001.

Data preprocessing
In contrast to sophisticated time series forecasting methods, where achieving stationarity in both the mean and variance is considered essential, the literature of ML is divided with some studies claiming that ML methods are capable of effectively modelling any type of data pattern and can therefore be applied to the original data [62]. Other studies however, have concluded the opposite, claiming that without appropriate preprocessing, ML methods may become unstable and yield suboptimal results [28].
Preprocessing can be applied in three forms: Seasonal adjustments, log or power transformations, and removing the trend. For instance, [11] found that MLP cannot capture seasonality adequately, while [63] claim exactly the opposite. There is greater agreement concerning the trend, given the bouncing of MLP's activation function that become more stable once the series has been detrended [64]. Yet, more empirical results are needed to support the conclusions related to preprocessing, including the most appropriate way to eliminate the trend in the data [65]. In this study, the following indicative alternatives are used: • Original data: No pre-processing is applied.
• Transforming the data: The log or the Box-Cox [66] power transformation is applied to the original data in order to achieve stationarity in the variance.
• Deseasonalizing the data: The data is considered seasonal if a significant autocorrelation coefficient at lag 12 exists. In such case the data is deseasonalized using the classical, multiplicative decomposition approach [39]. The training of the ML weights, or the optimization of statistical methods, is subsequently done on the seasonally adjusted data. The forecasts obtained are then reseasonalized to determine the final predictions. This is not done in the case of ETS and ARIMA methods since they include seasonal models, selected using relative tests and information criteria that take care of seasonality and model complexity directly.
• Detrending the data: A Cox-Stuart test [67] is performed to establish whether a deterministic linear trend should be used, or alternatively first differencing, to eliminate the trend from the data and achieve stationarity in the mean.
• Combination of the above three: The benefits of individual preprocessing techniques are applied simultaneously to adjust the original data.
In order to speed up computations, we first determine the best preprocessing alternative for improving the post-sample one-step-ahead forecasting performance of the MLP method (the most popular among ML ones) and then apply that one to the remaining ML models. Table 4 summarizes the results of the various preprocessing possibilities as well as their combinations when predicting the 18 unknown observations of the 1045 time series, as applied in the study of [15]. The best combination according to sMAPE is number 7 (Box-Cox transformation, deseasonalization) while the best one according to MASE is number 10 (Box-Cox transformation, deseasonalization and detrending). Some interesting observations from studying the results of Table 4 are the following: • Transforming the data with the Box-Cox method is a slightly better option than the alternative of using logs in terms of the sMAPE. But both alternatives do not improve its accuracy much and do not provide any improvement in MASE.
• Seasonal adjustments provide significantly better results in both sMAPE and MASE.
• Removing the trend from the data using a linear function provides more accurate results (in both the sMAPE and MASE) than the alternative of using the first difference.
• There are important differences in sMAPE and MASE with each type of preprocessing.
More work on the most appropriate preprocessing approach is definitely required, first to confirm the results found using a subset of the series of the M3 Competition and second to determine if the conclusions made using the MLP method will be similar to those of other ML ones.
Transforming the data also seems beneficial when dealing with traditional forecasting methods. This can be seen by examining Tables 5 and 6, that display the forecasting performance for the eight statistical methods included in this work according to sMAPE and MASE. The sMAPE accuracies show a consistent improvement, while those of MASE are about the same. Moreover, after transformations, the differences between the various methods become smaller, meaning that simpler methods, such as Damped, can now be used instead of ETS, which may be more accurate but also the most time intensive.  Table 7 compares the one-step-ahead forecasts of the ML methods used by Ahmed and colleagues and by our own study once the most appropriate preprocessing has been applied. This includes the Box-Cox transformation, deseasonalization and detrending since evaluating forecasting performance through MASE instead of sMAPE is considered to be, as mentioned in section 3.1, a more reliable choice. There are some important similarities in the overall sMAPE (see column 2) between the results of [15] and our own indicating that the forecasts of ML methods are consistent over the period of seven years. There are also some important differences that can be justified given the length of time that has passed between the two studies and the computational advancements in utilizing the ML algorithms. At the same time, the reasons for the huge difference in RBF (9.57% vs 15.79%) as well as the smaller ones in CART and SVR need to be investigated. Undoubtedly, the use of slightly different parameterisation for applying the individual methods, as well as the exploitation of different functions for their implementation (R instead of Matlab) might explain part of the variations. It should be noted that RNN is among the less accurate ML methods, demonstrating that research progress does not necessarily guarantee improvements in forecasting performance. This conclusion also applies in the performance of LSTM, another popular and more advanced ML method, which does not enhance forecasting accuracy too.
The results in Table 7 show that MLP and BNN outperform the remaining ML methods. Thus, these two are the only ones to be further investigated by comparing their forecasting accuracy beyond one-step-ahead predictions to multiple horizons, useful for those interested in predicting beyond one horizon.

Multiple-horizon forecasts
There are three alternative ways for producing multiple period forecasts with ML models. Given that it is not possible to claim that one is superior for all forecasting situations [3], all of them are considered.
3.5.1 Iterative forecasting. The first alternative is found in the exact same way as the onestep-ahead approach used by Ahmed et al. and by our own study. It is possible, however, to obtain forecasts additionally to one-step-ahead using the first forecast produced by the model, instead of the actual value, to get a forecast for horizon two and then use the first two forecasts to estimate the one for horizon 3 and so on, until predictions for all 18 horizons have been found (this is also the approach used by the ARIMA models in the M Competitions). This means that we obtain 18 forecasts using the first n-18 data points only. As the forecasting horizon increases, the new forecasts depend on the accuracy of the previous ones, meaning that longer term ones may deteriorate. The advantage of this approach is its simplicity and computational ease.

Direct forecasting.
The direct approach produces multiple-step-ahead forecasts, instead of one-step-ahead ones, by training and exploiting an 18-output-node neural network able of producing 18 forecasts simultaneously, one for each forecasting horizon. The first output node is responsible for forecasting horizon h = 1, output node 2 for forecasting horizon h = 2, ending with output mode 18 for forecasting horizon 18. It will be interesting to determine if this multi-forecast approach will be more accurate than the iterative one, knowing well that the Direct approach is more complex and computationally more demanding, while fewer observations are available for training.

Multi-neural network forecasting.
In this approach, single output node NNs are trained for producing forecasts. Yet, instead of training one NN for each horizon simultaneously, 18 separate NNs are trained, each one for predicting a single h-step-ahead forecast. In this respect, if we wish to forecast the value of the time series for one horizon-ahead we use the first NN trained using the n-18 data, for two the second NN again trained in the n-18 data, and so on for eighteen times in total. The three options for multi-horizon forecasts can be visualized in Fig 1. The results of each of the three approaches for predicting 18 months ahead are displayed in Tables 8 and 9 for both the ML methods as well as the eight statistical ones using sMAPE and MASE. To simplify the presentation, the results are grouped into three forecasting horizons: Short-term (1 to 6 months ahead), Medium-term (7 to 12 months ahead) and, finally, Longterm (13 to 18 months ahead), while the accuracies for all horizons can be found in Tables A1 and A2 in the Appendix at the end of this paper. Tables 8 and 9 as well as A1 and A2 allow us to evaluate the accuracy achieved by each method across multiple horizons and decide on their appropriateness for various applications. We also note that for the case of the BNN, the direct approach was excluded since it is not supported by the brrn function exploited for obtaining the forecasts in our study.
In brief, statistical models seem to generally outperform ML methods across all forecasting horizons, with Theta, Comb and ARIMA being the dominant ones among the competitors according to both error metrics examined. It is also notable that good forecasting accuracy  comes with great efficiency, meaning that CC is not significantly increased for the best performing methods. It is also worth mentioning that more complex approaches of extrapolation through ML methods, such as the direct and Multi one, display less accurate results indicating that complex is not always better and that ML methods fail to learn how to best predict for each individual forecasting horizon.  forecasts. As seen, the six most accurate methods are statistical, confirming their dominance over the ML ones. Even Naive 2 (a seasonal Random Walk (RW) benchmark) is more accurate than half of the ML methods. The most interesting question and greatest challenge is to find the reasons for their poor performance with the objective of improving their accuracy and exploiting their huge potential. AI learning algorithms have revolutionized a wide range of applications in diverse fields and there is no reason that the same cannot be achieved with the ML methods in forecasting. Thus, we must find how to be applied to improve their ability to forecast more accurately. ML models are nonlinear functions connecting the inputs and outputs of neurons. The goal of the network is to "learn" by solving an optimization problem in order to choose a set of parameters, or weights, that minimize an error function, typically the sum of square errors. However, the same type of optimization is done in ARIMA (or regression) models. There is no obvious reason, therefore, to justify the more than 1.24% higher sMAPE of MLP, one of the best ML methods, in comparison to that of ARIMA, or that the sMAPE of this MLP is only 0.19% more accurate than Naive 2, the seasonally adjusted random walk model. Respectively, one would expect RNN and LSTM, which are more advanced types of NNs, to be far more accurate than the ARIMA and the rest of the statistical methods utilized. Clearly, if there was any form of learning, the accuracy of ML methods should have exceeded that of ARIMA and greatly outperform the Naive 2. Thus, it is imperative to investigate the reasons that this is not happening, e.g. by comparing the accuracy of ML and statistical methods series by series, explaining the differences observed and identifying the reasons involved.

The accuracy, the goodness of fit and the computational complexity of the ML methods
The more serious issue, simply put, is how ML methods can be made to learn about the unknown future rather how well a model fits past data. For this to be done, the ML methods must have access to information about the future and their objective must be to minimize future errors rather than those of fitting a model to available data. Until a later time when more advanced ML methods become available and in order to simplify things, we suggest that the data is deseasonalized before some ML model is utilized, as research [11] has shown little to no differences between the post-sample accuracy of models applied to original and seasonally adjusted data.
A practical way to allow learning about the unknown future errors is by dividing the n − 18 data into two parts, with the first one containing the 1/3 of the n − 18 data and the second the remaining 2/3. If the data is first deseasonalized, a much simpler model can be developed using the first (n − 18)/3 data and then trained to learn how to best predict the next 18 observations. Then, the first (n − 18)/3 + 1 data can be used to let the method learn how to best predict the next 18 observations and continue using the first (n − 18)/3 + 2, the first (n − 18)/3 + 3 and so on, until having used all the observations available. Clearly, such a sliding simulation, attempting to predict future values based on post-sample accuracy optimization, will probably be a step in the right direction even though its performance needs to be empirically tested.
Another possibility is to provide ML methods with alternative forecasts (e.g. the ones produced by the best statistical methods) and ask them to learn to select the most accurate one (or their combination) for each forecasting horizon and series in such a way as to minimize postsample errors. This may require clustering the data into various categories (micro, macro, demographic etc.) or types of series (seasonal/non-seasonal, trended/non-trended, of high, medium or low randomness etc.) and develop different models for each category/type. In Table 6 of [15], for instance, accuracy varies significantly depending on the category of the series with the best one being in demographic and macro data, the worst in micro and industry time series, and finance in between. This may indicate that ML methods could under-perform among others, due to the fact that they are confused when attempting to optimize specific or heterogeneous data patterns.
An additional concern could be the extent of randomness in the series and the ability of ML models to distinguish the patterns from the noise of the data, avoiding over-fitting. This can be a challenging problem since, in contrast to linear statistical methods, where over-fitting can be directly controlled by some information criteria (e.g., the AIC [68]) taking into account the number of parameters utilized, ML methods are nonlinear and training is performed dynamically, meaning that different forecasts may arise according e.g. to the maximum iterations considered, even if the complexity of the network's architecture is identical. Since the importance of possible over-fitting by ML methods is critical, the topic will be covered in detail on its own in section 4.1 below.
A final concern with ML methods could be the need for preprocessing that requires individual attention to select the most appropriate transformation, possible deseasonalization, as well as trend removal. Effective ML methods must, however, be able to learn and decide on their own the most appropriate preprocessing as there are few possibilities available. If, for example, the Box-Cox criterion can be used to determine the most appropriate transformation for statistical methods, it makes no sense that something similar cannot be applied by ML methods to automate preprocessing, simplify the modeling process and probably improve accuracy by doing so.

Over-fitting
Tables 4, 5 and 6 report, among others, the goodness of fit, indicating how well the trained model fitted the n-18 observations available for each series. Yet, model fit is not a good predictor of post-sample forecasting accuracy, meaning that methods with low fitting errors might result in higher post-sample ones and vice versa. One would expect for instance that the MLP method, displaying a model fitting error of 2.11%, would forecast more accurately than the ARIMA whose corresponding error is higher (2.59%). However, this is not the case as the post-sample sMAPE of the two methods are 8.39% and 7.19%, respectively. Moreover, RBF, GRNN and CART, which have the best model fitting, are some of the worst performing methods. A possible reason for the improved accuracy of the ARIMA models is that their parameterization is done through the minimization of the AIC criterion, which avoids over-fitting by considering both goodness of fit and model complexity. In contrast, the MLP method specifies its complexity (input nodes) through cross-validation, but no additional criteria are applied for mitigating over-fitting e.g. by specifying when training should stop. The maximum number of iterations defined serves that purpose, yet there is no global optima: in some time series, over-fit might occur after a few iterations, while in others after many hundreds.  [69] would be required to indicate to ML methods when to stop the optimization process and avoid considering as pattern a part of the noise of the data. In our view, considerable improvements can result by such an action.

Computational complexity
As forecasting methods are used in various applications, the computational time required to forecast becomes critical. It would be impractical for example to utilize the ML GRNN method (the most computationally demanding) to predict the demand for hundreds of thousands of inventory items, even though computers are becoming faster and cheaper. Memory and CPU usage optimization might serve in that direction but again, computational intensity remains an important issue. For instance, despite exploiting such optimization processes in our study, reducing the computational time of the ML methods by more than 30%, the complexity reported is still much greater compared to the statistical ones. For this reason, the information provided in Fig 4 is of value, as it confirms the low computational requirements of statistical methods, lying in the lower left part of the Fig, and additionally shows that superior accuracy can be achieved with less computational effort. In particular, the five inside the square  Statistical and ML forecasting methods box (Damped, Comb, Theta, SES and Holt) are not only some of the most accurate but alsoapart from ETS-the least computationally demanding.
For practical reasons, if ML methods are to be applied by business and non-profit organizations, their computational requirements must be reduced considerably. This can be done by deseasonalizing the data first, utilizing simpler models, limiting the number of training iterations or choosing the initial values of the weights, not in a completely arbitrary manner, but through a guided search that would provide values not too far from the optimal ones. Alternatively, the speed of moving towards the optimal can increase in order to reduce the computational time to reach such an optimal. These improvements would require testing to determine the trade-offs between lesser accuracy, resulting from the reduction in computational time, versus the savings from such a reduction.

Conclusions: The state of the art and ways forward
The field of statistical forecasting has progressed a great deal since the early dates when [70] used exponential smoothing, in the late 1940s, for predicting the inventory demand for many thousands of items in navy shipyards. The introduction of the Box-Jenkins methodology to ARIMA models [71] brought academic respectability to a field dominated until then by practitioners, while the extensive use of regression and econometric models [72] further enlarged the field. Finally, multivariate GARCH models were also made available [73,74] broadening the coverage of the field (for an excellent survey of the latest developments see Special Issue on "Simple Versus Complex Forecasting" [75]).
A major innovation that has distinguished forecasting from other fields has been the good number of empirical studies aimed at both the academic community as well as the practitioners interested in utilizing the most accurate methods for their various applications and reducing cost or maximizing benefits by doing so. These studies contributed to establishing two major changes in the attitudes towards forecasting: First, it was established that methods or models, that best fitted available data, did not necessarily result in more accurate post sample predictions (a common belief until then). Second, the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite. These two findings have fundamentally changed the field of forecasting and are also evident in this paper both in Fig 3, showing post-sample versus in-sample accuracy, as well as in Fig 2, displaying the accuracy level of various statistical and ML methods, with the latter being much more sophisticated and computationally demanding than the former.
Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied. At the same time, understanding the reasons of their underperformance is the only way to improve them. This has certainly been the case with ARIMA models whose accuracy with monthly data (not the same as those used in this study) in the 1982 M Competition was 17.9% and has decreased to 11.28% in the present study, tying with the accuracy of the damped exponential smoothing, one of the most accurate methods of the M Competitions. ARIMA's improved performance is mainly due to the utilization of the AIC criterion and other optimization processes, enabling effective automatic model selection and parameterization, while avoiding or minimizing over-fitting. Another interesting example could be the case of LSTM that compared to simpler NNs like RNN and MLP, report better model fitting but worse forecasting accuracy.
ML theorists working on forecasting applications need to do something to improve the accuracy of their methods. For instance, the only thing exponential smoothing methods do is smoothen the most recent errors exponentially and then extrapolate the latest pattern in order to forecast. Given their ability to learn, ML methods should do better than simple benchmarks, like exponential smoothing. Accepting the problem is the first step in devising workable solutions and we hope that those in the field of AI and ML will accept the empirical findings and work to improve the forecasting accuracy of their methods.
A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case. In our view, this situation is the same with what was happening in statistical literature in the late 1970s and 1980s. At that time, it was thought that forecasting methods were of superior accuracy simply because of their sophistication and their mathematical elegance. Now it is obvious that their value must be empirically proven in an objective, indisputable manner through large scale competitions. Thus, when it comes to papers proposing new ML methods, or effective ways to use them, academic journals must demand comparisons with alternative methods or at least benchmarks and require that the data of the articles being published be made available for those who want to replicate the results. In our experience, this has not been the case at present, making replications practically impossible and allowing conclusions that may not hold.
In addition to empirical testing, research work is needed to help users understand how the forecasts of ML methods are generated (this is the same problem with all AI models whose output cannot be explained). Obtaining numbers from a black box is not acceptable to practitioners who need to know how forecasts arise and how they can be influenced or adjusted to arrive at workable predictions.
A final, equally important concern is that in addition to point forecasts, ML methods must also be capable of specifying the uncertainty around them, or alternatively providing confidence intervals. At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves. To overcome this issue, many researchers propose simulating the intervals by iteratively generating multiple future sample paths. Yet, even in that case, the forecast distribution of the methods is empirically and not analytically derived, raising many doubts about its quality.
To summarize, according the results of this study, ML methods need to become more accurate, requiring less computer time, and be less of a black box. A major contribution of this paper is in showing that traditional statistical methods are more accurate than ML ones and pointing out the need to discover the reasons involved, as well as devising ways to reverse the situation. However, in the comparisons of the statistical and ML methods reported in this paper, it must be made clear that the results may be related to the specific data set being used. The 3003 time series of M3 come mainly from the business and economic world that seem to be represented truthfully by this data [77], characterized by considerable seasonality, some trend and a fair amount of randomness [78]. The frequency of close to half of the series is monthly, followed by quarterly and yearly ones of about the same percentage. The length of the series varies from 14 for yearly data to 126 for monthly ones, with the majority being in the Micro and Macro domain. The characteristics of the series as well as their length may be a critical factor determining the accuracy of the various methods reported in this paper, requiring additional research, using other data sets, to verify that similar results will hold true if different sets of data are used and, most importantly, the series are of much longer length so the ML methods can train their weights more optimally.
For instance, in [78], the authors use a set of six features to analyze the M3 data, visualizing them in a 2-dimensional space and examining the strengths and weaknesses of different forecasting methods using the new classification. Their results show that the particularities of the dataset may affect the conclusions drawn, indicating that different ones could have emerged if another sample of time series had been selected instead. The relation between forecasting accuracy and time series characteristics is also reported by [79] who claim that there are indeed "Horses for Courses" in demand forecasting. In this regard, even though M3 might be representative of the reality when it comes to business applications, the findings may be different if nonlinear components are present, or if the data is being dominated by other factors. In such cases, the highly flexible ML methods could offer significant advantage over statistical ones. Furthermore, the length of business series, which is relatively limited compared to those of other applications that ML methods are typically utilized (e.g., energy forecasting), may also affect their performance as proper training may be difficult or even impossible when short series are involved. Thus, the conclusions of future studies would be necessary to come up with definite conclusions.
At this point, the following suggestions/speculations, that must be empirically verified, can be made about the way forward regarding the ML methods, while these can be enriched by future research topics proposed in relative surveys [80]: • Obtain more information about the unknown future values of the data rather than their past ones and base the optimization/learning on such future values as much as possible.
• Deseasonalize the data before using ML methods. This will result to a simpler one, reducing the computational time required to arrive at optimal weights and, therefore, learn faster.
• Use a sliding simulation approach to gain as much information as possible about future values and the resulting uncertainty and learn more effectively how to minimize them.
• Cluster the series into various homogeneous categories and/or types of data and developing ML methods that optimally extrapolate them.
• Avoid over-fitting as it is not clear if ML models can correctly distinguish the noise from the pattern of the data.
• Automate preprocessing and avoid the extra decisions required from the part of the user.
• Allow the estimation of uncertainty for the point forecasts and provide information for the construction of confidence intervals around such forecasts.
Although the conclusion of our paper that the forecasting accuracy of ML models is lower to that of statistical methods may seem disappointing, we are extremely positive about the great potential of ML ones for forecasting applications. Clearly, more work is needed to improve such methods but the same has been the case with all new techniques, including the complex forecasting methods that have improved their accuracy considerably over time. Who could have believed even ten years ago that we will have AVs, personal assistance on our mobile phones understanding and speaking in natural languages, automatic translations in Skype, AlphaGo beating the world GO champion or facial expression recognition algorithms [81]. There is no reason that the same type of breakthroughs cannot be achieved with ML methods applied to forecasting. Even though, we must realize that applying AI to forecasting is quite different than doing so in games or in image and speech recognition and may require different, specialized algorithms to be successful. In contrast to other applications, the future is never identical to the past and training of AI methods cannot exclusively depend on it. Table 10 is our attempt to show that not all applications can be modeled equally well using AI algorithms. Games are the easiest as the rules are known and do not change, the environment is also known and stable, the predictions cannot influence the future and there is no uncertainty. The exact opposite is true for forecasting applications where not only the rules are not known but can also change, there are structural instabilities in the data, while there is plenty of uncertainty and noise, that can confuse the search for the optimal weights. Moreover, in certain applications, the forecasts themselves can influence, or even change the future creating self-fulfilling or self-defeating prophesies, expanding the level of noise and increasing the level of uncertainty. It may be necessary, therefore, to adapt the algorithms to these conditions and make sure that there is no over-fitting. Judging from the results of this paper, it may be necessary that ML algorithms applied to forecasting may require additional research to experiment with innovative ideas and come up with adjustments in order to achieve more accurate predictions.
As the first step towards providing a more diverse, large and representative dataset for evaluating the performance of forecasting methods, establishing reliable benchmarks and promoting future research, we have started the M4-Competition (see https://www.m4.unic.ac.cy/) that seeks to identify the most accurate forecasting method(s) for different types of predictions. It aims to compare all major time series methods and identify the most appropriate methods for each case. M4 utilizes 100,000 real-life series and has attracted great interest by both academic researchers and practitioners, providing objective evidence of the most appropriate way of forecasting various variables of interest. The new M4 Competition will extend and replicate the results of the previous three ones, while also avoiding the possible problems of M3. Furthermore, it will increase the number of series to 100,000, include additional frequencies, while also augmenting their length considerably. Given the great number of series, it will be possible to utilize advanced data analytics and related technologies to determine the influence of the various factors on forecasting accuracy, as well as to determine the most appropriate methods for different forecasting applications.
Supporting information S1 Appendix. Containing tables A1 and A2, presenting the analytical results of the forecasting models used in the present study. The accuracy is evaluated per forecasting horizon first according to sMAPE, and then to MASE. (PDF)