A deep learning framework for financial time series using stacked autoencoders and long-short term memory

The application of deep learning approaches to finance has received a great deal of attention from both investors and researchers. This study presents a novel deep learning framework where wavelet transforms (WT), stacked autoencoders (SAEs) and long-short term memory (LSTM) are combined for stock price forecasting. The SAEs for hierarchically extracted deep features is introduced into stock price forecasting for the first time. The deep learning framework comprises three stages. First, the stock price time series is decomposed by WT to eliminate noise. Second, SAEs is applied to generate deep high-level features for predicting the stock price. Third, high-level denoising features are fed into LSTM to forecast the next day’s closing price. Six market indices and their corresponding index futures are chosen to examine the performance of the proposed model. Results show that the proposed model outperforms other similar models in both predictive accuracy and profitability performance.


Introduction
Stock market prediction is usually considered as one of the most challenging issues among time series predictions [1] due to its noise and volatile features.How to accurately predict stock movement is still an open question with respect to the economic and social organization of modern society.During the past decades, machine learning models, such as Artificial Neural Networks (ANNs) [2] and the Support Vector Regression (SVR) [3], have been widely used to predict financial time series and gain high predictive accuracy [4][5][6][7][8].In the literature, however, a recent trend in the machine learning and pattern recognition communities considers that a deep nonlinear topology should be applied to time series prediction.An improvement over traditional machine learning models, the new one can successfully model complex real-world data by extracting robust features that capture the relevant information [9] and achieve even better performance than before [10].Considering the complexity of financial time series, combining deep learning with financial market prediction is regarded as one of the most charming topics [11].However, this field still remains relatively unexplored.
Generally speaking, there are three main deep learning approaches widely used in studies: convolutional neural networks [12], deep belief networks [13] and stacked autoencoders [14].The relevant work on deep learning applied to finance has introduced the former two approaches into the research.For example, Ding et al. [15] combine the neural tensor network and the deep convolutional neural network to predict the short-term and long-term influences of events on stock price movements.Also, certain works use deep belief networks in financial market prediction, for example, Yoshihara et al. [16], Shen et al. [17] and Kuremoto et al. [18].However, regarding whether the stacked autoencoders method could be applied to financial market prediction, few efforts have been made to investigate this issue.Therefore, this paper contributes to this area and provides a novel model based on the stacked autoencoders approach to predict the stock market.
The proposed model in this paper consists of three parts: wavelet transforms (WT), stacked autoencoders (SAEs) and long-short term memory (LSTM).SAEs is the main part of the model and is used to learn the deep features of financial time series in an unsupervised manner.Specifically, it is a neural network consisting of multiple single layer autoencoders in which the output feature of each layer is wired to the inputs of the successive layer.The unsupervised training of SAEs is done one AE at a time by minimizing the error between the output data and the input data.As a result, the SAEs model can successfully learn invariant and abstract features [19].
The other two methods are incorporated to help increase predictive accuracy.LSTM is a type of recurrent neural network (RNN), with feedback links attached to some layers of the network.Unlike conventional RNN, it is well-suited to learn from experience to predict time series when there are time steps with arbitrary size.In addition, it can solve the problem of a vanishing gradient by having the memory unit retain the time related information for an arbitrary amount of time [20].Evidence has proved that it is more effective than the conventional RNN [21,22].Thus, we decide to use this model to predict the stock trends.WT is considered to fix the noise feature of financial time series.It is a widely used technique for filtering and mining single-dimensional signals [23][24][25].We use it to denoise the input financial time series and then feed them into the deep learning framework.In summary, the model we introduce in this paper is a combination of the three methods, and we refer to this novel model as WSAEs-LSTM hereafter.
We select six stock indices to test the prediction ability of the proposed model.Those indices include CSI 300 index in A-share market from mainland China, Nifty 50 index representing India stock market, Hang Seng index trading in Hong Kong market, Nikkei 225 index in Tokyo, S&P500 index and DJIA index in New York stock exchange.Technically, we apply WSAEs-LSTM to forecast the movements of each stock index and check how well our model is in predicting stock moving trends.
It is noted that we test the performance of WSAEs-LSTM in several financial markets instead of only one market.This is due to the concern for obtaining robust results.According to the efficient market hypothesis (EMH), the efficiency of a market affects the predictability of its assets.In other words, even though the predictive performances in one market are satisfied, it is still difficult to attribute it to the role of the proposed model.Testing the model in variant market conditions brings us the opportunity to solve the problem and shows us how robust the predictability of our model is.The chosen markets can meet the goal described above.They represent three development stages of financial markets.For example, the stock market in mainland China and India are commonly perceived as developing markets.Though both of them have experienced rapid growth during the past decades, much of their regulation is immature.By contrast, New York stock market is recognized as the most developed market.It is also by far the largest stock exchange and the most efficient market in the world.Besides those above, the stock markets in Hong Kong and Tokyo are in a kind of middle ground between the most developed and the developing state.Thus, our sample setting can help us to examine the validity of our proposed model in different states of the market.
For each stock index, three types of variables are used as model inputs.The first set is historical stock trading data, such as the Open, High, Low and Close price (OHLC) [26][27][28], and the second is the technical indicators of stock trading.These are commonly used inputs in previous studies [29].Apart from these, we also introduce the macroeconomic variables as the third type of inputs.As the macro economy can hugely influence stock markets and the advantage of our deep learning model is the ability to extract abstract and invariant features from input variables [30,31], we believe the addition of macroeconomic variables could improve the model performance.
Regarding the prediction approach, a subsection predictive method described in Chan et al. [32] is applied to get the predicted outcomes of each stock index.Then, we evaluate the model's performance from two dimensions: predictive accuracy and profitability.The predictive accuracy is evaluated by using three measurements: Mean absolute percentage error (MAPE), correlation coefficient (R) and Theil's inequality coefficient (Theil U).All of them are widely used indicators to measure whether the predicted value is similar to the actual value [2,23,33,34].To check the profitability, we establish a buy-and-sell trading strategy [35].The strategy is applied to obtain the trading returns based on the predicted outcomes from the model.As a benchmark, we also compute the returns of a buy-and-hold strategy for each stock index [32,36].The basic idea is that whether the trading returns based on WSAEs-LSTM can outperform the returns of this simple trading strategy, which provides further evidence for the model's profitability.
To better capture the performance of WSAEs-LSTM, we also introduce other three models and evaluate their predictive accuracy and profitability in forecasting each stock index as the comparisons against our proposed model.The three models include the WLSTM (i.e., a combination of WT and LSTM), LSTM and also the conventional RNN.The former two models are used to check the usefulness of the SAEs method in improving the prediction performance.The last model, RNN, is used as the performance benchmark.As it has been successfully applied to predicting financial time series in previous literature [23,37,38], it helps us to get more knowledge regarding how well our proposed model can improve performance compared with the conventional neural network.
All the sample data of this study are collected from WIND database provided by Shanghai Wind Information Co., Ltd, CSMAR database provided by Shenzhen GTA Education Tech.Ltd and a global financial portal: Investing.com.It consists of around 8 years of data from Jul. 2008 to Sep. 2016.Our results show that WSAEs-LSTM outperforms the other three models not only in predictability but also in profitability.
Our work is rooted in a growing research field regarding the application of deep learning method to improve efficiency.For example, deep learning-based methods have dramatically improved the state-of-the-art in image recognition [12,[39][40][41], speech recognition [42][43][44], language translation [45,46] and many other areas such as drug discovery [47] and genomics [48,49].The main contribution of this work is that it is the first attempt to apply stacked autoencoders to generate the deep features of the OHLC, technical indicators and macroeconomic conditions as a multivariate signal in order to feed to a LSTM to forecast future stock prices.The proposed deep learning framework, WSAEs-LSTM, can extract more abstract and invariant features compared with the traditional long-short term memory and recurrent neural networks (RNN) approaches.
The rest of this paper is organized into five sections.Section 2 presents the proposed hybrid models with an introduction to multivariate denoising using wavelet, SAEs and LSTM.Section 3 is a description of the inputs and data resource.Section 4 presents the details regarding our experiment design.Section 5 summarizes the observed results and the final section concludes our study.

Methodology
To generate the deep and invariant features for one-step-ahead stock price prediction, this work presents a deep learning framework for financial time series using a deep learning-based forecasting scheme that integrates the architecture of stacked autoencoders and long-short term memory.Fig 1 shows the flow chart of this framework.The framework involves three stages:(1) data preprocessing using the wavelet transform, which is applied to decompose the stock price time series to eliminate noise; (2) application of the stacked autoencoders, which has a deep architecture trained in an unsupervised manner; and (3) the use of long-short term memory with delays to generate the one-step-ahead output.The detailed approach of each block is further detailed as follows.

Wavelet transform
Wavelet transform is applied for data denoising in this study since it has the ability to handle the non-stationary financial time series data [50].The key property of wavelet transform is that it can analyze the frequency components of financial time series with time simultaneously compared with the Fourier transform.Consequently, wavelet is useful in handling highly irregular financial time series [51].
This study applies the Haar function as the wavelet basis function because it can not only decompose the financial time series into time and frequency domain but also reduce the processing time significantly [23].The wavelet transform with the Haar function as a basis has a time complexity of O(n) with n denoting the size of the time series [52].
For continuous wavelet transform (CWT), the wavelet function can be defined by: where a and τ are the scale factor and translation factor, respectively.ϕ(t) is the basis wavelet, which obeys a rule named the wavelet admissibility condition [53]: where ϕ(ω) is a function of frequency ω and also the Fourier transform of ϕ(t).Let x(t) denote a square-integrable function (x(t) L 2 (R)); then CWT with the wavelet ϕ can be defined as: where 0ðtÞ denotes its complex conjugate function.The inverse transform of the continuous wavelet transform can be denoted as: The coefficients of the continuous wavelet transform have a significant amount of redundant information.Therefore, it is reasonable to sample the coefficients in order to reduce redundancy.Decomposing time series into an orthogonal set of components results in discrete wavelet transform (DWT).Mallat [54] proposed filtering the time series using a pair of highpass and low-pass filters as an implementation of discrete wavelet transform.There are two types of wavelets, father wavelets φ(t) and mother wavelets ψ(t), in the Mallat algorithm.Father wavelets φ(t) and mother wavelets ψ(t) integrate to 1 and 0, respectively, which can be formulated as: The mother wavelets describe high-frequency parts, while the father wavelets describe lowfrequency components of a time series.The mother wavelets and the father wavelets in the j-level can be formulated as [55]: Financial time series can be reconstructed by a series of projections on the mother and father wavelets with multilevel analysis indexed by k {0,1,2, . ..} and by j {0,1,2, . ..J}, where J denotes the number of multi-resolution scales.The orthogonal wavelet series approximation to a time series x(t) is formulated by: where the expansion coefficients s J,k and d J,k are given by the projections The multi-scale approximation of time series x(t) is given as: Then, the brief form of orthogonal wavelet series approximation can be denoted by: where S J (t) is the coarsest approximation of the input time series x(t).The multi-resolution decomposition of x(t) is the sequence of {S J (t),D J (t),D J−1 (t),. ..D 1 (t)}.When the financial time series is very rough, the discrete wavelet transformation can be applied repeatedly by which the risk of overfitting can be reduced.As a result, the two-level wavelet is applied twice in this study for data preprocessing as suggested in [23].

Stacked autoencoders
Deep learning is a series of models that have the ability to extract deep features from input data with deep neural network architecture.Deep learning models usually have more than three layers.The deep network is typically initialized by unsupervised layer-wise training and then tuned by supervised training with labels that can progressively generate more abstract and high-level features layer by layer [56].According to recent studies [57,58], better approximation to nonlinear functions can be generated by deep learning models than those models with a shallow structure.Several deep neural network architectures have been proposed in recent studies, including deep Boltzmann machines (DBMs) [59], deep belief networks (DBNs) [13] and stacked autoencoders (SAEs) [14].Restricted Boltzmann machines (RBMs) [60], convolutional neural networks (CNNs) [61], and autoencoders [14] are the frequently used layer-wise training models.In this paper, autoencoders is applied for layer-wise training for the OHLC variables and technical indicators, while SAEs is adopted as the corresponding deep neural network architecture.
Single layer AE is a three-layer neural network; it is illustrated in Fig 2 .The first layer and the third layer are the input layer and the reconstruction layer with k units, respectively.The second layer is the hidden layer with n units, which is designed to generate the deep feature for this single layer AE.The aim of training the single layer AE is to minimize the error between the input vector and the reconstruction vector.The first step of the forward propagation of single layer AE is mapping the input vector to the hidden layer, which is illustrated in the boxed area of Fig 2, while the second step is to reconstruct the input vector by mapping the hidden vector to the reconstruction layer.The two steps can be formulated as: where x R k and x' R k are the input vector and the reconstructed vector, respectively.a(x) is the hidden vector generated by the single layer AE.W 1 and W 2 are the weight of the hidden layer and the reconstruction layer, respectively.b 1 and b 2 are the bias of the hidden layer and the reconstruction layer, respectively.f is the activate function, which has many alternatives such as sigmoid function, rectified linear unit (ReLU) and hyperbolic tangent.In this paper, f is set to be a sigmoid function as in Chen et al. [19].The optimization function for minimizing the error between the input vector and the reconstruction vector can be formulated as where J is the squared reconstruction error of the single layer AE. x i and x 0 i are the ith value of the input vector and its corresponding reconstruction vector.m is the size of the training dataset, which is the number of trading days in the training stage in this paper.J wd and J sp are the weight decay term and the sparse penalty term, which can be formulated as: where kÁk F is the Frobenius norm.λ and β controls the weight decay term and the sparse penalty term.KL(Á) denotes the Kullback-Leibler Divergence.ρ is the sparsity parameter, and only a few of the hidden units can be larger than the sparsity parameter.rt is the average activation of the tth hidden layer among the training dataset, which can be formulated as: where a t (x i ) denotes the kth unit of the tth hidden layer among the whole training dataset.The gradient descent algorithm is widely used for solving the optimization problem in SAEs [19,31].As a result, the gradient descent algorithm is applied to complete parameter optimization as suggested in Yin et al. [62].
Stacked autoencoders is constructed by stacking a sequence of single-layer AEs layer by layer [14].Fig 3 illustrates an instance of an SAE with 5 layers that consists of 4 single-layer autoencoders.The single-layer autoencoder maps the input daily variables into the first hidden vector.After training the first single-layer autoencoder, the reconstruction layer of the first single layer autoencoder is removed, and the hidden layer is reserved as the input layer of the second single-layer autoencoder.Generally speaking, the input layer of the subsequent AE is the hidden layer of the previous AE.Each layer is trained using the same gradient descent algorithm as a single-layer AE by solving the optimization function as formulated in Eq ( 16) and feeds the hidden vector into the subsequent AE.It is noteworthy that the weights and bias of the reconstruction layer after finishing training each single-layer AE is cast away.In this work, the number of input daily variables for each dataset ranges from 18 to 25; then, the size of hidden layer is set to 10 by trial and error.Depth plays an important role in SAE because it determines qualities like invariance and abstraction of the extracted feature.In this work, the depth of the SAE is set to 5 as recommended in Chen et al. [19].

Long-short term memory
Long short-term memory is one of the many variations of recurrent neural network (RNN) architecture [20].In this section, the model of RNN and its LSTM architecture for forecasting the closing price is introduced.We start with the basic recurrent neural network model and then proceed to the LSTM model.
The RNN is a type of deep neural network architecture [43,63] that has a deep structure in the temporal dimension.It has been widely used in time series modelling [21,22,[64][65][66][67][68][69].The assumption of a traditional neural network is that all units of the input vectors are independent of each other.As a result, the traditional neural network cannot make use of the sequential information.In contrast, the RNN model adds a hidden state that is generated by the sequential information of a time series, with the output dependent on the hidden state.1. x t is the input vector at time t.
2. s t is the hidden state at time t; it is calculated based on the input vector and the previous hidden state.s t is calculated by: where f is the activate function, which has many alternatives such as sigmoid function and ReLU.The initial hidden state s 0 for calculating the first hidden state s 1 is typically initialized to zero.
3. o t is the output at time t, which can be formulated as: 4. U and V are the weights of the hidden layer and the output layer, respectively.W are transition weights of the hidden state.
Although RNN models the time series well, it is hard to learn long-term dependencies because of the vanishing gradient problem [22].LSTM is an effective solution for combating vanishing gradients by using memory cells [70].A memory cell is composed of four units: an input gate, an output gate, a forget gate and a self-recurrent neuron, which is illustrated in Fig 5 .The gates control the interactions between neighboring memory cells and the memory cell itself.Whether the input signal can alter the state of the memory cell is controlled by the input gate.On the other hand, the output gate can control the state of the memory cell on whether it can alter the state of other memory cell.In addition, the forget gate can choose to remember or forget its previous state.1. x t is the input vector to the memory cell at time t.  5. i t and Ct are values of the input gate and the candidate state of the memory cell at time t, respectively, which can be formulated as: 6. f t and C t are values of the forget gate and the state of the memory cell at time t, respectively, which can be calculated by: 7. o t and h t are values of the output gate and the value of the memory cell at time t, respectively, which can be formulated as: The architecture of a LSTM network includes the number of hidden layers and the number of delays, which is the number of past data that account for training and testing.Currently, there is no rule of thumb to select the number of delays and hidden layers [21,22].In this work, the number of hidden layers and delays are set to 5 and 4 by trial and error.The financial time series is divided into three subsets: training set, validation set, and testing set, with a proportion of 80% training, 10% validation, and 10% testing.The back-propagation algorithm is used to train the WSAEs-LSTM model as well as the models in the experimental control group including WLSTM, LSTM and RNN.The learning rate, batch size and number of epochs are 0.05, 60 and 5000, respectively.The speed of convergence is controlled by the learning rate, which is a decreasing function of time.Setting the number of epochs and the learning rate to 5000 and 0.05 can achieve the convergence of the training.The experimental result will become stable once convergence is achieved though the combinations of parameters are varied [30].

Data descriptions
In this part, we present details regarding our sample selection and the input variables we choose for model prediction.Also, the data resources are provided in this section.

Sample selection and input variables
The six stock indices we choose are CSI 300, Nifty 50, Hang Seng index, Nikkei 225, S&P500 and DJIA index.As we noted before, market state may potentially impact the validity of the neural network.Samples from different market conditions can be helpful in solving this problem.The S&P500 and DJIA index are trading in New York stock exchange, which is commonly considered as the most advanced financial market in the world.Therefore, they denote such markets with highest development level.On the contrary, financial markets in both mainland China and India are often classified as new markets.In fact, most of their market institutions are still far from being fully completed.Thus, we choose CSI 300 and Nifty 50 to represent developing markets.In addition to the markets described above, Hang Seng index in Hong Kong and Nikkei 225 index in Tokyo represent a market condition that falls between the developed and developing market.To be honest, financial markets in Hong Kong and Tokyo are usually considered as developed markets in most scenarios.However, in this paper, compared with US stock market, we could say that these two markets are not as mature as US markets.Therefore, those six stock indices give us a natural setting to test the robust of model performances based on different market conditions.
We select three sets of variables as the inputs.Table 1 describes the details.The first set of variables in Panel A is the historical trading data of each index.Following the previous literature, the data includes Open, High, Low, and Close price (OHLC variables) as well as the trading volume.These variables present the basic trading information of each index.Another set of inputs is 12 widely used technical indicators of each index.Panel B gives the details.
The final set of inputs is the macroeconomic variable.Without a doubt, the macroeconomic conditions across regions also play critical roles in influencing the performance of the stock market.Zhao et al. [71] concludes that the fluctuation of RMB exchange rate can influence the trend of A-share markets in mainland China.Therefore, the addition of macroeconomic variables can be helpful in introducing more information into neural network prediction.We select two kinds of macro variables: the exchange rate and the interest rate.Both rates may affect the money flow in the stock market and then finally impact the performance of stocks.Specifically, we choose US dollar index as the proxy for exchange rate.It is acknowledged that Specifically, it is noted that we do not present the priori time series analysis in this paper.Indeed, the application of classic time series models, such as Auto Regressive Integrated Moving Average (ARIMA), usually requires strict assumptions regarding the distributions and stationarity of time series.As financial time series are usually known to be very complex, nonstationary and very noisy, it is necessary for one to know the properties of the time series before the application of classic time series models [72,73].Otherwise, the forecasting effort would be ineffective.However, by using artificial neural networks, a priori analysis of time series is not indispensable.First, ANNs do not require prior knowledge of the time series structure because of their black-box properties [74].Also, the impact of the stationarity of time series on the prediction power of ANNs is quite small.Related evidence has shown that it is feasible to relax the stationarity condition to non-stationary time series when applying ANNs to predictions [75].Therefore, we simplify the process for priori data analysis and directly put the data into the model.

Experiment design
We present the details regarding how we obtain the predicted value and evaluate the performance of each model.

Prediction approach
The prediction procedure follows the subsection prediction method described in Chan et al. [32].In particular, this procedure consists of three parts.The first part is the training part, which is used to train the model and update model parameters.The second part is the validating part.We use it to tune hyper-parameters and get an optimal model setting.The last one is the test part, where we use the optimal model to predict data.Specifically, as the data are limited, our time frame for each part is inconsistent with that in Chan et al. [32].In the training part, we use the past two years' worth of data to train the models.The following period of three months (a calendar quarter) is employed to the validating part.In the test part, in line with popular portfolio management practice, we predict the quarterly performance of each model.This process continues for six years on each quarter from Oct. 2010 to Sep. 2016.Finally, for each stock index, there are 24 quarterly and 6 yearly predicted results.The prediction procedure is illustrated in Fig 7 .To simplify the demonstration of results, we report the yearly performance instead of the quarterly performance in this paper.Thus, the performance of predictive accuracy and trading returns of the models is presented in six-year periods.Details regarding the interval of the six years can be found in Table 2.

Performance measurement
We discuss performance measurements in this part.We first demonstrate the accuracy measurements selected to judge the predictive performance.Next, we argue how we test the profitability performance of each model.
Predictive accuracy performance.Previous papers select several indicators to measure how well the model predicts the trend of financial markets [2,23,33,34].In this paper, we follow their method and choose three classical indicators (i.e., MAPE, R and Theil U) to measure the predictive accuracy of each model.The definitions of these indicators are as follows: In these equations, y t is the actual value and y Ã t is the predicted value.N represents the prediction period.MAPE measures the size of the error.It is calculated as the relative average of the error.R is a measure of the linear correlation between two variables.Theil U is a relative  measure of the difference between two variables.It squares the deviations to give more weight to large errors and to exaggerate errors.If R is bigger, it means that the predicting value is similar to the actual value, while if MAPE and Theil U are smaller, this also indicates that the predicted value is close to the actual value [23,76].Profitability performance.A buy-and-sell trading strategy is created based on the predicted results of each model.The implication is that under the same trading strategy, we want to find the most valuable model that could earn the highest profits for investors.Actually, the buy-and-sell trading strategy is widely used for profitability performance [35].
The strategy recommends that investors buy when the predicted value of the next period is higher than the current actual value.On the contrary, it recommends that investors sell when the predicted value is smaller than the current actual value.Specifically, the strategy can be described by the following equations: Sell signal : The y t denotes the current actual value, and y Ã tþ1 is the predicted value for the following time period.The definition of strategy earnings is: where R is the strategy returns.b and s denote the total number of days for buying and selling, respectively.B and S are the transaction costs for buying and selling, respectively.Due to the difficulty in executing the short sale of a basket of stocks in spot markets and the huge transaction costs it produces, we execute this strategy by trading the corresponding index future contracts instead of using stock indices.However, a main concern before this execution is that whether the index futures closely move with their underlying stock indices.In fact, evidence from both theoretical and empirical literature all proves the close connections between stock indices and their corresponding index futures [77][78][79][80][81].Moreover, to get stronger evidence, we further test the long-term relationships between the six stock indices and their corresponding index futures.Results from Spearman correlation and cointegration test show that all of our indices have a stable long-term relationship with their corresponding index futures (S1 Table ).Therefore, we believe our predictive results from spot markets can be successfully applied into their corresponding index future markets.
Based on the above trading rule, we sell short the index future contracts when the predicted price is below the current price and buy the contracts when the predicted price is higher than the current one.We notice that some markets have more than one future product trading in the market.For example, both Hang Seng and S&P 500 index have two types of future products: the standard future contract and the mini future contract.However, unlike the previous two markets, China only has the standard CSI 300 index future.Thus, for the purpose of consistency among markets, we select the standard future product to execute the trading strategy.
To make the results more realistic, we consider the influence of transaction cost on profit.As the cost rates are different among the markets and would be occasionally adjusted for the regulation purpose, we unify the cost rates among our sample markets into one rate within our sample period in order to simplify the calculation procedure.Finally, the chosen cost rate of unilateral trading is 0.01%.
In addition to the buy-and-sell trading strategy, we also incorporate the buy-and-hold trading strategy providing a passive threshold in testing the profitability of proposed models according to previous literature [32,36].The trading returns of each model will be compared against the returns of the buy-and-hold strategy.Specifically, as holding the future contract for a long time would be subject to great risk in reality, we execute the buy-and-hold strategy by trading in the spot stock market instead of trading in index future market.The computation procedure of transaction costs in the spot stock market follows the rule that we describe above.Finally, the unified cost in the spot market is 0.25% for buying and 0.45% for selling.

Results
For each stock index, we show the yearly predicted data from the four models and the corresponding actual data in the graph.

Predictive accuracy test
The results of predictive accuracy test for each model are reported from Tables 3 to 5. Each table includes the testing results in two stock indices trading in similar market condition.Within each table, each panel demonstrates predictive performance measuring in one of our three accuracy indicators.We separately report the six yearly results and the average value over the six years for each stock index at the same time.
Table 3 records the model performance in forecasting CSI 300 and Nifty 50.It can be seen from the table that WSAEs-LSTM shows much better performance than the other three models in predicting both stock indices.For example, in predicting CSI 300 index, the average value of MAPE and Theil U of WSAEs-LSTM reach 0.019 and 0.013, respectively, which is much less than those of the other three models.Besides, the indicator R has an average value of 0.944, which is the highest one among the four models.In fact, WSAEs-LSTM outperforms the other three not only on average but also in each year.To confirm the robustness of our findings, we examine the statistical significance of the differences between WSAEs-LSTM and the other three models.Specifically, we compare the 24 quarterly results of WSAEs-LSTM with those of the three models for each accuracy indicators.The statistic approach, T-test, is used for these comparisons.Finally, the statistical evidence proves that the differences between WSAEs-LSTM and the rest three models are all statistically significant at 5% level in both stock indices.Tables 4 and 5 present the models' performance in the rest four stock indices: Table 4 demonstrates model performance in Hong Kong and Tokyo markets while Table 5 reports the results in S&P 500 and DJIA index.Similar to what we have found in Table 3, WSAEs-LSTM still has the lowest MAPE and Theil U and the highest R than the other three models not only from the perspective of average value but also from the perspective of yearly results.Still, these differences between our proposed model and the other three models pass the statistical test at 5% significant level.This concludes that WSAEs-LSTM can stably obtain lower prediction errors and higher predictive accuracy than the other three models regardless of market conditions.
Besides the findings described above, we also discover an interesting pattern based on our data: the difference between the predictability of one specific model in forecasting two stock indices is quite small if these two stock indices are traded in markets with similar development state, while the difference would be increased if the two stock indices are traded in markets with different development states.For example, MAPE of WSAEs-LSTM in predicting S&P 500 and DJIA is 0.011, while it increases to 0.019 when predicting CSI 300 and Nifty 50.Similar patterns are also existed for the rest three models.Even though all models exhibit this pattern, the extent of impacts from market condition is different among models.For example, the market condition seems quite influential on RNN.Its average MAPE value ranges from 0.018 to 0.066 among the six stock indices.That means the worst predictability of RNN is only around one-fourth of its best predictability.Both WLSTM and LSTM exhibit similar patterns as RNN.By contrast, the performance of WSAEs-LSTM is quite stable across markets comparing with these three models.This could be due to the fact that SAEs is more powerful in processing noise data than the other three.The implication of these findings is that our model could be more valuable than others in predicting systems that are less mature and have higher volatility.

Profitability test
The results of profitability test are shown in Table 6.Similarly, we report both yearly returns and the average returns over the six years.Each panel describes the trading returns gained by the models in a specific market condition.In particular, the last row in each panel reports the returns of the buy-and-hold strategy in trading a specific stock index.Panel A demonstrates the profitability performance of each model in developing markets.The left part is the trading performance based on predicted data from CSI 300, while the right part is trading performance based on predicted data from Nifty 50.The results suggest that WSAEs-LSTM earns substantially more profits than the other three models.For example, the average annual earnings of the proposed model can reach up to 63.026% in mainland China and 45.418% in India market, while the annual earnings of the other three models are nearly all below 40%.Regarding each yearly returns, WSAEs-LSTM also outperforms the other models.It can almost stably gain more than 40% earnings in every year, which is really difficult for the other three models.
Panel B and C reports the trading returns in relatively developed and developed markets, respectively.Similar as the findings in Panel A, WSAEs-LSTM can acquire stable earnings in every year, while other models face larger variance in trading earnings.In addition, from the perspective of average earnings within our sample period, our proposed model still earns the highest profits according to the results in Panel B and C. Furthermore, to achieve a robust conclusion, we also test whether the returns differences between WSAEs-LSTM and the remaining three models are statistically significant.Again, we compare the 24 quarterly returns among the models.The t-test results show that our return differences between WSAEs-LSTM and the other three models all pass the significant test at the 5% level.Therefore, our findings support that WSAEs-LSTM has the best predictability among the four models.

Conclusion
This paper builds a novel forecasting framework to predict the one-step-ahead closing price of six popular stock indices traded in different financial markets.The procedure for building this forecasting framework is as follows: First, the denoised time series is generated via discrete wavelet transform using the Haar wavelet; second, the deep daily features are extracted via SAEs in an unsupervised manner; third, long-short term memory is used to generate the onestep-ahead output in a supervised manner.Our input variables include the daily OHLC variables, technical indicators and macroeconomic variables.The main contribution of this work to the community is that it is the first attempt to introduce SAEs method to extract deep invariant daily features of financial time series.In addition, the deep learning framework is proposed with a complete set of modules for denoising, deep feature extracting instead of feature selection and financial time series fitting.Within this framework, the forecasting model can be developed by replacing each module with a state-of-the-art method in the areas of denoising, deep feature extracting or time series fitting.
We test the predictive accuracy and profitability of our proposed model compared with the other three models.The results provide evidence that it can outperform the other three in both predictive accuracy and profitability regardless of which stock index is chosen for examination.Although the proposed integrated system has a satisfactory predictive performance, it still has some insufficiencies.For example, a more advanced hyper-parameters selection scheme might be embedded in the system to further optimize the proposed deep learning framework.In addition, deep learning methods are time-consuming, and more attention needs to be paid to GPU-based and heterogeneous computing-based deep learning methods.All of these could be enhanced by future studies.

Fig 1 .
Fig 1.The flowchart of the proposed deep learning framework for financial time series.D(j) is the detailed signal at the j-level.S(J) is the coarsest signal at level J. I(t) and O(t) denote the denoised feature and the one-step-ahead output at time step t, respectively.N is the number of delays of LSTM.https://doi.org/10.1371/journal.pone.0180944.g001

Fig 2 .
Fig 2. The flowchart of the single layer autoencoder.The model learns a hidden feature a(x) from input x by reconstructing it on x'.Here,W 1 and W 2 are the weight of t he hidden layer and the reconstruction layer, respectively.b 1 and b 2 are the bias of the hidden layer and the reconstruction layer, respectively.https://doi.org/10.1371/journal.pone.0180944.g002

Fig 3 .
Fig 3. Instance of a stacked autoencoders with 5 layers that is trained by 4 autoencoders.https://doi.org/10.1371/journal.pone.0180944.g003 Fig 4 shows an RNN model being unfolded into a full network.The mathematical symbols in Fig 4 are as follows:

Fig 4 .
Fig 4. A recurrent neural network and the unfolding architecture.U, V and W are the weights of the hidden layer, the output layer and the hidden state, respectively.xt and o t are the input vector and output result at time t, respectively.https://doi.org/10.1371/journal.pone.0180944.g004

Fig 6
shows an LSTM model being unrolled into a full network, which describes how the value of each gate is updated.The mathematical symbols in Fig 6 are as follows: U o and V o are weight matrices.3. b i , b f , b c and b o are bias vectors.4. h t is the value of the memory cell at time t.

Fig 7 .
Fig 7. Continuous dataset arrangement for training, validating and testing during the whole sample period.https://doi.org/10.1371/journal.pone.0180944.g007 Fig 8 illustrates Year 1 results and the remaining figures for Year 2 to Year 6 can be found in S1-S5 Figs.According to Fig 8 and S1-S5 Figs, we can find that LSTM and RNN have larger variations and distances to the actual data than WSAEs-LSTM and WLSTM.Furthermore, comparing WSAEs-LSTM with WLSTM, the former outperforms the latter: WSAEs-LSTM has less volatility and is closer to the actual trading data than WLSTM.Specifically, the advantage of WSAEs-LSTM in predicting is more obvious in less developed markets than in developed market.

Table 1 . Description of the input variables.
CCICommodity channel index: helps to find the start and the end of a trend.https://doi.org/10.1371/journal.pone.0180944.t001USdollar plays the most important role in the monetary market.Therefore, it alone could be enough to capture the impact from the monetary market to the stock market.Regarding the interest rate, we select the interbank offered rate in each market as the proxy, namely, Shanghai Interbank Offered Rate (SHIBOR), Mumbai Interbank Offered Rate (MIBOR), Hong Kong Interbank Offered Rate (HIBOR), Tokyo Interbank Offered Rate (TIBOR) and Federal funds rate in US.