Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Stock index trend prediction based on TabNet feature selection and long short-term memory

  • Xiaolu Wei ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft

    20200034@hubu.edu.cn

    Affiliation Business School, Hubei University, Wuhan, Hubei, China

  • Hongbing Ouyang,

    Roles Data curation, Funding acquisition, Supervision, Validation

    Affiliation Department of Economics, Huazhong University of Science and Technology, Wuhan, Hubei, China

  • Muyan Liu

    Roles Software, Writing – original draft, Writing – review & editing

    Affiliation Business School, Sichuan University, Chengdu, Sichuan, China

Abstract

In this study, we propose a predictive model TabLSTM that combines machine learning methods such as TabNet and Long Short-Term Memory Neural Network (LSTM) with a complete factor library for stock index trend prediction. Our motivation is based on the notion that there are numerous interrelated factors in the stock market, and the factors that affect each stock are different. Therefore, a complete factor library and an efficient feature selection technique are necessary to predict stock index. In this paper, we first build a factor database that includes macro, micro and technical indicators. Successively, we calculate the factor importance through TabNet and rank them. Based on a prespecified threshold, the optimal factors set will include only the highest-ranked factors. Finally, using the optimal factors set as input information, LSTM is employed to predict the future trend of 4 stock indices. Empirical validation of the model shows that the combination of TabNet for factors selection and LSTM outperforms existing methods. Moreover, constructing a factor database is necessary for stock index prediction. The application of our method does not only show the feasibility to predict stock indices across different financial markets, yet it also provides an complete factor database and a comprehensive architecture for stock index trend prediction, which may provide some references for stock forecasting and quantitative investments.

Introduction

The role of the stock indices is essential in the financial market since they reflect the macroeconomic health state and microeconomic issues of a particular market. Existing studies have suggested that despite the market is not efficient, stock indices can, to some extent, be characterized and predicted, allowing investors to identify and use arbitrage opportunities to make an excess profit [13]. The stock index is influenced by the economic cycle, market sentiments and corporate expectations, which increase the difficulty to elaborate suitable computational tools that capture the stock index dynamics [4]. Additional difficulties are also due to the intrinsic stochastic nature of the stock index, which shows non-linear, non-gaussian and non-stationary noise [57]. To address the abovementioned prediction challenges of a stock index, researchers have employed machine learning algorithms combined with features selection, which consistently outperform traditional econometric methods in terms of the predictive power of the stock index, while sacrificing their inferential aspects [8, 9].

In this paper, we present a novel deep learning model TabLSTM that not only allows researchers and practitioners in the financial field can employ to predict with low error rate the trend of a stock index, but it also extracts through a features selection procedure a set of factors that appear determinant for the characterization of the stock index. In particular, our modelling approach for predicting the stock index is composed of three stages: 1) Construction of a factors database at micro, macro and technical level; 2) Perform features selection procedure on the factor database by using TabNet. The selected features (optimal features set) determine the driving factors for predictive purposes; 3) Construction of a Long Short-Term Memory Neural Network (LSTM) model for stock index trend prediction.

To assess the performance of the model, four stock indices have been considered, namely S&P 500 index, Dow Jones Composite Index, China Shenzhen Composite Index and China Shanghai Composite Index. The features selection through TabNet is applied on each index and it leads to the identification of the four distinct optimal features sets, containing the likely important factor that characterizes the dynamics of the relative stock index. Finally, the optimal features set is used as input for the LSTM model, which predicts if the stock index will exhibit up- or down-trends. An out-of-sample evaluation of the model performances show evidence that our model outperforms other existing methods. Moreover, the construction of a comprehensive factor database is necessary for stock index prediction.

The contribution in this paper can be highlighted as follows:

  • Build a comprehensive factor library from three perspectives: macro, micro and technical. This factor library includes both traditional factors such as price-to-earnings ratio (P/E) and emerging factors such as Google Attention Index.
  • Study two distinct financial markets, including financial markets in developed countries and those in emerging countries.
  • Use the original data without data preprocessing such as normalization, which ensure that important information is not lost.
  • Study the current importance of each factor (local interpretability) and the impact of each factor on stock index trend (global interpretability) through TabNet method.

The rest of this paper is organized as follows. In section 2, fundamentals and concepts on market factors, features selection and trend prediction machine learning algorithms are discussed. In Section 3, details on TabNet and LSTM model and their application on stock index prediction are presented. In Section 4, data and evaluation metrics are described. In Section 5, experimental results of the proposed model are shown and discussed followed by a conclusion.

Literature review

1. Information and price predictability of stock markets

The effectiveness of financial markets was first proposed by the French mathematician Bachelier [10] in the early 20th century. By analyzing the price changes in the stock market, he found that stock prices reflect almost all relevant information. Then in the 1960s, Fama et al. proposed the efficient market hypothesis (EMH) [1113]. The efficient market hypothesis suggests that the stock prices at any given time should be a fair reflection of all currently available information, and that emerging information is quickly reflected in the stock prices. Therefore, analysis of historical and current data cannot help investors predict the future or earn excess profits. Depending on how prices respond to information, efficient market forms can be classified as weak-form efficient market, semi-strong-form efficient market and strong-form efficient market. The weak-form efficient market is consistent with the random wandering hypothesis, which states that stock prices fluctuate randomly and price changes are independent of each other. Moreover, the weak-form efficient market indicates that current stock price contains all available historical information and that it is impossible for investors to earn excess profits through technical analysis. The semi-strong-form efficient market assumes that stock prices adjust quickly to public information in market and that it is impossible to earn excess returns based on fundamental analysis. The strong-form efficient market assumes that stock prices reflect all available information in the market, including historical information, public information and relevant private information, so that no investor can earn excess profits by any means.

However, with further research, researchers have found that the two main assumptions of the efficient market hypothesis (no information costs, and "rational people" who seek to maximize their own interests) do not correspond to reality. At the same time, more and more anomalies in financial markets were discovered, and the efficient market hypothesis was unable to provide an effective and rational explanation for these anomalies.

In order to explain the anomalies in financial markets, Tversky and Kahneman (2013) first applied psychological assumptions to financial markets and proposed the "expectation theory" in 1979 [14]. In the theory, Tversky and Kahneman (1979) argued that investors have cognitive biases that conflict with the assumptions of the efficient market hypothesis due to risky attitudes, mental accounting and overconfidence. In 1989, Bhushan (1989) introduced the phenomenon of "herding effect" by analyzing the relationship between securities analysts and investors [15]. In 1993, Jegadeesh and Titman (1993) conducted an empirical study based on the average stock returns in the past six months and proposed the phenomenon of "return inertia" [16]. Latif et al. (2011) discussed different financial market anomalies and their causes [17]. Based on the empirical analysis, they suggested that investors could use calendar anomalies (weekend effect, monthly effect, annual effect and January effect), fundamental anomalies (value anomalies, low book price stocks, low yield price stocks, neglected stocks, high dividend yield stocks), technical analysis (moving averages) and insider trading to earn anomalous profits.

By reviewing the literature related to the information and price predictability of stock markets, this paper finds strong theoretical and empirical support for stock index prediction. Although the introduction of the efficient market hypothesis denied the feasibility of stock index prediction, the emergence of financial market anomalies and the introduction of behavioral finance later confirmed the predictability of stock market. Therefore, it is theoretically and practically feasible to construct a factor database and predict stock index in this paper.

2. Stock prediction methods

Stock indices predictive models can be broadly divided into two domains, namely, econometric models and machine learning models.

Classic econometrics models such as ARIMA, VAR and GARCH are widely used in the financial field, and they are mathematically and conceptually well understood. While the econometric framework is characterized by straightforward interpretation of the results and the possibility to perform hypothesis tests, underlying assumptions as linearity, Gaussianity of the errors and stationarity do not hold on stock index data [1821].

On the other hand, machine learning methods such as Support Vector Machine (SVM) and Gradient Boosted Decision Trees (GBDT) have become popular in stock index prediction due to their accuracy as well as less restrictive underlying assumptions [20, 22, 23]. Within the machine learning domain, deep learning techniques have recently drawn the attention of researchers, principally due to the high accuracy that deep learning models can reach. Deep learning models such as LSTM, Gated Recurrent Unit (GRU) and Deep Neural Network (DNN) have been employed to predict the trend of several stock indices [2428]. For example, Hoseinzade et al. have proposed a Convolutional Neural Network (CNN) model to predict the stock index trends. Additionally, the authors have validated the CNN based model on financial data drawn from different sources, obtaining in all cases good prediction accuracies [29]. Successively, Eapen et al. have combined first CNN with Bi-LSTM and after CNN with GRU, reaching in both cases very high prediction accuracy on S&P 500 data [30, 31]. Mehtab et al. have proposed four LSTM based models, where each model presents a different architecture. The LSTM models have been employed to predict the NIFTY 50 index values [32]. Nabipour et al. have compared Recurrent Neural Network (RNN) and LSTM to nine standard machine learning methods, showing that the two deep learning models have systematically outperformed the other machine learning methods on continuous data [33]. Livieris et. al have built and trained a Weighted-Constrained DNN (WCDNN) to predict the trend of three stock indices, showing not only the good prediction power of the model but also its numerical efficiency [34].

Most of the deep learning literature on the prediction of stock indices focuses on the improvement of the accuracy of the prediction models, while less attention is devoted to the identification and selection of the driving factors of stock indices dynamics. Within the financial literature, several factors have been elaborated and identified, yet few of them appear to do overlap with already identified factors, and even fewer appear to play a role in the description of asset prices [35]. To tackle the multicollinearity of the existing factors that impact the trend of a stock index, efficient feature selection procedures need to be employed. In detail, feature selection techniques aim to reduce the redundancy of the information within a dataset by selecting only a few of the many available factors. As a result, feature selection techniques reduce factors containing redundant information, while efficiency, accuracy and interpretability of the prediction model. In other words, an efficient feature selection procedure reduces the training time of the model by pulling out of the input dataset redundant factors, while avoiding overfitting of the financial data [28, 36]. Nonetheless the several advantages of feature selection procedures, few studies have applied them in combination with predictive models for stock index trends. Through a careful literature review, we detected three main issues that may hold researchers to apply feature selection procedures: 1) existing techniques require that original data are normalized, a procedure that alters the initial structure of the data with consequent loss of information [37]; 2) Selection techniques return a ranked list of selected factors, while they fail to explain how the features contribute to the stock index dynamics and how have been selected [3840]; 3) Several studies only apply selection methods on technical indicators, while neglecting micro and macro factors, which may play a pivotal role in a stock index trend [4143].

To tackle the three issues, we first have constructed a manually curated factors database, which contains not only technical features but also micro and macro indicators. The factor database contains traditional factors such as Price-to-Earnings ratio (P/E) as well as emerging factors as Google Attention Index. Successively, we have selected the relevant factors through TabNet, which can unveil the logic behind the ranking of the features. To do not obscure the natural variability of the information and preserve local and global interpretability, the TabNet procedure has been performed on non-preprocessed data. Finally, we have utilized the resulting important factors to train our predictive LSTM model.

Stock index trend prediction based on TabNet and LSTM

1. Mathematical model on stock index trend prediction

While deep learning models are often regarded as a black-box with good prediction power, in our framework, we identify the factors that characterize the stock index trends, while preserving the high accuracy power of the model. The goal has been reached through the application of TabNet, a feature selection process that selects a minimal number of relevant micro, macro and technical factors, in combination with LSTM, a deep learning model with a strong prediction power. In more detail, the feature selection procedure TabNet permits to select of the most relevant factors from a manually curated database of the economic and financial indicators. The selection of the features allows reducing the complexity of the model training phase while increasing the interpretability of the results. Once the selection procedure is ultimate, the LMST model is trained on the selected factors and successively employed to predict the trends of a stock index. The proposed three-step procedure increases the interpretability of the predictions while preserving the accuracy power of the model.

As a first step, the manual construction of the factors database is performed through a careful data-mining operation across different financial platforms. The mined factors are successively collected on a database and denoted by the n-dimensional set of vectors , where each dimension represent a factor. The factors database consists of a collection micro, macro and technical indicators.

Secondly, the TabNet procedure has been employed on the database to compute the importance score of each factor. The less relevant factors have been successively filtered out based on a prespecified threshold. The set containing all selected factors is indicated as ., where represents the vector of the n′-factors at time t with n′ ≪ n.

Thirdly, the LSTM architecture is designed and trained on , where yt denotes the stock index trend associated at time t. The model is validated on an unseen stock index data.

2. TabNet on stock index feature selection

Although feature selection procedures are widely used to predict stock trends, their applications appear to neglect two potential issues that can arise. The first issue regards the preprocessing of the data. Several studies apply the feature selection method after the application of preprocessing strategies on the data, which change the relation across the variables with consequential loss of information. The second issue that arises concerns how the ranking of the importance of the features is performed. Previous studies focused only on the final ranking of features while ignoring how the features have been selected and ranked. As a result, potential confounders may be filtered out from the selection procedure.

To address both issues, Sercan&Pfister have proposed TabNet, a feature selection method published in 2020 and accepted by the Association for the Advance of Artificial Intelligence (AAAI). Compared to other feature selection techniques, TabNet does not require any data preprocessing. Additionally, TabNet does not unveil the importance of each feature at each step of the selection procedure, but it also shows how vital the selected variable is overall. In other words, TabNet helps to give information of each feature’s local and global importance within the selection procedure. Additional computational experiments have shown that TabNet overperforms several existing feature selection procedures on cross-sectional, time-series or panel data [44].

In detail, TabNet relies on a sequential multi-step deep neural network architecture, and it returns interpretable informative features that can be employed to train the predictive model. TabNet composes of two types of different architectures, one for encoding and the other for decoding. The TabNet encoder consists of four modules: feature transformer, attentive transformer, feature masking, and split block, and its overall architecture is summarized in Fig 1. As abovementioned, TabNet encoder uses each module results to determine the local importance of each feature and retain the most relevant for the successive modules. The final module elaborates the list of the most relevant features and relative global importance scores. The TabNet decoder comprises a block of feature transformers.

This research applies TabNet feature selection on the manually curated factor database. Only the TabNet encoder architecture is used for the factor selection process.

Specifically, given a factor matrix f ∈ RB×D, wherein B is the batch size, D is the number of factors, the steps for selecting factors of the financial market index through TapNet encoding module are as follows:

  1. Preliminary feature transformation and data split. The factors are transformed by applying the feature transformer. Successively, the split block provides to divide into batches the transformed data. The split block also extracts the information required by the attentive transformers. Mathematically, this step reads (1) where denotes the transformed batch at ith decision step. At the initial decision step, variables a[0] and f0 denotes the required information and transformed batch at the initial decision step.
  2. Preliminary attentive transformation. The attentive transformer processes the factor information obtained from the previous step. The following formula can describe the attentive transformation: (2) where sparsemax is a normalization method that encourages sparsity by mapping Euclidean projections to probabilistic simplexes. Variables M[i] ∈ RB×D is a learnable mask for factor selection, and hi is the training function at ith decision step. The quantity P[i] is a proportional term which indicates the contribution of the corresponding factor. The value P[i] is computed as (3) where γ denotes a relaxation parameter. When γ = 1, the corresponding factor has contributed only once to the prediction of the stock index trend.
  3. Preliminary feature masking. The importance of each factor is computed through the matrix multiplication M[0] · h. The element Mb,j[0] represents the contribution that factor fb,j has on explaining the stock index trend, and when Mb,j[0] = 0, the j-th factor in the b-th sample does not capture any stock index trends and is consequently filtered out.
  4. Feature transformation and data split. A further feature transformer transforms the remaining factors obtained in the previous step. The transformed data are successively divided into d[i], values required for the actual prediction, and a[i], data that undergo a further attentive transformation in the split block. Mathematically, the step can be written as (4) Wherein .
  5. Attentive transformation. At this stage, to compute M[i], the a[i] obtained in the previous step are transformed as follows, (5) Wherein where .
  6. Feature masking. Similarly to the preliminary masking feature step, the multiplication M[i] · i is performed to calculate each factor’s contribution to predicting the stock index trend. Each contribution has been interpreted. The value Mb,j[i] defines the importance of the factor fbj. When the j-th factor in the b-th sample did not predict the trend, then Mb,j[i] = 0.
  7. Calculation global contribution. After looping steps (4)-(6) for 1000 epochs, the global contribution of the ith in the b-th to the prediction of the stock index trend is denoted as Magg−b,j and calculated as follows. (6) Wherein , which represents the factor contribution at i-th decision step in the b-th sample. When db,c[i] < 0, the contribution of all factors at i-th decision step is zero. In addition, .

3. LSTM on stock index trend prediction

In deep learning, the Recurrent Neural Network (RNN) is a neural network model known for its reliable prediction on classification problems when the input data are time series. In particular, the chain structure of the RNN shows good performance also in the case the time series are sparse. On the other hand, RNN can only retain short sequences due to well-known cons such as gradient disappearance and gradient explosion [45].

In light of the pros and cons of the RNN, here, the RNN has been employed to predict future trends of the stock index from the historical factor data. Here, we utilized the Long Short-Term Long Memory (LSTM) model, which has the advantage to store more extended sequences of information compare to the classic RNN. In particular, an introduction of a storage unit gives the LSTM higher memory capacity. The storage unit consists of four parts: 1) an input gate; 2) a forget gate; 3) an output gate; 4) a self-circulating neuron. The overall structure of the storage unit is presented in Fig 2.

In the storage unit, gates control the interaction amongst adjacent units. The input gate selects all the information that needs to be processed or retained. The selected data undergo an S-shaped and a tanh layer. The model decides if the inputs can change their state within the storage unit from the input gate. Successively, the forget state can choose to consider or not some information. The forget gate returns a continuous value in a unit range, where the "0" means a previous state is "completely ignored", while "1" that the previous state is "completely reserved". The output gate computes the final results drawn by considering the outputs of each storage unit. In other words, the output results are determined by the unit state, filtered data and newly added information. The storage unit in Fig 3 represents the architecture of the augmented LSTM model and how each gate contributes to the prediction generated by the model [46].

thumbnail
Fig 3. LSTM’s expanded network form of the storage unit.

https://doi.org/10.1371/journal.pone.0269195.g003

Mathematically, the LSTM architecture can be described as below.

(7)(8)(9)(10)(11)(12)

Here, Wi, Wc, Wf, Wo, Ui, Uc, Uf, Uo, Vo are the weight matrices, while bi, bc, bf, bo are the deviation vectors. The variables xt and ht are respectively input and output vectors of the storage unit at time t. Furthermore, variables it, ft, ot are variables of the input, forget, and output gate, while and Ct represent the candidate state and the state of the storage unit time t.

The TabNet procedure combined with the LSTM model are employed to predict the stock market trends by using the time series of different factors. The resulting factor selected from an initial manually curated database through TabNet can unveil the driving dynamics of the stock index trend and increase the prediction power of the LSTM model.

Data

1. Data set and data preprocessing

This study evaluates the performance of the TabLSTM through its application on four different stock indices, namely Dow Jones Composite Index, S&P 500, China Shanghai Composite Index (00001), and China Shenzhen Composite Index (399106). Each index has been considered from 01/02/2008 to 11/19/2020. The Dow Jones Composite Index and S&P 500 are two stock indices of the US financial market, a benchmark for developed countries. On the other hand, the China Shanghai Composite Index and the China Shenzhen Composite Index are two representative stock indices of the Chinese financial market, a potential benchmark for developing countries. Specifically, the China Shanghai Composite Index reflects the price movement of stocks listed on the Shanghai Stock Exchange, while the China Shenzhen Composite Index reflects the combined price movement of all stocks listed on the Shenzhen Stock Exchange. The data have been collected through different certified resources, such as Wind Financial Data Database, containing 3133 daily transaction information.

The data with small variability have been filtered out based on a lower and upper threshold. The lower threshold is set equal to -0.50%, while the upper bound is 0.55%. In other words, only financial out of the two thresholds are considered. Additional two filters have been applied. One filter erases data associated with null trading volume, and the second filters out the factors with more than 50% missing values. After the application of the three filters, the clean data contains 2524 daily observations. The clean data are successively split into three sets, namely training set (80% of the observations), validation set (10% of the observations) and test set (10% of the observations). The training set is used for feature selection and prediction model training, the validation set is used for hyperparameter tuning, and the test set is used to evaluate the predictive power of the model.

The manually curated factor database utilized as input information for TabNet contains thirty-five macroeconomic factors [4749], seven microeconomic factors [5052] and seventy-two technical factors [5356]. The identification of the factor is a result of a careful review of the financial and economic literature (see S1S3 Tables). The prediction of the stock index trend has been rewritten as a classification problem. The trend variable, also known as binary movement, is based on the thirty-day difference between two close prices. If the stock difference records a loss, then "0" is attributed to the binary movement, while "1" denotes if the difference signs a profit. The binary movement is denoted as y = I(Pt+30 > Pt), where I(·) is the indicator function.

2. Evaluation metrics

The prediction performance of the TabLSTM model has been evaluated using three different indicators, namely Area Under Roc Curve (AUC), Balanced Accuracy, and Error Rate. The first two metrics measure the prediction accuracy of a model, while the last metric measures the prediction error of a model. The mathematical description of each metric is given in Table 1. The prediction performance of the proposed model is compared to the performance of six models, namely, XGBoost [57], LightGBM [58], GBDT [59], CatBoost [60], Logitstic Regression [61], and K-NearestNeighbor (KNN) [62]. In particular, XGBoost, LightGBM, GBDT, and CatBoost are state of art classification models with feature selection function, while LR and KNN are traditional classification models without feature selection function.

In Table 1, insi is the i-th sample, is the rank position of the i-th sample in a ascending ordering. M and N are the number of positive samples and that of negative samples, respectively. Variables y, y’ are respectively the actual and predicted binary movements of the stock index. The index k represents the number of categories, in this study k = 2, while mi and n are the respective numbers of samples in each category.

Results and discussion

The prediction of the stock index trends based on a factor database, TabNet feature selection and LSTM method is applied on the four indices, Dow Jones Composite Index, S&P 500, China Shenzhen Composite Index, and China Shanghai Composite Index. For the TabNet feature selection, the dimensions of a[i] and d[i], which are denoted as Na and Nd, are set to be 8. The decision steps Nstep = 3, sparsity coefficient λsparse = 0.001, a relaxation parameter in preliminary attentive transformation step γ = 1.3, a small number for numerical stability in sparsity regularization ε = 1 × 10−15, batch size B = 128, learning rate = 0.02 and iteration = 1000. For the LSTM, the number of neurons in the hidden layer and output layer are arbitrarily chosen to be 128 and 1 respectively, dropout = 0.2 and iteration = 100. The optimization of the parameter is performed by using Adam optimizer. The code is implemented in Python 3.

1. Stock index feature selection

The TabNet encoder’s application leads to identifying several factors. Fig 4 shows feature importance masks mask[i] that indicates feature selection at (i + 1)th decision step, and its aggregate feature importance Magg that indicates the outcomes of global feature selection. The x-axis and y-axis in each subfigure of Fig 4 represent the factor number, and the decision step, respectively and the bright light highlights the factors with high importance. For example, in the first decision step which denoted as mask[0], the factors numbered between 0 and 100 show more feature importance than the factors numbered after 100. In the second decision step which denoted as mask[1], factors numbered around 200 are more importance than the factors numbered between 0 and 100. In the third decision step which denoted as mask[2], factors numbered around 100 show more significant feature importance than other factors. In the aggregate feature selection step which denoted as mask agg, the aggregate masks for factors numbered between 100 and 200 are almost all zero, indicating that these factors are irrelevant to predict Dow Jones Composite Index. Moreover, factors numbered between 0 and 100, and factors numbered around 200 play a role in predicting Dow Jones Composite Index. The outcomes of the aggregate mask are consistent with the previous decision steps. Furthermore, these results show that Tabnet can merely focus on the relevant factors and produce accurate feature selection results.

thumbnail
Fig 4. Factor importance ranking.

(a) Dow Jones Composite Index. (b) S&P 500 Index. (c) China Shanghai Composite Index. (d) China Shenzhen Composite Index.

https://doi.org/10.1371/journal.pone.0269195.g004

Moreover, Table 2 presents the top ten factors with the highest importance score for each index. As can be seen in Table 2, the most influential factors in each index overlap considerably. Ignoring the specific factor ranking, the nine factors, namely P/E, P/CF, P/D, B/M, D/E, LEVERAGE, MOMENTUM 1 WEEK, MOMENTUM 2 WEEK, and MOMENTUM 3 WEEK are located in the top ten impact factors of all indices. In addition, Table 2 shows that the factors that have the greatest impact on all indices fall under the category of micro and technical factors.

2. Stock index trend prediction

After obtaining the feature importance of each stock index, the LSTM model is trained on the highest-ranked factors, and the resulting model is used to predict the stock market trends. Since the ten highest-ranked factors in each stock index can already explain nearly 60% of stock index movement, this paper chooses these ten highest-ranked factors in each stock index for stock index trend prediction. The prediction performance and convergence time of all the models, including TabLSTM, are shown in Tables 3 and 4, respectively.

Table 3 summarizes the predictive performance of all methods, including XGBoost, LightGBM, GBDT, CatBoost, Logistic Regression, and KNN. The best results for these methods are highlighted in bold face in Table 3. The total count of the bold-faced results is 10 for TabLSTM and 2 for other methods, which indicates the relative predictive power of TabLSTM.

Generally, a forecasting accuracy of 56% is satisfactory in stock trend prediction (Haq et al, 2021). Table 3 shows that the TabLSTM model combined with a factor database achieves satisfying results in Dow Jones Composite Index, S&P 500 Index, China Shanghai Composite Index, and China Shenshen Composite Index. Specifically, in the Dow Jones Composite Index, the TabLSTM model has an auc of 92.79%, balanced accuracy of 86.35%, and error rate of 12.25%. In the S&P 500 Index, the TabLSTM model has an auc of 90.08%, balanced accuracy of 87.78%, and error rate of 11.47%. In the China Shanghai Composite Index, the TabLSTM model can achieve an auc of 93.43%, balanced accuracy of 87.78%, and error rate of 12.28%. In the China Shenzhen Composite Index, the TabLSTM model can achieve an auc of 91.32%, balanced accuracy of 84.28%, and error rate of 16.49%. These results all prove the good prediction performance of the proposed method.

In addition, TabLSTM model perform much better than other trend prediction models, especially the traditional models. Take Dow Jones Composite Index as an example, TabLSTM model outperforms four state of art classification models (XGBoost, LightGBM, GBDT, CatBoost) by 6.25%, 7.69%, 9.13%, 5.38% in auc metric, 6.90%, 9.00%, 14.69%, 4.35% in balanced accuracy metric, 42.19%, 44.84%, 56.22%, 36.20% in error rate metric. Moreover, TabLSTM model significantly outperforms two traditional classification models (Logistic Regression and KNN) by 66.41% and 34.50% in auc metric, 20.33% and 25.82% in balanced accuracy metric, 73.39% and 65.03% in error rate metric, suggesting superior performance of the proposed method further.

Moreover, Table 3 also shows the prediction results of the TabLSTM method based on the proposed factor database and technical indicators, respectively. Obviously, we could obtain better performance results based on a comprehensive factor database rather than a pool of technical indicators. Take Dow Jones Composite Index as an example, TabLSTM model with factor database outperforms that with technical indicators by 59.68%, 53.48%, 67.11% in auc, balanced accuracy, error rate, respectively, suggesting the necessity to construct a comprehensive factor database.

Table 4 summarizes the convergence time of all methods in seconds. The shortest convergence time of state of art classification models (XGBoost, LightGBM, GBDT, CatBoost) and traditional classification models (Logistic Regression, KNN) are highlighted in bold face, respectively. Generally, TabLSTM and four state of art classification models with better prediction performance require more time to converge than the two traditional classification models. Moreover, TabLSTM with the best prediction performance doesn’t always show the shortest convergence time, which indicates that TabLSTM require further optimization for better applications.

Conclusion

In this paper, we propose a novel hybrid model, TabNet, which combines LSTM with a feature selection technique, TabNet, to predict the stock index trend. The hybrid model includes three components: constructing factor library, selecting relevant features and predicting stock index trend. In the first phase, we summarize the potential factors from the existing literature and construct a factor library containing micro, macro and technical factors. In the second phase, TabNet encoder is trained to identify the driving factors of the stock index trend, in which we could also learn the local and global importance of each feature. In the last phase, we use the highest-ranked factors as the inputs of a LSTM model to predict the 30-day-ahead stock index movements.

In the empirical study, we conduct an experiment to test the prediction performance of the TabLSTM model. Take four stock indices as research objects, namely, Dow Jones Composite Index, S&P 500, Shanghai Composite Index, and Shenzhen Composite Index, the performance of TabLSTM is compared to the other six models, including XGBoost, LightGBM, GBDT, CatBoost, Logitstic Regression, and K-NearestNeighbor (KNN). The results show that our hybrid prediction model with a higher auc value, a higher balanced accuracy value and a lower error rate is superior to the traditional classification models and other state of art classification models proposed by the previous literature. Moreover, our hybrid prediction model with a comprehensive factor library outperforms the prediction model with only technical factors in terms of all evaluation metrics. We conclude that the construction of a comprehensive factor library based on the TabLSTM feature selection technique and LSTM model can build a prediction model that brings better prediction performance for stock index trend.

In summary, the contributions of the proposed model, TabLSTM, can be found in three aspects. The TabLSTM model that combines LSTM with a feature selection model, TabNet, provides a comprehensive understanding for predicting stock index trend from feature selection level and the trend prediction level. The factor library constructing stage provides potential factors from macro level, micro level and technical level to support further analysis of stock index trend prediction. In addition, the TabNet encoder helps to complement the TabLSTM model by analyzing the local feature importance and global feature importance of each factor in the original data. Furthermore, it is our hope that the process to build the TabLSTM model provides an example for predicting stock index trend in a more comprehensive manner.

However, there are still four possible extensions in this study. First of all, the research can include data from more stock markets for experiment and robustness test. Secondly, the model can be extended to a broader range of financial products, such as commodities, bonds, digital currencies etc. Thirdly, the manually curated financial factor library can be to extended to include more factors that may influence the trend of stock index, thereby improving the prediction performance of the proposed method. Lastly, TabNet could be optimized shorten its convergence time.

Supporting information

Acknowledgments

Thanks to Wind platform for provide the data needed. Moreover, thanks to Qiang Xu and John Gianni for their support in the writing process.

References

  1. 1. Chen Y, Hao Y. A feature weighted support vector machine and K-nearest neighbor algorithm for stock market indices prediction. Expert Systems with Applications. 2017; 80: 340–355.
  2. 2. Jiang M, Liu J, Zhang L, Liu C. An improved Stacking framework for stock index prediction by leveraging tree-based ensemble models and deep learning algorithms. Physica A: Statistical Mechanics and its Applications. 2020; 541: 122272.
  3. 3. Wang Y, Wang L, Yang F, Di W, Chang Q. Advantages of direct input-to-output connections in neural networks: The Elman network for stock index forecasting. Information Sciences. 2021; 547: 1066–1079.
  4. 4. Haq AU, Zeb A, Lei Z, Zhang D. Forecasting daily stock trend using multi-filter feature selection and deep learning. Expert Systems with Applications. 2021; 168: 114444.
  5. 5. Zhou F, Zhang Q, Sornette D, Jiang L. Cascading logistic regression onto gradient boosted decision trees for forecasting and trading stock indices. Applied Soft Computing. 2019; 84: 105747.
  6. 6. Yang F, Chen Z, Li J, Tang L. A novel hybrid stock selection method with stock prediction. Applied Soft Computing. 2019; 80: 820–831.
  7. 7. Zhang X, Zhang Y, Wang S, Yao Y, Fang B, Philip SY. Improving stock market prediction via heterogeneous information fusion. Knowledge-Based Systems. 2018; 143: 236–247.
  8. 8. Long W, Lu Z, Cui L. Deep learning-based feature engineering for stock price movement prediction. Knowledge-Based Systems. 2019; 164: 163–173.
  9. 9. Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Applied Soft Computing. 2020; 90: 106181.
  10. 10. Bachelier L. Theory of speculation in: Cootner P., ed., 1964, The random character of stock market prices. 1900.
  11. 11. Fama EF, Fisher L, Jensen M, Roll R. (1969). The adjustment of stock prices to new information. International economic review. 1969; 10(1).
  12. 12. Fama EF. Efficient capital markets: reply. The Journal of Finance. 1976; 31(1): 143–145.
  13. 13. Fama EF. Random walks in stock market prices. Financial analysts journal. 1995; 51(1): 75–80.
  14. 14. Kahneman D, Tversky A. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I. 2013; 99–127.
  15. 15. Bhushan R. Collection of information about publicly traded firms: Theory and evidence. Journal of Accounting and Economics. 1989; 11(2–3): 183–206.
  16. 16. Jegadeesh N, Titman S. Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of finance. 1993; 48(1): 65–91.
  17. 17. Latif M, Arshad S, Fatima M, Farooq S. Market efficiency, market anomalies, causes, evidences, and some behavioral aspects of market anomalies. Research journal of finance and accounting. 2011; 2(9): 1–13.
  18. 18. Doroudyan MH, Niaki ST. A. Pattern recognition in financial surveillance with the ARMA-GARCH time series model using support vector machine. Expert Systems with Applications. 2021; 115334.
  19. 19. Ribeiro PP, Cermeño R, Curto JD. Sovereign bond markets and financial volatility dynamics: Panel-GARCH evidence for six euro area countries. Finance Research Letters. 2017; 21: 107–114.
  20. 20. Wang GJ, Xie C, He K, Stanley HE. Extreme risk spillover network: application to financial institutions. Quantitative Finance. 2017; 17(9): 1417–1433.
  21. 21. Yu H, Ming LJ, Sumei R, Shuping Z. A hybrid model for financial time series forecasting—integration of EWT, ARIMA with the improved ABC optimized ELM. IEEE Access. 2020; 8: 84501–84518.
  22. 22. Xiao C, Xia W, Jiang J. Stock price forecast based on combined model of ARI-MA-LS-SVM. Neural Computing and Applications. 2020; 32(10): 5379–5388.
  23. 23. Ouyang H, Wei X, Wu Q. Discovery and prediction of stock index pattern via three-stage architecture of TICC, TPA-LSTM and multivariate LSTM-FCNs. IEEE Access. 2020; 8: 123683–123700.
  24. 24. Gu Q, Chang Y, Xiong N, Chen L. Forecasting Nickel futures price based on the empirical wavelet transform and gradient boosting decision trees. Applied Soft Computing. 2021; 109: 107472.
  25. 25. Baek Y, Kim HY. ModAugNet: A new forecasting framework for stock market index value with an overfitting prevention LSTM module and a prediction LSTM module. Expert Systems with Applications. 2018; 113: 457–480.
  26. 26. Li X, Tang P. Stock index prediction based on wavelet transform and FCD‐MLGRU. Journal of Forecasting. 2020; 39(8): 1229–1237.
  27. 27. Yu P, Yan X. Stock price prediction based on deep neural networks. Neural Computing and Applications. 2020; 32(6): 1609–1628.
  28. 28. Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing. 2018; 300: 70–79.
  29. 29. Hoseinzade E, Haratizadeh S. CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Systems with Applications. 2019; 129: 273–285.
  30. 30. Eapen J, Bein D, Verma A. Novel deep learning model with CNN and bi-directional LSTM for improved stock market index prediction. In 2019 IEEE 9th annual computing and communication workshop and conference (CCWC). 2019; 0264–0270.
  31. 31. Eapen J, Verma A, Bein D. Improved big data stock index prediction using deep learning with CNN and GRU. International Journal of Big Data Intelligence. 2020; 7(4): 202–210.
  32. 32. Mehtab S, Sen J, Dutta A. Stock price prediction using machine learning and LSTM-based deep learning models. In Symposium on Machine Learning and Metaheuristics Algorithms, and Applications. 2020; 88–106.
  33. 33. Nabipour M, Nayyeri P, Jabani H, Shahab S, Mosavi A. Predicting stock market trends using machine learning and deep learning algorithms via continuous and binary data: a comparative analysis. IEEE Access. 2020; 8: 150199–150212.
  34. 34. Livieris IE, Kotsilieris T, Stavroyiannis S, Pintelas P. Forecasting stock price index movement using a constrained deep neural network training algorithm. Intelligent Decision Technologies. 2020; 1–11.
  35. 35. Feng G, Giglio S, Xiu D. Taming the factor zoo: A test of new factors. The Journal of Finance. 2020; 75(3): 1327–1370.
  36. 36. Siddiqi UF, Sait SM, Kaynak O. Genetic algorithm for the mutual information-based feature selection in univariate time series data. IEEE Access. 2020; 8: 9597–9609.
  37. 37. Yuan X, Yuan J, Jiang T, Ain QU. Integrated long-term stock selection models based on feature selection and machine learning algorithms for China stock market. IEEE Access. 2020; 8: 22672–22685.
  38. 38. Alotaibi SS. Ensemble Technique With Optimal Feature Selection for Saudi Stock Market Prediction: A Novel Hybrid Red Deer-Grey Algorithm. IEEE Access. 2021; 9: 64929–64944.
  39. 39. Baek S, Mohanty SK, Glambosky M. COVID-19 and stock market volatility: An industry level analysis. Finance Research Letters. 2020; 37: 101748. pmid:32895607
  40. 40. Hsu TY. Machine learning applied to stock index performance enhancement. Journal of Banking and Financial Technology. 2021; 1–13.
  41. 41. Chen S, Zhou C. Stock prediction based on genetic algorithm feature selection and long short-term memory neural network. IEEE Access. 2020; 9: 9066–9072.
  42. 42. Niu T, Wang J, Lu H, Yang W, Du P. Developing a deep learning framework with two-stage feature selection for multivariate financial time series forecasting. Expert Systems with Applications. 2020; 148: 113237.
  43. 43. Alsubaie Y, El Hindi K, Alsalman H. Cost-sensitive prediction of stock price direction: Selection of technical indicators. IEEE Access. 2019; 7: 146876–146892.
  44. 44. Arık SO, Pfister T. Tabnet: Attentive interpretable tabular learning. arXiv: 1908.07442 2020.
  45. 45. Abdel-Nasser M, Mahmoud K. Accurate photovoltaic power forecasting models using deep LSTM-RNN. Neural Computing and Applications. 2019; 31(7): 2727–2740.
  46. 46. Olah C. Understanding lstm networks. 2015.
  47. 47. Brogaard J, Dai L, Ngo PT, Zhang B. Global political uncertainty and asset prices. The Review of Financial Studies. 2020; 33(4): 1737–1780.
  48. 48. Chen W, Xu H, Jia L, Gao Y. Machine learning model for Bitcoin exchange rate prediction using economic and technology determinants. International Journal of Forecasting. 2021; 37(1): 28–43.
  49. 49. Moskowitz TJ, Ooi YH, Pedersen LH. Cross-asset signals and time series momentum. Journal of Financial Economics. 2020; 136.
  50. 50. Basu S. Investment performance of common stocks in relation to their price-earnings ratios: A test of the efficient market hypothesis. The Journal of Finance. 1977; 32(3): 663–682.
  51. 51. Lakonishok J, Shleifer A, Vishny RW. Contrarian investment, extrapolation, and risk. The Journal of Finance. 1994; 49(5): 1541–1578.
  52. 52. Roll R. A simple implicit measure of the effective bid-ask spread in an efficient market. Journal of Finance. 1984; 39: 1127–39
  53. 53. Carhart MM. On persistence in mutual fund performance. The Journal of Finance. 1997; 52(1): 57–82.
  54. 54. Lehmann BN. Fads, Martingales, and Market Efficiency. The Quarterly Journal of Economics. 1990; 1–28.
  55. 55. Llorente G, Michaely R, Saar G, Wang J. Dynamic volume-return relation of individual stocks. The Review of financial studies. 2002; 15(4): 1005–1047.
  56. 56. Wang Y, Liu L, Wu C. Forecasting commodity prices out-of-sample: Can technical indicators help?. International Journal of Forecasting. 2020; 36(2): 666–683.
  57. 57. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016; 785–794.
  58. 58. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems. 2017; 30: 3146–3154.
  59. 59. Rao H, Shi X, Rodrigue AK, Feng J, Xia Y, Elhoseny M, et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Applied Soft Computing. 2019; 74: 634–642.
  60. 60. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. arXiv preprint 2017.
  61. 61. Martin D. Early warning of bank failure: A logit regression approach. Journal of banking & finance. 1977; 1(3): 249–276.
  62. 62. Cappellari L, Jenkins SP. Multivariate probit regression using simulated maximum likelihood. The STATA journal. 2003; 3(3): 278–294.