Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Forecasting leading industry stock prices based on a hybrid time-series forecast model

  • Ming-Chi Tsai ,

    Contributed equally to this work with: Ming-Chi Tsai, Meei-Ing Tsai, Huei-Yuan Shiu

    Roles Methodology, Project administration, Writing – review & editing

    Affiliation Department of Business Administration, I-Shou University, Dashu District, Kaohsiung City, Taiwan

  • Ching-Hsue Cheng ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision

    chcheng@yuntech.edu.tw

    Affiliation Department of Information Management, National Yunlin University of Science and Technology, Douliou, Yunlin, Taiwan

  • Meei-Ing Tsai ,

    Contributed equally to this work with: Ming-Chi Tsai, Meei-Ing Tsai, Huei-Yuan Shiu

    Roles Resources, Validation

    Affiliation Department of Information Management, National Yunlin University of Science and Technology, Douliou, Yunlin, Taiwan

  • Huei-Yuan Shiu

    Contributed equally to this work with: Ming-Chi Tsai, Meei-Ing Tsai, Huei-Yuan Shiu

    Roles Data curation, Software

    Affiliation Department of Information Management, National Yunlin University of Science and Technology, Douliou, Yunlin, Taiwan

Abstract

Many different time-series methods have been widely used in forecast stock prices for earning a profit. However, there are still some problems in the previous time series models. To overcome the problems, this paper proposes a hybrid time-series model based on a feature selection method for forecasting the leading industry stock prices. In the proposed model, stepwise regression is first adopted, and multivariate adaptive regression splines and kernel ridge regression are then used to select the key features. Second, this study constructs the forecasting model by a genetic algorithm to optimize the parameters of support vector regression. To evaluate the forecasting performance of the proposed models, this study collects five leading enterprise datasets in different industries from 2003 to 2012. The collected stock prices are employed to verify the proposed model under accuracy. The results show that proposed model is better accuracy than the other listed models, and provide persuasive investment guidance to investors.

Introduction

The prices forecast of stock is the most key issue for investors in the stock market, because the trends of stock prices are nonlinear and nonstationary time-series data, which makes forecasting stock prices a challenging and difficult task in the financial market. Conventional time series models have been used to forecast stock prices, and many researchers are still devoted to the development and improvement of time-series forecasting models. The most well-known conventional time series forecasting approach is autoregressive integrated moving average (ARIMA)[1], which is employed when the time-series data is linear and there are no missing values [2]. Statistical methods, such as traditional time series models, usually address linear forecasting models and variables must obey statistical normal distribution [3]. Therefore, conventional time series methods are not suitable for forecasting stock prices, because stock price fluctuation is usually nonlinear and nonstationary.

Further, most conventional time-series models utilize one variable (the previous day’s stock price) only [4], when, there are actually many influential factors, such as market indexes, technical indicators, economics, political environments, investor psychology, and the fundamental financial analysis of companies that can influence forecasting performance [5]. In practice, researchers use many technical indicators as independent variables for forecasting stock prices. How to select the key variables from numerous technical indicators is a critical step in the forecasting process. Investors usually prefer to select technical indicators depending on their experience or feelings for forecasting stock prices despite this behavior be highly risky. However, choosing unrepresentative indicators may result in losing profits for investors. Therefore, selecting the relevant indicators to forecast stock prices is one of the important issues for investors. Financial researchers must identify the key technical indicators that have higher relevance to the stock price by indicator selection. Therefore, proposed models must incorporate indicator selection in the stock forecasting process to enhance forecasting accuracy.

Recently, there have been many new forecasting techniques used to construct efficient and precise machine learning models, but forecasting stock prices is still a hot topic [6, 7, 8]. To overcome the shortcomings of traditional time series models, nonlinear approaches have been proposed, such as fuzzy neural networks [9, 10, 11, 12], and support vector regression (SVR) [13, 14, 15, 16]. SVR utilizes the minimized structural risk principle to evaluate a function by using the minimized the upper bound of the generalized error [17, 18]. The minimized structural risk principle could get better generalization from limited size datasets [19]. Further, SVR has a global optimum and exhibits better prediction accuracy due to its implementation of the structural risk minimization principle, which considers both the training error, and the capacity of the regression model [15, 20]. Although SVR has shown a great number of experimental results in many applications such as economic and financial predictions, the main problem of SVR is the determination of its parameters, which requires practitioner experience [21]. In the literature, genetic algorithms (GA) have been successfully used in a wide range of problems included machine learning, multiobjective optimization problems and multimodal function optimization [22]. GA is a search algorithm inspired by evolution and is usually used to solve optimization problems. Therefore, proposed models utilize GA to optimize the parameters of SVR and obtain better forecasting performance.

From the related work mentioned above, previous studies have shown some drawbacks:

(1) Many researches select key technical indicators depending on experiences and ideas [23], (2) Most statistical methods follow some assumptions in different datasets, and obey the statistical distributions [3], (3) Most previous time series models consider only one feature to forecast stock indexes [23], and (4) The parameter of SVR is difficult to determine [24, 25, 26].

This paper proposes a novel GA-SVR time series model based on indicator selection to overcome these problems, and the proposed model contributes the following: (1) In feature selection, this study applies multivariate adaptive regression spline (MARS), stepwise regression (SR), and kernel ridge regression to get the key technical indicators for investors. (2) The proposed model optimizes the parameters of SVR by genetic algorithm (GA) to increase the forecast accuracy. (3) The results could provide persuasive investment guidelines for investors.

The remaining contents of this paper are organized as follows. Section 2 describes the related methodology that incorporate the technical indicator, MARS, genetic algorithm, SVR, and stepwise regression. Section 3 presents the proposed algorithm. Section 4 provides the experimental results and comparisons. Conclusion of this paper are explained in Section 5.

Related work

This section introduces the related work containing the technical indicator, multivariate adaptive regression splines, genetic algorithm, support vector regression, and briefly stepwise regression.

Technical indicator

The technical indicator (TI) is an investment guidance for investors based on evaluating the profits of securities from analyzing trading data of marketing activities, such as past prices and volumes [5]. Stock market data have highly nonlinear, and many researches have focused on the technical indicator to increase the investment return [27, 28]. A technical indicator is a formula, which transfers trading data (open price, the lowest price, the highest price, average price, closing price and volume) into different technical indicators, and try to forecast future prices based on analyzing the past pattern of stock prices [29, 30]. Technical analysis utilizes basic market data, and assumes that the involved factors are included in the stock exchange information [31]. Based on literature review, this paper collected some technical indicators as Table 1. To consider more features, that affect the stock price and volatility, this paper incorporates the microeconomic features that affect the stock price, and these collected factors are listed in Table 2.

Multivariate adaptive regression splines

Friedman [46] proposed multivariate adaptive regression splines (MARS), it is a simple nonparametric regression algorithm. The main advantages of MARS is its capacity to grasp the complicated data mapping and patterns of high-dimensional data, and produce more simple, easy interpretation models, and its can perform analysis on feature relative importance. In concept, MARS integrates the piecewise linear regressions into a flexible model for solving the nonlinear and complex problems. MARS establishes the final model in a two-stage procedure: Firstly, the forward stage, many spline basis functions are built, the feature can be continuous, ordinal, or categorical. Secondly, the backward stage removes the redundant spline basis functions, it uses the generalized cross-validation (GCV) criterion [46] to evaluate the performance of model subsets for getting the best subset, the lower GCV value is better. Moreover, the GCV is defined as Eq (1). (1) where N is the number of data records, C(M) denotes the penalty cost of a model containing M basis functions, the numerator is the lack of fit on the M basis function model fM (xi), the denominator is the penalty for model complexity C(M) and yi denotes the target outputs.

Genetic algorithm

The genetic algorithm (GA) [47] is to search the global optimum by using inspired natural evolve. Moreover, GA has four operators (inheritance, mutation, selection, and crossover) to evolve repeatedly for obtaining the optimal solution. GA has been applied successfully in economic and financial domain [27, 48]. In specific problem, the GA algorithm encodes a potential solutions into the simple chromosome-like data structure, and applies the re-united operators to preserve critical information [3]. This paper referred the GA steps of Goldberg [49], and reorganized as follows:

  1. Step1:. Generate an initial population randomly.
  2. Step2:. Evaluate fitness of each chromosome.
  3. Step3:. Check the stop criterion.
  4. Step4:. Select suitable chromosomes based on the parents’ populations.
  5. Step5:. Extend crossover to search a new solution by swapping corresponding to segments of a string representation for the parents.
  6. Step6:. Employ mutation randomly to change some of the chosen chromosomes.

Support vector regression

SVR algorithm is a nonlinear kernel-based regression, which tries to find a regression hyperplane with minimized risk in high dimensional space [16]. Compared to the traditional regression model, it estimates the coefficients by minimizing the square loss, SVR uses the ε–insensitivity loss function to obtain its parameters. It can be express as: (2) where t is the desired (target) outputs, and ε defines the region of ε -insensitivity, when the predicted value falls into the band area, the loss is zero. Contrarily, if the predicted value falls outside of the band area, the loss is equal to the difference between the predicted value and the margin.

Considering empirical risk and structural risk, the SVR model can use slack variables to construct a minimal quadratic programming problem.

(3)

The symbols ξi and are two positive slack variables to calculate the error (qi − f(xi)) from the boundaries of the ε–insensitivity zone. denotes the empirical risk, is the structural risk to prevent over-learning and the lack of applied universality and ∁ denotes the regularization constant for specifying the trade-off between the empirical risk and the regularization terms.

Based on the sequentially modifying coefficient C, band area width ε, and kernel function Κ, the optimal parameter can be solved by the Lagrange method [19]. This study based on Vapnik [18] utilized the SVR-based regression function, and it is defined as (4) where αi and are the Lagrangian multipliers that satisfy the equality , αi and . K(x, xi) is the kernel function which represents the inner product 〈φ(xi), φ(x)〉. The radial basis function (RBF) has been widely used as the kernel function, and this study utilizes RBF because of its capabilities and simple implementation [50]. (5) where γ is the RBF width.

Stepwise regression

SR is a simple multiple regressions, it establishes a model by adding or removing features based on the statistics of F-test, that is, SR utilized the forward and backward procedures to add or remove features based on F statistics. SR adds the feature to the model if the p-value of variable is less than the given significant level (p < .05), and removes the variable from the model if the p-value of variable is greater than the given significant level [51].

Proposed model

This study based on previous studies has found some drawbacks in time series forecast: (1) Based on subjective experiences and opinions to select important technical indicators [23]. (2) Previous methods need to follow some assumptions in different datasets, and obey the statistical distributions [3]. (3) Previous time series models consider only one feature to forecast the stock index [23]. (4) The best SVR parameters are difficult to determine [24, 25, 26]. To overcome these problems, this study is based on our conference paper [52] to extend the proposed methods for solving the forecast problems. That is, this paper proposes a GA-SVR time series model based on feature selection to forecast the leading industry stock price. Hence, the proposed model contributes the following. (1) This study utilizes MARS, SR, and KRR to choose the key technical indicators for investors. (2) Use a GA to optimize the SVR parameters for enhancing the forecast accuracy. (3) The results can provide the investment guidance to investors.

The proposed model included three blocks as Fig 1, it can be briefly described as follows:

  1. Data preprocessing: this proposed model transforms daily basic stock data (open price, the lowest price, the highest price, average price, closing price and volume) into technical indicators. Then utilizes MARS, SR, and KRR methods to select the key indicators.
  2. Modeling: Build a forecast model by using SVR and employs GA to optimize the parameters of SVR.
  3. Forecasting: The optimized GA-SVR forecast model is utilized to forecast the stock price, and compare the proposed models with the listing models under the accuracy.

For easy computation, this section proposed an algorithm with six steps, the detailed step is described as follows:

  1. Step 1:. Transform trading data into technical indicators

This step collected daily stock trading data (open, close, the highest, the lowest price and volume), and transformed these data into technical indicators [31], such as MA, PSY, RSI, BIAS, and WMS%R. In addition, this paper also incorporate other indicators, such as exchange rate, NT dollars to US dollars, and the momentum. These technical indicators used are listed in Tables 1 and 2, respectively.

  1. Step 2:. Select key features (MARS, SR, and KRR)

From step 1, the collected data has been transformed into technical indicators; this step utilized MARS, SR, and KRR to choose the key indicators. For removing collinearity, this step also run SR multi-collinearity to eliminate the high multi-collinearity indicators. For comparing the three feature selection methods, this study selected the number of features are as similar as possible.

  1. Step3:. Construct the SVR forecast model

To build the SVR forecast model, this step used the selected features as input features, and the RBF function is used as the kernel function, due to it can handle the nonlinear and high-dimensional data. To build the forecasting model, three parameters that should be set: the loss function ε, the regularization constant C, and the RBF width σ. To obtain a better forecast model, this step utilizes a genetic algorithm to optimize these parameters.

  1. Step4:. Optimize the SVR parameters by GA at minimal RMSE

To get better forecasting accuracy, this step employed genetic algorithm to optimize the SVR parameters C and σ under minimal RMSE (Eq 6) for training dataset. The RMSE is defined as: (6) where yt is the real stock index, is the forecasted stock price, and n is the number of records.

Step 4 has six sub-steps, which is described the operation of the GA processes as follows:

  1. Step4.1.1:. Initialize the parameter for GA.

The initial population was set 80 individual solutions, and randomly generated in this sub-step, the SVR parameters are encoded into a chromosome by a binary string. In addition, the maximal generations, the crossover probability, the mutation probability, and are given as 2000, 0.8, and 0.08 [53] respectively.

  1. Step4.1.2:. Evaluate fitness.

This sub-step uses a pre-defined fitness function (RMSE) to evaluate fitness of each chromosome for determining the goodness of fit for each solution.

  1. Step4.1.3:. Check the stopping criterion

This sub-step sets the stopping rule: If one of the two conditions is got, then the GA process is stopped:

  1. The maximal number of generations is reached (2000).
  2. The optimal solution is smaller than the given minimal RMSE, the minimal RMSE is set as 10−5.

If the criterion is not achieved then repeatedly re-run a new iterative process (Step 4.1.2 to 4.1.5).

  1. Step4.1.4:. Select the parents by the fitness function

Selection is to screen out the fit chromosome to be copies for increasing the offspring sharing and eliminating the poorer chromosome for decreasing the offspring sharing. Roulette wheel selection is employed to select the chromosomes for reproduction.

  1. Step4.1.5:. Perform crossover and mutation

Re-unite the parents to produce and mutate offspring, this step one-point crossover is used. Then, it selected randomly a member of the population, and changed one randomly selected bit in its bit string representation [3].

  1. Step5:. Forecast the stock price by using the optimized models

From Step 4, the SVR parameters were determined. the testing data is applied to predict the next day’s stock prices by using the optimized forecast model.

  1. Step6:. Compare the performance

For evaluating the forecast accuracy, the propoesed model will be compared with the listing models under the RMSE criterion. The seven comparison models are as follows: (1) integrated KRR and SR (KRR-SR model), (2) integrated KRR and MARS (KRR-MARS model), (3) integrated KRR and GA-SVR (KRR-GA-SVR model), (4) integrated SR and MARS (SR-MARS model), (5) integrated SR and KRR (SR-KRR model), (6) integrated MARS and SR (MARS-SR model), and (7) integrated MARS and KRR (MARS-KRR model). Where KRR-SR model denotes using SR to build the forecast model after selecting the features by KRR, similarly other combined models have the same meanings, and the detailed abbreviations are presented in Table 3.

Experiment and comparisons

This study employs Taiwan’s stock as experimental datasets, the selected companies are different leading industries from “business today (www.businesstoday.com.tw)” which published the 1000 largest companies from Mainland China, Taiwan, and Hong Kong. The experimental datasets including Chunghwa Telecom (CHT), China Steel, Hon Hai, Cathay Financial Holdings and Taiwan Semiconductor Manufacturing Company (TSMC), were practically collected from 2003 to 2012. To compare the accuracy of a long test period forecast and short test period forecast, this study implements two experiments for each dataset, and the two experimental designs are listed in Table 4.

thumbnail
Table 4. The experiment of the long and short test period.

https://doi.org/10.1371/journal.pone.0209922.t004

First, this study conducts an initial experiment to explore the performance of the GA-SVR model. The forecasting performance of the GA-SVR is compared with SR, KRR, and MARS and the results are shown in Table 5. From Table 5, we can see that the GA-SVR model generates the smallest RMSE by the CHT, China Steel and Hon Hai datasets. Therefore, this study combines feature selection with the GA-SVR model as the proposed forecasting model. Second, this study sets the parameters of different forecasting models for the following experiment. In the parameter settings for the MARS model, the training data is utilized to build the MARS model, the maximal number of BFs of the MARS model’s parameter is set as 2000 and the other parameters are set as default [54]. For the KRR model, the parameter lambda for Tikhonov regularization of kernel ridge regression is set as 0.001 to build the forecasting model. In the SR model, the training data is employed to build the forecasting model, the high variance inflation factors (VIF) that are higher than 10 are removed first.

thumbnail
Table 5. The initial performance comparisons for three companies in RMSE.

https://doi.org/10.1371/journal.pone.0209922.t005

Based on the initial experimental result (GA-SVR model performs better than the other models), this paper combines the GA-SVR model and feature selection method as the proposed model. Then, this study proposes model A and model B based on different feature selection methods. Model A uses MARS to select features, and model B utilizes stepwise regression as the feature selection method.

In comparison, this study selects the same number of features for different feature selection methods. The selected features by MARS, SR and KRR are listed in Table 6. After finding the key features, this study constructs the forecast model by SVR and optimizes the parameters of the MARS-GA-SVR and SR-GA-SVR models by GA, the optimized parameters are listed in Table 7.

thumbnail
Table 6. Selected features by MARS, SR and KRR for five companies.

https://doi.org/10.1371/journal.pone.0209922.t006

thumbnail
Table 7. The optimal parameters of GA searching for five companies.

https://doi.org/10.1371/journal.pone.0209922.t007

Experimental results

In the section, this study verifies the performance of the proposed model by using five different industry datasets including the Chunghwa Telecom datasets (CHT), China steel datasets, Hon Hai datasets, Cathay Financial Holdings datasets, and Taiwan Semiconductor Manufacturing Company datasets. The CHT datasets are employed in the first experiment; the computational process follows the proposed algorithm in Section 3. The predictions of the MARS-GA-SVR and SR-GA-SVR models are demonstrated in Fig 2. Fig 2 shows that the long test period results have more overlap between the forecast line and real closing price line than the results in the short test period in model A (MARS-GA-SVR). For proposed model B (SR-GA-SVR), there is more overlap between the forecast line and real close price line in the short test period results (in Fig 2). The RMSE for the listing models and proposed models, are shown in Table 8, and the results show that the performances of proposed models are better than other models. In the short test period, the proposed model B (SR-GA-SVR) generates the smallest RMSE in listing models as Table 8. Moreover, the proposed model A (MARS-GA-SVR) generates the smallest RMSE in the long test period (in Table 8).

thumbnail
Fig 2. Results of forecasting short and long test period for Chunghwa Telecom datasets.

https://doi.org/10.1371/journal.pone.0209922.g002

Table 9 and Fig 3 illustrate the numerical results for the China Steel datasets. From Fig 3, the result of model A shows that the real closing price line and the forecast line in the long test period have more overlap than in the short test period. Similarly, compared to the results in the long test period, the results generated by model B have more overlap between closing price line and the forecast line in short test period. From Table 9, the proposed models generate the smallest RMSE in the listing models. In the short test period, the proposed model B generates the smallest RMSE. In addition, the proposed model A generates the smallest RMSE in the long period.

thumbnail
Fig 3. Results of forecasting short and long test period for China Steel datasets.

https://doi.org/10.1371/journal.pone.0209922.g003

Experiments on the Hon Hai datasets are presented in Table 10 and Fig 4. Fig 4 shows the excellent performance of the proposed model (the forecast line almost completely overlaps the closing price line). Form Table 10, model A and model B generate the smallest RMSE in the long training period and the short training period, respectively.

thumbnail
Fig 4. Results of forecasting short and long test period for Hon Hai datasets.

https://doi.org/10.1371/journal.pone.0209922.g004

Table 11 and Fig 5 show the experiments for the Cathay Financial Holdings datasets. In Fig 5, the numerical results clearly show that the forecast line deviates from the closing price line. From Table 11, the KRR-GA-SVR model generates the smallest RMSE in the short test period, and the proposed model B in the long test period generates a smaller RMSE than other models.

thumbnail
Table 11. Performance comparisons for Cathay Financial Holdings.

https://doi.org/10.1371/journal.pone.0209922.t011

thumbnail
Fig 5. Results of forecasting short and long test period for Cathay Financial datasets.

https://doi.org/10.1371/journal.pone.0209922.g005

Next, experiments on the TSMC datasets are illustrated in Table 12 and Fig 6. The forecast results in Fig 6 show that the forecast line obviously deviates from the closing price line. From Table 12, the proposed model B generates the smallest RMSE in the short and long test periods.

thumbnail
Fig 6. Results of forecasting short and test long period for TSMC datasets.

https://doi.org/10.1371/journal.pone.0209922.g006

Significance test

To test whether proposed model is superior to the KRR-MARS, KRR-SR, KRR-GA-SVR, MARS-SR, MARS-KRR, SR-MARS and SR-KRR models in the stock price forecasting, this study applies the Wilcoxon signed-rank test. We use RMSE to test the significance between the proposed model and the listed models. Tables 13 and 14 present the Z statistic of the two-tailed Wilcoxon sign test between proposed models and the listed models.

thumbnail
Table 13. Wilcoxon sign test for different models comparison in short period.

https://doi.org/10.1371/journal.pone.0209922.t013

thumbnail
Table 14. Wilcoxon sign test for different models comparison in long period.

https://doi.org/10.1371/journal.pone.0209922.t014

From Table 13, the proposed models have a significant difference (p<0.05) compared to other models in the short test period except for KRR-GA-SVR at the 0.05 significant level. Therefore, we can conclude that the proposed models are significantly better than KRR-MARS, KRR-SR, MARS-SR, MARS-KRR, SR-MARS, and SR-KRR models. However, we can see that there are no significant differences between proposed model A and B in the short testing period from Table 13. In the long testing period, the proposed models have a higher significance compared with the other models at the 0.05 significant level as shown in Table 14. Therefore, we can conclude that proposed model is significantly better than the listed models.

Findings

Based on the experimental results, this study can summarize the findings as follows.

(1) Datasets quality.

From Table 15, we find that of the five datasets with different fluctuations, the highest fluctuation range is the Hon Hai stock price. Despite the Hon Hai datasets having the highest fluctuation range, the proposed models can generate smaller RMSE than the listing models in the short and long test period as shown in Tables 16 and 17. Further, the China Steel stock price has the smallest fluctuation in the five datasets, and the proposed models still achieve better performance in both the short and long test periods, as shown in Tables 16 and 17. Finally, the results show that the two proposed models fit the forecast stock price for investors.

thumbnail
Table 16. The RMSE of all experiment for short testing period.

https://doi.org/10.1371/journal.pone.0209922.t016

thumbnail
Table 17. The RMSE of all experiments for long testing period.

https://doi.org/10.1371/journal.pone.0209922.t017

(2) Short and Long test period.

The experimental results of the forecasting models in the short and long test periods are listed in Tables 16 and 17, and we find that the accuracy of proposed models in the short test period is better than in the long test periods. From Fig 7, the figure shows that the stock indexes change dramatically in the long test periods; and the proposed models has better performance in larger price fluctuation. In the short test period, the results show that the proposed model B generates the smallest RMSE in the TSMC, Hon Hai, China Steel and CHT datasets, except Cathay dataset as Table 16. Because the fluctuation of Cathay price dataset in short period is smaller than other datasets as Fig 7. Therefore, we conclude that the proposed model B (SR-GA-SVR) has better performance than the listing models, especially in larger price fluctuation. i.e., we can confirm that the features selected by SR can effectively enhance the accuracy in the short testing period.

thumbnail
Fig 7. The closing prices of five companies from 2003 to 2012.

https://doi.org/10.1371/journal.pone.0209922.g007

Similarly, in the long test period, from Table 17, the proposed model A (MARS-GA-SVR) has the smallest RMSE in the Hon Hai, China Steel and CHT datasets. Therefore, we conclude that MARS can also select better features for the proposed model when the stock price range changes dramatically.

(3) Selected feature.

For the MARS selected features shown in Table 18, the feature “The Final Best Bid Quote” was chosen four times in five datasets and the forecasting results of proposed model A are better than the results of proposed model B in the long test period. Based on the reasons above, we can confirm that the feature “The Final Best Bid Quote” influences those stock prices forecasting in long testing period.

For the SR selected features as shown in Table 18, the features CDP, MO1 and MO2 are selected three times in five datasets. In addition, the proposed SR-GA-SVR shows with precise accuracy in the short testing period. Therefore, we find that the CDP, MO1 and MO2 have a great impact on forecasting stock prices for the short test period.

(4) Investor suggestion.

After verifying the proposed models, this study can provide some suggestions to investors as references in the following:

  • From Tables 16 and 17, the short test period forecasting is recommended, because it will be more accuracy than the long test period forecasting for investment stock.
  • In the short period forecasting, we suggest using the proposed model B because it is more accuracy than proposed model A (see Table 16). Regarding key features as shown in Table 18, we suggest the investors consider the three key features: CDP, MO1 and MO2.
  • In the long period forecasting, from Table 17, the proposed model A is recommended because it is more accuracy than proposed model B in the long testing period. From Table 18, the feature “The Final Best Bid Quote” should be considered as input variables in the long period forecasting.

Conclusion

This study has proposed a new time-series model, which considers multifactor and reasonable selected key features into the GA-SVR model. The results show that proposed models can improve forecasting accuracy. Furthermore, the proposed models outperform the listed models in RMSE for Chunghwa Telecom, China Steel, Hon Hai, Cathay Financial Holdings and Taiwan Semiconductor Manufacturing Company datasets. In addition, from the findings and discussions, the proposed SR-GA-SVR outperforms the listing models in the short testing period, except Cathay Financial Holdings. Moreover, in the long testing period, the MARS-GA-SVR also has better performance. We find that the proposed model B almost has better performance than the listing models, especially in larger price fluctuation. i.e., the proposed model is more fit the dataset of larger price fluctuation. Finally, the research results can provide some suggestion to investors as references.

In future work, several issues from this study can be extended as follows:

  1. Consider other features to train the model, such as company news, or government policies.
  2. Apply the model to different application fields, such as electric loads and environmental pollution forecasting.
  3. Employ other methods to improve proposed model, such as feature lags.

Supporting information

References

  1. 1. Box GEP, Jenkins GM. Time series analysis: forecasting and control. San Francisco: Holden-Day; 1970.
  2. 2. Ediger VS, Akar S. ARIMA forecasting of primary energy demand by fuel in Turkey. Energy Policy. 2007; 35: 1701–1708.
  3. 3. Cheng CH, Chen TL, Wei LY. A hybrid model based on rough sets theory and genetic algorithms for stock price forecasting. Information Sciences. 2010; 180: 1610–1629.
  4. 4. Yu THK, Huarng KH. A bivariate fuzzy time series model to forecast the TAIEX. Expert Systems with Applications. 2008; 34: 2945–2952.
  5. 5. Tsai CF, Lin YC, Yen DC, Chen YM. Predicting stock returns by classifier ensembles. Applied Soft Computing. 2011; 11: 2452–2459.
  6. 6. Pan Y, Xiao Z, Wang X, Yang D. A multiple support vector machine approach to stock index forecasting with mixed frequency sampling. Knowledge-Based Systems. 2017; 122: 90–102.
  7. 7. Su CH, Cheng CH. A hybrid fuzzy time series model based on ANFIS and integrated nonlinear feature selection method for forecasting stock. Neurocomputing. 2016; 205: 264–273.
  8. 8. Wang J, Hou R, Wang C, Shen L. Improved v-support vector regression model based on variable selection and brain storm optimization for stock price forecasting. Applied Soft Computing. 2016; 49: 164–178.
  9. 9. Chang PC, Liu CH. A TSK type fuzzy rule based system for stock price prediction. Expert Systems with Applications. 2008; 34: 135–144.
  10. 10. Liu CF, Yeh CY, Lee SJ. Application of type-2 neuro-fuzzy modeling in stock price prediction. Applied Soft Computing. 2012; 12: 1348–1358.
  11. 11. Oh SK, Pedrycz W, Park HS. Genetically optimized fuzzy polynomial neural networks. IEEE Transactions on Fuzzy Systems. 2006; 14: 125–144.
  12. 12. Zarandi MHF, Rezaee B, Turksen IB, Neshat E. A type-2 fuzzy rule-based expert system model for stock price analysis. Expert Systems with Applications. 2009; 36: 139–154.
  13. 13. Cao LJ, Tay FEH. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Transactions on Neural Networks. 2003; 14: 1506–1518. pmid:18244595
  14. 14. Gestel TV, Suykens JAK, Baestaens DE, Lambrechts A, Lanckriet G, Vandaele B, et al. Financial time series prediction using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks. 2001; 12: 809–821. pmid:18249915
  15. 15. Gavrishchaka VV, Banerjee S. Support vector machine as an efficient framework for stock market volatility forecasting. Computational Management Science. 2006; 3: 147–160.
  16. 16. Yeh CY, Huang CW, Lee SJ. A multiple-kernel support vector regression approach for stock market price forecasting. Expert Systems with Applications. 2011; 38: 2177–2186.
  17. 17. Vapnik VN. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999; 10: 988–999. pmid:18252602
  18. 18. Vapnik VN. The nature of statistical learning theory. New York: Springer; 2000.
  19. 19. Kao LJ, Chiu CC, Lu CJ, Yang JL. Integration of nonlinear independent component analysis and support vector regression for stock price forecasting. Neurocomputing. 2013; 99: 534–542.
  20. 20. Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press; 2000.
  21. 21. Kazem A, Sharifi E, Hussain FK, Saberi M, Hussain OK. Support vector regression with chaos-based firefly algorithm for stock market price forecasting. Applied Soft Computing. 2013; 13: 947–958.
  22. 22. Kabir MMJ, Xu S, Kang BH, Zhao Z. A new multiple seeds based genetic algorithm for discovering a set of interesting Boolean association rules. Expert Systems with Applications. 2017; 74: 55–69.
  23. 23. Su CH, Cheng CH, Tsai WL. Fuzzy time series model based on fitting function for forecasting TAIEX index. International Journal of Hybrid Information Technology. 2013; 6: 111–122.
  24. 24. Chapelle O, Vapnik V, Bousquet O, Mukherjee S. Choosing multiple parameters for support vector machines. Machine Learning. 2002; 46: 131–159.
  25. 25. Duan K, Keerthi SS, Poo AN. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing. 2003; 51: 41–59.
  26. 26. Kwok JTY. The evidence framework applied to support vector machines. IEEE Transactions on Neural Networks. 2000; 11: 1162–1173. pmid:18249842
  27. 27. Allen F, Karjalainen R. Using genetic algorithms to find technical trading rules. Journal of Financial Economics. 1999; 51: 245–271.
  28. 28. Leigh W, Pussell R, Ragusa JM. Forecasting the NYSE composite index with technical analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision support. Decision Support Systems. 2002; 32: 361–377.
  29. 29. Gorgulho A, Neves R, Horta N. Applying a GA kernel on optimizing technical analysis rules for stock picking and portfolio composition. Expert Systems with Applications. 2011; 38: 14072–14085.
  30. 30. Pring MJ. Technical Analysis Explained. New York: McGraw-Hill; 1991.
  31. 31. Chang PC, Liao TW, Lin JJ, Fan CY. A dynamic threshold decision system for stock trading signal detection. Applied Soft Computing. 2011; 11: 3998–4010.
  32. 32. Park JI, Lee DJ, Song CK, Chun MG. TAIFEX and KOSPI 200 forecasting based on two-factors high-order fuzzy time series and particle swarm optimization. Expert Systems with Applications. 2010; 37: 959–967.
  33. 33. Tanaka-Yamawaki M, Tokuoka. S. Adaptive use of technical indicators for the prediction of intra-day stock prices. Physica A. 2007; 383: 125–133.
  34. 34. Lin TN. Using AdaBoost for Taiwan stock index future intra-day trading system. M. Sc. Thesis, National Taiwan University. 2008. https://www.csie.ntu.edu.tw/~lyuu/theses/thesis_r95944016.pdf
  35. 35. Grebenkov DS, Serror J. Following a trend with an exponential moving average: Analytical results for a Gaussian model. Physica A. 2014; 394: 288–303.
  36. 36. Hassapis C, Kalyvitis S. Investigating the links between growth and real stock price changes with empirical evidence from the G-7 economies. The Quarterly Review of Economics and Finance. 2002; 42: 543–575.
  37. 37. Basher SA, Haug AA, Sadorsky P. Oil prices, exchange rates and emerging stock markets. Energy Economics. 2012; 34: 227–240.
  38. 38. Wikipedia. Interest rate. 2017. http://en.wikipedia.org/wiki/Interest_rate.
  39. 39. Bessembinder H. Quote-based competition and trade execution costs in NYSE-listed stocks. Journal of Financial Economics. 2003; 70: 385–422.
  40. 40. Bagella M, Becchetti L, Adriani F. Observed and “fundamental” price-earning ratios: A comparative analysis of high-tech stock evaluation in the US and in Europe. Journal of International Money and Finance. 2005; 24: 549–581.
  41. 41. Pontiff J, Schall LD. Book-to-market ratios as predictors of market returns. Journal of Financial Economics. 1998; 49: 141–160.
  42. 42. Chen S. The predictability of aggregate Japanese stock returns: Implications of dividend yield. International Review of Economics and Finance. 2012; 22: 284–304.
  43. 43. Pan D, Wiersma G, Williams L, Fong YS. More than a number: unexpected benefits of return on investment analysis. The Journal of Academic Librarianship. 2013; 39: 566–572.
  44. 44. Politi M, Millot N, Chakraborti A. The near-extreme density of intraday log-returns. Physica A. 2012; 391: 147–155.
  45. 45. Business next, not to afraid low interest rates! 8 strategies to preserve capital. 2002. http://www.bnext.com.tw/article/view/id/7121.
  46. 46. Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991; 19: 1–67.
  47. 47. Holland JH. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press; 1975.
  48. 48. Kim MJ, Min SH, Han I. An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Systems with Applications. 2006; 31: 241–247.
  49. 49. Goldberg DE. Genetic Algorithms in Search Optimization and Machine Learning. Reading: Addison-Welsey; 1989.
  50. 50. Huang SC, Chuang PJ, Wu CF, Lai HJ. Chaos-based support vector regressions for exchange rate forecasting. Expert Systems with Applications. 2010; 37: 8590–8598.
  51. 51. Chang PC, Liu CH, Fan CY. Data clustering and fuzzy neural network for sales forecasting: A case study in printed circuit board industry. Knowledge-Based Systems,. 2009; 22: 334–355.
  52. 52. Cheng CH, Shiu HY. A novel GA-SVR time series model based on selected indicators method for forecasting stock price. IEEE Proceedings-ISEEE 2014. Sapporo, Japan. 2014.
  53. 53. Asadi S, Hadavandi E, Mehmanpazir F, Nakhostin MM. Hybridization of evolutionary Levenberg–Marquardt neural networks and data pre-processing for stock market prediction. Knowledge-Based Systems. 2012; 35: 245–258.
  54. 54. Adoko AC, Jiao YY, Wu L, Wang H, Wang ZH. Predicting tunnel convergence using Multivariate Adaptive Regression Spline and Artificial Neural Network. Tunnelling and Underground Space Technology. 2013; 38: 368–376.