Forecasting leading industry stock prices based on a hybrid time-series forecast model

Many different time-series methods have been widely used in forecast stock prices for earning a profit. However, there are still some problems in the previous time series models. To overcome the problems, this paper proposes a hybrid time-series model based on a feature selection method for forecasting the leading industry stock prices. In the proposed model, stepwise regression is first adopted, and multivariate adaptive regression splines and kernel ridge regression are then used to select the key features. Second, this study constructs the forecasting model by a genetic algorithm to optimize the parameters of support vector regression. To evaluate the forecasting performance of the proposed models, this study collects five leading enterprise datasets in different industries from 2003 to 2012. The collected stock prices are employed to verify the proposed model under accuracy. The results show that proposed model is better accuracy than the other listed models, and provide persuasive investment guidance to investors.


Introduction
The prices forecast of stock is the most key issue for investors in the stock market, because the trends of stock prices are nonlinear and nonstationary time-series data, which makes forecasting stock prices a challenging and difficult task in the financial market. Conventional time series models have been used to forecast stock prices, and many researchers are still devoted to the development and improvement of time-series forecasting models. The most well-known conventional time series forecasting approach is autoregressive integrated moving average (ARIMA) [1], which is employed when the time-series data is linear and there are no missing values [2]. Statistical methods, such as traditional time series models, usually address linear forecasting models and variables must obey statistical normal distribution [3]. Therefore, conventional time series methods are not suitable for forecasting stock prices, because stock price fluctuation is usually nonlinear and nonstationary.
Further, most conventional time-series models utilize one variable (the previous day's stock price) only [4], when, there are actually many influential factors, such as market indexes, technical indicators, economics, political environments, investor psychology, and the fundamental a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 financial analysis of companies that can influence forecasting performance [5]. In practice, researchers use many technical indicators as independent variables for forecasting stock prices. How to select the key variables from numerous technical indicators is a critical step in the forecasting process. Investors usually prefer to select technical indicators depending on their experience or feelings for forecasting stock prices despite this behavior be highly risky. However, choosing unrepresentative indicators may result in losing profits for investors. Therefore, selecting the relevant indicators to forecast stock prices is one of the important issues for investors. Financial researchers must identify the key technical indicators that have higher relevance to the stock price by indicator selection. Therefore, proposed models must incorporate indicator selection in the stock forecasting process to enhance forecasting accuracy.
Recently, there have been many new forecasting techniques used to construct efficient and precise machine learning models, but forecasting stock prices is still a hot topic [6,7,8]. To overcome the shortcomings of traditional time series models, nonlinear approaches have been proposed, such as fuzzy neural networks [9,10,11,12], and support vector regression (SVR) [13,14,15,16]. SVR utilizes the minimized structural risk principle to evaluate a function by using the minimized the upper bound of the generalized error [17,18]. The minimized structural risk principle could get better generalization from limited size datasets [19]. Further, SVR has a global optimum and exhibits better prediction accuracy due to its implementation of the structural risk minimization principle, which considers both the training error, and the capacity of the regression model [15,20]. Although SVR has shown a great number of experimental results in many applications such as economic and financial predictions, the main problem of SVR is the determination of its parameters, which requires practitioner experience [21]. In the literature, genetic algorithms (GA) have been successfully used in a wide range of problems included machine learning, multiobjective optimization problems and multimodal function optimization [22]. GA is a search algorithm inspired by evolution and is usually used to solve optimization problems. Therefore, proposed models utilize GA to optimize the parameters of SVR and obtain better forecasting performance.
From the related work mentioned above, previous studies have shown some drawbacks: (1) Many researches select key technical indicators depending on experiences and ideas [23], (2) Most statistical methods follow some assumptions in different datasets, and obey the statistical distributions [3], (3) Most previous time series models consider only one feature to forecast stock indexes [23], and (4) The parameter of SVR is difficult to determine [24,25,26]. This paper proposes a novel GA-SVR time series model based on indicator selection to overcome these problems, and the proposed model contributes the following: (1) In feature selection, this study applies multivariate adaptive regression spline (MARS), stepwise regression (SR), and kernel ridge regression to get the key technical indicators for investors. (2) The proposed model optimizes the parameters of SVR by genetic algorithm (GA) to increase the forecast accuracy. (3) The results could provide persuasive investment guidelines for investors.
The remaining contents of this paper are organized as follows. Section 2 describes the related methodology that incorporate the technical indicator, MARS, genetic algorithm, SVR, and stepwise regression. Section 3 presents the proposed algorithm. Section 4 provides the experimental results and comparisons. Conclusion of this paper are explained in Section 5.

Technical indicator
The technical indicator (TI) is an investment guidance for investors based on evaluating the profits of securities from analyzing trading data of marketing activities, such as past prices and volumes [5]. Stock market data have highly nonlinear, and many researches have focused on the technical indicator to increase the investment return [27,28]. A technical indicator is a formula, which transfers trading data (open price, the lowest price, the highest price, average price, closing price and volume) into different technical indicators, and try to forecast future prices based on analyzing the past pattern of stock prices [29,30]. Technical analysis utilizes basic market data, and assumes that the involved factors are included in the stock exchange information [31]. Based on literature review, this paper collected some technical indicators as Table 1. To consider more features, that affect the stock price and volatility, this paper incorporates the microeconomic features that affect the stock price, and these collected factors are listed in Table 2.

Multivariate adaptive regression splines
Friedman [46] proposed multivariate adaptive regression splines (MARS), it is a simple nonparametric regression algorithm. The main advantages of MARS is its capacity to grasp the , i = 9, and p c is the closing price of the trading day [32] 5BIAS The difference between the closing price and MA5, which utilizes the stock price nature of returning back to average price for analyzing the stock trends [31] 10BIAS The difference between the closing price and MA10, which employs the stock price nature of returning back to average price for analyzing the stock trends [31] RSI RSI measures the magnitude of recently gain to recently loss in an trial to determine overbought and oversold conditions of an asset [31] 12PSY PSY12 (12 days psychological line) = (D up12 /12) � 100, D up12 is the number of days when price is going up within 12 days [23] 10WMS%R Williams %R is usually drawn by using negative values. For analysis and discussion, ignore the negative symbols. It is the best to wait the security's price until change direction before placing your trading [31] MACD MACD presents the difference between a fast and slow exponential moving average (EMA) for closing prices. Fast is a short-period average, and slow is a long period one [31] MO1 MO1(t) = price(t) − price(t − n), n = 1 [33] MO2 MO2(t) = price(t) − price(t − n), n = 2 [33] Transaction volume Transaction volume presents a basic yet very important element of market timing strategy. Volume gives clues for the intensity of given price moving [34] CDP value Divide the previous price movement into five values and make the intraday trading decision based on the five value [31] Exponential Moving Average (EMA) EMA is defined as a linear transformation of time series to a smoother time series by e x t ¼ l P 1 K¼0 ð1 À lÞ k x tÀ k Where 0<λ�1 is the timescale. When λ = 1, the EMA is the identity transformation: e x t ¼ x t ; in contrast, many term x t−k effectively contribute to e x t when λ < 1 [35].
Company-Daily price change P ¼ P c À P o P o � 100%, P c is the close price of today and P o is the open price of today [36].
TAIEX-Daily index change P ¼ P cT À P oT P oT � 100%, P cT is the close index of today and P oT is the open index of today [36]. complicated data mapping and patterns of high-dimensional data, and produce more simple, easy interpretation models, and its can perform analysis on feature relative importance. In concept, MARS integrates the piecewise linear regressions into a flexible model for solving the nonlinear and complex problems. MARS establishes the final model in a two-stage procedure: Firstly, the forward stage, many spline basis functions are built, the feature can be continuous, ordinal, or categorical. Secondly, the backward stage removes the redundant spline basis functions, it uses the generalized cross-validation (GCV) criterion [46] to evaluate the performance of model subsets for getting the best subset, the lower G CV value is better. Moreover, the G CV is defined as Eq (1).
where N is the number of data records, C(M) denotes the penalty cost of a model containing M basis functions, the numerator is the lack of fit on the M basis function model f M (x i ), the denominator is the penalty for model complexity C(M) and y i denotes the target outputs.

Genetic algorithm
The genetic algorithm (GA) [47] is to search the global optimum by using inspired natural evolve. Moreover, GA has four operators (inheritance, mutation, selection, and crossover) to

TAIEX index
This study considered related market index with macroeconomic, TAIEX index is the indicator of fundamental analysis.
Exchange Rate Conversion rate of US to NT [37] Prime / Base rate Prime rate is an interest rate, which is paid by a borrower (debtor) for the use of money that they borrow from a lender [38]. It has a relationship with macroeconomic and indirect affects the stock market.

The Final Best Ask Quote
Each transaction, the system discloses the quote for the lowest offer price [39], the last transaction every day is collected as indicator.

The Final Best Bid Quote
Each transaction, the system discloses the quote for the highest bid price [39], the last transaction every day is collected as indicator.
Price earnings ratio It is defined as market price per share divided by annual earnings in per share [4,40].

PBR
Compare a company's current market price to its book value [41].
where D is the most recent full year dividend, P is the current share price. [42] Return of Investment (ROI) � 100%, V f is the final value of an investment and V i is the initial value of an investment [43].
, V f is the final value of an investment and V i is the initial value of an investment [44].

Sale Month
This study considered sale monthly, sales growth rate, sales growth rate and the compared rate sale monthly with previous month would affect the company stock price.

Aggregate Sales Growth
Rate Sales Growth Rate R month = ([R y / R y-1 ]-1) � 100(%) R y is the monthly revenue in y year.
Rate compared sale monthly

Demand Savings Deposits
This study considered that the rate of demand savings deposits might be a factor of investment. When the rate is low, investors may be willing to take the risk for investment. [45] https://doi.org/10.1371/journal.pone.0209922.t002 evolve repeatedly for obtaining the optimal solution. GA has been applied successfully in economic and financial domain [27,48]. In specific problem, the GA algorithm encodes a potential solutions into the simple chromosome-like data structure, and applies the re-united operators to preserve critical information [3]. This paper referred the GA steps of Goldberg [49], and reorganized as follows: Step1: Generate an initial population randomly.
Step2: Evaluate fitness of each chromosome.
Step3: Check the stop criterion.
Step4: Select suitable chromosomes based on the parents' populations.
Step5: Extend crossover to search a new solution by swapping corresponding to segments of a string representation for the parents.
Step6: Employ mutation randomly to change some of the chosen chromosomes.

Support vector regression
SVR algorithm is a nonlinear kernel-based regression, which tries to find a regression hyperplane with minimized risk in high dimensional space [16]. Compared to the traditional regression model, it estimates the coefficients by minimizing the square loss, SVR uses the εinsensitivity loss function to obtain its parameters. It can be express as: where t is the desired (target) outputs, and ε defines the region of ε -insensitivity, when the predicted value falls into the band area, the loss is zero. Contrarily, if the predicted value falls outside of the band area, the loss is equal to the difference between the predicted value and the margin.
Considering empirical risk and structural risk, the SVR model can use slack variables to construct a minimal quadratic programming problem.
Min : The symbols ξ i and x � i are two positive slack variables to calculate the error (qi − f(xi)) from the boundaries of the ε-insensitivity zone. ðx i þ x � i Þ denotes the empirical risk, 1 2 kvk 2 is the structural risk to prevent over-learning and the lack of applied universality and ∁ denotes the regularization constant for specifying the trade-off between the empirical risk and the regularization terms. Based on the sequentially modifying coefficient C, band area width ε, and kernel function K, the optimal parameter can be solved by the Lagrange method [19]. This study based on Vapnik [18] utilized the SVR-based regression function, and it is defined as where α i and a � i are the Lagrangian multipliers that satisfy the equality a i a � i ¼ 0, α i and a � i � 0. K(x, x i ) is the kernel function which represents the inner product hφ(x i ), φ(x)i. The radial basis function (RBF) has been widely used as the kernel function, and this study utilizes RBF because of its capabilities and simple implementation [50].
where γ is the RBF width.

Stepwise regression
SR is a simple multiple regressions, it establishes a model by adding or removing features based on the statistics of F-test, that is, SR utilized the forward and backward procedures to add or remove features based on F statistics. SR adds the feature to the model if the p-value of variable is less than the given significant level (p < .05), and removes the variable from the model if the p-value of variable is greater than the given significant level [51].

Proposed model
This study based on previous studies has found some drawbacks in time series forecast: (1) Based on subjective experiences and opinions to select important technical indicators [23]. (2) Previous methods need to follow some assumptions in different datasets, and obey the statistical distributions [3]. (3) Previous time series models consider only one feature to forecast the stock index [23]. (4) The best SVR parameters are difficult to determine [24,25,26]. To overcome these problems, this study is based on our conference paper [52] to extend the proposed methods for solving the forecast problems. That is, this paper proposes a GA-SVR time series model based on feature selection to forecast the leading industry stock price. Hence, the proposed model contributes the following. (1)  3. Forecasting: The optimized GA-SVR forecast model is utilized to forecast the stock price, and compare the proposed models with the listing models under the accuracy.
For easy computation, this section proposed an algorithm with six steps, the detailed step is described as follows: Step 1: Transform trading data into technical indicators This step collected daily stock trading data (open, close, the highest, the lowest price and volume), and transformed these data into technical indicators [31], such as MA, PSY, RSI, BIAS, and WMS%R. In addition, this paper also incorporate other indicators, such as exchange rate, NT dollars to US dollars, and the momentum. These technical indicators used are listed in Tables 1 and 2, respectively.
Step 2: Select key features (MARS, SR, and KRR) From step 1, the collected data has been transformed into technical indicators; this step utilized MARS, SR, and KRR to choose the key indicators. For removing collinearity, this step also run SR multi-collinearity to eliminate the high multi-collinearity indicators. For comparing the three feature selection methods, this study selected the number of features are as similar as possible.
Step3: Construct the SVR forecast model To build the SVR forecast model, this step used the selected features as input features, and the RBF function is used as the kernel function, due to it can handle the nonlinear and highdimensional data. To build the forecasting model, three parameters that should be set: the loss function ε, the regularization constant C, and the RBF width σ. To obtain a better forecast model, this step utilizes a genetic algorithm to optimize these parameters.
Step4: Optimize the SVR parameters by GA at minimal RMSE To get better forecasting accuracy, this step employed genetic algorithm to optimize the SVR parameters C and σ under minimal RMSE (Eq 6) for training dataset. The RMSE is defined as: where y t is the real stock index,ŷ t is the forecasted stock price, and n is the number of records.
Step 4 has six sub-steps, which is described the operation of the GA processes as follows: Step4.1.1: Initialize the parameter for GA.
The initial population was set 80 individual solutions, and randomly generated in this substep, the SVR parameters are encoded into a chromosome by a binary string. In addition, the maximal generations, the crossover probability, the mutation probability, and are given as 2000, 0.8, and 0.08 [53] respectively.
This sub-step uses a pre-defined fitness function (RMSE) to evaluate fitness of each chromosome for determining the goodness of fit for each solution.
Step4.1.3: Check the stopping criterion This sub-step sets the stopping rule: If one of the two conditions is got, then the GA process is stopped: 1. The maximal number of generations is reached (2000).
2. The optimal solution is smaller than the given minimal RMSE, the minimal RMSE is set as 10 −5 .
If the criterion is not achieved then repeatedly re-run a new iterative process (Step 4.1.2 to 4.1.5).

Step4.1.4: Select the parents by the fitness function
Selection is to screen out the fit chromosome to be copies for increasing the offspring sharing and eliminating the poorer chromosome for decreasing the offspring sharing. Roulette wheel selection is employed to select the chromosomes for reproduction.

Step4.1.5: Perform crossover and mutation
Re-unite the parents to produce and mutate offspring, this step one-point crossover is used. Then, it selected randomly a member of the population, and changed one randomly selected bit in its bit string representation [3].
Step5: Forecast the stock price by using the optimized models From Step 4, the SVR parameters were determined. the testing data is applied to predict the next day's stock prices by using the optimized forecast model.
Step6: Compare the performance For evaluating the forecast accuracy, the propoesed model will be compared with the listing models under the RMSE criterion. The seven comparison models are as follows: (1) Table 3.

Experiment and comparisons
This study employs Taiwan's stock as experimental datasets, the selected companies are different leading industries from "business today (www.businesstoday.com.tw)" which published the 1000 largest companies from Mainland China, Taiwan, and Hong Kong. The experimental datasets including Chunghwa Telecom (CHT), China Steel, Hon Hai, Cathay Financial Holdings and Taiwan Semiconductor Manufacturing Company (TSMC), were practically collected from 2003 to 2012. To compare the accuracy of a long test period forecast and short test period forecast, this study implements two experiments for each dataset, and the two experimental designs are listed in Table 4. First, this study conducts an initial experiment to explore the performance of the GA-SVR model. The forecasting performance of the GA-SVR is compared with SR, KRR, and MARS and the results are shown in Table 5. From Table 5, we can see that the GA-SVR model generates the smallest RMSE by the CHT, China Steel and Hon Hai datasets. Therefore, this study combines feature selection with the GA-SVR model as the proposed forecasting model. Second, this study sets the parameters of different forecasting models for the following experiment. In the parameter settings for the MARS model, the training data is utilized to build the MARS model, the maximal number of BFs of the MARS model's parameter is set as 2000 and the other parameters are set as default [54]. For the KRR model, the parameter lambda for Tikhonov regularization of kernel ridge regression is set as 0.001 to build the forecasting model. In the SR model, the training data is employed to build the forecasting model, the high variance inflation factors (VIF) that are higher than 10 are removed first.
Based on the initial experimental result (GA-SVR model performs better than the other models), this paper combines the GA-SVR model and feature selection method as the proposed model. Then, this study proposes model A and model B based on different feature selection methods. Model A uses MARS to select features, and model B utilizes stepwise regression as the feature selection method.
In comparison, this study selects the same number of features for different feature selection methods. The selected features by MARS, SR and KRR are listed in Table 6. After finding the key features, this study constructs the forecast model by SVR and optimizes the parameters of the MARS-GA-SVR and SR-GA-SVR models by GA, the optimized parameters are listed in Table 7.

Experimental results
In the section, this study verifies the performance of the proposed model by using five different industry datasets including the Chunghwa Telecom datasets (CHT), China steel datasets, Hon   Fig 2). The RMSE for the listing models and proposed models, are shown in Table 8, and the results show that the performances of proposed models are better than other models. In the short test period, the proposed model B (SR-GA-SVR) generates the smallest RMSE in listing models as Table 8. Moreover, the proposed model A (MARS-GA-SVR) generates the smallest RMSE in the long test period (in Table 8). Table 9 Table 9, the proposed models generate the smallest RMSE in the listing models. In the short test period, the proposed model B generates the smallest RMSE. In addition, the proposed model A generates the smallest RMSE in the long period.
Experiments on the Hon Hai datasets are presented in Table 10 and Fig 4. Fig 4 shows the excellent performance of the proposed model (the forecast line almost completely overlaps the closing price line). Form Table 10, model A and model B generate the smallest RMSE in the long training period and the short training period, respectively. Table 11 and Fig 5 show the experiments for the Cathay Financial Holdings datasets. In Fig 5, the numerical results clearly show that the forecast line deviates from the closing price line. From Table 11, the KRR-GA-SVR model generates the smallest RMSE in the short test period, and the proposed model B in the long test period generates a smaller RMSE than other models.
Next, experiments on the TSMC datasets are illustrated in Table 12 and Fig 6. The forecast results in Fig 6 show

Significance test
To test whether proposed model is superior to the KRR-MARS, KRR-SR, KRR-GA-SVR, MARS-SR, MARS-KRR, SR-MARS and SR-KRR models in the stock price forecasting, this study applies the Wilcoxon signed-rank test. We use RMSE to test the significance between the proposed model and the listed models. Tables 13 and 14 present the Z statistic of the two-tailed Wilcoxon sign test between proposed models and the listed models.
From Table 13, the proposed models have a significant difference (p<0.05) compared to other models in the short test period except for KRR-GA-SVR at the 0.05 significant level. Therefore, we can conclude that the proposed models are significantly better than KRR-MARS, KRR-SR, MARS-SR, MARS-KRR, SR-MARS, and SR-KRR models. However, we can see that there are no significant differences between proposed model A and B in the short testing period from Table 13. In the long testing period, the proposed models have a higher significance compared with the other models at the 0.05 significant level as shown in Table 14. Therefore, we can conclude that proposed model is significantly better than the listed models.

Findings
Based on the experimental results, this study can summarize the findings as follows.
(1) Datasets quality. From Table 15, we find that of the five datasets with different fluctuations, the highest fluctuation range is the Hon Hai stock price. Despite the Hon Hai datasets having the highest fluctuation range, the proposed models can generate smaller RMSE than the listing models in the short and long test period as shown in Tables 16 and 17. Further, the China Steel stock price has the smallest fluctuation in the five datasets, and the proposed models still achieve better performance in both the short and long test periods, as shown in Tables  proposed models in the short test period is better than in the long test periods. From Fig 7, the figure shows that the stock indexes change dramatically in the long test periods; and the proposed models has better performance in larger price fluctuation. In the short test period, the results show that the proposed model B generates the smallest RMSE in the TSMC, Hon Hai, China Steel and CHT datasets, except Cathay dataset as Table 16. Because the fluctuation of Cathay price dataset in short period is smaller than other datasets as Fig 7. Therefore, we conclude that the proposed model B (SR-GA-SVR) has better performance than the listing models, especially in larger price fluctuation. i.e., we can confirm that the features selected by SR can effectively enhance the accuracy in the short testing period. Similarly, in the long test period, from Table 17, the proposed model A (MARS-GA-SVR) has the smallest RMSE in the Hon Hai, China Steel and CHT datasets. Therefore, we conclude that MARS can also select better features for the proposed model when the stock price range changes dramatically.
(3) Selected feature. For the MARS selected features shown in Table 18, the feature "The Final Best Bid Quote" was chosen four times in five datasets and the forecasting results of proposed model A are better than the results of proposed model B in the long test period. Based on the reasons above, we can confirm that the feature "The Final Best Bid Quote" influences those stock prices forecasting in long testing period. For the SR selected features as shown in Table 18, the features CDP, MO1 and MO2 are selected three times in five datasets. In addition, the proposed SR-GA-SVR shows with precise  accuracy in the short testing period. Therefore, we find that the CDP, MO1 and MO2 have a great impact on forecasting stock prices for the short test period. (4) Investor suggestion. After verifying the proposed models, this study can provide some suggestions to investors as references in the following: • From Tables 16 and 17, the short test period forecasting is recommended, because it will be more accuracy than the long test period forecasting for investment stock.
• In the short period forecasting, we suggest using the proposed model B because it is more accuracy than proposed model A (see Table 16). Regarding key features as shown in Table 18, we suggest the investors consider the three key features: CDP, MO1 and MO2.
• In the long period forecasting, from Table 17, the proposed model A is recommended because it is more accuracy than proposed model B in the long testing period. From Table 18, the feature "The Final Best Bid Quote" should be considered as input variables in the long period forecasting.

Conclusion
This study has proposed a new time-series model, which considers multifactor and reasonable selected key features into the GA-SVR model. The results show that proposed models can improve forecasting accuracy. Furthermore, the proposed models outperform the listed models in RMSE for Chunghwa Telecom, China Steel, Hon Hai, Cathay Financial Holdings and Taiwan Semiconductor Manufacturing Company datasets. In addition, from the findings and discussions, the proposed SR-GA-SVR outperforms the listing models in the short testing period, except Cathay Financial Holdings. Moreover, in the long testing period, the MARS-GA-SVR also has better performance. We find that the proposed model B almost has better performance than the listing models, especially in larger price fluctuation. i.e., the proposed model is more fit the dataset of larger price fluctuation. Finally, the research results can provide some suggestion to investors as references.
In future work, several issues from this study can be extended as follows: 1. Consider other features to train the model, such as company news, or government policies.
2. Apply the model to different application fields, such as electric loads and environmental pollution forecasting.
3. Employ other methods to improve proposed model, such as feature lags.