Action-specialized expert ensemble trading system with extended discrete action space using deep reinforcement learning

Despite active research on trading systems based on reinforcement learning, the development and performance of research methods require improvements. This study proposes a new action-specialized expert ensemble method consisting of action-specialized expert models designed specifically for each reinforcement learning action: buy, hold, and sell. Models are constructed by examining and defining different reward values that correlate with each action under specific conditions, and investment behavior is reflected with each expert model. To verify the performance of this technique, profits of the proposed system are compared to those of single trading and common ensemble systems. To verify robustness and account for the extension of discrete action space, we compared and analyzed changes in profits of the three actions to our model’s results. Furthermore, we checked for sensitivity with three different reward functions: profit, Sharpe ratio, and Sortino ratio. All experiments were conducted with S&P500, Hang Seng Index, and Eurostoxx50 data. The model was 39.1% and 21.6% more efficient than single and common ensemble models, respectively. Considering the extended discrete action space, the 3-action space was extended to 11- and 21-action spaces, and the cumulative returns increased by 427.2% and 856.7%, respectively. Results on reward functions indicated that our models are well trained; results of the Sharpe and Sortino ratios were better than the implementation of profit only, as in the single-model cases. The Sortino ratio was slightly better than the Sharpe ratio.


Introduction
Recently, trading systems based on machine learning have been actively studied in all fields including the financial field [1][2][3][4][5][6][7]. With sufficient data, a machine can efficiently learn patterns, exhibiting the notable advantage of the ability to learn unknown patterns [8]. This feature can be exploited for trading systems and consequently, is actively studied using machine learning in the financial field. Through machine learning, vast amounts of data can be quickly calculated, while an objective judgment of the database can help determine important financial transactions. Machine learning is largely divided into supervised, unsupervised, and reinforcement learning [9]. In the financial field, the supervised learning method extracts important features from labeled data and uses a classification and prediction model. Many studies have been conducted, ranging from those based on statistical learning theory to those using stateof-the-art machine learning algorithms such as Support Vector Machine (SVM), Random Forest (RF), and Deep Neural Network (DNN) [10][11][12][13]. The unsupervised learning method, which uses unlabeled data, is mainly used for clustering and finding patterns in the data using dimension-reduction machine learning techniques such as auto-encoder. One of the representative studies is the Deep Portfolio Theory [14]. In reinforcement learning (RL), which is mainly used in trading systems research, a model-free method that relies on the input of market conditions as a state is applied using a reward function. A representative study by Moody and Saffell [15] that led to numerous subsequent studies examined the optimal portfolio, asset allocation, and trading system using Recurrent Reinforcement Learning (RRL).
Although trading systems research based on RL is actively conducted, there are many challenges, such as difficulties in analyzing and training, which arise from insufficient data or excessive noise [4]. Additionally, RL itself is difficult to train. To improve performance, the ensemble method is one of the most widely used machine learning methods [16]. However, because applying the ensemble method to RL is more difficult than the general machine learning algorithm, it is yet to be applied in automated financial trading systems research. Therefore, we posit that if an ensemble technique specialized for RL is applied to a trading system, the performance of the trading system will improve.
We propose an action-specialized expert ensemble trading system-a novel ensemble method designed specifically for RL-that can reflect investment propensity. This ensemble system consists of action-specialized expert models, with each model specialized for each action examined in the RL for trading systems by using different reward values under specific conditions. Actions of trading systems typically include buying, holding, and selling; we designed an expert single model corresponding to each action to reflect real investment behavior [2,7,15]. To create an expert single model, reward values for expert action are controlled. In the common ensemble method, the single model is trained in the same data set with different models or in different data sets with the same model. In other words, various distribution effects for an ensemble can be obtained using these methods. Unlike the common ensemble method, this study employs a method to create an action-specialized expert ensemble model that is specifically developed for buying, holding, and selling actions; we then combine these action-specialized expert models in an ensemble. Our proposed ensemble method is expected to improve performance and reflect characteristics of RL for trading systems. We used soft voting with the softmax function, which is more effective than hard voting, as an ensemble method [16].
To verify our proposed method and check its robustness, we include more action spaces by discretizing, which determines the number of multiple shares of a stock to buy or sell by itself. Previous studies have either only studied the three actions or proceeded to a continuous action space [1][2][3][4][5][6][7]15]. It is well known that as the output of the model network increases, learning becomes more difficult [17]. In a previous study, however, discretizing action spaces yielded a better performance than applying continuous action spaces [18]. Thus, in this study, we extend the number of actions from 3 to 11 and 21, and the quantity of actions is increased by 5 and 10, respectively. Moreover, we expect the network to be able to recognize market risk and control the quantity by itself. One of the purposes of our research is to create a more profitable automated trading system that allows for more investment when data-driven patterns are clearer, such as real investors investing more boldly as compared to the information they receive from the market. Therefore, our model is designed to learn various patterns from data and vary actions to increase profit according to the magnitude of the reward value we designed.
Compared to the existing 3-action models, existing models could not represent the diversity of actions depending on the reward value of the model trained from the data. For example, we could not determine whether the buy signal in the 3-action model is strong or weak. In contrast, our proposed system with more discrete actions is significantly more profitable than the 3-action system because it can buy or sell more, depending on the market situation. More specifically, the trading model with 21 actions can ideally increase profits by up to 10 times, since it can trade more quantities (up to 10 times) for stronger signals than weaker ones.
If the extended action space model can capture the level of obvious patterns from the dynamic market data, it can decide the quantities of investment by itself-depending on the captured level of information. As we give the adaptive signal to our model through controlling reward values by extending action space, we expect that our model can analyze more detailed market information, which includes the degree of both direction and magnitude of market movements. If the proposed model is confident in the market condition, it will invest more in the market. Whereas, if the model is less confident in the market condition, it will adjust the quantity to take a relatively small risk in order to achieve a small loss or a small profit. In this regard, we have produced many experimental results that can support this. Many RL-based trading system studies surveyed have 3-action spaces, and our research is meaningful as it is the pioneering study to attempts this. As expected, our results indicate that Deep Reinforcement Learning (DRL) can learn not only three actions, but also various other actions, depending on the strength of the network signals.
Further, we used three types of reward functions: profit, and the Sharpe and Sortino ratios, to examine the sensitivity of our proposed ensemble model. Generally, profit is a frequently used reward function in RL for trading systems research [2,7,15]. Since the Sharpe and Sortino ratios are calculated using profits and volatilities, they are suitable reward functions to train networks for RL [3,4]. Thus, to compare the performance of reward functions, we consider not only profit, but also volatility.
Our experiment employed three extensively used data sets-S&P500, Hang Seng Index (HSI), and Eurostoxx50-that efficiently exhibit different price movements for the period from January 1987 to December 2017 [2,6,7,15]. For the same period, we divided these data sets into training and test periods of 20 and 11 years, respectively. Our basic model is based on the Deep Q-Network (DQN), and we employ online learning on the test data set. While the DQN is well known for combining the Q-learning algorithm with DNN [19], online learning is a method in which data become available in a sequential order, especially in a test data set [20].
The remainder of this paper is organized as follows. Section 2 describes the related research. Section 3 discusses the related methodologies of DQN, reward functions, and our proposed method. Section 4 analyzes our data sets in various ways. Section 5 describes the experiments of our proposed model and explains our methodology. Section 6 reveals experimental results and conducts detailed analyses. Section 7 concludes and suggests future applications.

Related work
In the financial field, there are many recent studies that employ machine learning for forecasting, classification, dimension reduction, and trading. In this section, we describe the relevant literature.

Supervised learning in the financial field
Most financial studies using supervised learning attempted to predict price fluctuations or trends. Trafalis and Ince [10] predicted stock prices using SVM based on statistical learning theory and compared it to Radial Basis Function (RBF). Huang, Nakamori, and Wang [21] also performed Nikkei225 index prediction using SVM, and compared Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Elman Backpropagation Neural Networks for performance evaluation. Tsai and Wang [22] studied the stock price prediction model by combining Artificial Neural Network (ANN) and a decision tree to improve the performance of a single model. Patel, Shah, Thakkar, and Kotecha [11] conducted two studies. First, they used four forecasting models-ANN, SVM, RF, and Naïve Bayes-to predict stock market index prices and trends. Second, they proposed a prediction model combining ANN, RF, and Support Vector Regression (SVR) and indicated better performance than the single model. Recent research has proposed a model to forecast a financial market crisis using DNN, the boosting method [23], and ModAugNet, which adds two modules to prevent overfitting, and improves prediction performance [13].

Unsupervised learning in the financial field
Most financial research that employed unsupervised learning were conducted in the direction of dimension reduction using an auto-encoder. As a representative study, a deep portfolio theory composed of 4 steps-auto-encoder, calibrating, validating, and verifying-was developed by Heaton, Polson, and Witte [14]. Chong, Han, and Park [24] compared reconstruction error, stock price fluctuation, and prediction using Principal Component Analysis (PCA), auto-encoder, and Restricted Boltzmann Machine (RBM). Bao, Yue, and Rao [25] conducted price prediction using the model combining Wavelet Transform, Stacked Auto-Encoder (SAE), and Long-Short Term Memory (LSTM). First, wavelet transformation is applied to the time series data to remove noise, while high-level features of data are extracted by SAE. Subsequently, the processed and transformed data are used for stock price prediction using LSTM.

Reinforcement learning in the financial field
Finance-related research using RL has been conducted mainly to improve the performance of trading algorithms. Moody and Saffell [15] conducted a study on the optimal portfolio, asset allocation, and trading system using RRL, which became the basis for a significant amount of research. They compared various methods using profit, Differential Sharpe Ratio (DSR), and Downside Deviation Ratio (DDR) as reward functions and indicated that RRL is better than Q-learning. Since then, many researchers have used RRL and DSR: Almahdi and Yang [3] proposed an optimal variable weight portfolio allocation model using RRL and Expected Maximum Drawdown (EMD) to solve the dynamic asset allocation problem; Deng, Bao, Kong, Ren, and Dai [4] improved performance by combining the RRL with the fuzzy DNN model, which analyzes the market; Huang [5] used small replay memory, added feedback signal, and sampled long sequences to improve the existing research with Deep Recurrent Q-Network (DRQN). Wang et al. [2] studied the trading system using DQN and compared it to the RRL strategy performance. This study became the basis of our research to compare and improve performance. In addition, another study added three ideas of the trading system by applying the existing DQN, which technically added to and changed the network. First, the number of stocks traded using DNN is determined. Second, the decision is suspended by analyzing the confusing situation. Lastly, this study uses transfer learning to account for the lack of data [7]. We summarize trading system studies using RL in Table 1 and compare our experiments with those of other papers. methods. Booth, Gerding, and McGroarty [27] proposed a trading system based on weighted ensembles of RF that is specialized in seasonality effects and improves profitability and prediction accuracy. Giacomel, Galante, and Pereira [28] proposed an ensemble network that approached stock price forecasting as a rising or falling classification problem and simulated it by applying it to North American and Brazilian stock markets. Yang, Gong, and Yang [29] proposed a DNN ensemble model that predicts the Shanghai index and the Shenzhen Stock Exchange index using a bagging method. Weng, Lu, Wang, Megahed, and Martinez [30] proposed a model that predicts short-term stock prices using four ensemble methods: neural network regression ensemble, SVR ensemble, boost regression ensemble, and RF regression.

Continuous and discrete action space in reinforcement learning
There are also studies on continuous and discrete action space in RL. However, a majority of these are related to games or robotics, with only a few studies related to finance. In general, the action space of RL in most environments is continuous; therefore, it is inappropriate to apply a discrete action space [7,31,32]. This is reflected in research by Google Deep Mind [33] and the OpenAI team [34]. However, in some real case studies, discretizing action spaces has been shown to be more effective than applying continuous action spaces [35]. Therefore, discretization of actions can be said to improve performance. According to recent studies by OpenAI, this may be because a discrete probability distribution is more expressive than a multivariate Gaussian or because discretization of actions makes the learning of a favorable advantage function potentially easier [18]. Based on this evidence, we believe that extending the discrete action space in this study could be a more efficient approach for the asset allocation problem than what can be accessed as a continuous action space. In addition, we can extend this study to solve the asset allocation problem that exists in the continuous action space with transfer learning by first learning it as a discrete action space problem.

DQN (Deep Q-network)
Unlike supervised and unsupervised learning from the static environment data, RL is a methodology wherein an agent directly explores environment data, confirms correlated rewards, and establishes policies for optimal action. The objective of RL is to find an optimal policy that maximizes the expected sum of discounted future rewards [36]. These rewards of optimal policy start with choosing the optimal value for each action, which is called the optimal Q-value. RL generally solves problems that can be defined in the Markov Decision Process (MDP). The elements of RL are represented by (S,A,P,R,γ), where S is a finite set of states, A is a finite set of actions, P is a state transition probability matrix, R is a reward function, and γ is a discount factor. The process of RL is described in Fig 1-the agent observes the state s t from the environment at time t and selects action a t [19]. Subsequently, as a result of this action, we receive reward r t from the environment and obtain the next environment s t+1 , which is changed by the action a t . If the reward is determined by both the state and action, we can define the action value function Q π (s t ,a t ) as follows: From this action value function, we can represent the optimal action-value function Q � (s t , a t ), maximizing the future reward amount as indicated in Eq (1). The optimal action a � (s t ) can be obtained from Eq (3) as follows.
Finally, the optimal action value function can be represented by the Bellman equation in Eq (4) [37].
Qðs tþ1 ; a tþ1 Þjs t ¼ s; a t ¼ a� ð4Þ The basic idea of the RL algorithm is to estimate the action value function by repeatedly calculating and updating the aforementioned Bellman equation. If the iteration is infinite, the action value function converges, and the result is the optimal action value function. The Q-network is composed of a neural network to find the Q-value. The DQN is a deeply structured network that learns by minimizing the loss function, shown by Eq (5).
where w t is the weight of the network. This is a neural network that approximates the Q-function, and it is trained in the supervised approach. Hence, it needs a label, and RL uses the target Q-value as the label, i.e., y t in Eq (6). To obtain the target Q-value, we need the target Q-network that fixes the weight every few steps. Thus, the DQN will train to minimize the loss until the next few steps, repeating this process until convergence. In this study, we attempt to construct a trading system in three index environments by the DQN method and establish an extended discrete action space and action-specialized expert ensemble method.

Reward functions
The reward function is a guide for model-free RL. The network of RL updates through the value of the reward function. We describe our reward functions as follows.

Profit function
The most common reward function is the profit function, which has been used in previous studies [2]. This function is outlined as follows: where r t_profit is the profit reward function at time t, a t is the action selected at time t by the agent, and p t is the closing price at time t. Eq (7) is an appropriate function for RL because it represents long-term returns over n periods and is less volatile than daily returns. This equation consists of one-day and long-term gross returns. Therefore, r t_profit is the same as 1 þ r n ¼ 1 þ r ð Þ p tÀ 1 p tÀ n when a t = 1. In this equation, we assume that the action for sale is -1, the action for hold is 0, and the action for purchase is 1. In the experiment, however, we assume that the action for sale is 0, the action for hold is 1, and the action for purchase is 2 because the network will output only positive numbers. Using long-term returns is useful because they can be considered as long-term stock trading or investments.

Sharpe and Sortino ratios
Another representative reward function is the Sharpe ratio [38], which can reflect profit and volatility. The Sortino ratio is similar to the Sharpe ratio, and their equations are as follows: where r t_sharpe is the Sharpe ratio reward function at time t, R i is the daily return with the number of multiple shares of a stock by action a t , and Averageð P I i¼1 R i Þ is the average return over period I: StandardDeviationð P I i¼1 R i Þ is the standard deviation of daily returns over period I. I is the window size for calculating average and standard deviation of returns. r t_sortino is the Sortino ratio reward function at time t, and StandardDeviation below ð P I i¼1 R i Þ is the standard deviation of the daily return below zero for I period. The Sortino ratio only considers the aspect of volatility of the loss because the volatility under profit conditions is not important. In modern portfolio theory, high Sharpe and Sortino ratios indicate high profit without a large fluctuation. Due to these characteristics of the two ratios, they are appropriate reward functions of RL. We assume that the risk-free rate is 0 (i.e., r_f = 0), and this assumption makes the Sharpe and Sortino ratios invariant to leverage; hence, the leverage effect remains the same regardless of the amount of investment. For instance, consider that there are two profits: 5% and 3%. The average is 4% and the standard deviation is about 1.4%. Therefore, the Sharpe ratio is approximately 2.83. For the leverage effect that is to expand investing, if the number of multiple shares of a stock is five, then profits are 25% and 15%, respectively. Its average is 20%, and the standard deviation is approximately 7.1%. In the end, the Sharpe ratio is approximately 2.83, same as before, indicating that the assumption makes the Sharpe ratio invariant to leverage.

Proposed single model-action-specialized expert model
The action-specialized expert models are created by adjusting the reward function values under specific conditions. The concept of our proposed single model is to develop an expert model of each action that reflects investors' behavior. For instance, if someone is inclined to buy to generate profit, then we can reflect this behavior tendency in an expert model specialized for an aggressive investor. As aforementioned, we can create various action-specialized expert models with investment strategies that are effective for analyzing buying, selling, and holding actions. In other words, the expert model for buying yields a larger reward value when profit is high, and the model works well in increasing price periods. Similarly, the expert model for selling yields a larger reward value when absolute profit is high, and the model performs well in the dropping period. Further, the expert model for holding yields a large reward when the holding is in the range of profit from -0.3 to 0.3%.
The reward function of the expert model is expressed by the following Eq (10).
if a t is an expertaction in the range of profit according to Table 4 below where r t is the reward value, m is the predetermined positive constant and m�1. For the m (predetermined positive constant), we constructed the range of profit based on profit distribution, and divided it into buy, hold, and sell actions based on the threshold. We set the threshold at 0.3% because it is used as a general transaction cost, and it is possible to prevent a loss by choosing a holding strategy if it does not generate more than 0.3% profit. In Eq (10), m is applied step-by-step-depending on the importance of profit and frequency. As frequency varies according to the profit interval, the absolute value of profit is important. Table 2 indicates the design of predetermined positive constants of the expert model by profit interval. Due to this conditional reward function r expert t , we can control the reward, and through this equation, we used the adjusted reward value to develop the proposed single expert model.
By controlling the reward value with m, we can create the enhanced model for specific action according to the reward value. In detail, we modify the reward function by multiplying it and m for learning the action-specialized expert model when the model makes a correct decision. For example, according to Table 2, the buy-specialized expert model obtains the enhanced reward value that is m times larger than the common reward value when its decision is correct in the range of profit. The enhanced compensation is only applied when the decision is correct. If the decision is wrong, the reward value is small or under zero but not at the enhanced penalty value. In other words, the model obtains larger reward values when it works well in the specific action, and so becomes the specific action-specialized expert model. Thus, each action-specialized expert model of buy, hold, and sell can be created by controlling the enhanced reward function with m. In addition, we apply the extended discrete action space and it makes the reward value larger than the 3-action space. The extended action space helps the model determine whether the action is strong or weak. Specifically, the buy action in the 3-action model is only one, whereas, the buy actions in the 11-action model are five-which means buying 1 to 5 shares. The action of buying 1 share is similar to a weak buy action whereas the action of buying 5 shares indicates a strong buy action. In addition, since the reward function of the action-specialized expert model with extended action space is defined as multiplying reward value, m and extended action (the number of shares), the action-specialized expert model can obtain more various reward values, which have a wide range. Thus, if the model can detect the degree of obvious patterns, which can be the direction and magnitude of dynamic market movements from input state, then it can determine how many shares to buy or sell of a stock depending on the detected degree by choosing the correct extended action. Fig 2 indicates the process of common ensemble model and our proposed model. The reward of common model is the raw value of profit or Sortino ratio and common ensemble consists of these models. The reward of our proposed model, on the other hand, is controlled by an additional value which is compensation for the expert action under specific condition. In this way, it consists of three different action-specialized expert models based on DRL. In Fig 2, the colored boxes represent enhanced expert action of each expert single model. In the common ensemble method, performance substantially improves because an ensemble of a plurality of networks can be averaged to reduce the deviation of the resulting network. Unlike the common ensemble method that combines similar models, our proposed ensemble method combines buy-, hold-, and sell-specialized single expert models to improve performance. For instance, our proposed ensemble model functions similarly to three experts from different fields cooperatively making decisions with unifying opinions. Thus, each expert model yields a different inference or decision with the same input; however, our ensemble method improves performance. When we employ it, we use the soft voting ensemble method, which can avoid  loss of information [16]. The soft voting method equation is as follows.

Range of profit Predetermined positive constant (m)
where Output tj is converted from the softmax function with a t [j] at time t. a t is the action as outputs of the model and J is the number of outputs. The softmax function normalizes each action value, which is the Q-value in DQN in the expert model, between 0 and 1. Since the sum of all the action values after applying the softmax function becomes 1, each value of output layer of DQN in the expert model indicates the probability of each action. For the final decision of the expert ensemble model, the Q-value of DQN in each expert model takes the softmax function. After that, the average of the activated Q-values of buy, hold, and sellspecialized expert models become the final outputs of the expert ensemble model, that is, the Q-values of the expert ensemble model. Thus, the action of the highest Q-value of the expert ensemble model is selected as the final decision. To describe our proposed method in detail, the DQN algorithm for our model is provided in Algorithm 1 below. To prevent overfitting our proposed method and to train the network better, we used experience replay and epsilon-greedy in our DRL experiments. Regarding experience replay, all the experiences are saved in the replay memory in the shape of <s t ,a t ,r t ,s t+1 > during the training of the DQN network. Then, the replay memory is uniformly shuffled to make a mini-batch of random samples so that the mini-batch sample is not sequential. This eliminates the time dependency of subsequent training samples. In addition, the observed experience is reused to train when it is sampled repeatedly and improves data usage efficiency. Thus, it helps to avoid local minima and prevent overfitting. Next, the epsilon-greedy method is used to solve exploration exploitation dilemmas in DRL. The epsilon-greedy method chooses an action randomly with probability ε and the maximum Q-value action with probability (1-ε). The epsilon(ε) is decreased over an episode from 1 to 0.1. This will result in completely random moves to explore the state space maximally at the start of the training, which settles down to the fixed exploration rate of 0.1 at the end of the training. Therefore, the epsilon-greedy method helps to prevent overfitting or underfitting.

Data design
In this study, we use the data of three indices: S&P500, HSI, and the Eurostoxx50, to verify our proposed method. We obtained these data from the Yahoo Finance Website and used the same period for each data set. Specifically, the training period spans from January 1987 to December 2006, and the test period from January 2007 to December 2017. By establishing the same time periods, we were able to compare how the RL model freely learns and yields different results over the same period in different environments. The data for the state space consists of the 200-days close price as the input, and the action space as the output generally relates to buy, hold, and sell, with 3, 11, and 21 actions according to the number of actions and experiments. The data set periods are described in Table 3. In order to discuss the trade-off between training costs and performance in more detail, we prepared three more training data sets with S&P500 with different time periods of 5, 10, and 15 years with same test data set period of 11 years. Based on these experimental settings, we could discuss the trade-off between different time period data set and the performance and another trade-off between training time and the performance. Unlike the S&P500 and Eurostoxx50 movements, however, HSI displays different moves toward the end of 1990. During the test period, the three indices indicate different movements. In this period, S&P500 moved upward except during the global financial crisis in 2008; however, HSI and Eurostoxx50 exhibited large fluctuations even after the financial crisis. HSI recovered slightly after the drop; however, Eurostoxx50 failed to recover after the decrease. Against the backdrop of these differences, we can compare the effectiveness of RL in terms of training and showing results. The movement during the test period can be thought of as a Buy and Hold strategy [39]. In the test period, analyzing data with the Buy and Hold strategy indicates that S&P500 increased by 89% and HSI by 47.3%, while Eurostoxx50 decreased by 16.3% (-16.3%).

Profit distribution
We analyzed our data set, and Fig 4 indicates profit distribution and data balance during the training period. Only training data were analyzed, and after this period, the model will update with test data using online learning. First, we divide the profit distribution into units of 0.5%, and all three indices seem to follow a normal distribution shape. Based on this analysis, the reward function was adjusted so that the action-specialized expert model could adaptively learn according to the profit. For example, the interval from 0.3% to 1%, which frequently occurs in the expert model for action of buy, is 3 times for the existing reward, 5 times for the interval from 1% to 2%, 6 times for the interval from 2% to 3%, 7 times for the interval from 3% to 5%, and 10 times for the interval that exceeds 5%. On the contrary, the adjusted part of the expert model for action of sell is the same, and the expert model for action of hold is applied 7 times in the -0.3% to 0.3% range. The reason for using 0.3% as the standard is that the transaction cost is assumed at 0.3% in many studies [40][41][42][43][44]. Therefore, our model learns for action of hold in the interval of less than 0.3%. The circle graphs for data balance also indicate the ratio of buy, hold, and sell data based on 0.3%, and data for the three indices appear to be balanced.

Normality test
We attempted to interpret these three indices data sets by referring to their statistical properties in Table 4. As is known, stock returns are characterized by negative skewness and sharp kurtosis. Negative skewness means a distribution shape where small profits are frequent but extreme losses occur. Sharp kurtosis is a characteristic of daily return distribution, and is lower during long-term return distribution. Table 4 shows the basic descriptive statistics of each index data set. First, to analyze the distribution of train data set and test data set of three indices from Table 4, the train data set of S&P500 shows negative skewness of -1.48 and the test data set shows relatively weak negative skewness of -0.10. The train data set of HSI shows negative skewness of -1.94, while test data set shows positive skewness of 0.29. The train data set of Eurostoxx50 has a negative skewness of -0.17, which is relatively weaker than the other indices, and the train data set has a positive skewness of 0.12. Negative skewness is a distribution where small gains occur frequently and extreme losses occur, while positive skewness means distributions where small losses occur frequently but extreme gains occur. Considered together, all three indices show a more negative skewness of the train data set than the test data set, and an upward trend with statistical  properties from 1987 to 2007. In addition, the S&P500 with negative skewness in the test data set from 2007 to 2017 shows an upward trend graph while HSI with positive skewness shows an upward trend, but it is more volatile and lower rising than the S&P500 (S&P500: 87% increase, HSI: 47.3% increase). In addition, Eurostoxx50 shows a downward graph of -16.3%, which shows characteristics of positive skewness that are prone to frequent losses. Statistical characteristics of Kurtosis indicate the sharpness and the tail of the distribution. The kurtosis of the train data set of the S&P500 and HSI is 31.71 and 45.69, respectively, with a sharp normal distribution with a long tail. These test data sets are 11.20 and 9.42, respectively, and are more evenly distributed than the train data set. The kurtosis of Eurostoxx50 is 5.35 in the train data set and 5.86 in the test data set, which is relatively more evenly distributed than S&P500 and HSI. We can also check the volatility of each index in the Table 4, with the highest volatility in the order of HSI, Eurostoxx50, and S&P500. Fig 5 indicates the quantile-quantile (Q-Q) plots for each index, and they do not demonstrate a linear pattern. This plot is a graphical technique for determining if two data sets come from populations with a common distribution. This indicates that the data set is not a normal distribution, and we establish that it follows a leptokurtic distribution shape through Figs 4 and 5.

Experimental setup
The purpose of our proposed method is to create an automated trading system that is more profitable. To achieve this, we develop the action-specialized expert ensemble method with DRL. There are many studies to improve the accuracy of prediction in the financial sector, but a higher accuracy of prediction does not mean higher profit. Thus, without return or price prediction, we develop a profitable trading system based on DRL with action specific controlled reward function to create the ensemble model of action-specialized expert models different from the existing common ensemble method. The DQN network requires defining the state and action space of the problem, as well as a reward function.

State space & action space
State space, as the input, uses the 200 days price data of each index. The agent analyzes the pattern over 200 days and learns to take action. These experimental environments are similar to those of previous studies [2,7]. In this study, single and action-specialized expert models have the same state space. We attempt to apply the action space in three ways, and examine how results differ from those of existing experiments when the available quantity increased 5 and 10 times. Further, we want to verify our proposed method under various experimental situations. Therefore, the first experiment was conducted with the same 3-action method-buy, hold, and sell-in which the quantity of shares was limited to one. In other ways, we attempted to discretely increase buy and sell actions. If the action space has 11 actions, there are 5 actions for purchase, 5 actions for sale, and 1 hold action. Additionally, the 21-action space case has 10 actions for purchase, 10 actions for sale, and 1 hold action. Furthermore, these actions for purchase and sale are the number of shares ranging from 1 to 5 or 10. As in the state space, the action spaces of the single and expert models are the same.

Reward function
In the single model, we applied three types of reward functions: the profit, Sharpe ratio, and Sortino ratio. We compared the trading system where only profit is used with the trading system where profit and volatility are used. We employed three types of action spaces: the 3-action, 11-action, and 21-action spaces. Thus, a t in Eqs (7), (8), and (9) is defined as the number of shares: {−1,0,1},.{−5~−1,0,1~5}, and {−10~−1,0,1~10}. To be exact, since we cannot trade these indices in the real market, a t indicates the number of multiple shares of a stock. In Eq (7), n is set to 100, and it is a network structure for maximizing the profit of 100 days by observing 200 days. To verify the sensitivity of our proposed model, we first use the profit function which only considers profit. Second, we use the Sharpe Ratio which allows for both profit and volatility. Third, we use the Sortino ratio which only considers volatility in loss.

Action-specialized expert ensemble model
Our proposed model consists of three action-specialized expert models for buy, hold, and sell. Fig 6 shows the process of our proposed method on training, test, and ensemble phases. In the training phase, each action-specialized expert model for buy, hold, and sell is trained. Each expert model follows steps for the training phase in Algorithm 1. As aforementioned in Table 2, the action-specialized expert models have a specific range of profit, which is different from the DQN models, to control reward value. After the training phase, we performed online learning to match the RL approach with reflecting the dynamic financial market. We input the first mini-batch of test data to obtain the first outputs of each action-specialized expert model. Thereafter, through sliding window and the same mini-batch size as the test data, we trained each action-specialized expert model, including former test data again; we input this next mini-batch of test data to obtain the next outputs. The online learning continues till the end of the test data. Further, we compiled the three expert models by soft voting, described in section 3.4, at each time t at inference time. To further explain the distinction between common and expert ensemble models, the common ensemble model requires three models, which are trained identically using the unenhanced reward function. The expert ensemble model, on the other hand, uses an enhanced reward function to create expert models for buy, hold, and sell action-specialized and ensemble these models.
Despite many advances in RL fields, DRL models are still unstable in learning, and hence, it is difficult to reproduce state-of-the-art performance. Therefore, Henderson et al. [45] suggest that presenting the mean and standard error of the five results shows a better performance of the model than only the topmost result. Accordingly, most recent DRL studies [45][46][47][48][49] present the mean and standard error (mean ± standard error) of five models. We ensembled the single models from the top five to three models and reselected the top five models from the ensemble results. Expert models were selected by combining each two models of buy, hold, and sell, and we selected the top five models again from the expert ensemble results. Therefore, we also trained the models of each approach 10 times and selected the top five to present the mean and standard error. Fig 7 outlines the experimental steps. First, to compare the profit reward function with the Sharpe and Sortino ratios, we examine it with single models and noticed that two ratios were better than the profit reward function in two-thirds of these experiments. Second, since the Sortino ratio is slightly better than the Sharpe ratio in the results of single model experiments, we exclude the Sharpe ratio as a reward function. We then compare the profit reward function with the Sortino ratio, which jointly considers profit and volatility. We also collect the top five models of each reward function of the single model. Third, we train the action-specialized expert networks with the profit and Sortino ratio reward functions and then collect the top two models of each action-specialized expert model as Step 4. Step 5 is the ensemble step for single models. We ensemble these three models out of five single models and extract the top five out of the ten ensemble models according to the results. The next step is the development of action-specialized expert models. We establish each expert model out of two models and choose the top five of eight expert ensemble models. Lastly, we repeat Step 1 to 6 with three types of extended discrete action spaces and three types of index data sets.
Describing the detailed architecture of our Q-network, we use three hidden fully connected layers: the number of neurons is 200, 100, and 50, respectively. We use the ReLU activation function and the following hyperparameters where the replay memory (M) is 10000, the step of training (R) is 20, the episode of target Q-network (C) is 256, learning rate is 0.0001, the gamma(γ) is the 0.85, and the mini batch size is 64; we also use Adam optimization.

Single model with three reward functions
We investigated reward functions that consider both profit and volatility in the single model experiment. Table 5 summarizes the top five models of each single model result using the Sharpe and Sortino ratios as reward functions. In addition, the results of each ratio correlate with the profit reward function. Since the average or volatility of the return is different depending on the window size of time series data, we conducted various window size tests. However, we could not compare the two ratios even for each window size; consequently, we used a cross window average to compare them. Specifically, it was not possible to compare the experimental results to determine which of the two ratios is a better reward function depending on window size, action, or index. Since the Sortino ratio was slightly better than the Sharpe ratio as a result of cross-averaging, we only utilized the Sortino ratio when creating action-specialized expert models. We observe that the results of the two ratios are better than the result of profit in two-thirds of this experiment. We compared the reward functions using the profit and Sortino ratio in the following experimental outline. Table 6 indicates results of the top two of each action-specialized expert models to be used in the final expert ensemble, and each single model as compared with an expert single model. When comparing results of the expert models according to each specific action, it is evident that profit results of a few actions are lower than others in each expert model. For example, the profit yielded by expert models for buy and hold is similar, while the profit yielded by the expert model for sell is relatively low. In addition, the expert model for buy is better than results of expert models for hold and sell. From these results, we noticed that the Sortino ratio results were better than profit results in two-thirds of the single experiment. However, it is unclear whether expert models may be effectively compared to single models. In other words, the experiment demonstrates different results by index, number of actions, specific action, and reward function. To demonstrate this comparison, we include the results in the Table 6.

Results of the action-specialized expert ensemble model
We investigated results of the proposed action-specialized expert ensemble system. We compared ensemble results of expert models, which are specialized for each action, to ensemble results of single models, which are well-balanced learning methods. Tables 7-9   are conducted using two reward functions (profit and Sortino ratio). First, the subsequent tables are analyzed from the perspective of the expert ensemble, common ensemble, and single models.   Table 10 summarizes the top five averages for comparing expert ensemble methods to common ensemble and single models. As a result of analyzing the application of each ensemble, the increase in the range of the models (PE, SE) applying the common ensemble in the single model (Profit, Sortino) was 7.6-26.9%, which indicated an average increase of 14.6%. On the other hand, the profit range of the model (EPE, ESE) applying an action-specialized expert ensemble was 16.6-82.8%, which was 39.1% on an average. As a result, our proposed expert ensemble method in this study was 21.6% more effective than the common ensemble method. The ensemble method is widely used in general machine and deep learning; however, there are only a few cases applied in RL. Likewise, since the experimental results of the ensemble method as a newly attempted expert model appeared to be effective, we expect that our proposed method can be applied in further expansion of financial and various fields in the future.
Figs 8-10 indicate the average of the top five models, demonstrated by the thick colored line, and their standard error. Our proposed action-specialized expert ensemble model is most effective for the profit and Sortino ratio reward functions. In certain cases, the blue line-representing the expert ensemble with profit-is higher than the red line, which represents the ensemble model with the Sortino ratio. Regardless, our experiments indicate consistent results with the single, common ensemble, and expert ensemble models.

Experimental results of the extended discrete action space
We examine robustness of the proposed ensemble system with the extension of the discrete action space. Above all, if we analyze Table 11, the increase in rate of the 3-action to the 11-action space is 302.4-505.5%, and the average increase is 427.2%. When increased from the 3-action to the 21-action space, the rate of the increase ranged from 668.1% to 985.8%, with an average of 856.7%. We calculate these numbers only for increasing rate from the base amount of investment 1.0 because of the different number of shares. For example, "302.4%" is calculated by the profit rate of PE of 11-action divide by profit rate of PE of 3-action (e.g., (9.923 −1.0)/(3.351−1.0) = 3.024). In simple comparison, the number of multiple shares of a stock increased 5-fold when increasing from the existing 3-action to the 11-action space. In addition, upon the increase from the 3-action to the 21-action space, the quantity increases by 10 times. However, as the number of actions increases, it is possible to select under 5 or 10 actions and experimentally obtain a mean value smaller than the maximum expected value. Unlike selecting the quantity for minimum 0 or maximum 1 in the 3-action space in network learning, the model will choose the quantity in a flexible way, like 1 to 5 or 1 to 10 in the 11-action and 21-action space, respectively. Further, we perform an extra test to compare the extended discrete actions in the 11-action and 21-action spaces with multiple shares of a stock in the 3-action space. Fig 11 displays average performance of extended discrete actions and multi shares on each index. First, to explain the x-axis, the 3-action space is for buy, hold, and sell with 1 share. Next, the 3-action space with 5 shares is for buy, hold, and sell with 5 shares. The following 11-action space is for 5 actions for buy from 1 to 5 shares, 5 actions for sell from 1 to 5 shares, and 1 hold action. The following 3-action space with 10 shares is for buy, hold, and sell with 10 shares. The next 21-action space is for 10 actions for buy from 1 to 10 shares, 10 actions for sell from 1 to 10 shares, and 1 hold action. The two graphs on the top of Fig 11 show the entire performance of our experimental results with two reward functions on S&P500. The two graphs in the middle are on HSI, and last two graphs are for Eurostoxx50. These graphs show that multiple shares make more profit and the discrete action space model performs almost 29.3% better on an average than the three action space model with multiple shares in all of the cases. As a result, the result of the extended discrete action space is better than the case of multiple shares of a stock in the 3-action space.  Figs 12-14 indicate detail actions of the best result of the action-specialized expert ensemble model on each index in this study. We have only displayed the case of profit reward function because the case of Sortino reward function is similar. Each action space consists of two pictures. The first graph is the movement of each index during the actual test period and we mark each action on it. We can compare the actual movement trend with decisions of model. The second graph is the spread marking of actions to check a different number of actions. As seen in Figs 12-14, the actions decision of our proposed model closely-resembles the real price movement. We also can see the spread of actions, and it is evident that the network applies various actions according to the market situation and the extended discrete action space of the experiment. In detail analysis by each index, our model on S&P500 learns the upward trends and shows the result of continuously representing the buy action. The price movement of the other two indices is more volatile than S&P500, and these results show various action decisions depending on the strength of these signals.

Analytic results of our whole experiments
Analyzing our results in connection with Table 4, results of Eurostoxx50's test period, which actually decreased by -16%, were generally higher profit than those of the other two indices. We think this is because the distribution characteristics of training data set and test set are similar. The kurtosis of two data sets of Eurostoxx50 is almost the same, and the gap of skewness is relatively small compared to S&P500 and HSI. For this reason, it seems to be able to learn relatively better than the other two index environments. In addition, S&P500 shows a relatively low volatility and upward trend during the train and test periods. This index pattern appears to be too simple to learn various patterns. As a result, in contrast to the environment of the other two indices that can learn a variety of information, the profit of trained model for S&P500 was lowest, even though the real index was highest in the same test period. Lastly, the HSI environment shows good results because the train and test data set movements are relatively enough to learn various patterns.

Computational complexity
In the DRL approach, the computational complexity of the DRL model is important to understand the burden of the architecture. Thus, we analyze this in two ways; time & space complexity, and trade-off between training costs and performance.

Time & space complexity
In the reinforcement learning, the time complexity is sublinear in the length of state period, and the space complexity is sublinear in the number of state space, action space, and steps per episode. These can be expressed as big O notation, time complexity requires O(n T ) space, where n T = n e n h is the total number of steps, n e is the number of episode, and n h is the number of steps per episode. Space complexity requires O(n s n a n h ) space, where n s is the number of states and n a is the number of actions [50]. In addition, the computational complexity of DRQN can be calculated based on the complexity of the reinforcement learning and LSTM. The time and space complexity of an LSTM per time step is estimated as O(n w ), where the n w is the number of weights of network [51]. Thus, the time complexity of DRQN is O(n w n T ) and the spatial complexity is O(n w n s n a n h ). Since the common ensemble method combines three single model of DQN, the time complexity is linear in the number of DQNs. However, it does not affect the space complexity of ensemble method. Therefore, the time complexity of ensemble method is estimated as O(n m n T ), where the n m is the number of base models, and the space complexity of ensemble method is estimated as O(n s n a n h ), which is the same as the spatial complexity of DQN. Last, as the design of our proposed approach is the same as the common ensemble method, our proposed method has the same complexity of time and space as with the common ensemble method. We summarize the comparison of complexities in Table 12 below.
The inference time of the ensemble method takes almost 1.5ms longer than that of a single model in our experimental environment (Experimental Server Specifications: CPU: Xeon E5-2620, Ram: 64GB, GPU: GTX 1080 8-ways). Moreover, expert single models take almost 70s longer to learn than common models. The reason for the longer duration is a result of judging the range of profit and calculating reward values in the action-specialized expert model. Thus, in training, our proposed expert ensemble model takes about 3.5 times longer than a common single model and takes longer than the common ensemble model; however, its performance is better than the single and common ensemble model. When we tested our proposed models, since we focused on their performance, we did not train our proposed method simultaneously in an advanced parallel system. Thus, if we conduct our proposed method with parallel or distributed system, we can reduce the learning time of experiments better. The computational load is also a challenge to be solved in the reinforcement learning task. There are a number of studies on synchronous parallel systems, asynchronous parallel learning, and distributed reinforcement learning systems [52][53][54][55][56].

Trade-off between training costs and performance
The trade-off between training costs and the performance were analyzed. First, we compared the performance of our proposed model with S&P500 by reducing the duration of training data to various lengths to discuss the trade-off between the different time period of training data set and performance. In more detail, the period of our training data set of the original experiment is 20 years (Jan 2, 1987-Dec 29, 2006) as seen in Table 3 In the boxplot, the red line is mean value and the green line is median. As seen in Fig 15, the longer training data set makes the better performance for all three discrete action spaces. In the more detailed explanation as seen in Fig 16(A), the performances of 10 years, 15 years, and 20 years are 1.6, 2.1, and 2.7 times better than the performance of 5 years.
In addition, we investigate the trade-off between training time cost and performance by measuring the training time of each different training data set and different action. We averaged the top five cumulative profits for each category and displayed it in Fig 16. The left scatter plot depicts the relative performance between the length of training data set. The training time and performance of five-year training data are based on 1, and results of remaining data sets are expressed as a ratio. The right scatter plot depicts the relative performance between different discrete action spaces. The training time and performance of the three-actions are based

Student's T-test of our proposed model on other methods
We experiment with the results of DQN, DRQN, and common ensemble of DQN to compare the results with previous studies and compare these three results with our action-specialized expert ensemble model. The experimental environment is only 3-action, and the data period is set to 20 years for training and 11 years for the test in three indices. Fig 17 shows the mean and standard error of five results from each model, and the performance of our method is excellent in all three indices. We conduct a student's T-test to see if this result is statistically significant. Based on our model, we conducted the student's T-test (e.g., 2-sample T-test) on DQN, DRQN, and common ensemble models. As a result of the T-test in Table 13, all p-values are less than 0.05, so we can explain that our proposed model is statistically different from other models. Overall, we evinced the performance of our proposed model and the student's T-test and we could believe our experimental results.

Conclusion
In this study, a new ensemble approach was proposed for automated trading systems using reinforcement learning-specifically, an action-specialized expert ensemble trading systemto improve performance. This ensemble model consists of action-specialized expert models specialized in buy, hold, and sell actions. Since we developed each specialized model individually, our proposed method can reflect investment behavior in each model differently and obtain various distribution effects. We verified our approach experimentally with three different stock indices: S&P500, HSI, and Eurostoxx50.
First, our proposed method displays better performance than the common ensemble and single models, and is 21.6% and 39.1% more effective than the common ensemble and the single models, respectively. Second, we compared the profits of our proposed model to common ensemble and single models to check the effect of the extension of the discrete action space. Briefly, results indicate an increase of 427.2% and 856.6% on the 11-action and 21-action models, respectively. Further, our extra experiments indicate that the extended action space is more efficient than multiple shares of a stock in the 3-action space. As the action space is extended, the training of each network becomes increasingly difficult. However, these results imply that our proposed method is well-trained with an extended discrete action space. Third, we analyzed the results of our proposed model with various reward functions: profit, Sharpe ratio, and Sortino ratio. As a result, the two ratios, which jointly consider profit and volatility, demonstrate a 9.6% better performance than the use of profit only in two-thirds of our experiments. We believe that both profit and volatility information is helpful in training the network.
In this study, we apply our proposed method to a trading system. Since our action-specialized expert model is developed based on actions with controlling reward function in DRL, it can be applied to other cases of DRL in other fields. For example, it is applicable to game fields. In the fighting game, it is possible to create expert models for an attack specialized expert model, a defense specialized expert model, and an evasion specialized expert model. Additionally, in the soccer video game, an ensemble model can be generated by making an attack specialized expert model and a defense specialized expert model. For more examples, because autonomous driving must be realistic in many ways, action-specialized expert models can be created, such as expert models for recognizing moving vehicles, expert models for avoiding parking vehicles, expert models for driving well, and cornering or break expert models. In addition, in robot fields, we can develop an expert model for balancing, an expert model for walking, and an expert model for moving angles, and so on. Another application is to extend this study by first training the network with a discrete action space to a continuous action space using transfer learning. Therefore, we believe that it is possible to expand this study to various fields and further develop its application in financial fields in the future.

Author Contributions
Conceptualization: Ha Young Kim.