Predicting Market Impact Costs Using Nonparametric Machine Learning Models

Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance.


Introduction
Transaction cost is one of the important factors that affect the investment performance and is usually classified into two major categories: explicit costs and implicit costs. Explicit costs, also called direct costs, are transaction costs that can be explicitly stated and measured. These costs include commissions, transaction fees, and taxes. Implicit costs, or indirect costs, are costs that cannot be measured directly but can be improvable by an appropriate trading strategy. They include bid-ask spreads, time risk costs, and market impact costs.
Market impact cost, one of the implicit transaction costs, is the cost caused by the difference between the price before the transaction and the actual price that the transaction is executed actually. During the last decade, many studies have been focused on analyzing market impact costs by not only the academic researchers but also the practitioners because it is one of the main reducible parts of the transaction cost. [1] and [2] fitted the impacts of single transactions to a concave power-law function of the volume of the transaction. [3] used a logarithm function of the transaction volume to estimate market impact costs. [4] exploited the hyperbolic tangent function for the same task. [5] and [6] used a stochastic process of the asset price which includes a function of the transaction size to explain market impacts. [7] estimated impact cost by using a linear regression with quantized transaction sizes. [8] analyzed the market impacts of the large institutional orders in the US equity market and found that the permanent impact function has a concave form with respect to the transaction size, in contrast to the previous results [5,9] in which the permanent impact function had a linear form. The I-star model, a state-of-the-art parametric model, described in [10] and [11] is a log-linear regression model that uses three inputs, transaction size, volatility, and underlying trading rate. These inputs affected the estimated market impact costs independently. [12] and [13] used more than 40 independent variables to fit the market impact cost to simple linear regression function. However, those existing parametric approaches showed the limitation in estimation and prediction performance because they assumed the fixed parametric or simple linear regression form of the market impact model. In addition, most of them cannot employ the variables that are not originally included in the model thus a new model is always required to predict the market impact with the different set of variables.
In this paper, we introduce nonparametric machine learning approaches to estimate and predict market impact costs more accurately than the existing parametric approaches. To our knowledge, this is the first approach that applies several nonparametric learning models to analyze the market impact cost. The proposed nonparametric approach has two main advantages. First, the nonparametric approaches usually fit the data better than does the parametric case. Second, the nonparametric approaches are versatile in the number of input variables so the general procedure does not change, whereas the number or kinds of input variables change while the parametric approaches require the new parametric models in those cases. By simulation, we analyzed the market impact costs of transactions of small-cap, mid-cap, and large-cap stocks in US equity market both altogether and separately by selecting the same types of input variables with I-star model [10,11] and compared the results.
The remainder of the paper is organized as follows. In the next section, we briefly explain the I-star model which is used as a parametric benchmark with three input variables. Then, we describe how to use nonparametric regression models to construct market impact cost functions. Data description and experimental procedures with the experimental results are presented in the following sections. Finally, we provide the summarized results, limitations, and some directions for the future work in the discussion section.

Review of I-star model
In this section, we first briefly review I-star model [10,11] which is a state-of-the-art benchmark parametric model. I-star model, which uses three input variables to describe the market impact cost, is composed of two separated equations calculating I Ã , a theoretical instantaneous cost, and MI, the market impact cost appeared in the real market, respectively. The equations calculate them are given as follows: where Size, Vol, and POV are input variables and a 1 , a 2 , a 3 , a 4 , and b 1 are parameters to be determined. The first input variable of Eqs (1) and (2) is Size, the normalized order size. Based on [11], this variable is represented as Size = Q/ADV, where Q is the imbalance, the absolute value of difference between buy order and sell order, and ADV is 30-day average daily volume. Thus Size implies the magnitude of pressure from this order relative to the average daily volume. The second input variable, Vol, is the volatility of the equity return and 30-day averaged volatility was used in [11]. The last input variable, POV, is an acronym for percentage of volume and it reflects the market liquidity condition. [11] simply expressed POV = Q/(Q+V) where V is the expected volume traded for the period of time that the imbalance order Q is executed. If the market is liquid or the imbalance trade order Q is executed slowly, V becomes large and thus POV becomes small. Small POV results in small MI value so the market impact cost will be small when the market is liquid.
The market impact cost in Eq (2) comprises two components, temporary impact cost and permanent impact cost which are the first and the last term in the right hand side of Eq (2) respectively. Considering that Size and Vol are used to calculate the value of I Ã , they affect both the temporary and permanent part of the market impact. However, the other input variable POV only appears in the temporary impact part. This result implies that the smaller POV incurs the smaller market impact cost when the other input variables are invariant. However, this effect is temporarily and the permanent impact on the market is independent of the market liquidity condition.
Several parameters should be estimated. These parameters can be determined with data sets, including input variable values and market impact costs observed in the market, by general parameter estimation techniques such as nonlinear optimization and grid search.

Nonparametric regression models
In this research, four state-of-the-art nonparametric machine learning models are used to estimate market impacts. As a preliminary, brief explanations of them are given as follows.

Neural networks
Neural networks [28] are nonparametric nonlinear regression models which can be fit to highly nonlinear data distribution. Mimicking a human brain, a neural network model consists of layers that contains several nodes, conducting the same role with neurons in the human brain. Each node in the layer takes output values of all nodes in the previous layer as input values, calculates the output value, and provides the output value to all nodes in the next layer as one of their input values. The most common way of output value calculation in the node is as follows: where y is the output value of the node, x i represents input value from node i in the previous layer, g(Á) is an input transform function, w i represents weights for input values, and f(Á) is an activation function that makes the regression model nonlinear. The sigmoid functions, such as logistic function, probit function, and hyperbolic tangent function, and a liner rectified function, for example, f(x) = max{0, x}, are usual selections for the activation function. Finding optimal weights, w i , in Eq (3) is the main task of constructing the neural network model. The most extensively used method for this optimization back-propagation algorithm [29]. In back-propagation algorithm, the weights are modified, or the gradients are calculated, backward from the last output layer to the first input layer by minimizing the sum of squared errors as usual. Similar to other nonparametric regression methods, the neural network efectively finds the complex data distribution after optimizing weights. However, the relationship between input values and output values is difficult to determine.

Bayesian neural networks
Bayesian neural network model, first proposed in [30], is a variant of the neural network model, whose weights have prior distribution similar to other Bayesian models. Maximizing the likelihood of this model is equivalent to minimizing the regularized error function, E reg ðw; X; yÞ ¼ EðX; yÞ þ l k w k k k , where w is the weight vector, {X, y} are data inputs and outputs, E(Á) is the error function, and k Á k k is a k-norm function. If the prior distribution has a Laplace function or a Gaussian function, k has the value of 1 or 2, respectively. [31] proposed the iterative process of optimizing the Bayesian features including Bayesian neural networks by using Gauss-Newton approximation to compute the Hessian matrix of the objective error function E reg .

Gaussian processes
Gaussian process regression [32,33], a collection of random variables such that the distribution of any finite selection of them follows the joint Gaussian distribution, can be fully determined by the mean function and the covariance function which can be represented as follows: where f(x) is the Gaussian process regression function and m(x) and k(x,x 0 ) are its mean and covariance function respectively. Suppose that the data set D ¼ fðx i ; y i Þg N i¼1 is given where the variance of the noise of the output values is denoted by σ 2 . Then, because the any finite joint distribution follows the multivariate Gaussian distribution described by Eqs (4) and (5), the likelihood of the Gaussian process model can be calculated as follows: where y = [y 1 , . . ., y N ] T and K is an N×N matrix whose ij'th component is k(x i ,x j ). Training Gaussian process means finding the hyperparameters in the mean and covariance functions and the output noise variance σ 2 ; these values maximize the likelihood function in Eq (6). After finding those hyperparameters, the prediction for the new input x Ã can be estimated as where For more details for Gaussian processes, see [34].
Assuming the regression form as f(x, w) = < w, ϕ(x) > +b, the support vector regression problem is defined to minimize the sum of errors with the regularization which minimizes kwk 2 to reduce the complexity of the model as follows: with the constraints for all i = 1, . . ., N. Applying Karush-Kuhn-Tucker conditions to the minimization problem above results in the following dual problem: with the constraints 0 a þ i ; a À i C for all i = 1, . . ., N. Then, the solutions for the primal problem are becomes for any k = 1, . . ., N. After solving the dual problem by using a quadratic programming solver, the predictive value for a new input x Ã becomes which can be represented without the basis function ϕ(x) itself but only with its inner product kernel function k(x, x 0 ).

Data description and procedures
In this section, we describe the proposed procedure to calculate the market impact cost by using nonparametric machine learning models with an example of single transaction data of representative US stocks.

General procedures
First, we suggest the general procedure to find market impact costs by using nonparametric regression models before the descriptions of the simulation conducted in the current paper. The whole procedure is classified into three stages: data collection, data preprocessing, and cost analysis. Fig 1 represents the summary of the whole procedure.
The main task at the first stage is data collection, which aims to gather necessary data. Collecting non-traditional data outside the market such as news, reports, opinions, and any other variables than may affect price or liquidity can also be useful as well as the traditional market variables because the nonparametric models do not require any restriction on the data and the general procedure of analyzing market impact costs using them will not be changed.
The gathered data from the first stage are preprocessed to make input variables that are used for learning process in the second stage. First, data cleaning processes such as outlier elimination and missing value imputation are conducted. Then, the input variables that will be used for the nonparametric models are derived from these cleaned data.
At the final stage, nonparametric models to estimate and predict market impact costs are constructed using the input variables created in the previous stage. In addition, any other analyses using the constructed model including testing statistical significance can be conducted at this stage.

Data description
For simulating the proposed nonparametric approach, we gathered a single transaction data of the stocks of US equity markets from Bloomberg Terminal for the period from 2014/06/02 to 2014/06/26. We selected 17 representative firms that have large market capitals among each of S&P 500, S&P MidCap 400, and S&Ps SmallCap 600 indices. These indices are composed of large cap, mid cap, and small cap firms respectively. The tickers of the selected firms are presented in Table 1.
The collected transactions data are classified into three data sets, large cap, mid cap, and small cap by their capitals; another data set all cap includes all transactions regardless of the market capital. For each size of capitals, the number of collected transactions are approximately 15 million, 2 million and 1 million for large cap, mid cap, and small cap, respectively. Thus the all cap data set has approxiamtely 18 million transactions in total. The procedures in the following sections will be applied commonly to all of those data sets.

Creating and bucketing input variables
We made three key input variables, Size, Vol, and POV, which are also used in I-star model [10,11] and one output variable, the market impact cost. Considering that the I-star was originally applied to the daily-aggregated transactions, we slightly modified the input variables suitable for single high-frequent transactions. First, we define the market impact cost, denoted by cost, as where side is 1 if a trade is a buy-initiated trade and −1 if a trade is a sell-initiated trade, p 0 is a mid-price just before the trade, and p t is an executed price of the trade. Given that cost is multiplied by 10 4 , the unit of cost becomes basis point (bp). The first input variable Size is the normalized trade size as follows: where ATV is the average trade volume of the previous day. In the original I-star model, the imbalanced trade size is normalized by 30-day average daily volume because the trade size itself is daily = aggregated. In our research, to apply the single transactions, we divide each trade size by the one-day average of the single trades of the previous day. The second input variable Vol is defined as the 30-day volatility, and this value is the same with the original I-star model because volatility is the characteristic of each stock, unrelated to trade size or frequency. POV, the percentage of volume, in [11] is defined as Q/V where Q is the daily imbalanced size and V is the total trade volume of that day. A single transaction may be affected by market liquidity more locally rather than the liquidity of the whole day. Thus we define POV for single transactions as where V t (−τ, τ) is the total traded volume from τ minutes before the trade to τ minutes after the trade. Based on the previous study [7], we expected that the single transaction affects and is affected by the market within approximately 15 minutes and thus we decided that τ equals 15.
After creating input variables, we made three dimensional bins of input variables and bucketed the transactions into them. Finally, we selected bins containing enough number of transactions. The criterion number will be different for data sets. We selected the bins containing more than 20, 30, 60, 100 transactions; the number of survived bins are 2931, 3356, 5706, 5119 for small cap, mid cap, large cap, and all cap, respectively.

Analyzing market impact costs
With respect to the bins of transactions to be used for the nonparametric machine learning models, we set 70% of survived bins as the training set and the remaining 30% as the test set for each data set. To find appropriate parameter sets of nonparametric models, we used 10-fold cross validation for the training set. After finding the parameter set, each model was retrained for the entire training set with the chosen parameter set and applied to the test set. As a parametric benchmark, we used I-star model with the same data sets. As described in Section, I-star model also requires finding certain parameters. We found the parameters for I-star model by grid search and 10-fold cross validation of the training set and applied it to the test set as the same with the nonparametric models. Finally, we applied Wilcoxon signed-rank test between each nonparametric model and the parametric benchmark for each capital size group to find whether the difference in performance between a nonparametric model and the benchmark is significant.

Predicting market impact costs
First, we applied the nonparametric machine learning models and the benchmark parametric model, I-star model, to the selected bins of each data set. To estimate the errors of the model, we used four different measures, mean absolute error (MAE), relative MAE (RMAE), root mean squared error (RMS), and relative RMS (RRMS). The summarized results are shown in Tables 2-5. NN, BNN, SVR, GP, and I-star refer to neural network, Bayesian neural network, support vector regression, Gaussian process, and I-star model, respectively.
From Tables 2-5, all the nonparametric machine learning approaches fit the data distribution better than does the parametric benchmark with the same input features and instances, as expected. Secondly, the compared nonparametric machine learning models indicated different performances. For example, Bayesian neural networks reduced the errors from 7.27% to 43.00% relative to I-star model but support vector regression reduced the errors just from -0.005% to 15.03%. This phenomenon is more clarified by Fig 2 which represented the errors in the tables above.
In summary, we find that the three nonparametric models, neural network, Bayesian neural network, and Gaussian process, shows much better performances than the parametric   benchmark while support vector regression model performs slightly better than the benchmark and worse than the other nonparametric models in general. In some measures such as RMS, support vector regression performs slightly worse than the benchmark model for small cap data set.
To validate the proposed approach, in addition, we applied the Wilcoxon signed-rank test to each pair of the instance-wise test errors of one nonparametric model and the benchmark parametric model for each error measure. The p-values obtained by normal approximation of the Wilcoxon signed-rank test and all of them were smaller than 0.05, which is usually considered as a critical value of the statistical significance. Even though the averaged RMS of SVR Predicting Market Impact Costs Using Nonparametric Machine Learning Models prediction is larger than the benchmark, the predicted performance of SVR was judged significantly better than the parametric benchmark by the Wilcoxon signed-rank test.

Discussion
In this study, we introduced the nonparametric approaches to estimate and predict market impact costs and applied them to US stock markets with the three input variables used in I-star model, the parametric benchmark. This study has several features. First, four state-of-the-art nonparametric machine learning models including NN, BNN, GP, and SVR have been applied to the task of analyzing market impact cost, whereas the previous studies were focused on only parametric models. The nonparametric machine learning approaches have advantages in both prediction performance and versatility in the number of input variables. Second, the data set used in this study was highly extensive. A total 17 firms were selected from each indices of large, mid, and small cap firms, while the previous studies mostly focused on the large cap firms. The total amount of transactions used in this study exceeded 18 million. Finally, the market impact prediction in this paper used independent variables from single transactions. Thus, this prediction has the advantages to be applied to technological high-frequency trades compared with previous studies that analyzed only large trades.
As a result of the experiments performed in this study, the performances of nonparametric machine learning models mostly overwhelmed the benchmark model with the same input variables for all kinds of firm sizes and all error measures. In particular, BNN, NN, and GP showed noticeably better performances, whereas SVR sometimes performed worse than or comparably to the benchmark model. The statistical significance of the predictive powers of nonparametric approaches was also validated by applying the Wilcoxon signed rank test to the test error.
However, one of the limitations of the nonparametric machine learning approaches is that they cannot directly separate the market impact cost into permanent and temporary cost, whereas some parametric models can. Considering that the total, or instantaneous impact primarily affects the price at which the transaction occurs, an indirect way to analyze the permanent or temporary portion of it using nonparametric models can be helpful in analyzing market characteristics. In addition, though the nonparametric models usually performed better than did the parametric benchmark with the same input variables, the magnitude of performance difference can be changed if the period or the selected firms vary.
This study implies possibilities to be extended on some points. First, the nonparametric machine learning models has the advantage over parametric models in that the input variables can be added freely without any limitations. Thus, studies related to the nonmarket variables affecting the market impact can be easily incorporated into nonparametric models rather than parametric ones which are formed with a priori fixed input variables. Next, several parametric models explain the market impact cost. However, they are difficult to compare because their input variables are varied, as is their number. In such cases, a nonparametric machine learning model with the same inputs as the parametric models can provide a performance benchmark. Finally, developing hybrid models of nonparametric and parametric ones that comprise the permanent and temporary portions of the market impact cost as well as that maintain the extendability and the performance level remains a future research topic related to this study.