A mutual information based R-vine copula strategy to estimate VaR in high frequency stock market data

In this paper, we explore mutual information based stock networks to build regular vine copula structure on high frequency log returns of stocks and use it for the estimation of Value at Risk (VaR) of a portfolio of stocks. Our model is a data driven model that learns from a high frequency time series data of log returns of top 50 stocks listed on the National Stock Exchange (NSE) in India for the year 2014. The Ljung-Box test revealed the presence of Autocorrelation as well as Heteroscedasticity in the underlying time series data. Analysing the goodness of fit of a number of variants of the GARCH model on each working day of the year 2014, that is, 229 days in all, it was observed that ARMA(1,1)-EGARCH(1,1) demonstrated the best fit. The joint probability distribution of the portfolio is computed by constructed an R-Vine copula structure on the data with the mutual information guided minimum spanning tree as the key building block. The joint PDF is then fed into the Monte-Carlo simulation procedure to compute the VaR. If we replace the mutual information by the Kendall’s Tau in the construction of the R-Vine copula structure, the resulting VaR estimations were found to be inferior suggesting the presence of non-linear relationships among stock returns.


1) Introduction
Developing multivariate models and estimating joint density function is an area of key interest amongst researchers not only in finance but also in various other fields [1][2][3]. In finance, the researchers have already discarded multivariate Gaussian distributions on log returns of stocks and hence developing methods to estimate the joint distribution of stock returns have always attracted lot of interest [4]. In this paper we use Copula functions to achieve the important goal of estimating the joint probability distribution of the portfolio. The Sklar's theorem [5] expresses a multivariate cumulative distribution function in terms of the univariate cumulative distribution functions and a Copula function. So we need to overcome two challenges: firstly, to identify the probability distributions of the individual stocks and secondly devise a computationally efficient method of combining these marginal distributions with an appropriate Copula to obtain the joint distribution of the portfolio. The Kolmogorov-Smirnov test [6] suggested the Student's t-distribution as a good choice for the probability distribution functions of the individual stocks. The second step was handled using the R-Vine Copula structure which was originally introduced in [7][8][9] by extending the concept of Markov Trees. Two special subclasses of R-vine copulas namely the D-vine and C-vine copulas were studied in [10] and since then have become immensely popular in the analysis of financial data owing to their simple structure [11][12][13][14]. Working with a general R-vine structure is computationally challenging especially in higher dimensions. The sequential algorithm due to Dißmann et.al. [15] is a breakthrough in this direction and enables an efficient construction of general R-vine copula structures in higher dimensions. In [15], joint distribution functions of 16 variables is computed and in this paper we go as high as 50 variables via this algorithm. It is relevant to point out that the construction of the R-vine structure in [15] made use of Kendall's Tau-a non parametric measure which captures an ordinal relationship between two random variables and also indicates a non linear relationship among them. In [16][17][18], this approach has been applied successfully to a number of financial markets. Some very recent works [19][20][21][22][23] reveal the growing popularity of mutual information between two random variables as a quantifier of a linear or non-linear relationship. Mutual information (MI) between two random variables is defined to be the relative entropy between the joint distribution and the product of the marginal distributions. A direct consequence of this definition is that MI of independent random variables is zero. MI captures the reduction in the uncertainty of one random variable given the knowledge about another random variable. In particular, Sharma and Habib [23], in the context of high frequency data, have demonstrated that mutual information based methods capture the non-linear relationship between log returns better when compared to Spearman correlation based methods. This observation motivates a key part of the present paper which deals with the computation of the joint density function of log returns of stocks using a mutual information based R-vine copula structure. Our analysis begins with the removal of Autocorrelation and Hetroscedasticity using the GARCH models on the time series data of log returns. A number of popular GARCH models were fitted and the best of the lot turned out to be ARMA(1,1)-EGARCH(1,1). Next, the R-Vine copula structure was constructed using the error (residual) terms of the ARMA(1,1)-EGARCH(1,1) model. This approach is similar to the one taken in [18] in which data of daily returns of 96 stocks listed on S&P was analysed.
The above R-Vine structure is then used to estimate the Value at Risk (VaR) of portfolios through Monte-Carlo simulation. Recently, multivariate copula based models for the estimation of VaR have been proposed in [24] and a Kendall's Tau based vine copula model for estimating Var is presented in [18]. For earlier models focused towards the VaR estimation, the reader is referred to [25,26]. However, none of these models have employed mutual information.
We have considered 5% and 10% VaRs for portfolios consisting 5, 10, 25 and 50 stocks in our analysis.
The remaining part of the paper is divided into 4 sections. In section 2, we give a brief description of the data used in our analysis. In section 3, we give an overview of the methods and methodology used. In section 4, we compare the effectiveness of VaR estimation based on Kendall's Tau method and MI method. In the last section, we summarize our observations and findings.
(NSE) for each working day of the year 2014. The working hours of the NSE are from 9:00AM till 4:00PM. Further, we divide this duration into time intervals of length 30 seconds and call each such interval as a tick. The interval length of 30 seconds ensures sufficiently many data points for the fitted models to have small bias. We chose to ignore the first and the last half an hour (that is, 9-9:30AM and 3:30-4PM) data due to some ambiguity and incompleteness in the recorded data. Thus the total number of ticks considered for each working day will be 720. In general, corresponding to any tick t, that is, in the t th 30-second interval there will be several transactions for various stocks. Let v t i;k be the volume of the stock k traded at an instant i (within the duration corresponding to the tick t) and S t i;k be the price of stock k at the instant i. We now define the volume weighted average price S VWAP (t,k) for the tick t by Here the summation runs over all possible instances within the 30-second duration corresponding to the tick t. The log return of each stock k at tick t is given by In our data we encountered 30-second intervals in which zero trade was recorded. This would make the formula (1) indeterminate for those ticks. To overcome this issue, the recent most value of S VWAP for each stock k was considered for these ticks. We include only 50 stocks in our analysis that had either no gap interval or very few gap intervals. In other words, these stocks are highly traded in the market.
Also, 2014 was the year when General Elections were held in India and a change in government was seen after 10 years. One may expect high volatility during the election times. We wanted to study how does our model gets impacted during the election or pre-election or the post-election periods. Thus our discrete time series data was studied under three periods:

Pair copula construction
Before we explain the construction of the R-vine structure, it is important to have a clear understanding of a Copula. So we first recall some preliminaries. For any natural number n, let I n denote the unit cube in the extended n-dimensional space R n . The elements of R n are ntuples of extended real numbers a i : a = (a 1 , . . .,a n ). For any a; b 2 R n , we shall write a � b whenever a i � b i for all i. Now for any a � b, the Cartesian product of closed intervals, B = [a 1 , b 1 ]×. . .× [a n ,b n ], is called an n-box and will be denoted by [a,b]. The set of vertices, V, of B is the collection of all n-tuples (c 1 ,. . .,c n ) for which each c i = a i or b i . Let H be a real valued function with domain of the form S 1 ×. . .×S n , where each S i is a subset of extended real numbers R.
The H-volume of B is defined to be the sum Here sgn(c) takes on +1 if c i = a i even number of times; and it takes on -1 otherwise. Also, note that the above summation is finite since the total number of vertices is finite. The reader is referred to [27] for other equivalent forms of V H (B). An n-dimensional Copula is a function C:I n ! I satisfying the following axioms: A real valued function F defined on R n is called an n-dimensional distribution function if (i) V F (B) � 0 for all n-boxes B with vertices in R n ; and (ii) Fhui = 0 whenever u i = −1 for some i.
It has been established in [27] that the n-dimensional distribution function F has one dimensional marginal distribution functions F 1 , F 2 , . . .,F n .
The famous Sklar's theorem guarantees that there exists an n-dimensional Copula function ). However, we are more interested in the converse which states that for a given n-dimensional Copula function C and univariate distribution functions where c is the nth order partial derivative of C. Thus, if we wish to study the joint behaviour of n random variables, we can first fit the marginal distribution functions of each random variable separately and then combine them through an appropriate multivariate copula.
The process of constructing multivariate copula that we adopt is the Pair-wise Copula Construction (PCC) which relies on Vine copulas (or pair copulas) introduced in [7]. At the heart of this process lies the fact that a joint copula function is broken down as product of bivariate copula functions that can be estimated independently. Thus, bivariate copulas are building blocks for the PCC method.
An R-vine on n variables as introduced by Bedford and Cooke [9] is a finite sequence of trees T j = (V j , E j ), j = 1,2,. . .,n−1, with vertices V j and edges E j satisfying the conditions: (i) The tree T 1 has nodes N 1 = {1,2,. . .,n}.
(ii) Trees T j are connected with nodes N j = E j-1 and that the cardinality of N j is n-j + 1 for each j = 1,2,. . .,n.
(iii) Let a = {a 1 ,a 2 } and b = {b 1 , b 2 } be two elements of N j (2�j�n-1), then {a,b} 2 E j provided that the cardinality of a \ b is exactly one.
The last axiom says that we will join two nodes by an edge only when these nodes interpreted as edges of the preceeding tree have exactly one node of the preceeding tree in common.
Bedford and Cooke [9] follow a convenient way of enumerating the nodes of trees in an Rvine structure in terms of conditioned and conditioning sets. For further details and illustrative examples the reader may refer to [9,15].
We make use of the same enumeration strategy to write down the probability density function corresponding to the distribution realized by the R-vine copula structure for the portfolio of stocks.
In order to construct an R-vine structure of stocks, we start with a tree T 1 with n nodes (N 1 ) represented by each stock and E 1 edges. In our analysis, we considered T 1 as minimum spanning tree network of stocks based on both mutual information metric (Eq 10) and Kendall's Tau based metric (Eq 11). Edge in E 1 is represented by a bivariate copula C {s(e),t(e)} where s(e), and t(e), are nodes connected by the edge e. Then we move on to next tree, T 2 with the nodes set N 2 same as the edge set E 1. Each node in T 2 is thus represented by C {s(e),t(e)} and edge in E 2 is represented by conditional copula C {s(e),t(e)/D(e)} where D(e) is the common node. Similarly we keep on building the trees T 3 , T 4 , . . .,T n .
Once we have constructed a R-vine structure on n stocks with random variables X 1 ,X 2 ,. . ., X n , their joint density function with marginal density functions f 1 ,f 2 ,. . .,f n is given by where F s(e)/D(e) is distribution function of conditional random variable X s(e)/D(e) and C s(e),t(e)/D(e) is second order partial derivative of copula connecting X s(e)/D(e) and X t(e)/D(e) . For example consider the joint density function of three random variables f can be decomposed as where f 2/. denotes conditional density functions. We can further decompose conditional density function f 2/1 as where f 12 is joint density of variable 1 and 2, c 12 is the 2 nd order derivative of copula C 12 connecting variable 1 and 2. Similarly, we have Thus, using Eqs (5) and (6) we have joint density function of 3 variables can be decomposed as The analogous R-vine copula is given in Fig 1. For fast execution of statistical methods such as the Maximum likelihood estimate, Morales and Napoles et al. [28] proposed an efficient scheme of storing an R-vine on n-variables as an n × n lower triangular matrix M = (m ij ). The matrix M has interesting properties such as each column has distinct elements; and deleting the first row and first column of M yields a (n -1)dimensional R-vine matrix.
The decomposition in Eq (4) now can be expressed in terms of the R-vine matrix: ...;m n;k ðF m kk jm iþ1 ;k;...;m n;k ; F m ik jm iþ1;k ;...;m n;k Þ: ð7Þ Note that the above equation is in terms of a bivariate copula function. An efficient algorithm for computing the conditional distributions appearing as arguments of this copula function has been proposed in [15].

Mutual information and Kendall's Tau based metrics
Mutual Information (MI) between two random variables captures mutual dependence between them and is zero if and only if they are independent. MI between two random variables is defined to be the difference between the sum of the respective entropies of random variables and their joint entropy. The mutual information of discrete random variables X and Y is defined as A generalization to the continuous case is Based on mutual information, the normalized distance [23] between two random variables X and Y is defined as where, I is the mutual information and H is the joint entropy. Based on this metric, we can construct minimum spanning tree (MST) network between n stocks. There are two wellknown methods to construct Minimal Spanning Tree: Kruskal's algorithm and Prim's algorithm. We used Prim's algorithm for construction of the stock networks since the stocks networks are dense networks and in such cases Prim's algorithm works well. We also considered building stock networks based on Kendall's Tau quantifier. The metric used is where τ X,Y is Kendall's Tau coefficient between X and Y. Sharma and Habib [23] studied MI based stock networks and showed the existence nonlinearity in the stock returns data at high frequency level.

Fitting univariate models to log returns of stocks
A stochastic process R 1 , R 2 , . . .,R t is a white noise process with mean μ and variance σ 2 , if E(R t ) = μ for all t, Var(R t ) = σ 2 for all t, and Cov(R t ,R s ) = 0 for all t 6 ¼ s. In order to check if the log returns of stocks exhibit the properties of white noise, we carried out Ljung-Box test [29] to check if the log returns of stocks exhibit any autocorrelation or heteroscedasticity at 1% level of significance Presence of autocorrelation and heteroscedasticity can be seen at a lag of 1. Thus, GARCH methods are applied to our data aiming to remove the autocorrelation and heteroscedasticity in the time series. We tested for GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) models with the error estimated by student's t-distribution. A process R t  [30,31]. We also tried fitting Normal Inverse Gaussian (NIG) distribution as well on the error terms. We used Kolmogorov Smirnov test to check the goodness of fit of univariate distribution on errors. Both NIG and student's t distribution turns out to be better choices over normal distribution. Due to computational simplicity, we used student's t distribution in our model. In Fig 3, we summarize the p-values corresponding to the test applied to the error terms obtained after fitting ARMA(1,1)-EGARCH(1,1) model for each of 50 stocks computed daily.
In all the equations given below, R t,k is as defined in Eq (2). The GARCH(1,1) model [31] for the kth stock is given by where, we fit a student's t-distribution to the noise � t,k.
In all three models, we tested if the noise term � t,k exhibit properties of a white noise by again running Ljung Box Tests at 1% level of significance. Figs 4 and 5 corresponds to the results obtained from running Ljung Box Test on � t,k and � t,k 2 respectively. Clearly ARMA (1,1)-EGARCH(1,1) proves to be better fitted model in comparison to other models. We use AIC values to compare the three methods. 94.84% of the times ARMA(1,1)-EGARCH(1,1) was seen to have the lowest AIC values and it again emerged to be a better fit in comparison to the other two methods. We used adjusted Pearson chi-squared goodness of fit test [32] to check the effectiveness of the univariate model for each stock on each working day at 5% and 1% level of significance. Fig 6 gives whether the null hypothesis, H 0 : ARMA(1,1)−EGARCH (1,1) is a good fit, was rejected (black colour) or accepted (white colour) for each stock on for each working day. 32 stocks out of 50 were seen to pass the test for more than 90% of times, i.e. null hypothesis was not rejected at 1% level of significance more than 90% of times. Also all the stocks showed an efficiency of a good fit for more than 72% of times. Thus, we conclude that ARMA(1,1)-EGARCH(1,1) is a good fit.

Value at risk (VaR) prediction
Value at risk (VaR) of a portfolio is measure of risk associated with it. For example if a portfolio has one-tick 5% VaR of x amount, then it means that there is 5% chance that the portfolio looses its value by an amount x over the time duration of one tick in the absence of trading. It is well known that α% VaR of the portfolio is given by the α-percentile of log returns of the portfolio [25]. Once the joint distribution function of n stocks is known, we can use a Monte-Carlo simulation to estimate the VaR of the underlying portfolio. In this paper we have drawn inferences by calculating the VaR for equally weighted portfolios of 5, 10, 25, and 50 stocks. Consider a portfolio consisting of n stocks and random variables S VWAP (t,k), R t+1 , k are as defined in Eqs (1) and (2). Let w k be the weight associated with kth stock in the portfolio and S t,P be the value of the portfolio corresponding to the tick t, then, the log return of the portfolio R t+1 , P in the time interval [t,t + 1] is given by Using identities e x~( 1 + x), and ln (1 + x)~x for small x, in above equation, we get We first use ARMA(1,1)+EGARCH(1,1) to model univariate log returns of each stock and then use R-Vine copula construction on the error terms � t,k to estimate joint copula on the error terms. We fit the model on the first 4 hours of each day and use it to predict Var for next 2 hours. We summarize the algorithm as below:

PLOS ONE
1. Consider log returns of each stock for the first 4 hours i.e. 9:30AM to 1:30PM (this gives 480 terms in each time series) on each day.
2. Fit an ARMA(1,1)-EGARCH(1,1) model to log returns of each stock obtained in step 1, with univariate Student's t-distribution assumed on the error term 2 t,k of each stock k. So if there were n-stocks in the portfolio then the data generated at this step can be written conveniently as (2 t,k ) 480×n .
3. Fit an R-vine copula structure to the random variables � t,1, � t,2,. . ., � t,n (sampled at 480 ticks in step 2) to obtain the joint distribution of the error terms. In the R-vine algorithm we choose the first tree T 1 as the minimum spanning tree based on Kendall's Tau metric (Eq 11) and also MI based metric (Eq 10). In this paper we fitted the R-vine structure on n = 50 stocks.
5. The α% VaR for 481st instant, VaR 481,P is now calculated by finding the α percentile of the N simulated values of R 481,P . Here P is a portfolio whose size is chosen to be of 5, 10, 25, and 50 stocks respectively. In this paper we have considered α = 5%, 10% respectively. 6. We then compare the actual R 481,P with the estimated VaR 481,P .
7. Once the actual R 481,k is known, then we can use Eq (16)

4) Discussion
Data for each day was divided into 2 subsets: training data from 9:30AM to 1:30PM and testing data 1:31PM to 3:30PM. We fit both Kendall Tau's based and MI based vine copula structure on the training data as discussed in the previous section. We then estimated VaRs corresponding to equally weighted portfolios for each time tick of the testing data. In our analysis we have considered portfolios consisting of all 50 stocks, randomly picked 25 or 10 or 5 stocks. Also, we have considered 5% and 10% VaRs in all the cases. To check the effectiveness of our model we carried out unconditional (UC) and conditional (CC) coverage test formulated by Christoffersen [33]. There are 32, 39 and 113 days in the pre-election, election and post-election period for which our proposed model was a good fit. We carried out the hypothesis testing for each day and calculated percentage of times, the null hypothesis was not rejected. We refer to this calculated percentage of times as the success rate of the model. Tables 1-3 summarizes the results obtained in pre-election, election and post-election period. It was observed that the VaR prediction were more accurate in case of portfolios consisting of small number of stocks like 5 or 10 in comparison to portfolios consisting of large number of stocks like 25 or 50. Also, the success rate of MI based model was seen to be much better than the Kendall's Tau based model, 41 out of 96 times (42.71%) in comparison to 6 out of 96 (6.25%) times when success rate of Kendall's Tau based model was observed to be better than that of MI based model. 49 out of 96 times (51.04%), the success rates based on both the methods were seen to be at par. One can also observe that even during the election times which is full of uncertainties, the success rate of the model was quite high.

5) Conclusion
This paper demonstrates the power of incorporating mutual information based metrics into the construction of R-vine copula structures in learning the joint distribution of a large  1) were found to be the lowest in comparison to the other methods (In the R software package, a lower AIC indicates that the model is superior). The joint distribution of the respective error terms in the ARMA(1,1)-EGARCH(1,1) model applied to each stock is then computed by learning R-Vine copula structures in 2 ways: first, by starting with the minimal spanning tree computed on the basis of the mutual information metric; and second, by starting with the minimal spanning tree computed on the basis of the Kendall's Tau based metric. Next, the VaR of the underlying 50 stock portfolio is computed through Monte-Carlo simulations in both the cases. The Christoffersen's UC and CC tests show that VaR predictions in the mutual information case out performs the VaR predictions in the Kendall's Tau case. The success rate obtained from the MI based method is seen to be higher than Kendall's Tau based method on 42.71% occasions. On 51.04% of the occasions the success rates from both the methods were at par. The predictions were quite good even during the election period when there is lot of anticipation amongst the buyers. We finally conclude that MI based R-Vine Copula model is able to capture the joint distribution well and thus leads to better VaR predictions in a high frequency scenario.
Supporting information S1

PLOS ONE
gratitude to Professor Amber Habib for his encouragement and valuable comments. The authors also thank the reviewers for their valuable suggestions, which have enhanced the clarity of the paper significantly.