## Figures

## Abstract

In this paper, we explore mutual information based stock networks to build regular vine copula structure on high frequency log returns of stocks and use it for the estimation of Value at Risk (VaR) of a portfolio of stocks. Our model is a data driven model that learns from a high frequency time series data of log returns of top 50 stocks listed on the National Stock Exchange (NSE) in India for the year 2014. The Ljung-Box test revealed the presence of Autocorrelation as well as Heteroscedasticity in the underlying time series data. Analysing the goodness of fit of a number of variants of the GARCH model on each working day of the year 2014, that is, 229 days in all, it was observed that ARMA(1,1)-EGARCH(1,1) demonstrated the best fit. The joint probability distribution of the portfolio is computed by constructed an R-Vine copula structure on the data with the mutual information guided minimum spanning tree as the key building block. The joint PDF is then fed into the Monte-Carlo simulation procedure to compute the VaR. If we replace the mutual information by the Kendall’s Tau in the construction of the R-Vine copula structure, the resulting VaR estimations were found to be inferior suggesting the presence of non-linear relationships among stock returns.

**Citation: **Sharma C, Sahni N (2021) A mutual information based R-vine copula strategy to estimate VaR in high frequency stock market data. PLoS ONE 16(6):
e0253307.
https://doi.org/10.1371/journal.pone.0253307

**Editor: **Alessandro Barbiero, Universita degli Studi di Milano, ITALY

**Received: **July 25, 2020; **Accepted: **June 3, 2021; **Published: ** June 17, 2021

**Copyright: ** © 2021 Sharma, Sahni. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **Raw data cannot be shared publicly because of copyright agreement with NSE Data & Analytics, formerly known DotEx International Ltd.(third party). Data are available from the NSE Data & Analytics, formerly known DotEx International Ltd., (contact via dotex_kraops@nse.co.in) for researchers who meet the criteria for access to confidential data. The terms of purchase prohibit us from redistributing the Historical Data or any component of it. However, the raw data can be purchased from NSE Data & Analytics at: https://www.nseindia.com/supra_global/content/dotex/data_products.htm We also confirm that we did not have any special access privileges that others would not have. However, we also confirm that all the processed data corresponding to each figure and each table is uploaded as a supporting file named “S1 Table”.

**Funding: **The author(s) received no specific funding for this work.

**Competing interests: ** NO authors have competing interests.

## 1) Introduction

Developing multivariate models and estimating joint density function is an area of key interest amongst researchers not only in finance but also in various other fields [1–3]. In finance, the researchers have already discarded multivariate Gaussian distributions on log returns of stocks and hence developing methods to estimate the joint distribution of stock returns have always attracted lot of interest [4]. In this paper we use Copula functions to achieve the important goal of estimating the joint probability distribution of the portfolio. The Sklar’s theorem [5] expresses a multivariate cumulative distribution function in terms of the univariate cumulative distribution functions and a Copula function. So we need to overcome two challenges: firstly, to identify the probability distributions of the individual stocks and secondly devise a computationally efficient method of combining these marginal distributions with an appropriate Copula to obtain the joint distribution of the portfolio. The Kolmogorov-Smirnov test [6] suggested the Student’s t-distribution as a good choice for the probability distribution functions of the individual stocks. The second step was handled using the R-Vine Copula structure which was originally introduced in [7–9] by extending the concept of Markov Trees. Two special subclasses of R-vine copulas namely the D-vine and C-vine copulas were studied in [10] and since then have become immensely popular in the analysis of financial data owing to their simple structure [11–14]. Working with a general R-vine structure is computationally challenging especially in higher dimensions. The sequential algorithm due to Dißmann et.al. [15] is a breakthrough in this direction and enables an efficient construction of general R-vine copula structures in higher dimensions. In [15], joint distribution functions of 16 variables is computed and in this paper we go as high as 50 variables via this algorithm. It is relevant to point out that the construction of the R-vine structure in [15] made use of Kendall’s Tau–a non parametric measure which captures an ordinal relationship between two random variables and also indicates a non linear relationship among them. In [16–18], this approach has been applied successfully to a number of financial markets. Some very recent works [19–23] reveal the growing popularity of mutual information between two random variables as a quantifier of a linear or non-linear relationship. Mutual information (MI) between two random variables is defined to be the relative entropy between the joint distribution and the product of the marginal distributions. A direct consequence of this definition is that MI of independent random variables is zero. MI captures the reduction in the uncertainty of one random variable given the knowledge about another random variable. In particular, Sharma and Habib [23], in the context of high frequency data, have demonstrated that mutual information based methods capture the non-linear relationship between log returns better when compared to Spearman correlation based methods. This observation motivates a key part of the present paper which deals with the computation of the joint density function of log returns of stocks using a mutual information based R-vine copula structure.

Our analysis begins with the removal of Autocorrelation and Hetroscedasticity using the GARCH models on the time series data of log returns. A number of popular GARCH models were fitted and the best of the lot turned out to be ARMA(1,1)-EGARCH(1,1). Next, the R-Vine copula structure was constructed using the error (residual) terms of the ARMA(1,1)-EGARCH(1,1) model. This approach is similar to the one taken in [18] in which data of daily returns of 96 stocks listed on S&P was analysed.

The above R-Vine structure is then used to estimate the Value at Risk (VaR) of portfolios through Monte-Carlo simulation. Recently, multivariate copula based models for the estimation of VaR have been proposed in [24] and a Kendall’s Tau based vine copula model for estimating Var is presented in [18]. For earlier models focused towards the VaR estimation, the reader is referred to [25, 26]. However, none of these models have employed mutual information.

We have considered 5% and 10% VaRs for portfolios consisting 5, 10, 25 and 50 stocks in our analysis.

The remaining part of the paper is divided into 4 sections. In section 2, we give a brief description of the data used in our analysis. In section 3, we give an overview of the methods and methodology used. In section 4, we compare the effectiveness of VaR estimation based on Kendall’s Tau method and MI method. In the last section, we summarize our observations and findings.

## 2) Data description

The high frequency data analysed in the present paper is an instant-by-instant record of the prices and volume of all the stocks listed on CNX100 index of the National Stock Exchange (NSE) for each working day of the year 2014. The working hours of the NSE are from 9:00AM till 4:00PM. Further, we divide this duration into time intervals of length 30 seconds and call each such interval as a tick. The interval length of 30 seconds ensures sufficiently many data points for the fitted models to have small bias. We chose to ignore the first and the last half an hour (that is, 9–9:30AM and 3:30-4PM) data due to some ambiguity and incompleteness in the recorded data. Thus the total number of ticks considered for each working day will be 720. In general, corresponding to any tick *t*, that is, in the *t*^{th} 30-second interval there will be several transactions for various stocks. Let be the volume of the stock *k* traded at an instant *i* (within the duration corresponding to the tick *t*) and be the price of stock *k* at the instant *i*. We now define the volume weighted average price *S*_{VWAP}(*t*,*k*) for the tick *t* by
(1)
Here the summation runs over all possible instances within the 30-second duration corresponding to the tick *t*. The log return of each stock *k* at tick t is given by
(2)
In our data we encountered 30-second intervals in which zero trade was recorded. This would make the formula (1) indeterminate for those ticks. To overcome this issue, the recent most value of *S*_{VWAP} for each stock *k* was considered for these ticks.

We include only 50 stocks in our analysis that had either no gap interval or very few gap intervals. In other words, these stocks are highly traded in the market.

Also, 2014 was the year when General Elections were held in India and a change in government was seen after 10 years. One may expect high volatility during the election times. We wanted to study how does our model gets impacted during the election or pre-election or the post-election periods. Thus our discrete time series data was studied under three periods: (a) pre-election period: Jan-Feb 2014 (b) election period: Mar-May 2014, (c) post-election period Jun-Dec 2014.

## 3) Methods and methodology

### 3.1 Pair copula construction

Before we explain the construction of the R-vine structure, it is important to have a clear understanding of a Copula. So we first recall some preliminaries. For any natural number *n*, let *I*^{n} denote the unit cube in the extended *n*–dimensional space . The elements of are *n*–tuples of extended real numbers *a*_{i}: *a* = (*a*_{1}, …,*a*_{n}). For any , we shall write *a* ≤ *b* whenever *a*_{i} ≤ *b*_{i} for all *i*. Now for any *a* ≤ *b*, the Cartesian product of closed intervals, *B* = [*a*_{1},*b*_{1}]×…× [*a*_{n},*b*_{n}], is called an *n*–box and will be denoted by [*a*,*b*]. The set of vertices, *V*, of *B* is the collection of all *n*–tuples (*c*_{1},…,*c*_{n}) for which each *c*_{i} = *a*_{i} *or b*_{i}. Let *H* be a real valued function with domain of the form *S*_{1} ×…×*S*_{n}, where each *S*_{i} is a subset of extended real numbers . The *H*–volume of *B* is defined to be the sum
(3)
Here *sgn*(*c*) takes on +1 if *c*_{i} = *a*_{i} even number of times; and it takes on -1 otherwise. Also, note that the above summation is finite since the total number of vertices is finite. The reader is referred to [27] for other equivalent forms of *V*_{H} (*B*).

An *n*–dimensional Copula is a function *C*:*I*^{n} → *I* satisfying the following axioms:

*C*〈*u*〉 = 0 if there exists an*i*such that*u*_{i}= 0.*C*〈*u*〉 =*u*_{k}if*u*_{i}= 1 for all*i*≠*k*.*V*_{C}([*a*,*b*]) ≥ 0 for any*n*–box [*a*,*b*] with*a*,*b*∈*I*^{n}.

A real valued function *F* defined on is called an *n*–dimensional distribution function if

*V*_{F}(*B*) ≥ 0 for all*n*–boxes*B*with vertices in ; and*F*〈*u*〉 = 0 whenever*u*_{i}= −∞ for some*i*.*F*(*u*) = 1 whenever*u*_{i}= ∞ for all*i*.

It has been established in [27] that the *n*–dimensional distribution function *F* has one dimensional marginal distribution functions *F*_{1}, *F*_{2}, …,*F*_{n}.

The famous Sklar’s theorem guarantees that there exists an *n*–dimensional Copula function *C* such that *F*(*x*_{1}, *x*_{2}, …,*x*_{n}) = *C*(*F*_{1}(*x*_{1}), *F*_{2}(*x*_{2}), …,*F*_{n}(*x*_{n})). However, we are more interested in the converse which states that for a given *n*–dimensional Copula function *C* and univariate distribution functions *F*_{1}, *F*_{2}, …,*F*_{n}, the formula *F*(*x*_{1}, *x*_{2}, …,*x*_{n}) = *C*(*F*_{1}(*x*_{1}), *F*_{2}(*x*_{2}), …,*F*_{n}(*x*_{n})) defines an *n*–dimensional distribution function with marginals are *F*_{1}, *F*_{2}, …,*F*_{n}. Equivalently the joint density function *f*(*x*_{1},*x*_{2},…,*x*_{n}) = *f*_{1}(*x*_{1})*f*_{2}(*x*_{2})…*f*_{n}(*x*_{n})*c*(*F*_{1}(*x*_{1}), *F*_{2}(*x*_{2}), …,*F*_{n}(*x*_{n})) where *c* is the *n*th order partial derivative of *C*. Thus, if we wish to study the joint behaviour of *n* random variables, we can first fit the marginal distribution functions of each random variable separately and then combine them through an appropriate multivariate copula.

The process of constructing multivariate copula that we adopt is the Pair-wise Copula Construction (PCC) which relies on Vine copulas (or pair copulas) introduced in [7]. At the heart of this process lies the fact that a joint copula function is broken down as product of bivariate copula functions that can be estimated independently. Thus, bivariate copulas are building blocks for the PCC method.

An R-vine on *n* variables as introduced by Bedford and Cooke [9] is a finite sequence of trees *T*_{j} = (*V*_{j}, *E*_{j}), *j* = 1,2,…,*n*−1, with vertices *V*_{j} and edges *E*_{j} satisfying the conditions:

- The tree
*T*_{1}has nodes*N*_{1}= {1,2,…,*n*}. - Trees
*T*_{j}are connected with nodes*N*_{j}=*E*_{j-1}and that the cardinality of*N*_{j}is*n*–*j*+ 1 for each*j*= 1,2,…,*n*. - Let
*a*= {*a*_{1},*a*_{2}} and*b*= {*b*_{1},*b*_{2}} be two elements of*N*_{j}(2≤*j*≤*n*-1), then {*a*,*b*} ∈*E*_{j}provided that the cardinality of*a ∩ b*is exactly one.

The last axiom says that we will join two nodes by an edge only when these nodes interpreted as edges of the preceeding tree have exactly one node of the preceeding tree in common.

Bedford and Cooke [9] follow a convenient way of enumerating the nodes of trees in an R-vine structure in terms of conditioned and conditioning sets. For further details and illustrative examples the reader may refer to [9, 15].

We make use of the same enumeration strategy to write down the probability density function corresponding to the distribution realized by the R-vine copula structure for the portfolio of stocks.

In order to construct an R-vine structure of stocks, we start with a tree *T*_{1} with *n* nodes (*N*_{1}) represented by each stock and *E*_{1} edges. In our analysis, we considered *T*_{1} as minimum spanning tree network of stocks based on both mutual information metric (Eq 10) and Kendall’s Tau based metric (Eq 11). Edge in *E*_{1} is represented by a bivariate copula C_{{s(e),t(e)}} where s(e), and t(e), are nodes connected by the edge *e*. Then we move on to next tree, *T*_{2} with the nodes set *N*_{2} same as the edge set *E*_{1.} Each node in *T*_{2} is thus represented by C_{{s(e),t(e)}} and edge in *E*_{2} is represented by conditional copula C_{{s(e),t(e)/D(e)}} where D(e) is the common node. Similarly we keep on building the trees *T*_{3}, *T*_{4}, …,*T*_{n}.

Once we have constructed a R-vine structure on *n* stocks with random variables *X*_{1},*X*_{2},…,*X*_{n}, their joint density function with marginal density functions *f*_{1},*f*_{2},…,*f*_{n} is given by
(4)
where F_{s(e)/D(e)} is distribution function of conditional random variable X_{s(e)/D(e)} and C_{s(e),t(e)/D(e)} is second order partial derivative of copula connecting X_{s(e)/D(e)} *and* X_{t(e)/D(e)}. For example consider the joint density function of three random variables *f* can be decomposed as *f*_{1}*f*_{2/1}*f*_{3/12} where *f*_{2/.} denotes conditional density functions. We can further decompose conditional density function *f*_{2/1} as
(5)
where *f*_{12} is joint density of variable 1 and 2, *c*_{12} is the 2^{nd} order derivative of copula *C*_{12} connecting variable 1 and 2. Similarly, we have
(6)
Thus, using Eqs (5) and (6) we have joint density function of 3 variables can be decomposed as *f*_{1} *f*_{2}*c*_{12}*f*_{3/1}*c*_{23/1} = *f*_{1} *f*_{2}*f*_{3}*c*_{12}*c*_{13}*c*_{23/1}. The analogous R-vine copula is given in Fig 1.

*T*_{1}, *T*_{2}, *T*_{3} corresponds to trees 1, 2 and 3 respectively.

For fast execution of statistical methods such as the Maximum likelihood estimate, Morales and Napoles et al. [28] proposed an efficient scheme of storing an R-vine on *n*–variables as an *n* × *n* lower triangular matrix *M* = (*m*_{ij}). The matrix *M* has interesting properties such as each column has distinct elements; and deleting the first row and first column of *M* yields a (*n* –1)–dimensional R-vine matrix.

The decomposition in Eq (4) now can be expressed in terms of the *R*-vine matrix:
(7)
Note that the above equation is in terms of a bivariate copula function. An efficient algorithm for computing the conditional distributions appearing as arguments of this copula function has been proposed in [15].

### 3.2 Mutual information and Kendall’s Tau based metrics

Mutual Information (MI) between two random variables captures mutual dependence between them and is zero if and only if they are independent. MI between two random variables is defined to be the difference between the sum of the respective entropies of random variables and their joint entropy.

The mutual information of discrete random variables *X* and *Y* is defined as
(8)
A generalization to the continuous case is
(9)
Based on mutual information, the normalized distance [23] between two random variables *X* and *Y* is defined as
(10)
where, *I* is the mutual information and *H* is the joint entropy. Based on this metric, we can construct minimum spanning tree (MST) network between *n* stocks. There are two well-known methods to construct Minimal Spanning Tree: Kruskal’s algorithm and Prim’s algorithm. We used Prim’s algorithm for construction of the stock networks since the stocks networks are dense networks and in such cases Prim’s algorithm works well.

We also considered building stock networks based on Kendall’s Tau quantifier. The metric used is
(11)
where *τ*_{X,Y} is Kendall’s Tau coefficient between *X* and *Y*. Sharma and Habib [23] studied MI based stock networks and showed the existence nonlinearity in the stock returns data at high frequency level.

### 3.3 Fitting univariate models to log returns of stocks

A stochastic process *R*_{1}, *R*_{2}, …,*R*_{t} is a white noise process with mean μ and variance *σ*^{2}, if *E*(*R*_{t}) = *μ* for all *t*, *Var*(*R*_{t}) = *σ*^{2} for all *t*, and *Cov*(*R*_{t},*R*_{s}) = 0 for all *t* ≠ *s*. In order to check if the log returns of stocks exhibit the properties of white noise, we carried out Ljung-Box test [29] to check if the log returns of stocks exhibit any autocorrelation or heteroscedasticity at 1% level of significance
We carried out hypothesis testing for each day and each stock on log returns and squares of log returns. Fig 2A corresponds to log returns and Fig 2B corresponds squares of log returns. Presence of autocorrelation and heteroscedasticity can be seen at a lag of 1. Thus, GARCH methods are applied to our data aiming to remove the autocorrelation and heteroscedasticity in the time series. We tested for GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) models with the error estimated by student’s t-distribution. A process *R*_{t} is called an ARCH(p) process if *R*_{t} = *μ* + *σ*_{t}*ε*_{t} where *ε*_{t} is a white noise and is the conditional standard deviation of *R*_{t} given the past values *R*_{t-1},…,*R*_{t-p}. It is to be noted that an ARCH(p) process has constant mean and constant unconditional variance but its conditional variance is not constant. The GARCH(p,q) model, on the other hand, tries to improve some of the deficiencies of the ARCH(p) model by expressing *σ*_{t} in terms of the past values of standard deviation *σ*_{t-1},…,*σ*_{t-q} in addition to the past values *R*_{t-1},…,*R*_{t-p}. Specifically, we have . For further details, the reader can refer to [30, 31].

On horizontal *axis* we have listed 50 stocks and on vertical *axis* we have working days of year 2014. Black and white colour represents that the null hypothesis (*H*_{0}: *data is independent*,*H*_{A}: *data exhibit serial correlation*) is rejected or accepted respectively. (a) Corresponds to test applied to log returns (b) Corresponds to test applied to squares of log returns.

We also tried fitting Normal Inverse Gaussian (NIG) distribution as well on the error terms. We used Kolmogorov Smirnov test to check the goodness of fit of univariate distribution on errors. Both NIG and student’s t distribution turns out to be better choices over normal distribution. Due to computational simplicity, we used student’s t distribution in our model. In Fig 3, we summarize the p-values corresponding to the test applied to the error terms obtained after fitting ARMA(1,1)-EGARCH(1,1) model for each of 50 stocks computed daily.

On horizontal *axis* we have listed 50 stocks and on vertical *axis* we have working days of year 2014. Black and white colour represents that the null hypothesis (*H*_{0}: *data follows t*—*distribution*) is rejected or accepted respectively. (a) Corresponds to test at 5% level of significance (b) Corresponds to test at 1% level of significance.

In all the equations given below, *R*_{t,k} is as defined in Eq (2). The GARCH(1,1) model [31] for the *k*th stock is given by
(12)
(13)
where, we fit a student’s t-distribution to the noise *ϵ*_{t,k.}

The ARMA(1,1)-GARCH(1,1) model [31] for the *k*th stock is given by
(14)
(15)
where, we fit a student’s t-distribution to the noise *ϵ*_{t,k.}

The ARMA(1,1)-EGARCH(1,1) model [31] for the *k*th stock is given by
(16)
(17)
where we fit a student’s t-distribution to the noise *ϵ*_{t,k.}

In all three models, we tested if the noise term *ϵ*_{t,k} exhibit properties of a white noise by again running Ljung Box Tests at 1% level of significance. Figs 4 and 5 corresponds to the results obtained from running Ljung Box Test on *ϵ*_{t,k} and *ϵ*_{t,k}^{2} respectively. Clearly ARMA(1,1)-EGARCH(1,1) proves to be better fitted model in comparison to other models. We use AIC values to compare the three methods. 94.84% of the times ARMA(1,1)-EGARCH(1,1) was seen to have the lowest AIC values and it again emerged to be a better fit in comparison to the other two methods. We used adjusted Pearson chi-squared goodness of fit test [32] to check the effectiveness of the univariate model for each stock on each working day at 5% and 1% level of significance. Fig 6 gives whether the null hypothesis, *H*_{0}: *ARMA*(1,1)−*EGARCH*(1,1) *is a good fit*, was rejected (black colour) or accepted (white colour) for each stock on for each working day. 32 stocks out of 50 were seen to pass the test for more than 90% of times, i.e. null hypothesis was not rejected at 1% level of significance more than 90% of times. Also all the stocks showed an efficiency of a good fit for more than 72% of times. Thus, we conclude that ARMA(1,1)-EGARCH(1,1) is a good fit.

On horizontal *axis* we have listed 50 stocks and on vertical *axis* we have working days of year 2014. Black and white colour represents that the null hypothesis (*H*_{0}: *data is independent*,*H*_{A}: *data exhibit serial correlation*) is rejected or accepted respectively. (a), (b) and (c) corresponds to test applied to the error terms(*ϵ*_{t,k}) in GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) models respectively.

On horizontal *axis* we have listed 50 stocks and on vertical *axis* we have working days of year 2014. Black and white colour represents that the null hypothesis (*H*_{0}: *data is independent*,*H*_{A}: *data exhibit serial correlation*) is rejected or accepted respectively. (a), (b) and (c) corresponds to test applied to the squares of error terms(*ϵ*_{t,k}^{2}) in GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) models respectively.

On horizontal *axis* we have listed 50 stocks and on vertical *axis* we have working days of year 2014. Black and white colour represents that the null hypothesis (*H*_{0}: *ARMA*(1,1)−*EGARCH*(1,1) is a good fit) is rejected or accepted respectively. (a) Corresponds to test at 5% level of significance (b) Corresponds to test at 1% level of significance.

### 3.4 Value at risk (VaR) prediction

Value at risk (VaR) of a portfolio is measure of risk associated with it. For example if a portfolio has one-tick 5% VaR of *x* amount, then it means that there is 5% chance that the portfolio looses its value by an amount *x* over the time duration of one tick in the absence of trading. It is well known that *α*% VaR of the portfolio is given by the *α*–percentile of log returns of the portfolio [25]. Once the joint distribution function of *n* stocks is known, we can use a Monte-Carlo simulation to estimate the VaR of the underlying portfolio. In this paper we have drawn inferences by calculating the VaR for equally weighted portfolios of 5, 10, 25, and 50 stocks.

Consider a portfolio consisting of *n* stocks and random variables *S*_{VWAP}(*t*,*k*), *R*_{t+1},_{k} are as defined in Eqs (1) and (2). Let *w*_{k} be the weight associated with *k*th stock in the portfolio and *S*_{t,P} be the value of the portfolio corresponding to the tick *t*, then, the log return of the portfolio *R*_{t+1},_{P} in the time interval [*t*,*t* + 1] is given by
(18)
Using identities *e*^{x} ~(1 + *x*), and ln (1 + *x*) ~*x* for small *x*, in above equation, we get
(19)
We first use ARMA(1,1)+EGARCH(1,1) to model univariate log returns of each stock and then use R-Vine copula construction on the error terms *ϵ*_{t,k} to estimate joint copula on the error terms. We fit the model on the first 4 hours of each day and use it to predict Var for next 2 hours. We summarize the algorithm as below:

- Consider log returns of each stock for the first 4 hours i.e. 9:30AM to 1:30PM (this gives 480 terms in each time series) on each day.
- Fit an ARMA(1,1)-EGARCH(1,1) model to log returns of each stock obtained in step 1, with univariate Student’s t-distribution assumed on the error term ∈
_{t,k}of each stock*k*. So if there were*n*–stocks in the portfolio then the data generated at this step can be written conveniently as (∈_{t,k})_{480×n}. - Fit an R-vine copula structure to the random variables
*ϵ*_{t,1,}*ϵ*_{t,2,…,}*ϵ*_{t,n}(sampled at 480 ticks in step 2) to obtain the joint distribution of the error terms. In the R-vine algorithm we choose the first tree*T*_{1}as the minimum spanning tree based on Kendall’s Tau metric (Eq 11) and also MI based metric (Eq 10). In this paper we fitted the R-vine structure on n = 50 stocks. - Using the joint distribution obtained in step 3, we then employ Monte-Carlo simulation to generate a large number of values (say N = 5000) of (
*ϵ*_{481,1,}*ϵ*_{481,2,…,}*ϵ*_{481,n}) simultaneously and substitute these in Eqs 16 and 17 to estimate the corresponding large number of values of (*R*_{481,1,}*R*_{481,2,…,}*R*_{481,n}). For each of the N tuples (*R*_{481,1,}*R*_{481,2,…,}*R*_{481,n}) obtained, compute the portfolio log return*R*_{481,P}using Eq 19. In our analysis, we have worked with equally weighted portfolios with 5,10,25,50 stocks respectively. - The
*α*% VaR for 481*st*instant,*VaR*_{481,P}is now calculated by finding the*α*percentile of the N simulated values of*R*_{481,P}. Here P is a portfolio whose size is chosen to be of 5, 10, 25, and 50 stocks respectively. In this paper we have considered*α*= 5%, 10% respectively. - We then compare the actual
*R*_{481,P}with the estimated*VaR*_{481,P}. - Once the actual
*R*_{481,k}is known, then we can use Eq (16) to calculate actual*ϵ*_{481,P}as

Next, we use Eq (17) to calculate*σ*_{482,k}. We then repeat steps 4, 5 and 6 for predicting (*R*_{482,1,}*R*_{482,2,…,}*R*_{482,n}) and compare the actual*R*_{482,P}with the estimated*VaR*_{482,P}. This way we calculate estimated*VaR*_{i,p}*where i*= 483,…,719, and compare these values with the respective actual*R*_{i,p}. Note that the model was fitted only once a day. - We repeat step 1 to step 7 for all working days in year 2014.

## 4) Discussion

Data for each day was divided into 2 subsets: training data from 9:30AM to 1:30PM and testing data 1:31PM to 3:30PM. We fit both Kendall Tau’s based and MI based vine copula structure on the training data as discussed in the previous section. We then estimated VaRs corresponding to equally weighted portfolios for each time tick of the testing data. In our analysis we have considered portfolios consisting of all 50 stocks, randomly picked 25 or 10 or 5 stocks. Also, we have considered 5% and 10% VaRs in all the cases. To check the effectiveness of our model we carried out unconditional (UC) and conditional (CC) coverage test formulated by Christoffersen [33]. There are 32, 39 and 113 days in the pre-election, election and post-election period for which our proposed model was a good fit. We carried out the hypothesis testing for each day and calculated percentage of times, the null hypothesis was not rejected. We refer to this calculated percentage of times as the success rate of the model. Tables 1–3 summarizes the results obtained in pre-election, election and post-election period.

It was observed that the VaR prediction were more accurate in case of portfolios consisting of small number of stocks like 5 or 10 in comparison to portfolios consisting of large number of stocks like 25 or 50. Also, the success rate of MI based model was seen to be much better than the Kendall’s Tau based model, 41 out of 96 times (42.71%) in comparison to 6 out of 96 (6.25%) times when success rate of Kendall’s Tau based model was observed to be better than that of MI based model. 49 out of 96 times (51.04%), the success rates based on both the methods were seen to be at par. One can also observe that even during the election times which is full of uncertainties, the success rate of the model was quite high.

## 5) Conclusion

This paper demonstrates the power of incorporating mutual information based metrics into the construction of R-vine copula structures in learning the joint distribution of a large number of stocks from a high frequency market data. The data considered in the present analysis has an instant-by-instant record of transactions of 89 stocks listed on the National Stock Exchange (NSE) of India in the year 2014. In order to give a time series interpretation to our data, we divide each working day into 720 “ticks” where each tick represents a 30 second duration. Out of the 89 stocks, we have considered only the top 50 traded stocks. On the basis of the Ljung-Box test it is concluded that ARMA(1,1)-EGARCH(1,1) captured the autocorrelation and heteroscedasticity of the time series of log returns of the above portfolio of 50 stocks significantly better than the famous GARCH(1,1) and ARMA(1,1)-GARCH(1,1) methods. In fact on 94.84% of the occasions the AIC values obtained after fitting ARMA(1,1)-EGARCH(1,1) were found to be the lowest in comparison to the other methods (In the R software package, a lower AIC indicates that the model is superior). The joint distribution of the respective error terms in the ARMA(1,1)-EGARCH(1,1) model applied to each stock is then computed by learning R-Vine copula structures in 2 ways: first, by starting with the minimal spanning tree computed on the basis of the mutual information metric; and second, by starting with the minimal spanning tree computed on the basis of the Kendall’s Tau based metric. Next, the VaR of the underlying 50 stock portfolio is computed through Monte-Carlo simulations in both the cases. The Christoffersen’s UC and CC tests show that VaR predictions in the mutual information case out performs the VaR predictions in the Kendall’s Tau case. The success rate obtained from the MI based method is seen to be higher than Kendall’s Tau based method on 42.71% occasions. On 51.04% of the occasions the success rates from both the methods were at par. The predictions were quite good even during the election period when there is lot of anticipation amongst the buyers.

We finally conclude that MI based R-Vine Copula model is able to capture the joint distribution well and thus leads to better VaR predictions in a high frequency scenario.

## Acknowledgments

We thank the Shiv Nadar University for providing the computational facilities and the necessary infrastructure needed to carry out the present research. Both authors extend a special gratitude to Professor Amber Habib for his encouragement and valuable comments. The authors also thank the reviewers for their valuable suggestions, which have enhanced the clarity of the paper significantly.

## References

- 1. Schölzel C, Friederichs P, Multivariate non-normally distributed random variables in climate research—introduction to the copula approach, Nonlinear Processes in Geophysics 15, 761–772, 2008; https://doi.org/10.5194/npg-15-761-2008.
- 2.
Salvadori G, Michele C D, C Kottegoda N T, Rosso R, Extremes in Nature: An Approach Using Copulas, Springer, Dordrecht, 2007; ISBN 978-1-4020-4415-1
- 3. Kazianka H, Pilz J, Bayesian spatial modeling and interpolation using copulas, Computational Geosciences 37, 310–319, 2011; https://doi.org/10.1016/j.cageo.2010.06.005
- 4.
Cherubini U, Luciano E, Vecchiato W, Copula Methods in Finance, Wiley, Chichester, 2004; ISBN 0-470-86344-7
- 5.
Sklar A, Fonctions de répartition à n dimensions et leurs marges, Publications de l’Institut de Statistique de l’Université de Paris 8, 229–231, 1959.
- 6. Massey FJ Jr, The Kolmogrov-Smirnov Test for Goodness of Fit, Journal of the American Statistical Association, Vol 46, 1951;
- 7.
Joe H, Families of m-variate distributions with given margins and m(m-1)/2 bivariate dependence parameters, Distributions with Fixed Marginals and Related Topics; Institute of Mathematical Statistics: Hayward, CA, USA, 1996, Volume 28; https://doi.org/10.1214/lnms/1215452614
- 8. Bedford T J, Cooke R M, Probability density decomposition for conditionally dependent random variables modeled by vines, Annals of Mathematics and Artificial Intelligence, 2001; https://doi.org/10.1023/A:1016725902970.
- 9. Bedford T J, Cooke R M, Vines—a new graphical model for dependent random variables, Annals of Statistics, 2002; https://doi:10.1214/aos/1031689016.
- 10. Aas K, Czado C, Frigessi A, Bakken H, Pair-copula constructions of multiple dependence, Insurance: Mathematics and Economics 44 (2), 182–198, 2009;
- 11. Min A, Czado C, Bayesian inference for multivariate copulas using pair-copula constructions, Journal of Financial Econometrics, 2010;
- 12. Min A, Czado C, Bayesian model selection for D-vine pair-copula constructions, Canadian Journal of Statistics, 2011; 10.1002/cjs.10098
- 13. Mendes B, Semeraro M M, Leal R P C, Pair-copulas modeling in finance, Financial Markets and Portfolio Management, 2010;
- 14. C Schepsmeier U, Min A, Maximum likelihood estimation of mixed C-vines with application to exchange rates, Statistical Modelling, 2010; https://doi.org/10.1177/1471082X1101200302.
- 15. Dißmann J, Brechmanna E C, Czadoa C, Kurowickab D, Selecting and estimating regular vine copulae and application to financial returns, Computational Statistics and Data Analysis, 2012;
- 16. Czado C, Jeske S, Hofmann M, Selection strategies for regular vine copulae, J. Soc. Franç. Stat., 154, 174–191, 2013.
- 17. Gruber L, Czado C, Sequential Bayesian model selection of regular vine copulas, Bayesian Anal., 10, 937–963, 2015;
- 18. Nagler T, Bumann C, Czado C, Model selection in sparse high-dimensional vine copula models with application to portfolio risk, Journal of Multivariate Analysis, 2018
- 19. Villaverde AF, Ross , Moran JF, and Banga JR, MIDER: Network Inference with Mutual Information Distance and Entropy Reduction, PLOS One, 2014; pmid:24806471
- 20. Fiedor P, Network in financial markets based on mutual information rate, Physical Review E89, 2014; pmid:25353838
- 21. Tao Y, Fiedor P and Holda A, Network analysis of the Shanghai stock exchange based on partial mutual information, Journal of Risk and Financial Management, 2015;
- 22. Guo X, Zhang H, Tian T, Development of stock correlation networks using mutual information and financial big data, PLOS One, 2018; pmid:29668715
- 23. Sharma C, Habib A, Mutual Information based stock networks and portfolio selection for intraday traders using high frequency data: an Indian market case study, PLOS One, 2019; pmid:31465507
- 24. Sampid MG, Hasim HM, Estimating value-at risk using multivariate copula-based volatility model: Evidence from European banks, International Economics, 2018;
- 25.
Holton GA, Value-at-Risk Theory and Practice, Second Edition, e-book, 2014; https://www.value-at-risk.net/
- 26. Aas K, Pair-Copula Constructions for Financial Applications: A Review, Econometrics MDPI, 2016;
- 27.
Nelsen RB, An Introduction to Copulas, second edition, Springer, New York, 2006; ISBN 978-0-387-28678-5.
- 28. Morales-Nápoles O, Cooke R, Kurowicka D, About the number of vines and regular vines on n nodes, 2010;
- 29. Ljung GM, Box GEP, On a measure of lack of fit in time series models, Biometrika, 1978; https://doi.org/10.1093/biomet/65.2.297
- 30. Engle RF, Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation, Econometrica,1982; pmid:12264762
- 31.
Ruppert D, Matteson DS, Statistics and data analysis for financial engineering with R example, second edition, Springer; ISBN 978-1-4939-2614-5.
- 32. Vlaar P, Palm SC, The Message in Weekly Exchange Rates in the European Monetary System: Mean Reversion, Conditional Heteroscedasticity, and Jumps, Journal of Business and Economic Statistics, 1993;
- 33. Christoffersen PF, Evaluating Interval Forecasts, International Economic Review, 1998;