The probabilities of type I and II error of null of cointegration tests: A Monte Carlo comparison

This paper evaluates the performance of eight tests with null hypothesis of cointegration on basis of probabilities of type I and II errors using Monte Carlo simulations. This study uses a variety of 132 different data generations covering three cases of deterministic part and four sample sizes. The three cases of deterministic part considered are: absence of both intercept and linear time trend, presence of only the intercept and presence of both the intercept and linear time trend. It is found that all of tests have either larger or smaller probabilities of type I error and concluded that tests face either problems of over rejection or under rejection, when asymptotic critical values are used. It is also concluded that use of simulated critical values leads to controlled probability of type I error. So, the use of asymptotic critical values may be avoided, and the use of simulated critical values is highly recommended. It is found and concluded that the simple LM test based on KPSS statistic performs better than rest for all specifications of deterministic part and sample sizes.


Introduction
The concept of cointegration was firstly proposed by [1]. If two or more than two integrated of order one variables possess a long run relationship, then it is termed as existence of cointegration among them. For two variables X and Y: integrated of order one, if their linear combination: aX + bY is integrated of order zero, then X and Y are possessing a long run relationship and they are said to be cointegrated. Note that cointegration analysis is based on the issue that all variables must be I (1), but this may depend on selecting the structural breaks (see, e.g., [2]). Soon after the development of concept of "cointegration", a huge variety of tests were proposed to test it like [3][4][5] and many more. Most of these tests proposed were testing the null of no cointegration. These tests have been widely and frequently used in economics and finance to assess the long run relationship between a set of time series. Some of these studies are but not limited to; [6][7][8][9][10]. In their pioneered paper, [11] proposed first test assessing the null of cointegration. After, this pioneer paper, more cointegration tests evaluating the null of cointegration were developed, assuming different underlying data generation ( [12,13] and many others). As all of these proposed tests were based on different underlying assumptions about the cointegrated system and were assuming different data generations, so they were showing different conclusions about the existence of cointegration for the same empirical problem ( [14,15]). Therefore, there was a need to assess their performance and to choose winner/winners among the tests which best fits an empirical problem. To fill this vacuum in literature, numerous comparative studies have been published. Most of these studies used Monte Carlo Simulations (MCS) to evaluate and assess the performance of tests. However, there are fewer real data based comparative studies of cointegration tests ( [16,17]). MCS have been frequently used for such comparative studies ( [18][19][20][21][22][23]). These studies ( [14,15,[24][25][26] and many more), using MCS were assessing and evaluating the performance of tests based on two properties, the size: "the probability of rejection of null hypothesis when actually it is true" and the power: "the probability of rejection of null hypothesis when actually it is false". A test is regarded and considered better than other, if it has a controlled size around nominal size and has relatively higher power than the others. Most of these comparative studies of cointegration tests were considering a limited number of alternative hypotheses (2 to 4), although the alternative hypothesis can take on infinite values in an alternative space. Two studies i.e. of [14] and [15] were the most extensive ones. They considered different data generating processes and considered more than 8 point alternative hypotheses, trying to cover the whole alternative space. [14] and [15] concluded that there is no significant evidence that one test is superior than others. According to them, if one test is performing better for some subset of alternative hypothesis, then the same test is performing worst in another subset of alternative hypothesis. So, a general conclusion is very difficult to draw.
Except the study of [27], all of comparative studies of cointegration tests were either addressing a selected set of null of no cointegration tests or a selected set of mixture of both kind of tests having opposite null hypothesis. [27] compared 6 tests (null of cointegration) on basis of the same size and power properties. Although, [27] considered different data generating processes and estimated the power curves, however, the conclusions are same as [14] and [15]. We are not able to find any other comparative study of null of cointegration tests.
As these tests are based on different underlying assumptions and for a particular real data set, one does not know that whether the data satisfies the assumptions of the test or not, therefore the selection of the test is a very critical decision. Therefore this study compares the tests on a very basic type of data generations and it is assumed that if a test is performing poor here, it will be doing the same for other data generations. However, it cannot be confirmed that the best performer will remain the best for other data generations. Moreover, majority of the comparative studies of cointegration tests, in the literature, didn't use size adjusted powers. These studies were comparing the tests on basis of size and power and if the tests have size distortions then for power comparison, these size distortions were not controlled. Moreover, fewer number of alternative hypothesis and data generations were considered to assess the relative performance. The aim of this paper is twofold: comparison of tests on basis of the probability of Type I error, known as size of test using asymptotic critical values or distributions developed by the respective authors, controlling the probability of Type I error around the assumed nominal probability of Type I error (usage of simulated critical values) and comparison of tests based on probability of Type II error when the probability of Type I error is controlled around a nominal level. These probabilities are defined as: α is also known as Size of the test. Whereas, Power of the test is (1 − β). The conclusions and recommendations of this study will be beneficial to a large audience: practitioners, statisticians, data scientists and applied researchers, as it will give the guidelines about a better performing test and worst performing test. These classifications of better and worst performance will be based on α and β. This study will also be helpful to the audience for the selection of type of critical values. This study is structured as: next section of "Methodology" elaborated the details of tests and the framework followed to assess the performance of these. "Methodology" is followed by "Results and Discussion" section, discussing the results in greater detail and then the last section is "Conclusions and Recommendations". The "References" are listed at the end.

Methodology
In this study, eight tests belonging to the class of null of cointegration are compared, the details of these tests are laid out in "Tests to be compared". The next section "Artificial Data Generation" lists the set of equations used to generate a cointegrated system and the procedure of estimation of α (for both asymptotic and simulated critical values is detailed in "Estimation of Empirical α". Continuingly, the next section "Estimation of Simulated Critical Values" details out the steps followed to obtain simulated critical values by fixing α and the next section "Estimation of β" specifies the steps to be followed to estimate β, the probability of Type II Error in. MATLAB has been used to carry out the analysis.

Tests to be compared
The eight tests have been compared in the current study and these tests were selected on basis of their frequently use in the economics/econometrics literature. Moreover, these are the pioneer null of cointegration tests. Furthermore these tests have been chosen from the previous studies like [14,15,27] on the basis of their relative performance. The eight tests compared in this study are detailed as:

LM test based on KPSS statistic (LM).
Considering two different kinds of variables say Z t and W it , 8 i = 1, 2,----, k where both of these kinds of variables are integrated of order one i.e. I (1). [11] proposed the estimation of Ordinary Least Squares (OLS) regression first, i.e.
Where Z t and W t are dependent and independent variables respectively. ψ t represents the deterministic part. [11] proposed that that the LM type test introduced by [28] for testing stationarity of time series, can be used for testing the null of cointegration. i.e.
T . Theû t are the residuals estimated from Eq (1). The LM test does not follow any regular distribution, therefore the current study uses the critical values of the LM as provided in [29].

Leybourne and McCabe's test (LBI)
. [11] proposed that the same LM test with a non-parametric modification can also be used for the said purpose as stated by [11]. Actually, they proposed a long run variance as compared to the contemporaneous variance, i.e.
tûtÀ s . Againû t are the residuals estimated from Eq (1). The lag truncation parameter m is very vital and plays a crucial role in empirical studies. According to [15], the size and power of the test depends on it. The current study is considering l = 4, as recommended by [15] i.e. at this value, size of test is controlled around the nominal size and also the test has reasonable powers. Same as the LM test, the LBI does not follow any standard statistical distribution, therefore this study is using the critical values for LBI test as provided [29]. 2.1.3. Shin's C test (Sc). [12] proposed two modifications to the same LM type test. First [12] proposed to use the Dynamic OLS (DOLS) regression and secondly [12] proposed to use a weighting kernel in estimating the long run variance. The DOLS regression is Where Z t and W t are dependent and independent variables matrices of order T × 1 and T × m respectively. The same LM type test, in this case named as C is Again, same as LM and LBI,û t are the residuals estimated from Eq (2) using OLS. The selection of lag truncation parameter m is again vital for the performance of test as it has been already discussed in section 2.1.2. However, the current study is considering m = 4 as proposed by [15]. Similarly [12] recommended r = 5, so the current study is using r = 5. The critical values for the test are used from [12].

McCabe-Leybourne -Shin test (Ls).
For the same LM type test, a different estimation methodology and process was proposed by [29]. [29] proposed the use of Maximum Likelihood Estimation (MLE) in place of OLS. They proposed that first the residuals from DOLS regression may be estimated i.e.û t and then the residualsẐ t may be estimated using MLE from According to them the selection of p is on the basis of the minimum Akaike information Criterion (AIC). The modified LM test statistic now named as Ls The current study is considering 4 as the maximum value of p and the critical values are used from [29].

Hausman H 1 and H 2 tests (H 1 and H 2 ).
A test statistic comparing two estimators was proposed by [13]. According to [13], under the null of cointegration both of the estimators are consistent. However, under the alternative of no cointegration one estimator is inconsistent. One estimator is of γ sayŷ L , when Eq (2) is estimated using DOLS estimation and for the second estimator, estimate the regression and obtain p j being the OLS estimates from Eq (2). The following regression is used to estimate the second estimator sayŷ D i.e.
The two estimatorsŷ L andŷ D are used to estimate both of the Hausman type test statistics andĉ D are the estimated variance covariance matrices ofŷ L andŷ D respectively. The critical values of the two tests have been taken from [13] in the current study.

Hansen's L c test (Lc).
Fully modified estimation method [30] was used by [31] to propose a test of null of cointegration. The fully modified estimation method involves the estimation of Eq (1) by OLS and the OLS residuals are obtained, i.e.û t . Define another vector of difference as ΔW it = μ it for i = 1,2,-------,k. Define the matrix z t as z t = (υ t , μ t 0 ). Estimates a VAR(1) model for z t as z t = Fz t−1 + υ t . Use the estimated residualsn t to find thê being an appropriate weighting scheme. The current study is using the quadratic spectral kernel i.e.
. The automatic bandwidth estimator is given aŝ r a andŝ 2 a are the "a th " endogenous variable's estimated AR coefficient and variance of residuals. The estimatesL n andÔ n are recolored to obtainÔ ¼ I ÀF À � À 1Ô n I ÀF 0 À � À 1 and . To obtain the Fully Modified OLS (FMOLS) estimator, following are estimated first are the estimated FMOLS residuals. [31] proposed the L c statistic as L c ¼ The critical values of the [31] proposed L c test are taken from [26] for the current study.

Xiao Fluctuation test of Cointegration (F).
On the basis of fluctuations of the estimated residualsû t , [32] proposed a test using FMOLS estimation. It is given as are the estimated FMOLS residuals and are obtained using Eq (7). Similarly, O υ�μ is obtained using Eq (6). The critical values of F T are taken from [32] in the current study. The abbreviations used in the current study for all eight tests are listed in Table 1.

Artificial Data Generation (ADG)
The set of equations used to generate a cointegrated system under null of cointegration and alternative of no cointegration are a modified version of the model used by [33]. The [33] model has been modified to include deterministic component also. For time series z t and w t of length T, the set of equations are: provided that μ t~N (0, I), I being an identity matrix of same order as μ t . Under null hypothesis of cointegration and alternative hypothesis of no cointegration H 0 : γ = 1 vs H A : 0 � γ < 1. The current study takes into account ten different point alternate hypotheses, i.e. γ = (0, 0.1, 0.2,--, 0.9). ψ t is the included deterministic component in ADG and it consists of two deterministic parts; one is the intercept and the other is linear time trend. i.e. The current study is considering ten point alternate hypothesis from the alternate space and from the null space, one point null hypothesis. Moreover, the performances of tests at four different time lengths (T = 30, 60, 120 and 240) covering the small time lengths to large time lengths are assessed. In addition to these, three different and plausible cases of deterministic parts are explored. This all sums up that 132 different ADG processes have been explored to evaluate the performance of tests.

PLOS ONE
The probabilities of type I and II error

Estimation of empirical α
The current study assesses the performance of test based on two α 0 s: one uses the asymptotic critical values and the other uses the simulated critical values. Nonetheless, their estimation procedure is same as: i. Generation of the time series following the ADG for null hypothesis.
ii. Estimation of the test statistic of each test.
iii. Taking decision about the rejection of null or not: (based on asymptotic or simulated critical value) iv. The repetition of the steps i, ii and iii for M time and counting of number of rejections of null.
v. Estimation of the α as the proportion of rejections of null out of total M repeatitions.

Estimation of simulated critical value
The current study uses the frequently used significance level of 0.05 for whole of its empirical analysis. Estimation of simulated critical values have been carried out by following the below mentioned steps: i. Generation of data under the point null hypothesis.
ii. Calculation of each test's test statistic.
iii. Repetition of the steps i and ii for fixed number of times say M.
iv. Recording of all of M test statistics in an array say S.
v. Calculation of the simulated critical value as: 2.5 th and 97.5 th percentiles of S, for two tailed tests, 95 th percentile of S, for right tailed tests and 5 th percentile of S, for left tailed tests.

Estimation of β
Considering the frequently used significance level of 0.05, the below mentioned steps have been followed to estimate the Probability of Type II Error β.
i. Generation of data following a point alternative hypothesis.
ii. Estimation of the test statistic.
iii. Deciding about the rejection of null or not (based on simulated critical value).
iv. Repetition of the steps i, ii and iii for a fixed number of times say M and counting of number of rejections.

Results and discussion
The empirical probability of Type I Error i.e. α is obtained using M = 30,000 for each test except the Ls test, as it uses the numerical optimization algorithms to find the maximum likelihood estimates, which take huge amount of time. So, for Ls: M = 5,000 is considered suitable. These α 0 s are given in Table 2. It is obvious from Table 2 that not a single test has empirical α around the considered significance level of 0.05 for all three specifications of deterministic part and four sample sizes. For instance, LM has α around 0.05 for all four sample sizes only when deterministic part is D 1 LT 0 . The values of empirical α around the specified significance level 0.05 have been marked as BOLD in Table 2. It is observed that from out of total 96 different cases, only for 15 cases the empirical α is around 0.05. This over rejection problem is due to the use of asymptotic critical values. As we have finite sample sizes, therefore nearly all of the tests exhibit the over rejection problem. The same was also stated by [34].
To control empirical α around 0.05, simulated critical values for all tests at all four time lengths using all of the three cases of deterministic component have been found and then by using these simulated critical values, again empirical α is calculated and these α 0 s are displayed in Table 3. It is evident from Table 3 that all tests have empirical α around 0.05 now. It is very important to control empirical α around 0.05 in a statistical hypothesis testing, because a large α will reduce BETA, the probability of Type II Error. So, as compared to the considered nominal significance level, a decrease in empirical α leads to an increase in BETA and an increase in empirical α leads to a decrease in BETA resulting in a meaningless conclusion.
As the use of simulated critical values led to controlled empirical α around 0.05, the significance level considered in this paper. So, these simulated critical values are used to estimate the probabilities of Type II Error i.e. β. These β 0 s for the three cases of deterministic component (D 0 LT 0 , D 1 LT 0 and D 1 LT 1 ) are displayed in Tables 4-6 respectively. The test having the minimum β is marked as "A" and declared as best performer in this study. Similarly, the tests having 2 nd minimum and 3 rd minimum are marked as "B" and "C" and nominated as better

PLOS ONE
The probabilities of type I and II error

PLOS ONE
The probabilities of type I and II error

PLOS ONE
The probabilities of type I and II error performer and good performer respectively in this study. These nominations are done for each alternative hypothesis i.e. γ. It is evident from Table 4 that for the first case of deterministic part i.e. D 0 LT 0 , the LM test is the best performer at all ten γ 0 s considered for all four sample sizes. At T = 30, for some γ 0 s, Sc is better and for some γ 0 s, H2 is better performer. Similarly, at the same sample size three tests i.e. H1, H2 and F are good performers at different γ 0 s. However, a deep insight of Table 4 reveals that excluding LM, the rest of seven tests have β 0 s way larger than 0.5. Moreover, the LM test has also β 0 s larger than 0.5 for γ 0 s from 0.9 up to 0.7. At a moderate sample size of 60, β 0 s tend to improve generally as now β 0 s are lesser as compared to the ones for T = 30. Now, for majority of γ 0 s, H2 is better and H1 is good performer. Again, these two tests (H1 and H2) have β 0 s larger than 0.5 for γ 0 s from 0.9 to 0.4. The LM test has β 0 s larger than 0.5 for only γ = 0.9, for rest of γ 0 s, β 0 s tends to decrease very sharply. When a moderately large sample size of 120 is considered, then β 0 s are generally lesser than those for T = 30 and T = 60. Now Sc is a better performer for majority of γ 0 s and H2 is also better performer for some γ 0 s. Lc, Sc, H2 and H1 are good performers at various γ 0 s. However, LM improves its performance enormously as now it has all β 0 s lesser than 0.1 except for two γ 0 s i.e. φ = 0.9, 0.8. For the large time length of 240, all tests tend to improve their performance as compared to previous three time lengths i.e. T = 30, 60, 120. At T = 240, LM is the best performer and it has β 0 s lesser than 0.05 for all γ 0 s except γ = 0.9. Sc is better and Lc and H2 are good performers.
For the second case of deterministic component i.e. D 1 LT 0 , again LM has the least β 0 s at all γ 0 s for all T. Table 5 depicts that at T = 30, the smallest sample size considered in the study, LM is the best, Sc and H2 are better and Sc, H1 and Ls are good performers. However, all these tests have β 0 s way greater than 0.5 except LM at some γ 0 s. At T = 60, the same performance of T = 30 is repeated, however, now all tests have β 0 s lesser than those for T = 30. Especially LM has all β 0 s lesser than 0.5 except γ = 0.9. At a moderately large T = 120, all the tests have improved their performance by having lesser β 0 s as compared to T = 30 and 60. For T = 120, LM remains the best, however, Sc becomes better and LBI and H2 becomes good performers. At this T = 120 LM has all β 0 s lesser than 0.05 for all γ 0 s except γ = 0.9, 0.8, 0.7. Similarly, Sc has all β 0 s lesser than 0.5 for all γ 0 s except γ = 0.9. At the largest T = 240, almost same performance of T = 120 is repeated with only one addition that now LBI is the sole good performer. At T = 240 these three tests LM, Sc and LBI have β 0 s lesser than 0.4.
For the final case of deterministic part i.e. D 1 LT 1 considered in this study, again LM is the sole best performer at all γ 0 s for all T. However, a detailed inspection of Table 6 reveals that at smallest T = 30, all eight tests have β 0 s way larger than 0.5 except LM at some γ 0 s. H2 is better and H1 is good performer at T = 30. At T = 60, again same is the case that all tests have β 0 s larger than 0.5 except LM at γ = 0.9 and 0.8. However, now Sc is better, and Ls is good performer. When a moderately large T = 120 is considered then in general all tests and especially three tests i.e. LM, Sc and Ls improve their performance as these tests have β 0 s lesser than 0.5 for most of γ 0 s. Now LM has β 0 s lesser than 0.1 for all γ 0 s except γ = 0.9 and 0.8. The better and good performers are Sc and Ls respectively at T = 120. At largest sample size of T = 240 considered in this study, LM, Sc and LBI are best, better and good performers respectively. However, now LM has β 0 s lesser than 0.05 for all φ 0 s except γ = 0.9. Similarly, Sc has β 0 s lesser than 0.05 for all φ 0 s except γ = 0.9, 0.8 and 0.7.

Conclusions and recommendations
This study is aimed to evaluate the performance of null of cointegration tests on basis of probabilities of Type I and II Errors using Monte Carlo Simulations. In light of discussion in Section 3 it is concluded that use of asymptotic critical values for these eight tests considered in this study, led to distortion in probabilities of Type I Error. At almost all of specifications of deterministic part and sample sizes, these eight tests had probabilities of Type I Error way greater or way smaller than the nominal significance level considered. This means that these eight tests faced problems of over rejection as well as under rejection. In statistical and econometric analysis these two problems of over rejection and under rejection have worse implications, leading to useless conclusions and recommendations. As stated by [34], the claims about the robustness of type I error probabilities are false if the asymptotic critical values are used.
To solve these problems of over rejection and under rejection, simulated critical values were estimated and then again probabilities of Type I Error were calculated. The use of simulated critical values led to the control of probabilities of Type I Error around nominal significance level. For further evaluation of performance on basis of probabilities of Type II Error, these simulated critical values were used to ensure that the probabilities of Type I Error are around nominal significance level of 0.05 for all eight tests.
The LM test was the sole best performer at all alternative hypotheses for all specifications of deterministic part and sample sizes on basis of probability of Type II Error. It had probabilities of Type II Error way lesser than its competitors. However, at small time lengths of 30 and 60, even it was the best performer, it had probabilities of Type II Error larger than 0.5. The LM test is a better performer as it uses the contemporaneous variance estimator and the rest used different form of long run variances. Therefore the performances of the rest of the tests also depend upon the selection of type of long run variance and their weighting function. However, the LM test is free of these decisions. Moreover, its underlying assumptions and theory meets with the data generating process used in the current study. From the rest of tests, two tests had better and good performance over all and these tests are Sc and H2 developed. At large sample sizes of 120 and 240, LM test had the extra ordinary least probabilities of Type II Error, even lesser than 0.05. The rest of five tests excluding LM, Sc and H2, performed badly even worst with one exception that LBI performed better for D 1 LT 0 and D 1 LT 1 at time length of 240.
In the light of conclusions, it is recommended that use of asymptotic critical values may be avoided. Furthermore, the use of simulated critical values is highly recommended. In modern age, the use of simulated critical values is easily possible due to the availability of fast computers. While testing for possible cointegration using null of cointegration tests, the LM test may be given priority as compared to the others. For large time lengths of 240 or more, when the presence of only intercept is assumed then LBI test may also be used.
After conducting the current study, few gaps in the literature have been observed which can be pursued in the future starting with the development of a new test of null cointegration on basis of vector autoregressive model as there is no such kind of test available in the literature. Moreover, it can also be assessed that how these current tests are performing when the data are generated using a multivariate system having more than one cointegrating vector.