Omnibus test for normality based on the Edgeworth expansion

Statistical inference in the form of hypothesis tests and confidence intervals often assumes that the underlying distribution is normal. Similarly, many signal processing techniques rely on the assumption that a stationary time series is normal. As a result, a number of tests have been proposed in the literature for detecting departures from normality. In this article we develop a novel approach to the problem of testing normality by constructing a statistical test based on the Edgeworth expansion, which approximates a probability distribution in terms of its cumulants. By modifying one term of the expansion, we define a test statistic which includes information on the first four moments. We perform a comparison of the proposed test with existing tests for normality by analyzing different platykurtic and leptokurtic distributions including generalized Gaussian, mixed Gaussian, α-stable and Student’s t distributions. We show for some considered sample sizes that the proposed test is superior in terms of power for the platykurtic distributions whereas for the leptokurtic ones it is close to the best tests like those of D’Agostino-Pearson, Jarque-Bera and Shapiro-Wilk. Finally, we study two real data examples which illustrate the efficacy of the proposed test.


Introduction
Testing the hypothesis of normality is one of the fundamental procedures of the statistical analysis. There is a large number of normality tests. Some of them such as the χ 2 goodness-of-fit test [1] with its variants, the Kolmogorov-Smirnov (KS) one-sample cumulative probability test [2], the Shapiro-Wilk (SW) test [3], D'Agostino-Pearson (DP) test [4] and Jarque-Bera (JB) test [5] are nowadays considered classical. These tests are based on comparing the distribution of the observed data to the expected distribution (χ 2 ), on measuring the distance between the empirical and analytical distribution function (KS), on taking into account some transformations of moments of the data like skewness and kurtosis (DP and JB), or on calculating some function of order statistics (SW). Other tests based on the empirical distribution function, less widespread, include the Kuiper test [6], Watson test [7], Cramer-von Mises (CvM) test [8], and the Anderson-Darling (AD) test [9]. From other testing techniques, let us mention ideas based on the empirical characteristic function [10], on the dependence between moments that characterizes normal distributions [11], or on the Noughabi's entropy estimator [12]. The Edgeworth series can be used to expand an arbitrary probability distribution in terms of its cumulants. So far, the Edgeworth expansion has been utilized to design a score test for normality of errors in a regression model [13], design a normality test for the probit model [14], and to design a normality test against a specific alternative, such as the logistic distribution [15].
There have been also attempts in the literature to provide a one-sample statistical test of normality for data in a broader setting like in a general Hilbert space [16]. Despite this verity, there have been continuing efforts to develop tests for the departure of a random sample from normality that could be considered omnibus, [17,18,19,20,21,22], that is, to be able to reject the null hypothesis of normality with high power for a wide range of alternatives.
Many of the normality tests consider evaluating the third and fourth order moments and, hence, the power of such tests depends on whether a symmetric or skewed alternative is being considered. In general, it is expected that symmetric alternatives or those with small amount of skew are more difficult to differentiate from the null hypothesis of normality than those alternatives characterized with a large skew [20]. The fourth order moment, most commonly used in the form of kurtosis, has less obvious effect on the performance of a normality test, but here it is also important whether the distribution is leptokurtic or platykurtic [23,24], that is, whether its kurtosis is larger or smaller than that of the normal distribution, respectively.
There are many signal processing applications in which the underlying distribution of the data is leptokurtic [25,26], while those phenomena that can be modeled with a platykurtic distribution, with the exception of the uniform distribution, are less present [27]. Consequently, normality tests dedicated to platykurtic alternatives are scarce. Nevertheless, some effort has been made to improve the performance of a normality tests across the range of symmetric platykurtic alternatives [28].
The aim of this work is to develop a novel test for normality of omnibus character that could outperform the classical tests for the case of platykurtic symmetric alternatives.
The paper is structured as follows. First, we derive a test statistic based on the second term of the Edgeworth expansion which incorporates information on both the skewness and kurtosis. In the next part we establish the main results. We formally construct a statistical test on normality and provide information on critical values of the test. Next, we analyze the power of the test by Monte Carlo simulations. We take into account four symmetric distributions which can be very close to the normal law, namely the generalized Gaussian, mix Gaussian distributions, α-stable and Student's t-distributions. The former two serve as examples of platykurtic distributions, whereas the latter two are classical leptokurtic probability laws. We compare the results with the power of the classical normality tests. Finally, we study two real data examples taken from a collection of over 1300 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The examples illustrate the efficacy of the proposed test. The findings of the paper are summarized in the last section.

Derivation of the new test statistic based on Edgeworth expansion
Let X 1 , X 2 , . . ., X N be a random sample from the distribution with finite mean μ = E(X 1 ). We define the arithmetic mean X n ¼ 1 n P n i¼1 X i , n = 1, 2, . . ., N and the standardized mean by The Edgeworth expansion is a series that approximates a probability distribution in terms of its cumulants [29]. For random variables X i , i = 1, 2, . . ., N, with finite kth moment it has the following form where H i ðyÞ ¼ n À i=2 P i ðyÞ�ðyÞ: ð3Þ In Eqs (2) and (3), F(y) and ϕ(y) stand for the cumulative distribution function (CDF) and the probability density function (PDF) of a standard normal distribution, respectively, and P i (y) is an appropriate Hermite polynomial of degree 3i − 1 [29]. The coefficients in P i (�) are expressed in terms of appropriate moments of the random variable X 1 . For instance, the first two polynomials have the following form P 2 ðyÞ ¼ À y 1 18 t 2 ðy 4 þ 2y 2 À 3Þ À 1 12 where τ = E(X 1 − μ) 3 (VarX 1 ) −3/2 and κ = E(X 1 −μ) 4 (VarX 1 ) −2 −3 are the skewness and excess kurtosis (in this paper in the figures also called, in short, kurtosis) of the random variable X 1 .
Let us now concentrate on the statistic T n for n = 2 (see Eq (1)), which depends only on two random variables X 1 and X 2 . Such a choice of n allows to have many realizations of the statistic for one sufficiently long trajectory by dividing the data into blocks of length 2. The choice of n = 2 seems to be the most optimal, however in order to prove this, in the simulation study we present the comparison between the results obtained in case of n = 2 and n = 3. By Eq (2) the CDF of the T 2 statistic can be approximated by Also, (n + 1)/nT 2 has a Student's t-distribution with 1 degree of freedom when X 1 follows the normal distribution. For practical reasons we assume k = 2. The functions H i (y), i = 1, 2 contain the information about the deviation of the T 2 distribution from the standard normal law. If the distribution underlying the random sample is close to normal we expect the deviations (hence the functions) to be smaller than those for non-normal distributions. They can be also called corrections to the normal distribution since the CDF of T 2 is approximated by the CDF of the standard normal distribution corrected by those functions. Let us now define two statistics H max 1 and H max 2 as the maxima of the functions H 1 (y) and H 2 (y), respectively, calculated over the y values at which the empirical CDF of T 2 changes, hence the T 2 values. This is similar to calculating the Kolmogorov-Smirnov statistic. Let and This approach is arbitrary and other ways of arranging the random sample into blocks of length two are permitted. Then where y ¼ fT 2iþ1 2 : i ¼ 0; 1; . . . ; bN=2c À 1g. To ascertain whether the test statistic H max i , i = 1, 2, is sensitive to deviations from normality we perform Monte Carlo simulations for two non-normal distributions described in more detail in the Appendix. They are the generalized Gaussian (GG) distribution with parameters μ = 1, β = 0.2, and ρ = 2.2 corresponding to a case of platykurtic distribution and the Student's t-distribution with ν = 16 degrees of freedom, corresponding to the case of a leptokurtic distribution. For M simulated samples x 1 , x 2 , � � �, x N of size N we calculate bN/2c values of the T 2 statistic and evaluate the maximum of H i (y), i = 1, 2, over all values of the standardized means T 2 obtained for a given sample according to Eq (9). As a result, we obtain M realizations of H max  and H max considered distributions and the corresponding empirical PDFs obtained for the standard normal distribution. The empirical PDFs were constructed as kernel density estimators [30]. In the simulations we considered N = 1000 and M = 5000.
We can observe that H max 1 is less sensitive to deviations from normality than the H max 2 statistic. This effect is visible for both analyzed non-normal distributions. Therefore, we propose a test for normality based on the H max where y ¼ fT 2iþ1

Construction of the test
In our statistical test the null hypothesis (H 0 ) is that the data come from a normal distribution with an unknown mean and variance. The alternative hypothesis (H 1 ) is that the data set does not come from such a distribution. The test statistic is given by formula (10). As explained in the previous section, for sample data x 1 , x 2 , � � �, x N , first, we calculate averages we replace μ, τ and κ coefficients by the sample mean, skewness and excess kurtosis, respectively, calculated for the whole series We reject the H 0 hypothesis if the test statistic is extreme, either larger than an upper critical value or smaller than a lower critical value at a given significance level c. The procedure of testing is summarized in Fig 2

Power simulation study
The power of the test is the probability to reject the null hypothesis H 0 when the alternative H 1 is true. The power is an important characteristic of any statistical test. In our case the power of the test is defined as follows: where Q1 and Q2 are lower and upper critical values, respectively. In our study, for all considered cases we calculate the power of the test by using Monte Carlo simulations. More precisely, for a given sample size N we simulate M independent trajectories from considered distribution.
For each trajectory the value of H max 2 test statistics is calculated and we check if the value falls into the critical region constructed for a given significance level c. The power of the test is evaluated as the fraction of trajectories for which the value of H max 2 is larger than an upper critical value Q2 or smaller than a lower critical value Q1.
In the following, we perform a simulation study for four selected distributions: two of them belong to the platykurtic class of distributions and two-to the leptokurtic. In the first group we choose the generalized Gaussian and the mixed Gaussian distributions, for which the JB test was found to perform poorly [24]. In the second group, α-stable and Student's t distributions   Tables 2-21 and Figs 3-6. In Fig 7 we study the power of the proposed test for normality for the generalized Gaussian distribution. For the distribution we assume that μ = 1, β = 0.2 and analyze the power by changing the ρ parameter. As it is described in the Appendix, in this case the ρ parameter controls if the generalized Gaussian distribution belongs to the leptokurtic or platykurtic class of distributions. Since there is one to one correspondence between the ρ parameter and the excess kurtosis (see formula (14)), we present the power of the test with respect to the excess kurtosis. It is compared with the power of the most common tests for normality, namely JB, DP, SW, Kuiper, Watson, CvM, AD, χ 2 and the test of Zoubir and Arnold [22], based on the empirical characteristic function (CF), that was shown to perform well for smaller sample sizes. We can see that the proposed test is clearly superior to other tests for N � 200 and that for N = 100 it shares this superiority to other tests with the DP test, which, on the other hand, performs best among considered tests for sample size N = 50 and very small kurtosis values. We can also observe that the least performing tests in this study are the KS test, the χ 2 test, and (surprisingly) the JB test, especially for short samples.
The second considered distribution is the mixed Gaussian which is also a member of the platykurtic class. Here we assume μ 1 = 1, σ 1 = 1 and σ 2 = 1 and analyze the power of the test by changing the μ 2 parameter. As it is explained in the Appendix, there is one to one correspondence between the μ 2 parameter and the excess kurtosis of the mixed Gaussian distribution (see formula (16)). Therefore, in Fig 8 we present the power of the test with respect to the excess kurtosis and compare it with the power of the common tests for normality. As previously, we can see that the proposed test is clearly superior to other tests for N > 100 while its performance diminishes for sample size N = 50. Again, we can see that the least performing tests in this study are the KS test, the χ 2 test, and the JB test, especially for short samples.
The next two considered distributions, namely α-stable and Student's t-distributions belong to the class of leptokurtic distributions. We note that the α-stable distribution is difficult to differentiate from the normal distribution when α is close to two and the same is true for the Student's t distribution when number of degrees of freedom is large [31,32,17].
In Fig 9 we consider the symmetric α-stable distribution with σ = 1. Since for the α-stable distribution excess kurtosis is infinite, in this study we analyze the power of the test with respect to the stability index α and compare the test performance with results of the classical tests. We can see that the situation for the leptokurtic distributions is different than for the platykurtic. We can clearly identify two groups of the tests. The first group, which consists of the introduced test, JB, SW and DP tests performs much better than the second group with the rest of the tests (only for N = 50 the introduced test visibly falls behind the top three tests but is still superior to others). It seems that the SW test has the best overall performance, and the proposed test is the last in the first group for smaller and moderate samples but for N = 1000 it behaves very similarly to the SW.
The last considered distribution is the Student's t. In Fig 10 the power of the proposed test for normality is presented as a function of the number of the degrees of freedom since for the Student's t-distribution the excess kurtosis is finite only if number of degrees of freedom exceeds 4. As in the previous cases, the power results are compared to those of the common tests for normality. We can see that the situation is very similar to the observed for the α-stable Table 2. Comparison of the powers of the tests for normality for GG distribution and N = 20. The following tests are taken under consideration: the new test proposed in this paper, Jarque-Bera (JB) test [5], D'Agostino-Pearson (DP) test [4], Shapiro-Wilk (SW) test [3], test based on the empirical characteristic function (CF) [22], Kolmogorov-Smirnov (KS) test [2], Kuiper test [6], Watson test [7], Cramer-von Mises (CvM) test [8], Anderson-Darling (AD) test [9] and χ 2 goodness-of-fit test [1].

PLOS ONE
Omnibus test for normality

PLOS ONE
distribution. We can observe two groups of tests. The introduced test belongs to a selected group of test that perform much better than the rest of the distributions (again only for N = 50 it visibly falls behind the leaders). In this case, it seems that the JB test is superior for most of the cases. The proposed test is the last in the first group for smaller and moderate samples but for N = 1000 it becomes even better than the JB test.
In the Appendix in Tables 2-21 we present powers of the considered normality tests with the highlighted best results also for N = 20. The situation for N = 20 is similar to the case N = 50 with two striking differences: the χ 2 test fails to reject simulated samples from the generalized Gaussian, mixed Gaussian and α-stable distributions, and the CF test significantly  improves its performance being the clear winner for the generalized Gaussian and mixed Gaussian distributions, see also  In order to analyze the influence of the n parameter on the effectiveness of the proposed test, in the Appendix we present the comparison of the power of the test for n = 2 and n = 3 for all considered distributions and sample sizes. In Tables 22-25 we demonstrate the power of the test and in each considered case we highlight the best results. We observe that for the platykurtic distributions generally the test for n = 2 outperforms the test for n = 3, especially for larger sample sizes and excess kurtosis much smaller than zero. For the leptokurtic distributions in both cases, namely for n = 2 and n = 3, the power of the test is comparable.

Application to real time series
In order to demonstrate how the proposed methodology can be applied to real data, in this section we consider two illustrative datasets from a collection of over 1300 datasets that were originally distributed alongside the R environment [33]. The inclusion criteria for a dataset to be considered an illustrative example in our study were: sufficient sample size, lack of obvious trend, lack of obvious correlation and platykurtosis. When necessary, data differentiation was considered to arrive at weak stationarity. The first dataset (Data 1) is related to oil investments. In the collection the data are under the name "Oil Investment". For the analysis, we took the variable "waterd", which describes the depth of the sea in metres and we examine the first 50 available observations. A detailed description of the data can be found in Ref. [34]. The second dataset (Data 2) corresponds to the non-experimental "control" group, used in various studies of the effect of a labor training program. The time series is titled "Labour Training Evaluation Data". To the analysis we took the first 200 observations of the differentiated time series "re78", which describes the real earnings in the year of 1978 [35,36,37,38].
The two considered datasets are presented in Fig 11. For both considered time series we apply the proposed test to ascertain whether the hypothesis of normality can be rejected. The testing procedure is illustrated in Fig 2 Schema 1 falls between lower and upper critical values constructed for the samples of size N at a given significance level, see Table 1. We reject the hypothesis of normal distribution if the value of the test statistic is smaller than the lower critical value Q 1 or higher than the upper critical value Q 2 . The results of testing the hypothesis of normality for the two considered real datasets with the proposed test as well as the other considered here normality tests are presented in Table 26. We present the following information about the results of the tests at the significance For Data 1 we can observe that the new test rejects the hypothesis of normality at both significance levels: 1% and 5%. At significance level 5% the same result is obtained for SW, KS, Kuiper, Watson, CvM, AD and χ 2 tests. The JB, DP and CF tests do not reject the hypothesis of normality although the empirical kurtosis is smaller than zero (see Table 26). At significance level of 1% the introduced and χ 2 tests are the only ones that lead to the rejection of the null hypothesis. Interestingly, for Data 2, only the new test rejects the hypothesis of normality at the significance level of 5%.

Conclusions
Developing an omnibus test for normality of a random sample is a challenging and important task in signal processing that is particularly difficult for symmetric alternatives and those that are close to the normal distribution. We examined the behavior of the Edgeworth expansion when the distribution of a random sample deviates from the normal assumption. Then, by appropriately utilizing the second term of this expansion we designed a novel test on normality that can be treated as omnibus.
The test's performance, evaluated via Monte Carlo simulations, showed superior performance to those exhibited by other statistical tests for normality, particularly for the case of platykurtic distributions and sample sizes greater or equal to 100. For these distributions, the proposed test in almost all cases had the highest power.
For the leptokurtic distributions the situation was different, but still the proposed test was among the best. For this class of distributions, we were able to identify two groups of the tests. The power of the proposed test was shown to belong to the group of powerful tests that include the well-known D'Agostino-Pearson, Shapiro-Wilk and Jarque-Bera tests. For the largest size of the sample it even surpassed these top competitors.
We also showed the efficacy of the introduced test by studying two datasets from the open R language data repository. We compared the results of the proposed test and those of the other considered here normality tests. It is evident that for the first dataset the new test was among a few which led to the rejection of the hypothesis of normality at significance level of 5%. At the level of 1%, only the proposed and the χ 2 tests rejected normality. For the second dataset the new test was the only one which rejected the hypothesis of normality.
Finally, we also note that the test is relatively easy to use. The introduced statistic is computationally simple. To conduct the test one needs to calculate critical values for a given significance level. In the paper we presented a simple algorithm to calculate these values and provided the critical values for typical sample sizes and significance levels. Table 26. Results of the new test and other considered here tests for normality for two real-world datasets at significance levels of c = 0.01 and c = 0.05. Tests taken into consideration: the new test proposed in this paper, Jarque-Bera (JB) test [5], D'Agostino-Pearson (DP) test [4], Shapiro-Wilk (SW) test [3], test based on the empirical characteristic function (CF) [22], Kolmogorov-Smirnov (KS) test [2], Kuiper test [6], Watson test [7], Cramer-von Mises (CvM) test [8], Anderson-Darling (AD) test [9] and χ 2 goodness-of-fit test [1]. "0" means that the normality is not rejected, "1"-it is rejected. In parentheses values of the test statistic for the new test are presented. ν > 4 and takes the form In Fig 14 we show the theoretical excess kurtosis for Student's t-distribution with respect to the ν parameter.

α-stable distribution.
The α-stable random variable with parameters α, σ, β and μ is defined through its characteristic function in the following way [45] FðkÞ ¼ expfÀ s a jkj a 1 À ibsgnðkÞtan pa 2 À � þ ikmg if a 6 ¼ 1; expfÀ sjkjð1 À ib 2 p sgnðkÞlnjkjÞ þ ikmÞg if a ¼ 1; where 0 < α � 2 is the stability index, σ > 0 is the scale parameter, −1 � β � 1 is the skewness parameter and μ 2 R is the location parameter. The explicit formula for the PDF of α-stable random variable is not given in an elementary form, except for the three cases: normal, Cauchy and Lévy distributions. In this work, distributions with the stability index α close to two (i.e., the case of normal distribution) are studied in detail.
Comparison of the power of the test for n = 2 and n = 3. In this part we present a comparison of the power of the test for two cases: n = 2 and n = 3. In Tables 22-25 we highlight the best results for the considered cases.
Comparison of the computational times. In this part we present a comparison of the computational time of the proposed normality test with the computational times of the other considered tests. Here we only present the running times for one exemplary distribution, namely the Gaussian, and one sample size, namely N = 1000. The power of each test was calculated on the basis of 5000 simulations. In Table 27 we depict mean computational times calculated as the time needed to evaluate the power of the test divided by the number of the Monte Carlo simulations (in our case 5000). We note that the tests based on the empirical distribution function (EDF tests) seem to be unusually slow. This is due to the fact we calculated powers by  evaluating p-values for all samples by means of Monte Carlo simulations as advocated by Ross [46] for goodness of fit testing for the case of unspecified parameters. Critical values of test. In this part we present critical values of the introduced test for the considered five sample sizes (N = 20, 50, 100, 200, 1000) and two significance levels c: 1% and 5%, see Table 1.
Comparison of the powers of the tests for normality. In this part we present a comparison of the powers of the normality tests considered in this paper for four distributions and for five sample sizes (N = 20, 50, 100, 200, 1000). The results are presented in Tables 2-21, where we have highlighted the best results.