Min-max approach for comparison of univariate normality tests

Comparison of normality tests based on absolute or average powers are bound to give ambiguous results, since these statistics critically depend upon the alternative distribution which cannot be specified. A test which is optimal against a certain type of alternatives may perform poorly against other alternative distributions. Thus, an invariant benchmark is proposed in the recent normality literature by computing Neyman-Pearson tests against each alternative distribution. However, the computational cost of this benchmark is significantly high, therefore, this study proposes an alternative approach for computing the benchmark. The proposed min-max approach reduces the calculation cost in terms of computing and estimating the Neyman-Pearson tests against each alternative distribution. An extensive simulation study is conducted to evaluate the selected normality tests using the proposed methodology. The proposed min-max method produces similar results in comparison with the benchmark based on Neyman-Pearson tests but at a low computational cost.


Introduction
Normality of the data is the underlying distributional assumption of multitude of statistical procedures and estimation techniques. In both cross-sectional and time series data, assuming the data normality without testing may affect the accuracy of the econometric inference [1]. Statistical inference from regression models applied to time series [2], categorical [3] and count data [4] depends crucially on the assumption of normal errors. The experimental data sets generated in clinical chemistry for the construction of population reference ranges require the assumption of normality [5]. In short, normality assumption of the given data is the key to validate the inferences made from regression models and other statistical procedures. Diagnostic tests for normality are important as Blanca et al. [6], find only 5.5 percent of the 693 real data distributions close to normality while considering skewness and kurtosis together.
Given the importance of the subject, literature has produced a plethora of goodness-of-fit tests to detect departures from normality [7][8][9][10][11][12][13]. With the development of several normality tests over the decades, power comparison of these statistics has been given the due consideration in literature in search of the best test thus helping the researchers in the choice of suitable normality test [14][15][16][17][18][19]. Different characteristics of normal distribution are exploited while developing normality statistics consequently the power of normality tests varies, depending upon the nature of non-normality [19]. Thus, one normality statistic may perform well for one alternative distribution and another for another alternative non-normal distribution [18]. Comparison of normality tests via simulations are bound to give ambiguous results, since these statistics critically depend upon the alternative distribution which cannot be specified. This study rests on the finding that one normality test is optimal against one alternative and another for another alternative distribution [20]. The best test's performance against each alternative distribution provides us the benchmark for comparison of normality tests by using the max-min criterion. Maximum deviations of all selected tests from the benchmark is computed and the test with minimum deviation is ranked as best. This method reduces the calculation burden in terms of computing and estimating the Neyman-Pearson test against each alternative distribution for the benchmark as proposed in [18]. Another problem is that the alternative space is infinite dimensional. Since we plan to use numerical methods, we must narrow this space down to something sufficiently small to permit exploration by numerical methods. At the same time, the space should be large enough to provide a good approximation to the full space of alternatives-failing that, it should be large enough to approximate the distributions conventionally used in simulations studies. First and second order departures from normality depend on the skewness and kurtosis of the distribution, we have used 72 alternatives with wider ranges of these parameters. This alternative space includes mixture of uniform distributions, mixture of t-distributions and the distributions used in the literature [14,[16][17][18]21].

Normality tests
This section deals with the background and technical details of the selected normality tests. Each of these tests belongs to a different class of normality tests e.g. ECDF, moments, regression and correlation based tests etc.
In the following literature review, we consider x 1 , x 2 ,. . ..,x n as a random sample of size n. Then � x; s 2 ; ffi ffi ffi ffiffi b 1 p and b 2 are the sample mean, variance, skewness and kurtosis respectively, defined as Where the rth central sample moment is defined as p Þ and Z 2 (b 2 ) denote the resulting approximate standardized normal variables. These can be computed by the following algorithm provided in [24].
Computational Algorithm for Zð ffi ffi ffi ffiffi b 1

Compute
Let k 4 be the fourth central moment of ffi ffi ffi ffiffi b 1 p then where w; d and a are constants Computational Algorithm for Z(b 2 ): from the sample data. 2. Compute the mean and variance of b 2 .

Compute
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2=ðA À 4Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 2=ð9AÞ p D'Agostino and Pearson [8] proposed a test statistic for testing normality that combines ffi ffi ffi ffiffi b 1 p and b 2 in the following way: where K 2 is distributed as chi-square with two degrees of freedom. The normality hypothesis is rejected for large values of the test statistic.

The Jarque-Bera test.
In the field of economics, the most widely used statistics for normality testing is introduced by Jarque and Bera [10,11]. It is based on the standardized third and fourth moments: where n is the number of observations, and m i is the ith central moment of the observations (i.e. m i ¼ Asymptotically, the JB-statistic is distributed as chi-square with two degrees of freedom. The hypothesis of normality is rejected for large values of the test statistic.

The Robust Jarque-Bera test.
Gel and Gastwirth [13] introduced robust measures of sample skewness and kurtosis by utilizing a robust measure of dispersion which is less sensitive to outliers, the average absolute deviation from the sample median, and leads to the following robust JB-statistic.
where M is the sample median. The RJB statistic asymptotically follows the Chi-square distribution with two degrees of freedom.

The Bonett-Seier test.
An alternative measure of kurtosis (G-kurtosis) based on Geary's [25] test for normality is defined by Bonett and Seier [12] as w ¼ 13:29ððlnðsÞ À lnðtÞÞ where σ and τ is the population standard deviation and mean absolute deviation respectively. The factor of 13.29 is used to scale it up to 3 so that it matches standard measure of kurtosis The above Z w -statistic is approximately distributed as standard normal.

Distance/ECDF tests
This class of tests deals with the comparison of the empirical cumulative distribution function (ECDF), F n x ðiÞ � � ¼ i n , which is estimated based on data with the cumulative distribution function of normal distribution, Z i . Stephens [26] provided versions of the ECDF tests with unknown mean and variance. ECDF tests can be further classified into those involving either the supremum or the square of the discrepancies, F n (x (i) )−Z i .
ECDF tests involving the square of the discrepancies are known as those from the Cramérvon Mises family.

The Anderson-Darling A 2 -test.
Anderson-Darling test is, in fact, a modified form of Cramér-von Mises test. It gives more weight to tails of the distribution than does the Cramér-von Mises test. The computational form of the Anderson-Darling statistic is: ½ð2i À 1Þ½lnðZ i Þ þ lnð1 À Z nþ1À i Þ�� À n A 2 -test is the most familiar among all ECDF tests. The asymptotic distribution is known and it was found that the critical values for finite samples quickly converge to their asymptotic values for n � 5.

The Kolmogorov-Smirnov test.
In the ECDF class of tests, involving the supremum, a well-known statistic is the Kolmogorov-Smirnov test.
The null hypothesis of normality is rejected for large values of the test statistic.

The Z a and Z c tests.
Zhang & Wu [15] proposed two more likelihood ratio statistics of normality testing to the class of EDF tests. The proposed statistics can be defined as follows: Let X (1) , X (2) ,.. ..., X (n) are the ordered statistics from a continuous random variable X with distribution function F(x) to be used for the following hypothesis testing setup.
where F 0 (x) = ;(x)−the cumulative distribution function of the normal distribution.
Null is rejected for large values of the test statistics.

Regression/correlation tests 2.3.1 The Shapiro-Wilk and Shapiro-Francia tests.
Graphically determining the linearity between the ordered observations x (i) and the expected values of the standard normal ordered statistics, m i is known as normal probability plotting. The main idea behind these tests is normal probability plotting. Formally, regression or correlation techniques are used to determine the linearity, hence the name of this group of tests.
The Shapiro and Wilk [27] W statistic is defined as the ratio of two estimates of variance of a normal distribution and can be calculated by The vector of weights can be computed by where m and V are the mean vector and covariance matrix of the ordered statistics of the standard normal distribution [28]. If the distribution of x i is normal, the W-statistic is close to unity otherwise less the unity. The critical values of W are tabulated up to sample sizes of 50. However, Shapiro and Francia [29] noted that as the sample size increases, the ordered observations tends to be independent (i.e. v ij = 0 for i6 ¼j). Treating V as an identity matrix, W can be extended for n larger than 50 by Values of {m i } are available in [30] up to sample sizes of 400. However, Weisberg and Bingham [31] suggested the following approximation to compute the values of {m i }.
It was shown that the approximation works even for the small samples as there is no significant difference between the null distributions of W and W 0 statistics. This simplifies the computation of the test statistics. [32] proposed another competitor of Shapiro-Wilk test based upon the normalized spacing which can be defined as

The Chen-Shapiro test. Chen and Shapiro
is the inverse of standard normal distribution. Since the authors have shown a close relationship between the Chen-Shapiro (CS) and the Shapiro-Wilk (W) test, it is therefore expected that the performance of the CS test would be comparable with the W test. The normality hypothesis is rejected for small values of the test statistic.

The COIN test.
Coin [33] has proposed a normality test especially for the symmetric non-normal alternatives based on polynomial regression. Let x (i) be a vector of ordered observations drawn from a normal population with unknown mean, μ and variance, σ 2 then it is possible to write where μ and σ are the parameters of the best fit line of a normal Q-Q plot and ε is a vector of errors which are assumed to be homoscedastic. The above two parameters may be estimated by using the Least Square method. Instead of using the above model, COIN proposed the following polynomial model where the vector of ordered observations, x (i) has been replaced by z (i) −a vector of ordered standard normal statistics.
where β i (i = 1,3) are the fitting parameters and α i represents the expected values of standard normal ordered statistics. The estimated value of β 3 significantly different from zero implies that the sample is drawn from a symmetric non-normal distribution. However, Coin suggests the use of b 2 3 as a statistic for testing the null hypothesis of normality. Hypothesis of normality is rejected for large values of the test statistic.
2.3.4 The BCMR test. Del Barrio, Cuesta-Albertos, Matrán and Rodríguez-Rodríguez [34] proposed the following test statistic for testing normality based on the L 2 -Wasserstein distance between a sample distribution and the set of normal distributions. Let x 1 , x 2 ,. . . . . .,x n be a random sample drawn from a distribution with the distribution function F. Let F n denotes the empirical distribution function, ; the distribution function of the standard normal law and s 2 the sample variance.
This statistic is asymptotically equivalent to Shapiro-Wilk and Shapiro Francia statistics [34]. The normality hypothesis is rejected for large values of the statistic.

Other tests 2.4.1 The Gel-Miao-Gastwirth test.
Recently, Gel and Gastwirth [13] have contributed to the literature of directed tests of normality by proposing a statistic which focuses on detecting heavy tails and outliers of symmetric distributions. The test statistic is simply the ratio of standard deviation to the robust measure of dispersion which should tend to unity under normality of data.
where M is the median of the sample data. Normality hypothesis is rejected for large values of the statistic. However, the statistic, p n(R-1) is asymptotically distributed as normal with zero mean and standard deviation equal to p 2 À 1:5 The applications of this test can be extended to light tailed distributions as well by using two-sided test for rejecting the null hypothesis of normality.

Alternative distributions
As already stated, to permit exploration by numerical methods we must narrow down the infinite dimensional alternative space to a space large enough to approximate the distributions conventionally used in simulations studies. We have used 72 alternative distributions with wide ranges of skewness and kurtosis as the first and second order departures from normality depend on these parameters. The simulation study considers the distributions used in the literature ( Table 1), mixture of uniform distributions, and mixture of t-distributions ( Table 2). The alternate space includes all kind of (a)symmetric, short-and long-tailed distributions.
The mixtures of t-& uniform distributions are generated by the following rules.

Simulation study
An extensive simulation study is conducted in the following to estimate the size and power of the selected normality tests. As stated earlier, no normality test can be uniformly most powerful against all alternative distributions, one test is optimal for one alternative and another is optional for another alternative. The trajectory of maximum power obtained by any test against each alternative distribution provides us the benchmark against which all tests can be compared. Deviations for each test are computed with reference to the benchmark.
Any function, T(x), which takes values {0. 1} is called hypothesis test. The size of the test is defined as where φ belongs to null space, F. The power of the test is the probability of not committing type-II error i.e., For any test, maximum achievable power for a given alternative is defined as

PLOS ONE
For different values of φ, we get different optimal tests statistics. The locus of the powers of these statistics provides us the benchmark. Following loss function is computed to evaluate each normality test in terms of its deviation from the benchmark. Deviation ¼ Max pðT; φÞ À pðT; φÞ A test with minimum loss or deviation is defined as the best test. The most stringent test will have zero percent loss or deviation from the benchmark. This allows us to rank the normality tests in a unique manner.

Results & discussion
Normality tests are evaluated against 72 alternative distributions including mixture of uniform distributions (18 distributions), mixture of t-distributions (20 distributions) and the distributions used in the literature (34 distributions) with wide ranges of skewness and kurtosis. The alternate space includes all kind of (a)symmetric, short (long)-tailed distributions. The most stringent test is the one with minimum deviation from the benchmark.
While evaluating the losses or deviations of normality statistics against the selected alternative space, CS test outperforms the remaining tests for small (n = 25) and medium (n = 50) sample sizes at 5 percent level of significance with 12.3 & 17.3 percent respective deviations from the benchmark ( Table 3). These results corroborate with the findings in [17,18]. Shapiro-Wilk's W-test is the first ranked statistic for large sample size (n = 75) with 27.4 percent deviation closely followed by CS, Z c , & Z a statistics. For third rank, this study recommends BCMR, A 2 , & Z a , tests for small and BCMR for medium and large sample sizes. The JB and RJB tests perform poorly with more than 90 percent losses for all sample sizes which is in line with the findings in [18].
These results clearly indicate that the min-max strategy adopted in this study produces similar results as achieved with Neyman-Pearson benchmark in Islam (2017). Furthermore, the computational cost reduces significantly.
It is interesting to note that symmetric short-and long-tailed alternative distributions are the worst alternatives for both the top ranked statistics, CS and W, in terms of maximum deviations from the benchmark for all sample sizes ( Table 4). The W-test also outperforms the

PLOS ONE
A 2 -test with significant lesser deviations from the benchmark which is in line with the findings in [35]. These two statistics along with the W 0 , BCMR & COIN tests belongs to the 'regression & correlation' class of normality tests. The worst alternatives for the rest of the members of this class are test dependent e.g., the worst alternatives for the COIN tests are skewed and the near normal distributions. On balance, when considering the performance of the regression and correlation-based group of normality statistics, CS is the best test (rank#1) for small and medium sample size closely followed by the W test at rank two position. For large samples, the W test outperforms the CS statistics by a margin of 3.2 percent and occupies the first rank position. These two statistics are closely followed by the BCMR test which is placed at rank three position for all sample sizes. The Shapiro-Francia's test (W 0 ) shows consistent performance by occupying the fifth position for all sample sizes with maximum deviations ranging from 33 to 50 percent.
Moment based JB & RJB tests perform poorly against the short-tailed symmetric and slightly skewed alternatives (Figs 1 and 2). It is pertinent to mention that the performance of JB & RJB is same at medium sample size (n = 50). Other moment based tests under consideration in this study are K 2 & Z w . The K 2 statistic is ranked at 6 for small and medium and 8 for large samples sizes with deviations ranging from 60 to 96 percent. The Z w test is ranked at 7 th position for small and large and at 8 th for medium sample sizes with maximum deviations from the benchmark range from 78 to 88 percent.
Among the normality tests based on empirical cumulative distribution function (ECDF), A 2 , Z a and Z c occupy the third and fourth rank respectively for small samples. The normality  [13] for symmetric distributions occupies rank 7 for small and medium sample sizes and rank 6 for large sample sizes. Range of the deviation from the benchmark is 76-85 percent when evaluated against the entire class of alternatives. The worst distributions for the R test belong to asymmetric alternative space for the obvious reasons. Interestingly, the R test occupies sixth and fifth ranks for small and medium to large sample sizes respectively when evaluated against the symmetric alternatives ( Table 5). The worst distributions for the R test belongs to symmetric short-tailed alnternative space (Fig  3) for all sample sizes. Therefore, R test is not recommended for symmetric short-tailed alternatives. The COIN test perform relatively much better than the R test which is inline with the findings in [17,33].
While considering symmetric alternative space, the CS & COIN are the best options for testing normality for small to medium sample sizes, the Shapiro-Wilk's W test is recommended for large sample sizes ( Table 5). The W-test occupies second rank for small and medium  sample sizes. The moment-based JB & RJB tests performed poorly against the symmetric class of alternatives as well. The worst distributions for these statistics belongs to symmetric and short-tailed class of alterntaives. These results corroborate with the findings in [16,35,36]. The Z w test perfoms relatively well among the moment-based normality tests and occupies fourth rank for all sample sizes with maximum deviation from the benchmark ranging between 23-40 percent. The K 2 test is ranked at 7 th & 8 th positions for small and medium to large samples respectively with losses range of 44-96 percent.
When considering the regression and correlation based group of normality tests, the CS, COIN, W & BCMR are the best options against the symmetric alternatives and occupy top three ranks in the table. Romão et al. [17], recommend the CS & W statistics for asymmetric group of alternatives by comparing the absolute powers. However, when these statistics are evaluated against a benchmark instead of absolute powers, these statistics turn out to be best for symmetric alternative distributions as well. The Shapiro-Francia's (W 0 ) test does not perform well against symmetric alternaives and occupies ranks 5, 7 & 6 for small, medium and large smaple sizes respectively. Among the ECDF class of normality tests, A 2 & Z a occupy the third rank for small samples with 21.5 & 22.0 percent deviations from the benchmark. The Z c & Z a are recommended for medium and large sample sizes as Anderson-Darling's statistic (A 2 ) occupies sixth and fourth positions for medium and large sample sizes respectively. In terms of maximum deviatoins, the Z c has slight edge to Z a test for medium and large sample sizes which does not corroborate with the findings in [15]. The KS test does not perform well against the symmetric alternatives with more than 85 percent losses for all sample sizes.
When the selected normality tests are evaluated against the asymmetric class of alternatives, W, Z c , & CS tests occupy the rank one position for small, CS for medium, and CS, W, Z c & Z a for large sample sizes (Table 6). On balance, the CS and W tests from regression and correlation based group of normality tests is recommended for all sample sizes whereas the COIN test did not perform well with very high range of deviations from the benchmark when the alternative distribution is drawn from the asymetric distributional space due to obvious reasons. These findings are corroborate with the findings in [17,18,35].
Among the ECDF class of tests, the Z c is ranked as number one statistic for small & large sample sizes and number two for medium sample size against the selected asymmetric distributional space closely followed by Z a test. Maximum deviations of these tests range from 5 to 12 percent. TheAnderson-Darling test, A 2 , is placed at fourth, fifth, and third positions for small, medium, and large sample sizes, respectively with a range of 21-31 percent maximum deviations from the benchmark. Moment-based tests did not perform well with more than 50 percent maximum deviations from the benchmark for all sample sizes against the asymmetric distributional space.
There is no significant difference between the performances of BCMR and W tests of normality in terms of discriminating the long-tailed distributions (β 2 > 3). Both the statistics share first rank when evaluated against the selected class of heavy tailed distributional space (Table 7) closely followed by the CS test. The Shapiro-Francia's W 0 test performed well for small samples and occupies third rank with power loss of 15.1 percent however, from regression and correlation class, the COIN test peforms poorly and occupies the last rank with more than 85 percent power losses at all sample sizes. On balance, in terms of maximum deviations from the benchmark, moment-based normality tests do not perform well (Table 7).

PLOS ONE
However, these statistics really performed well when evaluated against the symmetric longtailed distributions with clear dominance of the RJB test (Table 8). These results are in line with the findings in [18,33,36]. Overall, the JB & RJB tests perform well against the long-tailed distributional space except for the alternatives listed in Table 9. Among the ECDF class of normality tests, all the statistics except for KS performed well and are listed among the top four tests for all sample sizes. The R test when evaluated against the thick-tailed alternatives do not perform well and power deviations vary from 66-76 percent (Table 7).
For distributions from the short-tailed alternative space (β 2 < 3), we recommend CS & Z c for small, CS for medium and W test for large sample sizes (Table 10). Romão et al. [17], also recommends the use of CS & W tests for small and large sample sizes. The W-test is also ranked second for small and medium sample sizes with respective maximum deviations of 15.5 & 20.5 percent from the benchmark. Performance of the W test is much better than the KS test irrespective of the fact that the alternative belongs to short-or long-tailed distributional space which corroborates with the findings in [18,35]. Both the Z a & Z c statistics are among the top three positions with Z c having a slight edge to Z a against the short-tailed alternatives which is in line with the findings in [18]. Anderson-Darling test statistic (A 2 ) also performs well and occupies third & fourth ranks for small and medium & large samples, respectively. Based on the maximum deviations from the benchmark, BCMR test is placed at rank three with respective power losses of 20.6, 24.3, & 34.0 percent. Among the correlation and regression-based normality tests, the COIN test could not perform well when evaluated separately both for short-and long-tailed alternatives. Performance of the JB, RJB, K 2 , & Z w is not up to the mark with very high-power deviations which corroborates with the findings in [36]. Table 11 presents the top five damaging distributions for each normality test at samples of size 25, 50, & 75. It is evident from the results that ECDF based normality tests suffer more against the symmetric short-tailed and symmetric long-tailed distributions with significant outliers. Symmetric short-tailed and skewed distributions affect the performance of normality tests belong to regression and correlation class. However, the most damaging distributions for the moment based normality tests are specific to individual test in this class. For example, the JB & RJB tests suffer greater power loss against the long-tailed alternatives at small, negatively skewed alternatives at medium and large sample sizes.

Conclusion
Comparison of normality test without having an invariant benchmark has not been proven fruitful in the normality literature. This study proposes an alternative way to compute the benchmark instead of the Neyman-Pearson test-based benchmark proposed in literature. The proposed benchmark is based on the min-max approach which reduces the calculation cost in terms of computing and estimating the Neyman-Pearson tests against each alternative from the selected distributional space. The min-max approach is based on the finding that one test is best against one alternative and another for another alternative [20]. Thus, against each alternative distribution, we get different optimal normality tests. The locus of these statistics provides us the benchmark. Maximum deviations from the benchmark are computed for the selected normality statistics. A test with minimum loss or deviation is defined as the most stringent test. An extensive simulation study is conducted to rank the selected normality tests against a vast distributional space consisting of mixture of uniform distributions, mixture of t-distributions, and distributions used in literature. General recommendations derived from the analysis of maximum deviations from the benchmark indicate the most stringent normality test is CS for small (n = 25), medium (n = 50), and Shapiro-Wilk's W-test for large sample size (n = 75) closely followed by CS, Z c , & Z a statistics against the entire alternative space. While considering symmetric alternative space, the CS & COIN are the best options for testing normality for small to medium sample sizes, and the Shapiro-Wilk's W test for large sample sizes. When the selected normality tests are evaluated against the asymmetric class of alternatives, W, Z c , & CS tests occupy the rank one position for small, CS for medium, and CS, W, Z c & Z a for large sample sizes.
There is no significant difference between the performances of BCMR and W tests of normality in terms of discriminating the long-tailed distributions (β 2 > 3). Both the statistics share first rank when evaluated against the selected class of heavy tailed distributional space closely followed by the CS test. For distributions from the short-tailed alternative space (β 2 < 3), we recommend CS & Z c for small, CS for medium and W test for large sample sizes.