Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

# Cauchy combination omnibus test for normality

• Zhen Meng,

Roles Funding acquisition, Methodology, Software, Writing – original draft

Affiliation School of Statistics, Capital University of Economics and Business, Beijing, China

• Zhenzhen Jiang

Roles Software

jiangzhenzhen20@mails.ucas.ac.cn

Affiliations Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China, School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China

## Abstract

Testing whether data are from a normal distribution is a traditional problem and is of great concern for data analyses. The normality is the premise of many statistical methods, such as t-test, Hotelling T2 test and ANOVA. There are numerous tests in the literature and the commonly used ones are Anderson-Darling test, Shapiro-Wilk test and Jarque-Bera test. Each test has its own advantageous points since they are developed for specific patterns and there is no method that consistently performs optimally in all situations. Since the data distribution of practical problems can be complex and diverse, we propose a Cauchy Combination Omnibus Test (CCOT) that is robust and valid in most data cases. We also give some theoretical results to analyze the good properties of CCOT. Two obvious advantages of CCOT are that not only does CCOT have a display expression for calculating statistical significance, but extensive simulation results show its robustness regardless of the shape of distribution the data comes from. Applications to South African Heart Disease and Neonatal Hearing Impairment data further illustrate its practicability.

## Introduction

Normal distribution is often used to describe the pattern of data in biomedical research . Usually before conducting statistical analyses such as t-test, Hotelling T2 test and ANOVA, the normality test is performed. For example, in clinical trials, testing the effect of a new treatment is a basic problem, where two groups of subjects taking the treatment and placebo separately are enrolled and some features showing health benefits are screened. The t-test for univariate analysis and Hotelling T2 test for multivariate analysis are commonly employed to identify the mean differences between the two groups. However, the basic assumption of both tests is that data follow a normal distribution. If the normal distribution is violated, other nonparametric tests will be considered. It means that the results of the normality test may affect the inference of subsequent statistics. Another example is linear regression model, where the least square estimate may not be the best unbiased estimate when the error term does not obey the normal assumption. Futhermore, the inference of the regression coefficients based on F-test may be incorrect if the normality is violated.

For testing normality, the Quantile-Quantile (Q-Q) plot is a commonly used graphical method where the sample quantiles are compared with the expected ones from normal distribution. Generally, if the quantile points lie close to the diagonal line of the first and third quadrants, one can think of the data as being drawn from a normal distribution. A Percent-Percent plot can also be used based on two empirical cumulative distribution functions. Although both graphical methods are easy to operate, the decision depends on a rule of thumb. Thus the goodness-of-fit normality test were proposed . Among many others, Shapiro and Wilk  developed a test (hereafter referred to as SW) considering the best linear unbiased estimate obtained by generalized least-squares method in the linear regression between sample order statistics and those for a standard normal distribution. SW is also regarded as squared Pearson correlation coefficient between ordered observations and some kind of weights. Since the original SW is limited to sample size 3 ≤ n ≤ 50, Royston  extended it to n ≤ 2000 and Rahman and Govidarajulu  extended to n ≤ 5000. Based on sample skewness and kurtosis, Jarque and Bera  proposed a test (hereafter referred to as JB). Tests based on the empirical distribution function Fn were also proposed, such as Kolmogorov-Smirnov test (hereafter referred to as KS), Cramér-Von Mises test (hereafter referred to as CVM), Anderson-Darling test (hereafter referred to as AD) . KS is a supremum-type test to detect the difference between Fn and a specified normal distribution F0 with known mean and variance, and Lilliefors  improved KS by using sample mean and sample standard deviation to replace both unknown parameters, also commonly referred to as KS. CVM is a quadratic-type test and AD is the weighted version of CVM. In addition, there were also wide variety of normality tests corresponding to different types of alternative distributions in the literature. Dumonceaux et al.  considered the likelihood ratio test for some specific alternative distribution, such as Cauchy, exponential and double exponential. Spiegelhalter  gave the location and scale invariant test for uniform and double exponential distributions. Arshad et al.  proposed a modified AD for generalized Pareto distribution.

In view of the fact that there is no uniformly most powerful test under the general alternative hypothesis that data do not follow a normal distribution, we propose a Cauchy Combination Omnibus Test (CCOT). The CCOT aims to borrow the strength of AD, SW and JB, which is by far more robust than other tests under a wide range of settings. Another thrilling merit of the CCOT is that its statistical significance has an approximate expression, which is easier to calculate than other combination methods regardless of the correlation structure between the p-values.

The rest of the paper is structured as follows. In Section 2, we review some frequently-used methods, put forward the new test CCOT and give some theoretical results to discuss its effectiveness. Numerous simulation studies are shown in Section 3 and real applications are presented in Section 4. Some discussions are given in the last Section.

## Methods

Assume that x1, x2, ⋯, xn i.i.d.∼F(μ, σ), where μ is the location parameter and σ is the scale parameter. The null hypothesis of testing problem is and the alternative hypothesis is To test H0, Thadewald and Büning  set up a number of simulations for comparison among many tests and found that JB had higher power than others for symmetric distributions with medium up to long tails or slightly skewed distribution with long tails, and AD performed best for distributions with two peaks. Nornadiah and Yap  recommended SW for symmetric distributions with short tail or many skewed distributions. The advantages of AD, JB and SW cover a wide range of data distribution types, which have different lengths of tails, different degrees of skewness and different numbers of peaks, thus we mainly focus on these three tests here.

Considering the difference between the empirical distribution function and the normal distribution, Anderson and Darling  proposed a test as where F0(⋅) is the cumulative distribution of F(⋅) under H0 and . Given a nominal significance level α, H0 is rejected if the value of AD is larger than a threshold associated with α and sample size n. To make the threshold not dependent on the sample size n, Stephens  given the modified version as AD × (1 + 4/n − 25/n2).

Except for AD, there were two other similar tests: KS and CVM. AD is the weighted version of CVM by adding wight [F0(x)(1 − F0(x))]−1. Many simulation studies have shown that AD appears to have better performance than KS and CVM [14, 15, 17].

Shapiro and Wilk  proposed a test by using the ratio of two estimates for σ. We sort x1, x2, ⋯, xn in an ascending order and denote them by x(1), x(2), ⋯, x(n) with x(1)x(2) ≤ ⋯ ≤ x(n). Let x = (x(1), x(2), …, x(n)) and z1, z2, ⋯, zn be a random sample from a standard normal distribution N(0, 1) with ordered values being z(1), z(2), ⋯, z(n), where z(1)z(2) ≤ ⋯ ≤ z(n). Then Denote z = (z(1), z(2), …, z(n)), and V = (vij)n × n = (cov(z(i), z(j)))n × n as the mean vector and covariance matrix of order statistics, respectively. The best linear unbiased estimate of σ is (uV−1x)/(uV-1u). On the other hand is the unbiased estimate of σ2. The SW was constructed as where w = (w1, w2, …, wn) = V-1u/(uV-1V-1u)1/2 and ww = 1. Since the maximum value of SW is 1, SW will be close to 1 if x1, x2, ⋯, xn are from a normal distribution. We reject the null hypothesis when SW is small. Since w is harder to calculate as the sample size n increases, SW is limited to sample size 3 ≤ n ≤ 50. Royston  extended SW to 4 ≤ n ≤ 2000 and Rahman and Govidarajulu  extended SW to n ≤ 5000. Moreover, Shapiro and Francia  gave another correlation type test (denoted by SF) by a new approximation to w’s.

Denote the sample skewness and kurtosis by Since the skewness and kurtosis of a normal distribution are equal to to 0 and 3, respectively, Jarque and Bera  proposed a test by standardizing the sample skewness and kurtosis as Bowman and Shenton  showed that JB is the sum of squares of two asymptotically independent standard normal variables, thus JB follows asymptotically under H0 as n → ∞. Given a nominal significance level α, H0 is rejected if , where is the 1 − α quantile of χ2 distribution with 2 degrees of freedom.

### Cauchy combination omnibus test

Each of the above three test has its own sweet points: AD has outstanding performance for multimodal distributions, SW performs well for symmetric platykurtic with short-tailed distribution or skewed distributions, and JB is powerful for symmetric and slightly skewed distributions with long tails and poor for short-tailed distributions and bimodal distributions. Our goal is to construct a powerful and robust normality test that have relatively good performance over a wide range of situations. To borrow the strength from the above tests, we propose a cauchy combination omnibus test (CCOT) as where p1, p2 and p3 are the p-values of AD, SW and JB, respectively.

The reason we chose these three tests for the combination here is that their advantages cover most distributions, such as short tail to long tail distributions, symmetric to skewed distributions, unimodal or bimodal distributions.

With the available pk, k = 1, 2, 3, the statistical significance can be calculated based on the tail probability approximation via a standard Cauchy distribution . Suppose that the observed value of CCOT is , its approximate p-value is Remark 1: Under H0, pkU(0, 1) for k = 1, 2, 3, then where Cauchy(0, 1) is standard Cauchy distribution with the cumulative distribution function of . If p1, p2, p3 are independent of each other, , and the p-value of CCOT is On the other hand, if p1, p2, p3 are not independent of each other, its p-value can still be approximated by the above formula since the tail shape of the distribution of CCOT is similar to the tail of Cauchy(0, 1). The following numerical simulation result that CCOT can control the empirical type I error rate, shows the rationality of the p-value approximation.

The p-value of CCOT is between the minimum and maximum of three p-values p1, p2, p3. Thus, it is significant when p1, p2, p3 are all significant, and it is not significant when the smallest one is greater than the significance level. The corresponding conclusion can be referred to Chen . However, in the normality test, we will encounter the situation where the results given by different methods are inconsistent, that is, some p-values in the combination are not significant. To explore how CCOT behaves in this situation, we consider the case of two p-values combination, where p1 is significant and p2 is not, and give the following Theorem 1.

Theorem 1. Denote and , where p1, p2 are two p-values and p1α, p2 > α, α is a given level of significance, we have

1. (1) if α < p2 < 0.5, then pT < α when p2 ∈ (α, 2αp1];
2. (2) if 0.5 < p2 < 1, then
1. (i) pT < α when p2 ∈ (0.5, 0.5 + b*),
where ;
2. (ii) when p2 ≥ 1 − p1.

Remark 2: Conclusion in (1) indicates that the p-value of T is significant as long as p1 + p2 ≤ 2α. For example, p1 = 0.01 and α = 0.05, pT is significant when p2 ≤ 0.09. This result can also be generalized to combinations of more than two p-values, like , then pT is significant as long as .

Remark 3: Conclusion in (i) of (2) indicates that the p-value of T is significant if p1 < α and p2 ∈ (0.5, 0.5 + b*). Here, b* is a number greater than 0, thus p1 needs to satisfy tan((0.5 − p1)π) > 2 tan((0.5 − α)π). For example, α = 0.05, then p1 ≤ 0.025155168. In particular, if p1 = 0.01, then b* = 0.48343 and pT is significant as long as 0.5 < p2 < 0.98343. This also shows that when p1 is very small, p2 can have a larger range of values to cause the p-value of T to remain significant. In addition, conclusion in (ii) of (2) means that the p-value of T is not significant if p1 < α and p2 ≥ 1 − p1. If p1 = 0.01, when p2 ≥ 0.99. However, in the normality test, it is almost impossible for one method to have a very small p-value and the other method to have a p-value close to 1.

The detailed proof of Theorem 1 is as follows.

Proof. (1) If p1α and α < p2 < 0.5, we have for k = 1, 2. Since tan(⋅) is a convex function on interval , according to Jensen inequality, we have Then when p2 ≤ 2αp1.

(2) If p1α and p2 > 0.5, without loss of generality, let p2 = 0.5 + b, where b ∈ (0, 0.5).

Firstly, we prove the result of (ii). Since (0.5 − p1) > 0, (0.5−p2) < 0 and p2 ≥ 1 − p1, then b = (p2 − 0.5) ≥ (0.5 − p1) and we have Thus .

Secondly, we prove the result of (i). Based on the conclusions of (i), pT is not significant when p2 ≥ 1 − p1. Therefore, we consider 0.5 < p2 < 1 − p1, then 0 < b = (p2 − 0.5) < (0.5 − p1)<0.5. If , for b ∈ (0, b*), we have Thus .

Overall, the result of Theorem 1 shows that if α = 0.05 and p1 = 0.01, the p-value of the statistic based on the Cauchy combination can be guaranteed to be significant when p2 belongs to (0.05, 0.09] or (0.5, 0.98343). This also demonstates that CCOT is a robust and effective method in most cases, and the subsequent simulation results further support its good properties.

## Simulation studies

There is substantial simulation work in the literature to compare the power of normality tests by generating data from different distributions. Thadewald and Büning  generated data from the bimodal normal distribution (BN), which included symmetric distributions from short tail to long tail, asymmetric distributions and bimodal distributions. Nornadiah and Yap  carried out power comparisons of SW, KS and AD based on some symmetric or asymmetric familiar distributions. Following all of the above simulation settings, we conduct simulation studies by comparing the proposed CCOT with KS, CVM, AD, JB, SW, SF, and the traditional chi-square goodness-of-fit test (denoted by CS) that was given by Karl Pearson.

### The type I error rate

Assume that data are generated from N(0, 1) with The number of replicates for calculating the empirical type I error rates is 10,000. Table 1 presents the empirical type I error rates of eight tests. From Table 1, except JB, the type I error rate of CCOT and other tests are all close to 0.05. JB is a little conservative because it has inaccurate chi-square approximation for small samples due to slow convergence of kurtosis, but this phenomenon is improved when the sample size is large. Take n = 30, 50, 200, 500, 1500 for example, the empirical type I error rates of CCOT are 0.0456, 0.0526, 0.0501, 0.0553, and 0.0549, respectively, while JB gives 0.0313, 0.0357, 0.0444, 0.0472, and 0.0519, respectively.

Table 1. The empirical type I error rates of eight tests for different sample sizes under nominal significance level 0.05 based on 10,000 replications.

https://doi.org/10.1371/journal.pone.0289498.t001

### Bimodal normal distribution

Assume that data are generated from the bimodal normal distribution (BN): where a ∈ [0, 1] means that data come from with probability of a, and from with probability of 1 − a. For convenience, we fix μ1 = 0, and choose the nominal significance level α = 0.05. The number of replicates for calculating the empirical power is 10,000. We choose μ2 ∈ {0, 1, 2, 3, 4}, σ2 ∈ {1, 2, 3, 4, 6} and a ∈ {0.01, 0.05, 0.20, 0.35, 0.50, 0.65, 0.80, 0.90, 0.95, 0.99}. The combination of these parameters allows for a variety of distributions, such as symmetric and asymmetric cases, unimodal and bimodal cases. We set five scenarios: (i) μ2 = 0, a = 0.1, n = 100 and (ii) μ2 = 0, a = 0.5, n = 100 for different σ2, (iii) σ2 = 4, a = 0.05, n = 50 and (iv) σ2 = 1, a = 0.5, n = 50 for different μ2, and (v) μ2 = 0, σ2 = 4, n = 50 for different a. Scenario (i) corresponds to symmetric distributions with moderate to long tails; Scenario (ii) corresponds to symmetric distributions that data come from and with equal probability; Scenario (iii) corresponds to slightly skewed distribution with long tails; Scenario (iv) corresponds to the distributions with two peaks; and Scenario (v) focuses on the effect of different probabilities of a. Their density function curves, skewness and kurtosis are presented in S1 Fig.

Fig 1 illustrates the empirical power of CCOT and other tests for Scenarios (i)—(v). We can see from Fig 1 that under Scenario (i), CCOT, JB and SF perform better than SW, AD, CVM, KS, and CS. Taking σ2 = 3 of Scenario (i) for an example, the powers of CCOT, JB, SF, SW, AD, CVM, KS, CS are 0.8343, 0.8458, 0.8387, 0.7998, 0.6782, 0.6044, 0.4850, 0.2874, respectively. Under Scenario (ii), the empirical power of AD and CVM goes up while JB’s goes down, meanwhile CCOT has a good performance follows AD and CVM. Under Scenario (iii), CCOT, JB and SF are superior to other methods. Taking μ2 = 2 of Scenario (iii) as an example, the powers of CCOT, JB, SF, KS, CVM AD, SW, CS are 0.6327, 0.6385, 0.6381, 0.4198, 0.4829, 0.5315, 0.6134, 0.3065, respectively. Under Scenario (iv), CCOT, CVM, AD and SW have higher power than other tests. Taking μ2 = 4 of Scenario (iv) as an example, the powers of CCOT, CVM, AD, SW, KS, JB, SF, CS are 0.9032, 0.9430, 0.9425, 0.9017, 0.8241, 0.0018, 0.7884, 0.7340, respectively. Under Scenario (v), as probability a increases, JB performs well only at high kurtosis and poorly at low kurtosis, AD and CVM perform better when 0.35 ≤ a ≤ 0.8, while the proposed CCOT is robust in all a’s. Taking a = 0.35 as an example, the powers of CCOT, CVM, AD, JB, SW, CS are 0.9257, 0.9224, 0.9317, 0.8283, 0.9034, 0.6666, respectively.

Fig 1. The empirical power for scenarios (i)—(v) under nominal significance level 0.05 based on 10,000 replications.

https://doi.org/10.1371/journal.pone.0289498.g001

### Other distributions

Let n = {10, 15, 20, 25, 30, 40, 50, 100, 200, 300, 400, 500, 1000, 1500, 2000} and the nominal significance level is chosen as α = 0.05. For comparison of power, we generate data from 12 common non-normal distributions consisting of 6 symmetric distributions: and 6 asymmetric distributions: The density function curves, skewness and kurtosis of these 12 non-normal distributions are displayed in S2 Fig. The calculation of empirical power is based on 10,000 replications. For convenience, we present some representative results as shown in Fig 2, and the remaining scenarios are presented in S3 Fig. Beta(2; 2) correspond to the symmetrical platykurtic distributions with short tail, t7 and Laplace(0,1) correspond to the symmetrical leptokurtic distributions with medium long tail, Gamma(1,5) corresponds to asymmetrical distribution.

Fig 2. The empirical power for different sample sizes under nominal significance level 0.05 based on 10,000 replications.

https://doi.org/10.1371/journal.pone.0289498.g002

Fig 2 gives the empirical power of 4 common non-normal distributions. As depicted in Fig 2, for Beta(2; 2), CCOT and SW are the two most powerful tests. Taking n = 200 for an example, the powers of CCOT, SW, KS, CVM, AD, JB, SW, CS are 0.8535, 0.9267, 0.3364, 0.5521, 0.7163, 0.6267, 0.765, 0.2307, respectively. For t7 and Laplace(0,1), CCOT, JB and SF have higher power than other tests. Taking n = 300 for t7 as an example, the powers of CCOT, JB, SF, KS, CVM, AD, SW, CS are 0.7879, 0.7995, 0.7843, 0.3907, 0.5414, 0.6043, 0.7317, 0.1770, respectively. For Gamma(1,5), the power of CCOT, SW and SF are higher than other tests. For example, when n = 20, CCOT, SW and SF give power of 0.7814, 0.8356 and 0.7995, and KS, CVM, AD, JB, and CS give power of 0.5765, 0.7288, 0.7796, 0.4808, and 0.6589, respectively. Overall, CCOT is a powerful and robust test when data come from different distributions. The other results in S3 Fig also show the robustness and efficiency of CCOT in different scenarios.

## Applications

### Application to South African heart disease data

The Coronary Risk-Factor Study (CORIS) was jointly initiated in 1978 by the South African Medical Research Council, the Department of Health, and Welfare and Human Sciences Research Council. The aim is to determine the prevalence and intensity of risk factors in an Afrikaner community and assess the effectiveness of interventions to reduce risk factors. Rousseauw et al.  used the data of CORIS in three rural communities of the southwestern Cape Province to identify and establish the intensity of ischaemic heart disease (IHD) risk factors. A subset data of Rousseauw et al.  were analyzed by Hastie and Tibshirani  to study the risk factors for myocardial infarction, consisting of 462 white males with 162 patients and 302 healthy people between ages 15 and 64. The subset data include quantitative indicators of 9 risk factors: systolic blood pressure, cumulative tobacco, low density lipoprotein cholesterol (LDL), adiposity, family history of heart disease, type-A behavior (TYPE-A), obesity, current alcohol consumption, and age at onset (AGE). Our analysis below is based on this subset data that can be found in the ‘SAheart’ of R-package “ElemStatLearn”. For convenience, we select three risk factors (LDL, TYPE-A, AGE) to test whether the corresponding patient group and healthy control group come from normal distribution. We apply CCOT and other representative normality tests to the data set and obtain their p-values as shown in Table 2.

Table 2. The p-values of normality tests for South African heart disease data.

https://doi.org/10.1371/journal.pone.0289498.t002

We can see from the Table 2 that except for TYPE-A in the patient group, other scenarios all show that the null hypothesis should be rejected at the nominal significance level of 0.05. For risk factor LDL, the proposed CCOT and JB have the smaller p-values than those of AD and SW. Taking LDL in the patient group as example, the p-values of CCOT and JB are 2.2 × 10−11 and 7.2 × 10−12, while KS, CVM, AD, SW, and SF give 5.2 × 10−4, 6.6 × 10−6, 3.2 × 10−7, 3.9 × 10−7 and 1.8 × 10−6, respectively. For risk factor TYPE-A, it is obvious that the samples from patient group are normally distributed. For TYPE-A in the patient group, CCOT has a larger p-value than JB, SW and SF. The p-values of CCOT and AD are 0.76 and 0.89, while JB, SW and SF give 0.58, 0.57 and 0.46, respectively. For TYPE-A in the control group, the p-values of all tests but CVM are less than 0.05 and CCOT and JB have smaller p-values than other tests. For example, the p-values of CCOT and JB are 5.8 × 10−4 and 2.1 × 10−4, while KS, CVM, AD, SW, and SF give 3.4 × 10−2, 5.5 × 10−2, 2.3 × 10−2, 2.9 × 10−3, and 2.4 × 10−3, respectively. For risk factor AGE, the p-values of CCOT and AD are the smallest in patient group. For example, for AGE in the patient group, the p-values of CCOT and AD are 3.6 × 10−9 and 1.2 × 10−9, while KS, CVM, JB, SW, and SF give 3.2 × 10−7, 1.4 × 10−7, 1.1 × 10−4, 1.5 × 10−7, and 1.3 × 10−6, respectively.

### Application to neonatal hearing impairment data

Norton et al.  considered three neonatal hearing screening tools: distortion product otoacoustic emissions (DPOAE), transient evoked otoacoustic emissions (TEOAE) and auditory brain stem responses (ABR) to identify neonatal hearing impairment. The three tools were usually regarded as three biomarkers in the field of biomedical sciences. Taking female data as example, it includes 2,234 samples that are collected from one ear or two ears (left ear and right ear) of infants, where there are 64 hearing impaired samples (patient group) and 2,170 normal samples (control group). Statistical analysis often focuses on whether there are some differences between the two groups in different biomarkers. t-test for the means of two groups is a simple and common parametric method to solve this issue, but the premise is that levels of the biomarker come from normal distributions.

Choosing two biomarkers TEOAE and ABR as examples, the Q-Q plots of patient group and control group are presented in Fig 3. Deviation from the straight lines of Q-Q plots in Fig 3 indicates that both biomarkers in each group do not seem to follow normal distributions. We apply CCOT and other normality tests to this data and calculate their p-values as shown in Table 3. All methods reject the null hypothesis at the nominal significance level of 0.05. For biomarker TEOAE, CCOT, AD and CVM have the lowest p-values in patient group. For example, the p-values of CCOT, AD and CVM for patient group are 6.0 × 10−4, 2.5 × 10−4 and 3.3 × 10−4, while KS, JB, SW and SF give 1.1 × 10−2, 1.1 × 10−2, 1.1 × 10−3, and 2.2 × 10−3, respectively. For biomarker ABR, the p-values of CCOT and JB are samller than other tests in patient group. For example, the p-values of CCOT and JB are 8.7 × 10−13 and 2.9 × 10−13, while KS, CVM, AD, SW, and SF give 1.1 × 10−3, 4.5 × 10−5, 3.4 × 10−5, 2.8 × 10−5, and 4.9 × 10−5, respectively. For control group with larger sample size, all normality tests have very small p-values and CCOT also follows this trend. Taking ABR in the control group as example, both CCOT and JB have p-values of 0, and KS, CVM, AD, SW, and SF give p-values of 3.4 × 10−209, 7.4 × 10−10, 3.7 × 10−24, 1.3 × 10−46, and 1.6 × 10−42, respectively. These results also confirm with Q-Q plots that data deviate from the normal distributions. Thus t-test cannot be used to detect the differences between the two groups, whereas nonparametric methods such as Wilcoxon rank-sum test should be considered.

Fig 3. The Q-Q plots of patient group and health group for biomarkers TEOAE and ABR.

https://doi.org/10.1371/journal.pone.0289498.g003

Table 3. The p-values of normality tests for neonatal hearing impairment female data.

https://doi.org/10.1371/journal.pone.0289498.t003

## Discussion

Prior to applying a normality-based statistical inference procedure, it is critical to check whether the data come from a normal distribution. The South African Heart Disease and Neonatal Hearing Impairment data all involve the testing problem of whether there are some differences between patient and healthy groups. It is related to t-test or Hotelling T2, while the precondition of both tests is that data should be are normal- or approximately normal-distributed. Thus, performing normality test before data analyses is helpful to understand the data structure and verify the effect of further statistical analysis.

In this article, we review some representative and commonly-used normality tests such as AD, SW and JB. In addition to these three tests, other normality tests that have some merits can also be taken into account, but for simplicity we have chosen only the three most representative ones. Since AD, SW and JB are constructed by the distribution shape, second moment, and the third and fourth moments, they are widely applied in applications. Based on numerical results, we find that each of them has the sweet point: AD has outstanding performance for multimodal distributions, SW performs well for symmetric platykurtic with short-tailed distribution or skewed distributions, and JB is powerful for symmetric and slightly skewed distributions with long tails and poor for short-tailed distributions and bimodal distributions. However, the data in the actual problem may be complex and diverse, and the shape of its distribution may show different degrees of skewness and kurtosis, or it may show multi-peak state. Thus, a new test called CCOT is proposed by integrating the superiorities of the above three tests, which can be applied to different data distributions. We take two p-value combinations as an example in Theorem 1 to discuss the conditions under which CCOT remains significant when one p-value in the combination is significant but the other p-value is not. For example, if α = 0.05 and p1 = 0.01, the p-value of the statistic based on the Cauchy combination can be guaranteed to be significant when p2 ∈ (0.05, 0.09] or (0.5, 0.98343). Besides, the p-value of T is not significant if p1 = 0.01 and p2 ≥ 0.99, but it is very rare for the p-values of the two methods to differ greatly in the normality test.

There are a variety of combined p-value strategies in many literatures based on different research purposes and data characteristics, such as minimum p-value method , Fisher’s combination test , higher criticism method , adaptive rank truncated product method , Cauchy combination test , and so on. However, no method is uniformly most powerful in all cases [31, 32]. In this paper, the CCOT has two advantages: (i) its performance is always close to the best method in numerical simulation, (ii) its statistical significance has an approximate expression, which is easier to calculate than other combination methods regardless of the correlation structure between the p-values. We also give the source R code to compute its p-value, which is attached in S1 Appendix. The result of case analysis shows that CCOT is more robust and effective because it can detect the difference from the normal distribution to serve the subsequent selection of appropriate statistical inference methods. Thus it is expected that CCOT has wide useful application in the future.

In this work, we focus on univariate normal test. There are some work directly extending the procedures of univariate normality test to multivariate cases. Mardia  gave sample estimates of multivariate skewness or kurtosis, which provided a reference for multivariate normality test based on sample moments. Kim  proposed a multivariate version of the Jarque-Bera test using orthogonalization or empirical standardization of data, which was powerful for symmetric marginal distributions with long tails. Shapiro and Wilk test has been generalized to multivariate cases [35, 36]. For multivariate normality test based on empirical distribution function, a common idea is to reduce the multivariate data to univariate uniformity . It’s worth noting that our Cauchy combination omnibus test can be naturally extended to the multivariate normality test because it just needs the p-values.

## Supporting information

### S1 Fig. The density function curves for Scenarios (i)—(v) under BN model.

The solid and dashed lines represent the density function curves of standardized BN samples and standard normally distributed samples, respectively. The values of sk and ku below these subfigures are the skewness and kurtosis.

https://doi.org/10.1371/journal.pone.0289498.s001

(PDF)

### S2 Fig. The density function curves, skewness and kurtosis of 12 common non-normal distributions.

The solid and dashed lines represent the density curves of 12 common non-normal distributed data and standard normally distributed data, respectively.

https://doi.org/10.1371/journal.pone.0289498.s002

(PDF)

### S3 Fig. The empirical power for different sample sizes under remaining 8 common non-normal distributions at nominal significance level of 0.05 based on 10,000 replications.

https://doi.org/10.1371/journal.pone.0289498.s003

(PDF)

### S1 Appendix. The source R code to compute p-value of CCOT.

https://doi.org/10.1371/journal.pone.0289498.s004

(PDF)

## References

1. 1. Smith TJ, Temple AR, Reading JC. Cadmium, lead, and copper blood levels in normal children. Clin Toxicol. 1976; 9(1):75–87. pmid:1277771
2. 2. Fredholm B, Gunnarson K, Jensen G, Müntzing J. Mammary tumor inhibition and subacute toxicity in rats of predrimostlne and of its molecular components chlorambulc and prednisolone. Acta Pharmacologica et Toxicologica 1978; 42(3):159–163. pmid:580343
3. 3. Griffin MJ, Whitham EM. Individual variability and its effect on subjective and biodynamic response to whole-body vibration. J Sound Vib. 1978; 58(2):239–250.
4. 4. D’Agostino RB, Stephens MA. Goodness-of-fit techniques. Marcel Dekker, New York, 1986.
5. 5. Shapiro SS, Wilk MB. An analysis of variance test for normality. Biometrika 1965; 52:591–611.
6. 6. Royston JP. Approximating the Shapiro-Wilk W-test for non-normality. Stat Comput. 1992; 2(3):117–119.
7. 7. Rahman MM, Govindarajulu Z. A modification of the test of Shapiro and Wilk for normality. J Appl Stat. 1997; 24(2):219–236.
8. 8. Jarque CM, Bera AK. A test for normality of observations and regression residuals. International Statistical Review/Revue Internationale de Statistique 1987; 55(2):163–172.
9. 9. Anderson TW, Darling DA. A test of goodness of fit. J Am Stat Assoc. 1954; 49(268):765–769.
10. 10. Lilliefors HW. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J Am Stat Assoc. 1967; 62(318):399–402.
11. 11. Dumonceaux R, Antle CE, Haas G. Likelihood ratio test for discrimination between two models with unknown location and scale parameters. Technometrics 1973; 15(1):19–31.
12. 12. Spiegelhalter DJ. A test for normality against symmetric alternatives. Biometrika 1977; 64(2):415–418.
13. 13. Arshad M, Rasool MT, Ahmad MI. Anderson Darling and modified Anderson Darling tests for generalized Pareto distribution. Pakistan Journal of Applied Sciences 2003; 3(2):85–88.
14. 14. Thadewald T, Büning H. Jarque-Bera test and its competitors for testing normality-a power comparison. J Appl Stat. 2007; 34(1):87–105.
15. 15. Nornadiah MR, Yap BW. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2011; 2(1):21–33.
16. 16. Stephens MA. Use of the Kolmogorov-Smirnov, Cramer-Von Mises and related statistics without extensive tables. J R Stat Soc B. 1970; 32(1):115–122.
17. 17. Seier E. Comparison of tests for univariate normality. InterStat Statistical Journal 2002; 1:1–17.
18. 18. Shapiro SS, Francia RS. An analysis of variance test for normality. J Am Stat Assoc. 1972; 67(337):215–216.
19. 19. Bowman KO, Shenton LR. Omnibus contours for departures from normality based on and b2. Biometrika 1975; 62(2):243–250.
20. 20. Bu DL, Yang QL, Meng Z, Li Q. Truncated tests for combining evidence of summary statistics. Genet Epidemiol. 2020; 44:687–701. pmid:32583530
21. 21. Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc. 2020; 115(529):393–402. pmid:33012899
22. 22. Long M, Li Z, Zhang W, Li Q. The Cauchy combination test under arbitrary dependence structures. 2022; https://doi.org/10.48550/arXiv.2107.06040.
23. 23. Chen Z. Robust tests for combining p-values under arbitrary dependency structures. Sci Rep. 2022; 12:3158. pmid:35210502
24. 24. Rousseauw J, du Plessis J, Benade A, Jordaan P, Kotze J, Jooste P, et al. Coronary risk factor screening in three rural communities. S Afr Med J. 1983; 64:430–436.
25. 25. Hastie T, Tibshirani R. Nonparametric logistic and proportional odds regression. Appl Stat. 1987; 36:260–276.
26. 26. Norton SJ, Gorga MP, Widen JE, Folsom R, Sininger Y, Cone-Wesson B, et al. Identification of neonatal hearing impairment: evaluation of transient evoked otoacoustic emission, distortion product otoacoustic emission, and auditory brain stem response test performance. Ear Hearing 2000; 21(5):508–528. pmid:11059707
27. 27. Tippett LHC. The methods of statistics. London: Williams and Norgate, 1931.
28. 28. Fisher RA. Statistical methods for research workers. Edinburgh: Oliver and Boyd, 1934.
29. 29. Donoho D, Jin J. Higher criticism for detecting sparse heterogeneous mixtures. Ann Stat 2004; 32:962–994.
30. 30. Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, et al. Pathway analysis by adaptive combination of p-values. Genet Epidemiol. 2009; 33(8):700–709. pmid:19333968
31. 31. Birnbaum A. Combining independent tests of significance. J Am Stat Assoc. 1954; 49(267):559–574.
32. 32. Won S, Morris N, Lu Q, Elston RC. Choosing an optimal method to combine p-values. Stat Med 2009; 28(11):1537–1553. pmid:19266501
33. 33. Mardia KV. Measures of multivariate skewness and kurtosis with applications. Biometrika 1970; 57(3):519–530.
34. 34. Kim N. A robustified Jarque-Bera test for multivariate normality. Econ Lett. 2016; 140:48–52.
35. 35. Royston JP. Some techniques for accessing multivariate normality based on the Shapiro-Wilk W. J R Stat Soc C. 1983; 32(2):121–133.
36. 36. Villasenor Alva JA, Estrada EG. A generalization of Shapiro-Wilk’s test for multivariate normality. Commun Stat-Theor M. 2009; 38(11):1870–1883.
37. 37. Paulson AS, Roohan P, Sullo P. Some empirical distribution function tests for multivariate normality. J Stat Comput Sim. 1987; 28(1):15–30.