Figures
Abstract
In this study, tests of fit for the power function lognormal distribution is considered. The probability plot, probability plot correlation coefficient, and goodness-of-fit tests—the Kolmogorov–Smirnov (KS), Cramér–von Mises (CvM), and Anderson–Darling (AD) tests are provided. Tables of critical values are presented by using simulation techniques, and the AD test outperforms KS and CvM tests based on power comparisons. Finally, to illustrate these test procedures, we fit this distribution to the data which represent the survival times of 121 breast cancer patients from one hospital.
Citation: Wang C, Zhu H (2024) Tests of fit for the power function lognormal distribution. PLoS ONE 19(2): e0298309. https://doi.org/10.1371/journal.pone.0298309
Editor: Umair Khalil, Abdul Wali Khan University Mardan, PAKISTAN
Received: February 1, 2023; Accepted: January 18, 2024; Published: February 22, 2024
Copyright: © 2024 Wang, Zhu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: This research was funded by National Social Science Fund of China (Nos.19BTJ051, awarded to CW; 19CTJ004, awarded to HZ) and the Fundamental Research Funds for the Provincial Universities of Zhejiang (No.XT202304, awarded to HZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Statistical distributions are important for modeling and predicting real-world situations. For example, the empirical analysis of the distribution of income amounts is a major topic of research in development economics. One of the primary purposes of this analysis is simply to describe the distribution of income and derive descriptive and summative inequality measures, such as the Gini coefficient. A large number of income distributions have been proposed in the statistical literature, including the lognormal, gamma, Pareto, Weibull, Dagum, Singh–Maddala, and generalized beta–2 distributions [1].
Another type of income distribution, the power function lognormal composite (PFLC) distribution, has recently been proposed [2]. This flexible distribution can have positive or negative skewness and can be either leptokurtic or platykurtic, depending on the parameters. Extensive statistical inference methods have also been presented [2], such as those for modeling household income and automobile insurance claims. Moreover, it has been demonstrated that inequality measures, including the Gini coefficient, generalized entropy index, Theil’s entropy index, and the Atkinson, Bonferroni, and Zenga indexes, can be obtained using numerical methods based on the PFLC distribution [3].
The modeling and analysis of lifetime data are important aspects of statistical work in areas such as engineering, medicine, and the biological sciences. Lifetime models are often skewed or are far from normal. Therefore, models such as the exponential, Weibull, lognormal, and gamma often occupy a central position because of their demonstrated usefulness in a wide range of situations. Over the years, many models have been developed and applied to lifetime data, including the exponentiated generalized linear exponential distribution [4], generalized transmuted-G family [5], modified beta transmuted exponential distribution [6], extended gumbel distribution [7], and generalized Marshall-Olkin exponentiated exponential distribution [8]. We will demonstrate that the PFLC distribution can also be applied to the modeling of lifetime data.
Goodness-of-fit (GoF) usually refers to whether a dataset is consistent with sampling from a model for a distribution. Many GoF tests exist, and they can generally be classified as either graphical techniques or statistical methods. Statistical methods are usually preferred because of their objectivity. One frequently used graphical method is the probability plot [9], which generally uses special scales on which the cumulative distribution function (CDF) of a particular distribution plots as a straight line. A normal probability plot is defined as a plot of the ith-order statistic versus some measure of location of the ith-order statistics from a standard normal distribution. The probability plot correlation coefficient, the product moment correlation coefficient that measures the degree of linear association between these two random variables, can be used as an appropriate test statistic [10]. The Kolmogorov–Smirnov (KS), Cramér–von Mises (CvM), and Anderson–Darling (AD) tests are but a few of the traditional statistical tests available to determine GoF. These are all based on the empirical distribution function (EDF). They test the null hypothesis by measuring the distance between the EDF estimated from observed data and the CDF of the fitted models [11].
We introduce and implement probability plot, probability plot correlation coefficient, and GoF tests for the PFLC distribution. The first two methods are based on a transformation of the cumulative distributions. The classical KS, CvM, and AD tests are considered. A power study was conducted to investigate their performance, and the results show that the AD test outperforms the KS and CvM tests, and that the KS test is the weakest, requiring a much larger sample size to achieve comparable power to the other tests. We also show that the PFLC distribution can provide a good fit for one survival dataset.
The rest of this paper is organized as follows. The setup and statistical properties of the PFLC distribution are discussed in Section 2. The probability plot method and probability plot correlation coefficient are investigated in Sections 3 and 4, respectively. The derivations of computing formulae, critical values, and power comparisons for three PFLC GoF test statistics follow in Section 5. An application to one survival time is presented in Section 6. Some conclusions are drawn in Section 7.
2. Power function lognormal composite distribution
The probability density function (PDF) of the PFLC distribution can be written as [2]
(1)
where w is a mixing weight, defined as
(2)
and Φ(.) is the CDF of the standard normal distribution.
PFLC is a distribution in three unknown parameters—α > 0, θ > 0, and σ > 0—and X~PFLC(α,θ,σ) indicates that X follows this distribution.
The corresponding CDF F(x), and the quantile function X(p) are given by
(3)
and
(4)
Let X~PFLC(α,θ,σ). Then Y = w(X/θ)α has a density function
(5)
From (2), one can easily verify that w is a decreasing function of ασ. Table 1 presents the values of ασ for a grid of w values [0.01(0.01)0.99], which are accurate to about six significant digits. Thus f(y) in (5) can also be seen as a function of y when w is given.
The CDF F(y) and quantile function Y(p) are given by
(6)
and
(7)
We will refer to distribution (5) as the standard power function lognormal composite distribution, denoted by SPFLC(w).
3. Probability plot
Taking the logarithm of (4), we can obtain
(8)
Note that when p = w, lnθ + 1/α ln(p/w) = lnθ + 1/α ln(p/p) = lnθ, and lnθ + σΦ−1 [1 − (1 − p) Φ(ασ)/(1 − w)] + ασ2 = lnθ + σΦ−1 [Φ(−ασ)] + ασ2 = lnθ.
Eq (8) can be written as
(9)
q can be considered to be equal to the following formula:
(10)
Thus, (9) represents a linear relationship between q and lnx, with an intercept of −lnθ and a slope of 1.
The probability plot of 100 sets of simulated data from PFLC(5,10,0.08) with random seed 1234 is shown in Fig 1, where the solid line AB is a probability plot drawn according to (8). The starting point is the intersection of lnx = ln(xmin) and AB, where ln(xmin) is the natural logarithm of the minimum value of the simulated data. The ending point
is the intersection of lnx = ln(xmax) and AB, where ln(xmax) is the natural logarithm of the maximum value of the simulated data. Point E(lnθ, 0) is the intersection of lnx = lnθ and q = 0, which corresponds exactly to p = w. In this example, ln(xmin) = 1.7780, ln(xmax) = 2.4514, and
.
The line q = 0 divides AB into two parts. For the upper part BE, q = σΦ−1 [1 − (1 − p) Φ(ασ)/(1 − w)] + ασ2, and for the lower part AE, q = 1/α [lnp − lnw]. The straight line p = 0.5841 also divides AB into two parts: p < 0.5841 and p > 0.5841.
Many authors have discussed methods for choosing the values of p for a given n for use in such plots [9]. We use the Hazen formula,
(11)
We determine whether the PFLC distribution can be used to fit one dataset using a probability plot as follows:
- (i) Order the sample values to obtain x1 ≤ x2 ≤ ⋯ ≤ xn.
- (ii) Obtain the estimated parameters
for (α, θ, σ) by maximum likelihood, and then compute
.
- (iii) Compute q from (10), where
.
- (iv) Plot q versus lnx; the PFLC distribution can be used to model the dataset if the plot is approximately a straight line.
Fig 2 illustrates the probability plots of simulated data from PFLC(5,4,0.096), PFLC(6,6,0.08), and PFLC(8,8,0.06). For each case, 1000 sets of data are generated with random seed 1234. Fig 2 shows that the plots are approximately linear, which indicates that the underlying distributions are PFLCs.
4. The probability plot correlation coefficient
The probability plot correlation (PPC) coefficient was used as a test statistic for normality [10]. The PPC test measures the linearity of a probability plot; if the probability plot is expected to be almost linear, then the correlation coefficient will be close to one. We then derive the PPC test for the PFLC distribution.
Taking the logarithm of (7), we obtain
(12)
Letting Zi = lnYi, then T1i, T2i are defined as
(13)
Then, the correlation coefficient rQ between Zi and Ti can be defined as
(14)
Because rQ only depends on w, we can obtain the critical values of this statistic for a given w in practice. Monte Carlo studies have determined the percentage points of the statistics for sample sizes n = 5(5)90, 90(10)100, 100(50)500, and 500(100)1000. For each case, the procedure was repeated 20 000 times to produce an empirical distribution of the test statistic, from which sample quantiles approximating the critical values were obtained. For w = 0.1, the algorithm to obtain the 5% critical values is as follows.
- (i) Set w = 0.1, generate n random numbers from (7), and order the sample values to obtain y1 ≤ y2 ≤ ⋯ ≤ yn. Then, obtain ln(y1) ≤ ln(y2) ≤ ⋯ ≤ ln(yn).
- (ii) Compute m, which is the number of yi values less than w, and compute percentile point pi from (11).
- (iii) Compute αθ from (2), and then compute T1i and T2i from (13).
- (iv) Compute the correlation coefficient rQ between Zi and Ti from (14).
- (v) Repeat steps (i)-(iv) 20000 times to obtain the 5% sample quantiles of rQ as the 5% critical values.
Table 2 presents the 5% critical values of the distribution of rQ for selected sample sizes when w equals 0.1(0.2)0.9. For example, the critical value of rQ for n = 10 is 0.910701 when w is 0.1; this means that in 10% of random samples of size 10, the correlation coefficient will be at least 0.910701.
We can determine whether the PFLC distribution can be used to fit one dataset by the correlation coefficient at 5% significance level as follows:
- (i) Order the sample values to obtain x1 ≤ x2 ≤ ⋯ ≤ xn.
- (ii) Obtain the estimated parameters
for (α, θ, σ) by maximum likelihood, and calculate
.
- (iii) Calculate Zi = lnyi, where
.
- (iv) Calculate rQ from (14).
- (v) Reject H0 (the sample is from a PFLC distribution) at 5% significance level if rQ is less than the 5% critical values.
5. Goodness-of-fit tests
We discuss three GoF for the PFLC distribution [11]. As described in Section 2, the test for X~PFLC(α,θ,σ) is equivalent to that for Y~SPFLC(w). Therefore we test whether the underlying probability distribution is SPFLC(w) for a given random sample Y1, Y2, ⋯, Yn.
5.1. Basic GoF test statistics
Let y1, y2, ⋯, yn denote the set of the original data in ascending order. The test statistic for the KS test is thus
(15)
where
and
The CvM test statistic is
(16)
and the AD test statistic is
(17)
Note that the estimate of F is obtained by substituting the estimated parameter
for w in (6).
5.2. Critical values
When all three parameters are unknown, the problem is reduced to a testing whether the y values have distribution (5). Because the distribution of y is only related to w, we can obtain the critical values of three GoF tests under a given w in practice [11]. Critical values for the GoF test statistics are obtained similarly as those for the PPC test. For w = 0.057, the algorithm to obtain the 5% critical values for the GoF test statistics is as follows:
- (i) Set w = 0.057, generate n random numbers from (7), and order the sample values to obtain y1 ≤ y2 ≤ ⋯ ≤ yn.
- (ii) Obtain the estimated parameters
for w by maximum likelihood.
- (iii) Calculate the GoF test statistics D, W2, and A2 using (15)–(17), respectively.
- (iv) Repeat steps (i)–(iii) 20 000 times. Then, obtain the 5% sample quantiles of D, W2, and A2 as the 5% critical values.
Table 3 presents the critical values of the distribution of D for selected sample sizes at five significance levels when w equals 0.057. For example, at a 5% significance level, the critical value of D for n = 10 is 0.408884; this means that in 5% of random samples of size 10, the maximum absolute deviation between the sample and population cumulative distributions will be at least 0.408884. Tables 4 and 5 present the respective critical values of W2 and A2.
For the GoF tests, the null hypothesis is that the sample comes from a PFLC distribution, and the alternative hypothesis is that it does not. We can determine whether the PFLC distribution can be used to fit one dataset by GoF tests at 5% significance level as follows:
- (i) Order the sample values to obtain x1 ≤ x2 ≤ ⋯ ≤ xn.
- (ii) Obtain the estimated parameters
for (α, θ, σ) by maximum likelihood, and calculate
.
- (iii) Calculate
.
- (iv) Calculate GoF test statistics D, W2, and A2 using (15)–(17), respectively.
- (v) Reject H0 (the sample is from a PFLC distribution) at 5% significance level if the statistic exceeds the 5% critical values.
5.3. Power comparison
The power of a test is the probability that it will reject the null hypothesis if the alternative hypothesis is true (hence, power is the complement of the probability of a Type II error). Therefore, the power depends on the alternative distribution. A Monte Carlo study was performed to evaluate the power of the three GoF tests using 100 000 samples of different sizes from four alternative distributions. For these tests, the null hypothesis was that generated observations were drawn from a PFLC distribution (4,5,0.42). Note that the mixing weight w is 0.057 for PFLC(4,5,0.42), and the 5% critical values are from Tables 3–5. Simulations were carried out in MATLAB and all the codes used can be found from supporting information.
Four alternative distributions were considered: Gamma (3.5, 2.7), χ2 (10), LN(2.3, 0.5), and Weibull(10, 2), whose PDFs are as follows:
The plots of five distributions are shown in Fig 3, from which it can be seen that PFLC(4,5,0.42) has the largest mode. For the left side of the mode of PFLC(4,5,0.42), all of the distributions have a slight discrepancy, while for the right side, it is difficult to distinguish them from the PDFs. Thus, these five distributions exhibit similar overall shapes. We note that if the shapes of these five distributions differ significantly, then the results will be highly unreliable.
Table 6 summarizes the simulated powers of a PPC test for four selected distributions at the 5% significance level, which can be seen to increase with the sample sizes for the same alternative distribution.
Table 7 summarizes the simulated power for four selected distributions at the 5% significance level. From Table 7, the following can be seen:
- (i) The powers of the tests increase with the sample sizes.
- (ii) The AD test outperforms the KS and CvM tests across different sizes and alternative distributions.
- (iii) The KS test is the weakest test, and it requires a much larger sample to achieve comparable power to the other two tests.
- (iv) The powers of the tests are the smallest when the alternative distribution is LN(2.3, 0.5), which is the most similar to PFLC(4,5,0.42), as can be seen from Fig 3.
Hence, we recommend the AD test, followed by the CvM and KS tests.
6. Real data analysis
We consider an example to illustrate the methods discussed in Sections 3–5. The real dataset (BREAST) represents the survival times of 121 patients with breast cancer, as obtained from a large hospital from 1929 to 1938 [6], as follows:
- 0.3, 0.3, 4.0, 5.0, 5.6, 6.2, 6.3, 6.6, 6.8, 7.4, 7.5, 8.4, 8.4, 10.3, 11.0, 11.8,12.2, 12.3, 13.5, 14.4, 14.4, 14.8, 15.5, 15.7, 16.2, 16.3, 16.5, 16.8, 17.2, 17.3, 17.5, 17.9, 19.8, 20.4,20.9, 21.0, 21.0, 21.1, 23.0, 23.4, 23.6, 24.0, 24.0, 27.9, 28.2, 29.1, 30.0, 31.0, 31.0, 32.0, 35.0, 35.0,37.0, 37.0, 37.0, 38.0, 38.0, 38.0, 39.0, 39.0, 40.0, 40.0, 40.0, 41.0, 41.0, 41.0, 42.0, 43.0, 43.0, 43.0,44.0, 45.0, 45.0, 46.0, 46.0, 47.0, 48.0, 49.0, 51.0, 51.0, 51.0, 52.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 60.0, 60.0, 61.0, 62.0, 65.0, 65.0, 67.0, 67.0, 68.0, 69.0, 78.0, 80.0, 83.0, 88.0, 89.0, 90.0, 93.0,96.0, 103.0, 105.0, 109.0, 109.0, 111.0, 115.0, 117.0, 125.0, 126.0, 127.0, 129.0, 129.0, 139.0, 154.0.
Table 8 shows the summary statistics for the BREAST data. The mean is 46.3289, and the median is 40.00. Notice that the mean is greater than the median, indicating that the distribution of data is likely skewed to the right.
Fig 4 shows a histogram of these data, whose left side seems “chopped off” compared to the right side, which we describe as being skewed to the right.
In this study, the PFLC maximum likelihood estimates are , and
. Fig 5 presents the probability plot of the BREAST data, whose points approximate a straight line, which means that the PFLC distribution can provide a good fit to these data.
Fig 6 compares the fitted and empirical PDF of the BREAST data, and Fig 7 compares the fitted and empirical CDF of the BREAST data. It can be seen that the PFLC CDF exhibits a good match to the empirical CDF, but there is a slight discrepancy between the empirical and PFLC PDFs.
Table 9 reports the test statistics and p–values (in parentheses) for the PPC and three GoF tests. It indicates that the PFLC distribution provides a reasonable fit to the BREAST data.
7. Conclusions
Goodness–of–fit testing is a key procedure for selecting the statistical distribution that best fits observed data. We performed tests of fit for the PFLC distribution, including the probability plot, probability plot correlation coefficient, KS test, CvM test, and AD test. Among these methods, the probability plot is a graphical method, while the others are statistical methods. We described the probability plot and considered the procedures, algorithms, and critical values for the other methods. We found that the AD test outperformed KS and CvM tests based on power comparisons. Moreover, this new PFLC distribution was first successfully used to model the survival times of breast cancer patients. The tests developed in this paper revealed that PFLC fits the data well. The work in this paper can be extended in some ways, such as to: (i) calculate the lifetime statistics, such as the hazard rate and the mean residual life based on the PFLC distribution; (ii) develop other GoF tests methods, such as Chi-squared tests and likelihood ratios; (iii) compare the performance of the PFLC distribution with other competitive alternatives; and (iv) investigate its applications in other disciplines.
References
- 1.
Kleiber C, Kotz S. Statistical Size Distributions in Economics and Actuarial Sciences. New York: John Wiley; 2003.
- 2. Wang C. Parameter estimation for power function lognormal composite distribution. Communications in Statistics—Theory and Methods. 2023; 52(9):2966–2982.
- 3. Wang C. Inequality measures based on log power function lognormal composite distribution. Applied Economics Letters. 2022; 1–6.
- 4. Sarhan AM, Ahmad AA, Alasbahi IA. Exponentiated generalized linear exponential distribution. Applied Mathematical Modelling. 2013; 37(5):2838–49.
- 5. Nofal ZM, Afify AZ, Yousof HM, Gauss MC. The generalized transmuted-G family of distributions[J]. Communications in Statistics-Theory and Methods. 2017; 46(8): 4119–4136.
- 6. Awodutire PO, Balogun OS, Olapade AK, Nduka EC. The modified beta transmuted family of distributions with applications using the exponential distribution. PloS ONE. 2021; 16(11): e0258512. pmid:34793462
- 7. Fayomi A, Khan S, Tahir MH, Algarni A, Jamal F, Abu-Shanab R. A new extended gumbel distribution: Properties and application. PLoS ONE. 2022; 17(5): e0267142. pmid:35622822
- 8. Ozkan E, Golbasi Simsek G. Generalized Marshall-Olkin exponentiated exponential distribution: Properties and applications[J]. PloS ONE. 2023; 18(1): e0280349. pmid:36652462
- 9. Kimball BF. On the choice of plotting positions on probability paper. Journal of the American Statistical Association.1960; 55(291):546–560.
- 10. Filliben JJ. The Probability Plot Correlation Coefficient Test for Normality. Technometrics. 1975; 17(1):111–117.
- 11.
D’Agostino R, Stephens MA. Goodness-of-Fit Techniques. New York: Taylor & Francis; 1986.