Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

WLreg: A new re-parametrization of the Weighted Lindley distribution and its regression model

  • Emrah Altun ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    emrahaltun@bartin.edu.tr

    Affiliation Department of Mathematics, Bartin University, Bartin, Turkey

  • Christophe Chesneau,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Mathematics, University of Caen-Normandie, Caen, France

  • Hana N. Alqifari

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics and Operations Research, College of Science, Qassim University, Buraydah, Saudi Arabia

Abstract

A novel re-parametrization of the weighted Lindley distribution is introduced to develop a regression model suitable for skewed dependent variables defined on . This new model is called the WL2 regression model. It is shown to outperform existing models such as the gamma, extended gamma, and Maxwell-Boltzmann-exponential regression models. Parameter estimation is performed using the maximum likelihood estimation technique, and the efficiency of these estimates is assessed through a simulation study. An application to a house price data set is presented to highlight the importance of the WL2 regression model. In addition, we propose the WLreg software, accessible via https://bartinuni.shinyapps.io/WLreg, to facilitate the application of the new regression model for practitioners in the field.

1 Introduction

Literature review

The gamma regression model is often used in the analysis of right-skewed response variables, as discussed by [12]. Under the assumption that the response variable conforms to a gamma distribution, [9] developed the gamma regression model, characterized by the simultaneous modeling of the mean and shape parameters. Furthermore, [10] developed a novel diagnostic tool for the analysis of residuals in the gamma regression model. A number of other options have been suggested. These are the weighted exponential (WE) regression by [3], the new extended gamma (NEG) regression by [6], the Maxwell-Boltzmann exponential (MBE) regression by [5], and the Lomax regression by [4].

These types of regression models have important applications in various scientific fields. [18] employed the gamma and quantile regression models to evaluate the factors that affect the medical costs of gastric cancer patients. [13] used the gamma regression model to find the most important factors affecting patient satisfaction. [20] investigated the relationship between environmental factors and the distribution of pelagic fish using gamma error distribution. [23] estimated the conflict-crash relationship using the re-parametrized Lomax distribution proposed by [4]. For further applications of skewed regression models, see [1].

Contribution

[14] introduced the weighted Lindley (WL) distribution as a generalization of the Lindley distribution proposed by [15]. Bayesian parameter estimation of the WL model was discussed by [2]. More comprehensive work on parameter estimation of the WL distribution has been reviewed by [17]. [7] obtained a new discrete distribution using the WL distribution. The inverse WL distribution was also introduced by [22]. Further interesting work on the WL distribution was done by [16] to make the WL distribution orthogonal to the other shape parameter. [21] introduced the generalized WL distribution using the mixtures of two generalized gamma random variables. [19] introduced the WL regression model using the re-parametrization of [17]. [19] discussed the properties of the WL regression model, including a residual analysis and parameter estimation procedures.

As can be seen from the literature review on the WL distribution, it has been widely investigated by researchers. In this study, we use a different re-parametrization of the WL distribution to propose a new WL regression model. In order not to confuse the two regression models, the model proposed in this study is referred to as the WL2 regression model, and the model proposed by [19] is referred to as the WL1 regression model. The main motivation of the study is to propose a more flexible and efficient model than the existing models for the skewed dependent variables defined on . The contributions can be summarized as follows:

  1. ✓ The WL2 regression model is defined using a novel transformation of the random variable that follows the WL distribution.
  2. ✓ The parameter estimation of the WL2 regression model is performed using the maximum likelihood (ML) estimation method, based on ML estimates (MLEs). A simulation study is also carried out to discuss the effectiveness of the MLEs of the model parameters.
  3. ✓ To assess the accuracy of the fitted model, a residual analysis is performed, employing the Cox-Snell residuals.
  4. ✓ A cloud-based software called WLreg has been developed in the R Shiny environment to allow efficient and widespread use of the WL2 regression model. The WLreg software allows users to easily obtain the results of this model using their own data.

Organization

This paper is divided into several sections. Sect 2 discusses the re-parametrized WL2 distribution and the associated regression model. It also provides an analysis of the residuals and parameter estimation for the WL2 regression model, complemented by an extensive simulation study. Sect 3 is dedicated to the presentation of empirical results derived from the research. Sect 4 provides detailed information on the use of the WLreg software. The conclusion of the study is summarized in Sect 5.

2 WL distribution and regression model

2.1 Presentation

We begin with the mathematical background of the WL distribution. First, it is defined by its probability density function (pdf), which is given by

(1)

where , with and being the scale and shape parameters, respectively, and is the standard gamma function. The WL distribution can be described as a generalization of the famous Lindley distribution. In particular, it is reduced to the Lindley distribution for . The WL distribution is also a mixture distribution of two independent gamma distributions with the mixing proportion . The mean and variance associated with the WL distribution are given by

(2)

and

(3)

We are now in a position to introduce the mean-parametrized WL distribution. From Eq 2, we can express as a function of and , as follows:

(4)

The notation is intended to indicate the dependence of and in the expression, which will be a crucial point in the proposed regression model. Based on this re-parametrization, the pdf of the WL distribution becomes

(5)

To indicate the re-parametrization done, the distribution associated with this pdf is called the WL2 distribution. We now consider a random variable Y with this pdf, which we refer to using the following stochastic notation: . In particular, its mean is .

The plots of the WL2 distribution are shown in Figs 1 and 2. The analysis of these plots shows that the distribution is significantly right-skewed, and this skewness becomes more pronounced as the parameter is reduced, assuming that the values of are held constant.

2.2 WL2 regression model

On the mathematical basis of the WL2 distribution, we now discuss the construction of a new WL2 regression model. To do this, we assume that we have a random sample, , from , where, for any , . We recall that is the mean, assuming that the parameters and are unknown. The introduction of the WL2 model is achieved by using the corresponding link function, which is given by

(6)

where is the vector of the regression parameters, k is the number of the independent variables, is the vector representing the values of the covariates, and is a link function. The selection of the correct link function is therefore determined by the characteristics of the random variable Y.

Using the pdf of the WL2 distribution, given in Eq 5, the log-likelihood function of the WL2 regression model is obtained as

(7)

where and . The parameter vector, , is estimated using the ML approach. The resulting vector is denoted by . For this, we use the Nelder-Mead algorithm defined in the R software. The asymptotic standard errors are obtained using the observed information matrix.

The accuracy of the model is checked by the Cox-Snell (CS) residuals [11]. The CS is defined as

(8)

where is the cumulative distribution function (cdf) of the WL2 taken at yi, with the mention of the dependence in and . If the fitted model accurately represents the data, the CS residuals satisfy .

2.3 Simulation

This section looks at the effectiveness of the ML approach in estimating the parameters of the WL2 regression model. The simulation is configured with 1,000 replications. The analysis includes four different sample sizes: 100, 300, 500, and 1,000. The mean vector is defined as , where x1 and x2 are drawn from a uniform distribution . For the purposes of this analysis, the regression and scale parameters are assigned values of , and . The dependent variable yi is generated based on and through the inverse transform method.

The outcomes of the simulation are summarized in Table 1. They are evaluated on the basis of estimated biases, average estimates (AEs), and mean squared errors (MSEs). It is expected that larger sample sizes will give biases and MSEs that are close to zero, while AEs should be close to the true parameter values. An examination of the results in Table 1 shows that the biases and MSEs are close to zero. In addition, the AEs show stability and remain consistently close to the true parameter values across all sample sizes. These results support the ML approach as an appropriate way to estimate the parameters of the WL2 model.

3 Application

The data set contains 414 observations about the real estate valuation, collected from New Taipei City, Taiwan. The aim is to predict the price of the house using house age (xi1) and number of convenience stores (xi2). The WL2 regression model is compared with the gamma, NEG, MBE and WL1 regression models, as already presented in the first section of the paper.

The gammareg package, developed by [8], is used for the gamma regression model. Once obtained, these values are used as the initial parameter vector for the estimation steps of the WL1, WL2, NEG and MBE regression models. The model in Eq 9 is fitted using the gamma, NEG, MBE, WL1 and WL2 regression models with the mentioned data set:

(9)

The MLEs of the parameters and their standard errors are given in Table 2. As all p-values are less than 0.05, the regression parameters are statistically significant. The results of the analysis indicate that an increase in the age of the house is associated with a decrease in the house price, while a greater number of convenience stores is associated with an increase in the house price.

thumbnail
Table 2. The estimated coefficients and their respective standard errors.

https://doi.org/10.1371/journal.pone.0324005.t002

Table 3 shows the Akaike information criterion (AIC) and Bayesian information criterion (BIC) values of the regression models. The model with the lowest AIC and BIC values is taken as the best model. As can be seen in this table, the model with the lowest AIC and BIC values is the WL2 regression model. Therefore, it was selected as the best model for the data used.

To check the accuracy of the fitted WL regression model, we compute CS residuals. The probability-probability (PP) plots of the CS residuals are shown in Figs 3, 4, 5, 6, and 7. The Kolmogorov-Smirnov (KS) test is also applied to the CS residuals to check whether or not these residuals follow an exponential distribution. The results of the KS test are given in Table 4. From these results, we can see that the NEG and MBE regression models do not give satisfactory results and the residuals of these models do not follow the exponential distribution. However, the residuals of the gamma, WL1 and WL2 regression models satisfy the assumption for residuals. It is clear that the plotted points of the residuals of the WL2 model are closer to the diagonal line than the other regression models.

The comparison of the models based on the AIC and BIC values may not be sufficient to emphasize the superiority of the WL2 model over the others. Therefore, we use the Vuong non-tested test to compare the WL2 model with other models. [24] proposed a hypothesis to compare the non-tested models. The test statistic is

(10)

where , , , f1 and f2 are the pdfs of the two models being compared, and sm is the standard deviation of . The test statistic in Eq 10 is calculated for all competing models and the results are summarized in Table 5. The null hypothesis is that there is no difference between the models. The alternative hypothesis is that Model I is better than Model II. As reported in Table 5, all p-values are less than 0.05 and the null hypotheses are rejected in favour of the WL2 model in all cases. This evidence supports the conclusion that the WL2 model outperforms the other four competing models.

In addition, the F test is used to investigate whether the WL2 model is statistically significant. In the framework of the generalized linear model, the F statistic is calculated by

(11)

where M1 is the deviance of the null model containing only the intercept term and M2 is the deviance of the fitted model. p1 and p2 are the number of the estimated parameters of the two models, respectively. The null hypothesis is that the model is not significant. The F statistic for the WL2 model is obtained as 15.009 and the corresponding p-value is 5.126 10−7 which is less than 0.05. Therefore, the null hypothesis is rejected and the WL2 model is statistically significant.

4 WLreg: Shiny web-tool

In order to make the WL2 regression model easily accessible to researchers, WLreg, a cloud-based software, is being developed. The software is available at https://bartinuni.shinyapps.io/WLreg. WLreg consists of four sections. These are the upload data, model summary, goodness of fit and residuals sections (see Fig 8).

Fig 9 shows the upload data section of the WLreg software. The application results analysed in Sect 3 are given using the WLreg software for the WL2 regression model.

Figs 10 and 11 show the summary section of the model. This section shows the estimated parameters, the standard errors and the Hessian matrix.

Fig 12 shows the information criteria obtained from the WL2 regression model. These results are important for selecting the best model.

Fig 13 shows the PP plot for CS residuals. The PP plot can be downloaded as a png file by the user. The results of the KS test are also included in this PP plot.

5 Concluding remarks and future work

In this paper, the WL2 distribution is obtained by a new parameterisation of the WL distribution. A new regression model has been developed using the WL2 distribution. The efficiency of this regression model is compared with other recently proposed models. The results show that the WL2 regression model gives very successful results. In addition, WLreg software is developed to make the proposed model easy to use.

As a future work, we will introduce a WL2 model with varying dispersion and compare it with its counterparts. The assumption that the dispersion is constant for all observations may not be valid for highly skewed data sets, especially for insurance claims. The mean of the response variable is related to the dispersion parameter via the variance equation. Therefore, as with the mean component, the linear predictor can be used to model the dispersion parameter. The vary-dispersion WL2 model may be more effective in modeling the insurance data sets.

References

  1. 1. Akram MN, Amin M, Qasim M. A new biased estimator for the gamma regression model: Some applications in medical sciences. Commun Statist-Theory Methods. 2023;52(11):3612–32.
  2. 2. Ali S. On the Bayesian estimation of the weighted Lindley distribution. J Statist Comput Simulat. 2015;85(5):855–80.
  3. 3. Altun E. Weighted-exponential regression model: an alternative to the gamma regression model. Int J Model Simulat Sci Comput. 2019;2019:1950035.
  4. 4. Altun E. The Lomax regression model with residual analysis: an application to insurance data. J Appl Stat. 2020;48(13–15):2515–24. pmid:35707103
  5. 5. Altun E, Altun G. The Maxwell-Boltzmann-Exponential distribution with regression model. Mathematica Slovaca. 2024;74(4):1011–22.
  6. 6. Altun E, Korkmaz MC, El-Morshedy M, Eliwa MS. The extended gamma distribution with regression model and applications. AIMS Mathematics. 2021;6(3):2418–39.
  7. 7. Bodhisuwan W, Sangpoom S. The discrete Weighted Lindley distribution. In: 2016 12th International Conference on Mathematics, Statistics, and Their Applications (ICMSA). 2016. p. 99–103.
  8. 8. Bossio MC, Cuervo EC. Gamma regression models with the Gammareg R package. Comunicaciones en estadistica. 2015;8(2):211–23.
  9. 9. Cepeda-Cuervo EC. Modelagem da variabilidade em modelos lineares generalizados. Rio de Janeiro, RJ, Brasil: IM-UFRJ. 2001.
  10. 10. Cepeda-Cuervo E, Corrales M, Cifuentes MV, Zarate H. On gamma regression residuals. JIRSS. 2016;15(1):29–44.
  11. 11. Cox DR, Snell EJ. A general definition of residuals. J Roy Statist Soc Ser B (Methodol). 1968. p. 248–75.
  12. 12. De Jong P, Heller GZ. Generalized linear models for insurance data. Cambridge Books. 2008
  13. 13. Fang J, Liu L, Fang P. What is the most important factor affecting patient satisfaction - a study based on gamma coefficient. Patient Prefer Adherence. 2019;13:515–25. pmid:31114168
  14. 14. Ghitany ME, Alqallaf F, Al-Mutairi DK, Husain HA. A two-parameter weighted Lindley distribution and its applications to survival data. Math Comput Simulat. 2011;81(6):1190–201.
  15. 15. Lindley DV. Fiducial distributions and Bayes’ theorem. J Roy Statist Soc Ser B (Methodol). 1958:102–7.
  16. 16. Mazucheli J, Coelho-Barros EA, Achcar JA. An alternative reparametrization for the Weighted Lindley distribution. Pesquisa Operacional. 2016;36:345–53.
  17. 17. Mazucheli J, Louzada F, Ghitany ME. Comparison of estimation methods for the parameters of the Weighted Lindley distribution. Appl Math Comput. 2013;220:463–71.
  18. 18. Mohammadpour S, Niknam N, Javan-Noughabi J, Yousefi M, Ebrahimipour H, Haghighi H, Sharifi T. The factors associated with direct medical costs in patients with gastric cancer: quantile regression approach compared with gamma regression. Value Health Regional Issues. 2020;21:127–32
  19. 19. Mota AL, Santos-Neto M, Neto MM, Leao J, Tomazella VL, Louzada F. Weighted Lindley regression model with varying precision: estimation, modeling and its diagnostics. Commun Statist-Simulat Comput. 2024;53(4):1690–710.
  20. 20. Murase H, Nagashima H, Yonezaki S, Matsukura R, Kitakado T. Application of a generalized additive model (GAM) to reveal relationships between environmental factors and distributions of pelagic fish and krill: a case study in Sendai Bay, Japan. ICES J Marine Sci. 2009;66(6):1417–24.
  21. 21. Ramos PL, Louzada F. The generalized weighted Lindley distribution: properties, estimation, and applications. Cogent Math. 2016;3(1):1256022.
  22. 22. Ramos PL, Louzada F, Shimizu TK, Luiz AO. The inverse weighted Lindley distribution: Properties, estimation and an application on a failure time data. Commun Statist-Theory Methods. 2019;48(10):2372–89.
  23. 23. Tarko AP. Maximum likelihood method of estimating the conflict-crash relationship. Accid Anal Prev. 2023;179:106875. pmid:36345112
  24. 24. Vuong QH. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica. 1989:307–33.