A new regression model for bounded response variable: An alternative to the beta and unit-Lindley regression models

A new distribution defined on (0,1) interval is introduced. Its probability density and cumulative distribution functions have simple forms. Thanks to its simple forms, the moments, incomplete moments and quantile function of the proposed distribution are derived and obtained in explicit forms. Four parameter estimation methods are used to estimate the unknown parameter of the distribution. Besides, simulation study is implemented to compare the efficiencies of these parameter estimation methods. More importantly, owing to the proposed distribution, we provide an alternative regression model for the bounded response variable. The proposed regression model is compared with the beta and unit-Lindley regression models based on two real data sets.


Introduction
In the last decade, modeling of the bounded data sets is increased its popularity. These kinds of data sets appear in many fields such as finance, actuarial and medical sciences. The statistics literature has very limited distributions defined on (0,1). The best known distributions defined on (0,1) are beta, Topp-Leone by Topp and Leone [1] and Kumaraswamy by Kumaraswamy [2] distributions. To increase the modeling accuracy of the data sets on (0,1), several distributions have been proposed by researchers. For instance, the unit-Lindley by Mazucheli et al. [3], unit-inverse Gaussian by Ghitany et al. [4], unit-Birnbaum-Saunders by Mazucheli et al. [5], exponentiated Topp-Leone by Pourdarvish et al. [6], transmuted Kumaraswamy by Khan et al. [7], log-xgamma by Altun and Hamedani [8], log-weighted exponential by Altun [9] and unitimproved second-degree Lindley by Altun and Cordeiro [10].
Although the beta distribution is widely used to model data sets on bounded interval, it has deficiency to model extremely left-skewed and leptokurtic data sets. The moments of the Topp-Leone distribution are not in explicit forms which is important to make appropriate parametrization on the density function for regression modeling. Additionally, even if the moments of the Kumaraswamy distribution are in explicit forms, they contains gamma function which destroys the re-parametrization of the density function. We aim to introduce a new a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 distribution on (0,1) interval to remove the deficiencies of the existing distributions for modeling the extremely skewed data sets. The Bilal distribution introduced by Abd-Elrahman [11] is used to generate a new distribution employing the appropriate transformation. The resulting distribution is called as log-Bilal distribution since we use Y = exp(−X) transformation. After obtaining the log-Bilal distribution, we obtain its statistical properties such as moments, incomplete moments and quantile function. The important question is that do we need this distribution? To answer this question, we summarize the importance of the log-Bilal distribution: (i) the log-Bilal distribution has simple and closed-form expressions for its statistical functions (ii) the properties of the log-Bilal distribution are derived in explicit forms without any special mathematical functions, (iii) the proposed distribution provides more flexibility than existing distributions for the shapes of hazard rate function, (iv) thanks to its simple mathematical functions, we introduce a new regression model based on the log-Bilal density to model the extremely skewed dependent variables with associated covariates.
We summarize the concepts of the remaining sections: the moments, incomplete moments, quantile function, and exponential family property of the log-Bilal distribution are obtained in the next section. Section 3 is devoted to the parameter estimation methods. The efficiencies of these methods are compared in Section 4. The log-Bilal regression model is introduced in Section 5. Section 6 contains the results of the data analysis. The paper is ended with concluding remarks in Section 7.

The log-Bilal distribution
Let random variable (rv) X represents the Bilal distribution which has the following probability density function (pdf) where θ > 0 is the scale parameter. The cumulative distribution function (cdf) of X is Following the idea of Altun and Hamedani [8] and Altun [9] and using the Y = exp(−X) transformation on the Bilal distribution, the pdf of the log-Bilal distribution is where θ > 0. Here, the parameter θ behaves like a shape parameter by contrast with the Bilal distribution. From now on, the rv Y having density (3) is stated as Y * log-Bilal(θ). The cdf of Y (for 0 � y � 1) is The survival function (sf) and hazard rate function (hrf) of Y are, respectively, h y ð Þ ¼ 6y 2=yÀ 1 ð1 À y 1=y Þ yð1 À 3y 2=y þ 2y 3=y Þ : ð6Þ The quantile function of Y is given by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where 0 < u < 1. Using (7), we have the following algorithm to generate random variables from the log-Bilal distribution. Algorithm 1 Generating random variables from log-Bilal(θ) distribution ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Repeat steps 2 and 3 n times.

Moments
The kth raw moment of Y is Using (8), the first and second raw moments of Y are given, respectively, by The variance of Y is obtained from the its first and second raw moments as It is easy to conclude that the mean and variance of the log-Bilal distribution decreases when the parameter θ increases.

Incomplete moments
The rth incomplete moment of Y is The incomplete moments of random variables are important tools to measure the inequalities like Gini measure (see, Butler and McDonald [12] for details).

Exponential family
The pdf of any distribution should be expressed in the following form to be a member of exponential family.

Estimation
We use four estimation methods to discuss the parameter estimation process of the log-Bilal distributions. These estimation methods are maximum likelihood estimation (MLE), method of moments (MM), least squares estimation (LSE) and weighted least squares estimation (WLSE). Detailed pieces of information on these estimation methods are given in the rest of this section.

Maximum likelihood
Let y 1 , . . ., y n be a random sample from the log-Bilal distribution. The log-likelihood function of the log-Bilal distribution is By differentiating (10) with respect to θ gives The MLE of θ, say,ŷ, is the solution of (11) for zero. There is no explicit form solution for (11). Therefore, it should be solved iteratively or direct maximization of (10) can be viewed as the other choice. Here, the direct maximization of (10) is preferred by using the optim function of R software.

Method of moments
The MM estimation method is a popular method when the raw moments of the distribution have simple forms. The MM estimator of θ can be easily obtained by equating the first theoretical moment of the log-Bilal distribution to the sample mean, which giveŝ

Least squares
Assume that the y (1) , . . ., y (n) be ordered sample of y 1 , . . ., y n following the log-Bilal distribution. The LSE of θ is obtained by minimizing where F(y (i) ;θ) is in (4). Then, we have

Weighted least squares
The minimization of the below function gives the WLSE of the parameter θ.

Simulation
We compare the efficiencies of the MLE, MM, LSE and WLSE methods in estimating the parameter of the log-Bilal distribution. The algorithm given in Section 2 is used to generate random variables from the log-Bilal distribution. The simulation results are interpreted based on the following quantities.
Bias ¼ These kind of statistical measures such as means square erros (MSEs) and mean relative errors (MREs) are used to compare the different approaches deciding the best model under pre-determined scenarios (see, Zeng et al., [13,14]). The statistical software R is used to obtain numerical results for the simulation study. We choose the parameter value θ = 1.7, the simulation replication is N = 10, 000 and the sample size is n = 20, 25, 30, . . ., 300. If the estimation methods yield an asymptotically unbiased estimation of θ, we expect to see that MSEs and biases approach the zero. On the other hand, MREs should be near the one. The simulation results are displayed in Fig 3. As seen from these figures, MLE method approaches the

PLOS ONE
desired values of biases, MSEs and MREs faster than other estimation methods. Therefore, MLE method is more appropriate than other methods for estimating the parameter of the log-Bilal distribution.

The log-Bilal regression model
Now, we introduce a new regression model for bounded response variable as an alternative to the beta and unit-Lindley regression models. Let θ = 2 −1 ({μ/(μ + 24)} −1/2 − 5), then the pdf of log-Bilal distribution takes the form f y; m ð Þ ¼ 12 where 0 < y < 1, 0 < μ < 1 and E(Y|μ) = μ. The logit link function is used to link the covariates to the mean of response variable, as follows, . . . ; x ip Þ is the vector of covariates and β = (β 0 , β 1 , β 2 , . . ., β k ) T is the vector of unknown regression coefficients. Substituting μ i in (13) with (14), the log-likelihood function of the log-Bilal regression model is where μ i is given by (14). The unknown vector of regression parameters, β, is estimated by minimizing the negative value of (15) which is equivalent to the maximization of (15). The standard errors of the estimated parameters are obtained by means of observed information matrix whose elements can be calculated numerically with fdHess function of R software.

Residuals analysis
To check the model accuracy of the fitted log-Bilal regression model, the randomized quantile residuals introduced by Dunn and Smyth [15] is used. The randomized quantile residuals are given byr whereû i ¼ Fðy i ;βÞ and F −1 (z) is the inverse of the standard normal cdf. When the fitted model is valid for the used data set, r i is normally distributed with zero mean and unit variance.

Empirical studies
In this section, the log-Bilal distribution and log-Bilal regression model are compared with existing models. Two real data set are analyzed to prove the usefulness of proposed distribution in modeling the real data sets.

Dwellings without basic facilities
Better Life Index (BLI) is calculated for the OECD countries as well as Brazil, Russia and South Africa to compare the countries based on 12 indicators which effect the quality of the life.
Here, we use one of the variable of BLI measured in the year of 2017, dwellings without basic facilities which is defined as a percentage of the population living in a dwelling without indoor flushing toilet. The data set is available at https://stats.oecd.org/index.aspx?DataSetCode=BLI. This data set is used to compare the real data modeling performance of the log-Bilal distribution with the following competitive models: beta, Kumaraswamy, Topp-Leone and unit-Lindley.
The competitive distributions as well as the log-Bilal distribution are fitted to the data used by means of R software. After fitting the distribution to data, the MLEs of the parameters of the fitted distributions with their standard errors (SEs) are obtained. Besides, the formal goodness-of-fit tests such as Kolmogorov-Smirnov (K-S), Cramér-von Mises (W � ) and Anderson-Darling (A � ) are applied to decide the suitability of the distributions on the data used. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) are widely used criteria to choose the best statistical model. These statistics are used for comparison of the fitted models and selection of the best model (see, Chen et al., [16,17]). Table 1 shows the MLEs of the parameters for the fitted models to the dwellings without basic facilities data, corresponding SEs, and goodness-of-fit statistics as well as AIC and BIC values. As seen from the results of K-S tests with corresponding p-values, the all fitted distributions, except the unit-Lindley, provide adequate fits. However, the log-Bilal distribution has the lowest values of the AIC, BIC, A � and W � statistics which indicate that the proposed distribution is the best choice for the data used.

Education attainment
Here, the performance of the log-Bilal regression model is compared with the beta and unit-Lindley regression models. The used data set comes from the BLI of OECD countries, measured in the year of 2017. The data source is https://stats.oecd.org/index.aspx?DataSetCode= BLI.
The educational attainment values of the OECD countries (y) is considered as response (dependent) variable The goal is to explore the effects of following covariates on the

PLOS ONE
conditional mean of the response variable: homicide rate (HR), dwellings without basic facilities (DWBF), and labor market insecurity (LMI). The logit link function which ensures that the estimated mean lies between 0 and 1, is used for all fitted regression models. The fitted regression model is Table 2 lists the MLEs, SEs, and corresponding p-values, AIC and BIC for the beta, unit-Lindley, and log-Bilal regression models. The parameter φ represents the dispersion parameter of the beta regression model. Based on the figures in Table 2, all estimated regression parameters are found statistically significant for beta and log-Bilal regression models. Based on the estimated regression parameters of the log-Bilal regression model, it is concluded that when the homicide rate and labor market insecurity increase, the educational attainment decreases in the OECD countries. On the other hand, when the dwellings without basic facilities increases, the educational attainment increases in the OECD countries.
The information criteria, AIC and BIC statistics, are used to select the best model for the data used. Since the lowest values of the AIC and BIC statistics are belong to the log-Bilal regression model, we conclude that it is best by comparison with the beta and unit-Lindley regression models. Additionally, the residual analysis is done to evaluate the suitability of the fitted models for the data used. Fig 5 displays the quantile-quantile plots of the randomized quantile residuals. As seen from these figures, all fitted regression models provide adequate fits, but, the plotted points for the log-Bilal regression model are more closer the diagonal line than the beta and unit-Lindley regression models.

Conclusion
For the first time, a new one-parameter unit distribution is introduced for modeling the extremely left-skewed data sets measured in unit-interval. The new model provides a reasonably better fit than the other one and two-parameter unit distributions such as Topp-Leone, unit-Lindley, Kumaraswamy, and beta distributions when the data sets are extremely skewed to left (right). The newly defined regression model is compared with the famous beta regression model as well as the recently proposed unit-Lindley regression model. The results of the data analysis show that the proposed models work better than other existing models. As a future work of the presented study, we plan to introduce the quantile regression model based on the log-Bilal distribution. Additionally, we extend our model for modeling the longitudinal data sets as an alternative to the longitudinal beta regression model.