Figures
Abstract
In many practical situations, there is an interest in modeling bounded random variables in the interval (0, 1), such as rates, proportions, and indexes. It is important to provide new continuous models to deal with the uncertainty involved by variables of this type. This paper proposes a new quantile regression model based on an alternative parameterization of the unit Burr XII (UBXII) distribution. For the UBXII distribution and its associated regression, we obtain score functions and observed information matrices. We use the maximum likelihood method to estimate the parameters of the regression model, and conduct a Monte Carlo study to evaluate the performance of its estimates in samples of finite size. Furthermore, we present general diagnostic analysis and model selection techniques for the regression model. We empirically show its importance and flexibility through an application to an actual data set, in which the dropout proportion of Brazilian undergraduate animal sciences courses is analyzed. We use a statistical learning method for comparing the proposed model with the beta, Kumaraswamy, and unit-Weibull regressions. The results show that the UBXII regression provides the best fit and the most accurate predictions. Therefore, it is a valuable alternative and competitive to the well-known regressions for modeling double-bounded variables in the unit interval.
Citation: Ribeiro TF, Peña-Ramírez FA, Guerra RR, Cordeiro GM (2022) Another unit Burr XII quantile regression model based on the different reparameterization applied to dropout in Brazilian undergraduate courses. PLoS ONE 17(11): e0276695. https://doi.org/10.1371/journal.pone.0276695
Editor: Muhammad Amin, University of Sargodha, PAKISTAN
Received: March 13, 2022; Accepted: October 11, 2022; Published: November 3, 2022
Copyright: © 2022 Ribeiro et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
University dropout is a problem with academic, social, and economic implications due to the high cost it inflicts on the students, their families, universities, and the country’s growth [1]. Thus, it is necessary to extract relevant information to enable higher education institutions (HEIs) to understand this phenomenon and minimize the dropout proportion of their courses. In that idea, several authors studied how aspects of the organizational structure of universities affect student outcomes. See [2, 3], for instance. However, it is essential to look at appropriate classes of regressions to model the dropout proportion, such as those based on distributions that lie on the standard unit interval.
The Beta [4] and Kumaraswamy [5, 6] regressions are the most widely used for modeling unit outcomes. The beta regression is useful to understand the influence of covariates on the response’s mean. The Kumaraswamy is the classical alternative to the beta and allows modeling the quantile of a response in the unit interval. However, the search for alternative unit regressions has attracted many researchers’ attention, especially those based on quantile approaches. For example, [7] introduced the unit-Weibull quantile regression. [8, 9] proposed the unit Burr XII and reflected unit Burr XII, respectively. Other quantile regressions were introduced by [10, 11], and [12]. One may also see [13–16] for unit regressions applied to educational measurements. These authors focus on comparing indicators from different countries, including educational attainment percentage, and school living conditions. However, to the author’s knowledge, there is still a lack of information concerning the phenomenon of student dropout.
Under the above information, the goal of this paper is to propose a new alternative for unit quantile regression applied to the dropout proportion of undergraduate courses. We use an approach based on the unit Burr XII (UBXII) distribution, which was pioneered [8] by applying the transformation method in a Burr XII (BXII) random variable. Their choice was based on the versatility of the baseline, which has been applied in reliability analysis [17], regression modeling [18], generalized distributions [19, 20], and several other disciplines. Let Y be a unit random variable having the UBXII distribution. The cumulative distribution function (cdf) and probability density function (pdf) of Y are
(1)
and
(2)
respectively, where c > 0 and d > 0 are shape parameters. The quantile function (qf) of Y follows by inverting Eq (1), namely
(3)
Henceforth, if Y is a random variable with pdf (2), we write Y∼ UBXII (c, d). For c = 1, the UBXII distribution reduces to the unit Lomax distribution [21]. By taking d = 1, it is a special case of the unit log-logistic distribution [22]. Those models recently appeared in the literature, and the unit Lomax has not been studied in a regression context.
Our proposal is based on new reparametrization on Y by inverting its quantile function. We provide at least four motivations for this work. First, we propose a new reparametrization on Y and derive some useful statistical quantities that were not explored by [8]. Our investigation includes the computation of the score function and observed information matrix for distribution and also for the regression. Second, we consider a regression structure for the new quantile parameterby assuming that it can be expressed as a function of covariates and, hence, a more general class of regressions is obtained. The third motivation is to use a statistical learning tool for comparing the prediction performance of non-nested models and selecting the most suitable for the data at hand. The fourth motivation is referring to the usefulness of the new regression for modeling the dropout proportion of undergraduate courses. The motivating data set concern to Brazilian undergraduate animal science courses. This course has received attention in the literature; see, for instance, [23], who sought to identify demographic variables as well as their relation to students’ performance and interest areas, and factors associated with enrollment in an introductory animal sciences course.
The rest of paper is outlined as follows. Section 2 proposes an alternative quantile parameterization for the UBXII distribution and investigates some of its mathematical and statistical properties. We obtain the maximum likelihood of the parameters in Section 3. We provide a simulation study in Section 4 to evaluate the performance of the estimators. In Section 5, we define a quantile regression model based on the new parameterization of the UBXII distribution. In addition, we discuss the estimation of the parameters, present some diagnostic analysis methods and regression selection criteria, and conduct simulation studies. In special, we present a statistical learning tool (cross-validation approach) to compare non-nested regressions. In Section 6, we perform an application of the new regression to dropout in Brazilian undergraduate animal sciences courses. We offer some conclusions in Section 7. Finally, we provide the observed matrix for the new distribution and Fisher’s observed information matrix, and information about data’s extraction used in application; see S1 Appendix and Supporting Information, respectively.
2 A new UBXII parametrization
Distributions with direct interpretation parameters are desirable in empirical applications, and for this purpose, several authors have adopted reparameterizations on well-known distributions; see [4, 7, 24, 25]. These reparameterizations generally seek to allow modeling of the random variable’s mean, as in the [4]; and [25] proposal. However, the mean is an outlier-sensitive measure, and the UBXII distribution does not have a closed-form expression for it. Thus, modeling the quantiles is an interesting approach for asymmetric data because they can be outlier-resistant measures [24], besides being a smart alternative since the qf of Y (3) has a closed-form, and any quantile can be computed in explicit form. Further, one of the parameters of the UBXII distribution (under a quantile-parameterization) can be interpreted as the τth quantile of Y. Thus, we shall reparameterize Eq (1) in terms of the τth quantile q = QY(τ). By inverting (3) and solving for d, we have
(4)
By replacing (4) in Eqs (1) and (2), the cdf and pdf of the UBXII distribution (under this parametrization) have the forms
(5)
and
(6)
respectively. Henceforth, we denote a random variable with density (6) by Y∼ UBXII (c, q).
Some UBXII densities (for τ = 0.5) are displayed in Fig 1, which reveal different shapes such as decreasing, increasing, reverse J-shaped, U-shaped, reverse tilde-shaped (decreasing-increasing-decreasing), non-skewed, and skewed-left. It is noteworthy that the UBXII density can accommodate several skew-left shapes and has a reverse tilde-shaped, which is not presented by classical unit distributions.
The qf of Y on the new parameterization has the form
(7)
So, the UBXII quantiles can be obtained from (7) by setting u values. Further, we can generate occurrences for this distribution using (7) by the inversion method.
Alternatively, the flexibility of the new distribution can be displayed from the Bowley skewness and Moors kurtosis formulas, namely
and
respectively, where QY(⋅) is the qf given by (7). These measures provide a simple way to figure out the skewness and tail shapes of the distribution. Fig 2 displays plots for both measures B and M which show that they are sensible to variations of c and q for fixed τ = 0.5.
3 Estimation
Various methods can be used to estimate the parameters of a distribution. The maximum likelihood (ML) method is the most commonly used. In what follows, we shall use this method for estimating the parameters of the UBXII distribution.
Let y1, …, yn be a random sample of size n from the UBXII distribution, the parameter vector θ = (c, q)⊤, and a known τ ∈ (0, 1) specified. Based on this sample, the log-likelihood function for θ, ℓ(θ;y) ≡ ℓ(θ), has the form
(8)
where t(x) = 1 + logc x−1.
Eq (8) can be maximized either directly by using well-known plataforms such as the R (optim function), SAS (PROC NLMIXED), Ox program (MaxBFGS sub-routine) or by solving the nonlinear likelihood equations from the differentiation of ℓ(θ). By maximizing (8), we obtain the MLE of θ.
Graphically, it is possible to show local maxima of the log-likelihood function () and that it is unimodal. Plots that illustrate this are constructed in four steps. First, we simulate data from UBXII(c, d), where c = 1.5 and d = 3.4 with n = 100. Second, we evaluate the log-likelihood function obtained from the pdf of Eq (2) in a range covering the respective ML estimate, that is, c ∈ (0, 9) for d fixed at 3.4. After, the same is done for d ∈ (0, 9) by fixing c at 1.5. Finally, we plot the log-likelihood function against the values range of the parameters c and d. Fig 3 displays the plots obtained. As expected, both log-likelihood functions are unimodal and their maxima points (ML estimates) are achieved on the true values of c and d, respectively.
The components of the score vector from Eq (8) are U(θ) = [Uc(θ), Uq(θ)]⊤, where Uc(θ) = ∂ℓ(θ)/∂c and Uq(θ) = ∂ℓ(θ)/∂q. Setting these components to zero and solving them simultaneously gives . The score components are
and
The MLE of θ can not be expressed in closed-form by setting . However, for fixed c, we note that a MLE semi-closed form of q follows by taking
. Hence, it is the solution of
By replacing q by in Eq (8), we obtain the profile log-likelihood function
(9)
We can compute the score function for c from (9)
However, it is necessary to use a nonlinear optimization method to maximize numerically the profile log-likelihood function (9). Typically for the numerical computation of the MLEs, the quasi-Newton algorithm such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is adopted.
Approximate confidence intervals and hypothesis tests for θ can be constructed by considering its asymptotic distribution of the MLEs. For large samples, approximately assuming that standard regularity conditions (SRCs) hold, where I(θ) is the expected information matrix defined by
The computation of I(θ) may be cumbersome. Nevertheless, when the SRCs are valid, it follows that
[J(θ)], where J(θ) = −∂2 ℓ(θ)/∂θ∂θ⊤ is the observed information matrix. For the UBXII distribution, we can write J(θ) as
where Ucc(θ) = ∂2 ℓ(θ)/∂c2, Uqq(θ) = ∂2 ℓ(θ)/∂q2, and Ucq(θ) = ∂2 ℓ(θ)/(∂c∂q) = Uqc(θ). The elements of the matrix J(θ) are given in S1 Appendix.
[26] proved that the estimated observed information matrix is a consistent estimator of I(θ) when the sample size is large. It is then possible to obtain the standard errors (SEs) of the MLEs by computing the square roots of the diagonal elements of
. For instance, we can do large sample inference by building asymptotic confidence intervals with 100%(1 − α) nominal coverage for θ making
, where z1−α/2 is the 1 − α/2 standard normal quantile.
4 Simulation study
A Monte Carlo simulation study is carried out in the R programming language to evaluate the performance of the MLEs of the UBXII parameters that index the distribution. The Optim routine (with BFGS quasi-Newton nonlinear optimization algorithm and analytical derivative) is used for maximizing (9). The profile log-likelihood function involves a more straightforward numerical maximization than using (8) since it depends only on the parameter c. We start the root-finding algorithm using c = 1 for the shape parameter.
Different values for the parameter vector θ are considered according to those presented in Fig 1. Therefore, various combinations of skewness and kurtosis coefficients and density shapes are contemplated. A total of six scenarios is considered for the sample size n ∈ {25, 75, 150, 300}. The inversion method is employed for generating observations, i.e., the qf (7) is evaluated in , being QY(u) = y and, hence, a sample of size n from Y∼ UBXII (c, q) is generated. Each one of the sample sizes is replicated R = 10, 000 times. We compute quantities as percentage relative bias (RB%) and root mean squared error (RMSE) of the MLEs.
Table 1 reports results from the simulation schemes. As expected, the consistency property of the MLEs holds, i.e., the RMSEs tend to decrease when the sample size increases. Also, it can be noted that the RB%s are smaller for sample size higher, thus indicating that the overall performance of the MLEs is appropriate, as well as they are more accurate and less biased when n increases. Notice that the biggest RB%s for and
are less than 7.38 and 1.62, respectively, even with n = 25. In general, the estimate
is more accurate when compared with
. In the scenarios two to six, all the RB%s of
are below of 0.84 in absolute value.
Fig 4 displays boxplots from the first 100 Monte Carlo replications (to favor easy viewing) of the eight current scenarios. We can note that, in most cases, the presence of outliers overestimates the estimates for small sample sizes. However, this fact is attenuated when n increases. Besides, the dispersion of the estimates decreases, and the precision is achieved for larger sample sizes.
Fig 5 contains plots of total absolute RB% and total RMSE versus sample sizes for all these scenarios. These quantities are obtained from the sum of the RB% and RMSE of both parameters for each sample size and scenario. Note that those measures decay to zero when n increases in the six scenarios. This shows that the properties of the MLEs (such as asymptotically unbiased and consistent) are held.
5 The UBXII regression
Let Y1, …, Yn be n independent random variables, where Yi∼ UBXII (qi, c) for i = 1, …, n with shape parameter c and quantile parameter qi (both unknown) for 0 < τ < 1 assumed known. We propose the UBXII regression imposing that the quantile qi of Yi satisfies the functional relation
(10)
where
is the n-dimensional vector of linear predictors, q = (q1, …, qn)⊤ is the vector of quantiles with qi ∈ (0, 1),
is a k-dimensional vector of unknown regression coefficients (k < n),
is the n×k full column rank matrix,
denotes the ith observation on k covariates which are assumed known, and xi1 = 1, ∀i. Finally, we shall assume that g(⋅) is a strictly monotonic and twice differentiable link function which maps (0, 1) into
. By inverting each component of (10), we can write
There are various possible choices for the link function g(⋅) such as
- logit: g(qi) = log[qi/(1−qi)];
- probit: g(qi) = Φ−1(qi), where Φ−1(⋅) is the qf of the standard normal random variable;
- complementary log-log: g(qi) = log[−log(1−qi)];
- log-log: g(qi) = −log[−log(qi)];
- Cauchy: g(qi) = tan[π(qt−1/2)].
The choice of the logit link function is the most common by practitioners since the interpretation of the regression parameters becomes quite interesting. Consider increasing the jth regressor at one unit, while the others are kept constant. Let q* be the quantile of Y under the new value of xj, whereas q denotes the quantile of Y under the original value of this regressor. It can be shown that with the logit link function, we have βj = log{q*(1−q*)/[q(1−q)]}, i.e., βj is the log odds ratio [4]. In this context, we will consider the logit link function for g(⋅) in the UBXII regression. Then, the ith quantile of Yi is .
5.1 Estimation
The parameters estimation in the UBXII regression can also be performed by the ML method. Let θ = (β⊤, c)⊤ be the vector of k + 1 unknown parameters to be estimated. The log-likelihood function based on a sample of n independent observations having the UBXII distribution, i.e., Yi∼ UBXII (qi, c), can be expressed as
(11)
where ℓi(qi, c) is the logarithm of fY(yi;qi, c) given in Eq (6). Hence,
The score vector, obtained by differentiating the log-likelihood function (11) with respect to the unknown parameters βj, j = 1, …, k, and c, is expressed as U = [Uβ(β, c)⊤, Uc(β, c)]⊤. The components of U can be written in matrix notation. For doing this, we now define some quantities.
Let ,
,
, and
Then, we have
(12)
and
(13)
where X is an n × k matrix whose ith row is
, D = diag {1/g′(q1), …, 1/g′(qn)},
,
,
, and
. We provide the calculations of the score components in S1 Appendix.
Again, the nonlinear Equations and
can not be expressed in closed-form. Hence, a nonlinear optimization method must be used for maximizing the function (11) and determine the MLEs
. We also provide the observed information matrix for (β⊤, c)⊤.
To simplify the notation of its components, other quantities are defined as follows
Therefore, the observed information matrix can be expressed as (see S1 Appendix)
The quantities Jββ ≡ ∂2 ℓ(β, c)/∂β∂β⊤ and
, and Jcc ≡ ∂2 ℓ(β, c)/∂c2 are
(14)
(15)
and
(16)
where M = diag {m1, …, mn}, P = diag {p1, …, pn},
,
,
, T = diag {g″(q1), …, g′′(qn)}, r = (r1, …, rn)⊤, s = (s1, …, sn)⊤,
, and u = (u1, …, un)⊤.
As mentioned in Section 3, the matrix J is quite useful for interval estimation and hypothesis testing inference. Assuming that the SRCs hold and the sample size is large,
where I−1 is the inverse of
is the expected information matrix. It can be estimated of the consistent way by
, which is computed after replacing the unknown parameters (β⊤, c)⊤ by the corresponding MLEs.
5.2 Diagnostic measures and model selection
In order to check the goodness-of-fit and validate the UBXII regression assumptions, we adopt some well-known diagnostic tools that are now discussed. Initially, we used quantile residuals. These residuals verify if the model assumptions are satisfied and identify when the parameter estimations are considerably affected by the presence of atypical observations in the response. If the model is correctly specified, the quantile residuals are standard normally distributed. For the UBXII regression, they are given by
where FY(⋅) is the UBXII cdf given in Eq (5).
An incorrect functional form specification of the regression and the covariates omission can be identified through the RESET test. This test was initially introduced as a general misspecification test for the normal linear regression. Afterward, variants of the RESET test for classes of more general regressions were proposed by [27]. Thus, to determine whether a UBXII regression is misspecified, we propose using a RESET-like misspecification test. Next, we explain how this test can be performed.
The RESET-like test is carried out in two steps. Let be the predicted values vector obtained after fitting a UBXII regression. First, we build testing variables matrix as
, where the vectors
and
are formed by
squared and cubed components, respectively. We define the augmented regression
(17)
where T is the n × 2 matrix of testing variables, and δ is a 2 × 1 vector of parameters. Second, we estimate Eq (17) and test the null hypothesis
against the alternative hypothesis
by using the likelihood ratio (LR) statistic. We compute the LR statistic as
where ℓ(⋅) is the log-likelihood function and
is the unrestricted MLE of θ, and
is the restricted MLE of θ under the null hypothesis. Under
and the SRCs, ω converge in distribution to chi-square,
, where ν is the number of testing covariates added to the regression (ν = 2 in this case). The non-rejection of the null hypothesis suggests that the regression is correctly specified.
The proportion of the response variable’s variability explained by a fitted UBXII regression can be assessed using the generalized (pseudo) R-squared () defined by [28] as
where
is the log-likelihood of the null regression, i.e., obtained from the modeling of the response in the covariates absence, and
is the log-likelihood of the full regression. A regression with a higher value of
provides a larger explanation power of the response variable variation.
To select the more suitable model between several nested models, the information criteria such as Akaike information criterion (AIC) and Schwarz information criterion (BIC) can be considered. Both criteria are widely used in practical applications and they are defined by AIC and BIC
, where p is the number of estimated parameters.
A way of selecting the best one between different non-nested regressions is to assess its performance in the prediction of the response through statistical learning tools such as the cross-validation approach. Let y = (y1, …, yn)⊤ the vector of n observations of a response variable and X the covariates matrix like in (10). In statistical learning methods, a training data set is the observations set in which a model is initially adjusted. An accuracy measure is the test error, that result from applying the model fitted to test observations that were not used in training. For example, if we use (y, X) as training observations, the test error is , where L(⋅) is the loss function and
is the predicted value using the fitted model from (y, X) evaluated in the predictors
(that does not belong to X). To estimate the test error with quadratic loss, we consider the mean square error (MSE) defined as
where
is the ith predict value by the regression for the ith observation. This statistical measure is small if the predictions of the responses are very close to its true values, and it is large if for some of the observations, the predicted and true responses differ substantially [29].
As cross-validation method, we propose the use of the leave-one-out cross-validation (LOOCV). In this approach, we split the ith observation (ith row of a data set in which the response and covariates are disposed by columns) of the other n−1 observations that represent the training set whereas the row i is the validation set.
For each removed observation, we use the fitted model with the training set to predict the ith observation of the validation set. After, we estimate the test error by computing the MSEi. Repeating those procedure n times, we obtain MSE1, …, MSEn. The final estimate of the test errors are computed through average of those n statistics as follows [29]
Hence, we select the regression which provides smaller values for CV(n).
Finally, we perform an influence analysis to detect possible influential points as outliers. For this, the generalized Cook distance (GD) is considered. It is a measure of global influence, which proposes eliminating the ith observation (i = 1, …, n) to study its effect. The GD is computed as
(18)
where
is the MLE obtained when the ith observation is deleted, and
is the observed information matrix evaluated on the MLEs. We consider a general rule of thumb as a threshold for determining highly influential points. The rule is the following if GDi > 4/n, then the observation is influential.
5.3 Simulation study
In this section, a Monte Carlo simulation study is conducted in order to numerically evaluate the finite sample behavior of the MLEs of the UBXII regression’s parameters. The Monte Carlo experiments are performed using the R programming language [30]. Maximization of the log-likelihood function in (11) is carried out using the BFGS quasi-Newton nonlinear optimization algorithm implemented at the optim function available in R. We consider the ordinary least squares estimates (OLSEs) as an initial guess for β obtained from a linear regression of the transformed responses: z = [g(q1), …, g(qn)]⊤, i.e., the initial point estimate of β is . For the shape parameter c, we take the same initial guess in Section 4.
The simulations are based on the UBXII regression:
(19)
The covariate x2 is randomly generated from a standard normal. We combine various values of the parameter vector θ = (β1, β2, c)⊤ at six different scenarios. The Monte Carlo replications number adopted and the sample sizes considered are the same from Section 4. In each Monte Carlo replication, the inversion method is used to generate n occurrences of a random variable Yi∼ UBXII (qi, c). By assuming the regression structure defined in Eq (19), it follows that
i.e., qi is equal to the logistic cdf evaluated at (β1+ β2 xi2). The statistical quantities computed are also the same of Section 4.
Table 2 presents the results of the Monte Carlo simulations. In general, the RB%s are smaller for larger sample sizes. We can note that the most RB% is equal to 10.02 in scenario four for the smallest sample size, and it refers to the estimate of c. For estimates of the parameters β1 and β2, all RB%s are below 6.25. In addition, even for n = 25, the RMSE values are quite low in any scheme.
Fig 6 displays plots for the total RB% and total RMSE versus sample sizes. They reveal that the MLEs are consistent, and their biases quickly tend to zero when the sample size grows. Further, the most RB% is about 20, but it decays to less than 5 to n = 75. Thus, as expected, the ML asymptotic properties remain.
We also investigate the behavior of the proposed model competing with the Kumaraswamy (Kw), unit-Weibull (UW) [7], and beta [4] regressions, which are well-known in the analysis of limited data. We aim to compare the performance of the maximum likelihood estimator for estimating the parameters and investigating their performance in case of misspecification of the distribution. We also evaluate the behavior of the AIC and BIC as selection criteria for models from different distributions.
Let Y be a random variable Kw distributed under a median-dispersion parameterization [5], say Y∼ Kw(ω, dp). The pdf of Y is
where 0 < ω < 1 is the median of Y and dp > 0 is a dispersion parameter.
The UW quantile regression was recently introduced by [7]. Let Y∼ UW(q, γ) be a random variable having the UW distribution under the parameterization given in [7]. For y ∈ (0, 1), the random variable Y has density
where 0 < q < 1 is the τth quantile, γ > 0 is a shape parameter, and τ ∈ (0, 1) is assumed known. Here, it will be considered that τ = 0.5 in order to model the median of Y.
[4] pioneered the beta regression. Different parameterizations can be considered for the beta distribution. We consider the mean-precision based parameterization. Let Y be a random variable that follows a beta distribution, say Y∼ Beta(μ, ϕ). For y ∈ (0, 1), the Y density is
where 0 < μ < 1 is the mean of Y, ϕ > 0 is a precision parameter and
is the complete gamma function. Under this parameterization the variance of Y is V(μ)/(1+ ϕ), with V(μ) = μ(1 − μ).
The regression structure for the Kw, UW, and beta distributions is analogous to (19). The main differences are the assumptions under the random components and modeled location parameters. To get the Kw regression, q must be replaced by the median (ω) in Eq (19) and supposed that Yi∼ Kw(ωi, dp). The UW regression is obtained by considering the structure (19) and assuming that Yi∼ UW(qi, γ). In the beta regression, the location parameter is the mean (μ). Hence, in Eq (19), q must be replaced by μ and supposed that Yi∼ Beta(μi, ϕ). Thus, considering these regression structures, the simulation was performed in the following steps.
- We generate a sample with n = 100 observations for each regression. The parameter values were selected from Scenario 1 in Table 2, replacing c for the respective shape parameter.
- We fit the true regression and the UBXII regression for each generating scheme. When the UBXII is the true model, we fit all the competitors.
- We compute the MSE, AIC, and BIC for all fitted models.
- For each scenario considered, 5,000 replications were performed.
- For the MSE, we compute the average of all replications. For the AIC and BIC, the frequencies (%) of correct model selection are computed.
Table 3 displays the performance of the UBXII model when compared with the existing ones. The estimate is
for UBXII,
for Kw,
for UW, and
for beta regression. We observe that the estimates obtained with the Beta and KW differ from those from the UBXII and UW distributions. The last two present estimates are close to each other. It shows that the traditional beta and Kw densities were not suitable to describe the data generated from the UBXII. The UW is the most competitive model but still presents a worse performance for fitting UBXII random variables. Concerning the model selection approaches, all measures were able to select the correct model. The results for AIC and BIC were very similar, their success rates exceeding 92% for all generating schemes. Thus, the information criteria are reliable for model selection among the considered competitive models.
6 Application
In this section, we assess the UBXII regression performance on real data. The analysis is carried out using the R statistical computing environment [30]. We fit the UBXII regression and compare it with the Kw, UW [7], and beta [4] regressions, which are well-known in the analysis of limited data and were also considered in the simulation experiment. The R codes of the simulation studies and application are available at https://github.com/tatianefribeiro/UBXII_regression. We get the data from the higher education census conducted yearly by the Brazilian National Institute for Educational Studies and Research “Anísio Teixeira”. We are interested in the dropout proportion for animal sciences courses and factors associated with their enrollment and organizational structure. However, the response variable is not directly obtained from the original data set, and we use mining data techniques to obtain it from other reported variables. After preprocessing and cleaning steps, we select 40 covariates as possible predictors. A detailed description of the data mining tools employed and the final data set are available in Supporting information.
The UBXII, Kw, UW, and beta regressions also are used as data mining tools to select a subset of predictors that properly fits the dropout proportion. We test several combinations of predictors using the measures described in Section 6.3. to define the final regressions on each class. In what follows, we describe the response variable and predictive covariates used in our regression analysis.
The response variable is the dropout proportion from 2009 until 2017 of 77 Brazilian undergraduate animal sciences courses. For each course i (i = 1, …, 77), we consider three covariates as follows: i) quantity of vacancies offered in the morning shift, denoted by xi2; ii) a dummy variable that equals one if the course guarantees conditions of accessibility for people with disabilities, and zero otherwise, denoted by xi3; and iii) a dummy variable, denoted by xi4, that equals one if the course works on the night shift, and zero otherwise.
Let y = (y1, …, y77)⊤ be the vector of the response variable and X = (x1, …, x4) the covariates matrix, where x1 is a vector column with 77 ones and xj = (x1j, …, x77j)⊤, with j = 2, …, 4. Table 4 provides a descriptive summary of the response variable (y) and quantitative covariate (x2), revealing that y has negatively skewed distribution and lighter tails than a normal distribution. Further, its mean is close to the median, the standard deviation (SD) is low, and the values range is sizeable because the minimum and maximum are 0.1077 and 0.9714, respectively. The covariate x2 presents different degrees of variability, skewness, and kurtosis.
To study the covariates’ effects on the median dropout proportion, we set τ = 0.5 and specify the UBXII regression as
For comparison purposes, we also fit the Kw, UW, and beta regressions considering the same covariates combination and link function.
Table 5 brings some goodness-of-fit measures such as AIC, BIC, and , the p-values of the Anderson-Darling test (AD) [31] to validate the null hypothesis that errors are normally distributed, the p-values from RESET-like test (RES), and the statistic obtained from the LOOCV approach (CV(77)) that allows assessing the prediction performance of the fitted regressions. We consider α = 0.05 as a significance level for all performed hypothesis tests. According to the RESET-like tests, all models are correctly specified. Similarly, the p-values from Anderson-Darling tests indicate is reasonable supposing normality of the errors at each class. It is noteworthy that most of some goodness-of-fit measures suggest that the UBXII regression is more suitable to fit the dropout proportion in the Brazilian zootechnics course between 2009 and 2017 than other considered class of regressions. Moreover, the CV(77) estimate for the fitted UBXII regression is the smallest among all other fitted regressions. This means that the proposed regression leads to better predictions than the classical regressions used in the context of restricted response to the unit interval. Indeed, in Fig 7 it is possible to note that the UBXII regression provides the best fit for this data set since about 97% of the points are under the red line in the QQ-plot of fitted UBXII regression’s residuals.
In Table 6, we provide the estimates of the parameters, standard errors, t statistic value, and p-values for the UBXII regression. Results from other fitted regressions are given in Supporting information; see Table 2. The effect of the three considered covariates under the response’s median is positive. Further, according to the estimate of β4, the covariate xi4 presents the most impact on the median. That is, the odds ratio increases substantially if the course works on the night shift. This result may be related to the fact that many of the night students need to work during the day, making it challenging to persist [32]. However, the offer of night courses results from conquests achieved by popular pressure to meet the requirements of a population mainly consisting of workers [33]. Thus, our finding raises the discussion on the need to provide a better service to this public. For example, the low offer of extracurricular activities for evening students is one of the problems reported by [33].
Fig 8 plots the GD for the UBXII regression. We can note that only observation 32 highlights the others. It corresponds to the Faculdade de Estudos Superiores de Minas Gerais and, with dropout proportion of 0.8163, is upper the 3th quartile of the data set. However, it is not potentially influential since the GD associated is smaller than 4/n. Fig 9 assesses the impact of different τ values on the parameter estimates. We compute the 95% confidence intervals and point estimates for the UBXII regression by considering τ ∈ {0.1, 0.2, …, 0.9}. We observe that the intercept estimates become higher as τ increases, and the other regression coefficients are negatively related to the quantiles. It indicates that the covariates have a more substantial impact on explaining smaller quantiles of the dropout proportion. Finally, does not seem to be affected by variations of τ values. It is worth noting that similar behavior is reported by [7] for the shape parameter of the UW regression.
7 Conclusions
We define a new unit quantile regression based on an alternative reparametrization for the unit Burr XII (UBXII) distribution pioneered by [8]. A highlight of the proposed parametrization is that one of its parameters, q(τ), represents the τth quantile of the random variable. The researcher defines the τ value and assumes a regression structure on q(τ). We investigated some additional statistical quantities to those explored by [8], namely the score functions, and observed information matrix. The maximum likelihood method is used for parameter estimation, and Monte Carlo simulations show that its properties remain. We adapt several diagnostic analysis and model selection techniques that can be employed to check the goodness-of-fit of the estimated model.
The utility of the proposed regression is illustrated with an application that targeted to explain the linear relation between the dropout proportion of Brazilian undergraduate animal sciences courses and some factors associated with their enrollment and organizational structure. An essential aspect of quantile-based regression is the possibility of separately analyzing the covariates’ marginal effect on each response’s quantile. That allowed us to find that the effects of some factors, such as the number of vacancies, accessibility, and night shift, are more negligible on courses with fewer dropouts (those belonging to the lower quantiles). Another notable result is the positive effect between courses with night shifts and the dropout rate. This phenomenon is explained by the work carried out by the students during the morning shift, which makes persistence difficult. This situation must be considered by those who make educational policies since the opening of vacancies in the night shift must also be complemented by student attendance policies.
Additionally, we also fit the data set using other well-known regression models, such as the Kumaraswamy, unit-Weibull, and beta. The fit of the UBXII regression is superior to all of them since it provides better prediction performance. Thus, the UBXII regression is an alternative quite competitive for modeling data restricted to the unit interval and can be applied when the classical regressions are not unsuitable. That feature of capturing the nature of double-bonded variables makes the new model have a wide range of applications; for example, in the educational area, it can be helpful for modeling educational indicators such as graduation and persistence proportions of undergraduate and postgraduate courses. It may also be an alternative to educational measurements from different countries, such as in the applications [13–15], and [16] provided.
We end with some comments on possible future work. It should be noted that in conventional regression modeling, the non-existence of serial correlation between errors is assumed. In that sense, an extension of our proposal is the development of models that consider exogenous covariates in the median response with an Autoregressive Moving Average structure to handle serial dependence. It is important to highlight that the UBXII regression can be extended to the neutrosophic statistics analysis. This kind of analysis is applied when data or a part of it are indeterminate; that is, data have uncertain observations. Recently, some studies have been done in this context. [34] introduced the neutrosophic analysis of variance to test teaching methods using data collected from university students. [35] proposed a new Z-test for uncertainty events under neutrosophic statistics, which was applied to the Covid-19 data. In the regression context, [36] concluded that it is preferable to use Neutrosophic multiple regression over the classical regression models since this method is the most efficient for forecast the uncertainty observation data.
Supporting information
S1 File. Supplement to “The unit Burr XII regression: Properties, Simulation and application”.
It provides a detailed description of the data mining tools employed to obtain the final data set used in the application study in Section 7 and results from Kw, UW, and beta fitted regressions to the dropout proportion of Brazilian animal science courses.
https://doi.org/10.1371/journal.pone.0276695.s001
(PDF)
S1 Data. Droupot proportion of Brazilian animal science courses data set.
Data set used in the application study in Section 7.
https://doi.org/10.1371/journal.pone.0276695.s002
(ODS)
References
- 1. Rodríguez-Muñiz LJ, Bernardo AB, Esteban M, Díaz I. Dropout and transfer paths: What are the risky profiles when analyzing university persistence with machine learning techniques? Plos One. 2019;14(6):218–796. pmid:31226158
- 2. Sneyers E, De Witte K. The interaction between dropout, graduation rates and quality ratings in universities. Journal of the Operational Research Society. 2017;68(4):416–430.
- 3. Srairi S. An Analysis of Factors Affecting Student Dropout: The Case of Tunisian Universities. International Journal of Educational Reform. 2022;31(2):168–186.
- 4. Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31(7):799–815.
- 5. Mitnik PA, Baek S. The Kumaraswamy distribution: median-dispersion re-parameterizations for regression modeling and simulation-based estimation. Statistical Papers. 2013;54(1):177–192.
- 6. Bayes CL, Bazán JL, De Castro M. A quantile parametric mixed regression model for bounded response variables. Statistics and its interface. 2017;10(3):483–493.
- 7. Mazucheli J, Menezes AFB, Fernandes LB, de Oliveira RP, Ghitany ME. The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. Journal of Applied Statistics. 2020;47(6):954–974. pmid:35706917
- 8. Korkmaz MÇ, Chesneau C. On the unit Burr-XII distribution with the quantile regression modeling and applications. Computational and Applied Mathematics. 2021;40(1):1–26.
- 9. Ribeiro TF, Cordeiro GM, Peña-Ramírez FA, Guerra RR. A new quantile regression for the COVID-19 mortality rates in the United States. Computational and Applied Mathematics. 2021;40(7):1–16.
- 10. Korkmaz MÇ, Altun E, Alizadeh M, El-Morshedy M. The Log Exponential-Power Distribution: Properties, Estimations and Quantile Regression Model. Mathematics. 2021;9(21):2634.
- 11. Korkmaz MÇ, Chesneau C, Korkmaz ZS. On the arcsecant hyperbolic normal distribution. Properties, quantile regression modeling and applications. Symmetry. 2021;13(1):117.
- 12. Mazucheli J, Alves B, Korkmaz MÇ, Leiva V. Vasicek Quantile and Mean Regression Models for Bounded Data: New Formulation, Mathematical Derivations, and Numerical Applications. Mathematics. 2022;10(9):1389.
- 13. Korkmaz M, Chesneau C, Korkmaz ZS. Transmuted unit Rayleigh quantile regression model: Alternative to beta and Kumaraswamy quantile regression models. Univ Politeh Buchar Sci Bull Ser Appl Math Phys. 2021;83:149–158.
- 14. Korkmaz MÇ, Chesneau C, Korkmaz ZS. A new alternative quantile regression model for the bounded response with educational measurements applications of OECD countries. Journal of Applied Statistics. 2021; p. 1–24.
- 15. Korkmaz MÇ, Chesneau C, Korkmaz ZS. The Unit Folded Normal Distribution: A New Unit Probability Distribution with the Estimation Procedures, Quantile Regression Modeling and Educational Attainment Applications. Journal of Reliability and Statistical Studies. 2022; p. 261–298.
- 16. Korkmaz MÇ, Korkmaz ZS. The unit log–log distribution: a new unit distribution with alternative quantile regression modeling and educational measurements applications. Journal of Applied Statistics. 2021; p. 1–20.
- 17. Saini S, Tomer S, Garg R. On the reliability estimation of multicomponent stress–strength model for Burr XII distribution using progressively first-failure censored samples. Journal of Statistical Computation and Simulation. 2022;92(4):667–704.
- 18.
Araújo FJMd, Guerra RR, Peña-Ramírez FA. The Burr XII quantile regression for salary-performance models with applications in the sports economy. Computational and Applied Mathematics;Accepted.
- 19. Guerra RR, Peña-Ramírez FA, Cordeiro GM. The Weibull Burr XII distribution in lifetime and income analysis. Anais da Academia Brasileira de Ciências. 2021;93. pmid:34105700
- 20. Bhatti FA, Hamedani GG, Korkmaz MÇ, Sheng W, Ali A. On the Burr XII-moment exponential distribution. Plos one. 2021;16(2):e0246935. pmid:33617564
- 21. Guerra RR, Peña-Ramírez FA, Bourguignon M. The unit extended Weibull families of distributions and its applications. Journal of Applied Statistics. 2021;48(16):3174–3192. pmid:35707261
- 22. Ribeiro-Reis LD. Unit Log-Logistic Distribution and Unit Log-Logistic Regression Model. Journal of the Indian Society for Probability and Statistics. 2021;22(2):375–388.
- 23. Peffer PAL. Demographics of an Undergraduate Animal Sciences Course and the Influence of Gender and Major on Course Performance. NACTA Journal. 2011;55(1):26–31.
- 24. Lemonte AJ, Bazán JL. New class of Johnson distributions and its associated regression model for rates and proportions. Biometrical Journal. 2016;58(4):727–746. pmid:26659998
- 25. Mousa A, El-Sheikh A, Abdel-Fattah M. A gamma regression for bounded continuous variables. Advances and Applications in Statistics. 2016;49(4):305–326.
- 26. Lindsay BG, Li B. On second-order optimality of the observed Fisher information. The Annals of Statistics. 1997;25(5):2172–2199.
- 27. Pereira TL, Cribari-Neto F. Detecting model misspecification in inflated beta regressions. Communications in Statistics—Simulation and Computation. 2014;43(3):631–656.
- 28. Nagelkerke NJ, et al. A note on a general definition of the coefficient of determination. Biometrika. 1991;78(3):691–692.
- 29.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. vol. 112. Springer; 2013.
- 30.
R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/.
- 31. Stephens MA. EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association. 1974;69(347):730–737.
- 32. Costa FJd, Bispo MdS, Pereira RdCdF. Dropout and retention of undergraduate students in management: a study at a Brazilian Federal University. RAUSP Management Journal. 2018;53:74–85.
- 33. do Nascimento MdC, de Ribeiro Vieira PM, de Carvalho FMT, da Figueira MAS, Godoy GP. Perception of graduates about the quality of the night course in dentistry at a public institution in northeastern Brazil. Revista da ABENO. 2021;21(1):1044–1044.
- 34. Aslam M. Neutrosophic analysis of variance: application to university students. Complex & intelligent systems. 2019;5(4):403–407.
- 35. Aslam M. Design of a new Z-test for the uncertainty of Covid-19 events under Neutrosophic statistics. BMC Medical Research Methodology. 2022;22(1):1–6. pmid:35387604
- 36. Nagarajan D, Broumi S, Smarandache F, Kavikumar J. Analysis of neutrosophic multiple regression. Neutrosophic Sets and Systems. 2021;43:44–53.