Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Bootstrap-based inferential improvements to the simplex nonlinear regression model

  • Alisson de Oliveira Silva,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Instituto Federal de Educação, Ciência e Tecnologia da Paraíba, João Pessoa, Brazil

  • Jonas Weverson de Ararújo Silva,

    Roles Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – review & editing

    Affiliation Centro de Ciências Agrárias, Departamento de Ciências Fundamentais e Sociais, Areia, Paraíba, Brazil

  • Patrícia L. Espinheira

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing – review & editing

    patespipa@de.ufpe.br

    Affiliation Departamento de Estatística, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil

Abstract

In this paper we evaluate the performance of point and interval estimators based on the maximum likelihood(ML) method for the nonlinear simplex regression model. Inferences based on traditional maximum likelihood estimation have good asymptotic properties, but their performance in small samples may not be satisfactory. At out set we consider the maximum likelihood estimation for the parameters of the nonlinear simplex regression model, and so we introduced a bootstrap-based correction for such estimators of this model. We also develop the percentile and bootstrapt confidence intervals for those parameters as competitors to the traditional approximate confidence interval based on the asymptotic normality of the maximum likelihood estimators (MLEs). We then numerically evaluate the performance of these different methods for estimating the simplex regression model. The numerical evidence favors inference based on the bootstrap method, in special the bootstrapt interval, which was decisive in an application to real data.

1 Introduction

Normal linear regression models are widely used in the most diverse areas of knowledge. Currently, several proposals of regression models for doubly-constrained regression models for doubly-constrained response variables, which assume continuous values in (a, b), where a and b are known and −∞ < a < b < ∞, thus, such support can be easily transformed to the unit interval.

In this context, where y ∈ (0, 1) or (y ∈ (a, b)), the normal linear model is inadequate, because besides the possibility of occurrence of fitted values smaller than 0(a) or larger than 1(b), in general, the data present asymmetry and heteroscedasticity, violating the usual assumptions of such model. Thus, it seems more appropriate to consider models based on distributions naturally supported on (0, 1) as is the case of the simplex regression model proposed by [1], for example.

The simplex distribution was developed from the generalized inverse Gaussian distribution and is part of the class of dispersion models defined by [2], which extend the [3] generalized linear models. Several papers have been conducted using this distribution. For example, [4] used it to evaluate longitudinal data considering the constant dispersion parameter, using generalized estimating equations. [5] modified this approach with the assumption that the dispersion parameter varies across observations. Based on the dispersion models, Using the Bayesian approach and Monte Carlo simulations, [6] evaluates the estimators of the parameters of the simplex model with variable dispersion.

Other approaches for modeling limited data are the beta [7], Kumaraswamy [8], Johnson SB [9], unit gamma [10] regression models. Recently published papers show possible advantages of using the latter distribution over the beta distribution [11, 12]. Recently, [13] proposed to the class of non-linear simplex regression models, in which they estimate the model parameters using the maximum likelihood method and derive the local influence quantities. The authors showed that when data are concentrated at the extremes of the standard unit interval, the maximum likelihood estimation process of the simplex model is more stable than that of the beta regression model. [14] presented the zero-and-one-inflated simplex distribution for modeling proportion data. The authors introduced a new algorithm to compute maximum likelihood estimates of the parameters of the simplex distribution without covariates, and developed likelihood-based inference methods for the regression model using this new distribution.

The study of the behavior of asymptotic maximum likelihood estimators in small samples is an important area of research. These estimators can be biased when the sample size is small or even moderate. The bias is actually a measure of average risk. The average risk in replacing the true value of the parameter with a plausible estimated value. Bias can also be seen as how far the mean of an estimator is far from the true value of the parameter. Thus, it is desirable to obtain estimators with reduced bias in finite samples. When the sample size is large the bias tends to zero. In the literature, there are several ways to obtain less biased estimators in small samples. Here, we shall adopt a bias correction obtained from the bootstrap method [15].

In statistical inference it is of fundamental importance to associate reliability to the point estimates of the model, and one way to do this is through the construction of the interval estimators of the parameters in association with the probability that the estimators contain the true value of these parameters. Confidence intervals can be obtained through the assumption that the asymptotic distribution of the maximum likelihood estimators is the normal probability distribution, which may require large samples to ensure the validity of these approximation. In small samples, an alternative for constructing confidence intervals with good performance with respect to both the coverage rate of the true value of the parameter and the length of the interval is the bootstrap method [15]. Specifically we shall adopt two bootstrap-based confidence intervals, namely: the percentile and bootstrapt. These two schemes typically have empirical coverage rates very close to the nominal ones [16].

Regarding modeling limited continuous data, several authors have already conducted improvements on inference based on the maximum likelihood estimation method. [17] propose both the nonlinear beta regression model and improvements for the maximum likelihood estimators. [18] present corrections to the generalized likelihood ratio statistic (LR) based on [19] for the class of beta regression models whereas [11] used the same strategy considering the unitary gamma distribution. [20] also evaluate the impact of model misspecification on empirical coverage of different prediction intervals, and investigate the impact of model misspecification on three bootstrap prediction intervals. [21] discuss test inference in small samples in the class of beta regression models. The authors consider the LR test and its bootstrap versions, show that the standard LR test tends to be quite liberal in small samples and that bootstrap-based tests provide more reliable inference even when the sample size is very small.

In this document our aim is twofold. At outset we shall developed bootstrap-based inferential improvements for the parameters that index the class of nonlinear simplex regression models proposed by [13]. In which the mean of the response variable and the dispersion parameter are related to covariates by means of nonlinear predictors. In sequence, we shall jointly evaluate the performance of the competing estimators, namely: the MLEs and the bootstrap-based estimators introduced by us.

We evaluate several aspects of interval estimation by Monte Carlo simulations. The bootstrap method proved to be an important tool estimation on nonlinear simplex regression, because through it we can get around several inferential MLEs’ problems in finite samples. Finally, we present an application whose data is from the Chemistry department of the National University of Colombia.

2 Nonlinear simplex regression model

In the literature there are several discrete and continuous distributions that belong to the class of dispersion models, among which we can mention the distributions: normal, inverse normal, gamma, Von Mises, Poisson, Binomial, negative Binomial, and others. In particular, if a random variable y follows the simplex distribution denoted by S(μ, σ2) with parameters 0 < μ < 1 and σ2 > 0, the density expression takes the following form: (1) where the deviance component d(y; μ) is given by The variance function for the simplex distribution is expressed as V(μ) = μ3(1 − μ)3. The mean and variance of this distribution are given, respectively, by and where Γ(a, b) corresponds to the incomplete gamma function, defined by . For more details on these properties, see [2]. The simplex distribution is quite flexible for modeling data in the continuous range (0, 1), showing different shapes according to the values of the parameters that index the distribution. For examples, such as the J shape for S(0.9, 36), the U shape for S(0.5, 121) and the inverse J shape for S(0.1, 36), in addition to the common shapes, namely left-symmetric, right-symmetric and symmetric. Also, unlike the beta distribution, the simplex model is very useful for accommodating data with bimodal distributions, example for S(0.5, 20).

Let y1, …, yn be independent random variables, where each yt, t = 1, …, n, follows a simplex distribution, whose probability density function is given by (1) with mean μt and dispersion parameter . The nonlinear simplex regression model proposed by [13] is defined by (1) and the systematic components given by (2) where β = (β1, …, βk) and γ = (γ1, …, γq) are unknown regression parameter vectors such that and , k + q < n, ηt = (η1, …, ηn) and ζt = (ζ1, …, ζn) are nonlinear predictors and and are, respectively, k1 and q1 observations of known covariates, which may coincide fully or partially such that k1k and q1q.

In the linear models k1 = k, q1 = q, and therefore, and are, respectively, the t-th rows of the matrices X and Z, for t = 1, …, n. Linear models are a particular case of nonlinear models. When we have nonlinearity in the parameters, at least one of the ∂f1(⋅; β)/∂βj, j = 1, …, k depends on β = (β1, …, βk) and, at least one of the ∂f2(⋅; γ)/∂ γl, l = 1, …, q depends on γ = (γ1, …, γq). For linear simplex models, these derivatives depend only on the covariates x1, …, xk and z1, …, zq, respectively, and so and .

Moreover in (2) the link functions and are strictly monotone and at least twice differentiable. Different link functions can be chosen for g and h. For example, for μ we can use the logit function g(μ) = log{μ/(1 − μ)}, the probit function g(μ) = Φ−1(μ), where Φ(⋅) denotes the standard normal distribution function, the log-log function g(μ) = log{−log(1 − μ)} and the log-log complementary function g(μ) = log{−log(1 − μ)}, among others. Since σ2 > 0, we can use the log function h(σ2) = log(σ2) and the identity function h(σ2) = σ2. However, one should be aware if the estimates resulting from the likelihood maximization process take on positive values. If, in fact, the identity link function is the most appropriate, negative estimates shall not occur for the , t = 1, …, n and the diagnostic analysis shall corroborate for the model goodness-of-fit to the data when using such a link function. For more details, see [3, 22]. Finally, in (2) we have that g(μt) = ηt and , t = 1, …, n, are the mean and the dispersion submodels, respectively.

To provide the quantities related to the estimation by the maximum likelihood procedure we shall consider the general case, with nonlinearity in the parameters. Thus, we must emphasize that, f1(⋅) and f2(⋅) are differentiable functions with Jacobian matrices. Based on (1) we have that the logarithm of the likelihood function is given by , in which

The components of the score vector (Uβ(β, γ), Uγ(β, γ)) are given by with and being derivative matrices of dimension n × k and n × q, respectively, y = (y1, …, yn), μ = (μ1, …, μn) and a = (a1, …, an) are n × 1 matrices and U = diag{u1, …, un} is a diagonal matrix in which the t-th component is defined as Moreover, (3)

To obtain the Fisher information matrix for the parameter vectors β and γ, we shall use the following results: , and [4]; , and Var[d(y; μ)] = 2(σ2)2 [23] and [5]. The Fisher information matrix for the parameter vector θ = (β, γ) so-called here by K(β, γ) is a diagonal matrix with two blocks of submatrices which are Kββ and Kγγ defined as follows and . Here, W = diag{w1, …, wn} and D = diag(d1, …, dn) with Since K(β, γ) is a diagonal block matrix, the vectors β and γ are globally orthogonal [24] so that their MLEs and are asymptotically independent. For large samples and under regularity conditions the approximate distribution of the MLEs is given by (4) To measure the degree of non-constant dispersion, we define , t = 1, …, n. Note that the greater the λ the further away the simplex regression model with varying dispersion is from the model in which the dispersion is supposed to be fixed, since the constant dispersion models holds that , in either case λ = 1. Furthermore, this λ definition measure actually as the increase of variance response effects the estimation process of the model. To became σ2 variable it is necessary increases the , otherwise the should be too small and, do not plausible. Thus, as greater is λ as greater is the response variances, in the real problems. Here during the simulations we control the value of the maximum variance, because exactly what we want is to evaluate the properties of the estimators when the variance does not explode, but only grows slightly. When working with real data, the occurrence of large values of estimated λ is substantial, i.e., λ > 1 (in particular, n when it is large).

We still need to discuss the variances of the responses further. The first part of the expression in (4) implies that the vector is asymptotically unbiased. Thus, as the sample size increases, is approximately unbiased and its bias should be close to zero. In theory this fact is true only when n approaches infinity, that is, asymptotically. In practice the better the approximation in (4), and this depends on the distribution, the faster the bias goes to zero, i.e. this can occur for sample sizes n = 40, 50….

However, this assumption is mostly valid for due to its relationship with which is theoretically unbiased, (exactly and not approximately, typically). On the other hand, the relationship of the vector is with , which is theoretically biased in most distributions. Thus, it is already expected that the bias takes a long time to converge to zero and requires large sample sizes for this to occur.

This discussion reveals that we should be more aware of how the corrections act on the . Note that biased shall induce biased response variances. As a consequence, hypothesis tests and confidence intervals should perform poorly and may lead to misleading conclusions about the model, such as the exclusion of important covariates.

3 Point estimation of the model parameters

The maximum likelihood estimators of β and γ are obtained by Fisher’s interactive scoring process in which the initial guess was proposed in [13], and are usually biased when the sample size is small or even moderate, particularly . Nevertheless, the estimator’s bias can be corrected and one possibility is to use resampling methods, which are schemes that use repeated sampling within the same sample to calculate estimates. The bootstrap method is one of the most widely used resampling methods, and one that gives very satisfactory results for estimating a model. In this paper we adopt the parametric bootstrap where in the regression models context it assumes that the probability distribution of the response variable is known and indexed by unknown parameters. [15]. The steps for performing this method both for bias correction of the MLEs and for obtaining the confidence interval are described in the Algorithms (1), (2) and (3).

Algorothm 1: Parametric method

1: Suppose that y = (y1, …, yn) is a random sample such that each yt, t = 1, …, n, follows a distribution F supposedly known and indexed by parameter vector θ;

2: From the original sample, obtain the estimate of θ;

3: Generate B bootstrap samples of size n, namely from , b = 1, …, B;

4: For each bootstrap sample compute ;

5: Repeat steps 3 and 4 a great number B of times, thus obtaining ;

6: Use the estimates , with b = 1, …, B for compute the desired quantities, for instance: mean, variance, confidence interval, etc. regarding distribution of y.

Once the estimate of the estimator’s bias is obtained we can construct the bias-corrected point estimators. Using the steps of the bootstrap method presented in Algorithm (1), a bootstrap estimate of the bias can be obtained by where , i.e., it is possible to approximate the expected value from the arithmetic mean of the bootstrap estimates of θ. Thus, we can obtain an estimator corrected up to second order by bootstrap [15, 25]: This estimator has the same asymptotic properties as the usual MLE and presents Lower bias in small samples [16]. A detailed discussion of the bootstrap second-order bias correction and its relation to the analytic correction can be found in [26].

4 Interval estimation of the model parameters

A set constructed on the basis of a point estimator in association with a probability that this set contains the true value of the parameter, defines a confidence interval estimator. The general form for approximate confidence intervals (CI) for θ is: where l1 and l2, (l1 < l2) are the lower and upper bounds of the confidence interval, respectively, and 1 − α is the confidence level which converges to the probability of coverage. We should emphasize that l1 and l2 are quantiles of a distribution indexed by the parameter θ. Whether we assume that this distribution is known, it is possible to construct exact confidence intervals. However, defining the exact analytical distribution of a random variable is typically highly challenging.

Fortunately, there are diverse approaches to building approximate confidence intervals. The most widely used is the asymptotic confidence interval, which assumes asymptotic normality of the MLEs. According to (4) for the simplex model in large samples the distribution of is approximately normal with mean equal to θ = (β, γ) and the variance and covariance matrix given by (4). More precisely, we have that evaluated at is the k × k matrix of variances and covariances of and evaluated at is the q × q matrix of variances and covariances of .

Consider βi and γj, with i = 1, …, k and j = 1, …, q, the ith and jth components of the vectors β and γ, respectively. We shall denote and as the i-th and j-th components of the main diagonal of the matrices and , respectively. Therefore, it follows that and are intervals with confidence approximately equal to 1 − α for βi and γj, respectively, where is the quantile of the standard normal distribution. These intervals based on MLE may require large samples for the coverage to be close to the nominal ones. In small samples, they can have large coverage errors [15, 25].

An workaround for reaching improvements to confidence intervals in small samples, without analytical complexities, is the bootstrap method. This approach typically provides confidence intervals that have coverage levels close to the true coverage probability. Here, we shall discuss two strategies bootstrap-based confidence intervals, namely: the percentile and bootstrapt.

The percentile bootstrap confidence interval we shall denote by ‘Bootp’ [16] is a bootstrap approach built based on a B finite replicates of the estimators of the parameters of interest. Furthermore, it displays the monotonic transformation invariance property. Let F(θ) be the distribution of the response variable assumed to be known and indexed by the parameter vector θ. Moreover, let be the empirical distribution function of obtained from the B bootstrap replicas. We can construct the percentile confidence interval, with approximate coverage level 1 − α, by calculating 1 − α/2 and α/2 quantiles of . The interval is given by Defining and . The expressions of the percentile bootstrap confidence intervals for the parameters of the nonlinear simplex regression model are given by: (5) i = 1, …, k and j = 1, …, q, the i-th and j-th components of the vectors β and γ. The percentile interval is not necessarily symmetric about the value of and . Its construction ensures that improper values for the parameter of interest are not included in the confidence interval. The steps for its construction are described in Algorithm 2.

Algorithm 2: Bootstrap confidence interval—Percentile

1: Generate B bootstrap samples based on , for b = 1, …, B;

2: Let , in which, y = (y1, …, yn) is the original sample. Thus, the respective bootstrap estimate of de θ is computed as follows: , b = 1, …, B;

3: The B replicas of must be ordered.

4: The lower and upper limits of the percentile interval are provide by the replicas of of order B × (α/2) and B × (1 − α/2), respectively, by assuming that B × (α/2) and B × (1 − α/2) are integers and 0 < α < 1; Meaning and .

 4.1: Whether B × (1 − α/2) and B × (1 − α/2) are not integers, we can use the following procedure:

 4.1.1: Assuming 0 < α < 1, let p = [(B + 1)α/2] be the largest integer less than or equal to the number (B + 1)α/2; then, we define the lower and upper bounds of the percentile interval by the p-th and (B + 1 − p)-th ordered elements of the B bootstrap replicas of , respectively.

The bootstrapt confidence interval, here so-called as ‘Boott’ [16] is a pivotal method to construct confidence intervals that rely on the traditional t-Student confidence interval. This interval is based on the bootstrap estimate of the T distribution, where T is given by where is the standard error of . The construction of the bootstrapt confidence interval is given by Algorithm 3.

Algorithm 3: Bootstrap confidence interval—Bootstrapt

1: Generate B bootstrap samples from ;

2: For each bootstrap sample, it is compute with b = 1, 2, …, B, where is the estimated value of θ from the original sample y, is the estimated value of θ for the bootstrap sample y*b and is the standard error of for the bootstrap sample y*b. Note that ep = κ(θ), κ known function and ;

3: The α/2 and 1 − α/2 percentiles of T*b are estimate by the values and , respectively, as follows

Thus, the bootstrapt confidence interval is given by in which . The amounts and can be obtained as follows:

1. Sort the B bootstrap replicas T*b;

2. The quantiles and are, respectively, the replicas corresponding to the integer parts of B × (α/2) and B × (1 − α/2);

2.1. If B × (α/2) and B × (1 − α/2) are not integers, we can use the following procedure:

Assuming 0 < α < 1, is k = [(B + 1)α/2] the largest integer less than or equal to the number (B + 1)α/2. Thus, the quantiles bootstrap and are given, respectively, by the k-th and (B + 1 − k)-th ordered elements of T*b. Therefore, the bootstrapt intervals for the parameters of the simplex nonlinear regression model are given by the following expressions with i = 1, …, k and j = 1, …, q, the i-th and j-th components of the vectors β and γ.

According to [16], the bootstrapt intervals outperform the asymptotic interval displaying empirical coverages closer to the exact nominal levels, but tend not to be accurate in actual practice. Percentile intervals are more accurate, but display less satisfactory coverage performances. An outstanding discussion on bootstrap-based confidence intervals can be found in [27]. In what follows we shall evaluate the finite-sample performances of the confidence intervals introduced in this section.

5 Numerical results on point estimation

In this section we present the Monte Carlo simulations results, carried out to evaluate the performances of the maximum likelihood estimators of the nonlinear simplex regression model and the bootstrap versions on small samples. In what follows we shall assuming the following nonlinear simplex regression model: (6) where g(⋅) and h(⋅) are the logit and logarithmic link functions, respectively. The realizations of the covariates were generated using the uniform distribution as follows: , , and which are retained fixed for each Monte Carlo replication. Three different scenarios were considered for the mean response, namely: μt ∈ (0.02, 0.32) with β = (−2.4, 1.2, −1.5, −1.7); μt ∈ (0.19, 0.86) with β = (−1.7, −1.8, 1.2, −1.3) and μt ∈ (0.78, 0.98) with β = (2.1, −1.5, −1.6, −1.2). Furthermore, concerning the degree of non-constant dispersion, we report here the results for λ ≈ 12 with γ = (−1.3, −1.6); λ ≈ 45 with γ = (−1.3, −2.1) and λ ≈ 128 with γ = (−1.3, −2.4). The sample sizes chosen were n = 40, 80 and 120. For the last two cases we initially generated n = 40 covariates observations and these were replicated twice and three times, respectively to obtain the sample sizes n = 80 and n = 120. This was done to ensure that the non-constant dispersion intensity was the same for all sample sizes. The number of Monte Carlo and bootstrap replications were R = 10000 and B = 500, respectively. The parameter estimates in (6) were obtained by maximizing the log-likelihood function using the Fisher’s nonlinear optimization method.

For each both Monte Carlo replicate and the maximum likelihood estimate of the model parameters, B bootstrap replicate estimates were generated. Thus, at the end of the bootstrap some quantities regarding the parameters are estimated, namely: the corrected bootstrap estimates and the bootstrap confidence intervals, percentile and bootstrapt. Finally, outside the bootstrap, the asymptotic intervals of the parameters are also computed based on the quantiles of the standard normal distribution.

Aiming to evaluate the performance of the point estimation of the parameters, the relative bias and the square root of the mean square error were calculated for each sample size. Additionally, we introduce a measure suggested during the review of the article, which we shall so-call Unified Quadratic Bias (UQB) define as . In Tables 13 we consider, respectively, the scenarios where μt ∈ (0.02, 0.32), (μt ≈ 0), μt ∈ (0.19, 0.86), (μt ≈ 0.5) and μt ∈ (0.78, 0.98), (μt ≈ 1), t = 1, …, n. In these tables are reported the relative biases and the square roots of the mean square errors (RMSEs) of the parameter estimators for n = 40, 80 and 120 and λ ≈ 12, 45 and 128. We observe that in modulo the estimates of the relative bias of the bootstrap corrected estimators are smaller than those of the maximum likelihood estimators, evidencing the efficacy of the bootstrap scheme in bias correction.

thumbnail
Table 1. Relative biases and root mean square errors of the Maximum Likelihood Estimators (MLEs-asymptotic) and bootstrap corrected MLEs of the model parameters: and , t = 1, …, n, β = (−2.4, 1.2, −1.5, −1.7), μt ∈ (0.02, 0.32), t = 1, …, n.

https://doi.org/10.1371/journal.pone.0272512.t001

thumbnail
Table 2. Relative biases and root mean square errors of the Maximum Likelihood Estimators (MLEs) and bootstrap corrected MLEs of the model parameters: and , β = (−1.7, −1.8, 1.2, −1.3), μt ∈ (0.19, 0.86), t = 1, …, n.

https://doi.org/10.1371/journal.pone.0272512.t002

thumbnail
Table 3. Relative biases and root mean square errors of the Maximum Likelihood Estimators (MLEs-asymptotic) and bootstrap corrected MLEs of the model parameters: and , β = (2.1, −1.5, −1.6, −1.2), μt ∈ (0.78, 0.98), t = 1, …, n.

https://doi.org/10.1371/journal.pone.0272512.t003

For example, the relative bias estimate of the (BOOT) estimator is equal to 0.0003, while that of the (MLE-asymptotic) is 0.001. For μt ∈ (0.19, 0.86), n = 120 and λ ≈ 12, the estimated biased is equal to 0.001 for and < 0.0001 for . In fact, it is noteworthy the high performance of the bootstrap correction when μt ∈ (0.19, 0.86), since its estimators exhibit lower biases than the MLEs-asymptotic for all model parameters, for the different levels of non-constant dispersion and the sample sizes. In all scenarios considered, we note that the RMSEs of the estimators decrease when the sample size increases.

As we had expected the MLEs-asymptotic of the parameters of the dispersion submodel tend to be more biased than those of the mean submodel, especially regarding γ1. For instance, for μt ∈ (0.02, 0.32), n = 120 and λ ≈ 45, the relative bias estimate of is equal to 0.043, while that of the is < 0.0001. More expressive are the biases of and which drop from (0.136, −0.010) to (−0.001,0.001) after the bootstrap correction, respectively, when n = 40 and λ ≈ 45.

For in particular, the bootstrap correction provides a substantial reduction of the estimated bias. This is important since the correct estimation of the of the dispersion submodel parameters, directly interferes with the estimates of the response variances, which, when corrected, produce Z-tests that lead to truer decisions. Even so, the corrections were also effective for , i = 1, 2, 3, 4 because the goal is for the bias values to be as close to zero as possible, and for these parameters, through correction, the estimated bias became some times < 0.0001, i.e. the goal was achieved.

It is important to note that the estimated biases of the usual and corrected maximum likelihood estimators are notably smaller when the mean of the response variable is close to the upper limit of the unit interval than for the two other scenarios considered (Tables 1 and 3). Based on the Unified Quadratic Bias measure it becomes more evident how effective the bias correction we propose is. Let shall evaluated the results on μt ∈ (0.78, 0.98) and n = 40, for λ ≈ 12, 45 and 128, we have that the values of the UQB for the original MLEs are equal to 0.132, 0.137 and 0.138, whereas for the corrected version these values became 0.005, 0.001 and 0.002, respectively (Table 3).

6 Numerical results on confidence intervals

Concerning interval estimation, we computing the empirical coverage of the intervals (%), obtained from the relative frequencies in which the intervals contained the true value of the parameter. The lower and upper bounds were also estimated (via the average after the end of the Monte Carlo process), thus we were able to estimate the average length of the intervals and left and right non-coverage rates. The left rate is computed whenever the interval upper limit is less than the true value of the parameter and right rate is computed whenever the interval lower limit is greater than true value of the parameter.

In what follows we report the results of Monte Carlo simulations on interval estimation. We shall just take the nominal levels 0.90 and 0.95 concerning to Tables 4 and 5, respectively. These tables display the coverage rates of the following competing interval estimators: the asymptotic ML-like or ML interval approximation (ML-Ia), bootstrapt (Boott) and percentile (Bootp) for the model parameters in (6).

thumbnail
Table 4. Coverage rates of the interval estimators: ML-Ia, Boott and Bootp for θ, the model parameters: and , t = 1, …, n, 1 − α = 0.90.

https://doi.org/10.1371/journal.pone.0272512.t004

thumbnail
Table 5. Coverage rates of the interval estimators: ML-Ia, Boott and Bootp for θ, the model parameters: and , t = 1, …, n, 1 − α = 0.95.

https://doi.org/10.1371/journal.pone.0272512.t005

Regarding coverage rates the interval that performs best is the bootstrapt, with empirical coverage substantially closest to the nominal levels, for all parameters model. The asymptotic confidence interval displayed considerable undercoverage and the percentile confidence interval Bootp overall outperforms the ML-Ia, only for γ1 the Bootp displays a poor performance. For instance, consider n = 40, μt ∈ (0.19, 0.86), 1 − α = 0.90 and all λ′ s values, the Bootp coverage ratios for this parameter are approximately equal to 0.66. Whereas those of the ML-Ia and Boott are approximately equal to 0.80 and 0.90, respectively. These behavior are similar for all scenarios of the mean response. The conclusions about coverage rates are quite similar whatever the nominal level is. To exemplify, for 1 − α = 0.95 and for the same other settings when 1 − α = 0.90, the ML-Ia, Boott and Bootp coverage rates are around 0.87, 0.95 and 0.74, respectively (Table 5). Now consider n = 120, 1 − α = 0.95, λ = 45, μt ∈ (0.19, 0.86), as to β3, those values are equal to 0.933, 0.942 and 0.934 (Table 8). Meaning, even when the size of the sample increases the empirical coverage of the Boott interval is closest to the nominal level.

Our interest hereafter shall lie in evaluate some interval properties only for the nonlinearity parameters of the mean and the dispersion submodels, meaning β2 and γ2. Tables 6 through 9 present the mean lower (Lower) and upper (Upper) bounds, mean lengths (Size), the empirical probability of coverage (Coverage), as well as the left and right coverage rates (%) of the interval estimators of the previously mentioned parameters. These last two quantities evaluate the balancing of the interval. Perfect balancing occurs when these two percentages (%) are identical.

thumbnail
Table 6. Lower and upper bounds, size, empirical coverage (Coverage) and percentages of Lower (%Left) and upper (%Right) non-coverage of the ML-Ia, Boott and Bootp intervals for β2, in the model: and , t = 1, …, n, μt ∈ (0.02, 0.32), β2 = 1.2, n = 40.

https://doi.org/10.1371/journal.pone.0272512.t006

In Tables 68 are presented the results for β2, n = 40, n = 80 and n = 120, and only for μt ∈ (0.02, 0.32), t = 1…, n, where β2 = 1.2. Note that if β2 = 1 the mean submodel becomes linear. For n = 40, λ ≈ 12 and λ ≈ 45 only the Boott interval and for 1− α = 0.99 considers this possibility.

thumbnail
Table 7. Lower and upper bounds, size, empirical coverage (Coverage) and percentages of Lower (%Left) and upper (%Right) non-coverage of the ML-Ia, Boott and Bootp intervals for β2, in the model: and , t = 1, …, n, μt ∈ (0.02, 0.32), β2 = 1.2, n = 80.

https://doi.org/10.1371/journal.pone.0272512.t007

thumbnail
Table 8. Lower and upper bounds, size, empirical coverage (Coverage) and percentages of lower (%Left) and upper (%Right) non-coverage of the ML-Ia, Boott and Bootp intervals for β2, in the model: and , t = 1, …, n, μt ∈ (0.02, 0.32), β2 = 1.2, n = 120.

https://doi.org/10.1371/journal.pone.0272512.t008

Its empirical probability of coverage is the closest to the true one, however, the its mean length interval is longer than those of the other two intervals, which allows the inclusion of β2 = 1.0, i.e., linearity (Table 6). This behavior of the Boott interval holds for n = 80 (Table 7). When λ ≈ 128 all intervals include the possibility of linearity when n = 40, i.e., β2 = 1.0 (Table 6), for all confidence nominal levels. This result is interesting, as it shows how the intensity of non-constant dispersion negatively affects the performances of the three confidence intervals considered.

As the sample size increases the problem is smoothed. For instance, when n = 80, 1 − α = 0.90, λ ≈ 128, only the bootstrapt interval considers the possibility of linearity, namely: ML-Ia: = (1.006, 1.401), Boott: = (0.993, 1.410) and Bootp: = (1.010, 1.406). Nevertheless, if we would use only one decimal approximation those intervals would become ML-Ia: = (1.0, 1.4), Boott: = (1.0, 1.4) and Bootp: = (1.0, 1.4), therefore, admittedly equivalents. We should also point out that the average lengths (Size) of all intervals decrease as the sample size increases.

Our spotlight hereafter is show how the Boott considerably outperforms the accuracy and balance of its competitors, concerning for β2 interval. We shall fix 1 − α = 0.95 and consider n = 40 and three λ’s values. We shall compose the following set consisting of: the empirical coverage and left and right non-coverages rates of the interval estimators, expressed as {(⋅), [⋅%][⋅%]}. For λ ≈ 12, the sets of the ML-Ia, Boott and Bootp estimators are equal to {(0.919), [3.61%][4.49%]}, {(0.950), [2.40%][2.61%]} and {(0.923), [3.20%][4.47%]}. For λ ≈ 45 those sets have become {(0.919), [2.65%][3.21%]}, {(0.950), [2.55%][2.67%]} and {(0.922), [2.97%][3.41]}. Finally, when λ ≈ 128 the respective sets are {(0.917), [3.94%][4.40%]}, {(0.949), [2.66%][2.39%]} and {(0.916), [3.98%][4.48%]}.

Table 9 present the simulation results for β2 (μt ≈ 0.5 and μt ≈ 1), when n = 40, λ ≈ 128 and β2 equal to −1.8 and −1.5, respectively. We note that for the three nominal levels and the different scenarios, the asymptotic type confidence interval has the shortest average length. We also note that the bootstrapt confidence interval presents the best empirical coverage and balance proprieties, followed by the Bootp interval which had very similar values to the ML-Ia interval.

thumbnail
Table 9. Lower and upper bounds, size, empirical coverage (Coverage) and percentages of lower (%Left) and upper (%Right) non-coverage of the ML-Ia, Boott and Bootp intervals. For β2 in the model: and , t = 1, …, n, n = 40, λ ≈ 128.

https://doi.org/10.1371/journal.pone.0272512.t009

Figs 1 and 2 contain histograms constructed from the 10000 maximum likelihood estimates of the parameter β2 and γ2, respectively, for n = 40, λ ≈ 128 and the different scenarios for μt, t = 1, …, n. The distinct lines represent the different confidence intervals under evaluation, and their lengths correspond to the respective average lengths. The values below and above of the vertical lines are the non-coverage rates, meaning the percentages of replicates in which the true value of the parameter was smaller than the lower limit of the interval (below) and larger than the upper limit of the range (above).

thumbnail
Fig 1. Interval estimation for β2, n = 40, λ ≈ 128.

(a) β2 = 1.2, (b) β2 = −1.8 and (c) β2 = −1.5.

https://doi.org/10.1371/journal.pone.0272512.g001

thumbnail
Fig 2. Interval estimation for γ2, n = 40. γ2 = −2.4.

https://doi.org/10.1371/journal.pone.0272512.g002

These graphics were designed according to [28]. Through them it is possible to verify that for the different μt scenarios, the analyzed intervals are approximately symmetrical around the true value of β2. We further note that for μt ∈ (0.02, 0.32), the intervals were better balanced when compared to the scenarios where μt ∈ (0.19, 0.86) and μt ∈ (0.78, 0.98). Overall, the bootstrapt confidence interval stands out as better balanced. Fig 2 show that only the asymptotic confidence interval is approximately symmetric around the true value’s γ2.

The bootstrapt confidence interval is slightly asymmetric around γ2. Regarding the bootstrap percentile confidence interval it exhibits very strong asymmetry, especially for the nominal 99% level. We also observe that the asymptotic confidence interval exhibits strong unbalancing, as the rates (% Right) are markedly higher than the observed rates (% Left). However, the bootstrapt and percentile confidence intervals are approximately balanced for all nominal levels and scenarios. Therefore, based on the results presented, we suggest using the bootstrapt confidence interval which showed better coverage and balance performances.

7 Application: Fluid Catalytic Cracking Data (FCC)

In this application the data are from the Chemistry Department of the National University of Colombia [29] and concerns a process regarding the volume and quality of gasoline produced in a refinery. The fluid catalytic cracking process known as Fluid Catalytic Cracking (FCC) is used to convert high molecular weight hydrocarbons into small molecules of higher commercial value by contacting them with a catalyst. This process is often described as the heart of the refinery, as it allows production to be tailored for a higher demand and especially high profit products [29]. The process catalyst consists of fine particles of 10 to 150 microns, easily fluidizable having the zeolite Y [29] as the main component. Another important substance that participates in the catalysis process is the vanadium. This chemical component is known to participate in catalyst destruction, reducing the active surface, selectivity and crystallinity of the zeolite Y especially in the presence of steam. Every 1000 ppm of vanadium in the catalyst is known to reduce gasoline yield by about 2.3%. The process also depends on the temperature, which must be close to 720° C [29]. The data set consists of 28 observations.

Aiming to fit a model to these data [13] chose the following candidate to covariates: steam (x2), temperature (x3) and vanadium concentration (x4). Moreover, the authors defined a linear predictor relating these covariates to unknown parameters. However, the residual analysis highlighted the possibility that the predictor is non-linear in some of the parameters. To build the nonlinear model, the authors follow several steps that are carefully detailed in their article. The model chosen uses probit and logarithmic link functions for the mean and dispersion submodels, respectively, and was defined as follows: , with t = 1, …, 28. Hereafter we shall so-called this model as ‘Model-I’. We emphasize that this simplex model outperformed a competing beta model [13]. Table 10 displays the maximum likelihood estimates, the bootstrap bias-corrected estimates, their respective standard errors (SE), and the p-values associated with the Z-tests for the significance of the model parameters. The ML estimates and their bootstrap corrected versions are quite similar with regards to the parameters of the mean submodel. Whereas the corrected estimates have lower standard errors than the ML estimates, with the exception only for the parameter β1. Concerning the parameters of the dispersion submodel, we notice that the maximum likelihood estimates and their corrected version present slightly different values. Additionally, their respective standard errors are quite similar.

thumbnail
Table 10. Maximum likelihood estimates , bootstrap bias-corrected estimates (), standard errors (SE) and p-values associated with Z-tests for the parameters of the [13].

Fluid Catalytic Cracking Data (FCC).

https://doi.org/10.1371/journal.pone.0272512.t010

In Table 11 are reported the interval estimates of the Model-I parameters assuming nominal levels equal to 90%, 95% and 99%. Mindful the three estimation methods, ML-Ia, Boott and Bootp it is notice that the interval estimates for β1, β2, β4 and β5 are quite similar. Whereas for the β3 parameter the bootstrapt scheme estimates display lengths substantially longer than that of its competitors, for all nominal levels. Exemplify, for 1 − α = 0.95, the ML-Ia, Boott and Bootp interval estimates are, respectively, (−35.445; −20.240), (−40.929; −9.544) and (−34.745; −20.474). Another feature concerning the bootstrapt interval estimator is that some of its estimates include the value zero for the parameters. This fact occurs for the β2, (99%), (−0.466;0.142) and for the β4, (95%) and (99%). Nevertheless, β2 = 0 implies both in the exclusion of steam, which is a covariate recognized as important to the process and in the assumption of a linear predictor for the mean submodel. The most important information that the figures in Table 11 reveal, though, is that only the bootstrapt interval considers the possibility that both γ1 and γ2 are simultaneously at equal to zero, both to 95% and 99%. Bootp reaches this conclusion for the 99% level, whereas for ML-Ia estimator it is only possible that γ1 = 0 and when 1 − α = 0.99.

thumbnail
Table 11. ML-Ia, Boott and Bootp interval estimates for the parameters of the Model ‘Model-I’.

Fluid catalytic cracking (FCC) data.

https://doi.org/10.1371/journal.pone.0272512.t011

Therefore, we shall evaluate a nonlinear simplex model with constant dispersion. Among the competing models, the one that presented the best goodness-of-fit uses log-log complementary and logarithmic link functions for the mean and dispersion submodel, as we shall describe in the following: and , with t = 1, …, 28.

In what follows we shall appoint this model as ‘Model-II’ and provide some quantities about its parameters, namely: the maximum likelihood estimates, their bootstrap corrected versions: ‘(⋅)’ and the standard errors of the estimates ML: ‘[⋅]’. Thus, β1: 0.9112 (0.9051) [0.1081], β2: −1.4166 (−1.5062) [0.0016], β3: −26.1208 (−25.7744) [3.6902], β4: −0.2615 (−0.2571) [0.0635], β5: −0.3366 (−0.3431) [0.0607] and γ1: 0.0455 (0.2782) [0.2673]. Concerning Model-II it is possible to notice differences between the ML estimates and their bootstrap corrected versions. For example, is equal to −1.42 while its corrected version becomes −1.51. A further important issue regarding Model-II is that the correct modeling of the dispersion, considerably reduced the standard errors of the estimates. The SE of was 0.023 for Model-I whereas becomes 0.0016 for Model-II. Table 12 report the interval estimates of the ‘Model-II’ model parameters assuming nominal levels equal to 90%, 95% and 99%. It is noteworthy that the correct dispersion modeling improves the interval estimators performances. The accuracy of the boott and ML-Ia interval estimators regarding to the β2 parameter is especially noteworthy. Let interval estimators be ML-Ia, Boott and Bootp, respectively, the interval estimates are (−1.419; −1.413), (−1.419; −1.414) and (−1.904; −0.757), for 99%. For 95%, (−1.419; −1.413), (−1.419; −1.413) and (−2.032; −0.631). Finally, (−1.420; −1.412), (−1.4212; −1.4116) and (−2.3049; −0.3229), for 90% Here, we note that the Bootp displays a poor performance. It should be reminded that β2 is the parameter associated with the nonlinearity of the model, as well as β3. In fact, after dispersion was assumed constant the Boott scheme provided intervals for β3 with considerably shorter lengths compared to those of the ‘Model-I’ model.

thumbnail
Table 12. ML-Ia, Boott and Bootp interval estimates for the parameters of Model-II.

Fluid catalytic cracking (FCC) data.

https://doi.org/10.1371/journal.pone.0272512.t012

One area of research that we have been working on intensively regards model selection criteria for nonlinear models. The criterion proposed by [7] for the beta regression model, defined as the square of the correlation between g(y) and has proven quite effective in assessing the goodness-of-fit of models to data in the different applications we have performed on nonlinear models. The corrected was proposed by [30] and is defined as . Models I and II display measures equal to {0.6506, 0.5508} and {0.6818, 0.6095}, respectively. Thus, the choice of Model II is adequate and this model was inferred based on the bootstrapt interval estimator.

8 Conclusion

In this paper we evaluate the point and interval estimation for the parameters indexing the nonlinear simplex regression model [13] in small samples. Additionally, we propose inferential improvements based on the bootstrap method.

Often MLEs can be biased when the sample size is small or even moderate. Thus, we consider comparing the point MLE performances of the model parameters and their corrected versions through a bootstrap scheme. The results of Monte Carlo simulations showed that, in general, the corrected estimators presented lower biases than the maximum likelihood estimators, evidencing the efficacy of the bootstrap scheme in bias correction. The MLEs of the parameters of the dispersion submodel are strongly biased, and the bootstrap corrected estimator provides a substantial reduction of these bias. Thus reinforcing the importance of using the proposed scheme in the bias correction of estimators of the nonlinear simplex regression model.

Usually the asymptotic confidence intervals based on MLE’s require large samples in order the coverage rates to be close to the nominal ones. An alternative to constructing adequate confidence intervals on small samples is through the bootstrap method. Thus, we consider three competing interval estimators, namely: the MLE-asymptotic, percentile and bootstrapt estimator intervals. Regarding coverage rate in every simulation’s scenarios the bootstrapt confidence interval outperformed the two others competitors. Furthermore, in almost all experiments it was the best balanced interval.

As a penalty for providing this outperformance, the bootstrapt is typically larger than that of its competitors. Overall, however, the bootstapt interval proved to be the most appropriate estimator interval for nonlinear simplex regression. Not only from the simulation results, but it was also decisive in the application. In a scenario with only n = 28 observations, it was able to point the misspecification of the dispersion model which yield to a new and best fitted model.

References

  1. 1. Barndorff-Nielsen OE, Jørgensen B. Some parametric models on the simplex. Journal of Multivariate Analysis. 1991;39(1):106–116.
  2. 2. Jørgensen B. The theory of dispersion models. London: Chapman and Hall; 1997.
  3. 3. McCullagh P, Nelder JA. Generalized linear models. London: Chapman and Hall; 1989.
  4. 4. Song PXK, Tan M. Marginal models for longitudinal continuous proportional data. Biometrics. 2000;56(2):496–502. pmid:10877309
  5. 5. Song PXK, Qiu Z, Tan M. Modelling heterogeneous dispersion in marginal models for longitudinal proportional data. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2004;46(5):540–553.
  6. 6. López FO. A bayesian approach to parameter estimation in simplex regression model: a comparison with beta regression. Revista Colombiana de Estadística. 2013;36(1):1–21.
  7. 7. Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31(7):799–815.
  8. 8. Mitnik PA, Baek S. The Kumaraswamy distribution: median-dispersion re-parameterizations for regression modeling and simulation-based estimation. Statistical Papers. 2013;54(1):177–192.
  9. 9. Lemonte AJ, Bazan JL. New class of Johnson SB distributions and its associated regression model for rates and proportions. Biometrical Journal. 2016;58:727–746. pmid:26659998
  10. 10. Mousa AM, El-Sheikh AA, Abdel-Fattah MA. A gamma regression for bounded continuous variables. Advances and Applications in Statistics. 2016;49(4):305–326.
  11. 11. Guedes AC, Cribari-Neto F, Espinheira PL. Modified likelihood ratio tests for unit gamma regressions. Journal of Applied Statistics. 2020; p. 1–25. pmid:35707584
  12. 12. Rocha SS, Espinheira LP, Cribari-Neto F. Residual and local influence analyses for unit gamma regressions. Statistica Neerlandica. 2021;75(2):137–160.
  13. 13. Espinheira PL, Silva AO. Residual and influence analysis to a general class of simplex regression. Test. 2020; p. 1–30.
  14. 14. Liu P, Yuen KC, Wu LC, Tian GL, Li T. Zero-one-inflated simplex regression models for the analysis of continuous proportion data. Statistics and Its Interface. 2020;13(2):193–208.
  15. 15. Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics. 1979;7(1):1–26.
  16. 16. Efron B, Tibshirani RJ. An introduction to the bootstrap. NewYork: Chapman & Hall; 1994.
  17. 17. Simas AB, Barreto-Souza W, Rocha AV. Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis. 2010;54(2):348–366.
  18. 18. Ferrari SLP, Pinheiro EC. Improved likelihood inference in beta regression. Journal of Statistical Computation and Simulation. 2011;81(4):431–443.
  19. 19. Skovgaard IM. Likelihood asymptotics. Scandinavian Journal of Statistics. 2001;28(1):3–32.
  20. 20. Cribari-Neto F, Lima FP. Resampling-based prediction intervals in beta regressions under correct and incorrect model specification. Communications in Statistics-Simulation and Computation. 2019; p. 1–19.
  21. 21. Lima FP, Cribari-Neto F. Bootstrap-based testing inference in beta regressions. Brazilian Journal of Probability and Statistics. 2020;34(1):18–34.
  22. 22. Atkinson AC. Plots, transformations and regression: an introduction to graphical methods of diagnostic regression analysis. New York: Oxford University Press; 1985.
  23. 23. Silva FCd. Teste de diagnóstico baseado em influência local aplicado ao modelo de regressão simplex. Universidade Federal de Pernambuco; 2016.
  24. 24. Cox DR, Reid N. Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B (Methodological). 1987;49(1):1–18.
  25. 25. Davison AC, Hinkley DV. Bootstrap methods and their application. New York: Cambridge University Press; 1997.
  26. 26. Ferrari SL, Cribari-Neto F. On bootstrap and analytical bias corrections. Economics Letters. 1998;58(1):7–15.
  27. 27. Hall P. Theoretical comparison of bootstrap confidence intervals. The Annals of Statistics. 1988; p. 927–953.
  28. 28. Ospina R, Cribari-Neto F, Vasconcellos KL. Improved point and interval estimation for a beta regression model. Computational Statistics & Data Analysis. 2006;51(2):960–981.
  29. 29. Salazar SMG. Contribuicion al estudio de la reaccion de decomposición de la Zeolita Y em presencia de vapor de agua y vanadio. Universidad Nacional de Colombia; 2005.
  30. 30. Bayer FM, Cribari-Neto F. Model selection criteria in beta regression with varying dispersion. Communications in Statistics-Simulation and Computation. 2017;46(1):729–746.