Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Beta regression model nonlinear in the parameters with additive measurement errors in variables

Abstract

We propose in this paper a general class of nonlinear beta regression models with measurement errors. The motivation for proposing this model arose from a real problem we shall discuss here. The application concerns a usual oil refinery process where the main covariate is the concentration of a typically measured in error reagent and the response is a catalyst’s percentage of crystallinity involved in the process. Such data have been modeled by nonlinear beta and simplex regression models. Here we propose a nonlinear beta model with the possibility of the chemical reagent concentration being measured with error. The model parameters are estimated by different methods. We perform Monte Carlo simulations aiming to evaluate the performance of point and interval estimators of the model parameters. Both results of simulations and the application favors the method of estimation by maximum pseudo-likelihood approximation.

1 Introduction

Regression models for dependent variables that assume values in the unit interval have been proving quite important in the literature, with a special highlight to the beta regression model proposed by [1] and generalized by [2] whom proposed the nonlinear beta regression model. In the above mentioned models traditionally all covariates are considered fixed, not random, without measurement error. In practice, covariates may not be observed directly or may be subject to measurement errors. It is important to emphasize that if this assumption is not respected, unreliable inferential results shall be obtained [3]. In these circumstances, regression models with measurement errors are usually defined and structured so that the mean response is explained by covariates xt whose measurements are inaccurate. Thus, instead of the true value of xt, the value of another predictor variable, wt, which is associated with a measurement error, is considered.

It is reasonable that in regression models for dependent variables assuming values in the unit interval, some of the covariates are not observed directly, but acquired with possible measurement errors. Thus, [4] proposed the beta regression model with measurement errors, in which covariates measured with errors can enter in the mean and precision submodels described by linear predictors. A possible extension to the model proposed by [4] is to consider a nonlinear modeling for the parameters. In fact, a real problem has shown the necessity of this modeling proposal. The interest of the problem lies in modeling the percentage of crystallinity of a chemical reagent based on different concentrations of vanadium and steam, considering two values for temperature. It is expected that the higher the concentrations of vanadium and steam, the lower the percentage of crystallinity. The loss of crystallinity is a negative effect of the process and precise knowledge of how vanadium concentration destroys this crystallinity is one of the most chief goals of this regression modeling. As we shall describe in the application, the measurement of vanadium concentration may be inaccurate, which characterizes the possibility of this covariate to present measurement error. These data were modeled by [5] based on beta and simplex nonlinear regressions in which the vanadium was taken as a fixed covariate. In our application the nonlinearity is related with steam, that is here treated as a fixed covariate.

Nonlinear models with measurement errors have been being recently studied, as shown in the literature; see [69]. Our aim here is propose a particular class of nonlinear beta regression with measured errors, in which the nonlinearity is on a parameter related to a fixed covariate. We consider three methods for the estimation of the parameters, namely: approximate maximum likelihood, approximate pseudo maximum likelihood and regression calibration. We compare these three methods with the estimation of the naive model, which considers that the regression does not have covariates measured in error, that is, it considers the classical nonlinear beta regression model. We evaluate the properties of point and interval estimators for the four estimation methods considered, based on Monte Carlo simulations and different scenarios. Both the simulations and our application pointed out that the estimation method by approximate pseudo maximum likelihood is the one that presents the best performance. Furthermore, recently [10] proposed the simplex regression models with measurement error, and as a future research we shall extend this proposal to nonlinear case.

2 Model and estimation methods

Consider n independent random variables y1, …, yn such that each yt, t = 1, …, n is beta distributed with density (1) 0 < μt < 1, ϕt > 0, E(yt) = μt and Var(yt) = μt(1 − μt)/1 + ϕt. The class of nonlinear beta regression models here with measurement errors here proposed considers (2) where α = (α1, …, αp) ∈ IRp, β = (β1, …, βr) ∈ IRr, , λ = (λ1, …, λs) ∈ IRs are unknown parameters, , and are fixed and unknown vectors of observations and , are vectors of covariates not directly observed or associated to measurement errors. The link functions g1(⋅):(0, 1)→IR and g2(⋅):(0, ∞)→IR are strictly monotonic, continuous and twice differentiable. Further, f1(⋅) and f2(⋅) are differentiable functions with Jacobian matrices F1 = ∂η1/∂α, F2 = ∂η1/∂β, F3 = ∂η2/∂γ and F4 = ∂η2/∂λ having ranks p, r, and s, respectively. We should point out that the method proposed here is not applicable to models with random effects. Finally, the parameter defining the degree of variability of precision is here defined as . For the model with constant precision, ϕ1 = … = ϕn = ϕ, and, therefore, δ = 1.

Following [4] we consider that the variables with measurement errors in the submodels for mean and precision are the same. Also, the vector of random measurements xt is not directly observed. The observed vector is wt = (wt1, …, wtr), which we here suppose related to xt in the form wt = τ1 + τ2xt + et, where et is a vector of random errors, τ1 = (τ11, …, τ1r) ∈ IRr and τ2 = (τ21, …, τ2r) ∈ IRr are vectors of unknown parameters and “∘” represents the Hadamard product; for two vectors u1 = (u11, …, u1r) and u2 = (u21, …, u2r) of same dimension, u1u2 = (u11 u21, …, u1r u2r). We may assume τ1 to be a vector of zeros, as we may assume τ2 to be a vector of ones. However, we can also assume them to be unknown parameters to be estimated, as we shall see in the application.

Hence, our vector of unknown parameters is , where θ = (α, β, γ, λ) and are, respectively, the vectors of interest and nuisance parameters. The joint density function of (yt, wt) is (3)

In (3), f(yt|xt;θ) represents a density for a beta distribution, is a conditional density of wt given xt and f(xt, ξ) is a density for the explanatory variable xt. Observe that, to obtain (3), we must assume that yt is independent of wt given xt, that is, the distribution of yt given (wt, xt) involves only xt [8]. Therefore, the log-likelihood function for the n independent observations is (4)

The log-likelihood above involves integrals that are analytically unsolvable, which means that approximation methods need to be used. We will consider here three different approaches to estimate the parameters and, aiming to describe each one of the three methods, we will suppose, for illustration purposes, that only xt is measured with error. However, it is important to stress that the methods here used can be easily generalized to r covariates with measurement errors.

From now on, we make the following assumptions: and, finally, xt and et, with t, t′ = 1, …, n, are independent. This is a structural model for which the vector of non-observed variables follows a normal distribution. It is important here to stress that the models for μt and ϕt, t = 1, …, n are based in (2). From the above assumptions, we have (5) (6) where kx is known as the reliability coefficient. An additional assumption is that the variance of the measurement error is known. This assumption is necessary to avoid identifiability problems. When we have replicas for wt, it is possible to use the sample variance of wt to estimate the variance of the measurement error.

In order to establish the log-likelihood for the model to be studied, we use the results in (5) and also that , from which we obtain (7) (8)

We also suppose that the log-likelihood is a concave function of the parameters to be estimated. The estimators we shall propose have no closed form. Therefore, we shall use the nonlinear optimization method Fisher’s scoring with initial values proposed by [11] to the nonlinear beta regression models.

2.1 Estimation by approximate maximum likelihood

The integral in (8) can be approximated using the Gauss-Hermite quadrature, given by (9) where sq and νq are, respectively, the q-th zero and weight of the Q-th order orthogonal Hermite polynomial, Q being the number of points for quadrature [12]. Using the change of variable in (8), that is, , we obtain an approximation for the integral in (8) based in (9), which yields the approximation for the log-likelihood function given by (10) where (11) here, μtq and ϕtq are the parameters of the beta distribution of yt when the explanatory value xt takes the value , and being defined in (6) and sq being the q-th zero of the Q-th order orthogonal Hermite polynomial. Consequently, from (2), we have

Typically in the beta linear models with errors in covariates the MLEa’s asymptotic distribution is normal with mean Ψ and covariance matrix for n and Q large enough [4]. Here, we also shall consider this result to build asymptotic confidence intervals for the interest parameters of the model defined in (2) with a level α. We define Ja(θ) as the partition of Ja(Ψ) related with the interest parameters in the Appendix. Previous to Ja(θ) matrix, we also present in the Appendix the score functions for the parameters of the model. Here, MLEa is the approximate maximum likelihood estimator.

2.2 Estimation by approximate pseudo maximum likelihood (pseudo-likelihood estimation)

This estimation method considers maximizing a function depending only of the parameters of interest, with, following [13], the nuisance parameters being replaced by consistent estimators in the original likelihood function defined in (3). Pseudo-likelihood estimation consists of first obtaining the optimal point of the log-likelihood corresponding only to the nuisance parameters, which is given by , with 1t(ξ) as in (7). Once is obtained, the pseudo log-likelihood is defined as It is important to emphasize that the 2t integral defined in (8) will also be approximated by Gauss-Hermite quadrature. However, since involves only θ, its approximation will be a function only of the interest parameters. Let be the approximation for obtained with (9). From this, the approximate pseudo log-likelihood function for the nonlinear beta regression model with unidimensional error measures is defined as in the form (12) where tq(μtq;ϕtq), defined in (11), is evaluated at , that is , and .

Let be the approximate pseudo maximum likelihood estimator of θ, obtained by maximizing (12). Then, under assumptions that are usually satisfied in practice, it can be shown that the asymptotic distribution of is normal with zero mean and covariance matrix with , and , where rt(ξ) and pt(θ,ξ) are, respectively, the t-th element of the log-likelihood restricted to perturbation parameters , and of p(θ,ξ) given in (12). For the nonlinear beta regression model with measurement errors, the matrices Iθ, Iξ and Iθ have no explicit forms. Since the integral is approximated numerically, simply the second derivatives of rt(ξ) and en(θ,ξ) are used, thus the expected information matrix is replaced by the observed information matrix [14].

2.3 Estimation by regression calibration

When we estimate by regression calibration, the central idea is to replace the non-observed variable xt by an estimate of the expected value of xt given wt, E(xt|wt) in the likelihood function. If only one variable is measured with error, the calibration function is . Since , then, and are optimal estimators of μx and , respectively. However, it is not possible to estimate from the observed data w1, …, wn. We assume that kx or to be known, or, alternatively, that can be estimated since it is possible to observe replicas of wt.

Replacing the estimated calibration function in the probability density function of yt given xt, that is, in f(yt|xt;α, β, γ, λ) given in (1), we obtain the modified log-likelihood as where (μt;ϕt) = log Γ(ϕt) − log Γ(μt ϕt) − log Γ[(1 − μt)ϕt] + (μt ϕt − 1) log yt + [(1 − μt)ϕt − 1] log (1 − yt), , and , being the known or estimated value of kx. Observe that the log of the modified likelihood involves only the interest parameter θ and is the same as that of the usual beta regression model, since plays the role of an explanatory variable that is measured with no error. Standard errors of these estimators can be estimated via nonparametric bootstrap.

3 Simulations

In order to check the performances of the estimation methods that were described in Section (2), we have carried out simulation studies with 10,000 Monte Carlo replications each. We consider beta regression models where some of the covariates are measured with error and, at the same time, is nonlinear with respect to the parameters associated to the covariates that are measured with no error. This was the structure we have used in our application. The model for which we perform our simulations is a particular case in the class of models we propose in Section (2).

Model parameters are estimated with the naive model, (ιnaive), which is the traditional beta likelihood function, approximated maximum likelihood (ιa), approximated pseudo maximum likelihood (ιp) and regression calibration (ιrc). In all optimization processes, we have used the nonlinear BFGS quasi-Newton algorithm with numerical derivatives that is implemented in the programming language Ox [15]. Integral approximations were tried by using Q = 12, 30, 50 and 80 points of quadrature, these four different numbers of points yielding very similar results. We present here results for Q = 50.

We have made simulations for n = 40, 80 and 160. At each Monte Carlo replication, the n fixed observations of the covariate with no measurement errors, denoted by zt1, were obtained as independent draws from a uniform distribution on (0.2, 1.2), t = 1, …, n. Furthermore, xt and et being independent, t, t′ = 1, …, n. The simulations were performed for three different values of kx, namely: kx = 0.50 (, high measurement error), kx = 0.75 (, moderate measurement error) and kx = 0.95 , low measurement error). Finally, we consider , that is the logit link function and g2(ϕt) = log(ϕt), t = 1, …, n.

3.1 Scenario 1—Constant precision

The model is defined as

The true values of the parameters submodels are α1 = −0.6, α2 = 2.4 and β1 = 0.8. Aiming to compare the performances of the competing estimation schemes presented in Section (2), for different precision magnitudes, we have chosen three values for γ1, namely: 2.8, 4.0 and 5.7 leading to ϕ = 16.44, ϕ = 54.60 and ϕ = 298.87, typically these ϕ’s values in beta regression represent three degree of precision model, namely: substantially low, low to medium and medium to high, respectively. Concerns to the nuisance parameters we chose μx = 0 and . Thus, for kx = 0.50 (, high measurement error), kx = 0.75 (, moderate measurement error) and kx = 0.95 (, low measurement error).

3.2 Scenario 2—Varying precision I

In this scenario there is a covariate measured with error in both submodels. However, the nonlinearity is only in the mean submodel, such that

The true values of the parameters are α1 = −0.6, α2 = 2.4, β1 = 0.8, γ1 = 2.5, λ1 = 0.9. In this scenario we have μx = 0 and yielding the same values to of the Scenario 1, resulting in kx = 0.50 (), kx = 0.75 () and kx = 0.95 ().

3.3 Scenario 3—Varying precision II

This is the most complex scenario. The beta regression model with varying precision, and the both submodels present measurement error and nonlinearity, such that the complete model is defined as

We admit that vt1 = zt1, for t = 1, …, n. The true values of the parameters are α1 = 0.7, α2 = 2.0, β1 = −1.5, γ1 = 1.5, γ2 = 2.0 and λ1 = 1.3. Here, μx = 1.5 and . Thus we have used kx = 0.50, with , kx = 0.75 with and kx = 0.95 with .

3.4 Simulation results

Table 1 displays the RMSE (root mean square error) of the Scenario 1’s estimators. From this table, overall, we note that the RMSE decreases considerably when the precision model, the kx value and the sample size increase, which is the expected behavior.

thumbnail
Table 1. Root mean square error (RMSE) for the estimators of α1, α2, β1 and γ1 for the nonlinear beta regression model with constant precision.

Scenario 1.

https://doi.org/10.1371/journal.pone.0254103.t001

However, this behavior does not hold for the γ1’s estimators, in special when we use the naive and regression calibration methods. It is noteworthy how the ’s RMSE for these two methods become larger when ϕ increases. We shall focus on kx = 0.50 and n = 160. Be ϕ = 16.4 and ϕ = 298.9, the ’s RMSE for the naive (ιnaive), regression calibration (ιrc), approximate likelihood (ιa) and pseudo-likelihood (ιp) estimation methods are (0.7796, 0.7796, 0.7784, 0.7149) and (3.0352, 3.0352, 0.9895, 0.9565), respectively. Furthermore, the RMSE of does not decrease for the naive and regression calibration methods when the sample size increases, notably for kx = 0.50 and kx = 0.75.

The best RMSE-related performances of the estimators are due the approximated likelihood schemes, overall. The largest RMSEs are of the . When ϕ = 298.9 these RMSEs are larger than when ϕ = 16.4. However, these RMSEs decrease substantially when the sample size and reliability coefficient increase. Furthermore, the , and do not display large RMSEs when the precision is so low. Lets focus in the ιp scheme, pseudo-likelihood and in the , the estimator related to measurement error. Be kx = 0.75 and n = 40, when ϕ = 16.4 and ϕ = 298.9 their RMSEs are 0.1825 and 0.1059, respectively. Finally, it is important to note that only when n = 40, kx = 0.50 and for the parameter β1, RMSE-related, the estimation by approximate likelihood outperforms the pseudo-likelihood method.

Fig 1 presents plots for the biases of the parameter estimators of the Scenario 1, when n = 40, 80, 160, ϕ ≈ 17, 55, 300 and kx = 0.50. It is noteworthy as the naive structure and regression calibration perform poorly, in especial for the precision estimation. As the ϕ value increases, the biases of the γ1 estimators moving considerably away from zero showing absolute values close to three. Both estimation schemes also show the worst performances for estimating the nonlinearity parameter α2, since their biases moving further away from zero as the sample size increases. However, it is important to note that the absolute biases of , equally close to 0.2. for both methods are considerably small compared to the actual value of the parameter, α2 = 2.4.

thumbnail
Fig 1. Biases of estimators of α1, α2, β1 and γ1, for ϕ = 16.4, 54.6, 298.9.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 1: , g2(ϕ) = γ1, t = 1, …, n, n = 40, 80, 160. kx = 0.50.

https://doi.org/10.1371/journal.pone.0254103.g001

In fact, it is remarkable how badly the naive β1 estimator behaves. Its bias is constant for any value of n and ϕ and considerably high, equal to 0.4 while β1 = 0.8. Since β1 is the coefficient of the covariate measured with error, this behavior considerably disfavors the naive structure.

Regarding methods based on approximate likelihood (ιa and ιp), the Fig 1 leads to important conclusions. In general, their biases are not markedly high, especially when the sample size increases, even when ϕ is quite low. As an example, for β1, the coefficient of the covariate measured with error, the largest biases of these two estimators occur when n = 40 and ϕ ≈ 17, and are close to −0.1, getting closer and closer to zero for ϕ ≈ 55 and ϕ ≈ 300 (Fig 1(g)–1(i)). Nevertheless, be α2, the nonlinearity parameter. We note that the biases of the approximate likelihood and the pseudo-likelihood estimators for this parameter are close to zero when n = 160 (Fig 1(d)–1(f)).

For significant value of the reliability coefficient, i.e., kx = 0.50 only when the ιa and ιp methods are used, the biases of the estimators for all model parameters decrease as the sample size increases. and their performances are quite similar. However, for , the ιp method exhibits slightly lower bias than the ιp method, particularly for n = 40.

The fact described above is the first evidence of a better performance of the ιp method, typically. The smaller the ’s bias, the smaller the bias of the variances of the model estimators and the better the performance of the confidence intervals and hypothesis tests related to the model parameters.

In what follows we shall discuss estimation by confidence interval with respect to Scenario 1. The Table 2 displays the coverage rates of the estimated asymptotically 95% confidence intervals for kx = 0.50, 0.75, 0.95, n = 80 and ϕ = 16.4, 298.9.

thumbnail
Table 2. Coverage rates and average lengths of the nominal 95% confidence interval estimators.

kx = 0.50, 0.75, 0.95. Scenario 1: , g2(ϕ) = γ1, t = 1, …, n, n = 80.

https://doi.org/10.1371/journal.pone.0254103.t002

At this point, it is interesting to note how the high biases negatively affected the performance of the interval estimation. In particular, the considerable overestimation of γ1 when we use the naive and regression calibration methods leads to variances of the estimators highly underestimated. Especially the larger the value of ϕ. Therefore, we have interval lengths that are too small, to the point of not coverage the true values of the parameters. For example, when ϕ = 298.8 the coverage rates for the parameter γ1 based on the two methods are equal to zero, and 100% of the estimated values are greater than the true value of the parameter (Table 2).

We turn now to the coefficient for the covariate with measurement error, i.e., the parameter β1. Since relative to the naive method considerably overestimates that parameter, regardless of the value of ϕ (Fig 1(j)–1(l)), in association with the underestimation of variances already commented above, its β1 interval estimator produces values that are 100% higher than the true value of the parameter. The performance of the interval estimator based on the regression calibration method is considerably better than the naive related to β1. However, it is still considerably poorer than that of the ιa and ιb methods. For kx = 0.50 and ϕ = 298.9, the coverage rates and average lengths for ιrc, ιa and ιp are (66.59%, 73.27%, 93.38%) and (0.40, 0.75, 2.02), respectively. Here we note the outperformance of the ιp scheme. Its coverage rate is the closest the nominal level 95% and the average length equal to 2.02 decreases to 0.31, for kx = 0.75, ensuring a coverage rate closer to the nominal level (Table 2).

We shall now evaluate the interval estimators of α2. Based on the Table 2 we note that for ϕ = 16.4, the schemes: ιnaive, ιrc, ιa and ιp show coverage rates near each other and equal to (90.12%, 88.94%, 91.65%, 89.45%), respectively, for kx = 0.75. These coverage rates are still closer when kx = 0.50 and kx = 0.95. However, when ϕ = 298.9 the behavior of the four methods differs considerably. When kx = 0.75 those coverage rates above become equal to (90.93%, 89.82%, 72.78%, 75.20%). When kx = 0.50, these coverage rates differ further, in especial related to ιa estimator, being equal to (87.91%, 87.26%, 61.40%, 75.29%). Here it is necessary to increase the sample size to better understand the behavior of the estimators.

Thus, Fig 2 displays the coverage rates of the interval estimators of the beta regression model with measurement errors and nonlinearity with constant precision equal to 16.4. We consider n = 40, 80, 160 and kx = 0.95, 0.75, 0.50.

thumbnail
Fig 2. Coverage rates of the nominal 95% confidence intervals for the parameters α1, α2, β1 and γ1.

kx = 0.95, 0.75, 0.50. Scenario 1: , g2(ϕ) = γ1, t = 1, …, n. n = 40, 80, 160 ϕ = 16.4.

https://doi.org/10.1371/journal.pone.0254103.g002

Those results of coverage rates are related to the biases of the point estimators. In this case, the ιnaive and ιrc methods have been incorrectly favored for being more biased. Their respective point estimators slightly overestimate α2, this fact associate with the considerably underestimate the variance, in particular when ϕ = 298.9 (Fig 1(l)) leads to interval estimators that cover the true value of the parameters without loss of accuracy. However, this fact changes as the sample size increases. We can see in Fig 2(f) that when n = 160, the coverage rates of the interval estimators for α2, related to ιa and ιp methods, are close to the nominal level of 95%, while those of the ιnaive and ιrc methods remain constant for all sample sizes and close to 90%.

The ιa method is the most illustrative and true in its performance, even for n = 80 and kx = 0.50 (ϕ = 298.9). The average bias of its respective can be considered equal to zero. Besides, its largest average bias is less than −0.1 while α2 = 2.4. Thus, the chance that their interval estimates cover the true value of the parameter only depend on the goodness of approximation of the point estimator’s distribution to the normal distribution. For the ιa estimators this approximation seems to improve as the sample size increases, once that coverage rate equal to 84.68% cited above, for n = 80 and ϕ = 16.4, increases and remains close to the nominal level, in this case 95%. This behavior can be seen in the Fig 2(f).

From this figure, we confirm what we had discussed in the last paragraph. The interval estimators of the ιa and ιp methods are the ones that present the closest coverage rates to the nominal level of 95%, taking into account all parameters of the model. Most importantly, these rates get closer to the nominal level when the sample size increases. For the nonlinear beta regression with measurement error and constant precision, we should conclude that the best performance regarding interval estimation, concerns the approximate pseudo maximum likelihood method.

Table 3 presents the RMSE (root mean square error) of estimators for second and third scenarios. We begin by considering the second scenario. We observe that, in practically all situations, this RMSE decreases when the sample size increases. The exceptions lie in the estimation of γ1 of the precision model. When we use the naive model and regression calibration with significant values of the reliability coefficient kx, i.e., kx = 0.75 and kx = 0.50. For example, when using the naive model for kx = 0.50, we obtained , respectively, for n = (40, 80, 160).

thumbnail
Table 3. Root mean square error (RMSE) for the estimators of α1, α2, β1, γ1, γ2 and λ1 for the nonlinear beta regression model with nonconstant precision.

Scenario 2 and Scenario 3.

https://doi.org/10.1371/journal.pone.0254103.t003

In fact, for smaller values of kx it is reasonable that for at least one estimator (in our case it was ) we should obtain a larger RMSE when n increases when running the naive model. This is because, for small values of kx, the naive model represents a poor approximation for the true model and larger values of n represent more information about data. Therefore, larger values of n will detect the approximation is not good. This is translated in a larger RMSE for at least one estimator. Also, since regression calibration does not work with the full information about data distribution, but only with moments, it is expected that this loss of information that is intrinsic to regression calibration will be felt in the estimation process if more data are available. Anyway, the poor performance of the naive model is a clear indication of the necessity, in practice, of the model here proposed.

We turn now to the parameter β1, which is the coefficient for the covariate that has error measurement. Estimation by approximate maximum likelihood (ιa) and approximate pseudo maximum likelihood (ιp) present better performances. This is expected, since they use the full information about data distribution. We then investigate if calibration regression has comparable performance, and we can see it does not. For example, in the first scenario, for kx = 0.50 and n = 40, we have for the estimator that , while the values of the RMSE are typically smaller than 1 for the other estimators.

In general, with respect to the complete estimation of the model, when we have large measurement errors, kx = 0.50, and the sample size increases, the performances of the approximate pseudo maximum likelihood and the approximate maximum likelihood are comparable. Therefore, the pseudo maximum likelihood method is here recommended.

From the simulation results in Table 3, we can compare, for different reliability coefficients, the performances of the different estimators in terms of RMSE for the beta regression model with measurement errors in one covariate and nonlinearity in one parameter both for mean and precision submodel, i.e., Scenario 3. For the coefficient β1 of the covariate with measurement error, the naive model is the one presenting the worst performance, in particular, when the reliability coefficient is k = 0.50. For example, for n = 80 and k = 0.50, we have the RMSEs of the estimators ιnaive ιrc, ιa and ιp for β1 given, respectively, by 0.8355, 0.4412, 0.3088 and 0.3032. Thus, the approximate maximum likelihood and approximate pseudo maximum likelihood estimators present much better performances. The much worst performance of the naive estimator indicates again the usefulness of our proposed model. Also, we observe that the approximate maximum likelihood and approximate pseudo maximum likelihood are comparable, which means that pseudo-likelihood can be a useful tool in our specific proposal.

Figs 3 and 4 present plots for the biases of the estimators of the parameters in Scenario 2, for n = 40, 80, 160. It is possible to see that the naive structure and regression calibration present much worst biases. Not only that, but it was observed for these two last estimators, that absolute values of biases increase with measurement error. In particular, consider and , estimators we are very interested in, since they are related with the nonlinearity and measurement error of the mean submodel. Fig 3(d)–3(i) show how biases of both estimators are near zero when we use approximate likelihood (ιa) and approximate pseudo likelihood (ιp) for all sample sizes.

thumbnail
Fig 3. Biases of estimators of α1, α2 and β1 for kx = 0.95, 0.75, 0.50.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 2: , g2(ϕt) = γ1 + λ1 xt1, t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g003

thumbnail
Fig 4. Biases of estimators of γ1 and λ1 for kx = 0.95, 0.75, 0.50.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 2: , g2(ϕt) = γ1 + λ1 xt1, t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g004

It is remarkable how bias of , (related to nonlinearity) increases with measurement error for the regression calibration method. (Fig 3(d)–3(f)). This same behavior can be observed for the precision submodel parameters, in particular for , which estimates the coefficient of the covariate with measurement error. In this case, the biases of the estimators using regression calibration increase considerably for large sample sizes and measurement errors. (Fig 4(d)–4(f)). In fact only approximate maximum likelihood and approximate pseudo maximum likelihood produce estimators with biases tending to zero as the sample size increases.

Figs 5 and 6 present the plots of biases of the estimators of the parameters in Scenario 3, namely: nonlinear predictors with measurement error on both submodels. Approximate maximum likelihood (ιa) and approximate pseudo maximum likelihood (ιp) present much better performances than those of naive and regression calibration methods (ιrc) for , the estimator of the coefficient of the covariate with measurement error (mean submodel). (Fig 5(g)–5(i)). Estimators of the parameters related to nonlinearity are more biased, but, even so, the performances of ιa and ιp methods are slightly superior to those of the ιNaive and ιrc, in particular for larger values of the measurement error (Fig 5(f)). We can also check that, for the parameters of the precision submodel, the maximum likelihood based methods are clearly superior (Fig 6). In fact, once more, only the ιa and ιp methods yield estimators with biases decreasing when sample size increases.

thumbnail
Fig 5. Biases of estimators of α1, α2 and β1 for kx = 0.95, 0.75, 0.50.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 3: , , t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g005

thumbnail
Fig 6. Biases of estimators of γ1, γ2 and λ1 for kx = 0.95, 0.75, 0.50.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 3: , , t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g006

Figs 7 and 8 present the coverage rates for the different sample sizes and reliability coefficients of the interval estimators of the beta regression model with measurement errors and nonlinearity in both submodels (Scenario 3). From those figures, it becomes evident how the approximate pseudo maximum likelihood method is a good alternative to estimate the parameters of the models.

thumbnail
Fig 7. Coverage rates of the nominal 95% confidence intervals for the parameters α1, α2 and β1.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 3: , , t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g007

thumbnail
Fig 8. Coverage rates of the nominal 95% confidence intervals for the parameters γ1, γ2 and λ1.

ιa (square), ιp (circle), ιrc (triangle) and ιnaive (star). Scenario 3: , , t = 1, …, n, n = 40, 80, 160.

https://doi.org/10.1371/journal.pone.0254103.g008

It is still important to stress how inference based on interval estimation for β1 can be affected when we do not consider that the covariate is measured with error, that is, when we use the naive method. That becomes more remarkable if we think this is perhaps the parameter we are more interested in, the coefficient of the covariate in the mean submodel. Fig 7(g)–7(i) clearly shows how we lose with the naive estimation, even when the measurement error is small, kx = 0.95, since coverage rates for the naive estimation become considerably distant from the nominal 95% level. This is particular true for large sample sizes (Fig 7(g)). The sample size n = 160 is information enough to detect that we may be dealing with a wrong asymptotic normal distribution for the parameters, or, perhaps, even with nonconsistent parameters, due to the wrong assumption that the covariate has no measurement error. This behavior tends to become more seriously large as the measurement error increases, as expected (Fig 7(h) and 7(i)).

Fig 8 presents interval estimation of the parameters of the precision submodel. Here we emphasize, in particular, the very good performance of approximate pseudo maximum likelihood estimation. Considering all parameters, measurement errors and sample sizes, this seems to be the recommended method. This is an important aspect, since good estimation of the precision parameters of the observations is directly linked with the efficiency of the estimators and robustness of hypotheses tests about the model parameters.

4 An application: Fluid Catalytic Cracking (FCC)

Fluid Catalytic Cracking (FCC) is an important chemical process used in petroleum refineries to convert heavy hydrocarbons in small molecules with high comercial value. This process is accomplished by contact of those hydrocarbons with a catalyst having as main component a mineral that is known as zeolite Y [16]. A chemical element that can be found in FCC is vanadium. This chemical element takes part in the process by reducing, among other characteristics, the crystallinity of zeolite Y, in particular when water steam is present. The chemical reaction also depends on temperature, which must be near 720 °C (Salazar, 2005). Hence, at the end of the process, it is important to study how crystallinity fraction of zeolite Y is influenced by different concentrations of vanadium, by water steam and by temperature. The data set is constituted of n = 28 observations. The authors report that the response is moderately concentrated in the upper part of the unit interval. According to them, 75% of the observations are not smaller than 0.77. This characteristic will be considered as an important factor in our modeling.

[5] have modeled the response, considering the beta and simplex distributions, by using a nonlinear predictor for the mean but without measurement errors in the covariates. However, following [16], the concentration of vanadium for each observation was determined by a spectrophotometric method, with the aid of a chemical complex that includes the element and using a calibration curve. This calibration curve was defined with the same spectrophotometric method, by using another complex, that includes vanadium and a different reagent. Therefore, the concentration of vanadium here can be considered a covariate that is obtained for each case with measurement error. Extending the nonlinear model in [5], we propose the nonlinear beta regression model with measurement errors given by (13) where x1t represents vanadium concentration, z1t denotes water steam and z2t is a categorical variable indicating in which of the two temperatures the experiment was done (0 = 700 °C and 1 = 760 °C). Since x1t is a covariate with measurement error, we need to previously estimate the nuisance parameters and . Thus, we perform a simple linear regression for the relation between the original x and observed w chemical complexes, as wt = τ1 + τ2 xt + et, with t = 1, …, 15, . The obtained estimates were (0.123), (0.089) and , with R2 = 0.98. Also, we have , supposing . This yields an estimate of the reliability coefficient of , a low measurement error.

We have tried two different link functions: the logit and the complementary log-log. According to [17], when the mean of the response variable is near 1, the complementary log-log link function for the mean softens the impact of influent points in maximum likelihood estimation. In this sense, we consider here the logit and the complementary log-log link functions and check the performances of both modelings.

Tables (4) and (5) present the lower and upper limits of the estimated confidence intervals for the parameters of Model (13) obtained for the confidence levels of 90%, 95% and 99%, by considering the logit and complementary log-log link functions, respectively. We also present the lengths of the interval estimates. From both tables, it can be seen that the only interval estimate having zero is that for α2 at the nominal level of 99% using the method of regression calibration and with the logit model for the mean. Note that by accepting the hypothesis that α2 = 0, we eliminate the term from the model and, consequently, not only the covariate z1t, but also nonlinearity. Also, observe that the length 0.245 of the confidence interval is quite large, more than twice the length of 0.0936 for the confidence interval of this same parameter and method of estimation, but using the complementary log-log as the link function.

thumbnail
Table 4. Lower Limits (LL), upper limits (UL) and length (length) of interval estimations obtained considering 90%, 95% and 99% for confidence level, using as estimation methods ιNaive, regression calibration (ιrc), approximate maximum likelihood (ιa) and approximate pseudo maximum likelihood (ιp).

Logit link function. Fluid Catalytic Cracking (FCC) application.

https://doi.org/10.1371/journal.pone.0254103.t004

thumbnail
Table 5. Lower Limits (LL), upper limits (UL) and length (length) of interval estimations obtained considering 90%, 95% and 99% for confidence level, using as estimation methods ιNaive, regression calibration (ιrc), approximate maximum likelihood (ιa) and approximate pseudo maximum likelihood (ιp).

Complementary log-log link function. Fluid Catalytic Cracking (FCC) application.

https://doi.org/10.1371/journal.pone.0254103.t005

In fact, it is remarkable how the complementary log-log model presents better interval estimation with confidence intervals having lengths much smaller than the corresponding ones for the logit model, in all cases. For example, for α3, the coefficient related to temperature, when the nominal confidence level is 95% and we use approximate pseudo maximum likelihood, the lengths of the interval estimates are 0.4220 and 0.2099 for the logit and complementary log-log models, respectively. Comparing the estimation methods, it is also remarkable that approximate pseudo maximum likelihood yields interval estimates with lengths, that are, in general, smaller than those of the interval estimates obtained with the other estimation methods. For example, using the complementary log-log link function, the confidence intervals for α3, the parameter that makes the model nonlinear, the associated lengths for ιp are (6.22, 7.41, 9.74), while those associated to ιa, ιrc and ιnaive are, respectively, (8.96, 10.68, 14.03); (8.35, 9.95, 13.07); (8.94, 10.65, 13.99).

It is important here to observe that approximate pseudo maximum likelihood presents much better performance than the naive model, that which does not take measurement error into account, even for a very low measurement error, that is . This shows how important to consider measurement errors can be when estimating the beta nonlinear regression models.

From all the results here presented, we select for this particular data set the nonlinear beta regression model for which the covariate x1t, the vanadium concentration, has measurement error, and the regression structure defined by

Inference for measurement error in x1t produces , , and estimation based on approximate pseudo maximum likelihood yields , , , , equal to 0.910, −0.049, −26.451, −0.182, −0.292 and 4.405, respectively, with standard errors given by 0.063, 0.011, 1.891, 0.054, 0.050 and 0.286, all parameters being significant at the 1% level. Finally, note that which for the beta regression can be considered a moderate precision, neither too low, but far from being considered high. That is, the application shows similarities with Scenario 1 of the simulations, for ϕ = 55 and Kx = 0.95. The results of the application only confirm those of the simulations in which, among the methods evaluated, the approximate pseudo maximum likelihood method presents the best performance regarding interval estimation.

5 Conclusions

We have proposed in this work a beta regression model with parameter nonlinearity and covariates measurement with errors for mean and precision submodels. Log-likelihood for this kind of model are written in terms of integrals with difficult solutions. For this reason, we have proposed to approximate those integrals using a Gaussian quadrature. This yields the estimators which we have referred to in this paper as the approximate maximum likelihood estimators. Because the approximate likelihood can become a very difficult function to maximize, we have also considered an approximate pseudo maximum likelihood, where we first estimate the nuisance parameters. The advantage of this two-phase method is to reduce the dimension of the problem, which makes maximization problem easier. We have also tried a regression calibration method, where the nonobserved variable is replaced by its conditional expectation in the likelihood function.

Numerical simulations have shown that the approximate pseudo maximum likelihood proposal is a very good alternative to estimate the model parameters, competing with the original maximum likelihood estimators. On the other hand, regression calibration is not a good proposal, which means that too much information about the observations can be truncated when we use this method. Therefore, regression calibration is not recommended here, although we strongly recommend approximate pseudo maximum likelihood.

An application has also shown that approximate pseudo maximum likelihood presents much better performance than the naive model, which does not take measurement error into account, even for very low measurement error. This shows how important to consider measurement errors can be when estimating beta nonlinear regression models. Furthermore, noteworthy are the superior performances related to the use of complementary log-log link function, in a sample whose responses are close to the upper limit of the unit interval.

6 Appendix: Approximate maximum likelihood

Differentiating the approximate log-likelihood a(Ψ) defined em (10) with respect to each one of the interest parameters θ = (α, β, γ, λ), we obtain the approximate score function, given by with , , and Also, B is a n × n diagonal matrix in which the t-th diagonal element is given by (14) with E, A, H1, H2, and P being n × Q matrices with entries defined as etq = exp{tq(μtq, ϕtq)}, (15) (16) , and , respectively, where ψ(⋅) represents the digamma function (first derivative of log of gamma function). Additionally, Vq is a Q-dimensional vector with q-th element given by where νt is the weight for the orthogonal Hermite polynomial of order Q in (9). Finally, we have D2 = [(EP)∘(H1F2)]Vq and D4 = [(EA)∘(H2F4)]Vq.

The approximate score vector for the nuisance parameters is given by

We have that the approximate information matrix for the interest parameters is given by −Ja(θ), with in which and for i, j = 1, 2, …, p (first row), i = 1, 2, …, p; (second row), (third row), (the last rows) we have that (17)

Many expressions in (17) differ only in the parameter of interest. More specifically, they differ as to the derivatives of η1 and η2 with respect to these parameters. Thus, we defined a general argument ϑ varying among the interest parameters, namely: αi, γj, β, λ. Furthermore, we also defined the following functions of ϑ: (18)

In (18) we have that at and bt are defined in (14) and (15), respectively and , ctq = ϕtq[ψ′(μtq ϕtq)μtq + ψ′[(1 − μtq)ϕtq](1 − μtq)] and dtq = μtq{−μtq ψ′(μtq ϕtq) − (1 − μtq)ψ′[(1 − μtq)ϕtq]} + ψ′(ϕtq) − (1 − μtq)ψ′[(1 − μtq)ϕtq].

Supporting information

S1 File. We create a directory in to allocate the supplementary information, namely: “S1 File”.

As supplementary material we placed two simulation programs, the application program, the data set, the file of quadrature points, and the description of the data set.

https://doi.org/10.1371/journal.pone.0254103.s001

(ZIP)

Acknowledgments

We thank two anonymous referees for comments and suggestions that led to a much improved manuscript.

References

  1. 1. Ferrari SLP, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31(7):799–815.
  2. 2. Simas AB, Barreto-Souza W, Rocha AV. Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis. 2010;54(2):348–366.
  3. 3. Stefanski LA. The effects of measurement error on parameter estimation. Biometrica. 1985;72(3):583–592.
  4. 4. Carrasco JMF, Ferrari SLP, Arellano VRB. Errors-in-variables beta regression models. Journal of Applied Statistics. 2014;41(7):1530–1547.
  5. 5. Espinheira PL, Silva AO. Residual and influence analysis to a general class of simplex regression. Test. 2020;29(2):523–552.
  6. 6. Huwang L, Huang YHS. On errors-in-variables in polynomial regression—Berkson case. Statistics Sinica. 2000;10:923–936.
  7. 7. Wang L. Estimation of nonlinear models with Berkson measurement erros. The Annals of Statistics. 2004;32(6):2559–2579.
  8. 8. Carroll RJ, Ruppert D, Stefanski L, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2nd ed. New York: Chapman & Hall/CRC; 2006.
  9. 9. Schennach SM. Measurement error in nonlinear models—a review. Advances in Economics and Econometrics, Theory and Applications: Tenth World Congress of the Econometric Society. 2013;3:296–337.
  10. 10. Carrasco JMF, Reid N. Simplex regression models with measurement error. Communications in Statistics—Simulation and Computation. 2019; p. 1–16.
  11. 11. Espinheira PL, Santos EG, Cribari-Neto F. On nonlinear beta regression residuals. Biometrical Journal. 2017;59(3):445–461. pmid:28128858
  12. 12. Abramowitz M, Stegun I. Handbook of Mathematical Functions. New York: Dover Publications; 1972.
  13. 13. Gong G, Samaniego FJ. Pseudo maximum likelihood estimation: theory and applications. The Annals of Statistics. 1981;9(4):861–869.
  14. 14. Buonaccorsi JP, Tosteson TD. Correcting for nonlinear measurement errors in the dependent variable in the general linear model. Communications in Statistics—Theory and Methods. 1993;22:2687–2702.
  15. 15. Doornik JA. An Object-Oriented Matrix Language Ox 6. London: Timberlake Consultants Press; 2009.
  16. 16. Salazar SMG. Contribución al estudio de la reacción de decomposición de la Zeolita Y em presencia de vapor de agua y vanadio. Universidad Nacional de Colombia; 2005.
  17. 17. Espinheira PL, Silva LCM, Silva AO, Ospina R. Model selection criteria on beta regression for machine learning. Machine Learning and Model Extraction. 2019;1(1):427–449.