Process monitoring using inflated beta regression control chart

This paper provides a general framework for controlling quality characteristics related to control variables and limited to the intervals (0, 1], [0, 1), or [0, 1]. The proposed control chart is based on the inflated beta regression model considering a reparametrization of the inflated beta distribution indexed by the response mean, which is useful for modeling fractions and proportions. The contribution of the paper is twofold. First, we extend the inflated beta regression model by allowing a regression structure for the precision parameter. We also present closed-form expressions for the score vector and Fisher’s information matrix. Second, based on the proposed regression model, we introduce a new model-based control chart. The control limits are obtained considering the estimates of the inflated beta regression model parameters. We conduct a Monte Carlo simulation study to evaluate the performance of the proposed regression model estimators, and the performance of the proposed control chart is evaluated in terms of run length distribution. Finally, we present and discuss an empirical application to show the applicability of the proposed regression control chart.


Introduction
Standard control charts are directly applied to the output of a quality characteristic. However, the quality characteristic (process output) can be affected by external covariates (control variables), where we rather control a varying mean than a constant one. In these cases, the regression control chart [1] may be an effective statistical process control tool. Such method is widely used when the quality of a process or product is better characterized by a functional relationship between the response variable and one or more explanatory variables [2].
The standard regression control chart is based on the linear regression model, where the variable of interest is assumed to be normally distributed. However, in practice, several of these variables may not follow a normal distribution, leading to poor Gaussian-based inferences. Thus, several studies have been proposing non-Gaussian model-based control charts. [3] presented a model-based scheme for monitoring multiple gamma-distributed variables. By considering that robust methods can be effective in the presence of outlying observations, [4] explored the robust generalized linear model for a gamma-distributed response. [5] used a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 deviance residual for monitoring variables in a three-stage process assuming gamma, normal, and Poisson distributions.
Examples of a non-Gaussian process output are variables that assume values in the standard unit interval, such as fractions and proportions. In such instances, the usual regression control chart may be inappropriate since double bounded data are typically asymmetric and the Gaussian-based assumption is not suitable. In this sense, [6] proposed the beta regression control chart to monitor fractions and proportions related to control variables. The proposed control chart considers the beta regression model with varying dispersion [7], assuming that the mean and dispersion parameters of beta distributed variables are related to exogenous variables and modeled by regression structures. However, fractions and proportions may contain zeros and/or ones, leading to the unsuitable use of the beta distribution for data modeling [8].
Alternative regression models have been proposed to mend beta regression flaws in the presence of zeros and/or ones. [9] presented a unit inflated beta model for modeling efficiency scores as a function of exogenous variables. [10] proposed a zero inflated beta model to analyze data in corporate capital structures. [8] introduced a general class of zero or one inflated beta regression models, which is a natural extension of the beta regression model [11] to model variables that assume values in (0, 1] or [0, 1). [12] proposed an inflated beta regression model based on a reparametrization of the inflated beta distribution. This model accommodates mixed random variable responses, with non-negligible probabilities of assuming zeros and/or ones and continuous values in the interval (0, 1) that follows a beta distribution. The inflated beta regression model introduced by [12] may be useful for developing model-based control charts for monitoring inflated beta distributed processes as it considers an interesting parametrization in terms of the response variable mean. However, the model proposed by [12] does not consider a regression structure for the precision parameter. The monitoring of the mean and precision (or dispersion) is relevant to the statistical process control [6,13,14]. In addition, incorrect modeling of the dispersion can generate a high number of false alarms or loss of detection power of special causes [6]. Moreover, dispersion modeling is necessary in regression models in order to obtain accurate inferences about the structure parameters of the mean regression [15].
Control chart is a dynamic tool that works under two different phases, namely Phase I and Phase II. In practical situations, the in-control parameters are unknown and have to be estimated from a Phase I data set. Different Phase I data sets lead to different control chart performance. Thus, it is important to study the practitioner-to-practitioner variability due to parameter estimation. The aim of Phase I analysis is to estimate the parameters, while quick detection of out-of-control state is conducted in Phase II [16]. The literature offers some studies related to Phase I and Phase II analyses in regression models. For example, [17] proposed Phase I profile monitoring schemes for binary responses that can be represented by logistic regression models. [17] developed several Hotelling T 2 -type Phase I control charts for monitoring the parameters of a logistic regression linking to a binary response and one or more predictor variables. [18] developed control charts by integrating an exponentially weighted moving average scheme with a likelihood ratio test based on logistic regression models in Phase II study. [19] proposed a new modeling and monitoring framework for Phase I analysis of multivariate profiles by incorporating the regression-adjustment technique into the functional principal components analysis. [20] proposed the monitoring of profiles using generalized linear models during Phase II in which the explanatory variables can be a fixed design or any random arbitrary design.
In this context, this paper introduces the inflated beta regression control chart (IBRCC) with varying dispersion, useful for monitoring double bounded variables when zeros or ones appear in the data along with the presence of control variables. The process output may represent individual measures (e.g. efficiency score) or a ratio between continuous numbers (e.g. relative humidity). The contribution of the present paper is twofold. First, we extend the inflated beta regression model proposed by [12] by allowing a regression structure for the precision parameter. We also discuss likelihood inference of the model parameters. Second, we introduce the IBRCC based on the proposed inflated beta regression model with varying dispersion. Since in practice the parameters of the regression model are unknown, the proposed control chart is implemented into two phases. In Phase I, the parameters are estimated from an in-control sample, and in Phase II, we perform the monitoring scheme.
The remaining of the paper unfolds as follows. In Section 2, we describe the IBRCC and introduce the beta inflated mean regression model with varying dispersion. We also discuss likelihood inference and present the control limits estimation procedure. Section 3 presents a simulation study to evaluate (i) the inflated beta regression model with varying dispersion estimators and (ii) the performance of the proposed IBRCC and some competing control charts in the literature based on the run length (RL). In Section 4, we discuss and present an empirical application to show the applicability of the proposed IBRCC in real situations. Finally, some conclusions are presented in Section 5.

Inflated beta regression control chart
In this section, we introduce the IBRCC. Firstly, in Subsection 2.1 we present the inflated beta regression model with varying dispersion. The model we propose in this work is an extension of the model proposed by [12], where the authors used a reparametrization of the inflated beta distribution indexed by the response mean. In the Subsection 2.2 we present the model-based control limits for the proposed IBRCC. Secondly, in Subsubsection 2.2.1, we discuss the likelihood inference for the model parameters. Finally, in Subsubsection 2.2.2 we present the control limits estimation procedure.

Model-based control chart limits
The purpose of IBRCC is to monitor double bounded processes that contain values equal to zero or one, considering that the mean, precision, and parameters related to the probabilities of zero and one (α 0 and α 1 ) of the quality characteristic of interest are affected by control variables. Let (1 − α) be a control region where α is the type I error probability, the lower control limit (LCL), center line (CL), and upper control limit (UCL) of the proposed control chart are defined, respectively, by where FðyÞ ¼ PðY � yÞ ¼ R y 0 f ðu; a 0 ; a 1 ; g; �Þdu is the inflated beta cumulative distribution function and F −1 (�) is the quantile function of the inflated beta variable. The parameters α 0t , α 1t , γ t , and ϕ t are functions of ω, κ, β, and z, respectively, and through (2), (3), (4), and (5) we In practice, the model parameters are unknown and estimation methods are necessary to estimate the in-control limits. Thus, we consider the likelihood theory [26][27][28], which we discuss in the following subsection. We presented results of the log-likelihood and score functions, which are extensions of those developed for the inflated beta regression model proposed by [12].

Likelihood inference.
We shall consider the maximum likelihood estimator (MLE) for the parameter vector θ = (ω > , κ > , β > , z > ) > . The log-likelihood function is given by with where 1 0 ðy t Þ is an indicator function that equals 1 if y = 0 and 0 if y 2 (0, 1], 1 1 ðy t Þ is an indicator function that equals 1 if y = 1 and 0 if y 2 [0, 1), and in which α 0t , α 1t , γ t , and ϕ t are given by the regression structures in (2), (3), (4), and (5), respectively. Additionally, and y y t ¼ log ð1 À y t Þ; y t 2 ð0; 1Þ; 0; y t ¼ 0 or y t ¼ 1: ( By deriving the log-likelihood function in (6) with respect to each element of the parameter vector θ, we obtain the score vector given by where 0 is a null vector of dimension (k 1 + k 2 + k 3 + k 4 ). The MLEs cannot be expressed in closed-form, hence the maximization of the log-likelihood function needs to be numerically conducted through a Newton or quasi-Newton algorithm. In this work, we used the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [29] for computational implementation.
The Fisher information matrix (K(θ)), which is useful for large sample inferences, requires the expectations of second order derivatives of the log-likelihood function. The score vector U (θ) and Fisher's information matrix can be found in the Appendix.
To test hypotheses on the parameter θ j , j = 1, . . ., (k 1 + k 2 + k 3 + k 4 ), we consider the null hypothesis H 0 : The Wald test may be considered using the following z statistic [26] z where b θ j represents the MLE of θ j and the standard error of b θ j is given by θÞ is the asymptotic variance and covariance matrix of b θ. In large sample sizes and under H 0 , the z statistic follows a standard normal distribution [26]. The test is performed by comparing the computed z statistic with the usual quantiles of the standard normal distribution.

Control limits estimation. After obtaining the MLE of
and considering an in-control process, the estimated control limits are given by zÞ. Thus, we propose the following algorithm to implement the IBRCC.
1. Fit the inflated beta regression model with varying dispersion under Phase I and obtain the 2. Using covariates in Phase II, estimate α 0t , α 1t , γ t , and ϕ t such that 3. For a given type I error probability α and the estimates b a 0t , b a 1t , b g t , and b � t , compute the estimated control limits using (7) and (8). 4. Plot each data point y t together with the estimated control limits d UCL t and d LCL t , for t = 1, . . ., n.
The observation y t that is out of the estimated control limits interval ( d UCL t ; d LCL t Þ is considered out-of-control.

Simulation study
This section presents a Monte Carlo simulation study to evaluate the estimators of the introduced inflated beta regression model with varying dispersion and the performance of the IBRCC. The performance of the proposed control chart is compared with some alternatives in literature, namely: the usual linear regression control chart (RCC) [1], the beta regression control chart (BRCC) [6], and the inflated beta control chart (IBCC) [30]. Note that the RCC is a classical regression control chart that works under Gaussian assumptions. The other control charts are state-of-the-art alternatives, but the BRCC does not consider inflation in zeros and/ or ones and the IBCC does not include covariates.
We used the following structures for data generation with t = 1, . . ., n. The values ofx t , � x t , and € x t were obtained from a Bernoulli distribution with parameter p = 0.3, and x t was generated from a uniform distribution in the interval (0, 1), thus considering discrete and continuous random variables. We considered 5, 000 Monte Carlo replications and sample sizes n = 100, 200, and 500. According to [31] and [32], this number of replications is enough to obtain accurate results. All simulations were performed using the R programming language [33].
In the numerical evaluation, we considered several scenarios with different characteristics, namely: zero and one inflated beta regression model (Scenario 1), zero inflated beta regression model (Scenarios 2, 4, and 6), and one inflated beta regression model (Scenarios 3, 5, and 7

Point estimation evaluation
For the point estimation evaluation, we computed the mean, percentage relative bias (RB), and mean square error (MSE) for each estimator in all Scenarios (see Table 1). For brevity and similarity of results, we only present results for Scenarios 1, 2, and 7 (n = 100 and n = 500) as shown in Table 2. The figures show that the mean of the estimators is close to the corresponding parameter values. The RB and MSE decrease when the sample size increases, indicating that the MLEs are consistent. For instance, for β 1 (γ submodel) in Scenario 1, the RB of the estimator is equal to 0.2257% for n = 100 and equal to −0.0548% for n = 500. Regarding MSE, considering ω 0 in Scenario 2 and n 2 {100, 500}, the MSE is equal to 0.2291 and 0.0373, respectively. As in other studies related to beta regression [21,34], it is noteworthy that the RB of MLEs corresponding to the precision covariate parameters is greater than those of that model the mean response. For instance, consider b z 1 (ϕ submodel) in Scenario 7, we have RB = −8.2035% for n = 100 and RB = −2.4411% for n = 500. Regarding parameters related to the probabilities of zeros and ones, the bias also decreases considerably as sample size increases. For example, in Scenario 1 and n = 100, the estimator of ω 1 (α 0 submodel) yields RB = 16.9130% and the estimator of η 1 (α 1 submodel) yields RB = 14.8922%. For n = 500, the bias for the same estimators reduces to 4.5210% and 3.4431%, respectively.
In practice, the regression model relating the output and covariates is rarely known and the parameters have to be estimated. Our simulation results show that the MLE in the proposed model perform well, presenting low MSE for the estimates in all situations. This way, the proposed control chart may also present good performance in practice. In the next section, we shall investigate the run length performance of the IBRCC with estimated parameters.

Control charts performance
This section presents a run length analysis to evaluate the performance of the considered control charts. When the process is in-control, the run length (RL) distribution follows a geometric distribution with parameter α, which is the type I error probability [35]. The ability of a control chart to detect changes in the process is usually measured by the average number of observations until the detection of an out-of-control point (ARL) [36]. However, other measures can also be used for this purpose. We considered another location measure, the median (MRL), a dispersion measure, and the standard deviation (SDRL) of the RL distribution. Additionally, we computed the mean absolute percentage error (MAPE) for each measure for all evaluated control charts. We compared the proposed IBRCC with the standard RCC [1], and the state-of-the-art charts, namely BRCC [6] and IBCC [30]. Since the BRCC does not consider values equal to zero or one, we replaced zeros by 0.0001 and ones by 0.9999 for its application. For all considered control charts, we examined two aspects of evaluation: in-control (ARL 0 ¼ 1 a , MRL 0 ¼ lnð0:5Þ lnð1À aÞ , SDRL 0 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ð1À aÞ a 2 q ) and out-of-control ( , where β is the type II error probability) [30,35]. For brevity, we do not present the MRL and SDRL results for the out-ofcontrol process. The control charts were evaluated in Scenarios 2 to 7 (Table 2), considering inflation in 0 or 1. Scenario 1 was not covered in this section because it does not reflect real statistical process control situations, being possible to present perfect nonconforming and perfect conforming in the same process. To ensure that the comparisons between ARL 1 occur between control charts of same ARL 0 , we adjusted the chart limits to obtain ARL 0 equal to the specified nominal values of 100 and 370. This control chart calibration is suggested in the literature [30,[37][38][39]. After ARL 0 calibration, a δ change was induced in the mean and precision regression structures to generate out-of-control processes as the following: logit(γ t ) = δ + β 0 + β 1 x t and logð� t Þ ¼ d þ z 0 þ z 1 € x t . By enabling the process to be out-of-control, we obtained the estimated ARL 1 for different values of δ. When δ = 0, the process is in-control and the ARL 0 can be evaluated. The IBRCC showed better performance than BRCC, RCC, and IBCC, reaching empirical values closer to the nominal levels in all evaluated scenarios. In Scenarios 2, 4, and 6, the IBRCC and IBCC obtained 0 as the lower control limit, thus no point exceeded this limit. Similarly, in Scenarios 3, 5, and 7, the upper control limit of the mentioned charts were 1. The fact that these scenarios present the 0 or 1 as control limits is related to the value of the probabilities of occurrence of 0 or 1. That is, the IBRCC and IBCC will present zero as control limit when PðY ¼ 0Þ ¼ b a 0t ð1 À b g t Þ � a=2 and one as control limit when PðY ¼ 1Þ ¼ b a 1t b g t � a=2. The high probability of Y assuming values equal to one or zero means that these values are not atypical (out of control) but usual occurrences of the process.  Tables 3 and 4 also show the MAPE results. The proposed control chart has the lowest values for the MAPE. For example, consider α = 0.01, corresponding ARL 0 , and n = 200, the MAPE obtained for the IBRCC, BRCC, RCC, and IBCC were, 1.32, 83.82, 61.17, and 21.80, respectively. It is noteworthy that, for the IBRCC, the MAPE decreases considerably when the sample size increases.
Among the considered alternative control charts, the IBCC achieved better performance than the BRCC and RCC in all scenarios. In Scenario 2 (Table 4), for n = 200, the IBCC presented a false alarm after 411 samples, when, in fact, a false alarm was expected for every 370 samples. In the same scenario, the BRCC and the RCC presented a false alarm rate in approximately 11 and 106 samples, respectively. These results show the importance of considering an accurate model to reduce false alarms. We also note that the BRCC obtained the worst performance. In Table 3, consider Scenario 3 and n = 100, the RCC and BRCC presented a false alarm in approximately each 50 and 27 observations, respectively. It is important to note that BRCC performance worsened as the 0 or 1 percentage increased. Confirming this fact, the IBCC also presented lower MAPE than BRCC and RCC in all scenarios. Considering α = 0.0027, n = 500, and the MRL 0 measure, the MAPE obtained for the IBRCC, BRCC, RCC, and IBCC were, respectively, 4.82, 95. 15, 80.22, and 35.45.
Results of the ARL 1 evaluation are shown graphically in Figs 1 and 2. It was not possible to correct ARL 0 for the BRCC due to the poor in-control performance. Thus, the evaluation of ARL 1 was given only for the IBRCC, RCC, and IBCC. It is noteworthy that when several control charts are compared in terms of ARL, the one that presents the lowest ARL 1 among those with same ARL 0 is the control chart that outperforms the competitors [30]. By analyzing the

PLOS ONE
Inflated beta regression control chart ARL 1 results, when the perturbation was introduced in the mean of the process (Fig 1), we observe that in Scenario 3 the IBRCC performs better than the RCC and IBCC, and in Scenario 2 the performance of the control charts are similar. We note that the IBRCC detects more quickly the out-of-control process. For example, in Scenario 3, ARL = 370, n = 100, and δ = −0.4, the IBRCC takes 176 samples on average to detect a change in the process, while the IBCC takes 186 and the RCC takes 192 to detect a change of same magnitude. The simulation results showed similar behavior when a perturbation in the precision of the process occurs (Fig 2). The control charts detect process changes more quickly as the precision increases (dispersion decreases).
By considering the results obtained in the simulation, we see a necessity of using a control chart based on an appropriate regression model, such as the IBRCC, when the variable of interest is restricted to the intervals [0, 1) or (0, 1]. The use of the linear regression-based control chart is inappropriate for data of this type since the support of the usual regression model is the whole real space. Interestingly, the BRCC proved to be more inadequate in the presence of values equal to zero or one than the traditional RCC or the IBCC that uses inflated beta distribution but does not consider a regression structure. Since the BRCC does not accommodate values equal to zero or one, by substituting zeros for 0.0001 and ones for 0.9999, an inflation in these values is induced. That is, the probability mass at 0.0001 and/or 0.9999 exceeds what is allowed by the beta distribution, which is an absolutely continuous distribution. This reflects on the estimates of the parameters of the regression structures and, automatically, the estimates of the control limits are impaired.

Real data application
This section contains an empirical application in which the proposed control chart (IBRCC) and three other competing control charts are analyzed: the RCC, BRCC, and IBCC. The data evaluated in this section refer to the public administrative efficiency of the municipalities in the state of São Paulo, Brazil. The data are a subset of those analyzed by [40], who considered all Brazilian municipalities. The dataset we used contains 427 municipalities for the year 2000 and it is available at http://www.de.ufpb.br/~luiz/datasets/Dataset_plosone.txt. The covariates are from Secretaria do Tesouro Nacional (http://www.tesouro.fazenda.gov.br/), Instituto Brasileiro de Geografia e Estatística (IBGE) (https://www.ibge.gov.br/), and Instituto de Pesquisa Econômica Aplicada (IPEA) (https://www.ipea.gov.br/portal/), Brazil. The quality characteristic, y, is introduced by [40] and represents individual observations of an efficiency index, assuming values in (0, 1] and measuring how well mayors spend taxpayer money in order to provide them with public services. The efficiency index is equal to one when there is full efficiency. There are 32 units that are fully efficient (i.e., about 7.5% of the observations are equal to one). A brief description of the variables used in the analysis is presented in Table 5. Variables CONS, R2, and MT are dummies, i.e., they are equal to 0 or 1. The covariate CONS equals 1 if the municipality participates in the inter-municipal consortia, the covariate R2 equals 1 whenever the municipality receives more than 10% of its tax revenue to royalty, and the covariate MT equals 1 whenever the municipality is tourist, 0 otherwise for the three dummies covariates. It is important to mention that 100 municipalities were sorted to estimate the model parameters (Phase I), while the remaining observations were used for monitoring (Phase II).
At the outset, the inflated (at one) beta mean regression model, the beta regression model substituting 1 for 0.9999, and a linear regression model were selected and fitted. We used the logit link for γ and α 1 and the log link for ϕ. For the beta regression, we considered logit for μ and log link for ϕ. The maximum likelihood estimates of the models parameters are displayed in Table 6. All covariates were significant at the nominal level of 5%. In order to compare the  Table 7 presents some descriptive statistics of the estimated control limits. Note that the proposed control chart and the IBCC are the only ones that have an upper control limit constant and equal to one. Differently, when beta regression control chart is used, the control limits were restricted to the open interval (0, 1) and thus, in this case, fully efficient municipalities are considered out-of-control. In addition, we verify that, by using the standard RCC, the limits assume values below zero and above one, not being restricted to the interval (0, 1], where the data are distributed. The interpretation of the limits, in this case, makes no practical sense and leads to loss of detection power of out-of-control points.    where the null hypothesis is that the fitted model is correctly specified and the alternative hypothesis is that there is model misspecification. We perform the test using the second power of the estimated mean linear predictor as testing variables. We do not reject the null hypothesis at the 1% nominal level, thus suggesting that our model is correctly specified.

Conclusions
In this paper, we proposed a new model-based control chart for controlling quality characteristics limited to the intervals (0, 1] or [0, 1) using the inflated beta regression model. For this purpose, we extended the inflated beta regression model proposed by [12] by allowing a regression structure for the precision parameter. In this way, it is possible to model the mean response, the data precision, and functions of the probability of a given observation assuming zero or one through a regression framework. Our simulation study showed that the relative bias and mean square error decrease when the sample size increases. With regard to the sensitivity analysis in terms of run length (RL), the proposed IBRCC showed the best performance in all considered cases. In addition, the results indicated that it is better to ignore the explanatory variables and use the inflated beta control chart (IBCC) than to use a control chart based on an inappropriate regression model. We also considered an application to real data and highlight the practical importance of the proposed chart when the response is distributed in unit intervals containing ones. Finally, we suggest the use of the inflated beta regression control chart to monitor output quality characteristics, which is better characterized by a functional relation between the response variable, double bounded in unit intervals containing zeros or ones along with one or more explanatory variables.