A Control Chart Based on Cluster-Regression Adjustment for Retrospective Monitoring of Individual Characteristics

The tendency for experimental and industrial variables to include a certain proportion of outliers has become a rule rather than an exception. These clusters of outliers, if left undetected, have the capability to distort the mean and the covariance matrix of the Hotelling’s T 2 multivariate control charts constructed to monitor individual quality characteristics. The effect of this distortion is that the control chart constructed from it becomes unreliable as it exhibits masking and swamping, a phenomenon in which an out-of-control process is erroneously declared as an in-control process or an in-control process is erroneously declared as out-of-control process. To handle these problems, this article proposes a control chart that is based on cluster-regression adjustment for retrospective monitoring of individual quality characteristics in a multivariate setting. The performance of the proposed method is investigated through Monte Carlo simulation experiments and historical datasets. Results obtained indicate that the proposed method is an improvement over the state-of-art methods in terms of outlier detection as well as keeping masking and swamping rate under control.


Introduction
Following Staudte and Sheather [1], an outlier can be defined as an observation that is far from the bulk of the data. Similarly, Hampel et al. [2] defined outliers to be observations which deviate from the pattern set by majority of the data. Since this outlying observations depart from the general trend dictated by the bulk of the data, its identification in the dataset plays an important role in many practical applications. The dataset may contain outliers with abnormal values that are arbitrarily larger or smaller than the normal observations. This could lead to misjudgment of analysis in areas such as regression analysis, analysis of variance, statistical process control (SPC) and profile monitoring. Therefore, outlier identification is an important task prior to data analysis especially in monitoring product quality of a production process. Historical data often suffer from transcription errors and problems in data quality. These errors make historical data prone to outliers.
According to Montgomery et al. [3], outliers are identified as observations that produce residuals that are considerably larger in absolute values than the others; say, three or four sigma from the mean. This idea can be applied in SPC. Consider a situation where the objective of a multivariate control chart is to detect the presence of assignable causes of variation in multivariate quality characteristics. In particular, for a retrospective phase I analysis of a historical dataset, the objective is two fold: (i) to identify shifts in the mean vector that might distort the estimation of the in-control mean vector and covariance matrix; (ii) to identify and eliminate multivariate outliers. The purpose of seeking an in-control subset of historical dataset is to estimate in-control parameters for use in a phase II analysis. To achieve these objectives, individual retrospective multivariate control charts are constructed to determine if in a multivariate sense, the already obtained, sequentially ordered data points X = {x ij } i = 1, Á Á Á, n, j = 1, Á Á Á, p & R p are stable, that is, free of outliers, upsets or shifts. Fan et al. [4] refer to this kind of analysis as a retrospective Phase I analysis of a historical data set (HDS). The sample mean vector and covariance matrix are estimated from X. From these estimates, Hotelling's T 2 chart is constructed and used to flag outliers and mean shifts in the process.
According to Marcus and Pokojovy [5], given the multivariate quality characteristic, X, the Hotelling's T 2 statistic computed for {x ij } i = 1, Á Á Á, n, j = 1, Á Á Á, p & R p at time i is where m x is the sample mean given by andŜ is the sample covariance matrix given bŷ If T 2 i exceeds the upper control limit (UCL) defined as ; ð4Þ then x i is declared as an outlier or a special cause (out-of-control) is assumed to have occurred at sample number i or before it. Eq (4) assumes the x i 's are independent and come from a multivariate normal distribution. However, the T 2 are correlated because they each depend on the same m x andŜ (Mansion and Young [6]). The classical methods for detecting mean shift and outliers such as the one described in Eq (1) are powerful when the dataset contain only one outlier. Nevertheless, the power of these techniques reduces significantly when more than one outlying observations are present in the datasets. This loss of power is typically due to what can be referred to as the masking and swamping problems. Furthermore, these techniques do not always succeed in identifying mean shift and spurious outliers because they are based on several assumptions which are rarely met. Thus, they are affected by the observations that they are supposed to detect. Therefore, a technique devoid of these problems is desirable.
Let D i ðm x ;Ŝm x Þ ¼ f ðx i À m x ;ŜÞ; i ¼ 1; Á Á Á ; n be a suitable metric for measuring the distance between the i th observation x i and a location (mean) estimator m x in Eq (2), relative to a measure of dispersion (covariance matrix),Ŝ in Eq (3). Several forms of m x andŜ have been discussed in literature (see for instance, Rousseeuw [7], Rousseeuw and van Driessen [8], Billor et al. [9] and Hubert et al. [10]). The most frequently used form of f ðx i À m x ;ŜÞ; i ¼ 1; Á Á Á ; n is defined in Eq (1). The classical alternatives for m x andŜ are defined in Eqs (2) and (3) respectively. Eq (1) can be referred to as the squared Mahalanobis distance. It will be referred to as T 2 i for simplicity. A large value of T 2 i is a likely indication that the observation with indices corresponding to it is an outlier or better still, an out-of-control process. However, two problems occur in reality. First, outliers or out-of-control process may not essentially have large values of T 2 i . For instance, a small cluster of outliers will attract m x and will inflateŜ in its direction, leading to small values for T 2 i . This problem is referred to as the masking problem since the presence of one outlier masks the appearance of another outlier.
Secondly, not all observations with indices corresponding to large T 2 i values are essentially outliers or an out-of-control process. For instance, a small cluster of outliers will attract m x and inflateŜ in its direction and away from some other observations which belong to the trend suggested by the bulk of datasets, thereby yielding large T 2 i values for these observations. This problem is referred to as swamping problem.
Masking and swamping phenomena occur because m x andŜ are not robust. One way to remedy these problems is to use more robust estimators for Eqs (2) and (3) as an alternative to the location and covariance matrix. The resulting T 2 i could be used to effectively detect outliers or sustained shifts in the mean vector of the quality characteristics.
Consider the minimum volume ellipsoid, (MVE) of Rousseeuw [7] which covers at least half of the observations to construct robust estimators for location and dispersion in Eqs (2) and (3). The mean vector and covariance matrix of the observations included in the MVE are robust location and dispersion matrix estimators. The advantage of MVE estimators for location and dispersion is that they have a high break down point, (BDP) of approximately 50% (Lopuhaa and Rousseeuw [11]). They can also resist the presence of substantial amount of outlier in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the MVE is computationally intensive, and it may not even be computationally feasible, to implement the MVE algorithm estimators. As an illustration of the MVE's computational infeasibility, for an n × p data matrix X, if h is the integer part of ðnþ1Þ 2 , then the algorithm needs to compute the volumes of n! h!ðnÀhÞ! ellipsoids and to choose the ellipsoid with the minimum volume. This means that in an instance where n = 20, there are 184,756 such ellipsoids, for n = 25, there are 5,200,300 such ellipsoid and for n = 30, there are over 155 million ellipsoids to compute and to select the ellipsoid with the minimum volume. This enormous computational demand of MVE which grow geometrically as the sample size, n increases is common to nearly all methods that implement elemental sets or resampling algorithm.
Another robust estimator of mean vector and covariance matrix that can accommodate up to 50% outliers in dataset and also work well for Eqs (2) and (3) in computing T 2 i is the minimum covariance determinant estimator, (MCD). Rousseeuw and van Driessien [8] proposed a fast algorithm for computing the the MCD estimator and they referred to it as FastMCD. The FastMCD correspond to finding the h points for which the classical tolerance ellipsoid has minimum volume, and then taking it center. Consider all C n h subsets, and compute the determinant of the covariance matrix for each subset. The subset with the smallest determinant is used to calculate the usual 1 × p mean vector, m and corresponding p × p covariance matrix,Ŝ. This estimation procedure describes the MCD algorithm. The MCD estimator for location and dispersion are actually maximum likelihood estimators when h observations are from the multivariate normal distribution, while the other (n − h) observations are from a different, mean shifted, multivariate normal distribution. Like the MVE, it is equivariance and has high breakdown value of 50%. Cator and Lopuhaa [12] examined the asymptotic expansion of MCD estimator and stated that the MCD estimator is efficient and root ffiffiffi n p -consistent. Furthermore, Cator and Lopuhaa [13] derived the influence function for the MCD estimator. The MCD estimator is robust and equivariant and can resist the presence of substantial amount of outliers in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, it often biasedly estimate the center (mean vector) and as a result underestimates the covariance matrix. According to Pison et al. [14], the MCD estimator is inconsistent and underestimates the volume of the covariance matrix such that robust distances constructed from it are too large resulting in identifying too many observations as outliers when they are originally inliers. This phenomenon is called swamping effect. Swamping makes the control chart identifies too many observations as outlying thereby declaring an originally in-control-process as out-of-control process Billor et al. [9] proposed forward search algorithm called BACON with two versions of the initial subsample. The algorithm has two "initial starts" in which version one uses the non-robust but equivariant classical mean vector while version two uses the robust but not equivariance median. The BACON algorithm is computable, easy to implement and can resist the presence of substantial amount of outlier in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the version one of BACON is not robust but equivariance. Hence, the subsequent iterations may not be robust as it will depend on the robustness of the initial subset. The version two of BACON is robust but not equivariance and hence the subsequent iterations may not be equivariant.
Hubert et al. [10] proposed the deterministic minimum covariance determinant (DetMCD) algorithm for estimating the location and dispersion matrix in Eqs (2) and (3) when datasets contains outliers and are presumed to exhibit drift or a mean shift in a process. The DetMCD uses six different initial estimators to select six initial subsets denoted as h 0 ¼ d n 2 e. From the six initial h 0 , a Mahalanobis type distance measure d i, l , (i = 1, Á Á Á, n; l = 1, Á Á Á,6) is computed. Thereafter, the method selects for all six initial estimates, the h observations x i with the smallest d i, l and apply the concentration steps (C-steps) until convergence.
According to Hubert et al. [10], the DetMCD estimator is highly robust and near equivariant. Thus, it can resist the presence of substantial amount of outliers in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the computational requirement of DetMCD is highly enormous. Imagine the computational rigor in computing six different initial estimates that are independent, and can each be seen as "stand alone" estimator. The accompanying C-step also requires enormous computation to converge. Moreover, since the DetMCD use the FastMCD objective, it will inherit the ills of FastMCD as noted in Pison et al. [14]. Thus, it often biasedly estimate the center and as a result underestimates the dispersion matrix. It is inconsistent and underestimates the volume of the covariance matrix such that robust distances constructed from it are too large and hence identifying too many observations as outliers when they are originally inliers, leading to swamping effect.
Vargas [15] and Jensen et al. [16] studied the performance and evaluated several different retrospective multivariate control charts methods whose classical mean vector and covariance matrix in Eqs (2) and (3) are constructed from robust estimation algorithms. Each of the charts were based on a T 2 statistic calculated using a selected combination of robust mean vector and covariance matrix estimators. Jobe and Pokejovy [5] noted that most of the robust estimators considered (particularly the MVE and MCD) do not take time (i) into consideration. This is a drawback in relation to detecting certain outlier configurations such as sustained shifts in the mean vector (see Jobe and Pokojovy [5] for details). The inability of some robust methods to account for time i in constructing the T 2 i may be linked to non equivariance tendency. Some methods that work well in this scenario, accounting for time includes that of Chenouri and Variyath [17], Chenouri et al. [18], Variyath and Vattathoor [19], Variyath and Vattathoor [20] Jobe and Pokojovy [5] and Fan et al. [4].
This article proposed a control chart method that is based on regression adjustment and clustering algorithm for retrospective monitoring of individual characteristics. Since the proposed control chart method blends the addition-point regression with an agglomerative hierarchical cluster algorithm, we refer to it as cluster-regression control chart, (crcc for short). Two algorithms are presented for the proposed crcc methodology, one to compute the mean vector and covariance matrix in Eqs (2) and (3) and the second one for computing the cluster-regression control chart statistic denoted as T 2 crcc;i statistic. The notion for the first algorithm is to replace Eqs (2) and (3) with a robust estimate of mean vector and covariance matrix that does not exhibit masking and swamping and can detect certain outlier configurations such as sustained shifts in the mean vector. To account for time i in the T 2 crcc;i second phase algorithm, a projector referred to as "anchor point matrix" proposed by Johnson and Wichern [21] and Lawrence et al. [22] is used to trace the data trend through an addition-point least squares regression upon which the cluster distance is constructed. Thus the T 2 crcc;i is simultaneously able to detect outliers as well as shifts in the mean vector while keeping masking and swamping under control at a given time i.
The remaining part of this article is organized in the following way: Section 2 describes the crcc methodology and provides algorithms for computing the T 2 crcc;i control chart. Section 3 deals with construction of the controls limits for the proposed T 2 crcc;i along with performance evaluation through Monte Carlo simulation experiment. A numerical illustration of artificial dataset generated through a Monte Carlo simulation experiment is closely followed by a real life pulp fiber dataset analysis to implement the crcc algorithm in section 4 while section 5 concludes the article.

The crcc Methodology
The cluster-regression control chart is conceived on the notion that the natural trend of multivariate X-quality characteristics can be traced through a projector called "anchor point matrix" which can be used to construct a cluster of h-observations where h ! ðnþpþ1Þ 2 .
Let m x andŜ be the center and deviation from center of an ellipsoid formed from X as defined in Eqs (2) where λ i and e i are the eigenvalues and eigenvectors ofŜ and w 2 a;p is the cutoff point of the ellipsoid of constant distance (see Johnson and Wichern [21] for details on O). If m x andŜ are robust, an OLS fit to O after it has been augmented with the i th row of X for n times (i = 1, Á Á Á, n), can mimic the trend of the quality characteristics, X. This way, observations that belong to the in-control-process tend to cluster together while those that belong to the out-of-control process as well as those that exhibit certain outlier configuration tend to cluster together and away from those that belong to the in-control-process. The OLS fit to O after it has been augmented with the i th row of X for n times (i = 1, Á Á Á, n) is referred to as "addition point OLS". Since the addition-point OLS is performed sequentially, one at a time, by adding one observation from X to O and fitting OLS to it, the time component (i) of T crcc, i is preserved. Furthermore, the proper working of O requires a robust m x andŜ. Hence, we propose to use a forward search algorithm of Billor et al. [9] to first screen the data of likely outlier configurations and then compute from it, the mean vector, m x and the covariance matrix,Ŝ prior to estimation of O. Armed with this synopsis, the crcc algorithms are presented below.

The BACON Forward Search Algorithm
Billor et al. [9] proposed an algorithm, the BACON: block adaptive computationally efficient outlier nominator, for outlier identification as well as estimator for location and dispersion matrix. The BACON algorithm has been used extensively in literature (see [10,[23][24][25]). Their algorithm select the initial subset denoted as m = p + 1 as the observations with indices corresponding to the smallest euclidean distance obtained from the coordinatewise mediañ m x .
Having selected the m initial subset, the algorithm then proceeds to increase the size of m by one observation at a time until the initial subset, m contains h observations. An observation is nominated as a candidate for m if its Mahalanobis distance is the smallest among the observations not included in the m initial subset. The algorithm iterates this procedure until all n observations are screened. When h observations are in m, a decision criteria for nominating an outlier is defined, the nominated outliers are removed from X, and the mean vector and covariance matrix are computed for the remaining observations in X using Eqs (2) and (3). Detailed algorithm is given below. Algorithm 1.
Step 1: Letm x be the 1 × p vector of coordinatewise median of X, compute the distanced Step 3: Compute the discrepancies where m b and C b are the mean and covariance matrix of the observations in the basic subset. If C b is not of full rank, increase the initial subset m by adding observations whose indices corresponds to the smallest distances in (Eq 7) until it has full rank.
Step 4: Define the new basic subset as all observations whose discrepancy in (8) is less than c npr w 2 ðp; a n Þ , where w 2 ðp; a n Þ is the ð1 À a n Þ percentile of the chi square distribution with p degree of freedom, c npr = c np + c hr is a correction factor, c hr ¼ maxf0; ðhÀrÞ ðhþrÞ g; h ¼ ½ ðnþpþ1Þ

2
, r is the size of the current basic subset and See Billor et al. [9] for details on the correction factor.
Step 5: Iterate steps 3 and 4 until the size of the basic subset no longer changes.
Step 6: Nominate the observations excluded from the final basic subset as outliers Step 7: The location and dispersion estimator is computed as the classical mean and covariance of the observations in the final basic subset using Eqs (2) and (3).

The crcc Main Algorithm
Let x ij , i = 1Á Á Á, n j = 1, Á Á Á, p be the j th quality characteristics in the i th sample. The algorithm below computes the proposed T 2 crcc;i for a given multivariate quality characteristics at time i. Algorithm 2.
Step 1: Compute the mean vector m x and the covariance matrix,Ŝ using algorithm 1 above Step 2: Determine a dependent variable, y from among the quality characteristics x ij , i = 1Á Á Á, n j = 1, Á Á Á, p. Two choices are available to achieve this. One way is to first compute a covariance matrix from x ij , i = 1Á Á Á, n j = 1, Á Á Á, p and obtain from the covariance matrix, the eigenvalues of each quality characteristics in x ij . The variable or a quality characteristic with the least eigenvalue is nominated as the dependent variance while the remaining (p − 1)-variables are treated as regressor variables. The second choice of determining the dependent variable is described below: i. Regress x j on all the other predictors for j = 1, Á Á Á, p. This will yield p different regression models for instance, suppose there are 3 quality characteristics denoted as x i1 ,x i2 , and x i3 then the p = 3 regression models are ii. For all models in Eqs (10), (11) and (12), obtain the corresponding R 2 -values and denote the dependent variable corresponding to the model with the highest R 2 as the overall dependent variable while the remaining (p − 1)-variables are treated as regressors for subsequent addition point OLS.
Step 4: Determine the (n × p) data matrix B with the i th row of B denoted by a (1 × p) vector of b i to be the estimator that result from an OLS regression of O, augmented by the i th row of x ij .
Step 5: Compute an (n × n) similarity matrix S whose elements are defined by The elements of S serves as a distance metric upon which an agglomerative hierarchical cluster (AHC) analysis of a complete linkage is performed on the data. The AHC then partition the dataset x ij into the main cluster C m containing at least h observations and the remaining observations fall into one of τ minor clusters labeled as C τ1 , C τ2 , C τ3 , Á Á Á. See [26][27][28] for details on cluster analysis.
Step 6: Fit a regression model to the observations in the main cluster C m and obtain from it, the fitted valuesŷ i ; i ¼ 1; Á Á Á ; n as well as the prediction variance where r i is the residuals from regression model and 1.4826 is a turning constant (see Maronna et al. [29] for details). At this stage, the data points in the minor cluster have not been used and they are said to be inactive. The activation process of the minor cluster is done sequentially, one cluster at a time in the following way: i. Augment the main cluster, C m with the first minor cluster C τ1 and obtain an OLS fitted values denoted asŷ iþC t1 ; i ¼ 1; Á Á Á ; n ii. Obtain the difference in fits statistic, DFFITS as iii. If DFFITS C t1 w 2 a;p , then C τ1 minor cluster is included in the main cluster, else, the minor cluster is excluded from x ij and remain inactive throughout the crcc estimation process. This procedure is repeated for all minor clusters. Thus, observation that does not harm the fit produces small value of DFFITS C τ1 and hence, they are activated. However, outliers and mean shift data points tend to produce large values of DFFITS C τ1 and as a result, they are not activated.
Step 7: After the minor clusters have been activated by augmenting the main cluster with the minor clusters that satisfy the augmentation condition or otherwise, the mean vector and covariance matrix for crcc are then estimated from data points arising from this activation process as andŜ where n a is the sample size of the augmented main cluster and x a ij is the p-dimensional multivariate quality characteristics in the current augmented main cluster. The corresponding T 2 crcc;i -control chart plots Construction of Control Limits and Performance Evaluation of T 2 This section discusses the construction of the control limits of the proposed T 2 crcc;i as well as examining its performance based on masking and swamping rates. Two other robust control chart methods extensively discussed in literature will be compared with the proposed T crcc, i . They are: the re-weighted MVE and MCD control charts denoted as T 2 RMVE;i and T 2 RMCD;i respectively. These methods have been discussed extensively in literature. A few list includes Chenouri and Variyath [17], Chenouri et al. [18], Variyath and Vattathoor [19], Variyath and Vattathoor [20] and Jensen et al. [16].

Construction of Control Limits Using Simulation Algorithm
In order to implement the proposed T 2 crcc;i control chart and to also compare the detection performance of various control charts, control limits have to be established first especially when the methods to be considered are based on the robust estimate of mean and variance whose distribution is rarely known. Several approaches have been used in literature for scenarios where the distributions of the control chart parameters are unknown. For instance, Chenouri et al. [18] proposed a robust control chart method in which the classical mean and covariance estimator was replaced with the mean and covariance estimator derived from the re-weighted version of the MCD algorithm. They constructed the control limits by using the result of a Monte Carlo simulation experiment to model the empirical distribution of the re-weighted MCD algorithm through a family of regression curves. The control limit of their control chart is then estimated from the empirical model at a known value of the dimension p and the sample size, n. Similar approach of constructing empirical distribution in form of a regression model can also be found in the work of Chenouri and Variyath [17] and Variyath and Vattathoor [19]. This approach works well in detecting little shifts in the mean especially in scenarios where data are contaminated with outliers. A method that is not based on modelling the empirical distribution of the location and dispersion estimators can be found in Jensen et al. [16] and Variyath and Vattathoor [20]. This method constructs the control limits of the robust control charts by iteratively computing the T 2 ðnÞ -statistic for i iterations and then taking the (1 − α) 100 th percentile of the T 2 ðnÞ -statistic as the established control limits. Our method follow this prescription, with our objective being that the T 2 crcc;i is able to detect little shift in mean in the presence of outliers. Consequently, the estimation of the control limit for T 2 crcc;i along with that of T 2 RMVE;i and T 2 RMCD;i is described below.
Given an overall false alarm rate of α, a control limit (c α ) can be obtained from Eq (19) 1 À a ¼ pðmax Because of the invariance of the T 2 RMVE;i , T 2 RMCD;i and the T 2 crcc;i statistics, μ and ∑ of the in-control multivariate normal distribution are set to zero vector 0 and identity matrix I respectively. Applying Eq (19), the simulation runs for constructing control limits can be performed using the following algorithm: 1. Generate a dataset containing n independent observations from N p (0,I).  Tables 1, 2 and 3 for  T 2 crcc;i , T 2 RMVE;i and T 2 RMCD;i respectively. As the overall false alarm rate was fixed at 0.05, it can be seen from Tables 1, 2 and 3 that the control limits for the proposed T crcc, i is stable at various levels of n and p when compared to the T 2 RMVE;i and T 2 RMCD;i . The T 2 RMVE;i seems to be moderate when compared to the T 2 RMCD;i counterpart. It is important to mention here that these limits work well for high dimensions say 5-10. However, with small sample sizes, the control limits may not be reliable.

Performance Evaluation Through Monte Carlo Simulation Experiment
In order to examine the performance of the proposed crcc on effective detection of outliers and mean shift as well as simultaneously minimizing swamping rate, Monte Carlo simulation experiment is performed. In the course of evaluation, the proposed method is compared with Cluster-Regression Control Chart other robust multivariate control chart methods such as the RMVE and and RMCD. The detection rates are used as the yardstick for the performance evaluation. The detection rate is computed as the ratio of the number of correctly identified outliers or mean shift by the methods to the total number of the outliers or mean shift in the dataset. Precisely, let the number of outliers in the dataset be π = bmγc and the number of outliers detected by a control chart method at a given non-centrality parameter λ 2 be π(λ 2 ), then the detection rate of a method T 2 ';i ; ' ¼ crcc; RMVE; RMCD; i ¼ 1; Á Á Á ; m is computed as where m is the sample size. Following the methodology of Variyath and Vattathoor [19], the datasets are generated from a standard multivariate normal distribution MVN(0, I p ) with r = 100,000 sample runs of size m. Within each regime, we considered four levels of mean shifts in data contaminations, namely, γ = 0.00, γ = 0.10, γ = 0.20 and γ = 0.25. Without loss of generality, we further assume that a process with the same covariance matrix has been shifted from μ 0 = 0 to μ 1 = (λ,0, Á Á Á,0) 0 where a non-centrality parameter, λ 2 = (μ 1 − μ 0 ) 0 ∑ −1 (μ 1 − μ 0 ) represents the size of a process shift taken to be λ 2 = 0,5,10,15,20. The clean datasets were simulated for observations with the indices i = 1, Á Á Á, m − h where h = bmγc and γ is the fraction of data contamination while the contaminated datasets were simulated for observations with the indices i = m − h + 1, Á Á Á, m. The simulation experiment is replicated for r = 100,000 runs and for each dataset x r ij , the the detection rate is computed using the three methods. The detection rate is expected to be as close as possible to 1 for a method to be classified as performing well. However, the detection rate should close to zero for the null model where λ 2 = 0. Using Eq (20), the results of the simulation experiment are presented in Tables 4, 5 and 6 for crcc, RMVE and RMCD control charts respectively.
From Tables 4, 5 and 6, it can be seen that: 1. At the null model ie λ 2 = 0, all methods considered performed quite well and hence, they did not identify any point as outliers and/or out-of-control. Cluster-Regression Control Chart 2. The crcc has a higher detection rate compared to the RMVE and RMCD. This is noticeable when the non-centrality parameter is small (say 5 λ 2 10). Hence, the crcc is sensitive in detecting small shift in the in-control mean when outliers are present in datasets 3. The RMVE and RMCD performed quite well especially when the non-centrality parameter is large (say λ 2 > 10) 4. In all scenarios considered, the sample size m and the dimension p has little or no effect on the detection rate. 5. The detection rate of all methods considered increases as the non-centrality parameter (mean shift = λ 2 ) and the level of contamination (outliers = γ) increases. Hence, λ 2 and γ are the two parameters that influences the detection rate.

Numerical Implementation and Real Life Data Application of crcc Algorithm Artificial Data
In order to facilitate the understanding of crcc algorithm and the working mechanisms, a follow up numerical illustration is given below. Following the methodology of Lawrence et al. [22], two variable quality characteristics of sample size 12 is simulated in the following way: Observations 1-9 comes from x i2 = 200 − 4x i1 + e i and x i1 * U[10; 20] with the error term e i * N(μ Cluster-Regression Control Chart = 0, σ = 5). Observations 10-12 are outliers deliberately planted such that: for 10 i 12, x i1 = (29,30,31) and x i2 = (153.800, 155.800, 80.932). The data arising from the process is presented in Table 7 below The stepwise implementation of crcc algorithm on the two variable artificial data is presented below.
Step 2: The eigenvalues of x i1 and x i2 denoted as λ 1 and λ 2 , computed from the covariance matrix,Ŝ in Eq (3)   with w 2 0:975;2 ¼ 7:3777 and hence the (5 × 2) projector matrix described in Eq (6) Step 4: The (12 × 2) data matrix B resulting from an OLS regression of O augmented by the i th row of x ij is computed this way: Add the first row of the data in Table 7 to O to obtain the data in Table 8.
Notice that O has become 6 × 2 matrix because row 1 of Table 7 has been merged with O in Eq (25). An OLS fit to the data in Table 8 result in the estimates b 1 = 200.7503, −4.4460. Perform this operation for n = 12 times. Note that the augmentation is done without replacement. The resulting estimate is presented in Table 9 below Step 5: The (n × n) similarity matrix computed for the data in Table 9, using Eq (13) is presented in Table 10 below

Cluster-Regression Control Chart
The agglomerative hierarchical cluster analysis based on the similarity matrix produces the dendrogram plot in Fig 1 below. Notice that the 3 outliers cluster differently while the inliers cluster together. The plot shows that the main cluster C M contains observations 1:9 while the 3 minor clusters are C τ1 = {10}, Step 6: An OLS fit the the data points in the main cluster yields the estimates b j = 204.4238, −4.3132 and the corresponding prediction variance computed using Eq (14) Notice that the DFIITS-statistics for the third cluster, C τ3 is less than the cutoff value. Hence cluster 3 with observation number 12 is activated while cluster 1 and 2 with observations numbers 10 and 11 respectively, remain inactive.
Step 7: Having activated cluster 3, n a = 10, the crcc control chart parameters are computed thus: The resulting control chart statistic, T 2 crcc;i alongside two robust control chart methods such as the RMVE control chart, T 2 RMVE;i , the RMCD control chart, T 2 RMCD;i and the classical Hotelling T 2 i control chart statistics are presented in Table 11 below. The corresponding control chart for the four methods is presented in Fig 2 below

Cluster-Regression Control Chart
Notice from Table 11 and Fig 2 above that the crcc control chart is able to detect the 3 spurious outliers planted at index 10, 11 and 12 as an out-of-control points. The crcc control chart has the feature to detect the in-control-trend while the dendrogram plot in Fig 1 also depict the outlier structure in multivariate data. The RMVE and RMDC were able to detect observations 10-12 as an out-of-control point. However, due to swamping effect, they erroneously nominated observations 1 and 5 which were originally in-control-points as out-of-control points. Surprisingly, the classical Hotelling T 2 computed from the function qcc in R language could not even identify any of the 3 mean shift outlier points. This is because observations 10-12 attracted the mean to themselves and away from the other in-control observations so that the covariance matrix becomes arbitrarily large and the squared Mahalanobis distances computed based on these mean and covariance becomes arbitrarily small and hence, the 3 observations are masked.

The Pulp Fiber Real-life Dataset
Following the experimental design of Lee [30], Rousseeuw et al. [31] describe an experiment that was conducted to determine the effect of pulp fiber properties on the paper made from them. The pulp fiber properties that were measured for paper production are: mean fiber length, x i1 , long fiber fraction, x i2 , fine fiber fraction, x i3 , and zero span tensile, x i4 . The paper properties that were measured after production are: breaking length, y i1 , elastic modulus, y i2 , stress at failure, y i3 , and burst strength, y i4 . The dataset arising from this process contains n = 62 measurements as presented in Table 12.
The crcc step-wise algorithm for the pulp-fiber data is presented below.
Step 1: Given that p = 8, and h = b(n + p + 1)/2c = 35, the (1 × 8) mean vector m x and the   Cluster-Regression Control Chart Step 2: The eigenvalues of x i1 , x i2 , x i3 , x i4 , x i5 , x i6 , x i7 and x i8 denoted as λ 1 , λ 2 , λ 3 , λ 4 , λ 5 , λ 6 , λ 7 and λ 8 , computed from the covariance matrix,Ŝ in Eq (30) are since x i8 has the least eigenvalue of λ 8 = 3.8e −5 , it is nominated and denoted as the dependent variable, y i and the remaining variables are treated as regressors at the regression stage of crcc. Step 3: The eigenvectors of x i1 , x i2 , x i3 , x i4 , x i5 , x i6 , x i7 and x i8 denoted as e 1 , e 2 , e 3 , e 4 , e 5 , e 6 , e 7 and e 8 computed from the covariance matrix,Ŝ in Eq (30) with w 2 0:975;8 ¼ 17:5346 and hence the (17 × 8) projector matrix described in Eq (6) is given as   Step 6: An OLS fit to the data points in the main cluster yields the estimates b j = -2.0016, 0.5229, -0.0129, -0.0081, -1.2572, 0.1118, 0.3068, 0.0882 and the corresponding prediction variance computed using Eq (14) Notice that the DFIITS-statistics for the second, third and eighth clusters are less than the cutoff value. Hence they are activated in the estimation of crcc charting parameters while the other minor clusters remain inactive. The inactive observations are then removed from the pulp fiber dataset X. The resulting observation x a ij is then used to compute the mean vector and covariance matrix.
Step 7: Having activated the 3 minor clusters, the crcc control chart parameters are computed thus: The resulting control chart statistic, T 2 crcc;i is presented in Table 14 below while the crcc control chart is plotted in Fig 4(a).
Notice from Table 14 and Fig 4(a) that that observations number 22, 46-48, 51-52, 56, and 58-62 are out-of-control points. According to Rousseeuw et al. [31] and Jung [32], these outof-control points are said to be observations produced from different woods and different pulping process from those of other observations. The details of these out-of-control points are given below: From the source of the data, it was found that all but the last four pulp samples (observations 59-62) were produced from fir wood. Furthermore, most of the out-of-control samples were obtained using different pulping processes. For instance, observation 62 is the only sample from a chemithermomechanical pulping process, observations 60 and 61 are the only samples from a solvent pulping process, and observations 51, 52, and 56 are obtained from a kraft pulping process. Finally, the smaller outliers (22, 46-48, and 58) all were Douglas fir samples. Consequently, these out-of-control points are removed from the dataset and the crcc revised control chart are constructed for the remaining observation. Thus the in-control parameters (mean vector and covariance matrix) computed using algorithm 1 in section 1 are: while the revised control chart statistic, T 2 crccÀrevised;i is presented in Table 15 The crcc control chart is presented in Fig 4(a) while revised control chart based on the incontrol-parameters is presented in Fig 4(b). Notice that the process is in a state of statistical  control and hence, this in-control parameters can be adopted as a standard for the pulping process and paper produced from them. The RMVE and RMCD control charts were also used to analyze the pulp fibre dataset using the function cov.rob in the MASS package of R software and we found that the results obtained were the same as the proposed method and hence its was not reported.

Conclusion
The real life application of quality management requires simultaneous monitoring of multiple quality characteristics. Real life data from production and service processes often contains spurious observations whose causes can be traced to assignable variation. This spurious variables often go unnoticed if proper statistical techniques are not employed prior to control chart construction. A multivariate control chart method known as the cluster-regression control chart crcc is proposed to simultaneously screen dataset for likely outlier structure and mimic the data trend prior to the construction of the control chart.
Most often, the assumptions needed for large sample theory are better approximated by the distribution of the untrimmed data than by the entire data set, and it is often suggested that statistical analysis should be conducted on the "cleaned data set" when the outliers have been deleted. The proposed method follow this prescription with our objective being that, the parameters of the control chart can be better estimated when the outlying observations which depart from the trend exhibited by the bulk of the dataset have been removed. The proposed method has the tendency to identify mean shift and outliers in datasets while keeping masking and swamping under control. No single robust algorithm estimator seems to be outstanding, and for any given estimator, it is easier to find outlier configurations and mean shift in datasets where the estimator fails and hence, we state that the performance of the proposed crcc method can be quite poor when the level of data contamination go beyond 40% of the sample size.