An improved approach for fault detection by simultaneous overcoming of high-dimensionality, autocorrelation, and time-variability

The control charts with the Principal Component Analysis (PCA) approach and its extension are among the data-driven methods for process monitoring and the detection of faults. Industrial processing data involves complexities such as high dimensionality, auto-correlation, and non-stationary which may occur simultaneously. An efficient fault detection technique is an approach that is robust against data training, sensitive to all the feasible faults of the process, and agile to the detection of the faults. To date, approaches such as the recursive PCA (RPCA) model and the moving-window PCA (MWPCA) model have been proposed when data is high-dimensional and non-stationary or dynamic PCA (DPCA) model and its extension have been suggested for autocorrelation data. But, using the techniques listed without considering all aspects of the process data increases fault detection indicators such as false alarm rate (FAR), delay time detection (DTD), and confuses the operator or causes adverse consequences. A new PCA monitoring method is proposed in this study, which can simultaneously reduce the impact of high-dimensionality, non-stationary, and autocorrelation properties. This technique utilizes DPCA property to decrease the effect of autocorrelation and adaptive behavior of MWPCA to control non-stationary characteristics. The proposed approach has been tested on the Tennessee Eastman Process (TEP). The findings suggest that the proposed approach is capable of detecting various forms of faults and comparing attempts to improve the detection of fault indicators with other approaches. The empirical application of the proposed approach has been implemented on a turbine exit temperature (TET). The results demonstrate that the proposed approach has detected a real fault successfully.


Introduction
There are several conceptual approaches in detecting and diagnosing failure, and different researchers, based on their point of view, present various classifications, none of which is In line with these problems, many researchers attempted to increase the accuracy of the PCA model. In the literature, various quality control charts have been proposed to overcome autocorrelation or non-stationary problems, including the principal component analysis (PCA). Ku et al. proposed Dynamic PCA, which is an approach to extend static PCA tools to address autocorrelation in a multivariate process [12]. Rato et al. suggested approaches to improve Dynamic PCA [10,11]. Ammiche et al. recommended a modified moving window dynamic PCA with a fuzzy logic filter and its application to fault detection. They claimed that this approach could handle the autocorrelation problem and reduce false alarm rates [13].
Li et al. provided a recursive PCA approach to cope with non-stationary problems [14]. The RPCA technique updates the model for ever-increasing data consisting of new samples without discarding the old ones. Theoretically, RPCA, which has been used efficiently for process monitoring, could be simple but has some disadvantages. For instance, as data sets expand, and the model is updated, the speed of compatibility is decreased by growing the size of the data. Furthermore, the forgetting factor cannot be easily selected without prior knowledge of likely fault conditions when older samples are given down-weight. Another approach proposed to cope with the non-stationary problem is moving window PCA (MWPCA) [11,15]. The MWPCA method can tackle some of the limitations mentioned above by collecting a sufficient number of data points in the time-window that can help build an adaptive process. Specifically, MWPCA removes older samples to choose the new samples representing the current operational process. Meanwhile, researchers have stated the size of the window in MWPCA as an essential parameter. If the appropriate window size is not selected, over-fitting will be observed in the model [16]. Therefore, MWPCA based on the application delay as V-stepahead prediction was implemented to solve this problem. This method is applied using a model calculated at time t to predict the behavior of the system at time t + V and to detect the possible faults. This step is taken to ensure that the model does not overly adapt to the data, and can detect errors that gradually build up and are recognized as normal observations at any point of time [17]. Ketelaere et al. reviewed the PCA-based statistical process-monitoring methods in terms of time-dependent, high-dimensional data including DPCA, MWPCA, and RPCA [9]. Other researchers proposed various approaches in areas of DPCA, MWPCA, and RPCA for instance, to reduce false alarm rates or combined them with other methods [16,[18][19][20][21]. As stated, RPCA and MWPCA are adaptive approaches, indicating that with each new sample correctly identified the monitoring model and control limit would be modified where the normal time-varying information is applied to the monitoring model to distinguish between the normal time-varying and slow ramp fault processes. Nevertheless, Gao et al. introduced another approach to solve non-stationary problems called incremental PCA [22]. This method, unlike the adaptive method, is changeless and introduces a new parameter as an incremental PC (IPC). In this method, the non-stationary problem is extracted by calculating IPCs. Finally, the monitoring model remains unchanged and uses the new statistic called IT 2 to monitor data.
As stated, because the collection sampling is rapid in industrial processes, the data have autocorrelation properties and depending on the situation, the process behavior can change in various modes and parameters such as mean or variance, which makes it non-stationary. As far as DPCA model is concerned, it can resolve auto-correlation, but it has fixed thresholds and when data is non-stationary, indicators such as false alarm rate (FAR), missed detection rate (MDR), and delay time detection (DTD) may suggest that it does not perform well. Also, the methods such as the MWPCA model can dominate the non-stationary problem with adaptive thresholds. Again, precision in the detection of fault is not sufficient because the autocorrelation problem has been overlooked. Increasing the measurement indicators due to the lack of an acceptable system that is consistent with data characteristics causes delay in detecting faults and confounding operators. Based on De Ketelaere [9] suggestion and our investigation, no research has been carried out on the model that can handle high-dimensionality, non-stationary, and autocorrelation, simultaneously. In addition to using DPCA's property to resolve autocorrelation through data matrix expansion, this research attempts to reduce indicators such as FAR, MDR, and DTD by using the adaptive thresholds function of MWPCA process. In other words, the aim of this article is to present a suitable fault detection approach combined of MWPCA and DPCA properties models that can handle autocorrelation where the time lag of each variable can be different and non-stationary at the same time.
This paper is organized as follows: the background of PCA models and how they are used in fault detection are outlined in Section 2. Section 3 presents the proposed approach, which is a combination of DPCA and MWPCA. In section 4, the proposed approach is implemented on TEP data and turbine exit temperature (TET). Discussion and results are also presented in this section. Finally, the conclusion is given in Section 5.

Principal component analysis (PCA)
PCA converts a set of correlated variables into a smaller number of uncorrelated new variables, where the new sample includes the most information from the original data [2,23,24]. Let X2R n×m be the original data matrix with n samples and m variables, which can be explained as: . . . : The first step in PCA is the standardization of the data. After scaling, matrix X has zero mean and one-unit variance. Then, the PCA algorithm projects X onto a new orthonormal space by the following linear transformation: P can be calculated from the following eigenvalue problem: Where C represents the covariance matrix of X calculated by Eq (4) and Λ represents a diagonal matrix consisting of the non-negative real eigenvalues in descending order (λ 1 �λ 2 �� � ��λ m �0). The next step is determining principal dimensions or selecting the number of principal components (PCs) according to the distribution of variations in the new coordinate system. Several methods have been proposed for choosing the number of principal components [25,26]. Cumulative Percent Variance (CPV) is one of these methods which determines the percentage of variance calculated by the first r principal components as follows [27]: According to eigenvalues that determine how much variation each PC has, CPV can be an appropriate criterion to define the count of PCs in a PCA model. After determining the number of PCs, X is decomposed into PC space (T r P r ) and residual space (E), which is shown in Eq (6).
PC space explains the information of the system variations, while the residual space describes the information of noise or model error [28].
After constructing a PCA model based on the historical data collected, it is necessary to have the instrument that controls variations. It is possible to plot the multivariate control charts using the Hotelling T 2 and square prediction error (SPE) or Q to detect the fault. Determining two orthogonal subspace of the original space can reduce the monitoring into these two variables (T 2 and Q) [27].
The major variations and the random noise in the data can be controlled by T 2 and Q respectively.
The T 2 statistic can be calculated for each new observation x by: Where Λ r is the squared matrix constructed by the first r rows and columns of Λ and, as previously mentioned, P represents r eigenvectors or columns of V. The upper confidence limit for T 2 is acquired using the F-distribution: Where, r is the number of the principal components and n denotes the number of samples in the data, and α is the level of significance. A violation of the threshold would mean that variations of the system are out of control. Another statistic is the squared prediction error (SPE) or Q, which can monitor the portion of the measurement space related to the lowest m−r eigenvalues. Indeed, Q statistic is calculated as the sum of squares of residuals.
Where, I is the identity matrix. The upper confidence limit for the Q can be calculated from its approximate distribution: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Where, C α is the value of the normal distribution with the α level significance. A violation of the threshold would indicate that an unusual event has occurred causing a change in the covariance structure of the model [29]. Based on the previous description, if Q or T 2 statistics are beyond their confidence limits, abnormality is indicated.
In summary, the condition monitoring process with PCA method is demonstrated in Fig 2.

Dynamic principal component analysis
A small sampling period is required for early detection of faults in stream recorded data from fast industrial processes. In other words, the current values of the process variables depend on the past values. Thus, assumption of statistical independency between observations is violated. Hence, the conventional PCA model will not offer a good performance [31]. Ku et al. [12] proposed a dynamic PCA method which added the concept of data dynamics to the PCA model. For this propose, they used time lag shift method and added a necessary number of lags, l, to the data. The augmented data matrix is demonstrated further.
The DPCA model is constructed by applying PCA model on X A (l). The monitoring procedures of the DPCA are the same as the ones of PCA. The important key in the DPCA is selecting the number of lags.

Moving window principal component analysis
The non-stationary property appears since the process parameters, such as the mean or covariance, change over time [11]. To tackle the time-varying issue, several complementary multivariate statistical process monitoring (MSPM) methods have been introduced. To address non-stationary data, three classes of approaches called recursive PCA (RPCA), moving-window PCA (MWPCA) and incremental PCA (IPCA) have been applied to develop PCA methods [22].
As mentioned in the introduction, RPCA, which has been used for process monitoring efficiently might be theoretically simple. However, its implementation might not be accessible due to two main reasons: the ever-growing data set on which the model is updated eventually slows down the speed of adaptation as the data size increases. RPCA also consists of older data that are unrepresentative of the time-varying process. Forgetting factor cannot be easily selected without a priori knowledge of likely fault conditions when given to down-weight older samples [15].
The MWPCA method can tackle some of the aforementioned limitations by collecting a sufficient number of data points in the time-window, which can help build an adaptive process. Specifically, moving window principal component analysis (MWPCA) removes older samples to choose the new ones representing the current operation process. Hence, for window size K, the data matrix at time k is X k = (x kÀ Kþ1 ; x kÀ Kþ2 , . . .,x k ) and, at time k + 1, it is X k+1 = (x kÀ Kþ2 ; x kÀ Kþ3 , . . ., x k+1 ). The observations in the new window can be used to obtain the updated � x kþ1 and s k+1 [32]. The MWPCA algorithm can be summarized as follows: Offline step: 1. Implementation of a conventional PCA model on training data (computing the loading vectors, number of principal components, and control limits of the monitoring indexes, T 2 and Q statistics).

Model validation, which is performed with test data.
Online step: 1. Select a new online sample. Normalize the sample with the means and variances of the training dataset.
2. Calculate the monitoring indices for this new sample (T 2 and Q statistics).
3. Compare the monitoring indices for new sample (T 2 and Q statistics) with the current thresholds. If both of them are under the thresholds, go to step 4, otherwise, go to step 5.
4. Update the window by including the new sample in the moving window, exclude the oldest ones, and update the PCA model by recalculating it and its thresholds and go to step 1.

Proposed method (Moving Window Dynamic PCA)
Nowadays, industrial systems are complex, where fast and current observation is highly dependent on past observations. Because of the different state of the system, the collected data are non-stationary. Also, the lags caused by autocorrelation in the variables may not be the same in all variables. Thus, an approach is required that can overcome issues such as dimensionality, non-stationary property, and autocorrelation simultaneously. The proposed method is a datadriven one, which uses the features of two well-known multivariate process monitoring, DPCA and MWPCA. Augmented matrix DPCA technique dominates the autocorrelation property and MWPCA, through the generation of adaptive thresholds, reduces time-varying property. The proposed method, Moving Window Dynamic PCA (MWDPCA) combines the features of methods DPCA and MWPCA to reduce the effects of autocorrelation and timevarying and try to enhance the sensitivity and robustness of process monitoring model. Moving Window Dynamic PCA (MWDPCA) is applied in two offline and online steps. These two phases are illustrated in the form of pseudo-codes as follows: Offline phase: For i = 1: p (p number of variable) Determine the optimum lag for i th variable by AIC criterion.

End for i
Arrange the augmented matrix of the data by adding lag variables. Set α such that 1-α is a given confidence level. Construct PCA form augmented matrix (Mean, Standard Deviation and number of PCs) Compute the initial thresholds, Q 0 and T 2 0 using Eqs (8) and (10) Determine the length of the window using training data and based on minimizing FAR and name it H. Set initiate the binary fault indicator to zero Set n: number of rows of the training data set Set Q_off = T2_off = [0] 1×n For i = 1: n Set x: i th a row of training data set Compute x � standardized of x by the mean and standard deviation of the previous step Compute Q and T 2 corresponding x � using Eqs (7) and (9). Set i th component of Q_off and T2_off, respectively Q and T 2 End for i Set Q fixed and T 2 fixed , quantile 1-α percent of Q_off and T2_off Online phase: Set m: number of rows of the testing data set For i = 1:m Set x: i th a row of testing data set and add it lags Compute x � standardized of x by the mean and standard deviation of the previous step (in offline phase) Evaluate the corresponding Q and T 2 for x � If (Q<Q 0 and T 2 <T 2 0 ) Include the sample with its lags in the window and exclude the oldest one. Recalculate DPCA form expanded matrix (Mean, Standard Deviation and number of PCs) Calculate the adaptive thresholds, Q" and T 2" by Eqs (8) and (10) Set Q 0 = Q"; T 2 0 = T 2" . Set the binary fault indicator to 0 Else Set the adaptive thresholds: Set the binary fault indicator to 1.

End if End for i.
The first step in the offline phase is determining the lags of each variable using the training data based on Akaike Information Criterion (AIC) index. Any variable's lag may be different from others. Separately finding the lag for each variable increases model accuracy. We can create an extended training data matrix by adding lags to the data. As Ku et al. proved in DPCA, if enough lags l are added in the data matrix, the process monitoring statistics is statistically independent of one moment to another. The next step is to determine the length of the window using training data based on FAR minimization. Then, the mean and standard deviation is calculated and data normalized. Afterward, on the data matrix, the PCA algorithm is implemented and the principal components and the initial value for Q 0 and T 2 0 (Initial threshold) using Eqs (8) and (10) determines. The value 0 is given as the fault indicator. For all observations of the training data by using Eqs (7) and (9), Q and T 2 are calculated and selected as fixed thresholds Q fix and T 2 fix . These thresholds are used when an error is detected, which are set by trial and error in conditions that do not affect the process recovery.
The second phase is online. In this step, select a new test sample, expand it by adding l lags, and normalize it. Calculate Q and T 2 for this sample. Then, Q and T 2 are checked with Q 0 and T 2 0 threshold. If both are below the thresholds, this means that the process is under control, and then set the fault index to 0 and include the sample in the window in addition to its lags and exclude the oldest sample form window. The model is updated and computed the adaptive thresholds (Q" and T 2" ) by Eqs (8) and (10). Set Q 0 = Q"; T 2 0 = T 2" . Moving window and constructing adaptive thresholds can cope non-stationary property by limiting the effect of old observations and using close observations to predict the mean and variance and update the model. In other words, the proposed model can detect faults under the different operational modes of the system. But if Q and T 2 statistics surpass control limits, set adaptive threshold (Q" and T 2" ) with Q fixed and T 2 fixed , and set (Q 0 = Q"; T 2 0 = T 2" ), then change the binary fault index to 1. Fig 4, illustrates the proposed method (moving window dynamic PCA) in the form of a flowchart.
The proposed method is applied on two applications, including Tennessee Eastman Process and Gas Turbine Exit Temperature Spread.

Eastman Tennessee Process
The Eastman Tennessee Process (TEP) was developed by Eastman Company, which has created a real industrial process to evaluate the methods developed for process control and monitoring. This simulator is widely used to test and compare various tasks in process control and monitoring. The process has five main performance units: A reactor, a product condenser, a vapor-liquid separator, a recycle compressor, and a product stripper. This process has 41 variables and 12 manipulated variables. Details of these variables are reported in ref [33]. There are 20 faults in this data, as shown in Table 1. Fifteen of them are known, while five of them are unknown. The first seven faults are related to the step change in process variables. Faults 8 to 12 are related to increasing the variability of some process variables. Fault 13 is a slow drift in reactor kinetics, while faults 14 and 15 are associated with sticking valves. Fig 5 reveals the TEP diagram [13].
The proposed model is implemented on the data employed by Russell et al. [34]. These data have 53 variables which are divided into two sections of training and testing with 500 and 960 observations, respectively. All observations in the training section are under control and there are 20 faults in the observations testing section. The interval between the two observations is three minutes, with each fault reported in the test data after 8 simulation hours of observation. Twenty-one testing datasets were used for this study; the first dataset is fault-free, while the remaining datasets have faults, according to Table 1. Based on the AIC index and KPSS, the simulated data have autocorrelation and non-stationary problems. Fig 6 illustrates a part of this data.
Initially, the number of time lags is calculated for each variable and as mentioned earlier, an expanded matrix is created. Then, the PCA model is implemented on the data, and the number of PCs covering 70% of the total variance is determined. Initial thresholds and adaptive thresholds of the proposed method are calculated at the 99% confidence level. Fixed thresholds are defined experimentally. The window length is selected 40 based on the minimum FAR recorded for the test dataset with no faults. Finally, the proposed method is applied to the observations. A good fault detection technique should have three characteristics: 1-Be robust against the training data set. 2-Be sensitive to all feasible faults in the process. 3-React quickly in fault detection. The robustness is measured by computing the false alarm rate (FAR) upon fault free testing data set. The sensitivity fault detection methods are determined with missed detection rate (MDR) upon faulty testing data set and the promptness fault detection is  According to the results of Table 2 and Fig 7, the proposed method, in both training and test sets, has been successfully minimized FAR index for both Q and T 2 charts. These results indicate that the proposed method, compared to other methods listed in Table 2 has high potential terms of reducing FAR and it is robust independent of the training data set. It should be noted that Rato et al. [11] did not provide a report on the FAR index for implementation MWPCA and RPCA methods on TEP data. Table 3 shows the results of MDR for PCA, DPCA, MWPCA, RPCA, KPCA, KDPCA and proposed method. A comparison of the results shows that the proposed method has been able to significantly reduce the MDR index in some faults simultaneously in both T 2 and Q or in T 2 or Q. To evaluate the overall performance of the proposed method in the MDR (MMDR) index, the mean MDR index is calculated. The results show that the proposed method has a better overall performance in both T 2 and Q charts compared to other methods, and it is more sensitive to possible faults. Table 4 presents the Detection Time Delay in the proposed method and compares it with the DTD index in PCA, DPCA, Kernel PCA, and Kernel Dynamic PCA methods. The numbers written in the table indicate the number of observations from the moment the fault occurred to the moment it was detected.
According to the results in Table 4, the bold numbers indicate a better performance of the proposed method in DTD index for Q or T 2 charts or both simultaneously as compared to other methods. As shown in Table 4, the proposed method has a more agile performance about fault 4, 5, 6, 7, 11, 12 and 14. An empty cell means that DTD has not been reported. Moreover, Rato et al. [11] did not provide a report on the DTD index in their research. Figs 8-16 present the Q and T 2 control charts drawn for the proposed method with their adaptive thresholds. The first chart from the left is Q control chart, the second and third charts indicate monitoring by T 2 and fault indicator chart, respectively. In the Q and T 2 control charts, the red lines represent the Q α and T 2 a adaptive thresholds and the blue dots represent the calculated values of Q and T 2 for each sample. As long as the blue dots in both charts are below the threshold, the process is under control. The third chart illustrates the fault indicator, which is a binary variable. In this chart, a value 1 means that the process is under control, and

MWDPCA applied to turbine exhaust temperature spread
Gas turbines are designed for many different purposes. In the industry, they are commonly used to drive compressors to transport gas through pipelines and generators that produce electrical power [36]. In the past, GTs use was generally limited to generating electricity in periods of peak electricity demand; however, nowadays, they are being used in combined cycle power plants for base load production [37]. Consequently, their availability, as well as reliability, play a significant role in these machines. The development of a gas turbine in recent years has been facilitated most considerably by three factors: • Metallurgical developments that can be used to apply high temperatures in the combustor and turbine components; • The increased underlying knowledge of aerodynamics and thermodynamics; • Designing and simulating turbine airfoils as well as combustor and turbine blade cooling configurations by computer software.   After compressing the air in the compressor, the fuel is injected into it and combustion increases the temperature of the gas. Turbine inlet temperature (TIT) is defined as the average temperature of the flue gas that faces the first stage of turbine blades. During the expansion of the flue gas in the turbine, the pressure and temperature drop and the flue gas leaving the turbine with turbine exhaust temperature (TET).
The efficiency and specific power of GT would be improved if TIT could be increased. Nevertheless, designing and manufacturing turbines resisting higher TIT is a technological limit. Since TIT is too hot to be measured directly, it is usually calculated by measuring TET. Due to the rotation and turbulence of the flue gas stream, both TET and TIT have a profile on their section. In v94.2 GTs, TET is measured by six temperature transmitters to indicate a precise profile. GT manufacturers use various methods to calculate the TIT regarding the measured TET; by monitoring the TET, the operator tries to keep the GT under protected conditions. Throughout this paper, TET of an Iranian gas turbine company was used to illustrate the behavior of the proposed method. As observed in Fig 17, the data consist of measurements of six sensors, with each sensor representing TET of V94-2 gas turbine measured approximately in a 1500-minute time period. The samples were recorded every 1 minute. Statistical tests were implemented to detect the behavior of the data. These data are non-stationary in accordance with the test of KPSS and auto-correlated with AIC index. In this subsection, MWDPAC model was applied on TET data for controlling the behavior of GT and early fault detection. Preprocessing is normally done in various fields. The type of preprocessing depends on the type of the process. In the case of the TET data, no special preprocessing is necessary; standardizing the data was the only necessary procedure. The first 200 observations were used to training data set. Note that these data were collected when all parameters were under control. There are two kinds of fault in testing data: i) step change which means occurrence of sudden faults; ii) slow ramp. Once the proposed method was applied to the TET data, the number of lags has been selected for each variable, and the developed matrix was constructed in accordance with the lags. Then PCA implemented on data. Two components were retained in accordance with the CPV criterion. Thus, 70.0% of the total variance in the data was explained by these two principal components. The window size was set to 80. The initial and adaptive thresholds were determined at 99% confidence interval. Also, the fixed thresholds were selected experimentally.     Table 5 indicates DTD and MDR for faults 1 and 2.

Conclusion
Although, conventional PCA model is a suitable approach for process monitoring, it has not good performance for data with autocorrelation and non-stationary features due to the linearity of this method. As expected, DPCA is an extended PCA model that can decrease the effect of autocorrelation, but like conventional PCA, it has fix thresholds which increases evaluation indicators such as false alarm rate, delaying time detection and missing detection rate. Also, the adaptive PCA models such as RPCA and MWPCA can cope with some kinds of nonstationary data, but due to disregarding the effect of autocorrelation in the data, evaluation indicators do not show good performance, which confuses the operator to make the right decision that may cause undesirable consequences. In this study, we proposed an improved PCA method which is a combination of DPCA and MWPCA method properties which can resolve non-stationary and autocorrelation features, so that the time lag of each variable can differ. This approach was attempted by using the simple structure of DPCA to dominate autocorrelation feature. For this purpose, first, the lag of each variable is calculated and added to the data matrix. Then, by using the property of MWPCA, adaptive thresholds and updating model for each observation is attempted to reduce the effect of non-stationary. The performance of models depended on the reduction rate in some criteria such as False Alarm Rate (FAR), Missed Detection Rate (MDR), and Detection Time Delay (DTD). The proposed method was implemented on TEP data and turbine exhaust temperature as real data. The results of both simulated and real data show that the proposed method has zero FAR in both training and test data set which indicates a good performance in reducing the false alarm rate among other methods. Also, the proposed method was more accurate in MDR and DTD indicators than other methods regardless of the type of faults.
In other words, this method performed better in MDR and DTD indices in 60% of the total faults than other methods. As a result, the proposed method improves process monitoring performance and helps operators make better decisions. However, these approaches needs further improvement to achieve satisfactory monitoring performance. Hence, the use of adaptive thresholds for nonlinear PCA methods could be suggested as future research.