Correlation-based common spatial pattern (CCSP): A novel extension of CSP for classification of motor imagery signal

Common spatial pattern (CSP) is shown to be an effective pre-processing algorithm in order to discriminate different classes of motor-based EEG signals by obtaining suitable spatial filters. The performance of these filters can be improved by regularized CSP, in which available prior information is added in terms of regularization terms into the objective function of conventional CSP. Variety of prior information can be used in this way. In this paper, we used time correlation between different classes of EEG signal as the prior information, which is clarified similarity between different classes of signal for regularizing CSP. Furthermore, the proposed objective function can be easily extended to more than two-class problems. We used three different standard datasets to evaluate the performance of the proposed method. Correlation-based CSP (CCSP) outperformed original CSP as well as the existing regularized CSP, Principle Component Cnalysis (PCA) and Fisher Discriminate Analysis (FDA) in both two-class and multi-class scenarios. The simulation results showed that the proposed method outperformed conventional CSP by 6.9% in 2-class and 2.23% in multi-class problem in term of mean classification accuracy.


Introduction
One of the most popular topics in computer science in the last decade has been research on Brain-Computer Interface (BCI) systems. BCI systems consist of sensors and a signal processing unit that can convert brain activities into logical commands for an external device or computer application. The basis of these systems is decoding brain signals using feature extraction and pattern recognition algorithms [1,2].
Motor imagery (MI) is a dynamic state in the brain that a subject imagines a body movement such as hand or tongue movements without any external movement or muscle activation [3]. This imagination task leads to neural activities in the primary sensorimotor cortex so similar to real movement which has different pattern depends on the type of imagination [4]. MI based-BCI systems have designed to detect different MI signals that can be helpful for people with motor disabilities or mental injuries to do some activities such as moving an artificial arm, spelling words or controlling a computer cursor [1,5]. One of the useful and powerful pre-processing methods for feature extraction in BCI is Common Spatial Pattern (CSP). CSP approach was used for the first time in detection of EEG abnormality [6]. Then, CSP filters applied to the multi-channel EEG signal for the classification of left and right hand MI signals [7]. Initially, CSP algorithm finds spatial filters for two classes of data. Through applying spatial filters to the data, the variance is maximized for one class of data while it is minimized for the other one. Therefore, two classes have more distinctive feature to be used in the classification stage.
Despite the mentioned advantages for the CSP method, this approach has such shortcomings as noise sensitivity, overfitting and two-class issue, which has led quite a few studies to propose extensions for it [5,[8][9][10].
CSP approach uses L2 norm in its optimization problem; as a consequence, it has noise sensitivity problem. In other words, outlier data and artifacts maybe have exaggerated effect as they mostly have large amplitude in brain signals. Therefore, L1 norm has been proposed based on CSP for solving this problem in [11]. This work used L1-norm method in the objective function instead of L2 norm to find eigenvectors. In the other work, Sparse CSP-L1 (sp-CSPL1) used L1 norm technique to solve optimization problem of CSP in two steps. The first one finds the robust spatial filters, and the other one penalizes the objective function by adding a penalty term to induce sparsity [12]. Correntropy Induced Metric based CSP (CSP-CIM) has been proposed to obtain a robust algorithm against outliers in [13]. CIM is used instead of L2 norm to approximate the objective function of CSP in the different dynamic regions. Another way to reduce the effect of artifacts and outlier data in obtaining spatial filters has been suggested in [14]. In this approach artifacts of data are removed in the preprocessing stage; therefore covariance matrices are calculated more accurate, as result CSP filters improved. As we know, an EEG signal includes amplitude and phase information; however, the phase information entirely ignored in CSP filter calculation. To solve this problem, three different methods have been presented in [15]. In the first method, phase information of EEG signal added to CSP objective function. In the second method, the CSP filters obtained by adding both amplitude and phase information to the objective function. Finally, the third approach has been used to handle nonlinearity in complex data points by the non-linear counterpart of Principal Component Analysis (NLPCA).
Non-stationarity of EEG signal occur for several reasons such as alternation in cognitive state of subject arising from mental fatigue and etc. adaptive common spatial pattern (ACSP) has been used to modify CSP filters based on mental fatigue of subject. The Results show ACSP outperformed conventional CSP in the term of class separability [16]. Regularized CSP (RCSP) has been proposed for solving noise sensitivity and non-stationarity problem in [17]. In this paper, CSP filters have been optimized by prior information about the noise of data. The other way to regularize the standard CSP algorithm is adding prior information to the covariance matrix estimation. Four new RCSP algorithms, including two proposed regularization terms for optimizing the objective function, are suggested in [18]. As a result, their proposed Tikhonov Regularization CSP (TRCSP) and Weighted Tikhonov Regularized CSP (WTRCSP) performed better than the CSP and other ones. Both of these approaches obtain spatial filters by adding prior information to the objective function. CSP approach can only extract spatial information of the brain activities. Regularizing Multi-bands CSP (RMCSP) has been proposed that extracts data in spectral, temporal and spatial domains using Tikhonov regularization method in [19]. CSP algorithm originally proposed to recognize between two classes of data. CSP depends on frequency bands that it is different for each subject regarding to individual characteristic nevertheless these differences have been neglected. Also CSP has sample based covariance matrix. To overcome these problems, a feature extraction approach has been suggested based on filter bank method in [20].
Turning to the two-class problem, as CSP was originally proposed to discriminate between two classes of data, first time Pair wise approach for multi-class problems has been proposed in [21]. Furthermore, One-Versus-the-Rest (OVR) approach has been proposed to tackle with multi-class problems [22], which cannot be considered as a genuine extension of two-class CSP to more than two-class cases. In [23] a new classification method has been proposed based on fuzzy system for multi-class problem. The proposed method outperformed in comparison with existing classifiers such as Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM).
Correlation is a statistical criterion to measure similarity between two different signals. As well as one of the useful methods for detection of task related activation in the brain is the correlation based methods [24]. According to this fact, Local Temporal Correlation Common Spatial Pattern (LTCCSP) has been proposed to obtain optimized spatial filters. Covariance matric estimation is a noise sensitive step in obtaining CSP filters which is improved by imposing local temporal correlation information into it [25]. In other work, a Correlation-based Channel Selection (CCS) method was introduced for MI-based BCI. In this algorithms the channels with high correlation coefficient selected to obtain spatial filters by using new RCSP method [26]. In all the correlation-based approaches correlation calculated in the same class and correlation between different classes is neglected. In the other hand MI signals are similar to each other especially right hand and left hand MI signals. So these matters led us to propose novel CSP method by adding correlation between different classes in the calculation of CSP filters. As mentioned, one of the methods to improve spatial filters is regularized CSP by adding a suitable penalty term to the objective function. Temporal correlation between EEG channels is a good criterion to measure the similarity of different channels (right hand and left-hand movement in particular). Due to this fact, regularized CSP can be done using temporal correlation between different classes of EEG signals as prior information. In this paper, a new regularized CSP (CCSP) based on temporal correlation is proposed to improve spatial filters in twoclass scenario and it is further generalized to multi-class problem. Imposing the temporal correlation as penalty term in solving objective function to obtain the spatial filters is shown to be more effective than the existing methods in terms of discriminating the class-data, leading to higher classification accuracy. In this work, our focus is on obtaining optimized spatial filters using the proposed method for MI signals and all the simulation steps are done using this kind of signals as inputs. It should be noted that the primary results of this paper have been published in a conference paper [27].
The rest of this paper is organized as follows: First, we will provide the basis of the existing CSP and RCSP methods are Section 2. Then the proposed regularized multi-class CSP method will be given in Section 3. Furthermore, the block diagram of the proposed method, classification, feature extraction, and data description is also presented in Section 3. In Section 4, the simulation result of the proposed method is given and compared to the existing methods. Finally, concluding remarks are presented in Section 5.

Background
In the following, the mathematical foundations of standard CSP as well as RCSP are given.

Common spatial patterns
As pointed out previously, CSP is one of the most popular approaches for feature extraction in BCI technology. CSP finds spatial filters such that the variance of the transformed data is maximized for one class while it is minimized for the other one [7]. Suppose X 1 and X 2 stand for the EEG signals for classes 1 and 2, respectively; where X i 2 R n�c , n and c represent time sample and number of EEG channels, correspondingly. The solution of the following objective function is the desired spatial filters.
w denotes the spatial filter, T is the transposition operator and C 1 and C 2 represent covariance matrices of X 1 and X 2 respectively. There are several methods for solving the maximization problem (1). Using the Lagrange multiplier method, the constrained problem (1) is converted to the following unconstrained problem: where λ is the Lagrange multiplier. Therefore, to calculate the spatial filter w, the derivative of L must be taken and set equal to zero with respect to w, that is: So, we have the standard eigenvalue problem. The solution to (1) are the eigenvectors of M ¼ C À 1 2 C 1 , corresponding to its largest and lowest eigenvalues.

Regularized CSP
RCSP has been proposed to overcome some shortcoming of CSP such as being sensitive to noise and artifacts, non-stationary and overfitting. In the following, one of most common ways of regularizing standard CSP is explained. For this purpose, some prior information from data is used as a regularization term in the objective function [18]. Therefore the objective function (1) is modified as follows: Where α denotes the regularization parameter (α�0) and Q(w) is the penalty term (prior information). Q(w) can be defined in the form of non-quadratic or quadratic function. Considering a quadratic function as penalty term, i.e., Q(w) = w T Kw, where K denotes the prior information from the signal; the solution of (4), following the same Lagrangian approach gives: Taking the derivative with respect to w and setting it equal to zero, we have: The eigenvectors of M 1 = (C 2 +αK) −1 C 1 corresponding to its largest eigenvalues are the desired spatial filters. Also, to calculate the optimized CSP filters, the objective function J P 2 ðwÞ should be solved in the same manner.
The obtained filters by solving maximization problem (7) are the eigenvectors corresponding the largest eigenvalues of M 2 = (C 1 +αK) −1 C 2 . Thus, the spatial filters are obtained by putting largest eigenvalue of M 1 and M 2 together. There are several penalty functions with the difference in K as prior information. The simple assumption is K = I which I represent identity matrix [18].

Proposed Correlation-based CSP (CCSP) method
As mentioned before, CSP algorithm discriminates two classes of data by maximizing/minimizing their projected variance, which is initially designed for two classes of data. In order to overcome the shortcomings of the standard CSP and obtain a feature extraction method which can be easily generalized to multi-class cases, we proposed to exploit the temporal correlation between signals as a measure of similarity between them and impose it as penalty term to obtain a regularization CSP. In the following, at first the proposed CCSP algorithm is formulated for two-class problem. Then, its generalization for multi-class problem is given in the subsequent section.

Two-class problem
In the previous section, the basis of the RCSP approach is presented. In this section, the formulation of the proposed method based on the temporal correlation as a penalty term for twoclass problem is given. Consider the penalty term, Q 1 (w), to be a quadratic function of w as: where Q 1 (w) denotes the penalty term in J P 1 ðwÞ and R d1 2 R c�c is the diagonal matrix in which the elements of the main diagonal are the correlation between X 1 and X 2 . X 1 and X 2 represent the average of X 1 and X 2 for all of the trials, respectively; i.e. (in order to reduce computational complexity, X 1 and X 2 are used instead of X 1 and X 2 ) where i and j represent the class and trial indices, respectively, and t is the total number of trials for each class. Assume R to be the correlation matrix between two different classes as: where x i 1 and x j 2 denote the i-th and j-th column of the matrices X 1 and X 2 , respectively, and corr (.) is the correlation operator. Having the correlation matrix R, R d1 is constructed as follows: where the diagonal entries, a i , are obtained as: and |.| denotes the absolute operator. Considering the penalty term for J P 2 ðwÞ, i.e., Q 2 (w), it is also defined in the similar fashion as: where R d2 2 R c�c is the diagonal matrix of the form: and b j are calculated as: The spatial filters are obtained by replacing the proposed penalty terms Q 1 (w) and Q 2 (w) in (4) and (7) and solving the resulting objective functions. In order to maximize J P 1 ðwÞ and J P 2 ðwÞ, Q 1 (w) and Q 2 (w) must be minimized. Therefore, the EEG channels of higher correlation will have a lower contribution in solution of the optimization problem, which leads to having more discriminative projected signals.
Two clear the computational complexity concept that is mentioned before, assume e and k show the trials number of X 1 and X 2 respectively. Therefore the computational order to calculate the correlation between two different classes is O(e×k×n×c) that n and c represent time sample and number of EEG channels for X i . By using X 1 and X 2 instead of X 1 and X 2 the computational order of correlation matric is reduced to O(n×c). It must be noticed however the computational complexity is decreased but it is still high in comparison to conventional CSP.

Multi-class problem
In the following, the proposed two-class CCSP will be extended to more than two class problems by defining a suitable objective function for the spatial filters of each class. Considering the OVR classification method, the spatial filters for one class are calculated against the others [8]. Likewise the two-class problem, the temporal correlations are used as penalty terms to the objective functions. That is, the temporal correlations between one class and the remaining classes are calculated. Then, sum of the calculated temporal correlations is used to impose these correlations to the corresponding objective function. Take for example a four-class problem. Using the above described approach we have the following objective functions for: In which Q 1 (w) and Q 2 (w) are calculated as follows: where R 12 d1 , R 13 d1 and R 14 d1 represent diagonal correlation matrices that are calculated using correlation between class 1 and 2, class 1 and 3 and class 1and 4, respectively; entries of which are calculated using (12). Likewise, R 12 d2 , R 13 d2 and R 14 d2 denote diagonal correlation matrices between classes 1 and 2, 1 and 3 and 1 and 4, respectively; entries of which are calculated using (15). The problem for class 1 against the other classes is calculated by replacing Q 1 (w) and Q 2 (w) in (16) and (17) and solve these functions. This procedure can be followed to obtain the spatial filters for the remaining classes. Pre-processing step in this block-diagram includes extracting different time duration from each trial of datasets and band-pass filtering. At first, time segments (it is explained in more details in the date description section) are extracted from each dataset. Then, each trial of data is band-pass filtered to make the mean of EEG signals equal to zero. For this purpose, each trial of EEG signals is passed through an 8-30 HZ band-pass filter (Butterworth filter of order 5). Then, CCSP method was applied to the filtered data to obtain the spatial filters. We used cross-validation to choose the best regularization parameter (α) from [10 −6 : 5×10 −5 : 10 −3 ]. The next step is feature extraction, which is done using the log-variances method. In the fourth step, the extracted features are fed to the classifier. It should be noted that the majority voting step is used only for multi-class problem. Finally, the results are achieved.

Classification and feature extraction
In this paper, log-variances method, a traditional and powerful method, was used to extract feature vector from the projected MI signals. As mentioned before CSP is a preprocessing method based on maximization the variance between different classes. But it is not a proper feature extraction method itself. So performing a method that uses variance as the feature can be helpful. By using log-variances method, we extract feature vectors with more class separability. Besides the data is projected into the normal distribution by using logarithm operator; therefore it can cause the ease of classification step [7,28]. Assuming a 2l×n dimensional matrix as the filtered signal using 3 pairs of spatial filters (l = 3), the feature vector for each trial is calculated as: where f ij denotes feature from i-th class and j-th filter and z ij represents the projected signal of the i-th class and j-th filter. Furthermore, log(.) and var(.) denote the logarithm and variance operators, respectively. There are various classification methods for classifying the MI signals which some of them are popular and more efficient than the others such as LDA, SVM and K-Nearest Neighbor (KNN). Also several extensions have been proposed for each of these classifiers to improve their performances.
LDA is a simple and effective classifier which is separated two or more classes of data by hyper-planes [29]. This classifier has been extensively used for classifying the MI signals that reduce the risk of misclassification but it may do not work properly when the testing data outnumbered training ones [30,31].
As mentioned above KNN is one of the most common and powerful methods for classification of EEG signals. In this method, each data devote to the class number which is most common amongst its K nearest neighbour by a distance function such as Euclidean distance [29,32]. Two parameters affect on the performance of KNN classifier: 1) value of K 2) distance function [32].
Another popular classifier for BCI applications is SVM which is a supervised learning method. This method separates two classes of data by finding the best hyper-plane which can create the largest distance from each class [33]. This method extended to classified multi-class data in [34]. Unlike LDA, SVM method represents a good performance in the cases that the number of features is high [33].
In this work, we have used SVM classifier for two and multi-class classification problems. It is an efficient and suitable method for classifying motor imagery signals. In two-class problem, feature vector is fed to the SVM classifier and the results are obtained but in multi-class problem after using classifier, output of the SVM classifiers are combined by the majority voting strategy and the final results are obtained as follows: where p OVR (ω|x) denotes the probability of x to be classified as class 1, 2, 3 or 4. It should be pointed out 5-fold cross-validation is used to choose the best hyper-plane in the classification step.

Data description
In this study, three publically available datasets are used for simulation and comparison between the performance of the proposed algorithm and the existing ones. Each subject performed imagery movement and EEG signals of each subject have been recorded by several electrodes. These datasets consist of 17 subjects in overall. The EEG signal of 12 subjects have been used to simulate multi-class problem (dataset IIIa, BCI competition III and dataset IIa, BCI competition IV), while all of them have been used for two-class problem. The details of each dataset are given below: Each dataset has particular settings; therefore, different time segment was extracted from each of the datasets for training and testing purposes [19]. In dataset IIIa, BCI competition III, time segments of duration 3.5-7s are extracted from each trial. This time range is 0-3.5s (whole time interval) for dataset IVa, BCI competition III. Finally, time duration of 2.5-6s was used for dataset IIa, BCI competition IV.

Simulation results
After applying spatial filters to the datasets and extracting features from them, they are classified by SVM. In order to test the performance of the CCSP algorithm and compare it with other cases as the benchmark (CSP [7], TRCSP [18], PCA [28] and FDA [37]). The TRCSP algorithm used K = I as prior information in penalty term Q(w). The PCA is a mathematical method used for dimension reduction and feature extraction. This technique finds principal components (PCs) to reduce the dimension of data in which the few first components have the largest variances in comparison to the others [28]. FDA is a supervised feature mapping method which is found linear projection so that between-class scatter is maximum; meanwhile within-class scatter is minimum [37]. We used three measures namely classification of accuracy, sensitivity and specificity. Figs 2-6 illustrate the comparison between the classification accuracy of all methods in two-class and four-class scenarios. According to these Figs and regarding the two-class problem, the proposed CCSP algorithm had better performance in 9 out of 17 subjects, while the accuracy results were the same for 4 subjects (k3, al, A03, A08) within all datasets. To explain further, CCSP had better classification accuracy in 9 subjects and same classification accuracy in 4 subjects compared to TRCSP and outperformed CSP in all cases. In addition, CCSP excelled PCA and FDA methods in all cases. The proposed CCSP algorithm had improved classification accuracy rate from 0.63-17.88% compared to the original CSP regardless of the equal cases. This best improvement of classification accuracy occurred in subject aw in the Dataset IVa, BCI competition III which had 90.72%, 88.74%, 72.84%, 64.23% and 52.98% for CCSP, TRCSP, CSP, PCA and FDA respectively. Turning to  the four-class problem, CCSP outperformed in 7 out of 12 cases, and it had equal performance in two of them (k3 and A04). In order to clarify more, CCSP had better performance in 8 subjects and had same classification in the rest of them in comparison to CSP. As well as, CCSP outperformed TRCSP in 9 cases but 3 of them. Moreover CCSP surpassed PCA and FDA. Using the proposed approach, the improvement was from 0.69% to 6.67% compared to CSP (regardless of the equal cases). In comparison to CSP, in four-class problem best improvement of classification accuracy belongs to subject k6 in Dataset IIIa, BCI competition III. This subject shows 72.50%, 65%, 65.83%, 25.83% and 49.16% for CCSP, TRCSP, CSP, PCA and FDA respectively in term of classification accuracy. This must be pointed out that the proposed algorithm never had worse performance than the original CSP in all of the subject and twoclass and multi-class scenarios. Nevertheless, CSP outperformed TRCSP in 5 subjects. Figs 7 and 8 show boxplots for two-class and multi-class problems, respectively which are shown superiority of CCSP in comparison to the others. In Figs 7 and 8 the distribution of classification accuracy for CCSP is negatively skewed so it shows more than 50% of subjects have classification accuracy more than mean of CCSP method. Maximum and minimum of all of the data are specified in these diagrams. Table 1 reports mean, median and standard deviation of classification accuracy for five methods in two-class and multi-class problems. Compared to the CSP, the proposed method shows about 6.9% and 2.23% improvement in term of mean accuracy in two-class and multi-class problems, respectively. Furthermore, CCSP approach showed about 0.58% and 1.49% improvement in term of mean accuracy in two-class and multi-class problem compared to TRCSP. As well as, in term of mean classification accuracy, the CCSP represents 24.87% and 33.89% improvement in comparison to PCA and 19.23% and 15.43% improvement in comparison to FDA in two-class and multi-class problem respectively. The standard deviation of the CCSP method is better than CSP and TRCSP methods in multiclass problem and has better value than CSP and close to the TRCSP in two-class problem. Table 2 shows sensitivity and specificity for all datasets in two-class problem. Kappa coefficient is statistical method for measuring degree of agreement between classes which is more robust criterion than classification accuracy by considering random agreement. This method assigns zero to the random classification and one to the perfect classification [38]. Then in four-class problem zero is equal to 25% and one is equal to 100% in term of classification accuracy. https://doi.org/10.1371/journal.pone.0248511.g008 Table 1. Mean, median and standard deviation (std.) for two and multi-class problem (Best values are in boldface).

Mean
Median Std.  Table 3 reports kappa coefficient for all datasets in multi-class problem. It shows CCSP outperform other methods in term of kappa coefficient. Best performance for each subject has been reported in boldface in Tables 2 and 3. Table 4 represents running time of five algorithms for all subjects. Running time is the period of all simulation stage including preprocessing, applying spatial filters, feature extraction and classification for testing sets. As it is seen, running time for the proposed CCSP algorithm is approximately as same as the four other approaches. Fig 9 illustrates the first spatial filter using 3 methods for subjects aw (recorded by 118 electrodes), k3 (recorded by 60 electrodes), A06 and A08 (recorded by 22 electrodes) in two-class problem. It can be observed that CSP filter had large electrode weights spread over the unexpected area while CCSP filters provided more emphasis on the electrodes which is located in the motor cortex area [39]. It should be mentioned that the simulation is performed on a computer with 8 gigabytes of RAM and core i7 Intel CPU. Also MATLAB is used as a software in the simulation stage to obtain the results.

Conclusion
Regularized CSP is used to overcome the shortcomings such as overfitting and noise sensitivity in conventional CSP, by imposing appropriate prior information to its objective function. Several regularization terms are suggested for this purpose. Using prior information in terms of time correlation of the underlying signals can be useful in both two-class and multi-class problems. In this article, we introduced a novel term of prior information to penalize solution of the original CSP, named temporal correlation, which has the advantage of easy extension to multi-class problems. In order to evaluate the performance of the proposed CCSP method, we used 17 subjects in two-class problem and 12 subjects in multi-class problem. We further compared its simulation results with the classical CSP, TRCSP, PCA and FDA. In two-class problem, the results show that our approach outperforms CSP, TRCSP, PCA and FDA by 6.9%, 0.58%, 24.87% and 19.23% in the mean of classification accuracy, respectively. Likewise, in multi-class scenario, the proposed CCSP achieves an improvement of 2.23%, 1.49%, 33.89% and 15.43% in classification accuracy mean in comparison to the CSP, TRCSP, PCA and FDA methods. Also CCSP method provided neurophysiological plausibility spatial filters. We can suggest using the proposed algorithm as a processing unit in the MI-based BCI systems to have better recognizing between the MI signals. It should be noticed we did not investigate the performance of proposed method in datasets with small training data. As well as when EEG has been recorded with few numbers of electrodes our algorithms need to more investigation. As well as the proposed method needs more calculation to obtain optimized spatial filters therefore in the systems with weak hardware it maybe takes time to operate.