Z-Score Linear Discriminant Analysis for EEG Based Brain-Computer Interfaces

Linear discriminant analysis (LDA) is one of the most popular classification algorithms for brain-computer interfaces (BCI). LDA assumes Gaussian distribution of the data, with equal covariance matrices for the concerned classes, however, the assumption is not usually held in actual BCI applications, where the heteroscedastic class distributions are usually observed. This paper proposes an enhanced version of LDA, namely z-score linear discriminant analysis (Z-LDA), which introduces a new decision boundary definition strategy to handle with the heteroscedastic class distributions. Z-LDA defines decision boundary through z-score utilizing both mean and standard deviation information of the projected data, which can adaptively adjust the decision boundary to fit for heteroscedastic distribution situation. Results derived from both simulation dataset and two actual BCI datasets consistently show that Z-LDA achieves significantly higher average classification accuracies than conventional LDA, indicating the superiority of the new proposed decision boundary definition strategy.


Introduction
Brain-computer interfaces (BCI) provide direct connection channel between brain and external world without any peripheral muscular activity [1]. It translates brain activity to signals that control external devices, and there are many augmentative communication and control systems based on BCI [2][3][4][5][6], which improves lives of people with severe neuromuscular disorders.
Generally an EEG based BCI consists of four modules [1]: 1) signal acquisition module to record and amplify EEG signals; 2) feature extraction module to extract signal features that encode user's intent; 3) translation module to translate features into device command; 4) feedback and control module to synchronize user's action and achieve control of external devices. The high performance EEG amplifier with suitable reference strategy [7] will increase quality of the recorded EEG signal, and employing innovative paradigms in the feedback and control module may obtain higher quality features and better control strategies [8][9][10][11]. Once EEG amplifier as well as reference strategy, and the feedback and control module are determined, feature extraction and translation algorithms will play important roles in improving the performance of BCI. Currently, the conventional features used in scalp EEG based BCI can be attributed to event related potentials, the sensorimotor rhythm, the transient visual potentials and the steady state potentials (including visual and audio). To refine those specific features, many feature extraction algorithms have been proposed [12][13][14][15]. However, as an input-output system, the final translation module directly determines whether the subject's intention is correctly decoded [16]. Compared to the conventional pattern recognition problems, BCI system requires the translation module to have ability to handle with the small sample size training problem, the heteroscedastic class distribution problem and the nonstationary physiological signals, etc. Therefore, effective translation algorithms specifically suitable for BCI application are still required in BCI discipline [17] [18].
Linear discriminant analysis (LDA) is one of the most popular classification algorithms for BCI application, and has been successfully used in a great number of BCI systems such as motor imagery based BCI [19], P300 speller [20] and steady state movement related potentials based BCI [21]. The original LDA has two derivations [22], fisher LDA (FLDA) and least square LDA (LSLDA). FLDA is based on Fisher-Rao's criterion [22][23][24], which is to find the projection w to maximize the objective function J(w)~Dw T S b wD Dw T S w wD denoting the ratio of betweenclass to within-class variances. LSLDA is derived from a linear discriminant function y(x)~w T x, where the weight vector w is supposed to minimize the mean squared error between w T x and y(x) [25]. The solution of LSLDA will be equivalent to that of FLDA with a proper label coding scheme adopted in LSLDA [25].
Both the two kind of LDAs are under the homoscedasticity assumption that different classes follow Gaussian distribution with same covariance matrices. However, the EEG data recorded from actual BCI system usually have heteroscedastic class distributions, which violates the fundamental assumption of LDA and notably degrades the recognition performance. Heteroscedastic LDA (HLDA) is an extension of FLDA, whose between-class scatter is generalized from Euclidean distance to Chernoff distance with both effects of the class means and their covariance matrices considered [26], thus HLDA does not need the homoscedasticity assumption. Nonparametric discriminant analysis (NDA) is another extension of FLDA [27], which makes no prior assumption for the class distributions, the parameters can be estimated by k nearest neighbor method and then they are used to define the between-class scatter [28].
In essence, LDA linearly transforms data from high dimensional space to low dimensional space, and finally the decision is made in the low dimensional space, thus the definition of the decision boundary plays an important role on the recognition performance.
Conventional LDA defines the mean of the projected data as the decision boundary due to the homoscedasticity assumption [25]. Nearest neighbor of classes has also been proposed to serve as the decision boundary [29]. Different from LDA, support vector machine (SVM) firstly maps the data to a high dimensional space, and then finds a hyperplane in the high dimensional space so that the distance from the hyperplane to the nearest data point on each side is maximized [30], theoretically the hyperplane is determined only by a small amount of the training data which are called support vectors. During the classification procedure of LDA, the heteroscedastic class distributions will be still kept in the projected space. Therefore, we argue that if the mean and variance of the projected data could be considered for the definition of the  decision boundary, it may extend LDA to deal with the practical heteroscedastic distribution data, which is the derivation point for the proposed Z-LDA in this paper.
The paper is organized as follows. Section Methods and Materials provides a detailed description of z-score LDA (Z-LDA); The results of simulation dataset and motor imagery EEG datasets are showed in section Results; In section Discussions, there is a general discussion of the proposed algorithm; Section Conclusion gives a summary of this work.

Linear Discriminant Analysis
To simplify the description of the algorithm, we only consider the case of two classes. Assume (x 11 ,x 12 ,:::,x 1m )[C 1 and (x 21 ,x 22 ,:::,x 2n )[C 2 , with m and n being the number of samples, are the samples in the two class sets C 1 and C 2 . LetX~(x 11 ,x 12 ,:::,x 1m ,x 21 ,x 22 ,:::,x 2n ), then the simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that where w is called a weight vector, and w 0 is a bias. Using vector notation, equation (1) can be converted to whereW W~w w 0 andX X is the corresponding augmented input vector X T ,1 À Á T with a dummy input x 0~1 . Accordingly, the least square solution of equation (2) is [25] W W~X X TX X WithW W estimated in equation (3), the corresponding weight sum y(x) can be achieved. For conventional LDA, classification for an input x is based on the comparison of y(x) and threshold, i.e., the decision boundary. If we consider c 1 as the label of class C 1 ,c 2 as the label of class C 2 , the corresponding decision boundary can be defined by c~(c 1 zc 2 )=2.

Z-score Linear Discriminant Analysis
Theoretically, the decision boundary of LDA is derived by assuming the homoscedasticity distribution for the two classes. Thus it may not be competitive to the heteroscedastic distribution, and we will develop the following strategy to define a more robust decision boundary. Based on the estimatedW W obtained by equation (3), the weight sum y(x) for each training sample can be calculated from equation (2), and then the parameters of the Gaussian distributions of the weight sum y(x) related to the two classes can be estimated as, where m k ,s k (k~1,2) are the corresponding mean and standard deviation (SD) of the weight sum y(x) for training set C k (k~1,2). During classification, when a new sample x Ã is input, firstly calculate the weight sum y(x Ã ) by equation (2), then perform the following normalization procedure, In essence, z 1 and z 2 are the transformed z-scores to measure how much the weight sum y(x Ã ) of the newly input sample is close to the two weight sum distributions predefined by the training set, thus the method is called z-score linear discriminant analysis (Z-LDA). Finally, if Dz 1 DvDz 2 D, the sample is classified into C 1 , otherwise, the sample belongs to C 2 . Assume the weight sum of samples in the two classes subject to Gaussian distribution with parameters m k ,s k (k~1,2), then the proposed decision boundary is the intersection of the two Gaussian distribution curves. The above descriptions are based on LSLDA, since the only difference between LSLDA and FLDA is the way to estimate the weight vector w, and the solutions of them are substantially equal. Therefore, the proposed decision boundary definition strategy can be extended to FLDA, too.

Relationship between LDA and Z-LDA
Theoretically, the decision boundary of conventional LDA is defined by Based on equation (6), the decision boundary of conventional LDA is the mean of labels of two classes, i.e. c~(c 1 zc 2 )=2. Obviously, when the SDs are combined into classification, the decision boundary of Z-LDA is defined as which deduces a value Apparently, the decision boundary c Ã of Z-LDA is defined by both the mean and SD of the weight sum of two classes.
In the binary classification, the expectation of mean of the weight sum y(x) for training set is E(m 1 )~c 1 and E(m 2 )~c 2 , the decision boundary of conventional LDA is theoretically equal to c~(m 1 zm 2 )=2. When the weight sum of two classes have equal SDs, the decision boundary of Z-LDA will also reduce to c Ã~( m 1 zm 2 )=2. Therefore, the conventional LDA is a particular case of Z-LDA.  changed the SD of the second class step by step, and then performed the comparison between LDA and Z-LDA on those datasets. Training set and test set with each consisting of 200 2dimensional samples (100 for each class) were generated respectively.

Classification performance of LDA and Z-
LDA. After the simulated datasets with heteroscedastic class distributions are generated, the training model of LDA and Z-LDA were estimated from the training set respectively, and the models were then applied to classify the samples in the test set. The above procedure was repeated 100 times to lower the random effect, and paired t-test was performed to investigate whether the statistical difference exists between the two classifiers. Table 1 listed the mean and standard deviation of classification accuracies for the 100 runs. Figure 1 visually gived the decision boundary definition procedure for the two classifiers when the standard deviation of the first class is (0.3, 0.3), and (1.0, 1.0) for the second class. Figure 2 intuitively showed the difference of recognition performance for the test dataset based on the two decision boundaries in Figure 1. When SDs of two classes are same, LDA and Z-LDA achieved equal classification accuracy. But while we changed the SD of the second class with that of the first class kept, though the classification accuracies for both two classifiers are lowered, Z-LDA achieved higher classification accuracy than LDA. Paired ttest revealed that when the difference of SD between the two classes exists, Z-LDA would achieve the significantly higher classification accuracy than that of LDA (p,0.05), where the more obvious improvement could be observed for those simulations with more differences in SD.  participants were asked to read and sign an informed consent form before participating in the study. After experiment, all the participants received a monetary compensation for their time and effort. Subjects sat in a comfortable armchair in front of a computer screen, they were asked to perform motor imagery with left hand or right hand according to the instructions appeared on the screen. Motor imagery lasts for 5 seconds, and follows a 5 seconds rest. 15 Ag/AgCl electrodes covers sensorimotor area were used for EEG recordings with Symtop Amplifier (Symtop Instrument, Beijing, China), the signals were sampled with 1000 Hz and band pass filtered between 0.5 Hz and 45 Hz. 4 runs on the same day were recorded for each subject, each run consists of 50 trials, 25 trials for each class, and there is a 3 minutes break between the consecutive two runs. The first 2 runs are treated as training set and the last 2 runs are treated as test set.

Evaluations on Real BCI Dataset
2.2. Preprocessing. We used the EEG segments recorded from 0.5 s to 3.75 s after the visual cue for the following analysis according to [32] on the first dataset. For the second dataset, all the EEG segments during motor imagery were selected for analysis, and those trials with absolute amplitude above 300 mv threshold were considered to be contaminated with strong ocular artifacts and will be removed from analysis. Next, the specific optimal frequency band for each subject was obtained by r 2 [33], and then it was used to design band pass filter for the selected EEG segments.
2.3. Feature extraction. Common spatial pattern (CSP) analysis was used to estimate the spatial projection matrix, which projects the EEG signal from original sensor space to a surrogate sensor space [13,19]. Each row vector of the projection matrix is a spatial filter, which maximizes the variance of the spatially filtered signal under one task while minimizing the variance of the spatially filtered signal under the other task. The most discriminative 3 pairs of optimal spatial filters in the projection matrix were selected to transform the band pass filtered EEG signal, then the logarithm of the variance of the transformed surrogate channel EEG signals were served as the final features for task recognition. In general, each EEG segment was transformed to a 6dimensional vector feature after the above procedure.
2.4. Classification results. In this section, we will compare the classification performance of Z-LDA to LDA, SVM, NDA and HLDA. LIBSVM with default parameter was served as SVM classifier [34]. NDA in reference [28] with 5 as the number of k nearest neighbors, and HLDA in reference [26] were used for evaluation in current work.
The classification results of Dataset IVa of BCI Competition III were summarized in Table 2. Z-LDA and NDA achieved higher average accuracy than LDA, SVM and HLDA. Though the average accuracy of NDA is slightly larger than that of Z-LDA, Z-LDA had the better performance for 4 of 5 subjects with exception for subject ay, and the paired t test did not show the statistical difference between them (p = 0.4146). There are only 28 training samples for subject ay, which is a small size training problem. NDA is good at dealing with the small size training problem, resulting in the obvious improvement for subject ay. Across the 5 subjects, when LDA is regarded as the baseline for evaluation, only Z-LDA showed the consistent improvement for all the 5 subjects, and the paired t test also revealed that only the accuracies obtained by Z-LDA is significantly higher than that of LDA (p = 0.0293).
The classification results of Dataset recorded by our BCI system were summarized in Table 3. Z-LDA achieved the highest mean accuracy among the tested 5 classifiers. Paired t test also showed that the accuracies obtained by Z-LDA is significantly higher than LDA (p = 0.0004), NDA (p = 0.0006) and HLDA (p,10 25 ), but no statistical difference between Z-LDA and SVM (p = 0.0654).
The overall mean accuracies obtained by Z-LDA are 1.4% higher than that of LDA on both of the two BCI datasets. As shown in Table 2 and 3, we could also see that the accuracies obtained by Z-LDA is consistently better than (or at least equal to) LDA in all of the subjects.
Subject 1 from Dataset recorded by our BCI system was used as an example to briefly reveal why the classification performance of Z-LDA becomes better than conventional LDA in the actual situation. The distribution of weight sum y(x) when subject 1 performed the motor imagery with left hand and right hand were plotted in Figure 3 for the training and test sets, respectively. The decision boundaries of Z-LDA and LDA were also marked in Figure 3.

Discussion
Translation module of BCI receives features from previous feature extraction module and translates them to device command by using certain classification algorithms. In practical BCI situations, the concerned tasks may have heteroscedastic class distributions. Therefore, the consideration of effect of distribution variances may provide more robust ability to recognize the tasks. However, the conventional LDA assumes homoscedastic class distributions, which may not be competitive to handle with actual BCI dataset with heteroscedastic class distributions. Inspired by this, we develop Z-LDA by including the variance information in the classification procedure in order to provide more robust classification for BCI tasks.
As shown in Section Methods and Materials, the decision boundary of conventional LDA is decided by the labels of classes, while the decision boundary of Z-LDA is defined by both mean and SD of the weight sum, which is more potential to capture the distribution information of classes and provide better classification performance for heteroscedastic distribution situation.
The difference of decision boundaries between the two classifiers can be clearly observed in Figure 1. Assume we define the label of the two classes as 21 and 1, the decision boundary of LDA is fixed as c~0, while the decision boundary of Z-LDA is determined by equation (8). If the SD of two classes are same, the decision boundary of Z-LDA is also c Ã~0 , but when the SDs of two classes are different, the decision boundary of Z-LDA will move toward the class with smaller SD. From Figure 1 we can find that because of the small SD of the first class, the SD of weight sum y(x) is also small, resulting in the more concentrated distribution compared to the relatively divergent distribution of class 2 with larger SD. Considering the areas under the two Gaussian curves between the two decision boundaries, the area corresponding to the second class is obviously large than that of that of the first class, which denotes that with the new defined decision boundary, more samples can be correctly recognized. Figure 2 further reveals that if the decision boundary of LDA is used in the test dataset, many samples belong to the second class will be incorrectly assigned to the first class. But if we use the decision boundary of Z-LDA to classify the samples, the number of samples which incorrectly assigned to the first class will be reduced at the cost that some samples belong to the first class will be incorrectly assigned to the second class.
When applied to the actual BCI datasets, Z-LDA consistently shows the best average accuracies among the concerned five classifiers as shown in Table 2 and Table 3. Figure 3 clearly shows us that the weight sum for the two types of tasks actually follow different Gaussian distributions in practical BCI application. In this case, the decision boundary of Z-LDA obtained from the training set is the green solid line, which is smaller than 0, and the decision boundary of LDA is the green dashed line, which equals to 0. The black solid line in Figure 3 denotes the theoretical boundary for the test dataset. Obviously, the decision boundary of Z-LDA determined by the training set is more close to the theoretical boundary of the test dataset, leading to the better classification achieved by Z-LDA compared to LDA. Therefore, we can conclude that the proposed decision boundary definition strategy outperforms the conventional decision boundary definition strategy in actual BCI applications, where concerned samples usually have the heteroscedastic distribution.
Another concerned aspect is the algorithm complexity for the online BCI system. In current work, the algorithm is implemented with Matlab R2011b running on Windows 7 Ultimate SP1 64 bit with Intel Core i5-3470 CPU 3.2 Ghz. The mean time for 200 2dimensional samples in the simulation study using Z-LDA is 0.0004 s, and 0.0001 s for LDA. It indicates Z-LDA is applicable in the practical real time BCI.

Conclusion
Both the simulation and actual BCI datasets confirm that Z-LDA is a more robust classification method. In essence, Z-LDA is an enhanced version of LDA, and it can be reduced to the conventional LDA by assuming homoscedastic class distributions.
Moreover, the probability indicates how reliable the classification is performed could be derived from the z-score transformed weight sum, which may be helpful to handle with the adaptive calibration problem [17,33,35].
There are various algorithms have been proposed based on LDA in BCI application, such as regularized LDA [36,37], Bayesian LDA (BLDA) [38] and enhanced BLDA [33]. Unlike Z-LDA, these algorithms improved LDA's performance from other aspects like regularization, Bayesian frameworks. It is possible to combine the proposed decision boundary definition strategy with these algorithms, which is our future work. Moreover, we will also implement the proposed Z-LDA to our online BCI system.