A Novel Algorithm to Enhance P300 in Single Trials: Application to Lie Detection Using F-Score and SVM

The investigation of lie detection methods based on P300 potentials has drawn much interest in recent years. We presented a novel algorithm to enhance signal-to-noise ratio (SNR) of P300 and applied it in lie detection to increase the classification accuracy. Thirty-four subjects were divided randomly into guilty and innocent groups, and the EEG signals on 14 electrodes were recorded. A novel spatial denoising algorithm (SDA) was proposed to reconstruct the P300 with a high SNR based on independent component analysis. The differences between the proposed method and our/other early published methods mainly lie in the extraction and feature selection method of P300. Three groups of features were extracted from the denoised waves; then, the optimal features were selected by the F-score method. Selected feature samples were finally fed into three classical classifiers to make a performance comparison. The optimal parameter values in the SDA and the classifiers were tuned using a grid-searching training procedure with cross-validation. The support vector machine (SVM) approach was adopted to combine with an F-score because this approach had the best performance. The presented model F-score_SVM reaches a significantly higher classification accuracy for P300 (specificity of 96.05%) and non-P300 (sensitivity of 96.11%) compared with the results obtained without using SDA and compared with the results obtained by other classification models. Moreover, a higher individual diagnosis rate can be obtained compared with previous methods, and the presented method requires only a small number of stimuli in the real testing application.


Introduction
Research into lie detection has drawn a substantial amount of attention over the past several decades and has found many important applications in the legal, moral and clinical fields [1][2][3]. Currently, a number of studies that adopt neurophysiological signals have been conducted on lie detection. These methods have used Magnetic Resonance Imaging [4,5] and Event-Related Potentials (ERPs) [6,7]. P300, an endogenous ERP component, has been extensively investigated [8] and has been successfully used for deception detection [9].
Widely used P300-based lie detection methods can be roughly divided into three categories: the bootstrapped amplitude difference (BAD) [10,11], the bootstrapped correlation difference (BCD) [12] and machine learning methods [7,13,14]. For the methods listed above, there are three types of stimuli that are presented to subjects, i.e., Probe (P), Target (T) and Irrelevant (I) stimuli [7].
A good lie detection method should use a small number of stimuli to achieve as high accuracy as possible. To realize this goal for the P300-based lie detection, a critical step is to extract the P300 with a high signal/noise ratio (SNR). Although the P300 is time-and phase-locked to experimental stimuli, the extraction of the P300 with a high SNR is still a challenging task because various types of noise are superimposed seriously on P300 [15]. BAD and BCD use the statistical technique of bootstrapping [16] to generate many different averages of ERP from the same set of stimuli [7]. Using bootstrapping, the SNR of P300 can be increased. However, such a mode involves a large number of stimuli and hence is at the expense of taking a longer time for signal acquisition, which would also increase the fatigue of the subjects. In addition, more recently, a few researchers have investigated single trial-based lie detection methods that were based on machine learning [7,14]. In these methods, some features were extracted from single trials and then were used to train classifiers to differentiate between different brain states. The testing results showed that machine learning methods could achieve a higher detection accuracy than BAD and BCD methods [7]. However, they typically did not remove the noises embedded in single trials, resulting in unsatisfactory detection accuracy.
Consider the noises embedded in single trials for P300 extraction. The EEG recording on one sensor consists of two main parts. One part is extra-skull noise, and the other part is the signal produced by intra-skull neuronal sources at specific brain regions, including ERP and spontaneous EEG. Obviously, the ERP cannot be represented by the signal from the sensor directly. Conventional lie detection methods could not separate P300 from the noise and spontaneous EEG because their time courses and scalp projections usually overlap [17]. Recently, independent component analysis (ICA), a blind source separation (BSS) method [15,[18][19][20], was used to extract stimulus-related ERP into independent components (ICs) [21][22][23][24]. The results showed that the decomposed ICs were more distinguishable than the ''sensor signals'' [22,23]. In our early study [25], we proposed an ICAbased template matching method, topography-template matching (TTM) algorithm, to enhance the SNR of P300, and we achieved promising results. In TTM, we only consider the P300 independent sources affect in Pz site. In addition, one neurophysiologist was employed to select the P300 independent source by his experience. In this study we present a novel spatial denoising algorithm (SDA) to improve that early study. Comparing with our early study, SDA consider more affecting areas including at P3, P4, Pz, Cz and Oz sites. In addition, SDA recognized P300 independent source automatically, not by experience. Hence, the SDA is more reasonable and objective than the early study. The key innovation is how to automatically identify the P300 ICs (i.e., the ICs accounting for the P300), which will be described in the following section.
By removing any redundant features, feature selection can help the original classification system to achieve better classification performance including lower computational costs and higher classification accuracy. Polat et al. indicated that feature selection improves the classification accuracy by using a hybrid system of feature selection and several classifiers [26]. In this study, the Fscore [38], a simple but effective technique, was used to select the optimal features from the original extracted features. In addition, to select a suitable classifier, all of the training samples with the selected optimal features were fed into three popular classifiers to compare their performance.
For conventional lie detection like BCD/BAD [10][11][12] and other some lie detection methods [7,13], a number of stimuli were required to present to the subjects in practical applications, because both of the bootstrapping technique and threshold selection-based classification were based on many stimuli responses. This would limit the real application of lie detection. First, there is often very limited information related to criminal acts. Second, many repeated stimuli with little information would cause two problems. One problem is fatigue, and the other is an increase in the countermeasures [11], because real criminals might be familiar with the stimuli and tend to resist the detection when many stimuli are presented repeatedly. Furthermore, based on the analysis results from a number of stimuli, when the researcher need to make the last judgment, a threshold strategy (see the references [10][11][12]7,13] for details) was inevitably used, which was a subjective decision on the individual diagnostic rate. The present method aims at using only a small number of stimuli and having no threshold problem.

Ethics statement
The experiment was approved by Psychology Research Ethical Committee (PREC) of the College of Biomedical Engineering in South-Central University for Nationalities. Thirty healthy subjects (15 females, mean age of 21.5) were recruited from the university. The participants provided their written informed consent according to a human research protocol in this study.

EEG Data Acquisition
Twelve electrodes (Fp1, Fp2, F3, Fz, F4, C3, Cz, C4, P3, Pz, P4, Oz) from an International 10-20 system were used. The vertical EOG (VEOG) signal was recorded from the right eye (2.5 cm below and above the pupil), and the horizontal EOG (HEOG) signal was recorded from the outer canthus. EEG and EOG signals were filtered online with a band pass filter of 0.1-30 Hz, and they were digitized at 500 Hz using Neuroscan Synamps. All of the electrodes were referenced to the right earlobe. Electrode impedances did not exceed 2 kV.

Experimental Protocol
The standard three-stimuli protocol [10,12] was employed in this study. The participants were randomly divided into two groups: a guilty group and an innocent group. Six different jewels were prepared, and their pictures served as stimuli during detection. A safe that contained one (for the innocent) or two (for the guilty) jewels was given to each participant. They were instructed to open the safe and memorize the details of the object. We instructed the guilty group to steal only one object which would serve as the P stimulus. The other object in the safe was the T stimulus, and the remaining four pictures were the I stimuli. The object in the safe was not stolen for the innocent, which served as the T stimulus. Then, from the remaining five pictures, one picture was selected randomly and set as the P stimulus, and the remaining four images were set as I stimuli. All of the subjects were instructed to write down the information on the objects in the safe, such as the styles and colors of the jewels.
After the preparation tasks introduced above, the participants began to perform the detection. They were seated in a chair, facing a video screen that was approximately 1 m away from their eyes. The stimuli pictures were presented randomly on the screen. Each item remained for 0.5 s with 30 iterations for one session, and each session lasted for approximately 5 minutes, with 2 minutes of resting time. The inter-stimulus interval was 1.6 s. Each subject was instructed to perform 5 sessions. The stimuli sequence diagram is given in Figure 1. One push button was given to each subject, and he or she was asked to press a ''Yes'' and ''No'' button when faced with familiar and unknown items, respectively.
The guilty group was instructed to press the ''Yes'' and ''No'' button when faced with the T and I stimuli, respectively. With a P stimulus, they were asked to press the ''No'' button, attempting to hide the stolen act. In contrast, the innocent group made honest responses to all of the stimuli. All of the subjects had practiced the tasks above before the EEG signals were recorded formally. We planned to exclude any subjects that had more than a 5% clicking error, but none fell into this category. Finally, a sketch map is presented and shown in Figure 2 to describe above protocol.

General description of method
The present method is separated into the following steps: (1) preprocess the continuous raw EEG recordings, and then, apply SDA on the preprocessed datasets to reconstruct P300 waves that have a higher SNR (from the guilty) and non-P300 waves (from the innocent). For convenience, we hereafter describe the above processed results as reconstructed P300 waves (In fact, the results also contain non-P300 waves); (2) extract original features from the reconstructed waves; (3) adopt the F-score method to select the optimal features; these features were concatenated as a featured vector and fed into three kinds of typical classifiers; (4) train the classifiers using the two classes of training samples, and then, test the samples using testing samples. By the training procedure, the optimal parameter values including the parameter in SDA and in specific classifier can be determined. During a practical application phase, only several stimuli (Five probe stimuli were needed in this study) are presented to the subjects. The flowchart of the presented CIT system is shown in Figure 3.

Preprocessing
Using EEGLAB toolbox, we segmented the continuous EEG data into epoched datasets, each of which lasted from 0.5 s before to 1.1 s after the stimulus onset. Then, the ocular artifacts [24] in each set were removed by the software SCAN of Neuroscan, i.e., the datasets that contained single trials with the voltage in excess of +75mv were discarded. All of the remaining trials were baseline corrected on the pre-stimulus interval. Lastly, the datasets corresponding to P responses were selected, and each 5 datasets within each subject was pooled into one average, resulting in 450 averaged datasets for each subject group.

Independent component analysis
Let X(t) = x 1 (t),x 2 (t),:::,x C (t) ½ T denote the observed time series with t varying from 1 toN, where N and Cdenote the number of samples and sensors, respectively. In ICA method, X(t) is the result of an unknown mixture of a set of unknown source signals S(t) = s 1 (t),s 2 (t),:::,s C (t) ½ T , and the mixture is viewed as linear: X(t) = AS(t). Based on the principle of statistical independence [26][27], ICA estimates S(t) by introducing the unmixing matrix W, i.e., Z(t) =WX(t) where Z(t) (which is the decomposed ICs) is the estimation of signals S(t). Accordingly, W {1 is referred to as a mixing matrix. Once the signals S(t) are estimated by an ICA algorithm, a column of the matrix W {1 provides the projection strengths of the corresponding IC onto each electrode.

Spatial denoising algorithm for P300 enhancement
The spatial denoising algorithm, referred to as SDA hereafter, is described in this section. First, each averaged dataset was decomposed by ICA, resulting in mixing matrix W {1 and decomposed ICs Z(t). The extended infomax algorithm (EICA) was used in ICA because it can allow some sources to have sub-Gaussian distributions [28,29]. By accommodating sub-Gaussian distributions in the data, EICA could provide a more accurate decomposition of multi-channel EEG signals, especially when various neurophysiological signals follow different distributions.
Many investigators have found that P300 was usually the largest at Pz, the smallest at Fz, and takes intermediate values at Cz [30,32]. They typically acquired the P300 on one of the electrodes listed above [7,9,11,31]. According to the a priori physiological knowledge described above and the spatial distribution of an IC, SDA is divided into the following four steps: where symbol DD denotes an absolute calculation. Let X 0 (t) denote a new EEG dataset, which was defined by . . .
(2) Let Pz, P3, P4, Cz and Oz equal their respective sequence number in the electrode set (e.g., Pz equals 10 in this study). For the jth column in each matrix U, we calculate a value S j using the following formula: where the parameters k1, k2 and k3 denote the weighted parameters on different element U ij . A grid-search procedure (see Figure 3) would be used to obtain optimal values of these parameters. In this equation, S j denote the integrated distributionstrength on several interested brain areas from jth IC. The bigger S j is, the bigger probability jth IC is the P300 ICs.
(3) Sort the 14 values in S~S 1 ,S 1 ,:::,S 14 f gin descending order, resulting in a sorted vector E and a sorted index vector F, with F j being the position of the element in vector S. (4) Back projection: Let m denote how many P300 ICs should be selected to reconstruct the P300 wave. Suppose that Y pz (t) is the reconstructed P300 wave on the Pz electrode. The procedure of back projection for Y pz (t) can be given by i.e., only m ICs are considered as P300 ICs and are back projected to the scalp. A grid-search procedure (see Figure 3) will be used to determine the optimal value of parameter m, which will be discussed later. Lastly, for two groups of subjects, two sets of the reconstructed waves can be obtained, respectively. Let R-G denote the vector set for the guilty group, and let R-I denote for the innocent group. We expect that the SNR of P300 in the set R-G would be enhanced compared with the raw ERP signal, using the above SDA.

Feature extraction
Let Y(t) denote a time wave in the set R-G or R-I, with t varying from stimulus onset to 1.1 s after the stimulus onset. Timedomain, frequency-domain and wavelet features were selected as three groups of features in this study. Most of them have been demonstrated to be effective by many researchers [7,25,[33][34][35]. The features are extracted from each signal Y(t) by the following procedure.
Time-domain features. Four time-domain features are defined as follows: (1) Maximum amplitude, which is defined as (2) Latency, which is the time where V max occurs. It takes the form (3) Peak-to-Peak, which is defined as (4) Positive area, which is the sum of the positive signal values. It can be expressed as Frequency-domain features. The power spectrum density (PSD) is first calculated on each Y(t) by the Bartlett algorithm. Let p(f ) be the resultant PSD. Suppose that p max~m ax p(f ) f g denotes the maximum amplitude value of the PSD. Then 3 frequency-domain features can be calculated as follows: (1) Maximum frequency, i.e., (2) Mean frequency, calculated by the weighted average of the frequency. The weighted coefficient is the PSD value. It can be expressed as (3) The power of the main frequency band that involves the P300, which is calculated by Wavelet features. Many authors have indicated that ERPs are transient signals that include some typical frequency components in a different frequency range, such as delta, theta, alpha, beta and gamma [36]. Recently, the wavelet transform (WT) has been widely used to analyze ERPs [36][37][38]. The WT is achieved by the breaking up of a signal into shifted and scaled versions of the mother wavelet, which is a waveform that has a limited duration and a zero mean.
In this study, a fast algorithm for the Discrete WT (DWT) was adopted to decompose those averaged single trials [39]. We selected Quadratic B-Spline functions as mother wavelets because they have a near-optimal time-frequency localization property and good similarity with the P300 components [40][41]. The wavelet coefficients were computed by a high-pass filter h and a low-pass filter g. The coefficients of two filters are given in the first and  Table 1 give the coefficients of the two reconstruction filters, respectively. DWT was performed on each wave Y(t), which resulted in seven sets of wavelet coefficients corresponding to different frequency bands: 0.3-3.9, 3.9-7.8, 7.8-15.6, 15.6-31.2, 31.2-62.5, 62.5-125 and 125-250 Hz. Only the first four bands were useful due to the earlier filtering. Because the delta band was the main frequency range for the P300 component, the coefficient set corresponding to the first frequency band was selected as the final wavelet features for each wave Y(t).
Following the feature extraction, these feature samples were divided into two sample sets: the first set contained all of the P300 samples for the guilty group, and the second set contained non-P300 samples for the innocent group, with the class label being 1 and 21, respectively.

Feature Selection
In this study, we adopted the F-score method to further select the best subset of features for classification. The F-score method is a very simple but robust feature-evaluating technique. Recently, many researchers have successfully used this method in pattern recognition systems to select the optimal feature subset [42,43].
Given the ith feature vector fx i1 ,x i2 ,:::,x i nz ,:::,x i B g with the number of positive instances n + and the number of all of the instances B, the F-score value of the ith feature is defined by where , and x x i are the average of the positive, negative, and whole samples, respectively, and x ik is the kth feature value in the ith feature vector. Positive and negative represent two classes of identification, respectively. A larger F-score value indicates that the feature has more discriminative power. For the application of this method, the F-score value of all of the features will be sorted. Hence, in this study, those features that have relatively larger Fscore values were selected to construct the feature subset.
There are two main methods used to select the appropriate feature subset: the filter method [44] and the wrapper method [45,46]. To obtain simplicity and a lower computation cost, we used the former method to select the feature number for the optimal feature subset.

Classification
The fisher discriminant analysis (FDA) [47], back propagation neural network (BPNN) [48] and support vector machine (SVM) [49,50] were compared in this study to select an optimal classifier. The details of the three classifiers are given in Supporting information files (see Section S1-S3 in File S1). The hybrid models integrating with F-score feature selection is referred to as F-score_FDA, F-score_BPNN and F-score_SVM in this study. Accordingly, three individual classification models (FDA, BPNN and SVM) were also utilized.
A Subject-Wise CV (SWCV) [25,51] was performed on the two classes of optimal feature sample sets. For each set, samples from 14 subjects were grouped into a training set and the samples from the remaining were used as a testing set. Thus by this SWCV, 15 pairs of training sets and testing sets were obtained. For each pair, the training set consisted of the samples from 28 subjects, and the testing set from 2 subjects (i.e., a guilty and an innocent subject). We would like to emphasize the importance of the SWCV procedure. In fact, a statistical classification model that could explain the data for some subjects did not necessarily generalize well to other subjects, even if those were draw from the same distribution. Accordingly, the SWCV procedure was used to assess the generalization ability not only from the different data within one subject but from the data in different subjects. Hence, the advantage of SWCV compared with common CV is that the test accuracy can simulate the generalization performance on other unseen subjects. Accordingly, we can obtain the testing results not only on the level of single-trials, but also on the level of subjects, i.e., to test whether one subject can be recognized correctly.
For each training set yielding by SWCV, the feature samples were mixed to obtain two classes of samples: one is lying group (it was considered as P300 feature samples) and the other is truthtelling group (it was considered as non-P300 feature samples). Subsequently, a common 10-fold CV procedure [52] was performed on each training set, resulting in 10 pairs of subtraining sets and sub-validation sets. Figure 4 shows the schematic diagram of the division of samples and cross validation procedure.

Selection of optimal parameters
For the proposed lie detection method, two groups of parameters must be tuned: 1) The parameters in SDA: m, k1, k2 and k3, and 2) The specific hyperparameters for each classifier. Considering that the parameters in SDA can affect the optimal values of the hyperparameters, the two groups of parameters were tuned together using a multi-dimension grid searching. During the turning, m varied from 1 to 14; and k1, k2 and k3 varied from 0.2 to 1 with a step size of 0.15, by the suggestion of an independent EEG expert. In the tuning procedure above, for BPNN, the number of sigmoid hidden nodes a and the learning rate g were tuned (the control precision was set to be 0.002). For SVM, the penalty parameter C and the radial width sfor radial basis , [52]) were tuned. The procedure of training and testing is described as follows: (1) The classifiers were trained on each sub-training set with different combinations of tuning parameters. By the 10-fold CV, an averaged sensitivity and an averaged specificity can be obtained for the jth training set. Then, the mean and Standard Deviation (SD) of the 15 sensitivities (15 training Spatial Denoising Method for P300 to Detect Liars PLOS ONE | www.plosone.org sets), referred to as M asen and SD asen respectively, are calculated. Similarly, the M aspe and SD aspe for specificity a r e o b t a i n e d . L a s t l y , b a l a n c e d a c c u r a c y BA train~1 2 (M asen zM aspe ) is calculated for the specific combination of tuning parameters. (2) Repeat the above steps using a different combination of tuning parameters. Thus, the optimal parameter values were selected when BA train reached the highest value. (3) On the 15 testing sets, calculate the generalization performance of the trained classifiers with the optimal parameter values. Similar to step 1, M tspe and SD tspe (mean and SD on the 15 sensitivities), M tsen and SD tsen (on the 15 sensitivities) can be obtained. Finally, calculate the balanced testing accuracyBA test~1 2 (M tsen zM tspe ). This accuracy is the final testing measure of the performance evaluation.

Preprocessing
The grand average ERPs on the Fz, Cz, Pz and Oz sites as a function of stimulus type were first calculated within each subject. Figure 5 gives the boxplot of the maximum amplitude at the Pz site for three types of stimuli and the two subject groups, during which 450 samples for each type of stimuli and each group were used to statistical analysis. Using ANOVA on the guilty subject, there is no significant difference (p.0.05) for the maximum amplitude between the P and T stimuli. However, there is a significant difference (p,0.001) between P and I stimuli. In contrast, there is no significant difference (p.0.05) between the P and I stimuli for an innocent subject. A 2|2 mixed model ANOVA (P vs. I | innocent vs. guilty) was performed on the maximum amplitude at the Pz site. The result shown in Figure 6  More importantly, by a further independent effect analysis of innocent versus guilty when P stimuli was used, the person type effect is significant and yields F(1,28) = 1514.68, p,.0005. The amplitude of P300 for the guilty is higher than that for the innocent. In contrast, when using I stimuli, there is no significant person effect (F,1). Hence, P responses at the Pz site were finally selected for further processing to enhance the feature difference of the P300 waves between the two classes of subjects.

SDA
First, the enhancement of the SNR of P300 by SDA is illustrated in Figure 7. A guilty subject's five raw EEG datasets were randomly taken as an example. The raw waves on the Pz with solid thin line and their averaged wave with dashed thick lines are shown in Figure 7A. Similarly, we randomly selected an innocent subject, and the raw waves and averaged wave on Pz are shown in Figure 7B. Applying SDA to the two averaged datasets respectively, the two reconstructed P300 waveforms on Pz are shown in Figure 7C. There is no distinct P300 (dashed lines) in Figure 7A and 7B. As Figure 7C shows, however, there is a clear P300 with a latency of approximately 280 ms for the guilty subject, and the two lines can be differentiated easily. During this evaluation, the parameters m, k1, k2 and k3 were set to 3, 0.9, 0.8, 0.6 by a priori knowledge of an independent physiology expert.

Extraction of Wavelet Features
After SDA, the features were extracted from the reconstructed waves for the Pz. Here, we randomly selected a guilty and an innocent subject, and then conducted the wavelet transform on two subjects' denoised P300 signals, respectively. The results of DWT are shown in Figure 8A and 8B respectively. The most distinct difference in the wavelet features and reconstruction waves between the two subjects is in the 0.3-3.9 Hz band (the delta band). For the guilty subject, it can be seen from the bottom row in Figure 8A that there are obvious peaks in the wavelet coefficients and reconstruction waves at approximately 500 ms post-stimulus for this band. This approach is in accordance with the timedomain features of the P300 waveform. In contrast, there are no obviously corresponding features in Figure 8B. The results above suggest that the wavelet coefficients corresponding to the delta band, as a class of P300 features, are suitable for differentiating the P responses between the two groups of subjects. Table 2 shows the results of the feature selection by the F-score method. W 1 -W 22 denotes 22 WT coefficients. From this table, we can see the F-score values of the 29 original features. Those features with relatively larger F-score values were selected to construct a feature subset. For simplicity, we directly selected 10 features whose F-score values were larger than 0.85 to form the optimal feature subset.

Result of the feature selection
Observing these 10 features, we can see that two optimal timedomain features are closely related to the peak value of P300. Second, one feature (A lf ) is related to the main frequency range of P300 (0.3-3.9 Hz). Most importantly, the most of optimal features are selected from the original wavelet features. This indicates the wavelet feature has the better classification capability than the other two kinds of features.

Classification Performance
Using SWCV, BA train reaches the highest value, 96.18%, using the F-score_SVM, and the optimal parameters of m, k1, k2, k3, which are determined by grid searching, are as follows: m = 2, k1 = 0.85, k2 = 0.70 and k3 = 0.40. The training accuracies as a function of the parameter m were shown in Figure 9A and 9B for the three hybrid models when k1 = 0.85, k2 = 0.70 and k3 = 0.40. As shown in Figure 9, the accuracy rates increase significantly when m changes from 1 to 2 for all of the models. For example, the increased rate for F-score_SVM is approximately 5%. In addition, the accuracies of F-score_FDA and F-score_SVM reach a maximum when m = 2 except for F-score_BPNN, whose accuracy still increases slightly as m varies from 2 to 3. More importantly, the accuracy rates decrease when more than 3 ICs are used in SDA. This result is basically consistent with the report of Lin et al. [53]. Note that the accuracies with m = 14 denote the performance without the SDA. For every classification model, those accuracies are distinctly much lower than those when m = 2. The results discussed above indicate the remarkable performance of SDA. Furthermore, Table 3 gives the training accuracies (M asen ,M aspe ) and testing accuracies (M tsen ,M tspe ) of the six classification models with the optimal grid searching result. First, the accuracy of the model using FDA is obviously lower than the models using BPNN and SVM. This finding suggests that the data from the two types of subjects in the lie detection cannot be separated linearly. Additionally, the performance of the models that use SVM significantly exceeds those of the models that use FDA and BPNN. Using ANOVA, the statistical results (F(1, 28) = 7396.689 and p,0.001) confirm that the testing accuracy for SVM is significantly greater than that for BPNN. The BA_test of 96.08% for F-score_SVM strongly suggests that it is suitable for the classification of the two classes of subjects. Additionally, we can see from Table 3 that each hybrid model achieves significantly higher accuracy than the corresponding individual model. For example, on the training sets, SVM reaches a sensitivity and specificity of 91% and 90.98%, respectively. In contrast, F-score_SVM obtains 96.07% and 96.30%, respectively. Based on the above experimental results, the model F-score_SVM reaches the highest classification performance of all of the models.

Comparison with previous methods
The individual diagnostic rates of the presented and previous methods were calculated, and they were compared in this section. In the BAD/BCD method, each 10 waveforms of each type of response on the Pz electrode were selected to average into a waveform, based on the technique of bootstrapping. In the BAD method, the P300 amplitudes of the three types of responses were calculated based on the Peak-to-Peak method [7,13,54]. For the BCD method, the time lag was equal to 0 when the CV was calculated.
For the BAD and BCD methods, we calculated 100 D-values obtained by 100 iterations for each subject. Let N d denote the times when the D-values were larger than zero. Then N d and the  percentage of N d were calculated for each subject, respectively. If the percentage of N d was greater than a threshold N th , then this subject would be considered to be a guilty subject [7,12]. Lastly, the error rates of an individual diagnosis as a function of the setting threshold are shown in Figure 10A and 10B, respectively. Considering the equal importance of the detection rates of the two groups of subjects, the individual diagnostic rates of 92% and Spatial Denoising Method for P300 to Detect Liars 88.71% are reached when the thresholds are set to 83.6% and 85.5% for the BAD and BCD methods, respectively.
Based on the results in the above section, for our method, in fact, the individual diagnostic rate can reach 100% when choosing the test accuracy of 90% as a decision criterion for a subject. That is, one was identified as a liar when the percentage of reconstructed samples classified as P300 was larger than 90%. In contrast, one was a truth-teller if the percentage of reconstructed samples classified as non-P300 was larger than 90%. Obviously, this diagnostic rate is higher than the rates of the BAD and BCD methods, and is also higher than those reported using other machine learning-based methods. For example, Abootalebi et al. [7] reported that the best detection rates are 74%, 80% and 79% for BAD, BCD and the machine learning methods, respectively.

Discussion and conclusions
Lie detection methods using a large number of stimuli suffer from several inherent drawbacks such as more fatigue for subjects, more workload for examiners, increased probability of countermeasure behavior and lower flexibility [25,55]. Obviously, a lie detection method with only a small number of stimuli will be crucial for practical lie detection. The purpose of this study is to develop a novel detection method that uses several stimuli to  identify the liars, and at the same time, to further increase the individual diagnostic rate and robustness compared to previous studies. For this purpose, we proposed a novel ICA-based SDA to enhance the SNR of P300, and then, we used a machine learning method to distinguish the P300 evoked by guilty subjects from the non-P300 in innocent subjects. Some recent studies suggested that machine learning-based lie detection methods are more reliable than the BAD and BCD methods. One advantage is that the investigation of the dynamic variation of single trials might help us to study more cognitive information on lying. The second major advantage lies in that the failure of one trial will not affect the classification results of the other trials. In contrast, for BAD and BCD, the failure will change many bootstrapping averages and hence, the overall result of the lie detection [7]. Third, one can utilize more features of P300 in addition to the time-domain features that are used in the BAD/  Table 3. Sensitivity/specificity on the training and testing sets for different classification models with the optimal parameter combination.

Classifier models
Sensitivity/specificity (%) BCD method. Lastly, note that, in previous methods, it is difficult to decide the related thresholds such as the N th described earlier because this decision involves the tradeoff between the two individual diagnostic rates from the two groups of subjects. In contrast, we can see that this problem does not exist in our method.
In the present study, we assumed that for a P300-based lie detection method, the noise in the single trials could be divided into two categories: one is the ill-assorted responses to a certain type of stimulus, which results from a variation of cognitive state during detection [55]; the other is normal noise such as EOG artifacts and spontaneous EEG. Hence, before applying SAD, we first averaged each 5 raw EEG datasets to decrease the impact of ill-resorted P3009s on the SNR of P300, which would increase the robustness of the entire system for lie detection. The efficiency of this preprocessing method for lie detection is not addressed in this study because it has already been proven in the previous report [55]. To reduce the influence of the second type of noise on the performance of the detection to the greatest extent, we proposed a novel SDA to separate the P300 components from the other noise signals, constructing new Pz waves with the more obvious P300 features; this process can be viewed as a spatial filter for the P300.
Previously, we introduced a topography-template matching (TTM) method [25] to reconstruct P300 waveforms that have a higher SNR. TTM was based on correlation theory of the topography of the ICs. SDA differs from the TTM method in the construction algorithm. SDA is computationally efficient to implement. Hence SDA could decrease the training and testing time. In addition, the classification accuracy of the presented method is higher than that in the report [25]. For the sake of brevity, we have not compared the efficiency of these two methods here and the comparison will be addressed in future studies.
For SDA, the experiment results show that the detection accuracy is the highest when 2 (or 3) P300 ICs are selected to reconstruct the Pz waveform. This finding might indicate that 2 or 3 neural sources are responsible for the task of responding to the P stimuli. This inference deserves further study. In addition, we deemed that the physiology meaning of three parameter values of k1, k2and k3 can be interpreted as follows. A realistic P300 IC (unknown P300 independent neural source under scalp) should have different distributed weight on different brain scalp areas. Comparing three k values, P300 IC has biggest distributed weight on P3 and P4, medium on Cz and least on Oz scalp areas.
It is worth mentioning that, even though only the waves on the Pz were finally used to extract features, 14 electrodes were still selected to run ICA in order to guarantee the efficiency of the EICA algorithm and SDA. Using ICA has another advantage in that it can help remove the ocular artifacts automatically in the preprocessing phase [24], which few previous studies of lie detection have addressed [56][57][58]. Using SDA to remove ocular artifacts simultaneously will be investigated in the future.
It should be acknowledged that the procedure for tuning parameters in the present study is complicated and timeconsuming. However, once these optimal parameter values were selected by the grid searching method on the training sets, they would be kept stable for the testing and real applications. We assumed, for example, that the parameter m represents the volume conduction feature of the neurons accounting for the P300 on the scalp, which is thought to be relatively stable spatially [31]. Using other parameter optimization methods [52,59] is also possible. We will evaluate this approach in future work.
Using the presented method, only 5 Probe stimuli (together with some Target and Irrelevant stimuli) must be presented to the subject in real applications. This arrangement is attractive and promising for practical applications. Moreover, to increase the reliability of the diagnoses, the examiner could perform our testing procedure multiple times and, then, make a more accurate decision by combining several independent testing results.
The F-score, which is a simple feature-selection method, was combined with classifiers to choose the optimal features. The Fscore helps to decrease the feature number and, hence, to decrease the computational burden. More importantly, the experimental results show that it helps to enhance the classification accuracy compared with the individual classification models, indicating the importance of the feature selection for the classification performance. For the sake of simplicity, we remove redundant features by a commonly used threshold strategy. In the future, the wrapper method should be used to improve the proposed method. Spatial Denoising Method for P300 to Detect Liars Different kernel functions for SVM were not tested in this study. It can be found that the training procedure in this study is very complex. Hence, the selection of kernel functions was not considered for the simplicity of the training procedure. In our early other studies [25,55], we had tested that the radial basis function (RBF) had the best performance than the other kernel functions. Hence, RBF was directly used in SVM method considering the similar lie detection researches.
The proposed method is not specific to research into lie detection and could be extended to other fields of the ERP classification. We believe that more sophisticated feature selection approaches, such as genetic algorithm [7,60], could further improve the performance of the classifier.