A convolutional neural network for steady state visual evoked potential classification under ambulatory environment

The robust analysis of neural signals is a challenging problem. Here, we contribute a convolutional neural network (CNN) for the robust classification of a steady-state visual evoked potentials (SSVEPs) paradigm. We measure electroencephalogram (EEG)-based SSVEPs for a brain-controlled exoskeleton under ambulatory conditions in which numerous artifacts may deteriorate decoding. The proposed CNN is shown to achieve reliable performance under these challenging conditions. To validate the proposed method, we have acquired an SSVEP dataset under two conditions: 1) a static environment, in a standing position while fixated into a lower-limb exoskeleton and 2) an ambulatory environment, walking along a test course wearing the exoskeleton (here, artifacts are most challenging). The proposed CNN is compared to a standard neural network and other state-of-the-art methods for SSVEP decoding (i.e., a canonical correlation analysis (CCA)-based classifier, a multivariate synchronization index (MSI), a CCA combined with k-nearest neighbors (CCA-KNN) classifier) in an offline analysis. We found highly encouraging SSVEP decoding results for the CNN architecture, surpassing those of other methods with classification rates of 99.28% and 94.03% in the static and ambulatory conditions, respectively. A subsequent analysis inspects the representation found by the CNN at each layer and can thus contribute to a better understanding of the CNN’s robust, accurate decoding abilities.

Of these EEG paradigms, SSVEPs have shown reliable performance in terms of accuracy and response time, even with a small number of EEG channels, at a relatively high information transfer rate (ITR) [35] and reasonable signal-to-noise ratio (SNR) [36]. SSVEPs are periodic responses elicited by the repetitive fast presentation of visual stimuli; they typically operate at frequencies between 1 and 100 Hz and can be distinguished by their characteristic composition of harmonic frequencies [33,37].
Various machine learning methods are used to detect SSVEPs: first and foremost, classifiers based on canonical correlation analysis (CCA), a multivariate statistical method for exploring the relationships between two sets of variables, can harvest the harmonic frequency composition of SSVEPs. CCA detects SSVEPs by finding the weight vectors that maximize the correlations between the two datasets. In our SSVEP paradigm, the maximum correlation extracted by CCA is used to detect the respective frequencies of the visual stimuli to which the subject attended [37]. Modified CCA-based classifiers have been introduced, such as a multiway extension of CCA [38], phase-constrained [39] and multiset [40] CCA methods. In addition, stimulus-locked intertrace correlation (SLIC) [41] and the sparsity-inducing LASSO-based method [42] have been proposed for SSVEP classification. The multivariate synchronization index (MSI) was introduced to estimate the synchronization between two signals as an index for decoding stimulus frequency [43,44]. SSVEP decoding can be further extended by employing characteristics based on phase and harmonics [35], boosting the ITRs significantly. Recently, deep-learning-based SSVEP classification methods [45][46][47] have also been considered; however, all have thus far used prestructuring by employing a Fourier transform in the CNN layer.
recordings to indicate commands such as "walk forward", "turn left", "turn right", "sit", and "stand", and were approximately 1 s in length. Note that during the experimental tasks, all LEDs were blinking simultaneously at different frequencies.
• Task 1 (Static SSVEP): The subjects were asked to focus their attention on the visual stimulus in a standing position while wearing the exoskeleton. Corresponding visual stimuli were given by auditory cue and 50 auditory cues were presented in total (10 times in each class).
• Task 2 (Ambulatory SSVEP): The subjects were asked to focus on visual stimuli while engaged in continuous walking using the exoskeleton. The exoskeleton was operated by a wireless controller, per the decoded intention of the subject. a total of 250 auditory cues were presented (50 in each class).

Neural network architectures
We now investigate three neural network architectures for SSVEP decoding, CNN-1 and CNN-2, which use convolutional kernels, and NN, standard feedforward neural network without convolution layers. We show that CNN-1 has the best classification rate; in CNN-2, we included a fully connected layer with 3 units for visualizing feature representations as a function of the learning progress.
Input data. The acquired EEG data were preprocessed for CNN learning by band-pass filtering from 4-40 Hz. Then, the filtered data were segmented using a 2 s sliding window (2,000 time samples × 8 channels). The segmented data were transformed using a fast Fourier transform (FFT). Then, we used 120 samples from each channel, corresponding to 5-35 Hz. Finally, data were normalized to the range from 0 to 1. Therefore, the input data dimension for CNN learning was 120 frequency samples (N fs ) by 8 channels (N ch ). The number of input data for training depends on the experimental task and is therefore described in the Evaluation section.
Network architecture overview. The CNN-1 network has three layers, each composed of one or several maps that contain frequency information for the different channels (similar to [55]). The input layer is defined as I p, j with 1 p N fs and 1 j N ch ; here, N fs = 120 is the number of frequency samples and N ch = 8 is the number of channels. The first and second hidden layers are composed of N ch maps. Each map in C 1 has size N fs ; each map in C 2 is composed of 110 units. The output layer has 5 units, which represent the five classes of the SSVEP signals. This layer is fully connected to C 2 as in Fig 4. The CNN-2 network is composed of four layers. The input layer is defined as I p, j with 1 p N fs and 1 j N ch . The first and second hidden layers are composed of N ch maps. Each map in C 1 has size N fs . Each map of C 2 has 110 units. To this point, CNN-2 is equivalent to CNN-1. The difference comes in the third hidden layer F 3 , which is fully connected and consist of 3 units. The each unit is fully connected to C 2 . The output layer has 5 units that represent the five classes of SSVEP. This layer is fully connected to F 3 . The 3 units in F 3 are used to visualize the properties of the representation that CNN-2 has learned, as depicted in Fig 5. The standard NN is composed of three layers. For the input layer, we concatenated the 120 by 8 input into a 960-unit vector. The first hidden layer is composed of 500 units, the second has 100 units, and the output layer has 5 units to represent the five classes. All layers are fully connected, as in Fig 6. Learning. A unit in the network is defined by x l k ðpÞ, where l is the layer, k is the map, and p is the position of the unit in the map, where f is the classical sigmoid function used for the layers: s l k ðpÞ represents the scalar product of a set of input units and the weight connections between these units and the unit number of p in map k in layer l. For C 1 and C 2 , which are convolutional layers, each unit of the map shares the same set of weights. The units of these layers are connected to a subset of units fed by the convolutional kernel from the previous layer. Instead of learning one set of weights for each unit, where the weights depend on unit position, the weights are learned independently to their corresponding output unit. L 3 is the output layer in CNN-1 and L 4 is the output layer in CNN-2.
• CNN-1 -For C 1 : where w(1, k, 0) is a bias and w(1, k, j) is a set of weights with 1 j N ch . In this layer, there are N ch weights for each map. The convolution kernel has a size of 1 × N ch .
-For C 2 :  where w(2, k, 0) is a bias. This layer transforms the signal of 120 units into 110 new values in C 2 , reducing the size of the signal to analyze while applying an identical linear transformation to the 110 units of each map. This layer translates spectral filters. The convolution kernel has a size of 11 × 1.
-For L 3 : where w(3, 0, p) is a bias. Each unit of L 3 is connected to each unit of C 2 .
• CNN-2 -C 1 and C 2 are the same as in CNN-1.
-For F 3 : where w(3, 0, p) is a bias. Each unit of F 3 is connected to each unit of C 2 -For L 4 : where w(4, 0, p) is a bias. Each unit of L 4 is connected to each unit of F 3 The gradient descent learning algorithm uses standard error backpropagation to correct the network weights [56][57][58]. The learning rate was 0.1 and weights were initialized with a normal distribution on the interval [-sqrt(6/(N in +N out )), sqrt(6/(N in +N out ))], where N in is the number of input weights and N out is the number of output weights following [58]. The number of learning iterations was 50, but training stopped once the decrease of in error rate was smaller than 0.5% after 10 iterations.
For each classifier, we compute the 10-fold cross-validation error, splitting the data chronologically (a common method in EEG classification) to preserve the data's non-stationarity and avoid overfitting [27,59]. For the test data, both datasets (50 trials of 5 s for static SSVEPs and 250 trials for ambulatory SSVEPs, randomly permuted) were segmented using a 2 s sliding window with a 10 ms shift size, segmenting a 5 s trial into three hundred 2 s trials. As a result, there were 1,500 static and 7,500 ambulatory SSVEP test data points in each fold. Deep neural networks generally show higher performance for larger amounts of data [53]. Hence, we tested the classifiers with different training data sizes; in particular, different segmentations of the data were considered. Using a 2 s sliding window with different shift sizes (60,30,20,15,12, and 10 ms), we obtained a trial segmentation into 50, 100, 150, 200, 250, and 300 data samples. Thus, there were 2,250, 4,500, 6,750, 9,000, 11,250, and 13,500 training data for the static SSVEPs, and 11,250, 22,500, 33,750, 45,000, 56,250, and 67,500 for the ambulatory SSVEPs.
Note that although we used a small size shift, there was no overlap between training and test data in order to prevent overfitting. The CCA method does not require a training phase. Thus, we only show its results on test data.
We now briefly describe the CCA, CCA-KNN, and MSI methods. CCA is a multivariate statistical method [60,61] that finds a pair of linear combinations such that the correlation between two canonical variables X and Y is maximized. As X(t), we chose 2 s EEG windows; as Y i (t), we use the five reference frequencies (f 1 = 9, f 2 = 11, . . .,f 5 = 17) from the five visual stimuli [14] where T is the number of sampling points and S denotes the sampling rate. CCA finds weight vectors, W x and W y , that maximize the correlation between the canonical variants x = X 0 W x and y = Y 0 W y , by solving The maximum ρ with respect to W x and W y is the maximum canonical correlation. The canonical correlation ρ f i , where i = 1, . . ., 5, is used for detecting the frequency of the LED that a subject is attending by where O i are the output classes corresponding to the five visual stimuli.
For CCA-KNN [14], the set of canonical correlations (ρ = (ρ f 1 , . . ., ρ f 5 )) is used as a feature vector for subsequent KNN classification, each with a class label. In the training step, the algorithm consists only of storing the feature vectors and class labels of the training samples. In classification, an unlabeled vector is classified by assigning the label to the most frequent of the k nearest training samples, where the Euclidean distance is used as a distance metric.
For MSI [43,44], the S-estimator, based on the entropy of the normalized eigenvalues of the correlation matrix of multivariate signals, was used as the index. Thus, MSI creates a reference signal from the stimulus frequencies used in an SSVEP-based BCI system similarly to CCA.

Results and discussion
EEG signals are highly variable across subjects and experimental environments (see Figs 5 and 6 and Tables 1 and 2 in [14]). The SSVEP signals acquired from the static exoskeleton show more pronounced frequency information than in the ambulatory environment. In the static SSVEP, we can observe the increased frequency components that are visible at the stimulus frequency. In the ambulatory SSVEP, however, because of the higher artifactual content, this effect becomes less clearly visible (see S1 Fig for selected input and average data under both conditions).

Static SSVEP
In Table 1, we show the 10-fold cross-validation results for 13,500 training data validated on 1,500 test data points for all subjects. CNN-1 showed the best classification accuracy of all subjects in each classifier. For low-performing subjects, with a CCA accuracy under 80% in the ambulatory SSVEP (i.e., subjects S3-7, see Table 3), the neural network results stayed robust. Clearly, the CCA method exhibits significantly lower performance.
With fewer training data (see Table 2 and Fig 7 (top) for the 10-fold cross-validation results of the static task), we observe a decaying performance for the neural networks, which is to be expected. Note that the CCA and MSI methods stay essentially constant as a function of data, since no training phase is required because the canonical correlations and synchronization index with reference signals are simply computed in order to find the maximum value. The CCA-KNN classifier was trained for k = 1, 3, 5, and 7, respectively, and k was selected on the training set to achieve the best accuracy. Fig 7 (top) presents the average accuracy of each classifier as a function of the number of training data for all subjects (a) and low-performing subjects (b). Statistical analysis of these results shows a significant improvement with larger training data sizes for the neural network classifiers. We provide more information on the difference between CNN-1 and the other methods for all subjects and low-performing subjects in S2(a) and S2(b) Fig, respectively. CNN-1 outperforms other classifiers; however, CCA-KNN shows better classification results for 4,500 training data samples or fewer, as we can see from the positive values in brackets. Fig 8(a) shows the decoding variability of the individual subjects at the minimum (dash) and maximum (solid line) number of data samples; here, a diamond indicates a 5% or more increase in classification rates. Clearly, all subjects achieved increased accuracies with neural network models. Specifically, low-performing subjects (S3-7) show a higher increase than other subjects (S1 and S2). Subjects S2 and S3 only increase in the CCA-KNN method. Table 3 considers the 10-fold cross-validation results for the ambulatory SSVEP setup when the number of training data is 67,500, with 7,500 test data for all subjects. CNN-1 showed the best classification accuracy of all subjects in each classifier. Even the low-performing subjects showed the highest accuracy with CNN-1. The competing classifier models showed a more pronounced performance deterioration owing to the higher artifact presence in the ambulatory environment (see Figs 5 and 6 in [14]). Table 4 and Fig 7 (bottom) presents the 10-fold cross-validation results for the ambulatory SSVEP classification as a function of a changing number of training data with 7,500 test data. In particular, Fig 7 (bottom) shows the averaged accuracy of each classifier with increasing training data for all subjects (c) and low-performance subjects (d). One asterisk indicates the 5% significance level (compared to 67,500 training data samples), whereas two asterisks denote the 1% significance level. As expected, analysis confirms the performance gains of the neural networks as training data increases, even if the data contain large artifacts. Fig 8(b) confirms the findings of the static setting for the more artifact-prone ambulatory setting.

Ambulatory SSVEP
Compared with the static SSVEP in CNN-1, larger training data samples are required for the ambulatory SSVEP to accomplish high accuracy (classification performance of more than 90% at 56,250 training data samples). For the static SSVEP setup, 96.20% accuracy could be achieved using only 2,250 training data and 99.28% accuracy was achieved using 13,500 samples. In contrast, 81.40% accuracy and 94.03% accuracy were achieved in the ambulatory condition when 11,250 and 67,500 training data were used, respectively. Fig 9 shows the learning curves of subjects S2 (black) and S4 (red) in static (solid) and ambulatory (dash line) SSVEP environments in CNN-1. The learning iteration of subject S2 stops at the 13th and 12th epochs, whereas the iteration of subject S4 stops at the 19th and 12th epochs in the datasets (subject S2 records the best performance with CNN-1 and subject S4 has the lowest performance in the ambulatory SSVEP.) The appearance of the kernels differs for each individual because the network training is subject-dependent. Unfortunately, there is no obvious and simple interpretation linked to  The darker circles indicate more training data. As more training data were given, we observed that the performance of the CNN-1 increased consistently and was more pronounced under the ambulatory condition (more black circles are located on the right side). However, individual CCA-KNN decoding accuracies stay relatively stable, meaning that the accuracy of the CCA-KNN is almost independent of the amount of training data in our experimental conditions.

Feature representation
Analyzing and understanding classification decisions in neural networks is valuable in many applications, as it allows the user to verify the system's reasoning and provides additional information [54,62]. Although deep learning methods are very successfully solving various pattern recognition problems, in most cases, they act as a black box, not providing any information about why a particular decision was made. Hence, we present the feature representation from the CNN-2 architecture. The averaged features of each layer using static and ambulatory SSVEPs are shown in Figs 11 and 12, respectively. In both cases, the networks focus on the stimulus frequency components. For learning in layer C 1 , we used a 1 × N ch  convolutional kernel, which can give channel-wise (spatial) weight. The C 2 layer used an 11 × 1 convolutional kernel to detect frequency (spectral) information. The frequency components that were most discriminated by the convolutional layers were highlighted using blacklined boxes. With the exception of the 17 Hz class, the corresponding stimulus frequencies were enforced through iterative training. We conjecture that the absence of second harmonics (34 Hz) for the 17 Hz SSVEPs results from low magnitude when compared with lower frequencies or outside the boundary of the ranges in the C 2 layer. In the second convolutional layer, the patterns were spread out (and slightly smoothed) when compared to the first convolutional layer. The F 3 layer is composed of three units that we plotted with each unit as an axis direction. The 3D plot shows that all classes are distinguished nicely. Therefore, we can conclude that the CNN architecture is able to appropriately extract the meaningful frequency information of SSVEP signals. To compare the feature distributions with CCA-KNN, we show a scatter plot using CCA-KNN in S4 Fig. The features were extracted with CCA and classified using KNN when k = 3 for subject S6 (85%). Test data were plotted on ρ f 1 , ρ f 2 and ρ f 3 axes. Blue, red, green, black, and cyan circles indicate 9 Hz, 11 Hz, 13 Hz, 15 Hz, and 17 Hz, respectively. Note that the feature dimension is actually 5 (the number of classes), therefore we only used the ρ f 1 , ρ f 2 and ρ f 3 projection to visualize feature distributions in the plot. However, the classes are clearly not as well spread apart when compared with CNN-2.

Conclusion
BMI systems have shown great promise, though significant effort is still required to bring neuroprosthetic devices from the laboratory into the real world. In particular, further advancement in the robustness of brain signal processing techniques is needed [63,64]. In this context, constructing reliable BMI-based exoskeletons is a difficult challenge owing to the various complex artifacts spoiling the EEG signal. These artifacts may be induced differently depending on subject population and may in particular be caused by suboptimal EEG measurements or broadband distortions due to movement of the exoskeleton. For example, while walking in the exoskeleton a subject's head may move, which can give rise to swinging movements in the line between the electrodes and EEG amplifiers, leading to disconnections or high impedance measurements. Furthermore, significant challenges still exist in the development of a lower-limb exoskeleton that can integrate with the user's neuromusculoskeletal system. Although these limitations exist, a brain-controlled exoskeleton may eventually be helpful for end-user groups.
The current study made a step forward toward more robust SSVEP-BMI classification. Despite the challenges imposed on signal processing by a lower-limb exoskeleton in an ambulatory setting, our proposed CNN exhibited promising and highly robust decoding performance for SSVEP signals. The neural network model was successfully evaluated offline against standard SSVEP classification methods on SSVEP datasets from static and ambulatory tasks. The three neural networks (CNN-1, CNN-2, and NN) showed increased performance in both environments when sufficient training data were provided. CNN-1 outperformed all other methods; the best accuracies achieved by CNN-1 were 99.28% and 94.03% in static and ambulatory conditions, respectively. Other methods (CCA-KNN, NN, CNN-2) showed high accuracy in the static environment, but only CNN-1 recorded smallest low performance deterioration for the ambulatory SSVEP task. CNN-1's complexity is low because it has a comparatively simple structure (few layers, maps, and units) and the weights in the convolution layers are shared for every unit within one map, effectively reducing the number of free parameters in the network. Our application is far from being data rich (N 67,500); therefore, we adopted neither pre-trained model, dropout, nor pooling methods, yet our relatively simple architecture worked efficiently after a brief training period. Overall, the proposed method has advantages for real-time usage and it is highly accurate in the ambulatory conditions. Furthermore, our method can increase in accuracy with more data, if available. Note that we consider subject-dependent classifiers for decoding, which reflects the fact that individuals possess highly different patterns in their brain signals. From the kernel analysis, we therefore foundas expected-that the convolutional kernels were different for each individual. We also demonstrate the feature representations, as implemented using a bottleneck layer in CNN-2. The CNN classifiers could determine the most discriminative frequency information for classification, nicely matching the stimulus frequencies of the respective SSVEP classes.
So far, our study has only successfully tested the performance of CNN classifiers for offline data. Future work will also develop a real-time CNN system that can control a lower-limb exoskeleton based on the proposed method and evaluate its performance with healthy volunteers as well as for end-user groups to investigate their use in gait rehabilitation. We will investigate subject-independent classification using CNNs. A subject-independent CNN-based classifier may be more efficient system because it could reduce long training times. Features were extracted from CCA and classified using KNN with k = 3 for subject S6. Test data were plotted along the ρ f 1 , ρ f 2 and ρ f 3 axes. Blue, red, green), black, and cyan are 9, 11, 13, 15, and 17 Hz, respectively. (PDF)