A comparison of stimulus types in online classification of the P300 speller using language models

The P300 Speller is a common brain-computer interface communication system. There are many parallel lines of research underway to overcome the system’s low signal to noise ratio and thereby improve performance, including using famous face stimuli and integrating language information into the classifier. While both have been shown separately to provide significant improvements, the two methods have not yet been implemented together to demonstrate that the improvements are complimentary. The goal of this study is therefore twofold. First, we aim to compare the famous faces stimulus paradigm with an existing alternative stimulus paradigm currently used in commercial systems (i.e., character inversion). Second, we test these methods with language model integration to assess whether different optimization approaches can be combined to further improve BCI communication. In offline analysis using a previously published particle filter method, famous faces stimuli yielded superior results to both standard and inverting stimuli. In online trials using the particle filter method, all 10 subjects achieved a higher selection rate when using the famous faces flashing paradigm than when using inverting flashes. The improvements achieved by these methods are therefore complementary and a combination yields superior results to either method implemented individually when tested in healthy subjects.


Introduction
The P300 Speller is a common brain-computer interface (BCI) system that provides a means of communication for patients with high brain stem injuries or motor neuron diseases such as amyotrophic lateral sclerosis (ALS) [1]. The system relies on electroencephalogram (EEG) detection of evoked responses to rare target stimuli to identified intended letters for communication. Because the signal to noise ratio (SNR) is low, several trials must be combined in order to correctly classify responses. The resulting typing speed can therefore be slow, prompting many studies focused on system optimization. Approaches include varying the grid size [2][3][4], a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 optimizing interstimulus interval (ISI) [5,6], and adopting different signal processing methods [7][8][9][10].
One active area of research has been to modify the type of visual stimulus used. In the original system, the character grid is gray and the intensified characters are changed to white. However, other types of visual stimuli could potentially elicit stronger P300 or other stimulus evoked responses and several studies have aimed to show superior flashing methods by using character motion [11], modifying character size and sharpness [11], changing stimulus colors [12], varying the grid layout [13], or increasing stimulus contrast [14]. The most successful stimulus to date has been the presentation of "famous faces" [15]. In this system, stimuli consist of overlaying characters with images of a famous face. This method is based on previous findings that face recognition has been found to elicit two evoked responses in addition to the P300: the N170 and N400f [16]. By incorporating face images, the response signals elicited are more salient, leading to a reduction in the number of stimuli required for perfect accuracy by over 45%, greatly improving typing speed [15]. While the improvement using "famous faces" was significant over the traditional system, to our knowledge it has not been compared to other alternative stimuli. Moreover, while it has been validated online [17], it was only using a traditional classifier and does not reflect the true performance of an online BCI system using state of the art classification methods.
Separately, recent work has involved the incorporation of language information into the signal classifier [18]. This movement in BCI research integrates knowledge about the domain of natural language to improve classification, similar to methods used in other domains such as speech recognition [19]. Several BCI studies have shown incremental improvements in system speed and accuracy using n-gram language models, first using naïve Bayes [20,21] and later using a partially observable Markov decision process [22] and a hidden Markov model [23,24]. Recently, a particle filter (PF) algorithm was introduced which allowed for the use of more complicated language models to further improve results [25]. This method approximates distributions by projecting samples through a state-space language model based on the observed EEG signals [26]. The system then determines the most likely output by finding the state that attracts the highest number of samples. In offline trials, this method yielded an increase in typing speed from 5.87 characters/minute to 8.70 characters/minute over a system without language model integration.
While both famous faces stimuli and language model integration have been shown separately to provide significant improvements, the two methods have not yet been implemented together to demonstrate that the improvements are complimentary. It is conceivable, for instance, that SNR could be improved to the point where perfect classification would be possible from the signal alone and adding a bias based on prior knowledge would not provide any benefit. It is necessary to test these methods together in order to verify that the combination is indeed better than the individual components.
The goal of this study is therefore twofold. First, we aim to compare the famous faces stimulus paradigm with an existing alternative stimulus paradigm currently used in commercial systems such as the Intendix speller (Guger Technologies, Graz, Austria). This comparison is necessary because, while the superiority of the famous faces paradigm over traditional stimuli has been previously established, it has not been compared to other paradigms that are in current use. Second, we will test these methods with language model integration to see if the advances reported in these two research areas can be combined to further improve BCI communication. We hypothesized that using famous face stimuli will increase the speed and accuracy of the P300 speller system over other stimulus paradigms and that incorporating both famous face stimuli and a language model classifier will combine to yield superior performance than either method individually.

Data collection
All data was acquired using g.tec amplifiers, active EEG electrodes, and electrode cap (Guger Technologies, Graz, Austria); sampled at 256 Hz, referenced to the left ear; grounded to AF Z ; and filtered using a band-pass of 0.1-60 Hz. The electrode set consisted of 32 channels placed according to a previously published configuration (Fpz, Fz, FC1, FCz, FC2, FC4, FC6, C4, C6, CP4, CP6, FC3, FC5, C3, C5, CP3, CP5, CP1, P1, Cz, CPz, Pz, POz, CP2, P2, PO7, PO3, O1, Oz, O2, PO4, PO8) [5]. The system used a 6 × 6 character grid, row and column flashes, and a stimulus onset asynchrony of 125 ms (consisting of a 100 ms flash duration and a 25 ms interstimulus interval). After each stimulus, the next 600 ms of data from each of the 32 channels were used as features for classification. This research was approved by the University of California, Los Angeles institutional review board (IRB), IRB#11-002062. Written consent was obtained from all subjects using a consent form approved by the IRB. The subjects in this study consisted of 25 healthy volunteers with normal or corrected to normal vision between the ages of 20 and 35. Fifteen of the subjects participated in a preliminary study comparing the inverting and non-inverting paradigms and the remaining 10 used the inverting and famous faces paradigms. For each of the stimulus paradigms, the training sessions consisted of three sessions of copy spelling 10 character phrases each for the inverting and famous faces paradigms. The approaches were counterbalanced across subjects to account for possible order or fatigue effects. In the main experiment, each subject then chose a target phrase to spell in online sessions, during which the subject had five minutes to spell as much of the phrase as they could using both stimulus paradigms. Subjects were instructed not to correct errors and to repeat the phrase if they completed it in under five minutes. The training data was then analyzed retrospectively using three-fold cross-validation to provide an additional offline comparison of results using the two stimulus paradigms when using classifiers with and without a language model. BCI2000 was used for data acquisition and online analysis [27]. Offline analysis was performed using MATLAB (version 8.6.0, MathWorks, Inc, Natick, MA).

Interface
Three stimulus types are compared in this study. The first method is the standard method, consisting of highlighting flashed characters by "intensifying" the font color to white ( Fig 1A) [1]. The second method is letter inversion, or changing the background to white and the character to black ( Fig 1B). The third method overlays the character with an image of a face as proposed by Kaufmann and colleagues ( Fig 1C) [15]. As in the Kaufmann study, the image of Einstein was used in this method.

Classifier
Feature selection for classification uses stepwise linear discriminant analysis (SWLDA), a classification algorithm that selects a set of signal features using ordinary least-squares regression [23]. It iteratively adds significant features and removes the least significant features until either the target number of features is met or a state is reached where no features are added or removed [10]. A score, y t , for a stimulus response is then determined by taking the dot product of the feature vector with the associated EEG signal. Using the score means and variances for target (μ a and s 2 a ) and non-target (μ n and s 2 n ) signals, the likelihood of a signal given a target character, x t , can be determined [21]: The PF method combines these likelihood probabilities with prior knowledge about language structure to decide the optimal character given the observed signal by estimating the probability distribution over possible outputs [26]. This distribution is created by sampling a batch of possible realizations of the model called particles, which move through states in the language model independently, based on transition probabilities. After each character selection, particles are resampled based on weights derived from observed EEG responses, effectively removing low probability realizations and replacing them with more likely realizations. The algorithm then estimates a probability distribution over the possible output strings by computing a histogram of the particles after they have moved through the model. When a user begins using the system, a set of P particles is generated with an empty history and a weight equal to 1/P. At the start of a new character t, a sample x 0:t−1 is drawn for each particle, j, from the proposal distribution defined by the language model's transition probabilities from the particle's history, x 0:t−1 (j) .
x 0:t ðjÞ $ pðx 0:t jx 0:tÀ 1 ðjÞ Þ Where p(x 0:t |x 0:t−1 ) is defined from the language model by finding the frequency of occurrence of substrings in an underlying corpus: where c(x 0 ,. . .,x t−1 ,x t ) refers to the number of times a word occurs in the corpus that begins with the string 0 x 0 ,. . .,x t−1 ,x t 0 . When a particle transitions between states, its history, , is stored to represent the output character sequence associated with that particle. After each stimulus response, the probability weight is computed for each of the particles The weights are then normalized and the probability of the current character is found by summing the weights of all particles that end in that character.
where δ is the Kronecker delta. A new batch of particles, x t Ã , are then sampled from the current particles, x t , based on the weight distribution, w t . Each of the new particles are then assigned an equal weight w t Ã (j) = 1/P. The subject then moves on to the next character and the process then repeats with the new batch of particles. Dynamic classification was implemented by setting a threshold probability to determine when a decision should be made. The program flashed characters until either the probability of at least one character reaches the threshold or the number of flashes reached the maximum (120). The classifier then selected the character that satisfied has the highest probability. In offline analysis, the speeds, accuracies, and CCPMs were found for threshold probability values between 0 and 1 in increments of 0.01 and the threshold that maximized CCPM was chosen for each subject. This optimization was impractical for online experiments, so a previously reported value of 0.95 was used for all trials [24].

Evaluation
Evaluation of a BCI system must take into account two factors: the ability of the system to achieve the desired result and the amount of time required to reach that result. The efficacy of the system can be measured as the selection accuracy, which we defined as the percentage of characters in the final output that matched the target string. The speed of the system was measured using the selection rate (SR), the inverse of the average time required to make a selection.
As there is a tradeoff between speed and accuracy, a metric is needed which takes both into account. Traditionally, BCI systems use information transfer rate (ITR), which calculates the amount of information conveyed in a system's output, taking into account the accuracy and the number of possible selections [28]. However, this metric makes several assumptions that are not valid in a natural language communication system, including lack of memory between selections, uniform probability of selection across all characters, and a uniform distribution of errors [29,30]. We include ITR here for context across existing P300 speller results, but focus instead on a simpler metric consisting of the number of correctly selected characters per minute (CCPM), discarding incorrect selections. Significance for all values was tested using Wilcoxon signed-rank tests.

Offline performance
In the preliminary experiment comparing traditional and inverted stimuli, subjects achieved significantly higher typing speeds (10.68 characters/minute versus 9.48 characters/minute) with comparable accuracy (93.39% versus 92.13%) when using inverted stimuli. The main experiment therefore compared performance using inverted and famous faces stimuli. In offline analysis without feedback, two classifiers were used: the standard SWLDA method and the PF method, both with dynamic stopping (Table 1, Fig 2). Using the combination of famous faces and particle filtering classification, there was an average selection rate of 11.97 characters per minute across all subjects, which was significantly higher than those achieved by famous faces with SWLDA (9.78 characters/minute, p = 0.0004) or letter inversion with particle filtering (10.34 characters/minute, p = 0.01). Although the average accuracy achieved by the combination was slightly higher (96.00%) than either of the individual methods (95.00% and 91.67% for famous faces and particle filtering, respectively), accuracy was not significantly different between the three analyses. Overall, the combination of particle filtering yielded an average CCPM of 11.49 characters/min across subjects with all subjects having a value over nine correct characters per minute. This performance was significantly better than that achieved using either famous faces with SWLDA (9.31 chracters/min, p = 0.001) or inverted flashing with particle filtering (9.46 characters/min, p = 0.0003) with nine of the ten subjects having the highest performance using the combined method.

Online performance
In online experiments, only the PF classifier was used. All 10 subjects were able to type characters with at least 60% accuracy using each of the flashing paradigms ( Table 2, Fig 3). Using the inverting method, nine of the 10 subjects achieved at least 75% accuracy and 6 characters per minute. Using the FF method, all subjects selected characters with at least 75% accuracy, with Table 1. Optimal selection rates, accuracies, and correct characters per minute (CCPM) for the 10 subjects in offline trials using the inverted (Inv) and famous faces (FF) flashing paradigms with either the SWLDA or particle filtering (PF) classifiers with dynamic stopping. seven of 10 subjects having accuracies over 98%. All but one of the subjects had typing speeds over 10 characters per minute using the famous faces flashing paradigm. All 10 subjects achieved a higher bit rate when using the famous faces flashing paradigm than when using inverting flashes. On average, subjects selected 8.45 characters per minute with 85.49% accuracy, resulting in an average bit rate of 33.86 bits/minute using inverting flashes. When using the famous faces paradigm, subjects achieved significant improvements with an average selection rate of 11.16 characters/minute (32.0% improvement, p = 0.0005), an average accuracy of 94.21% (p = 0.02), CCPM of 10.56 (44.1% improvement, p<0.0001), and an average bit rate of 52.27 bits/minute (54.4% improvement, p = 0.0001).

Discussion
While there are many active areas of research in improving the P300 speller, relatively little work has been done to combine these improvements. Some of these methods could be mutually exclusive, such as the stimulus presentation pattern presented by Jin et al. [4] and the checkerboard paradigm developed by Townsend et al. [31]. Others, however, can be implemented together, which can potentially produce superior results to either method used  Table 2. Online selection rates, accuracies, and correct characters per minute (CCPM) for each subject using the inverted and famous faces flashing paradigms with the particle filtering classifier. Stimulus comparison in the P300 speller using language models individually. Developing a viable system for ALS patient communication will require utilizing many of the improvements that have been developed and it is important that we explore how these components will work together in a final product.

SR (selections/min) ACC (%) CCPM (characters/min)
Here, we have demonstrated the performance of the P300 speller when implementing famous faces flashing with a language model-based signal classifier. All subjects achieved their best online performance using the combination of famous faces with the PF classifier. In offline experiments, the improvements were largely a result of a reduction of the number of stimuli required to achieve a similar accuracy. When using the particle filter, the addition of famous faces stimuli increased the selection rate from 10.34 characters/minute to 11.97 characters/minute, an equivalent of reducing the number of flashes by 52%, which is in line with previously published reduction of 45% for famous faces without language modeling [15].Using famous face stimuli with a traditional classifier and using standard flashing with the PF classifier achieved similar results, both of which were substantially higher in terms of selection rate than previously published results using standard methods, which were on the order of 6.5 characters/minute [21]. Combining the methods resulted in the best offline performance for all but one subject. The majority of subjects had worse offline performance using standard flashing compared to inverted stimuli, although famous faces stimuli yield superior results to either alternative method.
There was a decrease in online performance compared to offline analysis, with lower average typing speeds and accuracies for each flashing paradigm. In both cases, the difference was mainly a result of increased selection rate as the accuracy did not significantly differ (p = 0.07 and p = 0.25 for inverted and famous faces flashing, respectively). A similar decrease was seen previously when using language model-based classifiers in an online setting [26]. This decrease could have been caused by the optimization of the probability threshold for each subject in the offline trials. Differences could also have been affected by the target sentence chosen by the users in online trials. Because offline analysis was performed on the training data, all subjects had the same target sentence and therefore benefitted from the language model equally. In online trials, subjects were allowed to choose their own text for free spelling. Sentences that contain words that are common in the language model would have higher prior probabilities, resulting in faster speeds as fewer stimulus responses would be needed for the classifier to reach a decision. Conversely, sentences that are not likely in the language model will have a bias against them and will therefore take longer and are more likely to contain errors. In a realistic system, language models can be individually tailored to reflect text that patients are more likely to type, resulting in further improved results.

Limitations and future directions
The current study was conducted only using healthy volunteers and their performance might not accurately reflect the performance of "locked-in" patients due to additional restrictions such as a lack of gaze control. The PF algorithm will likely have a similar effect in classifying signals from the target population as it is simply a means for improving speed and accuracy and does not affect the appearance of the system for the user. Famous faces stimuli have independently been validated in the target population [17], so it is reasonable to expect the combination of the methods to show an improvement for the target population. Nevertheless, this expectation needs to be tested in a study in the patient population to verify that these improvements will translate into a better system for end users.

Conclusion
Famous faces stimuli and language model based classification have both been previously shown to greatly improve performance of BCI communication systems. Here, we have shown that the improvements achieved by these methods are complementary and that combining them yields superior results to either method implemented individually in terms of typing speed and information transfer rate. This result has been validated in both online and offline experimental settings. We have also demonstrated that famous faces stimuli are superior to inverted stimuli in addition to standard character intensifications.