Classifying Response Correctness across Different Task Sets: A Machine Learning Approach

Erroneous behavior usually elicits a distinct pattern in neural waveforms. In particular, inspection of the concurrent recorded electroencephalograms (EEG) typically reveals a negative potential at fronto-central electrodes shortly following a response error (Ne or ERN) as well as an error-awareness-related positivity (Pe). Seemingly, the brain signal contains information about the occurrence of an error. Assuming a general error evaluation system, the question arises whether this information can be utilized in order to classify behavioral performance within or even across different cognitive tasks. In the present study, a machine learning approach was employed to investigate the outlined issue. Ne as well as Pe were extracted from the single-trial EEG signals of participants conducting a flanker and a mental rotation task and subjected to a machine learning classification scheme (via a support vector machine, SVM). Overall, individual performance in the flanker task was classified more accurately, with accuracy rates of above 85%. Most importantly, it was even feasible to classify responses across both tasks. In particular, an SVM trained on the flanker task could identify erroneous behavior with almost 70% accuracy in the EEG data recorded during the rotation task, and vice versa. Summed up, we replicate that the response-related EEG signal can be used to identify erroneous behavior within a particular task. Going beyond this, it was possible to classify response types across functionally different tasks. Therefore, the outlined methodological approach appears promising with respect to future applications.


Introduction
In order to adjust behavior rapidly to environmental and one's own demands it is necessary to carry out action-monitoring permanently. Especially the detection and compensation of erroneous (or undesired) outcomes play an important role in this regard. One neurophysiological correlate of such a response monitoring system can be measured in the electroencephalogram. Shortly following response errors (about 60 ms) a negative potential can be observed at fronto-central electrode positions: the error negativity (Ne [1]) or error-related negativity (ERN [2]). On the neurophysiological level, the anterior cingulate cortex (ACC) seems to be the main structure that is involved in the generation of the Ne, although the supplemental motor area has also been shown to be involved [3][4][5][6].
The Ne is being followed by another correlate of such a response monitoring system which is also related to error processing: the error positivity (Pe [1,7]). Typically, the Pe peaks about 200-400 ms following response onset at centro-parietal electrode positions and is assumed to reflect error awareness and evaluation, such that the Pe can be observed, whenever subjects are aware of a recently committed error [7,8]. With respect to the neurophysiological implementation of the Pe, source-localization indicated that the neural regions (e.g. rostral ACC) generating Ne and Pe partly overlap [9]. However, additionally activation of the insula is related to error awareness, indicating an increased awareness with respect to the autonomic reaction to errors [10]. Since the Ne does not vary with error awareness, both components might reflect distinct aspects of error processing [11]. In sum, it is well accepted that both ERP components seem to be closely related to behavioral adaptation.
Indeed, actions are not only adapted via detection and compensation of errors: correct responses are being monitored as well, for example in order to increase motor precision. Accordingly, even after correct responses there is a prominent fronto-central negativity, the correct-related negativity (CRN, see e.g. [12]). Recent evidence pointed to a general response monitoring system (as reflected in Ne and CRN) that is central to the adaptation of actions [13,14]. However, for convenience we use the term Ne throughout the manuscript and specify whether it is referred to correct or incorrect responses. Further evidence for a general response monitoring system can be derived from the finding that CRN and/or Ne both are observable in various types of speeded or simple response tasks [8,13,[15][16][17][18][19]. Moreover, seemingly the response modality does not matter, since the Ne is present for example in tasks requiring vocal responses [20].
One theoretical explanation regarding the functional implementation of the Ne is the reinforcement-learning hypothesis (RFL [21]) which is supported by a huge amount of empirical findings (for an overview compare [22]). The RFL proposes that the Ne is triggered whenever an outcome, irrespective of being a response or an event, is worse than expected. Accordingly, the function of the neurophysiologic mechanisms generating the Ne is to initiate remedial action to control errors or undesired outcomes. The RFL is thereby not restricted to errors: it is a general theory about the compensation and adaptation to unexpected outcomes implicating a general functional network linked to the learning and monitoring of stimulus response contingencies [23,24]. However, more recent models and studies provide evidence, that the ACC does not code solely events being worse than expected, but rather the expectancy of events (i.e. if events are expected or not [25]).
Given the accuracy-differentiating character of CRN, Ne, and Pe, the practical question arises, whether this activity (from a statistical point of view) is specific for certain (cognitive) tasks, or can be used as a general classification feature across tasks. If such an adaptive response monitoring system is part of task processing in any task consisting of providing a 'to-be-evaluated-response' to some stimuli, classifications across different types of tasks should be feasible. Hence, the first hypothesis of the present study is that the response-related EEG signal can be utilized for the classification of the subject's current response state, more specifically, whether the current response is correct or not (via classification of Ne and Pe). This idea appears promising as especially the Ne is prominent in the single-trial EEG, and has been shown to be a very reliable ERP, with reliabilities >.85 within subjects [26]. Indeed, a huge number of studies already revealed that it is possible to classify response correctness by utilizing the single-trial Ne (see e.g. [27][28][29]). In the present study it is aimed to replicate this finding. Going beyond the question whether the Ne can be utilized to classify response correctness, previous research revealed that information about upcoming events can be derived from the individual EEG [30][31][32][33]. However, the latter studies were limited to correlates of attentional or perceptual maladaptations, not yielding classifications, but rather correlative dependencies. In a recent study a pattern recognition framework has been implemented in order to classify correct and incorrect responses via extraction of EEG features in the time window of Ne and Pe [34] and from these results it has been concluded that the Pe does not contribute much to classification performance. However, in their study the authors did not use the single-trial Ne or Pe, but their algorithm selected automatically the best feature for classification irrespective of electrode position, that is, the whole electrode array was fed into the learning phase and the best predictor was selected. This led to the situation, that sometimes even temporal or lateral electrodes were included, thus it is not clear, whether really the Pe or Ne were selected but rather statistically relevant features [34].
In addition, the core aim of the present study is to test whether the supposed classification of response correctness is stable across tasks. There is much evidence indicating that the mechanisms involved in error processing constitute a general evaluation system [6,[12][13][14]22,23,35,36]. Consequently, it should be possible to use the error-related EEG activity to classify responses accurately across tasks, even if the mechanisms that lead to an error in both tasks are quite different. Indeed, in a previous study it has been shown that the Ne correlates considerably across tasks [19].
To test the assumption that Ne, as well as Pe can be used as general classification features, two different cognitive tasks, namely a mental rotation and a flanker task were conducted and it was investigated whether the response-related activity of one task can be employed to assess behavioral performance in the other. In both experimental tasks actually different types of errors can be assumed; in the flanker task errors are due to lapses of attention [37,38], whereas in the rotation task errors are more likely "mistakes" (i.e. inappropriate target processing). Given that there is a common response, i.e. error monitoring system which is reflected in the Ne, and that error awareness as reflected in the Pe is involved as well, it can be hypothesized that the identification of errors and correct responses should be above chance, despite the fact that in both tasks errors arise due to different cognitive and neural mechanisms [22].

Methods
The data herein were taken from a previous experimental series [13]. Thus, the standard behavioral and ERP results must not contribute to any meta-analysis. However, we describe the corresponding methods again in order to circumvent cross reading for the reader. Furthermore, the data were re-analyzed, leading to slightly different statistics (but not inferences), which is due to fact that in the previous study the statistics were related to the independent components derived by independent component analysis (ICA) related to the error negativity. In the present study the "pure" ERP was analyzed, since the ICA analysis scheme would go far beyond the scope of the manuscript.

Participants
The sample consisted of 20 healthy young participants (11 women). Participants were aged between 21 and 27 years (mean = 23.8; SD = 1.9), gave written informed consent prior to participation and received 10, -€/h payment for participation. The local ethics committee of the Leibniz Research Centre for Working Environment and Human Factors approved the study.

General procedure and experimental design
Participants were seated in an ergonomic seat in front of a 19"-CRT monitor (100 Hz). Responses were given by a button press of the left or right thumb of a force measuring device. The experiment consisted of two tasks each consisting of eight blocks (one training block). Each block consisted of 80 trials. Following each block a break of 20 seconds and after half of the experimental blocks a break of 120 seconds was provided. The initial experiment consisted of a mixed 2 (group) x 2 (task) design with the between subjects factor group (accuracy, speed instruction) and the within subjects factor task (flanker, rotation). The design was fully balanced with respect to group, sequence of tasks, and response side for mirrored/non-mirrored letters. We did not analyze the speed-accuracy manipulation since it was beyond the scope of the present manuscript and, more important, did not lead to any differences with respect to Ne amplitude due to the adaptive deadline (for details compare [13]).
The first task was a modified flanker task [39]. In the center of the screen an arrowhead indicated the button that had to be pressed. This arrowhead was accompanied by two distracting arrowheads below and above which appeared 100 ms prior to target occurrence, which is known to induce maximal distraction [38,40]. These flankers could be congruent (pointing to the same direction) or incongruent (opposite direction). The probability for congruent and incongruent flankers was 50%, respectively.
The second task was a mental rotation task. One out of two letters (F,R) was presented to the participants. This letter was either rotated, mirrored across the main axis or both. Participants had to indicate with a left or right button press of the corresponding thumb if the letter was mirrored or not. The letters were rotated by 0°, 45°,135°, 225°or 315°, resulting in 20 possible stimuli which were presented in random order. Thus, the rotation task was not only much more difficult than the flanker task, it also differed with respect to the degree of stimulusresponse mapping.
In both tasks the participants received post-response feedback indicating whether they responded within an appropriate time interval. The feedback consisted of two pictograms: If the participants responded fast enough a yellow pictogram of a smiling face ("smiley") appeared in the center of the screen. A red angry looking pictogram appeared if they responded too fast or too slow. The deadline for the feedback was adapted block wise. If the error rate in one block (80 trials) was below 8%, the deadline was decreased subtracting one standard deviation from the mean RT of the previous block. In contrast, an error rate above 12% led to an increase of the deadline by adding four standard deviations to the mean RT of the previous block.

Behavioral data analysis
Error rates of both tasks were compared with each other by means of bootstrap t-tests [41]. Observed t-values (t obs ), adjusted bootstrap p-values (p boot ), and Cohen's d for repeated measures indicating effect sizes are reported [42].
The reaction times (RTs) were analyzed by means of a repeated measures ANOVA with the within subject factors task (flanker, rotation) and response (error, correct), with RTs faster than 100 and slower than 1000 ms being excluded from the analysis. Resulting F-values, p-values, and partial eta squared are η p 2 reported. Whenever necessary, multiple comparisons were conducted via post-hoc t-tests, while the corresponding p-values were FDR-adjusted according to the method of Benjamini and Yekutieli [43]. Cohen's d is reported for effect sizes [42].
EEG data: pre-processing and ERPs ) with a sampling rate of 500 Hz. The EOG was recorded from the outer canthi and from above and below the right eye (SO2, IO2, LO1, LO2). Data were re-referenced off-line relative to linked mastoids. The EEG was filtered offline using a short non-linear FIR filter (high pass 0.5 Hz, low pass 25 Hz). Subsequently, the data for each participant were segmented into 1000 ms epochs yielding a temporal data set to which an automated artifact rejection procedure [44] was applied, followed by the ICA AMICA algorithm [45,46]. The automated artifact rejection procedure basically calculates the empirical distribution of all data points across all trials and time points and rejects statistical outliers that are trials consisting of data points exceeding a criterion of 3  standard deviations. The amount of maximal rejected trials was set to 5%. The derived ICAweights were submitted to the previous continuous data set, and independent components representing ocular artifacts were removed by projecting back the mixing matrix with artifact components set to zero [47][48][49]. Now the pruned data sets were segmented relative to stimulus onset and a second time relative to response onset. Finally, labels coding errors and correct trials were added in order to yield conditional vectors for the machine learning analysis. For calculation of the ERPs, the data were submitted to the before mentioned automated artifact procedure [44] and the data were averaged for both tasks and response types (errors, correct). The Ne was quantified as a mean voltage (20-100 ms following button press) at FCz, since topographic maps (spherical splines) indicated a maximum at this time point and channel location. The Pe was quantified as mean voltage in the time range 180-250 ms following button press at Cz. The statistical analysis of task (flanker vs. rotation) and accuracy (correct vs. incorrect) effects on this mean amplitudes consisted of a repeated measures ANOVAs. Due to the 2x2 factorial design the degrees of freedom were 1 for all ANOVAs. Thus, no sphericity correction was applied. We report F-values, p-values and effects sizes by means of η p 2 . If post-hoc t-tests were conducted due to significant interactions, we report t-values, alpha-adjusted p-values [43] and effect sizes by means of Cohens's d for repeated measures [42].

EEG data: machine learning
The EEG in the present study was analyzed by means of a machine learning approach, i.e. a support vector machine (SVM) was implemented and optimized for classification of singletrial EEG data. SVMs are supervised learning algorithms that aim towards an optimal separation of distinct classes. For that purpose input data is projected into high-dimensional feature space in order to determine a hyperplane which is able separate this data. An SVM trained this way is subsequently able to apply this "knowledge" to new data and hence able to classify it [50,51]. SVMs are used in a wide variety of scientific fields and recently received progressing popularity in classifying brain activation. They were even successfully employed to investigate common neural mechanisms between different tasks. For instance, a shared neural basis for perceptual guesses and free decisions [52] or an involvement of spatial coding in mental arithmetic [53] was investigated this way. Basic analysis scheme. In the analysis EEG data were analyzed separately for both experiments via machine learning in order to classify errors and correct responses. More specifically, there were two analyses: the initial analysis consisted of a classification of errors and correct responses within task. The second analysis tested whether it is feasible to identify errors and correct responses across tasks.
Pattern recognition analysis scheme. Initially, the segmented EEG data was subjected to a feature extraction procedure. According to the hypotheses, feature sets for Ne, Pe, or a combination of both components were constituted for both tasks, resulting in six individual feature sets for each participant. Based on the results of the EEG data five electrodes were selected around the peak position of Ne (i.e. Fz, FCz, FC1, FC2, Cz) and Pe (i.e. Cz, FCz, CPz, C1, C2), respectively. Thus, the Ne and Pe feature set comprised 5 features each. Specifically, the mean voltage on each of the selected electrodes (i.e. average of the EEG signal on that site) was calculated within a time window of 20-100 ms (Ne) and 180-250 ms (Pe). The combined feature set comprised both, Ne and Pe feature set (i.e.10 features). Subsequently, these feature sets are utilized for the classification process by means of a support vector machine (SVM). Therefore, the extracted features of all individual feature sets were linearly scaled in a range between 0 and 1 [54].
In both experimental tasks, errors were less likely to occur than correct responses. Thus, the individual number of errors ranged between 12 and 237 (mean~104) in the flanker task and 52 and 268 (mean~128) in the rotation task, respectively. For each participant a corresponding (equal) number of correct response trials was randomly selected for further processing steps. For instance, 100 errors would be matched with 100 (randomly selected) correct responses. Accordingly, the classifier's accuracy rate is defined as the number of correct classification incidents divided by the total number of classification incidents (i.e. selected trials).
The actual classification procedure was performed using the freely available toolbox libsvm [55] as implemented for MATLAB 1 . An SVM with radial basis function (RBF) kernel was employed, since it offers the flexibility to handle linear and nonlinear relations between features and target classes [54,56]. Using a RBF-SVM the penalty parameter C (controlling the cost of misclassifications) and the free parameter of the RBF-kernel γ (determining the shape of the kernel) have to be chosen. Therefore, both parameters were individually determined by successive iteration using a nested grid search approach, i.e. following an initial coarse search a second, finer search is conducted [54]. For within task classification the selected features were subjected to a 10-fold cross validation. In other words, the dataset is subdivided into ten parts from which nine parts serve to train the SVM while the remaining part is tested on. This procedure is repeated ten times for each iteration-level of the grid search and results in a mean accuracy rate. More importantly, the selected parameters (C and γ) determine the SVM model that is employed to perform across task classifications. For that purpose the SVM was trained on the flanker task data with the previously established parameters and then tested on the rotation task data, and vice versa. The reliability of cross-validation (within task) and across task classification was tested using a randomization test procedure [57,58] with 1000 permutations. To ensure that the results are not based on a biased selection of correct trial, the outlined analysis scheme was repeated 10 times using randomly selected samples of correct trials. The accuracy rates associated with the best-found parameters observed in the cross-validation (within task) as well as the across task classification accuracy rates were used to determine a mean value across these 10 analyses (see above). The resulting mean values are subsequently reported in the results section.

Behavioral data
Participants committed significantly fewer errors in the flanker task (13.87%) compared to the rotation task (17.67%, t obs = 2.16, p boot = .002, d = .48). Also, the reaction times were significantly shorter for the flanker task (277 ms, collapsed across correct and erroneous responses) than for the rotation task (441 ms, F(1,19)    EEG data: response classification via machine learning Within task classification. The classification of responses within task using an SVM approach was based on 10-fold cross-validation (see methods) and resulted in above chance performance (see Table 1). Trials from the flanker task were correctly classified in about 86% across participants (75-96%, all p < 0.05) when both feature sets where employed. Classifier performance slightly decreased when Ne or Pe feature sets were utilized separately. Using only Ne or Pe features resulted in a mean accuracy rate of about 79% (Ne: 61-95%, all p < 0.01; Pe: 69-94%, all p < 0.05, Tables 2 and 3). A further investigation of the results from the individual subjects indicates however, that there are differences related to the feature sets across participants: Classifiers performance in some participants was clearly driven by Pe features while in other participants the usage of Ne features was beneficial ( Table 2 and Table 3).
A similar pattern of results was observed for rotation task classification, even though the overall accuracy was slightly worse 75% (58-85%, all p < .01, Table 4). Again, classifier performance was diminished using Ne features (69%: 54-80%, all p < .01; except of one participant p = .074, Table 5) and Pe features (71%: 58-85%, all p < .05, Table 6) separately. As observed for the flanker task classification, in some individuals the classifier was more accurate using Ne features, while in others Pe features led to better results. Thus, data derived from both tasks indicate, that there is no advantage for either feature set on its own but the combination leads to a general improvement.   Table 1 for further captions.
Across task classification. In case that there are generalizable neural patterns between flanker and rotation task a classification across tasks should be feasible. Indeed, the SVM classifier was still able to classify responses above chance level across tasks in most cases. An SVM trained on flanker task data correctly classified responses from the rotation task in about 67% (55-80%, Table 7). The permutation test indicated significance for ten participants (p < .05) and strong trends in additional seven participants (p < .11), while less reliable accuracy was only observed for 3 participants (i.e., p = .12; p = .12; p = .14). In cases the SVM was trained with Ne features only, the classifier's performance drops to about 61% (50-74%, see p-values in Table 8) and training with Pe features only slightly improves results to 64% (49-82%, see p-values in Table 9).
Reversing the train-test order of the SVM (train on rotation task data, test flanker task data) resulted in a mean accuracy rate of about 75% (58-88%, Table 10). Performance was significant for 13 participants (p < .05), with a trend towards significance in 6 participants (p < .11), while responses were not reliably classified in a remaining participant (p = .14). Training the classifier on Ne and Pe features only yielded similar accuracy rates (Ne: 68% (50-87%); Pe: 68% (58-90%), see p-values in Tables 11 and 12). Again, in both types of analysis it became evident that Ne and Pe do not have the same predictability within individual participants.

Discussion
The present study addressed the question whether it is possible to classify response correctness (i.e. separating correct from erroneous responses) within and across two different cognitive tasks using a machine learning approach. For this purpose, data were re-analyzed from another study [13]. In this study, participants conducted a flanker and a mental rotation task. The reanalysis revealed the same data pattern with respect to the behavioral data like in the previous publication [13]. This behavioral result pattern was accompanied by a significant Pe and Ne component in erroneous trials: both were clearly discernable, even though the difference between correct and incorrect trials was much more elaborated within the flanker task condition. Thus, both components (Ne and Pe) subsequently served as features for the SVM. Overall, the SVM yielded high accuracy rates up to over 80%. As expected, responses were identified most accurately within particular task sets, whereby responses could be classified with higher precision in the flanker task. More importantly, it was possible to train an SVM with data from one task and to classify responses across tasks. In both cases (within and across tasks) accuracy classification was most precise using a combination of Ne and Pe features.
It has previously been demonstrated that EEG signal can be utilized to classify response correctness utilizing single-trial Ne and Pe (e.g. [27][28][29] and also to derive predictions about upcoming behavior and errors [30,32,33,52]. Thus, the present study replicates some findings in this regard. However, there is information besides the Ne that can enhance the separation of correct from erroneous responses. Indeed, it has recently been shown that the signal as reflected in the Ne is constituted by at least two processes: a centrally distributed error-sensitive factor and an outcome-independent factor contributing to both Ne and CRN [59]. Furthermore, the neural mechanisms involved in generating the Pe seem to differentiate between perceived and unperceived errors [60]. In this regard, this is the first time it is shown that the Pe appears to be stable across tasks as well (despite the fact that the degree of error awareness differs in both tasks). Thus, though being modulated by task features, the Pe can be utilized to classify effects of error awareness across tasks. In sum, the present results strengthen the notion, that both, Ne and Pe, provide error related information, usable for precise classifications. Furthermore, the SVM trained on the rotation task and tested on the flanker tasks was more accurate compared to the opposite case (i.e. the SVM trained on the flanker task and tested on the rotation task). This adds evidence to the assumption that the rotation task per se is more difficult, most likely conveyed by attenuated response monitoring [4,61]. However, as responses in the flanker task condition are classified with higher accuracy, the supposed errormonitoring process might as well reflect the classifiers' capability to better distinguish correct from erroneous responses in the "easier" flanker task.
With respect to the assumption of Ne and Pe reflecting a general response evaluation system, a recent investigation on convergent validity of error-related brain activity [19] revealed that there was overlapping variation in error-related brain activity across three different tasks (i.e. flanker, stroop, Go/No-Go task). However, in their study Riesel and colleagues used correlational measures whereas in the present investigation a classification approach was utilized. This offers the advantage, that if the SVM is being trained on one task, the derived classifier can be used in another task to identify correct erroneous and correct responses. Apparently, Ne and Pe constitute informative features with respect to response behavior within a particular task setting but also across tasks that might be utilized as a general classifier. The latter might have important implications, for example it could be tested whether the Ne as well as Pe can be employed with respect to clinical applications [31].
Also, the results of the present study partly challenge the finding of Ventouras and colleagues [34], who concluded from their finding that the Pe does not improve classification performance. This competing result might be due to the different classification scheme: the Pe as well as Ne were selected a priori and fed into the learning phase, whereas in the work of Ventouras and colleagues [34] the most relevant statistical feature was selected, irrespective of electrode position. Thus, it is questionable, whether actually the Pe was selected for classification. Furthermore, in their study a flanker task was utilized. Typically, in this kind of task participants have a clear impression of committing an error. Since the Pe has often been related to error awareness [7,8], this notion is unexpected and not in line with the present results. In contrast, we found a discernable Pe in single trials (compare Fig 2) that improved classification accuracy. However, the suggested approach has also some limitation. The performance of the classification is likely due to the fact that both experiments were conducted in a successive order, and thus the EEG signal is relatively stable. It is well known that the EEG signal is quite sensitive to noise, biorhythms, arousal and many other factors. Thus, the features derived for instance on one day might be not as predictive if the second task was conducted on another day, even if exact the same experimental setup was used. Thus, it has yet to be tested, whether these signals can even be utilized not only to classify across tasks, but also to predict behavior if the tasks are not conducted immediately following each other.
In sum, the present findings replicate the role of Ne and Pe as reflections of basic error monitoring processes. What is more, those monitoring processes obviously apply across different cognitive tasks, even though a reliable classification was not observed in all participants. However, what goes beyond classification is that the features of one task are predictive with respect to the performance in a following task. Thus, the error related EEG signal might be used to predict behavior in interleaved, i.e. subsequent conducted tasks. Accordingly, the outlined approach is also promising with respect towards an application oriented perspective, for instance within a brain-computer interface framework. Based on the present results it appears at least possible that upcoming errors might be detected even before they were committed, if tasks are conducted subsequently. Finally, the applied machine learning approach was successful in demonstrating that a functional mechanism of one task can be identified in another task. Moreover, the combination of advanced data mining strategies and EEG analysis provides the opportunity to test whether different cognitive tasks depend on similar neural mechanisms.

Author Contributions
Conceived and designed the experiments: TP MF SH. Performed the experiments: SH. Analyzed the data: TP SH. Contributed reagents/materials/analysis tools: TP SH. Wrote the paper: TP EW MF SH.