Comparison of EEG-Features and Classification Methods for Motor Imagery in Patients with Disorders of Consciousness

Current research aims at identifying voluntary brain activation in patients who are behaviorally diagnosed as being unconscious, but are able to perform commands by modulating their brain activity patterns. This involves machine learning techniques and feature extraction methods such as applied in brain computer interfaces. In this study, we try to answer the question if features/classification methods which show advantages in healthy participants are also accurate when applied to data of patients with disorders of consciousness. A sample of healthy participants (N = 22), patients in a minimally conscious state (MCS; N = 5), and with unresponsive wakefulness syndrome (UWS; N = 9) was examined with a motor imagery task which involved imagery of moving both hands and an instruction to hold both hands firm. We extracted a set of 20 features from the electroencephalogram and used linear discriminant analysis, k-nearest neighbor classification, and support vector machines (SVM) as classification methods. In healthy participants, the best classification accuracies were seen with coherences (mean = .79; range = .53−.94) and power spectra (mean = .69; range = .40−.85). The coherence patterns in healthy participants did not match the expectation of central modulated -rhythm. Instead, coherence involved mainly frontal regions. In healthy participants, the best classification tool was SVM. Five patients had at least one feature-classifier outcome with p0.05 (none of which were coherence or power spectra), though none remained significant after false-discovery rate correction for multiple comparisons. The present work suggests the use of coherences in patients with disorders of consciousness because they show high reliability among healthy subjects and patient groups. However, feature extraction and classification is a challenging task in unresponsive patients because there is no ground truth to validate the results.


Introduction
Voluntary brain activation has been extensively examined in disorders of consciousness (DOC). The goal of these endeavors is to develop a diagnostic tool to distinguish unresponsive from responsive patients if the latter are severely paralyzed and cannot react behaviorally to external stimuli. In these studies, simple instructions were presented which can be carried out by thought [1][2][3][4][5][6][7][8][9][10][11][12][13]. A possible command-following can be detected in the resulting brain activation with imaging methods (functional Magnetic Resonance Imaging, fMRI) or neurophysiology (Electroencephalogram, EEG). The perhaps most promising approach to apply such a diagnostic tool is based upon techniques from brain computer interface (BCI) research. Cruse et al. [3] developed a procedure, in which the patients' responses were classified with machine learning techniques. A support vector machine classified bandpass-filtered activity as recorded by central positioned channels of the EEG. Most interestingly, single patients seemed to show command following. However, the results of this research are not undisputed. Goldfine et al. [14] demonstrated that the data in this study did not meet the assumption of the statistical model, which led to artificially low p-values. By use of an appropriate permutation test the authors showed that there was no evidence for command following in any of these patients. Indeed, false positives are a major problem when applying machine learning techniques to patients with DOC. In BCI research, it is evident that the participants of the studies are performing the tests. In DOC research, we hardly know whether a patient follows the command or not. In addition, data from DOC patients differs from usual BCI-data because of the many artifact-sources such as stereotypical movements but also pathologic brain activity. These circumstances pose extraordinary demands on the data-analysis.
Cruse et al. [3] used bandpass-filtered data and a support vector machine (SVM) for classification. The healthy participant group yielded an average classification accuracy of .68 in those cases which could be classified significantly above chance. This average did not include three subjects who showed a classification accuracy between .44 and . 53. Thus, the average on all 12 participants may be around .63+. The goal of this study was to start the discussion on EEG-features for detection of voluntary brain activation in patients with DOC. Therefore, we investigated the accuracy of other markers and other classification methods in healthy participants and applied them to the data of DOC patients.
A meaningful feature/classifier yields a high accuracy among healthy participants and is reliable among healthy subjects, i.e., classifies the data of healthy participants above chance. The higher the sensitivity of a marker in healthy participants is, the higher is the sensitivity in patients with DOC. In other words, in healthy participants there should be no below-chance classification accuracy. Although a lot of BCI research concerning the optimal feature and best classifier selection has been conducted, this is seldom the case because there is a high inter-individual variability between subjects in the brain activity during motor imagery [15]. There are subjects without significant alternation of the m-rhythm, and thus, it is possible that classification of motor-imagery related activation fails. However, there can be other relevant sources of information in the recorded EEG of these subjects, perhaps allowing to classify the data even without modulation of the mrhythm. To take this option into account we used a nearly full montage of EEG-electrodes instead of a small number of central electrodes, which would capture only the activity of the motor cortex. In addition, for frequency-based features, we included frequencies ranging from delta to high-beta, to capture, e.g., non m-frequency activity related to executive functions which are involved in imaginary tasks.
In addition to a high accuracy and a low rate of non-responders, a good feature should show up with a small variance between participants, i.e., a small range and a small standard deviation (SD). Generally, these questions were and are subject of BCI research, and there are numerous studies comparing different features to each other in healthy participants. However, the examination of different features in DOC patients poses some additional demands. As stated above, false-positive results in patients are undesirable. Based on the assumption that the conducted behavioral diagnosis, carried out with the Coma-Recovery-Scale revised (CRS-R) by trained experts, which is the gold standard, is highly accurate [16], we assumed that DOC patients are not able to perform motor imagery and, thus, should not reveal above chance accuracies in the discrimination of motor imagery vs. rest trials. However, because of classification failures, for example, in unbalanced data sets, it is possible to obtain artificial high classification accuracies. This possibility and the multiple comparison problem require an adequate correction for false discoveries. We examined both, results with and without false-discovery-rate (FDR) correction since correcting the alpha level depresses the probability of Type I error but increases the probability of Type II error. We searched for features which reveal high classification accuracies in healthy participants which are significantly above chance (after FDR correction) and yield consistent results in patients. We considered results as consistent if they were significant despite correcting for multiple comparisons.
In the presented study, we recorded the EEG in healthy participants, patients in a minimally conscious state (MCS), and with unresponsive wakefulness syndrome (UWS) during imagery of moving both hands and during a rest condition (for details, see Figure 1). We extracted EEG-features to measure synchrony, complexity/entropy and to describe frequency characteristics. These features were chosen because of evidence for their value as features, for feature reduction or similar applications in BCIs or DOC (e.g. Hjorth parameters [17], brainrate [17], Wackermann features [17], power spectra [5,17], coherence [18,19], directed transfer function (DTF) [20,21], approximate entropy [22], Shannon entropy [23], Bhattacharyya distance [24]) or in other fields of EEG-research (e.g. [25]).
We compared 3 common algorithms as classification methods, discriminant analysis with a diagonal quadratic function (DADF), k-nearest neighbor classification (knn) with k = 1 and k = 3, and a support vector machine (SVM) with a linear kernel function. Previous work examined similar comparisons of classification methods and found no differences [26] or found differences which depend on the examined subject and feature [27]. There are results supporting the use of Bayesian linear discriminant analysis [28] or a stepwise linear discriminant analysis [29]. Other researchers report a slight advantage of nonlinear classification methods such as SVM over linear ones such as linear discriminant analysis [30]. There is also evidence that SVM perform much better than knn [31], but we supposed that the choice of k may influence the results.
In this study, we wanted to investigate which classification methods and features resulted in best accuracies in healthy subjects. In addition, we examined the effect of the choice of a particular feature in patients compared to healthy controls to address the question if specific features are more appropriate when investigating patients with DOC.

Classification methods
Healthy participants. The results in the healthy participant group differed depending whether or not FDR was applied to the proportional chance (c prop ) criterion. The results of the Wilcoxon signed-rank tests comparing the different classification methods to each other in healthy participants are shown in Table 1. Without FDR-correction, it becomes evident that SVM classification (mean accuracy = 0.59) yields significantly better results than knn classification with k = 1 (mean accuracy = 0.56) and k = 3 (mean accuracy = 0.57). The reason for a high p-value in the test between DADF (mean accuracy = 0.57) and SVM may be that those features resulting in the highest classification results (coherence) did not work in DADF and thus, decreased the degrees of freedom and did not influence the test result. This problem biased also the results of the comparisons with FDR correction. However, with FDR-correction a lower number of participants yielded above chance accuracies with DADF classification in the feature-set without coherences, demonstrating that DADF yields poor results.
In summary, the differences between classification methods are rather small, but the best classification result was achieved with SVM classification.
Patient groups. There were no significant differences between classification methods in the patient groups (with pvalues ranging from .5 to 1), neither with nor without FDR correction. Therefore, we chose the classification method which worked best in healthy participants for the comparison of features to each other. That is, the comparison was carried out on the SVM-results.

Features
Classification accuracies. Table 2 shows the SVM results, i.e., for each feature the average accuracies, SD, and range for all three groups (healthy subjects, MCS, and UWS). For the explanation of the features we refer to the Methods section. Note that the average accuracies of some healthy participants and patients in certain features (e.g., Wackermann features) yielded below-chance accuracies. Table 3 shows the average p-values and the average improvement over chance criterion (IOCC)-values for each group and each feature. Note that the p-value is also low for significant below-chance accuracy. Such a result is indicated by a negative effect size, i.e., a negative IOCC value.
Healthy participants. As can be seen in Table 2, in healthy subjects coherence yielded the highest average classification accuracy followed by FFT.
The lowest SD in the healthy participants group was found for coherence, Hurst exponent, and brainrate. However, Hurst exponent and brainrate showed very low classification accuracies in the healthy participant group.
The smallest range in the healthy participant group was found for Hjorth activity and complexity, and, again, for Hurst exponent and brainrate. Hjorth complexity had a rather small classification accuracy.
Patient groups. In patients in MCS, the highest average accuracy was found for DTF, followed by Hurst exponent and brainrate. In UWS patients, the highest average accuracy was found for brainrate.
In patients in MCS, the lowest SD was found for Tsallis knn, followed by coherence and Hjorth complexity. In UWS patients, the lowest SD was found for Hjorth activity, followed by Granger causality (Granger GW).
In patients in MCS, the significantly lowest ranges corresponded with the lowest SDs. In UWS patients, the lowest range was found for Hjorth activity, followed by partial directed coherence (PDC) and Granger GW.
Above-chance classification. The numbers of participants with above chance accuracy are shown in Table 4.
Healthy participants. The highest number of healthy participants with above chance accuracy without FDR-correction was found for coherence, followed by FFT. Both of these features yielded significantly higher numbers of above chance accuracies than the other features. Figures 2, 3, and 4 show the differences in power spectra between the two conditions in healthy participants, patients in MCS, and with UWS, respectively. Figures 5, 6 and 7 show the coherence-statistics for healthy participants, patients in MCS, and with UWS, respectively.
The number of healthy participants showing above-chance accuracy decreased considerably after FDR-correction. Only coherences still yielded a significantly higher number of abovechance accuracies compared to the other features (see Table 4, FDR-columns).
Patient groups. Most interestingly, there were no patients with above chance accuracies in power spectra and coherence. Without FDR-correction, three MCS patients had an above chance accuracy in the features Hjorth activity (patient MCS3), brainrate (patient MCS2), and DTF (patient MCS5). Among UWS patients, two patients had an above chance accuracy in the features Bhattacharyya knn (patient UWS5) and CorrEntr KDE direct (patient UWS4). Because of the occasional occurrence of above-chance accuracies in the patient groups, all of these values different from zero were significant compared to the other values.
We examined the patients with above-chance accuracies a bit closer. The above-chance accuracy in brainrate (MCS2) turned out to be a classification-error since there was only one value (brainrate = 10) for all channels and all trials. Thus, there was no difference between imagery trials and rest trials. For the other patients (MCS3, MCS5, UWS4, and UWS5), there was no obvious anomaly in the data of the respective features.
However, all of the positive results in patients were not significant after FDR-correction.

Discussion
In this study we evaluated several features and classification methods to answer the question if there are features and/or classification methods which are better suited for examining motor imagery in patients with DOC. SVM classification yielded the best results compared to knn classification in healthy participants. In the group of healthy participants we found best results for coherences and advantageous results for FFTs. Most interestingly, the coherence-results exceed classification accuracies reported previously [3]. In the following we discuss that a higher accuracy in healthy subjects is not the only reason why we suggest to use coherences with DOC-patients.

Coherences and power spectra in DOC
We found that coherences and power spectra yield high classification accuracies in healthy subjects. After FDR-correction, this advantage remains for coherences, only. Our results demonstrate that the possible value of coherences might have been underestimated so far. BCI-projects commonly use frequency analysis as revealed by FFT-power spectra [5], wavelet transform [32], or bandpass filtering [3], while studies involving coherences are rather rare. Coherences, like power spectra, use spectral information and treat the signal as stationary within the analyzed window and thus, allow to detect also pathologically delayed responses in patients [5]. The additional information in coherences over power spectra are connectivity patterns, i.e., brain regions showing synchronized frequency activity.
The use of coherences for BCI control has been shown to be a good option [18,19]. However, the pattern of coherences may differ between healthy participants and patients with severe motor disabilities [33]. The reason for distinctive patterns between patients and healthy participants was interpreted by Nam et al. [33] as reduced cortical differentiation and specialization in patients, reflected by a compensatory recruitment of more cortical regions. This difference may lead to lower performance of a BCIsystem and thus, decrease the classification accuracy in patients.
Against this background, one could argue that coherences are rather unsuited for use in DOC-patients because they may underestimate a patient's ability to perform motor imagery. However, the classification accuracy of features such as power spectra and coherence possibly depend also on the electrode positions used for classification [34]. Krusienski et al. [34] reported that FFT and coherence yield comparable classification accuracies in healthy participants with central electrode positions covering the motor cortex. Nevertheless, the authors state that it is very likely that electrodes positioned over all brain regions could yield different results (i.e., advantages for coherences or phase locking values). Such a montage could detect task-related activity which increases classification accuracy but does not stem from the motor cortex. For example, one could expect a strong activation of frontal regions due to executive functions involved in task performance. This is what we found in healthy participants, i.e., significant differences between rest and imagery coherences between frontal and other brain regions. This alternative source of brain activity could be relevant in patient groups. For example, patients with spinal cord injury suffer from deafferentiation and thereby negative neuroplasticity in the sensorimotor cortex. This may explain the rather low classification accuracies in BCI-systems in tetraplegic patients [35]. Coherences could yield the possibility to recognize motorimagery related activation even without typical activation of the motor cortex. Indeed, our approach of using a full montage revealed that healthy subjects show frontal connectivity patterns in coherences, suggesting that the high classification accuracy in this group stems not only from central activations. Thus, the way we applied coherences should be able to detect voluntary brain activation in patients with DOC even if they suffer from negative neuroplasticity in the motor cortex. A similar approach could be achieved also by features such as Granger causality (Granger GW). However, Granger causality requires a fine tuning of the lag length and produces a longer data vector, possibly causing a smallsample-size problem.
Another advantage of coherences is that they yield small SDs and rather small ranges, which makes them better suited to distinguish participants which do and do not perform motor imagery. In fact, after FDR-correction only 2 healthy participants yielded below-chance accuracy. There were also other features with small ranges or small SDs but only coherences had a small variance combined with high accuracy in the healthy participants group.

False positives
Without FDR-correction, there were some patients showing above chance accuracies in some of the features. These occurrences in the patient groups did not exceed the FDRcorrected threshold. Most interestingly, features which yielded above-chance accuracies in patients did not match with those features showing the highest classification accuracies in the healthy participant group. If the above chance accuracies in patients would have occurred in those features which yielded best results in the healthy participant group the results would have been more thrustworthy. Indeed, the brainrate-results turned out to be a severe classification error.
Only recently Goldfine et al. [14] found that noise in the data [3] may lead to failure of the classifier. Most importantly, several above-chance accuracies in healthy participants as well as in patients were not significant anymore after FDR-correction. As Goldfine et al. [14] point out, based on the statistical criterion of pv.05, a classifier could yield positive results in 5% of the tests even if these datasets were random. In the present data, in 14 patients and 20 features there were 5 above-chance accuracies before an FDR-correction was applied. Based on the assumption that the diagnosis with the CRS-R is highly accurate [16], we considered above-chance accuracies as false positives if they did not reach significance at the FDR-corrected level. Still, the CRS-R is the gold standard, but there is no ground truth to truly ascertain the results in studies involving DOC. Therefore, we need features which are robust against artifacts and classification errors. Perhaps just because of the possibility of false positives in other features we see the clear advantage of coherences. There were no above-chance accuracies with or without FDR correction in patients despite a high number of above-chance accuracies was found in healthy subjects. We consider it a main advantage of the coherences that they seem to be robust against erroneous classifications.
However, examining the healthy participant group showed that interpreting coherence-patterns is not straightforward. Analysis of motor-imagery-related EEG data is usually done on central positions, only, because they represent the central m-rhythm. Thus, we expected to find significant coherences involving central positions in healthy participants. In contrast, there are more healthy participants showing significant differences in the coherences between frontal and other (frontal, central, parietal) regions than participants showing significant differences between central regions. With respect to power spectra, the difference between motor-imagery and rest is obvious above central, parietal, and occipital regions in healthy participants, but with a high SD in the range of the m-rhythm. This inter-individual difference in activation patterns is well in line with previous research involving simple stimuli such as an oddball [36,37], subject's own name [38,39], but also complex tasks such as music [40], and specifically motor imagery [15,41], in which healthy participants show contradictory patterns of activation. As a consequence, visually exploring the pattern of coherences cannot specify whether a patient was performing the task or not. This becomes obvious when comparing the coherence-distribution in the patient groups to the distribution in the group of healthy participants. The attempt to compare power spectra between patients and healthy participants seems to be more useful. However, power spectra achieved a low number of above-chance classification in healthy participants after FDR-correction. Therefore, classification results in power spectra may underestimate a patients' task performance.
Finally, we want to mention that we cannot rule out that patients without below-chance accuracies in coherences (as in any other feature) are conscious. First, it is possible that patients were actually performing the task and the coherence feature failed to classify them correctly as reported previously for severely motor disabled patients [33]. The statistical procedure could have failed to reveal the motor imagery related brain activation or the patients are simply not be able or willing to follow the task instructions. Patients with dementia, for example, are surely conscious but at a certain stage of the progressive disease they may have difficulties in performing even such a simple task.   Classification problems: Overfitting We have found many accuracies smaller than .50 in the patient groups and some in the healthy participant group. It is very likely that the below-chance values are due to overfitting of the data. It is known that the risk for overfitting the system to a particular class rises when the number of samples used for the different classes is not well balanced. That is, the classifier will assign images more likely to the class which contains the most samples within the training set. In addition, overfitting is especially a problem with high dimensional data, i.e., long feature vectors. In EEG-data, this problem of long feature vectors is very common and there exist approaches to reduce e.g. the number of frequencies [42]. Finally, overfitting is also a problem in noisy data, because random variability in the data leads to an inaccurate model. In the worst case, overfitting leads to zero classification accuracy. This may happen in the case that both classes (motor imagery and rest) have equal numbers of trials. Let this number be N. With the leave-oneout cross-validation one trial x of class A is classified and all of the other trials of class A and B serve as training data. The training data includes N trials of class B and N-1 trials of class A. Because of overfitting, the classifier assigns the wrong class B to the trial x. As such, each trial is assigned to the wrong class and the classification accuracy is 0. In a similar way overfitting could lead to above-chance accuracy if there are strongly unbalanced datasets.
The main reason for overfitting in DOC patients and, thus, below-chance accuracies, may be that the data contains no taskrelated information. For the classifier, non-task-related information is equivalent to noise. In fact, if a patient does not perform motor imagery there is no model to build for distinguishing the motor-imagery trials from rest trials.
We believe that coherences are robust against overfitting. In healthy participants, coherences were the only feature with no subjects with below-chance accuracy. The lowest accuracy for coherence among healthy participants was .53 while in all other features the lowest value was below .5. This is also reflected by the fact that coherences had a very small standard deviation and a rather small range both in patients and healthy participants. The accuracies do not vary extremely between participants. This makes the results more reliable and leads to the assumption that a coherence value is better predictable than, e.g., a Wackermann feature, which yielded many below-chance accuracies and a high between-subject variability.

Choosing SVM as a classifier for EEG-data in DOC
We found no significant difference between classifiers in patients. Therefore, the deciding factor for choosing a classifier remains the classification accuracy in healthy participants. But also in healthy participants the differences between classifiers were rather small. SVM yielded best results for healthy participants in our study, which is not surprising given the results of previous research [31,43]. SVM has been used for synchronous brain computer interfaces [32] and for classifying motor imagery in patients with DOC [3]. The main drawback of SVM is that it is slower than other classification methods. However, for the diagnosis of DOC this is not critical because data are usually analyzed off-line. If used for BCI in patients with DOC or SCI, SVM can still be fast enough to reach real-time performance [32]. Nevertheless, the differences of the classification accuracies were rather small and the difference between SVM and DADF were not significant because of a smaller number of degrees of freedom (despite the average accuracy of DADF was the same as for knn k = 3, which was boarder-significantly different from SVM). These results are well in line with previous research, reporting only a slightly better result for SVM than linear discriminant analysis [30]. This suggests that the differences between the classifiers are not very wide and comparing SVM with other classification methods, e.g., a Bayesian classifier may yield no difference between classification results [44]. It is important to note that the kernel of the SVM plays an important role. We used a linear kernel function, but a non-linear kernel function allows a more flexible decision boundary in the data space [32] and also a Gaussian kernel function was shown to be advantageous [44]. However, when applying a non-linear kernel function we found that the classifier did not converge to a solution for each participant's data in certain features. Thus, when choosing a classifier and tuning the classifier's parameters there is a trade-off between best classification results and reliability of the algorithm to converge.

Conclusion
In this study we compared the use of different features and different classifiers in healthy participants and patients with DOC. SVM classification yielded the best results in healthy participants among the options we tested. Classification accuracies of coherences were found to be better than those of all other features.
In DOC patients, though, we recommend to examine coherences over the whole scalp since they identify a more distributed network activity reflecting the cognitive processes involved in motor imagery. This advantage could be crucial when examining severely-disabled patients since the individual response patterns can differ drastically from those of healthy people because of neuroplastic changes. Without FDR-correction we found above-chance accuracies for patients in other features. At least one of these occurrences was identified as failure of the classification method. We suggest that coherences are more robust against artifactual data. This claim is also supported by a small variability of accuracies in coherences, suggesting a good predictability of the classification values.
Future studies should examine if classification accuracies of coherences calculated on a full montage are affected by neuroplastic changes in severely motor disabled patients.

Ethics
The study was approved by the local Ethics Committee (Ethics Commission Salzburg/Ethikkommission Land Salzburg; number 415-E/952) and was designed according to the Declaration of Helsinki. Written informed consent was obtained from all control subjects and from the families or guardianship of all patients.

Subjects
Over the course of 2 years (2010-2012), 16 patients were assessed at the Christian-Doppler-Clinic in Salzburg (Austria). Inclusion criteria for patients were a diagnosis of UWS (n = 10) or MCS (n = 6) based on neuropsychological assessment of trained experts with the CRS-R [45]. Due to artifacts resulting from stereotypical movement patterns, the data of 2 subjects (1 UWS; 1 MCS) were excluded. The remaining sample of 9 UWS patients and 5 MCS patients is described in Table 5.
Cd, code: diagnosis and patient number; WHIM, value on Wessex Head Injury Matrix; CRS-R, value on Coma Recovery Scale -Revised; Dur, duration in months of the disorder at the time of assessment since onset; N.A., not available; H, haemorrhage.A sample of 22 high school graduated subjects (age: 20-26 years; mean = 22.86 years; SD = 1.81; 6 male) was recruited for the healthy subject group. None of the participants reported any history of neurological or psychiatric diseases, nor were they receiving any psychoactive medication. Healthy subjects were remunerated for their expenditure of time. The data of the healthy subject group was analyzed and published recently [15].

Experiment
The experiment consisted of three conditions. There were 24 trials for each condition. In the imagery-condition the participants were asked to imagine to open and close both hands. The resting condition consisted of no movement and no imagery but was preceded by the instruction to hold both hands firm. With this special ''resting task'' we controlled for speech-related activations, which are common in patients with DOC and do not reflect consciousness [46,47]. To ensure that the participants performed the task in the movement-and the imagery-condition during a certain period of time, they were instructed to perform the task while hearing a tone sequence of 2 tones. Both tones were alternatingly recurring once per second. Thus, the participants imagined to open and close their hands once per second. The instructions had durations of 6, 6.5, and 9.5 sec for the rest, move, and imagery conditions, respectively. After each instruction, there was an interval of 5 sec during which the participants were expected to follow the instruction (i.e., to move in the movementcondition, to imagine a movement in the imagery-condition, and to hold the hands firm in the rest-condition). To avoid expectation effects, conditions were ordered pseudo-randomly. Instructions were presented verbally and binaurally through earphones using Presentation software (Neurobehavioral Systems, version 12). The auditory material was recorded and processed with Audacity (version 1.2.6). The instructions were normalized to an equal sound level. Healthy participants were asked to look straight ahead during the experiment. Patients were awake during the experiment with eyes open. The presented data are based on these two conditions while the third condition, the movement condition, was excluded from analysis because of limited information content in the patient groups.

Data preparation
An overview of this and the following steps is given in the flowchart in Figure 1. Data were pre-processed with Brain Vision Analyzer (Version 1.05.0005, Brain Products GmbH). In order to re-reference all channels, a new reference was built by averaging the signal of mastoid electrodes. To obtain a bipolar vertical electrooculogram, the average of Fp1 and Fp2 was used as a reference for the lower-site vertical electrooculogram. Butterworth Zero Phase Filters from 1 to 48 Hz (time constant 0.1592s, 48dB/ oct) were applied to reduce noise.
Independent component analysis (ICA) was applied to detect and remove ocular, muscular, and cardiac artifactual sources [48][49][50]. The ICA was calculated on the entire dataset, on all channels, including the prepared electrooculographic channels. An experienced researcher identified the components containing ocular, cardiac, or muscle artifacts by visual inspection. These components were removed by performing the corresponding ICA backtransformation.
An automatic data inspection was carried out in order to exclude remaining artifacts. Maximal allowed voltage step per To extract features of individual trials, data was segmented into 4 sec epochs for each trial. These segments started at the end of the acoustic instruction and ended 1 sec before the next instruction to avoid expectancy effects. The preprocessed segments were exported into a generic data format and imported to MatlabH (The Mathworks). These segments were first spatially filtered to common average reference [51].
Patient's data was first downsampled to 250Hz to have the same sampling frequency as data collected from healthy participants.

Feature extraction
In the following, we list the features and refer to the software we used to compute them.
Hjorth parameters. The BioSig implementation [52] of the Hjorth features as proposed by Hjorth [53,54] was used. Hjorth parameters include activity, mobility, and complexity. Each of these features was evaluated in stationary mode separately for each channel to the whole segment (4 sec). Thus, the result was a feature per channel and, as a whole, a feature vector. Classification was carried out separately for the three Hjorth feature vectors.
Brainrate. The BioSig implementation [52] of the brainrate [55] was used. The brainrate was calculated in stationary mode for each channel over the whole segment.
Wackermann. The BioSig toolbox [52] provides a function to calculate the global field strength s, the global frequency w, and a measure of spatial complexity v [56,57]. Each of these features was evaluated in stationary mode separately for each channel over the whole segment. Classification was carried out separately for the three Wackermann feature vectors.
Hurst exponent. The BioSig toolbox [52] includes a function to estimate the Hurst parameter via the rescaled range [58]. The Hurst exponent, also known as index of long-range dependence, was calculated for each channel over the whole segment.
FFT. The FFT was calculated with the matlab function fft.m. To reduce the length of the feature vector (124 frequency steps between 1 and 30Hz for each channel), we calculated the average value at 2.44Hz frequency steps between 1 and 28Hz (12 values for each channel).
Coherence. Coherence was calculated with the matlab function mscohere.m. To reduce the length of the feature vector (124 frequency steps between 1 and 30Hz for each channel|channel combination) we calculated the average value at 2.44Hz frequency steps between 1 and 28Hz (12 values for each channel|channel combination).
Granger causality. Spectral granger causality [59,60] was calculated with the function cca_pwcausal.m in Seth's toolbox [61] which was applied to each single segment. This function returns the log-ratio of granger-causalities for each frequency step (GW, i.e., a matrix with channels in rows and columns for each frequency step), analogously values of coherence, and the power spectrum for each channel (pp). We used 2.44 Hz frequency steps between 1 and 28 Hz. Note that the information of these PDC and DTF. The PDC [62] and DTF [63] were calculated with functions provided by Omidvarnia [64] including also functions from the BioSig toolbox [52] and the arfit toolbox [65,66] with 12 2.44 Hz frequency steps between 1 and 28 Hz. For computing the time-varying PDC and DTF measures we used the optimum model order as estimated with the arfit toolbox between 1 and 20 for each trial.
Entropy. A large set of entropy measure calculations was first examined. Shannon entropy was calculated with the matlab function wentropy, approximate entropy was calculated with the function ApEn provided by Kijoon Lee in Matlab central file exchange with a tolerance of r~2 Ã SD. All of the other entropy features (again Shannon entropy but by knn estimation, with multivariate Edgeworth expansion, locally linear regression, approximate slope of the inverse distribution function with and without piecewise constant or linear correction as well as with and without bias correction; Rényi entropy estimated by knn, weighted knn, empiric entropy estimator of order m, approximate slope of the inverse distribution function, and continuously differentiable sample spacing (CDSS); Tsallis entropy estimated with k-nearest neighbors) were calculated with Szabó's Octave/Matlab Toolbox for Information Theoretical Estimators [67,68]. All of these entropies were then classified in all groups and we chose the Shannon entropy and the Rényi entropy with highest classification accuracies in the healthy participant group since the values in patient groups did not differ systematically. The highest accuracies were reached with the estimation with approximate slope of the inverse distribution (spacingV). Thus, we used 4 entropy measures: Shannon and Rényi entropy estimated by spacingV, Tsallis entropy estimated with k-nearest neighbors, and approximate entropy.
Bhattacharyya distance. Bhattacharyya distance to measure the divergence, i.e. the distance between to probability densities, was calculated with the with k-nearest neighbors implementation of Szabó's Octave/Matlab Toolbox for Information Theoretical Estimators [67,68].

Classification
Trials of each participant were classified with the leave-one-out cross-validation. Each trial was used once as validation data with the remaining trials as the training data. This was repeated such that each trial was used once as the validation data. Then, we calculated the percent of correctly classified trials. This proportion is called hereafter the classification accuracy.
For classification we used three algorithms (MATLAB-functions): discriminant analysis with a diagonal quadratic function (DADF). Each trial was classified by a specified discriminant function. In the present case, this function was ''diagonal quadratic''. In this function multivariate normal densities are fit with a diagonal covariance matrix estimate (naive Bayes classifiers) stratified by group.
k-nearest neighbor classification (knn). Each trial was classified by knn classification with Euclidean distance metric. We applied the knn-classifier twice, with k~1 and with k~3. These two values were chosen because with k~1 we produced the special case of the nearest neighbor algorithm and with k~3 we chose an odd number, which avoids tied votes in binary classification problems and this value for k is still small so that the boundaries between classes can be quite distinct. support vector machine classifier (SVM). Classification with SVM technique was used with a linear kernel function.

Above-chance classification
Thus, we got a classification accuracy for each participant. Based on the individuals' classification accuracies, we estimated the number of healthy participants, UWS-patients, and MCSpatients who showed above chance accuracies, separately for the 4 classification methods and for each of the 20 features. We used a proportional chance criteria and an adequate measure of significance and effect size, as described in [69], for assessing if the features' accuracies reflected task performance, that is, if they classified motor imagery and rest-trials with an accuracy that is significantly above chance. The proportional chance criteria HC prop is calculated by summing the squared proportion that each number of trials represents in the whole trial set: where N r is the number of resting-trials and N i is the number of motor imagery trials. We required that the accuracy should be significantly better than the chance criteria [69]. That is, an accuracy is considered as reflecting task performance if the improvement over chance criterion was significant as assessed by a z-statistic:.
where H 0 is the number of correctly classified trials and N is the total number of trials in both categories. Note that the z-test assumes that the data are independently sampled from a normal distribution, which was proved with both the Chi-square goodness-of-fit test (Matlab function chi2 g of.m) and the Lilliefors test (Matlab function lillietest.m; [70]).The significance of z is determined according to the critical values from a standard normal distribution. This was done by computing the probability density function of the normal distribution with mean 0 and standard deviation 1 at the resulting z-values. A false discovery rate (FDR) correction [71] for correcting multiple comparisons (20 features, 4 classification methods, and 36 participants) was applied to the critical alpha level of .05. This resulted in a critical alpha level of pƒ.0086.
In addition, the effect size of this improvement can be measured directly with the improvement over chance criterion (IOCC): For each group and each feature we reported the average pvalues and average IOCC-values. To compare the classification methods (discriminant analysis, knn classification with k = 1 and k = 3, SVM), we compared the number of significantly abovechance accuracies for C prop of the features separately in each participant group and separately for the numbers with and without FDR-correction. Normal distribution of these numbers was evaluated with the Kolmogorov-Smirnov test. Since the numbers were not distributed normally in any of the result-sets of the 4 classification methods and 3 participant groups, a non-parametric Wilcoxon signed rank test was used for the comparisons. The resulting p-values were interpreted at the FDR-corrected threshold of significance. For interpretation of significant results, we calculated also the mean accuracy over all features for each classification method result set (i.e., the mean over the 20 features for each classification method). The feature comparisons were then carried out for the best classification method, only.
We calculated the mean, standard deviation (SD), and the range (maximum-minimum) of each feature, separately for each group.
To find out which features had significantly higher average accuracy, lower SD, and lower range of accuracies than the other features, i.e., which features were significantly better than all other features, we calculated one-sample Wilcoxon signed rank tests for each feature vs. all other features on average accuracy, SD, and range. These 180 results were interpreted at the FDR-corrected threshold of significance pƒ.0242.
The same procedure was applied to the number of participants showing above-chance accuracy. That is, we evaluated separately for the C prop results with and without FDR-correction and separately for the 3 participant groups which features yielded significantly more above-chance accuracies than the other features. The level of significance was FDR corrected for these 120 comparisons and resulted in pƒ.0069.

Preparation of figures
For graphical purposes, we calculated the group-average and group-standard deviation of the difference in power spectra between the resting and imagery condition. These values were plotted for each electrode position of a subset of 13 electrodes. This subset was chosen to ensure that the graphics are not overloaded.
In addition, a graphical representation of coherences was prepared. We evaluated the difference between rest and imagery in the coherences statistically. This was done by first calculating the coherence over all trials instead of individually for each trial and then converting each coefficient for imagery and rest of each channel | channel combination and each frequency into a zscore using Fisher's r-to-z transformation. Then, taking into account the number of samples and trials for each condition (imagery and rest), these z-scores were compared using formula 2.8.5 from Cohen and Cohen [72].
Then, we calculated the number of subjects with significantly different coherences with a threshold of the FDR-corrected level of significance pv:0036 for a subset of 13 electrodes (the same subset as for power spectra). The graphics were constructed by ordering the electrode positions in a scalp-like manner for each frequency and by coding the numbers of participants with significantly different coherences with line-thickness and coloring.