Collaborative Filtering for Brain-Computer Interaction Using Transfer Learning and Active Class Selection

Brain-computer interaction (BCI) and physiological computing are terms that refer to using processed neural or physiological signals to influence human interaction with computers, environment, and each other. A major challenge in developing these systems arises from the large individual differences typically seen in the neural/physiological responses. As a result, many researchers use individually-trained recognition algorithms to process this data. In order to minimize time, cost, and barriers to use, there is a need to minimize the amount of individual training data required, or equivalently, to increase the recognition accuracy without increasing the number of user-specific training samples. One promising method for achieving this is collaborative filtering, which combines training data from the individual subject with additional training data from other, similar subjects. This paper describes a successful application of a collaborative filtering approach intended for a BCI system. This approach is based on transfer learning (TL), active class selection (ACS), and a mean squared difference user-similarity heuristic. The resulting BCI system uses neural and physiological signals for automatic task difficulty recognition. TL improves the learning performance by combining a small number of user-specific training samples with a large number of auxiliary training samples from other similar subjects. ACS optimally selects the classes to generate user-specific training samples. Experimental results on 18 subjects, using both nearest neighbors and support vector machine classifiers, demonstrate that the proposed approach can significantly reduce the number of user-specific training data samples. This collaborative filtering approach will also be generalizable to handling individual differences in many other applications that involve human neural or physiological data, such as affective computing.


Introduction
Future technologies that allow computer systems to adapt to individual users -or even to the current cognitive/affective state of the user -have many potential applications including entertainment, training, communication, and medicine. One promising avenue for developing these technologies is through braincomputer interaction (BCI) or physiological computing; i.e, using processed neural or physiological signals to influence human interaction with computers, environment, and each other [1,2]. There are numerous challenges to effectively using these signals in system development. One of the primary challenges is the individual differences in neural or physiological response to tasks or stimuli. In order to address these individual differences, many researchers train or calibrate their systems for each individual, using data collected from that individual. However, the time spent collecting this data is likely to decrease the utility of these systems, slowing their rate of acceptance. As an example, one of the primary reasons that slow cortical potential-based BCIs never achieved mainstream acceptance, even among the disabled, is because using the slow cortical potential-based BCI could require training for several hour-long sessions per week for months in order to achieve satisfactory user performance [3].
While it is possible to train a generic model with group or normative data, in practice this tends to result in significantly lower performance than calibrating with individual data [4]. An example of this may be found in our earlier work [5], in which we used a support vector machine (SVM) to classify three task difficulty levels from neural and physiological signals while a user was immersed in a virtual reality based Stroop task [6], which has been shown to have high individual differences in neural and physiological response as the task difficulty varies [7]. Results revealed that when each subject is considered separately, an average classification rate of 96.5% can be obtained by SVM; however, the average classification rate was much lower (36.9%, close to chance) when a subject's perception of task difficulty level was predicted using only data from other subjects. In a more recent study [8] on whether generic model works for rapid eventrelated potential (ERP)-based BCI calibration, a generic model was derived from 10 participants' data and tested on the 11th participant. Experiments showed that seven of the 11 participants were able to use the generic model during online training, but the remaining four could not.
Novel approaches to analyses of individual differences have significant potential in helping to address these individual differences in neural and physiological responses [9,10]. In particular, we are interested in analytical methods that decrease the amount of data required for training/calibration a customized BCI system, or equivalently, methods that increase the performance of the BCI system without increasing the number of userspecific training samples. Collaborative filtering -the process of making inferences about a user based on the combination of data collected from that user with a database of information collected from similar previous users -is one potential solution [11]. Using collaborative filtering to decrease the amount of time and data required for individual customization should, in turn, increase the usability leading to wider acceptance of BCI technologies [12].
This paper describes a successful collaborative filtering approach developed for implementation in a BCI system. While there are many types of BCI systems [1], the example application domain used herein was developed as a passive BCI (i.e. a BCI that uses a pattern recognition algorithm to passively monitor a user's cognitive and/or affective state [13]), that would monitor a user of a virtual environment (VE) for cognitive assessment and rehabilitation, looking for neural and physiological indicators of task difficulty. The specific VE used for the sample domain is the Virtual Reality Stroop Task (VRST), which uses neuropsychological tests embedded into military-relevant VEs to evaluate potential cognitive deficits [6,14,15]. Cognitive assessment and rehabilitative VEs or serious games such as VRST require immersion on the part of the user to be successful [16][17][18]. One of the key aspects for immersion is the difficulty of the task being performed. If the task is too difficult, the user will become frustrated and lose interest. However, if the task is too easy, the user will become bored, again resulting in a loss of interest [19][20][21].
There are many ways to modulate difficulty based solely on the behavioral measures of user performance [22]. However, there are also strong individual differences in ability to handle difficulty, i.e., some users are better able to handle more difficult tasks than others. One way to address these individual preferences would be to combine information obtained from neural and physiological measures with the behavioral measures to provide superior performance for difficulty modulation [23]. However, these methods for neural and physiological-based difficulty modulation are strongly affected by the differences between individual physiological responses [2]. Thus, for these approaches to be successful, we will require a robust method for addressing the widely varying individual differences in physiological response to task difficulty.
One method for tailoring a BCI pattern recognition algorithm for a specific user is to collect a set of user-specific training data samples at once, estimate the recognition performance using crossvalidation, and iterate until the maximum number of iterations is reached, or the cross-validation performance is satisfactory. The pseudocode for such an algorithm is shown in Figure 1, where the k-nearest neighbor (kNN) classifier is used for simplicity.
There are many techniques to improve this method. Two of them are examined in this paper: N Transfer Learning (TL) [24]: Use the information contained in other subjects' training data. Although training data from other subjects may not be completely consistent with a new subject's profile, they may still contain useful information, as people may exhibit similar responses to the same task. As a result, improved performance can be obtained at recognizing the difficulty of a task.
N Active Class Selection (ACS) [25]: Optimally generate the user-specific training samples online. If in an application there are lots of offline unlabeled training samples and the bottleneck is to label them, then active learning [26][27][28][29] can be used to optimally select a small number of training samples to label. However, in many applications we do not have unlabeled data, and all training samples need to be generated online. Thus we cannot propose which samples to label; instead we must obtain additional training samples. So, the ACS problem becomes how to drive the selection of the class from which training samples are obtained during an online calibration session with the user, so that a high-performance classifier can be constructed from a small number of training samples. In our previous research we have shown that TL can improve classification performance compared with a baseline that uses only the user-specific training samples [30], and ACS can improve classification performance compared with a baseline that selects the classes uniformly [31]. Because TL considers how to make use of data from other subjects and ACS considers how to optimally generate user-specific training samples online, they are independent and complementary. This paper presents theory and experimental results on a collaborative filtering approach which combine TL and ACS for learning an optimal classifier from a minimum amount of user-specific training data samples.
There has been some work on combining TL and collaborative filtering [32][33][34][35][36], where TL was used to make use of auxiliary data to address the data sparsity problem in collaborative filtering. Most of this work was for recommender systems, particularly movie recommendation. Our work is different from these in that: 1) we use TL to handle the data insufficiency problem instead of the sparsity problem; 2) we combine TL and ACS, instead of using TL only; and, 3) we apply our algorithm to a BCI system.
There are also a small number of existing collaborative filtering systems for BCI [8,[37][38][39][40], which integrate information from other users to improve the performance for the current user. For example, Lu et al. [40] built two classification models for each user, one is an adaptive user-specific model from user-specific online training data only, and the other is a user-independent model from offline training data from other users. The two models performed classifications independently for a new input, and the one with higher confidence score was chosen. Jin et al. [8] built an online genetic classification model by directly combining online user-specific training data and offline training data from other users. They showed that the online generic model achieved better performance than a generic model which used offline data from other users only. It also achieved similar performance to a typical model which used user-specific data only, but the online generic model needed less user-specific data so it was trained more quickly. Our work is different from these approaches in two aspects. First, we propose a different way to make use of the offline data from other users. Second, we propose an optimized procedure to generate the user-specific training data online.

Transfer Learning (TL)
This section introduces the theory and an algorithm for TL. For simplicity we use the kNN classifier as an example since it has only one parameter (k) to optimize given a fixed distance function and the type of normalization. However, these ideas can be generalized to other classifiers such as the SVM [41].
TL theory. In many machine learning applications, in addition to the data for the current task, we also have data from similar but not exactly the same tasks. The learning performance can be greatly improved if these additional data are used properly. TL [24,42] is a framework proposed for addressing this problem.
Definition (Transfer Learning). [24] Given a source domain D S with learning task T S , and a target domain D T with learning task T T , TL aims to help improve the learning of the target predictive function f T ( : ) in D T using the knowledge in D S and T S , where D S =D T , or T S =T T .
In the above definition, a domain is a pair D~fX ,P(X )g, where X is a feature space and P(X ) is a marginal probability distribution, in which X~fx 1 ,:::,x n g[X . D S =D T means that X s =X T , and/or P(X S )=P(X T ), i.e., the features in the source domain and the target domain are different, and/or their marginal probability distributions are different. Similarly, a task is a pair T~fY,P(Y jX )g, where Y is a label space and P(Y jX ) is a conditional probability distribution. T S =T T means that Y S =Y T , and/or P(Y S jX S )=P(Y T jX T ), i.e., the label spaces between the source and target domains are different, and/or the conditional probability distributions between the source and target domains are different.
For example, in the domain of classifying the subjective difficulty level of a task in a VE based on neural and physiological signals, the labeled neural and physiological data from a user would be the primary data in the target domain, while the labeled neural and physiological data from other users would be the auxiliary data from the source domain. A single data sample would consist of the feature vector for a single epoch of neural and physiological data from one subject, collected as a response to a specific stimulus, and labeled with the difficulty of responding to that stimulus. Though the features in this primary data and auxiliary data would be the same, generally their marginal distributions are different, i.e., P(X S )=P(X T ), due to the fact that the baseline physiological levels for the subjects are likely to differ. Moreover, the conditional probabilities are also different, i.e., P(Y S jX S )=P(Y T jX T ), due to the significant individual differences in neural and physiological response to different difficulty levels. As a result, the auxiliary data from the source domain cannot represent the primary data in the target domain accurately, and must be integrated with some labeled primary data in the target domain to induce the target predictive function.
Previous work [42] has shown that when the primary training dataset is very small, training with auxiliary data can significantly improve classification accuracy, even when the auxiliary data is significantly different from the primary data. This result can be understood through a bias/variance analysis. When the size of primary training data is small, a learned classifier will have large variance and hence large error. Incorporating auxiliary data, which increases the number of training samples, can effectively reduce this variance. However, this data may increase the bias, since the auxiliary and primary training data have different distributions. This also suggests that as the amount of primary training data increases, the utility of auxiliary data should decrease [42].
TL algorithm. Suppose there are N p user-specific training samples fx p i ,y p i g i~1,2,:::,N p for the primary supervised learning problem, where x p i is the feature vector of the i th training sample and y p i is its corresponding class label. The superscript p indicates the primary learning task. Additionally, there are N a auxiliary training samples (training samples from other subjects) fx a i ,y a i g i~1,2,:::,N a , whose distribution is assumed to be similar to the primary training samples but not exactly the same. So, the auxiliary training samples should be treated as weaker evidence in designing a classifier. Moreover, we may want to select some ''good'' auxiliary training samples and discard the ''bad'' ones.
In the kNN classifier we need to optimize the number of NNs, k. This is done through internal cross-validation [42,43]. The most important parameter in determining the optimal k is the internal cross-validation accuracy on the primary training samples, i.e., the portion of the correctly classified primary training samples in the internal cross-validation, a p . However, because N p is very small, different k may easily result in the same a p . So, a a , the internal cross-validation accuracy on the selected ''good'' auxiliary training samples, is used to break the ties. Once the optimal k is identified for the kNN classifier, its performance can be evaluated as the accuracy of the algorithm classifying the test data.
As pointed out in [42], in many learning algorithms, the training data play two separate roles. One is to help define the objective function, and the other is to help define the hypothesis. Particularly, in kNN one role of the auxiliary data is to help define the objective function and the other is to serve as potential neighbors. In [30] we investigated both roles and found that using the auxiliary training samples in the validation part of the internal cross-validation algorithm generally achieved better performance. So, only this approach is considered in this paper. In each iteration the TL algorithm computes a p by leave-one-out cross-validation using the N p primary training samples, and a a using the N p primary training samples to classify a selected set of ''good'' auxiliary training samples. Its pseudo-code is given in Figure 2, and is denoted TL in this paper.
How the ''good'' auxiliary training samples are selected is very important to the success of the TL algorithm. The general guideline is to select auxiliary training samples that are similar to the primary training samples. Specifically we are using a mean squared difference, calculated by: 1. Computing the mean feature vector of each class for the new subject, from the N p primary training samples. These are denoted as m p i , where i~1,2,:::,c is the class index. 2. Computing the mean feature vector of each class for each subject in the auxiliary dataset. These are denoted as m j i , where i~1,2,:::,c is the class index and j is the subject index. 3. Select the subject with the smallest difference from the new subject, i.e., arg min j P c i~1 jjm p i {m j i jj 2 , and use his/her data as auxiliary training data.
We note that the way to select ''good'' auxiliary training samples may be application dependent, and there are multiple potential avenues for research in this area. One of them is pointed out in the Future Research section.

Active Class Selection (ACS)
This section introduces the theory and an algorithm for ACS. For simplicity the kNN classifier is used; however, the algorithm can be extended to other classifiers such as the SVM.
ACS theory. Active learning (AL)  has been attracting a great deal of research interest recently. It addresses the following problem: suppose that we have considerable amounts of offline unlabeled training samples and that the labels are very difficult, time-consuming, or expensive to obtain; which training samples should be selected for labeling so that the maximum learning (classification or prediction) performance can be obtained from the minimum labeling effort? For example, in speech emotion estimation [29,44,45], the utterances and their features can be easily obtained; however, it is difficult to evaluate the emotions they express. In this case, AL can be used to select the most informative utterances to label so that a good classifier or predictor can be trained based on them. Many different approaches have been proposed for AL [28] so far, e.g., uncertainty sampling [46], query-by-committee [47,48], expected model change [49], expected error reduction [50], variance reduction [51], and density-weighted methods [52]. However, in many online applications we do not have large amounts of unlabeled offline data, and hence cannot propose which samples to label. Instead, it is possible to request more training samples on-the-fly from desired classes. For example, in the domain of classifying the subjective difficulty level of a task in a VE based on neural and physiological signals, there is no existing unlabeled data (i.e., neural/physiological signals), and it is not possible to generate sample training data with specific characteristics, such as a data sample with a specific heart rate, or EEG alpha power. However, we can control the difficulty level of the training sample. So, the problem of ACS is to optimally select the classes during real-time interaction with the user, by displaying stimuli from desired classes to the user in order to obtain training samples that allow a high-performance classifier to be constructed from minimal training samples.
Unlike the rich literature on AL, there has been limited research on ACS. Weiss and Provost [53] proposed a budget-sensitive progressive sampling algorithm for selecting training data. They considered a two-class classification problem. The proportion of Class 1 samples and Class 2 samples added in each iteration of the algorithm is determined empirically by forming several class distributions from the currently available training data, evaluating the classification performance of the resulting classifiers, and then determining the class distribution that performs best. They demonstrated that this heuristic algorithm performs well in practice, though the class distribution of the final training set is not guaranteed to be the best class distribution. Lomasky et al. [25] claimed that if one can control the classes from which training samples are generated, then utilizing feedback during learning to guide the generation of new training data may yield better performance than learning from any a priori fixed class distributions. They proposed several ACS approaches to iteratively select classes for new training instances based on the existing performance of the classifier, and showed that ACS may result in better classification accuracy. The below algorithm is based on and improves Lomasky et al.'s Inverse ACS algorithm [25].
ACS algorithm. In [31] we compared two ACS algorithms (Inverse and Accuracy Improvement), proposed by Lomasky et al. [25], with a baseline uniform sampling approach, and found that the Inverse algorithm consistently outperformed a baseline kNN classifier. This approach is considered and improved in this paper. The ACS method relies on the assumption that poor class accuracy is due to not having observed enough training samples. It requires internal cross-validation to evaluate the performance of the current classifier so that the class with poor performance can be identified and more training samples can be generated for that class.
We assume that there are c classes and no limits on generating instances of a particular class. The ACS algorithm begins with a small set of l 0 labeled training samples, where l i is the number of instances to generate in Iteration i. ACS is used to determine p j i (0ƒp j i ƒ1), the portion of the l i instances that should be generated from Class j. In Iteration i, we record the classification accuracy (in the leave-one-out cross-validation) for each class, a j i , j~1,2,:::,c. Then, Lomasky et al. defined the probability of generating a new instance from Class j as: i.e., it is proportional to the inverse of a j i . We have improved this approach by adding a constraint that no two consecutive new training samples can be generated from the same class, i.e., if the last new training sample is generated from Class h, then the next new training sample is generated from Class j (j=h) with probability: This improvement reduces the risk that most new samples are generated from the class which has the lowest accuracy but is difficult to improve, and our experiments showed that it is more robust than Lomasky et al.'s original approach.
The detailed algorithm is given in Figure 3, and it is denoted ACS in this paper.

Combining TL and ACS
Because in our case TL considers how to make use of training data from other subjects and ACS considers how to optimally generate samples of user-specific training data online, they are independent and complementary. So, we conjecture that a collaborative filtering approach based on combining TL and ACS will result in improved classification performance. The fundamental concept is to use TL to select the optimal classifier parameters for the current subject based on available data obtained from the current subjects and other subjects, and then use ACS to obtain the most informative new training samples from the current subject, until the desired crossvalidation accuracy is obtained, or the maximum number of training samples is reached, as illustrated in the left column of Figure 4. The pseudo-code for combining TL and ACS for a kNN classifier is given in Figure 5, and it is denoted TL+ACS in this paper.
The idea of TL+ACS can be illustrated with the following example. Suppose there are three label classes, and we start from 3 primary training samples (one for each class) and generate one new training sample in each iteration until the desired crossvalidation accuracy is reached. In the first iteration, we use TL (combining the 3 primary training samples with a large number of ''good'' auxiliary training samples) to identify the optimal k, and then use ACS to compute the probability that the new training sample should be generated from each class. A new training sample is then generated according to the three probabilities. It is added to the primary training dataset. These 4 primary training samples are then combined with a large number of ''good'' auxiliary training samples and used in the second iteration. The program iterates until the desired cross-validation accuracy is reached. The optimal k obtained in the last iteration (identified by TL) is output as the optimal kNN parameter.
Recall that ACS considers the case that we do not have unlabeled offline data in the target domain, and we can only control from which classes new training samples are generated onthe-fly. When there are large amounts of unlabeled data in the target domain and we want to suggest which ones to label, an AL approach is appropriate. As TL and AL are also independent and complementary, they could be combined in a similar fashion as TL and ACS [54], as shown in the right column of Figure 4.

Experiments
This section presents our experimental results from a comparison of the four algorithms (baseline, TL, ACS, and TL and ACS combined, for both the kNN and SVM classifiers). These algorithms were applied to the task difficulty level classification problem introduced in [5]. We first consider kNN because we have shown that the specific TL and ACS approaches presented in the previous sections work well with this classifier [30,31]. For example, in [30] we compared two TL approaches for the kNN classifier and found that the approach presented in this paper gave better results; in [31] we compared two ACS approaches for the kNN classifier and found that the approach presented in this paper gave better results. However, the generic framework of combining TL and ACS should apply to all classifiers, though the implementation details may differ. To demonstrate this, we also present results of these methods applied to an SVM classifier.
It is important to note that the purpose of the experiments is not to show how good a kNN or SVM classifier can be in task difficulty classification; instead, the goal was to demonstrate how TL and ACS, and especially their combination, can improve the  performance of an existing classifier. The ideas proposed in this paper can also be extended to other classifiers, and also to other applications, including additional BCI, physiological computing, or affective computing systems [44].

Experiment Setup and Data Acquisition
The data included in this paper were drawn from a larger study on the VRST [6]. Neural and physiological measures were used to predict levels of threat and task difficulty. The VRST is part of a battery of tests developed by Parsons that are found in an adaptive VE, which consists of a virtual city, a virtual vehicle checkpoint, and a virtual Humvee driving scenario in simulated Iraq and Afghanistan settings [14,21,[55][56][57].
Ethics statement. The University of Southern California's Institutional Review Board approved the study. Upon agreement to participate, prospective subjects were educated as to the procedure of the study, possible risks and benefits, and alternative options (non-participation). Prior to actual participation, they completed written informed consents approved by the University of Southern California's Institutional Review Board. After each subjects' written informed consent was obtained, basic demographic information was recorded.
Participants and procedure. A total of 20 college-aged subjects participated in the study. Two of the 20 subjects did not respond at all in one of the three scenarios, and were excluded as outliers. While experiencing the VRST, participant neural and physiological responses were recorded using a Biopac MP 150 system in conjunction with the NeuroSim Interface (NSI) software developed at Parsons' Neuroscience and Simulation Laboratory (NeuroSim) at the University of Southern California. Electroencephalography (EEG), Electrocardiographic activity (ECG), Electrooculography (EOG), Electrodermal activity (EDA), and Respiration (RSP) were recorded. Following completion of the VRST protocol, none of the subjects reported simulator sickness. EEG was measured using seven electrodes placed at locations Fp1, Fp2, Fz, Cz, Pz, O1, and O2 according to the international 10-20 system for EEG electrode placement. The EEG signal was recorded at 512 Hz, and was referenced to linked ear electrodes. EDA was measured using Ag/AgCl electrodes placed on the index and middle fingers of the non-dominant hand [58]. ECG was recorded with use of a Lead 1 electrode placement, with one Ag/ AgCl electrode placed on the right inner forearm below the elbow, another in the same position on the left inner forearm, and a third on the left inner wrist to serve as a ground. Finally, RSP was recorded with a transducer belt placed around widest area of the rib cage.
Virtual Reality Stroop Task (VRST). The VRST involves the subject being immersed in a VE consisting of a Humvee that travels down the center of a road in a desert environment with military relevant events while Stroop stimuli appear on the windshield (see Figure 6). The VRST is a measure of executive functioning and was designed to emulate the classic Stroop test [59]. Like the traditional Stroop, the VRST requires an individual to respond by selecting one of three colors, (i.e., red, green, or blue). Unlike the traditional Stroop, a subject responds by pressing a computer key, and the VRST also adds a simulation environment with military relevant events in high and low threat settings. Participants interacted with the VRST through an eMagin Z800 head-mounted display (HMD). To increase the potential for sensory immersion, a tactile transducer was built by mounting six Aura bass shaker speakers on a three foot square platform.
Stimuli and design. Participants were immersed in the VRST as neural and physiological responses were recorded. EEG, ECG, EDA, and RSP were collected as participants rode in a simulated Humvee through alternating zones of low threat (i.e., little activity aside from driving down a desert road) and high threat (i.e., gunfire, explosions, and shouting amongst other stressors). The VRST was employed to manipulate levels of task difficulty. The VRST consisted of 3 conditions: 1) word-reading, 2) color-naming, and 3) Stroop interference. The Stroop interference condition displays a color word, such as red, in a different color font, while the subject's task is to name the color of the font, not read the word (right column of Figure 6). Each Stroop condition was experienced once in a high threat zone and once in a low threat zone.
There are many different task difficulty levels in VRST. In this study we chose the following three: N Scenario I: Low threat, color naming. N Scenario II: High threat, color naming. N Scenario III: High threat, Stroop interference.
Each scenario consisted of 50 stimuli. Three colors (Blue, Green, and Red) were used, and they were displayed randomly with equal probability. In Scenario I, 50 colored numbers were displayed one by one while the subject was driving through a safe zone. Scenario II was similar to Scenario I, except that the subject was driving through an ambush zone. Scenario III was similar to Scenario II, except that Stroop stimuli instead of color naming stimuli were used. In terms of task difficulty, the three scenarios are in the order of I v II v III. We set forth to classify these three scenarios using the proposed algorithms.
For each scenario each of the 50 stimuli was displayed at a random location on the windshield in a different color, randomly selected from one of the three different color schemes, in order to reduce signal habituation. Stimuli were presented for a maximum of 1.25 seconds each, and participants were asked to respond as quickly as possible without making mistakes. As shown in Table 1, the average reaction time was less than one second. The next stimulus was displayed when the user gave response to the current one. So, in total each stimulus took only a few seconds, and the 150 stimuli were finished in about 10 minutes.

Comparison of the Algorithms
Each of the 18 subjects had 150 responses (50 stimuli for each task difficulty level). The same 29 features from our previous analysis [5] were used (shown in Table 2), with all 29 features being used across all subjects for this analysis. Twenty-one features were extracted from EEG, three from EDA, three from RSP, and two from ECG. Feature extraction consisted of segmenting the data into 3-second epochs that were time locked from 1 second prior to the stimulus occurrence to 2 seconds after. EOG artifacts were removed from the EEG using a standard regression-based approach. Then, EEG data was filtered using a [1,30] Hz bandpass filter, epoched into overlapping 1 second windows, and detrended. Spectral power was then calculated in the theta [3.5, 7] Hz, alpha [7.5, 13.5] Hz, and beta [13.5, 19.5] Hz frequency bands for each channel. The EDA features were the mean, minimum, and maximum amplitude response in the epoch window. Respiration was scored similarly, with mean, minimum, and maximum amplitude in the epoch window. ECG features consisted of the number of heartbeats and the average inter-beat intervals (IBIs, scored as the time difference in seconds between successive R waves) in the epoched window. We normalized each feature for each individual subject to [0, 1].
The coefficients of the first two principle components of the 29 features for the 18 subjects are shown in Figure 7. Different colors are used to denote different scenarios: red for Scenario I, green for Scenario II, and blue for Scenario III. Observe that generally the distributions of these coefficients are quite different among the subjects, which suggests that it may be impossible to find a generic classifier that works well for all subjects. This has been confirmed by our previous studies. In [5] we have reported that when we trained a SVM classifier on 17 subjects and tested it on the remaining subject, the average classification rate was 36.9%, close to chance. We also trained a kNN classifier on 17 subjects and tested it on the remaining subject. The average classification rate was 35.8%, again close to chance.  However, when examining Figure 7 more closely, we can observe that some subjects share similar distributions of the principle component coefficients, e.g., Subjects 4 and 6, Subjects 8 and 17, and Subjects 13 and 16. This suggests that auxiliary data from other subjects may be helpful in building a classifier for a new user. Next we show how classification performance can be improved above baseline using TL, ACS, and their combination.
kNN classification. In kNN classification we set the maximum number of primary training samples to 30, and a, the minimum satisfactory classification accuracy, to 1, i.e., the algorithms terminated when 30 primary training samples were generated. The Euclidean distance was used to specify nearest neighbors. We studied each subject separately, and for each subject l 0~3 (so that there is at least one primary training sample for each labeled class). We used l i~f 1,2,3g for Vi, i.e., in the first experiment, only one primary training sample was generated in each iteration; in the second experiment, two primary training samples were generated in each iteration; and in the third experiment, three primary training samples were generated in each iteration. After Iteration i, the kNN classification performance was evaluated using the remaining 150{ P i{1 j~0 l j responses from the same subject. We repeated the experiment 100 times  (each time selecting the l 0 initial training samples randomly) for each combination of subject and l i , and then recorded the average performances of the four algorithms. It was necessary to repeat the experiment many times to ensure the statistical significant of the results. This is because there were two forms of randomness: 1) training data samples were selected randomly, so for the same sequence of class labels the training samples were different; and, 2) the class to select training samples from was chosen according to a probability distribution instead of deterministically. Figure 8 shows the performances of the four algorithms on the 18 subjects for l i~1 . Observe that significantly different classification accuracies were obtained for different subjects. For example, with 30 user-specific training samples, 95.82% accuracy was obtained for Subject 4 by TL+ACS, but only 56.76% for Subject 11. However, regardless of the large individual differences, both TL and ACS outperformed the baseline for all 18 subjects, and TL+ACS achieved the best performance among the four.
The mean and standard deviation of the classification accuracy of the four algorithms for l i~f 1,2,3g on the 18 subjects are shown in Figure 9. Observe that: 1. TL outperformed the baseline approach. The performance improvement is generally larger when N p is small. As N p increases, the performances of TL and the baseline converge, i.e. the effect of auxiliary training data decreases as the number of primary training data samples increases. 2. ACS outperformed the baseline approach, and when N p increased the performance improvement of ACS became larger than the performance improvement of TL over the baseline. 3. TL+ACS outperformed the other three approaches. It inherited both TL's superior performance for small N p and ACS's superior performance for large N p , and showed improved performance overall.
To show that the performance differences among the four algorithms are statistically significant, we performed paired t-tests to compare their average accuracy (Table 3), using a~0:05. The results showed that the performance difference between any pair of algorithms is statistically significant ( Table 4). Although our ttests revealed significance, we decided to be very conservative with our results and took measures to ensure that the probability of Type I error did not exceed 0:05. Hence, we also performed Holm-modified Bonferroni corrections [60] to assess classification  Table 4).
We have shown that with the same number of primary training samples, TL, ACS, and TL+ACS can give higher classification accuracy compared with the baseline approach. As an equivalent goal of the improved algorithms is to learn an optimal classifier using a minimum number of primary training samples, it is also interesting to study how many primary training samples can be saved by using the three improved algorithms, as compared to the baseline approach. Take TL as an example. Assume that the TL algorithm has a classification accuracy of a when N p primary training samples are used. We then find how many primary training samples are needed by the baseline algorithm to achieve the same classification accuracy and denote that number by N p '. Then (N p '{N p )=N p '|100% is percentage of primary training samples saved by TL. The mean and standard deviation of the percentages are shown in Figure 10. Note that we only show N p up to 20 because N p ' is larger than N p and the maximal N p ' in the experiments is 30. Observe from Figure 10 that: 1. TL can save over 7% primary training samples over the baseline approach, especially when N p is small. When N p increases the saving becomes smaller, which is intuitive, as more primary training samples diminish the usefulness of the auxiliary training samples. 2. ACS can save over 17% primary training samples over the baseline, especially when N p is small. When N p increases the saving generally becomes smaller, which is intuitive, as the classifier converges to the optimal one when each class has sufficient training samples, no matter how they are generated. 3. TL+ACS can save over 22% primary training samples over the baseline, especially when N p is small. It also outperformed both TL and ACS.
To show that the percentages of saved primary training samples among the four algorithms are statistically significant, we performed paired t-tests to compare their average savings (Table 5), using a~0:05. The results showed that the percentage of saved primary training samples between any pair of algorithms is statistically significant (Table 6). Again, to ensure that the probability of Type I error does not exceed 0:05, we also performed Holm-modified Bonferroni correction on the percent- ages by considering the four algorithms and three l i together. The results indicate that all 15 differences are statistically significant.
SVM classification. To demonstrate that the generic framework of combining TL and ACS can be applied to other classifiers, we also implemented it for an SVM classifier [41,61] using l i~f 1,2,3g. The same 29 features in the kNN classifier were again used in the experiment, and no feature selection was performed.
The radial basis function (RBF) SVM in LIBSVM [62] was employed, so in training we tuned two parameters: C, which is the penalty parameter of the error term, and c, which defines the RBF kernel. Modification to the algorithms is very simple: In the TL part of Algorithms 2 and 4, instead of optimizing k in kNN, we now optimize C and c.     Figure 11 shows the performances of the four algorithms on the 18 subjects for l i~1 . Again, significantly different classification accuracies were obtained for different subjects. However, regardless of the large individual differences, generally both TL and ACS outperformed the baseline for all 18 subjects, and TL+ACS achieved the best performance among the four.
The mean and standard deviation of the classification accuracy of the four algorithms for l i~f 1,2,3g on the 18 subjects are shown  in Figure 12. Observe that the average performances of TL, ACS and TL+ACS were all better than the baseline approach, which verified the effectiveness of the proposed approaches. However, unlike in Figure 9, where ACS outperformed TL when N p was large, for SVM classifier generally the performance of ACS was always worse than TL. This is because the ACS approach used in our experiment (selecting a new class according to the inverse of the per-class cross-validation accuracy, which is called Inverse in [25]), which is suitable for the kNN classifier, is not optimal for the SVM classifier. This fact is confirmed by Figure 1(b) in [25], where several ACS approaches for SVM were compared. In the future we will investigate better ACS approaches for the SVM classifier.
To show that the performance differences among the four algorithms are statistically significant, we performed paired t-tests to compare their average accuracy (Table 3), using a~0:05. The results showed that the performance difference between any pair of algorithms is statistically significant (Table 4). To ensure that the probability of Type I error does not exceed 0:05, we also performed Holm-modified Bonferroni correction on the classification accuracy by considering the four algorithms and three l i together. The results indicate that all 15 differences are statistically significant, despite the very conservative nature of Bonferroni correction.
Similar to the kNN case, for the SVM classifier we also studied how many percentages of primary training samples can be saved by using the three improved algorithms, compared with the baseline approach. The results are shown in Figure 13. TL+ACS can save over 21% of primary training samples.
To show that the percentages of saved primary training samples among the four algorithms are statistically significant, we performed paired t-tests to compare their average savings (Table 5), using a~0:05. The results showed that the percentage of saved primary training samples between any pair of algorithms is statistically significant (Table 6). Again, to ensure that the probability of Type I error does not exceed 0:05, we performed Holm-modified Bonferroni correction on the percentages by considering the four algorithms and three l i together. The results indicate that all 15 differences are statistically significant.
As the search space of the SVM classifier is much larger than that of the kNN classifier, which implies that more primary training samples are needed to identify the optimal SVM model parameters, we expect that more significant performance improvement can be demonstrated for the SVM classifier by using the improved algorithms. The conjecture is clearly verified in Figure 14, in which the mean baseline and TL+ACS performances for l i~1 for both kNN and SVM are shown. We would also expect that the baseline SVM classifier should outperform the baseline kNN classifier as it is more sophisticated; however, Figure 14 shows that this is only true when N p is large, because a small number of primary training samples are not sufficient to identify the optimal SVM parameters in a large search space. In summary, it seems that TL+ACS is particularly advantageous when a sophisticated classifier with a large search space is used.

Discussion
We have demonstrated that a collaborative filtering approach based on TL and ACS can improve kNN and SVM classifier performance over a baseline classifier using the same quantity of training data, and that combining TL and ACS can achieve an even larger performance improvement. So, for same level of classification accuracy, TL+ACS may require a smaller number of user-specific training samples. This will reduce the data acquisition effort to customize an automatic task difficulty recognition BCI system, and hence increase its usability and popularity.
The VRST application is a scenario reflecting the ways in which this collaborative filtering algorithm could be used in automatic task difficulty classification. The VE maintains a database with many different subjects and their neural and physiological responses at different task difficulty levels. A new user can build his/her profile for automatic task difficulty classification by performing a calibration task, which is a subset of the primary task. Assuming that there are c levels of task difficulty, the VE will display c stimuli, one from each difficulty level. The neural and physiological responses from the user, along with selected ''good'' auxiliary training samples from the subject database, are then used by TL to identify the optimal parameters for the classifier. The TL module also computes the classification accuracy from crossvalidation using these optimal parameters.
If the classification accuracy is not satisfactory, the ACS module then determines from which difficulty level the next stimulus should be displayed. The VE generates the corresponding stimulus and records the user's neural and physiological responses during the test and adds it to the primary training sample database. The  TL module is used again to update the optimal parameters for the classifier and compute the cross-validation accuracy. If the accuracy is satisfactory, then the VE configures the classifier using the optimal parameters and stops training; otherwise, it calls the ACS module to generate another new primary training sample from the user and iterates the process. The advantage is that the user will need to provide many fewer responses with TL+ACS than without, saving the user's time calibrating the system.
Note that TL and ACS each requires a higher computational cost than the baseline approach, because TL needs to consider the auxiliary training samples in the internal cross-validation, and ACS needs to compute the per-class cross-validation accuracy. However, since the extra computational cost only occurs during the training process, it does not hinder the applicability of these improvements.

Conclusions and Future Research
Individual differences make it difficult to develop a generic BCI algorithm whose model parameters fit all subjects. It is hence important to customize the BCI algorithm for each individual user by adapting its parameters using user-specific training data. However, collecting user-specific data is time-consuming and may also decrease the user's interest in the BCI system. In this paper we have shown how TL, ACS, and a collaborative filtering approach based on their combination, can help learn an optimal classifier using a minimum amount of user-specific training data. TL exploits the information contained in auxiliary training data, and ACS optimally selects the class new training data to generate from. This approach reduces the data acquisition effort in customizing a BCI system, improving its usability and potentially, its popularity.
In the future we will improve both TL and ACS, thereby improving our collaborative filtering framework. For TL, we may be able to improve the selection of ''good'' auxiliary data by removing inconsistent data samples from the auxiliary data (i.e., reduce the intra-individual difference). If a subject cannot reliably classify his/her own perception of task difficulty, then unlikely his/ her data can give good suggestions on another subject's perception. One possible approach is that for each subject in the auxiliary data, we remove a minimum number of confusing data so that a 100% accurate classifier can be obtained. The remaining data from all subjects can then be combined to form the auxiliary dataset. To improve ACS, we will investigate other ACS approaches, such as Redistricting and Improvement [25].
Another direction of our future research is to integrate TL and ACS with feature selection. As it has been shown in [5], many of the 29 features are not useful. However, the useful features are subject dependent. As these features directly affect the classification performance and computational cost, it is necessary to integrate TL and ACS with feature selection for further performance improvement. In addition, we are involved in several large-scale (w100 subjects) neural and physiological data collections, and intend to use that data to continue to refine and improve these collaborative filtering approaches.
Finally, we are interested in studying whether the proposed approach can also be used in cross-domain knowledge transfer [63,64], e.g., whether the labeled task difficulty data in VRST can help improve the task difficulty recognition performance in other related application domains like personalized learning and affective gaming [65][66][67].