^{1}

^{*}

^{1}

^{1}

^{2}

^{1}

^{3}

^{4}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: YL HL PS. Performed the experiments: YL HL. Analyzed the data: JL YL LH PS. Wrote the paper: YL PS. Designed the algorithm: YL. Contributed to the writing: ZG.

Considering the two-class classification problem in brain imaging data analysis, we propose a sparse representation-based multi-variate pattern analysis (MVPA) algorithm to localize brain activation patterns corresponding to different stimulus classes/brain states respectively. Feature selection can be modeled as a sparse representation (or sparse regression) problem. Such technique has been successfully applied to voxel selection in fMRI data analysis. However, single selection based on sparse representation or other methods is prone to obtain a subset of the most informative features rather than all. Herein, our proposed algorithm recursively eliminates informative features selected by a sparse regression method until the decoding accuracy based on the remaining features drops to a threshold close to chance level. In this way, the resultant feature set including all the identified features is expected to involve all the informative features for discrimination. According to the signs of the sparse regression weights, these selected features are separated into two sets corresponding to two stimulus classes/brain states. Next, in order to remove irrelevant/noisy features in the two selected feature sets, we perform a nonparametric permutation test at the individual subject level or the group level. In data analysis, we verified our algorithm with a toy data set and an intrinsic signal optical imaging data set. The results show that our algorithm has accurately localized two class-related patterns. As an application example, we used our algorithm on a functional magnetic resonance imaging (fMRI) data set. Two sets of informative voxels, corresponding to two semantic categories (i.e., “old people” and “young people”), respectively, are obtained in the human brain.

One fundamental question in neuroscience focuses on determining how information is processed within local and global networks in the brain. Recently, multivariate pattern analysis (MVPA) approaches have been used successfully in revealing brain patterns activated by different stimulus conditions in brain imaging studies

Ideally, all of the informative features contributing to the discrimination should be extracted, no matter how small a contribution they provide. However, in most MVPA algorithms, not all the informative variables are selected because part of the informative variables may be sufficient for decoding or classification. However, when trying to extract the sufficient informative features, those representing noise may be selected; therefore follow-up statistical tests are necessary.

For the purpose of identifying all the informative features, we proposed a sparse representation-based pattern localization algorithm combined with a nonparametric statistical test in this study. We summarized the process of the algorithm as three components: a K-fold cross-validation of recursive feature search where feature weights were determined by a sparse representation method, construction of two probability maps based on the selected features, and a permutation test at the individual or group level. In our previous study

The neuroimaging data were given by a matrix

Feature selection was based on the weights of all features, which were determined by sparse representation

For a data matrix

Model (1) can be seen as a sparse regression between the data matrix

The optimization problem (1) can be converted to a standard linear programming problem as below

Setting

Here we could also use another regression approach, e.g., support vector machine (SVM), instead of sparse representation for weight determination. A difference between the two methods is: the weights obtained by sparse representation is sparse while those obtained by SVM are dense. Sparse weights are useful for highlighting those variables relevant with the labels or the vector to be regressed

We first present the outline of our SPL algorithm here. As shown in

Regarding the cross-validation, both the data matrix

In each fold of cross-validation, we recursively eliminated informative features selected by the sparse regression method until the decoding accuracy based on the remaining features drops to a threshold close to the chance level 50% for the two-class problem. According to the signs of the sparse regression weights, these identified features were separated into two sets corresponding to two stimulus classes/brain states respectively. In this way, the resultant two feature sets including all the identified features were expected to involve as many informative features for discrimination as possible.

The process of the recursive feature search in the

Step 1: Sparse representation for feature weight determination. We applied the sparse representation method to the data updated in the previous iteration and obtained a weight vector denoted as

Step 2: Feature selection. We used the weight vector

Step 3: Informative feature removal. We removed these variables in

Step 4: Decoding. We performed a decoding by applying an SVM classifier to the updated data matrix, and the prediction accuracy of the labels, denoted as

Terminating criterion: The above iteration procedure ran until the prediction accuracy dropped to a threshold. Theoretically, the best threshold was the chance level 50% for the two-class problem, which was used in this paper.

After the iterations in the

Remark 1: (i) In the above Step 2, we assumed that the signs of those weights obtained by sparse representation were associated with the classes of stimuli/brain states. In

After

As described above, two sets of selected variables, which correspond to two classes of stimuli/brain states respectively, were obtained in each fold of cross-validation. Using the

In each fold of the cross-validation, the two feature sets obtained by the recursive feature search included as many informative features for discrimination as possible. However, some irrelevant/noisy features were unavoidably selected by the method. To determine if a feature had contribution to the discrimination between two classes of stimuli/brain states, we performed a permutation test at the individual subject level or the group level on each of the two probability maps, described below.

Permutation test at the individual level: In each permutation, we randomly gave the labels for the data and repeated the above procedure of cross-validation, and obtained two probability maps/density functions. A null distribution for each class was constructed by pooling all probability values of the 100 probability maps corresponding to this class, which were obtained through the 100 permutations

Permutation test at the group level: For each subject, we obtained two probability maps with respect to features corresponding to two classes respectively as in Sections 2.2.1 and 2.2.2. We averaged these probability maps across all subjects and obtained two average probability maps also corresponding to two classes respectively. Next, we performed multiple permutations (e.g. 100). In each permutation, we randomly gave labels for the data from each subject and obtained two probability maps for each subject, and further calculated two average probability maps across all subjects. A null distribution for each class was constructed by pooling all probability values of the 100 average probability maps corresponding to this class, which were obtained through the 100 permutations. For a given significance level

There were two simulations in this experiment. In the first simulation, we showed that when the data were sufficient, i.e., the number of known data vectors was much larger than that of the variables, two patterns (two sets of informative variables) corresponding to two classes of data could be well estimated by directly using the weights with signs determined by sparse representation or SVM. We also demonstrated that the signs of these weights were associated with the two classes of the data. In the second simulation, we considered the case where the number of known data vectors was much smaller than that of the variables. We illustrated our approach for localizing two patterns and demonstrated its effectiveness. We also used SVM to replace sparse representation in our SPL algorithm for weight determination and obtained comparable results.

In the first simulation, we considered the following optimization problem, which is a sparse representation model similar to

Remark 2: In this paper, the temporal SNR for a column of the data matrix

Note that in (3), the data matrix

In the second simulation, we used a much smaller data set to find two patterns using our approach. Considering the model (3), we first we generated two pattern vectors

For comparison, we used a standard SVM and the univariate correlation method for localizing the two patterns

In this experiment, intrinsic signal optical imaging

In this experiment, fMRI data were acquired from nine human subjects. This study was approved by the Ethics Committee of Guangdong General Hospital, China, and all subjects gave their written informed consent for the study. The visual stimuli were 80 grayscale pictures of Chinese faces at two different age levels (40 old persons and 40 young persons). During each trial, which lasted 10 seconds or 5 volumes (TR = 2 s), the subject was instructed to make a covert semantic categorization (old vs. young) based on the pictures. Each picture was used only once in one trial, and 80 trials were collected for each subject. Mean responses of third, fourth and fifth volumes in each trial were used, whereas the other volumns were discarded because of the delay of BOLD response.

Preprocessing consisted of head motion correction, slice timing correction, coregistration between functional scans and structural scan, normalization to a MINI standard brain, data masking to exclude nonbrain voxels, time series detrending and normalization of time series to a zero mean and a unit variance.

To reduce the amount of computation and remove noise, initial voxel selection was performed using a correlation method. Cross-correlation was performed voxel by voxel between BOLD responses and stimulus function, the top 2,500 voxels with correlation coefficient values larger than 0.15 were selected for later processing.

We demonstrated the effectiveness of our analysis method and illustrated its application through three experiments. The first experiment contained two simulations based on toy data. The second was an orientation preference experiment in the monkey visual cortex using an intrinsic signal optical imaging technique. The third was an fMRI experiment for face recognition in the human brain.

There were two simulations conducted in this experiment. In the first simulation, we showed that when the data were sufficient, two patterns could be estimated by directly using the weights with signs determined by sparse representation or SVM, and that the signs of these weights were associated with two classes. In the second simulation, we illustrated our approach and showed that the two patterns could be localized when the data were insufficient. We also used a linear SVM instead of sparse representation for weight determination in our SPL algorithm and obtained comparable results.

First, we present the results of the first simulation based on

A: The average of 500 weight vectors obtained by sparse representation. B: The average of 500 weight vectors obtained by SVM.

In the following, we presented the results of the second simulation. We regenerated the data for 5 times (corresponding to 5 subjects in a real world experiment) and performed our SPL algorithm with a permutation test at the group level (see

A and B: with sparse representation-based weight determination; C and D: with SVM-based weight determination. Left: for the first pattern. Right: for the second pattern. In each subplot, there are an average probability density function with stars (circled or non-circled) indicating probability values, and a horizontal dash-dotted red line representing a threshold (significance level: 0.001). Stars higher than the threshold correspond to the indices of nonzeros of a predicted pattern. Those stars with circles represent the indices of the nonzeros of the true pattern.

We also used a linear SVM instead of sparse representation for determining the weights in our SPL algorithm and repeated the above procedure. The results are shown in

Furthermore, we considered the classification performance based on the selected informative variables. Specifically, the above 100 data vectors with labels were used as training data. Then, an independent test set in which 50 vectors contained the pattern

Note that we localized the informative features with a high accuracy but the classification accuracy based on these informative features was relatively lower. This difference may be explained as below. First, feature selection based on the training data with labels plays an important role in classification. In order to achieve a high classification accuracy, only these features that contain strong discrimination information between the two categories are selected. In other words, not all of the informative features are selected for classification in general. However, in many real applications, e.g. neuroimaging and novelty detection/fault diagnosis, it is important to find all informative features, no matter how weak they are. We thus proposed the SPL algorithm to localize all the informative features in this study. Here, it was not an optimal way to use all of the localized informative features for classification and thus the accuracies might be lower. Second, the noise level was quite high in the simulated data sets. If the SNR was sufficiently large, we could still obtain a high classification accuracy using all of the localized informative features as shown below.

Here the parameter

To evaluate the robustness of our SPL algorithm to different noise level, we generated eight data sets at different tSNR values. We applied our algorithm to these data sets and obtained accuracies for localizing the two patterns and calculated ROC curves. For each noise level, we also generated an independent test set in which 50 vectors contained the pattern

A: accuracy curve for localizing informative features obtained by the sparse representation-based SPL algorithm with different noise levels. B: 8 ROC curves corresponding to 8 noise levels respectively. C: accuracy curve for predicting the labels of the independent test sets generated at different noise levels.

For comparison, we applied a SVM to the above data set containing 100 data vectors and obtained a 300 dimensional weight vector. For each variable, we also calculated the correlation coefficient between the 100 dimensional data vector corresponding to this variable and the label vector. We could localize the two patterns

Cells in animal early visual areas are sensitive to edge orientations and form repeated representations known as hypercolumns

After we finished all 80 folds of cross-validation, we counted the frequency of a pixel selected across all 80 folds and obtained two probability maps for each of the two sets of selected pixels respectively. By projecting back these pixels with probability values on a 2-dimensional map, we then calculated the difference between the two probability maps (horizontal probability map minus vertical probability map) and smoothed the difference map with a 2-D Gaussian mask (zero mean, unit variances 1 and zero covariances). The difference between the two probability maps, reflecting the class information, is shown in

A: a difference map between the two stimulus conditions. B: the reconstructed condition difference map between the two stimulus conditions using our approach. C: the iterative curve of decoding accuracy rates in Experiment 2.

Human faces share many common features. The overlap of features makes it difficult to discriminate between different semantic categories related to faces (e.g., old and young, male and female) using fMRI signals

For each subject, we used a leave-one-trial-out method to carry out 80 folds of cross-validation. In each fold of cross-validation, we performed a recursive feature search, and 50 voxels were obtained in each iteration (25 voxels for “old people” and the other 25 voxels for “young people”). We then obtained two probability-maps corresponding to the two age categories respectively.

For each subject, we obtained two probability maps with respect to the voxels. Two average probability maps across all the 9 subjects were then calculated. Furthermore, we performed a permutation test of 100 permutations at the group level (see

The red clusters corresponded to the “old people” stimulus condition, and the blue clusters corresponded to the “young people” stimulus condition.

Facial appearance changes with age

In neuroimaging studies, the number of variables/features ranges from tens to hundreds of thousands. In comparison, only a small amount of these features are engaged in certain stimulus conditions or brain states. Furthermore, there are a limited number of trials (examples) for each stimulus condition and it is known that too many variables may lead to overfitting in pattern classification

In this study, we proposed a sparse representation-based method for searching for informative features. Sparse representation is a promising method for feature selection and has recently received lots of attention

To improve the reliability for feature selection, we implemented a cross-validation procedure as suggested by several earlier studies

Neuroimaging data contain a large number of uninformative/noisy features that carry no useful information about the stimulus conditions. These uninformative features may be selected during each fold of the cross-validation. For instance, after the strong informative features are removed from the data set after several early iterations, weak informative features, as well as irrelevant features/noise, can be picked up under lower decoding accuracy conditions. Statistical tests are required to remove these uninformative features. Suppose that all of the informative features can be extracted in each fold, informative features that truly contribute to the discrimination should then frequently be picked up during the cross-validation. Therefore, the contribution of an informative feature can be measured as the frequency of its appearance across all folds. To remove the irrelevant features, we used a nonparametric test at the individual subject level in this study. The same idea has been used previously to generate discrimination maps

The effectiveness of the above approach for selecting high percentage of the informative features as well as removing the irrelevant features was clearly demonstrated in our data analysis in Examples 1 and 2. For instance, in the optical imaging data analysis, it is known that vertical and horizontal orientations are represented as hypercolumns in the monkey primary visual cortex

Besides locating these informative features in the brain, we further specified the correspondence between the groups of informative features and stimulus conditions. For instance, the ventral and dorsal visual pathway in the brain can be selectively activated by face and spatial information stimuli respectively

The multivariate pattern analysis approach is sensitive to subtle differences caused by different stimulus conditions/mental states and has been used to decode stimulus/task-related information in functional brain imaging data. In this paper, we proposed a sparse representation-based MVPA algorithm combined with a permutation test at the individual subject level or group level for feature selection in brain imaging data analysis. Two informative patterns, which correspond to two experimental conditions/brain states, respectively, were obtained. Through a recursive feature elimination procedure, most of the informative features were selected. Using the two nonparametric permutation tests, irrelevant features were significantly reduced. Applications of our methods on three experiments were presented. Specifically, data analysis results based on the toy data set and the optical imaging data set demonstrated the effectiveness of our approach, while the analysis results for an fMRI data set illustrated its application. Future work may include the extension of our algorithm to multiclass problems as well as applications in data analysis from other neuroimaging studies (e.g. neuroimaging data, medical image data).

Effectiveness analysis for our sparse representation-based feature selection method.

(DOCX)

Brain areas for two sets of selected voxels in Example 3.

(DOCX)

The brain areas, volume sizes and the center coordinates of the clusters corresponding to the “old people” stimulus condition in Experiment 3. In a single brain area, at most two clusters (corresponding to the left and the right hemispheres respectively) are presented here.

(DOC)

The brain areas, volume sizes and the center coordinates of the clusters corresponding to the “young people” stimulus condition in Experiment 3. In a single brain area, at most two clusters (corresponding to the left and the right hemispheres respectively) are presented here.

(DOC)

The authors would like to thank three anonymous reviewers and Academic Editor for the insightful and constructive suggestions.