Learning a Weighted Meta-Sample Based Parameter Free Sparse Representation Classification for Microarray Data

Sparse representation classification (SRC) is one of the most promising classification methods for supervised learning. This method can effectively exploit discriminating information by introducing a regularization terms to the data. With the desirable property of sparisty, SRC is robust to both noise and outliers. In this study, we propose a weighted meta-sample based non-parametric sparse representation classification method for the accurate identification of tumor subtype. The proposed method includes three steps. First, we extract the weighted meta-samples for each sub class from raw data, and the rationality of the weighting strategy is proven mathematically. Second, sparse representation coefficients can be obtained by regularization of underdetermined linear equations. Thus, data dependent sparsity can be adaptively tuned. A simple characteristic function is eventually utilized to achieve classification. Asymptotic time complexity analysis is applied to our method. Compared with some state-of-the-art classifiers, the proposed method has lower time complexity and more flexibility. Experiments on eight samples of publicly available gene expression profile data show the effectiveness of the proposed method.


Introduction
The development of high-throughput technologies has enabled scientists to monitor the gene expression levels in tens of thousands of genes simultaneously in a single experiment. This technology has become a symbol of the post-genomic era [1]. Biomedical research indicates that tumor development is related to the change in gene expression levels and that tumor-related biomarkers are usually associated with a few genes. Thus, identifying tumor tissue or disease-related biomarkers accurately is of great practical significance. However, gene expression profile data are characterized by very high dimensionalities and small sample size. The curse of dimensionality problem makes classification challenging.
Some dimensionality reduction methods have recently been proposed to solve the ''large p, small n'' problem [2]. Feature extraction and feature selection are two methods of dimensionality reduction; feature extraction transforms original features (genes) into a set of new features by subspace learning [3][4][5]. However, suitable biological interpretation is difficult to obtain from the subspace learning dimensionality reduction results. Feature selection is another commonly used dimensionality reduction method that selects a sub-set of genes that can best predict the response values from the raw data [6]. Although dimensionality reduction can significantly improve computational efficiency, this process can easily lead to over-fitting when a classifier is applied.
Sparse representation classification (SRC) was proposed by Wright et al. [7] for face recognition. With ' 1 sparsity constraint, a testing face can be approximately represented by parts of the training data that are from the same class. Unlike traditional classification methods such as support vector machine and k nearest neighbor classifier, SRC is robust to both noise and outliers. However, the orginal training samples may not contain suffiient discriminating information compared with meta-samples [8].
To capture more alternative information from gene expression data, the so-called meta-samples are proposed by [8][9][10][11]. These samples can be regarded as a set of bases, the linear representation of which can represent the training data. In [11], penalized matrix decomposition is used to extract meta-samples, and clustering is performed on those meta-samples. In [8], the meta-sample based sparse representation classification (MSRC) method is proposed. This method is robust to over-fitting problem and noise. However, MSRC needs two predefined parameters, namely, the number of meta-samples and the sparse penalty factor. These two parameters are data dependent. Thus, model selection methods, such as crossvalidation (CV), significantly affect the classification results. In this study, we propose a non-parametric version of MSRC to address this optimal parameter selection problem. The main contributions of this paper are as follows: 1. The data-dependent sparsity can be automatically adjusted, rather than empirically chosen. Without computationally expensive model selection, our method is scalable and efficient. 2. The existing MSRC [8] method requires the appropriate selection of the number of meta-samples for each sub class, which is a laborious task. We address this problem by introducing a simple weighting strategy for the meta-sample of each category, and the rationality of weighting strategies is mathematically proved. 3. Extensive experiments are performed to evaluate the proposed method. Experimental results show the superiority of the nonparametric version of MSRC compared with some state-of-theart classifiers. Section 3 presents more details.
The remainder of this paper is organized as follows: prior work on sparse representation classification and the fundamentals of the proposed method are described in Section 2. Section 3 presents the experimental results. The proposed method is discussed in Section 4. Section 5 concludes this paper.

Methods
This study primarily aims to establish the manner by which to devise an robust classifier for tumor subtype classification. Given a microarray data set X~fx 1 ,x 2 ,:::,x m g [ < n|m and a set of class labels C~f1,2,:::,cg, X is a matrix with n rows and m columns. Each column of X denotes a sample, whereas each row of X denotes a gene. Let x j denote the jth sample, which is a column vector with n dimensional. For each element in X, x i,j [ < denotes the expression level of the ith gene in the jth sample. We provide a summary of the abbreviations used in this study in Table 1. For clarity, we use boldface and lowercase type letters for vectors and boldface and capital type letters for matrices.
Gene expression profile data are high-throughput data with tens of thousands of genes. However, the number of samples is usually very small, which makes classification challenging. To avoid the curse of dimensionality, differential gene expression analysis [12,13] is widely used to exclude redundant and irrelevant genes before classification. In our study, we use the Relieff [14] method to select a subset of informative genes for further analysis. In the following subsections, we briefly review meta-sample and sparse representation classification. we then propose weighted metasample based parameter free sparse representation classification (PFMSCR).

Meta-samples versus gene expression samples
As illustrated in Figure 1, meta-samples can be regarded as basis samples that contain the essential information of the original data. A given testing sample can be represented by a linear combination of meta-samples from the same class. Concretely, suppose x i is associated with the n i th class, where n i [ C, and the n i th class samples in the training data have k meta-samples, namely, fw 1 ,w 2 ,:::,w k g [ < n|k . Sample x i can be formulated as Eq. (1).
x i~w1 h 1,i zw 2 h 2,i z:::zw k h k,i ð1Þ Mathematically, meta-samples extraction can be regarded as a type of matrix decomposition, including non-negative matrix factorization [15], singular value decomposition (SVD) [16], and principal component analysis [17], where matrix W ni [ < n|k , and H T ni [ < k|DniD denote the meta-sample and meta-gene, respectively. In singular value decomposition, W ni is a maximum linearly independent group of X ni column vectors.
Biologically, meta-samples are also called eigenarray [18] or basis snapshot for gene expression data. Han et al. [17] used metasamples to identify tumors from microarray data and found that meta-sample-based classification can effectively avoid over-fitting. Zheng et al. [10,11,18] proposed a novel cluster method based on meta-samples, which meta-samples can be regarded as cluster indictors.
Prior works revealed that meta-samples preserve some desired discriminant information of samples from the same class.

Sparse representation classification problem revisited
In this subsection, we revisit the sparse representation problem briefly. Sparse representation is one of the most important components of machine learning and data mining community that has wide applications in such fields as text mining, image classification, and bioinformatics. In this work, we interpret the sparse representation problem from the view of linear algebra.
From the standpoint of linear equations system Xa~y, the solution of Xa~y has three possible states: In the first scenario, one can pursue the sparse solution by regularization [19]. The problem can be formulated as However, ' 0 norm is an NP-hard combinational optimization problem, and difficult to solve, fortunately, ' 1 norm is an appropriate convex approximate to ' 0 [20]. If the solution is sparse enough, ' 1 minimization is equivalent to ' 0 minimization [21], such that we can reformulate Eq. (2) as min a a k k 1 For the other two scenarios, the sparsity of a cannot be guaranteed. However, one can still obtain a sparse solution by adding a penalty term that shares the same formulation as LASSO [22] Compared with Eq. (3), Eq. (4) is an unconstrained convex problem. Notably, makes a tradeoff between sparsity and regression error and should be empirically chosen. A larger yields a sparser a. However, one might run the risk of increasing regression error term Xa{y k k 2 2 . Sparse representation assumes that a signal can be reconstructed by a small number of basis signals within a linear combination. Thus, Eq (3) can be named as basis pursuit [23]. In bioinformatics applications, one can suppose that a testing sample can be well reconstructed by the training data from the same class within a linear combination, which is a very useful assumption for our later work.

Meta-sample based sparse representation
Zheng et al. [8] proposed MSRC method to predict tumor subtypes. In such situations, c classes of meta-samples are extracted, denoting as W~½W 1 ,W 2 ,:::,W c with the same classes being conjoined together, where meta-samples are column vectors (two kinds of meta-sample are proposed in [8]). Given a test sample y associated with class i, MSRC tries to find sparse reconstruct coefficients in terms of all meta-samples using Eq. (4). In particular, [8] tries to solve the sparse representation problem using min a Wa{y k k 2 2 z a k k 1 . In ideal cases, the nonzero entries in a will only be associated with the ith class meta-samples of W, as shown in Eq. (5).
a~½0,:::,a i1 ,a i2 ,:::,a in i |fflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflffl ffl} ith class ,:: Notably, the gene expression profile contains data with high dimensionality and small sample size (n&m). The sparsity can only be achieved by adding a penalty term. However, the optimal number of meta-samples and penalty factor are essentially important in classification applications. Figure 2 illustrates that if the meta-samples are improperly set, the prediction accuracy of MSRC drops seriously on COLON dataset. Specifically, in the left part of Figure 2 shows that the 10-fold stratified cross validation classification accuracy is achieved by varying the number of metasamples from 3 to 12 for each subclass. We can observe that the performance is less sensitive to various regularization parameters within the scope of from the right part of Figure 2. Thus, model selection is essential and laborious work on different data sets.
To overcome this weakness, this study proposed a novel parameter free meta-sample based sparse representation classification (PFMSRC) method.
Parameter free meta-sample sparse representation (PFMSRC) In this subsection, we first propose a heuristic weighted strategy, the reasonableness of which is theoretically proven. We then construct an underdetermined linear equation system, in which the data-dependent sparsity can be self-adaptively tuned by ' 1 norm regularizer.
Let X~fX 1 ,X 2 ,:::,X c g [ < n|m be gene expression profile data, with the same classes being conjoined together, that is, X i contains all samples associated with the ith class. We factorize X i by performing SVD. The singular values are sorted in descending order 1 § 2 §::: § k w0, where k is the column rank of X i , and L~diag( 1 , 2 ,::: k ) denotes diagonal matrix with singular values being diagonal elements. One can extract weighted meta-samples associated with class i as W i~ffi ffiffiffiffi 1 p u 1 , ffiffiffiffiffi 2 p u 2 ,:::, ffiffiffiffiffi , where u i is a column vector in U i , and rank(X i )~k.
Alternatively, Eq. (6) can be compactly reformulated as This weighting scheme can enhance the influence of main singular vector in U i . That is, larger i makes the associated meta-sample more important. Moreover, the weighting scheme works well in the following experiments. Compared with [8], Zheng et al. extracted meta-samples by performing SVD as well. However, in their algorithm framework, the number of meta-samples used for classification is determined during the cross-validation step. On the contrary, PFMSRC tries to avoid the cross-validation part by weighting the all meta-samples and weakening the influence of minor eigenvectors rather than using several of them for classification. Proposition 1 theoretically proves the reasonableness of the weighting strategy in measuring the importance of each metasample. Proposition 1. Singular value is a reasonable weighting factor for measuring the importance of meta-samples. Proof.
Let X~u 1 ,u 2 ,:::,u k ½ L½v 1 ,v 2 ,:::,v k T , where L~diag( 1 , 2 ,:::, k ) and 1 § 2 §:: This completes the proof. % The evaluation metric function is used to measure the metasample's contribution of the meta-sample to the raw data reconstruction in terms of i . Tr denotes matrix trace. Note that, functions f (x)~x and g(x)~ffi ffiffi x p have the same monotonicity, which makes the weighting strategy reasonable.
' 1 graph was proposed by Cheng et al. [24] to measure the similarity among samples. Inspired by their work, sparsity can be obtained by ' 1 regularizer on underdetermined linear equation systems. Concretely, a testing sample can be recovered by weighted meta-samples within a linear combination with a noise term added, formulated as Eq. (7) y~Waze~½W I a e Let B~½W I [ < n|(m'zn) and a'~a e [ < m'zn , where m' represents the number of meta-samples corresponding to c classes, I is an identity matrix, and e is the noise term. Alternatively, one can solve the following minimization problem: Theorem 1 proves that Eq. (8) is a underdetermined linear system. As stated in Subsection 2.2 the sparsity of under-determined linear system can be automatically tuned by ' 1 regularization (the first scenario). Moreover, (8) is a canonical convex problem with equality constraints, which can optimize sparse representation coefficients and noise term simultaneously. The globally optimal solution can be efficiently solved by CVX package [25] in polynomial time. Notably, the package solves the optimization problem by dualization rather than interior point method because the former is significantly faster than the latter. Theorem 1. Linear equation system (8) is underdetermined, and rank(B)~n.
Proof. We can find a sub matrix in B [ < n|(m'zn) , such as I and rank(I)~n[rank(B)~n. This completes the proof. % Note that a' [ < m'zn is a sparse vector with m'zn entries. The first m' components correspond to linear representation coefficients, whereas the last n components characterize model noise or regression error. However, the test sample y from one of the classes in training data cannot be well reconstructed by metasamples associated with the same class in most instances because of the existence of noises. Figure 3 illustrates the flowchart of our PFMSCR scheme, the redundant dictionary is constructed by combining meta-samples and noise term.
We define a projection function d i (a') : < m' ?< m' for each class i, which selects the coefficients associated with the ith class from the first m' components in a', whereas the other entries are appropriately padded with zeros in d i (a'). The reconstruction relationship y~Wd i (a') is not always holden. However, the minimized reconstruction error criterion r i (y)~y{Wd i (a') k k 2 , i~1:::c is a good approximation to classify testing samples. We summarize the proposed classification method as follows.
Step 1. Input training sets X~½X 1 ,X 2 ,:::,X c [ < n|m , class number c, and testing sample y [ < n ; Step 2. Normalize training set samples and testing sample to obtain unit ' 2 -norm; Step 3. Extract weighted meta-samples W~½W 1 ,W 2 ,:::,W c for each class (meta-samples with the same class are conjoint); Step 4. Solve non-parametric sparse representation problem by Eq. (8); Step 5. Compute residuals for each class r i (y)ỹ {Wd i (a') k k 2 , i~1:::c; Step 6. Return class label of y as c(y)~arg min i r(y), i~1,:::,c; PFMSRC can be considered as a non-parametric version of MSRC, compared with the former having the following merits: 1. The weighted meta-samples are orthogonal with one another.
That is, no redundancy exists among meta-samples, and the weight enhances the influence of the main singular vector, such that discriminant information can be well retained. 2. The data-dependent sparsity can be automatically tuned without human intervention. Thus, PFMSRC has better scalability and robustness. In the following section, we will conduct extensive experiments on micoarray data to evaluate the effectiveness of our scheme, and microarray data repository information as well as the accession number is given by Table 2.

Experiments
In this section, we will evaluate the performance of the proposed PFMSRC algorithm against four state-of-the-art algorithms, namely, linear discriminant analysis (LDA+SVM), independent component analysis (ICA+SVM), SRC, and meta-sample sparse representation (SVD-MSRC). The former two are model based and accompanied by feature extraction. These two algorithms are regarded as baseline. For the model-based method, support vector machine [26,27] with radial basis function kernel is employed as a classifier. The experiments are performed on four binary-class classification data sets and four multiclass classification data sets. All experiments are implemented in Matlab environment and run on a personal computer with intel Pentium4 dual core CPU 2.4 GHZ and 4 G RAM. The summarized descriptions of the eight gene expression profile datasets are provided by Table 3.

Dataset preprocessing and experiment setup
Gene expression profiling involves data with high dimensionality and small sample size. The exclusion of redundant and irrelevant data is critical for classification. As suggested by [36], restaining only the top 400 genes makes a good tradeoff between computational complexity and biological significance. In our experiment, the top 400 genes are selected from each dataset by applying the Relieff [14] algorithm to the training set.
For LDA+SVM algorithm, we simply extract c{1 new features to train the classifier, as LDA can find at most c{1 meaningful projection vectors in the subspace, where c denotes the number of Table 7. Comparison on four multiclass tumor data sets; for each data set, 10 (8 for LeukemiaGloub) samples per class are randomly selected for training the rest are used for testing.  classes. SVM kernel parameters are determined by 10-fold crossvalidation. In fact, the determination of the number of independent components is also an empirically dependent work. Here, we use the same method as suggested by [18]. SRC and MSRC methods need parameter to control sparsity. MSRC also needs the number of meta-samples of each class as a key parameter. Each dataset is searched from f0:001,0:1,1,10,100g by 10-fold CV on training data, and the number of meta-samples for each class is set as recommended by [8].

Experiments on binary classification problem
To evaluate the performance of five methods on a balanced split data set, we randomly select p~5 to min (Dc i D){1 samples per subclass as training set and use the rest for testing to guarantee that at least one sample in each category can be used for test, 20 times training/testing are randomly split, and the average classification accuracies are presented. The best prediction accuracy is in boldface for each gene expression profile dataset.
We show the average performance comparison on four binary classification tasks in Figure 4. PFMSRC exhibited encouraging performance. Although Gliomas was difficult for classification, the proposed approach can still achieve 85% classification accuracy via 20 samples per subclass used for training. Notably, the classification accuracy of LDA+SVM and ICA+SVM dropped quickly as more samples are taken for training; the same observations can be found in [36]. This fluctuation phenomenon can be interpreted as follows: (1) For the binary classification case, the feature extracted by LDA has only one dimension that is insufficient to capture the intrinsic discriminating information. Thus, model-based classification methods have difficulty in preventing the over-fitting phenomenon. (2) When evaluating the performance on the testing set the number of samples changes as more samples are used for training.  Classification accuracy, specificity, and sensitivity are some popular evaluation metrics. In this work, we use all three to evaluate performance, and the results are reported in Table 4, 5, and 6, respectively. The three methods can achieve satisfactory performance not only on the specificity metric but also on the sensitivity metric. Compared with SRC and MSCR, PFMSRC outperforms its competitors in most cases. A comprehensive consideration is that PFMSRC achieves the best performance, followed by MSRC and SRC.

Experiments on multiclass classification problem
We investigate multiclass classification performance on four publicly available data sets. The experimental setup is the same as that for the binary classification case. On one hand from Figure 5 and Table 7 it can be seen that (1) the classification accuracies of SRC, MSRC, and PFMSRC are increased on all multiclass classification datasets as more samples per subclass are taken for training. (2) ALL has six subclasses, and the proposed PFMSRC achieves the highest classification accuracy, which indicates that we have potential superiority on multiclass classification task. (3) LDA can capture more discriminating information on the multiclass classification task, and the over-fitting phenomenon is reduced compared with the binary classification task.
On the other hand, sparse representation based classification methods are less sensitive to the number of samples used for training model-based classification methods, which suggests a natural approach to select a classifier when the training sample size is small.

Experiments with different number of genes
In this subsection, we evaluate the performance of the five methods with different feature dimensions on eight tumor data sets. For the training data, 10 samples per subclass are randomly selected, whereas the remaining samples are used for test. We perform the test with various numbers of genes, starting from 50 to 400 genes in steps of 20. The comparison experiment was performed 20 times, and the average prediction accuracy of our experiments on eight gene expression profile datasets was recorded for evaluation.
The balanced training sets for each dataset ensure fair evaluation as stated by [36]. The experimental result in Figure 6 shows that the proposed PFMRSC performs well when only 100 genes are used. We can observe the similar results in the multiclassification case as well.
In binary classification case, SRC, MSRC, and PFMSRC share the same curve trend. Compared with SRC and MSRC, PFMSRC performs well by using a smaller number of genes, SRC and MSRC can achieve comparable accuracy by using more genes. Evidently, SRC, MSRC, and PFMSRC consistently outperform LDA+SVM and ICA+SVM in all datasets.
In the multiclass classification case, the performance of MSRC, SRC, and PFMSRC is very stable with respect to the number of genes, and all these methods converge fast to the optimal classification rate point. Figure 7 shows that compared with their performance in the binary classification case, SRC, MSRC, and PFMSRC are less influenced by gene dimension. Note that ALL is a multiclass dataset with six subclasses, but PFMSRC can still achieve a higher classification rate of 97% accuracy compared with SRC and MSRC. The same conclusion can be drawn for the SRBCT dataset.
In Table 8, we report the detailed classification accuracy. PFMSRC outperforms its competitors on most gene expression profile datasets, whereas SRC and MSRC-SVD perform the second best.

Comparsion of CV performance
To evaluate the classification performance on imbalanced split training/testing sets, we perform 10-fold stratified CV on tumor subtype dataset. All samples are randomly divided into 10 subsets based on stratified sampling: nine subsets are used for training, and the remaining samples are used for testing. This evaluation process is repeated 10 times, and the average result is presented. The 10fold CV results are summarized in Table 9. Table 9 shows that as the training sample size increases, the performance of these five classification methods is significantly improved. Model based methods LDA+SVM and ICA+SVM perform very well, with the classification accuracy increased significantly. In particular, the prediction accuracy of ICA+SVM ranges from 86.5% to 96.57% in all tumor expression profile datasets, which is comparable with those of SRC, MSRC and PFMSRC.
We can conclude that model-based approaches are more vulnerable to the small sample size problem, over-fitting should be resolved properly.

Discussion
Based on the above experiments, we can draw the following observations: 1. Sparse representation based methods (SRC, MSRC, PFMSRC) consistently outperform the model-based methods (LDA+SVM, ICA+SVM) on all experiments. Especially, in balance splited datasets the prediction accuracy of model-based methods is significantly lower than that of sparse representation methods which may be attributed to the small sample size problem. However, SRC, MSRC, and PFMSRC perform well even when we take 5 samples per subclass for training and the rest for testing. 2. SRC, MSRC and PFMSRC are robust to various sample sizes and feature dimensions, as well as converge fast to the optimal classification rate. The experiments verify the results in [7], which favors the application of those methods. Note that, model-based methods (LDA+SVM, ICA+SVM) exhibit improved 10-fold CV classification accuracy. A reasonable explanation is that the over-fitting phenomena are dramatically reduced when 90% of original samples are used for training and the remaining 10% are used for evaluation in our experiments. 3. PFMSRC outperforms SRC and MSRC in most cases, which implies that the parameter free sparse representation and weighting strategies can capture more discriminating information, especially in multiclass classification. See Figure 5. 4. PFMSRC is a parameter-free method, in which the data dependent sparsity can be self-adaptively tuned, compared with SRC and MSRC in which search for a regularization parameter is laborious work. Moreover, the number of metasamples is a key parameter for MSRC, as shown in Figure 2, which makes model selection more difficult.

Conclusions
In this study, we proposed a novel non-parametric metasample-based sparse representation. The algorithm assumes that test samples can be well reconstructed within a linear combination of weighed meta-samples in the same class. We theoretically proved the rationality of the weighting strategy. A simple but efficient projection function is constructed by the sparse representation coefficients to complete the classification work. We also compare the performance of PFMSRC with that of two modelbased methods and two sparse representation-based methods on eight tumor expression datasets. Experimental results have shown the superiority of the proposed method. We then drew some conclusions on the effects of both balanced split and imbalanced split testing/training sets on tumor classification problems.
PFMSRC exhibits stable performance with respect to different training sample sizes and feature dimensions compared with the other four algorithms. Thus, the extension of the sparse representation with dimensionality reduction (feature selection or feature extraction) in a unified framework is one of our future works.