Collaborative representation-based classification of microarray gene expression data

Microarray technology is important to simultaneously express multiple genes over a number of time points. Multiple classifier models, such as sparse representation (SR)-based method, have been developed to classify microarray gene expression data. These methods allocate the gene data points to different clusters. In this paper, we propose a novel collaborative representation (CR)-based classification with regularized least square to classify gene data. First, the CR codes a testing sample as a sparse linear combination of all training samples and then classifies the testing sample by evaluating which class leads to the minimum representation error. This CR-based classification approach is remarkably less complex than traditional classification methods but leads to very competitive classification results. In addition, compressive sensing approach is adopted to project the high-dimensional gene expression dataset to a lower-dimensional space which nearly contains the whole information. This compression without loss is beneficial to reduce the computational load. Experiments to detect subtypes of diseases, such as leukemia and autism spectrum disorders, are performed by analyzing the gene expression. The results show that the proposed CR-based algorithm exhibits significantly higher stability and accuracy than the traditional classifiers, such as support vector machine algorithm.


Introduction
The development of microarray technology facilitates the collection of information containing expression values of multiple genes under different experimental conditions. This technology also promotes and affects the progress in biological and biomedical research [1][2][3]. Therefore, huge amounts of genome-wide expression data have been remarkably acquired, and the quantity of data is continuously growing. These data provides the opportunity to better understand the tissues being studied and explore a finer molecular distinction between health states. Thus, approaches based on machine learning, which can automatically acquire qualitatively interesting patterns from gene data, have been widely adopted [4][5][6][7]. Among these machine learning pathways, support vector machine (SVM) and K-nearest neighbor (KNN) are used to study a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 performance [8][9][10][11]. To facilitate a more flexible and comprehensive analysis, Pse-in-one and Pse-analysis have been proposed. These methods are considered powerful bioinformatics analysis tools based on web server and Python package, respectively [12,13]. These tools can generate any desired pseudo components or feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' in their studies. In [14], various neural network technologies for cancer classification were surveyed, and the results are beneficial to exploit the most cost-effective approaches for clinicians. Zheng et al. [15] used independent component analysis to refine a subset of genes to further improve the clustering performance of nonnegative matrix factorization. However, gene data are always in high-dimensional space because of the enormous numbers of measurements (e.g., genes and probes). The high-dimensional characteristic limits the applicability of the majority of conventional classifier models. This characteristic also leads to the poor performance of conventional models in the identifying diseases from genome-wide expression data. Moreover, parameters should be optimized to facilitate the implementation of the algorithm depending on the structure of the data set.
Sparse representation is well-known as a powerful tool in various applications and has been highly developed for signal processing and machine learning [16]. Sparse representation-based classification (SRC) was introduced to identify gene expression without requiring any parameter optimization [17]. The SRC scheme has been successfully applied in the diagnosis of microarray gene expression in cancer. In [18], SRC was applied and showed better performance than the state-of-the art methods in protein fold recognition. In the SRC scheme, a microarray gene expression y can be expressed as y = Fα by sparse representation over a dictionary F, while α is a sparse vector. Assuming that l 0 -norm indicates the number of non-zeros in α, the sparsity of α can be represented by l 0 -norm. Thus, α can be evaluated with the criteria of l 0 -norm minimization. However, the solution of l 0 -norm minimization is NP-hard, which is very timeconsuming. Consequently, l 1 -norm minimization is considered as the alternative to l 0 -norm minimization because of its closest convex function of the former. The former minimization process is widely equipped with in sparse coding as follows: min α kαk 1 s.t. ky − Fαk 2 ε, where ε is a small constant. Although the computational complexity of l 1 -norm minimization is much lower than that of l 0 -norm minimization, the applications of l 1 -norm minimization is still limited by the high complexity in real-time scenarios. Therefore, various algorithms have been presented to accelerate l 1 -norm minimization.
Another issue is that the dimension of the feature space may be too high for classification algorithm. Numerous algorithms may become invalid or infeasible when lots of feature data in a dimension have to be processed [19][20][21]. A successful method is to extract a small number of discriminative information from a high-dimensional space. That is, high-dimensional data can be mapped to a feature space with lower dimension to facilitate the design of a classification algorithm. The algorithm of feature extraction is effective but highly complex. Compressive sensing (CS) theory is a well-known solution for sparse signals in sampling theory [22]. The main result of CS, which was introduced by Candés and Tao [23] and Donoho [24], is that a length-N signal x, that is, K-sparse in some basis can be recovered exactly in the polynomial from just M = O(K log(N/K)) linear measurements of the signal. Using the CS approach, highdimensional feature data can be simply projected to a space with much lower dimension with a random sensing matrix. The low-dimensional data retain enough information and can recover the original high-dimensional features. In this paper, we employ a measurement matrix which is assumed to be very sparse to efficiently compress the feature from the gene expression dataset. CS facilitates the extraction of the feature data to a low-dimensional one in the compressed domain.
To improve the sparse representation with computationally expensive l 1 -norm minimization, this paper proposes a collaborative representation (CR)-based classification (CRC) to subtype gene expression datasets. CR collaboratively uses whole training samples from all classes to constitute the query sample y. CRC with regularized least square can achieve competitive classification results with considerably less complexity. Our works are beneficial in augmenting the study of sparsity-based genome-wide data pattern classification. We used benchmark cancer, autism spectrum disorder (ASD), and brain data sets, in the experiments. To measure the effectiveness of the CRC algorithm, accuracy in terms of prediction or error rate in classifying a selected gene subset is evaluated.

Materials and methods
In this study, the classification problem can be depicted as follows: The microarray gene data sets, which serve as sample to train, are classified for the diagnosis and prognosis of various types of diseases. To apply sparse representation model, the coding vector of y should be sparse. Moreover, coding should be performed collaboratively over the whole dataset instead of each subset. By identifying the sparsest representation of y over X, the subset is naturally discriminative.

Collaborative representation-based classification
Classification of gene expression datasets using algorithms has been well studied for the robust diagnosis and prognosis of various diseases to achieve high prediction accuracy [25]. The primary goal of this approach is to identify the important pathways or genes strongly associated with the clinical outcomes of interest. Sparse representation has been applied to classify genomic data [26,27]. The corresponding performance was tested by distinguishing two subtypes of leukemia by analyzing microarray gene expression. With the selected leukemia genes from the 7129 ones, SRC achieved a classification accuracy of 97.4% when performed with the leaveone-out (LOO) method. When the tests were imposed on the same datasets, compared with existing methods, such as weighted vote, SVM, sparse logistic regression method, and rough sets method, SRC shows potential advantages with respect to improved classification rates with fewer informative genes. SRC also helps reduce computational load in terms of computational complexity and memory storage when processing high-dimensional data. Detecting subtypes of diseases, such as leukemia, according to different genetic markups is important. SRC is favorable for personalized therapy and improving treatment.
According to the theory of sparse representation, a complete dictionary of atoms, denoted by F 2 R mÂn , can accurately represents any signal x 2 R m by linearly combining these atoms in F. Nevertheless, if F is orthogonal, the representation of x is required with many atoms from F, leading to a complex computation. Then, the orthogonality of F is relaxed to allow less atoms to represent x. That is, more atoms is required to be involved in F to allow more choices to represent x. The dictionary F with redundant atoms is referred to as an over-complete matrix, leading to a sparser representation of signal x. Such sparse representation has been utilized in image restoration, in which great success has been achieved.
A total number of K classes of targets are assumed. Let X i 2 R mÂn denote the dataset of the i th class, X = [X 1 , X 2 , Á Á Á, X K ] and the column of X i is a sample belonging to class i. The existence of a microarray gene data y 2 R m is assumed, which can be coded as y = Xα + w, where w is a vector indicating the residual, α = [α 1 ; Á Á Á; α i ; Á Á Á; α K ] and α i denotes the coding vector corresponding to class i. If y belongs to the ith class, w is a zero vector, then y = X i α i holds. Only the coefficients in α i are significant when most entries in α j , j 6 ¼ i are nearly zeros. That is, vector α is sparse, and its non-zero entries are used to encode the identity of sample y. Table 1 shows the matrix measurements of the microarray gene expression data for samples of prostate tumors and adjacent prostate tissue without tumor. Assuming that the training samples consist of these samples, the dictionary of samples is as follows: : ð1Þ A sample to be identified is treated as the measurement output, and α is to be reconstructed. Microarray gene expression datasets, such as breast cancer dataset, usually suffer from high dimensionality. In traditional SRC, enough training samples for each class is required for dictionary X i to be over-complete. However, the sample size of the gene expression data is very small compared to its high dimensionality, leading to the under-complete matrix X i . Consequently, the representation error would be unacceptable, even when y belongs to class i. This characteristic leads to a failure decision made by the classifier. One obvious solution is to use more samples of class i to represent y. However, collecting huge number of samples is expensive. Nevertheless, differences among genes are very small such that the diverse classes of gene expression data share similarities. Thus, the testing gene y can be coded collaboratively by the dictionary of all samples X = [X 1 , X 2 , Á Á Á, X k ] under the l 1 -norm sparsity constraint.
When y is collaboratively represented with all classes containing all samples, y can be classified individually by SRC by checking class by class [28]. The sparse solution to y = Xα can be determined by solving the following optimization problem: However, solving l 0 -optimization is an NP hard problem. Optimizations of l 0 -optimization and l 1 -optimization have recently been proved to be equivalent when α is sparse enough. In general, the sparse representation problem can be expressed as follows: We use sparsity constraint of l 2 -norm instead of that of l 1 -norm, Thus. the problem becomes If X is over-complete,â is computed by an algorithm such as OMP algorithm. Then, y is perpendicularly projected onto the space spanned by X, which can be expressed as follows: In SRC, a new sample y new can be classfied by evaluating the reconstruction representation errors: whereâ i designates the coding coefficient vector drawn according to class i. When the number of classes is too large, a stable solution of the least squareâ ¼ arg min a k y À Xa k 2 2 cannot be obtained. To overcome this limitation, CRC is proposed to collaboratively represent the query sample using X. The corresponding l 2 -minimization with regularized least square is as follows: Assuming P = (X T X + λI) −1 X T , this formula shows that P is independent of y. Hence, P is only required to be computed once and can be shared in classifying different samples. The reduction in the complexity of CRC is not achieved at the expense of performance. Similar to SRC,p is used to classify the type of gene expression data. The representation residual ky À X ipi k 2 is the main criteria for classification, wherep i is the coding vector associated with class i. However, the l 2 -norm kp i k 2 is also beneficial in providing some information about the class features of gene expression data and can be combined for classification. Such combination is useful to slightly improve the classification accuracy compared with that using only the representation residual.

Compressive sensing-based dimensionality reduction
Gene expression datasets may have very high dimensionality. High-dimensional gene expression datasets always result in high computational load and degradation of model performance of classification algorithm [10,19,21]. Therefore, feature selection approach or dimensionality reduction methods should eliminate redundant and irrelevant feature data to decrease the ratio of features to samples. On the other hand, this process reduces the probability of overfitting. A common method of dimensionality reduction in high-dimensional classification is choosing some data from a dataset in an order and dropping the other data. This approach would inevitably result in the loss of some information. Many probes in traditional microarrays are inactive during detection. That is, the microarray gene expression data can be sparse. Only a few significant entries of gene expression data matrix may be of interest. This phenomenon suggests the fast transformation of a microarray along with the CS measurement process, where each measurement y is a linear combination of entries in the microarray gene expression data vector x. A number of N gene expression data is assumed in every sample, but that at most K samples are collected, with K ( N. In random projection, a sensing matrix R with rows having unit length projects data from the high-dimensional feature x 2 R N to a lower- where M ( N. Each projection v is essentially identical to a compressive measurement during CS encoding. Therefore, using the CS principle, the number of data to be classified can be remarkably lower than the number of the original microarray gene data. With fewer data, the classification algorithm can also be more efficient. We refer to this reduction of microarray dimensionality as a CS dimensionality reduction. Ideally, R can offer a stable embedding that approximately retains the important information in any K-sparse signal when x 2 R N is projected to v 2 R M . Therefore, the so-called restricted isometry property (RIP) is derived to approximately maintain distances between any pair of K-sparse signals. Note that every signal pair shares the same K basis. That is, for any two K-sparse vectors x 1 and x 2 sharing the same K basis, Incoherence is achieved if the sparse signal satisfies the RIP condition. For example, random matrices with entries identically and independently drawn from a standard normal distribution, Bernoulli distributions, or Fourier matrix satisfy the RIP condition. However, given that the matrix is dense, the loads in terms of memory and computational complexity become unacceptable when N is large. The matrix which satisfies the RIP condition can be directly obtained using the Johnson-Lindenstrauss lemma. In this paper, we use a very sparse random measurement matrix with entries defined as follows: with probability 1 2r 0 with probability 1 2r À 1 with probability 1 2r This type of matrix with ρ = 1 or 3 satisfies the Johnson-Lindenstrauss lemma [29]. This matrix is easy to compute and requires only a uniform random generator. In the latter approach, the microarray readouts are linear combinations of input gene expression data components. Thus, the readouts can be expressed in the form given by Eq 8. With a reduced number of measurements, we are able not just to detect but also classify the target gene expression datasets. The proposed CRC with CS dimensionality reduction algorithm is summarized in Algorithm 1.

Results and discussion
To evaluate the classification performance, we performed experiments on two benchmark datasets and compared the proposed model with state-of-the-art models. The simulations were conducted on a system with i7-4790 CPU with 3.60 GHz processor and 8 GB RAM. The algorithms were evaluated based on accuracy to determine the rating prediction accuracy as follows: where TP denotes true positive (item is true and classified truly), TN is true negative (item is true but not classified truly), FP is false positive (item is false but classified truly), and FN is false negative (item is false and not classified truly). The performance of the classifiers is quantified by LOO cross-validation (LOOCV) and 10-fold cross validation (10-fold-CV). In the LOOCV scheme, each sample in the dataset was predicted by building a model from the remaining samples and recording the accuracy of each model. In 10-fold-CV, the dataset was randomly divided into 10 equally sized subsets. Nine of these subsets were used in the model construction, while the remaining subset was used for prediction.

Datasets
In this study, we use gene data for leukemia and ASD to evaluate the performance of CRC. The leukemia data set was downloaded from an open resource on the website of Gene Pattern in Broad Institute. The original training data set consisted of 38 bone marrow samples (27 ALL and 11 AML), while the testing data set consisted of 35 bone marrow samples (21 ALL and 14 AML) [14]. A total of 7129 gene expression samples were tested. The brain gene data was collected by a consortium consisting of Allen Institute for Brain Science and five collaborating universities [30][31][32]. Using the GENCODE consortium's release [33], the expression data were assembled and aligned in the form of RNA-sequencing reads. The data were in the units of reads per kilobase of transcript per million mapped reads (RPKM). Therefore, A log 2 (RPKM+1) transformation operation was imposed on these data. The dataset had 524 samples with a developmental time point range from 8 weeks post-conception to 40 years of age from 26 brain structures. To our knowledge, this brain dataset is currently the most comprehensive transcriptome of the human developing brain. In the training dataset, the expression values for the temporospatial time points acted as features.
We also considered the diffuse large B-cell lymphoma (DLBCL) [34] and breast cancer, which has high dimension. The DLBCL data set consisted of 58 samples from DLBCL patients and 19 samples from follicular lymphoma with 7070 genes. The gene expression profiles were organized using Affymetrix human oligonucleotide arrays. The gene dataset of breast cancer has 14 docetaxel-resistant samples and 10 docetaxel-sensitive samples. The cDNA microarrays comprised 12625 genes. All datasets are downloaded from http://www.biolab.si/supp/bi-cancer/projections/ index.html (Table 2), and have been preprocessed by t-test with a 0.05 confidence level.

Dimensionality reduction results
The computational load can be reduced, and overfitting can be avoided by reducing the dimensionality for a dataset. CS is used to project the high-dimensional data to a lowerdimensional space. The much lower bound for M is sufficient to achieve good results for gene data classification. The average LOOCV accuracy under the incremental dimension reduced by CS is shown in Figs 1-4. Fig 1 shows that in leukemia, when the reduced dimensionality is more than 500, the average accuracy would not increase. That is, 500 is an optimal choice for CS dimensionality reduction of leukemia gene data. Similarly, Figs 2 and 3 show that the optimal reduced dimensionalities for breast cancer and DLBCT are 200 and 600, respectively. By contrast, Fig 4 depicts the performance of ASD always remains stable even when the dimensionality is reduced to 3. Thus, the data in the gene datasets of this disease may be highly coherent. These coherent data may be sparse in the other form. Compared with the original dimensionality, CS dimensionality reduction is helpful to speed up the classification algorithm.
We also compared the running time of CRC and SRC with fast l 1 -minimization methods. We fixed the reduced dimensionality. In one LOOCV loop, the average running speeds of CRC and SRC are listed in Table 3.
The improved speed of CRC is obvious in the ASD database. However, in leukemia, DLBCL, and breast cancer datasets, the matrix X is overdetermined. SRC directly uses pseudoinverse to solve the l 1 minimization problem. When more samples are used for training, SRC has to use the other fast l 1 minimization methods, such as l_1_ls and homotopy. In such case, CRC is 7 times faster than SRC.

Performance comparison
Several state-of-art classifier models, including SRC, KNN [35], support vector machines (SVM) [36], and random forest [37], were adopted when performances were compared in terms of averaged accuracy. The procedure was repeated 100 times, and their averaged performances were calculated for all tested samples to enhance the quantification of the performance of the learning model. The numerical results of the LOOCV, 10-fold CV, and 5-fold CV are summarized in Tables 4-6. For comparison, we also considered the small round blue cell tumors (SRBCT) which have multiple classes. Four different SRBCTs are used in this dataset, namely, Ewing family tumor (EWS), Burkitt lymphoma (BL), neuroblastoma (NB), and rhabdomysarcoma (RMS). The training set contained 63 samples, and the test set contained 20 samples. The cDNA microarrays consisted of 2308 genes. Tables 4 and 5 shows that CRC outperformed the other classifiers using all validation methods. This approach has an average accuracies of 96.4% and 94.5% using LOOCV and 10-fold-CV, respectively. The accuracies of SRC and CRC are quite close and at least 7% higher than  those of the other three methods. Table 6 shows the performances in terms of 5-fold CV. All averaged accuracies are reduced compared with that of 10-fold CV. This phenomenon is due to the less samples used to train. Note that the performance of ASD was reduced slightly, because the total samples of ASD gene data were far more than the other gene datasets. The proposed method yielded a slightly higher average accuracy, implying that CR remarkably contributed to the classification. Considering these studies, CRC is the most robust method, because it showed the highest accuracy with the least number of genes not only in the test set but also using other types of validations, including 10-fold-CV and LOOCV.

Conclusion
This paper presented a new classifier model for classifying microarray gene expression. The proposed classifier uniquely incorporates CS dimensionality reduction CS to guide data classification. This work has two important contributions. First, an effective CRC approach was implemented. This method exhibited very high performance and valuable insight into the different types of cancer data sets. CRC did not require the optimization of parameters to facilitate classification. This method can be used for different types of gene expression data with multiple classes without any modifications. Second, CS method was adopted to reduce the dimensionality of data. The sensing matrix is a very sparse random matrix with entries as 0 and ffiffiffi r p . This method avoided the various feature extraction methods that are computationally expensive. The proposed method can be introduced for clinical application for each patient. Accurate diagnostics can be provided by only measuring a few gene expression data. We tested our algorithm on publicly available data sets of several diseases, including leukemia, breast cancer, ASD, DLBCL, and SRBCT. In conclusion, CRC can achieve high classification accuracy and fast computational speed.
Supporting information S1 File. The code (Matlab) and data files along with instructions are provided as a zip file. (ZIP)