Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification

Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.


Introduction
In cancer prognosis and treatment, it is crucial to identify different cancer types and subtypes. Traditional methods often rely on similar morphological appearances but easily induce different clinical courses and responses to therapy because of subjective interpretations and personal experience. This usually results in diagnostic confusion. Fortunately, the emergence of the DNA microarray technique removes this barrier in an objective and systematic manner and has showed great potential in outcome prediction of cancer types in genome-wide scales [1][2][3][4][5][6][7][8][9][10][11].
Numerous learning methods have been developed for cancer classification based on gene expression profiles [1][2][3]. For instance, Golub et al. [1] used a weighted voting scheme for the molecular classification of acute leukemia. Nguyen et al. [3] incorporated partial least squares the assumption that the basis lies in the subspace spanned by the original samples. Given the data matrix V = [v 1 ,Á Á Á,v n ] T 2 R n × m , where n denotes the number of samples and m their dimensionality, PNMF learns the coefficients H 2 R n × r to represent original samples, i.e., min where k•k F denotes the matrix Frobenius norm and r the number of clusters. As in objective (1), it is non-trivial to analyze the convergence in theory because Eq (1) contains a fourth-order term. To remove such a high order term, we first introduce an auxiliary variable, i.e., the cluster centroids, and the equality constraint into Eq (1). Thus, we can obtain min H!0 k V À HW k 2 F ; s:t: The objective is very similar to BPNMF [26], but we cannot directly apply the optimization algorithm of BPNMF to optimize it especially when additional constraints such as the sparseness constraint and Laplacian regularization are imposed over the coefficients, as these constraints easily induce PNMF to produce the trivial solution. To avoid such a drawback, we propose a semi-supervised PNMF method (Semi-PNMF) by recasting Eq (2) as min H;W!0 where α ! 0 is a regularization constant and W denotes the non-negative cluster centroid. Model (3) significantly differs from BPNMF because Eq (3) favors the representative capacity of the cluster centroids, while BPNMF focuses on the orthogonality of the non-negative subspace. Thus, Eq (3) induces the sparse coefficients, while BPNMF produces the sparse basis. According to Eq (3), we can incorporate the local coordinate constraint [38] to improve the representative power of the basis, meanwhile further inducing the sparse coefficients to be true classes. Thus, we recast Eq (3) as the following regularization form: where β trades off the local coordinate regularization and H ij denotes the i-the row and j-th column element of coefficients H, W j and V i , signifying the i-th and j-th row vector of W and V, respectively.
To make full use of partial labeled samples, we propagate the labels of labeled samples to unlabeled ones by minimizing the distance between their coefficients and the corresponding class indicator. Particularly, we require the coefficients of labeled samples to be equivalent with the corresponding class indicator. Consider the first d examples labeled and the rest unlabeled; the data matrix V can be divided into two parts, i.e., V ¼ ½V T L ; V T U T . Then, we can obtain the objective function of Semi-PNMF as follows: where Q denotes the partial label matrix wherein Q ij = 1 if v i belongs to the j-th class; otherwise, Q ij = 0. Both H U and n U denote the coefficients and number of the unlabeled samples, respectively.
Interestingly, Semi-PNMF has two distinct aspects. First, it replaces the learned coefficients of the labeled samples with the corresponding class indicator. The constraint is so strong that the learned basis completely biases the labeled samples. This might induce the trivial solution to the coefficients of the unlabeled samples. Second, Semi-PNMF completely ignores the representation contribution of the labeled samples. It is so unintelligible that the learned basis only favors the unlabeled samples. It appeared that both aspects contradict each other, but intrinsically, they mutually complement each other in our Semi-PNMF. In essence, the first aspect corresponds to supervised learning, which generates the reasonable solution yet does not ensure it is consistent with the underlying data distribution, while the second one considers data distribution but cannot yield the reasonable solution. Thus, the combination of both aspects can mutually complement each other. Semi-PNMF learns the shared basis by the labeled and unlabeled instances, meanwhile inducing similar instances to have a similar representation, i.e., the coefficients. Because we impose the restriction that coefficients of the labeled samples be their labels as well as the local coordinate constraint over the basis and coefficients, the unlabeled sample coefficients are implicitly as sparse as the label vectors. In this way, Semi-PNMF effectively propagates the labels of labeled samples to the unlabeled ones. Consequently, in cancer classification, it is reasonable that, for each unlabeled sample, we choose the index of the largest entry of its coefficient to predict the classes of this sample once objective (5) yields their coefficients. The above intuition can be further verified by the toy example given in Figs 1 and 2.

Optimization Algorithm
It is difficult to optimize Eq (5) because it is jointly non-convex with respect to both W and H. Fortunately, it is convex with respect to W and H, respectively. Thus, we can establish the following theorem: Theorem 1: The objective function (5) is non-increasing under the following multiplicative update rules: , and where denotes the element-wise product operator, Proof. According to Eq (5), we can obtain the objective with respect to W as follows: where L i U denotes the diagonal matrix whose diagonal elements are the i-th row vector values of V U . By Eq (8), we can define the auxiliary function of J(W) as Obviously, objective (9) has GðW; W 0 Þ ! JðWÞ ¼ GðW; WÞ: ð10Þ We can obtain the derivative of Eq (9) as follows: Based on Eq (11), we have By simple algebra, the formula (6) can be deduced from Eq (12). Likewise, we can obtain the auxiliary function of J(H U ) as follows: Setting Thus, according to Eq (14), we also obtain the update rule (7) for H U . Moreover, according to Eqs (10), (12) and (14), we have Based on Eq (15), these update rules always guarantee that the objective function monotonically decreases. Thus, this completes the proof. ■ According to the above theorem, we summarize the multiplicative update rule (MUR) for Semi-PNMF in Algorithm 1.

Algorithm 1 MUR for Semi-PNMF
Input: Examples V 2 R m × n , penalty parameter α, partial label matrix Q. Output: H U .
To reduce the time overhead, Algorithm 1 utilizes the objective relative error as the stopping criterion; in addition, set ε to 10 −7 in our experiments. The main time cost of Algorithm 1 lies in line 3 and line 4. Their time complexities are O(r 2 n+mrn+r 2 m+rm) and O(mr(n − d) +r 2 m+rm+r 2 +r 2 (n − d)), respectively. Thus, the total time complexity of Algorithm 1 is O(r 2 n+mrn+mr(n − d)+mrd+r 2 m+rm+r 2 +r 2 (n − d)).

Results
This section conducts a series of experiments on both synthetic and real-world datasets to verify the method proposed in this paper.

Synthetic Dataset
This section generates a small synthetic dataset to clarify the mechanism of Semi-PNMF. The synthetic dataset consists of three categories constructed by the following random samples: and where x 2 R 3 , and each of its entry is sampled from the standard uniform distribution U(0,1).
For each category, we randomly generated 10 samples, within which three samples were selected as labeled samples and the rest as unlabeled ones. Therefore, the synthetic dataset contains 30 samples in total. For clear illustration, three categories are marked as three different colors, and the labeled and unlabeled samples are distinguished by two shapes.  In Fig 1(d), each row of the learned basis has different colors, implying that the basis stands for the centroids of different categories and owns the discriminative representation ability. According to Fig 1(c), each row of the learned coefficients is the lower-dimensional coefficient of the corresponding unlabeled sample. The larger the entry of the coefficient is, the darker its color is. As shown in Fig 1(c), the maximum entry of the coefficient largely exceeds the other entries. All maximum entries make the coefficients take up the diagonal form and imply the cluster memberships of all the samples. Thus, it is reasonable to select the index of the maximum entry of the coefficient as the classes of an unlabeled sample. This verifies our previous intuition. Since all samples shares the common basis, their coefficients become close to each other if they have the same labels. We impose the restriction that the coefficients of labeled samples be equivalent to their label vectors, and thus this also induces

GCM Dataset
This experiment merely compares traditional semi-supervised learning methods including low density separation (LDS, [14]), transductive SVM (TSVM, [16]), constrained NMF (CNMF, [24]), soft-constrained NMF (SCNMF, [25]) and Semi-PNMF by separating different types of cancers on the GCM dataset. The GCM dataset [8] contains the expression profiles of 218 tumor samples representing 14 common human cancer classes. It is available on the public website: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi, and can also be downloaded from the website: https://zenodo.org/record/21712. According to [8], we combine the training and testing set of this gene expression data into a dataset for cancer classification. Thus, the combined dataset contains 198 samples with 16,063 genes. Table 1 gives a brief description of this dataset. To remove very low noisy values and saturation effects of very high values, we bound the gene expression data into a specific box constraint ranging from 20 to 16,000 units and then exclude those genes whose ratios and absolute variations across samples are under 5 and 500, respectively. Consequently, the resultant expression profile dataset contains the 11,370 genes passing. We compare the effectiveness of Semi-PNMF with LDS, TSVM, CNMF and SCNMF under varying configurations. Both CNMF and SCNMF involve no parameter tuning. For Semi-PNMF, we set two parameters α = 2, and β = 0.0001, respectively. Because these representative methods enable convergence within 1,500 iteration rounds, we set the maximum number of loops to 1,500. For LDS and TSVM, we adopt the parameter settings provided in the source code to obtain the classification results. We evaluate the cancer classification by the cross-validation over the whole dataset. This process selects one sample as the unlabeled sample and, meanwhile, learns the prediction model on all the samples for cancer diagnosis. For the unlabeled sample, we choose the index of the largest value of the resultant consensus matrix to predict the classes of this sample. As shown in Figs 3 to 7, the confusion matrix of the predicted results of Semi-PNMF, CNMF, SCNMF, LDS and TSVM are reported in detail. Each column denotes how many the unlabeled samples are assigned to each cancer, while each row signifies the number of the unlabeled samples affiliated to the real tumor type. Each color not only represents a specific cancer type but also highlights the correct prediction results, i.e., the diagonal elements of the confusion matrix. Figs 3 to 7 imply that Semi-PNMF can identify different tumor types more accurately than the representative methods. For example, when working with two labeled samples from each tumor type, Semi-PNMF achieves 70.71% classification accuracy and exceeds LDS, TSVM, SCNMF, and CNMF by 10.6%, 21.72%, 21.72%, and 32.3%, respectively. Moreover, Table 2 further implies the effectiveness of Semi-PNMF compared with CNMF, SCNMF, TSVM, and LDS in terms of both sensitivity and specificity. For completeness, we list their definitions as follows: and

Acute Leukemia Dataset
We also conduct a cancer classification experiment to verify the classification performance of Semi-PNMF compared with low density separation (LDS, [14]), transductive SVM (TSVM, [16]), constrained NMF (CNMF, [24]), and soft-constrained NMF (SCNMF, [25]) on another popular dataset, i.e., the Acute Leukemia dataset [36]. This dataset comes from Gene  Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13159), and can also be downloaded from the website: https://zenodo.org/record/21712. We replace the unavailable entries of this dataset with the average values of their k-nearest neighbor elements. This dataset consists of 2,096 samples along with 54,675 probes in total. This dataset contains different cancer subtypes of the acute leukemia and thus is not suited for cancer classification in contrast with the GCM dataset. Table 3 gives a brief description of this dataset. Then, we feed this dataset to all the compared methods. For Semi-PNMF, we set two parameters α = 0.2, and β = 0.01. For the traditional semisupervised learning methods, we adopt the same configurations as the above subsection. The cross-validation process of the above subsection is repeatedly conducted to evaluate the compared methods on this dataset. As shown in Figs 9 to 13, the confusion matrix of the predicted results of Semi-PNMF, CNMF, SCNMF, LDS and TSVM are reported in detail. Each column denotes how many unlabeled samples are assigned to each cancer subtype, while each row signifies the number of unlabeled samples affiliated to the real tumor subtype. Each color not only represents a specific cancer subtype but also highlights the correct prediction results, i.e., the diagonal elements of the confusion matrix.
Figs 9 to 13 imply that Semi-PNMF can identify different tumor types more accurately than the representative methods. Semi-PNMF achieves the highest total classification accuracy

Discussion
This paper proposes the semi-supervised PNMF method (Semi-PNMF), which incorporates two types of constraints as well as the auxiliary basis to boost PNMF. Particularly, Semi-PNMF utilizes the linear combination of examples to approximate the cluster centroids such that the cluster centroids have more powerful representative ability. To effectively indicate the classes of unlabeled samples, Semi-PNMF enforces the coefficients of labeled samples to approach Each row indicates the specific cancer sub-style corresponding to each row of Table 3.
their labels, meanwhile representing the unlabeled samples using the identical cluster centroid.
To optimize Semi-PNMF, we devised the multiplicative update rule (MUR) to establish the convergence guarantee. Experiments of cancer classification on two real-world datasets show that Semi-PNMF outperforms the representative methods in terms of quantity.
Recently, Bayesian methods that incorporate both sparsity and a large number of covariates in the model have been extensively used for parameter estimation and classification in data sets compared to small sample sizes such as gene expression data [39][40][41]. They also improve model accuracy by introducing a slight bias in the model [40]. In future works, we can borrow from the merits of Bayesian methods to further improve the classification performance of Semi-PNMF for a large-scale dataset. Semi-PNMF has provided a flexible framework for learning methods in cancer data processing and can be utilized in other applications such as cancer recurrence [42,43].