Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) has proved to be an effective method for unsupervised clustering analysis of gene expression data. By the nonnegativity constraint, NMF provides a decomposition of the data matrix into two matrices that have been used for clustering analysis. However, the decomposition is not unique. This allows different clustering results to be obtained, resulting in different interpretations of the decomposition. To alleviate this problem, some existing methods directly enforce uniqueness to some extent by adding regularization terms in the NMF objective function. Alternatively, various normalization methods have been applied to the factor matrices; however, the effects of the choice of normalization have not been carefully investigated. Here we investigate the performance of NMF for the task of cancer class discovery, under a wide range of normalization choices. After extensive evaluations, we observe that the maximum norm showed the best performance, although the maximum norm has not previously been used for NMF. Matlab codes are freely available from: http://maths.nuigalway.ie/~haixuanyang/pNMF/pNMF.htm.


A simulation study
Here we conduct a simulation study, by which we try to find evidence to support answers to the following questions: Question 1: What will happen if there is only a small fraction of genes (signature genes) in a dataset differ between different types of tumors while more genes differ between levels of another categorical feature?
Question 2: What is the effect of irrelevant genes and the filter of removing irrelevant genes?
Question 3: What is the effect of a normalization method?
We design an artificial dataset containing 60 samples consisting of two classes I and II, representing two cancer types. 20 samples belong to class I, and the remaining 40 belong to class II. Here we consider three cases (A, A+, and B). To answer Question 2, we consider a case (data A) that contains 100 signature genes of the cancer types and 1000 irrelevant genes; also we consider a case (data B) that contains contains 100 signature genes of the cancer types and 50 irrelevant genes. To answer Question 1, we consider a case (data A+), in which some more signature genes discriminating another category are added to A. Then we compare the effects of different normalization methods to answer Question 3. Note that here we use NMF-Brunet as the Basic NMF that does not normalize the NMF solutions and it does not use the filter in Algorithm 1 in the main text. Because a Poisson noise is assumed underlying the NMF-Brunet, we shall generate data using a Poisson noise. The data will be repeatedly generated 100 times for evaluations.

Data A
In the first case, we try to make the data that contains 100 signature genes that can discriminate the two classes. Moreover, it contains another 1000 irrelevant genes that do not discriminate the two classes, and the data is generated by a Poisson noise. For this, we expect our post-processing method is better than the Basic NMF. We generate A 100 times by the following steps:

Data A+
In the second case, there are two other classes III and IV (e.g., Male and Female), and there are more signature genes discriminating classes III and IV than genes discriminating classes I and II. It is designed that half of the samples in class I and half of the samples in class II belong to class III while the remaining belongs to class IV. Based on A0, we add 200 signature genes that discriminate classes III and IV to generate A0+, and A+ is generated by adding a Poisson noise (A+=Poisson(A0+)). In such cases, the clusters discovered by NMF may not be consistent with the cancer types (I and II), and the evaluation based on the cancer types will be problematic. Among 100 realizations of these matrices, one realization of the matrices A0+, and A+=Poisson(A0+) are shown in Fig. 1.

Data B
In the third case, similar to data A, we try to make the data that contains 100 signature genes that can discriminate the two classes. Different from data A, it contains only 50 additional irrelevant genes that do not discriminate the two classes, and the data is generated by a Poisson noise. For this, we expect our post-processing method is better than the Basic NMF. Among 100 realizations of these matrices, one realization of the matrices W, H, B0=W*H, and B=Poisson(B0) is shown in Fig. 2.

Results
We show the results in Table A for both settings of using the filter or not. We used the default T=0.5. We performed 100 independent trials and computed the mean accuracy and the standard error of the mean. The clustering accuracy results are evaluated by classes I and II. In each of the 100 runs, A+ is re-generated by a Poisson noise based on a re-generated A0+, and H and W are randomly initialized. A corresponds to the 1100 rows of A+ without adding 200 signature genes for classes III and IV.

Meaningless results on A+ -an answer to Question 1
On the data A+, all methods do not work well for the purpose of grouping samples by cancer types. As we expected, the post-processing method cannot improve over the Basic NMF. The reason for this, is that the data tends to prefer clusters by classes III and IV (genders) to clusters by classes I and II (cancer types). While it is not difficult to select marker genes for cancer types given labels from the data A+ in a supervised learning algorithm, it is problematic to find meaningful clusters for cancer types in an unsupervised setting from a data having a lot of nuisance genes. Consequently the evaluation of performances for clustering  algorithms on datasets like A+ is meaningless. We speculate that there are some similar situations as A+ in some datasets whose original research is mainly focused on classification problems.

Effects of normalization methods on A and B-an answer to Question 3
On the data A, the Basic NMF works reasonably well. Without using the filter, using the maximum norm appears to be the second best while the normalization method SD is the best (the mean clustering accuracy is significantly different from that produced by the maximum norm, p-value=0.0083). With the filter, the normalization method SD appears to the best, but the mean clustering accuracy is not significantly different from that produced by the maximum norm (p-value=0.3656). Thus we conclude that using the maximum norm is a good choice for both setting of using the filter or not. It is surprising to see that the normalization method SD is excellent on the toy data A, and we could not explain why this happens. Nevertheless, it is not so great on the real data.
On the data B, without the filter, the normalization method 0.75q appears to the best, but the mean clustering accuracy is not significantly different from that produced by the maximum norm (p-value= 0.1172). With the filter, the situation is changed: see the discussion below.

Effects of irrelevant genes and the filter -an answer to Question 2
On the data A, we can see that, the filter improves all the normalization methods. This result may come from the effects from irrelevant genes (1000 genes that do not discriminate the two classes in this example). With the filter being intended to remove effects from irrelevant genes, such effects are expected to be removed to some extent and the clustering accuracy is improved. However, on the data B, the filter does not work any more. This is probably because less than half of the genes are irrelevant in B while the filter with the default setting removes half of the genes. This means that the current default setting of the threshold is not optimal, and this justifies the need of a further investigation on the filter in the future work. This is why we state in the end of the main text that "the filter in Algorithm 1 needs to be investigated further: there may exist a better threshold adapted to the data than the current default setting, in order to improve the performance of the normalization method using the maximum norm."

Evaluation of unsupervised methods using datasets designed for supervised algorithms
We download 11 gene expression datasets from http://www.gems-system.org/. They were collected for evaluating classification methods and their description can be found (Statnikov et al., 2005). The preprocessing method of these datasets was summarized in Table 1 in (Liu et al., 2008). The cluster accuracy for both the Basic NMF and the Post-Processing method is reported in Fig. 3).   Table B: Datasets and their original research problems. Those datasets that were used for clustering analysis in the original papers are marked a *, indicating that they are more appropriate for the purpose of evaluation of clustering algorithms. Note that in (Shipp et al., 2002), they did not do clustering analysis for DLBCL outcome although they did some initial clustering analysis. There is a similar situation in Su et al. (2001). Note datasets Leukemia1 and Brain Tumor1 are supersets of the datasets used in the main text.
The proposed method does not show consistent improvement over the Basic NMF. In the main text, we have speculated that some expression datasets are good for the task of classification but not for clustering -if only a small fraction of genes (signature genes) in a dataset differ between different types of tumors while the majority of genes differ between levels of another categorical feature (e.g., gender). In such cases, the clusters discovered by NMF may not be consistent with the cancer types, and the evaluation based on the cancer types will be problematic. This problem is illustrated further by a simulation study. Nevertheless, it is still mysterious to us that why the proposed method looks much worse than the Basic NMF for some classification datasets even if the evaluation of performances for clustering algorithms is meaningless. We have not been able to design a simulation study to explain this yet.
Although all these datasets were used to evaluate classification methods in (Statnikov et al., 2005), some of them were also used for clustering purpose in their original papers (see Table B), namely Brain Tumor1, Leukemia1, Leukemia2, Lung Cancer, SRBCT, and Prostate Tumor. These datasets are thus more appropriate for the purpose of evaluation of clustering algorithms. Among them, only on Prostate Tumor dataset, the proposed method performs slightly worse (but not significantly as the p-value is 0.271) than the Basic NMF. Therefore, from this set of evaluation, there is no evidence against the proposed method.

Effects of the filter on all datasets in the main text
To see the effects of the filter on the best normalization method -using the maximum norm, we compare the effects on all datasets in Table C Table C: Performance comparison in term of clustering accuracy in percentage for the effects of the filter on the best normalization method -using the maximum norm. Reported is the mean of clustering accuracies from 100 runs of Basic NMF together with the standard error of the mean. Also reported is the p-value produced by a paired two-sided t-test. Note that the proposed method is using 'max' normalization and using the filter.

Basic NMF minimizing Euclidian distance
The underlying assumption for NMF-EU is that A i,j is an observation from a normal distribution with mean (W H) i,j and a constant variance σ 2 , and W and H are model parameters.
Here, if the Basic NMF is the one minimizing Euclidian distance, the results of the postprocessing method are impressive in terms of clustering stability (see Fig. 4), as we expected. But it is not so in terms of clustering accuracy. For this, an explanation is that the KL divergence functional is better suited as a loss function for the matrix factorization for the underlying problems, i.e., the assumption of a Poisson noise is more reasonable than a Gaussian noise for the nonnegative expression data.
The improper use of Gaussian noise can be verified qualitatively: A i,j may be negative if it is from a normal distribution, but it never is. Or quantitatively this can be verified by showing the residuals A − W H produced by NMF-EU are not normally distributed while the model assumes that they should be. The Q-Q plots of the residuals A − W H in Fig. 5 strongly contradict the normality of the residuals as they are not a straight line for all these datasets with four settings.