Bayesian Hierarchical Clustering for Studying Cancer Gene Expression Data with Unknown Statistics

Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/


S1 Hyperparameter Optimization
We have that where and κ n k = κ 0 + n k , Assume that in which the probability density function of a Gamma distribution is defined by Hence, The first and second derivatives of Equation (11) with respect to hyperparameters λ 0 , β 0 , κ 0 are where Since α 0 , β 0 , κ 0 > 0, it is recommended to perform optimization based on the logarithmic scale of the hyperparameters. Let v = ln λ 0 , It follows that in which The first and second derivatives of Equation (25) with respect to v, y, w are in which we denote P (v, y, w|D k ) by p.

S2 Synthetic Dataset
Synthetic Dataset1: Mixture of Gaussian Distributions and Independent Data Variables 1000 observations of 10-dimensional random vector, x, are generated from a mixture distribution: where N(·) denotes a multivariate Gaussian distribution: mean vectors are given by and covariance matrices Σ i are chosen to be diagonal matrices with positive diagonal entries. The data are normalized before the use.

Synthetic Dataset2: Mixture of Gaussian Distributions and Correlated Data Variables
Again, 1000 observations of 10-dimensional random vector are generated from the mixture distribution (36) with settings (38), (39), but covariance Σ i are chosen to be symmetric semi-positive definite matrices with positive diagonal entries. The data are normalized before the use.
Synthetic Dataset3 : Mixture of Several Distributions 1000 observations of 10-dimensional random vector, x = (x 1 , ..., x 10 ), are generated from a mixture distribution: in which π i are given by Equation (38), P 1 is a multivariate Gaussian distribution where its variates are independent, and given by in which P 2 is a multivariate gamma distribution whose variates are independent, and given by where P 3 is a multivariate uniform distribution whose variates are independent, and expressed by where P 4 is a multivariate student's t-distribution whose variates are independent, and given by where P 5 is a multivariate Weibull distribution whose variates are independent, and defined by in which P 6 is a multivariate chi-squared distribution whose variates are independent, and given by in which P 7 is a multivariate Gaussian distribution: where (·) T denotes a transpose operator, | · | denotes a determinant operator, µ is a 10-dimensional zero vector, and Σ is a symmetric semi-positive definite matrix of size 10 by 10 whose diagonal entries are positive. Data generated from different distributions are then shifted and centered at different locations given by rows of (39). The data are normalized before using.

S4 Technical Setting
• For APC and APE, we set damping factor to 0.9, and set preference for each data point to be the median value of pairwise similarities between data points.
• In all BHC algorithms, the concentration parameter α is set as 0.001.
• For GBHC-TREE, the hyperparameter optimization is performed as follows. Optimization are run for each starting point to find local maxima, and the highest local maximum is selected. This optimization is performed using MATLAB function MultiStart and fmincon, where the stopping criterion is that the distance between the current and the previous searches is less than 1.
• KC and KE are randomly initialized. To find the best run, we therefore run the algorithms for 5 times and choose the partition that gives the lowest value of total sums of point-to-centroid distance.
• To infer the number of clusters in KC and KE by L-method, we run the algorithms with predefined number of clusters k = 1, ..., n (synthetic data clustering: n = 50; sample clustering of gene expression data: n =number of samples; gene clustering of gene expression data: n = 100).
• Regarding the experimental platform, the sample clustering experiment is conducted on Mac Book Pro laptop with 2.66 GHz Intel Core i7 processor, and only one core is used. For gene clustering, the experiment is conducted on a machine with 3.10 GHz Intel Core i5 processor, where 4 cores is running. GBHC-TREE, GBHC-NODE, MBHC, KC, KE whose code run in parallel thus benefit from the latter setting.

Effect of Degree of Correlation between Data Variables on the Performance of GBHC
To investigate the effect of degree of correlation between a pair of data variables on the behavior of GBHC-TREE and GBHC-NODE, we generate 6 datasets. Each dataset contains a single cluster of 100 independently and identically bivariate Gaussian distributed random vectors, and the correlation coefficients between data variables of different datasets are 0.4,0.5,...,0.9. Each dataset are then normalized and clustered by GBHC-TREE and GBHC-NODE. Table S3 shows the inferred number of clusters in each dataset. We can see that the number of clusters inferred by both algorithms tends to increase as the degree of correlation increases.

Effect of the Number of Strongly Correlated Pairs of Variables on the Performance of GBHC
We study the effect of the number of highly correlated pairs of variables on the performance of GBHC by consider three synthetic datasets. Each dataset contains a single cluster of 100 observations of 4-dimensional random vector, drawn from multivariate Gaussian distribution. The correlation coefficient matrices of different datasets are given by Equations (59)-(61). In (60), we can see that there is one pair of strongly correlated variables ( 1st and 2nd variables whose correlation coefficient is 0.9). In (61), there are two pairs of strongly correlated variables (1st and 4th, 2nd and 3rd, whose correlation coefficients are both 0.9). Thus, we will refer to the datasets corresponding to Equations (59), (60), and (61) as "no highly correlated pair", "1 highly correlated pair", and "2 highly correlated pairs", respectively. We normalized each dataset prior to clustering. The number of clusters inferred by GBHC-TREE and GBHC-NODE are shown in Table S4. The number of inferred clusters tends to increase as the number of strongly correlated pairs of variables increases.