Accelerating Bayesian Hierarchical Clustering of Time Series Data with a Randomised Algorithm

We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.


Introduction
Many scientific disciplines are becoming data intensive. These subjects require the development of new and innovative statistical algorithms to fully utilise these data. Time series clustering methods in particular have become popular in many disciplines such as clustering stocks with different price dynamics in finance [1], clustering regions with different growth patterns [2] or signal clustering [3].
Molecular biology is one such subject. New and increasingly affordable measurement technologies such as microarrays have led to an explosion of high-quality data for transcriptomics, proteomics and metabolomics. These data are generally high-dimensional and are often time-courses rather than single time point measurements.
It is well-established that clustering genes on the basis of expression time series profiles can identify genes that are likely to be co-regulated by the same transcription factors [4]. There have been a number of approaches developed to clustering time series, for example using finite or infinite hidden Markov models [5,6]. Another popular approach is the use of splines as basis functions [7][8][9][10]. [11] also use Fourier series as basis functions. A number of additional methods for time series data analysis have been reviewed by [12].
These statistical methods often provide superior results to standard clustering algorithms, at the cost of a much greater computational load. This limits the size of data set to which a given method can be applied in a given fixed time frame. Fast implementations of the best statistical methods are therefore highly valuable.
The Bayesian Hierarchical Clustering (BHC) algorithm has proven a highly successful tool for the clustering of microarray data [13][14][15]. The time series BHC method uses Gaussian processes to model time series in a flexible way, making the method highly adaptive and able to handle a wide range of structure in the data.
The principal downside of the BHC algorithm is its run-time, in particular its scaling with the number of items clustered. This can be addressed via randomised algorithms [16], a class of techniques that can be highly powerful in this regard. Randomised algorithms employ a degree of randomness as part of their logic, aiming to achieve good average case performance with high probability. Because the requirement for guaranteeing a certain (e.g. optimal) result is relaxed, it is often possible to obtain significantly improved performance as a result.
In this paper, we apply the approach of [17] to create a randomised BHC algorithm for clustering microarray time series. This allows much larger time series data sets to be analysed in a given amount of time, substantially extending the utility of the time series BHC method.

Synthetic Data Results
To demonstrate the effectiveness of the randomised BHC algorithm, we test its performance on a realistic synthetic data set. We use synthetic data constructed from several realisations of the S. cerevisiae synthetic data generated in [15]. Using the fact that Gaussian processes are generative models, we draw random realisations from the BHC model obtained on a 169-gene subset of the cell cycle gene expression data of [18], to give a total of 1000 genes, spread across 13 distinct clusters.
Given that for these synthetic data we know the ground truth clustering partition, we use the adjusted Rand index as our performance metric [19]. Figure 1 shows how the adjusted Rand index (averaged over runs) varies with the randomised algorithm parameter, m. For mv500, there is some loss in accuracy performance; however for mw500, the adjusted Rand index is approximately that of the greedy algorithm. Figure 2 shows the corresponding run-time performance. As expected, the algorithm is approximately linear in m and a significant speed-up can be obtained over the greedy algorithm. For these synthetic data, one could therefore pick m~500 and get approximately the same performance as for the greedy algorithm, but with more than a |2:5 speed-up. And if some performance drop-off was acceptable, as much as an order of magnitude improvement is possible. We note that such a run takes only approximately 5 hours to complete on a single node 2.40 GHz Intel Xeon CPU.
We also consider how the run-time varies as a function of the total number of genes analysed, n.  Figure 4 shows the same information, expressed a a speed-up over the greedy algorithm.
We note an interesting effect for the lowest value of m (m~10) in Figure 1. A significant part of the performance degradation for lower m values in Figure 1 comes from the randomised algorithm over-estimating the number of clusters (these being synthetic data, we know the ground truth number of clusters). Investigation of the m~10 point shows that this effect is lessened for the synthetic data for small m. We believe that this is because for small numbers of data items, the inferred noise level is more weakly constrained. This in turn allows for clusters with higher noise levels, meaning the algorithm can explain the data using a smaller number of noisy clusters.

Microarray Results
It is also important to validate the randomised algorithm on real microarray data. To do this, we use a subset of the data of [18], selecting genes that have a KEGG pathway annotation, using the version of the KEGG database to match that used in [20]. This consists of yeast cell cycle microarray time series for 1165 genes, measured at 17 time points.
As a performance metric, we choose the Biological Homogeneity Index (BHI) [21], as implemented in the R package clValid [22]. The BHI metric scores a clustering partition between 0 and 1, with higher scores assigned to more biologically homogeneous partitions with respect to a reference annotation set. This has proven to be an effective metric for measuring the performance of microarray-based gene clustering [14,15]. Figure 5 shows the BHI scores (averaged over 10 runs) as a function of the randomised-algorithm parameter, m. The BHI scores show very little variation for mw100, showing that the randomised algorithm is highly robust, in this case, to choice of m. There is typically a small drop in performance relative to the greedy algorithm.  Figure 6 shows the corresponding run-times. As with the synthetic data, we see the expected O(m) scaling. We note that here the overhead of the randomised algorithm means that for mw600 the greedy algorithm is actually faster. However, the BHI results in Figure 5 show that we could set m~200 and gain almost a factor of 3 in speed while incurring only a minimal loss of performance. We note that such a run takes only approximately 2 hours to complete on a single node 2.40 GHz Intel Xeon CPU.     We note an interesting difference between Figures 2 and 6 in run time, relative to the greedy BHC algorithm. Because the number of genes is similar in both cases, one might expect the performance relative to the greedy algorithm to be similar. However (as is shown in these figures) the efficiency of the randomised BHC algorithm depends on how balanced (or otherwise) the dendrogram is. For example, if many levels of the dendrogram split into subsets of very different sizes (one big, one small), the randomised algorithm may have to go through many iterations in order to define the entire dendrogram. The run time is therefore dependent not only on the number of genes and time points, but also on the underlying clustering structure in the data. Essentially, unbalanced dendrograms make the randomised algorithm less efficient.
We also note that for m values close to the actual number of genes, it is a general feature that randomised BHC will tend to be slower than the greedy algorithm. This is because in this case, the randomised algorithm has to perform a greedy run with almost the entire set of genes to define the top branching of the dendrogram, and then assign all the genes to one of the two branches and run the greedy algorithm again for each of these subsets. Figures 2 and 6 show an increased variance in the run time for mw600. We believe this effect is due to the fact that for higher âJ,mâJ TM values, the run time is more likely to be dominated by a single run of the greedy algorithm for DsubsetDvm items. This will make the run time very sensitive to DsubsetD, which will be affected by the randomisation of the overall algorithm.

Discussion
We have presented a randomised algorithm for the BHC clustering method. The randomised algorithm is statistically wellmotivated and leads to a number of concrete conclusions. The randomised BHC time series algorithm can therefore be used on data sets of well over 1000 genes.
Use of the randomised BHC algorithm requires the user to set a value of m. On the basis of the analyses presented in this paper, we recommend that a value of m in the range 100{200 is reasonable, giving significant speed-up with minimal cost in terms of statistical performance.
The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor. org/packages/2.10/bioc/html/BHC.html.
We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/ randomisedbhc/.

Methods
In this section, we provide a mathematical overview of the time series BHC algorithm. Greater detail can be found in [15]. time series BHC combines the BHC clustering algorithm, coupled with a Gaussian process data model to provide a flexible, generative representation of microarray time series. Here we replace the standard (greedy) BHC algorithm with a randomised algorithm, improving the computational complexity of the method and hence its run time for scientifically-useful numbers of genes.

BHC Algorithm
The BHC algorithm [13][14][15]23] performs agglomerative hierarchical clustering in a Bayesian setting. In agglomerative clustering algorithms, each gene begins in its own cluster and at each stage the two most similar clusters are merged. BHC uses a model-based criterion to do this, also learning the most likely number of clusters given the data (something which many clustering methods are unable to do in a principled way). We note that the BHC algorithm can be interpreted as a fast approximate inference method for a Dirichlet Process Model (DPM) [13].
The prior probability, p k , that a given pair of clusters, C 1 and C 2 , should be merged is defined by the DPM and is determined solely by the concentration hyperparameter for the DPM and the number of genes currently in each partition of the clustering. Bayes' rule is then used to find the posterior probability, r k , that the pair of clusters should be merged, where y~fy 1 , . . . ,y N g is the set of N data points contained in clusters C 1 and C 2 . P(yDH k 1 ) is the marginal likelihood of the data given the hypothesis, H k 1 , that the data y belong to a single cluster and requires the specification of a likelihood function, f , as the probabilistic model generating the observed data, y. P(yDT k ) is the probability that the data could be partitioned in any way which is consistent with the order of assembly of the current partition and is defined recursively, where T i and T j are previously merged clusters containing subsets of the data in y. When r k is greater than 0.5, it is more likely that the data points contained in the clusters C 1 and C 2 were generated from the same underlying function, f , than that the data points should belong to two or more clusters. When r k is less than 0.5 for all remaining pairs of clusters, the number of clusters and partitions best described by the data has been found.
For the purposes of the BHC algorithm, a complete dendrogram is constructed, with at each step the most likely merger being made. This allows us to see the log-probability of mergers in the whole dendrogram, even when this value is very small. To determine the likely number of clusters, given the data, we then cut the dendrogram wherever the probability of merger falls below 0.5 (i.e. non-merger is more likely).
As described in [13], the p k are dependent on a hyperparameter for the mixture model, a. As in previous work on BHC, we set a~0:001 as a fixed value. This has the effect of setting a prior assumption of only weak clustering. One could learn this parameter as part of the BHC algorithm; we choose to not do this as it will substantially increase the run time of the algorithm.
The BHC algorithm provides a lower bound of the DP marginal likelihood, as shown in [13]. For the randomised algorithm, we note that the lower bound on the DP marginal likelihood is effectively determined using a subset of only m data items. These lower bounds are used in the usual way to optimise hyperparameters for each potential merger. One could attempt in principle to compute the lower bound using all n data items. However, this will be computationally intensive and so we do not consider it in this paper.

Gaussian Process Regression
Gaussian processes define priors over the space of functions, making them highly suited for use as non-linear regression models. This is highly valuable for microarray time series [24][25][26][27], where a wide range of functional forms can be expected. In essence, Gaussian Process Regression (GPR) allows us to minimise the assumptions we must make as to the underlying structure in our time series data.
For the time series BHC model, we model an observation at time t i as y(t i )~f (t i )ze. For each cluster, we assume the latent function f is drawn from a Gaussian process with covariance function S, defined by hyperparameters, h S . We also assume iid Gaussian noise, N(0,s 2 e ). Let y~½y 1,1 . . . y G,t be the N~G|T observations in a cluster of G genes, where the fy g,t g are time series of f1, . . . ,Tg time points. Each gene is normalised to have mean 0 and standard deviation 1 across time points. The prior of f is given for fixed values of h S , such that P(f Dh S )~N(0,S). It follows that the likelihood function for f is P(yDf ,s 2 e )~N(f ,s 2 e I), where I is the N|N identity matrix. The marginal likelihood of the data, y, is then: where K~Szs 2 e I is the covariance function for y. Time series BHC implements either the squared exponential or cubic spline covariance functions. In this paper, we restrict our attention to the default choice of squared exponential covariance: where d ij is the Kronecker delta function and t i and t j are two time points for f . s 2 f is the signal variance parameter for the covariance function and l is the length-scale parameter.

Randomised BHC Algorithm
To speed up the time series BHC, we implement the randomised BHC algorithm of [17] (specifically, algorithm 1). The key insight from which we hope to benefit is that the standard, greedy BHC algorithm is dominated by the computation of merges at the lowest level of the tree. Therefore, if we can reduce this load in a sensible way, it may be possible to produce a substantially faster algorithm.
Throughout this paper we will refer to the top of the dendrogram. This is the highest level of the dendrogram, where the whole set of genes is split into two subsets.
For reasonably balanced trees, the top levels should be welldefined even using only a random subset of the genes. From this idea, we can define the following randomised algorithm. N Now recurse for the gene subsets in each branch, until each subset size is ƒm, at which point use the standard BHC algorithm to complete the lower levels of the tree.
In effect, we are using estimates of the higher levels of the tree to subdivide the genes so that it is not necessary to compute many of the potential low-level merge probabilities. Figure 7 shows a flow chart describing the algorithm.

Setting the Hyperparameters
The covariance function of the Gaussian processes used in this paper are characterised by a small number of hyperparameters. These hyperparameters are learned for each potential merger using the BFGS quasi-Newton method [28].
This merge-by-merge optimisation allows each cluster to have different hyperparameter values, allowing for example for clusters with different intrinsic noise levels and time series with different characteristic length scales.

Utilising the Covariance Matrix Block Structure
We assume in this paper that each time series is sampled at the same set of time points. This leads to a block structure in the covariance matrix, which can be utilised to greatly accelerate the computation of the Gaussian process marginal likelihood.
The computational complexity of BHC is dominated by inversion of the covariance matrix. Considering the case of a group of k genes, each sampled at the same T time points, the naive approach to matrix inversion would require us to invert a kT|kT matrix, which is an O(k 3 T 3 ) operation. However, we can instead use block matrix pseudoinversion, which recursively reduces the block size to one, at which point the remaining inversion is an O(T 3 ) operation.
We also note that this is equivalent to a Bayesian analysis using a standard multivariate Gaussian. Indeed, considering the task in this way may be a simpler way of doing so and is certainly a useful way of gaining additional insights into the workings of the model.

Computational Complexity
When proposed merges have constant cost (the case considered by [17]), the standard greedy BHC algorithm has O(n 2 ) computational complexity.
For the time series BHC algorithm however, the merges do not have have constant cost. For a given node, we are merging k gene time series, each of length T. We therefore have to consider a (kT)|(kT) covariance matrix, which we must invert. As noted in [15], this matrix is actually a block matrix consisting of k|k blocks, which means we can invert it in O(kT 3 ) operations.
Because k will be as large as n for the merges closer to the root node of the tree, this gives the greedy time series BHC algorithm a worst-case computational complexity of O(n 3 T 3 ).
The randomised algorithm for case of constant cost merges has O(n log n) complexity [17]). Heller and Ghahramani show that, for reasonably balanced trees, the complexity is dominated by the filtering step. Each of the O( log n) filtering steps is O(n), resulting in the overall O(n log n) complexity. For the time series BHC algorithm, the filtering step is O(nmT 3 ), because of the additional cost of merging time series clusters. As in the original analysis there will be O( log n) filtering steps, giving an overall computational complexity for the randomised version of time series BHC of O(mT 3 n log n).