^{1}

^{¤a}

^{2}

^{3}

^{1}

^{¤b}

^{1}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: RD RSS EJC ZG PDWK DLW. Performed the experiments: RD. Analyzed the data: RD RSS EJC PDWK DLW. Contributed reagents/materials/analysis tools: RD RSS. Wrote the paper: RD RSS EJC ZG PDWK DLW.

Current address: London Centre for Nanotechnology, University College London, London, United Kingdom

Current address: Centre for Bioinformatics, Imperial College London, London, United Kingdom

We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package

Many scientific disciplines are becoming data intensive. These subjects require the development of new and innovative statistical algorithms to fully utilise these data. Time series clustering methods in particular have become popular in many disciplines such as clustering stocks with different price dynamics in finance

Molecular biology is one such subject. New and increasingly affordable measurement technologies such as microarrays have led to an explosion of high-quality data for transcriptomics, proteomics and metabolomics. These data are generally high-dimensional and are often time-courses rather than single time point measurements.

It is well-established that clustering genes on the basis of expression time series profiles can identify genes that are likely to be co-regulated by the same transcription factors

These statistical methods often provide superior results to standard clustering algorithms, at the cost of a much greater computational load. This limits the size of data set to which a given method can be applied in a given fixed time frame. Fast implementations of the best statistical methods are therefore highly valuable.

The Bayesian Hierarchical Clustering (BHC) algorithm has proven a highly successful tool for the clustering of microarray data

The principal downside of the BHC algorithm is its run-time, in particular its scaling with the number of items clustered. This can be addressed via

In this paper, we apply the approach of

To demonstrate the effectiveness of the randomised BHC algorithm, we test its performance on a realistic synthetic data set. We use synthetic data constructed from several realisations of the

Given that for these synthetic data we know the ground truth clustering partition, we use the adjusted Rand index as our performance metric

Each point is the average of 10 runs, with the error bars denoting the standard error on the mean. The horizontal dashed line shows the result for the full BHC method.

Each point is the average of 10 runs, with the error bars denoting the standard error on the mean. The horizontal dashed line shows the result for the full BHC method.

We also consider how the run-time varies as a function of the total number of genes analysed,

Shown are the results for

Shown are the results for

We note an interesting effect for the lowest value of

It is also important to validate the randomised algorithm on real microarray data. To do this, we use a subset of the data of

As a performance metric, we choose the Biological Homogeneity Index (BHI)

Each point is the average of 10 runs, with the error bars denoting the standard error on the mean. The horizontal dashed line shows the results for the full BHC method. Shown are the results for the different gene ontologies, Biological Process (red), Molecular Function (green), Cellular Component (blue) and the logical-OR of all three (black). The BHI scores were all generated using the org.Sc.sgd.db annotation R package.

Each point is the average of 10 runs, with the error bars denoting the standard error on the mean. The horizontal dashed line shows the results for the full BHC method.

We note an interesting difference between

We also note that for

We have presented a randomised algorithm for the BHC clustering method. The randomised algorithm is statistically well-motivated and leads to a number of concrete conclusions.

The randomised BHC algorithm can be used to obtain a substantial speed-up over the greedy BHC algorithm.

Substantial speed-up can be obtained at only small cost to the statistical performance of the method.

The overall computational complexity of the randomised BHC algorithm is

The randomised BHC time series algorithm can therefore be used on data sets of well over 1000 genes.

Use of the randomised BHC algorithm requires the user to set a value of

The randomised time series BHC algorithm is available as part of the R package

We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL.

In this section, we provide a mathematical overview of the time series BHC algorithm. Greater detail can be found in

The BHC algorithm

The prior probability,

When

For the purposes of the BHC algorithm, a complete dendrogram is constructed, with at each step the most likely merger being made. This allows us to see the log-probability of mergers in the whole dendrogram, even when this value is very small. To determine the likely number of clusters, given the data, we then cut the dendrogram wherever the probability of merger falls below 0.5 (i.e. non-merger is more likely).

As described in

The BHC algorithm provides a lower bound of the DP marginal likelihood, as shown in

Gaussian processes define priors over the space of functions, making them highly suited for use as non-linear regression models. This is highly valuable for microarray time series

For the time series BHC model, we model an observation at time

Let

Time series BHC implements either the squared exponential or cubic spline covariance functions. In this paper, we restrict our attention to the default choice of squared exponential covariance:

To speed up the time series BHC, we implement the randomised BHC algorithm of

Throughout this paper we will refer to the

For reasonably balanced trees, the top levels should be well-defined even using only a random subset of the genes. From this idea, we can define the following randomised algorithm.

Select a subset of

Run BHC on the subset of

Filter the remaining

Including the original

Now recurse for the gene subsets in each branch, until each subset size is

In effect, we are using estimates of the higher levels of the tree to subdivide the genes so that it is not necessary to compute many of the potential low-level merge probabilities.

The main loop is the randomised part of the algorithm, which is used recursively until the remaining gene subsets are small enough that it uses the greedy version of BHC to complete the tree and then terminates.

The covariance function of the Gaussian processes used in this paper are characterised by a small number of hyperparameters. These hyperparameters are learned for each potential merger using the BFGS quasi-Newton method

This merge-by-merge optimisation allows each cluster to have different hyperparameter values, allowing for example for clusters with different intrinsic noise levels and time series with different characteristic length scales.

We assume in this paper that each time series is sampled at the same set of time points. This leads to a block structure in the covariance matrix, which can be utilised to greatly accelerate the computation of the Gaussian process marginal likelihood.

The computational complexity of BHC is dominated by inversion of the covariance matrix. Considering the case of a group of

We also note that this is equivalent to a Bayesian analysis using a standard multivariate Gaussian. Indeed, considering the task in this way may be a simpler way of doing so and is certainly a useful way of gaining additional insights into the workings of the model.

When proposed merges have constant cost (the case considered by

For the time series BHC algorithm however, the merges do not have have constant cost. For a given node, we are merging

Because

The randomised algorithm for case of constant cost merges has

We thank Jim Griffin for useful discussions.