Let’s Not Waste Time: Using Temporal Information in Clustered Activity Estimation with Spatial Adjacency Restrictions (CAESAR) for Parcellating FMRI Data

We have proposed a Bayesian approach for functional parcellation of whole-brain FMRI measurements which we call Clustered Activity Estimation with Spatial Adjacency Restrictions (CAESAR). We use distance-dependent Chinese restaurant processes (dd-CRPs) to define a flexible prior which partitions the voxel measurements into clusters whose number and shapes are unknown a priori. With dd-CRPs we can conveniently implement spatial constraints to ensure that our parcellations remain spatially contiguous and thereby physiologically meaningful. In the present work, we extend CAESAR by using Gaussian process (GP) priors to model the temporally smooth haemodynamic signals that give rise to the measured FMRI data. A challenge for GP inference in our setting is the cubic scaling with respect to the number of time points, which can become computationally prohibitive with FMRI measurements, potentially consisting of long time series. As a solution we describe an efficient implementation that is practically as fast as the corresponding time-independent non-GP model with typically-sized FMRI data sets. We also employ a population Monte-Carlo algorithm that can significantly speed up convergence compared to traditional single-chain methods. First we illustrate the benefits of CAESAR and the GP priors with simulated experiments. Next, we demonstrate our approach by parcellating resting state FMRI data measured from twenty participants as taken from the Human Connectome Project data repository. Results show that CAESAR affords highly robust and scalable whole-brain clustering of FMRI timecourses.

where Z = p(Y|A) is the normalization term, all the hyperparameters have been denoted with θ = {φ, τ, σ 2 m , l}, and Y k is shorthand for all the node time series belonging to cluster k given some partition π(λ). Because the posterior distribution (1) is analytically intractable, meaning that we cannot analytically compute the normalization constant Z or the expectations of any of the unknown variables, we form a sample-based approximation using MCMC methods.
To draw samples from the posterior we construct a partially collapsed Gibbs sampling framework, which integrates over the T -dimensional cluster time series x k when possible. At each iteration of the sampling algorithm we repeat the following steps: 1. Draw the GP hyperparameters ψ = [log(σ 2 m ), log(l)] from their marginal distribution that is obtained by integrating over the cluster time series {x k } K k=1 as described in Appendix 1.3: Because the conditional posterior (2) is not analytically tractable, we use slice sampling to draw from the unnormalized density. If separate hyperparameters are assigned to each cluster, (2) can be factorized and the sampling can be done independently for each k = 1, ..., K.
2. Draw the cluster time series X = [x 1 , ..., x K ] T given the current link assignments λ and the hyperparameters: which can be done independently for k = 1, ..., K as described in Appendix 1.3. The time series X are used only as auxiliary variables that facilitate subsequent sampling steps for the noise hyperparameters φ and τ .
3. Draw the overall noise precision τ given the cluster time series x k and the time-specific noise parameters φ: The conditional posterior (4) is a Gamma distribution from which samples can be drawn directly.
4. Draw the time-specific noise parameters φ = [φ 1 , ..., φ T ] given the cluster time series x k and the overall noise precision τ : for t = 1, ..., T , where y t and x t denote the observation and cluster time series vectors for each t, respectively. Each of the conditional posteriors in (5) is a Gamma distribution form which samples can be drawn directly.
5. Draw the node link assignments by doing a (number of) Gibbs sweep(s) over all the nodes i = 1, ..., N given the hyperparameters: The marginal density p(Y k |λ, φ, τ, ψ) can be computed as described in Appendix 1.3.
In theory, the collapsed step 5 should be done before steps 2-4, but here we have moved it to the last position because we wish to record samples {λ (l) , φ (l) , τ (l) , ψ (l) }, where the node link assignments λ represent the current hyperparameter setting. It is also better to draw the overall noise precision in step 3 before the time-specific noise parameters. This way τ will capture the average noise level and φ t will remain on average around one representing time-specific adjustments to τ . Clearly, the majority of the algorithm's running time is spent in step 5, because it can contain thousands of individual link assignments, each of which can require marginal likelihood computations related to one cluster split proposal and several different cluster merge proposals.

Parallel Inference Using Population Monte Carlo
Because typical FMRI measurements contain a large number of voxels and the measurements can be noisy, several different cluster assignment configurations can explain the observed data with relatively high posterior densities. Consequently, the posterior distribution (1) can have multiple distinct modes corresponding to different values of λ. Therefore, a single Gibbs sampling chain can get stuck in some low density modes for a large number of iterations resulting in slow overall convergence. In contrast, with fixed λ, the algorithm of Section 1.1 typically converges quickly.
To speed-up convergence, we propose to utilize several parallel chains in a population Monte Carlo (PMC) framework [1]. At each step of the PMC algorithm, we run one iteration of the Gibbs sampler of Section 1.1 in J parallel chains that propagate one sample particle each. Next we combine the resulting particles into a single posterior approximation using an importance sampling resampling update. First, the particles are initialized, e.g., by drawing random values for the λ (j,0) from the dd-CRP prior and choosing appropriate values for the hyperparameters φ (j,0) , τ (j,0) , and ψ (j,0) . Then, at each iteration i, the algorithm draws a sample from the proposal distributions q(·) in parallel for each particle j = 1, ..., J: n+1:N , φ (j,i) , τ (j,i) , ψ (j,i) ) for n = 1, ..., N.
Which is followed by computing the importance weight for each particle j = 1, ..., J as and resampling particles using the normalized weights. After resampling, the particles are given equal importance weights and typically only the descendants of the best particles survive to the next iteration. After sufficient number of iterations, the posterior expectations of any function h(·) of the unknowns can be summarized using a suitable number of iterations from the end of the PMC chain as: Because resampling is done at each iteration i, the weights can be set to equal values: w j,i = 1/J. Compared to a single Gibbs chain, the PMC algorithm has much better chances to escape local modes, because at each iteration the J parallel Gibbs sweeps explore more efficiently the high-dimensional posterior distribution near the particles that have survived from the previous iterations.

Batch GP Inference
This appendix describes a batch method for computing the marginal likelihood p(Y|λ, θ) and the conditional posterior p(X|Y, λ, θ), which is a crucial part of the Gibbs sampler of Appendix 1.1. The method is very efficient for a small to medium number of time points, roughly T ∈ (0, 10000] depending on memory constraints. Because the GP prior (Eq. 3 in the main text) is defined separately for each cluster time course x k and the likelihood (Eq. 2 in the main text) can also be factored over k, we can do the inference separately for each cluster. The conditional posterior distribution of x k given all hyperparameters and λ, is Gaussian and it is given by where Y k is a N k × T matrix containing the node measurements from cluster k. The posterior mean µ k and the covariance Σ k and the marginal likelihood where y k = vec(Y k ) is formed by stacking the columns of Y k into a single vector, B = I T ⊗ 1 N k , Σ = diag(τ −1 φ −1 1 , ..., τ −1 φ −1 T ) ⊗ I N k , 1 N k is a N k × 1 vector of ones, and ⊗ denotes the Kronecker product. In the Gibbs sampling for λ, the marginal likelihood Z k needs to be evaluated for each proposed cluster configuration and it requires a Cholesky decomposition of the T × T matrix Σ −1 k = B T Σ −1 B + K −1 , which is computationally expensive. Using the properties of the Kronecker product we can show that the marginal likelihood can be evaluated very efficiently if we pre-compute the following auxiliary variables after each hyperparameter update: where • is the entry-wise product. We assume that K 1/2 needs to be computed only after updating the GP hyperparameters. At each new cluster proposal we can evaluate the marginal likelihood as where d k,t = 1 + λ t N k and which involves only scalar summations over n ∈ {n|π n = k} and t = 1, ..., T .
In step 2 of the Gibbs sampler of Appendix 1.1 we need to draw time courses from the conditional posterior and after convergence we also want to compute the conditional mean and covariance of x k for summarizing the final results. These can be obtained using where L 1 = K 1/2 U, L 2 = K 1/4 U and r t ∼ N (0, I T ). Here we assume that L 1 and L 2 are obtained using the previously computed auxiliary variables and the eigendecomposition of the prior covariance K.