Switching state-space modeling of neural signal dynamics

Linear parametric state-space models are a ubiquitous tool for analyzing neural time series data, providing a way to characterize the underlying brain dynamics with much greater statistical efficiency than non-parametric data analysis approaches. However, neural time series data are frequently time-varying, exhibiting rapid changes in dynamics, with transient activity that is often the key feature of interest in the data. Stationary methods can be adapted to time-varying scenarios by employing fixed-duration windows under an assumption of quasi-stationarity. But time-varying dynamics can be explicitly modeled by switching state-space models, i.e., by using a pool of state-space models with different dynamics selected by a probabilistic switching process. Unfortunately, exact solutions for state inference and parameter learning with switching state-space models are intractable. Here we revisit a switching state-space model inference approach first proposed by Ghahramani and Hinton. We provide explicit derivations for solving the inference problem iteratively after applying a variational approximation on the joint posterior of the hidden states and the switching process. We introduce a novel initialization procedure using an efficient leave-one-out strategy to compare among candidate models, which significantly improves performance compared to the existing method that relies on deterministic annealing. We then utilize this state inference solution within a generalized expectation-maximization algorithm to estimate model parameters of the switching process and the linear state-space models with dynamics potentially shared among candidate models. We perform extensive simulations under different settings to benchmark performance against existing switching inference methods and further validate the robustness of our switching inference solution outside the generative switching model class. Finally, we demonstrate the utility of our method for sleep spindle detection in real recordings, showing how switching state-space models can be used to detect and extract transient spindles from human sleep electroencephalograms in an unsupervised manner.


Posterior discrete state concentration during the fixed-point iteration
It is intriguing that the approximate, factorized posterior tends to converge to a very sparse state through the fixed-point iteration inner loop.The form of the reciprocals is akin to the sparse Bayesian Gaussian regression and factor analysis procedures know as ARD.Automatic Relevance Determination (ARD) which is an iterative type-II maximum likelihood procedure leads to sparsification of coefficient in a similar iterative manner.If there's a relationship that can be shown, I suggest adding some discussion.ARD has been popularized by Bishop's PRML book and has been used in neuroscience widely.
Response: We appreciate the reviewer for pointing out the relevance of ARD to the fixed-point iterations.ARD is a procedure to determine which input features are most relevant in the classification/regression problem using the correlation length scales in a covariance function.The sparsification of the classification/regression coefficient in an iterative type-II maximum likelihood estimation follows from the particular expression of the likelihood function.The mathematical construction of our problem is fundamentally different.Concretely, the fixed-point iterations in our problem updates the h (m) t that is a deterministic summary statistics unlike the ARD procedure where the weight coefficients are randomly distributed with zero mean and a hyperpararmeter.The negative variational free energy expression (see S1 Appendix) involves h (m) t differently from that of the ARD procedure (compare to [Bishop, 2006, Eq. 7.97]).In our problem, the empirical sparsification of h (m) t is driven by the term R/h (m) t used in estimating the hidden model log-evidence, which is then used to compute h (m) t at the next iteration: this procedure in turn pushes h (m) t to either 1 or 0. Nevertheless, we find the analogy to ARD helpful for an intuitive understanding of the empirical effect of the fixed-point iterations, and we have thus included a reference to ARD in the Materials and methods section (lines 756-760, p. 43) when discussing this behavior of the fixed-point iterations.
2. Although it is nice to have a sparse selection of discrete states after the inference, this is clearly not always going to be close to the true posterior.I don't think this is necessarily a bad thing, but it needs to be made clear that the space of potential approximate posteriors are smaller than postulated by the explicit assumptions.

Response:
The reviewer raised an interesting and subtle point that was not previously explicated in the text.The sparse model selection that follows our inference step may in fact be close to the true posterior, since the true posterior p is expected to contain polarized values (i.e., with model probabilities close to 1 or 0) under the assumed generative structure with parallel candidate models.This polarization is due to the following factors: • First, each transition between candidate models introduces a discontinuity in the observation sequence at the switch point, as the realization trajectories from different SSMs are distinct from each other.
• Second, since the SSM hidden state trajectories consist of multiple time points, they reside in a high dimensional space.The dimensionality further increases as SSM state dimension increases.
Because of this high-dimensionality, the separation between any two candidate models becomes very substantial.
• Third, real recordings often have infrequent transitions between distinct dynamics.This empirical skewness leads to a high probability to stay within the same candidate model.Such probability propagates as a strong prior across time points amplifying the effect of the other factors in polarizing the values of posterior estimates.
We did not mean to indicate that the reciprocal interdependence between g (m) t and h (m) t in the fixedpoint iterations was designed to polarize h (m) t estimates.Instead, it is derived by solving the optimization problem in Eq. 13, and it could recover sparse solutions when the true posterior is sparse.
Correspondingly, the space of potential approximate posteriors is indeed P as defined in Eq. 12 that includes non-sparse distributions, and is not a smaller subspace of P. In fact, in both VI-EM and VA-EM algorithms, the values of h (m) t are initialized away from 0 or 1 for the fixed-point iterations.Throughout the iterations, the values progressively become more polarized in both algorithms, in order to increase the negative variational free energy until reaching a local maximum.Thus, non-sparse models in the neighborhood of such local maximum are already considered in the search space during the fixed-point iterations.We have now edited the Materials and methods section (lines 737-748, p. 43) to include this explanation, and we appreciate the reviewer for helping us to clearly describe this key empirical behavior of the variational inference approach presented here.method derived in Bar-Shalom and Li [1993].The static method uses the predicted density p(y t |x (m) t|t−1 ) as a surrogate for the predicted observation probability p(y t |s t = m, y 1 , • • • , y t−1 ) in an HMM.This is an approximation because the actual density p(y t |s t = m, y 1 , • • • , y t−1 ) requires marginalizing out the entire history for {s t|t−1 ).In the same vein of approximation and Gaussian merging, we are approximating the smoothed density p(y with the interpolated density from each candidate model and running the forward-backward algorithm to obtain smoothed estimates of the switching state.Thus when viewed from this perspective, these soft/hard segmentation work in the same principle as the early switching state-space inference method. The usefulness of the soft segmentation could be assessed empirically within a specific application, where one expects a gradual change in neural states as opposed to a sharp transition.For instance, a switch between two neural states may be driven by a covariate, perhaps a pharmacological agent that can be measured at a fine temporal resolution.Then relating variations of the covariate to the gradually changing model probability could be more practical.After all, the soft segmentation method emphasizes the transition periods between states acknowledging uncertainty in choosing one of the several underlying states rather than dividing a time series into multiple distinct segments with different dynamics.Thus, its usage is more relevant in applications where key hypotheses are focused on the transition periods between segments, than the segments themselves.We have added this discussion to the corresponding section in S4 Appendix.Additional analyses (last paragraph, p. 8).
We apologize for the delay in uploading the codes.Please see our somata github repo for the most recent changes.Both the soft and hard segmentation results are now outputted by default (see line # 163 of somata/switching/vb.py).
-->published in "Advances in Neural Information Processing Systems" Response: Thanks for catching the typo.The reference [44] in p. 56 is now fixed in the manuscript.

Response to the Comments of Reviewer 2
Thank you for your thorough response and revisions.I still have some lingering, major concerns though.
Response: We once again thank Reviewer 2 for their constructive suggestions.
# Concern 1: Bringing interpolated density and parameter initialization to the fore a.The revised manuscript makes clear that the methodological novelty is in the initialization scheme, but the main text gives very few details of the interpolated density initialization scheme.Instead, the details are punted to Supplement S3.Given that this is a core contribution, it should be a major component of the main text.The relative fraction of main text devoted to describing generalized EM as opposed to describing initialization strategies is not commensurate with the relative novelty of these two facets of the work.
Response: We understand the concern from the reviewer regarding emphasizing the novel initialization procedure in the manuscript.We have indeed aimed to highlight the interpolated density initialization as a major component of the main text, but it can be improved.
• First, a section is already devoted to the interpolated density in the Materials and methods section (lines 827-849, pp.48-49), emphasizing the motivation and interpretation of this initialization.S3 Appendix.Implementation details covers the detail of actual computations of the interpolated density, which requires an exposition of Kalman smoothing recursions and some proofs we provide.We feel that explaining this implementation detail in the Materials and methods section instead may digress from understanding the use of interpolated density in our context.
• Second, we admit that the relative fraction of Materials and methods section that describes interpolated density is little when compared to the description of Generalized EM algorithm.However, we feel the complete derivation and explanation of this particular generalized EM framework for switching state-space models is an equally important contribution of this paper.We note throughout the manuscript that this idea was initially proposed by Ghahramani and Hinton [2000], but to the best of our knowledge no published work has provided straightforward derivations using optimal functional forms of variational log-posteriors to arrive at the fixed-point iterations.As we allude to in the Introduction section (lines 87-90, p. 5), all closed-form equations are presented here to help readers from an applied field (e.g., neuroscience) to understand switching state-space models, appreciate the statistical principles behind, and be able to implement and expand on the solutions we present.While this perhaps makes the Materials and methods section obvious and less interesting to an expert in statistics, we hope to keep the organization as is for its pedagogical values (which is noted and appreciated by Reviewer 1).Related to this, one way to highlight the importance of the initialization procedure in this framework is through explaining how the fixed-point iterations arise from optimizing the negative free energy and noting the challenges of a non-concave landscape.We now expand on this to motivate the importance of the initialization to further increase it weight (lines 761-768, pp.43-44).
• Third, all Results sections are devoted to comparing switching inference and learning performance with and without the interpolated density during initialization.This focus demonstrates the improvements due to the novel initialization procedure and makes it a major component of the main text.We have added an explanation of the interpolate density in the leading paragraphs of the Results section (lines 129-141, p. 7) so that this novel initialization is better contextualized when the results are presented, perhaps before the Materials and methods section is perused.This change should further highlight the novel initialization as a core contribution of this manuscript.
b.A skeptical reader could ask whether the generalized EM algorithm is necessary at all, with reasonable initial estimates of the parameters.Given the same initial parameters θ (0) , one could used blocked Gibbs sampling (and many previous works have used blocked Gibbs to great success on SLDS, as the authors note).The analog of interpolated density initialization would seem be: initialize the continuous states with samples from the conditional distribution p(x (m) 1:T |y 1:T , s t = m ∀t, θ (0) ), and then sample s 1:T from its conditional distribution.With weak, conjugate priors on the parameters and good initialization, I would expect blocked Gibbs sampling to work very well.I realize that doing this comparison is non-trivial so I don't consider it a must-have, but I would personally be interested to know the outcome.(FWIW, https://github.com/mattjj/pysldshas Gibbs sampling with conjugate priors implemented for SLDS, so some of the annoying parts of working with priors on matrix-valued parameters could be ameliorated.) Response: We appreciate the reviewer for raising this great point.Blocked Gibbs sampling indeed have been used with great success on SLDS.We would first like to point out the similarities of blocked Gibbs sampler and our closed-form distributional approach.In the current problems, blocked Gibbs sampler would cycle between sampling from p(x (m) 1:T |y 1:T , s t = m ∀t, θ (i−1) ), and p(s t = m|y 1:T , x (m) 1:T ∀m, θ (i−1) ), and from p(θ (i) |y 1:T , s t = m ∀t, x (m) 1:T ∀m).As the reviewer suggested, weak, conjugate priors on the parameters would be made work very well.The blocked sampling of p(x ) is done by forward-filtering-backward-sampling, as described in [Frühwirth-Schnatter, 1994, section 3.2].Similarly, the probabilities of discrete states, p(s t = m|y 1:T , x (m) 1:T ∀m, θ (i−1) ) can be found by forwardbackward procedure similar to our work, and then a sample path can be generated [Carter and Kohn, 1994, Appendix 2].Sampling from p(θ (i) |y 1:T , s t = m ∀t, x (m) 1:T ∀m) is a bit trickier: variance parameters will have inverse gamma distribution, AR parameters will follow Gaussian distribution, oscillation frequency parameter will follow von Mises distribution and so on.
In comparison to that our generalized EM algorithm uses the same basic blocks, but instead of generating sample paths, it works directly with the actual probability distributions.For example, the Estep cycles between updating p(x (m) 1:T |y 1:T , s t = m ∀t, θ (i−1) ) and p(s t = m|y 1:T , x ) is updated using forward-filtering-backward-smoothing algorithm which only differs from forward-filtering-backward-sampling algorithm in the very last step: instead of generating sample path, it computes the mean and variance of the posterior distribution.Same remark follows for the p(s t = m|y 1:T , x (m) 1:T ∀m, θ (i−1) ), with our algorithm stopping at the forward-backward procedure.Finally in the M-step the model parameters are updated using a set of closed-form update rules, instead of sampling from p(θ (i) |y 1:T , s t = m ∀t, x (m) 1:T ∀m).So, there is no reason to believe that given the same initial parameters θ (0) , blocked Gibbs sampling would not result in similar estimates.In fact, when the posterior distributions of the states could not computed in closed-form (e.g., in case of a non-linear, Poisson observation model or when there is a dependency between s (m) t+1 and x (m) t ), blocked Gibbs sampling would be crucial, greatly facilitated by resources such as PySLDS.We now point that out in the manuscript in Discussion (lines 526-529 p. 29).
We would also like to point out that 'the analog of interpolated density initialization' suggested by the reviewer is akin to initializing g (m) t using the smoothed density, instead of using the density.That is because the samples from the conditional distribution p(x (m) 1:T |y 1:T , s t = m ∀t, θ (0) ) will not reflect the deletion of y t from y 1:T when sampling x t .We found the smoothed density to result in catastrophic failure of the fixed-point iterations to distinguish between candidate models, which in fact motivated us to derive the interpolated density.Our closed-form E-step gives the insight behind using interpolated density for the initialization, and makes it is easy to appreciate its role in the initialization.In a blocked Gibbs sampling setting, one can theoretically sample from the interpolated density derived in the paper, since the density parameters are composed of the quantities already computed in the forward-filtering-backward-sampling algorithm, and just run the very first blocked Gibbs sampling of p(s t = m|y 1:T , x , where y(m) 1:T is the block sample from the interpolated density.We omitted a direct comparison with sampling techniques since it is out of the scope of current work.Instead, we have added a discussion pointing readers to the possibility of using blocked Gibbs sampling for the inference in place of the Generalized EM algorithm in the Discussion section (lines 512-516, p. 29).

# Concern 2: Framing of existing methods
The discussion presents existing works as "black-box variational inference" methods when in fact many of the cited references (39, 41, 71, 76) use structure exploiting inference algorithms like blocked Gibbs sampling (39, 41) or structured mean field approximations (71, 76).Both sets of algorithms exploit conditional conjugacy to update blocks of random variables at once, and they do so with exact updates rather than stochastic gradient ascent steps.The structured mean field approaches are directly analogous to the structured mean field approach used in this paper.The main difference is not black box vs structured, but rather the initialization scheme and the model/posterior family itself.T models and posterior approximation in this paper have a factorial construction with many continuous states evolving in parallel, rather than one continuous state with switching dynamics, as in the cited works.
Response: We would like to thank the reviewer for pointing out our misframing of the existing methods as all black-box variational inference.Indeed, much existing work exploits the structure of switching state-space models to derive efficient solutions.We have revised the discussion paragraphs (lines 506-548, pp.28-30) to correctly present the main differences as in 1) model class, 2) initialization, and 3) sampling versus fixed-point iterations for state estimation.We would like to maintain the discussion on black-box variational inference per a previous suggestion from Reviewer 1, and we believe it is still a relevant inference approach; although as pointed out in ref 76, structured mean field approximation (e.g., variational Laplace EM) tends to perform better, which we highlight as a highly effective alternative to the solution we present here (lines 546-548, p. 30).

# Concern 3: Switching SSM Framing in the Introduction
I disagree with the paragraph on lines 47-59 of the introduction.
-"These methods apply carefully constrained approximations to represent a broader range of switching processes besides the Markov process used in the HMM" My impression is that the cited work uses the same sorts of approximations as are developed in this submission, and they target very similar models (switching linear dynamical systems) with some extensions (nonparametric models, continuous-¿discrete dependencies, etc.)