Neuron’s eye view: Inferring features of complex stimuli from neural responses

Experiments that study neural encoding of stimuli at the level of individual neurons typically choose a small set of features present in the world—contrast and luminance for vision, pitch and intensity for sound—and assemble a stimulus set that systematically varies along these dimensions. Subsequent analysis of neural responses to these stimuli typically focuses on regression models, with experimenter-controlled features as predictors and spike counts or firing rates as responses. Unfortunately, this approach requires knowledge in advance about the relevant features coded by a given population of neurons. For domains as complex as social interaction or natural movement, however, the relevant feature space is poorly understood, and an arbitrary a priori choice of features may give rise to confirmation bias. Here, we present a Bayesian model for exploratory data analysis that is capable of automatically identifying the features present in unstructured stimuli based solely on neuronal responses. Our approach is unique within the class of latent state space models of neural activity in that it assumes that firing rates of neurons are sensitive to multiple discrete time-varying features tied to the stimulus, each of which has Markov (or semi-Markov) dynamics. That is, we are modeling neural activity as driven by multiple simultaneous stimulus features rather than intrinsic neural dynamics. We derive a fast variational Bayesian inference algorithm and show that it correctly recovers hidden features in synthetic data, as well as ground-truth stimulus features in a prototypical neural dataset. To demonstrate the utility of the algorithm, we also apply it to cluster neural responses and demonstrate successful recovery of features corresponding to monkeys and faces in the image set.


Introduction
The question of how the brain encodes information from the natural world forms one of the primary areas of study within neuroscience.For many sensory systems, particularly vision and audition, the discovery that single neurons modulate their firing of action potentials in response to particular stimulus features has proven foundational for theories of sensory function.Indeed, neuronal responses to contrast, edges, and motion direction appear to form fundamental primitives on which higher-level visual abstractions are built.Nevertheless, many of these higher-level abstractions, increasingly of interested to modern neuroscience, do not exist in a stimulus space with obvious axes.As a result, experimentalists must choose a priori features of interest in constructing their stimulus sets, with the result that cells may appear weakly tuned due to misalignment of stimulus and neural axes.
For example, in vision, methods like reverse correlation have proven successful in elucidating response properties of some cell types, but such techniques rely on a well-behaved stimulus space and a highly constrained encoding model in order to achieve sufficient statistical power to perform inference [1,2,3].However, natural stimuli are known to violate both criteria, generating patterns of neural activity that differ markedly from those observed in controlled experiments with limited stimulus complexity [3,4,5].Information-based approaches have gone some way in addressing this challenge [4], but this approach assumes a metric structure on stimuli in order to perform optimization, and was recently shown to be strongly related to standard Poisson regression models [6].
More recently, Gallant and collaborators have tackled this problem in the context of fMRI, demonstrating that information present in the blood oxygen level-dependent (BOLD) signal is sufficient to classify and map the representation of natural movie stimuli across the brain [7,8,9].These studies have used a number of modeling frameworks, from latent dirichlet allocation for categorizing scene contents [9] to regularized linear regression [8] to sparse nonparametric models [7] in characterizing brain encoding of stimuli, but in each case, models were built on pre-labeled training data.Clearly, a method that could infer stimulus structure directly from neural data themselves could extend such work to include less easily characterized stimulus sets like those depicting social interactions.
Another recent line of work, this one focused on latent Poisson processes, has addressed the task of modeling the low dimensional dynamics of neural populations [10,11,12,13].Using generalized linear models and latent linear dynamical systems as building blocks, these models have proven able to infer (functional) connectivity [10], estimate spike times from a calcium images [11], and identify subgroups of neurons that share response dynamics [13].Inference in thsese models is generally performed via expectation maximization, though [14] and [15] also used a variational Bayesian approach.Our work is distinct from those models, however, in that both were concerned with modeling and discriminating internal states based on neural responses, while this work focuses on detecting features in external stimuli.
Our model sits at the intersection of these regression and latent variable approaches.We utilize a Poisson observation model that shares many of the same features as the commonly used generalized linear models for Poisson regression.We also assume that the latent features modulating neural activity are time-varying and Markov.However, we make 3 additional unique assumptions: First, we assume that the activity of each neuron is modulated by a combination of multiple independent latent features governed by semi-Markov dynamics.This allows for latents to evolve over multiple timescales with non-trivial duration distributions, much like the hand-labeled features in social interaction data sets.Second, we assume that these latents are tied to stimulus presentation.That is, when identical stimuli are presented, the same latents are also present.This allows us to model the daynamics of latent features of the stimulus that drive neural activity, rather than intrinsic neural dynamics.Finally, we enforce a sparse hierarchical prior on modulation strength that effectively limits the number of latent features to which the population of neurons is selective.This allows for a parsimonious explanation of the firing rates of single units in terms of a small set of stimulus features.Finally, we perform full variational Bayesian inference on all model parameters and take advantage of conditional conjugacy to generate coordinate ascent update rules, nearly all of which are explicit.Combined with forward-backward inference for latent states, our algorithm is exceptionally fast, automatically implements Occam's razor, and facilitates proper model comparisons using the variational lower bound.

Observation model
Consider a population of U spiking neurons exposed to a series of stimuli indexed by a (unique) time index t ∈ {1 . . .T }.Each neuron is exposed to each stimulus M tu times, where we do not assume either that each neuron observes each stimulus time or that consecutive times are observed by the same sets of neurons.That is M tu may have many 0s.For each observation m (a unique unit, time pair), we then observe a spike count, N m .We model these spike counts as arising from a Poisson process with time-dependent rate Λ tu and observation-specific multiplicative overdispersion θ m : Figure 1: Generative model for spike counts.Spike counts N are observed for each of U units over stimulus time T for multiple presentations M .Counts are assumed Poisson-distributed, with firing rates Λ that depend on each unit's responses (λ) to both latent discrete states z t and observed covariates x t that change in time, as well as a baseline firing rate λ 0 .γ nodes represent hyperparameters for the firing rate effects.θ is a multiplicative overdispersion term specific to each observation, distributed according to hyperparameters s.
Note that both the unit and time are functions of the observation index, m, and that the distribution of the overdispersion for each observation is specific to the unit observed.

Firing rate model
At each stimulus time t, we assume the existence of K binary latent states z tk and R observed covariates x tr .We further assume that each unit's firing rate response at each time can be modeled as arising from the product of three effects: (1) a baseline firing rate specific to each unit, (2) a product of responses to each latent state, and (3) a product of responses to each observed covariate: Note that this is conceptually similar to the generalized linear model for firing rates (in which we model log Λ) with the identification β = log λ.However, by modeling the firing rate as a product and placing Gamma priors on the individual effects, we will be able to take advantage of closed-form variational updates resulting from conjugacy that avoid explicit optimization (see below).
In addition, to enforce parsimony in the inferred features, we put sparse hierarchical priors with hyperparameters γ = (c, d) on the λ z terms: That is, E[λ u ] = d −1 and var[λ u ] = (cd 2 ) −1 , so for c large and d ∼ O(1), the prior for firing rate response to each latent feature will be strongly concentrated around gain 1 (no effect).And once again, this choice of priors will lead to closed-form updates in our variational approximation.Finally, for the baseline terms, λ 0u , we use a non-sparse version of the same model, while for the specified covariate responses, λ xu , we model the unit effects non-hierarchially, using independent Gamma priors for each unit.

Latent state model
We model each of the latent states z tk as an independent Markov process for each feature k.That is, each k indexes an independent Markov chain with initial state probability π k ∼ Dir(α π ) and transition matrix A k ∼ Dir(α A ).For the semi-Markov case, we assume that the dwell times in each state are distributed independently for each chain according to an integer-valued, truncated lognormal distribution with support on the integers 1 . . .D: Note that we have allowed the dwell time distribution to depend on both the feature k and the state of the Markov chain j.In addition, we put indpendent Normal-Gamma priors on the mean (m kj ) and precision (τ kj ≡ s −2 kj ) parameters of the distribution: (m, τ ) ∼ NG(µ, λ, α, β).

Inference
We perform variational inference for the posteriors over the model parameters Θ = (λ 0 , λ z , λ x , A, π, c 0 , d 0 , c z , d z , s) and latents Z = (z kt , θ m ).In addition, for the semi-Markov case, we include (m, τ ), the parameters for the state dwell time distribution.That is, we wish to approximate the joint posterior density, by a variational posterior q Θ (Θ)q Z (Z) that factorizes over parameters and latents but is nonetheless close to p as measured by the Kullback-Leibler divergence [16].Equivalently, we wish to maximize the variational objective with H the entropy.Following the factorial HMM work of [17], we also assume that the posterior factorizes over each latent time series z •k and the overdispersion factor θ m , as well as the rate parameters λ •u• associated with each Markov process.This factorization results in a variational posterior of the form: With this ansatz, the variational objective decomposes in a natural way, and choices are available for many of the qs that lead to closed-form updates.

Variational posterior
From 1 and 2 above, we can write the probability of the observed data N as (8) where again, m indexes observations of (time, unit) pairs and the last two nontrivial terms represent the probability of the Markov sequence given by z tk .Given that 8 is of an exponential family form for θ and λ conditioned on all other variables, free-form variational arguments [16] suggest variational posteriors: For the first of these two, updates in terms of expected sufficient statistics involving expectations of γ = (c, d) are straightforward (see Supplement).However, this relies on the fact that z t ∈ {0, 1}.The observed covariates x t follow no such restriction, which results in a transcendental equation for the β x updates which we solve using an explicit BFGS optimization on each iteration.Moreover, we place non-hierarchical Gamma priors on these effects: λ xur ∼ Ga(a xur , b xur ).
As stated above, we place hierarchical priors on λ 0 , λ z , and θ of the form Eq. 3 that tie the hyperparameters for rate parameters λ and overdispersion effects θ across units.This involves multiple terms in the expected log evidence of the form In order to calculate the expectation, we make use of the inequality to lower bound the negative gamma function and approximate the above as Clearly, the conditional probabilities for c and d are gamma in form, so that if we use priors c ∼ Ga(a c , b c ) and d ∼ Ga(a d , b d ) the posteriors have the form This basic form, with appropriate indices added, gives the update rules for the hyperparameter posteriors for λ 0 and λ z .For θ, we simply set c = s u and d = 1.

Latent state inference
For Hidden Markov Models, given the observation model 8, inference for z, A, and π for each latent feature can be performed efficiently via conjugate updates and the well-known forward-backward algorithm [18].For the case of semi-Markov dynamics, we additionally need to perform inference on the parameters (m, τ ) of the dwell time distributions for each state.In the case of continuous dwell times, our model 4 would have W = 1 and be conjugate to the Normal-Gamma prior on (m, τ ), but the restriction to discrete dwell times requires us to again lower bound the variational objective: This correction for trunction must then be added to E q [p(z|Θ)].For inference in the semi-Markov case, we use an extension of the forward-backward algorithm [19], at the expense of computational complexity O(SDT ) (S = 2) per latent state, to calculate q(z k ) (see Supplement).For the 4SK hyperparameters of the Normal-Gamma distribution, we perform an explicit BFGS optimization on the 4S parameters of each chain on each iteration.

Synthetic data
We generated synthetic data from the model in Section 2 for U = 100 neurons for T = 10, 000 time bins of dt = 0.0333s (≈ 6min).Assumed firing rates and effect sizes were realistic for cortical neurons, with mean baseline rates of 10 spikes/s and firing rate effects given by a Ga(1, 1) distribution for K data = 3 latent features.In addition, we included R = 3 known covariates generated according to Markov dynamics.
For this experiment, we assumed that each unit was presented once with the stimulus time series, so that M = 1.That is, we tested a case in which inference was driven primarily by variabiliy in population responses across stimuli rather than pooling of data across repetitions of the same stimulus.Moreover, to test the model's ability to parsimoniously infer features, we set K = 5.That is, we asked the model to recover more features than were present in the data.Finally, we placed hierarchical priors on unit baseline firing rates and sparse hierarchical priors on firing rate effects of latent states.We used 5 random restarts and iterated over parameter updates until the fractional change in L dropped below 10 −4 .As seen in Figure 2, the model correctly recovers only the features present in the original data.We quantified this by calculating the normalized mutual information Î ≡ I(X, Y )/ H(X)H(Y ), between the actual states and the inferred states, with H(Z) and I estimated by averaging the variational posteriors (both absolute and conditioned on actual states) across time.Note that superfluous features in the model have high posterior uncertainty for z k and high posterior confidence for λ zk around 1 (no effect).In addition, the model correctly recovers coefficients for the observed covariates, and when limited to fewer features than in the generating model, recovers a subset of the features accurately rather than blending features together (see Supplement).

Labeled neural data
We applied our model to a well-studied neural data set comprising single neuron recordings from macaque area LIP during the performance of a perceptual discrimination task [20] 1 .In the experiment, stimuli consisting of randomly moving dots, some percentage of which moved coherently in either the preferred or anti-preferred direction of motion for each neuron.The animal's task was to report the direction of motion.Thus, in addition to 5 coherence levels, each trial also varied based on whether the motion direction corresponed to the target in or out of the response field as depicted in Fig. 3. 2We fit a model with K = 10 features and U = 27 units to neural responses from the 1-second stimulus presentation period of the task.Spike counts corresponded to bins of dt = 20ms.In this case, each unit experienced a different number of presentations of each stimulus condition, resulting in a ragged observation matrix.Figure 3 shows the experimental labels from the concatenated stimulus periods, along with labels inferred by our model.Once again, the model has left some features unused, but correctly discerned differences between stimuli in the unlabeled data.Even more importantly, though given ten distinct stimulus classes, the model has clearly inferred the factorial design of the experiment, with the two most prominent features, z 0 and z 1 , capturing the two levels of the crossed factor with the largest effect size: whether or not the relevant target is inside or outside the receptive field of the neuron.Moreover, the model correctly discriminates between two identical stimulus conditions (0% coherence) based on the monkey's eventual decision (In vs Out).In addition, the model correctly captures the initial 200ms "dead time" during the stimulus period, in which firing rates remain at pre-stimulus baseline.Finally, the model uses the same features to describe stimuli with similar firing rates, as well as roughly capturing the time course of stimulus differentiation.As a result, mismatch between the ground truth labels and model-inferred features reflects fundamental ambiguities in the neural data, with the model's latent states taking into account features of spiking response not necessarily known in advance by experimenters.

Movie data
Finally, we applied our model to a data set comprising single neuron recordings made in macaque monkeys as they freely viewed movies of primate social interactions in the wild.These video clips, collected from a rhesus macaque research field site, effectively represent the range of behaviorally relevant stimuli for macaques, including the four "F"s of animal behavior: fighting, fleeing, foraging, and reproduction.Thus this dataset represents a rare opportunity for exploratory feature selection based on neural responses.
On each trial of the experiment, monkeys viewed a random 5-second clip from a movie database (T ≈ 5h, dt = 30 frames/s).On the following trial, monkeys were given a two-alternative choices with options chosen from the set {repeat clip, continue clip, blank screen, new clip}.Since each neuron saw only a subset of clips within the database, and these clips began at with random offsets, observations were sparse, with total number of observations N obs = 7.42 × 10 6 and M = N obs /T ≈ 14.These data form a particularly intriguing application of our model, since the neurons recorded were located in the orbitofrontal and dorsolateral prefrontal cortices, areas not normally associated with low-level visual processing.In fact, there is reason to believe that the relevant features encoded in these areas for such naturalistic stimuli are behavioral and social [21].
Because the behaviors of interest in these stimuli potentially span a large range of time scales, we used semi-Markov dynamics with a maximum duration of D = 500 on the latent states 3 .We fit a model with K = 10 features and the same hierarchical priors on baselines and latent firing rate effects as in previous experiments.In addition, because many neurons in the data set exhibited a transitory increase in firing with movie onset on each trial (independent of movie content), we include this movie onset response as an external covariate x(t) specific to each neuron.We calculated this separately for each recorded unit by averaging spike counts (relative to stimulus onset) across trials, smoothing, and taking the log so that x(t) was coded as a multiplicative rather than an additive effect on firing.This can be loosely viewed as a means of preprocessing our data, in which we account for a large known effect by specifying it as a covariate in the model.
Figure 3: Top: Actual and recovered binary features during the stimulus presentation period.Note that model features 5 -8 are unused and that feature 0 closely tracks the Out feature of the data, albeit delayed.Bottom: Actual and predicted firing rates for the stimulus period.Note that the model infers stimulus categories from the data, including appropriate timing of differentiation between categories.
Figure 4 shows a comparison of Markov and semi-Markov models applied to feature detection in the stimulus set.Labels were calculated as maxima of posterior marginals, independently for each time bin.Clearly, the semi-Markov model captures the rich heterogeneity in the durations of stimulus-driven features in neural spiking, particularly in the on (1) state.This is also visible in the kernel density estimates, which show event durations on the order of tens of seconds for the semi-Markov model, as opposed to mere seconds in the Markov case.Thus semi-Markov models (and other models that take into account multiple timescales) will likely be necessary to capture variations in neural activity on the timescales relevant to natural behaviors.

Discussion
Here, we have proposed and implemented a method for learning features in stimuli via the responses of populations of spiking neurons.This work addresses a growing trend in systems neuroscience -the increasing use of rich and unstructured stimulus sets -without requiring either expert labeling or a metric on the stimulus space.As such, we expect it to be of particular use in disciplines like social neuroscience, olfaction, and other areas in which the real world is complex and strong hypotheses about the forms of the neural code are lacking.By learning features of interest to neural populations directly from neural data, we stand to generate unexpected, more potentially more accurate (less biased) hypotheses regarding the neural representation of the external world.
We have shown that our model is capable of parsimoniously and correctly inferring features in the low signal-to-noise regime of cortical activity, even in the case of independently recorded neurons.Moreover, by employing a fully variational, Bayesian approach to inference, we gain three key advantages: First, we gain the advantages of Bayesianism in general: estimates of confidence in inferences, parsimony and regularization via priors, and the ability to do principled model comparison.Second, variational methods scale well to large datasets.Finally, variational methods are fast, in that they typically converge within only a few tens of iterations and in many case (such as ours) require mostly simple coordinate updates.
Future directions include augmenting the model of overdispersion via an autoregressive process to take into account slow variations in neural activity not driven by the stimulus; adding non-trivial temporal dependence to latent variables in a manner consistent with kernels typically used with Poisson regression models; and an implementation of the model using stochastic variational inference for application to larger, next-generation neural data sets.
Finally, we want the entropy of the variational posterior over z, E q [− log q(z)].We can write this in the form with (η, Ã, π) the parameters of the variational posterior corresponding to the emission, transition, and initial state probabilities of the Markov chain (interpreted as matrices) and Z the normalization constant.From [18], we have that variational updates should give while the effective emission probabilities are Given these update rules, we can then alternate between calculating (η, Ã, π), performing forward-backward to get (ξ, Ξ, log Z) and recalculating (η, Ã, π).

Overdispersion, firing rate effects
Both the case of p(θ) and p(λ) are straightforward.If we ignore subscripts and write p(y) = Ga(a, b), q(y) = Ga(α, β), then where again, log y, y and H[q(y)] are straightforward properties of the gamma distribution.Expectations of the prior parameters are listed in Table 1.Note in the last line that there is no expectation, since we have not assumed a hierarchy over firing rate effects for the covariates, x.As we will see below, the expectations Table 1: Expectations of prior parameters for overdispersion and firing rates under the variational posterior of s, c, and d are themselves straightforward to calculate.

Hyperparameters
As shown in the main text, the hyperparameters c and d, given gamma priors, have conjugate gamma posteriors, so that their contribution to the evidence lower bound, E q log p(γ) q(γ) is a sum of terms of the form In other words, these are straightforward gamma expectations, functions of the prior parameters a and b for each variable and the corresponding posterior parameters α and β.Similarly, the overdispersion terms are exactly the same with the substitutions a → s, d → 1.

Conjugate updates
For updates on the overdispersion, firing rate, and hyperparameter variables, we have the simple conjugate update rules depicted in Table 2.These can be derived either from free-form variational arguments, or exponential family rules, but are in any case straightforward [22].If we assume a Ga(a, b) prior and Ga(α, β) variational posterior, all of these have a simple algebraic update in terms of sufficient statistics.
Here, we overload notation to write N tu = m δ u(m),u δ t(m),t N m , θ u = m δ u(m),u θ m , and make use of H, G, and F as defined in S7 -S9.Indeed, our implementation caches F and G for a substantial speedup (at the cost of additional memory requirements).

Non-conjugate updates
We employ explicit optimization steps for two updates in our iterative Algorithm 1.In each case, we employ an off-the-shelf optimization routine, though more efficient alternatives are likely possible.

Covariate firing rate effects
Because we do not restrict the covariates x(t) to be binary, the expression S6, no longer yields a conditional log probability for λ x in the exponential family.This leaves us with a transcendental equation to solve for β xr .However, from Table 2, we see that α xr t x tr , allowing us to approximate G t ≈ r (α xr /β xr ) xtr as we have done above.Moreover, since the sum t G t ∼ T G, we expect α x /β x ≈ 1 at the optimum for most reasonable datasets.We thus reparameterize β xur = α xur e − ur and write the relevant -dependent piece of the objective function L as We also supply the optimizer with the gradient: where we sum only over observations with u(m) = u.On each iterate, we then optimze L , initialzing β x to the just-updated value of α x .In addition, we do not update firing rate effects for covariates separately, but optimize β xur for all r together.Again, more efficient schemes are no doubt possible.

Semi-Markov duration distribution
If the dwell time distributions of states in the semi-Markov model were truly continuous, the parameters (m, s 2 ) of a lognormal distribution p(d|j) would have a Normal-Gamma conjugate prior, and updates would be closed-form.However, the requirement that the durations be integers and that there exist a maximal duration, D, over which these probability mass functions must normalize results in an extra term in E q [log p] arising from the normalization constant.In this case, we must explicitly optimize for the parameters of the Normal-Gamma prior on (m, s 2 ).However, unlike the case of 2.2.1, these parameters are only updated for one latent feature at a time.Since the Normal-Gamma distribution has only 4 parameters and the number of states of the latent variable is S = 2, this requires updating 4SK parameters, but only taken 4S = 8 at a time, a much more manageable task.
To derive the optimization objective, we begin by noting that, to terms proportional to Ξ and ξ 0 in the expectation of the latent state dynamics, the semi-Markov model adds an additional term djk C(d, j, k)E q log p k (d|j) − log where d is the duration variable, j labels states of the chain (here 0 and 1), and k, as elsewhere, labels chains [23,24,19].C is defined for each state and chain as the probability of a dwell time d in state j conditioned on the event of just having transitioned to j 4 5  Thus the objective to be optimized is E q [p k (d|j)] + E q log p(m, τ ) q(m, τ ) (S19) For the case of p(d|m, s 2 ) lognormal and (m, τ = s −2 ) Normal-Gamma, the terms E q [log p(d)] involve only routine expectations of natural parameters of the Normal-Gamma, and similarly for the expected log prior and entropy in the last term.Only slightly more complicated is the term arising from the normalization constant, which in the standard (µ, λ, α, β) parameterization of the Normal-Gamma takes the form E q [p(d|m, τ )] = dτ dm p(d|i, m, τ )q(m, τ ) = This can be derived either by performing the integral directly or by noting that the result should be proportional to the posterior (Normal-Gamma, by conjugacy) of (m, τ ) after having observed a single data point, log d. 4 Thus C is equivalent to D t|T in [19]. 5We also note that, just as the emission, transition, and initial state probabilities have their counterparts in variational parameters (η, Ã, π) in q(z), so C is matched by a term − djk C(d, j, k)ν djk in Eq[− log q].And in analogy with the HMM case, variation with respect to ν gives Eq [p k (d|j)] , (S18) implying that there are cancellations between log p(N, z|Θ) and log q(z) in L and care must be taken when calculating it [18].

Figure 2 :
Figure 2: Top: Actual and recovered binary features for a subset of stimulus times in the dataset.Note that inferred feature 3 is the inverse of actual feature 0, and that unused features are in gray, indicating a high posterior uncertainty in the model.Bottom left: Population posterior distributions for λ z .Features 0 and 1 are effectively point masses around gain 1 (no effect), while features 2-4 approximate the Ga(1, 1) data-generating model.Bottom right: Normalized mutual information between actual and inferred states.

Figure 4 :
Figure 4: Top: Maximum a posterior inferred features for the Markov and semi-Markov models.Plotted data are for the same stimulus for ≈ 17s of data.Bottom: Estimated dwell time distributions for off (0) and on (1) states in the Markov and semi-Markov models.Dwell times include self-transitions.Dwell times for the Markov case (blue) are clearly lower than those for the semi-Markov case (green).