Neural decoding with visual attention using sequential Monte Carlo for leaky integrate-and-fire neurons

How the brain makes sense of a complicated environment is an important question, and a first step is to be able to reconstruct the stimulus that give rise to an observed brain response. Neural coding relates neurobiological observations to external stimuli using computational methods. Encoding refers to how a stimulus affects the neuronal output, and entails constructing a neural model and parameter estimation. Decoding refers to reconstruction of the stimulus that led to a given neuronal output. Existing decoding methods rarely explain neuronal responses to complicated stimuli in a principled way. Here we perform neural decoding for a mixture of multiple stimuli using the leaky integrate-and-fire model describing neural spike trains, under the visual attention hypothesis of probability mixing in which the neuron only attends to a single stimulus at any given time. We assume either a parallel or serial processing visual search mechanism when decoding multiple simultaneous neurons. We consider one or multiple stochastic stimuli following Ornstein-Uhlenbeck processes, and dynamic neuronal attention that switches following discrete Markov processes. To decode stimuli in such situations, we develop various sequential Monte Carlo particle methods in different settings. The likelihood of the observed spike trains is obtained through the first-passage time probabilities obtained by solving the Fokker-Planck equations. We show that the stochastic stimuli can be successfully decoded by sequential Monte Carlo, and different particle methods perform differently considering the number of observed spike trains, the number of stimuli, model complexity, etc. The proposed novel decoding methods, which analyze the neural data through psychological visual attention theories, provide new perspectives to understand the brain.


Introduction
Neural coding is the science of characterizing the relationship between a stimulus presented to a neuron or an ensemble of neurons, and the neuronal responses [1]. Neural encoding refers to the map from stimulus to response, i.e., how the neurons respond to a specific stimulus. For example, if we can construct an encoding model, it can be used to predict responses to other PLOS  stimuli. Neural decoding refers to the reverse map, from response to stimulus, and the challenge is to reconstruct a stimulus, or certain aspects of that stimulus, from the evoked spike train. Neural coding is extensively studied in computational neuroscience. Our aim is to decode complicated multiple stochastic stimuli from neural spike trains. We combine biophysical spiking neural models with visual attention theories, bridging computational neuroscience and cognitive psychology. Following the visual attention model [2,3], attention to complicated multiple stimuli is viewed as probability mixtures. The two visual search mechanisms in psychology, the parallel and the serial processing [4], are employed for decoding neuron ensembles.
The goal of this paper is to develop, explore and compare various decoding methods based on sequential Monte Carlo for multiple stimuli in a visual attention setting.

Neural decoding
Given neurobiological observations, a decoding algorithm aims at reconstructing the unknown stimulus information encoded by the neural system. Neural decoding plays an important role in understanding the mechanisms of neurons and the brain. Well-performing algorithms of decoding constitute necessary components of brain-machine interfaces [5,6]. Different methods have been explored to study neural decoding. Some methods focus on regression-related approaches building linear models between spike trains and the corresponding stimulus by optimal linear estimation (OLE) [7,8]. Machine learning methods are also employed to stimulus decoding, such as artificial neural networks [9], kernel regression [10], and a recently developed approach using kernel-based neural metrics [11]. These methods employ general statistical techniques and omit the specific spike-generating mechanism of the neural response. On the other hand, stimulus decoding may directly employ spiking neural models that describe the spike generating mechanisms from stimuli [12][13][14][15]. Various encoding models can be used. Approximate methods using point processes treat the spikes in a spike train as sequential random events, which can be equivalently formulated as generalized linear models (GLM) for model fitting [15,16]. Meanwhile, there are also biophysically motivated methods like integrate-and-fire models, which study the stochastic evolution of the membrane potential. In decoding tasks, these encoding models are used in the posterior distribution to infer the most likely stimuli. Decoding of constant stimulus can be obtained from the posterior distribution using maximum a posteriori (MAP) or Monte Carlo methods. The decoding of temporal stimuli can be discretized as a sequence of constant decoding tasks, which can be solved by Kalman filtering [17] or particle sequential Monte Carlo methods [18][19][20][21].

Modeling visual attention
Stimulus mixture and probability mixing. We define a stimulus mixture to be multiple non-overlapping stimuli inside the receptive field of a neuron. We assume that the neuronal response to a stimulus mixture follows the probability-mixing model [2,22], where the neuron responds at any given time to only one of the single stimuli in the mixture with certain probabilities. In [3] data from MT neurons in macaque monkeys are analyzed, and the probabilitymixing model appears to be more in agreement with data compared to the competing response-averaging model. The probability-mixing model enables us to accurately perform decoding, i.e., to recover the single stimulus that caused the response.
Neural behaviour during parallel and serial processing. The two opposing visual search mechanisms of parallel and serial processing have been long debated in psychology, and empirical behavioral experiments have shown evidence supporting both mechanisms [4,[23][24][25]. membrane potential is model by the solution to the stochastic differential equation: dXðtÞ ¼ bðXðtÞ; tÞdt þ sdWðtÞ ¼ ðÀ aðXðtÞ À mÞ þ IðtÞ þ HðtÞÞdt þ sdWðtÞ; where t þ j denotes the right limit taken at t j . The drift term b(�) contains three currents: the leak current −a(X(t) − μ), where a > 0 is the decay rate and μ is the reversal potential, the stimulus driven current I(t), and the post-spike current H(t). The potential X(t) evolves until it reaches the threshold, x th , where it resets to x 0 . The membrane potential X(t) is not measured, only the spike times {t 1 , t 2 , . . .} are observed. Thus, the scaling of X is arbitrary, and we can use any values for threshold and reset. We set x 0 = 0 and x th = 1 such that X is measured in units of the distance between reset and spike threshold. The noise is modelled by the standard Wiener process, W(t), with diffusion parameter, σ > 0.
The stimulus current I(t) is shaped from the external stimulus S(t) through a stimulus kernel k s (t); IðtÞ ¼ R t À 1 k s ðt À sÞSðsÞds. The post-spike current arises from past spikes convoluted with a response kernel k h (t); HðtÞ ...g dðs À tÞ represents the spike train, where δ(�) denotes the Dirac delta function.
We assume a stimulus kernel without delay, such that k s (t) = δ(t), implying that I(t) = S(t). The response kernel is assumed to be the difference of two exponentials decaying over time, with four positive parameters, η = (η 1 , η 2 , η 3 , η 4 ). By adjusting the parameters, different kernels are obtained. Three types of kernels are used here, described in Table 1 and illustrated later in the Results section. In the center panels example spike trains generated from the different kernels and different stimuli are illustrated.

Likelihood of an observed spike train
Suppose there are a total of K stimuli inside the receptive field of the neuron, denoted by S = (S 1 , . . ., S K ). Let Y = (Y 1 , . . ., Y M ) denote M spike trains. The realizations of stimuli and spike trains are respectively s = (s 1 , . . ., s K ) and y = (y 1 , . . ., y M ). According to the probability-mixing encoding model, the stimulus-driven current, I(t), follows a probability mixture: for k = 1, . . ., K, where P K k¼1 a k ¼ 1. Then the probability of a spike train y m generated under the exposure of the K stimuli is also a mixture distribution, where p(y m |s k ) is the probability of generating spike train y m from the single stimulus s k . It equals the product of the probability densities of all spike times within y m = (t 1 , t 2 , . . .), where the dependence between spike times is accounted for by conditioning on the history of past spike times, H m t iÀ 1 , where gðtjs k ; H t iÀ 1 Þ is the conditional probability density of spiking at time t given the kth stimulus and the spike history up to the previous spike time t i−1 . The probability density g(�) can be obtained from the density of the first-passage time of model (1)

Decoding of stochastic stimulus mixtures with Markov switching
We consider stochastic stimulus mixtures with Markov attention switching, described by stochastic processes with unknown parameters. The focus is both on estimating parameters governing the law of the kth stimulus, as well as decoding of the stochastic realization of the stimulus. We discretize the time interval of a trial in smaller intervals of length v, and assume that the neurons can only switch attention between intervals, but will attend the same stimulus during any of these small intervals. [30] found that sustained attention naturally fluctuates with a periodicity of 4-8 Hz, thus, at most switching attention after 125ms. In the simulations presented later, we set v = 100ms. Denote by C n the index of the attended stimulus at the nth time point, C n 2 {1, . . ., K}, n = 1, . . ., N, such that vN is the length of the total observation interval, and let S n denote the stochastic realization of the attended stimulus at the nth time point. In the decoding algorithm, it is assumed that S n is constant, thus approximating the stochastic stimulus process by a piecewise constant process. Assume the neuron switches attention between two consecutive time intervals following a Markov chain with transition probability matrix (TPM) Γ. Denote the elements of Γ by λ kl for k, l = 1, . . ., K. Thus, λ kl = P (C n = l|C n−1 = k) is the probability that at time n the attended stimulus is S l , given that the neuron attended stimulus S k at time n − 1. The stochastic stimuli are described by Ornstein-Uhlenbeck (OU) processes. For a mixture of K stimuli S = (S 1 , . . ., S K ), the kth stimulus component is governed by the stochastic differential equation (SDE): where β k and γ are parameters, and W(t) is a standard Wiener process. Only the drift parameter β k is stimulus specific, the diffusion parameter γ is assumed to be the same for all stimuli in the mixture.
The parameters describing the stimulus are unknown, namely γ, β = (β 1 , . . ., β K ) and the TPM Γ, so that θ = (γ, β, Γ). The parameter space is thus Θ = R + × R K × O, where O is the space of K × K stochastic matrices. For simplicity, the mixture number K is assumed to be known. If K is unknown, then the algorithm is run with different k = 1, 2, . . ., and the k that minimizes the BIC is chosen. We focus on various Monte Carlo techniques for decoding, including the bootstrap filter, the auxiliary particle filter with parameter learning, fixed-lag and fixed-interval smoothing, etc; see [31] for a review of such methods. The goal is to decode the stochastic realization of S n for n = 1, . . ., N. We will present online methods, where parameter estimates are updated sequentially as observations become available. We also explore smoothing techniques, where some delay is allowed before the stimulus is reported.

Sequential Monte Carlo methods
We will now establish sequential Monte Carlo methods for decoding. In Table 2 below, we summarize the methods that are developed and compared. The details are described in the following sections.
To represent various methods, we use a unified term f; i; m gfBF; APFgf; gg À fF; lag; FBg: The prefix i or m stands for individual decoding or marginal likelihood decoding in parallel processing, respectively. The main term BF or APF indicates the filtering algorithm. The suffix g stands for using the geometric mean for the likelihood value. Finally, the last part represents whether we use filtering (F), fixed-lag smoothing (lag) or fixed-interval smoothing with the forward-filtering backward-smoothing algorithm (FB). For example, the method iBF-lag represents individual decoding in parallel processing using bootstrap filtering with fixed-lag smoothing. State space model. We use a state-space model to describe the evolution of the stochastic stimuli. The state space is extended to not only include the stimuli S, but also the unknown stimulus-related parameters, which are included for the construction of the decoding algorithms. Fig 1 shows the graph of the state-space model. The stimuli S are continuous-state Markov processes, and the attention states C are discrete-state Markov processes. The spike trains Y depend on both stimuli and the attention, also affected by spiking history. S, C and Y may be multi-dimensional containing multiple stimuli and neurons. The transition of the states S and C are parameterized by θ = (γ, β, Γ). In the algorithms, the parameters θ are also considered as states propagating following Markov processes given in (10), but are not shown in the graph. Denote by Z n = (S n , C n , θ n ) the full hidden states, and z n a realization of Z n . Similar methods were used in [32], where the authors employed a state-space model describing spike train data with Poisson distributions and an animal's position with Gaussian noise. Sequential Monte Carlo methods were used to estimate parameters and decode the position based on spike trains. Here we include the latent states explaining visual attention and describe spike trains with leaky integrate-and-fire models. The full states are The subscript n stands for the current time in the state evolution. Note that, even if Γ, γ and β are constant in model (7), the filters will at each time point update the information regarding their value, and thus, they are allowed to change at each time point. Hopefully, they converge towards their true values as more spikes are used in the decoding algorithm. The propagation of states at time n is given by: fl kl;n g l¼1;...;K � DirðfV À 1 l l kl;nÀ 1 g l¼1;...;K Þ; X K l¼1 l kl;n ¼ 1; l kl;n � 0 C n � ΓðC nÀ 1 Þ; C n 2 f1; . . . ; Kg for k, l = 1, . . ., K. The state propagation is explained as follows. Each row of the TPM is sampled following a Dirichlet distribution, with parameters being the probabilities in the previous time step multiplied by a concentration parameter V À 1 l controlling the sampling variance. The index of the attended stimulus is sampled from a multinomial distribution given by row C n−1 of the TPM, Γ(C n−1 ). The parameters γ n and β n are updated using Gaussian distributions with variance V γ and V β , respectively. Since γ n > 0, a positive truncated Gaussian distribution is used. The strength of each stimulus, S k n , is updated according to the OU model, following a Gaussian distribution with mean M k . The likelihood of the spike train given the parameters is obtained from the encoding model. In the following text before we deal with multiple simultaneous spike trains, we focus on decoding of single spike trains, so we will use y as a single spike train for readability. Let y n ¼ ðt 1 ; . . . ; t L n Þ denote the spike train within the duration of the nth interval, where it can happen that y n is empty if no spikes were fired. Since the intervals are short, we need to take into account boundary effects, i.e., the time from the left boundary of the interval to the first spike, and the time from the last spike to the right boundary. Let T b and T e denote the beginning and the end of the interval, respectively. Then if y n is non-empty, T b � t 1 < � � � < t L n � T e . Given stimulus S 1:n = s 1:n and attentional index C 1:n = c 1:n from the first to the nth time step, the likelihood of y n is then pðy n js c n n ; s gðt l js c n n ; H t lÀ 1 Þ ðcomplete ISIs inside the intervalÞ � gðt 1 js c n n ; s c nÀ 1 If there are no spikes in the interval, the likelihood is given by the survival probability: The decoding of stimuli aims at obtaining the conditional distribution We have different types of filtering based on the distribution p(s 1:n |y 1:n ).
Online filtering refers to the distribution p(s 1:n |y 1:n ) or the marginal distribution p(s n |y 1:n ) where n represents the current time step. In online filtering, when new data y n arrive, the unknown hidden state s n is inferred, and the decoding procedure is online.
Offline smoothing refers to the distribution p(s 1:N |y 1:N ) or the marginal p(s n |y 1:N ), where N represents the total time and n is some past time. In offline smoothing, we infer any states in the past s n , n = 1, . . ., N, after we observe all data y 1:N , and the decoding procedure is offline.
A third type is a semi-online smoothing, where we target the distribution p(s n − Δn |y 1:n ), for Δn > 0. We infer the state at a past time s n−Δn after we receive the data at the current time y 1:n . This semi-online decoding procedure can be conducted if we allow for some delay Δn before reporting the online result.
A bootstrap particle filter. Sequential Monte Carlo methods aim to obtain the distribution (13) through sequential sampling over time, and the strategy relies on the following decomposition: pðz 1:n jy 1:n Þ ¼ pðz 0:nÀ 1 jy 0:nÀ 1 Þ pðy n jy 0:nÀ 1 Þ pðz n jz nÀ 1 Þpðy n jz n ; y 0:nÀ 1 Þ: ð14Þ The method is to sample a new z n at each time step n and sequentially update the weight of each sample z n based on the above decomposition [31]. In the bootstrap particle filter (BF), z n is sampled from p(z n |z n−1 ) and the weight of each sample is updated using p(y n |z n , y 0:n−1 ). Each particle is a sample from the state space at all time points, where we write Z n,i = z n,i for the sampled value of Z n of particle i. Particle filtering approximates the distribution p(z 1:n |y 1:n ) by the empirical distribution using I particles: where � w n;i denotes the normalized weight of particle i at time n. Since we are interested only in the marginal distribution of the stimuli, p(s 1:n |y 1:n ), we use the marginal pðs 1:n jy 1:n Þ ¼ X I i¼1 1 fs 1:n ¼s 1:n;i g � w n;i ð16Þ and alsop ðs n jy 1:n Þ ¼ with the same set of weight values. Then the stimulus at time n is estimated by the posterior mean,Ŝ Using the state evolution and the likelihood, the BF is formulated in Algorithm 1. In this particle filter, each particle has the attended target C n as a state, and only the information about the attended stimulus is used to calculate the weights. In the first step at n = 1, the states are initialized by sampling from uniform distributions. The attention state C is sampled from a discrete uniform distribution on the indices of the K stimuli, U{1, . . ., K}, and the other states are sampled from continuous uniform distributions, with intervals given in the Result section.
In this and the subsequent filters, we resample particles using systematic resampling to avoid weight degeneracy, which is conducted as follows. Denote by U j , for j = 0, 1, . . ., I − 1, a total of I random grid variables. A uniform variable � U is sampled from U(0, 1]. The grid variables follow The number of duplicates for particle i, i = 1, 2, . . ., I, after resampling is i.e., the number of grid variables that fall into the ith increment of the cumulative sum of the normalized weights. It follows that P I i W i ¼ I and W i � 0 for i = 1, 2, . . ., I. Afterwards, we set the weight of all resampled particles to 1/I.

Algorithm 1 Bootstrap particle filter, BF
Initialization: at n = 1 1: for particle i = 1, . . ., I do 2: Sample each row of Γ using the Dirichlet distribution with equal weights 3: Calculate the weights, w i ¼ pðy 1 jS . ., N 7: Resample particles (systematic resampling) 8: for particle i = 1, . . ., I do 9: Propagate states: first Γ n,i , then C n,i , γ n,i , β n,i , and finally, S n,i , from distributions (10) 10: Calculate the weights, w i ¼ pðy n jS C n;i n;i ; S C nÀ 1;i nÀ 1;i ; y 1:nÀ 1 Þ 11: end for 12: Calculate normalized weights, � w i ¼ w i = P i w i 13: Estimate attended stimulus,Ŝ n ¼ P I i¼1 � w i S C n;i n;i Auxiliary particle filter with parameter estimation. In the bootstrap filter, the resampling weights are calculated from the past observations. A more reasonable idea is to calculate the weights based on the current observation. In the auxiliary particle filter (APF) [33], the resampling relies on auxiliary variables, for example, the likelihood of the current observation conditional on the expected states: The idea is that the resampling based on the current observation provides particles that are distributed more closely to the posterior at the following time point. Therefore, the weights degenerate less and the effective number of particles is larger.
The stimulus model contains fixed hyperparameters θ that are estimated using artificial propagation, which introduces information loss over time [34]. To overcome this, we propagate the hyperparameter γ n using kernel smoothing as proposed by [34]. The propagation of γ n follows a Gaussian distribution where � g n and v n are the mean and the variance of the posterior p(γ|y 1:n ), evaluated from particles at time n. In practice, we use a truncated version of the Gaussian distribution in (23) since the parameter γ is positive. The constants ψ = (3δ − 1)/2δ and h 2 = 1 − ψ 2 are evaluated using a discount factor δ 2 (0, 1], typically around 0.95 − 0.99 recommended by the authors. For the parameters Γ n and β n , which depend on the stimulus components, we use the same propagation distribution as before, due to the problem of label switching in mixture models [35,36]. It is difficult to evaluate the posterior of elements of Γ n and β n because each particle can label each component differently.
Due to label switching, each particle could label the stimulus components differently. It is then difficult to output the correct results with the posterior mean [35]. Here we use a simple method. The stimuli in each particle are sorted first, then the posterior mean is calculated for the sorted stimuli. The hope is that after sorting, each particle relabels the components in the same order. The algorithm of a bootstrap particle filter with marginal likelihood is formulated in Algorithm 3.
For single spike trains, we cannot decode all components of the stimulus mixture because only one is attended at a time. Therefore marginal likelihood is less appealing for single spike train decoding. However, if we have multiple independent observations at each time point, marginal likelihood will be more appropriate.
Algorithm 3 Bootstrap particle filter with marginal likelihood, mBF Initialization: at n = 1 1: for particle i = 1, . . ., I do 2: Sample each row of Γ using the Dirichlet distribution with equal weights 3: Calculate the weights, w i = p(y 1 |S 1,i ) 5: end for 6: Calculate normalized weights, � w i ¼ w i = P i w i Iteration: for n = 2, . . ., N 7: Resample particles (systematic resampling) 8: for particle i = 1, . . ., I do 9: Propagate states: first Γ n,i , then γ n,i , β n,i and finally S n,i from distributions (10) 10: Calculate the weights, w i = p(y n |S n,i , S n−1,i , y 1:n−1 ) 11: end for 12: Calculate normalized weights, � w i ¼ w i = P i w i 13: Estimate all S n ¼ ðS 1 n ; . . . ; S K n Þ usingŜ k n ¼ P N i¼1 � w i S k n;i on sorted stimulus components Auxiliary particle filtering with parameter estimation and marginal likelihood. The idea of APF and parameter learning using kernel smoothing can also be applied to the particle filter with marginal likelihood. We calculate the first-stage weights using marginal likelihood: u n ¼ w nÀ 1 pðy n jm n ; S nÀ 1 ; y 1:nÀ 1 Þ; where μ n is the expectation of all components of S n : The calculation of the marginal likelihood p(y n |μ n ) follows the same way as in Eq (24). Due to label switching, only the propagation of the common parameter γ n is done using the kernel smoothing method by [34]. The algorithm is formulated in Algorithm 4.
Algorithm 4 Auxiliary particle filter with kernel smoothing and marginal likelihood, mAPF Initialization: at n = 1 1: for particle i = 1, . . ., I do 2: Sample each row of Γ using the Dirichlet distribution with equal weights 3: Calculate the weights, w i = p(y 1 |S 1,i ) 5: end for Iteration: for n = 2, . . ., N 6: for particle i = 1, . . ., I do 7: Calculate m n;i ¼ EðS n;i jS nÀ 1;i ; y nÀ 1;i Þ 8: Calculate the first-stage weight, u i = w i p(y n |μ n,i , S n−1,i , y 1:n−1 ) 9: end for 10: Resample particles (systematic resampling) using {u i }, giving a new set of particles N 11: for particle j 2 N do 12: propagate γ n,j using (23), then β n,j , Γ n,j and finally S n,j 13: Evaluate the weight, w j = p(y n |S n,j , S n−1,j , y 1:n−1 )/p(y n |μ n,j , S n−1,j , y 1:n−1 ) 14: end for 15: Normalize weights and output estimate based on sorted stimulus components Decoding from multiple spike trains with serial and parallel processing. Now we consider multiple neurons simultaneously recorded in one trial providing multiple spike trains. Since stochastic stimuli contain inevitable noise and are not reproducible by repetitions in real applications, all estimates of the stimuli depend entirely on the spike trains from one trial. Thus, the attentional behavior of the simultaneously recorded neurons is of great importance for understanding the full information of stimuli.
For multiple, simultaneously recorded spike trains we consider two opposing hypotheses for visual search in neuronal attention, namely the serial and the parallel processing. In serial processing, all stimuli are processed sequentially. The neural interpretation is that all neurons attend to the same stimulus at the same time, and switch to another all together. Therefore, all spike trains would have similar spiking patterns. On the contrary, in parallel processing, stimuli are processed in parallel. Each neuron attends its own stimulus and can switch to another stimulus independently of the other neurons. The spike trains are then distinct from each other.
Serial processing. For stimulus decoding using particle methods, serial processing essentially means an increase of the observation size at each time point, making the decoding more accurate. However, it only decodes the attended stimulus at any time, and the data contain no information about the other stimuli at that time point. For M spike trains, y = {y 1 , y 2 , . . ., y M }, the likelihood function with the serial processing assumption within a small interval is then pðy n j S c n n ; S c nÀ 1 nÀ 1 ; y 1: pðy m n jS c n n ; S c nÀ 1 nÀ 1 ; y m 1:nÀ 1 Þ: The right hand side is evaluated using expression (11). Parallel processing. In parallel processing each spike train has its own attended stimulus. Stimulus decoding can then estimate multiple components of the mixture. Each single spike train is decoded independently using Algorithms 1 or 2, which produces estimates of each neuron's attended stimulus at each time point, and then the results from all spike trains give an empirical distribution of the stimulus mixture at each time point. Then we run cluster analysis at each time point in one-dimensional space based on the estimates of stimuli. Since there are outliers (see the Results section), we apply k-medoids clustering [37, 38, chpt. 14] using the square root of Euclidean distance as the dissimilarity measure. The k-medoids clustering is preferred over k-means because k-medoids can be more robust against outliers [38]. Furthermore, the square root of the Euclidean distance puts less weight on extreme outliers than the Euclidean distance. Finally, we use the median of each cluster as the estimate for each component of the stimulus mixture.
Another decoding method for parallel processing is to exploit the marginal likelihood since we have multiple independent observations. Now each particle can decode all stimulus components, and all decoded components will be used for the output estimation. When calculating the weights, we need the likelihood, which is the product of the marginal likelihoods of all spike trains: pðy n jS n ; S nÀ 1 ; y 1:nÀ 1 Þ ¼ Y M m¼1 pðy m n jS n ; S nÀ 1 ; y m 1:nÀ 1 Þ; ð30Þ and the right hand side is evaluated using Eq (24). Adjusting auxiliary variables for large data size. In Algorithms 2 and 4 based on APF for population decoding, the auxiliary variables are calculated using the likelihood, which can take extreme values if the sample size is large, e.g., when the data contain multiple spike trains. The consequence is that only few particles with extreme weight values survive the resampling, reducing the posterior variance and leading to the degeneracy of parameter learning [39,40]. To slow down the degeneracy, we use the geometric mean of the likelihood value over the number of spike trains,pðy n jm n ; S nÀ 1 ; y 1:nÀ 1 Þ ¼ Q M m¼1 pðy m n jm n ; S nÀ 1 ; y m 1:nÀ 1 Þ À � 1=M , when calculating the auxiliary variables in Algorithms 2 and 4.
Semi-online smoothing. The above online algorithms return estimates of stimuli by approximating the filtering probability conditional on the observation up to the current time, p(s 1:n |y 1:n ). An alternative is offline methods that make use of later observations or the entire data set when estimating the stimuli at a past time point. This posterior is referred to as the smoothing distribution. A full-length smoothing reports the posterior of the stimulus at any time n conditional on all observations over 1: N, p(s n |y 1:N ), but we can also apply partial smoothing when only certain delays are allowed. Say we need to report the stimulus after a delay of Δn time points, then we can decode the stimulus at time n using partial smoothing, p(s n−Δn |y 1:n ). Thus, filtering does real-time online decoding, while smoothing does semi-online decoding with some delay or offline decoding after the full observation. Here we pursue the semi-online decoding allowing a delay of Δn before reporting the stimulus, p(s n−Δn |y 1:n ). Two smoothing methods have been tried, the fixed-lag smoothing and the fixed-interval smoothing [41].
In the fixed-lag smoothing, we simply marginalize the filtering distribution p(s 1:n |y 1:n ) for time n − Δn:p ðs nÀ Dn jy 1:n Þ ¼ where the weights are the same as the online filtering weights. Then we estimate the stimulus at time n − Δn asŜ This requires additional memory to store the history of S.
In fixed-interval smoothing we apply the forward-filtering backward-smoothing algorithm, and calculate the smoothing distribution p(s n−Δn |y 1:n ) for the desired time n − Δn, instead of using the joint filtering distribution p(s 1:n |y 1:n ). The smoothing distribution p(s n−Δn |y 1:n ) is obtained using recursive backward smoothing from n after a full forward filtering up to n [41]. For the semi-online smoothing at n − Δn, we keep the online filtering running. Whenever we receive new data y n , we proceed with the online filtering to obtain p(s 1:n |y 1:n ) and go back Δn time steps to obtain the smoothing distribution p(s n−Δn |y 1:n ). See Appendix II: Forward-Filtering Backward-Smoothing for a full description of the forward-filtering backward-smoothing algorithm.
Continuous-time switching. All the decoding algorithms assume that neuronal attention is fixed within intervals of duration 100ms, and only switches between two intervals. To test how robust the algorithms are when this assumption is violated, we also simulate spike trains with continuous-time switching, i.e., the attentional switching does not need to take place exactly between two intervals. One example is that the switching follows a Poisson process, which is used in the simulations. If this is the case, then decoding with discretization will be less accurate. However, if the switching rate is sufficiently low such that the average interswitch interval is much longer than the discretized intervals, the Poisson attentional switching is well approximated by the approach based on discretization.
A fixed TPM on discretized time points approximates the Poisson switching model well due to the memoryless property of the Poisson process. However, since the TPM is updated at each time point as latent states, the model is easy to extend to non-Poissonian switching allowing for memory effects by adapting the TPM for a specific model. This is not pursued here.

Results
Throughout the following examples, we use the parameters for the LIF encoding model shown in Table 3. Fig 2 illustrates some realizations of spike trains generated from the encoding model using different response kernels and stimuli.
In the decoding simulations, we perform many repetition trials. In each decoding trial, we simulate the realizations of the stochastic stimuli and the spike trains, and then perform decoding with the sequential Monte Carlo particle methods. Specifically, we simulate K new stimuli according to the OU model. Each spike train is generated using the simulated stimuli within the period [1, 6]s (a period of 5s after 1s burn-in). The time step size of generating the stimulus is 0.01s. We then decode the stochastic mixtures from the spike trains.
The root mean squared deviation (RMSD) between true and decoded stimuli is used to evaluate the performance. Since the stochastic stimuli are simulated with steps of 0.01s and we approximate the stochastic process with a discretized piecewise constant function with steps of 0.1s, we can never achieve a perfect decoding and the RMSD will always be greater than 0. To take this into account, a relative root mean square deviation (rRMSD) is used to measure the where N is the number of discretized intervals, l ¼ 1; 2; . . . ; 0:1s 0:01s is an index for discretization in each time step, S n,l denotes the true stimulus, different for each n and l,Ŝ n is the prediction of the stimulus andŜ � n is an artificial stimulus that minimizes the RMSD,Ŝ � n ¼ 1 10 P 10 l¼1 S n;l . Then the best achievable value of rRMSD is 1. The effective sample size (ESS) measures the weight degeneracy of the sequential Monte Carlo methods. The ESS at time n for I particles is given by If the weights are evenly spread, then (N eff ) n = I takes its maximum value. The smaller ESS is, the less effective are the particles in representing the distribution. The performance of different particle methods are compared using rRMSD, ESS and the trace of parameter learning over time.
We tried stimulus mixtures of K = 1, 2 and 3 components. A mixture of 1 component implies that the neuron's attention is fixed at the single stimulus. We set the TPM for the mixture of two or three to  Table 4 shows the β parameters used for each component and the common γ values for each mixture.
During initialization, the values of γ, β and the stimulus strength S are uniformly sampled from U(0, 40), U(0, 200) and U(0, 200), respectively. The parameters for the algorithmic updating of Γ, γ and β are V λ = 0.02, V γ = 1 and V β = 4, respectively. For the AFP algorithm with kernel smoothing, we use δ = 0.95. Throughout the experiments, the number of particles is I = 500. The delay time for particle smoothing is Δn = 10 intervals equal to 1s.
All data are simulated according to the state-space model and the diffusion process described in the Models and Methods section using the parameters given above.
In Table 5 we show a summary of the performance comparison of different methods from the simulations. In both single and multiple spike train simulations, we focus on discrete-time switching and the bursting kernel to compare between different particle algorithms. Then we include extensions with continuous-time switching and other response kernels. The detailed explanations of the results can be found in the following sections. The source code for performing these experiments is in the repository https://osf.io/tkvhs/ (DOI: 10.17605/OSF.IO/ TKVHS).

Single spike trains
In single spike train experiments, the decoding trials are repeated 50 times. In each trial new stimuli are generated and one spike train is simulated following the stimulus mixture. Then all decoding is conducted only on this single spike train.   Various combinations of three filtering methods (online filtering, fixed-lag smoothing and fixed-interval smoothing), two particle methods (BF and APF) and three component sizes (K = 1, 2 and 3) are tried. The decoding performance tends to be better when there are less number of stimulus components and when we use delayed smoothing rather than online filtering. The benefit of APF is not observed for K = 1 and K = 2, but becomes notable when K = 3. Fig 6 shows the ESS of different particle methods for different number of components. The ESS is calculated for all time steps, so the boxplots cover 2500 samples for all 50 repetitions at all 50 time steps. The ESS of APF outperforms BF only when K = 3. When K = 2, the medians of APF and BF are comparable but the variance of BF is smaller. When K gets larger, the weight degeneracy quickly becomes a problem for BF, but the weights are less sensitive to K for APF. This finding here corresponds to the finding in the rRMSD plots in Fig 5. Finally, in Fig 7 we show examples of the time trajectory of parameter learning for γ, the diffusion parameter in the OU model of the stimuli. Parameter learning converges faster using APF than BF when there is more than one stimulus, but the learning is not as fast as the parameter degeneracy (observed and explained in the following population decoding).

Multiple spike trains
In population decoding of multiple spike trains, we use a mixture of two stimuli also of length 5 s. In each trial we simulate new stimuli and 20 simultaneous spike trains, and we conduct 50 repetitions. Population decoding assumes either serial processing or parallel processing. A decoding example following serial processing is shown in Fig 8. The figure compares filtering, fixed-lag smoothing and fixed-interval smoothing, all using BF. In the top of the figure are shown the 20 spike trains used for decoding, which follow similar spiking patterns because all of them attend to the same stimulus assuming serial processing.
A decoding example following parallel processing is shown in Fig 9. Spike trains can be quite distinct due to different attended stimuli. All stimuli can be simultaneously decoded at each time point. Two decoding methods are used. First we apply individual decoding of each spike train, obtaining 20 estimates which are clustered into two categories. The median of each category is the final estimate. The histograms to the right show the distributions of the 20 estimates at two selected time points. Sometimes one category contains few estimates. This occurs when the two components are different in strength and most spike trains happen to attend to one stimulus component, or when the two components have similar strength and outliers form a second category. A category with few estimates is marked by a red color and stars if  Population decoding using multiple spike trains generally performs better than single spike train decoding. For serial processing, APF performs worse than BF, and for parallel processing APF performs as well as or better than BF, judging from the rRMSD results. For both serial and parallel population processing methods, smoothing yields little or no improvement over filtering. However, the exception is the individual decoding methods for parallel processing, of course, since they are based on decoding of single stimuli. Indeed, significant improvement is observed when using smoothing instead of filtering for iBF and iAPF. The reason for the performances of BF and APF, filtering and smoothing can be partly found from the ESS values shown in Fig 11. Most notably, the ESS values are much smaller than the ESS values of single spike train results (Fig 6), due to extreme weights for larger sample sizes. This can lead to inaccurate approximations of the marginalization in fixed-lag smoothing and the integrals in the forward-filtering backward-smoothing algorithm. The smoothing performance is more affected by the small ESS than filtering. Furthermore, for serial processing BF has better ESS with higher median and smaller variance than APF, whereas for parallel processing, APF has better ESS. This explains the different performances of BF and APF in serial and parallel processing in Fig 10. Finally, regarding using geometric means, we do not observe much improvement of APFg and mAFPg over APF and mAPF. Using geometric means have positive effects since the ESS's are larger and the parameter degeneracy slows down (Fig 12) with APFg and mAFPg. However, the geometric mean changes the resulting posterior distribution and introduces a bias.
In Fig 12, examples of parameter learning of γ are plotted for different methods. The APF algorithm for serial population decoding suffers from parameter degeneracy. Parameter degeneracy of APF with kernel smoothing [34] under large sample sizes has been reported in previous studies [39], which is a phenomenon where the parameter distribution quickly becomes narrow or collapses to a Dirac delta function. If parameter learning degenerates too fast before it receives sufficient data to achieve a good estimate, the parameter can be fixed at values far from the true one, reducing the decoding accuracy. Using the geometric mean slows down the degeneracy for serial processing. Other parameter learning methods have previously been studied using sufficient statistics, which may avoid the degeneracy problem [39,40]; it is not pursued here. For particle filtering with marginal likelihood on parallel population decoding, there is not a large difference between APF and BF in terms of degeneracy.

Approximating continuous-time switching
Here we simulate the attentional switching in continuous time following a Poisson process to test how robust the methods are to discretization errors. With the same setup and methods as For single spike train decoding, the estimate at switching times tends to be somewhere between the two values before and after the switch (first spike train in the upper panel in Fig  14), but sometimes the estimation can be far from the true stimulus (second spike train at 0.8 s in the lower panel in Fig 14).

Decoding with the delay and decay kernel
In the above analysis, we have been using the burst response kernel which generates rhythmic and oscillatory bursting spiking patterns. Now we also try parallel population decoding using the decay and the delay kernel, shown in Figs 15 and 16, respectively. Again we use the same setup and methods. For the delay kernel, good performance is achieved, comparable with the burst kernel. For our current specification of the decay kernel, the spiking rate decreases rapidly over time and we have to use stronger stimulus, but there are still long ISIs (e.g. in the middle region from 2s to 4s) which reduce the decoding accuracy.

Discussion
We have shown how to decode mixtures of multiple stochastic stimuli in the framework of visual attention under the hypothesis of probability mixing, which assumes the neuron responds to only one single stimulus at any time. The opposing hypothesis is response averaging [42], which assumes the neuron responds to a weighted average of the mixture. In this case, the decoding of each single stimulus would be much harder or impossible due to the difficulty in identifying each single stimulus based on the estimate of the weighted average, and information of individual stimulus characteristics would not be identifiable. This is an argument for why the neural system probably follows the probability-mixing hypothesis, as also shown in [3]. In the decoding simulations with stochastic mixtures, we successfully decode the attended stimulus component using a single spike train or using population data under serial processing. Using population data under parallel processing enables us to obtain information of all stimulus components. Various types of particle methods are employed and compared. Interestingly, we find that the more complicated techniques using APF and kernel-based parameter learning do not necessarily perform better than basic methods using BF, and smoothing, conditional on more observations, does not necessarily perform better than filtering. This is related to sample size and model complexity. The rRMSD values using different particle methods for serial and parallel processing, calculated from 50 repetitions. In the labels of the xaxis, APFg: APF with geometric mean, iBF: individual decoding using BF, iAPF: individual decoding using APF, mBF: BF with marginal likelihood, mAPF: APF with marginal likelihood, mAFPg: APF with marginal likelihood and geometric mean. For example, APFg-FB means using APF with geometric mean, and reporting estimates using fixed-interval smoothing by the forward-filtering backward-smoothing algorithm. https://doi.org/10.1371/journal.pone.0216322.g010

Neural decoding with visual attention
For a limited number of particles (500 in our case), smoothing performance is closely related to ESS and how extreme the weights are. If the sample size is increased, weights become extreme and ESS decreases. After Δn = 10 times of resampling, the values fS k nÀ Dn ; k ¼ 1; . . . ; Kg used in fixed-lag smoothing only contain very few or only one unique value, so the accuracy will be affected. The forward-filtering backward-smoothing algorithm is also affected because the backward sweep requires the integration using the past particles. Therefore, for a large sample size smoothing can perform worse than filtering. In addition, the backward-smoothing procedure requires the transition probabilities p(z n |z n−1 ) that we compute using different particles at time n and n − 1, and the performance is affected by label switching.
The performance of APF compared with BF has previously been studied; see e.g. [43][44][45]. APF applies new proposal weights to resample particles by an early introduction of subsequent distributions, as a variance reduction approach: the estimation variance is reduced if we achieve a good prediction of subsequent weights and thus larger ESS. When the data size is large, distributions become narrow and the first-stage weight in APF may not provide a good prediction of the subsequent distribution; meanwhile, the more complicated two-stage numerical calculations under a limited particle size could yield more variance and bias. Therefore, the variance reduction can perform worse for a large data size. When the model is more complicated, so are the prior and transition distributions of the states. It becomes difficult for BF to have good samples with a limited number of particles. APF, on the other hand, gains advantage by introducing the subsequent states information, and therefore suffers less from the increased model complexity than BF. Increased model complexity also makes the distributions less narrow under a large sample size due to higher dimensions. In summary, APF is more favored for smaller sample sizes and more complex models. In our case, population decoding contains a larger sample size than single spike train decoding. Increasing the stimulus number K yields higher dimensions and thus a more complex model. With the same K, parallel processing with mAPF (using full stimulus information) has larger dimensions than serial processing with APF (using partial stimulus information).
In the simulations of parallel processing, the stimulus number K is much less than the number of simultaneously recorded spike trains, and each stimulus component has sufficiently large probability to be attended. Consequently, at all time points each component is likely to be attended by some neurons and it becomes possible to decode all stimulus components. If, on the other hand, K is too large, or the probability of attending to one of more components is very small, the decoded stimuli will not likely form as many as K clusters. In that case we could try out different K values for the clustering analysis, and report the k � which minimizes the Bayesian information criterion. This means that among all K stimuli, k � are most likely attended by the recorded neurons and we decode those k � attended stimuli. We have included some extensions, namely approximating continuous-time switching and various response kernels. The framework can be further extended in much broader ways. For example, we may consider other spiking neuron models like point processes, and spiking models incorporating neuronal interactions, or even other more sophisticated ion channel models if we have access to intracellularly recorded membrane voltage data. This amounts to modifying the likelihood of the observed data conditional on the stimulus and historical data in Eq (6). Another feature of this state-space framework is that we take into account the hidden attentional states, which is particularly useful if we have prior knowledge about neuronal attention. Using prior information, we can e.g. put constraints on the TPM of attention switching, or set appropriate discretization intervals.
Our methods provide a state-space, Monte Carlo framework for neural decoding incorporating single neurons' attention, which can be easily extended for different neural models and experimental settings. The framework is especially useful for applications with complex stochastic stimuli and multiple simultaneously recorded neurons, or when we want to infer the neuronal attention scheme in addition to decoding the stimuli. The simulation results can serve as a reference to choose proper algorithms when researchers apply the methods to experimental data. The probability that the neuron has not yet fired at time t, 1 − G(t), is equivalent to the probability that the potential has not yet reached x th , F(X th , t). Thus, the probability density of an ISI is gðtÞ ¼ À @ @t Fðx th ; tÞ ¼ À @ @t The transition probability density with a resetting threshold follows the Fokker-Planck equation, defined by the following partial differential equation (PDE): @ t f ðx; tÞ ¼ À @ x ðbðx; tÞf ðx; tÞÞ þ s 2 2 @ 2 xx f ðx; tÞ; with absorbing boundary condition f(x th , t) = 0 and initial condition f(x, 0) = δ(x − x 0 ). For numerical reasons, we also approximate by setting a reflecting boundary condition at a small value x = x − , where the flux equals 0. Neural decoding with visual attention Now we formulate a PDE based on the CDF, F(x, t) [22,46,47]. Plugging f(x, t) = @ x F(x, t) into (37) gives @ t @ x Fðx; tÞ ¼ À @ x bðx; tÞ@ x Fðx; tÞ À s 2 2 @ x @ x Fðx; tÞ Integrating both sides with respect to x yields @ t Fðx; tÞ ¼ À bðx; tÞ@ x Fðx; tÞ þ s 2 2 @ 2 xx Fðx; tÞ þ CðtÞ: At the lower reflecting boundary x = x − , we have F(x − , t) = 0 and thus @ t F(x, t)| x = x − = 0. The flux equals 0, so Jðx À ; tÞ ¼ À bðx À ; tÞf ðx À ; tÞ þ s 2 2 @ x f ðx; tÞj x¼x À ¼ À bðx; tÞ@ x Fðx; tÞj x¼x À þ s 2 2 @ 2 xx Fðx; tÞj x¼x À ¼ 0: Neural decoding with visual attention Thus, C(t) = 0, and we obtain the PDE for F(x, t): @ t Fðx; tÞ ¼ À bðx; tÞ@ x Fðx; tÞ þ s 2 2 @ 2 xx Fðx; tÞ; ð41Þ with boundary conditions @ x F(x th , t) = 0, F(x − , t) = 0, and initial condition F(x, 0) = H(x − x 0 ), where H(�) is the Heaviside step function. The PDE is solved numerically using Crank-Nicholson finite difference method by discretizing time and potential with grid size Δt and Δx. The result is the CDF F(x, t), which is differentiated along time to obtain the desired g(t) following Eq (36).