Community-based benchmarking improves spike inference from two-photon calcium imaging data

In recent years, two-photon calcium imaging has become a standard tool to probe the function of neural circuits and to study computations in neuronal populations1, 2. However, the acquired signal is only an indirect measurement of neural activity due to the comparatively slow dynamics of fluorescent calcium indicators3. Different algorithms for estimating spike trains from noisy calcium measurements have been proposed in the past4‒8, but it is an open question how far performance can be improved. Here, we report the results of the spikefinder challenge, launched to catalyze the development of new spike inference algorithms through crowd-sourcing. We present ten of the submitted algorithms which show improved performance compared to previously evaluated methods. Interestingly, the top-performing algorithms are based on a wide range of principles from deep neural networks to generative models, yet provide highly correlated estimates of the neural activity. The competition shows that benchmark challenges can drive algorithmic developments in neuroscience.

challenge, launched to catalyze the development of new spike inference algorithms through crowd-sourcing. We present ten of the submitted algorithms which show improved performance compared to previously evaluated methods. Interestingly, the top-performing algorithms are based on a wide range of principles from deep neural networks to generative models, yet provide highly correlated estimates of the neural activity. The competition shows that benchmark challenges can drive algorithmic developments in neuroscience.
We organized the spikefinder challenge to crowd-source the problem of developing algorithms for inferring spike rates from calcium signals. Such challenges have been used successfully in machine learning and computer vision to drive algorithmic developments 9 . We used a benchmark data set consisting of 92 recordings from 73 neurons 6 , acquired in the primary visual cortex and the retina of mice using different calcium indicators (OGB-1 and GCaMP6s), different scanning methods, and different sampling rates. For this data, calcium imaging has been performed simultaneously with electrophysiological recordings allowing to evaluate the performance of spike inference algorithms on ground truth data. Importantly, all data has been acquired at a zoom factor typically used during population imaging experiments, ensuring that all benchmark results reflect performance under the typical use-case conditions.
For the challenge, we split the data into a training and a test set (see Table 1). For the training data, we made both the calcium and the spike traces publicly available, but kept the spike traces secret for the test data. Additionally, the publicly available data sets provided by the GENIE project 10 were available as training data. This allowed participants to adjust their models on the 3 training data set, while avoiding overfitting to the specific benchmark data set. Participants could upload predictions for the spike rate generated by their algorithm on a dedicated website (see Methods) and see their performance on the training set during the competition phase. Results on the test set were not accessible to the participants during the competition. To evaluate the quality of the prediction, we computed the Pearson correlation coefficient between the true spike trace and the prediction sampled at 25 Hz (equivalent to 40 ms time bins) as previously described 6 .
We obtained 37 submissions, from which we selected all algorithms performing better than the STM algorithm, which had previously been shown to outperform other algorithms on this data 6 . In addition, if there were multiple submissions from the same group, we used the one with the highest correlation on the test set. This resulted in a total of ten algorithms that we studied in greater detail (see Table 2). Interestingly, these submissions include algorithms based on very different principles: some of the algorithms built on the classical generative models of spike-induced calcium dynamics 4 , while others relied on purely data-driven training of deep neural networks or pursued hybrid strategies (see Table 3).
We found that several of the submitted algorithms performed substantially better than the STM, increasing the average correlation on the test set from 0.36 by 0.08 to 0.44 for the best algorithm ( Figure 1A, B; Table 3). Because increases in correlation are hard to interpret, we converted it to mutual information assuming a Gaussian distribution (see Methods), yielding an estimated increase of almost 50% on average. For all algorithms, performance varied substantially between data sets with the best results observed on data set I. Interestingly, performance gains were Colors indicate different data sets (for details, see Table 1). Data sets I, II, and IV were recorded with OGB-1 as indicator, III and V with GCaMP6s. Black dots are mean correlation coefficients across all N = 32 cells in the test set. Colored dots are jittered for better visibility. STM: Spiketriggered mixture model 6 ; f-oopsi: fast non-negative deconvolution 11 (continued next page) typically larger on GCaMP6 than on OGB-1 data sets ( Figure 1B). Surprisingly, the top group of six algorithms performed virtually indistinguishably ( Figure 1C), despite using very different methodologies. Finally, we evaluated to what extent the algorithms overfitted the training data.
For example, it is possible that algorithms extracted peculiarities of the training data that did not transfer to the test data, resulting in artificially high correlation coefficients on the training data.
We found that most algorithms showed similar performance for both the training and the test set, with evidence for overfitting in some of the DNN-based algorithms ( Figure 1D).
As the algorithms in the top group used a range of algorithmic strategies, we wondered whether they also made different predictions, e.g., each capturing certain aspects of the spikecalcium relationship. However, the predictions of the different algorithms are typically very similar with an average pairwise correlation coefficient among the first six algorithm of 0.82 ± .04 (mean ± SD, Figure 2  indicates that despite their different algorithmic strategies, all algorithms capture similar aspects of the spike-fluorescence relationship.
In summary, the spikefinder challenge has shown that a community competition making use of suitable benchmark data can catalyze algorithmic developments in neuroscience. The challenge triggered a range of new and creative approaches towards solving the problem of spike inference from calcium data and improved the state-of-the-art substantially. Interestingly, the challenge did not distill the optimal strategy out of the different possible algorithmic approaches, something we had initially hoped for; rather, it showed that -given the current data -a range of approaches yield very similar outcomes. In particular, it is noteworthy that algorithms based on generative models such as those by Team 1 and 6 perform on par with more flexible deep learning-based approaches.
Each algorithm comes with their own advantages and disadvantages regarding speed, interpretability, and incorporation of prior knowledge. For example, training the DNN-based models is computationally quite costly, and care has to be taken when using complex DNN-based models to avoid overfitting the training set, as this could lead to false confidence about the prediction performance on new data. On the other hand, the DNN-based approaches are less reliant on prior knowledge and may be more easily adapted when new training data sets are acquired with new indicators with different dynamics. In contrast, the algorithms based on generative models allow for very fast inference (Team 6) or Bayesian treatment (Team 1), but may be less easily adapted to novel settings.
More generally, the spikefinder challenge raises the question of what the actual performance bound of an ideal decoder is. Model simulations can help to answer these questions 7,12 , but their interpretation is limited by the accuracy of the model regarding indicator dynamics, noise structure, and other experimental factors 6 . However, the challenge contributions by Team 9 and a recent paper 13 by Team 3 have shown that they can actually be useful for training supervised algorithms.
It remains to be seen whether new and larger data sets of simultaneously recorded and imaged neurons will yield further improvements and distinguish more clearly between different algorithmic strategies. It will also be interesting to see whether new indicators will allow for more precise spike-rate inference.

8
In addition to improving on the state-of-the-art, competitions such as the spikefinder challenge can boost standardization of algorithms, something that has been lacking from neuroscience analysis tool chains 14 . During and after the challenge, lively debates ensued online about the best way to handle several of the processing steps highlighting several potential pitfalls in analyzing calcium imaging data. Some of these difficulties were also reflected in the choices made for this challenge. For example, we resampled all data to 100 Hz for ease of comparison, which induced problems for some of the participating algorithms through the properties of the used filter. In addition, several of the participating teams found it necessary to introduce means of adapting the model parameters to the specific data set as, e.g., the average time delay between the spike and calcium signal seemed to vary between data sets. This may be due to different preprocessing procedures performed in the different data sets, but such confounds should be avoided in future competitions on this topic. Finally, the challenge was performed on traces extracted from the raw imaging data by averaging all the pixels within manually placed regions-of-interest (ROIs). It is thus possible that the extracted signals contain contamination from the neuropil or were suboptimally placed, a problem tackled by methods that combine ROI placement and calcium-trace extraction in a single algorithm 5,15 . However, at least for data with simultaneous imaging and electrophysiological recordings, it is not clear how methods integrating ROI placement and spike-rate extraction should be evaluated and compared to regular data, since the recording electrode is always present in the picture.
Alternatives to our primary evaluation measure, the correlation coefficient, may also need to be considered. While it is easy to interpret and more sensitive than the often used AUC-score, it is still invariant under global scaling of the predicted spike rate 6 . In fact, different metrics can lead to different conclusions about which algorithm is optimal since the metric contains part of the task specification 16 . On the other hand, information gain -which may be considered a canonical model comparison criterion for probabilistic predictions 6, 17 -is considered less intuitive by some.
We believe that quantitative benchmarks are an essential ingredient for progress in the field, providing a reference point for future developments and a common standard with regards to how new algorithms should be evaluated. We strongly believe that many fields of computational neuroscience can benefit from community-based challenges to assess where the field stands and how it should move forward. As for the problem of spike inference from two-photon imaging, the spikefinder challenge should not be considered the last word in this matter: More comprehensive data sets and new functional indicators may require organizing another round of community-based development, further pushing the boundaries of what is attainable.

Methods
Data The challenge was based on data sets collected for a previous benchmarking effort 6 and the publicly available cai-1 data set from crcns.org 10 . Details about the recording region, indicators, scan rate and cell numbers are summarized in Table 1. All data was resampled to 100 Hz independent of the original sampling rate. Upon request during the challenge, we made the data available at the native sampling rate.
Challenge organization For the challenge, we split the available data into training and test sets (see Table 1). The training set contained both calcium and spike data, while for the test set, only calcium data was available during the challenge period. The GENIE datasets were only used as training data, since they are completely publicly available and consist of recordings from individual zoomed-in cells.
The data and instructions were available on a dedicated website, based on an open-source web framework (https://github.com/codeneuro/spikefinder). There was a discussion board linked from the website to allow for questions and discussion among participants.
Each team could make multiple submissions, but during the challenge period, only results on the training set were shown. The challenge ran from 30/11/2016 to 04/05/2017.
Algorithms The submitted algorithms are described in detail in the Supplementary Information.
For comparison, we used publicly available implementations of the STM algorithm 6 and fastoopsi 11 . STM parameters were optimized on the training set.
Evaluation The evaluation of the submissions was done in Python using Jupyter notebooks. All notebooks are available at https://github.com/berenslab/spikefinder_analysis.
We used the correlation coefficient between the inferred and the real traces resampled to 25 Hz (40 ms time bins) as a quality measure. Because difference in correlation is hard to interpret (e.g., because an increase from .40 to .45 means something very different than an increase from .94 to .99), we converted correlation coefficients to information rate assuming a Gaussian channel: Improvements in performance were then computed as the increase in average information rate compared to the STM algorithm: We used the R package afex to compute a repeated measures ANOVA on the correlation coefficients with within-subject factor algorithm and cells as subjects. Pairwise comparisons between algorithms were performed using the lsmeans package with Holm-Bonferroni correction for 66 tests.  LT preprocessed data; PB wrote the paper with input from all authors.

Supplementary Material
For all algorithms, we denote the spike train s, the fluorescence trace f and the underlying calcium signal c, where applicable. We observe a total of T time bins, and the measurement in time bin t is written s t , f t and c t , respectively. Similarly to the method by Vogelstein et al. 4 , the conversion of neuronal spiking activity to calcium fluorescence is modeled by a biophysical dynamical system, and a two-ways filtering scheme is applied to estimate the hidden dynamics of the intracellular calcium concentration given the noisy fluorescence recording. MLspike implements two major improvements over previous models: The first one is an extension of the biophysical model including a slowly drifting baseline, which allows disentangling a wide range of noises often observed in the real data from the spike-related signals. The second one is to represent probabilities as dense arrays rather than using Monte-Carlo approximations, namely making MLspike a histogram filter instead of a particle filter, which improves both speed (at least for a models hidden state dimension not greater than 2) and accuracy.
For the spikefinder competition, MLspike was set to estimate a-posteriori probabilities E(s|c) rather than maximum-a-posteriori spike trains arg max s p(s|c). The biophysical model entailed a drifting baseline and nonlinear calcium to fluorescence conversion (i.e. saturation for OGB dataset; polynomial supralinearity for GCaMP6 dataset), and therefore had 6 or 7 parameters.
One of these parameters (the a-priori spiking rate) was fixed while the 5 or 6 remaining ones were estimated independently for each training dataset so as to maximize the match between the estimated and observed spikes. This was preferred to using MLspikes autocalibration method on each individual neuron, because the data appeared too noisy for this autocalibration to perform accurately. The match between spike trains was defined as the correlation after resampling to 25 Hz. We used an additional postprocessing optimization on the estimates consisting of temporal smoothing and a time shift to account for apparent differences between data sets. Once these optimizations were performed, the same model and postprocessing parameters were applied to the test data sets. Interestingly, the hyperparameter optimization strategy pursued here was very similar to that chosen by Team 6.
Code is available at https://github.com/MLspike.  To distribute the amount of information that different units are carrying, dropout is applied at the output of the first five convolutional layers.
In separate work 13 , we developed an approach for training this network in an unsupervised fashion using variational auto-encoders 19 that can perform inference on a wide range of biophysical generative models (e.g. using the ones used by team 1). For the challenge, all available training data was labeled (i.e. ground truth spikes were provided), and we therefore trained the network using supervised learning.
Our architecture contains one forward running RNN that uses a multi-layer CNN with leaky ReLUs units to extract features from the input trace. The outputs of the forward RNN and CNN are transformed into Bernoulli spike probabilities through a dense sigmoid layer. Additional input is provided by a second RNN that runs backwards and also receives input from the CNN. Forward and backward RNN have a single layer with 128 gated recurrent units each 20 . In order to generate a single marginal probability distribution q φ (s t |f ) for evaluation, samples drawn by running the stochastic RNN 50 times were averaged. We trained one separate network for each dataset.
To minimize the artifacts introduced by upsampling the data to a common imaging rate, we where the goal is to estimate the parameters of a basis transformation g and an output non-linearity h such that a loss L(µ t , s t ) between spike train s t and prediction µ t is minimized. g is given by a deep convolutional artificial neural network. The first layer of this network is a standard convolutional layer followed by a rectified linear (ReLU) nonlinearity 21 , mapping each calcium time series f t to 32 parallel time series indexed by k.
Here (x) + ≡ max(0, x) is the ReLU nonlinearity. We use a large window (33 time points, or 330 ms), batch normalization, as well as a dropout fraction of .3. This initial layer is followed by seven adjustment layers in the style of a residual network 22 : These adjustment layers use smaller windows (9 time points, or 90 ms). The nonlinear component of each layer is batch normalized. Finally, the output is composed linearly via: The output nonlinearity h was given by a RelU nonlinearity, such that s t = η + t . We minimized a scaled sum of squared error criterion: Here, i indexes different neurons and α i is a set of scalars which are learned alongside the other parameters of the model (w and β). One can show that the loss is equivalent to 1 − ρ 2 , where ρ 2 is the square of the cosine similarity between prediction and spike train. The model was specified and fit using the tf.contrib.learn library in TensorFlow 23 . Model parameters were initialized with the Xavier method and fit using the Adam optimizer. One large model for all 10 recordings from the training set was fit (173 neurons) in this phase and goodness-of-fit was monitored on a leave-aside validation set to control overfitting by early stopping. Convergence took close to 200,000 iterations.
The model described so far uses fixed filters for each recording, and uses local information (≈ 1 second of data) to estimate spikes from calcium traces. To adapt filters, we learned longrange features with an unsupervised mixture density network 24, 25 . The model, a 3-layer recursive neural network with 512 long-short-term memory (LSTM) nodes, was fit to the calcium data to obtain one 1536-dimensional latent vector per mini-batch, which was reduced to 32 dimensions by PCA after z-scoring. These features were processed by two fully connected layers to produce 4 hidden features γ tp . These 4 hidden features were used to additively adapt the filters w 0 tjk = p softmax(γ tp )W jkp , in a manner similar to attention models.
Originally, one large model was fit for all recordings. We then created refined versions of this model for each of the 10 data sets by a transfer learning process. We took the large model with its learned parameters and ran up to 50,000 extra iterations of gradient descent on just the data from the kth dataset.
Code is available at https://github.com/patrickmineault/spikefinder_submission Team 5 -P. Rupprecht, S. Gerhard, R. W. Friedrich In order to infer a spike probability for time bin t (Fig. 5A), the calcium trace located around t was used, including 25% before and 75% after t, totaling to 128 samples, i.e., 1.28 sec (Fig. 5C). A convolutional neural network was trained to use these 128-wide windows to predict the corresponding spiking probability. To facilitate gradient ascent, we smoothed the discrete spiking ground truth with a Gaussian filter Fig. 5B).
We The 5% neurons that were worst at predicting their own spiking were discarded from the following modeling, assuming bad recording quality that is not suited for inclusion into a training dataset.
Normalization over columns, symmetrization of the matrix and averaging over datasets yields a matrix of predictive power, i.e., a matrix of proximity in prediction-space between datasets (Fig.   6B). A PCA of this matrix results in an embedding space that was limited to two dimensions due to the low number of datasets. Datasets close to each other in the embedding space (e.g. 2 and 4) 31 can predict each other's spikes very well, whereas datasets distant from each other in space (e.g. datasets 4 and 5) fail to do so. The idea behind this approach is very similar to the embedding spaces used by Team 8.
Using this approach, it is however not yet possible to map a neuron of a new dataset of unknown properties onto the right location of the embedding space above. To solve this problem, the following statistical properties of the raw calcium time traces were calculated (Fig. 6D), in an approach that is similar to the long-range features of calcium traces used by Team 4: • coefficient of variation, kurtosis, skewness • autocorrelation of the calcium time trace with its future value in 0.5, 1 and 2 seconds • generalized Hurst exponents of order 1-5 • the power spectral density at different frequencies between 0.1 and 3.6 Hz We did not attempt to find a minimal set of predictive properties to reduce computation time here, but used dimensional reduction techniques to automatically extract the relevant independent components. After averaging the standardized values over datasets (Fig. 6E), we used the two first principal components to generate a map of proximity in statistical property space (Fig. 6F). This map was generated using the training datasets (numbers located on the right side of the symbols).
Test datasets were mapped into this PCA space (numbers on the left side of the symbols).
To generate a mapping between the locations of the datasets in the two embedding spaces,  a simple regressor (DecisionTreeRegressor from the scikit-learn package 27 ) was fit to the training datasets (schematic arrows in Fig. 6C,F). We then used this mapping to determine the position of the test datasets in the embedding space of mutual predictive power.
Once the position in the embedding space is known for a dataset, the model that had been trained before on all datasets is retrained, but preferentially with neurons from datasets that lie close to the position in the embedding space. This preference was weighted with a function that decays exponentially over distance in the embedding space, as indicated by the red shading (Fig.   6C). Again, the functional form of the decay and the decay constant have been chosen heuristically without systematic optimization, since our goal was to showcase the power of our embedding space approach rather than finding a global optimum.
Embedding spaces as a visual and explicit intermediate step for model refinement are more easily accessible for users, allow the use of relatively small convolutional neuronal networks and can highlight similarities and differences between datasets. For example, it is interesting to see that in both embedding spaces, datasets 3 and 5 cluster together, whereas dataset 8, which uses the same calcium indicator (GCaMP6s) in the same brain region (V1), is in proximity of dataset 6 (GCaMP5k in V1). It was also observed that the datasets that use OGB-1 as indicator (1,2 and 4) tend to occupy similar regions of the embedding spaces. This indicates that model selection is not only based on the calcium indicator and the brain region, but on hidden parameters, e.g., signal-to-noise of the calcium recording, sampling rate, spike rate, temperature, indicator concentration, or others. To reliably comprise these possible hidden parameters with embedding spaces, it will be necessary to increase the number of datasets in order to support as many possible types of datasets as possible. However, the unknown dimensionality of this hidden parameter space makes it difficult to predict how many datasets would be required.
The observed fluorescence f ∈ R T is related to the calcium concentration as 11 : where a is a non-negative scalar, b is a scalar offset parameter, and the noise is assumed to be i.i.d.
zero mean Gaussian with variance σ 2 . We assume units such that a = 1 without loss of generality.
The goal of calcium deconvolution is to extract an estimateŝ of the neural activity s from the vector of observations f . This leads to the following non-negative LASSO problem for estimating the calcium concentration 8,11 : where the 1 penalty onŝ enforces sparsity of the neural activity. Note that the spike signalŝ is relaxed from non-negative integers to arbitrary non-negative values 11 .
Problem (4) is solved by iterative warm-started runs of OASIS to solve Eq. (4) while adjusting λ,b (and optionally also γ i ) between runs until Eq. (5) holds. We refer the reader to 8 for the full algorithmic details.
The above parameter choices rely on a robust noise estimateσ. The resampling of each spikefinder dataset to a fixed frame rate introduced artifacts into the data that corrupted the autoco-variance and PSD such that it was not possible to obtain reliable noise and AR estimates based on the preprocessed data. Therefore, these parameters for baseline, sparsity and AR dynamics were determined based on the training data sets and kept fix for each test trace, thus not accounting for differences between neurons within one data set. Six parameters were fit: the percentile value and window length to estimate the baseline using a running percentile, the two AR coefficients, and the slope and offset of a linear function that determines the sparsity parameter λ as function of the noise. The latter was estimated on traces that were decimated by a factor of 10 to counteract the artifacts that had been introduced by upsampling the raw data.
Running OASIS with the known parameters directly yields an estimateŝ of the neural activity. This estimate was already good for datasets 6-10, but noticeably improved for the first 5 datasets by convolving it with some kernel k, to obtain the final estimateŝ =ŝ * k. The kernel adjusts for mismatches between the actual calcium response kernel and the AR(2) model, smoothes the estimate, and accounts for the uncertainty of the exact spike timings by distributing spikes as spike rates over a few time bins. We used a kernel width of 30 bins and obtained it by averaging the closed form solutions of the least squares linear regression problem k = arg min k ŝ * k − s 2 for each true spike train s in the training set. Interestingly, the strategy used for hyperparameter optimization used here was very similar to that used by Team 1.
Because the evaluation criterion was correlation not the residual sum of squares, we considered to further optimize the kernel for this specific criterion using gradient decent initialized at the least squares solution; however, we did not obtain significant improvements.
where k is the calcium kernel (assumed, or estimated), s * k is the convolution of s and k, and s 0 is the L 0 norm of s, in other words the number of non-zero entries in s. We do not describe here the inference method for this model, but point the reader to our original derivation in 30 . This solution is approximate, due to its greedy nature. Exact solutions have been obtained by 31 , in the case where the positivity constraint on s was removed, and the calcium kernel was restricted to be exponential.
The L 0 deconvolution model was developed as an alternative to the L 1 -deconvolution model 5,8,11 . We originally believed that an L 0 penalty would better account for the binary nature of spike trains, and allow the algorithm to return sparse spike trains. The algorithm indeed returns very sparse descriptions of the calcium data, which can deceptively look like electrophysiologically recorded spike trains. However, neither the L 0 nor the L 1 penalties are necessary or desirable for achieving best performance, the positivity constraint is sufficient 30 .
Here, the training data was not used to set the parameters for the deconvolution (with the exception of time lags, see below). Instead, calcium kernels were chosen to be exponentials with timescales obtained from the literature for each specific sensor 3 . Following deconvolution, dataset-specific timelags were introduced for some of the spikefinder datasets. Also, the output was smoothed with a Gaussian kernel of a preset standard deviation (80ms for spikefinder data sets, 20 ms for GENIE datasets).
Code is available at https://github.com/cortex-lab/Suite2P. During experimentation, it was observed that the spike behavior varied quite a bit in different data sets. From this observation, it was inferred that different convolutional filters would perform well for different data sets. To implement this idea, "data set embeddings" were used to weight Intuitively, these vectors represent embeddings for each data set, and the similarity between two embeddings represents the similarity of the spike behavior in each data set, since a model trained to infer spikes in the two data set would employ similar convolutional filters. This is illustrated in Figure 7.
Code is available at https://github.com/codekansas/spikefinder modeled by low-pass filtering white noise. The baseline drift component significantly improved inference quality, in agreement with the observation of Team 1. In contrast, the statistics of the simulated spike trains, as well as the randomized jump sizes following each spike, had a much smaller impact on decoder performance.
The time constants of the autoregressive model, γ, and the scale of the noise, σ, were also sampled from random uniform distributions (such that a single decoder trained on these data could work well across multiple indicators and frame rates), but the precise parameterization was varied between model training sessions. However, in all cases, we found that a wide range of γ values could be learned by a single decoder model. For instance, we successfully trained decoder models across data with γ values spanning approximate decay times for fast OGB data recorded at 25 Hz, to slow GCaMP6S data recorded at 100 Hz. This shows that a single decoding model trained on simulated data can be used to analyze data produced by many different indicators and many different acquisition rates in agreement with Theis et al. 6 . Similarly, wide ranges of σ values could be learned by single decoders. However, there was a slight performance enhancement seen by training single encoders on mostly low SNR or high SNR data to improve performance on low SNR and high SNR real data, respectively.
Finally, in almost all cases, the second-order models (i.e. p = 2 in Equation 7) outperformed first-order models (p = 1). An exception were the OGB-1 data sets, as the sensor displays very rapid rise time kinetics following action potentials and has been previously shown to be especially well-described by first-order models 11 .
To estimate the most likely spike train underlying a given fluorescence trace, we built a convolutional neural network. During training, the network was presented with f as well as parameter estimates for σ and γ given by methods published in earlier work 5 . Because the spikefinder data was upsampled from its native resolution and this introduced artifacts in the power spectrum of each fluorescence trace, we decimated each trace by a factor of 7-10 (depending on the approximate native time resolution of each dataset) before performing subsequent parameter fitting and analysis. The target of the network during training was the set of simulated spike trains, s, used to generate f using Equation 7.
A fairly simple architecture inspired by research into the construction of generative models for audio data was found to be effective 32 . In brief, the network consists of four 1D dilated convolutional layers containing 100 units, a filter size of 32, and rectified-linear (relu) nonlinearities.
The first layer was dilated by a factor of 1, the second layer by 2, the third by 4 and the fourth by a factor of 8. Dropout (rate = 0.5) was also used at each layer. Finally, a fifth 1D convolutional layer with one unit, a filter size of one, and a relu nonlinearity was used to read-out a non-negative estimate of s from each f vector provided.
This architecture contained about 950,000 parameters and could be trained on a simulated data set of 5,000 traces in about 20 minutes (over 20 epochs) using the Google ML Engine. A single model trained on simulated data that spanned a wide range of σ and γ values performed well, but an ensemble of four models, each trained on a slightly different simulated data set, worked even better-as some decoders tended to work better or worse on each spikefinder data set. For 43 our submission, we chose the decoding model that worked best for each dataset to use as our submission. For data set 5, which had high firing rates, we found that convolving the results with a small Gaussian kernel resulted in a modest improvement to our inference quality.
Code is available at https://bitbucket.org/tamachado/encoder-decoder The even filter is a Gaussian,h even (τ ) = A exp(−τ 2 /2σ 2 ), and the odd filter is the derivative of a Gaussian h odd (τ ) = Bt exp(−τ 2 /2σ 2 ). The constants A and B are such that the norm of the filters is normalized to one, A = B = 1. These two filters are linearly combined while keeping the norm of resulting filter equal to one, h(τ ) = cos αh even (τ ) + sin αh odd (τ ). The output nonlinearity is a rectifier to a power, φ(x) = (x − θ) β if x > θ, and zero otherwise.
The model has only 4 parameters, σ, α, θ, β. The amount of smoothing of the signal is controlled by σ, the shape of the filter is controlled by α, and the threshold θ and power β determine the shape of the nonlinearity. The model is fit by finding the optimal values of σ, α, θ, β that maximize the correlation between its outputŝ(t) and the recorded spiking of the neuron. Matlabs