A comparison of neuronal population dynamics measured with calcium imaging and electrophysiology

Calcium imaging with fluorescent protein sensors is widely used to record activity in neuronal populations. The transform between neural activity and calcium-related fluorescence involves nonlinearities and low-pass filtering, but the effects of the transformation on analyses of neural populations are not well understood. We compared neuronal spikes and fluorescence in matched neural populations in behaving mice. We report multiple discrepancies between analyses performed on the two types of data, including changes in single-neuron selectivity and population decoding. These were only partially resolved by spike inference algorithms applied to fluorescence. To model the relation between spiking and fluorescence we simultaneously recorded spikes and fluorescence from individual neurons. Using these recordings we developed a model transforming spike trains to synthetic-imaging data. The model recapitulated the differences in analyses. Our analysis highlights challenges in relating electrophysiology and imaging data, and suggests forward modeling as an effective way to understand differences between these data.

Introduction Electrophysiological recordings ('ephys') and calcium imaging offer distinct tradeoffs for interrogating activity in neural populations. Ephys directly reports spiking of neurons with high signal-to-noise ratio, temporal fidelity, and dynamic range, but typically offers access only to a sparse subset of relatively active neurons [1]. In addition, the ability to track the same population of neurons across time, important for understanding the neural basis of learning, remains challenging [2][3][4].
However, calcium imaging reports spikes only indirectly [10,20]. The transformation from spikes to calcium is inherently non-linear due to the dynamics of intracellular calcium concentrations [21]. Additional nonlinearities are imposed by the protein-based indicators of calcium [12][13][14]22]. Together these produce a low-pass filtered, delayed, and transformed version of neural activity, which complicates relating neural activity to behavior. Calcium imaging also has lower signal-to-noise ratio for detecting spikes and limited dynamic range [10]. In addition, during animal behavior, spike rates can vary by orders of magnitude across behavioral epochs and across neurons, even neurons of the same type [23][24][25] and spike rates change over times of milliseconds to seconds [25][26][27]. Finally, the coupling between spikes and calcium-dependent fluorescence likely differs across different neuron types and even individual neurons within a type [13,28].
The complexities in the relation between spiking activity and calcium imaging at the level of single neurons have been long appreciated [12][13][14]21,22,29]. However, the effect of these factors on analyses of population activity are not fully known [30]. Ideally a detailed understanding of the transformation from spikes-to-calcium-dependent fluorescence would allow inversion of this transformation and the reliable extraction of spikes. Calcium indicators with high sensitivity allow reliable detection of action potentials, at least under conditions when single spikes or burst of spikes are separated in time [13,31,32]. However, under behaviorally relevant conditions neurons operate with a large range of spike rates, and spiking responses are typically superposed on a substantial background spike rate, which varies across the population [23,25]. Moreover, neuron-to-neuron variability in calcium dynamics, calcium indicator dynamics and patterns of firing rate could conspire to make this inversion challenging. These issues are compounded by the paucity of simultaneously recorded spikes and fluorescence data. It is therefore unclear if spike inference can invert the fluorescence data accurately to eliminate potential discrepancies between analyses performed on ephys and calcium imaging data.
Here, we explore these issues empirically in data collected in a decision-making task, where the dynamics of the neural circuit are rich and variable across neurons. In particular, neurons 12786296.v1. The new raw data in precompiled data include (1) simultaneously ephys-imaging data in TG mice in primary visual cortex (passive viewing task) that were recorded by B.J.L. and (2) imaging data in 6f-TG mice in anterior lateral motor cortex (delayed discrimination task) that were recorded by K.D. and roi-extracted by Z.W. manually; both of them can be downloaded at https://doi.org/10.6084/m9.figshare.12792587. All codes for model benchmarks and comparison metrics are recompiled and packed with data through im-phys-API (https://github.com/zqwei/ Im-phys-API), which can be available at http://imphys.org/codes. The API will come with a userfriendly interface in which one can reproduce all results in our paper and extensive results on http:// im-phys.org. We also provide repos for benchmarks of S2F and F2S models at https:// github.com/zqwei/Ca-Imaging-Deconv-List (DOI: 10.5281/zenodo.3960635) and comparison metrics at https://github.com/zqwei/Neural-Recording-Methodology-Comparison (DOI: 10. 5281/zenodo.3979786) and website interface at https://github.com/zqwei/Im-phys-org.

PLOS COMPUTATIONAL BIOLOGY
in frontal cortex show a wide range of spike rates and exhibit diverse temporal dynamics and selectivity [25,26,33]. We first analyzed ephys and calcium imaging data, recorded separately but measured in matched neuronal populations in the same delayed response task, and directly compared the results of standard measurements of selectivity and population dynamics. We find qualitative discrepancies at both the level of single cells and neural populations. Spike inference algorithms were limited in resolving these differences. However, a phenomenological model of the spike-to-fluorescence transformation, based on a new set of simultaneous imaging and electrophysiology data [13,14], explains many differences across the data sets. Finally, we developed a web-based platform, im-phys.org, that allows quantification of the effects of various transformations from electrophysiology to imaging.

Results
We measured neural activity using electrophysiology ('ephys') and calcium imaging under identical behavioral conditions and in matched neural populations, but in separate experiments. Mice performed a tactile delayed response task [25,34,35] (Fig 1A). In each trial, mice judged the location of an object with their whiskers. During the subsequent delay epoch (approximately 1.3 seconds), mice planned an upcoming response. Following an auditory 'go' cue, mice reported object location with directional licking (lick-left or lick-right). Two-photon calcium imaging and ephys were performed in left anterolateral motor cortex (ALM; Fig 1B, 1D and 1F). We report the results of three variants of calcium indicators in this study: GCaMP6s delivered by viral gene transfer, and GCaMP6s and GCaMP6f expressed in Thy-1 transgenic mice. In the first series of imaging experiments, neurons were transduced with adeno-associated virus expressing GCaMP6s (6s-AAV), a widely-used method [10,13] (data from [25], 1493 neurons, 4 mice). In the second, neural activity was recorded by imaging transgenic mice expressing GCaMP6s in cortical pyramidal neurons (6s-TG, data from [36], 2293 neurons, 1 mouse). We treated these datasets separately since the mode of delivery of GCaMP can affect its properties. Specifically transgenic GCaMP typically results in neurons that have lower GCaMP expression levels and faster fluorescence dynamics compared to neurons transduced with AAV [31]. Finally, we collected a dataset obtained with a faster, but less sensitive indicator, GCaMP6f (6f-TG, 2672 neurons, 2 mice). We refer to these three datasets as 6s-AAV, 6s-Tg and 6f-Tg, respectively. The 6s-TG data, though containing a large number of neurons across multiple behavioral sessions, came from a single animal. We compared this data to ephys data acquired with silicon probes that record multiple neurons simultaneously (720 neurons, 19 mice [25]) (Fig 1C, 1E and 1G). Ephys recordings were subsampled so that their recording depths matched the generally more superficial calcium imaging experiments. Neurons were recorded by 6s-AAV and 6s-Tg at 120-740 μm. The matched ephys subset was taken at 100-800 μm leaving 720 neurons. Neurons were recorded by 6f-Tg at 140-470 μm. The matched ephys subset was taken at 100-470 μm, leaving 225 neurons (S1 Table).

Filtering of selectivity by calcium imaging
Individual ALM neurons exhibit diverse temporal dynamics, including changes in selectivity over time (Fig 2B) [25,33,34]. We classified dynamics into three categories: 'monophasic' neurons showed consistent selectivity across the trial (Fig 2A); 'multiphasic' neurons changed selectivity over time (defined as having consistent selectivity for at least 335 ms which then changes and remains stable for at least 335 ms more) (Fig 2B); 'non-selective' neurons responded similarly across trial types but were still modulated during the task (Fig 2C). The proportion of monophasic selective neurons was similar between the datasets (58% ephys; 66% 6s-AAV; 50% 6s-Tg; 45% 6f-Tg). However, the ephys data set contained a substantial proportion of multiphasic neurons (220/720; 31%), much larger than the imaging datasets (6s-AAV: 76/1493, 5%; 6s-Tg, 98/2293, 4%; compare to matched ephys, 220/720, 31%; 6f-Tg, 69/ 2672, 3%; compare to matched ephys, 52/225, 20%; p < .001, χ 2 test; Fig 2D-2F). As neural response properties can change across cortical layers, we performed a more detailed analysis of the effect of recording depth on single neuron selectivity and find that selectivity was reduced in imaging compared to ephys across depths (S1A-S1D Fig).   What could account for this difference? Ephys records a sparse subset of neurons blindly, which could introduce biases, for example a bias towards neurons with higher spike rates (S1E-S1I Fig). In contrast, in our imaging experiments all visualized neurons were analyzed. In addition, the spike sorting procedure used to identify units from raw electrode potentials can introduce artifacts, including erroneous merging of neurons (S1J Fig). However, we found that these factors were unlikely to explain our results for two reasons. First, we considered intracellular recordings for which spike sorting is not required [37] (we used only trials without photostimulation and excluded one cell due to the small number of trials; see Materials and Methods). The fraction of multiphasic neurons was 25.7% (n = 9/35), similar to the extracellular ephys data (p = .67, χ 2 test; Fig 2G) but significantly different from the imaging data (p < .001, χ 2 test with both 6s-AAV and 6s-Tg). Second, we tested the question of spike-sorting induced biases by considering synthetic data in which we deliberately introduced merges at different probabilities. Merging neurons generated more multi-selective neurons when two neurons with different temporal selectivity profiles were merged, but the ratio of accidental merging had to be high (i.e., more than 10% of the neurons need to be completely conflated with another neuron) to explain the difference between datasets (S1J Fig). This suggests that the above differences are not driven by ephys being biased towards different populations of neurons than imaging.
Another source of difference could be that our comparisons so far were performed on the imaging data, not on spike rates inferred from the imaging data. Spike inference algorithms attempt to undo the transformation from spikes to calcium-dependent fluorescence, thereby recovering spike times (or spike rates) from imaging data [32,[38][39][40][41][42][43]. We tested two high-performing published methods: MLSpike [42] and MCMC [40]. We also tested seven additional models (http://im-phys.org/analyses, for additional comparisons between inference techniques see [44]). In our hands, spike inference only partially corrected the differences between the datasets and in some cases actually pushed the data even further apart (Fig 2H). For instance, MLSpike produced even lower proportions of multiphasic neurons. MCMC was more accurate, increasing the proportions of multiphasic neurons, but still far short of the actual proportion in the ephys dataset (and for 6s-Tg and 6f-Tg decreased instead of increased the proportion of monophasic neurons). Deconvolution at best recovered about half of the missing multiphasic selectivity (6s-AAV, 18%; 6s-TG, 17%, compared to 31% in matched ephys; 6f-TG, 8%, compared to 20% in matched ephys; Fig 2H).
Differences between calcium and ephys were not limited to the temporal nature of responses but were also present in trial-type selectivity. In the ephys dataset, right-preferring neurons (i.e., neurons whose firing rate before right licks was higher than before left licks) were as common as left-preferring neurons (Fig 2I, left; p = .118, χ 2 test) [25,34]. The same was true for imaging with a fast calcium indicator (Fig 2I, right; p = .102, χ 2 test), but not for imaging with slow indicators (Fig 2I, left; p < .001, χ 2 test; S4B Fig, spike-inference measure). What could be the cause of these differences? Spike rates in individual ALM neurons often increase or decrease during a trial in ramp-like patterns [25,34,45]. Right-preferring selectivity was more often associated with neurons ramping up on right trials, whereas left-preferring selectivity included many neurons with firing rates ramping down in the non-preferred estimates of fractions of monophasic (left) and multiphasic (right) neurons. The distribution of fraction of neurons for imaging data (source data), is given in gray for 6s-AAV (top), 6s-TG (middle) and 6f-TG (bottom). The distribution for ephys (target data) is in black. Distributions from inferred spike rates from MCMC (40) are in cyan and for MLSpike (42) are in magenta. Arrows denote the difference between the imaging data and ephys data (gray arrow) or inferred ephys and ephys data (cyan arrow for MCMC and magenta arrow for MLSpike). I. Fraction of right-preferring neurons in the different datasets divided into slow indicators (left) and fast indicators (right). J. Bar plot of fractions of ramp-down, ramp-up and 'other' cells in ephys for right-preferring (left) and leftpreferring neurons (right). https://doi.org/10.1371/journal.pcbi.1008198.g002

PLOS COMPUTATIONAL BIOLOGY
('right') trial (Fig 2J). The large difference between the rise and decay times of calcium indicators could lead to differences in how the selectivity of neurons that ramp up or ramp down gets transformed by the indicator. To test to what degree such explanations explain the data we developed a model of the spike-to-fluorescence transformation.

Simultaneous loose-seal electrophysiology and calcium imaging
Modeling the spike-to-fluorescence transformation requires simultaneous electrophysiology and calcium imaging at the level of individual neurons. Since this data was not available for the transgenic calcium indicators used here we performed loose-seal recordings and calcium imaging in individual neurons (Fig 3). The dataset consists of GCAMP6f-and GCAMP6sexpressing L2/3 neurons in transgenic mice (6s-TG, 22 cells; 6f-TG, 18 cells; S2 Table). This new data, which we make publicly available, more than doubles the number of currently publicly available simultaneously recorded neurons [32]. In addition we used published data with AAV-based gene transduction [13] (http://dx.doi.org/10.6080/K02R3PMN). Bursts of spikes produced fluorescence transients in the imaged neurons (Fig 3A-3D; top, 6f-TG cell; bottom, 6s-TG cell). Peak fluorescence responses increased monotonically with the number of spikes. The ability to detect single spikes varied considerably between neurons (Figs 3E-3G and S2).

PLOS COMPUTATIONAL BIOLOGY
The detection of single spikes was lower in transgenic mice than with AAV-based gene transduction, likely reflecting the lower expression level in the transgenic mice. Detection of spikes varied across neurons and, as expected, was better for 6s-TG than 6f-TG (Fig 3F). We used these recordings to build models of the spike to fluorescence transformation.

Spike-to-fluorescence transformations explain differences in single neuron selectivity
Using the newly recorded data we developed a spike-to-fluorescence (S2F) forward model to generate a synthetic calcium imaging data based on a neuron's spike train (Fig 4A) [13,14,46,47]. In brief, spike times were first converted to a latent variable, c(t), by convolution with a double-exponential kernel, with parameters rise-time (τ r ) and decay-time (τ d ). This latent variable was pushed through a non-linearity, F(c), with a non-linearity sharpness parameter (k), a half-activation parameter (c 1/2 , corresponding to the half-rise point of the nonlinearity) and a maximum fluorescence change (F m ) (Materials and Methods). The neurons were well-fit by the model (S3B Fig; variance explained, 6s-AAV, .87 ± .17, mean ± std.; 6s-TG, .80 ± .20; 6f-AAV, .82 ± .27, mean ± std.; 6f-TG, .66 ± .23). The inferred parameter values reflected known indicator kinetics. For instance, the decay times measured for neurons expressing GCaMP6s were longer than those expressing GCaMP6f (Fig 4B). However, there was substantial variability between the parameter values inferred across neurons (Figs 4B and S3). This heterogeneity likely reflects differences in calcium indicator expression and differences in calcium influx and calcium extrusion rates [28,48,49]. This variability is one factor that could explain the difficulty of the inversion of calcium responses which is central to spike inference approaches. We refer to simulations of calcium-dependent fluorescence based on application of the S2F model to spiking activity as 'ΔF/F Synth '. We applied the model to ramp-up and ramp-down neurons. For ramp-up cells the separation of activity across trial types was retained in ΔF/F Synth , albeit with slower dynamics (Fig  4C; top, an example neuron; bottom, synthetic 6s-AAV imaging of that neuron). In contrast, for many ramp-down cells ΔF/F Synth became non-selective (Fig 4D). Overall, selectivity was conserved more frequently for ramp-up cells than for ramp-down cells. Since right-preferring cells were more often associated with ramp-up dynamics, and calcium imaging is more likely to capture ramp-up selectivity than ramp-down selectivity, the model explains the greater fraction of right-preferring neurons in the calcium imaging data (Fig 4E). This was true whether a neuron happened to be a right-or left-preferring neuron, i.e., there were no significant differences in the fraction of detectability in the synthetic data once the data was broken down into two categories, ramp-up and ramp-down (p > .05, χ 2 test for all imaging conditions; S4A Fig). Consistent with the difference being produced by the slow decay kinetics of GCaMP6s, there was little difference between the fraction of right-and left-preferring neurons in the 6f-TG data (p > .05 for both cell types). In line with these results, we found that the forward model accounted for the drop in multiphasic neurons presented in the previous section (S4C Fig).
We further confirmed that these (and previous differences) between ephys and imaging were not driven by low SNR manually-identified neurons (S5 Fig). These data show that the spiketo-fluorescence transformation introduces systematic discrepancies in comparing the same analysis performed on ephys or imaging data.

Dimensionality reduction emphasizes different sources of variance in ephys and imaging
Large-scale recording methods are often used in combination with dimensionality reduction techniques to provide a compact description of the data [30]. For example, principal component analysis (PCA) finds modes of population activity that capture the largest amount of variance in neural activity [30]. Data visualization and analysis are often performed after from an input spike train. Middle: example fit and data of two cells. Experimental, measured ΔF/F (blue) is overlaid with the simulated ΔF/F Synth (orange) from the S2F model. The input to the model, the simultaneously recorded spikes (black), is shown below the traces. B. Distributions of the inferred model parameters for different indicators (yellow: 6s-AAV; green: 6s-TG; Purple: 6f-TG; gray: 6f-AAV. C. An example ramp-up neuron (top, ephys; bottom, 6s-AAV synthetic of that neuron); selectivity remains detectable in synthetic imaging data. D. An example ramp-down neuron (top, ephys; bottom, 6s-AAV synthetic of that neuron); selectivity becomes undetectable in synthetic imaging. E. S2F model predicts that selectivity of ramp-down neurons but not ramp-up neurons, would be often obscured in imaging datasets. Bar plot shows fraction of cells that remain detectably selective in synthetic imaging (6s-AAV synthetic, left; 6s-TG synthetic, middle; 6f-TG synthetic, right) plotted separately for ramp-down and ramp-up cells. https://doi.org/10.1371/journal.pcbi.1008198.g004

PLOS COMPUTATIONAL BIOLOGY
truncating the decomposition after a few components. Moreover, regression analyses are typically performed following dimensionality reduction to avoid having the number of variables (neurons) be close to the number of samples (trials). We found that the contribution of different sources of variance to the first principal components diverges between ephys and imaging. Accordingly, truncation of PCA in the first few principal components can lead to a qualitatively different PCA decomposition of neural activity between ephys and imaging.
We found substantial differences in performing PCA on ephys and imaging datasets. First, the content of the first PCs was remarkably different between ephys and imaging. In the ephys data, variance in the first PC was mostly due to temporal dynamics (98.71 ± 0.06%, mean ± std., bootstrap analysis). In contrast, for GCaMP6s imaging trial-type selectivity was the dominant source of variance in the first PC (6s-AAV: 60.39 ± 0.29%; 6s-TG: 44.51 ± 0.65%) (Fig 5A). This difference was consistent with the temporal smoothing imposed by slower indicators, and as expected temporal dynamics were predominant in the first PC of GCaMP6f, closer to the values found in ephys, (6f-TG: 64.87 ± 2.47%; depth matched ephys: 91.02 ± 0.14%). Second, in the ephys data, a relatively large number of PCs (> 10) contribute substantially to the variance, whereas in imaging and synthetic imaging most variance was explained by the first few PCs (test for number of PCs required to explain 90% of the variance, p < .001; t test, bootstrap) (a difference in the explained variance per component has been previously reported [50]; Fig S9 there).
These differences in the sources of explained variance can be seen in the profiles of the PC scores (Fig 5B) as well as in the profiles obtained by a standard exploratory visualization, depicting the evolution of activity over time as a trajectory in the space of the first two PCs (Fig 5C). Spike inference algorithms correctly reduced the amount of trial-type variance in the first principal components (although not fully), but with the caveat that the fraction of variance in the first two principal components was reduced too much (Fig 5D).The spike-to-fluorescence model captured the qualitative differences between ephys and imaging, but overestimated the increase in variance in the first two principal components (Fig 5A-5C).

Population activity history affects instantaneous decoding differently in ephys and imaging
Decoding analysis relating population activity to behavioral variables is widely used in systems neuroscience [3,6,33,51]. Such analyses typically relate the state of population activity at a given time point to a behavioral variable of interest, such as behavioral choice. They are one of the most common analyses as they are a straightforward approach to addressing the question of what information does a population of neurons contain. We performed decoding analysis to predict either trial type or the current behavioral epoch from population activity (Eqs 3-5, Materials and Methods). Decodability of trial type in ephys increased earlier (one-tail t-test, p < .001), but saturated at a lower level (one-tail t-test, p < .001) than in calcium imaging ( Fig  6A). Spike inference models, the MCMC framework in particular, partially reduced the delay of the rise of decodability but overestimated the decrease in decodability yielding lower performance in delay-response epoch than the ephys data (S6 Fig). Both observations were recapitulated by the S2F model (delay: one-tail rank sum test, 6s-AAV, p < .001, 6s-TG, p < .001, 6f-TG, p < .001; enhancement: 6s-AAV, p < .001, 6s-TG, p < .001, 6f-TG, p < .001; Fig 6B and  6C). The counterintuitive result of higher decoding accuracy in imaging for matched population size is explained by the long decay time of slow calcium imaging. The long integration in calcium imaging causes instantaneous decoding on imaging to be equivalent not to instantaneous decoding on spiking data, but to decoding on a more time averaged variable. Such a choice is advantageous when a larger proportion of the selectivity is stable, as was the case in ALM sample and delay selectivity (27). Consistently, decoders built on ephys that incorporated a one second integration time were more accurate than instantaneous ephys decoders and as accurate as slow indicators (Fig 6D). The delayed increase of decodability was also explained PLOS COMPUTATIONAL BIOLOGY by the forward model. GCaMP6f, with its reduced signal to noise, yielded less accurate population decoders (Fig 6E; spike inference measure, Fig 6F).
For different decoding analyses such averaging can reduce accuracy. For instance, the neurons that can be used to decode trial-type change substantially between the delay and response period, i.e., the patterns of population selectivity are typically dynamical themselves. To test the interaction of these dynamics with calcium indicators, we trained decoders to distinguish the current epoch in the task from the pattern of neural activity. In ephys (Fig 6G) we observed a rapid decrease of the probability of activity to belong to the previous epoch following a change in behavioral epoch, along with a sharp increase in the probability of belonging to the current epoch. In contrast, in the calcium imaging data such changes tended to be delayed and gradual, even for the fast calcium indicator (Fig 6H). This effect was also recapitulated in the synthetic calcium data from the S2F model (Fig 6I). In other words, at the change of a behavioral epoch, the asymmetry of fast rise times and long decay times in calcium indicators yields calcium imaging signals that are a mix of the decaying profile of activity in the previous epoch and the newly activated profile of activity elicited by the response epoch.

Population dynamics is temporally dispersed in calcium imaging
Neurons show temporally complex responses, even in simple trial-based behaviors [25,26]. These spike rate changes are critical for an understanding of neural circuit models of neural computation. For instance, relative timing can be analyzed for propagation of information through neural circuits [52]. Our analysis revealed a qualitative difference in the dynamics between populations recorded by ephys or imaging: a dispersion of the apparent dynamics. That is, the spike rates recorded in ALM peaked at transitions between behavioral epochs ( Fig  7A) [25]. In contrast, in the calcium imaging data, peaks of fluorescence were delayed and spread out over time, producing a more sequence-like appearance (Fig 7B) [51,53].
To quantify this effect we computed a measure of the 'peakiness' of the distribution of neuronal activity ('s') across recording modalities as the difference between observed neural activity and temporally uniformly distributed neural activity (P ¼ 1 2T ), summed over lick-left and lick-right trials:

PLOS COMPUTATIONAL BIOLOGY
TG (0.58 ± 0.07; one-tail rank sum test, p < .001) imaging data (Fig 7C). The forward model was able to recapitulate the differences between ephys and imaging (s = 0.39 ± 0.04, 6s-AAV ΔF/F Synth ; s = 0.37 ± 0.03, 6s-TG ΔF/F Synth ; s = 0.71 ± 0.07, 6f-TG ΔF/F Synth ; Fig 7DE). Using the forward model we found that the degree of delay in the peak response is dependent on interactions between multiple factors including the assumed temporal and non-linear parameters of the indicator, as well as the absolute value of the underlying firing rate (Fig 7FG). In addition, slow changes in spike rate can interact with transient dynamics. For example, small fractional changes in spike rate can appear as prominent ramping in ΔF/F, whereas subsequent brief increases in spike rate can appear blunted in ΔF/F [54] (S6 Fig). Here, spike inference algorithms were able to partially undo the difference between imaging and ephys, yielding a reduction in the temporal dispersal (Fig 7H). Similar overall results were obtained with different metrics for the sharpness of the maximum-activity-time distribution relative to a uniform distribution, such as the Kullback-Leibler divergence. Similar analyses on single neuron and population activity properties were performed on ephys and imaging data from the primary somatosensory cortex with qualitatively similar results (S7 Fig).

Discussion
Calcium imaging using fluorescent protein sensors is a powerful method for recording activity in large neuronal populations [5,8]. In systems neuroscience, cellular calcium imaging fills a complementary role to extracellular electrophysiology. Imaging can sample neural activity densely [5,10] and reveal spatial relationships between neurons with related activity patterns [55,56]. Imaging can be used in a cell-type specific mode to sample rare neuronal populations that are difficult to target using electrophysiology [9]. Imaging can be combined with postexperiment molecular analysis [23,55,57] or serial electron microscopy reconstruction [58,59]. Imaging can track the activity of individual neurons over long time scales to explore the circuit basis of learning [6,60]. Finally, imaging allows recording activity in neuronal microcompartments that are not accessible to electrophysiology [13,[61][62][63]. Electrophysiological recordings report neural activity with high temporal precision but have limitations of their own. Ephys recordings have a bias towards large neurons with high spike-rates. In addition, the process of transforming raw recordings into spike times associated with individual isolated units, i.e., spike sorting, can introduce artifacts such as merging spikes from different neurons.
Calcium imaging and ephys are often used almost interchangeably. A few studies have attempted to compare calcium imaging and electrophysiology and generally found qualitative agreement [8,13], but only using static and relatively coarse measures. More refined measurements reveal clear differences between the methods. For example, under standard recording conditions the detection efficiency for individual spikes is low for imaging and high for ephys (Figs 3E and S2C) [64]. Here we explored the effects of differences between ephys and imaging on measures typically used in system neuroscience. By comparing activity recorded with electrophysiology or imaging from matched neuronal populations during the same behavioral task we showed that the different recording methods can lead to diverging results. On the level of single neurons, the proportion of neurons with specific response properties and different dynamics of selectivity differs between calcium imaging and ehpys. At the level of neuronal populations, we find diverging results for the content of population activity variance (trial condition differences being the main source of variance in imaging while temporal dynamics are the main source of variance in ephys), the relation of population activity to behavior, and the overall pattern of population dynamics. Spike inference algorithms only partially recovered the difference between ephys and imaging across the multiple metrics considered in this study ( S8B Fig). Notably, we find large neuron-to-neuron variability in the inferred parameters of a forward, spike-to-fluorescence model. Analytical approaches that ignore this heterogeneity, as most do, will likely infer an incorrect average inverse solution which will be a poor match for individual neurons. Indeed, such variability coupled with the large heterogeneity in firing rates and temporal patterns makes correctly solving the inverse problem difficult, which potentially explains our results. At the same time, most of the differences we found between ephys and imaging were explainable by a forward-model that generates a synthetic imaging experiment counterpart of a neuron's ephys responses. Such a model takes into account the specific heterogeneity found in ephys recordings and can take into account neuron-to-neuron variability in calcium imaging properties by sampling randomly from the varying parameters of the spike-to-fluorescence transformation. Lastly, baseline subtraction can distort inference of modulation of spiking activity when the underlying baseline spike rate is unknown. For example, a small gradual change in baseline spike rate can be amplified compared to a large phasic response (S6 Fig). This poses a challenge especially for the interpretation of photometry [54], where averaging is performed over neurons.

Im-phys.org-A website for more detailed comparison ephys and imaging
We presented an extensive dataset with three calcium indicators, extracellular and intracellular electrophysiology and multiple models. However, a single research paper still represents a small distillation of all possible analyses. We developed an online resource, im-phys.org (http://im-phys.org/ S8 Fig) with three goals. First, the website allows analysis of all combinations of dataset and model, to evaluate the scenario that is most relevant to particular experiments. Im-phys.org allows spike inference algorithms to be systematically tested in real use case scenarios, i.e., not just testing recovery of any aspect of the patterns of spike rates but rather testing the impact of performing spike inference on undoing differences in specific metrics extracted from ephys and imaging (S8B Fig). Second, we hope that other groups will share data, models and analyses to allow more general comparison of ephys and imaging data. Imphys.org allows submission of data that can be incorporated into various comparisons that are displayed on the website, controlled through UIs. Though few labs have matched ephys and imaging datasets, many labs have one or the other. Our resource can serve to aggregate and combine these datasets, as well as find a best match from an imaging to ephys dataset (S8A Fig). Third, im-phys.org is linked to a github repository containing the analyses code, models (S2F and F2S), and related data. These allow the application of analyses and models on data without sharing it through im-phys.org.

Differences between interrogating population activity by ephys and imaging affect data-driven models
Differences in metrics of population activity between calcium imaging and ephys not only complicate the research literature but can result in the divergence of models used to understand the underlying data. Most population models, whether models in which the single units are modeled in more biophysical detail or more abstractly, are still highly reduced in the way they treat population heterogeneity. As such they often rely on dimensionality reduction of the recorded data to define the aspects of population activity the model is meant to capture. We found substantial differences between ephys and imaging data in application of PCA, and the truncation of the data after a few important data components can further amplify differences. In extreme cases one may be left with subsets that differ dramatically across imaging and electrophysiology. The amplification of difference by dimensionality reduction is relevant not just for modeling of the data, but more generally when generic forms of dimensionality reduction, such as PCA, are used early in the analysis pipeline to improve signal-to-noise ratio (which is important given the limited duration of typical behavioral experiments) for subsequent analysis, such as population decoding. Dimensionality reduction can be hard to avoid when analyzing large datasets [30], but can be modified to be less sensitive to known issues.

Going forward
Going forward, the discrepancies between ephys and calcium imaging can be reduced by improvements in calcium indicators, adjustments to experimental design and use of forwardmodels to identify the sensitivity of metrics of interest to the transformation in calcium dynamics. Calcium indicators could be improved on multiple fronts. They could be made faster and less nonlinear [18,65]. In addition, more uniform expression across cells can allow for more aggressive modeling of the nonlinearities that cannot be reduced, especially when coupled with priors on activity profiles derived from large scale electrophysiology. Faster indicators will result in the effect of previous activity history washing away faster, thus reducing effects that are history dependent. Imaging with multiple types of indicators in different experiments might produce additional constraints and help reduce biases. Voltage imaging holds great potential for fast accurate measurement of spiking activity, at least in sparsely labeled neuronal populations [66,67]. At the level of experimental design, when population activity in a given behavioral epoch involves fixed dynamics, such as settling to a steady state or consistent ramping, longer trial epochs will allow the effect of the previous dynamical state to decay away. Indeed, we found a smaller discrepancy between the number of multiphasic neurons in ephys and 6s-TG data when the behavioral paradigm was adapted to use longer delay epochs.
Finally and most importantly, the sensitivity to the specific properties of population activity that are of interest to a particular hypothesis can be evaluated by forward models, as we performed here. For example, imaging studies could use forward models on published ephys data to evaluate the potential effect of the spike to calcium transformation on the metrics of interest. Then differences from the expected value given the transformation can be analyzed and metrics that are shown to be more variable given heterogeneity in transformation parameters can be flagged. This effort will become easier as neurophysiology probes become more powerful [68], data sharing more common, and preprocessing more standardized.
Overall our results highlight the importance of a deeper understanding of the transformation imposed by calcium imaging. The fact that our model was able to reproduce differences between the recording methods suggests that additional data and associated analysis methodology developments could potentially better address quantitative comparisons between analyses of population activity performed from imaging or ephys data. The online resource we built allows researchers to better understand how the discrepancies we observed would be relevant for the circuit and recording method of interest. More quantitative interpretation of calcium imaging and full utilization of all its advantages will require investment in ground-truth data sets and new statistical approaches. We hope this study and our online resource will catalyze this crucial effort.

Electrophysiological and imaging population activity recordings
Electrophysiological ('ephys') [25] or calcium imaging [25,36] recordings were performed in separate experiments and described in detail in the original publications (S1 Table;http://imphys.org/data). Mice were trained to perform a delayed version of a tactile discrimination task. Mice reported the position of a pole (anterior or posterior) by directional licking (lick-left or lick-right) after a delay period. The duration of sample and delay epoch was 2.6 s. In ephys, the delay epoch was 1.3 s; in imaging, it was 1.4 s. Trials with early licking were excluded from analysis. Neuronal depths were 100 to 800 um (ephys), 150-740 um (6s-AAV), 120-640 um (Thy1-GP4.3 mice, 6s-TG), and 140-470 um (Thy1-GP5.17 mice, 6f-TG). Only sessions with more than 20 trials for each type (right-trial and left-trial) were included. For imaging data, we performed a post-hoc detection of outliers and removed trials where more than 30% of the time points contain a signal with 3 standard deviations away from median (these outliers relate to baseline fluctuations across trials, and removing them was necessary for variance-based analysis). Neurons were limited to putative pyramidal neurons. These reduced the total number of neurons with sufficient number of trials, yielding 1493, 2293, and 2672 units for 6s-AAV, 6s-TG and 6f-TG imaging, respectively. We note that despite the 6s-TG data containing many neurons across multiple sessions all data came from a single animal. Though we find this data to be consistent with other imaging data, the single animal source could be a potential issue in that differences in an individual animal's behavior may cause differences in neural encoding.
We used two sets of data from loose-seal electrophysiological recordings and imaging from GCaMP6-expressing neurons in primary visual cortex. In one set neurons were transduced with 6s-AAV and 6f-AAV (data from [13]). For 6s-AAV data, imaging was performed after 2-4 weeks of expression. In the other set we used 6s-TG and 6f-TG mice [31,69]. More details of all datasets are described at http://im-phys.org/data.
In the imaging data, individual neurons were visually identified based on average fluorescence images as well as "neighborhood correlation maps" (where the brightness of each pixel encodes the correlation of its fluorescent time course to that of its neighbors), which highlights active cells. Each ROI was inspected to correspond to a morphological neuron and have a generally donut-like shape (since the nucleus does not express the indicator). The fluorescence time course of each cell was measured by averaging all pixels within the ROI, with a correction for neuropil contamination. The fluorescence signal of a cell body was estimated as F cell (t) = F roi (t)-r � F neuropil (t), with r = 0.7. The neuropil signal F neuropil (t) surrounding each cell was measured by averaging the signal of all pixels within a 40 μm radius from the cell center (excluding all selected cells). For each imaging plane, the number of ROIs was about 30+/-19 cells in the 6s-AAV imaging conditions, and 82+/-48 cells per recording plane in 6s-TG, and 122+/-32 cells per recording plane in 6f-TG.
Simultaneous loose-seal recordings and imaging (Figs 3 and S2; S2 Table) was performed as described previously [13] (more details at http://im-phys.org/data). GP4.3 and GP5.17 mice [31] were lightly anesthetized (0.5% isoflurane). Drifting grating visual stimuli were used to drive activity in the visual cortex. Loose-seal recordings were made through a craniotomy windows over the primary visual cortex. Two-photon imaging and loose-seal, cell-attached recordings were performed simultaneously. We acquired images in both low (284 x 284 um 2 ) and high (38 x 38 um 2 ) zoom configurations. Extraction of fluorescence transients was as described [13]. All procedures in mice experiments were performed in compliance with the Janelia Research Campus Institutional Animal Care and Use Committee.
To analyze the spike-triggered fluorescence changes, we created 1.2-s snippets around action potentials (APs), where a few APs only happened from 200 ms to 400 ms from the onset of each snippet. We computed baseline fluorescence using the snippets without AP in the entire time series. For snippet with APs, we required the fluorescence changes within the first 200 ms (before APs) was around baseline level (Fig 3C and S2A). We computed ROC curves for detecting one (Fig 3E, inset panels) or many APs (Figs 3E and S2C) compared to baseline fluorescence fluctuations. D-prime was computed as <maxðDF=FÞ 1AP >À <maxðDF=FÞ no AP > stdðDF=FÞ no AP .

Spike-to-fluorescence model
We developed a phenomenological model that converts spike times to synthetic fluorescence time series [13,14,25,46]. This 'spike-to-fluorescence' (S2F) model consists of two steps. First, spikes at times {t k } are converted to a latent variable, c(t), by convolution with a double-exponential kernel: τ r and τ d are the rise and decay times, respectively. n i ðtÞ � Nð0; s 2 i Þ is Gaussian distributed 'internal' noise. c(t) was truncated at zero if noise drove it to negative values. Second, c(t) was converted to a synthetic fluorescence signal through a sigmoidal function: k is a non-linearity sharpness parameter, c 1/2 is a half-activation parameter, F m is the maximum possible fluorescence change. n e ðtÞ � Nð0; s 2 e Þ is Gaussian external noise [28,46,70]. We estimated the model parameters for each imaging condition using the simultaneous ephys and imaging experiments (S3A, S3B and S3C Fig). We then applied the S2F model to ephys data using parameters randomly sampled from the parameter distributions except for the parameters directly related to the nonlinearity. Since ALM spike rates in ephys vary over a larger range than the spike rates in the primary visual cortex these parameters may be underconstrained. Accordingly, we followed an alternative strategy to choose these parameters for a given neuron. For each neuron, after assigning the rest of the parameters, we transformed the spike trains to calculate the phenomenological calcium variable c(t). We then estimated the nonlinear parameters for that neuron by calculating the values that would best transform c(t) to the fluorescence dynamics of any neuron in the imaging dataset. For all neurons we were able to find matches with Spearman correlation higher than 0.7 between mean dF/F and mean synthetic dF/F. The parameters inferred in this process recapitulated the correlation structure of c 1/2 and k found in the data (S3D Fig).
Given the short timeframe over which baseline activity was recorded before each trial started, we extended the pre-trial period by simulating a Poisson spike train for the unrecorded time between trials with a constant rate equal to the baseline mean activity.
To relate this model to previously studied models, Eq 2 can be generalized as ΔF/F Synth (t) = f(c(t))+n e (t), where f(�)and n e ðtÞ � Nð0; s 2 e Þ is Gaussian external noise [28,46,70]. We considered two alternative S2F models used by previous studies (note though that both of these models did not contain internal noise in Eq 1): S2F Linear model: f(c(t)) = F max c(t)+F 0 , where F max is a scaling parameter (we kept the naming as max to clarify the relationship to other models); F 0 is the baseline (S3G Fig, left).

S2F
Hill model: f cðtÞ ð Þ ¼ F max cðtÞ n cðtÞ n þK d , where F max is the maximum possible fluorescence change; n is the nonlinearity; K d is a half-activation parameter. Model performance is summarized in the supplementary material and reported in http://im-phys.org/analyses for each single cell (S3G Fig, right).
Model parameter sensitivity (S3C Fig) was defined as the decrease of the fraction of explained variance, as a function of the deviation of the parameter value from the estimated solution: g ¼ DEV=EV DP=P , where P2{τ r ,τ d ,k,c 1/2 ,F m }.

Calcium imaging to spikes for non-simultaneous ephys-imaging recordings
We performed fluorescence-to-spike (F2S) inference using two published models [40,42] and code available on GitHub. Specifically, the default model in CaImAn was used to solve the FOOPSI problem to infer firing rates (using OASIS) and then MCMC was used to infer spike times. When performing algorithm comparison some published approaches may rely on parameters that are difficult to tune. In order to avoid mistuning the hyperparameters, we intentionally selected F2S models that do not require manual setting of parameters or hyperparameters. Instead, both toolboxes autonomously fit the parameters they required through optimization processes provided in the shared code. Moreover, we used the simultaneous recorded loose patch and imaging data (where the loose patch provides ground truth) to ensure that fluorescence-to-spike models were implemented correctly and return reasonable results.

Single neuron analyses
Neural selectivity for left-or right-trials was determined using two-sample t-tests, with neural activity binned over 67 ms, which corresponds to one imaging frame. A neuron was selective if it showed selectivity (p < .05) for >335 ms (5 continuous frames). A selective neuron was multiphasic if the polarity of selectivity switched, with continuous periods of selectivity lasting at least 335 ms long. Selective neurons that were not classified as multiphasic according to this criterion were classified as monophasic. Selective neurons (mono-and multiphasic) were classified into left-and right-preferring cells according to the condition in which their activity was higher (Fig 2IJ). Ramp-down (ramp-up) were defined as neurons that have activity that is greater (less) in the baseline epoch compared to the delay epoch (paired t-test, p < .05 across trials). Note that ramp-down cells were excluded from the analysis of peakiness (Fig 7). respectively. 6f-Tg related population analyses were only applied to cells with ROC > 0.7.

Population decoding
We applied regularized linear discriminant analysis (LDA) on neural dynamics grouped into bins corresponding to single imaging frames (67 ms) to compute the instantaneous decodability of trial type. Regularization was performed by sparsity-regularized LDA [33,71]. The optimal LDA decoder was computed separately for each time bin using correct trials only. We estimated performance for the instantaneous LDA decoder by sampling subsets of units and averaging 100 subsamples. We separated the trials of each neuron into non-overlapping training (70%) and testing (30%) sets. The instantaneous decoder of trial type was computed from training set and its performance was evaluated on the testing set.
We tested the ability of neuronal population activity at different times to discriminate the behavioral epoch by using a four-class LDA (Fig 6F-6H). We defined the latency of neuronal response to behavioral epoch by the first time at which decoding reached a 0.7 accuracy threshold (arrows on Fig 6F-6H). Regularization was performed by sparsity-regularized LDA [33,71].

Sensitivity analysis of peakiness
We used as a reference value an artificial, synthetic ephys dataset with 50 neurons whose firing rates were manually set to be non-zero only at the time corresponding to one imaging frame. From left to right in Fig 7G, S2F model was configured (1) using the same parameters for all cells, except that the internal noise and external noise were randomly generated (at the same amplitudes); (2) using the same parameters for all cells, except that the spike times were jittered within the time length of the frame (i.e., all spikes were kept in the same image frame); (3) using the same parameters for all cells, except that the spike rates in the original frame varied from 0.1 Hz to 5 Hz (spike trains generated using Poisson process); (4) using the same parameters for all cells, except that the decay time constant of calcium indicator was randomly sampled from its distribution; (5) using the same parameters for all cells, except that nonlinearity of calcium indicator was randomly sampled from its distribution; (6) both decay time constant and nonlinearity of calcium indicator were randomly sampled; (7) the same as (6) except that the spike rates in the original frame vary from 0.1 Hz to 5 Hz (spike trains generated using Poisson process).

Distributions of measures
For S2F model, one can randomly sample all the parameters from the distributions measured using simultaneous ephys-imaging recordings and all possible noise levels. The distribution of a measure ψ (e.g. fraction of mono-selective neurons, peakiness etc.) can then be computed through synthetic data using randomly sampled S2F models. Specifically: where P(ΔF/F Synth (t)|{t spike },Θ) is derived from Eqs 1, 2, and P(ψ|ΔF/F Synth (t)) describes probability of measure ψ at a given value for dynamics ΔF/F Synth (t), P({t spike }) is the empirical distribution of spike events in ground truth ephys and P(Θ) is the distributions of S2F parameters. For unsupervised-learning-based F2S models (i.e. MCMC and MLSpike), we performed 100 subsamples of deconvolved synthetic ephys data to estimate distribution of the parameters.

Computer code
All codes for model benchmarks and comparison metrics are recompiled and packed with data through im-phys-API (https://github.com/zqwei/Im-phys-API), which can be available at im-phys.org/codes. The API will come with a user-friendly interface in which one can reproduce all results in our paper and extensive results on im-phys.org.
We also provide repos for benchmarks of S2F and F2S models at https://github.com/zqwei/ Ca-Imaging-Deconv-List (DOI: 10.5281/zenodo.3960635) and comparison metrics at https:// github.com/zqwei/Neural-Recording-Methodology-Comparison (DOI: 10.5281/zenodo. 3979786) and website interface at https://github.com/zqwei/Im-phys-org. 19; 6s-AAV: p = .73; 6s-TG: p = .97; 6f-TG: p = .43). For the same depth, ephys has more selective neurons and more multiphasic selective neurons than imaging (χ 2 -test, p < .001 for all). B. Percentage of variance of neural activity explained by each principal component (Fig 5). Left: length of horizontal bar shows fraction of variance in each principal component. Colors show breakdown into different types of variance (blue: trial-type, red: time, orange: other). Right: horizontal bar shows number of neurons in each depth. For the same depth, the 1 st PC show more temporal dynamics content in ephys and 6f-TG (χ 2 -test, p < .001 for all), while that show more trial-type content in 6s-AAV and 6s-TG (χ 2 -test, p < .001 for all). C. Decodability of trial type (Fig 6). The number of cells at each depth is identical to that in PCA analyses. The decodability differs across depths, where the neurons in superficial layers show weak decodability of trial type in sample-delay epoch (multivariate ANOVA test on time-series to depths with n > 50 cells in ephys, 6s-AAV and 6s-TG; that to depth with n > 10 cells at ROC > 0.7 in 6f-TG; p < .001, 1000 bootstrap). For the same depth, the average decodability of trial type is higher in late delay to early response in imaging than that in ephys (rank sum test, p < .001 for all, 1000 bootstrap). D. Peakiness (Fig 7). The peakiness differs across depths (rank sum test, p < .001, 1000 bootstrap). For the same depth, peakiness is higher in ephys than imaging (rank sum test, p < .001 for all, 1000 bootstrap). E-I. Analysis as a function of spike rates. E. Schematic of resampling procedure to target firing rate and distribution of firing rates after subsampling to different average spike rates (magenta: original data; cyan: ephys subsampled to 1 Hz average; yellow: ephys subsampled to 4 Hz average; green: ephys subsampled to 10 Hz average. F. Effect of target firing rate subsampling on fraction of monophasic (left) and multiphasic neurons (right) G. Values of peakiness are shown with the same color code as F. H. Fraction of variance in the first principal components are shown by length of bar with same color code as F. Saturation of bar shows the breakdown into different components of variance (trial-type, time, other). I. Trial-type decodability over time shown with the same color code as F and with 6s-AAV added as a reference. J. Analysis as a function of spike sorting accuracy-possible effects of merging. Increased fraction of multiphasic neurons is unlikely to have stemmed exclusively from failures of spike-sorting. Box plots indicate fraction of neurons in each selectivity class (left: non-selective, middle: monophasic, right: multiphasic) as a function of increased probability of artificially induced merging between two neurons. Dashed line indicates fraction of selectivity type found in the ephys dataset. Pairwise correlation plots for each of the spike-to-fluorescence parameters. Panels along the diagonal describe the distribution of each parameter (these are identical to Fig 4B but reproduced to facilitate comparisons). Off-diagonal panels depict the correlation between two parameters. Spearman's rank correlation of parameters across cells (regardless of recording method) and associated p-value are provided in each off-diagonal panel. Each circle corresponds to a response set. Data from the different indicator conditions is overlaid and marked by color. (gray: 6f-AAV, 11 neurons, 37 response sets; yellow: 6s-AAV, 9 neurons, 21 response sets; purple: 6f-TG, 18 cells, 32 response sets; green: 6s-TG, 22 neurons, 33 recording periods). B. Boxplots of explained variance of S2F on validation data for simultaneously recorded neurons (color follows the same convention as in A). C. Boxplot of distribution of parameter sensitivity values. D. Pairwise correlation of re-estimation of k and c 1/2 using ALM imaging dynamics (Materials and methods). The re-estimated parameter values are shown as a scatter plot. Each dot corresponds to a neuron (n = 720 for 6s-AAV and 6s-TG; n = 225 for 6f-TG in matched depths). The distribution of the re-estimated parameter values strongly overlapped with those obtained in simultaneous imaging-ephys recordings. c 1/2 and k had a strong inverse correlation as in the simultaneously recorded data (r s < -.64, p < .001). E. Boxplots of firing rates of neurons in each recording sessions (6f-AAV, gray, 0.51 ± 0.25 Hz, mean ± std., range 0.05-1.25 Hz; 6s-AAV, yellow, 0.43 ± 0.38 Hz, range 0.05-1.68 Hz; 6f-TG, purple, 1.25 ± 1.48 Hz, range 0.09-5.22 Hz; 6s-TG, green, 1.08 ± 0.85 Hz, range 0.09-3.00 Hz). F. Scatter of simultaneous ephys-imaging data model fit and the dynamical range of the data (expressed as mean spike rate). G. Scatter of simultaneous ephys-imaging data model fit quality between different S2F models (Materials and methods). Left: comparison between S2F linear model (x-axis) and S2F sigmoid model (y-axis); right, comparison between S2F hill model (x-axis) and S2F sigmoid model (y-axis). (TIF)

S4 Fig. Forward model explains differences in neuronal selectivity between imaging and ephys. A.
Fraction of cells that remain selective in synthetic imaging plotted separately for ramp-down and ramp-up cells (left: 6s-AAV synthetic, middle: 6s-TG synthetic, right: 6f-TG synthetic), which is further broken down into right-(blue) and left-preferring (red) trials. B. Fraction of right-preferring neurons in imaging after spike inference models. Left: the same analyses as that in Fig 2I, but performed on inferred spiking data obtained via the MCMC framework; right: the same analyses as that in Fig 2I, but performed on inferred spiking data obtained via the MLSpike framework. C. Estimation of the fraction of monophasic and multiphasic neurons that would be discovered by an imaging experiment through use of the S2F forward model. Plots show the estimates for monophasic (left) and multiphasic (right) neurons. The proportion of the source data, ephys, is in black. The experimentally measured proportions in imaging are in gray. Blue color shows the distribution of selectivity type proportion for different repetitions of each algorithm on subsamples of the dataset for synthetic imaging using 6s-AAV (top), 6s-TG (middle) and 6f-TG (bottom) parameters. D. Example neurons that change their selectivity after F2S models. Top, a mono-phasic neuron becomes nonselective after F2S model; bottom, a mono-selective neuron becomes multi-phasic after F2S model. E. Fraction of selectivity change per selective group after F2S models. Left, MCMC F2S model; right, MLSpike F2S model. Top, 6s-TG imaging; middle, 6s-AAV; bottom, 6f-TG. First column, non-selective neurons before F2S model; second column, mono-phasic; third column, multi-phasic. First bar, non-selective neurons after F2S model; second bar, mono-phasic; third bar, multi-phasic. (TIF)

S5 Fig. Fraction of selective neurons as a function of the imaging signal-to-noise ratio.
We estimated SNR using the procedure in the widely used CaImAn package, in which the noise level is estimated as the exponential of the mean of the logarithm of power spectral density. We then generated datasets including only a subset of neurons by moving the threshold up from its zeroth percentile to its 100th percentile. A. non-selective (yellow), mono-selective (blue) and multi-selective neurons (red). Top, 6s-AAV imaging; middle, 6f-TG; right, 6s-TG. An important assumption of F2S models is that baseline fluorescence reflects zero spikes. This assumption is rarely met. For example, in our study, ALM neurons fire at about 6 Hz in the pre-sample period. This background firing rate, which can vary across time and from neuron to neuron, can distort measures of neural dynamics based on imaging. We explored this effect using computer simulations. The firing rate of a simulated neuron (baseline at 3 Hz) was gradually increased by 2 Hz over four seconds followed by a brief phasic response (1 to 5 spikes were evoked in 70 ms; Fig S6A). We computed the peak over ramp ratio (i.e. ratio of the maximum firing rate during phasic firing to the maximum firing rate before phasic firing) as the measure of the detectability of the phasic activity from tonic activity. We found that the small change of the tonic activity became prominent while detectability of phasic activity was reduced by a factor of >10 in calcium imaging (Fig S6BC). This stems from the integration in calcium dynamics. Although the ramping activity was weak, it was integrated over seconds; although the phasic activity was strong, it was only integrated over 100 ms. The degree to which the detectability was reduced in imaging (comparing to ephys) increased with the level of baseline spike rate (Fig S6D). Therefore, baseline subtraction can be problematic for inference when the underlying baseline spike rate is unknown. A. Simulation (200 trials) of a single neuron, whose firing rate slowly increased from 3 Hz to 5 Hz over~4 seconds and was then followed by a transient increase (phasic firing) to 15Hz or 30 Hz in 70 ms, and then reset to 3 Hz (baseline). Black dots, spike events; gray dash line, onset time of transient increased spike events. B. Spike rate. Black line, phasic firing at 30 Hz; gray, phasic firing at 15 Hz. C. Mean ΔF/F synth from S2F model. The change of fluorescence came more from the small change of the baseline firing, and little from the strong phasic firing, which results in difficulty to detect transient modulations of spikes in calcium. D. ΔF/F synth peak/ramp ratio is less than that in spike, and such effect increases with baseline spike rates. Colors of circles correspond to baseline spike rates; size corresponds to number of spikes in phasic firing. Dash line corresponds to equal ratio between x and y axes. (TIF) S7 Fig. Analysis of matched datasets from a primary somatosensory area. Differences between ephys and imaging are likely to depend not only on the analysis and indicator, but also on the underlying dynamics which change from one brain area to the other. We analyzed a second group of matched population recordings, obtained from primary somatosensory area (S1) rather than ALM. We find that differences in some analyses were no longer present, but others remained. We find that the fraction of multiphasic neurons in S1 was far smaller than that in ALM (n = 1/55, ephys; n = 4/719, 6s-AAV; p < .001, χ 2 test) and there was no significant difference between the fraction of multiphasic neurons observed in ephys and imaging (p = .801, χ 2 test). Our forward model correctly predicted this lack of change (p = .674, χ 2 test between imaging data and synthetic imaging data). Similarly to ALM data, trial type variance dominated the first principal component in imaging but not in ephys and population decoding was substantially delayed in imaging relative to ephys. A. Single neuron selectivity type. Bar plots show fraction of neurons found in each of the three selectivity types (left: monophasic, middle: multiphasic, right: nonselective) for the different recording methods (left: ephys, middle: 6s-AAV, right: 6s-AAV synthetic). B. principal component variance content. Bar plots show fraction of variance contained in the first three principal components (from left to right: PC1, PC2, PC3). Each bar is broken into the contribution from trial-type variance (blue), time variance (red) and other (yellow). C. Population trial-type decodability. Plot shows mean decodability over time for ephys: top, 6s-AAV: middle and synthetic 6s-AAV: bottom. Dashed lines designate different trial periods (sample, delay response). Note that the experiments with 6s-AAV had a slightly shorter delay period, hence the difference in location of dashed lines. Since 6s-AAV synthetic is derived from ephys it has the same trial structure as ephys. (TIF) S8 Fig. A community based online resource, im-phys.org, for determining quantitative effects of measuring population activity by imaging or ephys. A. Top, schematic of our community resource that can allow datasets acquired by different labs to be found in one location and matched in analyses. Bottom, schematic of combining different analyses with different datasets on im-phys.org. B. Schematic of using im-phys.org to predict values (metric distributions) expected for different population analyses from datasets acquired by different techniques through use of a variety of forward and inverse models. (TIF) S1 Table. Summary of large-scale ephys and imaging recording, more data can be found at im-phys.org/data. List of datasets. Includes type of dataset, number of neurons, link to dataset, figures in manuscript and citation for data. (DOCX) S2 Table. Summary of simultaneous ephys-imaging recording of single cells in GCaMP6-TG mice. List of single neurons recorded simultaneously by ephys and imaging. Includes duration of recording, spike rate properties and inferred decay time constant of calcium imaging. (DOCX)