VolPy: Automated and scalable analysis pipelines for voltage imaging datasets

Voltage imaging enables monitoring neural activity at sub-millisecond and sub-cellular scale, unlocking the study of subthreshold activity, synchrony, and network dynamics with unprecedented spatio-temporal resolution. However, high data rates (>800MB/s) and low signal-to-noise ratios create bottlenecks for analyzing such datasets. Here we present VolPy, an automated and scalable pipeline to pre-process voltage imaging datasets. VolPy features motion correction, memory mapping, automated segmentation, denoising and spike extraction, all built on a highly parallelizable, modular, and extensible framework optimized for memory and speed. To aid automated segmentation, we introduce a corpus of 24 manually annotated datasets from different preparations, brain areas and voltage indicators. We benchmark VolPy against ground truth segmentation, simulations and electrophysiology recordings, and we compare its performance with existing algorithms in detecting spikes. Our results indicate that VolPy’s performance in spike extraction and scalability are state-of-the-art.


crucial to check claims and expected behavior against other methods and revise either the methods or claims if and when unexpected results appear.
Both concerns are reasonable and we agree, they would make for a more impactful paper. We have addressed all of the raised concerns, as outlined below in response to reviewers.
There are also a large number of specific questions and concerns from reviewer 1, which must be addressed point by point.
We addressed all the questions and concerns by reviewers one and two.
Please also answer the following questions from the editor: 1. Were surrogate data sets used or can they be used to test validity of the methods? 2. Were model data sets with known ground truth spike times, cell identities, etc. constructed or used or can they be used to test the validity of methods?
We addressed questions 1 and 2 from the editor in: • Reviewer 1. Main comments, Point 1 , and Specific comments, point 6 • Reviewer 2 Main point 2 .
We do not add here a duplicate of responses to compress an already long rebuttal letter.

Comments to the Authors:
Please note here if the review is uploaded as an attachment.

Reviewer #1:
The paper presents an automated analysis framework for Voltage imaging data alongside a collection of 24 manually annotated datasets. The framework performs motion correction, segmentation and spike extraction and has been incorporated in the CaImAn python codebase. Performance on the included datasets, especially L1 is impressive. This software pipeline along with the curated datasets will advance analysis of voltage imaging in the community.
We thank the reviewer for the thorough review, and the useful suggestions, which we hope to have satisfactorily addressed. The method in [13] has also been tested on voltage imaging -why is there no comparison? If results are indeed poor, then the authors can at least report that they attempted these approaches with poor performance. Comparison to demonstrate the authors have indeed improved the SpikePursuit algorithm should also be included.
Following the suggestions of the reviewer we compared VolPy thoroughly to other approaches, including calcium imaging analysis algorithms and [13]. In most cases some very specific preprocessing of the data was necessary to get the algorithms working. Also, no evaluated method, except SpikePursuit, had a means to automatically extract spikes. We therefore also implemented some basic spike extraction methods. In the text we included the following explanation regarding the methods we compared to: For the simulations, we compared VolPy to all benchmarked algorithms mentioned above. The results showed that in most cases, especially in the low SNR settings, Volpy performed better than other methods in F1 score and SpNR. Since the spike extraction accuracy (F1 score) changes with different thresholds, we selected for each algorithm (including VolPy) a threshold which outputs the best F1 score given spike amplitude. This threshold provides the best result an algorithm could ideally get (Fig 4c left). SpikePursuit adopts an adaptive thresholding method which does not need manual thresholding and was therefore directly compared to VolPy with automatic threshold (Fig 4c right). The performance of VolPy was slightly better than SpikePursuit mainly because we introduced a more robust way to remove the background. We also compared VolPy with CaImAn, MeanROI and SGPMD-NMF on voltage data with simultaneous electrophysiology. Only VolPy and MeanROI were able to extract reasonable fluorescence traces on two fish datasets. We believe that CaImAn and SGPMD-NMF failed because neurons in these datasets were not firing with homogeneous spatial footprints, as one can observe from S4 Vid. Also, the denoising step in SGPMD-NMF sometimes reduced the SNR on these noisy datasets. For other voltage imaging datasets (L1, TEG, HPC), VolPy showed a better SpNR compared to CaImAn, MeanROI and SGPMD-NMF. These comparisons were reported in Figs 4,5.
Since some of the algorithms we compared against VolPy require a substantial amount of work and tweaking to run, often entailing to execute multiple portions of code in different languages, we did not carry out multiple runs of the simulations. We do think the results across simulations and real datasets display sufficient evidence of the superiority of VolPy.
Although there is an only mild increment in accuracy of detecting spikes versus SpikePursuit, the computational performance of VolPy is significantly improved over SpikePursuit (Fig 6 c, d). Volpy was 3.5X faster than SpikePursuit and consumed 3X less memory. Simulation results showed that VolPy with adaptive and simple threshold outperforms SpikePursuit. Besides, VolPy was packaged into a usable, documented and maintained open source package, already popular within the community. Finally, VolPy was more than 10X faster and used much less memory compared to SGPMD-NMF.

Why were only 2 annotators used? What was the level of agreement between them?
We thank the reviewer for the opportunity to improve the paper. We re-annotated datasets with three independent annotators and different strategies for selecting neurons only based on mean and correlation images. We assessed the degree of agreement between annotators. We explained these points in the text as follows: In Fig 3c, we compared the performance of the VolPy and the performance of the annotators. We regarded the combined annotations as ground truth and compared each human annotation against it. The level of agreement was measured as F1 score. The level of agreement also represented a measure of the difficulty of each dataset. For the L1 datasets, the average F1 score of the human annotation was 0.92±0.01; for the TEG datasets, 0.89±0.02; and for the HPC datasets, 0.82±0.09. Human annotators achieved high agreement on the L1 datasets, lesser on the TEG datasets and the least agreement on the HPC datasets.
3. Pages 7-8: multiple parameters and algorithmic choices are made in the processing and spike time estimation algorithms. Can the authors provide some insight or motivation to the selection parameters and heuristics used? For example why are 8 principal components used to estimate the background in line 180?
In our experience, many of the largest principal components describe structured global noise in voltage recordings. We chose to subtract the largest 8 components because subtracting fewer components would remove less spurious variants, while subtracting many more components would risk subtracting neuronal signals. We admit that this is a rudimentary denoising method, but it is simple to implement and effective. We have clarified this in the text. We thank the reviewer for the suggestion and we think that indeed we were not precise in discussing the above points. We hope to have addressed in detail the above points in the following sections of the paper. In summary: • We described the model underlying spike detection (lines 259-268) • We specified in more details the two different portions of the algorithms when spikes are extracted and noise is estimated • We clarified that the threshold is used to extract events after the matched filtering • We added pseudocodes for the described function directly in the main text and further organized in section the text Specific comments:

Fig 1: mean image displayed in "back" of subplot b -image is clearly not visible, not clear why it is included. Both images can be plotted on a smaller scale so they will both be visible if the authors want to display both.
Thank you for pointing this out. We updated Fig 1 as suggested:

Line 63: what do the authors mean by "normalizing by the z-score"? Are they z-scoring each pixel or are they dividing each pixel by the number of standard deviations the value is above/below the mean (this is the definition of the z-score)? Same applies to line 66.
We thank the reviewer for pointing out the confusing definition. We have updated the text to address this point as follows.

A short description of the motion correction algorithm and its suitability for voltage imaging would make the paper more self-contained.
Following reviewers' advice, we complemented the motion correction section and discussed the suitability of the same algorithm for voltage imaging.

Lines 124-125: do the authors mean that instead of a 3-channel RGB image, they are inputting an image that has the mean duplicated to two channels and the correlation in the third? Doesn't this create artifacts as the original network learns features dependent on relations between the color channels that now don't exist?
Thank you for pointing this out. It is correct that the input image has the mean duplicated to two channels and the correlation in the third. Even though the correlations among channels are captured in the original network weights, our results seem to indicate that this problem is attenuated by retraining the last 22 layers of the ResNet (50 layers in total). Mask R-CNN pretrained weights for segmentation tasks are only available for the COCO dataset. Retraining from scratch is not possible in our case considering that we have a reduced training set. We have included a discussion of this point in the text:

Lines 139-143: this is not clear. Is this magic wand interface to be used to annotate videos in order to then retrain the deep network? Or is it intended to add/remove/correct cells?
We apologize for the lack of clarity. The VolPy GUI we developed, with Python Cell Magic Wand Tool, is used to add/remove/correct cells from the Mask R-CNN outputs (see S2 Vid). Besides, users can choose to bypass the neural networks step and directly input their own masks annotated through the VolPy GUI or other softwares. We have clarified that in the text as follows, and with added supplements.

Line 164: What happens if the dilated background region includes pixels from overlapping neurons?
This is indeed an important problem. We quantified in detail, via simulations, how performance in detecting spikes changed in function of the overlap with other neurons. The results in Fig 4e showed that VolPy is robust to overlaps smaller than about 20%, but then the performance started degrading as the overlap increases, especially for low spike amplitude scenarios.

Line 170: can the authors elaborate about a spatial filter w? is this the filter in equation (8)?
Indeed it is. We clarified in the text as follows:

Line 201: why is the template time-flipped? Are the authors using correlation?
Exactly, as we further specified in the text, we performed template matching by cross-correlating the spike waveform and the signal. We changed convolution to correlation in the new text to avoid confusion. See answer to Main Comments, point 3.

Table 2: number of neurons in the HPC datasets is extremely low given data size. Doesn't this impact the training of the network? Line 240 claims table 2 includes a "train/val" column. What column is this?
Table 2 -> It is actually Table 3. Sorry for the confusion.
Indeed the low number of samples has an effect on training. We have investigated in the paper how the training set size affects the performance of the network on the different datasets. We also have shown how the learning curves correspondingly behave. This is shown in Fig 3d and  S4, partially reported below. We also add the following discussion into the text.
10. Line 253: can the authors comment on their cost choice for evaluating the spike extraction performance? Why is M=25? The authors should add equations for the Victor-Purpura distance to the paper.
We have adopted a new strategy for spike matching. Given the fact that we only want to compute F1 score/precision/recall, the previous method based on Victor-Purpura distance and Hungarian algorithm is not necessary. Now we use a greedy matching method which is much faster and outputs the same F1 score as the previous method. The details of the algorithm can be found in the paper. In this algorithm we only allow spikes within 10ms to be matched. We chose 10ms based on the fact that neighboring spikes in the electrophysiology ground truth we have evaluated had a minimum interspike interval of 30 ms.
11. Line 275 the authors comment that performance of TEG is "fair" because only 2 datasets were used, and in HPC there are not enough neurons. A more accurate statement should connect the size of the image plane, number of time frames and number of neurons. How much data is necessary to receive good training results?
Thank you for the great point and opportunity for improving the paper. We indeed performed a more accurate analysis of the failure mode for each of the dataset types, as well as an analysis of the network performance in function of the training set size. See also answer to point 9.

Figure 3:
Thank you for identifying these issues.
• subplot a -for HPC is this the best overlay? There are structures in the image that were not annotated but appear bright? Are these not cells?
We think that in our previous round of annotations we did miss a few neurons especially in HPC datasets. We think this is because only looking for active neurons in the correlation image and correlation movie was unreliable. Our new annotations selects all neurons in mean and corr image, thereby providing more consistent results. We added more annotators and created a consensus ground truth that seems to capture most of the visible neurons now.

• What are the vertical red lines that appear within the contours?
Vertical lines were the result of a visualization bug that we fixed.
• For HPC and TEG lines can be made thicker for better visibility. It is hard to see yellow on top of the white cells, perhaps a different color would give better contrast? ○ We have made the lines for HPC and TEG thicker ○ We have changed the yellow color to green to provide better contrast • Plots (h) and (i) -how do the plots correspond to one another? How many processors were used for plot h? how many time frames are included in data of plot i? ○ We have updated plots for scalability. All plots for scalability were performed on an L1 movie with 512*128 FOV and 75 selected neurons. The processing time depends on the number of frames and the number of processors used. We controlled one variable and tried to see how processing time changed with the other. in Plot (h) we showed the change of processing time with different number of frames and in Plot (i) we showed the change of processing time with different number of processors used. ○ Plot (h) was performed using 8 processors with 10000, 20000, 40000 frames. ○ Plot (i) was performed on 40000 frames using 1,2,4,8 processors.