Versatile multiple object tracking in sparse 2D/3D videos via deformable image registration

doi:10.1371/journal.pcbi.1012075

Fig 1.

Overview of frame sorting strategies.

Orange indicates fully annotated reference frames, blue indicates parent frames with at least one child frame, and green indicates child frames. A: In the simplest strategy, all frames are initialized by the closest reference frame. B: Frames are sorted into ordered queues based on similarity. Each of these branches start with a reference frame, and new child frames are added such that the parent-child similarity distance is minimized, naturally clustering similar frames around each reference frame. C: Frames are sorted chronologically, branching both forward and backwards from each reference frame.

More »

Expand

Fig 2.

Overview of ZephIR algorithm.

A: Examples of input datasets, created with BioRender.com. ZephIR can track keypoints in various biological systems, including fluorescent cellular nuclei in a tissue and body parts that summarize a posture. Input dimensions can range from 3D (time, XY) to 5D (time, channel, XYZ). Colored dots indicate example keypoints to be tracked. B: Frame sorting schemes. A branch defines an ordered queue of frames to be tracked. Each branch begins at a manually annotated reference frame (orange), but subsequent parent (blue) and child (green) frames in a single branch can be sorted either by chronology (top) or by minimizing the similarity distance between each parent-child pair (bottom). C: Overview of tracking loss. Tracking loss is comprised of four terms: 1) (top left), overlap of local image features around each keypoint, sampled from the current frame and its nearest reference frame, 2) (top right), elastic connections between neighboring keypoints with varying stiffnesses based on covariance of the connected keypoints, 3) (bottom left), proximity to features detected by a shallow model selector network that takes in a number of existing feature detection software as input channels, 4) (bottom right), smoothness of temporal dynamics at each keypoint position. D: Overview of steps for manual verification and additional supervision. Users can verify tracking results as correct or identify incorrect results. After fixing a few key incorrect results, ZephIR can use those new annotations as well as the verified correct tracking results to improve tracking results for all other keypoints in that frame (and all its child frames).

More »

Expand

Fig 3.

ZephIR analysis workflow and results for tracking GCaMP fluorescence from neuronal nuclei in 3D volumes of freely behaving C. elegans.

A: Plot of mean distance (in similarity space) to the nearest reference frame vs the number of reference frames (left), and the first three median frames (maximum intensity projections of shape 200 x 512) recommended by ZephIR’s k-medoids clustering algorithm (right). The first three median frames clearly represent the three main postures that the worm cycles through as it crawls. B: Accuracy (higher is better) and precision (lower is better) vs the number of reference frames. Accuracy is measured as the average percentage of neurons correctly tracked, where a neuron is considered correctly tracked if the tracked coordinate is within the volume of the neuron as identified via a manual annotator. Precision is presented as the average RMS error between the predicted position and the manually annotated position of each neuron in pixels. Note that once the majority of the postures present in the data is well-represented by the first three reference frames, subsequent additions returns diminished improvements. Last data point shows ZephIR’s accuracy using 10 reference frames with 10 partial annotations across all frames (panel C). C: 10 neurons were randomly selected to be verified or corrected to serve as partial annotations. Traces of 5 of these neurons extracted using the initial ZephIR results with 10 reference frames (left), and those using verified true positions (right) are shown, along with 5 other randomly selected neurons. Traces are calculated as fold change of the ratio between GCaMP and RFP fluorescence of each neuronal nuclei over the baseline, where the baseline is defined as the ratio in the first frame. Tracking quality for these 10 neurons can also be seen in individual crops around the neurons averaged across all frames (sharper image of the cell at the center reflects better accuracy and precision in tracking). Note how the five unannotated neurons show improvements in tracking quality after the addition of partial annotations, exemplifying the effects of partial annotations on the unannotated neurons in the same frame. D: Neuronal activity traces from 178 neurons, extracted using results from ZephIR with 10 reference frames and 10 partial annotations in all frames. Traces are calculated as fold change of the ratio between GCaMP and RFP fluorescence of each neuronal nuclei over the baseline, where the baseline is defined as the ratio in the first frame. Behavior is shown in the ethogram below the heatmap. Trajectory of the worm (t = 0 at bottom right) is also colored with the behavior state at the time. Trajectory of the worm matches changes in behavior over time as expected, and many of the neuronal activity traces show strong correlation with behavior. E: Accuracy vs the number of reference frames for tracking 79 neurons in a publicly available dataset of freely moving C. elegans [14]. Since the spatiotemporal patterns in the data are similar to the previously tracked data, we can reuse the same parameters and follow the same procedure to track the 79 neurons in the head of the worm. Each volume has been centered and rotated but no further straightening has been done. Since this particular dataset has also been used to benchmark a number of recently developed algorithms, we may also compare ZephIR’s accuracy with Neuron Registration Vector Encoding (NeRVE) [14], fast Deep Neural Correspondence (fDNC) [11], 3DeeCellTracker [23], and CeNDeR [46].

More »

Expand

Table 1.

Summary of results and performance of ZephIR and other recently developed algorithms for tracking neurons in freely moving C. elegans.

When benchmarking on a common dataset (provided by [14]), ZephIR achieves top 2 accuracy with significantly less annotations and faster performance. Note that tracking inference speed does not include time spent for any potential training or detection.

More »

Expand

Fig 4.

Results for tracking GCamP fluorescent neuron nuclei in 3D volumes of freely behaving C. elegans (same dataset as shown in Fig 3A–3D) with varying combinations for the tracking loss.

Three reference frames and fixed weights (λ_R, λ_N, λ_D) were used for all results shown. A. One of the three reference frames. B. Network of connections between neighboring neurons in the frame shown in panel A. The edge weights represent their relative stiffness for calculating the spring loss, . The connections and their stiffness are modulated by the distance between the neurons and the covariance of the connected pair in reference frames. C. Results from feature detection, , on the frame shown in panel A. Feature detection achieves an average precision and recall of 0.948 ± 0.024 and 0.931 ± 0.016, respectively. D. Tracking accuracy (top) and precision (bottom) as the loss components are changed. We can confirm the positive contributions of both and as adding either or both to image registration, , increases accuracy and precision. We also note that have similar local information as the image descriptors when represented as a probability map with peaks at the neuron centers. When combined with the spring network, , it reaches similar accuracy as image registration, and the sum of the three terms, , result in only a small increase in accuracy compared to either combinations, or .

More »

Expand

Fig 5.

Results for tracking posture of a behaving mouse in 2D.

We compare performances of ZephIR and DeepLabCut on tracking 10 body parts that characterize the mouse’s posture over time. A: Accuracy (average percentage of keypoints correctly tracked in unannotated frames) and precision (average distance between predicted position and ground truth position of keypoints) vs the number of manually labeled or ground truth frames. These labeled frames are used as reference frames for ZephIR and as training data for DeepLabCut. The frames are selected based on automated recommendations from each algorithm, meaning the two sets of frames used may not be identical. The last data point (results with 200 training frames) for DeepLabCut are produced with training data generated by verifying and correcting ZephIR results with 10 reference frames. Note that ZephIR achieves better accuracy when only a few labeled frames are provided, but DeepLabCut ultimately reaches a higher accuracy when its training data was augmented with ZephIR. B, C: DeepLabCut and ZephIR results with 20 labeled frames (vertical line in panel A) for tracking mouse body parts as it raises its paws. Note that ZephIR is more stable during motion while DeepLabCut tends to jump between the different body parts. Table: Annotation and computation speed comparison. Annotation time is calculated for the same person, using the respective GUI’s provided with each software package. Training and inference times are tested on the same CPU and single GPU environment and with 20 reference frames (vertical line in panel A). While DeepLabCut is faster for inference, it requires a slow training phase, dramatically increasing the total computation time. This data was produced and provided by the Churchland Lab (UCLA). Raw data is available at: https://ibl.flatironinstitute.org/public/churchlandlab/Subjects/CSHL047/2020-01-20/001/raw_video_data/.

More »

Expand