PriMAT: Robust multi-animal tracking of primates in the wild

doi:10.1371/journal.pone.0347669

Fig 1.

Bounding box based detection vs. keypoint detection on different datasets.

Bounding box based tracking has advantages over keypoint tracking when used as a starting point for automated behavior analysis, such as reduced labeling time, higher tracking robustness for videos from the wild, and universal extensibility to other objects of interest.

More »

Expand

Fig 2.

(A) Structure of our datasets and experiments.

All models were trained with image datasets, while the performance was evaluated primarily on videos. We compared different datasets for pretraining, and applied them to two use cases: videos of redfronted lemurs and Assamese macaques. We compared our bounding box based approach with DeepLabCut [23]. Finally, we demonstrated the use of our additional classification branch for individual identification. (B) Overview of the model architecture: Input images are processed by a Convolutional Neural Network (CNN, in our case HRNet32 [41]) into a feature volume. Different heads learn tracking-related tasks using two-layer CNNs. We omitted the offset head for simplicity. Afterwards, individuals are cut out and processed by the classification branch for individual identification. We used ResNet18 [42] as the CNN for individual recognition.

More »

Expand

Fig 3.

Problem during a jump: The Kalman filter predicts the location of the lemur track based on past velocity.

The detection in the last frame cannot be matched to the lemur track. Top row: Detected bounding boxes and assigned tracks identified by the color of the bounding box. The track of the jumping lemur is lost in the last frame (black) and a new track is instantiated (green). Bottom row: Kalman filter predictions for the expected position of the lemur. The Kalman filter assumes linear motion and predicts a different position when the jump ends (black). The newly instantiated track starts with default parameters for the Kalman filter prediction (green).

More »

Expand

Fig 4.

(A) Setup for social learning experiments: Four feeding boxes were placed on the ground and eight cameras were filming from three different perspectives ((B) close, (C) top, (D) far).

More »

Expand

Fig 5.

Two examples that illustrate the difficulty of identifying individuals in every single frame of a track.

Instead, we combine the tracking and identification models to come to a majority vote decision.

More »

Expand

Fig 6.

Pretraining with MacaqueCopyPaste.

(A) Examples from MacaquePose [3] (top), MacaqueCopyPaste with ImageNet [43] backgrounds (middle), and MacaqueCopyPasteWild with backgrounds from Thailand (bottom). (B) Models trained on the copy-paste approaches yield higher results when tested in-domain, i.e. on their own test sets. This can probably be attributed to sharper edges and more saliency against the background. (C) Results of testing on different test sets relative to the in-domain performance from panel B. MacaqueCopyPaste detects macaques well on all three datasets. MacaquePose did not generalize well to other datasets. (D) Both copy-paste strategies outperform MacaquePose on the macaque validation videos. Error bars: Standard error of the mean from training three models with different random seeds. The top row images as well as the monkeys on the other images are reprinted from Labuguen et al. (2020) and Google Open Images under a CC BY 2.0 license. Original copyright remains with the authors. Our model was trained using ImageNet images; however, for illustration purposes in this figure (middle row), all ImageNet backgrounds have been replaced with representative images created by the authors.

More »

Expand

Table 1.

Performance of Lemur and Macaque tracking models.

More »

Expand

Fig 7.

Showing tracking results for lemurs and macaques on the test videos when the model is trained with a subset of the training data.

We selected five non-overlapping subsets of size 100, and two subsets of size 200. Training with fewer samples will lead to reduced performance, however, we see that even with 100 annotated frames, we get decent tracking results.

More »

Expand

Table 2.

Transfer learning results to other species.

More »

Expand

Fig 8.

Comparison between our boxes based tracking model and a DeepLabCut (DLC) model [23].

Our model was trained with 500 images from three camera perspectives. The DLC model was trained using 33 keypoints, resulting in 100 frames. (A) A qualitative example where DLC does not detect most keypoints. (B) DLC missed many keypoints and shows large root-mean squared errors (rmse) for the predicted keypoints. (C, D) Performance difference on the 12 test videos. The evaluation metric was chosen in a favorable manner for DLC, as it counts a detection as positive when there are at least two keypoints correctly detected.

More »

Expand

Fig 9.

Qualitative comparison between PriMAT and DLC.

Ground truth bounding boxes are black, PriMAT bounding boxes are yellow. The different colors of the keypoints refer to different detected individuals.

More »

Expand