Multi-appearance segmentation and extended 0-1 programming for dense small object tracking

Aiming to address dense small object tracking, we propose an image-to-trajectory framework including tracking and detection, where Track-Oriented Multiple Hypothesis Tracking(TOMHT) is revised for tracking. Unlike common cases of multi-object tracking, merged detections and the greater number of objects make dense small object tracking a more challenging problem. Firstly, we handle frequent merged detections through the aspects of detection and hypothesis selection. To tackle merged detection, we revise Local Contrast Method(LCM) and propose a multi-appearance variant, which exploits tree-like topological information and realizes one threshold for one object. Meanwhile, one-to-many constraint is employed via the proposed extended 0-1 programming, which enables hypothesis selection to handle track exclusions caused by merged detections. Secondly, to alleviate the high complexity caused by dense objects, we consider batch optimization and more rigorous and precise pruning technologies. Specifically, we propose autocorrelation based motion score test and two-stage hypotheses pruning. Experimental results are presented to verify the strength of our methods, which indicates speed and performance advantages of our tracker.


Detection Metrics
Detection Rate(DR↑) We use the definition that DR = N C/N T , where N C and N T are the number of correctly detected objects and true objects respectively.
False Alarms(FA↓) We have F A = N IC/N while N IC is the number of incorrectly detected objects. Meanwhile, N is the length of the sequence.
Standard Deviation of Detection Rate(DR-STD↓) Lower standard deviation of detection rate indicates better stability of detection method.

Tracking Metrics (Traditional)
Traditional metric treat object as point when evaluating tracking performance.
Optimal Sub-pattern Assignment Distance(OSPA-T↓) The modified Optimal Sub-pattern Assignment Distance proposed by Branko Ristic is widely used as measure index in multi-object tracking. c and are set to 25 as default, and p is 2.
Track Completeness Factor(TCF↑) Track Completeness Factor measures how well we detect a given object after the association [3]. tol used in TCF is set as 15.
Track Fragmentation(TF↑) Track Fragmentation measures how well we maintain identity [3]. tol used in TF is set as 15 too.

Tracking Metrics (CLEAT MOT)
We also used some metrics based on the reference from visual multi-object tracking, which attracts great attention and develops plenty of sophisticated evaluation mechanisms. The representative one is CLEAR MOT metric [2], which contains MOTA,MOTP, etc. XXX Those metrics were designed for object with a certain size and detection box, rather than a small object with few pixels. However, the small objects in our dense tracking scenario show up more than just few pixels and occupy considerable space(they are still small object with not more than one hundred pixels). Higher density actually exaggerates the effect of their size. Those metrics were designed to evaluate the complex scenario with massive occlusions, which is more complicated than traditional scenario. Applying those new performance metrics could augment diversity of our result. Under such consideration, we utilized the CLEAR MOT metrics [2].

Number of Identity Switch(IDSW↓)
Identity Switch counts the number of emergences when a ground truth target i is matched to hypothesis j and the last known assignment was k(k = j) [2].
Multiple Object Tracking Accuracy(MOTA↑) Thanks to its expressiveness, the Multiple Object Tracking Accuracy [2] may be the most widely used figure in evaluating a tracker's performance. The definition of Multiple Object Tracking Accuracy is as 1: It combines three different sources of errors, where t is the index of frame and GT is the number of ground truths. F N is the number of false negatives, and F P is the number of false positives.

Multiple Object Tracking Precision(MOTP↑) The Multiple Object Tracking
Precision is the average dissimilarity between all true positives and their corresponding ground truth targets [2].

Ratio Misses Over Total Number(FN↓)
The ratio misses in the sequences over the total number of objects presenting in all frames [3].
Ratio False Positive Over Total Number(FP↓) The ratio False Positives over the total number of objects presenting in all frames [3].
Recall(REC↑) The number of correctly matched detections divided by the total number of detections in ground truth.
Precision(PRE↑) The number of correctly matched detections divided by the total number of output detections. We use up arrow ↑ to represent that higher score indicates better result.The opposite of that, down arrow ↓, means preference to lower score.