Multi-appearance Segmentation and Extended 0-1 Program for Dense Small Object Tracking

Aiming to address the fast multi-object tracking for dense small object in the cluster background, we review track orientated multi-hypothesis tracking(TOMHT) with consideration of batch optimization. Employing autocorrelation based motion score test and staged hypotheses merging approach, we build our homologous hypothesis generation and management method. A new one-to-many constraint is proposed and applied to tackle the track exclusions during complex occlusions. Besides, to achieve better results, we develop a multi-appearance segmentation for detection, which exploits tree-like topological information and realizes one threshold for one object. Experimental results verify the strength of our methods, indicating speed and performance advantages of our tracker.


I. INTRODUCTION
I N video content analysis, whether for interpretation, indexing or coding, tracks of objects are of much importance. One of the important fields is dim small object tracking. Detection and tracking of small and dim moving objects are increasingly becoming vital for scenes like military infrared guidance, physical particles analysis and micro-animal observation. The objective of this paper is to develop algorithms than can detect and track small object in the complex scenario. This algorithm should be capable of accurately locating dim small object, starting and maintaining path and terminating it.
Many challenging factors stand in the way of successful tracking processing. It may occur events (temporary misdetection, occlusions, crossings), from which important ambiguities in the association of consecutive measurements to a track can arise. Proper localization for small object is the first challenge. Variable intensity of object with time, structured backgrounds, electronic noise and frequent occlusions are some examples of factors impeding the detection of small object. The overall unreliability of object detection results in corrupted measurements. The other factor comes from the data association. Due to the featureless character, limited by objects' similar small and isotropic shapes, the spatial position usually is the only feature to be relied on for data association. Thus, object motion should be extensively exploited to provide valued information. Generally, the core problem of multi-object tracking is believed to be the data association. Data association determines the source that each measurement derives from, in other words, to build the very link between the measurement and track. From this aspect, methods of most contemporary multi-object tracking could be divided into two main categories.
One-to-one association, also known as unique neighbour association, means that each measurement should attach to up to only one track. One measurement could belong to an existing track, otherwise is regarded as the start of a new track or a false alarm. In short, it results from a unique source. The classic methods inspired by this rule include plain NN (nearest neighbour) and its modified variant, for instance GNN(global nearest neighbour). Those early methods immediately make the association decision after acquiring new measurements. In fact, delayed decision will improve the reliability of association for the sake of integrating later information. This idea just postpones the association decision for several frames, utilizing the information from subsequent measurements, and had been proved to considerably effective. MHT [1], [2], [3], [4], MDA [5], SGTS [6], [7] are the typical methods of such idea. Reid proposed the original MHT algorithm [8]. Afterwards, many researchers followed the work and explored the power of this algorithm by employing efficient assignment methods,such as A* search [4], Murty's [9], and optimizing the framework. MDA [5], stand for multiple dimensional assignment, expands the bipartite graph matching in MHT to multiply dimensional assignment. SGTS employs a semi-greedy algorithm to get the approximately optimal solution of association assumption.
Those improved one-to-one association methods with delayed decision also bring many new issues. The most serious one is association hypothesis explosion, which comes from the exponentially growth of combination of association. For now, this exponential disaster is partially alleviated by all kinds of restriction and pruning technologies for branches.
Moreover, most network optimization based methods could be classified into this category. They often utilize an abstract connected graph to represent the tracking problem. Since they would always add an one-to-one association constraint on the optimization problem, they satisfy the definition of one-to-one association method. Those methods include spatio-temporal path based optimization like K-Shortest paths optimization [10] proposed by Jérôme Berclaz et al., and tracklet based optimization like the works of Bing Wang et al. [11], [12]. Actually, network optimization based tracking have been a crucial research focus especially in the last fewer decades.

arXiv:1712.05116v1 [cs.CV] 14 Dec 2017
Many works' impressive results [13], [14], [15] have indicated that it's a useful method with simple and clear framework. Zhang et al. [13] used a push-relabel method to solve the min-cost flow problem. Jérôme Berclaz et al. and Pirsiavash et al. [14] proposed to use more efficient successive shortest path algorithms, which can provide roughly the same globally optimal tracking results with less running time. Butt et al. [15] incorporated higher-order track smoothness constraints for multi-target tracking.
The second category is many-to-one association(also called all-neighbours association), where multiple measurements are used for the update of certain one track. The principal assumption is the case that each measurement within the threshold gate should contribute to the update of track, but with different weights. This method naturally avoids the fixed association as well as exponential combination. The typical methods include PDA and JPDA [16]. The PDA algorithm calculates the association probabilities to the target being tracked for each validated measurement at the current time. This probabilistic or Bayesian information is used in the PDAF tracking algorithm, to account for the measurement origin uncertainty. JPDA could only track objects with fixed and informed number.
Beyond one-to-one (measurement-to-track) association and many-to-one association, we exploit one-to-many association to solve approximative motion occlusions. Some studies [17], [18] show that the precision of JPDA may be inferior to MHT in some cases with massive objects. After thoughtful analysis of MHT, to maintain its ability in dealing with the massive low observable objects meanwhile avoid the daunting computations and time consumption, a new tracking framework is developed. The MMHT(management for multi-hypothesis tree) inspired by the TOMHT is employed as the generator of plausible hypothesis. Then, extended 0-1 program is used for hypothesis selection, which integrates considerations of the mutual exclusion and other critical facts.

B. motivation
The complexity is a fatal defect of MHT as well as related variations. MHT had been proved to be an effective method compared with other methods in dense object occasion. However, massive objects and exponentially growing scale of hypothesis form the nearly intractable problem to get compatible hypothesis solutions during probability evolution. Generally, this compatible sets problem would be transferred as a graph problem. Then, clustering divides it into some individual problems, and each individual is formed as a linear program or MWISP problem, etc. Besides, those computation about compatible hypothesis sets will be performed at each frame and so come with the frequent massive computation. Sometimes, those clustering processing and compatible sets searching include similar graph structure over fewer frames, which brings partly repeated computation. Because most of the incompatible relations, caused by crossings or something else, are inherited from the last frame to record history events. No each new frame brings new incompatible events, hence the compatible graph remains quite similar with last one sometimes.
To address massive computational complexity of MHT especially in dense scene, we try to extend the interval between graph processes. Besides, familiar graph structure is formed within a small interval, our effort could help to partly eliminate such phenomenon. Generally, trajectories can be generated online [19], [20], offline [12], [21], [22], or with a short latency [23], [24]. Batch optimization tracking is in some sense like deferred tracking with affordable delay, as long as short batch length is applied. We try to employ batch decision to MHT framework.
The first problem we encountered is exponential explosion of hypothesis. The result of graph processing determines whether a hypothesis would be pruned or maintained. Prolonging interval means adding the depth of hypothesis tree. A new and strong hypothesis management method is needed to restrict the number of hypotheses and reserve the valid ones in the meantime. Then, the second problem is hypothesis selection. With more hypotheses, graph processing based algorithm may not be suited to handle it since the graph may expand to a larger degree. Increasing graph scale exponentially expands solution space.
Quite a number of segmentation algorithms have been used for the detection of small object, and proved to be highly efficient. Local contrast method proposed by C.L.Philip Chen et al. [25] showcases state-of-art abilities. However, its defect, to expand the object for the sake of using maxpool operation, is not good for accurate segmentation. In fact, most detectors concentrate on enhancing emergence probability of small objects. They usually don't think over for cases of dense object with lots of occlusions. To achieve better tracking results, we propose our multi-appearance segmentation. Unlike normal segmentation often utilizing unified threshold for single image, multi-appearance segmentation adopts different thresholds for different objects in the same image. To distinguish touching objects during occlusions, we need to use different combinations of thresholds. A topological tree structure is built to organize the relationship between objects under different thresholds.
We make the following contributions: (1)We propose a oneto-many association based constraint for dense small object tracking, and implement this constraint by extended 0-1 program, to the best of our knowledge this new association idea is the first that differ to common practices; (2)To maintain the hypothesis set in a tractable scale, we design a MMHT method for hypothesis management, which makes deeper tree steerable. (3)A novel multi-appearance segmentation method is proposed for small object detection. It utilizes topological tree structures to management the relationship between local thresholds for different objects, and refines individual thresholds. (4)Owning to the efforts to reduce the complexity and number of hypotheses, the implementation of our tracker is proved to be impressively fast.

C. Outline of the Paper
The primary methods proposed in this paper will include two parts, presented in section 2 and 3. In section 2, we  Example of image slides. Threshold in a1 makes the right objects connected to each other, which merges them to a single object. As for threshold of a3, segmentation with too high threshold directly loses sight of right-down object. Besides, the segmentation of a2 loses sight of left-up object.
propose the multi-appearance segmentation for small object detection. Then, our tracking method will be presented in section 3. In section 4, the experiments about detection, tracking and verifying new constraint will be introduced. Finally, we discuss our results and future work in section 5.

A. Multi-appearance segmentaion
We employ the same paradigm as tracking-by-detection framework, which is popularly used in visual tracking [10], [26], [27], [14], [13], [28]. So detection of small object is our first task, and it is obviously a fundamental and vital step. Small objects usually appear as points or irregular blocks of few pixels, which certainly could not contain much information except intensity and rough shape. In an image with low SNR, objects are totally mixed up with noises, so plausibly that it's always nearly impossible to distinguish them. Besides, occlusion is another severe puzzle.
Unlike normal segmentation often utilizing unified threshold for single image, multi-appearance segmentation adopts different thresholds for different objects in the same image. Utilizing the multi-appearance information of small objects, multi-appearance segmentation could automatically vary the threshold in local area and make it more suitable for the small object in certain local area.
1) Intention: Obviously, different thresholds for segmentation product totally different results. As the Fig. 2 showing, low threshold can not distinguish objects and noisy point well, while too high threshold would lose some objects. In the example of Fig. 2 we show only three thresholds. In practical situation, the number of layers should be determined according to the specific variance of image. The appropriate threshold is changing for different positions of image. Unified threshold segmentation methods barely embody sufficient discrimination to distinguish false-alarms with real objects.
Each segmentation of gray objects demonstrates its single appearance, which could be interpreted as one slice measurement. From one slice measurement, we can obtain the corresponding object distribution hypothesis score. Sequential layers of slice measurements plus affiliated connections form a small object appearance tree. The affiliated tree of objects in Fig. 2 is shown as Fig. 3. The critical problem in segmentation is to select a more appropriate threshold for each object from all the layers. The criterion for layer selection is to maintain Fig. 3. Multi-appearance structure with tree relation. The corresponding tree structure is built according to the relation in binarization slides of original image. As the simple tree example(right side of image) shows, the combination for candidate object selections in the tree with root node S1 3 can only be one from {S1 3},{S2 2, S2 3} and {S3 2, S2 3}. That is based on the constraint of affiliation relations indicated by the tree structure. You can't select the S1 3 and S3 2 simultaneously, since the thresholds for a local region is unique. the shape of objects.
2) Details of detection method: Our multi-appearance segmentation is composed of three stages. During the first stage, the gray image is transformed into the corresponding map using 2 and 1, then the global threshold binarization is performed on the transformed image. Then, the tree structure is built in second stage based on the information from slides. After that, during the third stage, deep-first-branch-adjustment algorithm is executed for each multi-appearance tree.
In the defined 2, M I i is the mean intensity of appointed neighbor area. The appointed neighbor area of a certain pixel is the part between outer square windows and inner windows with current pixel at center. The outer and inner width are set to 9 and 3 respectively. N u is the number of pixels in the neighbor area, and I j is the gray level of the jth pixel in neighbor area. I n represents the gray level of the central pixel and C n is the final correspond value.
For the first stage, 2 and 1 is inspired by C.L.P.Chen's work [25]. However, we remove the minimum search and keep the edges of objects clear. the global threshold, which is the sum of average intensity and K times of variance, is calculated in the beginning. Then, using global threshold as the reference middle value, n layers of thresholds are set with equal interval. For each threshold, a corresponding segmentation is performed, meanwhile acquiring the object information via filling algorithm. The relationship of affiliation between objects from adjacent layers is preserved. After this stage, a multi-appearance tree is built, where a clear tree-like structure is used to describe the relationship between objects from adjacent layers.
Input: n root Output: S root 1: S childscoresum = 0 2: Calculate S appearance according to 3. 3: for each n child satisfying that n child is child node of n root do 4: S childscoresum ← S childscoresum + Deep first branch adjustment(n child ) 5: end for 6: if S childscoresum < S appearance then 7: S root ← S childscoresum 8: else 9: mark n root as candidate node 10: S root ← S appearance 11: end if Then, Deep-first-branch-flow algorithm is used to mark the candidate nodes in the appearance tree. For children nodes of the candidate node, we just discard that part of tree. Then, the breadth-first search algorithm is used to determinate the final objects, which are the first candidate nodes on the each way of branch starting from the root node to leaf nodes.
We define the appearance score as the product of three score components (intensity, shape, bubble punishment): S appearance evaluates the object's likelihood based on the consideration that intensity variation will maintain certain stability inside the separated object. In fact, the greatest intensity variation in the image to be detected should be the edges which split the object and background. S intensity is defined as variance of intensity.
We use S shape to measure the impact of object geometric shape. Accurate segmentation should impel the profile pattern to be glossy and inerratic. S shape is defined as variance of pixel distance.
And a punishment factor S bubblepunishment is used for the regularization, which treat the non-detected pixel in the object as bubble and employ it to measure the defect as punishment.

A. Our tracking framework
We propose a tracking framework with MMHT as hypothesis generator and extended 0-1 program as hypothesis selection component. MMHT is inspired by the TOMHT, using the similar tree hypothesis structure and the idea of limitation for branch expansion. Then extended 0-1 program is proposed and employed to be the substitution of hypothesis enumeration procedure. In the traditional MHT, hypothesis enumeration is carried out for each frame to find the group of hypotheses where each hypothesis is compatible with others. For this intractable problem, we transform it into an extended 0-1 linear program problem, meanwhile implement the many-to-one association assumption through adjunctive binary variables.
The tracking framework is illustrated as the working flow diagram of Fig. 1. MMHT produces the most reliable hypotheses of every tree, which are sent to hypothesis selection component as input data. Then, our extended 0-1 program would determine the set of final preserved hypotheses in full consideration of compatible relationship among hypotheses. Noted that our 0-1 program doesn't treat every pair of incompatible tracks in a total hard way. Extended 0-1 program is a soft method with more flexibility to handle some complex circumstances.

B. MMHT to form candidate tracks
MMHT(management for multi-hypothesis tree) is developed as the generator of potential tracks. We integrate some technologies for hypothesis processing, and design this collection of ordered procedures as our hypothesis management method. MMHT comprises of our hypothesis management method and other plain data processing steps.
1) Tracking description: Firstly, we introduce the mathematical-expressional form. Assume a sensor scans the surveillance region periodically. The set of measurements received at frame t is denoted by M (t): where N is the number of frames, Nt is the ith measurement received within frame t, and N t is the number of measurements received at frame t. In addition, a dummy observation m t 0 is defined for each frame t to denote possible missed detections.
A track hypothesis (we would use hypothesis for short in following content) T t j at frame t is defined as a sequence of observations: This definition constitutes a restriction that one track can contain at most one measurement at a particular frame. Track hypothesis score is associated with each hypothesis to evaluate the likelihood of being the true target. O t l is the set of hypothesis tracks, with all the T t n in it possessing same root. O t is the set of O t l , representing all tree hypotheses at frame t.
2) Track management method: Since our tracking method is designed to be of batch-optimization, 0-1 program will not be executed at every frame, but over a larger span. Before the hypothesis selection in 0-1 program, the hypothesis tree will grow to a colossal scale if no special restriction means is employed.
The principles of new management method are: 1) Hypothesis growth is of more restrictive; 2) The complexity is reduced by various kinds of measures. Our hypothesis management method is flexible and easy to be controlled. This hypothesis management method includes following functions: gating, low-level hypothesis assessment for acceptation or rejection, and hypothesis merging. a) Score for moving variability: Firstly, we defined S M V , representing the score of moving variability.
where the following notations are used: N j C : depth of hypothesis j, which will not exceed the depth of practical hypothesis tree.
: score of leaf hypothesis j at frame t for moving variability; V m t i : the velocity of m t i at frame t assumed association between m t i and T t−1 j is built; m t i : measurement i at frame t; P t jt−1 : prediction deduced from the T t−1 j ; mahaldist(·) : function to get mahalanobis distance.
The velocity and prediction used here are acquired through correction of Kalman filter. In fact, S M V (T t j ) can be regarded as an autoregressive model based score, if undo S M V (T t j ) , we can get following autoregressive formula of ∆V m l i ,P l j l−1 , with l as the frame number. The initial S M V (T t j ) is set as zero, so the constant term of this autoregressive formula is zero.
As the result of decreasing weights term used in 13, the coefficient a l decreases as the l descending. The approximate effective order of this autoregressive model varies from 10 to 20 in a correspondingly reasonable short period, depending on the value of N C . b) Score test for moving variability: For each leaf node in hypothesis tree, a S M V (T t j ) is calculated and maintained via iteration. Then, we use the test equation 15 to filter out unsatisfying hypotheses before branch growth(or called gating). A ij = 0 means the corresponding association won't pass gating.
th n : multistage threshold for score test; A ij : association filter mask between measurement i and hypothesis j, A ij = 1 indicates permission of association between measurement i and hypothesis j; We use a figure to illustrate our definition of S M V , score test and consideration behind them. As Fig. 5 shows, our score testing is a polygonal line with a descending threshold at first and constant one later. |∆V m,P − S M V (T j )| should be under the threshold line, once cross then the corresponding hypothesis will be abandoned. We define this score test to describe the strength to maintain the moving pattern of tracks. The moving pattern used here can be interpreted as the degree of volatility. As an autoregressive formulate of ∆V m,P , S M V possesses the attribution of reflecting the expected degree of acceleration. Note that we assume object movement is of constant acceleration. |∆V m,P − S M V (T j )| reflects the deviation of current movement between expected one from the respect of acceleration. Therefore, our score testing is capable to capture the coarse pattern of movement.
For motion score of T 1 (the curve with frequently changing moving pattern), three potential hypotheses are given. The corresponding diagram of temporary moving score is drawn in the right side. H1 and H2 is rejected under the result of exploding score, since these two smooth tracks deviate far away for its original fickle pattern. As for T 2, sudden change of movement also goes against its previous smoothness. This is the assumption we employed here that the volatility of a track will keep in a certain degree within limited time span. To describe this certain degree, we use a simple multistage threshold, in consideration of facilitating the generation of hypotheses by tolerating elusory movement at the beginning. c) Hypothesis score: Indicating the degree of track's likelihood, each hypothesis is assigned with a hypothesis score, which is used for hypothesis selection and merging. The score of hypothesis j at frame t is defined as follows: S LT M (T t j ) and S ST M (T t j ) are the scores of hypothesis j, in the consideration of long-term motion and short-term motion respectively. ω LT M and ω ST M are the weights for long-term motion and short-term motion.
S LT M (T t j ) used here is the original score formulation in traditional MHT. We added S ST M (T t j ) to capture the variability of short term motion, in order to enhance the sensitivity of the score for rapid motion variation. The original score is a slowly changing value, increasing over time for the potential tracks. After a period of updates, it reaches a rather high score and partly losing its sensitivity for motion change. Although some methods like SQRT employed a threshold to detect its variation in high value state, this hard threshold measure can't provide enough agile information for short motion description. So, S LT M (T t j ) reflects the likelihood of hypothesis from a global view, or in the rather long term.
S LT M (T t j ) is defined by using cumulative log-likelihood ratio as typical way.
If hypothesis T t−1 j is associated with measurement m t i at frame t, then the increment of hypothesis score is given by: Where the following notations are used: m t i : vector of measurement i at frame t; p(·) : the probability density function (PDF) of measurement m t i conditioned on the one-step prediction of hypothesis T t−1 j ; P D : detection probability; λ f a : the expected number of false alarms per unit volume of the measurement space per frame(spatial density of clutter); λ nt : spatial density of new targets. the initial hypothesis score ∆S i LT M (T t j ) is given as log( λnt λ f a ). Then, we define S ST M (T t j ) as 20.
|∆V m,P −S M V (T j )| is used for score testing as we mentioned before. This accumulation of deviation could be used to show the deviation tendency according to the known life of a certain object. Then, we use 20 to acquire long term score based on the history deviation information, where IC is the number of effective emergences. IC counts the occasions when a valid measurement is assigned to the current object. Otherwise, IC decreases as the punishment of missing association. T s (IC) is the accumulation of two-stage threshold th n used in score testing. d) Hypothesis merging: The flow diagram of our hypothesis management method is as Fig. 1. Then, two stages of merging are carried out to remove logically incompatible hypotheses. The first stage is strong hypotheses merging, which is carried out between different trees. A comparison, between hypotheses with the highest score and highest IC, are conducted. These two hypotheses embrace the historic strongest one and the temperate strongest one. Hypotheses would be merged if they share too much measurements in recent period. Looped hypotheses merging is to detect the loop path, where two hypothesis tracks, deriving from the same node, separate on the path and are assigned with the same measurement later. These phenomena may bring exponential growth of hypothesis number if no special treatment is employed.
end if 10: end for 11: end for for each T t i satisfying m t n ∈ T t i do 3: find T t sm with max S(T t ms ) . 4: find T t mic with max IC(T t mic ) .

5:
if T t mic = T t sm and they share more than depth number of detections then  For inter-tree hypothesis merging, hypothesis score is employed as assessment criteria. As to intra-tree merging, temporary score is used. e) Birth and death of hypothesis: Those detections that can't link to any hypothesis would be regarded as the starts of new objects. We employ SQRT test [8] to decide whether or not to delete a hypothesis that has no linked detection to update.
We maintain a score rank for each hypothesis tree. Once a hypothesis is terminated, it will be added to the corresponding position of the rank according to its score. Besides, all existing hypotheses will be added to the rank at the last frame of each batch. For final extended 0-1 program based intratree hypothesis selection, we provide only top 20 percent of hypotheses as input, which will release most hypotheses with less possibility and reduce computational complexity burden.
The main function of hypothesis management method in our paper is to enhance the quality and lower the quantity of hypothesis. o . As for one-to-one association, the ownership of certain detection is unique, which gives rise to the scramble for key detections between tracks. The results of that may be like the sub-figure c and d presenting. Some hypotheses lose and can't maintain the continuity of their trajectory or even being totally denied due to the missing of ownership for the key detections.

C. Overall hypothesis selection as an extended 0-1 program
The more objects appear in the scene, the more frequently occlusions would happen. In fact, occlusion problem still is the most momentous challenge.
Various occlusion situations significantly aggravate the complexity of data association. In our observations, the most knotty and deceptive situation is the one with objects moving proximately. Proximate motion means they have similar speed while they encounter each other. Occlusion will last for a longer period with tremendous probability to lose the tracks.

1) One-to-many association for intractable occlusions:
When we focus on the approximative motion occlusion, utilizing one-to-many association may be more adapted than oneto-one association.
Most of past studies assumed that at most one object is associated with each measurement. As for the rock-ribbed occlusion with approximative motion, during the occlusion period, two or more objects are detected as only one measurement. As Fig. 8 illustrates, if you insist the assumption of one-to-one association, some of the tracks without corresponding associated measurements will emerge a great gap, which would influence the correct formation of tracks. On the contrary, one-to-many association facilitates the formation of each hypothesis through the one measurement detected zone.
2) Formula for one-to-many association: Track selection is naturally a binary linear program problem in the view of treating each selection of hypothesis as a binary switch of variable. The classical formula of one-to-one association (such as [29]) is like following.
C Tn is defined as the cost of hypothesis n. ξ Ti is a binary representation of T i . To make it feasible and efficient to take into account the degree of mutual exclusion, we redesign an extended 0-1 linear program with adjunctive binary variable I ij . The hypothesis selection is as following formula.
I ij represents the index of incompatibility between hypothesis i and j. I ij will equal to 1 while hypothesis i and j own same measurement. I ij = 0 indicates that hypothesis i and j are totally irrelevant with each other. The constraint 26 and 27 ensure that while T i and T j are both selected, I ij will be 1, otherwise I ij equals to 0. C Iij represents the cost of incompatibility to select both hypothesis i and j.
Unlike the classical formula, where a simple constrain is employed to restrict collision, we utilize adjunctive binary variables I ij to take in charge of the considerations for mutual exclusion and one-to-many assignment. In fact, the new term I ij C Iij is analogous to ξ Ti ξ Tj C Iij , which makes the program a quadratic problem as well as an intractable NP-hard problem.
Without the adjunctive binary variables I ij , the problem would turn into a quadratic one for the sake of necessary introduction of ξ Ti ξ Tj C Iij . It is a great computation mitigation to solve a linear program problem instead of a quadratic one. Besides, by the introduction of I ij and C Iij , the restriction for incompatible hypotheses is steerable. We slightly alleviate such restriction to encourage robust long hypothesis formation.
3) Details of extended 0-1 linear program: In fact, those adjunctive binary variables I ij with incompatible i, j couples take a quite small portion. So, the total number of variables maintains a limited scale, which is tractable and actually pretty fast as our experiments show.
In consideration of characters of featureless small object, we proposed following criteria as cost of hypothesis. Motion would be the critical component in it.
Where N n s is the number of sustaining frames, and L n t is the score of leaf hypothesis T n at frame t. K is used as adjustment coefficient. As for the cost of incompatible hypotheses couple, we use following formula.
N ij I is denoted to be the number of incompatible measurements between T i and T j . C IN is a constant, representing the cost multiplier for incompatibility of single measurement sharing. L T is settled as the threshold of tolerable upper limit. If N ij I exceeds, an outrageous value will be assigned to C Iij , indicating that T i and T j are totally antipathetic to each other.
Then, after the building of 0-1 program for each batch of frames, we use lpsolve to solve this problem. Lpsolve is reasonably efficient. Thanks to the rational assumptions and optimized constraints, solving progress spends totally affordable time. Detailed time consumption and analyses will be presented in the next section.

IV. EXPERIMENTS
In this section, we describe several experiments to verify the performance of our segmentation and tracker. The detailed data and contrastive analysis will be presented to prove the capacity of proposed methods.

A. Comparing methods
Firstly, those methods used for comparison and its parameter settings shall be listed. Two parts of experiments were performed, including different segmentation and tracking methods respectively.
1) Segmentation methods: a) LCM based detection: Using contrast based template operator would help to extract salient points, and also will be conducive to local optimization [25]. Its sensibility to small object in the low-SNR circumstance really impresses us. For simplification of expression, we use LCM hereinafter to indicate the LCM based detection method.
b) Top-hat filter: Top-hat operation can extract small elements and details from image, which had been found truly useful for small object detection.
2) Tracking methods: a) MHT: The parameters for cox's implementation [9] of MHT mainly derived from default setting. Only few changes were made to make it more adaptive for some scenarios. We also compared the speed of cox's with our method when running on the same platform and implementing in the same language (C++). b) SGTS: We also employ SGTS as another compared method. It's implemented in MATLAB, so no comparison in speed about SGTS would be presented. The number of semigreedy solutions generated before selection was set at 60. c) MDA: The duality gap of the termination criterion was set to 0.02. We defined the maximum number of iterations to 100. The Lagrange multiplier updating scheme applied was the heuristic price update, because it's believed to be more efficient and can fully exploit the structure of the intermediate feasible solutions found by the Auction algorithm. d) GRASP-MHT: Greedy randomized adaptive search was applied in multi-object tracking by Murphey et.al [30] and Robertson et.al [31]. We used a MATLAB implementation from ren's [32], where GRASP is used as an engine of hypothesis generation in the MWISP formulated TOMHT [32]. Parameters nv, np, and nitr, which were used to control the amount of computation in candidate construction, were set to 20, 20, and 3, respectively. In the hypothesis pruning procedure, those with a probability lower than 10 −6 are discarded. b) Track Completeness Factor(TCF↑): Track Completeness Factor measures how well we detect a given object after the association [33]. tol used in TCF is set as 15.
c) Track Fragmentation(TF↑): Track Fragmentation measures how well we maintain identity [33]. tol used in TF is set as 15 too.
3) Tracking Metrics Assuming that Objects Occupy Certain Space: We used some metrics based on the reference from visual multi-object tracking like pedestrian tracking, which attracts great attention recently. Those metrics were designed for object with a certain size and detection box, rather than a small object with few pixels. However, the small objects in our dense tracking scenario show up more than just few pixels and occupy considerable space(they are still small object with not more than one hundred pixels). Higher density actually exaggerates the effect of their size. Those metrics were designed to evaluate the complex scenario with massive occlusions, which is more complicated than traditional scenario. Applying those new performance metrics could augment diversity and reference value of our result. Under such consideration, we utilized the CLEAR MOT metrics [34]. a) Number of Identity Switch(IDSW↓): Identity Switch counts the number of emergences when a ground truth target i is matched to hypothesis j and the last known assignment was k(k = j) [34].
b) Multiple Object Tracking Accuracy(MOTA↑): Thanks to its expressiveness, the Multiple Object Tracking Accuracy [34] may be the most widely used figure in evaluating a tracker's performance. The definition of Multiple Object Tracking Accuracy is as 31: It combines three different sources of errors, where t is the index of frame and GT is the number of ground truths. F N is the number of false negatives, and F P is the number of false positives.
c) Multiple Object Tracking Precision(MOTP↑): The Multiple Object Tracking Precision is the average dissimilarity between all true positives and their corresponding ground truth targets [34].

d) Ratio Misses Over Total Number(FN↓):
The ratio misses in the sequences over the total number of objects presenting in all frames [33]. e) Ratio False Positive Over Total Number(FP↓): The ratio False Positives over the total number of objects presenting in all frames [33].
f) Recall(REC↑): The number of correctly matched detections divided by the total number of detections in ground truth.
g) Precision(PRE↑): The number of correctly matched detections divided by the total number of output detections.
We use up arrow ↑ to represent that higher score indicates better result.The opposite of that, down arrow ↓, means preference to lower score.

C. Dataset
Two datasets were used in out experiments, denoted as Larva and V erti Hat respectively. a) Larva: Scene with movements of micro-animal is our first-line application for this paper, three segments of video data were used in this paper. All of them contain nearly one hundred objects in a single frame, meanwhile some appear frequent occlusions.
They were captured in different conditions with differentiated image qualities. Three sequences represent three degrees of difficulty, the image quality of Larva s2 is a bit lower than Larva s1, due to extra ripple interferes deriving from sensor noise. In Larva s3, focal length is changed several time, resulting in drift of the focus plane. Objects could suddenly become blurry in couple frames and be missed by detector.
We sampled the video images at regular interval, and took half frames for experiments to reduce the computational complexity. The ground truth was produced based on the tracking results of few tracking methods enumerated before. Unlike experiments, the video used for pre-designation is fullframe without down sampling. So it'd be of less ambiguity because that provides more data to fill the uncertainty gap. Then, we checked and corrected the trajectories manually used those results as reference to reduce efforts, especially focusing on key frames where occlusions or false alarms happen. b) V erti Hat: This pedestrian video was from the vertical view, where around 60 persons with hat walk round an appointed region. After adding disturbance and removing some measurements, this scene becomes of considerable difficulty. This dataset comes from Ren's work [32], we intercepted the first 600 frames of the pedestrian video, then down sampled it at 5Hz as the author did. Finally, 120 frames of images were used in this scenario test, providing twice length of used data in Ren's paper.

D. Parameters
The depth of hypothesis tree was set to 6, parameters for multi-stage threshold were set as α = 20, β = 0.8, γ = 10, δ = 6. ω LT M and ω ST M were both 1. As for parameters in S ST M ,P D = 0.9, λ f a = 1e −6 , λ nt = 1e −8 . K, C IN , L T in the programming part were set to 5,1,5 respectively. Optimization batch length was set to 20 with a consideration window of 40, which is the length of windows containing former frame. Any frame in it would be took into account when track selection is underway.
Usually, if the distance between a ground truth and a detected position is within a threshold, then the detection is declared as being correct [35]. In consideration of our tracking scenario with dense and bigger objects, moreover the resolution of sequence used here is more than 3 hundred thousand compared to about 60 thousand in C.L.P.Chen's [25], we chose 15 as a distance threshold according to proximate increasing of proportion. For the metrics assuming that objects occupy certain space, we treat each object as in a box with an radius of 15.

E. Experiment arrangement
Two components of experiments were arranged and performed. They focus on segmentation method and tracking method respectively.
1) Segmentation method: We arranged a detection experiment to test our multi-appearance segmentation compared with classical one (Top-hat) and state-of-art one (LCM).
First of all, we used F-score to pre-evaluate the three detection methods. F-score is widely used for the assessment with definition of F = 2 * recall * precision/(recall + precision), which takes both recall and precision into account. A best K, which is acquired by traversal on samples of video fragment(part of Larva dataset), was chosen for every segmentation according to the F-score curve. Then, we got best K = 1.8, 3.5, 5.8 for LCM,MAS,Top-hat respectively. In fact, K ∈ [3,5] recommended in [25] provides poor results in our dense object scenario, with DR less than seventy percent. Dense objects occupy more space. So a befitting K should be determined through sample tests.
Then, we used the best K for each detection method to ensure that each segmentation yields their best result. Several indexes are listed in table I, including some tracking indexes to assess detection from the view of the whole tracking procedure. In fact, from the F-score, our segmentation is not superior to LCM, however, LCM exposes some problems in final result. As Fig. 9 show, expanding objects go against the precise segmentation in the response map of LCM. We also can discover the impact of this defect in final tracking metrics. That is why we developed our new segmentation to improve the results of tracking. Besides, in consideration of our main purpose of segmentation (i.e integrating segmentation into tracking), evaluating the finally tracking consequences would be more objective and revealing. The tracking procedure used for three detection methods is totally identical with the same settings. Table I shows that our method achieves the best detection rate and lowest DR-STD. Meanwhile, it yields more false alarms. The priority mission of detection to manifest objects, secondly to delete false alarms. According to the final tracking Our detection method is designed to detect dense object and cooperate with our tracking method. So it may not be appropriate for general detection of small object. MAS takes better consideration of occlusion and crossing situations for dense object, and utilizes the multi-appearance tree to retain distinguishability, discouraging tiny object from covering of bigger agglomerate object. LCM uses max-pooling like operation to enhance and agglomerate object, which could be beneficial to remove false alarms. However, agglomerate object may deteriorate the situation of occlusion and degenerate distinguishability.
2) Tracking method: We present a series of experiments here, with purposes of testing overall performance and oneto-many assumption. Apart from comparison experiments with other tracking methods, we try to explore and figure out the effect of our new idea employed in this paper.
Four tracking methods listed in last subsection were used in this comparison. Two datasets used in this part indicate traditional scenario and dense scenario, denoted as V erti Hat,Larva respectively. a) Scenario V erti Hat: The results are presented in Table II. Our tracking method outperforms others in the principal metrics such as OSPA-T, MOTA and IDSW. OSPA-T is a critical metrics in the traditional tracking scenario. Our tracking method achieves near 20% decreasing in this distance compared with the second method,i.e GRASP-MHT. Besides, lower IDSW is a remarkable feature of our method, which can be easily noticed in the following experiment results as well.
b) Scenario Larva: The primary problem of this paper is to develop an efficient method for dense object in the clutter background. Larva tracking from biology applications is such very situation. We collected three sequences with different qualities of image to represent different tracking conditions. The final row in Table III is the weighted average of metrics from three sequences, where the weight is the corresponding frame number divided by the total frame number. Note that IDSW,FN and FP are different, they were sums of three  sequences. For this aiming scenario, our method outperforms other methods and ranks first in 6 items of total 10 metrics. The recall figure of our method is at an ordinary level, since we try much effort to reduce the computational complexity and also bring in sacrifices of some valid hypotheses. After the trade-off of complexity and performance, we reach a balance to keep the computation totally tractable and even super fast in this scenario, with decent results preserved at the same time. c) Experiment of one-to-many assumption: We also provide a comparison test to prove the effect of our one-tomany assumption. By using different constraints for track selection, the results of flexible association (one-to-many) and hard association (one-to-one) are listed as table IV. One-to-one association refuses some potential hypotheses, because that 'to be or not to be' decision means compromised losing in some cases.
As Tables IV shows, our one-to-many constraint boosts the performance notably. Critical indexes like MOTA,OSPA-T are improved by a quite large range. OSPA-T decreases by 22%. Meanwhile MOTA increases by more than 20%. However, some metrics appear modest degeneration such as IDSW. Even though, the one-to-many constraint is significative, in consideration of obvious improvement on critical indexes.
During MMHT stage, our efforts to keep the number of hypotheses in a tractable scale have removed most short and weak hypotheses. Leaving the strong and long hypotheses, at this point, too rigorous constraint could reject imperfect ones with small flaws. In the extended time duration and complex scenes, some strong hypotheses inevitably encounter long occlusion crossing like we illustrated before, where measurement competition occurs. One-to-one association will reject a strong hypothesis for such flaws while the major part of the hypothesis is correct. Then, since we already removed those weak short hypotheses before, no succedaneum composed of splitting hypotheses can be used as an inferior choice. So, our one-to-many constraint complements this procedure gap and preserves the practicable strong hypothesis.
One-to-many constraint is compatible with our tracking framework. In addition, one-to-many constraint probably could be a new direction that breaks through the traditional idea of hard association from a certain aspect. We only provide modest or slight relaxation for one-to-many constraint, only limited number of sharing measurements is permitted. Massive number of sharings will entangle the tracking problem and degenerate the tracker performance. d) Experiment of speed comparison: To save consuming time is another important target for tracking, especially for motion analysis of dense object. However, this is often regarded as an extra factor for the extreme difficulty and the tradeoff between performance and speed. For our tracking method, we achieve decent performance. Meanwhile consuming time is restricted under a pretty low level. As the Table V show, our tracking method is more than ten times faster than Cox's MHT. The platform we used is a PC with E5(3.3GHz) CPU. Only two methods are listed in the table, because they use the same language(C++) while others use Matlab. Cox's MHT algorithm possesses both well tracking capability and efficient implementation, according to the recent comparison in Chanho Kim et.al [37]. From some respects, Cox's implementation remains a meaningful reference. The fast implementation is based on our aiming to screen out weak hypothesis at first place and preserve strong ones later, cooperating with batch optimization instead of evolution frame by frame.

V. CONCLUSION
In the paper, we propose and implement the methods for an entire tracking processing including detection and tracking for dim small object, from raw image sequence to identified object trajectories. For the detection, we present our multiappearance segmentation. It employs multi-layer thresholds to produce multi-appearance slides, and exploits the tree structure to describe the relations between objects from different layers. Instead of global segmentation, we try to achieve one threshold for one detection, to maximize the adaptability. Then, a deepfirst-branch-adjustment algorithm is designed to solve the optimization of threshold for every individual object. According to the final tracking result, this detection method markedly improved the performance of our tracking method, providing valued detection input.
We build the tracking management structure based the classical tree from TOMHT. Then we implement it with some heuristic techniques, such as loop detect based hypothesis merge, modified motion score utilizing the auto-regression information, etc. For the global hypothesis selection, we propose a extended 0-1 program based on the idea of oneto-many association, integrating compatibility information and object likelihood in the meantime. This new idea permits fewer number of sharings that one measurement can be assigned to multiple objects. This means would improve the "cut-off" phenomenon and preserve the identity of objects. Thanks to our efforts to reduce complexity and number of hypotheses, our tracking method is implemented in an efficient way. It's proved to be impressively fast in experiments.