A Deep-Structured Conditional Random Field Model for Object Silhouette Tracking

In this work, we introduce a deep-structured conditional random field (DS-CRF) model for the purpose of state-based object silhouette tracking. The proposed DS-CRF model consists of a series of state layers, where each state layer spatially characterizes the object silhouette at a particular point in time. The interactions between adjacent state layers are established by inter-layer connectivity dynamically determined based on inter-frame optical flow. By incorporate both spatial and temporal context in a dynamic fashion within such a deep-structured probabilistic graphical model, the proposed DS-CRF model allows us to develop a framework that can accurately and efficiently track object silhouettes that can change greatly over time, as well as under different situations such as occlusion and multiple targets within the scene. Experiment results using video surveillance datasets containing different scenarios such as occlusion and multiple targets showed that the proposed DS-CRF approach provides strong object silhouette tracking performance when compared to baseline methods such as mean-shift tracking, as well as state-of-the-art methods such as context tracking and boosted particle filtering.


Introduction
Structured prediction, where one wishes to predict structured states given structured observations, is an interesting and challenge problem that is important in a number of different applications, with one of them being object silhouette tracking. The goal of object silhouette tracking is to identify the silhouette of the same object over a video sequence, and is very challenging due to a number of factors such as occlusion, object motion changing dynamically over a video sequence, and object silhouette changing drastically over time.
Much of early literature in object tracking have consisted of generative tracking methods, where the joint distribution of states and observations is modeled. The classical example is the use of Kalman filters [1], where predictions of the object are made with Gaussian assumptions made on both states and observations based on predefined linear system dynamics. However, since object motion do not follow Gaussian behaviour and have non-linear system dynamics, the use of Kalman filters can often lead to poor prediction performance for object tracking. To address the issue of non-linear system dynamics, researchers have made use of modified a context-specific deep CRF model where the local factors in linear-chain CRFs are replaced with sum-product networks. Yu et al. [17,18] proposed a deep-structured CRF model composed of multiple layers of simple CRFs, with each layer's input consisting of the previous layer's input and the resulting marginal probabilities. Given that the problem of object silhouette tracking is one where a set of video frames can contribute to predicting the object silhouette in a new video frame, one is motivated to investigate the efficacy of deep-structured CRF models for solving this problem.
In this work, we propose an alternative framework for state-based object silhouette tracking based on the concept of deep-structured discriminative modeling. In particular, we introduce a deep-structured conditional random field (DS-CRF) model consisting of a series of state layers, with each state layer spatially characterizes the object silhouette at a particular point in time. The interactions between adjacent state layers are established by inter-layer connectivity dynamically determined based on inter-frame optical flow. By incorporate both spatial and temporal context in a dynamic fashion within such a deep-structured probabilistic graphical model, the proposed DS-CRF model allows us to develop a framework that can accurately and efficiently track object silhouettes that can change greatly over time. Furthermore, such a modeling framework does not require distinct stages for prediction and update, and does not require independent training for the dynamics of each object silhouette being tracked. Experimental results show that the proposed framework can estimate object silhouettes over time in situations where there is occlusion as well as large changes in object silhouette appearance over time.

Materials and Methods
Within a statistical modeling framework, one can describe the problem of object silhouette tracking as a classification problem, where the goal is to classify each pixel in a video frame as either foreground (part of the object silhouette) or background. The goal is to maximize the posterior probability of the states given observations P(YjM), where, Y is the state plane characterizing the object silhouette and M is corresponding observations (e.g., video). Discriminative models, such as CRFs, derive the posterior probability P(YjM) directly and as such do not require the independence assumptions necessary for generative modeling approaches. In the proposed DS-CRF modeling framework for object silhouette tracking, the object silhouette and corresponding background at the pixel level for each video frame is characterized by a state layer, which the series of state layers interconnected based on inter-frame optical flow information to form a deep-structured conditional random field model that facilitates for interactions amongst adjacent state layers. A detailed description of CRFs in the context of object silhouette tracking, followed by a detailed description of the proposed DS-CRF model, is provided below.

Conditional Random Fields
Conditional random fields (CRFs) are amongst the most effective and widely-used discriminative modeling tools developed in the past two decades. The idea of CRF modeling was first proposed by Laffety et al. [19]; based on the Markov property, the CRF directly models the conditional probability of the states given the measurements, without requiring the specification of any sort of underlying prior model, and relaxes the conditional independence assumption commonly used by generative models.
Formally, let G = (V, E) be an undirected graph such that y i 2 Y is indexed by the vertices of v i 2 V in G. (Y, M) is said to be a CRF if, when globally conditioned on M, the random variables y i obey the Markov property with respect to the graph G. In other words, P(y i jM, y V−{i} ) = P (y i jM, y N i ) where V − {i} is the set of all nodes in G except node i, Y is a set of output variables that we aim to predict, and N i and M are the sets of neighbors of i and of observed input variables, respectively. The general form of a CRF is given by where Z(M) is a normalization constant, essentially the so-called partition function of Gibbs fields, with respect to all possible values of Y, C represents the set of all cliques, and ψ c encodes potential functions with a non-negative value condition. According to the non-negative constraint for ψ c , and based on the Principle of Maximum Entropy [20], a proper probability distribution is the one that maximizes the entropy, given the constraints from the training set [21]. As such, a new form of the CRF is then given by is a feature function with respect to clique ϕ c , and λ denotes the weight of each feature function to be learned. The feature function expresses the relationship amongst the random variables in a clique. The number of feature functions with respect to each clique is denoted by k.
Two-dimensional CRFs have been applied to many computer vision problems, such as segmentation and classification. In particular, because of the undirected structure of most images, the 2D CRF leads to efficient performance in computer vision [7,22,23]. Although early CRFs incorporate spatial relationships (spatial feature functions) amongst random variables into the model, these relationships repeat sequentially in many applications such as visual tracking, where incorporating this property into the framework can lead to better modeling.
Feature functions play an important role in the context of CRF modeling. Selecting appropriate feature functions speeds up the convergence of the CRF training process, whereas inappropriate feature functions can cause inconsistent results in CRF inference. To illustrate the importance of selecting appropriate feature functions for object silhouette tracking, we train a CRF for predicting the object silhouette at one frame based on the previous frame using only spatial feature functions without incorporating any feature function describing temporal relationship amongst frames. Two frames consist of a simulated object which has small movement between two frames. As seen in Fig 1, the prediction result of the object silhouette is poor as the CRF could not learn object motion dynamics in the absence of temporal feature functions, leading to poor object silhouette tracking performance.
To tackle this issue of selecting appropriate feature functions to improve tracking performance, Shafiee et al. [13] proposed the incorporation of temporal feature functions such as inter-frame optical flow into the CRF modeling framework to better take advantage of temporal relationships for visual tracking. Although this approach showed promising results and illustrated the feasibility of temporal processing for visual tracking in the CRF modeling framework, it only makes use of motion information from the previous frame to estimate object position in the current frame and as such cannot handle large motion dynamics changes nor shape changes over time, or can it handle accelerated motion dynamics. Furthermore, it is designed for object position tracking and does not handle object silhouette tracking. Therefore, motivated by the benefits of incorporating both spatial and temporal context in a dynamic fashion in a manner that addresses the aforementioned issues, we propose a deep-structured CRF (DS-CRF) model for object silhouette tracking, where the series of interconnected state layers making up the model along with the set of corresponding temporal observations allow for better modeling of more complex motion and shape dynamics that can occur in realistic scenarios.

Deep-structured Conditional Random Fields
Here, we will describe the proposed DS-CRF model in detail as follows. First, the graph representation of the DS-CRF model is presented. Second, the manner in which inter-layer connectivity within the DS-CRF model is established dynamically based on motion information derived from temporal observations is presented. Third, a set of new feature functions incorporated in the DS-CRF model for object silhouette tracking is presented.

Graph Representation
Let the graph G(V, E) represent the proposed DS-CRF model, which consists of several state layers Y t : Y 1 corresponding to times t : 1 as shown in Fig 3. Each state layer characterizes the object silhouette at a specific time step by modeling the conditional probability of Y t given the  Example of CRF modeling of object motion using only spatial feature functions for object silhouette tracking. The first column (A) shows the temporal observation and second column (B) shows the label used for training (two columns are two consecutive frames at t − 1 and t). The third column (C) shows the prediction result of the object silhouette. As seen the prediction result of the object silhouette is poor since the CRF could not learn object motion dynamics in the absence of temporal feature functions, leading to poor object silhouette tracking performance.
where Z(M t:1 ) is a normalization constant, C is the set of inter-layer and intra-layer cliques, λ k,c determines the weight of each feature function, and f k,c (Á) denotes the feature function over clique c. The intra-layer connectivity between nodes in each layer (i.e, e t ij in layer Y t , Fig 3) imposes the smoothness property of the target object into the model while the inter-layer connectivity between two adjacent state layers (i.e., e t;tÀ1 k for layers Y t and Y t−1 corresponding to node y k ) incorporate object motion dynamics into the model. As such, the inter-layer connectivity carries the energy corresponding to unary potential in the model, and are specified dynamically and adaptively in the proposed framework based on motion information derived from temporal observations, which will be described in detail in the next section.
To reduce computational complexity, the implementation of the DS-CRF model in this work will make use of the three previous frames as observations in the modeling of the conditional probability: The use of the last three frames is chosen as it provides sufficient information to reasonably model accelerated motions given that both acceleration and velocity can be computed, thus allowing for handling various motion situations in short time steps. Motion-guided inter-layer connectivity As discussed above, the intra-layer connectivity between nodes in a layer incorporate spatial context while the inter-layer connectivity between layers incorporate temporal context into the DS-CRF model. The simplest approach to establishing inter-layer connectivity between nodes from different layers would be to simply create inter-layer cliques between nodes that represent the same spatial location at two different time steps. This creates a simple regular spatialtemporal lattice that is fixed across time. However, this is not appropriate for object silhouette tracking, as temporal neighbors established under such a fixed inter-layer connectivity structure would not share relevant information since target objects that are undergoing drastic motion and shape changes over time, and thus the feature functions under such a structure would hold little meaning. Therefore, we are motivated to establish the inter-layer connectivity in the propose DS-CRF model in a dynamic and adaptive manner, where motion information derived from the temporal observations is used to determine the inter-layer cliques at each state layer.
In this work, we dynamically determine inter-layer cliques of each node at each layer Y t of the DS-CRF model based on the velocity obtained by inter-frame optical flow computed by two consecutive temporal observations M t and M t−1 : where y c,t is an inter-layer clique in time t, y i,t is a node in time t and y k,t−1 is its neighbor node in time t − 1 based on the inter-layer clique connectivity. v x and v y encode the velocities in both directions of x and y where node i in x direction (i.e., i x ) is consistent with node k (i.e., k x ) based on v x and the same manner for y direction. An illustrative example of the motion-guided inter-layer connectivity strategy is shown in Fig 4, where the inter-layer clique structures are established at Y t and Y t−1 based on the interframe optical flow between temporal observations M t and M t−1 . It can be seen that the nodes corresponding to the target object (indicated here as gray nodes) (e.g., y i,t , y l,t , and y k,t ) form inter-layer clique structures with nodes from the previous state layer that characterize different spatial locations than them due to motion, while nodes corresponding to the background (indicated by white nodes) (e.g., y j,t ) form inter-layer cliques with nodes from the previous state layer corresponding to the same spatial location since there is no motion at that position. As such, this motion-guided dynamic inter-layer connectivity strategy allows for better characterization of temporal context of the object silhouette being tracked and allow the feature functions to hold meaning.

Feature Functions
In addition to the inter-layer connectivity between state layers, it is important to also describe the feature functions being incorporated into the proposed DS-CRF model. The set of feature functions are: i) optical flow, ii) target appearance, iii) spatial coherency, and iv) edge.
Optical Flow This crucial feature function is described by the velocity of each pixel in the x and y directions in two adjacent frames and is estimated via inter-frame optical flow. Optical flow is an approximation of motion based upon local derivatives in a given sequence of images [24]. It specifies the moving distance of each pixel in two adjacent images: Iðx; y; tÞ ¼ Iððx þ dxÞ; ðy þ dyÞ; ðt þ dtÞÞ where (I x , I y ) denotes the spatial intensity gradient and v x and v y denotes motion in both directions (here, in this implementation, δt = 1). Optical flow assumes the change in a pixel's intensity corresponds to the displacement of pixels [13]. Here, inter-frame optical flow is applied between two temporal observations. Target Appearance The model utilizes simple unary appearance feature functions based on features describing the target object's appearance, including RGB color and target appearance in previous frame. To obtain this unary feature function, the label state of time t − 1 is shifted by the computed velocity and find the corresponding value for each node: where S(Á) shifts Y t−1 based on velocities v x and v y . Spatial Coherency Each target in the scene has spatial color coherency. This term implies the reflection between neighboring nodes in the image. Each node consisted to a target has strong relations with other nodes corresponding to the target silhouette. In other words, the target appearance is coherent in each time frame. By adding this feature function to the DS-CRF tracking framework, the proposed algorithm can track target object's silhouette despite large changes over time. A rough segmentation algorithm [25] enforces the label consistency among nodes with a segment produced by the segmentation result of frame M t .
Edge The Ising model is the ordinary edge energy function utilized in different problems which we incorporated to the model as spatial smoothness feature function: f ðy i;t ; y j;t ; M t Þ ¼ mðy i;t ; y j;t Þ Á ðm i;t À m j;t Þ ð 8Þ where μ(Á) is the penalty function based on the similarity of two nodes i and j. Inter-layer connectivity. The inter-layer connectivity between nodes in two adjacent layers are determined based on the motion information computed in two consecutive frames. The inter-layer cliques are constructed dynamically and adaptively based on motion information corresponding to each node in layer Y t . As seen, the corresponding temporal neighbor of node y i,t is y k,t−1 based on inter-layer clique structure are determined by use of inter-frame optical flow. The gray color correspond to nodes associated with the target object where there is movement in the previous frame.

Training and Inference
Maximum likelihood is a common method to estimate the parameters of CRFs. As such, the training of the proposed DS-CRF is done by maximizing log-likelihood ℓ upon the training data: Because the log-likelihood function ℓ(λ) is concave, the parameters λ can be chosen such that the global maximum is obtained and the gradient or vector of partial derivatives with respect to each parameter λ k becomes zero. Differentiating ℓ(λ) with respect to the parameter λ k gives: An exact solution does not exist; therefore, the parameters are determined iteratively using gradient descent optimization. Our DS-CRF training is performed via the belief propagation method [26]. After the training of the DS-CRF, inference is performed by evaluating the probability of each random variables in the represented graph given the observations M t−2:t−1 and Y t−1 , while decoding is performed by assigning the output variable Y-determining states with maximum probability: where Eqs (11) and (12) show the formal definition of the inference and decoding process, respectively.

DS-CRF Tracking Framework
Based on the DS-CRF model described above, we can then develop a state-based framework for tracking object silhouettes across time in a video sequence as follows. The first two frames are annotated by user as initialization. The velocity is computed based on these two frames and DS-CRF starts the tracking procedure by third frame. DS-CRF can track objects automatically after frame 2. The optical flow is performed by used on two last seen frames each time. Since the optical flow is computed for each time frame and parameters were trained based on the velocity, the model needs to train only one time.
The DS-CRF essentially plays the rule of fusing spatial context such as target object shape and appearance with temporal context such as motion dynamics within the proposed tracking framework. The contribution of each aspect of the spatial and temporal information within the DS-CRF model based on the weights learned during the training step. The inference process described in Eq 11 is then performed based on the learned DS-CRF.
To reduce the computational complexity of the proposed DS-CRF tracking framework, the decoding result in each step (see Eq 12) utilized to obtain the object silhouette. The decoding result consists of a binary label field Y (i.e., each pixel has a value y = {0, 1}, where 0 indicates object silhouette pixels and 1 indicates background pixels). An example of a temporal observation (i.e., video frame) and its corresponding binary label field is shown in Fig 5. The use of a binary label setup allows for not only reduced computational complexity of the training process, but also the convergence.
One issue that needs to be tackled when using a binary label setup is that, while well suited for single target object silhouette tracking, it is less appropriate for multi-target object silhouette tracking. To address this issue, we introduce a data association procedure where connected components in the binary label field are assigned to the target objects being tracked. This is accomplished by matching the object silhouettes determined for the previous time step to the connected components in the binary field at the current time step to determine the best template matches: where T j (t) is the target j's silhouette in time t, C is the set of connected components detected as targets and M(Á) is template matching function that evaluates the similarity of two input silhouettes.

Results
To evaluate the performance of the proposed DS-CRF model for the purpose of object silhouette tracking, a number of different experiments were performed to allow for a better understanding and analysis of the model under different conditions and factors. First, a set of experiments involving video of a simulated object with different motion dynamics is performed to study the capability of the DS-CRF model in handling objects with changing motion dynamics. Second, a set of experiments performed on videos of humans moving within a subway station from the PETS2006 database is used to study the capability of the DS-CRF model in handling object silhouette tracking scenarios where there is occlusion and objects that change drastically in shape and size over time.

Experiment 1: Simulated object motion undergoing acceleration
In this experiment, we examine the ability of the proposed DS-CRF method in tracking the silhouette of an object with different motion dynamics over time. To accomplish this, we produce three video sequences consisting of a simulated object undergoing the following motion dynamics over time: • Motion1: Object undergoes acceleration but remain constant in shape over time.
• Motion2: Object undergoes size change over time but moves at constant velocity. • Motion3: Object undergoes acceleration as well as size change over time.
Sample frames from the video sequences Motion1, Motion2, and Motion3 is shown in the first rows of Fig 6(a), 6(b) and 6(c), respectively. The proposed method was then used to predict the object silhouette over time based on these video sequences. The predicted results are shown in the second rows of Fig 6(a), 6(b) and 6(c), respectively. It can be observed that the proposed method is able to provide accurate object silhouette tracking results for all three video sequences, thus illustrate its ability to handle uncertainties in both motion and object appearance over time.

Experiment 2: Real-life video of human targets
In this experiment, we examine the capability of the DS-CRF model in handling object silhouette tracking scenarios where there is occlusion and objects that change drastically in shape and size over time. To accomplish this, we made use of three different video sequences from the PETS2006 database depicting human targets moving within a subway station (one of which is used for evaluation in [15]), each used to illustrate different aspects of the capability of the proposed method: • Subway1: This sequence is used to illustrate the capability of the proposed method in handling single object silhouette tracking over time. The object target in this sequence is crossing the hallway from the top of the scene to the bottom of the scene.
• Subway2: This sequence is used to illustrate the capability of the proposed method in handling object occlusions. The object target in this sequence is crossing the hallway from the right of the scene to the left of the scene, and becomes occluded by a person walking from the left of the scene to the right of the scene.
• Subway3: This sequence is used to illustrate the capability of the proposed method in handling multiple object silhouette tracking over time. Two of the target objects in this sequence is crossing the hallway from the bottom of the scene to the top of the scene, while a third object target is crossing the hallway from the top of the scene to the bottom of the scene.
The PETS2006 database is a public dataset which is available from http://www.cvg.reading.ac. uk/PETS2006/data.html. However it is worth to note that the proposed method can be replicated by any video tracking dataset.
To provide a comparison for the performance of the proposed method, four different existing tracking methods are also evaluated: • Mean-shift tracking [27] Mean-shift tracking is based on non-parametric feature space analysis, where the goal is to determine the maxima of a density function, which in the case of visual tracking is based on the color histogram of target object. This goal is achieved via an iterative optimization strategy that locates the new target object position near the previous object position based on a similarity measure such as Bhattacharyya distance.
• Context tracking [28] Context tracking is a discriminative tracking approach which utilizes a specific trained detector in a semi-supervised fashion to locate the target in consecutive frames. The goal of this method is to locate all possible regions that look similar to the target. Context tracking then identifies and differentiates the target object from the 'distracters' within the set of possible regions based on a confidence measure derived based on the posterior probability and supporting features.
• Boosted particle filtering [6] Particle filtering is a discriminative tracking approach that approximates the posterior P(Y t jM 0:t ) with a Dirac measure using a finite set of N particles fY i t g i¼1...N . The sample candidate particles are drawn based on the proposal distribution. The importance weight of each particle is then updated according to its previous weight and the importance function, which is often the transition prior. After that, the particles are resampled using their importance weights. Here, we employed the boosted particle filter proposed in [6], which incorporates mixture particle filtering [29] that is ideally suited to multitarget tracking.
• Visual Silhouette Tracker [15] The visual silhouette tracking method fuses different visual cues by means of conditional random fields. The object silhouette is estimated every frame according to visual cues including temporal color similarity, spatial color continuity and spatial motion continuity. The incorporated energy functions are minimized within a conditional random field framework.
Note that for the mean-shift tracking and context tracking methods are only evaluated for the Subway1 and Subway2 sequences as the implementations used were not designed for tracking multiple object targets within the same scene. The visual silhouette tracking method was only compared for the Subway1 sequence as only the object silhouette results for that sequence was provided by the authors of [15]. Finally, the boosted particle filtering and proposed DS-CRF method was evaluated for all three sequences (Subway1, Subway2, and Subway3).
To compare methods quantitatively, the number of frames which the tracker could track the object correctly divided by the total number of frames in the sequence is reported as the accuracy: Accuracy ¼ Number of Corrected Tracked Frames Total Number of Frames Â 100: ð14Þ Table 1 shows the quantitative results for the Subway1 and Subway2 sequences while Table 2 presents the result corresponding to Subway3 sequence. First, let us examine the performance of the proposed DS-CRF method in the situation where the object being tracked changes significantly in size and shape over time. Fig 7 shows the single-object object silhouette tracking results of the tested tracking methods for the Sub-way1 sequence. It can be observed that while the mean-shift tracking, context tracking, and boosted particle filtering methods lose the object target completely, both the visual silhouette tracking method and the proposed DS-CRF method is able to track the object silhouette all the way through. It can also be observed that the object silhouette obtained using the proposed method is more accurate than that obtained using the visual silhouette method. These results illustrate the capability of the proposed DS-CRF method in tracking the object silhouette over time in spite of drastic changes in size and shape over time.
Next, let us examine the performance of the proposed DS-CRF method in the situation where the object being tracked undergoes occlusion by other objects over time. Fig 8 shows the single-object object silhouette tracking results of the tested tracking methods for the Subway2 sequence. It can be observed that while the mean-shift tracking, context tracking, and boosted particle filtering methods lose the object target completely, the proposed DS-CRF method is able to track the object silhouette all the way through despite being occluded by another person. These results illustrate the capability of the proposed DS-CRF method in tracking the object silhouette over time in spite of object occlusion.
Finally, let us examine the performance of the proposed DS-CRF method in the situation where we wish to track multiple object silhouettes over time. Fig 9 shows the multiple-object silhouette tracking results of the tested tracking methods for the Subway3 sequence. It can be observed that while the boosted particle filtering method is able to track two of the three object Table 1. Quantitative results for different video sequences. The accuracy of Visual Silhouette Tracker method is reported for one sequence since only one sequence of result has been provided by the authors of [15]. MST, CT, VST, BPF are refereed to Mean-shift tracking [27], Context tracking [28], Visual Silhouette Tracker [15] and Boosted Particle Filtering [6] respectively.

Video Name
MST [27] CT [28] VST context tracking, and boosted particle filtering methods lose the object target completely, both the visual silhouette tracking method and the proposed DS-CRF method is able to track the object silhouette all the way through. It can also be observed that the object silhouette obtained using the proposed method is more accurate than that obtained using the visual silhouette method. The original results can be found [35]. targets completely, it loses one of the object targets as a result it crossing paths with one of the other object targets. Furthermore, the boosted particle filtering method does not provide pixellevel object silhouettes and is able to only track bounding boxes. On the other hand, the proposed DS-CRF method is able to track all three of the object silhouettes at the pixel-level all the way through. These results illustrate the capability of the proposed DS-CRF method in tracking multiple object silhouettes over time in a reliable manner.

Discussion
Here, we proposed a deep-structured conditional random field (DS-CRF) model for object silhouette tracking. In this model, a series of state layers are used to characterize the object silhouette at all points in time within a video sequence. Connectivity between state layers formed dynamically based on inter-frame optical flow allows for interactions between adjacent state layers to facilitate for the utilization of both spatial and temporal context within a deepstructured probabilistic graphical model. Experimental results showed that the proposed DS-CRF model can be used to facilitate for accurate and efficient pixel-level tracking of object silhouettes that can change greatly over time, as well as under different situations such as occlusion and multiple targets within the scene. Experiment results using both simulated data and real-world video datasets containing different scenarios demonstrated the capability of the proposed DS-CRF approach to provided strong object silhouette tracking performance when compared to existing tracking methods. One of the main contributing factors to the proposed method's ability to handle uncertainties in object motion dynamics and size and shape changes over time is in the way the interlayer connectivity is established dynamically based on inter-frame optical flow information. If the inter-layer connectivity is established statically at all state layers of the deep-structured model, then the feature functions would hold little meaningful relationships in the temporal domain as the object accelerates and changes size over time. By making use of inter-frame optical flow information to determine inter-layer connectivity between adjacent state layers, the feature functions maintain meaning over time in guiding the prediction process. Another important contributing factor is the incorporation of object shape feature functions (spatial coherency) enforces the proposed method to consider object shape variations in time, which also aids in the handling of changes in size and shape over time. Example tracking results for Subway3. It can be observed that while the boosted particle filtering method is able to track two of the three object targets completely, it loses one of the object targets as a result it crossing paths with one of the other object targets. Furthermore, the boosted particle filtering method does not provide pixel-level object silhouettes and is able to only track bounding boxes. On the other hand, the proposed DS-CRF method is able to track all three of the object silhouettes at the pixel-level all the way through. The original results can be found [35]. Object Silhouette Tracking with Conditional Random Fields Future work involves extending the proposed DS-CRF model to incorporate not only interframe optical flow information, but also additional motion information via descriptor matching to better guide the establishment of inter-layer connectivity in the situation of large object displacements within a short time in the video sequence. Furthermore, we aim to explore the extension of the DS-CRF model with high-order and fully-connected clique structures [30] to improve modeling of spatial relationships for better object silhouette boundaries. Finally, we aim to explore the application of the proposed DS-CRF model for the purpose of improved video saliency detection using texture distinctiveness-based feature functions [31][32][33] and improved content-based video retargeting using energy gradient feature functions [34].