A Part-Based Probabilistic Model for Object Detection with Occlusion

The part-based method has been a fast rising framework for object detection. It is attracting more and more attention for its detection precision and partial robustness to the occlusion. However, little research has been focused on the problem of occlusion overlapping of the part regions, which can reduce the performance of the system. This paper proposes a part-based probabilistic model and the corresponding inference algorithm for the problem of the part occlusion. The model is based on the Bayesian theory integrally and aims to be robust to the large occlusion. In the stage of the model construction, all of the parts constitute the vertex set of a fully connected graph, and a binary variable is assigned to each part to indicate its occlusion status. In addition, we introduce a penalty term to regularize the argument space of the objective function. Thus, the part detection is formulated as an optimization problem, which is divided into two alternative procedures: the outer inference and the inner inference. A stochastic tentative method is employed in the outer inference to determine the occlusion status for each part. In the inner inference, the gradient descent algorithm is employed to find the optimal positions of the parts, in term of the current occlusion status. Experiments were carried out on the Caltech database. The results demonstrated that the proposed method achieves a strong robustness to the occlusion.


Introduction
Object detection [1] is a classical problem in the field of computer vision. Among the numerous methods for object detection, the statistical-based approaches [2][3][4][5][6][7][8][9][10][11][12][13] have become mainstream. They discriminate the given object from others by learning it, hence achieving a more robust detection. As a typical statistical-based method, the part-based model has attracted increasing attention in the past decade [5][6][7][8][9][10][11][14][15][16][17][18][19]. As the name implies, the part is the local area of the object. The partbased models can capture both the local appearance and spatial structural information of the object simultaneously, which makes these methods robust to the variations of the object pose and appearance to some extent. Similar with other statistical-based methods, the part-based methods also include two fundamental problems: training and detection. The former refers to the modeling of the part appearance and spatial relationship among the parts. The latter refers to the optimization problem for acquiring the information about the parts (such as their position and occlusion status). It is worthy to emphasize that, most partbased methods pay attention to the part areas only. Therefore, they are robust only to the occlusions which do not overlap the part areas (as illustrated in Figure 1.(a)). However, if the occlusions overlap the part areas, as illustrated in Figure 1.(b), not only would the occluded parts be influenced, the non-occluded ones would also be shifted from their right positions due to the spatial relationship among the parts (refer to the experiments below for more details).
However, The bag model almost consider none of spatial relationship among parts. The other part-based models improve the performance of detection by adding the spatial relationship among the parts. Especially, the pictorial structure model and the k-fan model perform fast detection via dynamic programming [20,21], since the appearance of generalized distance transform (GDT) [22] greatly reduces the time complexity of the dynamic programming. However, they does not allow cycles in the spatial relationship [20]. There are many solutions for this problem [23][24][25][26][27], of which the simplest technique is the gradient descent algorithm (GD).
Although the above part-based models belong to the category of sparse representation, they still suffer from the problem of shading the part areas (as demonstrated in Figure 3). In this paper, these shaded parts are named disabled parts. some methods [19] employed a kind of part appearance representation which is robust to the occlusion, but the robustness is limited, especially when the occlusion region is large. In the literature about the occlusion, ignoring the occluded parts from the model is the most intuitive idea to solve the occlusion problem [8,14,28,29]. In general, it can be achieved by estimating a mask for the test image. However, the mask variable is difficult to model in the objective function to achieve a precise inference and cannot work in the case of large occlusion. Papandreou et al. proposed to solve the occlusion problem by using a robust objective function [30], which weakens the role of occlusion parts, but it is difficult to completely eliminate the influence of the occluded areas. Li et al. [31] solved the occlusion problem under a novel RANSAC framework. However, it is difficult to incorporate the spatial relationship to improve the detection of the keypoints. In [32,33], the authors solved the occlusion problem under the sparse framework, which is usually used in the case of batch image processing.
This paper considers applying the part-based method in object detection, with special emphasis on occlusion handling, and propose a part-based probabilistic model with an alternative detection scheme. In order to increase the detection accuracy, we constructed a fully connected graph to describe the spatial relationship among parts in the stage of training. Each edge is represented by a 2D Gaussian distribution for the vector difference of the position coordinates. Moreover, we introduced a penalty term to the objective function ensure us to obtain a more accurate detection result. For the occlusion problem, we assigned a binary status variable to each part to indicate whether it is occluded or not, and proposed a method to model the prior probability of the occlusion status variable. Then, according to the Bayesian theory, we constructed a new posterior probability as the objective function. In the stage of detection, We designed two alternative procedures, which are named as the inner inference and outer inference. The former used the GD to determine the positions of the parts given the current occlusion status variable. The outer inference is responsible for determining the occlusion status of the parts according to their current positions. To address the detection efficiently, we adopted a stochastic tentative method in the outer inference. In addition, in the procedure of the detection, we incorporated the validity test mechanism to avoid the invalid inner inference results.

Methods
Consider a model with n parts V~fv 1 ,v 2 ,::::::,v n g. A detection result of a given image is expressed as H~fL,Sg. The argument L~fl 1 ,l 2 ,::::::,l n g is the position variable, where l i~f y i ,x i g denotes the position of part v i . The argument S~fs 1 ,s 2 ,::::::,s n g represents the occlusion status variable, where s i is a Boolean variable (if part v i is a normal part, s i~1 ; otherwise, s i~0 for the disabled part). We define the object function as which is the posterior probability of a result H given a test image I, where exp(Q(H)) is a penalty term and Q(H) is defined in the following subsection.

Construction of the model
In Eq. 1, the posterior probability contains four items. 1. p(IDH)~p(IDL,S), which represents the total matching probability of the normal parts. 2. p(LDS) represents the priori probability of a spatial relationship among normal parts. 3. p(S), a priori probability of the occlusion. 4. The penalty term exp(Q(H)).
The total matching probability of all of the normal parts is where C 0 is a constant for a test image I, g i (I,l i ) is the matching probability of a single part [11]. For a priori probability of the spatial relationship p(LDS), we employed the fully connected graph to represent the spatial relationship among the parts. However, as demonstrated in Figure 3.(c), the disabled parts will severely affect the detection of the normal parts because of the edges between them. We overcame this problem by discarding the edges connected to the disabled parts. In addition, the positions of the disabled parts were supposed to follow the independent uniform distribution [8]. Therefore, where A is the number of the normal parts with s i~1 , M is a constant representing the number of possible positions where a part could be placed, E Ã is the edge set of a fully connected graph, and p(l i Dl j ) is defined as p(l i {l j ) which follows the 2D Gaussian distribution. Let w (m,n) denote the conditional probability of part v n being shaded under the condition that v m is disabled. If the mean distance d (m,n) between v m and v n satisfies that d (m,n) v2c o (c o is a constant), we have proved that (please refer to the Appendix S1 for the deduction.) Then a priori probability p(S) is calculated as where j) ), s i~1 and s j~1 , represents the joint probability of the occlusion status about the part pair (v i ,v j ). Where p t is a constant standing for the probability of a part being present, and U is a normalization constant. The penalty term Q(H) aims at improving the detection results by emphasizing the weak parts, and also for regularizing the argument space of the objective function. It is defined as where V S is the part set in which the occlusion status of each element is 1. Substitute Eq. 2, 3, 5, 7 into Eq. 1, the posterior probability can be rewritten as where C~C 0 =½U : p(I) is a constant for a test image I. Applying the logarithm and minus operations to both sides of Eq. 8, we have The above expression is the right objective function for detection.

Detection
In the step of the detection, we look for an optimal detection result H Ã~f L Ã ,S Ã g with minimum energy, which is In this paper, we adopted the strategy of alternative optimization to solve the above problem. The basic idea is to let S search in the space of S (called the outer inference), and after each movement of S, the inner inference searches the current optimal part positions L ÃS . These two procedures are performed alternately until the terminal conditions are satisfied.
Inner inference and Outer inference. In the inner inference, given the current status vector S~fs 1 ,:::s n g, Eq. 9 can be expressed as where C 0 is a constant. Eq 11 is the right objective function in inner inference. we used the gradient descent algorithm (GD) to search the current optimal position variable for each part L ÃS~f l ÃS 1 ,:::,l ÃS n g. In the outer inference, given L ÃS , the object function Eq. 9 becomes the function depending on S only: As aforementioned, the outer inference is responsible for determining the occlusion status variable. At the beginning of the outer inference, the occlusion status variable is assumed to be S~f1,1,:::,1g, implying that no occlusion happens to any part. The aim of the outer inference is to find the next probable S to reduce the value of Eq. 12.
Due to the discreteness of the S space, we adopted a stochastic tentative method to address the outer inference. In each iteration, we calculated the gradient vector h~G(S)~fLE=Ls 1 ,:::,LE=Ls n g for Eq. 12. If (s j~0 &h(j)v0)D(s j~1 &h(j) §0) holds, we consider the s j as a feasible descending bit, and consider the s j which has the minimal value of Dh(j)D as the most irresolute bit. The procedure of the outer inference is illustrated in Figure 4 and detailed in Table 1.
The validity test is used to validate whether the inner inference has obtained a feasible result. For a two-part model, if the likehood a(l 1 ,l 2 )~g 1 (I,l 1 )g 2 (I,l 2 )p(l 2 Dl 1 ) ð13Þ is larger than some threshold, this two-part model is defined to pass the validity test. A full-part model is defined to pass the validity test if there is at least one two-part sub-model passing the validity test.
Step 3g to Step 3h is the procedure of inner inference, and can avoid the solutions from deviating from the right occlusion status variable. Finally, after we have obtained the output, we could estimate the position of the disabled parts just by the spatial relationship among all of the parts, i.e. minimizing the following expression: where fl i Dv i [V S Ã g is the known variable, which has been obtained in Algorithm 1 (Table 1).

Results and Discussion
In this section, we tested the performance of our method on the Faces dataset in the Caltech database [8,34], and the performance of 1-fan [11] is compared. For this dataset, as done in [11], six parts were selected: the left eye, the right eye, nose, the left corner of the mouth, the right corner of the mouth and the chin (defined as the part 1,2,:::,6 respectively). In our experiment, the distance error e is defined as the mean distance of n parts from the  detection results to its corresponding ground truth. The smaller e is, the better the given model performs on this specific test image.
We first demonstrated the influence of the disabled parts on the normal ones. We chose 200 images from the Faces dataset to train a 1-fan model, and chose 100 test images to construct a test dataset by shading the right eye and the right corner of the mouth in each test image. Then, we compared the discarded 1-fan model (the disabled parts had been discarded from the 1-fan model) (shown in Figure 6 (b)) with the original 1-fan model (shown in Figure 6 (a)). We used the distance error e as the evaluation index. The distance error e of all of the test images was normalized to ½0,1. The comparison result is illustrated in Figure 7, where N(e) is the distribution function of the distance error. The horizontal axis represents normalized e, the vertical axis represents the percentage of test images whose distance error is smaller than e. It is obvious that the higher the curve is, the better the model performs. It should be emphasized that only the normal parts were gathered to calculate the distance error e. Figure 7 shows that the detection results of the discarded 1-fan were much better than that of the original 1-fan model, due to discarding the disabled parts. In other words, the disabled parts will severely affect the detection of the normal parts if they are not handled properly. Once the occlusion happens, the matching degree of the disabled parts is very likely to be low at the right positions, so they must search other positions to minimize the objective function, which would increase the deformation of the edge connecting them. As a result, the adjacent normal parts would tune their positions to reduce the edge cost (as illustrated in Figure 3.(c)). For this reason, we discarded the Input: The initial occlusion status variable S~f1,1,::::::,1g, the initial energy value O t~i nf ; Step 1. Use the simulated annealing algorithm to obtain the initial position, then obtain current optimal position L ÃS via the inner inference, if it passes the validity test (explained below), go to Step 3, otherwise go to Step 2; Step 2. Use all of the two-part sub-model (illustrated in Figure 5) to make an inner inference (determine the positions of these two parts), until a two-part sub-model T whose results can pass the validity test appear. Then, estimate the approximate positions of other parts except for the two parts in T, and obtain their position by GD further. If none of the two-part models can pass the validity test, quit the outer inference in failure; Step 3. While the result S Ã is not altered and the maximum iteration number m is not reached, (a) Calculate gradient vector h; (b) If E(S)v~O t , S Ã is updated as S, and O t is updated as E(S); (c) In S, search the feasible descending bits; (d) If there is no feasible descending bit, invert the most irresolute bit in S, and go to Step 3g, otherwise go to Step 3e; (e) If there is only one feasible descending bit, invert it, and go to Step 3g, else go to Step 3f; (f) If there are at least two feasible descent bits, therein invert the corresponding bit with a probability proportional to its gradient absolute value; (g) Carry out the GD algorithm for E S (L), if the results cannot pass the validity test, go to Step 3h; disabled parts in our method. The performance will be demonstrated in the following experiments.
In the second experiment (partially occluded experiment), we compared the proposed model with the 1-fan model and demonstrated the detection accuracy of our method when one or two parts were partially occluded. Both models were trained by 200 images selected randomly from the Faces dataset. We randomly selected another 100 images to construct the two test datasets. The first dataset, termed as the one-part-shaded test dataset, was constructed by shading part(1), part(2),…, part (6) respectively with different occlusion degrees for each test image. The occlusion degrees is defined as the ratio of the occlusion area to part area varied from about 44% (the size of the occlusion region was 40|40) to 100% (the size of the occlusion region was 60|60), 11 values. The number of test images in the one-partshaded dataset was 100|6|11~6600. The second dataset, termed as the two-parts-shaded test dataset, was constructed by shading 7 kinds of adjacent part pairs (i.e., part(1,2), part(1,3), part (1,4), part (4,6), part(2,3), part (2,5), part(5,6)) respectively with different occlusion degrees for each test image. The number of images in the second dataset is 100|7|11~7700. The test images in the two-parts-shaded test dataset are illustrated in Figure 8. The distance error e was also used as the evaluation index for detection accuracy. The average distance errors for all of the test images are plotted in Figure 9. As a typical instance, the results on the part(1,2)-shaded test are listed in Table 2. Figure 9 shows that the average distance error for our model is almost constant and much smaller than that of the 1-fan model when the occlusion degree changes from 44% to 100%. These results are due to the fact that the disabled parts were discarded from our model, and could not affect the detection of the normal parts. For the 1-fan model, the average distance error on the onepart-shaded dataset was smaller than that on the two-part-shaded dataset. What is more, the average distance error for the 1-fan model increase with the increase in the occlusion degree. Specifically from Table 2, once the occlusion degree exceeded  70%, the average distance error increased sharply. That is because the information of the face held by part(1,2) (see Figure 8 (a)) was more than the other parts, thus the occlusion of part(1,2) greatly misguide the 1-fan model.
To further evaluate the performance of our method, we constructed four test datasets by complete shading one, two, three, or four parts (named the completely shading experiment). We carried out our algorithm on these test datasets. To evaluate the occlusion status variable S, we used two evaluation indices: the occlusion false alarm probability p f and the occlusion false dismissal probability p d for all of the bits in the occlusion status variable S. We defined the occlusion false alarm probability p f as the probability that the bit in S was wrongly estimated as 0, but it was actually 1. We defined the occlusion false dismissal probability p d as the probability that the bit in S was wrongly estimated as 1, it was actually 0. To evaluate the position variable L, we also used the distance error e as the evaluation index. The results of complete shading experiment are shown in Table 3. We did not compare our method with the 1-fan model in this experiment because the 1-fan model almost cannot work in the case where three or more parts are shaded.
From Table 3, we can see that both p f and p f increase as the number of disable parts increases. That is because when more parts are occluded, it will be more difficult to obtain valid results in Step 1 of Algorithm 1 (Table 1), and the number of the valid twopart sub-models will be also reduced in Step 2 of Algorithm 1 ( Table 1). Table 3 also shows that average distance error increases as the number of disabled parts increases. The reasons, except for those illustrated above, also lie in that the disabled parts are estimated only by the spatial relationship with the normal ones. The experimental results in Table 3 demonstrate that our method is competent for object detection even though most parts are occluded. Table 2. Average e of the 1-fan model and our method when part(1,2) was shaded.