Quick, Accurate, Smart: 3D Computer Vision Technology Helps Assessing Confined Animals’ Behaviour

Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs’ behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals’ quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog’s shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.


I. INTRODUCTION
Stray dog populations represent a serious concern for human beings health and safety and for dogs themselves in many European countries [1]. The main population control action plan in Italy, but also in other countries, is the confinement of stray dogs in shelter facilities until re-homing. Unfortunately, the impairment between entrance and adoption rates often leads to an overcrowded scenario where dogs are likely to spend most part of their lives. Previous literature has shown how long-term confinement in shelters has detrimental effects on dogs' welfare [2], [3]. Behavioral responses are a direct reflection of an animal's attempts to cope with its environment. Failure in these coping strategies may lead to a reduction in the expression of normal behaviors, and an increase of abnormal or repetitive behaviors. The study of behavior as an indicator of poor welfare is therefore critical when assessing the well-being of shelter dogs. Videorecording and subsequent image analysis is far the most applied technique, because of its non-invasiveness. However, manual or semi-automatic methods of image scoring are very time consuming, and may show drawbacks since they rely on the observer subjectivity, sensitivity and level of accuracy. An automatic image recording system would allow to collect a bigger amount of data, operating also over long periods of time, with high precision and nonetheless saving precious human labor. Therefore, the aim of the paper is to propose an innovative and, to our knowledge, unique framework for measuring, in an automatic way, the behavioral parameters of dogs kept in kennel environment.
Computer vision analysis of dog body-parts is absent in literature, only a few experiments exist on animals, mostly involving livestock and focusing in classifying animals motion patterns. Shao et al. [4] analyzed the thermal comfort behavior of swine using programmable cameras and information based on the top view of the animals. Tillett et al. [5] applied image processing techniques to pigs in a pen in order to track their movement and extract information about position, rotation, bending and head nodding with the aim of studying their individual behavior. Leroy et al. [6] developed a model-based computer vision system to study the behavior of hens to assess their welfare degree. Analyzing the hens contour they extract the posture and classify the possible behaviors into predetermined categories (i.e. as "standing", "walking", "scratching", ...). Cangar et al. [7] developed an automatic image analysis system able to identify some locomotion and posture behaviors of cows prior to calving with the purpose of alarming when a human intervention is necessary. However, all of these works are often oversimplified and operate in supervised and controlled settings. The complexity of real scenarios poses new challenges and emerges the need to precisely analyzing animals complex behaviors in order to evaluate their welfare in real situations. To this aim we believe it is of broad interest to provide details about quadrupeds posture or body parts. To our knowledge there are no proposal about this topic. Conversely, most of the existing works about body parts detection focus on human beings. Most of the solutions relies on the analysis of 2D images where features are analyzed using pre-trained classifier either generative [8], [9] or discriminative [10]. Conversely 3D approaches have exploited the richness of the three dimensional representation that conveys important information able to solve partial occlusions between body parts, [11]. Among the 3D sensors the Microsoft Kinect sensor have been profitably exploited for body part detection and tracking, [12]. Recently, Structural classifiers have emerged as a valuable tool for body parts detection. Structural classification exploit at the same time body parts model and their mutual relation in a joint classification framework, [13]. The main flaws of structural classification reside in the need of an exhaustive training set and the computational cost of the classification algorithm that prevents a real time application.
Although techniques that explicitly model the human body cannot be directly applied to quadrupeds, we propose the adoption of a structural classifier to detect dogs body part. Our proposal relies on the adoption of an efficient on-line training technique that allows the classification to be both real-time and widely applicable by the use of kernel functions. Our solution is the first tentative of identifying dogs body parts considering that dogs bodies have a completely different structure w.r.t. human ones (i.e. different axes of symmetry, a different kind of self occlusions, different motion constraints ...). The adoption of 3D features allows the method to be invariant to the dog breed and size while it can be easily extended to different quadrupeds body models by changing the kernel functions leaving the rest of the proposal unchanged.

II. STRUCTURAL CLASSIFICATION OF DOG BODY PARTS
The method for body part classification is constituted by two different steps; first the dog is located inside the scene and its image is extracted, second depth features are extracted and eventually classified. We restricted the dog body model as being constituted by seven body parts (torso, head, tail and the four paws) and considered pens containing one animal alone. This hypothesis simplifies the detection and tracking problem. Nevertheless, the dog tracking can be conducted, even in complex scenarios, with a single target tracker. For further details readers can refer to the survey in [14]. Considering the structure of the pen, depth maps, acquired by the Kinect sensor ( Fig. 1.b) are exploited to remove the planes that delimit the pen itself with a least-square plane fitting method. After planes removal, only the blob containing the dog remains, Fig.  1.c. Finally, morphological binary operators are applied to the dog mask to fill potential holes due either to noise or Kinect errors (Fig 1.d) and both the dog depth image and the distance transform, computed on the dog binary mask, are extracted as the features for classification, Fig. 1.(e-f).

A. Structural Support Vector Machine
We formulate the problem of dog body-part detection as a structured learning problem using Structural Support Vector Machine (SSVM) [15]. SSVM represents an effective solution for structural learning problem and has been profitably applied in different computer vision context from segmentation [16] to tracking [17]. The use of a structural classifier allows to consider jointly the body parts model and their inner relationships derived from anatomical constraints leading to an accurate classification without the need of explicitly define the body model itself. The classification is performed frame by frame. Let us consider an input vector x ∈ X that represents the dog features and a set of possible solutions (i.e. body parts labeling) Y. In a supervised discriminative setting, the classifier aims to learn a classification function h : X → Y based on training samples of input-output pairs. This function is expressed as the maximization of a discriminant function F : X × Y → R: that measures the compatibility between (x, y) pairs, returning a high score value for well-matched pairs. The training of the classifier is performed by parametrizing the scoring function of Eq. (1) by a weight vector w and expressing F as a dot product between w and a joint kernel map Ψ(x, y) that maps an input output pair (x, y) to a real valued features vector, F (x, y) = w, Ψ(x, y) . This Structured Support Vector Regression problem is solved by estimating the parameter vector w in a loss augmented learning setting where the loss function Δ(y,ȳ) measures the difference between two possible solutions y andȳ. The objective here is learning w so that the value of F (x i , y i ) − F (x i , y) mimics as close as possible the loss function behavior Δ(y i , y). Basing on a set of sample pairs {(x 1 , y 1 ), . . . , (x n , y n )}, during training we solve the dual SSVM convex optimization problem in its re-parametrized form proposed in [18]: The discriminant function then becomes: In Eq. (4) the joint kernel map Ψ does not need to be defined explicitly because it appears only inside a dot product operation thus the problem can be solved using a kernel function K(x i , y i , x j , y j ) instead, as in the dual SVM case. In a similar vein, the pairs (x i , y) having β y i = 0 are considered as the support vectors. The support vectors having β y i > 0 are referred as positive because they contribute positively to 10: for (x j , y) ∈ S do 11: the discriminant function, conversely, those having β y i < 0 are referred as negative. Those x i that are included in at least one support vector are defined as support patterns.
The use of kernels in structural SVM is an advantage because it allows, similarly to SVMs, the classification function to be non linear, but the computational cost of the dual problem increases as it involves evaluating the kernel for every support vector [19].
To overcome this problem we make use of the iterative Sequential Minimal Optimization(SMO) technique proposed in [18], LaRank.

B. LaRank
LaRank algorithm is a stochastic learning algorithm used to estimate the coefficient β of the constrained optimization problem of Eq. (2), that combines partial gradient information with the randomization arising from the sequence of training examples. Differently from optimization algorithms that rely on the evaluation of the full solution space, LaRank performs a randomized exploration inspired by subgradient methods. The algorithm is based on a sequence of SMO-style steps, [20]. At every step, it modifies a pair of coefficients β y+ i and β y− i , by adding and subtracting a fixed quantity λ to fulfill the constraint y β y i = 0 . This constitutes a one-dimensional maximization problem in λ that can be solved using the SMO technique, see Alg. 1. Gradients g i in Alg.1 are computed for a single coefficient β y i , as: LaRank considers three different update strategies for choosing y + and y − in Alg.1.
The PROCESSNEW step processes a new training sample (x i , y i ) and selects y + = y i and y − = argmin y∈Y g i (y). This step adds the correct solution (x i , y + ) as a positive support vector and search for the worst solution (x i , y − ) as the corresponding negative support vector. It is important to notice that a new support vector is not created if the SMOstep doesn't modify the β coefficients.
Instead, the PROCESSOLD step processes an existing support pattern x i , chosen randomly, where y + = argmax y∈Y g i (y) with β y i < δ(y, y i )C and y − = argmin y∈Y g i (y). This step revisits an existing positive support vector possibly adding (x i , y − ) as a new negative example. Lastly the OPTIMIZE step processes an existing support pattern x i , chosen randomly among the existing support vectors, setting y + = argmax y∈Y g i (y) with β y i < δ(y, y i )C and The algorithm doesn't specify a termination criterion. As suggested in [18], we schedule the update steps as follows: given a new training sample (x i , y i ) we invoke a PROCESSNEW followed by η R REPROCESS, defined as a PROCESSOLD followed by η O OPTIMIZE. We set η O = η R = 10.

C. Kernel and Loss functions for dogs body parts
We define a solution y of (1) as a 14 dimensional vector containing the image coordinates of the segments that represent the body part in the following order: torso, head, tail front left paw, front right paw, bottom left paw and bottom right paw as can be seen in Fig. 2. a) Kernel Function: Every input-output pair (x i , y j ) generates a feature vector based on the mapping Ψ : X ×Y → R d . For our problem, we use features derived both from the depth map, obtained from the Kinect sensor, and the distance transform computed on the binary mask of the dog body. The distance transform maps binary images into gray-scale images replacing every pixel of the object with its distance from the nearest pixel of background, Fig. 1(e-f). In particular we choose the mean and the variance of the depth and the distance transform values along the segments that identify the body parts in the solution vector y, Fig. 2(b-c). These appear to be simple and effective descriptors because the distance transform allows us to obtain a sketch of the skeleton of the dog while the depth image, instead, helps to distinguish among paws as resulted from our experiments. We additionally add the components of the motion vector of the dog barycenter computed between two consecutive frames in order to support the system to point the torso in the correct direction; finally we obtain a 16 dimension real valued vector ϕ(x, y). Given two input-output pairs the kernel function K(Ψ(x i , y i ), Ψ(x j , y j )) is then computed using RBF Gaussian Kernel with σ 2 = 1: b) Loss Function: The Loss function Δ : Y ×Y → R in eq. (2), as previously stated, is used during training to evaluate the dissimilarity between two solutions. We derive the loss function as the inverse of the PCP measure (Percentage of Correctly estimated body Parts) of Eichner and Ferrari [21]. The PCP measure is a well assessed measure to evaluate the accuracy in human body parts classification systems. The PCP value is based on the criterion that a body part is considered correctly estimated if its segment endpoints lie within of the length of the ground-truth segment from their annotated locations. The loss function, Δ(y, y ), is then computed as the inverse of the PCP between two different solutions: Δ(y, y ) = 2(1 − P CP (y, y )) Once the SSVM is correctly trained, inference on the solution vector y is obtained by maximizing the discriminant function of Eq. (1) using Eq. (4). The maximization process involves the generation of possible feasible solution vectors y ∈ Y in order to compute the argmax operation. In the case of body part detection, this involves considering all possible dog poses w.r.t the camera that is not feasible due to the complexity of the articulated motion of quadrupeds. This limitation has been overcome designing a heuristic method for the solution generation process where possible solutions y are generated considering the distance transform DT computed on the dog binary mask, as in Sec. II-C. In detail, the i-th body part segment s i is described by the quadruple {x 0i , y 0i , θ i , L i } where (x 0 , y 0 ) is one of the extreme of the segment, θ the angle between an horizontal line and the segment (computed counterclockwise) and L the length of the segment. Let n be the number of the searched body parts and S = {x 0k , y 0k , θ k , L k } n k=1 be the body parts segments; we formulate the problem of finding possible solutions as finding the sets of segments that maximizes the sum of DT along their points: We constrained the set of possible solution segments S using quadrupeds anatomical constraints. The first set of constraints involves the top portion of the body parts, namely the torso, head and tail. We force the torso segment to pass through the dog barycenter, lying inside the area where the distance transform reaches its maximum values, Fig. 3.c. Head and tail then start respectively from the start point and the endpoint of the torso and we additionally impose an angular constraint on the search space of segment parameters of 150 degrees, w.r.t the torso direction, as shown in Fig. 3.d.
Finally the paws are constrained in the area beyond the torso segment. We heuristically search for the set of possible solutions by iteratively finding, for every body part segment, its global maximum of Eq. (8) until a complete solution, that involves all the seven body parts, is built. The complete heuristics procedure is visually sketched in Fig. 3. First the torso is searched using the aforementioned constraint. Then head and tail segments are scanned starting from torso endpoints inside the limited angular area. Lastly the paws are extracted. We first scan the area below the torso using four fixed-size vertical segments until a local maximum of Eq. (8) is found, Fig. 3.e. The paws are then refined shrinking and rotating the segments w.r.t. segments midpoints, starting points and endpoints until the maximum value of DT along segments is reached, Fig. 3.f. To generate a set of possible solutions, we iterate N times the procedure removing the previous segments from the DT image. After the set of N possible solution segments is computed we finally create the y vectors assigning to every segment its label. A label permutation step between head and tail, and accordingly to paws, is employed to account, in the solution generation process, for different dog orientations. This method covers a very large portion of possible orientation of the torso showing some shorcomings only when the dog is exactly front or back to the camera. An example of possible solution computed by this heuristic procedure is depicted in Fig. 3 (g-i).

IV. EXPERIMENTAL RESULTS
In order to perform tests, we acquired a specific dataset of dog videos in a kennel environment. We remark that there is no presence of publicly available datasets of this kind. Two different real scenarios have been considered depending whether the dog pen is indoor or outdoor that affects the lighting condition during the shooting of the videos. In particular two test trials have been performed in the italian kennel, Test. 1 and Test.
2. An additional test trial Test 3 was performed in a fully controlled environment in our laboratory. Trial. 1 contains sequences acquired with constant lighting conditions. Conversely Trial. 2 exhibits severe difference in the lighting of the pen. All the test involved different breeds of dogs. Videos have been shooting using the Microsoft Kinect at a distance of one meter from the pen. The pens size have been restricted to a maximum of 3 meter width and 3 meters from the sensor. This restriction is due to the Kinect operation range but the method can be applied to wider pens using a stereo camera instead. All the trials involves varying length sequences for a total recording time of half an our per trial at 10 frame per second. The total number of frames considered for classification are 8340 for Trial 1, 8110 for Trial 2 and 6000 for Trial 3; in all the frames the dogs are present and awaken. After the acquisition, a 10% of frames, randomly chosen for every test trial, have been manually annotated and used for quantitative evaluation, while the remaining frames have been evaluated qualitatively by three experts and detected parts have been labeled as either correct or wrong based on majority voting. The effectiveness of the heuristic for solution generation, described in Sec. III has been evaluated on the manually annotated portion of the dataset. It was considered the number of frames in which the algorithm was able to generate at least one correct solution and this happened in the 97,7% of the frames we tested while the number of correct solution with a maximum of two mislabeled body parts reached the 100%.
The classification accuracy was evaluated using the PCP measure [21], the same measure we exploited for computing the loss function of the SSVM in Sec. II-C0b. The PCP is state-of-the-art performance measure for the human body pose estimation problem and can be directly employed for quadrupeds without any further modification. We trained the classifier choosing randomly the input-output pairs (x, y) among the manually annotated images. In principle, different test have been performed varying the training set size while the C parameter of the SSVM have been set by grid-search. Quantitative results, in term of PCP, for the three test trials on the annotated frames are shown in Tab. I.
It was noted that the lower performance in Trial 2 are mostly due to the sensitivity of the Kinect to strong illumination changes that resulted in dog masks with many holes and imprecise depth images. We perform an additional test to underline which body part are mostly mislabeled calculating the PCP for the top body parts (torso, head and tail) and for the bottom ones (the 4 paws), Tab. II. Observing the results we noted that paws are more frequently wrongly classified w.r.t. the other body parts. Most of the errors involves the swapping of the paws closer to the camera with the farther ones mainly when depth images are imprecise or noisy due to sensor inaccuracies. That problem can be partially mitigated by the adoption of a stereo camera with a higher resolution than the Kinect, increasing the costs of the system. Since no methods exist for dog body part classification we compare our system against the proposal in [13], that employs SSVM for human body pose estimation. In order to perform the comparison, kernels have been set equal for both the tested methods while the solution generation process and the loss function have been varied according to [13].
The comparison was performed on the annotated part of the dataset and the results shown in Tab. III demonstrate that both the heuristic described in Sec. III and the loss function are specifically tailored for quadrupeds classification leading to more accurate results. Finally qualitative tests have been performed on the complete dataset. We classified all the frames automatically and for every solution we asked three experts to evaluate the classification results using three classes (in the case of discordance among experts we use majority voting): • Correct Solution: where body parts appear visually correct.
• Partially Correct Solution: where at least a half of the body parts appears visually correct.
• Mostly Wrong Solution: where visually the body parts are perceived as wrongly detected.
The qualitative performances in Tab. IV are higher than the quantitative ones because PCP measure accounts for the precise localization of the body part. Nevertheless, experts agree that only an average 4% of the images are completely misclassified. Visual results obtained by our proposal on three different dog breeds in the trial scenarios can be observed in Fig. 4.

V. CONCLUSIONS
We presented a system that approaches the novel problem of dog body parts detection using a 3D sensor. The 3D depth images are acquired by the Microsoft Kinect sensor and used, in conjunction with the distance transform values, to effectively classify the dog body parts. The adoption of a structural classifier allows to capture the relation among body parts without explicitly modeling all the anatomical constraint in the dog body. During experiments, carried out on dogs kept in kennel, we observe promising results of the system both in terms of the quantitative PCP measure and the qualitative visual evaluation. Tests have exhibited the independence of the proposal w.r.t to dog breeds, and we expect it being applicable to different kind of quadrupeds without excessive changes. We believe that this can constituts a first important step for analyzing dog behavior in kennels in order to detect repetitive and other aberrant behaviors, common indicators of poor welfare for confined animals.