Learning Dictionaries of Sparse Codes of 3D Movements of Body Joints for Real-Time Human Activity Understanding

Real-time human activity recognition is essential for human-robot interactions for assisted healthy independent living. Most previous work in this area is performed on traditional two-dimensional (2D) videos and both global and local methods have been used. Since 2D videos are sensitive to changes of lighting condition, view angle, and scale, researchers begun to explore applications of 3D information in human activity understanding in recently years. Unfortunately, features that work well on 2D videos usually don't perform well on 3D videos and there is no consensus on what 3D features should be used. Here we propose a model of human activity recognition based on 3D movements of body joints. Our method has three steps, learning dictionaries of sparse codes of 3D movements of joints, sparse coding, and classification. In the first step, space-time volumes of 3D movements of body joints are obtained via dense sampling and independent component analysis is then performed to construct a dictionary of sparse codes for each activity. In the second step, the space-time volumes are projected to the dictionaries and a set of sparse histograms of the projection coefficients are constructed as feature representations of the activities. Finally, the sparse histograms are used as inputs to a support vector machine to recognize human activities. We tested this model on three databases of human activities and found that it outperforms the state-of-the-art algorithms. Thus, this model can be used for real-time human activity recognition in many applications.


Introduction
A smart environment is a place where humans and objects (including mobile robots) can interact and communicate with each other in a human-like way [1]. It has a wide range of applications in home and office work, health care, assistive living, and industrial operations. Current pervasive computing technologies and low-cost digital imaging devices make feasible the development of smart environments. In smart environments, accurate, real-time human activity recognition is a paramount requirement since it allows to monitor individuals/ patient's activities of daily living [2], such as taking medicine, dressing, cooking, eating, drinking, falling down, and feeling painful, to keep track of their functional health, and to timely intervene to improve their health [3][4][5][6][7]. Fig. 1 shows several human activities in the dataset CAD-60 [8], including ''wearing contact lens'',''talking on the phone'', ''brushing teeth'' and ''writing on the white board''.
Automated human activity understanding is a challenging problem due to the diversity and complexity of human behaviors [9]. Different people do the same activity in a multitude of ways; and even for a single person, he or she may do the same activity in different ways at different times. Most previous work in human activity understanding is performed on traditional 2D color images/videos and both global and local spatial-tempo features have been proposed (reviewed in [10][11][12]). Because it is difficult to deal with variations in 2D images/videos due to changes in lighting condition, view angle, and scale, researcher begun to explore applications of 3D information in human activity understanding [9]. In contrast to 2D images/videos, depth maps such as those acquired by the Microsoft Kinect system are related to object geometry and thus are independent of lighting conditions. However, it is a difficult task to develop features to representation human activities based on 3D information. This is because depth images have much less textures than 2D images and are sensitive to occlusion [13]. Adopting recognition algorithms developed to work on 2D images and videos is not trivial either. For example, interest-point detectors such as Dollar [14] and STIP [15] perform badly on 3D videos. Currently, there are two approaches in using depth data for activity recognition, depth based and skeleton/joint based methods [9]. A recent study showed that relative joint positions carry significant information about activities [16], but these features are difficult to extract without human intervention. Thus, although several recognition algorithms that use manually selected joint-related features have been developed [8,[17][18][19][20][21][22][23][24], there is no consensus on what jointrelated features should be extracted and how they should be used for activity recognition.
We propose a method that learns automatically sparse representations of human activities. Specifically, we treat 3D movements of joints as space-time volumes and densely sample the volumes along the time axis to obtain a set of sub-volumes. We then use the reconstructed independent component analysis (RICA) [25] to learn a dictionary of over-complete codes from the sub-volumes for each activity. In this learning procedure, the sub-volumes are represented by the learned codes in a sparse manner. From the coefficients of the sub-volumes projected to the sparse codes, we construct a sparse histogram for each activity. Finally, we concatenate the sparse histograms and use them as inputs to a multiclass support vector machine (SVM) to perform activity recognition. We tested this model on three widely used databases of human activities and found that it outperforms the state-of-the-art algorithms. The contributions of this paper to joint-based activity recognition are: N a general dictionary-based framework that automatically learns sparse, highdimensional spatial-temporal features of 3D movements of joints, N an efficient method that constructs sparse codes and histograms, N a real-time system for human activity recognition that can be easily implemented, N extensive evaluations on the proposed model and superior results on three datasets of human activities.
The paper is organized as follows. In Section 2, we briefly describe related work and how our model is different. In Section 3, we describe the procedures of data processing and learning dictionaries of codes of 3D movements of body joints. In Section 4, we propose a set of sparse histograms of the codes of human activities. In Section 5, we present an algorithm for activity recognition via a multi-class SVM with sparse histograms as input features. In Section 6, we report the recognition results of our model on three datasets of human activities and compare them to the state-of-the-art algorithms. In Section 7, we briefly summarize the main points of our model and address several aspects of the model that can be improved.

Related Work
We briefly describe related work below. For work on activity recognition based on 2D videos, we refer readers to several surveys [10][11][12].

Depth map-based approaches
Features automatically or manually extracted from depth images/videos have been proposed, including bag of points [26], Space-Time Occupancy Patterns (STOP) [27], Random Occupancy Pattern (ROP) [28], HOG from Depth Motion Maps (DMM-HOG) [24], Histogram of Oriented 4D Surface Normals (HON4D) [29], Pixel Response and Gradient Based Local Feature [30], Local Trajectory Gradients, and SIFT [31]. In [32], depth silhouettes are used as features and a hidden Markov Model (HMM) is used to model temporal dynamics of activities. Different from these methods, our algorithm is based on joints which are the best features for human activity recognition [16].

Skeleton/Joint based approaches
It was observed in 1970's that a range of human activities can be recognized on the basis of 3D movements of body joints [33]. However, joint-based activity recognition drew research attention only recently due to the availability of lowcost Microsoft Kinect cameras that can acquire 3D videos of joint movements. Campbell and Bobick [17] proposed to compute action curves by projecting 3D joint trajectories on low-dimensional phase spaces and to classify actions based on action curves. This approach works only for simple activities. Lv et al. [18] proposed seven types of local features and used HMMs to describe the evolution of these features. In [19] a so-called Histogram of 3D Joint Location (HOJ3D) was designed to characterize the distribution of joints around the central joint (hip joint) and a HMM was developed to model temporal changes of the feature. In [20], SIFT features for objects and skeleton features for humans were developed and an MRF was used to model human activities. Sung et al. [8] computed HOG from RGBD data and position-angle features from joints and used a Maximum Entropy Markov Model (MEMM) to represent activities hierarchically. Wang et al. [34] designed Local Occupancy Pattern (LOP) which was computed from a set of 3D points around each joint. Finally, geometric relationships among joints were used in [23]. All these methods need manually designed features. In contrast, a set of dictionaries of sparse codes of human activities are obtained without manual interventions in the method we present here.
The work related to ours is the EigenJoints that describe positional differences between joints within or cross video frames and are used for action recognition via a Naive Bayes nearest neighbor classifier [24]. The EigenJoints are simple and easy to compute and so are the features of our model presented below. Our model is different in two ways. First, a set of dictionaries of codes of human activities are learned. Second, an approximate sparse coding is performed to obtain a set of sparse histograms for action recognition via a multi-class SVM.

Joint-Dictionary Learning
We propose to learn a set of dictionaries of sparse codes to represent the complex spatial-temporal relationships among body joints. For this purpose, we introduce some notations first. where We densely sample V d c along the time dimension (''frame'' axis in Fig. 2) to obtain N s (N s~N d f ) sub-volumes for each video. Thus, we take all possible sub-volumes of V d c . One can use various methods to take sub-volumes at sampled points in the time dimension (N s ƒN d f ). Suppose that the sample sizes are 3,N,N s f along the ''xyz'' axis, the ''joint'' axis, and the ''frame'' axis, respectively. Each where I t i j is the t i j th coordinate image. The third dimension of sub-volume S i can be permuted with the first dimension by a permutation operation where vector ½3,2,1 indicates that the second dimension of S i stays where it is but the first dimension is swapped with the third dimension. From equation (1), it can be seen that the same coordinate components of each joint form the columns of the permuted sub-volume S p i by the above permutation operation. As a result, either x, y, or z coordinate components of a joint in the sampled frames in subvolume S i form one column of the permuted sub-volume S p i . For example, the x coordinate components of the head joint in different frames in sub-volume S i are one column of S p i . This is illustrated by the horizontal color bars in Fig. 3 since body joints in neighboring frames tend to have similar coordinates. To examine the sub-volumes, we form a new matrix S u by by reordering the permuted subvolumes S p i lexicographically.
S u represents the sub-volumes from one video with each column corresponding to one sub-volume. One S u is shown in Fig. 3, where the ''Sample index'' axis indicates the indices of all the sub-volume samples and the ''Coordinate index'' axis is the row index of matrix S u . As shown in Fig. 3, gradual changes between samples occur along the ''Sample index'' axis (corresponding to time axis). Thus, the configurational relationships among body joints update in the time domain, as they should in human activities.

Semantics of space-time-joint sub-volumes
The i-th sub-volume S i described above contains several video frames (N s f frames) which may capture components of one or more activities. For big N s f , there are more frames in a sub-volume, which may capture an activity. For small N s f , there are few frames in a sub-volume, which may only capture a part of an activity. Two extreme cases are N s f~1 and N s f is equal to the total number of the frames of the videos.
In following section, we propose to learn a set of dictionaries of codes that can be used to represent complex human activities. The words (i.e., codes) in the dictionaries should be components whose concatenations in the space and time domains constitute representations of human activities. Thus, N s f should be neither too small nor too big so that the sub-volumes are samples of components of human activities. Unfortunately, it is difficult to set a fixed value for N s f for all human activities, which may have components of a variety of spatial and temporal scales and may be captured by cameras of a range of imaging parameters. Therefore, we set the values of N s f via a learning procedure for the three datasets tested in this paper.

Joint-dictionary learning
We propose a method to learn a set of sparse codes that can be used to represent human activities. Sparse representation is useful for object recognition [25]. A number of algorithms have been proposed to learn sparse features, including restrict Boltzmann machines [35], spare auto-encoder [36], independent component analysis [37], sparse coding [38], and RICA [25]. Since RICA works well on approximately whitened data and is fast [25], we use RICA to learn a dictionary of codes from a set of sub-volumes S i ,i~1,2, Á Á Á ,N s for each activity. The learned dictionary is called ''Joint-dictionary''. To the best of our knowledge, this is the first work on feature learning from 3D movements of body joints.
For each activity c,c~1,2, Á Á Á ,N a (N a is the number of activities), we obtain a dictionary W c . Suppose N c is the total number of sub-volume samples from activity c. Then the class-specific dictionary W c can be obtained by solving the following optimization problem [25] W c~a rg min where S c i is the ith sub-volume sample from activity c; S c i ( : ) is a lexicographical operation on S c i to form a column vector; ( ? ) is a nonlinear convex function (e.g., smooth L 1 penalty function ( ? )~log cosh( ? ) [39] in this paper); and k, l are the number of features (rows of W c ) and a balancing parameter, respectively.
The objective function in (7) is a smooth function. The optimization problem (7) can be easily solved by any unconstrained solvers (e.g., L-BFGS and CG [40]).
We propose to learn a class-specific dictionary W c for each activity c and we pool all the learned class-specific dictionaries W c ,c~1,2, Á Á Á ,N a to form a code book W as follows The code book W contains k|N a~4 00|N a words in total. Note that W is over-complete since the number of words is bigger than the size of sub-volumes. Fig. 4 shows two dictionaries for ''talking on the phone'' and ''writing on white board''. Each dictionary contains 400 words. The words shown in Fig. 4 are used to represent 3D spatial-temporal sub-volumes and are different from conventional words (e.g., oriented bars) learned from 2D natural image patches [25]. These words are the bases of segments of space-time concatenations of body joints by which any segment of an activity can be constructed linearly. Unfortunately, unlike independent components of natural scenes, which are like edge elements, the words obtained here are difficult to visualize.

Sparse Histograms
In this section, we propose an approximate sparse coding scheme and compile a set of sparse histograms. Any sample x can be sparsely represented by W as following arg min where s is the sparse coefficients of sample x represented by dictionary W. A number of algorithms have been proposed to solve the above problem of sparse representation [41]. Instead of solving the optimization problem (9) for each video, which is prohibitively time consuming, we propose to project any sample where s is the coefficients of sample x. The first N s (400 in this paper) largest coefficients are kept and the rest coefficients of s are set to zero to make s sparse.
Note that the dimension of s is N a |400 (N a is the number of activities). The number of the kept sparse coefficients (400 in this paper) seems to be big, but it is a lot smaller than the dimensionality of sub-volumes, which is 15|3|11~495 for the CAD60 database, and the dimensionality of the entire video. In the Section 6 we show that N a can be much smaller while good performance on activity recognition can be still achieved by our method. The computation in equation (10) is very fast. Although this is an approximate sparse coding scheme, our results show that this approximation does not impair activity recognition (see Section 6).
We then obtain the histogram h of nonzero coefficients of samples of a video u by counting the number of occurrences of nonzero coefficients for each word in W. Thus, the ith component of h is the number of occurrences of the ith word that appears in video u.
The sparsity degrees of the two histograms in Fig. 5 are 10:375% and 13:104%, respectively. Thus, the histograms constructed this way are sparse.
Note that the histogram bins in Fig. 5 have more or less the same height (about 0.3). This may be due to similar words in the dictionaries for the activities in the dataset. Since a dictionary is learned from each activity independently, it is likely that there are words that are shared by more than one activities. It is worthy to point out, though, that shared words do not impair the performance of our algorithm.

Classification
We compile a sparse histogram for each activity and use it as a feature for recognition via a multi-class SVM. In this procedure, we train one SVM in a onevs.-rest scheme for each activity; use the homogeneous kernel map expansion [42] with a ''x{ square'' kernel to expand the dimensionality of feature by 2 times;  Fig. 6 and Fig. 7, respectively.

Results
We tested our algorithm on three publicly available datasets: the Cornell Activity Dataset-60 (CAD-60) [8], the MSR Action3D [43], and the MSR Daily Activity 3D [22]. Our results show that the model proposed here is better than the stateof-the-art methods.

CAD-60 dataset
The CAD-60 dataset is an RGBD dataset acquired with a Microsoft Kinect sensor at 30 Hz and has a resolution of 640|480 pixels [8] (Dataset S1). The 3D coordinates of 15 joints are the real-time outputs of the skeleton tracking algorithm of the sensor [44]. The dataset contains 14 human activities performed Sparse Codes of 3D Movements of Body Joints for Activity Understanding indoors by 4 subjects (two males and two females) for about 45 seconds. The total number of frames of each activity of each person is about one thousand. We follow the ''new person'' setting in [8] where data of 3 subjects were used for training and the remaining one subject for testing. To improve recognition performance, we mirrored the joints of the left-handed subject to make her activities similar to those of the other 3 right-handed subjects, which is a usually practice. Briefly, a plane P was first found by fitting four joints, left-arm, rightarm, left hip, and right hip. Then, a mirror plane P m was computed under the constraints that P m is perpendicular to P and passes through the middle point between the two arm joints and through the middle point between the two hip  Sparse Codes of 3D Movements of Body Joints for Activity Understanding joints. Finally, all joints of the left-handed subject were mirrored with respect to P m . Fig. 8 shows 4 confusion matrices for four cases where three subjects are chosen for training and the remaining subject for training. We compare our results to 9 algorithms in terms of average accuracy, precision, and recall in Table 1. The results of other algorithms are from the website http://pr.cs.cornell.edu/ humanactivities/results.php that reports results on the dataset. As shown in Table 1, our algorithm is the best in terms of accuracy, precision, and recall on this dataset. Since some authors reported the performance of their algorithms in terms of only part of the above metrics, there are blank cells in Table 1.

MSR Action3D dataset
The MSR Action3D dataset contains 20 activities acquired from 10 subjects, each of whom performed each activity 2 or 3 times. The resolution is 320|240 pixels and the frame rate is 16 Hz. The dataset provides the 3D movement data of 20 joints per person. We used 557 videos out of the 567 videos in the dataset since 10 videos have missing joints or erroneous joints [22] (Dataset S2).
To allow fair comparison, we followed the same setting as [22]: subjects Nos. 1, 3, 5, 7, and 9 as the training set and subjects Nos. 2, 4, 6, 8, and 10 as the testing set. The 20 actions are divided into three subsets, AS1, AS2, and AS3 according to the experimental setting in [22,43], which are listed in Table 2. AS1 and AS2 The accuracy of our algorithm on AS1, AS2 and AS3 is 87.62%, 87.5% and 97.3%, respectively. The average accuracy on the dataset is 90.81%. The three confusion matrices for AS1, AS2, and AS3 are shown in Fig. 9. Thus, our algorithm performs better on AS3 than AS1 and AS2.  Table 3 compares the performance of our model to other 9 methods. The accuracies of methods are from a recent paper [29]. The performance (90:80%) of our model is the best.

MSR Daily Activity 3D dataset
The MSR Daily Activity 3D dataset contains 16 activities each of which was performed twice by 10 subjects [22] (Dataset S3). The dataset contains 320 videos in each of 3 channels, RGB, depth, and joint. There are 20 body joints recorded whose positions are quite noisy due to two poses: ''sitting on sofa'' and ''standing close to sofa''.
The experimental setting is the same as in [22] which split the dataset into 3 subsets, AS1, AS2, and AS3 as listed in Table 4. We followed the same setting as [22]: subjects Nos. 1,3,5,7, and 9 as the training set and subjects Nos. 2,4,6,8,and 10 as the testing set. The accuracy of our algorithm on AS1, AS2 and AS3 is 71.67%, 81.25%, and 85.00%, respectively and the average accuracy is 79.31%. The confusion matrices are shown in Fig. 10. Our algorithm performs better on AS3 than AS1 and AS2. Table 5 lists the results of our model and several other methods. The results of other methods are from a recent paper [22]. The accuracy of our model is 79:31% which is lower than the best result (85:75%). However, only joint information is Table 3. Performance of our model and other methods on the MSR Action 3D dataset.

Comparison with a baseline method
We have evaluated the performance of our method on three public datasets. Our method has four steps: generating samples, learning dictionaries, constructing sparse histograms, and classifying via SVMs. In this section, we replace the RICAbased dictionary learning in our method with the k-means clustering. We cluster samples with the k-means algorithm and take the clusters as words in the dictionaries. We call this method as a baseline method. The results of this baseline method and our original method on the three datasets are shown in Table 6. Both methods perform well, with our original method being slightly better. Thus, the joint dictionaries and sparse histograms in both methods are responsible for the good performance.

Parameter setting and time performance
There are seven parameters in our model. They are N s f , the sampling size along the z-direction; N w , the number of words in each class-specific dictionary; l, the balancing parameter in Eq. 7; N s , the number of the largest coefficients; N t , the factor by which the dimensionality of feature vector is expanded; c, the parameter of the x{ square kernel; and l sum , the balancing parameter of the SVM. These parameters are probably independent of each other since they are for different phrases of our algorithm, sampling, dictionary learning, sparse histogram, and SVM training.
Of the seven parameters, the sampling size N s f , the number of words N w , and the number of the largest coefficients N s are new in our algorithm while other parameters appeared in other published studies [25,42]. Therefore, we explore how to choose the values of these three parameters while setting other parameters to the values recommended by other researchers [25,42]. We run our algorithm with different parameter values on the CAD60 dataset. Fig. 11 shows the average accuracy as a function of the sampling size f s when N w~4 00 and N s~4 00; Fig. 12 shows the average accuracy as a function of the number of words N w when N s f~1 1 and N s~4 00; and Fig. 13 shows the average accuracy as a function of the number of the largest coefficients N s when N s f~1 1 and N s~4 00. These good results on action recognition obtained under a wide range of parameter settings show that our method is not sensitive to parameter values. Therefore, setting the parameters in our algorithm for good recognition performance is not challenging.
The values of the parameters for all the experiments are listed in Table 7. For simplicity, we set the parameter values the same for the three databases except N s f , the sampling size along the z-direction, which may depend on the speed of the activities and the frame rate of the videos. As shown in Tables 1, 3   Figs. 8-13, there are a range of parameter values in our method that lead to very good performance, which may be further improved by finely tuned parameter values.
The proposed algorithm was implemented in Matlab without any optimization in programming. We evaluated the time performance of our method using Intel(R) Core(TM)2 Duo CPU E8600@3.33 GHz with 64 bit Windows 7 professional SP1 OS. Only one core (2 cores available) was used based on single thread programming. We report 4 measures, i.e., the average training time (ATT), the average testing time per video (ATTPV), the average number of training videos (ANTV), the average number of test videos (ANOTV), and the average number of training classes (ANTC) on the three datasets in Table 8.
As shown in the table, our method took 0.50, 0.03, and 0.10 seconds/per video to classify the activities of the CAD-60 dataset, the MSR Action3D dataset, and the MSR Daily Activity 3D dataset respectively. The training time was 513.43 seconds, 73.02 seconds, and 125.60 seconds on the CAD-60 dataset, the MSR Action3D dataset, and the MSR Daily Activity 3D dataset respectively. This time performance can be improved significantly by optimized C++ codes running on much faster CPUs. Therefore, our model is a real-time method that can be used in smart environments and deployed in robots for human-robot collaborations.

Discussion
In this paper we proposed a real-time algorithm that makes use of joint information to recognize human activities. In the first step of the algorithm, videos of 3D movements of body joints are sampled to obtain a set of spatialtemporal 3D volumes, which entail the complex spatial-temporal relationships of joints of human activities at a data size that is much smaller than that of a RGBD volume. Second, RICA is performed on the spatial-temporal 3D volumes to obtain a set of dictionaries of codes that form a sparse representation of human activities. An approximate spare coding scheme is then used to compile a set of spare histograms as features for activity recognition. Finally, a multi-class SVM is used to perform activity recognition. We performed extensive tests on this algorithm on three widely used datasets of human activities. Our results show that this algorithm produces so far the best recognition accuracy on these datasets. Our algorithm automatically learns discriminative features for activity recognition and is very fast and easy to implement. Since joint information can be obtained by low-cost cameras such as the Microsoft Kinect systems, our algorithm can be used in smart environments and deployed in robots for human-robot collaborations. This model can be improved by the rich information in depth images. To include this information, we will extend the model presented here and our recent model of activity recognition based on multi-scale activity structures [45].

Acknowledgments
We thank Drs. He Cui and Suxibing Liu for helpful comments.