Hessian-Regularized Co-Training for Social Activity Recognition

Co-training is a major multi-view learning paradigm that alternately trains two classifiers on two distinct views and maximizes the mutual agreement on the two-view unlabeled data. Traditional co-training algorithms usually train a learner on each view separately and then force the learners to be consistent across views. Although many co-trainings have been developed, it is quite possible that a learner will receive erroneous labels for unlabeled data when the other learner has only mediocre accuracy. This usually happens in the first rounds of co-training, when there are only a few labeled examples. As a result, co-training algorithms often have unstable performance. In this paper, Hessian-regularized co-training is proposed to overcome these limitations. Specifically, each Hessian is obtained from a particular view of examples; Hessian regularization is then integrated into the learner training process of each view by penalizing the regression function along the potential manifold. Hessian can properly exploit the local structure of the underlying data manifold. Hessian regularization significantly boosts the generalizability of a classifier, especially when there are a small number of labeled examples and a large number of unlabeled examples. To evaluate the proposed method, extensive experiments were conducted on the unstructured social activity attribute (USAA) dataset for social activity recognition. Our results demonstrate that the proposed method outperforms baseline methods, including the traditional co-training and LapCo algorithms.


Introduction
The rapid development of Internet technology and computer hardware has resulted in an exponential increase in the quantity of data uploaded and shared on media platforms [1] [2]. Processing these data presents a major challenge to machine learning, especially since most of the data are unlabeled and are described by multiple representations in different computer vision applications [3] [4]. One of the earliest multi-view learning schemes was co-training, in which two classifiers are alternately trained on two distinct views in order to maximize the mutual agreement between the two views of unlabeled data [5]. In general, the co-training algorithms train a learner on each view separately and then force the learners to be consistent across views.
A number of co-training approaches have been proposed in many applications [6] [7] [8] [9] since the original implementation [10] [11] and can be divided into four groups: (1) co-EM [12] [13]; (2) co-regression [14] [15]; (3) co-regularization [16]; and (4) coclustering. The co-EM algorithm combines co-training with the probabilistic EM approach by using naive Bayes as the underlying learner [12]. Brefeld and Scheffer [13] subsequently developed a co-EM version of support vector machines (SVMs). The coregression algorithm can also be used to extend co-training to regression problems; for example, Zhou and Li [14] employed two k-nearest neighbor regressors with different distance metrics to develop a co-training style semi-supervised regression algorithm, and Brefeld et al. [15] investigated a semi-supervised least squares regression algorithm based on the co-learning schema. The coregularization algorithm formulates co-training as a joint complexity regularization between the two hypothesis spaces, each of which contains a predictor approximating the target function [16]. The co-clustering algorithms [17] [18] [19] apply the idea of cotraining to unsupervised learning settings with the assumption that a point will be assigned to the same cluster in each view by the true underlying clustering.
Although many co-training variants have been developed, most co-training-style methods aim to obtain satisfactory performance in multi-view learning by minimizing the disagreement between two classifiers. However, it is likely that a learner will receive erroneous labels on unlabeled data when the other learner has only mediocre accuracy. This usually happens in the first rounds of co-training, when there are only a few label examples.
To address the aforementioned problem, here we propose Hessian-regularized co-training, in which regularization is integrated into the learner training process of each view to significantly boost performance. Specifically, each Hessian is obtained from a particular view of examples, which is then used to penalize the classifier along the potential manifold. Comparing other manifold regularizations e.g. Laplacian regularization, Hessian has a richer nullspace and steers the learned function that varies linearly along the underlying manifold. Thus Hessians can properly exploit the local distribution geometry of the underlying data manifold [20]  To evaluate the proposed Hessian regularized co-training, we conduct extensive experimentation on the unstructured social activity attribute (USAA) dataset [22] [23] for social activity recognition [24] [25] [26] [27]. The USAA dataset contains eight different semantic class videos, which are home videos of social occasions, including birthday parties, graduation parties, music performances, non-music performances, parades, wedding ceremonies, wedding dances, and wedding receptions. We compare the proposed Hessian regularized co-training (HesCo) with traditional co-training and Laplacian regularized co-training Labeled examples Unlabeled examples The k th view feature vector of the i th example    (LapCo). The experimental results demonstrate that the proposed method outperforms the baseline algorithms.

Method Overview
In the standard co-training setting, we are given a two-view On the other hand, manifold learning assumes that close example pairs x i and x j will have similar conditional distribution pairs P yDx i ð Þ and P yDx j À Á [28]. It is therefore important to properly exploit the intrinsic geometry of the manifold M that supports P X , and here we employ Hessian regularization to explore the geometry of the underlying manifold. Hessian regularization penalizes the second derivative along the manifold. This approach ensures that the learner is steered linearly along the data manifold, and it is superior to first order manifold learning algorithms, including Laplacian regularization, for both classification and regression [29] [30] [31]. The effectiveness of Hessian regularization has been well explored by Eells [32], Donoho [21], and Kim [20].
For convenience, we list the important notations used in this paper in Table 1.
In this section, we first briefly introduce Hessian regularization derived from Hessian eigenmaps [21] [20]. We then present the Hessian-regularized support vector machine (HesSVM), which is applied as the classifier for each view of co-training. Finally, we summarize the proposed Hessian regularized co-training. Hessian of a point depends on the choice of the coordinate system used in the underlying tangent space T p M ð Þ. Fortunately, the usual Frobenius norm of a Hessian matrix is invariant to coordinate changes [21]. Therefore, we have the Hessian regularizer that measures the average curviness of f along the

Hessian regularization
is the usual Frobenius norm of matrix A.
Step 1: Finding the k-nearest neighbors N p of sample x i and form a matrix X i whose rows consist of the centralized examples x j {x i for all x j [N p .
Step 2: Estimate the orthonormal coordinate system of the tangent space T xi M ð Þ by performing a singular value decomposition of X i~U DV T . Step

The Hessian-regularized support vector machine (HesSVM)
The Hessian-regularized support vector machine (HesSVM) for binary classification takes the form of the following optimization problem: f Ã~a rg min where f k k 2 K is the classifier complexity penalty term in an appropriate reproducing kernel Hilbert space (RKHS) H K , H is the Hessian matrix, and the term f T Hf is the Hessian regularizer to penalize f along the manifold M. Parameters c A and c I balance the loss function and the regularization terms, respectively.
According to the representer theorem [28], the solution to problem (1) is given by By substituting (2) back into (1) and introducing the slack variables g i for 1ƒiƒl, the primal problem of HesSVM is the following: Using the Lagrangian method, the solution to (3) is where J~I h ½ is an l| lzu ð Þmatrix with I as the l|l identity matrix and h as the l|u zero matrix, Y~diag y 1 ,y 2 , . . . ,y l ð Þ , and b Ã is the solution to the following problem: where Q~YJK 2c A Iz2c I HK ð Þ {1 J T Y , J~I h ½ is the l| lzu ð Þ matrix, I is the l|l identity matrix, h is the l|u zero matrix, and Y~diag y 1 ,y 2 , . . . ,y l ð Þ . Problem (3) can then be transformed into a standard quadratic programming problem (5) that can be solved using an SVM solver.

Hessian regularized co-training (HesCo)
Similar to standard co-training algorithms, HesCo also iteratively learns the classifiers from the labeled and unlabeled training examples. In each iteration, HesSVM exploits the local geometry to significantly boost the prediction of unlabeled examples, which helps to effectively augment the training set and update the classifiers. Table 2 summarizes the procedure of HesCo by integrating HesSVM into CoTrade [33].

Complexity analysis
Suppose we are given n examples, the computation of the inverse of a dense Gram matrix leads to O n 3 À Á and general HesSVM implementations typically have a training time complexity that scales between O n ð Þ and O n 2:3 À Á . Hence in each iteration of co-training, the time cost is approximately O n 3 À Á . Denote the number of iteration as g, the total cost of the proposed method is about gO n 3 À Á .

Experiments
We conducted experiments for social activity recognition on the USAA database [22] [23]. The USAA database is a subset of the CCV database [34] and contains eight different semantic class videos, as described above.
In our co-training experiments, we used tagging features as one view and visual features as the other. The tagging features are the 69 ground-truth attributes provided by Fu et al. [22] [23], and the visual features are low features that concatenate SIFT, STIP, and MFCC according to [34].
We used the same training/testing partition as in [22] and [23], in which the training set contains 735 videos and the testing set contains 731 videos. Each class contains around 100 videos for training and testing, respectively. In our experiments, we selected any two of the eight classes to evaluate performance, resulting in a total of 28 one vs. one binary classification experiments. We randomly divided the training set 10 times to examine the robustness of the different methods. In each experiment, we selected 10%, 20%, 30%, 40%, and 50% of the training videos as labeled examples, and the rest as unlabeled examples, for initialization assignment. Parameters c A and c I in HesSVM and LapSVM were tuned using the candidate set 10 e De~{10,{9, . . . ,9,10 f g . The parameter k, which denotes the number of neighbors when computing the Hessian and graph Laplacian, was set to 100.
We compared the proposed HesCo with CoTrade and Laplacian regularized co-training (LapCo). The accuracy and mean accuracy (MA) for all classes were used as assessment criteria. Figure 1 shows the confusion matrix for the CoTrade method on the eight social activity classes. The subfigures correspond to the performance of the algorithm using different numbers of labeled examples. The x-and y-coordinates are the class labels.  Figure 4 shows the MA boxplots for the different co-training methods, with each subfigure corresponding to one case of labeled examples. LapCo and HesCo both perform better than CoTrade, and HesCo outperforms LapCo. Figure 5 shows the accuracy of the different methods for the eight activity classes. Each subfigure corresponds to one activity class in the dataset, and the x-coordinate is the number of labeled examples. Manifold regularized co-training methods, including LapCo and HesCo, significantly boost performance for every activity class, especially when the number of labeled examples is small. HesCo outperforms LapCo in most cases.

Conclusion
Here we propose Hessian regularized co-training (HesCo) to boost co-training performance. In this method, each Hessian is first obtained from a particular view of examples. Second, Hessian regularization is used to explore the local geometry of the underlying manifold for the training of the classifier. Hessian regularization significantly boosts the performance of the learners and then improves the effectiveness of augmenting the training set in each co-training round. Comprehensive experiments on social activity recognition in the USAA dataset were conducted to evaluate the proposed HesCo algorithm, which demonstrated that HesCo outperforms baseline methods, including the traditional cotraining algorithm and Laplacian regularized co-training, especially with small numbers of labeled examples.