3D Face Recognition Based on Multiple Keypoint Descriptors and Sparse Representation

Recent years have witnessed a growing interest in developing methods for 3D face recognition. However, 3D scans often suffer from the problems of missing parts, large facial expressions, and occlusions. To be useful in real-world applications, a 3D face recognition approach should be able to handle these challenges. In this paper, we propose a novel general approach to deal with the 3D face recognition problem by making use of multiple keypoint descriptors (MKD) and the sparse representation-based classification (SRC). We call the proposed method 3DMKDSRC for short. Specifically, with 3DMKDSRC, each 3D face scan is represented as a set of descriptor vectors extracted from keypoints by meshSIFT. Descriptor vectors of gallery samples form the gallery dictionary. Given a probe 3D face scan, its descriptors are extracted at first and then its identity can be determined by using a multitask SRC. The proposed 3DMKDSRC approach does not require the pre-alignment between two face scans and is quite robust to the problems of missing data, occlusions and expressions. Its superiority over the other leading 3D face recognition schemes has been corroborated by extensive experiments conducted on three benchmark databases, Bosphorus, GavabDB, and FRGC2.0. The Matlab source code for 3DMKDSRC and the related evaluation results are publicly available at http://sse.tongji.edu.cn/linzhang/3dmkdsrcface/3dmkdsrc.htm.


Introduction
Recognizing the identity of a person with high confidence is a critical issue in various applications, such as e-banking, access control, passenger clearance, national ID card, etc. The need for reliable user authentication techniques has significantly increased in the wake of heightened concerns about security, and rapid advancement in networking, communication and mobility [1]. Biometrics, which refers to automatic identification of individuals based on their measurable physiological or behavioral attributes, is of great interest and has received considerable attention because of their high accuracy and convenience to use in the modern e-world. Due to the natural and non-intrusive nature of data acquisition, the face has many benefits when compared to other biometric identifiers.
Face recognition has received substantial attention over the last three decades. To date, the majority of implemented face recognition systems are based on 2D images. Unfortunately, despite the great efforts made over the last decades, face recognition using 2D images is still a great challenge due to kinds of adverse factors, such as illumination variation, pose changes, makeup, or facial expressions. The emergence of reliable and inexpensive 3D scanners has provided new opportunities for researchers to use 3D shape information of the face to obtain better performance [2]. 3D scanning has a major advantage over 2D imaging in that those nuisance factors have a relatively smaller influence. The 3D face recognition algorithms identify faces from the 3D shape of a person's face. In the literature, some works in this field attempt to integrate discriminating information from 2D and 3D modalities simultaneously [3] and others depend solely on 3D information. In this paper, our discussions are confined only to the latter ones.

Previous Work
The task of recognizing 3D face scans have been approached in various ways, leading to varying level of successes. Some representative and prominent works will be briefly reviewed here. The existing 3D face recognition algorithms can be roughly classified into ''holistic-based'' and ''local-based'' techniques.
The holistic techniques employ information from the whole face or at least from large regions of the 3D face. Many early-stage 3D face recognition algorithms were simply extended versions of holistic 2D approaches, in which the portrait images are replaced by range images. Typically, the input range images are aligned and then reformatted into feature vectors. After that, some statistical dimensionality reduction techniques, such as the principal component analysis (PCA) [4][5][6][7][8], the linear discriminant analysis (LDA) [9,10], and the independent component analysis (ICA) [11], are adopted to learn the subspace of the feature vectors. Thereafter, facial images are projected onto the learned subspace and then are compared by means of a suitable metric in that space.
Apart from the aforementioned appearance-based schemes, there are also some other kinds of holistic techniques. Some researchers attempted to deal with the 3D face recognition problem by using ''surface matching'' techniques, in which the two facial surfaces under comparison are iteratively registered as closely as possible in 3D space by minimizing a distance metric. Representative examples belonging to this category are iterative closest point (ICP) and its various variants [12][13][14][15]. The ICP-based surface matching techniques are robust to variable facial poses and illumination variations. However, ICP-based registration procedures are not guaranteed to converge to a global minimum and they are computationally expensive. Another limitation of these methods is their sensitivity to facial expressions, which actually are non-rigid deformations of the facial surface [14]. Other methods rely on deforming facial surfaces into one another under some criteria, and use quantifications of these deformations as metrics for face comparison. Representative works belonging to this category include [16][17][18], in which elastic registration with morphable models were used. In order to deal with variable facial expressions, some researchers utilize geodesic distances between points on facial surfaces to define features that are eventually used for comparison. For the methods belonging to this category, they assume that geodesic distances are relatively invariant to small changes in facial expressions and can consequently help generate features that are robust to facial expressions. Motivated by these insights, Bronstein et al. [19,20] proposed a 3D face recognition approach by matching intrinsic representations of facial features that are computed using multi-dimensional scaling. Samir et al. [21] proposed to use the level curves of the surface distance from the tip of nose as features for face recognition. Berretti et al. [22] used surface distances to extract equal-width iso-geodesic facial stripes, which in turn, were used as nodes in a graph-based recognition algorithm. However, approaches as proposed in [21,22] are not able to deal with the problems caused by missing data or occlusions, since under these cases the shape of the level curves will definitely be affected. In [23], Mahoor and Abdel-Mottaleb represent each range image by ridge lines on the 3D surface of the face using a 3D binary image, namely ridge image, which is the locus of the points which have principal curvatures grater than a threshold. With respect to the matching strategy, they also resorted to ICP. The limitation of [23] lies in that it can only deal with frontal or near-frontal range scans. In [24], Drira et al. represent facial surfaces by radial curves emanating from the nose tips and use elastic shape analysis of those curves to develop a Riemannian framework for analyzing shapes of full facial surfaces. In [25], 3D face scans are represented in a canonical representation, namely, spherical depth map, from which spherical harmonic features can be derived. Smeets et al. in [26] proposed a geodesic distance matrix (GDM)-based representation scheme, in which the vector of eigenvalues of GDM was used as an isometry-invariant shape representation. Such a method is also sensitive to the problems aroused by missing data or occlusions.
Although 3D data can offer several great advantages over their 2D counterparts, the non-rigid deformations due to facial expressions, missing data, and self-occlusion problems caused in data acquisition severely affect the accuracy of 3D face recognition. To cope with these issues, another common framework is based on matching only parts or regions rather than matching full faces. In [27], Lee et al. extracted eight fiducial points that are geometrically invariant and then they used ratios of distances and angles between fiducial points as features, followed by an SVM classifier. Motivated by the research fruit of facial anthropometry, Gupta et al. proposed a 3D face recognition approach, namely ''Anthroface 3D'' [28]. In ''Anthroface 3D'', ten anthropometric facial fiducial points are detected at first, and then the facial 3D Euclidean and geodesic distances between the detected fiducial points are employed as features. The weakness of such an approach is its sensitivity to the problems of missing data or occlusions as under these adverse conditions, it is nontrivial to faithfully detect the anthropometric fiducial points. In [29], Li et al. designed a feature pooling and ranking scheme to collect various types of low-level geometric features, such as the curvature at the vertex, the area of each triangle, and the length of each edge, and rank them according to their sensitivity to facial expressions. In [30], Faltemier et al. proposed a region ensemble based 3D face recognition framework. In their method, the nose tip is automatically selected and then 28 face regions around the face are extracted. When matching a gallery-probe pair, corresponding regions are matched at first using ICP and then the overall matching score is obtained as the fusion of the local matching results. Such an idea of part-based matching [30] was also explored in some other works, such as [31][32][33][34][35][36]. In [37], Elaiwat et al. explored the curvelet transform to detect salient points on the face scan and to build multi-scale local surface descriptors. Inspired by SIFT [38], which is a quite successful method for matching 2D images, Smeets et al. [39] developed a meshSIFT algorithm which could detect keypoints and build local descriptors for 3D meshes. Such an algorithm has been applied to 3D face recognition and promising results were reported on Bosphorus database [40].

Overview of Our Approach
When missing data, large facial expressions, or occlusions exist in 3D face scans, it would be difficult for an approach based on holistic representations to succeed. Instead, methods resorting to local representations seem more appealing. For most state-of-theart local representation based methods, it is imperative to detect some semantic fiducial points at first, such as the nose tip, the eye corners, the mouth corners, etc. However, it is nontrivial to design an approach that can automatically and robustly detect fiducial points when missing data, self-occlusions, or large expressions exist in face scans.
In this paper, we propose a novel general 3D face recognition scheme based on local representations. In such an approach, we require neither the alignment of facial range images nor the detection of meaningful fiducial points. Our approach is highly motivated by the success of a recent work designed for 2D partial face matching, namely MKDSRC (Multiple Keypoint Descriptors and Sparse Representation based Classification) [41]. MKDSRC proposed by Liao et al. [41] is an alignment-free 2D partial face matching approach, in which each face is represented by a set of descriptor vectors extracted from keypoints and a multi-task SRC is used for classification. Such a method can address the problem of 2D partial face matching pretty well.
Specifically in our approach, for each 3D face scan F, we at first use meshSIFT [39] to extract from it multiple keypoints and then build the associated local descriptors. By using meshSIFT, keypoints are detected as mean curvature extrema in the scale space. The set of local descriptors derived from F can be used as a representation of F. In order to build the gallery dictionary, all the local descriptors extracted from gallery samples are concatenated together. Given a probe face scan, its local descriptors are extracted at first and then its identity can be determined by using a multi-task SRC. The proposed method is called 3DMKDSRC (3D Multiple Keypoint Descriptors and Sparse Representation based Classification). 3DMKDSRC uses a variable-sized description and accordingly each face scan is represented by a set of descriptors. Since the MKD dictionary comprises a large number of gallery descriptors, it is highly possible to sparsely represent descriptors from a probe scan, irrespective of whether it is a holistic, partial, or occluded one. 3DMKDSRC is particularly appropriate for matching 3D scans with missing parts, facial expressions, or occlusions. Its efficacy has been validated on three widely used benchmark databases.
The rest of this paper is organized as follows. Section 2 briefly reviews meshSIFT, based on which we extract from 3D face scans interest points and construct local descriptors. Section 3 presents  meshSIFT In our 3DMKDSRC approach, each 3D face scan is represented by a set of local descriptors extracted from keypoints. With respect to the scheme for keypoint detection and local descriptor construction for 3D scans, we resort to meshSIFT [39], which is an effective method designed for these particular tasks proposed quite recently. MeshSIFT was highly motivated by SIFT [38], which is now a widely used method to build scale invariant local descriptors for 2D gray-scale images. In this section, we briefly review the key steps of meshSIFT.

Keypoint Detection
The keypoint detection step in meshSIFT is similar to SIFT. A scale space containing smoothed versions of the input mesh is constructed at first as: with e e the average edge length. Fig. 1 shows the shapes of two face scans in the scale space.
To detect keypoints in the scale space, the mean curvature is computed for each vertex i at each scale s as: where k s i, 1 and k s i, 2 respectively stand for the maximum and minimum curvatures for each vertex i at scale s. The difference between subsequent scales could be computed as: A vertex is selected as a keypoint only when its value dH s i is larger or smaller than all its neighboring vertices in all upper, current, and lower scales. The scale s s at which the extremum is obtained is assigned to each keypoint. Fig. 2 shows an example of keypoint detection results of 3 face scans collected from the same person.

Local Descriptor
Having detected keypoints, the next step is to describe them with local descriptors which actually summarize the local neighborhood information around them. In order to obtain an orientation-invariant descriptor, each keypoint is assigned a canonical orientation. With such a canonical orientation, it is possible to construct a local reference frame in which the vertices of the neighborhood can be expressed independent of the facial pose.
For a keypoint P, all vertices within a spherical region of radius 9s s around it are its neighboring points. For each neighboring point, its normal vector is computed and its geodesic distance to P is determined based on the fast marching algorithm [42]. The normal vectors of these points are projected onto the tangent plane of the mesh containing P. The projected normal vectors are gathered in a weighted histogram with 360 bins. Each histogram entry is Gaussian weighted with the geodesic distances to P. The highest peak in the histogram and the peaks above 80% of this highest peak value are selected as canonical orientations. For a keypoint which has more than one canonical orientations, it can be regarded as multiple keypoints, each assigned one of the canonical orientations.
The generation of a local descriptor for P is based on 9 subregions. As described in Fig. 3, the locations of these 9 regions are based on the canonical orientation of P. The geodesic distances from the centers of regions 2, 4, 6 and 8 to P are all 4.5 s s , while the geodesic distances from the centers of regions 3, 5, 7 and 9 to P are all 4:5 ffiffi ffi 2 p s s . For each of the 9 regions, two histograms p S and p h are used for generating the descriptor. The first histogram contains the shape index which is expressed as: where k i,1 and k i,2 are the maximum and the minimum curvatures, respectively. The second histogram contains the slant angles, which are defined as the angles between the projected normals and the canonical orientation. Both the shape index and the slant angle histograms are Gaussian weighted with the geodesic distances to P.

3DMKDSRC
In this section, the proposed 3D face recognition scheme 3DMKDSRC will be presented in details.

Construction of the Gallery Dictionary
For each sample 3D face scan in the gallery set, its local descriptors could be computed by meshSIFT. Then, the gallery dictionary is constructed by concatenating these descriptors together. Suppose that there are C subjects in gallery and for each subject i there are totally n i derived descriptors. Usually, these n i descriptors are obtained from multiple samples of the subject i. where K here represents the total number of descriptors in the gallery set. Typically, K is very large, making D an over-complete description space of the C classes. According to the theory of compressed sensing, a sparse solution is possible for an overcomplete dictionary [43]; therefore, any descriptor from a probe face scan can be expressed by a sparse linear combination of the items from the dictionary D.

Multi-task Sparse Representation
Given a probe 3D face scan, we at first compute from it a set of local descriptors: Y~(y 1 ,y 2 ,:::,y n ) ð8Þ with n the number of keypoints detected from this scan. Then, the sparse representation problem is formulated as: where X = (x 1 , x 2 ,…, x n )MR K6n is the sparse coefficient matrix, and ||?|| 0 denotes the l 0 -norm of a vector. However, the solution to this problem is NP-hard. As suggested by the research results of compressed sensing [44], sparse signals can be well recovered with a high probability via the l 1 -minimization. Therefore, Eq. (9) can be approximated by: where ||?|| 1 represents the l 1 -norm of the vector. This is a multitask problem as both X and Y have multiple columns. Equivalently, we can solve the following set of n l 1 -minimization problems, one for each probe descriptor x i k k 1 , s:t: y i~D x i ,i~1,2,:::,n ð11Þ To solve Eq. (11), several prominent algorithms have been developed in the past few years, including Homotopy [45], FISTA [46], DALM [47], SpaRSA [48], l 1 _ls [49], etc. In our implementation, we use the Homotopy algorithm proposed in [45]. Usually, if the identity of the probe face scan is covered by the gallery set, the coefficient vectors of its local descriptors would be very sparse as illustrated in Fig. 4. Inspired by [41,50], we adopt the following multi-task SRC to determine the identity of the probe face scan: where d c (?) is a function which selects only the coefficients corresponding to class c. Eq. (12) makes use of the sum of reconstruction residuals of the n descriptors with respect to each class to determine the identity of the input face scan.

Dictionary Shrinking and Sparsity Criterion
In practice, the size (K) of the dictionary can be extremely large, making it difficult to solve Eq. (11). Hence, we adopt a similar idea as Liao et al. [41] to derive a fast approximate solution. For each probe descriptor y i , we first compute: Then, for each y i , we only keep L (L,,K) descriptors in D according to the L largest values of d i , resulting in a small subdictionary D (i) m|L : Then, D is replaced by D (i) in Eq. (11) and Eq. (12) is adjusted accordingly. In our implementation, L is set to 400.
In addition, we assume that if the identity of the probe face scan belongs to the j th subject of the gallery, the entries ofx x i should be small except those associated with the j th subject. If the coefficientŝ x x i are not concentrated on any subject and instead values ofx x i spread evenly over all the gallery subjects, y i is likely to be a noisy descriptor and it can provide little discriminative information. Thus, suchx x i will not be considered when computing Eq. (12).
To evaluate the sparsity ofx x i , we use, where k is the number of subjects in D (i) and Main(x x i ) stands for the summation of absolute values of coefficients inx x i corresponding to the first 5 percent of subjects with higher sums of absolute coefficients. If sparsity(x x i ) is larger than a threshold (0.8 in our implementation), we consider thatx x i is sparse enough and it will be involved in the further determination of identity (see Eq. (12)). Fig. 5 shows an example of the distribution of a coefficient vector which is not sparse. The overall pipeline of our proposed 3DMKDSRC algorithm is illustrated in Fig. 6.

Experimental Results and Discussions
In this section, we will provide a comparative performance analysis of our method with the other state-of-the-art or representative approaches using three public datasets, Bosphorus, GavabDB, and FRGC2.0.

Experiments on Bosphorus
The Bosphorus database [40] consists of 4666 facial range scans from 105 different subjects and is acquired by an Inspeck Mega Capturor 3D scanner leading to 3D point clouds of approximately 35000 points. In Bosphorus, facial expression variations, pose variations, and occlusions are present. The majority of the subjects are aged between 25 and 35.
In our experiment, we chose 3 face scans with neutral expressions to form the gallery set, making the gallery set have 315 samples. When forming the test set, two cases were considered. In the first case, the test set included all the remaining samples, while in the second case the test set only contained remaining frontal samples. Besides 3DMKDSRC, meshSIFT was also evaluated under the same experimental settings. The identification results in terms of rank-1 recognition rate are summarized in Table 1. In addition, results of several other algorithms are also reported. They include ICP based method [6], PCA based method [ [34], It needs to be noted that experiments conducted in [6], [32] and [33] were based on Bosphorus 2.0 which contains 2491 facial scans collected from 47 subjects, smaller than the one used in our experiments. In addition, only frontal samples were involved in those experiments.
From the results listed in Table 1, it can be seen that the proposed 3DMKDSRC performs much better than the other methods evaluated.

Experiments on GavabDB
GavabDB [51] is designed to be the most expression rich and noise prone 3D face database. The database consists of the Minolta Vi-700 laser range scans from 61 subjects. For each subject, 9 scans are collected, covering different poses and various facial expressions. We skipped those 2 types of scans which are largely rotated (690 degrees). For each subject, we chose 3 neutral faces to build the gallery set. When forming the test set, two cases were considered. In the first case, the test set included all the remaining samples, while in the second case the test set only contained remaining neutral samples. Besides 3DMKDSRC, meshSIFT was also evaluated using the same experimental protocol. The rank-1 recognition rates are summarized in  [23], are also reported in Table 2 for comparison.
The superiority of 3DMKDSRC over the other competitors can be clearly observed from the results listed in Table 2. Particularly, when the test set only contains samples with neutral expressions, the rank-1 recognition rate of 3DMKDSRC can reach 100%, which is quite amazing.
In this experiment, we randomly chose 3 face scans for each subject to form the gallery set. For the subject which has less than 3 samples, we just put all its samples in the gallery. The rest of the faces in the database were used for testing. The rank-1 recognition rates obtained under those settings by 3DMKDSRC and meshSIFT are listed in Table 3. Actually, some state-of-the-art methods, such as [24], could achieve higher recognition accuracy than 3DMKDSRC on FRGC2.0. However, it should be noted that those methods would usually apply a complicated data preprocessing procedure (e.g., hole filling) on the face scans in FRGC2.0 to improve the data quality. By contrast, in our experiments, no extra data preprocessing was performed. That's the main cause accounting for the lower recognition accuracies of 3DMKDSRC and meshSIFT reported here. 3D data preprocessing is an independent area and in the future we may try to give deeper investigations in this field.

Conclusions
In this paper, we have addressed the problem of 3D face recognition and proposed a novel approach, namely 3DMKDSRC. 3DMKDSRC represents each 3D face scan by a set of keypoint descriptor vectors extracted by meshSIFT and constructs a large dictionary from all the gallery descriptors. At the testing stage, descriptors of a probe face scan can be sparsely represented by the dictionary, and its identity can be determined accordingly by solving a multi-task SRC problem. 3DMKDSRC is particular appropriate for matching range scans with missing parts, large expressions, or occlusions. Its efficacy has been corroborated by the extensive experiments conducted on various benchmark databases.