Figures
Abstract
In this paper, we developed a pose-aware facial expression recognition technique. The proposed technique employed K nearest neighbor for pose detection and a neural network-based extended stacking ensemble model for pose-aware facial expression recognition. For pose-aware facial expression classification, we have extended the stacking ensemble technique from a two-level ensemble model to three-level ensemble model: base-level, meta-level and predictor. The base-level classifier is the binary neural network. The meta-level classifier is a pool of binary neural networks. The outputs of binary neural networks are combined using probability distribution to build the neural network ensemble. A pool of neural network ensembles is trained to learn the similarity between multi-pose facial expressions, where each neural network ensemble represents the presence or absence of a facial expression. The predictor is the Naive Bayes classifier, it takes the binary output of stacked neural network ensembles and classifies the unknown facial image as one of the facial expressions. The facial concentration region was detected using the Voila-Jones face detector. The Radboud faces database was used for stacked ensembles’ training and testing purpose. The experimental results demonstrate that the proposed technique achieved 90% accuracy using Eigen features with 160 stacked neural network ensembles and Naive Bayes classifier. It demonstrates that the proposed techniques performed significantly as compare to state of the art pose-ware facial expression recognition techniques.
Citation: Altaf MF, Iqbal MW, Ali G, Shinan K, Alhazmi HE, Alanazi F, et al. (2025) Neural network-based ensemble approach for multi-view facial expression recognition. PLoS ONE 20(3): e0316562. https://doi.org/10.1371/journal.pone.0316562
Editor: Vijayalakshmi G V. Mahesh,, BMS Institute of Technology and Management, INDIA
Received: December 9, 2023; Accepted: December 11, 2024; Published: March 19, 2025
Copyright: © 2025 Altaf et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the results presented in this study are available via RaFD after receiving the proper approvals. Requests to access the dataset can be sent to info@rafd.nl.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
The research of automatic recognition of pose-aware facial expression has endorsed significant progress in the past [1,2]. Automatic recognition of eight basic expressions (neutral, anger, happiness, surprise, fear, sadness, disgust, and contempt) from frontal pose has been done with fairly high accuracy. However, recognition of multi-pose facial expressions is still a challenging problem, primarily due to the dissimilarity in multi-pose facial expression representation. Most of the existing techniques make use of images that are relatively still and demonstrate posed facial expressions in a near frontal pose [2]. Therefore, several real-life applications relate to multi-pose facial expressions in human-to-human interaction, where the conjecture of having immovable subjects is impractical.
However, extracting facial features from facial images independent of pose is a difficult task because any change in poses also affects the facial expression representation. Consequently, multi-pose facial expression classification requires a large amount of training data to learn variations among facial expressions and poses, which are generally not available [3]. In addition, the accuracy of the classifier decreases when applied on facial images with continuous changes in pose. Multi-pose facial expression recognition techniques rely on 2D/3D facial images to distinguish the image variation caused by change in facial expression and pose. However, the accuracy of multi-pose expression recognition depends on the accuracy of pose detection, which is not an easy task [4].
According to Ekman, the facial region around the eyes and mouth contains more information about action units as compare to other regions of face [5]. This suspicion motivated us to focus on extraction of facial concentration region for facial feature representation. We have utilized the Voila-Jones face detection method [6] to find and clip the facial region from the facial picture in order to do this. The identified face concentration region’s size, contrast, and brightness have all been adjusted to be normal. Subsequently, we used histogram of oriented gradient (HOG) and principal component analysis (PCA) for feature extraction. These features were mapped to the sample image into a feature space where stacked neural network ensemble (SNNE) is used to recognize the facial expression. Another interesting point is that posed facial image provides partial facial information. Consequently, the facial expression information entailed in the facial images related to multiple poses is not same. This situation inspires us to detect the pose before applying the facial expression classification techniques. As a result, we propose a pose-aware facial expression recognition technique, for pose detection and pose specific expression recognition. For pose-aware facial expression classification we have extended the stacking ensemble technique from two-level ensemble model to three-level ensemble model: base-level, meta-level and predictor. The base-level classifier is the binary neural network. The meta-level classifier is a pool of binary neural networks. The outputs of binary neural networks are combined using probability distribution from equation 1 and 2 to build the neural network ensemble as presented in [7]. The output of a neural network ensemble is either a one or a zero, which indicates whether an expression is present or not. Each neural network ensemble in the pool of SNNEs is trained to represent the presence or absence of a facial expression in order to understand the similarities across multi-pose facial expressions. The Naive Bayes (NB) classifier, which draws its inspiration from, is the predictor [8]. It identifies the sample expression as one of the possible outcomes using the binary output of SNNEs. The reason behind using NB is that it operates over binary data, it either accepts or rejects the existence of an expression. Moreover, the K nearest neighbor (KNN) classifier is used for determining the relationship between the facial features of multiple poses. In the KNN classifier, the response is the detection of pose from corresponding facial feature vector of a given facial image.
The major objectives of this research were to develop a pose-aware facial expression recognition technique while satisfying the following conditions.
- The binary neural networks were trained and tested on multi-pose dataset.
- Extraction of features which are invariant to differences of facial structure and head pose.
- The availability of sufficient multi-pose dataset.
- The system works automatically without human intervention.
Rest of the paper organized as follows: The related work presented in section 2. Methodology, framework implementation, examination of the restrictions, constraints, and design choices indicated in section 3. Section presents the experimental findings. Finally, part 5 and section 6 separately give the discussion and conclusion.
2. Related work
Until early 2008, the issue of pose-aware facial expression recognition was comparatively ignored in the literature. Hu et al. have pointed out this issue in [3]. In the facial expression recognition literature, the use of multi-pose facial images is rare; and comparatively less attention has been given to the problem of multi-pose facial expression recognition. Hu et al. [3] made the first attempt to recognize the multi-pose facial expressions. The facial features around eye brow, eyes and mouth were detected using 2D displacements of 38 facial landmark points. The linear Bayes, quadratic Bayes, parzen classifier, support vector machine (SVM) and KNN classifiers employed on facial land marks to evaluate the performance of multi-pose facial expression classification techniques. Achieved best average recognition accuracy on BU-3DFE facial expression database at the 45-degree facial pose. Ognjen et al. [9] proposed the coupled scaled gaussian process model to normalize the pose for multi-pose facial expression recognition. The model learns the relationship between each pair of poses to determine the dependencies among multiple poses.
Most of the facial expression recognition systems employed the Ekman’s [5,10] facial action coding system to represent the facial expression in the form of action units. This technique involves a lot of manual effort to label the images as mentioned by Chew et al. [11]. Many newly developed techniques such as [12] and [13] employed facial action coding system, were unable to define the relationship between action units and facial expression recognition techniques. This issue leads towards the use of appearance-based feature extraction techniques.
Mostafa et al. [2] used the local binary pattern (LBP), Sobel and discrete Laplace features to recognize the spontaneous facial expressions with multiple poses (0, 18, 36, 49, 62, 75 and 90). Proposed a random forest based novel ensemble classifier to classify the multi-pose facial expressions using different datasets for the training and testing of classifiers. The objects of the facial expression dataset belong to different cultural, ethnic, and geographical regions. The best average accuracy was obtained on frontal pose facial expressions. Zheng et al. [14] used the scale invariant feature transformation (SIFT) feature vector to recognize the multi-pose facial expressions. Similarly, in [4] facial landmarks extracted from a set of SIFT features to represent the five poses for multi-pose facial expression recognition. The experimental results show that the best average expression recognition accuracy achieved with 45-degree pose. Moore and Bowden [15] proposed a multi class SVM classifier for multi-pose facial expression recognition using texture descriptor. Various variants of LBP descriptor were used for feature extraction to investigate the effect of orientation and multi-resolution analysis for non-frontal facial expression recognition. The sample image divided into a set of grids to extract the LBP features. The LBP features of each grid were then finally concatenated to form a feature vector to represent the corresponding facial image. The experiments were performed on BU-3DFE and multi-PIE databases to evaluate the optimal poses of multi-pose facial expressions.
Recently, Wenming [16] developed the group sparse reduced-rank regression model for multi-pose facial expression recognition. The model describes the correlation between facial features and the subsequent expression class label vector by selecting the optimal facial regions, that contribute most to the expression recognition. Each sample facial image divided into a set of equal size facial regions. The facial feature vector was extracted using LBP descriptor. A comprehensive review on using deep learning techniques for facial expression is presented in [17]. Karnati et al. [18] introduced a parallel network based on texture features to overcome the issue of intra class facial appearance variation in representing the facial expressions. Karnati et al. [19] developed a novel technique for illumination normalization to recognize the facial expression in wild, the main focus was on the extraction of five prominent facial regions that contribute most in the representation of facial expression variations. More recently a hybrid deep convolutional neural network-based technique developed to efficiently recognize the facial expressions, where the geometric features extracted using local gravitational force descriptor and the holistic features extracted using the convolution layer [20].
In this work, we designed a novel ensemble model using Radboud faces database (RaFD). The proposed ensemble model permits the SNNEs to represent the multiple binary neural networks by a singular value, that is the usual case as presented in ensemble classifiers literature [21,22]. Thus, Each SNNE’s binary values signify if an expression is present. The amount of binary neural networks in each SNNE, which is substantial, regulates the difficulty of multi-pose expression recognition.
3. Contributions
The contributions of this research are as follow:
- Novel Ensemble Classifier: The proposed neural network-based ensemble model is a novel ensemble structure that is the extended version of stacking ensemble model.
- Pose-aware facial expression recognition: We developed a two-level approach to recognize the pose-aware facial expression recognition: at first level it detects the pose from a given facial image then invokes the pose specific classifier to recognize pose-aware facial expression. The use multiple pose-specific models quantify the multi-pose complex problem to a simple solvable problem.
4. Methodology
Before discussing the proposed methodology, we present the used abbreviations, notations and perspective description in this paper as shown in Table 1.
The proposed pose-aware facial expression recognition framework is presented in Fig 1, which consists of the following 5 major parts:
Pre-processing
This section largely focuses on methods for pre-processing images, such as noise reduction, normalization, threshold holding, sharpening, and cropping the face area. Therefore, it is crucial to extract the facial picture that only shows the region of the face and has normalized intensity, is noise-free, and is consistent in size and shape. The localization and alignment of face pictures based solely on appearance is the initial stage in image pre-processing. The face alignment problem could be solved more effectively using the face alignment and detection techniques described in [23]. However, analyzing each new face picture computationally is expensive. Therefore, to extract the facial region from a picture, we employed Voila Jones’ face detection technique. The size of the identified face region was adjusted to the mean of each batch of training poses. As a result, the face alignment stage only needs to conduct simple arithmetic computations briefly.
- Feature extraction: In this phase, the method of face feature extraction from multiple-pose facial photos is examined. To depict a face picture, the feature extraction techniques PCA and HOG were both used. Principal component analysis was used to further analyze the retrieved feature vectors for dimensionality reduction, a similar method shown in [24].
- Pose detection: This phase focuses on pose detection from multi-pose facial images. The major aim is to establish the relationship among the facial features of different poses using KNN classifier. In the KNN classifier, the response is the detection of pose from corresponding facial feature vector of the given facial image.
- Ensemble classifier training: In a neural network ensemble, a binary neural network receives a one-dimensional feature vector with decreased dimensionality as input. Each SNNE consists of a collection of binary neural networks and a simple Bayes classifier. A group of binary neural networks trained to categorize each of the eight facial expressions were combined to create an SNNE. Each SNNE’s binary output guarantees the existence or absence of a certain phrase. These ensembles’ outcomes that are connected to the probability distribution span the SNNE.
- Multi-pose facial expression recognition: This stage focuses on identifying many face expressions from testing samples. First, face characteristics from a sample image are retrieved, and then postures are found using a KNN classifier. The binary neural networks received the reduced feature vectors from the output of SNNEs after the postures had been determined. To reflect the most noticeable pose-ware facial expression, the output of SNNEs is connected to the NB classifier.
At last, we have observed that our experimentation requires to evaluate the types of classifiers.
- Firstly, the binary neural networks are trained to construct a SNNE. The output of binary neural networks across the SNNE was combined using majority voting approach. Although, it does not provide the evidence about any relationship between probability value for the presence of an individual expression with other SNNEs.
- A KNN classifier with city block distance measure and 10 nearest neighbors implemented to detect one of the possible five poses.
- A SVM classifier with different kernels was designed as a final predictor.
- A KNN with hamming distance implemented as a predictor to combine the decisions of SNNEs.
- The proposed NB classifier to combine the decision of SNNEs.
Dataset preparation
The RaFD dataset contains 6840 images of 57 participants. The participants belong to Moroccans and Caucasians geographic regions. Out of 57 participants 38 are male participants and 19 female participants [24]. Each subject posed eight facial expressions (anger, happiness, surprise, sadness, fear, neutral, contempt, and disgust) with three gaze direction (left, right and front) and five head poses (-90°, + 90°, -45°, + 45°, 0°). The posed still images of each subject captured and digitized to 681 x 1024 pixels. Every subject in this database has three images of each facial expression against a pose.
The stacked ensemble models were trained using the RaFD multi-pose dataset. The facial region of each facial image was cropped using Voila Jones face detector as shown in Fig 2. The cropped facial region of each image varies in size and illumination. In order to create a consistent dataset, these photos are scaled to the average of the whole dataset of each posture. Finally, the picture histogram was equalized to normalize the impacts of lighting variance [25].
Neural network ensembles training
The principal component analysis and histogram-oriented gradients are used for feature extraction from sample images. Each sample image converted into a one-dimensional feature vector. This feature vector used to train the binary neural networks. For SNNEs construction first the subset expression specific data was randomly selected from training data, then another subset of other classes’ samples is randomly selected to train the base-level classifier. A set of SNNEs is trained against each facial expression to recognize the one of the eight expressions. The output of stacked neural network ensembles depends on the predictions of base level classifiers, because the predictions of base-level classifiers are combined with majority voting to produce the output of SNNEs. During the training of SNNEs, if the performance of a base-level classifier is below 99%, that classifier was discarded. We used 0.3 learning rate, and 2000 maximum number of epochs, most of the binary neural networks (BNNs) converged before 50 epochs. Scaled conjugate gradient function was adopted as the neurons training function. The number of SNNEs in each stacked ensemble model was varied from 80 to 160, 240, 320 or 400 SNNEs. In case of 400 SNNE 50 SNNEs were trained against each expression. Finally, the decision about the number of SNNEs in each stacked ensemble model was made on the basis of expression accuracy achieved during training. The 50 most prominent features were selected to represent a facial image. The count of neurons in hidden layer was 10. Moreover, ½ of total training samples was randomly selected to train a base level classifier.
The binary neural networks trained using feature vectors of pose dependent facial expressions against all other expressions feature vectors. This model learns inter-expression variability in case of different head poses as described by Ekman [5,10]. Five pose-aware stacked ensemble models were trained using above-described parameters. The output of all SNNEs were concatenated to produce the meta input vector. The output of pose-aware stacked ensemble model is a 80, 160, 240, 320 or 400 element vector, representing the binary value about the presence of an expression. The output of a SNNE is determined using majority voting.
Selection of parameters for ensemble classifier
Five types of stacked ensemble models were trained using previously mentioned parameters with respect to five poses (0°, −45°, + 45°, −90°, + 90°). Experimental results indicate that the accuracy of the 160, 240 and 400 SNNEs was higher than 80 and 320 SNNEs. The SNNEs count 160 give the best accuracy using Eigen vectors and achieving the best accuracy on frontal pose. These experiments showed that the performance of stacked ensemble model using SVM and KNN predictors is significantly lower than NB predictor. The optimal ensemble classifier structure (BNN training parameters, SNNE count, feature type and final predictor) for each pose are presented in Table 3.
Pose detection
This part focuses on pose detection from multi-pose facial images. The major objective is to use KNN classifier for determining the relationship between the facial features of different poses. In the KNN classifier, the response is the pose of corresponding facial feature vector from the given facial pose. The KNN is the simplest classifier in machine learning [15]. It postponed the generalization task till the classification of sample data required. The algorithm predicts the class of sample data by computing the similarity between training set attributes and sample data attributes using some distance measure, and picks the K closest training samples. Assign the most common class among these training samples to the test sample. The KNN was applied with varying value of K and distance measures. Finally, the KNN classifier with city block distance measure and K = 10 implemented to detect one of the possible five poses (−90°, −45°, 0°, + 45°, + 90°).
Pose-aware facial expression recognition
We used NB classifier as a level-three classifier to combine the decisions of level-two classifiers in extended stacking ensemble model. The NB classifier determines the presence of one of the eight expressions in an unknown facial image.
Naive Bays
Naive Bayes is a probability based simple classification technique; its performance is comparable to the most prominent classification techniques. The classifier predicts an unknown sample by defining a probabilistic relationship between the underlying features of trainings samples and test sample. It computes prior and posterior probabilities of sample features with respect to unknown sample while assuming no dependencies among all features. It assigns the class label to the unknown sample with the maximum likelihood of probability values. In practice NB performed significantly as compare to many state-of-the-art classification techniques.
5. Results
In this study, experiments are performed on RaFD multi-pose facial expression database. The detailed description of RaFD dataset is given in Table 2. The facial expression classification process carried out using RaFD database for each pose by varying the number of SNNEs (80, 160, 240, 320 and 400) with 10 BNNs. The NB, KNN and SVM classifiers are trained to predict the presence or absence of an expression. The results obtained from optimal ensemble classifier structure are presented in Table 4 along with accuracy on each expression. While complete evaluation results are presented in Tables 5–9.
Table 4 indicates the performance of stacked ensemble model with NB is superior using PCA as compare to HOG. However, results for expressions (neutral, sadness, fear and disgust) are very poor using HOG features. Therefore, accuracy differences using PCA and HOG features are very high on disgust, fear and neutral. The expression recognition accuracy difference is about 24.35% on disgust, about 11% on neutral and fear.
3 represents a deeper analysis of each expression on multi-pose expression classification using test dataset. The diagonal entries represent the correctly classified samples and off-diagonal entries specify the misclassified samples. The labels (AN stands for anger, HA for happiness, SU for surprise, SA for sadness, FE for fear, NE for neutral, CO for contempt and DI for disgust) are used for x-axis and y-axis. From these confusion matrices it is noticed that the rate of misclassification is high between fear-surprise and fear-sadness expressions. Whereas neutral and disgust are the most confusing expressions. It represents the similar relationship between fear, sadness, and surprise expression on HOG features as presented in [26], where it is demonstrated that disgust is the most confusing expression as compare to others. The highest misclassification rate is between neutral-disgust and fear-surprise expressions, where 67 samples of fear are misclassified as surprise and 49 samples of neutral are misclassified as disgust. Moreover, these results demonstrates that among the eight multi-pose expressions, the expressions of surprise, contempt, and happiness are easier to be classified irrespective of pose-variations (Fig 3).
The experimental results broken down as pose wise from 0°, −45°, −90°, + 45° to + 90°. The detailed description of these results is presented in Tables 5–9. The illustrated results include the average accuracy along with performance of each expression with different combinations of SNNEs count and final predictor. The average results are the overall performances of the predictor, not the average of each column. The best accuracy is presented in bold face. Each SNNE is a set of 10 BNNs. The BNNs of each pose were trained using subset (specific facial pose images) feature vectors of each expression against similar facial pose feature vectors of other expressions. Therefore, the whole test data is used for the evaluation of five types of pose-aware extended stacking ensemble model.
Table 5 shows the highest expression recognition accuracy on frontal pose with pose-aware ensemble model, with different SNNE count using PCA and HOG features. The results in this Table 5 indicate that the level of difficulty on frontal pose is significantly lower than non-frontal pose. Tables 6–9 present a significant drop in recognition accuracy as compare to the results presented in Table 5. The reason behind low recognition accuracies on non-frontal pose is the relatively less information available regarding facial structure and expression representation. These results indicate that pose-aware stacked ensemble model perform better with PCA and HOG descriptors. However, PCA achieved better performance as compare to HOG. The best average accuracies on frontal pose using PCA features are 90%, 88.78%, 88.29% and 86.59%, 86.34%, 84.15% using HOG features. The overall best recognition accuracy is 90% using Eigen features with 160 SNNEs and NB classifier.
The experimental results of all five types of pose-aware stacked ensemble models indicate that expression recognition accuracy for each expression varies with respect to different facial poses. It is observed that the most favorable facial pose is the frontal pose. In addition, among the three final predictors, NB achieves better recognition accuracy than KNN and SVM. This is most likely due to the fact that the NB learns using conditional probabilities of features derived from training data.
These experimental results indicate that the stacked ensemble model with NB predictor outperforms the other stacked ensemble models with KNN and SVM predictors respectively. Whereas, the combination of stacked ensemble model with KNN predictor outperforms the stacked ensemble model with SVM predictor trained using on both PCA and HOG features. The best average accuracies for stacked ensemble models with NB (90.00%), KNN (88.78%) and SVM (88.78) using eigen vectors, illustrated in Table 5. In contrast, it can also observe that average classification accuracy of stacked ensemble model with SVM predictor is superior than with KNN predictor on facial pose + 900 (presented in Table 9) when trained using HOG features. The expression recognition accuracy using HOG features is significantly lower as compare to PCA features.
The performance of pose-aware stacked ensemble model with the combination of NB and PCA is remarkably superior on frontal pose as compare to no-frontal pose. Its performance drops dramatically when applied on non-frontal facial images, as illustrated in Tables 6–9. This issue could also be pointed out by comparing the literature on multi-pose facial expression recognition [2,16] and [27], where higher recognition accuracy achieved on frontal pose facial images. It is observed that combination of all three final predictors with eigen vectors performed better than HOG features. The performance of three final predictors is remarkable on PCA. Moreover, considering the HOG features, the combination of HOG with NB outperformed the combination with KNN and SVM in five types pose-aware stacked ensemble model [28–35].
As Table 5 demonstrates the best expression recognition accuracy achieved with 160 SNNEs using eigen vectors. It also shows that pose-aware stacked ensemble model with NB predictor outperform the stacked ensemble models with KNN and SVM predictors respectively. We can so notice the stacked ensemble model trained using PCA features outperform the other stacked models trained with HOG features. Contrarily, the stacked ensemble model with KNN predictor performed significantly with lesser number of SNNEs. The experimented results presented in Table 6–9demonstrates that stacked ensemble approached better with 160 and 400 SNNEs as compare to other SNNEs count.
Let us consider the confusion matrices illustrated in Figs 3–8 pertaining the best combination of stacked ensemble model with final predictor and feature extraction techniques. These figures show the resultant confusion matrices of five types of stacked ensemble models. The grey scale matrices are designed to visualize the results, where gray scale vary from black to white representing the intensity of values from high to low respectively. In contrast to diagonal values, which are difficult to perceive, diagonal values are easy to visualize. These confusion matrices represent the recognized facial expression on x-axis and facial image labels on y-axis, where each row represents the confusion intensity of each expression with respect to other expressions. The gray intensity in all confusion matrices represents the level of inter-expression similarity as well as dissimilarity. Fig 4 illustrates the level of confusion between different expression combinations with the best expression recognition accuracy of 90%. These confusion matrices illustrate that disgust and neutral are the most confused expressions. Moreover, from these confusion matrices we can see that, among the eight expressions, the expressions of happiness, surprise and contempt are easier to recognize than anger, sadness, fear, neutral and disgust. Considering the best performances each pose-aware stacked ensemble model, it can be notice that the happiness expression easy to recognize as compare to other expressions. Whereas, in the best performing stacked ensemble model, the expression surprise has the superior classification accuracy (100%).
: (a)-(c) confusion matrices correspond to PCA features, and (d)-(f) confusion matrices corresponds to HOG features.
: (a)-(c) confusion matrices correspond to PCA features, and (d)-(f) confusion matrices corresponds to HOG features.
+ 45° yaw pose confusion matrices comparing performance of NB, KNN and SVM on PCA and HOG features: (a)-(c) confusion matrices correspond to PCA features, and (d)-(f) confusion matrices corresponds to HOG features.
es, (a)-(c) confusion matrices correspond to PCA features, and (d)-(f) confusion matrices corresponds to HOG features.
+ 90° confusion matrices comparing performance of stacked ensemble model of type 1 combined with NB, KNN and SVM on PCA and HOG features: (a)-(c) confusion matrices correspond to PCA features, and (d)-(f) confusion matrices corresponds to HOG features.
To envision the confusion between eight expressions, we presented the confusion matrices of five types of pose-aware stacked ensemble models in Figs 4–8. These confusion matrices indicate that the expressions of surprise, contempt, and happiness are easier to recognize than other expressions. It happens due the large muscular deformation in representing these three facial expressions as compare to other facial expressions. The expression surprise is the easiest one to recognize and disgust is the most confused. Moreover, these figures illustrate that disgust, fear and sadness expressions are difficulty to recognize due to confusion rate. Next to neutral and disgust, surprise and fear are the most confused expressions in non-frontal pose. This confusion may attribute to the similar low muscle deformation.
6. Discussion
During the training of SNNEs it is observed that stacked ensemble model consistently experienced higher accuracy on frontal pose as compare to non-frontal pose. It was also evident from the experimental results of Mostafa et al. [2] and Wenming [16], where different facial databases are used to train and test the expression recognition model. The expression recognition accuracies demonstrates that prediction rate of all expressions is not equal. For example, by observing human recognition accuracy for RaFD from [24], it can be noticed that surprise and happiness have the highest expression recognition accuracy as compare to other expressions. These findings strengthen the results of proposed ensemble model. It can also be noticed that fear and sadness expressions have lowest recognition accuracy and most confused expressions. It witnessed about the stability of stacked ensemble model.
It has been noticed that using the whole dataset for training and testing of classifiers, irrespective of pose variations, the performance of stacked ensemble model degrades severely. Table 10 demonstrates the expression recognition accuracies on multi-pose databases, where several datasets were utilized to train and evaluate expression recognition models These findings show that the accuracy of expression recognition models is quite poor when compared to using pose-specific datasets for training and testing. Again, this variation in expression recognition accuracy strengthens the evidence about the generalization of stacked ensemble model.
The most similar work to the proposed technique is presented in [41], where KEF dataset is used with five facial views (+90°, −90°, + 45°, −45°, 0°). The best performance of proposed technique is 87% on −90° and 88% on −45° facial images. We can say that proposed approach performed slightly lower as compare to the work presented in [41]. Contrarily there is a huge difference in computational cost of both techniques. An evocative comparison of the results using RaFD with other databases used in literature is actually not possible. The reason behind that, up to the best of our knowledge, the existing work has not reported any results using multi-pose RaFD database. Instead, used only frontal pose of RaFD or BU-3DFE and Multi-PIE database. The results from other techniques using RaFD database listed in Table 11 for comparison.
The most noticeable work about the level of difficulty in recognizing the multi-pose facial expression is presented in [2]. The authors trained the expression recognition model on BU-3DFE dataset and evaluated its performance on five different multi-pose and front pose facial expression datasets (SPEW, RaFD, JAFFE, KDFE). Another similar work is presented in [27], where BU-3DFE dataset was used to train the classifier and Multi-PIE dataset was used for the testing of recognition model. Another interesting work is presented in [39]. Demonstrate the effectiveness of the Bayesian networks to capture the posed and spontaneous spatial features for gender and expression recognition. Table 10 illustrates most significant results which are comparable to the performances achieved with pose-aware stacked ensemble model. The difference in performance of proposed ensemble model and other models is due to the difference in number of facial expressions and multi-pose expression representation. The work presented in [42] and [43] used only frontal pose images while achieving 98.1% and 95.6% accuracy with six and seven facial expressions respectively.
7. Conclusions
In this paper, we proposed the pose-aware stacked ensemble model for learning the facial expressions discrimination from multi-pose facial images. This ensemble approach is experimented on RaFD facial expression database to evaluate the recognition performance using PCA and HOG features. The experimental result shows that the proposed method achieves competitive recognition performance compared with the state-of-the-art methods. Further, the NB classifier through the incorporation of naive Bayes and Bernoulli distribution is a valuable combination where the outputs of SNNEs are binary. We would expect that the use of Bernoulli distribution would increase the capability of classifier to differentiate between expressions. The significant contribution of this research is the development of pose-aware ensemble model for multi-pose facial expression recognition. The introduction of the combination of pose detection and then pose specific facial expression recognition ensemble model is entirely a novel structure that gains the advantage pose dependent expression recognition. However, SNNEs at meta level provide high generality by introducing a pool of binary neural networks for extended stacked ensemble learning. The binary output of base level classifier as well the meta level classifier enables the final predictor to accurately predict the on the eight facial expressions. Therefore, it is very difficult for a classifier to make efficiently recognize the facial expressions in case of varying pose and facial appearance. The results obtained using pose-aware stacked ensemble model are significantly comparable to the performance of multi-pose facial expression recognition techniques presented in literature [2,37,42,43]. Consequently, we observed that the combination of stacked ensemble model with NB outperform the SVM and KNN predictors.
Acknowledgments
The authors extend their appreciation to the Department of Computer Science, GC Women University Sialkot, Pakistan for research support.
References
- 1. Usman T, Jianchao Y, Thomas SH. Multi-pose facial expression recognition analysis with generic sparse coding feature. Computer Vision – ECCV 2012, Workshops and Demonstrations. 2012;7585:578–88.
- 2. Abd El Meguid MK, Levine MD. Fully automated recognition of spontaneous facial expressions in videos using random forest classifiers. IEEE Trans Affective Comput. 2014;5(2):141–54.
- 3. Hu Y, Zeng Z, Yin L, Wei X, Tu J, Huang T. A study of non-frontal-pose facial expressions recognition. Proceedings of the 19th International Conference on Pattern Recognition. 2008, p. 1–4.
- 4. Hu Y, Zeng Z, Yin L, Wei X, Tu J, Huang T. Multi-pose facial expression recognition. Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition. 2008, p. 1–6.
- 5. Ekman P, Friesen W. The facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press. 1978;8(4):219–38.
- 6. Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2001, p. 511–8.
- 7. Miguel L, Patricia M, Oscar C. A method for creating ensemble neural networks using a sampling data approach. Analysis and Design of Intelligent Systems using Soft Computing Techniques Advances in Soft Computing. 2007;41(1):355-64.
- 8. Chenlei F, Fan Y, Xuan Z, Xin L. Bayesian model fusion on Bernoulli distribution for efficient yield estimation of integrated circuits. Design Automation Conference. n.d.1–6.
- 9. Daniilidis K, Maragos P, Paragios N. Coupled gaussian processes for pose-invariant facial expression recognition. Computer Vision - ECCV 2010, Lecture Notes in Computer Science. n.d.;6312350–63.
- 10. Tian Y-L, Kanade T, Cohn JF. Recognizing Action Units for Facial Expression Analysis. IEEE Trans Pattern Anal Mach Intell. 2001;23(2):97–115. pmid:25210210
- 11. Chew S, Lucey P, Sridharan S, Fookes C. Exploring visual features through Gabor representations for facial expression detection. International Conference on Auditory-Visual Speech Processing. 2010, p. 1–7.
- 12. Ying-Li Tian, Kanada T, Cohn JF. Recognizing upper face action units for facial expression analysis. Proceedings IEEE Conference on Computer Vision and Pattern Recognition CVPR 2000 (Cat NoPR00662). 2000;1:294–301.
- 13. Smith RS, Windeatt T. Facial expression detection using filtered local binary pattern features with ECOC classifiers and platt scaling. J Machine Learning Research. 2010;11(1):111–8.
- 14. Wenming Zheng, Hao Tang, Zhouchen Lin, Huang TS. A novel approach to expression recognition from non-frontal face images. 2009 IEEE 12th International Conference on Computer Vision. 2009:1901–8.
- 15. Moore S, Bowden R. Local binary patterns for multi-view facial expression recognition. Computer Vision and Image Understanding. 2011;115(4):541–58.
- 16. Wenming Zheng. Multi-View Facial Expression Recognition Based on Group Sparse Reduced-Rank Regression. IEEE Trans Affective Comput. 2014;5(1):71–85.
- 17. Karnati M, Seal A, Bhattacharjee D, Yazidi A, Krejcar O. Understanding deep learning techniques for recognition of human emotions using facial expressions: a comprehensive survey. IEEE Trans Instrum Meas. 2023;72:1–31.
- 18. Karnati M, Seal A, Yazidi A, Krejcar O. FLEPNet: Feature level ensemble parallel network for facial expression recognition. IEEE Trans Affective Comput. 2022;13(4):2058–70.
- 19. Karnati M, Seal A, Jaworek-Korjakowska J, Krejcar O. facial expression recognition in-the-wild using blended feature attention network. IEEE Trans Instrum Meas. 2023;72:1–16.
- 20. Mohan K, Seal A, Krejcar O, Yazidi A. Facial expression recognition using local gravitational force descriptor-based deep convolution neural networks. IEEE Trans Instrum Meas. 2021;70:1–12.
- 21. Shen X, Lin Z, Brandt J, Wu Y. Detecting and Aligning Faces by Image Retrieval. 2013 IEEE Conference on Computer Vision and Pattern Recognition. 2013, p. 3460–7.
- 22. Xue-Fei Bai, Wen-Jian Wang. An approach for facial expression recognition based on neural network ensemble. 2009 International Conference on Machine Learning and Cybernetics. 2009, p. 19–23.
- 23. Zhao X, Zhang S. Facial expression recognition using local binary patterns and discriminant kernel locally linear embedding. EURASIP J Adv Signal Process. 2012.
- 24. Langner O, Dotsch R, Bijlstra G, Wigboldus DHJ, Hawk ST, van Knippenberg A. Presentation and validation of the Radboud Faces Database. Cognition & Emotion. 2010;24(8):1377–88.
- 25. Ojala T, Pietikäinen M, Harwood D. A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 1996;29(1):51–9.
- 26. da Silva FAM, Pedrini H. Effects of cultural characteristics on building an emotion classifier through facial expression analysis. J Electron Imaging. 2015;24(2):023015.
- 27. Farajzadeh N, Pan G, Wu Z. Facial expression recognition based on meta probability codes. Pattern Anal Applic. 2013;17(4):763–81.
- 28. Friedman M. A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings. Ann Math Statist. 1940;11(1):86–92.
- 29. Iman RL, Davenport JM. Approximations of the critical region of the fbietkan statistic. Communications in Statistics - Theory and Methods. 1980;9(6):571–95.
- 30. Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 1979;6(2):65–70.
- 31. Doe J. Understanding the Universe. J Astrophys. 2023;12(3):45–67.
- 32. Ali G, Dastgir A, Iqbal MW, Anwar M, Faheem M. A hybrid convolutional neural network model for automatic diabetic retinopathy classification from fundus images. IEEE J Transl Eng Health Med. 2023;11:341–50.
- 33. Arif E, Khuram Shahzad S, Mustafa R, Arfan Jaffar M, Waseem Iqbal M. Deep neural networks for gun detection in public surveillance. Intelligent Automation & Soft Computing. 2022;32(2):909–22.
- 34. Nazir Z, Iqbal MW, Hamid K, Muhammad H, Nazir A, Hussain N, et al. Voice assisted real-time object detection using Yolo V4- tiny algorithm for visual challenged. 2023, p. 56.
- 35. Memon A, Nazir A, Hamid K, Iqbal MW. An efficient approach for data transmission using the encounter prediction. Tianjin Daxue Xuebao Ziran Kexue Yu Gongcheng Jishu Ban. 2023;56:92–109.
- 36. Eleftheriadis S, Rudovic O, Pantic M. Discriminative shared Gaussian processes for multiview and view-invariant facial expression recognition. IEEE Trans Image Process. 2015;24(1):189–204. pmid:25438312
- 37.
Bijlstra G, Dotsch R. FaceReader 4 emotion classification performance on images from the Radboud Faces Database. 2011.
- 38.
Noldus. FaceReader methodology. 2014.
- 39. Wang S, Wu C, He M, Wang J, Ji Q. Posed and spontaneous expression recognition through modeling their spatial patterns. Machine Vision Appl. 2015;26(2–3):219–31.
- 40. Hsieh C, Hsih M, Jiang M, Cheng Y. Effective semantic features for facial expressions recognition using SVM. Multimedia Tools Appl. 2015;1(1):1–20.
- 41. Dong J, Zhang Y, Fan L. A Multi-view face expression recognition method based on DenseNet and GAN. Electronics. 2023;12(11):2527.
- 42. Ali M, Hossein M. Multimodal facial expression recognition based on 3D face reconstruction from 2D images. Facial Expression Recognition from Real World Videos. 2015;8912:46–57.
- 43. Ahmady M, Ghasemi R, Hamidreza R. Local weighted pseudo zernike moments and fuzzy classification for facial expression recognition. Proceedings of the 13th Iranian Conference on Fuzzy Systems. 2013, p. 1–4.