Dynamics of facial actions for assessing smile genuineness

Applying computer vision techniques to distinguish between spontaneous and posed smiles is an active research topic of affective computing. Although there have been many works published addressing this problem and a couple of excellent benchmark databases created, the existing state-of-the-art approaches do not exploit the action units defined within the Facial Action Coding System that has become a standard in facial expression analysis. In this work, we explore the possibilities of extracting discriminative features directly from the dynamics of facial action units to differentiate between genuine and posed smiles. We report the results of our experimental study which shows that the proposed features offer competitive performance to those based on facial landmark analysis and on textural descriptors extracted from spatial-temporal blocks. We make these features publicly available for the UvA-NEMO and BBC databases, which will allow other researchers to further improve the classification scores, while preserving the interpretation capabilities attributed to the use of facial action units. Moreover, we have developed a new technique for identifying the smile phases, which is robust against the noise and allows for continuous analysis of facial videos.


Introduction
Facial expressions are the observable temporal alterations in human face appearance caused by motions of the muscles located just under the facial skin, controlled with the facial nerve. While there is no doubt that the primary function of facial expressions for humans is to convey information on the emotional state of an individual, their origin from the evolutionary perspective could be quite different [1], for example related with increasing or decreasing the sensory exposure [2]. Facial expression recognition is an inherent capability of humans, and it plays a substantial role in their interpersonal communication. Automatic recognition of facial expressions from digital images and videos has been explored for years, becoming a multidisciplinary research topic that embraces computer vision, machine learning, psychology, neuroscience, and cognitive sciences. Potential applications of recognizing facial expressions are related with healthcare, surveillance, animation engines, driver safety, creating responsive human-computer interfaces, and more [3]. An important direction in facial expression analysis is concerned with assessing the genuineness of the manifested non-verbal messages. In particular, the problem of discriminating between spontaneous and posed smiles has been given considerable attention in the literature [4][5][6]. Smiles are one of most common facial expressions, and their detection using computer vision techniques has been widely investigated [7]. Over the years, a variety of benchmark datasets were created, including the famous UvA-NEMO Smile Database [4] which contains over a thousand videos with genuine and posed smiles. This encouraged the researchers to focus on recognizing smile genuineness, and the study reported in this paper addresses this interesting problem as well.

Facial action coding system
Current state of the art in automatic facial expression recognition originates from the work of Ekman and Friesen, who introduced Facial Action Coding System (FACS) [8] to describe the facial activity. In FACS, all the observable expressions are represented as a combination of basic visually discriminable muscle actions, termed Action Units (AUs). Importantly, FACS is a descriptive system, which considers the face from an observer's perspective, rather than performing anatomical or emotional analysis. This makes FACS particularly useful in creating computer vision solutions aimed at recognizing facial expressions from images or videos, as the analysis can be performed in a two-stage approach [9]-first, the AUs are automatically detected, and subsequently their interpretation is performed during the second stage. There have been many successful attempts to exploit FACS for recognizing facial expressions [10,11], and the mapping between FACS and expressed emotions was confirmed by Wegrzyn et al. in their recent study [12]. Furthermore, Khorrami et al. reported an interesting observation that the features elaborated automatically using deep learning employed for recognizing facial expressions are highly correlated with the AUs defined in FACS [13], which once again confirmed the adequacy of this observation model. Importantly, detection of AUs, alongside assessing their intensity, can be effectively performed relying on computer vision solutions [14][15][16][17], and a number of implementations are publicly available.
The dynamic process of manifesting a smile is composed of three main phases, namely: (i) onset (when the face alters from neutral expression to a smile), (ii) apex (when the observable expression of the face is a smile with varying intensity), and (iii) offset (when the facial expression turns back to neutral). A smile is mainly concerned with the following AUs: AU6 (cheek raiser) and AU12 (lip corner puller), however different AUs are very often involved as well. One of the reasons is that there are a wide range of possible underlying emotional states which could be expressed with a smile, including happiness, enjoyment, pleasure, embarrassment, sadness, or even fear, depending on the context. Although the subtle differences between these types of smiles can be relatively easily perceived by humans in most cases (this appears non-trivial for patients with mental disorders, e.g., schizophrenia [18]), it is a challenging computer vision and pattern recognition task. Discriminating between genuine (spontaneous) and posed smiles, along with understanding which facial features exhibit overwhelmingly different human intensions became a vital topic and attracted attention in many domains, ranging from machine learning to clinical research [19]. A more general problem of recognizing the genuineness of manifold facial expressions was recently studied by Healey et al. [20]. They used average intensity of AUs to differentiate between spontaneous and intentionally expressed reactions to positive and negative images. For intentional expressions, the AU intensity was higher both for AUs associated with negative (AU1, AU2, AU4, and AU5) and positive (AU6 and AU12) emotions. However, neither the dynamics of AUs, nor their mutual relation were studied in that research.

Contribution
Despite many successful attempts to exploit FACS for recognizing facial expressions, AUs are not commonly used for assessing smile genuineness. The only attempt to exploit AUs for automatic recognition of spontaneous smiles was reported in 2006 by Valstar et al. [21]. Three AUs related with the eyebrow movements (AU1, AU2 and AU4) were studied in [21]. Recently, Ruan et al. [22] reported a psychological study aimed at improving the people's ability to differentiate between posed and spontaneous smiles by focusing on AU6 and AU12 related with the mouth movements. The recent approaches are either based on direct analysis of facial landmarks [4], they rely on spatial-temporal textural features [5], or are underpinned with the features extracted from smile intensity dynamics [6].
The goal of the research reported here was to verify whether AUs defined in FACS contain sufficient information to discriminate between posed and spontaneous smiles, as this problem has not been tackled in the literature so far. We explore how to exploit AUs for recognizing smile genuineness, to increase the interpretability of automated methods that solve this task. Furthermore, we report our study to investigate which AUs carry most valuable information in assessing whether a smile is posed or spontaneous. Overall, our contribution is threefold: 1. We introduce the AU Dynamics Analysis (AUDA) method for recognizing smile genuineness. The method is underpinned with new features (we publish the AUDA features extracted for the UvA-NEMO and BBC benchmarks (https://doi.org/10.7910/DVN/ X5QGLA), which should allow for further research focused on improving their classification) that capture the dynamics of particular AUs, as well as their mutual relations.
2. We study the relevance of particular AUs, as well as the pair-wise differences in their dynamics, for deciding whether an observed smile is spontaneous.
3. We propose a new approach towards detecting the smile phases (the source code for detecting the smile phases is available at https://github.com/jkawulok/audaphases). In contrast to many existing approaches, we do not assume that a given video sequence presents a single cycle of a smile composed of onset, apex and offset, making it suitable for continuous face analysis.
The results of our experimental study indicate that the proposed features have competitive discriminating capabilities when compared with the features exploited by the existing state-ofthe-art techniques [4,23]. At the same time, their physiological interpretation is straightforward, as they are entirely based on the AUs. This showcases that the FACS features convey the information that allows for discriminating posed smiles from spontaneous ones.

Related work
Facial expression recognition. Analysis and recognition of facial expressions has been intensively studied in the literature [9,11,[24][25][26]. Existing approaches are either based on the holistic features, extracted from the entire facial region, or on the local ones retrieved from particular facial components and facial landmarks. Furthermore, the features can be extracted from the spatial domain [27,28] (each image is analyzed independently) or directly from the spatial-temporal domain [29] (the features are extracted across multiple frames of a video sequence).
Taking into account whether and how FACS is exploited, two approaches can be distinguished: (i) to detect AUs given a still facial image (or an image sequence) followed by interpreting the recognized actions [30], and (ii) to recognize the expressions or non-verbal messages directly from the facial region without detecting the AUs [31]. The latter approach encompasses both local and global features, including Local Binary Patterns (LBPs) [32], Gabor wavelets [33,34], extreme learning machines [7], and many solutions based on deep Convolutional Neural Networks (CNNs) [35,36]. Moreover, some of the recent methods based on deep learning exploit the knowledge on FACS in an indirect way. Khorami et al. studied the deep features learned by CNNs trained to recognize facial expressions, and they discovered that these features resemble the AUs defined in FACS [13]. Furthermore, Liu et al. proposed a deep network [37], whose architecture is inspired by the AUs. In this way, the analysis is intended to be split into detecting the AUs using adaptive receptive fields, and then the network groups the features to recognize specific expressions.
Detection of facial action units. The problem of detecting AUs from face images has been recently thoroughly reviewed by Martinez et al. [25]. The general pipeline for detecting AUs encompasses three main phases, namely: preprocessing aimed at detecting face alongside the facial landmarks, topped with face normalization, which is followed by feature extraction to prepare the basis for higher-level analysis of facial actions to detect, recognize and classify the particular AU.
Face and facial landmark detection has been widely explored [38] and among most effective approaches are active appearance models [39], supervised descent [40], or constrained local model [41], whose implementation is available in the OpenFace suite [42,43] (OpenFace library is available at https://cmusatyalab.github.io/openface). From the detected landmarks, local appearance-based features, with different variations of LBPs [44] and Histogram of Oriented Gradients (HOG) [45] being most common, are extracted and classified to detect particular AUs. In OpenFace, the geometry-based features are coupled with HOG features reduced using Principal Component Analysis (PCA), and classified with a linear Support Vector Machine (SVM) to detect the AUs [46]-recently, in [47], this SVM-HOG approach was reported to outperform solutions based on CNNs.
There have also been some successful attempts to detect AUs using CNNs [48]-the most important challenge here consists in the need for large amounts of annotated data. Tong et al. reported to increase the accuracy of detecting AUs by exploiting their dynamic and semantic relationships [16]. Relationship between the manifested AUs have been also recently studied by Wang et al. [49] and it was subsequently exploited to improve their recognition using a hybrid Bayesian network. Overall, state of the art in AU detection allows for excellent performance for frontal faces in controlled environment, and the main research challenges are concerned with robustness against head pose variations and realistic illumination conditions. Importantly, the algorithms for facial expression recognition that are underpinned with AU detection are easier to interpret and understand.
Smile genuineness. Discrimination between posed (deliberate) and spontaneous (genuine) smiles from facial images and videos is an intensively explored research topic [50,51]. In the last decade, there have been many advances made focused both on developing new computer vision techniques, as well as creating appropriate databases that could serve as benchmarks, including the excellent UvA-NEMO Smile Database [4]. The latter task is particularly important, as it is quite challenging to ensure that the person being recorded is presenting the expected (i.e., posed or spontaneous) smile [52]-creating such benchmarks requires close cooperation between psychologists, camera operators, and computer vision specialists. Overall, the process of collecting such data remains an important challenge in expression genuineness recognition.
Most of the state-of-the-art algorithms for recognizing spontaneous and posed facial behaviors are focused on the temporal analysis of various facial features. In one of the earliest approaches towards recognizing smile genuineness, Cohn and Schmidt [53] investigated changes in the Smile Onset Amplitudes and their Durations (SOAD), extracted from detected and tracked facial landmarks, to find a strong evidence that spontaneous smiles are characterized by smaller amplitudes and significantly more stable relations between these two features. Valstar et al. [21] exploited the AUs focused on the eyebrow region (i.e., AU1, AU2, and AU4), extracted from the positions of facial landmarks. An interesting, yet simple approach, in which the asymmetry of facial expressions is exploited, was presented by Senechal et al. [54]. Extracting distance-based and angular features from eyelid movements for this task was proposed in [55].
Dibeklioğlu et al. demonstrated that although the eyelid features are most discriminating [4], as claimed in [53], the classification performance can be boosted, if these features are coupled with those extracted from other facial components (encompassing, e.g., cheeks and/or lip corners). This finding indicates that different facial regions can contribute differently to the classification of smiles in their particular phases. Here, the onset phase is detected as the longest continuous increase in the distance between the mouth corners, the offset is the longest continuous decrease, and the frames between these two are considered to represent the apex phase. Such an approach is not robust against inaccurate localization of facial features, and it is underpinned with the assumption that a given sequence always presents a single smile cycle. In order to address the shortcoming resulting from the sensitivity to facial feature localization, appearance-based techniques were also developed. Liu and Wu proposed to detect AU6 and AU12 using Gabor wavelets with 2D PCA and Adaboost, and final classification to assess smile genuineness is performed using SVM [56]. Recently, the psychological aspects of focusing on these two AUs while learning people to differentiate between posed and genuine smiles were explored by Ruan et al. [22].
Pfister et al. [57] proposed to utilize the Completed Local Binary Pattern (CLBP)-the standard LBP is complemented with textural features from Three Orthogonal Planes-which creates an appearance-based local spatial-temporal descriptor (CLBP-TOP). The CLBP-TOP descriptor was enhanced by Wu et al. [23]-the entire image sequence is divided into blocks in both spatial and temporal domains, using the flexible facial sub-region cropping. Then, five discriminative facial points (eyes, lip corners, and nose tip) are detected and tracked to retrieve facial sub-region volumes which are further analyzed. Each sub-region volume is divided into three blocks in the temporal domain, reflecting three smile phases: onset, apex, and offset (in a similar manner to [4]). In this paper, we refer to that approach as CLBP-TOP+. In addition to that, the authors in [23] proposed an adaptive learning procedure to extract an optimal (most discriminative) subset of all CLBP-TOP features (termed disCLBP-TOP). Although this algorithm retrieved high classification scores, inaccurate detection of facial landmarks can notably jeopardize its performance. The initial work by Wu et al. was further improved in [5] by introducing a discriminative learning model (DLM) to classify the disCLBP-TOP features.
In our earlier work [6], we proposed to analyze Smile Intensity Dynamics (SID) to estimate smile genuineness. Smile intensity is measured in the facial region, as well as in two facial components-the eyes region and the mouth region. The assessment is made in a frame-wise manner, relying on the LBP features classified with SVM. Dynamics of smile intensity is analyzed in each frame, as well as from the whole sequence, and these features are classified once again using SVM to distinguish the spontaneous from posed smiles.
Overall, the state-of-the-art methods that were reported to render high classification scores do not rely on the AUs. Most of them are based on the features extracted directly from the images or they exploit the landmark locations and smile intensities. This makes it more challenging to integrate these methods with the existing AU-based systems for facial expression analysis.
Smile genuineness recognition may also be performed employing multi-modal techniques which benefit from the observation that people communicate by the means of language, facial expressions, head movement, gestures and poses [58]. To fully exploit the information coming from different sources, the multi-modal methods fuse them to improve the classification performance. This fusion may be performed at various abstraction levels (they are often referred to as early, mid-level, and late fusion strategies), e.g., across different smile phases, or for various facial regions. In [59], three different facial regions are used to extract features (eyes, cheeks, and mouth). Then, SVMs are trained for each region separately, and they are used to classify the feature vectors. The algorithm which fuses head, face, and shoulder modalities was proposed in [60] (different landmark trackers were employed for each modality). The authors efficiently combined these modalities, and highlighted which of them carry discriminative information. According to the authors, the tracked facial landmarks were related with AU6, AU12, and AU13. Another interesting research direction includes thermal imaging, in which the heat radiated from the face is used to recognize deception [61]. Recently, Saito et al. demonstrated that smile genuineness can be assessed based on a signal measured with smart eyewear equipped with 16 photo-reflective sensors [62].

Method
A general overview of the proposed approach is presented in Fig 1. At first, facial AUs are detected using the SVM-HOG technique [46]-for every frame, the intensity for each of 17 AUs is retrieved (in the plot, the intensities of individual AU are scaled from 0 to 1), which forms a frame-wise AU feature vector. Subsequently, we employ an SVM to estimate the smile intensity from each AU feature vector. The obtained smile intensity series is processed to detect a smile in the temporal domain (here, the smile intensity is scaled between -1 and 1, with the value of 0 being the classifier's decision boundary between the smile and non-smile classes) and to divide it into three phases (i.e., onset, apex, and offset). For each detected phase, as well as for all of them, we capture the dynamics of each AU alongside their mutual dependencies, to extract four feature vectors that characterize the considered sequence. Finally, these feature vectors are classified using an SVM ensemble to determine whether the presented smile is spontaneous or posed. These subsequent steps are discussed in detail later in this section.

Capturing the expression dynamics
In order to detect the smile phases, as well as to extract features which allow for discriminating between posed and spontaneous smiles, we analyze a series (v) of estimated intensities (of the smile and/or individual AUs) to capture the dynamics of the signal. For a series of the intensity values {v i }, we first apply median filtering over three consecutive values, followed by linear regression in a sliding window of ω subsequent scores with a unit stride. While we assume the series to have a frequency of 50 frames per second (fps), we adjust the window length accordingly for sequences of a different time rate (alternatively, the sequences could be normalized, so that the time span between subsequent intensities in the series equals 20 ms). For each window, we obtain a trend line characterized by its slope: and regression coefficient: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where t is the frame capture timestamp, � v and � t are the mean values of v and t inside the window. The regression coefficient r 2 [−1;1] indicates how well the linear trend fits the data (the higher its absolute value is, the more linear they are). For every i-th intensity in the sequence (v i ), we compute δ i , hence the v signal is transformed into δ that represents its first-order dynamics. We replicate the boundary values when processing the initial or final values. By applying different window lengths (hence obtaining a variety of δ series), we determine the dynamics at different scales. We also extract the secondorder dynamics by processing δ once again, which produces the δ 2 signal. In addition to that, the slope value can be adjusted based on the regression coefficient-the r-adjusted values are Fig 2, we present an example of δ and δ 2 signals obtained from the smile intensity series using windows of different lengths (the length must be an odd number, and we demonstrate the signals for subsequent powers of 3). It can be seen that for ω = 3, the noise influences the δ and δ 2 signals, but for longer windows (e.g., ω = 9 and ω = 27, corresponding to 160 ms and 520 ms), the very smile dynamics are well captured. On the other hand, a larger scale (such as ω = 81) may not be suitable for highlighting the dynamics of smiles that are a few seconds long. Therefore, we decided to focus on the range 9 � ω � 27.

Detecting smile phases
We estimate the smile intensity for every frame using an SVM trained to classify AU feature vectors as presenting a smile or not-the intensity of a smile is determined based on the distance from the hyperplane that separates the opposite classes (i.e., smile and non-smile), whose position is found during the SVM training. Although such an approach was reported to be not as effective as when multiclass or regression models are used [51], the latter require training sets with continuous smile intensity labels (or multiple intensity classes) that are difficult to acquire and prone to annotation errors. Contrary to that, the binary ground-truth labels are less problematic to obtain, and we have found an SVM trained with them sufficient to analyze the dynamics of the smile intensity signal and to detect the smile phases. Due to the large sizes of the training sets and their imbalance (the smile frames being the majority class), we employed our training set selection algorithm [63] to train an SVM.
Algorithm 1 An algorithm to determine the subsequent smile phases (onset: t on to t ap , apex: t ap to t off , and offset: t off to t end ).

PLOS ONE
Dynamics of facial actions for assessing smile genuineness 1: Input signals: v, δ, δ 2 of length T; 2: Output: P; ⊳P i {onset, apex, offset, none} 3: t 0 1; ⊳Indicates a current smile starting point 4: repeat 5: ⊳v on Indicates a current position of the search 8: v max v c ; 9: repeat ⊳A loop to determine the final smile intensity descent 10: if We determine a vector of the smile phases (P) based on the relative changes observed in the estimated smile intensity signal v, as well as in its first-order and second-order dynamics (δ and δ 2 ). To obtain these signals, we use a window of ω = 27 (for 50 fps). In contrast to the existing approaches, we take no assumptions on the number of phases in a presented sequence. Algorithm 1 presents the procedure for detecting the smile phases and the process is illustrated in Fig 3. The search of a new smile starts when δ > 0 (line 5) and it is composed of three major steps, whose goal is to: (i) determine the temporal extent of the smile event, (ii) find the approximate limits of the apex phase, and (iii) fine tune the apex boundaries.
First, the signal is scanned to find the final descent of the smile intensity (lines 9-15). During that step, visualized in Fig 3a, the signal is scanned as long, as the current smile intensity value is over a continuously updated reference value (v c > v ref ). To avoid stopping in case of incidental low v (e.g., resulting from noise), it is required that δ < 0 to finish the search. This determines the t c timestamp, after which the next positive value of δ is considered as the end of the current smile event (line 16). This determines the boundaries of the current smile (t on to t end ).
During the second step (Fig 3b), the δ signal is analyzed within the detected range ht on ;t end i. We assume that the fastest increase (δ max ) happens during the onset, while the fastest decrease (δ min ) during the offset phase (lines 17 and 18). This sets the initial limits of the apex phase (we expect it to start after t δ max and finish before t δ min).
The third step consists in inspecting the v's second-order dynamics to find the maximum convexity of the smile intensity which would indicate the apex phase's bounds (Fig 3c). For this purpose, we scan δ 2 for the first (line 19) and last (line 20) local minimum in the range δ max , tδ min i . If the local minima are not found, then we use the initial limits (t δ max and t δ min) determined during the second step (line 21). Finally, we validate the determined limits (we also check whether the smile lasts at least 1 second) to approve the detected phases (line 23 and Fig 2d).

Classifying the smiles based on facial actions dynamics
Every detected smile is classified based on the features extracted from AU sequences within each individual smile phase, as well as from the entire smile cycle (i.e., from t on till t end ). We extract two types of features, namely AU-wise features, derived independently from each individual AU, and cross-AU features that capture the mutual relations between the AUs.  The purpose of the cross-AU features is to retrieve the dependencies between the dynamics of individual AU signals. For every pair of AUs, we compute the dynamics difference signal: to take minimum (d min D ) and maximum (d max D ) values as features. In addition to that, we locate the minimum and maximum for the r-adjusted dynamics, and for each pair of AUs, we consider their distance in the temporal domain (Dtdmax and Dtdmin). In this way, we retrieve information on whether the maximum linear increase (or decrease) in two AU signals are close to each other. For a single value of ω, we obtain 544 cross-AU features.
The aforementioned two types of features are extracted from four different time ranges that reflect the smile phases, hence we obtain eight feature vectors, as presented in Fig 4. Each feature is subject to standardization based on the training set, and the obtained feature vector is classified using an SVM with a Radial Basis Function (RBF) kernel. During training, the features are selected with Recursive Feature Elimination (RFE) [64] to simplify the model. We assess the importance of each individual feature by excluding it from the feature set to observe the performance of the model trained without that feature. The least important features are recursively eliminated (we allow for eliminating multiple features at a time), as long as the classification performance, measured for the validation set, does not decrease. The validation set is a part of the training set (not to be confused with the test set which remains unseen during that procedure). Finally, we treat these first-level SVMs as an ensemble-the SVM responses (i.e., the distances from separating hyperplanes) are treated as the elements of a second-level feature vector which is classified using an SVM with a polynomial kernel. This produces the final decision on whether the considered smile is spontaneous or posed.

Experimental setup
We evaluate the proposed algorithm using two benchmarks created for assessing smile genuineness recognition: UvA-NEMO database [65] which contains 1240 video sequences of posed and spontaneous smiles (643 and 597 sequences, respectively, involving 400 subjects) with a resolution of 1920 × 1080 pixels, captured at 50 fps, and the BBC database (available at http:// www.bbc.co.uk/science/humanbody/mind/surveys/smiles) with 20 video sequences (10 posed and 10 spontaneous smiles of 20 different subjects), captured at 25 fps, with a resolution of 314 × 286 pixels. For UvA-NEMO, we followed the official evaluation protocol published by the database authors (the UvA-NEMO database alongside all the metadata and division into the folds are available at https://www.uva-nemo.org/index.html) which is based on 10-fold cross validation-SVMs are trained with 9 folds, and the performance is tested for the remaining fold unseen during training (the subjects whose images are in the test set do not appear in the training set). The process is repeated for every fold, and the scores obtained for the test sets are averaged over all the folds. For the BBC database, we report the scores using 10-fold cross validation.
For detecting AUs, we exploit the OpenFace library which implements the SVM-HOG method [46]. Our algorithms for capturing the dynamics of facial expressions, detecting the smile phases, followed by extraction and classification of the features, were implemented in the C++ language with the use of the libsvm library. The SVM hyper-parameters were determined based on a grid search, performed for every fold. The validation sets used to evaluate the model during the grid search and feature selection procedures were extracted from the training set, hence the test set remained unseen during training. To compare the proposed features with alternative approaches, we have also implemented the feature extraction in the Facial Landmark Analysis (FLA) method by Dibeklioğlu et al. [4], and we classified them with SVM. We ran our experiments on a computer equipped with an Intel Core i7-3740QM 2.7 GHz (32 GB RAM) processor. Processing a sequence composed of 100 frames consumes 3 ms to identify the smile phases, 6 ms to extract and classify the AU-wise features, and 153 ms to extract and classify the cross-AU features. Overall, this allows for real-time analysis.
Experimental validation is composed of three major parts that are presented and discussed later in this section: (i) evaluation of smile phase detection, (ii) analysis of the proposed AUDA method, (iii) comparison with the state of the art. The performance of recognizing smile genuineness is evaluated based on the classification accuracy (the percentage of correctly classified samples) as well as with the area under the receiver operating characteristic curve (AUC).

Evaluation of smile phase detection
As there are no ground-truth data available on when the particular smile phases start and finish, the accuracy of the proposed smile phase detection algorithm cannot be determined directly by comparing the outcome against the reference. Therefore, we evaluated the algorithm qualitatively, by inspecting the obtained outcome, and quantitatively to assess: (i) the algorithm's behavior for sequences presenting multiple smiles and (ii) its robustness against the noise injected into the smile intensity signal.
In the example presented earlier in Fig 3, the smile phases were clearly visible and they were correctly identified. Fig 5 demonstrates three examples of non-obvious cases. In Fig 5a, the sequence contains two cycles of the smile intensity-as it can be seen from the plots, they have been correctly identified and split into three phases. Fig 5b shows a case with rather smooth transition between the apex and offset phases and in Fig 5c, the smile intensity remains low across the whole sequence-in fact, the smile is not detected with the binary frame-wise classifier (which is wrong, looking at the corresponding frames presented over the plot), but the smile and the smile phases are identified correctly by analysing the intensity dynamics.
We expect the smile phases to be identified regardless of the length of the presented sequence and the number of smile events. In order to verify that, we combined all the original single-smile sequences from UvA-NEMO into a single long sequence. We treat the smile phases detected from the original sequences as a reference, and we compare them against the phases identified from the long combined sequence. In Table 1, we report the confusion matrices for spontaneous and posed smiles that show the differences between detecting the smile phases in these two scenarios. It can be seen that the frames classified as belonging to the apex phase from the original sequences are mostly classified as apex from the long sequence (97.9% and 96.3% for posed and spontaneous smiles, respectively), and the differences are mainly in the lengths of the onset and offset phases. It is quite common that given a broader context in the long sequences, the offset phase is moved forward (making the apex phase longer). Overall, despite some discrepancies, the phase detection was stable for multiple smile events in a sequence, making it suitable for continuous analysis-over 90% of the frames were assigned the same phase in both scenarios. In other works on recognizing the smile genuineness [4,5], it is assumed that the onset phase is the longest continuous increase of the smile intensity (measured as the distance between the lip corners), making it quite vulnerable to the noisy values (e.g., resulting from imprecise detection of the landmarks). Our algorithm was designed to be robust against the noisy data, hence we investigated its behavior in the presence of the Gaussian noise. We have contaminated the smile intensity signals with different levels of the Gaussian noise to obtain signal-to-noise ratio (SNR) of 5, 10, 15, 20, and 25 dB, and we detected the smile phases from these noisy data. In Fig 6, we show an example of the smile intensity signal with different levels of the noise, and in Table 2 we report the percentage of frames whose identified phase was not affected by the noise. It can be seen that phase detection is more vulnerable for the spontaneous smiles (in general, the intensity signals are less smooth here than for the posed smiles), however in both cases the detection remains stable for SNR of at least 20 dB.

Table 1. Confusion matrices showing the differences between smile phase detection performed for original sequences from the UvA-NEMO database (which contain a single smile per sequence) vs. a long combined sequence composed of the single-smile ones.
Bold values indicate the numbers of frames whose phase match in both approaches.

Analysis of the proposed smile genuineness recognition
At first, we investigated the classification performance for the features extracted from individual AUs (for the AU-wise features) and pairs of AUs (for the cross-AU features). In Table 3, we report the scores obtained for the features extracted from each individual AU within the whole detected sequence (i.e., between t on and t end ) and from each smile phase. In addition to that, we combine these four classifiers using the SVM with a polynomial kernel. In general, the accuracy is similar for the features extracted from the whole sequence and for those derived from the onset phase, and it is lower for those extracted from apex and offset. Importantly, for all AUs, the SVM ensemble renders a higher classification accuracy than the phase-wise SVMs which exposes the importance of identifying the smile phases. It can be seen that the most discriminative are the dynamics of AU12 (lip corner puller), AU6 (cheek raiser), and AU10 (upper lip raiser), followed by AU25 (lips part), AU14 (dimpler), and AU5 (upper lid raiser). Interestingly, the dynamics of each individual AU, including AU45 (blinking), allow for obtaining the classification accuracy of over 65%.  Table 3. Classification accuracy (in %) obtained for the UvA-NEMO database using AU-wise features extracted from individual AUs. The features were extracted from each smile phase (onset, apex and offset), as well as from the whole sequence. The "combined" column shows the scores obtained using an ensemble of four AU-wise SVM classifiers (as shown in Fig 4). The darker the background, the higher the accuracy is.  Table 4 shows the scores obtained using the cross-AU features extracted from the individual pairs of AUs. Here, we report the final classification accuracy obtained with the ensemble of four SVMs trained with the feature vectors extracted from the individual smile phases. It can be observed that the best scores (over 70%) are obtained relying on the pairs that include AU12 (lip corner puller) and AU6 (cheek raiser), especially when coupled with AUs that code the behavior of lips and mouth (including AU15-lip corner depressor, AU17-chin raiser, AU20-lip stretcher and AU26-jaw drop). However, the effectiveness of the pair AU6-AU12 is relatively low-as it was noted in [66], these AUs are correlated with each other, and possibly this correlation does not differ significantly between spontaneous and posed smiles. It is worth noting that quite high scores are obtained for AU2 (outer brow raiser) coupled with AU6 and AU12, as well as with AU10 (upper lip raiser) and AU7 (lid tightener)-this confirms some of the earlier findings [4,6,59] on the importance of the correlation between the mouth and eye regions.

AU
In Table 5, we report the classification accuracy and AUC for several variants of the proposed AUDA method, including the use of exclusively the AU-wise features (for all AUs) and the cross-AU features (for all of the AU pairs). Also, we investigate the scores for the dynamics extracted using different sets of the window lengths (ω). As it was discussed earlier in the paper and demonstrated in Fig 2, the sensible values of ω are between 9 and 27 at 50 fps, which corresponds to the windows of 160 ms and 520 ms. During the experiments, we have sampled that range more densely, adding the values of ω = 15 (280 ms) and ω = 21 (400 ms). As the standard deviations across the folds are considerable (compared with the differences between the variants), we employed the two-tailed Wilcoxon test to verify the hypothesis that the variants do not differ between each other. For the accuracy and AUC, we boldfaced the highest score, and the variants for which the hypothesis has been rejected at p < 0.05 were underlined. The best results were obtained using all the features extracted from two (ω 2 {9, 27}) and four windows (ω 2 {9, 15, 21, 27}) without any statistically significant difference between these variants. As they are significantly different from the single-window variants (for ω = 9 and ω = 27), we decided to use the two-window variant as our baseline. It is also clear from the table   Table 4. Classification accuracy (in %) obtained for the UvA-NEMO database using cross-AU features extracted from particular pairs of AU signals. The scores were obtained using an ensemble of four cross-AU SVM classifiers (as shown in Fig 4). The darker the background, the higher the accuracy is. that using all the features delivers better scores than relying exclusively on cross-AU and AUwise features, which justifies exploiting both types of them. It may also be noted that including all the AU pairs renders higher accuracy (81.23%) than the score obtained with a single pair in Table 4 (i.e., 78.56% for AU12-AU15). Similarly, the best score obtained for a single AU in Table 3 (i.e., 78.95% for AU12) is lower than using all the AUs (82.25%).
For the selected baseline variant, we performed the RFE-based feature selection. In Table 6, we report the performance of the first-level SVMs, as well as of the final classification ensemble, trained without and with feature selection. Using the selected subset of features, the scores are slightly worse than when SVMs are trained from all the features extracted from the whole sequence, as well as for the cross-AU features extracted from the apex phase. The performance of the remaining first-level SVMs and that of the final ensemble is better after applying feature selection. It is worth noting that when using all AU-wise features extracted from the onset, apex and offset phases, the classification accuracy is lower than for SVMs trained based on individual AUs (Table 3). For example, SVMs trained from AU6, AU10, AU12, AU14, and AU25 onset features are better than using all AUs. After feature selection, these scores are higher than relying on any single AU.
In Table 7, we report the ratios of the selected features grouped by the action unit they originate from (for the cross-AU features, each feature originates from two AUs). Similarly, in Table 5. Scores (classification accuracy and AUC) obtained for UvA-NEMO database using different variants of our AUDA method. The best score in each column is marked as bold and the scores that are not significantly different from the best (in the statistical sense) are underlined.  Table 8, the ratios are categorized by the particular dynamics extracted from all AUs. It can be observed that there is substantial information redundancy among the features and in most cases over half of them can be rejected without affecting the final classification performance (more for the cross-AU features). From Table 7, it can be seen that the features related with AU6 and AU12 (and AU25 for AU-wise) were more often picked than those related with other AUs. This is coherent with the observations made for the classifiers based on single AUs (Table 3) and their pairs (Table 4), discussed earlier in this section. However, some AUs (e.g., AU10) that rendered high classification scores when treated individually, were not that often selected with RFE. Importantly, even though the features related with some AUs were selected more frequently, none of AUs nor feature types were entirely eliminated which confirms that all of the proposed features that capture the AUs' dynamics are relevant for discriminating between posed and spontaneous smiles.

Comparison with the state of the art
In Tables 9 and 10, we compare the obtained classification accuracy and AUC with the stateof-the-art techniques for the UvA-NEMO and BBC databases, respectively. The best results were reported for the DLM method [5] applied to classify the disCLBP-TOP features extracted from spatial-temporal blocks. The authors in [5] stated that their disCLBP-TOP features exploit information concerned with facial appearance which conveys the age of a person. Therefore, they compare their DLM method against the FLA+Age variant from [4] that benefits from the age-based stratification. As the AUDA features are extracted from the AU signals, they do not capture the facial appearance. Also, it is worth noting that most of the existing methods (including DLM, disCLBP-TOP, and FLA) assume that every presented sequence contains a single smile event, which simplifies the process of identifying the subsequent smile phases. Our AUDA method is not restricted in this way, allowing for processing continuous smile sequences. The original FLA method [4] is based on the geometric features, from which the most discriminative ones are selected using min-redundancy max-relevance algorithm before final classification with an SVM. We also report the scores obtained with our implementation of the FLA method (without the feature extraction step, termed FLA-all), which we combined with the classifiers based on the AUDA features (an SVM trained from FLA-all is included as the ninth classifier in our ensemble). For UvA-NEMO, such a combination (AUDA+FLA-all variant) renders better results than both AUDA and FLA-all, and the difference is statistically significant according to the Wilcoxon test (at p < 0.05). For BBC, the FLAall features occurred to be less effective (for 10-fold split they do not improve the results when combined with our method).
In addition to the cross-validation tests, each of which is performed on a single dataset, we exploited all the recordings from one database to train the FLA-all and AUDA methods to subsequently test them using the other database. The scores reported in Table 11 indicate that the performance is lower in such a scenario. The AU-wise features render very low scores for UvA-NEMO when trained with BBC (in the opposite scenario, the results are much better), and the cross-AU features allow for achieving classification accuracy of 74.35% and 70% for UvA-NEMO and BBC, respectively. Although these scores are much lower compared with when the models were trained and tested using the same databases (85.11% and 90%), they are still comparable to those rendered by the SOAD and CLBP-TOP methods (Tables 9 and 10). FLA-all is much more affected here, achieving the accuracies of only 55.25% and 65% for UvA-NEMO and BBC. The limited robustness of these methods may have two main reasons. The first one lies in a different frame rate (50 fps for UvA-NEMO compared with 25 fps for BBC) and that the videos were recorded in a different setting using various cameras (the latter may affect the tools employed to extract the AUs and localize the facial landmarks). It is worth noting here that contrary to AUDA, FLA does not take into account the frame rate during feature extraction which may make the trained model adapted to a specific frame rate. The second reason is that the criteria of creating the reference data may have been different for both databases. In the case of UvA-NEMO, the recorded subjects were stimulated in the same waythey were shown short funny videos to elicit spontaneous smiles, and they were asked to pose a smile as realistically as they could. Unfortunately, this procedure is not clear for the BBC dataset, and it is actually unknown whether the labels were defined based on how the subjects were stimulated or relying on the judgement of an expert. Overall, these scores indicate that the possibility of transferring the models across different databases (including the devices used for video acquisition) remains a challenging research problem which has not been tackled in the literature so far. Overall, AUDA outperforms the SOAD and CLBP-TOP techniques, and its performance is comparable with that obtained using FLA and CLBP-TOP+ features. These results indicate that the AU dynamics standalone allow for discriminating between spontaneous and posed smiles, while preserving high interpretation capabilities. The disCLBP-TOP and DLM methods, as well as FLA enriched with the age-based stratification that exploits additional metadata (FLA+Age), perform better than AUDA. DLM and disCLBP-TOP capture information on facial appearance which is not present in the AU signals that constitute the input data for AUDA. We have published the features extracted for the UvA-NEMO and BBC datasets (available at https://doi.org/10.7910/DVN/X5QGLA), which makes it possible for the community to employ more sophisticated classification methods, as well as to combine them with appearance-based features, to further improve the performance of emerging algorithms.

Conclusions
In this paper, we presented a new AUDA technique for capturing the dynamics of facial action units. The elaborated features were used to classify the smiles as spontaneous or posed, and we demonstrated that these features are competitive with the features extracted from the facial landmarks [4] as well as with the CLBP-TOP+ textural features extracted from spatial-temporal blocks [23]. An important benefit of our approach is that it offers interpretability in the domain of facial action units that are widely used for analyzing facial expressions. Overall, we have proved that classification of smile genuineness can be entirely based on the AUs defined in FACS. Furthermore, we proposed a new technique for identifying the smile phases from video sequences. We demonstrated that it does not require an analyzed sequence to contain a single smile cycle and it is robust against the Gaussian noise.
The experimental study has shown that although the proposed technique is highly effective, the DLM and disCLBP-TOP methods [5] render higher classification scores for the UvA-NEMO database. They are based on the features which capture the facial appearance, allowing them to extract more information on the subject (like age or gender). Our AUDA method, as well as FLA [4], are based on the data that abstract from the appearance of a subject (these are the facial action units for AUDA and facial landmark positions for FLA). On one hand, this is a certain limitation of AUDA, but on the other hand, our features can be fused with the appearance-based information at a later stage. Furthermore, our experiments concerned with feature selection using RFE have shown that there is considerable redundancy within the AUDA features. Our ongoing research is aimed at exploiting deep recurrent neural networks to analyze the dynamics of AUs, and we use the attention modules to highlight the most important features [67]. In addition to that, the networks fed with AU dynamics may be coupled with the branches with convolutional layers to extract and benefit from the appearancebased features. These research directions can be explored in the future, and for this purpose, we publish the features extracted from the UvA-NEMO and BBC databases, alongside the first-order and second-order dynamics.
The research reported in this paper is limited to the problem of recognizing the smile genuineness, but potentially the AUDA features can be exploited for solving alternative tasks related with facial expression analysis. Furthermore, it would be interesting to determine not only whether the smile is genuine or posed, but to recognize the underlying emotional state that triggered the smile. However, to train and validate such approaches, it would be necessary to create appropriate benchmark datasets in cooperation with the psychologists. Creating more datasets may also help address the problem of the model transferability across different databases which would make the methods more robust in the real-life scenarios under uncontrolled conditions.