Explainable human-centered traits from head motion and facial expression dynamics

Surbhi Madan; Monika Gahalawat; Tanaya Guha; Roland Goecke; Ramanathan Subramanian

doi:10.1371/journal.pone.0313883

Abstract

We explore the efficacy of multimodal behavioral cues for explainable prediction of personality and interview-specific traits. We utilize elementary head-motion units named kinemes, atomic facial movements termed action units and speech features to estimate these human-centered traits. Empirical results confirm that kinemes and action units enable discovery of multiple trait-specific behaviors while also enabling explainability in support of the predictions. For fusing cues, we explore decision and feature-level fusion, and an additive attention-based fusion strategy which quantifies the relative importance of the three modalities for trait prediction. Examining various long-short term memory (LSTM) architectures for classification and regression on the MIT Interview and First Impressions Candidate Screening (FICS) datasets, we note that: (1) Multimodal approaches outperform unimodal counterparts, achieving the highest PCC of 0.98 for Excited-Friendly traits in MIT and 0.57 for Extraversion in FICS; (2) Efficient trait predictions and plausible explanations are achieved with both unimodal and multimodal approaches, and (3) Following the thin-slice approach, effective trait prediction is achieved even from two-second behavioral snippets. Our implementation code is available at: https://github.com/deepsurbhi8/Explainable_Human_Traits_Prediction.

Citation: Madan S, Gahalawat M, Guha T, Goecke R, Subramanian R (2025) Explainable human-centered traits from head motion and facial expression dynamics. PLoS ONE 20(1): e0313883. https://doi.org/10.1371/journal.pone.0313883

Editor: Alessandro Bruno, International University of Languages and Media: Libera Universita di Lingue e Comunicazione, ITALY

Received: August 16, 2023; Accepted: November 2, 2024; Published: January 17, 2025

Copyright: © 2025 Madan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data has been updated in a public repository with the following URL: https://github.com/deepsurbhi8/Explainable_Human_Traits_Prediction The data is available without any access restrictions.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Personality is a psychological construct that describes human behavior in terms of habitual and fairly stable patterns of emotions, thoughts, and attributes [1, 2]. Personality is typically characterized by the OCEAN traits typified by the big-five model [3]: Openness (creative vs conservative), Conscientiousness (diligent vs disorganized), Extraversion (social vs aloof), Agreeableness (empathetic vs distant) and Neuroticism (anxious vs emotionally stable). Other popular personality models include the big-two model which categorizes these five traits into the Plasticity and Stability dimensions [4], and the 16 personality factors model [5].

Personality plays a crucial role in shaping an individual’s behavioral and communication traits, and how one conducts themselves in different social situations. To this end, multimodal non-verbal cues are critical in exhibiting an individual’s inter-personal skills in the context of ‘multimedia CVs’ [6, 7]. Subjective impressions of interviewees’ personality traits can influence hiring decisions [8], and even one behavioral modality can explain personality attributions [9]. E.g., Conscientiousness characterizing diligence and honesty is reflected in an upright posture and minimal head movements, while Neuroticism indicating anxiety and stress is revealed through fidgeting and camera aversion in self-presentation videos [7].

This paper builds on the above findings, and explores the efficacy of multimodal behavioral cues to explainably predict personality and job interview traits. In particular, we examine (i) elementary head motions termed kinemes, (ii) atomic facial movements called action units (AUs), and (iii) prosodic and acoustic speech features for traits prediction (see Fig 1 for an overview). We first evaluate the efficacy of unimodal temporal characteristics of individual behavioral channel in predicting these traits using long-short term memory (LSTM) architectures. Next, we explore different multimodal fusion strategies (feature fusion, decision fusion, and additive soft attention) to enhance each channel’s predictive power and explainability. Recent studies have already shown the effectiveness of kineme patterns for emotional trait prediction [10, 11], while acoustic features and facial expressions have been successfully employed for estimating personality attributes [1, 12, 13] and candidate hireability (suitability to hire/interview later) [14, 15].

Download:

Fig 1. Overview of the proposed framework: Kinemes (elementary head motions), action units (atomic facial movements) and speech features employed for explainable trait prediction.

https://doi.org/10.1371/journal.pone.0313883.g001

Examining various LSTM architectures for classification and regression on the diverse FICS [16] and MIT interview [17] datasets, we make the following observations: (i) Both kinemes and AUs achieve explanative trait prediction. (ii) Multimodal approaches leverage cue-complementarity to better predict interview and personality attributes than unimodal ones. (iii) Trimodal fusion-based attention scores enable behavioral explanations, and provide insights into the relative contribution of each modality over time. (iv) Adequate predictive power is achieved even with 2 seconds-long behavioral episodes or slices. Overall, this paper makes the following research contributions:

Building upon our initial results [18], we novelly employ kinemes, action units and speech features for the estimation of personality and interview traits. Given the strong correlations among personality and interview traits [16, 19], we show that the three behavioral modalities are both predictive and explanative of these traits. We explore distinct strategies for temporally fusing behavioral features. Fusion approaches outperform unimodal ones by a large margin owing to the complementary nature of the cues and modalities.
Our experiments reveal that speech features are highly predictive of interview traits on the MIT dataset [17], and achieve performance comparable to kinemes and AUs for OCEAN trait prediction on the FICS dataset.
Kineme and AU features enable behavioral explanations to support their predictions. We employ scores obtained from the additive attention fusion model to assess the relative importance of our three modalities per trait.
We perform ablative studies presenting unimodal and multimodal results over thin-slices of varying lengths. We show that satisfactory continuous and discrete trait prediction performance can be achieved even with 2s slices, with more accurate predictions possible over longer slices in line with expectation.

2 Literature review

This section reviews research on (a) personality and interview trait prediction, and (b) multimodal behavior analytics to position our work with respect to the literature.

2.1 Trait prediction

Human thoughts, emotions and behavioral patterns are influenced by their personality, typically characterized via the OCEAN model [3] describing human personality in terms of Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. Various non-verbal behavioral cues such as eye movements [20, 21], head motion [22, 23], and facial features [13, 19] have been employed for personality trait prediction.

Numerous studies have examined the relationship between a candidate’s personality traits and their job-interview performance [14, 17, 24]; For instance, Conscientiousness is positively correlated with job and organizational performance [25, 26]. Conscientiousness and Extraversion impact interview success [27, 28] and job ratings [29]. While Mount et al. [30] observed that Emotional stability, Conscientiousness and Agreeableness are positively related to job performance, Rothmann et al. [31] associated Conscientiousness, Extraversion, Emotional stability and Openness with job performance and creativity. While these correlations among personality and interview traits have been discovered via statistical analyses, very few studies have explored the relationships between non-verbal behavioral cues and personality-cum-interview traits in a predictive (regression/classification) setting.

Explainable trait prediction.

Despite achieving excellent performance on multiple prediction problems, deep learning models fall short in terms of explainability and interpretability due to their ‘black-box’ nature [32]. Recent studies alleviate this issue by interpreting the results of deep learning models, e.g., Wicaksana and Liem [33] predict OCEAN personality traits explicitly focusing on human-explainable features and a transparent decision-making process. Wei et al. [34] propose a deep bimodal regression framework, in which Convolutional Neural Networks (CNNs) are modified to aggregate descriptors for improving regression performance on apparent personality analysis. A CNN-based approach for interpretability is explored, where the authors observe a correlation between AUs and CNN-learned features [35]. Interpretability is achieved via a visualization technique highlighting image regions activating different units in each layer. Another work [36] trains a deep residual network with audiovisual descriptors for personality trait prediction, where predictions are elucidated via face image visualization and occlusion analysis. In contrast, our approach provides trait-specific behavioral explanations, encompassing features (kineme and AUs based) and model-based (modality contribution) explanations.

2.2 Multimodal behavior analytics

Low-level behavioral features have been largely employed for human-centred trait prediction. E.g., head-motion has been modeled with descriptors such as amplitude of Fourier components [37], Euler rotation angles and velocity. Head motion is often restricted to nods and shakes [38]. Yang and Narayanan [39] extract arbitrary head motion patterns, which do not have a physical interpretation. Subramanian et al. [23] predict Extraversion and Neuroticism employing positional and head pose patterns.

Audio-visual features are typically combined to achieve effective trait prediction. Low-level speech descriptors such as pitch, intensity, spectral, cepstral coefficients and pause duration are commonly used for personality [40, 41] and affect recognition [42–44]. Other works use acoustic, prosodic and linguistic features for personality prediction [13, 45].

Many trait prediction studies focus solely on visual cues, with facial cues playing a crucial role. E.g., multivariate regression is employed to infer user personality impressions from Twitter profile images [46], while eigenfaces are combined with Support Vector Machines are used to predict if a depicted person scores above/below the median for each of the big-five traits [47]. Meng et al. [48] investigate the connection between gratification-sought (e.g., escape, fashion, entertainment) and personality traits, and find that extroverts are more active in contributing to, and participating in engaging behaviors. Short-term facial dynamics are learned from short videos via an emotion-guided, encoder-based approach for personality analysis in [49].

2.3 Summary

The literature review reveals the following research gaps:

Personality and interview traits are known to be highly correlated based on statistical observations, but few works have explored learning of features that can effectively predict as well as explain these traits.
While personality and interview traits have been predicted via machine/deep learning approaches, the majority employs statistics of low-level audiovisual features (statistics relating to head motion, eye-gaze, facial expression, speech and prosodic), which limits explanations to support the predictions. While head motion patterns have been identified as critical non-verbal behavioral cues, they have not been employed for personality or interview trait prediction. We show how kineme and AU features can intuitively explain trait-specific behaviors.
Multimodal behavioral analytics have been largely restricted to feature and decision fusion, treating all behavioral channels equally. Differently, we utilize additive soft attention [50]-based fusion that learns relative contribution of each channel from data. This allows for quantifying and explaining the relative contribution of the different modalities towards the prediction result.

3 Methodology

3.1 Feature extraction

We now present feature extraction for the three employed modalities: (i) 3D head motions denoted via a sequence of kinemes, (ii) facial action units describing muscle movements, and (iii) low-level descriptors for speech representation. As in [18], we encode these features into 2s temporal segments with a 50% overlap to obtain feature vectors.

Kineme representation.

A compact approach to modeling head motion is by representing it in terms of a small number of fundamental and interpretable units termed kinemes [10]; they are analogous to phonemes in human speech [51]. We extract the 3D Euler rotation angles pitch (θ_p), yaw (θ_y) and roll (θ_r) per frame to represent head pose using the Openface toolkit [52]. Head motion for a time period T can be represented as a time-series of 3D angles: . This multivariate time-series θ of length T is divided into l-overlapping segments, where the i^th segment is denoted by a vector . These overlapping segments enable shift-invariance and generate better representations of the head motion [11].

Further, we define the characterization matrix as H_θ = [h⁽¹⁾, h⁽²⁾, ⋯, h^(s)] with s denoting the number of segments in the training sample. All N training samples are combined to form the head motion matrix , where each column in the matrix H represents a single head motion time-series segment. Non-negative Matrix Factorization is performed on the matrix H to obtain basis and coefficient matrices B and C respectively. We then employ Gaussian Mixture modeling to cluster coefficient vectors in a low dimensional space to obtain a k column matrix C* (k << Ns). The matrix C* is transformed as H* = B C*, to obtain kinemes in the original space. Columns of H* yield the k kinemes .

On learning the kineme representation, any head motion time-series is represented via by mapping each time series segment to an individual kineme. To obtain the corresponding kineme, we compute the characterization matrix h⁽ⁱ⁾ for the i^th segment. Lastly, we project h⁽ⁱ⁾ onto the learned subspace spanned by B to get c⁽ⁱ⁾: We maximize the posterior probability to associate the i^th segment to its corresponding kineme . Thus, we can map a head motion time-series to a kineme sequence. Selected kinemes are extracted from the MIT and FICS datasets are visualized in Fig 2(a) and 2(b).

Download:

Fig 2. (a) Plots of 16 kinemes extracted for the FICS dataset following raster ordering (left to right, top to bottom) and (b) Selected kineme plots for the MIT dataset.

https://doi.org/10.1371/journal.pone.0313883.g002

Action unit detection.

We extract 17 facial action units (AUs) per video frame using Openface. These 17 AUs are described in terms of a value specifying the visibility of an AU and an intensity score representing AU sharpness on a 5-point scale (minimal to maximal). We employ mean intensity as a threshold to identify the dominant AUs over all 2s frames with 1s overlap as above. Some of the common AUs from the two datasets are presented in Table 1.

Download:

Table 1. 18 AUs common to the FICS and MIT datasets.

https://doi.org/10.1371/journal.pone.0313883.t001

Speech feature extraction.

We extracted low-level audio descriptors (LLDs) via the Librosa library [53] following the Interspeech2009 emotion challenge [54]: Fundamental frequency (F0), voice probability, zero-crossing rate (ZCR) and Mel-frequency cepstral coefficients (MFCCs). A local feature vector is created by extracting the LLDs over a sliding window of 93ms with an overlap of 23ms over the entire video duration. These local features are averaged and concatenated to obtain a 23-dimensional feature vector for each 2s segment. For each dataset, these features are normalized to have zero mean and unit variance.

3.2 Models

Long short-term memory (LSTM) models for regression and classification: We trained LSTMs with the kineme (LSTM Kin), AU (LSTM AU) and speech sequences (LSTM Aud). We also performed bimodal feature fusion (FF) and decision fusion (DF) with all combinations (LSTM Kin+AU, LSTM Kin+Aud and LSTM AU+Aud), and trimodal LSTM fusion (LSTM Kin+AU+Aud). The kineme sequences are one-hot encoded, where the kineme denoting a given time-window is coded to 1 and the rest to 0. AU sequences are encoded by setting the dominant AUs to 1 and rest to 0 for the time-window, creating a binary 17-element AU vector. Speech sequences are created by z-normalizing LLDs averaged over the time-window. For a behavioral slice involving L time windows with N training samples, the kineme, AU and speech features form 3D matrices of size 16 × N × L, 17 × N × L, and 23 × N × L respectively.

Unimodal and feature fusion (FF).

A single hidden LSTM layer is employed for unimodal prediction followed by a dense layer involving one neuron with sigmoidal/linear activation for classification/regression. For bimodal and trimodal feature fusion, unimodal descriptors are fused by applying a single LSTM layer to each feature. The subsequent outputs are merged followed by a dense layer comprising a single neuron as above (see Fig 3). The hyperparameters such as number of neurons, activation function and dropout rate are tuned via the validation set. An Adam optimizer is utilized for training with learning rate of 0.01. We employ binary cross entropy and mean absolute error as loss functions for classification and regression respectively.

Download:

Fig 3. Trimodal feature fusion architecture.

Linear activation is applied on the dense layer output for regression. N denotes the number of neurons per layer. The dense layer output involves linear activation and 32 neurons in the LSTM layer for regression model.

https://doi.org/10.1371/journal.pone.0313883.g003

Attention fusion (LSTM AF).

To achieve multimodal explanations, we employ attention-based trimodal fusion as in [50] to assign importance weights to the three modalities at each time window (Fig 4). Dense layers are employed for each cue in [50], while we use one LSTM layer per modality to quantify an importance weight. Also, while we compute weights based on softmax scores generated per time step, [50] focuses only on the channel with maximum attention weight discarding others. As in Fig 4(a), an LSTM layer is employed for each modality to learn temporal dynamics, resulting in a fixed-length feature vector per modality. Unimodal descriptors are concatenated and passed through a fully connected layer, and a softmax layer composed of three neurons (Fig 4(b)). Attention scores generated via the softmax layer are deemed as the relative contribution of each modality per time window. Layer normalization is applied over each unimodal feature vector. To fuse normalized features, we employ an additive layer to sum the weighted unimodal features. This is followed by a dense layer comprising a single neuron with sigmoidal/linear activation for classification/regression. We aggregate weights to compute modality contributions over behavioral slices spanning multiple time windows.

Download:

Fig 4. Additive soft attention fusion.

(a) Additive attention fusion architecture overview, and (b) Attention score computation process (FC layer comprises twelve neurons). N denotes the number of neurons per layer. Linear/sigmoid activation is applied on the dense layer output for regression/classification.

https://doi.org/10.1371/journal.pone.0313883.g004

Decision fusion (DF).

We adopt the fusion weight estimation approach [55] outlined below. Assuming the unimodal classifier/regressor scores are p₁ and p₂ for the bimodal fusion, the test sample score is defined as αp₁ + (1 − α)p₂, α ∈ [0, 1]. We perform grid search with a step-size of 0.05 to identify the optimal α* maximizing F1-score and Pearson correlation coefficient (PCC), respectively, for classification and regression (the same is extended to trimodal fusion).

4 Experimental results

4.1 Datasets

The FICS dataset [16] contains 10K self-presentation snippets derived from YouTube videos of people talking into the camera. Averaging 15s in length, these videos are split into a 3:1:1 proportion for train (6000 samples), validation (2000 samples) and test (2000 samples). All videos are annotated with OCEAN trait scores with ‘N’ scores denoting emotional stability instead of Neuroticism. This MIT dataset [17] comprises audio-visual recordings of 138 mock job interviews with 69 undergraduate students, with videos being 4.7 minutes long on average. All videos are annotated with 16 interviewee-specific traits. We focus on the following traits: recommended hiring score (RH) denoting the candidate’s hireability, level of excitement (Ex), friendliness (Fr) and eye-contact (EC). We also examine the Overall (Ov) interview score in prediction experiments. Representative examples from the two datasets are presented in [18].

4.2 Quantitative experiments

Prediction settings.

Both datasets are equipped with continuous scores, posing human trait estimation problem as a regression problem. We explore both continuous and discrete predictions for personality and interview traits. In the case of regression scores, annotation values are standardized to a range of 0 to 1. For binary classification, trait scores are dichotomized by setting a threshold at their median value (Refer to Table 2 for class distribution). Tables 3 and 4 present regression results, while Tables 5 and 6 showcase the classification results. For the FICS dataset, the models are fine-tuned via the pre-defined validation set, while hyperparameter tuning is achieved via 10-fold cross-validation (cv) on the smaller MIT Interview dataset (resulting in 90% data for training and 10% data for testing). Results reported on the MIT dataset are μ±σ statistics noted over 50 runs (5 repeated runs of 10-fold cross-validation). Early stopping with a patience value of 4 epochs is employed to prevent model degradation.

Download:

Table 2. Trait-wise train (Tr) and test (Te) class distributions for the FICS and MIT datasets obtained for classification experiments.

MIT class distributions correspond to 1-minute video samples employed for analysis.

https://doi.org/10.1371/journal.pone.0313883.t002

Download:

Table 3. Unimodal and multimodal regression results on the MIT dataset.

Accuracy and PCC values are tabulated as (μ±σ) values, with highest PCC achieved per trait denoted in bold.

https://doi.org/10.1371/journal.pone.0313883.t003

Download:

Table 4. Unimodal and multimodal regression results on the FICS dataset.

Accuracy and PCC values for different methods are tabulated, with highest PCC achieved per trait denoted in bold.

https://doi.org/10.1371/journal.pone.0313883.t004

Download:

Table 5. Unimodal and multimodal classification results on the MIT dataset.

Accuracy and F1-score are tabulated as (μ±σ) values, with highest F1 achieved per trait denoted in bold.

https://doi.org/10.1371/journal.pone.0313883.t005

Download:

Table 6. Unimodal and multimodal classification results on the FICS dataset.

Accuracy and F1-score for different methods are tabulated, with highest F1 achieved per trait denoted in bold.

https://doi.org/10.1371/journal.pone.0313883.t006

Chunk vs video-level prediction.

To examine trait prediction over tiny behavioral episodes (or slices), we segment the original videos into smaller chunks of 2–7s for FICS, and 2–60s for the MIT dataset. All video chunks are assigned the source video label. We then compute metrics over a) all chunks (chunk-level performance), and b) over all videos by assigning the majority label/mean value over all chunks (video-level performance) for classification/regression. A comparison of chunk vs video-level predictions for the three modalities is presented in S1–S3 Figs.

Thin-slice predictions.

We explore trait prediction over short behavioral episodes known as thin slices and present the multimodal results for classification and regression using soft additive attention-fusion over 2s behavioral slice in Table 7. The results convey that reasonable prediction performance can be achieved even with 2s-long slices expressing the efficacy of these small behavioral slices for predicting different traits. For more details, please refer to S1 text.

Download:

Table 7. Soft Additive Attention Fusion Results over the 2s behavioral slice: MIT Dataset (top, results tabulated as (μ ± σ) values) and FICS Dataset (bottom).

https://doi.org/10.1371/journal.pone.0313883.t007

Performance metrics.

Due to the imbalanced class distribution in classification (Table 2), we use two metrics: Accuracy (Acc) and F1-Score. For regression, accuracy (Acc) defined as 1-MAE (Mean Absolute Error) [19] and PCC (Pearson Correlation Coefficient) are considered.

4.3 Experimental details

All experiments are performed using the two mentioned datasets, without external data for pre-training or fine-tuning. We optimized model training with the binary cross entropy loss function for classification and mean absolute error for regression. The network is trained using the Adam optimizer with a learning rate of 0.01. Specifically, when training on the MIT dataset, we employed 20 neurons, a batch size of 32, a dropout rate of 0.2 and setting the number of epochs to 30. For the FICS dataset, the configuration includes 32 neurons, a batch size of 100 and a dropout rate of 0.2. We set the number of epochs to 300, and applied early stopping with patience value set to 5.

4.4 Results and discussion

Based on Tables 3–7, we make the following observations:

For regression benchmarking (Tables 3 and 4), PCC is a more stringent measure than Acc, as very low PCC values are observed with relatively high Acc values for the FICS dataset (Table 4). Tables 3 and 5 show that regression and classification results are comparable for the (smaller) MIT dataset. For FICS, the regression scores are considerably higher than the classification scores, which can be attributed to Gaussian-distributed FICS traits with means around 0.5 [16].
Speech features achieve optimal interview trait prediction (Table 3), while Kineme and AU features perform comparably. Optimal personality trait regression is also achieved with audio features (Table 4), even as AUs significantly outperform kinemes on the FICS dataset.
Higher PCC scores are achieved with multimodal as compared to unimodal methods on both the MIT and FICS datasets. Bimodal and trimodal fusion perform very similarly for both interview and personality trait prediction, with maximum PCC values of 0.98 achieved for the Excited and Friendliness interview traits, and a peak PCC of 0.566 achieved for the Extraversion personality trait on FICS obtained with trimodal fusion.
Focusing on multimodal methods, bimodal combinations involving audio outperform others for interview trait prediction, implying that speech features individually and in combination with others acquire high predictive power, mirroring findings in [17]. Bimodal predictions improving over unimodal ones conveys that kinemes and AUs provide complementary information concerning interview and personality traits.
Among trimodal fusion methods, decision fusion slightly outperforms attention and feature fusion on the MIT dataset, while decision, attention and feature fusion approaches perform first, second and third best on the FICS dataset. These results again reveal the complementary utility of the kineme, AU and speech features; optimal performance achieved with trimodal decision fusion conveys that the AU and kineme classifiers improve prediction performance in instances where speech descriptors are ineffective.
Focusing on classification (Tables 5 and 6), considering unimodal results, audio features achieve optimal F1-scores on Interview traits (highest F1 of 0.95 for Recommended Hiring and Excited), while AUs achieve the best classification on personality traits (maximum F1 of 0.651 for Extraversion). AUs and kinemes perform similarly on the MIT dataset, while speech descriptors achieve much higher F1-scores than kinemes on FICS.
Multimodal approaches again outperform unimodal methods in categorizing both interview and personality traits. With respect to bimodal methods, combinations involving speech tend to perform well for both interview and personality prediction.
Trimodal fusion performs best, producing peak F1 scores of 0.98 and 0.695 for the RH interview, and Extraversion personality traits. Decision fusion produces the best trait classification performance on both datasets, with feature and attention fusion having comparable scores.

The above results represent trait prediction at the video level, on examining 15s FICS videos or upon collating classification/regression results over 5–60s chunks/segments on the MIT dataset (the best results obtained by averaging chunk-level values, or computing the majority label over all chunks are listed in Tables 3 and 5). Table 7 presents results for the 2s behavioral slice for both datasets.

4.4.1 Comparison with the state-of-the-art approaches.

Table 8 presents a comparison of our proposed methodology with available baseline approaches for both datasets: FICS and MIT. In the studies focused on the MIT dataset, the paper presenting the MIT interview dataset [17] performed a series of experiments utilizing multiple behavioral cues such as prosodic features, facial expressions, and language of the interviewee for implementing binary classification and regression analysis to achieve highest PCC of 0.77 for Excited trait and lowest PCC of 0.27 for Eye Contact. In another study, Agrawal et al. [56] also employed similar multimodal cues to predict different class labels associated with the interview process to report a classification accuracy of 0.6428 for the Eye Contact label. On the other hand, Kumar et al. [57] examined only speech features for regression analysis using the CNN-LSTM fusion to obtain highest accuracy of 0.96 for Overall trait and highest PCC of 0.93 for Excited and Friendly. Compared to these previous studies, our proposed trimodal fusion-based approach achieves an improved regression accuracy of 0.98 for all traits except Eye Contact (0.97), a PCC of 0.98 for Excited and Friendly, and the highest classification accuracy of 0.98 for Recommend Hiring. Apart from achieving better performance, we also demonstrate the efficacy of different behavioral cues to achieve explanations for the interview traits.

Download:

Table 8. Comparison with prior works for the two datasets.

https://doi.org/10.1371/journal.pone.0313883.t008

For the FICS dataset, Yan et al. [58] focused on investigating the biases in multimodal personality assessment induced by various sources, such as individual behavioral differences and late fusion approaches, on employing data balancing and adversarial learning to report the best regression accuracy of 0.92 for Extraversion. On the other hand, Yagmur et al. [59] proposed an audiovisual deep residual network comprising auditory and visual streams to achieve 0.91 regression accuracy for all traits. Another similar study by Zhang et al. [60] employed the Deep Bimodal Regression (DBR) framework by modifying the traditional CNNs to combine audio and visual information. The model achieves the highest accuracy of 0.92 for the Conscientiousness trait. The work by Subramanian et al. [61] introduced two bi-modal end-to-end deep neural network architectures using temporally ordered audio and visual features to report an accuracy of 0.91 for all traits. Comparatively, our proposed approach achieves the highest regression accuracy of 0.90 for multiple traits, including Openness, Extraversion, and Agreeableness. Along with regression analysis, our model achieves the best classification accuracy of 0.69 for the Extraversion trait. Apart from attaining comparable results for OCEAN traits of the FICS dataset, our approach enables behavioral explanations to support the predictions using multimodal cues.

4.4.2 Model generalisability.

Considering the diverse datasets utilized in our study, we investigate the generalisability of our approach by training the model on one dataset and testing on another. The two datasets utilized in our study are curated for distinct objectives; the FICS data are compiled primarily for the automated assessment of personality traits, while the MIT dataset is compiled for examining interview behavior. Therefore, we focus on predicting the Recommend Hiring (RH) and the Interview score traits from the MIT and FICS datasets respectively, as they share similar meanings. For evaluating model generalisability, we consider the following configurations:

Configuration 1: We synthesize kineme units from FICS head pose angles following the procedure outlined in Sec. 2.1. We then map head pose angles in the MIT dataset to the learned FICS kinemes for feature extraction and regression. Since we synthesize kineme sequences for both datasets, either data can be used for model training/testing.
Configuration 2: We learn kinemes based on head movements in the MIT dataset, which are then used to represent head pose angles in the FICS data.

Empirical settings. We employ the trimodal decision fusion approach utilizing the kineme, AU and speech features, given its optimal performance over the two datasets. As the FICS dataset has fixed train, validation and test sets, we utilized the test and train FICS sets for testing and training respectively. Whereas, for MIT, the entire dataset is considered for training/testing. To train the model for both configurations, we used a mean absolute error as loss function, Adam optimizer with a learning rate of 0.01, 20 neurons for each LSTM layer, batch size of 32, dropout rate of 0.2, and number of epochs to 300. Early stopping was applied with a patience value of 5.

Results & discussion. Table 9 displays regression results for both configurations (poor classification measures were obtained, which are omitted here). We observe from the table that the dataset employed for kineme generation does not influence outcomes much, and very similar similar accuracy and PCC scores are obtaiined for both configurations. We note that, across configurations, (1) models trained on the MIT dataset achieve better performance, despite that being the smaller among the two; this conveys that the variance in appearance and/or head rotations for the FICS dataset is lower than in MIT, and (b) while relatively high Acc values (between 0.83–0.87) are obtained, very low PCC values are obtained on cross dataset-trained models, conveying that model predictions differ substantially from the ground-truth values. Cumulatively, while kineme-based models may not generalise optimally, the predictions across datasets are reasonably accurate and robust. Future work will focus on further improving model generalisability.

Download:

Table 9. Regression results for model generalisability.

https://doi.org/10.1371/journal.pone.0313883.t009

4.4.3 Ethical aspect.

Research on understanding human traits requires careful attention to privacy, consent and accuracy, with a crucial awareness of cultural, gender and ethnic differences to prevent misinterpretations. We highlight the ethical considerations associated with our work below. The developed methods facilitate accurate prediction of users’ personality and job interview traits, which can then be used to give insights and recommendations to users, especially since our predictions are supported by explanations. At the same time, these powerful tools when used as behavioral benchmarkers could create biases and discrimination [62]. Likewise, Asad et al. [63] demonstrate a correlation between Openness/Neuroticism with impulsive buying behavior, implying that individuals’ personality traits could be exploited for sales promotion. For our research, potential biases relating to racial and cultural backgrounds may arise due to the nature of the training data derived from public datasets [16, 17]. In this regard, (a) the transparency of our model providing relevant explanations supporting predictions [64], and (b) its generalizability as demonstrated by prediction results on a different dataset, are supportive of the fact that the presented results are generally devoid of biases. Our research strictly adheres to ethical standards, utilizing open-sourced data with a valid End User License Agreement (EULA) solely for research purposes, and per se not targeting sensitive problem domains such as job recruitment.

5 Explainability & interpretability

5.1 Interpretation via kinemes and AUs

Along with their predictive power, kinemes and AUs also enable facile trait-specific behavioral explanations. To this end, we considered the top and bottom 10-percentile videos for each trait, and computed the most frequently occurring AUs and kinemes for the same. The most frequently occurring four kinemes and five dominant AUs for these high (H) and low (L)-rated videos are presented in Table 10. Analysing the table, we make the following remarks:

Download:

Table 10. Explaining OCEAN and interview traits via kinemes and AUs.

MIT kinemes in bold font are visualized in Fig 4.

https://doi.org/10.1371/journal.pone.0313883.t010

The presence of kineme 16 (denoting head nodding and shaking) in all OCEAN traits conveys the significance of head motion for the characterization of personality traits. Combination of head nodding and shaking with other kineme representations highlights the subtle difference between high and low-rated personality impressions. Also, note that AUs 25 and 26, signifying talking behavior, are present in all videos.
Focusing on other kinemes, high Openness is characterized by kinemes 2 and 8, which signify persistent head movements. This finding is echoed in [65], where large motion variations are found to associate with high O impressions. Presence of AUs 12 and 14 indicates that a smiling demeanor characterizes high O. Conversely, kineme 6 denoting minimal head motion and AUs 4 and 17 typical of frowning and diffident behavior are commonly noted for low O videos.
Kineme 1 denoting an upward head tilt is associated with high C, while kinemes 2 and 4 depicting tilt-down and head-shaking are associated with low C. This indicates that attempting to maintain eye-contact conveys diligence and honesty, while avoiding eye-contact conveys insincerity.
Extraversion appears to be conveyed better by AUs than kinemes; Dominant AUs for high E include 10, 12 and 17 indicating a friendly and talkative nature, while dominant kinemes 2 and 14 convey significant head movements. Conversely, low E is associated with kineme 4 denoting head-shaking and AUs 4, 7 and 17 indicating frowning, overall conveying a socially distant nature.
High Agreeableness is characterized by kineme 3 (head-nod), and AUs 12 and 14 which constitute a smile. Conversely, kinemes 1, 8 and 9 dominate low A, and they collectively convey persistent head motion. Also, AUs dominant for low A are 4, 14 and 17, cumulatively describing a frown; overall, nodding and smiling is viewed as courteous, while frequent head movements and frowning convey hostility.
Emotional stability (high N) is associated with kinemes 2 and 8, and AUs 7, 12 and 17, indicating persistent head motion and facial expressiveness. On the other hand, a neurotic trait is conveyed via limited head motion and head-shaking (kinemes 1, 5, 12) and frowning (described by AUs 4, 7, 10).
While kinemes for the MIT videos are less discernible, due to smaller face size and the fact that they capture an interactional setting, some patterns are nevertheless evident as seen in Fig 2(b); these kinemes are highlighted in Table 10. As with FICS, Kineme 14 denoting a head-nod is commonly observed for all high trait videos, while kineme 11 depicting a head-shake is common for all low-trait videos.
High RH scores are elicited with expressive facial behavior involving head-nodding and smiling. Conversely, low RH scores are associated with head-shaking and exhibiting limited facial expressions. Highly excited behavior is associated with identical AUs as high RH, and persistent head motion. Inversely, low excitement scores are connected with head shaking, and limited facial emotions.
Identical AUs are observed for both high and low eye-contact, implying that head movements primarily impact eye-contact impressions. Head nodding (kineme 14) is associated with high EC, while kinemes 11 and 16 depicting head shaking and frequent head-nodding elicit low EC scores. Therefore interestingly, while head nodding is beneficial, frequent nodding is perceived as avoiding eye-contact.
High friendliness is characterized by kinemes 11, 14 and 16, signifying persistent head motion along with expressive and smiling facial movements (AUs 5, 12 and 14). Conversely, low friendliness is associated with head-shaking (kineme 11) and frowning (AUs 4, 6, 7).

5.2 Attention score-based interpretations

While Table 10 presents unimodal behavioral explanations via kinemes and AUs, behaviors are expressed and best modeled multimodally as seen from our empirical results (Section 4.4). For multimodal explanations, we explore the attention-fusion network (Fig 4) to estimate the relative contribution of each modality towards trait regression. We visualize softmax scores learned by the attention-fusion network as follows. For the FICS dataset, we present mean attention scores obtained over 10 runs on 15s test videos (Fig 5(left)), while we present softmax scores averaged over 15s chunks for MIT videos across 50 runs (Fig 5(right)). Our remarks from the weight plots are as follows:

Download:

Fig 5. Mean modality-specific attention weights for personality traits (left) and interview traits (right).

Error bars denote standard error.

https://doi.org/10.1371/journal.pone.0313883.g005

Cumulatively, Fig 5 conveys that while the relative contribution of speech features towards weighted fusion is not high for personality trait prediction, they tend to play a significant role in predicting interview traits on the MIT dataset. These observations mirror prior findings; the criticality of visual features such as head movements and facial movements for personality trait recognition has been noted in [14, 68] while the impact of prosodic speech features on interview trait impressions is discussed in [17].
Fig 5(left) conveys that either kineme or AU features are most critical for personality trait prediction. Specifically, kinemes maximally contribute to the prediction of Openness and Extraversion, while AUs are most critical for predicting Agreeableness and Neuroticism. Both kinemes and AUs are found to be equally critical for estimating Conscientiousness, in line with the findings in [69]. Extraversion and Openness are conveyed by exaggerated physical and head movements [65, 70], with different head movement patterns representing high and low Extraversion [71]. While Agreeableness is also positively correlated with head movements [70, 71], empathetic behavior is accurately conveyed via facial expressions as denoted by the higher AU weights. Facial movements (e.g., unconcerned or anxious) better convey emotional stability [72].
From Fig 5(right), it can be seen that facial movements have relatively less impact on interview trait prediction with the exception of eye contact. This can partly be attributed to the smaller face size in MIT videos, limiting the efficacy of AU detection. Conversely, speech features significantly impact trait prediction with the exception of recommended hiring and eye contact traits. While prosodic speech behavior has been found to considerably influence interview trait impressions [17, 73], other forms of non-verbal behavior such as positive facial expressions and frequent postural changes are known to impact hierabilty [74].
For the Excited trait, speech plays a prominent role with a high correlation to continuous or restricted head movement [75]. On the surprising finding of AUs and speech features impacting eye-contact, prior studies [76] have revealed a low-yet-meaningful correlation between eye contact impressions and vocal acoustic features. Friendliness is best characterized by head movement and voice features, showing that the integration of visual and auditory modalities can be crucial in discerning interviewee friendliness [77].

To summarise, we explore interpretability using the visual features by identifying the most frequently occurring AUs and kinemes for the top and bottom 10-percentile videos. A further manual analysis conveys characteristic head movements and subtle facial actions representative of different personality and interview traits consistent with human understanding. For multimodal explanations, we extend the interpretability approach using the attention-fusion trimodal architecture to evaluate and measure the relative contribution of each modality towards trait regression. The softmax scores obtained over 15s videos for the two datasets are visualized by taking a mean of the values over 10 runs for the FICS, and 50 runs of the MIT Dataset. This analysis further validates the relative differences between the modality-specific attention weights based on prior findings. We acknowledge the potential possibility of average values obscuring the significant variations and nuances within the different runs of the model and intend to explore alternate behavioral explanations in future, including analysing the impact of systematically altering modality-specific attention weights on predictions.

6 Conclusion

This work demonstrates the efficacy of multimodal (kineme, AU and speech) behavioral cues to achieve explainable prediction of OCEAN and interview traits. Our results confirm that efficient trait prediction can be achieved with both unimodal and multimodal approaches. Also, multimodal approaches outperform their unimodal counterparts owing to complementary information provided by trait-specific behavioral cues. In addition, frequently occurring kineme and AU patterns enable behavioral explanations associated with each trait.

In terms of limitations, this work extracts all behavioral features over a fixed time window (same time-scale); however, behaviors associated with human personality may manifest over different time scales; example, facial expression or head motion patterns could be affected by speaking behavior (talkative: drastic variation in speaking behavior over video frames, or reserved: lingering silence over most video frames). Investigating the effect of temporal scales will be a future research direction. Trait-specific behavioral patterns can also be utilized to create virtual agents to train users in interviewing or public speaking settings. The authors do not advise using the proposed methodologies for complex processes like job recruitment per se; however, explanatory technologies can be utilized as a complementary tool in decision-making processes.

Supporting information

S1 Fig. Chunk vs video-level predictions with kinemes for FICS (left) and MIT (right).

https://doi.org/10.1371/journal.pone.0313883.s001

(TIF)

S2 Fig. Chunk vs video-level predictions with AUs for FICS (left) and MIT (right).

https://doi.org/10.1371/journal.pone.0313883.s002

(TIF)

S3 Fig. Chunk vs video-level predictions with speech features for FICS (left) and MIT (right).

https://doi.org/10.1371/journal.pone.0313883.s003

(TIF)

S1 Text.

https://doi.org/10.1371/journal.pone.0313883.s004

(PDF)

Acknowledgments

We would like to thank A. Samanta, IIT Kanpur for sharing the kineme implementation.

References

1. J JC Junior, Güçlütürk Y, Pérez M, Güçlü U, Andujar C, Baró X, et al. First impressions: A survey on vision-based apparent personality trait analysis. IEEE Transactions on Affective Computing. 2019;13(1):75–95.
- View Article
- Google Scholar
2. Vinciarelli A, Mohammadi G. A survey of personality computing. IEEE Transactions on Affective Computing. 2014;5(3):273–91.
- View Article
- Google Scholar
3. McCrae RR, Costa PT. Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology. 1987;52(1):81–90. pmid:3820081
- View Article
- PubMed/NCBI
- Google Scholar
4. Digman JM. Higher-order factors of the Big Five. Journal of Personality and Social Psychology. 1997;73(6):1246–56. pmid:9418278
- View Article
- PubMed/NCBI
- Google Scholar
5. Cattell HE. The sixteen personality factor (16PF) questionnaire. In: Understanding psychological assessment. Springer; 2001. p. 187–215.
6. Raza SM, Carpenter BN. A model of hiring decisions in real employment interviews. Journal of Applied Psychology. 1987;72(4):596–603.
- View Article
- Google Scholar
7. Batrinca LM, Mana N, Lepri B, Pianesi F, Sebe N. Please, tell me about yourself: automatic personality assessment using short self-presentations. In: Proceedings of the 13th international conference on multimodal interfaces. 2011. p. 255–62.
8. Van Dam K. Trait Perception in the Employment Interview: A Five–Factor Model Perspective. International Journal of Selection and Assessment. 2003;11(1):43–55.
- View Article
- Google Scholar
9. DeGroot T, Gooty J. Can nonverbal cues be used to make meaningful personality attributions in employment interviews? Journal of business and psychology. 2009;24:179–92.
- View Article
- Google Scholar
10. Samanta A, Guha T. On the role of head motion in affective expression. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 2886–90.
11. Samanta A, Guha T. Emotion Sensing From Head Motion Capture. IEEE Sensors Journal. 2021 Feb;21(4):5035–43.
- View Article
- Google Scholar
12. Sidorov M, Ultes S, Schmitt A. Automatic recognition of personality traits: A multimodal approach. In: Proceedings of the 2014 Workshop on Mapping Personality Traits Challenge and Workshop. 2014. p. 11–5.
13. Kampman O, Barezi EJ, Bertero D, Fung P. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. arXiv preprint arXiv:180500705. 2018;
14. Malik H, Dhillon H, Goecke R, Subramanian R. I am empathetic and dutiful, and so will make a good salesman: Characterizing Hirability via Personality and Behavior. 2020;
- View Article
- Google Scholar
15. Eddine Bekhouche S, Dornaika F, Ouafi A, Taleb-Ahmed A. Personality traits and job candidate screening via analyzing facial videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 10–3.
16. Escalante HJ, Kaya H, Salah AA, Escalera S, Güçlütürk Y, Güçlü U, et al. Modeling, Recognizing, and Explaining Apparent Personality From Videos. IEEE Transactions on Affective Computing. 2022 Apr;13(2):894–911.
- View Article
- Google Scholar
17. Naim I, Tanveer MdI, Gildea D, Hoque ME. Automated Analysis and Prediction of Job Interview Performance. IEEE Transactions on Affective Computing. 2018 Apr;9(2):191–204.
- View Article
- Google Scholar
18. Madan S, Gahalawat M, Guha T, Subramanian R. Head matters: explainable human-centered trait prediction from head motion dynamics. In: Proceedings of the 2021 International Conference on Multimodal Interaction. 2021. p. 435–43.
19. Güçlütürk Y, Güçlü U, Baro X, Escalante HJ, Guyon I, Escalera S, et al. Multimodal first impression analysis with deep residual networks. IEEE Transactions on Affective Computing. 2017;9(3):316–29.
- View Article
- Google Scholar
20. Hoppe S, Loetscher T, Morey SA, Bulling A. Eye movements during everyday behavior predict personality traits. Frontiers in human neuroscience. 2018;12:328195. pmid:29713270
- View Article
- PubMed/NCBI
- Google Scholar
21. Rauthmann JF, Seubert CT, Sachse P, Furtner MR. Eyes as windows to the soul: Gazing behavior is related to personality. Journal of Research in Personality. 2012;46(2):147–56.
- View Article
- Google Scholar
22. Jayagopi DB, Hung H, Yeo C, Gatica-Perez D. Modeling dominance in group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and Language Processing. 2009;17(3):501–13.
- View Article
- Google Scholar
23. Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on International conference on multimodal interaction. 2013. p. 3–10.
24. Malik H, Dhillon H, Parameshwara R, Goecke R, Subramanian R. Examining the Influence of Personality and Multimodal Behavior on Hireability Impressions. In: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing. 2023. p. 1–9.
25. Hassan S, Akhtar N, Yılmaz AK. Impact of the conscientiousness as personality trait on both job and organizational performance. Journal of Managerial Sciences. 2016;10(1).
- View Article
- Google Scholar
26. Moy JW, Lam KF. Selection criteria and the impact of personality on getting hired. Personnel Review. 2004;33(5):521–35.
- View Article
- Google Scholar
27. Tay C, Ang S, Van Dyne L. Personality, biographical characteristics, and job interview success: a longitudinal study of the mediating effects of interviewing self-efficacy and the moderating effects of internal locus of causality. Journal of Applied Psychology. 2006;91(2):446. pmid:16551195
- View Article
- PubMed/NCBI
- Google Scholar
28. Barrick MR, Mount MK. The big five personality dimensions and job performance: a meta‐analysis. Personnel psychology. 1991;44(1):1–26.
- View Article
- Google Scholar
29. Witt LA, Burke LA, Barrick MR, Mount MK. The interactive effects of conscientiousness and agreeableness on job performance. Journal of Applied Psychology. 2002;87(1):164–9. pmid:11916210
- View Article
- PubMed/NCBI
- Google Scholar
30. Mount MK, Barrick MR, Stewart GL. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance. 1998;11(2–3):145–65.
- View Article
- Google Scholar
31. Rothmann S, Coetzer EP. The big five personality dimensions and job performance. SA Journal of industrial psychology. 2003;29(1):68–74.
- View Article
- Google Scholar
32. Samek W, Müller KR. Towards explainable artificial intelligence. Explainable AI: interpreting, explaining and visualizing deep learning. 2019;5–22.
- View Article
- Google Scholar
33. Wicaksana AS, Liem CC. Human-explainable features for job candidate screening prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE; 2017. p. 1664–9.
34. Wei XS, Zhang CL, Zhang H, Wu J. Deep bimodal regression of apparent personality traits from short video sequences. IEEE Transactions on Affective Computing. 2017;9(3):303–15.
- View Article
- Google Scholar
35. Ventura C, Masip D, Lapedriza A. Interpreting cnn models for apparent personality trait regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 55–63.
36. Gucluturk Y, Guclu U, Perez M, Jair Escalante H, Baro X, Guyon I, et al. Visualizing apparent personality analysis with deep residual networks. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017. p. 3101–9.
37. Ding Y, Shi L, Deng Z. Low-level characterization of expressive head motion through frequency domain analysis. IEEE Transactions on Affective Computing. 2018;11(3):405–18.
- View Article
- Google Scholar
38. Gunes H, Pantic M. Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners. In: Intelligent Virtual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA, September 20-22, 2010 Proceedings 10. Springer; 2010. p. 371–7.
39. Yang Z, Narayanan SS. Modeling dynamics of expressive body gestures in dyadic interactions. IEEE Transactions on Affective Computing. 2016;8(3):369–81.
- View Article
- Google Scholar
40. An G, Levitan R. Lexical and Acoustic Deep Learning Model for Personality Recognition. In: INTERSPEECH. 2018. p. 1761–5.
41. Valente F, Kim S, Motlicek P. Annotation and Recognition of Personality Traits in Spoken Conversations from the AMI Meetings Corpus. In: INTERSPEECH. 2012. p. 1183–6.
42. Mangalam K, Guha T. Learning spontaneity to improve emotion recognition in speech. arXiv preprint arXiv:171204753. 2017;
43. Tawari A, Trivedi MM. Speech emotion analysis: Exploring the role of context. IEEE Transactions on multimedia. 2010;12(6):502–9.
- View Article
- Google Scholar
44. Abdel-Hamid L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication. 2020;122:19–30.
- View Article
- Google Scholar
45. Levitan SI, Levitan Y, An G, Levine M, Levitan R, Rosenberg A, et al. Identifying individual differences in gender, ethnicity, and personality from dialogue for deception detection. In: Proceedings of the second workshop on computational approaches to deception detection. 2016. p. 40–4.
46. Dhall A, Hoey J. First impressions-predicting user personality from twitter profile images. In: Human Behavior Understanding: 7th International Workshop, HBU 2016, Amsterdam, The Netherlands, October 16, 2016, Proceedings 7. Springer; 2016. p. 148–58.
47. Al Moubayed N, Vazquez-Alvarez Y, McKay A, Vinciarelli A. Face-based automatic personality perception. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014. p. 1153–6.
48. Meng KS, Leung L. Factors influencing TikTok engagement behaviors in China: An examination of gratifications sought, narcissism, and the Big Five personality traits. Telecommunications Policy. 2021;45(7):102172.
- View Article
- Google Scholar
49. Song S, Jaiswal S, Sanchez E, Tzimiropoulos G, Shen L, Valstar M. Self-supervised learning of person-specific facial dynamics for automatic personality recognition. IEEE Transactions on Affective Computing. 2021;14(1):178–95.
- View Article
- Google Scholar
50. Sharma R, Guha T, Sharma G. Multichannel attention network for analyzing visual behavior in public speaking. In: 2018 ieee winter conference on applications of computer vision (wacv). IEEE; 2018. p. 476–84.
51. Birdwhistell RL. Essays on body motion communication. Philadelphia: University of Pennsylvania. 1970;
52. Baltrušaitis T, Robinson P, Morency LP. Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE; 2016. p. 1–10.
53. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, et al. librosa: Audio and music signal analysis in python. In: SciPy. 2015. p. 18–24.
54. Schuller B, Steidl S, Batliner A. The interspeech 2009 emotion challenge. 2009;
- View Article
- Google Scholar
55. Koelstra S, Patras I. Fusion of facial expressions and EEG for implicit affective tagging. Image and Vision Computing. 2013;31(2):164–74.
- View Article
- Google Scholar
56. Agrawal A, George RA, Ravi SS. Leveraging multimodal behavioral analytics for automated job interview performance assessment and feedback. arXiv preprint arXiv:200607909. 2020;
57. Kumar D, Raman B. Speech-Based Automatic Prediction of Interview Traits. In: International Conference on Computer Vision and Image Processing. Springer; 2022. p. 586–96.
58. Yan S, Huang D, Soleymani M. Mitigating biases in multimodal personality assessment. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020. p. 361–9.
59. Güçlütürk Y, Güçlü U, van Gerven MA, van Lier R. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 349–58.
60. Zhang CL, Zhang H, Wei XS, Wu J. Deep bimodal regression for apparent personality analysis. In: European conference on computer vision. Springer; 2016. p. 311–24.
61. Subramaniam A, Patel V, Mishra A, Balasubramanian P, Mittal A. Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 337–48.
62. McClendon J, Bogdan R, Jackson JJ, Oltmanns TF. Mechanisms of Black–White disparities in health among older adults: Examining discrimination and personality. Journal of Health Psychology. 2021;26(7):995–1011. pmid:31250666
- View Article
- PubMed/NCBI
- Google Scholar
63. Shahjehan A, Qureshi JA, Zeb F, Saifullah K. The effect of personality on impulsive and compulsive buying behaviors. African journal of business management. 2012;6(6):2187.
- View Article
- Google Scholar
64. Van Nuenen T, Ferrer X, Such JM, Coté M. Transparency for whom? Assessing discriminatory artificial intelligence. Computer. 2020;53(11):36–44.
- View Article
- Google Scholar
65. Koppensteiner M. Motion cues that make an impression: Predicting perceived personality by minimal motion information. Journal of experimental social psychology. 2013;49(6):1137–43. pmid:24223432
- View Article
- PubMed/NCBI
- Google Scholar
66. Ishii R, Ahuja C, Nakano YI, Morency LP. Impact of personality on nonverbal behavior generation. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 2020. p. 1–8.
67. Osugi T, Kawahara JI. Effects of head nodding and shaking motions on perceptions of likeability and approachability. Perception. 2018;47(1):16–29. pmid:28945151
- View Article
- PubMed/NCBI
- Google Scholar
68. Lepri B, Subramanian R, Kalimeri K, Staiano J, Pianesi F, Sebe N. Connecting meeting behavior with extraversion—A systematic study. IEEE Transactions on Affective Computing. 2012;3(4):443–55.
- View Article
- Google Scholar
69. Celiktutan O, Gunes H. Automatic prediction of impressions in time and across varying context: Personality, attractiveness and likeability. IEEE transactions on affective computing. 2015;8(1):29–42.
- View Article
- Google Scholar
70. Oberzaucher E, Grammer K. Everything is movement: on the nature of embodied communication. Embodied communication in humans and machines. 2008;151–77.
- View Article
- Google Scholar
71. Ruhland K, Zibrek K, McDonnell R. Perception of personality through eye gaze of realistic and cartoon models. In: Proceedings of the ACM SIGGRAPH Symposium on Applied Perception. 2015. p. 19–23.
72. Breil SM, Osterholz S, Nestler S, Back MD. 13 contributions of nonverbal cues to the accurate judgment of personality traits. The Oxford handbook of accurate personality judgment. 2021;195–218.
- View Article
- Google Scholar
73. DeGroot T, Kluemper D. Evidence of predictive and incremental validity of personality factors, vocal attractiveness and the situational interview. International Journal of Selection and Assessment. 2007;15(1):30–9.
- View Article
- Google Scholar
74. Levine SP, Feldman RS. Women and men’s nonverbal behavior and self-monitoring in a job interview setting. Applied HRM Research. 2002;7(1):1–14.
- View Article
- Google Scholar
75. Walther S, Ramseyer F, Horn H, Strik W, Tschacher W. Less structured movement patterns predict severity of positive syndrome, excitement, and disorganization. Schizophrenia bulletin. 2014;40(3):585–91. pmid:23502433
- View Article
- PubMed/NCBI
- Google Scholar
76. Eyben F, Weninger F, Paletta L, Schuller BW. The acoustics of eye contact: detecting visual attention from conversational audio cues. In: Proceedings of the 6th workshop on Eye gaze in intelligent human machine interaction: gaze in multimodal interaction. 2013. p. 7–12.
77. House D. Integrating audio and visual cues for speaker friendliness in multimodal speech synthesis. In: INTERSPEECH. Citeseer; 2007. p. 1250–3.

[ref1] 1. J JC Junior, Güçlütürk Y, Pérez M, Güçlü U, Andujar C, Baró X, et al. First impressions: A survey on vision-based apparent personality trait analysis. IEEE Transactions on Affective Computing. 2019;13(1):75–95.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Vinciarelli A, Mohammadi G. A survey of personality computing. IEEE Transactions on Affective Computing. 2014;5(3):273–91.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. McCrae RR, Costa PT. Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology. 1987;52(1):81–90. pmid:3820081
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Digman JM. Higher-order factors of the Big Five. Journal of Personality and Social Psychology. 1997;73(6):1246–56. pmid:9418278
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Cattell HE. The sixteen personality factor (16PF) questionnaire. In: Understanding psychological assessment. Springer; 2001. p. 187–215.

[ref6] 6. Raza SM, Carpenter BN. A model of hiring decisions in real employment interviews. Journal of Applied Psychology. 1987;72(4):596–603.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Batrinca LM, Mana N, Lepri B, Pianesi F, Sebe N. Please, tell me about yourself: automatic personality assessment using short self-presentations. In: Proceedings of the 13th international conference on multimodal interfaces. 2011. p. 255–62.

[ref8] 8. Van Dam K. Trait Perception in the Employment Interview: A Five–Factor Model Perspective. International Journal of Selection and Assessment. 2003;11(1):43–55.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. DeGroot T, Gooty J. Can nonverbal cues be used to make meaningful personality attributions in employment interviews? Journal of business and psychology. 2009;24:179–92.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Samanta A, Guha T. On the role of head motion in affective expression. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 2886–90.

[ref11] 11. Samanta A, Guha T. Emotion Sensing From Head Motion Capture. IEEE Sensors Journal. 2021 Feb;21(4):5035–43.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Sidorov M, Ultes S, Schmitt A. Automatic recognition of personality traits: A multimodal approach. In: Proceedings of the 2014 Workshop on Mapping Personality Traits Challenge and Workshop. 2014. p. 11–5.

[ref13] 13. Kampman O, Barezi EJ, Bertero D, Fung P. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. arXiv preprint arXiv:180500705. 2018;

[ref14] 14. Malik H, Dhillon H, Goecke R, Subramanian R. I am empathetic and dutiful, and so will make a good salesman: Characterizing Hirability via Personality and Behavior. 2020;
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref15] 15. Eddine Bekhouche S, Dornaika F, Ouafi A, Taleb-Ahmed A. Personality traits and job candidate screening via analyzing facial videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 10–3.

[ref16] 16. Escalante HJ, Kaya H, Salah AA, Escalera S, Güçlütürk Y, Güçlü U, et al. Modeling, Recognizing, and Explaining Apparent Personality From Videos. IEEE Transactions on Affective Computing. 2022 Apr;13(2):894–911.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref17] 17. Naim I, Tanveer MdI, Gildea D, Hoque ME. Automated Analysis and Prediction of Job Interview Performance. IEEE Transactions on Affective Computing. 2018 Apr;9(2):191–204.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref18] 18. Madan S, Gahalawat M, Guha T, Subramanian R. Head matters: explainable human-centered trait prediction from head motion dynamics. In: Proceedings of the 2021 International Conference on Multimodal Interaction. 2021. p. 435–43.

[ref19] 19. Güçlütürk Y, Güçlü U, Baro X, Escalante HJ, Guyon I, Escalera S, et al. Multimodal first impression analysis with deep residual networks. IEEE Transactions on Affective Computing. 2017;9(3):316–29.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref20] 20. Hoppe S, Loetscher T, Morey SA, Bulling A. Eye movements during everyday behavior predict personality traits. Frontiers in human neuroscience. 2018;12:328195. pmid:29713270
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref21] 21. Rauthmann JF, Seubert CT, Sachse P, Furtner MR. Eyes as windows to the soul: Gazing behavior is related to personality. Journal of Research in Personality. 2012;46(2):147–56.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref22] 22. Jayagopi DB, Hung H, Yeo C, Gatica-Perez D. Modeling dominance in group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and Language Processing. 2009;17(3):501–13.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref23] 23. Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on International conference on multimodal interaction. 2013. p. 3–10.

[ref24] 24. Malik H, Dhillon H, Parameshwara R, Goecke R, Subramanian R. Examining the Influence of Personality and Multimodal Behavior on Hireability Impressions. In: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing. 2023. p. 1–9.

[ref25] 25. Hassan S, Akhtar N, Yılmaz AK. Impact of the conscientiousness as personality trait on both job and organizational performance. Journal of Managerial Sciences. 2016;10(1).
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref26] 26. Moy JW, Lam KF. Selection criteria and the impact of personality on getting hired. Personnel Review. 2004;33(5):521–35.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref27] 27. Tay C, Ang S, Van Dyne L. Personality, biographical characteristics, and job interview success: a longitudinal study of the mediating effects of interviewing self-efficacy and the moderating effects of internal locus of causality. Journal of Applied Psychology. 2006;91(2):446. pmid:16551195
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref28] 28. Barrick MR, Mount MK. The big five personality dimensions and job performance: a meta‐analysis. Personnel psychology. 1991;44(1):1–26.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref29] 29. Witt LA, Burke LA, Barrick MR, Mount MK. The interactive effects of conscientiousness and agreeableness on job performance. Journal of Applied Psychology. 2002;87(1):164–9. pmid:11916210
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref30] 30. Mount MK, Barrick MR, Stewart GL. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance. 1998;11(2–3):145–65.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref31] 31. Rothmann S, Coetzer EP. The big five personality dimensions and job performance. SA Journal of industrial psychology. 2003;29(1):68–74.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref32] 32. Samek W, Müller KR. Towards explainable artificial intelligence. Explainable AI: interpreting, explaining and visualizing deep learning. 2019;5–22.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref33] 33. Wicaksana AS, Liem CC. Human-explainable features for job candidate screening prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE; 2017. p. 1664–9.

[ref34] 34. Wei XS, Zhang CL, Zhang H, Wu J. Deep bimodal regression of apparent personality traits from short video sequences. IEEE Transactions on Affective Computing. 2017;9(3):303–15.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref35] 35. Ventura C, Masip D, Lapedriza A. Interpreting cnn models for apparent personality trait regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 55–63.

[ref36] 36. Gucluturk Y, Guclu U, Perez M, Jair Escalante H, Baro X, Guyon I, et al. Visualizing apparent personality analysis with deep residual networks. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017. p. 3101–9.

[ref37] 37. Ding Y, Shi L, Deng Z. Low-level characterization of expressive head motion through frequency domain analysis. IEEE Transactions on Affective Computing. 2018;11(3):405–18.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref38] 38. Gunes H, Pantic M. Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners. In: Intelligent Virtual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA, September 20-22, 2010 Proceedings 10. Springer; 2010. p. 371–7.

[ref39] 39. Yang Z, Narayanan SS. Modeling dynamics of expressive body gestures in dyadic interactions. IEEE Transactions on Affective Computing. 2016;8(3):369–81.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref40] 40. An G, Levitan R. Lexical and Acoustic Deep Learning Model for Personality Recognition. In: INTERSPEECH. 2018. p. 1761–5.

[ref41] 41. Valente F, Kim S, Motlicek P. Annotation and Recognition of Personality Traits in Spoken Conversations from the AMI Meetings Corpus. In: INTERSPEECH. 2012. p. 1183–6.

[ref42] 42. Mangalam K, Guha T. Learning spontaneity to improve emotion recognition in speech. arXiv preprint arXiv:171204753. 2017;

[ref43] 43. Tawari A, Trivedi MM. Speech emotion analysis: Exploring the role of context. IEEE Transactions on multimedia. 2010;12(6):502–9.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref44] 44. Abdel-Hamid L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication. 2020;122:19–30.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref45] 45. Levitan SI, Levitan Y, An G, Levine M, Levitan R, Rosenberg A, et al. Identifying individual differences in gender, ethnicity, and personality from dialogue for deception detection. In: Proceedings of the second workshop on computational approaches to deception detection. 2016. p. 40–4.

[ref46] 46. Dhall A, Hoey J. First impressions-predicting user personality from twitter profile images. In: Human Behavior Understanding: 7th International Workshop, HBU 2016, Amsterdam, The Netherlands, October 16, 2016, Proceedings 7. Springer; 2016. p. 148–58.

[ref47] 47. Al Moubayed N, Vazquez-Alvarez Y, McKay A, Vinciarelli A. Face-based automatic personality perception. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014. p. 1153–6.

[ref48] 48. Meng KS, Leung L. Factors influencing TikTok engagement behaviors in China: An examination of gratifications sought, narcissism, and the Big Five personality traits. Telecommunications Policy. 2021;45(7):102172.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref49] 49. Song S, Jaiswal S, Sanchez E, Tzimiropoulos G, Shen L, Valstar M. Self-supervised learning of person-specific facial dynamics for automatic personality recognition. IEEE Transactions on Affective Computing. 2021;14(1):178–95.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref50] 50. Sharma R, Guha T, Sharma G. Multichannel attention network for analyzing visual behavior in public speaking. In: 2018 ieee winter conference on applications of computer vision (wacv). IEEE; 2018. p. 476–84.

[ref51] 51. Birdwhistell RL. Essays on body motion communication. Philadelphia: University of Pennsylvania. 1970;

[ref52] 52. Baltrušaitis T, Robinson P, Morency LP. Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE; 2016. p. 1–10.

[ref53] 53. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, et al. librosa: Audio and music signal analysis in python. In: SciPy. 2015. p. 18–24.

[ref54] 54. Schuller B, Steidl S, Batliner A. The interspeech 2009 emotion challenge. 2009;
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref55] 55. Koelstra S, Patras I. Fusion of facial expressions and EEG for implicit affective tagging. Image and Vision Computing. 2013;31(2):164–74.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref56] 56. Agrawal A, George RA, Ravi SS. Leveraging multimodal behavioral analytics for automated job interview performance assessment and feedback. arXiv preprint arXiv:200607909. 2020;

[ref57] 57. Kumar D, Raman B. Speech-Based Automatic Prediction of Interview Traits. In: International Conference on Computer Vision and Image Processing. Springer; 2022. p. 586–96.

[ref58] 58. Yan S, Huang D, Soleymani M. Mitigating biases in multimodal personality assessment. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020. p. 361–9.

[ref59] 59. Güçlütürk Y, Güçlü U, van Gerven MA, van Lier R. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 349–58.

[ref60] 60. Zhang CL, Zhang H, Wei XS, Wu J. Deep bimodal regression for apparent personality analysis. In: European conference on computer vision. Springer; 2016. p. 311–24.

[ref61] 61. Subramaniam A, Patel V, Mishra A, Balasubramanian P, Mittal A. Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 337–48.

[ref62] 62. McClendon J, Bogdan R, Jackson JJ, Oltmanns TF. Mechanisms of Black–White disparities in health among older adults: Examining discrimination and personality. Journal of Health Psychology. 2021;26(7):995–1011. pmid:31250666
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref63] 63. Shahjehan A, Qureshi JA, Zeb F, Saifullah K. The effect of personality on impulsive and compulsive buying behaviors. African journal of business management. 2012;6(6):2187.
View Article
Google Scholar

[136] View Article

[137] Google Scholar

[ref64] 64. Van Nuenen T, Ferrer X, Such JM, Coté M. Transparency for whom? Assessing discriminatory artificial intelligence. Computer. 2020;53(11):36–44.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref65] 65. Koppensteiner M. Motion cues that make an impression: Predicting perceived personality by minimal motion information. Journal of experimental social psychology. 2013;49(6):1137–43. pmid:24223432
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref66] 66. Ishii R, Ahuja C, Nakano YI, Morency LP. Impact of personality on nonverbal behavior generation. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 2020. p. 1–8.

[ref67] 67. Osugi T, Kawahara JI. Effects of head nodding and shaking motions on perceptions of likeability and approachability. Perception. 2018;47(1):16–29. pmid:28945151
View Article
PubMed/NCBI
Google Scholar

[147] View Article

[148] PubMed/NCBI

[149] Google Scholar

[ref68] 68. Lepri B, Subramanian R, Kalimeri K, Staiano J, Pianesi F, Sebe N. Connecting meeting behavior with extraversion—A systematic study. IEEE Transactions on Affective Computing. 2012;3(4):443–55.
View Article
Google Scholar

[151] View Article

[152] Google Scholar

[ref69] 69. Celiktutan O, Gunes H. Automatic prediction of impressions in time and across varying context: Personality, attractiveness and likeability. IEEE transactions on affective computing. 2015;8(1):29–42.
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref70] 70. Oberzaucher E, Grammer K. Everything is movement: on the nature of embodied communication. Embodied communication in humans and machines. 2008;151–77.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref71] 71. Ruhland K, Zibrek K, McDonnell R. Perception of personality through eye gaze of realistic and cartoon models. In: Proceedings of the ACM SIGGRAPH Symposium on Applied Perception. 2015. p. 19–23.

[ref72] 72. Breil SM, Osterholz S, Nestler S, Back MD. 13 contributions of nonverbal cues to the accurate judgment of personality traits. The Oxford handbook of accurate personality judgment. 2021;195–218.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref73] 73. DeGroot T, Kluemper D. Evidence of predictive and incremental validity of personality factors, vocal attractiveness and the situational interview. International Journal of Selection and Assessment. 2007;15(1):30–9.
View Article
Google Scholar

[164] View Article

[165] Google Scholar

[ref74] 74. Levine SP, Feldman RS. Women and men’s nonverbal behavior and self-monitoring in a job interview setting. Applied HRM Research. 2002;7(1):1–14.
View Article
Google Scholar

[167] View Article

[168] Google Scholar

[ref75] 75. Walther S, Ramseyer F, Horn H, Strik W, Tschacher W. Less structured movement patterns predict severity of positive syndrome, excitement, and disorganization. Schizophrenia bulletin. 2014;40(3):585–91. pmid:23502433
View Article
PubMed/NCBI
Google Scholar

[170] View Article

[171] PubMed/NCBI

[172] Google Scholar

[ref76] 76. Eyben F, Weninger F, Paletta L, Schuller BW. The acoustics of eye contact: detecting visual attention from conversational audio cues. In: Proceedings of the 6th workshop on Eye gaze in intelligent human machine interaction: gaze in multimodal interaction. 2013. p. 7–12.

[ref77] 77. House D. Integrating audio and visual cues for speaker friendliness in multimodal speech synthesis. In: INTERSPEECH. Citeseer; 2007. p. 1250–3.

Figures

Abstract

1 Introduction

2 Literature review

2.1 Trait prediction

Explainable trait prediction.

2.2 Multimodal behavior analytics

2.3 Summary

3 Methodology

3.1 Feature extraction

Kineme representation.

Action unit detection.

Speech feature extraction.

3.2 Models

Unimodal and feature fusion (FF).

Attention fusion (LSTM AF).

Decision fusion (DF).

4 Experimental results

4.1 Datasets

4.2 Quantitative experiments

Prediction settings.

Chunk vs video-level prediction.

Thin-slice predictions.

Performance metrics.

4.3 Experimental details

4.4 Results and discussion

4.4.1 Comparison with the state-of-the-art approaches.

4.4.2 Model generalisability.

4.4.3 Ethical aspect.

5 Explainability & interpretability

5.1 Interpretation via kinemes and AUs

5.2 Attention score-based interpretations

6 Conclusion

Supporting information

S1 Fig. Chunk vs video-level predictions with kinemes for FICS (left) and MIT (right).

S2 Fig. Chunk vs video-level predictions with AUs for FICS (left) and MIT (right).

S3 Fig. Chunk vs video-level predictions with speech features for FICS (left) and MIT (right).

S1 Text.

Acknowledgments

References