Figures
Abstract
We explore the efficacy of multimodal behavioral cues for explainable prediction of personality and interview-specific traits. We utilize elementary head-motion units named kinemes, atomic facial movements termed action units and speech features to estimate these human-centered traits. Empirical results confirm that kinemes and action units enable discovery of multiple trait-specific behaviors while also enabling explainability in support of the predictions. For fusing cues, we explore decision and feature-level fusion, and an additive attention-based fusion strategy which quantifies the relative importance of the three modalities for trait prediction. Examining various long-short term memory (LSTM) architectures for classification and regression on the MIT Interview and First Impressions Candidate Screening (FICS) datasets, we note that: (1) Multimodal approaches outperform unimodal counterparts, achieving the highest PCC of 0.98 for Excited-Friendly traits in MIT and 0.57 for Extraversion in FICS; (2) Efficient trait predictions and plausible explanations are achieved with both unimodal and multimodal approaches, and (3) Following the thin-slice approach, effective trait prediction is achieved even from two-second behavioral snippets. Our implementation code is available at: https://github.com/deepsurbhi8/Explainable_Human_Traits_Prediction.
Citation: Madan S, Gahalawat M, Guha T, Goecke R, Subramanian R (2025) Explainable human-centered traits from head motion and facial expression dynamics. PLoS ONE 20(1): e0313883. https://doi.org/10.1371/journal.pone.0313883
Editor: Alessandro Bruno, International University of Languages and Media: Libera Universita di Lingue e Comunicazione, ITALY
Received: August 16, 2023; Accepted: November 2, 2024; Published: January 17, 2025
Copyright: © 2025 Madan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data has been updated in a public repository with the following URL: https://github.com/deepsurbhi8/Explainable_Human_Traits_Prediction The data is available without any access restrictions.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Personality is a psychological construct that describes human behavior in terms of habitual and fairly stable patterns of emotions, thoughts, and attributes [1, 2]. Personality is typically characterized by the OCEAN traits typified by the big-five model [3]: Openness (creative vs conservative), Conscientiousness (diligent vs disorganized), Extraversion (social vs aloof), Agreeableness (empathetic vs distant) and Neuroticism (anxious vs emotionally stable). Other popular personality models include the big-two model which categorizes these five traits into the Plasticity and Stability dimensions [4], and the 16 personality factors model [5].
Personality plays a crucial role in shaping an individual’s behavioral and communication traits, and how one conducts themselves in different social situations. To this end, multimodal non-verbal cues are critical in exhibiting an individual’s inter-personal skills in the context of ‘multimedia CVs’ [6, 7]. Subjective impressions of interviewees’ personality traits can influence hiring decisions [8], and even one behavioral modality can explain personality attributions [9]. E.g., Conscientiousness characterizing diligence and honesty is reflected in an upright posture and minimal head movements, while Neuroticism indicating anxiety and stress is revealed through fidgeting and camera aversion in self-presentation videos [7].
This paper builds on the above findings, and explores the efficacy of multimodal behavioral cues to explainably predict personality and job interview traits. In particular, we examine (i) elementary head motions termed kinemes, (ii) atomic facial movements called action units (AUs), and (iii) prosodic and acoustic speech features for traits prediction (see Fig 1 for an overview). We first evaluate the efficacy of unimodal temporal characteristics of individual behavioral channel in predicting these traits using long-short term memory (LSTM) architectures. Next, we explore different multimodal fusion strategies (feature fusion, decision fusion, and additive soft attention) to enhance each channel’s predictive power and explainability. Recent studies have already shown the effectiveness of kineme patterns for emotional trait prediction [10, 11], while acoustic features and facial expressions have been successfully employed for estimating personality attributes [1, 12, 13] and candidate hireability (suitability to hire/interview later) [14, 15].
Examining various LSTM architectures for classification and regression on the diverse FICS [16] and MIT interview [17] datasets, we make the following observations: (i) Both kinemes and AUs achieve explanative trait prediction. (ii) Multimodal approaches leverage cue-complementarity to better predict interview and personality attributes than unimodal ones. (iii) Trimodal fusion-based attention scores enable behavioral explanations, and provide insights into the relative contribution of each modality over time. (iv) Adequate predictive power is achieved even with 2 seconds-long behavioral episodes or slices. Overall, this paper makes the following research contributions:
- Building upon our initial results [18], we novelly employ kinemes, action units and speech features for the estimation of personality and interview traits. Given the strong correlations among personality and interview traits [16, 19], we show that the three behavioral modalities are both predictive and explanative of these traits. We explore distinct strategies for temporally fusing behavioral features. Fusion approaches outperform unimodal ones by a large margin owing to the complementary nature of the cues and modalities.
- Our experiments reveal that speech features are highly predictive of interview traits on the MIT dataset [17], and achieve performance comparable to kinemes and AUs for OCEAN trait prediction on the FICS dataset.
- Kineme and AU features enable behavioral explanations to support their predictions. We employ scores obtained from the additive attention fusion model to assess the relative importance of our three modalities per trait.
- We perform ablative studies presenting unimodal and multimodal results over thin-slices of varying lengths. We show that satisfactory continuous and discrete trait prediction performance can be achieved even with 2s slices, with more accurate predictions possible over longer slices in line with expectation.
2 Literature review
This section reviews research on (a) personality and interview trait prediction, and (b) multimodal behavior analytics to position our work with respect to the literature.
2.1 Trait prediction
Human thoughts, emotions and behavioral patterns are influenced by their personality, typically characterized via the OCEAN model [3] describing human personality in terms of Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. Various non-verbal behavioral cues such as eye movements [20, 21], head motion [22, 23], and facial features [13, 19] have been employed for personality trait prediction.
Numerous studies have examined the relationship between a candidate’s personality traits and their job-interview performance [14, 17, 24]; For instance, Conscientiousness is positively correlated with job and organizational performance [25, 26]. Conscientiousness and Extraversion impact interview success [27, 28] and job ratings [29]. While Mount et al. [30] observed that Emotional stability, Conscientiousness and Agreeableness are positively related to job performance, Rothmann et al. [31] associated Conscientiousness, Extraversion, Emotional stability and Openness with job performance and creativity. While these correlations among personality and interview traits have been discovered via statistical analyses, very few studies have explored the relationships between non-verbal behavioral cues and personality-cum-interview traits in a predictive (regression/classification) setting.
Explainable trait prediction.
Despite achieving excellent performance on multiple prediction problems, deep learning models fall short in terms of explainability and interpretability due to their ‘black-box’ nature [32]. Recent studies alleviate this issue by interpreting the results of deep learning models, e.g., Wicaksana and Liem [33] predict OCEAN personality traits explicitly focusing on human-explainable features and a transparent decision-making process. Wei et al. [34] propose a deep bimodal regression framework, in which Convolutional Neural Networks (CNNs) are modified to aggregate descriptors for improving regression performance on apparent personality analysis. A CNN-based approach for interpretability is explored, where the authors observe a correlation between AUs and CNN-learned features [35]. Interpretability is achieved via a visualization technique highlighting image regions activating different units in each layer. Another work [36] trains a deep residual network with audiovisual descriptors for personality trait prediction, where predictions are elucidated via face image visualization and occlusion analysis. In contrast, our approach provides trait-specific behavioral explanations, encompassing features (kineme and AUs based) and model-based (modality contribution) explanations.
2.2 Multimodal behavior analytics
Low-level behavioral features have been largely employed for human-centred trait prediction. E.g., head-motion has been modeled with descriptors such as amplitude of Fourier components [37], Euler rotation angles and velocity. Head motion is often restricted to nods and shakes [38]. Yang and Narayanan [39] extract arbitrary head motion patterns, which do not have a physical interpretation. Subramanian et al. [23] predict Extraversion and Neuroticism employing positional and head pose patterns.
Audio-visual features are typically combined to achieve effective trait prediction. Low-level speech descriptors such as pitch, intensity, spectral, cepstral coefficients and pause duration are commonly used for personality [40, 41] and affect recognition [42–44]. Other works use acoustic, prosodic and linguistic features for personality prediction [13, 45].
Many trait prediction studies focus solely on visual cues, with facial cues playing a crucial role. E.g., multivariate regression is employed to infer user personality impressions from Twitter profile images [46], while eigenfaces are combined with Support Vector Machines are used to predict if a depicted person scores above/below the median for each of the big-five traits [47]. Meng et al. [48] investigate the connection between gratification-sought (e.g., escape, fashion, entertainment) and personality traits, and find that extroverts are more active in contributing to, and participating in engaging behaviors. Short-term facial dynamics are learned from short videos via an emotion-guided, encoder-based approach for personality analysis in [49].
2.3 Summary
The literature review reveals the following research gaps:
- Personality and interview traits are known to be highly correlated based on statistical observations, but few works have explored learning of features that can effectively predict as well as explain these traits.
- While personality and interview traits have been predicted via machine/deep learning approaches, the majority employs statistics of low-level audiovisual features (statistics relating to head motion, eye-gaze, facial expression, speech and prosodic), which limits explanations to support the predictions. While head motion patterns have been identified as critical non-verbal behavioral cues, they have not been employed for personality or interview trait prediction. We show how kineme and AU features can intuitively explain trait-specific behaviors.
- Multimodal behavioral analytics have been largely restricted to feature and decision fusion, treating all behavioral channels equally. Differently, we utilize additive soft attention [50]-based fusion that learns relative contribution of each channel from data. This allows for quantifying and explaining the relative contribution of the different modalities towards the prediction result.
3 Methodology
3.1 Feature extraction
We now present feature extraction for the three employed modalities: (i) 3D head motions denoted via a sequence of kinemes, (ii) facial action units describing muscle movements, and (iii) low-level descriptors for speech representation. As in [18], we encode these features into 2s temporal segments with a 50% overlap to obtain feature vectors.
Kineme representation.
A compact approach to modeling head motion is by representing it in terms of a small number of fundamental and interpretable units termed kinemes [10]; they are analogous to phonemes in human speech [51]. We extract the 3D Euler rotation angles pitch (θp), yaw (θy) and roll (θr) per frame to represent head pose using the Openface toolkit [52]. Head motion for a time period T can be represented as a time-series of 3D angles: . This multivariate time-series θ of length T is divided into l-overlapping segments, where the ith segment is denoted by a vector
. These overlapping segments enable shift-invariance and generate better representations of the head motion [11].
Further, we define the characterization matrix as Hθ = [h(1), h(2), ⋯, h(s)] with s denoting the number of segments in the training sample. All N training samples are combined to form the head motion matrix , where each column in the matrix H represents a single head motion time-series segment. Non-negative Matrix Factorization is performed on the matrix H to obtain basis and coefficient matrices B and C respectively. We then employ Gaussian Mixture modeling to cluster coefficient vectors in a low dimensional space to obtain a k column matrix C* (k << Ns). The matrix C* is transformed as H* = B C*, to obtain kinemes in the original space. Columns of H* yield the k kinemes
.
On learning the kineme representation, any head motion time-series is represented via by mapping each time series segment to an individual kineme. To obtain the corresponding kineme, we compute the characterization matrix h(i) for the ith segment. Lastly, we project h(i) onto the learned subspace spanned by B to get c(i):
We maximize the posterior probability
to associate the ith segment to its corresponding kineme
. Thus, we can map a head motion time-series to a kineme sequence. Selected kinemes are extracted from the MIT and FICS datasets are visualized in Fig 2(a) and 2(b).
Action unit detection.
We extract 17 facial action units (AUs) per video frame using Openface. These 17 AUs are described in terms of a value specifying the visibility of an AU and an intensity score representing AU sharpness on a 5-point scale (minimal to maximal). We employ mean intensity as a threshold to identify the dominant AUs over all 2s frames with 1s overlap as above. Some of the common AUs from the two datasets are presented in Table 1.
Speech feature extraction.
We extracted low-level audio descriptors (LLDs) via the Librosa library [53] following the Interspeech2009 emotion challenge [54]: Fundamental frequency (F0), voice probability, zero-crossing rate (ZCR) and Mel-frequency cepstral coefficients (MFCCs). A local feature vector is created by extracting the LLDs over a sliding window of 93ms with an overlap of 23ms over the entire video duration. These local features are averaged and concatenated to obtain a 23-dimensional feature vector for each 2s segment. For each dataset, these features are normalized to have zero mean and unit variance.
3.2 Models
Long short-term memory (LSTM) models for regression and classification: We trained LSTMs with the kineme (LSTM Kin), AU (LSTM AU) and speech sequences (LSTM Aud). We also performed bimodal feature fusion (FF) and decision fusion (DF) with all combinations (LSTM Kin+AU, LSTM Kin+Aud and LSTM AU+Aud), and trimodal LSTM fusion (LSTM Kin+AU+Aud). The kineme sequences are one-hot encoded, where the kineme denoting a given time-window is coded to 1 and the rest to 0. AU sequences are encoded by setting the dominant AUs to 1 and rest to 0 for the time-window, creating a binary 17-element AU vector. Speech sequences are created by z-normalizing LLDs averaged over the time-window. For a behavioral slice involving L time windows with N training samples, the kineme, AU and speech features form 3D matrices of size 16 × N × L, 17 × N × L, and 23 × N × L respectively.
Unimodal and feature fusion (FF).
A single hidden LSTM layer is employed for unimodal prediction followed by a dense layer involving one neuron with sigmoidal/linear activation for classification/regression. For bimodal and trimodal feature fusion, unimodal descriptors are fused by applying a single LSTM layer to each feature. The subsequent outputs are merged followed by a dense layer comprising a single neuron as above (see Fig 3). The hyperparameters such as number of neurons, activation function and dropout rate are tuned via the validation set. An Adam optimizer is utilized for training with learning rate of 0.01. We employ binary cross entropy and mean absolute error as loss functions for classification and regression respectively.
Linear activation is applied on the dense layer output for regression. N denotes the number of neurons per layer. The dense layer output involves linear activation and 32 neurons in the LSTM layer for regression model.
Attention fusion (LSTM AF).
To achieve multimodal explanations, we employ attention-based trimodal fusion as in [50] to assign importance weights to the three modalities at each time window (Fig 4). Dense layers are employed for each cue in [50], while we use one LSTM layer per modality to quantify an importance weight. Also, while we compute weights based on softmax scores generated per time step, [50] focuses only on the channel with maximum attention weight discarding others. As in Fig 4(a), an LSTM layer is employed for each modality to learn temporal dynamics, resulting in a fixed-length feature vector per modality. Unimodal descriptors are concatenated and passed through a fully connected layer, and a softmax layer composed of three neurons (Fig 4(b)). Attention scores generated via the softmax layer are deemed as the relative contribution of each modality per time window. Layer normalization is applied over each unimodal feature vector. To fuse normalized features, we employ an additive layer to sum the weighted unimodal features. This is followed by a dense layer comprising a single neuron with sigmoidal/linear activation for classification/regression. We aggregate weights to compute modality contributions over behavioral slices spanning multiple time windows.
(a) Additive attention fusion architecture overview, and (b) Attention score computation process (FC layer comprises twelve neurons). N denotes the number of neurons per layer. Linear/sigmoid activation is applied on the dense layer output for regression/classification.
Decision fusion (DF).
We adopt the fusion weight estimation approach [55] outlined below. Assuming the unimodal classifier/regressor scores are p1 and p2 for the bimodal fusion, the test sample score is defined as αp1 + (1 − α)p2, α ∈ [0, 1]. We perform grid search with a step-size of 0.05 to identify the optimal α* maximizing F1-score and Pearson correlation coefficient (PCC), respectively, for classification and regression (the same is extended to trimodal fusion).
4 Experimental results
4.1 Datasets
The FICS dataset [16] contains 10K self-presentation snippets derived from YouTube videos of people talking into the camera. Averaging 15s in length, these videos are split into a 3:1:1 proportion for train (6000 samples), validation (2000 samples) and test (2000 samples). All videos are annotated with OCEAN trait scores with ‘N’ scores denoting emotional stability instead of Neuroticism. This MIT dataset [17] comprises audio-visual recordings of 138 mock job interviews with 69 undergraduate students, with videos being 4.7 minutes long on average. All videos are annotated with 16 interviewee-specific traits. We focus on the following traits: recommended hiring score (RH) denoting the candidate’s hireability, level of excitement (Ex), friendliness (Fr) and eye-contact (EC). We also examine the Overall (Ov) interview score in prediction experiments. Representative examples from the two datasets are presented in [18].
4.2 Quantitative experiments
Prediction settings.
Both datasets are equipped with continuous scores, posing human trait estimation problem as a regression problem. We explore both continuous and discrete predictions for personality and interview traits. In the case of regression scores, annotation values are standardized to a range of 0 to 1. For binary classification, trait scores are dichotomized by setting a threshold at their median value (Refer to Table 2 for class distribution). Tables 3 and 4 present regression results, while Tables 5 and 6 showcase the classification results. For the FICS dataset, the models are fine-tuned via the pre-defined validation set, while hyperparameter tuning is achieved via 10-fold cross-validation (cv) on the smaller MIT Interview dataset (resulting in 90% data for training and 10% data for testing). Results reported on the MIT dataset are μ±σ statistics noted over 50 runs (5 repeated runs of 10-fold cross-validation). Early stopping with a patience value of 4 epochs is employed to prevent model degradation.
MIT class distributions correspond to 1-minute video samples employed for analysis.
Accuracy and PCC values are tabulated as (μ±σ) values, with highest PCC achieved per trait denoted in bold.
Accuracy and PCC values for different methods are tabulated, with highest PCC achieved per trait denoted in bold.
Accuracy and F1-score are tabulated as (μ±σ) values, with highest F1 achieved per trait denoted in bold.
Accuracy and F1-score for different methods are tabulated, with highest F1 achieved per trait denoted in bold.
Chunk vs video-level prediction.
To examine trait prediction over tiny behavioral episodes (or slices), we segment the original videos into smaller chunks of 2–7s for FICS, and 2–60s for the MIT dataset. All video chunks are assigned the source video label. We then compute metrics over a) all chunks (chunk-level performance), and b) over all videos by assigning the majority label/mean value over all chunks (video-level performance) for classification/regression. A comparison of chunk vs video-level predictions for the three modalities is presented in S1–S3 Figs.
Thin-slice predictions.
We explore trait prediction over short behavioral episodes known as thin slices and present the multimodal results for classification and regression using soft additive attention-fusion over 2s behavioral slice in Table 7. The results convey that reasonable prediction performance can be achieved even with 2s-long slices expressing the efficacy of these small behavioral slices for predicting different traits. For more details, please refer to S1 text.
4.3 Experimental details
All experiments are performed using the two mentioned datasets, without external data for pre-training or fine-tuning. We optimized model training with the binary cross entropy loss function for classification and mean absolute error for regression. The network is trained using the Adam optimizer with a learning rate of 0.01. Specifically, when training on the MIT dataset, we employed 20 neurons, a batch size of 32, a dropout rate of 0.2 and setting the number of epochs to 30. For the FICS dataset, the configuration includes 32 neurons, a batch size of 100 and a dropout rate of 0.2. We set the number of epochs to 300, and applied early stopping with patience value set to 5.
4.4 Results and discussion
Based on Tables 3–7, we make the following observations:
- For regression benchmarking (Tables 3 and 4), PCC is a more stringent measure than Acc, as very low PCC values are observed with relatively high Acc values for the FICS dataset (Table 4). Tables 3 and 5 show that regression and classification results are comparable for the (smaller) MIT dataset. For FICS, the regression scores are considerably higher than the classification scores, which can be attributed to Gaussian-distributed FICS traits with means around 0.5 [16].
- Speech features achieve optimal interview trait prediction (Table 3), while Kineme and AU features perform comparably. Optimal personality trait regression is also achieved with audio features (Table 4), even as AUs significantly outperform kinemes on the FICS dataset.
- Higher PCC scores are achieved with multimodal as compared to unimodal methods on both the MIT and FICS datasets. Bimodal and trimodal fusion perform very similarly for both interview and personality trait prediction, with maximum PCC values of 0.98 achieved for the Excited and Friendliness interview traits, and a peak PCC of 0.566 achieved for the Extraversion personality trait on FICS obtained with trimodal fusion.
- Focusing on multimodal methods, bimodal combinations involving audio outperform others for interview trait prediction, implying that speech features individually and in combination with others acquire high predictive power, mirroring findings in [17]. Bimodal predictions improving over unimodal ones conveys that kinemes and AUs provide complementary information concerning interview and personality traits.
- Among trimodal fusion methods, decision fusion slightly outperforms attention and feature fusion on the MIT dataset, while decision, attention and feature fusion approaches perform first, second and third best on the FICS dataset. These results again reveal the complementary utility of the kineme, AU and speech features; optimal performance achieved with trimodal decision fusion conveys that the AU and kineme classifiers improve prediction performance in instances where speech descriptors are ineffective.
- Focusing on classification (Tables 5 and 6), considering unimodal results, audio features achieve optimal F1-scores on Interview traits (highest F1 of 0.95 for Recommended Hiring and Excited), while AUs achieve the best classification on personality traits (maximum F1 of 0.651 for Extraversion). AUs and kinemes perform similarly on the MIT dataset, while speech descriptors achieve much higher F1-scores than kinemes on FICS.
- Multimodal approaches again outperform unimodal methods in categorizing both interview and personality traits. With respect to bimodal methods, combinations involving speech tend to perform well for both interview and personality prediction.
- Trimodal fusion performs best, producing peak F1 scores of 0.98 and 0.695 for the RH interview, and Extraversion personality traits. Decision fusion produces the best trait classification performance on both datasets, with feature and attention fusion having comparable scores.
The above results represent trait prediction at the video level, on examining 15s FICS videos or upon collating classification/regression results over 5–60s chunks/segments on the MIT dataset (the best results obtained by averaging chunk-level values, or computing the majority label over all chunks are listed in Tables 3 and 5). Table 7 presents results for the 2s behavioral slice for both datasets.
4.4.1 Comparison with the state-of-the-art approaches.
Table 8 presents a comparison of our proposed methodology with available baseline approaches for both datasets: FICS and MIT. In the studies focused on the MIT dataset, the paper presenting the MIT interview dataset [17] performed a series of experiments utilizing multiple behavioral cues such as prosodic features, facial expressions, and language of the interviewee for implementing binary classification and regression analysis to achieve highest PCC of 0.77 for Excited trait and lowest PCC of 0.27 for Eye Contact. In another study, Agrawal et al. [56] also employed similar multimodal cues to predict different class labels associated with the interview process to report a classification accuracy of 0.6428 for the Eye Contact label. On the other hand, Kumar et al. [57] examined only speech features for regression analysis using the CNN-LSTM fusion to obtain highest accuracy of 0.96 for Overall trait and highest PCC of 0.93 for Excited and Friendly. Compared to these previous studies, our proposed trimodal fusion-based approach achieves an improved regression accuracy of 0.98 for all traits except Eye Contact (0.97), a PCC of 0.98 for Excited and Friendly, and the highest classification accuracy of 0.98 for Recommend Hiring. Apart from achieving better performance, we also demonstrate the efficacy of different behavioral cues to achieve explanations for the interview traits.
For the FICS dataset, Yan et al. [58] focused on investigating the biases in multimodal personality assessment induced by various sources, such as individual behavioral differences and late fusion approaches, on employing data balancing and adversarial learning to report the best regression accuracy of 0.92 for Extraversion. On the other hand, Yagmur et al. [59] proposed an audiovisual deep residual network comprising auditory and visual streams to achieve 0.91 regression accuracy for all traits. Another similar study by Zhang et al. [60] employed the Deep Bimodal Regression (DBR) framework by modifying the traditional CNNs to combine audio and visual information. The model achieves the highest accuracy of 0.92 for the Conscientiousness trait. The work by Subramanian et al. [61] introduced two bi-modal end-to-end deep neural network architectures using temporally ordered audio and visual features to report an accuracy of 0.91 for all traits. Comparatively, our proposed approach achieves the highest regression accuracy of 0.90 for multiple traits, including Openness, Extraversion, and Agreeableness. Along with regression analysis, our model achieves the best classification accuracy of 0.69 for the Extraversion trait. Apart from attaining comparable results for OCEAN traits of the FICS dataset, our approach enables behavioral explanations to support the predictions using multimodal cues.
4.4.2 Model generalisability.
Considering the diverse datasets utilized in our study, we investigate the generalisability of our approach by training the model on one dataset and testing on another. The two datasets utilized in our study are curated for distinct objectives; the FICS data are compiled primarily for the automated assessment of personality traits, while the MIT dataset is compiled for examining interview behavior. Therefore, we focus on predicting the Recommend Hiring (RH) and the Interview score traits from the MIT and FICS datasets respectively, as they share similar meanings. For evaluating model generalisability, we consider the following configurations:
- Configuration 1: We synthesize kineme units from FICS head pose angles following the procedure outlined in Sec. 2.1. We then map head pose angles in the MIT dataset to the learned FICS kinemes for feature extraction and regression. Since we synthesize kineme sequences for both datasets, either data can be used for model training/testing.
- Configuration 2: We learn kinemes based on head movements in the MIT dataset, which are then used to represent head pose angles in the FICS data.
Empirical settings. We employ the trimodal decision fusion approach utilizing the kineme, AU and speech features, given its optimal performance over the two datasets. As the FICS dataset has fixed train, validation and test sets, we utilized the test and train FICS sets for testing and training respectively. Whereas, for MIT, the entire dataset is considered for training/testing. To train the model for both configurations, we used a mean absolute error as loss function, Adam optimizer with a learning rate of 0.01, 20 neurons for each LSTM layer, batch size of 32, dropout rate of 0.2, and number of epochs to 300. Early stopping was applied with a patience value of 5.
Results & discussion. Table 9 displays regression results for both configurations (poor classification measures were obtained, which are omitted here). We observe from the table that the dataset employed for kineme generation does not influence outcomes much, and very similar similar accuracy and PCC scores are obtaiined for both configurations. We note that, across configurations, (1) models trained on the MIT dataset achieve better performance, despite that being the smaller among the two; this conveys that the variance in appearance and/or head rotations for the FICS dataset is lower than in MIT, and (b) while relatively high Acc values (between 0.83–0.87) are obtained, very low PCC values are obtained on cross dataset-trained models, conveying that model predictions differ substantially from the ground-truth values. Cumulatively, while kineme-based models may not generalise optimally, the predictions across datasets are reasonably accurate and robust. Future work will focus on further improving model generalisability.
4.4.3 Ethical aspect.
Research on understanding human traits requires careful attention to privacy, consent and accuracy, with a crucial awareness of cultural, gender and ethnic differences to prevent misinterpretations. We highlight the ethical considerations associated with our work below. The developed methods facilitate accurate prediction of users’ personality and job interview traits, which can then be used to give insights and recommendations to users, especially since our predictions are supported by explanations. At the same time, these powerful tools when used as behavioral benchmarkers could create biases and discrimination [62]. Likewise, Asad et al. [63] demonstrate a correlation between Openness/Neuroticism with impulsive buying behavior, implying that individuals’ personality traits could be exploited for sales promotion. For our research, potential biases relating to racial and cultural backgrounds may arise due to the nature of the training data derived from public datasets [16, 17]. In this regard, (a) the transparency of our model providing relevant explanations supporting predictions [64], and (b) its generalizability as demonstrated by prediction results on a different dataset, are supportive of the fact that the presented results are generally devoid of biases. Our research strictly adheres to ethical standards, utilizing open-sourced data with a valid End User License Agreement (EULA) solely for research purposes, and per se not targeting sensitive problem domains such as job recruitment.
5 Explainability & interpretability
5.1 Interpretation via kinemes and AUs
Along with their predictive power, kinemes and AUs also enable facile trait-specific behavioral explanations. To this end, we considered the top and bottom 10-percentile videos for each trait, and computed the most frequently occurring AUs and kinemes for the same. The most frequently occurring four kinemes and five dominant AUs for these high (H) and low (L)-rated videos are presented in Table 10. Analysing the table, we make the following remarks:
MIT kinemes in bold font are visualized in Fig 4.
- The presence of kineme 16 (denoting head nodding and shaking) in all OCEAN traits conveys the significance of head motion for the characterization of personality traits. Combination of head nodding and shaking with other kineme representations highlights the subtle difference between high and low-rated personality impressions. Also, note that AUs 25 and 26, signifying talking behavior, are present in all videos.
- Focusing on other kinemes, high Openness is characterized by kinemes 2 and 8, which signify persistent head movements. This finding is echoed in [65], where large motion variations are found to associate with high O impressions. Presence of AUs 12 and 14 indicates that a smiling demeanor characterizes high O. Conversely, kineme 6 denoting minimal head motion and AUs 4 and 17 typical of frowning and diffident behavior are commonly noted for low O videos.
- Kineme 1 denoting an upward head tilt is associated with high C, while kinemes 2 and 4 depicting tilt-down and head-shaking are associated with low C. This indicates that attempting to maintain eye-contact conveys diligence and honesty, while avoiding eye-contact conveys insincerity.
- Extraversion appears to be conveyed better by AUs than kinemes; Dominant AUs for high E include 10, 12 and 17 indicating a friendly and talkative nature, while dominant kinemes 2 and 14 convey significant head movements. Conversely, low E is associated with kineme 4 denoting head-shaking and AUs 4, 7 and 17 indicating frowning, overall conveying a socially distant nature.
- High Agreeableness is characterized by kineme 3 (head-nod), and AUs 12 and 14 which constitute a smile. Conversely, kinemes 1, 8 and 9 dominate low A, and they collectively convey persistent head motion. Also, AUs dominant for low A are 4, 14 and 17, cumulatively describing a frown; overall, nodding and smiling is viewed as courteous, while frequent head movements and frowning convey hostility.
- Emotional stability (high N) is associated with kinemes 2 and 8, and AUs 7, 12 and 17, indicating persistent head motion and facial expressiveness. On the other hand, a neurotic trait is conveyed via limited head motion and head-shaking (kinemes 1, 5, 12) and frowning (described by AUs 4, 7, 10).
- While kinemes for the MIT videos are less discernible, due to smaller face size and the fact that they capture an interactional setting, some patterns are nevertheless evident as seen in Fig 2(b); these kinemes are highlighted in Table 10. As with FICS, Kineme 14 denoting a head-nod is commonly observed for all high trait videos, while kineme 11 depicting a head-shake is common for all low-trait videos.
- High RH scores are elicited with expressive facial behavior involving head-nodding and smiling. Conversely, low RH scores are associated with head-shaking and exhibiting limited facial expressions. Highly excited behavior is associated with identical AUs as high RH, and persistent head motion. Inversely, low excitement scores are connected with head shaking, and limited facial emotions.
- Identical AUs are observed for both high and low eye-contact, implying that head movements primarily impact eye-contact impressions. Head nodding (kineme 14) is associated with high EC, while kinemes 11 and 16 depicting head shaking and frequent head-nodding elicit low EC scores. Therefore interestingly, while head nodding is beneficial, frequent nodding is perceived as avoiding eye-contact.
- High friendliness is characterized by kinemes 11, 14 and 16, signifying persistent head motion along with expressive and smiling facial movements (AUs 5, 12 and 14). Conversely, low friendliness is associated with head-shaking (kineme 11) and frowning (AUs 4, 6, 7).
5.2 Attention score-based interpretations
While Table 10 presents unimodal behavioral explanations via kinemes and AUs, behaviors are expressed and best modeled multimodally as seen from our empirical results (Section 4.4). For multimodal explanations, we explore the attention-fusion network (Fig 4) to estimate the relative contribution of each modality towards trait regression. We visualize softmax scores learned by the attention-fusion network as follows. For the FICS dataset, we present mean attention scores obtained over 10 runs on 15s test videos (Fig 5(left)), while we present softmax scores averaged over 15s chunks for MIT videos across 50 runs (Fig 5(right)). Our remarks from the weight plots are as follows:
Error bars denote standard error.
- Cumulatively, Fig 5 conveys that while the relative contribution of speech features towards weighted fusion is not high for personality trait prediction, they tend to play a significant role in predicting interview traits on the MIT dataset. These observations mirror prior findings; the criticality of visual features such as head movements and facial movements for personality trait recognition has been noted in [14, 68] while the impact of prosodic speech features on interview trait impressions is discussed in [17].
- Fig 5(left) conveys that either kineme or AU features are most critical for personality trait prediction. Specifically, kinemes maximally contribute to the prediction of Openness and Extraversion, while AUs are most critical for predicting Agreeableness and Neuroticism. Both kinemes and AUs are found to be equally critical for estimating Conscientiousness, in line with the findings in [69]. Extraversion and Openness are conveyed by exaggerated physical and head movements [65, 70], with different head movement patterns representing high and low Extraversion [71]. While Agreeableness is also positively correlated with head movements [70, 71], empathetic behavior is accurately conveyed via facial expressions as denoted by the higher AU weights. Facial movements (e.g., unconcerned or anxious) better convey emotional stability [72].
- From Fig 5(right), it can be seen that facial movements have relatively less impact on interview trait prediction with the exception of eye contact. This can partly be attributed to the smaller face size in MIT videos, limiting the efficacy of AU detection. Conversely, speech features significantly impact trait prediction with the exception of recommended hiring and eye contact traits. While prosodic speech behavior has been found to considerably influence interview trait impressions [17, 73], other forms of non-verbal behavior such as positive facial expressions and frequent postural changes are known to impact hierabilty [74].
- For the Excited trait, speech plays a prominent role with a high correlation to continuous or restricted head movement [75]. On the surprising finding of AUs and speech features impacting eye-contact, prior studies [76] have revealed a low-yet-meaningful correlation between eye contact impressions and vocal acoustic features. Friendliness is best characterized by head movement and voice features, showing that the integration of visual and auditory modalities can be crucial in discerning interviewee friendliness [77].
To summarise, we explore interpretability using the visual features by identifying the most frequently occurring AUs and kinemes for the top and bottom 10-percentile videos. A further manual analysis conveys characteristic head movements and subtle facial actions representative of different personality and interview traits consistent with human understanding. For multimodal explanations, we extend the interpretability approach using the attention-fusion trimodal architecture to evaluate and measure the relative contribution of each modality towards trait regression. The softmax scores obtained over 15s videos for the two datasets are visualized by taking a mean of the values over 10 runs for the FICS, and 50 runs of the MIT Dataset. This analysis further validates the relative differences between the modality-specific attention weights based on prior findings. We acknowledge the potential possibility of average values obscuring the significant variations and nuances within the different runs of the model and intend to explore alternate behavioral explanations in future, including analysing the impact of systematically altering modality-specific attention weights on predictions.
6 Conclusion
This work demonstrates the efficacy of multimodal (kineme, AU and speech) behavioral cues to achieve explainable prediction of OCEAN and interview traits. Our results confirm that efficient trait prediction can be achieved with both unimodal and multimodal approaches. Also, multimodal approaches outperform their unimodal counterparts owing to complementary information provided by trait-specific behavioral cues. In addition, frequently occurring kineme and AU patterns enable behavioral explanations associated with each trait.
In terms of limitations, this work extracts all behavioral features over a fixed time window (same time-scale); however, behaviors associated with human personality may manifest over different time scales; example, facial expression or head motion patterns could be affected by speaking behavior (talkative: drastic variation in speaking behavior over video frames, or reserved: lingering silence over most video frames). Investigating the effect of temporal scales will be a future research direction. Trait-specific behavioral patterns can also be utilized to create virtual agents to train users in interviewing or public speaking settings. The authors do not advise using the proposed methodologies for complex processes like job recruitment per se; however, explanatory technologies can be utilized as a complementary tool in decision-making processes.
Supporting information
S1 Fig. Chunk vs video-level predictions with kinemes for FICS (left) and MIT (right).
https://doi.org/10.1371/journal.pone.0313883.s001
(TIF)
S2 Fig. Chunk vs video-level predictions with AUs for FICS (left) and MIT (right).
https://doi.org/10.1371/journal.pone.0313883.s002
(TIF)
S3 Fig. Chunk vs video-level predictions with speech features for FICS (left) and MIT (right).
https://doi.org/10.1371/journal.pone.0313883.s003
(TIF)
Acknowledgments
We would like to thank A. Samanta, IIT Kanpur for sharing the kineme implementation.
References
- 1. J JC Junior, Güçlütürk Y, Pérez M, Güçlü U, Andujar C, Baró X, et al. First impressions: A survey on vision-based apparent personality trait analysis. IEEE Transactions on Affective Computing. 2019;13(1):75–95.
- 2. Vinciarelli A, Mohammadi G. A survey of personality computing. IEEE Transactions on Affective Computing. 2014;5(3):273–91.
- 3. McCrae RR, Costa PT. Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology. 1987;52(1):81–90. pmid:3820081
- 4. Digman JM. Higher-order factors of the Big Five. Journal of Personality and Social Psychology. 1997;73(6):1246–56. pmid:9418278
- 5.
Cattell HE. The sixteen personality factor (16PF) questionnaire. In: Understanding psychological assessment. Springer; 2001. p. 187–215.
- 6. Raza SM, Carpenter BN. A model of hiring decisions in real employment interviews. Journal of Applied Psychology. 1987;72(4):596–603.
- 7.
Batrinca LM, Mana N, Lepri B, Pianesi F, Sebe N. Please, tell me about yourself: automatic personality assessment using short self-presentations. In: Proceedings of the 13th international conference on multimodal interfaces. 2011. p. 255–62.
- 8. Van Dam K. Trait Perception in the Employment Interview: A Five–Factor Model Perspective. International Journal of Selection and Assessment. 2003;11(1):43–55.
- 9. DeGroot T, Gooty J. Can nonverbal cues be used to make meaningful personality attributions in employment interviews? Journal of business and psychology. 2009;24:179–92.
- 10.
Samanta A, Guha T. On the role of head motion in affective expression. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 2886–90.
- 11. Samanta A, Guha T. Emotion Sensing From Head Motion Capture. IEEE Sensors Journal. 2021 Feb;21(4):5035–43.
- 12.
Sidorov M, Ultes S, Schmitt A. Automatic recognition of personality traits: A multimodal approach. In: Proceedings of the 2014 Workshop on Mapping Personality Traits Challenge and Workshop. 2014. p. 11–5.
- 13.
Kampman O, Barezi EJ, Bertero D, Fung P. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. arXiv preprint arXiv:180500705. 2018;
- 14. Malik H, Dhillon H, Goecke R, Subramanian R. I am empathetic and dutiful, and so will make a good salesman: Characterizing Hirability via Personality and Behavior. 2020;
- 15.
Eddine Bekhouche S, Dornaika F, Ouafi A, Taleb-Ahmed A. Personality traits and job candidate screening via analyzing facial videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 10–3.
- 16. Escalante HJ, Kaya H, Salah AA, Escalera S, Güçlütürk Y, Güçlü U, et al. Modeling, Recognizing, and Explaining Apparent Personality From Videos. IEEE Transactions on Affective Computing. 2022 Apr;13(2):894–911.
- 17. Naim I, Tanveer MdI, Gildea D, Hoque ME. Automated Analysis and Prediction of Job Interview Performance. IEEE Transactions on Affective Computing. 2018 Apr;9(2):191–204.
- 18.
Madan S, Gahalawat M, Guha T, Subramanian R. Head matters: explainable human-centered trait prediction from head motion dynamics. In: Proceedings of the 2021 International Conference on Multimodal Interaction. 2021. p. 435–43.
- 19. Güçlütürk Y, Güçlü U, Baro X, Escalante HJ, Guyon I, Escalera S, et al. Multimodal first impression analysis with deep residual networks. IEEE Transactions on Affective Computing. 2017;9(3):316–29.
- 20. Hoppe S, Loetscher T, Morey SA, Bulling A. Eye movements during everyday behavior predict personality traits. Frontiers in human neuroscience. 2018;12:328195. pmid:29713270
- 21. Rauthmann JF, Seubert CT, Sachse P, Furtner MR. Eyes as windows to the soul: Gazing behavior is related to personality. Journal of Research in Personality. 2012;46(2):147–56.
- 22. Jayagopi DB, Hung H, Yeo C, Gatica-Perez D. Modeling dominance in group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and Language Processing. 2009;17(3):501–13.
- 23.
Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on International conference on multimodal interaction. 2013. p. 3–10.
- 24.
Malik H, Dhillon H, Parameshwara R, Goecke R, Subramanian R. Examining the Influence of Personality and Multimodal Behavior on Hireability Impressions. In: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing. 2023. p. 1–9.
- 25. Hassan S, Akhtar N, Yılmaz AK. Impact of the conscientiousness as personality trait on both job and organizational performance. Journal of Managerial Sciences. 2016;10(1).
- 26. Moy JW, Lam KF. Selection criteria and the impact of personality on getting hired. Personnel Review. 2004;33(5):521–35.
- 27. Tay C, Ang S, Van Dyne L. Personality, biographical characteristics, and job interview success: a longitudinal study of the mediating effects of interviewing self-efficacy and the moderating effects of internal locus of causality. Journal of Applied Psychology. 2006;91(2):446. pmid:16551195
- 28. Barrick MR, Mount MK. The big five personality dimensions and job performance: a meta‐analysis. Personnel psychology. 1991;44(1):1–26.
- 29. Witt LA, Burke LA, Barrick MR, Mount MK. The interactive effects of conscientiousness and agreeableness on job performance. Journal of Applied Psychology. 2002;87(1):164–9. pmid:11916210
- 30. Mount MK, Barrick MR, Stewart GL. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance. 1998;11(2–3):145–65.
- 31. Rothmann S, Coetzer EP. The big five personality dimensions and job performance. SA Journal of industrial psychology. 2003;29(1):68–74.
- 32. Samek W, Müller KR. Towards explainable artificial intelligence. Explainable AI: interpreting, explaining and visualizing deep learning. 2019;5–22.
- 33.
Wicaksana AS, Liem CC. Human-explainable features for job candidate screening prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE; 2017. p. 1664–9.
- 34. Wei XS, Zhang CL, Zhang H, Wu J. Deep bimodal regression of apparent personality traits from short video sequences. IEEE Transactions on Affective Computing. 2017;9(3):303–15.
- 35.
Ventura C, Masip D, Lapedriza A. Interpreting cnn models for apparent personality trait regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017. p. 55–63.
- 36.
Gucluturk Y, Guclu U, Perez M, Jair Escalante H, Baro X, Guyon I, et al. Visualizing apparent personality analysis with deep residual networks. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017. p. 3101–9.
- 37. Ding Y, Shi L, Deng Z. Low-level characterization of expressive head motion through frequency domain analysis. IEEE Transactions on Affective Computing. 2018;11(3):405–18.
- 38.
Gunes H, Pantic M. Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners. In: Intelligent Virtual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA, September 20-22, 2010 Proceedings 10. Springer; 2010. p. 371–7.
- 39. Yang Z, Narayanan SS. Modeling dynamics of expressive body gestures in dyadic interactions. IEEE Transactions on Affective Computing. 2016;8(3):369–81.
- 40.
An G, Levitan R. Lexical and Acoustic Deep Learning Model for Personality Recognition. In: INTERSPEECH. 2018. p. 1761–5.
- 41.
Valente F, Kim S, Motlicek P. Annotation and Recognition of Personality Traits in Spoken Conversations from the AMI Meetings Corpus. In: INTERSPEECH. 2012. p. 1183–6.
- 42.
Mangalam K, Guha T. Learning spontaneity to improve emotion recognition in speech. arXiv preprint arXiv:171204753. 2017;
- 43. Tawari A, Trivedi MM. Speech emotion analysis: Exploring the role of context. IEEE Transactions on multimedia. 2010;12(6):502–9.
- 44. Abdel-Hamid L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication. 2020;122:19–30.
- 45.
Levitan SI, Levitan Y, An G, Levine M, Levitan R, Rosenberg A, et al. Identifying individual differences in gender, ethnicity, and personality from dialogue for deception detection. In: Proceedings of the second workshop on computational approaches to deception detection. 2016. p. 40–4.
- 46.
Dhall A, Hoey J. First impressions-predicting user personality from twitter profile images. In: Human Behavior Understanding: 7th International Workshop, HBU 2016, Amsterdam, The Netherlands, October 16, 2016, Proceedings 7. Springer; 2016. p. 148–58.
- 47.
Al Moubayed N, Vazquez-Alvarez Y, McKay A, Vinciarelli A. Face-based automatic personality perception. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014. p. 1153–6.
- 48. Meng KS, Leung L. Factors influencing TikTok engagement behaviors in China: An examination of gratifications sought, narcissism, and the Big Five personality traits. Telecommunications Policy. 2021;45(7):102172.
- 49. Song S, Jaiswal S, Sanchez E, Tzimiropoulos G, Shen L, Valstar M. Self-supervised learning of person-specific facial dynamics for automatic personality recognition. IEEE Transactions on Affective Computing. 2021;14(1):178–95.
- 50.
Sharma R, Guha T, Sharma G. Multichannel attention network for analyzing visual behavior in public speaking. In: 2018 ieee winter conference on applications of computer vision (wacv). IEEE; 2018. p. 476–84.
- 51.
Birdwhistell RL. Essays on body motion communication. Philadelphia: University of Pennsylvania. 1970;
- 52.
Baltrušaitis T, Robinson P, Morency LP. Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE; 2016. p. 1–10.
- 53.
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, et al. librosa: Audio and music signal analysis in python. In: SciPy. 2015. p. 18–24.
- 54. Schuller B, Steidl S, Batliner A. The interspeech 2009 emotion challenge. 2009;
- 55. Koelstra S, Patras I. Fusion of facial expressions and EEG for implicit affective tagging. Image and Vision Computing. 2013;31(2):164–74.
- 56.
Agrawal A, George RA, Ravi SS. Leveraging multimodal behavioral analytics for automated job interview performance assessment and feedback. arXiv preprint arXiv:200607909. 2020;
- 57.
Kumar D, Raman B. Speech-Based Automatic Prediction of Interview Traits. In: International Conference on Computer Vision and Image Processing. Springer; 2022. p. 586–96.
- 58.
Yan S, Huang D, Soleymani M. Mitigating biases in multimodal personality assessment. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020. p. 361–9.
- 59.
Güçlütürk Y, Güçlü U, van Gerven MA, van Lier R. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 349–58.
- 60.
Zhang CL, Zhang H, Wei XS, Wu J. Deep bimodal regression for apparent personality analysis. In: European conference on computer vision. Springer; 2016. p. 311–24.
- 61.
Subramaniam A, Patel V, Mishra A, Balasubramanian P, Mittal A. Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer; 2016. p. 337–48.
- 62. McClendon J, Bogdan R, Jackson JJ, Oltmanns TF. Mechanisms of Black–White disparities in health among older adults: Examining discrimination and personality. Journal of Health Psychology. 2021;26(7):995–1011. pmid:31250666
- 63. Shahjehan A, Qureshi JA, Zeb F, Saifullah K. The effect of personality on impulsive and compulsive buying behaviors. African journal of business management. 2012;6(6):2187.
- 64. Van Nuenen T, Ferrer X, Such JM, Coté M. Transparency for whom? Assessing discriminatory artificial intelligence. Computer. 2020;53(11):36–44.
- 65. Koppensteiner M. Motion cues that make an impression: Predicting perceived personality by minimal motion information. Journal of experimental social psychology. 2013;49(6):1137–43. pmid:24223432
- 66.
Ishii R, Ahuja C, Nakano YI, Morency LP. Impact of personality on nonverbal behavior generation. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 2020. p. 1–8.
- 67. Osugi T, Kawahara JI. Effects of head nodding and shaking motions on perceptions of likeability and approachability. Perception. 2018;47(1):16–29. pmid:28945151
- 68. Lepri B, Subramanian R, Kalimeri K, Staiano J, Pianesi F, Sebe N. Connecting meeting behavior with extraversion—A systematic study. IEEE Transactions on Affective Computing. 2012;3(4):443–55.
- 69. Celiktutan O, Gunes H. Automatic prediction of impressions in time and across varying context: Personality, attractiveness and likeability. IEEE transactions on affective computing. 2015;8(1):29–42.
- 70. Oberzaucher E, Grammer K. Everything is movement: on the nature of embodied communication. Embodied communication in humans and machines. 2008;151–77.
- 71.
Ruhland K, Zibrek K, McDonnell R. Perception of personality through eye gaze of realistic and cartoon models. In: Proceedings of the ACM SIGGRAPH Symposium on Applied Perception. 2015. p. 19–23.
- 72. Breil SM, Osterholz S, Nestler S, Back MD. 13 contributions of nonverbal cues to the accurate judgment of personality traits. The Oxford handbook of accurate personality judgment. 2021;195–218.
- 73. DeGroot T, Kluemper D. Evidence of predictive and incremental validity of personality factors, vocal attractiveness and the situational interview. International Journal of Selection and Assessment. 2007;15(1):30–9.
- 74. Levine SP, Feldman RS. Women and men’s nonverbal behavior and self-monitoring in a job interview setting. Applied HRM Research. 2002;7(1):1–14.
- 75. Walther S, Ramseyer F, Horn H, Strik W, Tschacher W. Less structured movement patterns predict severity of positive syndrome, excitement, and disorganization. Schizophrenia bulletin. 2014;40(3):585–91. pmid:23502433
- 76.
Eyben F, Weninger F, Paletta L, Schuller BW. The acoustics of eye contact: detecting visual attention from conversational audio cues. In: Proceedings of the 6th workshop on Eye gaze in intelligent human machine interaction: gaze in multimodal interaction. 2013. p. 7–12.
- 77.
House D. Integrating audio and visual cues for speaker friendliness in multimodal speech synthesis. In: INTERSPEECH. Citeseer; 2007. p. 1250–3.