The authors have declared that no competing interests exist.
Emotion states recognition using wireless signals is an emerging area of research that has an impact on neuroscientific studies of human behaviour and well-being monitoring. Currently, standoff emotion detection is mostly reliant on the analysis of facial expressions and/or eye movements acquired from optical or video cameras. Meanwhile, although they have been widely accepted for recognizing human emotions from the multimodal data, machine learning approaches have been mostly restricted to subject dependent analyses which lack of generality. In this paper, we report an experimental study which collects heartbeat and breathing signals of 15 participants from radio frequency (RF) reflections off the body followed by novel noise filtering techniques. We propose a novel deep neural network (DNN) architecture based on the fusion of raw RF data and the processed RF signal for classifying and visualising various emotion states. The proposed model achieves high classification accuracy of 71.67% for independent subjects with 0.71, 0.72 and 0.71 precision, recall and F1-score values respectively. We have compared our results with those obtained from five different classical ML algorithms and it is established that deep learning offers a superior performance even with limited amount of raw RF and post processed time-sequence data. The deep learning model has also been validated by comparing our results with those from ECG signals. Our results indicate that using wireless signals for stand-by emotion state detection is a better alternative to other technologies with high accuracy and have much wider applications in future studies of behavioural sciences.
With the advancements in body-centric wireless systems, physiological monitoring has been revolutionized for improving healthcare and wellbeing of people [
Due to the impact of aforementioned applications in our daily course of life, extensive amount of strategies have been exploited for emotion detection that primarily focus on audio [
While conventional machine learning (ML) algorithms have already performed optimally for emotion classification, especially under the constraint of subject dependency [
A similar architecture is deployed in [
These approaches vary from simple multi-layer feed forward neural networks [
The recent progress in wearable electronic sensors have enabled collection of physiological data, such as heart rate, respiration rate and electroencephalography (EEG) for several physical manifestations of emotions. However, wearable sensors and devices are cumbersome during routine activities and can lead to false judgement in recognizing people’s true emotions. In [
This paper focuses on exploring deep neural networks for affective emotion detection in comparison to traditional ML algorithms. A framework is developed for recognizing human emotions using a wireless system without bulky wearable sensors, making it truly non-intrusive, and directly applicable in future smart home/building environments. An experimental database containing the heartbeat and breathing signals of 15 subjects was created by extracting the radio frequency (RF) reflections off the body followed by noise filtering techniques. The RF based emotion sensing systems (
The Tx antenna is used to transmit RF signals towards the participant, whereas Rx antenna is used to receive RF reflections off the body. The ECG monitor is also connected to a participant’s chest for recording heart beats. The data received from ECG is used to correlate heart beats variations with emotion evoking videos.
The proposed network achieves state-of-the-art classification accuracy in comparison to five different traditional ML algorithms. On the other hand, a similar architecture is used for emotion recognition using the ECG signals. Our results indicate that deep learning is capable of utilizing a range of building blocks to learn from the RF reflections off the body for precise emotion detection and excludes manual feature extraction techniques. Furthermore, we propose that RF reflections can be an exceptional alternative to ECG or bulky wearables for subject-independent human emotion detection with high and comparable accuracy.
Feature extraction is an integral part of a signal (electromagntic, acoustic, etc.) classification that can be performed manually or by using a neural network. Deploying traditional machine learning algorithms for signal classification necessitates ponderous extraction of statistical parameters from the raw data input. However, this manual approach can be tedious and may result in omission of some useful features. In contrary, deep neural networks can extract enormous amount of features from the raw data itself, whether they are significant or of minute details [
Time domain RF signal is processed through two convolutional-1D layers and an additional LSTM cell that captures the time dependency (section 1 in
The convolutional layers are exceptional feature extractors and often outperform humans in this regard. A convolutional layer may have many kernels in the form of matrices (e.g. 3 × 3 and 5 × 5) that embed numerical values to capture variety of different features (e.g. brightness, darkness, blurring, edges, etc., of an image) from raw data. A kernel runs through the input data as a sliding window, and at every distinct location, it performs element-wise multiplication with the overlapping input data and takes the summation to obtain the value of that particular location of the generated feature map. Maxpooling layers do not involve in feature extraction. However, they reduce the dimensions of the outputs of convolutional layers, hence reducing the computational complexity. A typical convolutional layer has 32, 64 or even 128 kernels and thus results in the same number of feature maps. As observed in
The accuracy of classification is evaluated with leave-one-out cross validation (LOOCV) [
The proposed DL model yielded in 71.67% LOOCV accuracy. This is quite a high percentage, considering the fact that human emotions are highly dependent on the level of stimulation generated in their brains by the same audio-video stimuli, capable of inducing emotions intensity differently from one person to another. It is tempting to conclude that the performance of model is solely based on the classification accuracy. However, a model with a high classification accuracy can still perform suboptimally, especially when the database is unbalanced as some classes contain a high number of data points and the others do not. In order to have a better description of the model, we often adopt other performance metrics such as precision, recall and F1-score. Precision indicates how many selected instances are relevant (a measure of quality), whereas recall indicates how many relevant instances are selected (a measure of quantity). F1-score reveals the trade-off between precision and recall, and can be correlated with effective resistance of the two parallel resistors (precision and recall) in a closed loop circuit. F1-score becomes low if either of these figures is low in comparison to the other, thus illustrating the reliability of the model across all classes. Although, these parameters are defined for binary classification, they can be extended to multi-class problems by calculating inter-class mean and standard deviation. The calculated values of precision, recall and F1-score after LOOCV are 0.713, 0.716 and 0.714 respectively, implying that the model has achieved good generalizability.
We have employed traditional ML algorithms process by means of data pre-processing, feature extraction, model training and classifications (section 3, S5 Fig in
Feature extraction process can be regarded as a core step of ML algorithms to analyse data. Considering the importance of ML for feature extraction, an efficient algorithm can significantly improve the classification accuracy while reducing the impact of interfering redundant RF signals and random noise. In the literature, a variety of feature extraction parameters are studied that are mostly in the field of affective recognition and biological engineering [
Apart from the entropy value, it is well documented that the power spectral density (PSD) and statistical (variance, skewness, kurtosis) parameters are also related to the affective state of participants [
The confusion matrices obtained using LOOCV for CNN+LSTM model and five classical ML algorithms are depicted in
The metric ‘Accuracy’ refers to LOOCV accuracy.
Accuracy (%) | Precision | Recall | F1-score | |
---|---|---|---|---|
CNN + LSTM | 71.67 | 0.713 (±0.08) | 0.716 (±0.12) | 0.714 (±0.10) |
Random forest | 63.33 | 0.646 (±0.27) | 0.633 (±0.29) | 0.634 (±0.18) |
SVM | 63.33 | 0.645 (±0.17) | 0.63 (±0.04) | 0.637 (±0.08) |
KNN | 61.7 | 0.64 (±0.21) | 0.616 (±0.18) | 0.615 (±0.19) |
Decision tree | 55.0 | 0.554 (±0.30) | 0.549 (±0.23) | 0.55 (±0.14) |
LDA | 51.7 | 0.544 (±0.36) | 0.516 (±0.27) | 0.526 (±0.28) |
Data visualization is pivotal for basic identification of patterns and trends in data that helps to understand and elaborate the results obtained from the machine learning models. However, high dimensional data as obtained by feature extraction, needs to be compressed into a lower dimension for visualization. T-distributed stochastic neighbour embedding (t-SNE) is a nonlinear dimensionality reduction machine learning algorithm often used for visualising high dimensional data by projecting it onto a 2D or 3D space (section 4 in
The plots were obtained by reducing the dimensions of the continuous wavelet images of each signal. It can be observed that the wavelet images of RF signals (panel (a)) demonstrate a better separability between emotions than that of ECG signals (panel (b)).
It is understood that the emotions evoked by the audio-visual stimuli are highly subject dependent and therefore difficult to classify on a common ground. Due to this reason, it is essential to assess the capability of models to distinguish between classes. A receiver operating characteristic (ROC) curve is a probability curve obtained by plotting sensitivity against (1-specificity). Area under the curve (AUC) represents the degree of separability. ROC is defined for a binary classifier system, however, can be extended for a multiclass classification by building a single classifier per class, known as one-vs.-rest or one-against-all strategy. ROC curve and AUC for each class obtained using the SVM model are illustrated in
The emotions ‘Disgust’ and ‘Relax’ are highly separable from the rest. Micro-average aggregates the contribution from all classes to compute the average ROC curve. Macro-average computes the ROC metric for each class independently and takes the average, hence treating all classes equally.
We have used CNN+LSTM model to predict the variations of emotion probabilities across all the videos for a randomly selected subject from the test set.
Smooth probability curves are generated by interpolating the discrete probability values.
Human clinical conditions, either physical or mental, cause subtle variations in heart rate that is also reflected in the ECG signal. Therefore, the existing health condition monitoring systems predominantly depend on ECG data for discovering the underlying reasons and categorizing the conditions. In order to make a comparison with RF results, we utilize simultaneously extracted the ECG signal data to train a similar DNN architecture as shown in
The metric ‘Accuracy’ refers to LOOCV accuracy.
Accuracy (%) | Precision | Recall | F1-score | |
---|---|---|---|---|
CNN | 71.67 | 0.720 (±0.03) | 0.716 (±0.09) | 0.714 (±0.03) |
SVM | 68.33 | 0.692 (±0.07) | 0.68 (±0.05) | 0.681 (±0.02) |
All experimental study was approved by Queen Mary Ethics of Research Committee of Queen Mary University of London under QMERC2019/25. All research was performed in accordance with guidelines/regulations approved by Ethics of Research Committee. Written informed consent was obtained from the participants involved in the study.
The experiment was performed on 15 participants. All participants were English speaking, aged between 22—35 years. The participants were briefly explained about the measurement details before the start of the experiment. They were provided with comfortable environment so that they can only focus on watching videos with minimum distractions.
For inducing emotions in the participants, individual videos were selected that can induce four emotional states (relax, scary, disgust, and joy) in the participants (section 7, S8 Fig in
Measurements were performed in the anechoic chamber to reduce any interfering noise emanating from external environment that might alter the emotions of a participant during experiment (section 8, S11 Fig in
The videos were shown one at a time to the participant who was sitting on the chair in-front of the displaying monitor at a distance of approximately 1 meter. The participants were exposed with RF power level of 0 dBm. After the end of each video, the participant was asked to relax before the start of next video. While each video was playing, RF reflections from the participant’s body were detected through the receiving Vivaldi antenna, that was connected the VNA. In our experiment, the phase difference of RF reflections is captured using radar techniques. We have employed the procedure that can calculate the phase difference between the transmitted and RF reflections off the body. For instance, the transmitted signal is given as:
The phase difference between transmitted signal and received signal is:
The ECG signals have been extensively explored in literature for emotion detection, particularly in the field of affective computing. The emotional states of a person are effectively associated with psychological activities and cognition of humans. In our experiment, we have used an ECG monitor (PC-80B) to extract the heartbeat variations of a participant during experiment. The ECG monitor is convenient to use and has three electrodes that can be mated to the participant’s chest conveniently.
For further classifications, we have used continuous wavelet transform (CWT) to modify 1-D RF signals into 2-D scaleogram. In the field of mathematics, CWT is a formal (i.e., non-numerical) tool that provides a complete representation of a signal and provides the capability to continuously alter the scale parameters of wavelets. Based on CWT, the 1-D RF signals can be transformed into 2-D scaleogram that represents an image format. Although a scaleogram is beneficial for in-depth understanding of the dynamic behaviour of body movements, individual body movements of participants while watching videos can also be distinguished individually. The normalized time series and its Fourier transform sequence are extracted as the 1-D features. The 2-D scaleogram that is stored as an image format can be considered as the 2-D features. In the classification section, the combination between 1-D features and 2-D features is used to classify different emotional states of participants.
Emotion detection has emerged as a paramount area of research in neuroscientific studies as well as in many other strands of well-being, especially for mentally ill elderly people that are susceptible to physiological fatigue and undergo interactive therapy for the treatment. In this study, we have proposed a novel deep learning architecture that fuses time-domain wirelessly received raw data with those from the frequency domain can achieve state-of-the-art emotion detection performance. We have experimentally demonstrated that four different human emotions can be recognized in a subject independent manner with over 71% accuracy, even in a data limited regime. Moreover, our results indicate that deep learning offers superior performance in the present classification task in comparison to five different machine learning algorithms. We further tested the performance of proposed DL architecture on simultaneously extracted ECG data. It was established that wireless RF measurements could be a better alternative to other invasive methods such as ECG and EEG for human emotion detection. We further evaluated the generalizability of our DL model across other databases by validating it on a well established ECG database. We believe the framework proposed in the present study is a low-cost, hassle-free solution for carrying emotion related research and it offers high detection accuracy in comparison with other alternative approaches.
(PDF)
(RAR)