Deep learning framework for subject-independent emotion detection using wireless signals

Emotion states recognition using wireless signals is an emerging area of research that has an impact on neuroscientific studies of human behaviour and well-being monitoring. Currently, standoff emotion detection is mostly reliant on the analysis of facial expressions and/or eye movements acquired from optical or video cameras. Meanwhile, although they have been widely accepted for recognizing human emotions from the multimodal data, machine learning approaches have been mostly restricted to subject dependent analyses which lack of generality. In this paper, we report an experimental study which collects heartbeat and breathing signals of 15 participants from radio frequency (RF) reflections off the body followed by novel noise filtering techniques. We propose a novel deep neural network (DNN) architecture based on the fusion of raw RF data and the processed RF signal for classifying and visualising various emotion states. The proposed model achieves high classification accuracy of 71.67% for independent subjects with 0.71, 0.72 and 0.71 precision, recall and F1-score values respectively. We have compared our results with those obtained from five different classical ML algorithms and it is established that deep learning offers a superior performance even with limited amount of raw RF and post processed time-sequence data. The deep learning model has also been validated by comparing our results with those from ECG signals. Our results indicate that using wireless signals for stand-by emotion state detection is a better alternative to other technologies with high accuracy and have much wider applications in future studies of behavioural sciences.


Introduction
With the advancements in body-centric wireless systems, physiological monitoring has been revolutionized for improving healthcare and wellbeing of people [1][2][3][4][5][6]. These systems predominantly rely on wireless intelligent sensors that are capable of retrieving clinical information from physiological signals to interpret the progression of various ailments. A traditional The recent progress in wearable electronic sensors have enabled collection of physiological data, such as heart rate, respiration rate and electroencephalography (EEG) for several physical manifestations of emotions. However, wearable sensors and devices are cumbersome during routine activities and can lead to false judgement in recognizing people's true emotions. In [11], a wireless system is demonstrated that can measure minute variations of a person's heartbeat and breathing rate in response to the individually prepared stimuli (memories, photos, music and videos) that evoke a certain emotion during experiment. Most of the participants in the study were actors and experienced in evoking emotions. The RF reflections off the body are preprocessed and fed to machine learning (ML) algorithms to classify four basic emotions types, such as anger, sadness, joy and pleasure. The proposed system excludes the requirements of carrying on-body sensors for emotion detection. Nevertheless, emotions were classified only using conventional ML algorithms and the quest to investigate the competence of deep learning for wireless signals classification has become an exciting research area.
This paper focuses on exploring deep neural networks for affective emotion detection in comparison to traditional ML algorithms. A framework is developed for recognizing human emotions using a wireless system without bulky wearable sensors, making it truly non-intrusive, and directly applicable in future smart home/building environments. An experimental database containing the heartbeat and breathing signals of 15 subjects was created by extracting the radio frequency (RF) reflections off the body followed by noise filtering techniques. The RF based emotion sensing systems (Fig 1) can overcome the limitations of traditional body worn devices that can encounter limited range of sensing and also cause inconvenience to people. For eliciting particular emotion in the participant, four videos have been selected from an on-line video platform. Videos were not shown to the participants before the start of experiment. Thus, our approach of evoking emotions in the participants is distinguishable from [11], in which each participant has to prepare their own stimuli (after watching photos, reminding personal memories, music, videos) before the start of the experiment and act the intended emotion during the experiment. A novel convolutional neural network (CNN) architecture integrated with long short-term memory (LSTM) sequence learning cells that leverage's both the processed RF signal and raw RF reflection is utilized for the classification.
The proposed network achieves state-of-the-art classification accuracy in comparison to five different traditional ML algorithms. On the other hand, a similar architecture is used for emotion recognition using the ECG signals. Our results indicate that deep learning is capable Emotion detection process in which each participant is asked to watch emotion evoking videos on the monitor while being exposed with radio waves. The Tx antenna is used to transmit RF signals towards the participant, whereas Rx antenna is used to receive RF reflections off the body. The ECG monitor is also connected to a participant's chest for recording heart beats. The data received from ECG is used to correlate heart beats variations with emotion evoking videos.
https://doi.org/10.1371/journal.pone.0242946.g001 of utilizing a range of building blocks to learn from the RF reflections off the body for precise emotion detection and excludes manual feature extraction techniques. Furthermore, we propose that RF reflections can be an exceptional alternative to ECG or bulky wearables for subject-independent human emotion detection with high and comparable accuracy.

Detection of emotional states
Deep learning analysis. Feature extraction is an integral part of a signal (electromagntic, acoustic, etc.) classification that can be performed manually or by using a neural network. Deploying traditional machine learning algorithms for signal classification necessitates ponderous extraction of statistical parameters from the raw data input. However, this manual approach can be tedious and may result in omission of some useful features. In contrary, deep neural networks can extract enormous amount of features from the raw data itself, whether they are significant or of minute details [39]. Therefore, we employ an appropriate DNN architecture to process the time domain wireless signal (RF reflections off the body) and the corresponding frequency domain version obtained by continuous wavelet (CW) transformation. Here, the RF reflection signal is one-dimensional (1D) and the CW transformation is an image of three dimensions (3D), represented in the format of (width, height and channels). The parameters in wavelet image can be regarded as time (x-axis), frequency(y-axis), and the amplitude. The proposed DL architecture that is shown in Fig 2 could be identified as a 'Y' shaped neural network that accepts inputs in two distinct forms and fuses the processed inputs at the end to produce classification probabilities related to four emotions. The neural network consists of two sets of convolutional 1D and maxpooling 1D layers, followed by a long shortterm memory (LSTM) cell to capture the features and time dependency of the time domain RF signal. Another two sets of convolutional 2D and maxpooling 2D layers are used to process CW transformed image.
The convolutional layers are exceptional feature extractors and often outperform humans in this regard. A convolutional layer may have many kernels in the form of matrices (e.g. 3 × 3 and 5 × 5) that embed numerical values to capture variety of different features (e.g. brightness, darkness, blurring, edges, etc., of an image) from raw data. A kernel runs through the input data as a sliding window, and at every distinct location, it performs element-wise multiplication with the overlapping input data and takes the summation to obtain the value of that particular location of the generated feature map. Maxpooling layers do not involve in feature extraction. However, they reduce the dimensions of the outputs of convolutional layers, hence reducing the computational complexity. A typical convolutional layer has 32, 64 or even 128 kernels and thus results in the same number of feature maps. As observed in Fig 2, the feature maps carry even the diminutive information available in the input image, whereas a human eye is unable to capture this level of information, making them ordinary feature extractors.
The accuracy of classification is evaluated with leave-one-out cross validation (LOOCV) [40]. Although, cross validation is immensely used with ML models to observe the generalizability of the model, it is somewhat unconventional to perform cross validation with deep learning due to; (1) extreme computational complexity and (2) difficulty in tracking overfitting/underfitting conditions with a fixed number of iterations while training the model. However, in order to make a fair judgement on our DL predictions, we first used the full database and performed LOOCV, despite being the most computational intensive form of K-fold cross validation. In K-fold cross validation, the database is split into k subsets, out of which, one is kept as the test set and the other k − 1 are put together to form the training set. This process is repeated k times such that every data point gets to be in the test set exactly once, eliminating the effect of biased data division into train and test sets. LOOCV is achieved by making the value of k equal to N, number of data points in the database.
The proposed DL model yielded in 71.67% LOOCV accuracy. This is quite a high percentage, considering the fact that human emotions are highly dependent on the level of stimulation generated in their brains by the same audio-video stimuli, capable of inducing emotions intensity differently from one person to another. It is tempting to conclude that the performance of model is solely based on the classification accuracy. However, a model with a high classification accuracy can still perform suboptimally, especially when the database is unbalanced as some classes contain a high number of data points and the others do not. In order to have a better description of the model, we often adopt other performance metrics such as precision, recall and F1-score. Precision indicates how many selected instances are relevant (a measure of quality), whereas recall indicates how many relevant instances are selected (a measure of quantity). F1-score reveals the trade-off between precision and recall, and can be correlated with effective resistance of the two parallel resistors (precision and recall) in a closed loop circuit. F1-score becomes low if either of these figures is low in comparison to the other, thus illustrating the reliability of the model across all classes. Although, these parameters are defined for binary classification, they can be extended to multi-class problems by calculating inter-class Time domain RF signal is processed through two convolutional-1D layers and an additional LSTM cell that captures the time dependency (section 1 in S1 File). The CW transformation is processed by two convolutional-2D layers (section 2.1 in S1 File). Each feature map in convolutional layers represents a unique extracted feature from the layer input. The features extracted from two distinct inputs of the model are then concatenated, leading to a broad learning capability. The detailed visualization of 32 and 64 features maps is presented in section 2.2 of S1 File.
Machine Learning (ML) analysis. We have employed traditional ML algorithms process by means of data pre-processing, feature extraction, model training and classifications (section 3, S5 Fig in S1 File). In our experiment, the RF reflected signals off the body encompass human body movements and random noise that is mostly contributed from the environment, equipments (VNA, cables, etc,. . .) and other moving objects. For this reason, it is essential to filter the noise from received RF signals for further processing. Moreover, we have also implemented data normalization technique to circumvent the influence of intensity variations on body movement for each participant.
Feature extraction process can be regarded as a core step of ML algorithms to analyse data. Considering the importance of ML for feature extraction, an efficient algorithm can significantly improve the classification accuracy while reducing the impact of interfering redundant RF signals and random noise. In the literature, a variety of feature extraction parameters are studied that are mostly in the field of affective recognition and biological engineering [41][42][43]. Permutation entropy is a widely used nonlinear parameter to evaluate the complexity of sequence that is a prevalent approach to estimate the pattern of biological signals, such as Electrocardiogram (ECG) and electroencephalogram (EEG). It is also capable of detecting realtime dynamic characteristics, and also has strong robustness.
Apart from the entropy value, it is well documented that the power spectral density (PSD) and statistical (variance, skewness, kurtosis) parameters are also related to the affective state of participants [44]. In our analysis, the permutation entropy, PSD in the range of 0.15-2 Hz, 2-4 Hz and 4-8 Hz, and the variance, skewness and kurtosis values are extracted from the pre-processed signals. Therefore, overall seven parameters are tapped in the feature extraction process (section 3, S5 Fig in S1 File).
Analysis of deep learning and machine learning results. The confusion matrices obtained using LOOCV for CNN+LSTM model and five classical ML algorithms are depicted in Fig 3. As tabulated in Table 1, deep learning outperforms conventional machine learning algorithms in all performance metrics. We identify two main reasons that explain why deep learning is superior in the current learning problem. First, having both the time domain wireless signal and CW transformed image as an input is a rich source of learning for the CNN +LSTM model whereas the ML algorithms are trained with extracted features as inputs, that are sensitive to the level of human judgement on selecting features as well as the obvious loss of information from the original data. Second, CNNs are self learners that learn even the diminutive information, hidden in raw data that aids to reconstruct its target values, given the correct hyper-parameters. ML algorithms are somewhat reliant on human to figure out meaningful statistical parameters (or a combination of parameters) from raw data to be fed to the model. Nevertheless, these ML models still report an acceptable performance that can be used as a criterion for measuring how well the implemented DL model can perform.

Data visualization
Data visualization is pivotal for basic identification of patterns and trends in data that helps to understand and elaborate the results obtained from the machine learning models. However, high dimensional data as obtained by feature extraction, needs to be compressed into a lower dimension for visualization. T-distributed stochastic neighbour embedding (t-SNE) is a nonlinear dimensionality reduction machine learning algorithm often used for visualising high dimensional data by projecting it onto a 2D or 3D space (section 4 in S1 File).

Discussion
It is understood that the emotions evoked by the audio-visual stimuli are highly subject dependent and therefore difficult to classify on a common ground. Due to this reason, it is essential to assess the capability of models to distinguish between classes. A receiver operating characteristic (ROC) curve is a probability curve obtained by plotting sensitivity against (1-specificity). Area under the curve (AUC) represents the degree of separability. ROC is defined for a binary classifier system, however, can be extended for a multiclass classification by building a single classifier per class, known as one-vs.-rest or one-against-all strategy. ROC curve and AUC for each class obtained using the SVM model are illustrated in Fig 5. AUCs indicate that the emotions 'Disgust' and 'Relax' have a higher degree of separability, complying well with the DL and ML classification results. It should be noted that four video stimuli of respective emotions were displayed to the subjects with minimum delay between the videos and hence it is possible for evoked emotions in the preceding video to persist in the initial part of the following video before it completely vanishes.

PLOS ONE
We have used CNN+LSTM model to predict the variations of emotion probabilities across all the videos for a randomly selected subject from the test set. Fig 6 depicts the probability variation of emotions over the time and mean probabilities.

RF vs ECG performance comparison
Human clinical conditions, either physical or mental, cause subtle variations in heart rate that is also reflected in the ECG signal. Therefore, the existing health condition monitoring systems predominantly depend on ECG data for discovering the underlying reasons and categorizing the conditions. In order to make a comparison with RF results, we utilize simultaneously extracted the ECG signal data to train a similar DNN architecture as shown in Fig 2. Likewise, the Wavelet Transformation is applied to ECG data (Fig 7a). As observed in the Fig 4b, CW transformation alone is not enough to distinguish between emotions. Therefore, we calculated 81 features from the ECG signal, known as inter beat interval (IBI) features and further applied minimum redundancy maximum relevance (mRmR) feature selection [45] method to reduce the dimensions, resulting in 30 features. Since the features do not form a time sequence, we have omitted the LSTM cell from the DL architecture. The extracted features and the CW transformations are used to train the CNN model for emotion recognition. To identify the threshold performance, we have trained a SVM model with the extracted 30 features. Table 2 demonstrates the performance metrics in ECG classification. The confusion matrices of CNN and SVM models are shown in Fig 7b and 7c respectively. In general, the deep learning classification performances of RF and ECG (Fig 7) are high and very similar (both having 71.67% LOOCV accuracy), indicating that RF signals can describe the underlying emotion of a person as good as an ECG signal with added benefits of being wireless and more practical. Furthermore, we tested the performance of the proposed deep learning architecture on the well established DREAMER ECG database [41] for emotion recognition. The model achieved 68.48% subject independent LOOCV classification accuracy with 0.678, 0.685, 0.680 precision, recall and F1-score values respectively (section 6 in S1 File). Thus, it is evident that the proposed novel DL architecture can be employed across different databases generated under diverse conditions.

Ethical approval
All experimental study was approved by Queen Mary Ethics of Research Committee of Queen Mary University of London under QMERC2019/25. All research was performed in accordance with guidelines/regulations approved by Ethics of Research Committee. Written informed consent was obtained from the participants involved in the study.

Participants
The experiment was performed on 15 participants. All participants were English speaking, aged between 22-35 years. The participants were briefly explained about the measurement details before the start of the experiment. They were provided with comfortable environment so that they can only focus on watching videos with minimum distractions.

Stimuli: Emotions evoking videos
For inducing emotions in the participants, individual videos were selected that can induce four emotional states (relax, scary, disgust, and joy) in the participants (section 7, S8 Fig in S1  File). The duration of each video clip was from 3-4 minutes. A survey was prepared and provided to the participants, where the emotions can be mapped and graded according to the intensity of emotions felt during the experiment [46]. Participants were asked to record the intensity of emotions in the survey after watching each video. Self assessment results indicated that videos are capable of inducing a particular emotion in the participant during the experiment (section 5, S6 Fig in S1 File). However, it is also observed that some participants have experienced multiple emotions while watching a single video. For instance, while watching video corresponding to the happy emotions, participants indicated on the survey that they  didn't find the video content happy enough and they remained relax while watching the video. This implies that emotion detection require complex procedure to distinguish emotions of a participant.

Emotion detection experiment
Measurement set-up. Measurements were performed in the anechoic chamber to reduce any interfering noise emanating from external environment that might alter the emotions of a participant during experiment (section 8, S11 Fig in S1 File). A pair of Vivaldi type antennas is used to form the radar, operating at 5.8 GHz (section 8, S10 Fig in S1 File). One antenna is used for RF signal transmission towards the body (Green Signal , Fig 1), while the second antenna was used for receiving RF reflections off the body (Red Signal, Fig 1). A pair of coaxial cables were used to connect both antennas to the programmable vector network analyzer (Rohde & Schwarz, N5230 C) through coaxial cables. A laptop was used to play videos and the participants were asked to wear headphones so that they can effectively focus on the audio. The distance between the antennas and the participants was 30 cm as illustrated in the measurement set-up (S2 Fig in S1 File).
Detection of RF reflections from the participants. The videos were shown one at a time to the participant who was sitting on the chair in-front of the displaying monitor at a distance of approximately 1 meter. The participants were exposed with RF power level of 0 dBm. After the end of each video, the participant was asked to relax before the start of next video. While each video was playing, RF reflections from the participant's body were detected through the receiving Vivaldi antenna, that was connected the VNA. In our experiment, the phase difference of RF reflections is captured using radar techniques. We have employed the procedure that can calculate the phase difference between the transmitted and RF reflections off the body. For instance, the transmitted signal is given as: where ω 0 is the frequency of transmitted signal(operating frequency of 5.8 GHz), whereas φ 0 is initial phase of the transmitted signal. Distance between the participant and Tx antenna is: where d is the static distance between the participant and Tx antenna and f(t 0 ) corresponds to the movement of participant's body. The received signal can be expressed as: where Dt ¼ 2dðt 0 Þ c is the time duration that the transmitting RF signal takes to reach the participant's body and G ¼ jG 0 je jφ 0 is the reflection coefficient from the participant. By considering the participant's body movement can be regarded as quasi-periodic signals, the expression, The extended expression of the received signal is given below: The phase difference between transmitted signal and received signal is: Where C 0 is a constant. The amplitude of F(t 0 ) is proportional to the frequency of VNA ω 0 and body movements A i . We can infer from above mentioned equation that the variations of phase difference corresponds to the participant's body movement. We have analysed the emotions on the last 120 seconds of each video. This is to make sure that the intensity of emotions will be high by the end of video as compared to the start of every video. Data acquisition using ECG. The ECG signals have been extensively explored in literature for emotion detection, particularly in the field of affective computing. The emotional states of a person are effectively associated with psychological activities and cognition of humans. In our experiment, we have used an ECG monitor (PC-80B) to extract the heartbeat variations of a participant during experiment. The ECG monitor is convenient to use and has three electrodes that can be mated to the participant's chest conveniently.
Signal processing analysis. ECG signals, We have employed signal processing techniques on ECG signals to extract the information about heartbeat variation, owning to the elicited emotions in the participants. Generally, the ECG signals occupy bandwidth in the range of 0.5 -45 Hz. For this reason, to remove the baseline drift in the ECG signal, re-sampling is applied at the frequency of 154 Hz and a bandpass Butterworth filter is used to perform filtering from 0.5-45 Hz. In the next step, we have used Augsburg Biosignal Toolbox (AuBT) of Matlab to extract statistical features from ECG signals for different emotional states (section 9 in S1 File). The extracted features are essential for further classification of emotions. The classification results indicate audio-visual stimulus successfully evoke discrete emotional states and can be recognized in terms of psychological activities.
RF signals, After pre-processing the raw data (section 3, S5 Fig in S1 File), the next step is to extract feature and transform from processed data. The extracted parameters for ML have been discussed in the previous section, and the transformation based on continuous wavelet transform (CWT) is introduced.
For further classifications, we have used continuous wavelet transform (CWT) to modify 1-D RF signals into 2-D scaleogram. In the field of mathematics, CWT is a formal (i.e., nonnumerical) tool that provides a complete representation of a signal and provides the capability to continuously alter the scale parameters of wavelets. Based on CWT, the 1-D RF signals can be transformed into 2-D scaleogram that represents an image format. Although a scaleogram is beneficial for in-depth understanding of the dynamic behaviour of body movements, individual body movements of participants while watching videos can also be distinguished individually. The normalized time series and its Fourier transform sequence are extracted as the 1-D features. The 2-D scaleogram that is stored as an image format can be considered as the 2-D features. In the classification section, the combination between 1-D features and 2-D features is used to classify different emotional states of participants.

Conclusions
Emotion detection has emerged as a paramount area of research in neuroscientific studies as well as in many other strands of well-being, especially for mentally ill elderly people that are susceptible to physiological fatigue and undergo interactive therapy for the treatment. In this study, we have proposed a novel deep learning architecture that fuses time-domain wirelessly received raw data with those from the frequency domain can achieve state-of-the-art emotion detection performance. We have experimentally demonstrated that four different human emotions can be recognized in a subject independent manner with over 71% accuracy, even in a data limited regime. Moreover, our results indicate that deep learning offers superior performance in the present classification task in comparison to five different machine learning algorithms. We further tested the performance of proposed DL architecture on simultaneously extracted ECG data. It was established that wireless RF measurements could be a better alternative to other invasive methods such as ECG and EEG for human emotion detection. We further evaluated the generalizability of our DL model across other databases by validating it on a well established ECG database. We believe the framework proposed in the present study is a low-cost, hassle-free solution for carrying emotion related research and it offers high detection accuracy in comparison with other alternative approaches.