Using Activity-Related Behavioural Features towards More Effective Automatic Stress Detection

This paper introduces activity-related behavioural features that can be automatically extracted from a computer system, with the aim to increase the effectiveness of automatic stress detection. The proposed features are based on processing of appropriate video and accelerometer recordings taken from the monitored subjects. For the purposes of the present study, an experiment was conducted that utilized a stress-induction protocol based on the stroop colour word test. Video, accelerometer and biosignal (Electrocardiogram and Galvanic Skin Response) recordings were collected from nineteen participants. Then, an explorative study was conducted by following a methodology mainly based on spatiotemporal descriptors (Motion History Images) that are extracted from video sequences. A large set of activity-related behavioural features, potentially useful for automatic stress detection, were proposed and examined. Experimental evaluation showed that several of these behavioural features significantly correlate to self-reported stress. Moreover, it was found that the use of the proposed features can significantly enhance the performance of typical automatic stress detection systems, commonly based on biosignal processing.


Introduction
Growing interest has surrounded the roles of technology in emotions, and in particular, in psychological stress. As such, automatic stress detection has become a challenging issue, for both research and the clinical practice. At the moment, physiological measurements and self-report questionnaires are the most common methods used to automatically detect stress [1,2]. Although questionnaires are affected by personal convictions [3] and biosensors are often too obtrusive [4], last decades' research has demonstrated an increasing quality of these methods and related technologies. In this line, effort has been made to improve the performance, and also to reduce the obtrusiveness of the adopted systems. However it still remains partially unclear how stress can be effectively detected with the help of systems that retain an increased degree of unobtrusiveness.
According to Cohen, Janicki-Deverts, and Miller [5], psychological stress occurs when an individual perceives that the environmental demands exceed his or her adaptive ability to meet them. This gap gives rise to the labeling of oneself as stressed and elicits a concomitant negative emotional response. In physiological measures, such a response can lead to increased stress hormone levels, blood pressure [6], heart rate, pupil dilation, and skin conductivity [7,8]. In activity-related behavior such an emotional response can lead to a wide range of ''behavioural symptoms''; for example, hands and foot trembling [9], body hyperactivity [2,10], compulsive movement [11], and faster eye gaze [12].
Over the last few decades, many factors, such as the lower technology price and its higher availability, portability, and usability, have allowed a closer connection and interaction between automatic detection systems and affective models. A vivid example of such joining is the promising field of Affective Computing [13,14], which has among others, the aim to identify emotions during human-computer interaction, by examining biological signals [15,16], facial expressions [17], speech [18], hands [19], and further parameters.
Several studies have shown interesting results that support the feasibility of detecting affective states through psychophysiological data acquisition and analysis [20]. For example, the affective computing group at MIT, led by Rosalind W. Picard (the pioneer in the field who also coined the term), conducted several research studies that highlighted the use of psychophysiological measures towards deducing and classifying emotional states. In particular, researchers have highlighted the usefulness of wearable biosensors that detect changes in physiological and subsequently, affective states [7,14,15].
However, considering the practical applicability of such methodologies, a major problem with wearable biosensors is related to obtrusiveness. In fact, the regularly utilized biosensors are not transparent to subjects, something that may even affect the study goals, i.e. in studying stress, there can be cases where subjects become stressed as a consequence of the biosensors themselves [4]. Nevertheless, until now, the common tactic has been to decrease the obtrusiveness of such devices, sometimes also making them almost invisible, i.e. by using an electrocardiogram under a t-shirt that transmits data to a smart phone [4] or a skin conductance recorder to wear on the wrist [21].
In an effort to augment automatic affect detection with less obtrusive monitoring methods, the applicability of automatically extracted activity-related parameters has been recently examined [22]. Activity-related behaviours suggest a clear aspect of continuous regulatory actions, observable in movement qualities, contours, expressions, and also perceived in vocal tonality [23]. In this respect, the body attitude is related to the constant shape of the body, its general pose and the distinct location of its parts [24,25]. This concept recalls the so called ''background emotion'' [23] or ''state of mind'' that considers how one perceives oneself and how this affects others.
In a more general view, gesture and posture are considered as parts of a wider semiotic system that underlies human communication. Along this line, it has been reported that attitude, intention, and, in general, meaning, are expressed only in part by verbal content, and as much or even more so through nonverbal channels [17,23,26,27,28]. In this perspective, nonverbal behaviours could be interpreted to indicate the ''unsaid'' elements representing our internal states.
Human gestural parameters have recently been extracted from monitoring video sequences, and it was shown that they have interesting potential towards automatic affect recognition [22]. The latter work drew inspiration from [29], where the degree of dependence between body movements and postures and certain emotions, like joy, happiness, anger, etc had been investigated. However, it has to be noted that, to our knowledge, no study has explored until now the potential of activity-related behavioral features, in respect to the practical problem of automatic stress detection. These features could be collected at distance through the use of appropriate sensors or cameras, being this way totally unobtrusive. This consideration needs to be reviewed in further studies, in order to understand the implications of such features for the research and for the health care practice. Following this line, a study aiming to automatically detect the stress level of participants has been conducted herein. Psychological self-reports and common physiological measures were used, and the latter were compared to a less obtrusive technology, mainly based on a low cost, single view depth imaging camera (Microsoft Kinect [30]). In the context of our study, the following research questions are put forth: RQ1: Is there a relationship between behavioural features that can be automatically extracted from a computer system, and the self-reported stress levels of subjects?
RQ2: Is it possible to enhance stress detection effectiveness by adding automatically detected behavioural information to standard systems?
The present work tries to answer both these research questions by comparing standard psychological questionnaires, common psychophysiological measures, and activity-related behavioural indexes. An experiment was deployed with the aim to collect appropriate video and accelerometer data from participants who followed a stress induction protocol. Galvanic Skin Response (GSR) and Electrocardiogram (ECG) biosignals were also recorded during the experiment. An explorative study was then conducted, to identify whether behavioural features extracted from video and accelerometer recordings can improve the effectiveness of automatic stress detection. At this purpose, a large set of behavioural features were extracted from the collected data, and their relation to self-reported stress was examined. Many mixed linear hierarchical regressions were computed, to take into consideration the nested structure of the data. Moreover, utilizing a linear classifier, it was found that the proposed behavioural features were capable to significantly enhance effectiveness of automatic stress detection, compared to the results obtained when only common physiological features, extracted from the monitored biosignals, were used.

Materials and Methods
In order to collect appropriate data, a stress-induction experiment was conducted, as explained in the following.

Participants
Twenty one right-handed subjects (4 women, 17 men, Mean = 30.4 years, SD = 3.7) participated in the experiment, which was conducted in the premises of Informatics and Telematics Institute, Centre for Research and Technology Hellas (CERTH-ITI) in Thessaloniki, Greece. All subjects gave written informed consent to the experimental procedure, which was approved by the local ethics committee of Centre for Research and Technology Hellas.

Hardware Setup
Video data was collected through a Microsoft Kinect [30] camera placed opposite to the participant, at a distance of around 2 1/2 meters. Accelerometer data was collected from two tri-axial accelerometer sensors developed by Phidget Corp.
[31] that were placed at the participant's knees, with the aim to detect foot trembling. Moreover, physiological (GSR and ECG) data was collected using a Procomp5 [32] Infinity device. For GSR signal acquisition, one two-electrode GSR sensor was placed at the subject's index and middle fingers of the non-dominant hand. Also, one three-electrode ECG sensor was placed at the subject's chest, covering Eindhoven's triangle. The overall sensor setup of the study and a sample screenshot taken from the video recordings are shown in Figure 1.a and Figure 1.b, respectively.

Stimuli and Procedure
Stimuli. The stress-induction stimuli of the experiment was based on a custom stroop colour word test [33] application ( Figure 1). The stroop test has been commonly utilized in the past so as to examine attention and cognitive flexibility [34], emotion perception [35], as well as the effect of stress manipulation on cognitive performance [36]. However, due to the fact that it is as a mental task whose difficulty may substantially increase (through manipulation of task pacing etc), it has also been considered during the recent years as capable to form the basis for stress-induction stimuli [20,37,38]. Following this line, the stroop test was used in our work, so as to provide a mental task of increased difficulty, within a stressor framework that was also based on time pressure. Eventually, the stressor of our experiment was mainly based on two parameters, i.e. time pressure and increased task difficulty, commonly known as capable to induce stress [39,40]. As explained in the following, these two factors were manipulated throughout the different conditions of our experiment, during the participants' interaction with our custom Stroop-based application.
In our specific stroop test, five colours were utilized, namely red, green, blue, yellow and pink. Two versions of the test were implemented: In Version A, the subject was presented with five buttons labelled after the specific colours. For each question, the subject had to press the correct button. In Version B, speech recognition was utilized; the subject had to speak out the name of the correct colour.
Experimental Protocol. During the experiment, the subject was initially briefed and asked to sign the consent form. Then, the sensors were installed. All experiments lasted for about one hour in total, including the sensors setup phase. The stress induction aim of the experiment was kept hidden from participants throughout its duration. This way, it was ensured that stress induction would occur naturally through the stressor and self reports would have not been biased from subjects' prior knowledge over the fact that they ''should get stressed''. Moreover, subjects were unaware of the processing methods that would have subsequently been applied over their video and accelerometer recordings. It was thus also ensured, that collected data would have not been biased from manipulations of subjects that might have been aware of their body language.
The experimental procedure consisted of the following eight conditions: Rest: The subject was asked to relax for two minutes with eyes closed. GSR and ECG baseline data were recorded during this period.
Condition 0: The subject watched a relaxing video, compiled from pictures of Greek islands. GSR, ECG, video and accelerometer data were recorded during this period, which lasted for one and a half minutes. Data from the same sensors/devices was also recorded during the rest of the conditions (1)(2)(3)(4)(5)(6). At the end of condition 0, the subject answered the post-condition questionnaire, which is explained below.
Condition 1: The subject played an easy (congruent) version of the Stroop colour word test (Version A), where the font colour was always same as the colour name displayed. The time limit for each question was five seconds. The condition ended when sixty questions had been answered, and then, the subject filled in the post-condition questionnaire.
Condition 2: The same specifics as in condition 1 were followed, only this time, Version B of the stroop test was used, and thus the subject had to speak out the font colour instead of clicking the colour buttons.
Condition 3: The subject played the typical Stroop colour word test (Version A), where the font colour was always different than the colour name displayed (Figure 2). The time limit for each question was three seconds. After two minutes of playing, the game automatically paused for one minute and automatically resumed, so as for the subject to play for another two minutes. At the end, the subject completed the post-condition questionnaire.
Condition 4: The same specifics as in condition 3 were followed, only this time, Version B of the Stroop test was used.
Condition 5: The same specifics as in condition 3 were followed, however, this time the subject played a Stroop colour word test of increased difficulty (Version A); the font colour was always different than the colour name displayed, and also after each of the subject's answer, the order of buttons changed in a random manner. The time limit for each question was two seconds.
Condition 6: The same specifics as in condition 5 were followed, however this time, the subject played a Stroop test of increased difficulty (Version B), where three colour words were presented in each question instead of the typical one. Two out of the three words had always the same colour. The subject had to identify the dominant font colour and speak it out loud.
Summarizing, the experiment consisted of eight different conditions, one for recording of baseline biosignals data (as typically done in biosignals-based affect detection studies, e.g. [37,41,42]) and seven (condition 0-condition 6) for collecting both physiological and behavioural measurements during periods with presence and absence of stress.
From each participant, we needed as many recordings of subjects as possible, taken both during not-stressed and stressed  states. For this purpose, the initial, Rest condition, was followed from condition 0, from which physiological and behavioural measurements were to be collected, during a period with potentially complete absence of stress. Then, the interaction with the Stroop-based application involved three different levels of difficulty; a very easy level (conditions 1 and 2), a moderately difficult level (conditions 3 and 4) and a very difficult one (conditions 5 and 6). Whereas the first two levels had a strong resemblance to (congruent-incongruent) stroop tests that have been used in the past [37], our third level involved a stroop test variation of very high task difficulty and time pressure. As such, conditions 5 and 6 were expected to prove particularly stressful.
Each of the above difficulty levels consisted of two conditions, one of which employed the version of our Stroop-based application with button-press (Version A, used in the oddnumbered conditions) and the other, the version with speech recognition (Version B, used in the even-numbered conditions). This way, within each difficulty level, the participant faced two different versions of the application. As a result, we were capable to obtain double the measurements from each difficulty level, by simultaneously avoiding boredom, which might have appeared if the participant played twice in each level, the exactly same version of the test.
Questionnaires. Stress self-assessment was conducted at the end of each condition (post-condition questionnaire), using two different question types. The first was a Likert-scaled (1)(2)(3)(4)(5) question directly asking subjects whether they were feeling stressed during the condition, following the rationale of the free-scale question used in [41]. The second was a subset of the Stress-Appraisal-Measure (SAM) questionnaire [43], consisting of the four questions related to stress (questions 2, 16, 24 and 26). Based on the two question types, two different variables were formed, to be thereafter used as ground truth (subjects' self-reported stress levels) in our analysis: 1) The answers that the subjects had given to the Likert-scaled question directly assessing stress (Stress_1-5) and 2) The average value of the subjects' answers to the four SAM questions (Stress_SAM).
Acquired Dataset. The acquired dataset consisted of 126 (2166) trials that were recorded during subjects' playing the different versions of the stroop colour word test, 21 recorded during subjects' watching the relaxing video (condition 0), and also 21 rest sessions. Apart from the 21 rest sessions, in total, 147 trials were recorded. From all these relaxation and stroop-playing conditions, various features were extracted from each monitoring modality (GSR, ECG, video, accelerometer) and analyzed.

Behavioural Features Extraction Procedure
In the current section, the extraction procedure of a plethora of activity-related features is described. Initially, the features that are solely based on visual information are presented, while the accelerometer-based behavioural features follow right after.

Video-based Feature Extraction
The vision based features examined in the current work are mainly based on spatiotemporal information of the subject's movements: The proposed method for stress-related analysis of the user's movement is mainly based on Motion History Images (MHIs) [44], vision-based spatiotemporal descriptors that can be extracted from video sequences depicting the monitored subject ( Figure 3). A MHI is a spatio-temporal template, where the intensity value (MHI T ) at each point is a function of the motion properties at the corresponding spatial location in an image sequence, according to the following equation: where t is the number of frames contributing to the MHI generation and D(x,y,t) equals to 1 if there is a difference in the intensity of a pixel between two successive frame. The older a change is, the darker its depiction on the MHI will be, while changes older than t frames faint completely out. Before presenting each extracted feature, a short description of the MHI properties should be given. In this respect, it can be noticed in the equation above that the value of t provides a notion about the history information that is taken into account. As such, large values of t form an MHI that extends deep in the past, while small values refer only to the very recent past. Moreover, it is obvious that the bigger the differences between two successive frames, the larger the non-black area (A non-black ;A nb ) on the MHI. Similarly, identical successive frames would produce a completely black MHI. This would be valid for any arbitrary number t of utilized frames. Based on these observations, significant motion-related information can be extracted from an MHI, by properly adjusting parameter t.
Given the specifications of our system (Intel i5-2500k processor, 4 GB RAM), an experimentally detected average frame rate value was 10 frames per second (fps) for online processing. Moreover, it was noticed that that human's small movements typically do not last longer than 1 sec. In this respect, the extracted (Long term -) MHIs are generated within the time period of 1 sec (,10 frames) and updated accordingly. However, since a minimum duration for such movements cannot be trivially defined, and given the fact that fast and sudden movements may form a strong stress indicator, a second (Short term -) MHI is also produced in parallel, by processing only two successive frames (t = 2).
In order to preserve a common reference for all subjects, the head's position is constantly tracked and updated by the system. The robust detection of the head's position is a vital prerequisite for the extraction of a series of stress related features, as it will be shown in the following. As such, in the current work the head detection algorithm was implemented as the combination of a face detection algorithm [45] and a tracking mean-shift based algorithm [46]. This enhancement is used, as the Haar-based detector utilised in Viola & Jones algorithm fails to detect a face, when the latter is significantly diverged from the frontal (camera) view.
Specifically, within our method, the face -and thus the headcentre is initially detected at each frame via the Viola & Jones's algorithm. If this algorithm fails, the last successfully detected facerectangle with pixels fP Ã i g i~1,:::,n and centre P 0 is passed to the mean-shift algorithm, and handled as follows: First, a function b is defined: R 2 R1,…,m, which associates the pixel at locationq q~(q q 1 ,q q 2 ,:::,q q m ) to the index b(P Ã i ) of the histogram bin corresponding to the colour of that pixel. The probability of a colour u in the target model is derived by employing a convex and monotonically decreasing kernel profile k, which assigns a smaller weight to the locations that are further away from the centre of the target. Therefore, it can be written aŝ , whereby C is computed by imposing the condition P n u~1q q u~1 , i.e. the summation of delta functions for u = 1,…,m is equal to one. Further, when the target model is passed on to the next frame, the probability p u of colour u in the target candidate with a centre P 0 and a radius h is calculated as: The most probable location P 0 of the target pixel area in the current frame is obtained by minimizing the distance d(P9) at a given location y, forp p(y)~(p p 1 (y),p p 2 (y),:::,p p m (y)) andq q~(q q 1 ,q q 2 ,:::,q q m ), or by simply maximizing r(P9), the Bhattacharyya coefficient, Based on the outcome of the aforedescribed MHI extraction and head detection algorithms, a series of behavioural features (V1, V2, V3,…) were defined and extracted from the video sequences that were recorded during the experiment. For better notation, these behavioural features can be regarded to form the video-based feature vector of our study: F v = {V1, V2,…}. The features that were extracted from the video modality belong to the following categories: Global Activity Level. First of all, the Average and Standard Deviation (SD) of global upper-body Activity Level within a time period were examined as potentially useful behavioural features towards automatic stress detection. The specific features followed the rationale of [22], where the overall energy spent by a subject, approximated by the total amount of displacement in her/his hands and head, was examined as an expressive feature useful for automatic affect recognition. Also, in [29], it was found that dynamics/energy/power of movements can show significant differences among different emotional states.
Hence, the amount of non-black areas in the MHIs of a given time period was expected to provide a powerful clue of stress indication, given that stressed persons would probably move nervously (and arbitrarily), more frequently than calm ones. In this respect, a MHI regarding the last ten captured frames was constantly updated, and given this, the proportion k of the nonblack area A nb over the whole area A of an MHI was calculated as: where W and H stand for the width and height of the MHI, respectively. Based on the mean (Avg) and standard deviation (SD) of k within a condition's MHIs, various stress-related features were extracted: V1: Avg(k), V2: Avg(k), for k.0, V3: SD(k), for k.0. Moreover, each time the parameter k of a certain MHI exceeded the experimentally set threshold l (k.l), a signal ''Increased Movement Detected'' was triggered. This way, a new subset of the original MHIs was also preserved, that consisted only of executed movements above an energy threshold, so as to discard small-scale movements of the hands or head. Thus, the following features were extracted similarly to the above: V4: Avg(k) for MHIs with k.l and V5: SD(k) for MHIs with k.l. Moreover, similar features like V4 and V5 were extracted, by considering only the MHIs where the activity level exceeded a threshold smaller than the one used for ''Increased Movement'' detection (k.l s , 0,l s %l). With this threshold, only micro-movements of extremely small scale (e.g. movement of the mouse of a PC) were discarded: V6: Avg(k) for k.l s , and V7: SD(k) for k.l s .
Finally, the frequency of the detected ''Increased Activity Level'' movements within the given time period was extracted as a further potentially stress-related feature with V8: The proportion of seconds with ''Increased Movement'' detected to the total number of seconds of the condition. The nominator used in feature V8 calculation was taken as the number of non-overlapping seconds within the time period considered, for which at least an MHI with k.l existed. Feature V8 aimed to provide a further indicator of the subject's activity level, by taking into account activities that produced MHIs denoting increased activation, and finally expressing their frequency within the examined time period.
Sharp Activities Energy. A sharp activity is defined within our proposed system as activity occurring between two consecutive recorded frames. Taking into account the definition of the MHIs, by restricting the analysis window (threshold t) to the value of 2, MHIs can be extracted on the basis of only two consecutive frames (Short-term MHIs). Thus, the main difference of the features explained in the following of this section, compared to features V1, V3-V5,is the selection of parameter t, which defines the inspection time for the generation of a single MHI; contrary to above, where t had been set to 10 frames, a value of t = 2 is hereby used for the current features.
Practically, this means that only the movement captured within two successive frames is taken into account. In this respect, rather rapid movements are expected to be detected, and from the MHIs, features expressing several qualitative characteristics (e.g. energy) of these rapid movements can be extracted. The larger the area of A nb , the faster can be considered the performed movement. Thus, the correlation between rapid, nervous movements and stress is attempted to be studied through the average of the proportion k s of the non-black area (A nb ) over the whole area A of the Shortterm MHIs: V9: Avg(k s ). Also, by considering only the short-term MHIs (t = 2) corresponding to time periods where movement was detected (k.0) in the long-term (t = 10) MHIs (as defined in feature V2), features V10: Avg(k s ), for k.0 and V11: SD(k s ), for k.0 were extracted. Finally, taking into account only the Shortterm MHIs (t = 2) that correspond to the Long-term ones (t = 10) for which ''Increased Movement'' was detected (k.l), features V12: Avg(k s ), for k.l and V13: SD(k s ), for k.l, were extracted.
Activity Symmetry. The relevance of gestural symmetry as behavioural and affective features has been recently studied [22]. Although in [22], no significant differentiation was found between several symmetry-related features and the quarters of the valencearousal space, in [47], it was shown that that arm-position asymmetry was a relevant behavioral feature to identify a ''relaxed'' attitude and relative high social status of a person within a group. Following this line, activity symmetry-related features were extracted in our work from the MHIs of the recorded video sequences. In particular, the symmetry of the human gesture was defined as the divergence of the vector s v , drawn between the user's head and the MHI's centre of gravity from the upright position ( Figure 4). Specifically, given that the head's location is detected as described above, its movements can be ''subtracted'' from the MHI. Then by estimating the centre of gravity (CoG) on the remaining MHI, From the symmetry vectors of the MHIs taken within a time period of interest (i.e., a given condition of the experimental session), several features were extracted, by taking into account either all MHIs of the time period, or only those MHIs where k was larger than l, or l s , as shown in Table 1.
From Table 1, it is clear that the extracted symmetry-related features mainly encode the average and standard deviation of the Euclidian, horizontal, and vertical size of S v , as well as its divergence from the upright position.

Position and movement of subject's head and MHI
barycenters. According to the relevant literature [22,29,48], head position (sometimes indicative of pose) and movement can be considered as important features for distinguishing between various emotional expressions. Along this line, a set of features were extracted, expressing the position and movement of the subject's head during each condition. First of all, the Average (V38) and Standard Deviation (V39) of the head's distance from the image centre (IC) were calculated from: . Additionally, the features shown in Table 2 were extracted by taking into account the initial position (IP) of the head within the same condition, the initial position of the head within the specific subject's condition 0 (IP_0), or the average position of the head within the specific subject's condition 0 (AP_0).
Also, the Average and SD of the head's velocity (V58, V61 respectively), acceleration (V59, V62) and jerk (V60, V63) were calculated. Within our explorative study, these parameters were also examined in respect of the MHI barycenters (Centre of Gravity -CoG), since in [22], gestural smoothness/jerkiness were examined as behavioural parameters related to emotions. As a result, further features expressing qualitative aspects of the MHI barycenters' movement were extracted, as shown in Table 3.
Frequency of specific gesture occurrence. From the MHIs, specific gestures made by the subject can also be detected. For this purpose, each recorded MHI was initially transformed according to the Radial Integration Transform (RIT) and the Circular Integration Transform (CIT), which were used due to their aptitude to represent meaningful shape characteristics.
The RIT transform of a function f(..,..) is defined as the integral of f(..,..) along a line starting from a centre (x 0 , y 0 ), which forms angle h with the horizontal axis, (Figure 5, Left picture). In the proposed feature extraction method, the discrete form of the RIT transform was applied, which computes the transform in steps of Dh and is given by equation: In a similar manner, the CIT is defined as the integral of a function along a circle curve with centre (x 0 , y 0 ) and radius r ( Figure 5, Right). Similarly to the RIT transform, the discrete form of the CIT transform was used, as given by the following equation: MHI(x 0 zkDr(tDh),y 0 zkDr(tDh)) for k = 1,…,K with T = 360u/Dh, where Dr, and Dh are the constant step sizes of the radius r and angle h variables. Finally, kDr is the radius of the smallest circle that encloses the gray scaled MHI. Thus, each MHI can be represented by two 1 Dimensional vectors, which are simpler to process ( Figure 6). It should be noted that the origin point (x 0 , y 0 ) for the aforementioned transforms was taken in our case as the centre of the head, which was detected by the aforedescribed head detection algorithm.
Via the RIT-and CIT-based MHI transformations, specific gestures of the subject were detected, by applying a thresholdbased template matching algorithm to pre-defined templates of gestures of interest. For this purpose, we created a gallery of predefined template MHIs, where one MHI existed for each gesture of interest. As gestures of interest we selected the ''Right hand on head'' and ''Left hand on head'' activities. These specific gestures were used due to the fact that activities like nail biting, scratching of head, smoothing of (already smooth or even long gone) hair etc. are known to occur as behavioural symptoms of stress [1].
In order to detect the gestures of interest, two matching scores between the probe and the gallery templates were simultaneously produced so as to increase both the robustness and the performance of the algorithm. These matching scores were a) the L1-Norm distance and b) the correlation factor between each of the RIT and CIT transformed vectors, as shown below: An event was detected, when only the returned scores from both classifiers exceeded experimentally selected thresholds, so as to diminish false positives. The final decision about which activity was occurring, was taken according to the most matches with the prototype MHIs within a predefined time-period (majority voting rule); in our case, this period was one second, and the analysis was performed in non-overlapping intervals. Once the specific gestures had been detected within each nonoverlapping one-second interval of the whole time period considered (i.e. the time period of a recorded condition), the following features were extracted, expressing the frequency of each Avg{Sqrt( (P(x)2P 0 (x)) 2 +(P(y)2P 0 (y) ) 2 ) )} V42 V48 V54 SD{ Sqrt( (P(x)2P 0 (x)) 2 +(P(y)2P 0 (y) ) 2 ) )} V45 V51 V57 doi:10.1371/journal.pone.0043571.t002

Accelerometer-based Feature Extraction
Two tri-axial accelerometers were used in our framework, one at each knee of the participant. The aim was to monitor the occurrence of ''foot trembling'', a behaviour known to often accompany stress. Each accelerometer provided a triplet of values denoting the acceleration in the three axes. Accelerometer data was collected with 60 Hz sampling rate and was processed in one second long, non-overlapping intervals.
Initially, within each interval, the total Power Spectral Density (PSD) of the accelerometer output was calculated, as the average PSD of the three axes. Following a rationale similar to [49], foot trembling was detected only when a) the proportion of signal power that existed in the experimentally set range [f low , f high ] Hz to the total signal power and also b) the total signal power were both above experimentally selected thresholds. Each second of the recording where foot trembling was detected was annotated as such. Then, in order to diminish false positives, intervals were treated as pairs of consecutive ones; only when foot trembling was detected in both two consecutive intervals, these intervals were finally marked as having foot trembling occurrence. This processing was done for each of the two accelerometers separately, and eventually, the outcomes of the two accelerometers were fused by using an ''OR'' rule for each interval.
Following this processing, one feature was finally calculated from the accelerometer modality, which regarded the frequency of foot trembling occurrences within the time period of each condition: A1: The number of seconds with foot trembling detected to the total Nr of seconds.

Physiological features extraction
A further set of features were also extracted from the biosignals (GSR, ECG) that were monitored throughout the experiment. This set consisted of features commonly used in the literature towards automatic stress, or in more general, affect detection. Given the acknowledged effectiveness of these biosignal features in our context, they were used to provide the basis for assessing whether the examined behavioural features are capable to enhance the accuracy of a typical automatic stress detection system. In the following, the features extracted from the GSR and ECG signals recorded during each trial (trials 0-6) of the experiment are described. GSR and ECG data were collected with 16 Hz and 256 Hz sampling rates, respectively. From the ECG data, Inter -Beat -Intervals (IBIs) were calculated directly from the monitoring device software. In order to treat betweensubject variability in physiological measurements, all extracted biosignal features were normalized by division to their baseline values, recorded during each subject's rest session.
From both the GSR and IBI time series, the following typical features were extracted for each trial: Average (Avg) and Standard Deviation (SD) [18], Minimum (Min) and Maximum (Max). Moreover, following [15], the mean of the absolute values of the first differences of the raw and normalized signals were calculated: Also, the mean of the convoluted with a Hanning window GSR and IBI signal first differences were given by: In the above three equations, x is the IBI or GSR signal, s i is the i th sample of the resulting time series of the raw signal, sub-sampled at 16 Hz and convoluted with a 3-second Hanning window. As in [15], the normalized signal x i used in d norm (x) calculation, was given by (x i 2x mean )/x sd , where x i is a signal value recorded during a trial, x mean and x sd are the signal's average and standard deviation during the trial, respectively. Moreover, the Skewness and Kurtosis [50] of the GSR and IBI signals were calculated by: where x is the IBI or GSR signal, x i is the i th sample of the raw signal, x x and s are the signal's average and standard deviation, respectively.
Furthermore, the following features were extracted only from the GSR signals recorded during the trials: Average, RMS (Root Mean Square), and proportion of negative samples of the 1 st Derivative (Avg1, RMS1, prop1) and the smoothed (convoluted with Bartlett window) 1 st Derivative (Avg1s, RMS1s, prop1s) following [51]. Skin Conductance Responses (SCRs) were detected similarly to [51], and their Average Amplitude and Duration (SCR_Amp, SCR_Dur) were calculated for each trial. Also, the Rate of SCR occurrences (number of SCRs divided to the intermediate duration, SCR_Rate), as well as Quantile thresholds at 25%, 50%, 75%, 85%, and 95% for Amplitude (SCR_AmpQ25, …, SCR_AmpQ95) and Duration (SCR_DurQ25, …, SCR_DurQ25) of SCRs were calculated similarly to [42]. Furthermore, the average area under the rising half of each GSR response (SCR_arUnder) was calculated.
From the ECG modality and the IBI time series of each trial, the following features, typically used in the literature towards biosignals-based stress and affect detection were extracted: RMSSD, pNN50, average LF/HF power ratio (LF/HF). Also, the standard deviations SD1 and SD2 were calculated from the IBI Poincare' plot geometry similarly to [52]. Finally, following [15], the mean of the absolute values of the second differences of the raw normalized signals were calculated by:  Figure 7, the stressor employed in our experiment was eventually found to be effective (especially in condition 5), leading to average stress self-reporting values close to relevant ones that have been reported in the literature [41].

Correlations between behavioural features and stress
The hierarchical structure of the experiment data makes traditional forms of analysis less resilient to the different levels considered. Subjects are measured repeatedly, at many time points. Traditional repeated-measures designs require the same number of observations for each subject and no missing data, being thus suitable for our case. However multilevel models are appropriate to analyze such data, above all, because the existent dependencies due to repeated measurements are included in the parameter estimates. Moreover, further dependencies existing in the data can be taken into account.
Since in our case, the entries were nested within the conditions and within participants, physiological and behavioural indexes were estimated on stress level extracted through the free scale (Stress_1-5) questionnaire, with hierarchical linear analysis, an alternative to multiple regression, suitable for our nested data. We referred to two levels in the model: condition-level and subjectlevel.
Selection of the models was done on the basis of three criteria:

significance levels of involved variables; 2. Quasi Likelihood under Independence Model Criterion (QIC) in the smaller-is-better form; 3. Corrected Quasi Likelihood under Independence Model
Criterion (QICC) in the smaller-is-better form.
The usual goodness of fit statistics, like R-square, could not be computed. Instead, the above information criteria, based on a generalization of the likelihood were computed.
In particular, using self-reported stress as our dependent variable, analyses consisted of:  Table 5 shows the mixed linear hierarchical regressions conducted per each feature (independent variable) using selfreported stress as the dependent variable; in particular, results for features selected from Step 2 of the aforementioned analysis are presented. From Table 5, it is evident that significant effect of stress was found for several physiological, as well as behavioural features. As concerns behavioural parameters, a positive relation was found between stress and features expressing aspects of the Global Activity Energy (V1, V2, V4, V5, V7, V8) and the Sharp Activities Energy (V9, V10 and V11). Significant stress effect was also found for features focusing on Activity Symmetry (V14, V16, V31, V33), as well as ones related to the Position and movement of the head (V41, V46, V52, V54, V58, V59, V60) and the MHI barycenters (V70). Interestingly, increased self-reported stress was also found to be accompanied with increase in the frequency of occurrence of Right Hand on Head movements (V76) and foot trembling (A1). Table 6 shows the four mixed linear hierarchical multiple regression models, conducted per different combinations of multiple independent variables, using again self-reported stress as the dependent (Steps 3, 4, 5 and 8). Features V5, V16, V31 and V46 were found to have significant effect within the final, full model that consisted of both behavioural and physiological features ( Table 6). It was thus found that features expressing the SD of the global activity level (V5), our proposed activity symmetry vector's divergence from the upright position (V16) and vertical length (V31), as well as the horizontal difference between the head's position during the trial and its position during relaxation (V46), had significant effect in modelling stress, even when used in conjunction to physiological features derived from the GSR and ECG modalities.
Finally, Figure 8 (specific values are given in Table 4) depicts the variation of several of the examined variables among the different conditions. As shown from Table 6 and Figure 8, the results obtained can be regarded to confirm a relationship between physiological, behavioral and psychological data of our experiment.

Efficiency of behavioural features towards automatic stress detection
From the analysis presented above, it is clear that some of the behavioural features showed significant relation to self-reported stress, similarly to physiological (GSR and IBI) features commonly used in the relevant literature (e.g. [41,42]) towards automatic stress detection. Following these findings, it was examined whether the proposed behavioural features can be used in conjunction with (or even instead of) the typical physiological features, to enhance the effectiveness of automatic stress detection. For this purpose, an LDA-based classifier [52] was used over the multi-subject data set of our experiment. In Fisher's LDA, the optimum projection for a given data set is realized through the transformation matrix W, which is calculated so as to maximize the formula: where S b is the ''between class scatter matrix'' and S w is the ''within class scatter matrix'' of the train data set. In two-class LDA, data from the initial feature space is projected on a single projection axis, which best discriminates training data among the available classes. Thus, once the optimum transformation vector W is calculated from the train data set, it can be used to calculate the projection of each class Centroid and each new (test) case to the transformation axis. Classification is then performed in the transformed space by assigning the new case to its less distant class found over the projection axis using: where F(case) is the feature vector of the test case, m 0 and m 1 are the centroids of the two classes under consideration, calculated using the training data, and W is the transformation matrix.
Leave-one-out cross validation (LOOCV) was employed as in [52], and the final Correct Classification Ratio (CCR) of the classifier was calculated by CCR = N c /N, where N c is the number of cases correctly classified and N is the total number of cases constituting the full data set. In order to assess whether behavioural features are useful towards automatic stress detection, various different feature sets consisting of physiological features and/or behavioural ones were used as the input of the classifier, in an effort to identify: -The effectiveness of well-known physiological features towards stress detection in the given dataset (used thereafter as a basis for comparison). -The effectiveness of the behavioural features towards stress detection, compared to the physiological features. -Whether there would be an increase (or decrease) in stress detection accuracy, using behavioural features auxiliary to physiological ones, compared to the initial stress detection accuracy achieved using only physiological features.
For each different feature set considered, an SBS (Sequential Backward Search) feature selection procedure was employed as in  [52], to retain the subset of features that would yield in each case the best stress detection accuracy. By starting with a full, initial feature set, SBS initially calculated a criterion value, in our case, the classifier's performance. An iterative feature removal process was then employed, and within each iteration, the feature whose removal increased more the criterion value was definitely removed from the feature set. As a result, the features that produced the best CCR were finally selected from the initial feature set. Data Annotation. The LDA-SBS classification schema was applied over two different two-class stress detection problems, which were formulated by following a different annotation procedure over the full dataset, which consisted of 147 cases in total (i.e., all recordings of all subjects regarding trials 0-6).
For the annotation of the first dataset (Dataset1), the subjects' answers to the Likert-scaled direct stress self-assessment question (Stress_1-5) were taken into account. In particular, trials for which the answer to this question was ''1'' or ''2'' were annotated as ''Not Stressed'' (NS), whereas trials for which this answer was ''4'' or ''5'' were annotated as ''Stressed'' (S) ones. Trials for which the specific answer was ''3'' were excluded. As a result, Dataset1 consisted of 108 trials in total, 82 labelled as NS, and 26 as S.
For the annotation of the second dataset (Dataset2), the average of the subjects' answers to the four SAM-questions (Stress_SAM) was taken into account. In particular, trials for which the average value of the answers to these questions was higher than 2.5 were annotated as ''Stressed'' (S), and the rest of trials were annotated as ''Not Stressed'' (NS). As a result, Dataset2 consisted of 147 trials in total, 93 labelled as NS, and 54 as S.
The purpose of using these two datasets was to evaluate the LDA-based classification schema on stress detection applied over: -A portion of the full dataset of this study, containing only the more extreme cases of not stressed trials and stressed ones (Dataset1) -The full dataset of this study (Dataset2) For each dataset, three different feature sets were used as the initial feature set of the SBS procedure, consisting of: -All physiological features (feature set: FS1) -All behavioural features (feature set: FS2) -All physiological and all behavioural features (feature set: FS3) From each feature set, SBS selected the features that provided the best CCR for each of the two different stress detection problems. In the following, the confusion matrices of the best stress detection results obtained from the various feature sets in respect of Dataset1 and Dataset2 are provided, along with the features that were finally selected from SBS and yielded the best results in each feature set case.
Classification Results. In respect of Dataset 1, as shown from a comparison between Tables 7 and 8, the behavioural features extracted proved equally effective to the physiological features in the stress detection problem concerning the more extreme cases of stress and no stress that existed in this dataset. Furthermore, when behavioural features were combined with the physiological ones, the best average CCR significantly increased (by 7.41%), achieving the maximum correct classification rate of 100% (Table 9).
Regarding Dataset 2, by comparing Tables 10 and 11, it is clear that the proposed behavioural features appeared more effective than the physiological ones in the stress detection problem concerning all stress/no stress cases that existed in our dataset; a significant increase of 7.49% in the CCR was achieved from the behavioural features. The significance of this increase in performance was proved as in [53], by a two-tailed pair-wise t-test, applied over the classification results of FS1 and FS2 (t = 1.996, df = 146, p,.05). Moreover, when behavioural features were used together with the physiological ones as the initial feature set of SBS, the best average CCR again significantly (t = 3.964, df = 146, p,.001) increased (Table 12); by 13.61%, compared to the best average CCR achieved with physiological features.
Furthermore, instead of using all features of our work, SBS and the LDA classifier was also applied only over the features for which significant regressions to self-reported stress levels were found, from the aforedescribed regression analysis ( Table 5). The classification results obtained in respect of Dataset1 were: 88.89% for physiological (8 selected) features and 96.30% for physiological and behavioural features (13 selected features, 4 physiological and 9 behavioural). The respective results for Dataset2 were 73.47% with 6 features and 86.40% with 26 features (10 physiological and 16 behavioural). Behavioural features were thus again found effective in both datasets, even when the initial feature space of the SBS procedure was already limited through the regression analysis.
As said, all above classification analyses were based on LOOCV. Furthermore, in order to examine our approach over a completely independent validation sample than the training one, we randomly split the full dataset (consisting of data taken from 21 participants) into a training set, consisting of data taken from 9 participants and a validating one, consisting of data taken from the rest 12 participants. Using validation samples of subjects whose data was absolutely absent during training, simulates the hardest affect detection scenario, where the system tries to identify emotions of unknown persons, on the basis of knowledge it has taken from a limited training set. Overall stress detection accuracy was therefore expected to decrease in this case, however, the purpose of this analysis was to examine whether behavioural features would again lead to increase in automatic stress detection performance.

Using behavioural features to predict stress-related increase in the GSR
The analysis of the previous section focused on the automatic detection of stress, using self reports as the ground truth for the classification. It could however be argued that physiological responses (such as the increase in the average GSR level) could provide a more objective and reliable measure of stress than self reports. From this perspective, it would be interesting to also examine whether our proposed behavioural features could also be used so as to effectively predict stress-related alterations of physiological signals. Therefore, following the correlates that were   found between behavioural features and physiological ones, a further analysis was conducted over the present dataset, towards assessing whether the increase in the average GSR level (a wellknown, reliable index of stress [41]) could be predicted through the proposed features. For this purpose, a further dataset was formed (Dataset3), by annotating the recorded conditions on the basis of the GSR average value (Avg(GSR)). In order to do so, a normalized value for the Avg(GSR) value of each recorded condition was calculated as: where Avg(GSR) norm,ij is the normalized GSR value of condition i of participant j, Avg(GSR) ij is the actual value of the Avg(GSR) feature for the same condition, Avg(GSR) min,j and Avg(GSR) max,j are the minimum and maximum values of the Avg(GSR) feature, found in all conditions of participant j. The result of the above equation was a value in the range [0-1], expressing the increase in GSR level that was observed within each subject's recorded conditions. Then, considering the average value of the GSR as the ground truth, annotation took place on the basis of Avg(GSR) norm , with the rule: If Avg(GSR) norm ,0.5, the condition was labelled as ''Not Stressed'' (NS). Otherwise, the condition was labelled as ''Stressed'' (S). As a result, Dataset3 consisted of 147 cases in total, 57 annotated as NS and 90 as S. Following this labelling process, Dataset3 had as ground truth, each participant's increase in GSR, instead of the answers given to stress self-reports. Thus, the purpose was in fact to examine whether the behavioural features can predict the increase in GSR, which can in turn be regarded as a reliable measure of stress. By applying LOOCV over the behavioural features (selected from FS2 feature set after SBS), stress (GSR increase) detection accuracy at the level of 93.88% (138/147; NS:54/57, S:84/90) was obtained. This result further underlines the correlation that exists between the behavioural features of our work and the level of GSR, a well-known physiological stress metric.
In the same context, a further analysis followed, using this time self-reported stress, so as to predict stress as measured by the increase in GSR. In this case, we considered two further features, the Stress_1-5 and the Stress_SAM, which regarded the respective self-reports that were obtained after the end of each recorded condition. When these two features were used in the LDA-based classifier, stress (increase in GSR) was predicted in Dataset3 with accuracy of 75.51% (111/147; NS:47/57, S:64/90). Thereafter, these two features formed together with the behavioural ones a further feature set (FS4), from which the best features (selected after SBS) provided stress detection accuracy in Dataset3, at the level of 94.56% (139/147; NS:55/57, S:84/90). From a perspective that takes as reference the subject's increase in stress level, as depicted from the increase in GSR, the latter two results indicate that our behavioural features, used together with self-reports, can lead to significant (t = 5.039, df = 146, p,.001) increase in the performance of a stress detection system that is solely based on selfreports.

Discussion
In this work, a large set of seventy eight behavioural features were extracted from video and accelerometer data collected in the conducted experiment, and analysed with the aim to answer the two research questions of the present study.
Our first research question (RQ1) aimed to understand the relationship between automatically extracted behavioural features and self-reported stress levels of subjects. Analyses based on mixed linear hierarchical regression models appeared to confirm that relationships between the proposed behavioural features and selfreported stress exist. We defined statistical models of self-perceived stress, explained by the calibrated mixing of physiological and behavioural measures. This showed in an even more clear way, the subtle relationships among different changes in subjects' behaviour due to increased stress level. Interestingly, several behavioural features were found to have significant effect in modelling selfreported stress, even when used in conjunction to physiological features.
Our second research question (RQ2) aimed to investigate whether more robust stress detection is feasible by adding automatically detected behavioural information. Results showed that when the behavioural features were used together with common physiological measures (FS3), stress detection accuracy significantly increased, compared to the case when only the latter were utilized (FS1). It was even observed that in the full dataset of  the present study (Dataset2), the total replacement of the physiological features from the proposed behavioural ones (i.e. using feature set FS2 instead of FS1), led again to increase in performance. Moreover, behavioural features appeared to also enhance automatic stress detection within a harder classification scenario, where limited training data, taken from different persons than the validating ones exists. These results suggest that the proposed behavioural features provide an appropriate basis towards implementing an efficient real affective computing system. They can either be used to replace conventional, more obtrusive physiological measures, or in conjunction to them. In both cases, the results of the present study show that stress detection effectiveness can increase. Moreover, considering future practical applications of automatic stress detection, it should be noted that the proposed features extracted from the video modality, form an unobtrusive activityrelated behaviour monitoring framework that is based on a lowcost camera. Additionally, their extraction is based only on the silhouette of the subject depicted through the MHIs, together with the head's position. Both the MHIs and head position can be calculated in real-time. As a result, even for applying postprocessing on the recorded data, the original video sequences that fully depict the subject are not required, and thus, these are not needed to be stored. Compared to typical facial expression recognition methods, it is thus clear that the proposed framework has higher chance for ethical acceptance in future practical applications.
Nevertheless, the accelerometer modality used in our study can not be regarded as unobtrusive as the video modality, and it can be considered rather obtrusive, similar to the typical physiological modalities. However, two comments should be made in this respect. First, our proposed framework is mainly based on video processing, since only one out of the seventy eight features examined was extracted from the accelerometers. Thus, the accelerometer modality could be omitted in a future practical system in order to make it as unobtrusive as possible, having a possible small degrade in performance. Second, different (e.g., vision-based) methodologies could be developed in the future so as to detect foot trembling, thus making the use of accelerometers unnecessary.
In the present analysis, behavioural features were extracted in off-line mode from video and accelerometer recordings, something that can also be done in practical clinical settings, where subjects can be monitored for a time period, and subsequently, behavioural features will be extracted, as soon as the monitoring period ends. However, considering further future practical applications that may be in need of real-time extraction of the proposed behavioural features, it has to be noted that by using multi-threading techniques, the proposed features can also be extracted in realtime, similarly to the procedure that should be followed for common physiological features. This way, our proposed parameters can provide further input also to a practical on-line stress detection system, so as to enhance its effectiveness.
Our analysis involved data collected during a situation that required sustained attention to a visual display. Such situations are typically addressed in various daily settings, including a typical day at the office, the monitoring of a safety critical system etc., where stress is highly likely to appear. Moreover, our behavioural features can also be extracted in further settings, which do not necessarily involve sustained attention over visual displays, however require for the subject to be sit in front of the monitoring camera. For instance, our proposed system could be applied at a psychologist's office, so as to monitor the patient's activity during the treatment session. It can be argued that physiological responses could be utilized for stress monitoring in further situations of diverse daily settings. However, where applicable, our proposed behavioural feature extraction system provides less obtrusive stress monitoring, which, as indicated from the results of our work, can significantly increase the effectiveness of conventional methods based on physiological responses. In any case, the proposed behavioural features can provide effective automatic stress detection in situations that constitute a rather fertile ground for the future application of automatic stress monitoring systems.
For instance, we came to realize that behavioural parameters can improve a stress detection system based on objectively measurable features, making it appropriate also for clinical settings.
Based on the idea that bodies are specific as word, some researchers began to speak of a ''kinetic text,'' defining the set of subjects' movements as ''thick and specific as the words we speak'' [54]. Our study emphasizes the role of automatic behavioural feature detection as a way, in conjunction to physiological measures, to objectifying the subjectivity.
''Even patients lying on the couch move as they speak and convey their own rhythm and shape patterns. The patient's body position, movements of limbs, have an impact on the analyst whose movements, though perhaps unseen by the patient, are felt and heard so that both together form a kinetic, as well as verbal text.'' [54] The sentence above highlights the importance to understand activity-related behavioural features also for a therapist. In a laboratory setting, such analyses can provide further information to the researcher for detecting situations hardly conveyable otherwise. To our knowledge this study represents one of the first attempts towards a system for activity-related automatic stress detection. We believe that the relevance of activity in the understanding of human behaviour is a cutting-edge and relevant theme in science and for future technological development. Our results are a preliminary step towards a complete and effective development, while more studies are needed in this direction.