Posttraumatic stress disorder hyperarousal event detection using smartwatch physiological and activity data

Posttraumatic Stress Disorder (PTSD) is a psychiatric condition affecting nearly a quarter of the United States war veterans who return from war zones. Treatment for PTSD typically consists of a combination of in-session therapy and medication. However; patients often experience their most severe PTSD symptoms outside of therapy sessions. Mobile health applications may address this gap, but their effectiveness is limited by the current gap in continuous monitoring and detection capabilities enabling timely intervention. The goal of this article is to develop a novel method to detect hyperarousal events using physiological and activity-based machine learning algorithms. Physiological data including heart rate and body acceleration as well as self-reported hyperarousal events were collected using a tool developed for commercial off-the-shelf wearable devices from 99 United States veterans diagnosed with PTSD over several days. The data were used to develop four machine learning algorithms: Random Forest, Support Vector Machine, Logistic Regression and XGBoost. The XGBoost model had the best performance in detecting onset of PTSD symptoms with over 83% accuracy and an AUC of 0.70. Post-hoc SHapley Additive exPlanations (SHAP) additive explanation analysis showed that algorithm predictions were correlated with average heart rate, minimum heart rate and average body acceleration. Findings show promise in detecting onset of PTSD symptoms which could be the basis for developing remote and continuous monitoring systems for PTSD. Such systems may address a vital gap in just-in-time interventions for PTSD self-management outside of scheduled clinical appointments.


Introduction
PTSD is a psychiatric condition experienced by individuals after exposure to life-threatening events, such as physical assault, sexual abuse, and combat exposure [1]. PTSD symptomology includes avoidance, hyperarousal, and reexperiencing trauma through dreams and recollections [1]. Avoidance symptoms include circumventing activities or thoughts associated with the traumatic event, decreased interest in daily life, and an overall feeling of detachment from fluctuations due to physical activity and heart rate fluctuations due to mental stress. Therefore, the objective of this article is to expand McDonald et al. 's [23] study to machine learning algorithms that uses body acceleration and heart rate data to predict PTSD hyperarousal events in veterans. In addition, in an effort to improve the interpretation of the algorithm, we further analyze the developed model to investigate significant factors contributing to model's detection output.

Materials and methods
Four machine learning algorithms were trained using self-reported data collected naturalistically from veterans to predict PTSD hyperarousal events: Random Forest, XGBoost, Logistic Regression and non-linear SVM. Fig 1 provides an overview of the methods used in this study. This study was approved by the Institutional Review Board (IRB) at Texas A&M University (IRB2017-0210D) and all participants completed an informed consent prior to data collection.

Participants
Participants were recruited from seven Project Hero's United Healthcare Ride 2 Recovery (R2R) challenges. Project Hero is a non-profit organization dedicated to help veterans and first responders diagnosed with PTSD. In each challenge, veterans rode for an average of 7 days between key destinations in California, Washington DC, Minneapolis, Texas, and Nevada. Each day of the challenge involved approximately 8 hours of biking with the remaining time for resting and socializing. The research team joined a total of 5 rides in 2017, 2018 and 2019. Data from 99 veteran participants (82 male; 17 female) were used in this study. Participants' age ranged from 22 to 75 years old (M = 45.5, SD = 10). Majority of participants reported Veterans Affairs disability rating of over 90% related to PTSD. Table 1 summarizes other relevant demographics.

Data collection
The data collection application (app) for smart wearable devices utilized in [23] was used. Participants were asked to wear smart watches (MOTO 360 Gen 1 or Gen 2, Apple Watch series 3 or 4) with the app installed on them for the duration of the study. The app ran continuously in the background and connected to participants' phones for the purpose of data transfer. The app had the ability to continuously and remotely collect physiological data including heart rate and acceleration from participants at the frequency of 1 Hz. The app included functionality which allowed the user to report a hyperarousal event (symptomatic of PTSD) through a simple 'double tap' anywhere on the watch face which created a time-stamped self-reported event. These events were used for training the machine learning algorithm. https://doi.org/10.1371/journal.pone.0267749.g001

Data preprocessing
All data analysis including data preprocessing and machine learning were conducted in Python 3.8.2 and R 3.6.2. The data preprocessing included four main steps: (1) imputation, (2) windowing and labeling, (3) dividing the data into training and testing, and (4) resampling the training dataset.
Further, four machine learning algorithms including Random Forest, XGBoost, Logistic Regression and non-linear SVM were used to detect hyperarousal events. These four algorithms have been used and shown promise for stress detection in previous research [32][33][34][35][36][37][38][39][40] and have several strengths that makes them suitable for this specific research. For example, Random Forest has been widely used for large physiological datasets due to its robustness in dealing with missing values [41,42]. In addition, Logistic Regression are computationally efficient algorithms commonly used for predicating binary output [43]. Lastly, XGBoost was chosen due to its efficiency, accuracy, interpretability, and ease of integration with mobile applications [42].
Data imputation. Kalman filter imputation was used to impute missing acceleration and heart rate data. Kalman filter imputation is an established method for timeseries data imputation [44], especially for heart rate data [45]. To determine the cut off range, we calculated the average Mean Square Error (MSE) of the imputed data and corresponding actual values. A cut off range of 15 MSE for estimating the randomly dropped values is suggested by Gui et al. [45]. Based on Kalman filter imputation analysis, we chose 5 as the maximum imputation range because it was the greatest value among a set of successive values to have the highest MSE less than 15 [23].
Windowing and labeling. To investigate the patterns of hyperarousal events, the data was divided into 60-second sliding windows with 30 seconds overlap, chosen based on prior work [24] to predict stress severity based on physiological reactions. Each window was assigned a label based on the presence or absence of reported hyperarousal events. If a hyperarousal event occurred anywhere in the window, it was labeled as hyperarousal event; otherwise, it was labeled as non-hyperarousal event. All windows with over 80% missing values were dropped from the dataset. The final dataset included 530 and 13,554 instances of hyperarousal and nonhyperarousal events, respectively.
Training, testing, and upsampling. To validate the algorithm, the data was separated into training (70%) and testing (30%) sets by participant. Participants were assigned to only the testing or training set to ensure generalizability of the results. Table 2 shows the initial dataset classifications. One of the challenges of training the algorithm to detect PTSD hyperarousal events was the imbalanced dataset-96.2% of the windows were labeled non-hyperarousal events. To address this issue, we upsampled the training data. Upsampling was used because it decreases the information lost in the quantification process, thereby reducing the noise and increasing the resolution of the results [46,47]. Upsampling has several advantages over downsampling. The main advantage of upsampling is its ability to use the entire information in the data rather than omitting part of the data [48]. In addition, for noisy datasets, which is often the case when collecting data in naturalistic settings, oversampling is more robust to the noise in the data than undersampling and performs better for prediction [49]. Based on a sensitivity analysis comparing different resampling ratio including 1-1, 2-1, 3-1, 3-2, and 4-3, a ratio of 4 (non-hyperarousal events) to 3 (hyperarousal events) windows was used for upsampling ( Table 2).

Feature generation and selection
Heart rate data. Previous research has shown that that time domain features of heart rate are strongly correlated with PTSD [50]. We extracted time domain features of heart rate including maximum heart rate (bpm), minimum heart rate (bpm), heart rate standard deviation (bpm), heart rate range (max-min) (bpm), and average heart rate (bpm) from each window of time to use for PTSD hyperarousal prediction. Key features were extracted based on recommendations from a review article on detecting psychological stress using bio signals [51].
Acceleration data. Research on stress prediction have used scalars of body acceleration to estimate body activity and to remove noise from the data [25, [51][52][53]. Garcia-Ceja et al. [52] used time domain and frequency domain features of acceleration to predict stress in participants in real work environments. In line with these approaches, we calculated the vector of body acceleration for each moment using the following widely used formula: body acceleration ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Where a x is the body acceleration in X direction, a y is body acceleration in Y direction, and a z is body acceleration in Z direction. Further, based on previous research [e.g., 25, 52], time domain features of body acceleration including average body acceleration (m/s 2 ), maximum body acceleration (m/s 2 ), minimum body acceleration (m/s 2 ), and range of body acceleration (m/s 2 ) were extracted to feed machine learning algorithms.
Model assessment. Four models including Random Forest (size of feature set = 10, number of trees = 50) XGBoost (Maximum depth = 37, number of trees = 50, learning rate = 0.3), Logistic Regression (nLambda = 100) and non-linear SVM (degree = 3, C-Classification, Radial Basis Function) were trained to predict hyperarousal events. We generated a confusion matrix to assess the performance of each model. Model comparisons were conducted with a 5x2 cross validation test following the recommendations in Dietterich [54] to minimize the type 1 error. This method uses p values and t statistics to compare the algorithms. The null hypothesis indicates that there is no significant difference between the algorithms in terms of performance (average accuracy) where the alternative hypothesis shows that one algorithm is more accurate than the others. Algorithms were further assessed with the Area Under the receiver operating characteristic (ROC) Curve (AUC). Feature importance and model interpretation. The complexity of black box machine learning models and the need to make these models explainable necessitate an evaluation of the influence of algorithm's features on algorithm predictions. In this study, we used Shapley Additive exPlanations (SHAP) to address this. SHAP uses game theoretic concepts to allocate values to features in a model based on their importance in prediction [55]. SHAP values indicate how much each feature contributes to the prediction of the machine learning algorithm. SHAP value summary plots generate a feature importance list along with the distribution of each feature and shows how each value affects the output of the model. This method is computationally efficient, consistence with human intuition, and interpretable for explaining class differences [56]. Using SHAP values to interpret machine learning algorithms has several advantages over using more traditional methods such as dependency plots [56]. For example, dependence plots do not usually show features' distributions, which may lead to misinterpreting regions with significant missing data. Conversely, SHAP plots show feature distributions. SHAP values also indicate how much a feature affects the output of the prediction by considering interaction effects, whereas partial dependence plots do not account for interactions between features.  Table 3 shows confusion matrices for developed algorithms at three probability cutoffs. The first confusion matrix prioritizes hyperarousal detection, the second confusion matrix balances the true positives and false positives rate, and the third matrix prioritizes minimizing false positive rates. The pairwise 5 � 2 cross validation test results showed that XGBoost significantly outperformed the Random Forest (t = -13.25, p < 0.001), SVM (t = -13.02, p < 0.001), and Logistic Regression (t = -11.97,   Table 3, XGBoost showed the best performance in detecting PTSD hyperarousal events.

PLOS ONE
significance of the feature's value in predicting the output, and the X-axis indicates how the feature affects the output of the model (whether that feature with that specific value is contributing to experiencing a hyperarousal event or not). The X axis further indicates log-odds of perceiving a PTSD hyperarousal event. According to the SHAP analysis, the most important body acceleration features are average body acceleration (linaccmean) and minimum body acceleration (linaccmin). The most important heart rate time-domain features for predicting PTSD hyperarousal events are minimum heart rate (hrmin) and heart rate standard deviation (hrsd). Fig 4 shows the SHAP dependence plots for the two most important acceleration and heart rate features. SHAP dependence plots show contribution of a specific feature to a model based on the feature's distribution. In this plot each point shows an observation from the dataset, the X-axis line shows the value of the feature in that row, and the Y-axis shows the SHAP value for that feature that indicates the effect of that feature with that specific value on the prediction. The unit of X-axis is the same as the unit of the feature (for instance for heart rate measures it is beats per minute), and the unit of the Y-axis is log-odds of perceiving a PTSD hyperarousal event.
As shown in Fig 4, hyperarousal events are more likely to be observed with higher minimum heart rate values over the window. When the minimum heart rate is over 140 the risk of perceiving hyperarousal events increases. Also, as the average body acceleration and minimum body acceleration increase, the odds of the detecting PTSD hyperarousal events decrease. Finally, higher heart rate standard deviation, i.e. higher heart rate fluctuation, increases the risk of hyperarousal events.

Discussion
This study developed, evaluated, and explicated machine learning algorithms to predict PTSD hyperarousal events among veterans using smartwatch based naturalistic heart rate and  accelerometer data. The ground truth was subjectively-reported PTSD hyperarousal events. After preprocessing the data, we trained four different algorithms including Random Forest, SVM, Logistic Regression and XGBoost. Among the developed algorithms, the XGBoost was the most robust algorithm which yielded an AUC of 0.70 and over 81% accuracy. We sorted the most important features in the prediction process. The top three body acceleration features included average body acceleration, minimum body acceleration and range of body acceleration. The top three heart rate time domain features were minimum heart rate, standard deviation of heart rate and maximum heart rate. The initial analysis from the SHAP summary plot and SHAP dependence plots show that heart rate and body acceleration features have nonlinear relationships with PTSD episodes. A deeper look into SHAP plots indicate that as the body acceleration increases, indicating more activity from the participant, the algorithm is less likely to predict a PTSD hyperarousal events. This result is consistent with prior studies demonstrating a significant relationship between increased physical activity and a reduction in PTSD hyperarousal events [57][58][59].
The SHAP dependence plot for the average heart rate data corroborates that when the heart rate is between 60-70 bpm, PTSD hyperarousal events are more likely to happen (cf. [60]). The SHAP summary plot indicates that heart rate standard deviation was one of the most important features contributing to the odds that the algorithm will predict PTSD hyperarousal event manifestation. In particular, our findings suggest that as the heart rate standard deviation increases, i.e., as heart rate fluctuates more and in higher ranges, the odds of detecting a PTSD hyperarousal event increases. This result supports previous findings [22, 61,62] which showed that during PTSD hyperarousal events, participants experience increased heart rate acceleration and fluctuation.
While previous research on objective assessment of PTSD exists, most of such research has been conducted in controlled lab settings by inducing external stimuli or so-called triggers (e.g., [62][63][64]). Naturalistic studies investigating common internally generated stimuli such as thoughts or flashbacks [65] are largely absent. In addition, while several studies have attempted to detect stress using physiological data, most of these studies have used Heart Rate Variability (HRV) features (e.g., [24,34,66,67]). Given the availability and non-intrusiveness of heart rate (or pulse rate) sensors, it is timely to investigate the efficacy of using heart rate features to monitor or predict stress and more specifically PTSD hyperarousal. However, to our knowledge, only one study [23] has used heart rate features to predict PTSD hyperarousal events using data collected in naturalistic settings.
This study builds on a prior study conducted by McDonald et al.
[23] but there are several key differences that both enhance the algorithm and improve our understanding of algorithm performance. First, we expanded McDonald et al.'s dataset by conducting two additional field studies. Second, McDonald et al. indicated that one of the main limitations of their work is utilizing downsampling for data preprocessing. We addressed this limitation by using an upsampling method for preprocessing the data. Third, McDonald et al. used frequency domain features of heart rate such as coefficients of Fourier decomposition of heart rate. In our analysis we used time domain features of heart rate which may provide more tangible and interpretable results for this specific context [50,68]. This change may be beneficial for integrating the detection algorithm into a treatment device because explaining predictions to veterans based on time domain features (e.g., standard deviation of heart rate) will be more understandable than frequency domain features (e.g., the phase of the 5th Fourier component). Fourth, in our analysis we added body acceleration features to decrease the noise in data by differentiating heart rate changes due to physical activity versus fluctuations due to stress (similar to [27]) and used a new XGBoost algorithms to train the machine learning model. We believe these changes have contributed to the significant improvement in the machine learning algorithm performance compared to the findings from McDonald et al. (81% in this study compared to 70% in [23]). Finally, McDonald et al. explain that one of their study limitations is the limited insight into the performance-shaping factors for the algorithm predictions. We addressed this issue by adding feature importance and SHAP analysis to improve the interpretability of the results.
Several limitations of this study should be addressed in future work. First, stress and PTSD hyperarousal events are highly idiosyncratic. A stimulus that triggers one individual may or may not trigger someone else. Because of the subjective and sustained characteristics of stress, defining the start, end, duration, and intensity of a hyperarousal event is an uncertain task [69]. As a result, it is significantly complex and difficult to define and measure a ground truth for stress. Hyperarousal events might have been over or under reported due to the subjectivity of the perceived events. Individual differences such as gender, age, lifestyle and other factors can affect PTSD hyperarousal events; therefore, personalizing machine learning algorithms might boost their performance. Another issue in this study was the high number of heart rate missing values due to the naturalistic nature of the study. While non-intrusiveness of smart watches makes them suitable for naturalistic data collection, like most wrist-based sensors, smart watches use optical technology the accuracy of which is affected by skin tone and proximity to skin [70]. Future work may validate the findings presented here using more accurate sensors such as chest straps that use electrical pulse. Lastly, although machine learning algorithms work in theory, external validation of these algorithms in naturalistic settings are necessary to evaluate the accuracy and applicability of these algorithms in the real world settings.
This article provides preliminary evidence of efficacy for data-driven real-time PTSD hyperarousal detection tools that can be used beyond clinic walls to remotely and continuously monitor veterans suffering from PTSD. In addition to the promise shown by the machine learning algorithm, in this article we utilized analytical techniques to which identifies most important features contributing to such detection, hence, improving the interpretation of the outcomes and moving towards explainable ML tools for PTSD monitoring. Although other machine learning algorithms exist to detect stress, to the best of our knowledge, the algorithm documented in this paper is one of very few algorithms that is specific to PTSD. The work is in progress to validate this algorithm in longitudinal home studies using smart watches and smart phones.