Sensor-based measurement of critical care nursing workload: Unobtrusive measures of nursing activity complement traditional task and patient level indicators of workload to predict perceived exertion

Objective To establish the validity of sensor-based measures of work processes for predicting perceived mental and physical exertion of critical care nurses. Materials and methods Repeated measures mixed-methods study in a surgical intensive care unit. Wearable and environmental sensors captured work process data. Nurses rated their mental (ME) and physical exertion (PE) for each four-hour block, and recorded patient and staffing-level workload factors. Shift was the grouping variable in multilevel modeling where sensor-based measures were used to predict nursing perceptions of exertion. Results There were 356 work hours from 89 four-hour shift segments across 35 bedside nursing shifts. In final models, sensor-based data accounted for 73% of between-shift, and 5% of within-shift variance in ME; and 55% of between-shift, and 55% of within-shift variance in PE. Significant predictors of ME were patient room noise (ß = 0.30, p < .01), the interaction between time spent and activity levels outside main work areas (ß = 2.24, p < .01), and the interaction between the number of patients on an insulin drip and the burstiness of speaking (ß = 0.19, p < .05). Significant predictors of PE were environmental service area noise (ß = 0.18, p < .05), and interactions between: entropy and burstiness of physical transitions (ß = 0.22, p < .01), time speaking outside main work areas and time at nursing stations (ß = 0.37, p < .001), service area noise and time walking in patient rooms (ß = -0.19, p < .05), and average patient load and nursing station speaking volume (ß = 0.30, p < .05). Discussion Analysis yielded highly predictive models of critical care nursing workload that generated insights into workflow and work design. Future work should focus on tighter connections to psychometric test development methods and expansion to a broader variety of settings and professional roles. Conclusions Sensor-based measures are predictive of perceived exertion, and are viable complements to traditional task demand measures of workload.


Introduction
The increasing workload under which physicians and nurses operate in today's health care system adversely impacts patient outcomes (i.e., patient experience, [1] healthcare-acquired infections, [2] delays in treatment, [3] postoperative complications, [4] unplanned extubations, [5] and mortality [6,7]), workforce outcomes (i.e., burnout and job-dissatisfaction, [8] as well as turnover and disengagement from or exiting the professions [9][10][11]), and organizational efficiency and productivity. [12,13] Workload is the level of effort required to complete a task in relation to the resources available to expend on that task. [14,15] When demands exceed available resources, an individual's performance deteriorates. Despite the importance of workload, there remains a gap in strategies to measure it for health care professionals. Most methods rely on some form of staffing ratio [16,17] that inadequately represents workload. [18] Other workload measurement methods are observation [19] or self-report, [20] which are expensive and burdensome for respondents, respectively. To better understand and manage workload, more dynamic measurement is needed.
Recent advances in low cost, wearable and environmental sensors offer the potential for large scale, unobtrusive measurement of work processes and related constructs. [21][22][23] Compelling feasibility studies demonstrate the potential utility of sensor data for understanding workforce issues [24,25] and patient data [26,27] but few examples provide rigorous evidence that wearable and environmental sensors can validly measure work processes in vivo. [28] The objectives of the study were to evaluate whether, after accounting for variance associated with traditional measures of workload, sensor-based measures of work processes could predict significant variance in nurses' perceived mental and physical exertion while performing demanding tasks in a surgical ICU. and August, 2014. Eight critical care nurses from the unit were recruited through email and flyer notifications. The study was approved by the Johns Hopkins University School of Medicine Institutional Review Board.

Sensor-based measurement system
The sensor-based measurement system included wearable and stationary sensor badges equipped with a radiofrequency identification (RFID) 2.4 GHz band (Sociometric Solutions, Inc., Boston, MA) [29] and an infrared sensor (TFDU4300Vishay, Malvern, PA) that captured physical proximity and location. Also, two omnidirectional micro-electrical-mechanical system (MEMS) microphones (SPM0103-NE3, Knowles Electronics, LLC, Itasca, IL) captured features of speech and environmental noise, and a three-axis MEMS accelerometer (ADXL330, Analog Devices, Inc., Norwood, MA) captured body movement and activity. Audio signals were filtered on-board the sensor badge to extract speech features without saving the full signal.
Nurse participants wore a sensor badge and their location was detected through a network of 41 stationary sensor badges placed in 16 of 20 patient rooms, both nursing stations, and three service areas (medication, supply, and nutrition rooms; see Fig 1). A feature engineering process mapped sensor capabilities to nursing work processes using four separate one hour focus group sessions with eight RNs, four hours of observing nurses at work by an experienced human factors researcher, and review of an existing nursing task taxonomy for ICUs. [19] This process resulted in 72 features organized into seven high level categories: location-based (time in location, movement through physical space), accelerometer-based (body movement and activity in location), environmental noise (volume), speaking (time speaking, pitch, volume, burstiness [distribution of activity over time]), posture, walking (time and burstiness of walking), and temperature. All sensor-based measures and definitions are in Supplementary Methods A in S1 File.

Focus groups and survey instrument
We used nursing focus groups to identify patient and shift level drivers of workload most meaningful in our study unit. Human Factors professionals moderated a total of four sessions, each one hour in duration, with the eight RNs participating in this study. Nursing task demands identified by focus group participants and included in this study cover staffing factors (number of patients, number of patients assigned a sitter, and whether nurse had an assistant), and patient factors (e.g., number of assigned patients requiring specific care interventions; 8 variables). Composite measures of task demand are also reported in Supplementary Methods B in S1 File including descriptions of all variables and measurement definitions.
A brief survey instrument was developed to collect data on staffing and patient task demands, to elicit perceptions of physical exertion (PE) and mental exertion (ME), and to link the sensor data to the completed survey. Perceptions of exertion were measured using the 15-grade Borg Scale for rating PE (scale range, 6 = very, very light to 20 = very, very hard) [30] and a version modified to rate ME. The modified ME scale changed the referent of the survey item from 'physical' to 'mental' exertion, and the response scale was unchanged. We chose this approach because evidence has established the validity of concurrently measuring mental and physical exertion as related yet distinct sub-dimensions of an overall exertion construct. [31]

Data collection
At the beginning of each shift, a participating nurse retrieved a sensor badge stored on the unit and recorded the badge number on part one of the brief survey. During the shift, the sensors recorded features of their activity. Nurses rated their perceptions of PE and ME every four hours on the survey and recorded patient and staffing task demands at the end of the shift. Four-hour blocks within a shift were chosen because it corresponded to natural breaks in nursing workflow.

Data analysis
Data analysis proceeded in two phases: 1) feature selection, and 2) multi-level modeling (MLM). Feature selection was conducted to determine whether any of the 72 sensor-based measures were predictive of nurses' perceived ME or PE. Elastic net methods, which combine the least absolute shrinkage and selection operator (LASSO) and ridge regression penalties [32,33], were applied to select a parsimonious set of predictors for consideration in MLM. An extension of Elastic net [34] was used to explore all pairwise combinations of predictive features for significant interactions. Feature selection was performed in R (version 3.2) [35] using glmnet (version 2.0-2) [36] and glinternet (version 1.0.0) [37] packages. Elastic net methods do not account for clustering in the data, therefore a more lenient shrinkage penalty was selected Environmental sensor placements throughout surgical intensive care unit sixteen patient rooms were instrumented with two sensor badges, one immediately inside the room by the computer terminal and the second on the wall opposite the door; two nursing stations were instrumented with three sensor badges; and three services areas (medication, supply, and nutrition) were instrumented with one sensor badge. The service areas were relatively isolated from other work areas and redundant sensors were not needed for accurate localization. Two low occupancy and two isolation rooms on the unit were excluded from this study. so important predictors were not eliminated at this stage. Subsequently, a traditional backwards elimination process was used in MLM with the shift grouping structure in place to further reduce the feature set.
MLM was used to evaluate the predictive validity of sensor-based measures, and conducted with R (version 3.2) using nlme (version 3.1-122) [38] and multilevel (version 2.5) [39] packages. MLM was chosen to account for the non-independence of data collected in four-hour segments within a shift and test cross-level interactions between task demand variables and sensor-based measures. Shift was the grouping variable used to analyze perceptions of ME and PE as dependent variables, sensor-based measures of work processes as Level 1 predictors, and task demand workload variables as Level 2 predictors. All sensor-based measures were grand mean centered prior to analyses. Intraclass correlation coefficients (ICC) measured the proportion of variance between different shifts relative to four-hour segments within the same shift. Model deviance was computed to compare model fit of MLM using an L ratio test. An alpha level of < 0.05 was used for assessing significance. Supplementary Methods C in S1 File provides full detail on data analysis methods.

Results
Our analysis included 89 four-hour shift segments across 35 bedside nursing day shifts (between 7 AM and 7 PM), totaling 356 work hours of data collected in July and August, 2014. Seventy percent (62/89) were weekday shift segments.

Feature selection
Elastic net analyses selected 23 variables related to ME from the initial pool of 72 (listed in Supplementary Methods D in S1 File) and 6 interaction terms, as well as 14 variables related to PE and 6 interaction terms. Each of the features retained as either a main effect or interaction term are indicated in Supplementary Methods D in S1 File. Two tuning parameters are used for Elastic net: α which specifies the degree of mixing of penalties from LASSO and ridge regression and λ which controls the degree of shrinkage. Both α and λ can range from 0 to 1. An α of 0 indicates a pure ridge regression penalty, and an α of 1 indicates a pure LASSO. Values in between indicate a proportional mixing of the penalties. For these analyses, α was set at .9 which more heavily weighted the LASSO penalty. A λ value of 0 indicates no shrinkage is performed, and increasing values indicate more severe shrinking of coefficients. For these analyses, the lambda.min function of glmnet identified lambda values that minimized crossvalidation error (for PE: λ = 0.18; and for ME: λ = 0.19).

Multi-level modeling (MLM)
Tables 1 and 2 detail results of MLM for ME and PE, respectively. Level 1 variables included sensor-based measures as predictors and perceived exertion as dependent variables collected for each four-hour shift segment. Level 1 variables were grouped within shift, and Level 2 variables were task demands associated with that specific shift such as the number of patients cared for and their status level. We detail each step of the MLM process below, followed by a summary of the final ME and PE models.
ICC values supported the use of shift as the grouping structure for ME (ICC = 0.63) and PE (ICC = 0.57), indicating that 63% of total variance in ME and 57% in PE occurred between shifts. Group mean reliability exceeded the standard of 0.7 for both ME (0.81) and PE (0.76). Both ME 0 (χ 2 (1) = 27.30, p < .001) and PE 0 (χ 2 (1) = 24.98, p < .001) had significantly better fit than models without the shift grouping variable.
Models ME 1 and PE 1 added Level 1 sensor-based predictors. To generate ME 1 and PE 1 , all features retained from Elastic net analysis (i.e., 23 main effects and six interaction terms for ME; 14 main effects and 6 interaction terms for PE) were added to the respective ME 0 or PE 0 model which included the shift grouping structure, and a traditional backward elimination process was performed. ME 1 and PE 1 , as detailed in Tables 1 and 2 respectively, represent the end of the backward elimination feature reduction process. This process produced a model for ME with one significant main effect term and one significant interaction term, accounting for Proportion of variance explained for Level 1 (within shifts) and Level 2 (between shifts) were calculated relative to ME 0 , the model with only a grouping variable; Proportion of slope variation explained was calculated relative to ME 3 , the model with random coefficients, but no cross-level interactions. d Reduction in model deviance was tested as follows: ME 0 contained only the shift grouping variable and was tested against a model without the grouping variable; ME 1 included Level 1 sensor-based measures and was tested against ME 0 ; ME 2 included Level 2 task demand fixed effects and was compared to ME 1 ; ME 3 included random coefficients for Level 1 sensor-based measures and was evaluated relative to ME 2 , the model without random coefficients; ME 4 included cross level interactions between Level 1 and Level 2 predictors and was evaluated relative to ME 3 .
Sensor-based measurement of critical care nursing workload 28% of the between and 8% of within shift variance. Model PE 1 was reduced to four interaction terms ( Table 2, predictors 10 to 13), accounting for 65% of between shift and 24% of within shift variance. Models ME 1 and PE 1 were significantly better fitting models compared to ME 0 (χ 2 (4) = 14.27, p = .007) and PE 0 (χ 2 (12) = 43.25, p < .001), respectively. Models ME 2 and PE 2 added task demands documented by nurses working that shift (Level 2). One task demand, number of patients on an insulin drip, was a significant predictor of ME, producing a model that accounted for 44% of between-shift and 5% of within-shift variances. Model ME 2 exhibited a significantly better fit compared to ME 1 (χ 2 (1) = 6.60, p = .01). No task demand predictors were retained for PE. Therefore, PE 2 was equivalent to PE 1 .
One significant random coefficient term was retained in Models ME 3 (burstiness of speaking) and PE 3 (volume while speaking at nursing stations), producing significantly better fitting models compared to ME 2 (χ 2 (3) = 7.93, p = .05) and PE 2 (χ 2 (2) = 9.08, p < .05), respectively. Model ME 3 accounted for 66% of between shift and 5% of within shift variances. Model PE 3 accounted for 53% of between-shift, and 62% of within-shift variances. In Model PE 3 , a previously significant main effect term (temperature in service areas) and interaction term (volume while speaking at nursing stations by temperature in service areas) became non-significant and were excluded from further analysis.
One significant cross-level interaction was retained in Models ME 4 and PE 4 . The final model ME, ME 4 , included a significant and positive cross-level interaction between a Level 2 task demand variable, number of patients on an insulin drip, and a Level 1 sensor-based measure, burstiness of speaking (ß = 0.19, p < .05) as well as the interaction term between two Level 1 sensor-based measures (time spent and activity levels outside of main work, ß = 2.24, p < .01) and a Level 1 main effect term (environmental noise in patient rooms, ß = 0.30, p < .01). We defined the main patient care or work areas as patient rooms, nursing stations, and service areas. Areas outside of these main patient care areas included unit halls, locker room, and conference or break room areas. Main work areas were instrumented in this study, and areas in the unit outside of this were not." This model accounted for 73% of between-shift and 5% of within-shift variances, and for 75% of the variation in slopes between burstiness of speaking and mental exertion across shifts. The final model for PE, PE 4 , included a significant Proportion of variance explained for Level 1 (within shifts) and Level 2 (between shifts) were calculated relative to PE 0 , th model with only a grouping variable; Proportion of slope variation explained was calculated relative to PE 3 , the model with random coefficients, but no cross-level interactions. d Reduction in model deviance was tested as follows: PE 0 contained only the shift grouping variable and was tested against a model without the grouping variable; PE 1 included Level 1 sensor-based measures and was tested against PE 0 ; PE 2 included Level 2 task demand fixed effects and was compared to PE 1 ; PE 3 included random coefficients for Level 1 sensor-based measures and was evaluated relative to PE 2 , the model without random coefficients; PE 4 included cross level interactions between Level 1 and Level 2 predictors and was evaluated relative to PE 3  and positive cross-level interaction between a Level 2 task demand variable, average patient load, and a Level 1 sensor-based measure, volume while speaking at the nurses' station (ß = 0.30, p < .05) as well as three interaction terms between Level 1 sensor-based measures (entropy by burstiness of physical transitions, ß = 0.22, p < .01; time speaking outside of main work areas by time at nursing stations, ß = 0.37, p < .001; environmental noise in service areas by time walking in patient rooms, ß = -0.19, p < .05) and one Level 1 main effect (environmental noise in service areas, ß = 0.18, p < .05). This model accounted for 55% of between-shift and 55% of within-shift variances, and for 41% of variation in slopes between volume while speaking at the nurses' station and physical exertion across shifts. Both Models ME 4 and PE 4 had large decreases in model deviance. This decrease was significant for Model ME 4 (χ 2 (1) = 5.04, p < .05), but model complexity (degrees of freedom lost due to including non-significant main effect terms for multiple interaction terms) precluded significance testing for Model PE 4 . Significant Level 1 and cross-level interactions for final reduced models are depicted in Fig 2A  and 2B (ME 4 ) and Fig 3A through 3D (PE 4 ). The interaction plots detailed in Figs 2 and 3 were generated with the r package sjPlot [40]. Variables were centered by subtracting the mean value of that variable, and then scaled by dividing values by the standard deviation. Each plot was constructed by plotting the relationship between two of the three interaction terms while holding the third moderator variable constant at the upper (maximum value depicted in blue) and lower (minimum value depicted in red) bounds. Fig 2A illustrates a cross-level interaction where burstiness of speaking, a Level 1 predictor, becomes a stronger predictor of mental exertion with increasing numbers of patients on an insulin drip, a Level 2 predictor. Fig 2B shows the positive interaction between two Level 1 predictors, activity levels and time spent outside  indicates that work shifts with physical transition events that are both highly unstructured (high entropy) and bursty (high clumping together of transition events in time) are more physically exerting. Panel B: Level 1 interaction illustrates the positive conditional effects in which higher levels of time at nursing station and higher levels of time speaking outside of main work areas are associated with higher levels of physical exertion. Panel C: Level 1 interaction illustrates a negative effect in which less time walking in the patient rooms and high levels of environmental noise in service areas were associated with higher physical exertion. Panel D: Cross-level positive interaction between volume of speaking at the nursing station (Level 1) and average patient load (task workload factor) indicated that a general vocal stress indicator (speaking volume) is only significantly associated with physical exertion when localized to the nursing station and when caring for more complex patients. https://doi.org/10.1371/journal.pone.0204819.g003 Sensor-based measurement of critical care nursing workload

Discussion
In the final reduced models for mental and physical exertion, sensor-based measures of work processes accounted for large proportions of unique variance above and beyond task demand variables typically used for evaluating workload (i.e., task demands derived from patient and shift level factors). These findings support the further development of these technologies for workforce management issues in healthcare. The significant cross-level interactions our models are consistent with existing multi-level frameworks of nursing workload [41] in that relationships between different work processes and perceived exertion changed based on higher level task demand workload factors.

Main study findings
The final model for ME included noise in patient rooms, an interaction between time spent and activity levels outside the main areas, and an interaction between number of patients on an insulin drip and burstiness of speaking. Environmental noise is a well-documented stressor with a positive relationship with perceptions of workload. [42,43] As illustrated in Fig 2B, the positive interaction between time and activity levels outside of main work areas indicated that the more time spent away from patient rooms, nursing stations, or service areas when activity levels were high outside of these areas the higher the nurse's mental exertion. High activity levels outside the main patient care areas could mean the nurse was searching for team member support or supplies, while low levels of activity could indicate downtime. For example, high activity in non-work areas could involve walking up and down the unit halls to seek assistance, and low activity in non-main work areas could involve socializing in a break room.
As illustrated in Fig 2A, the positive interaction between number of patients on an insulin drip and the burstiness of speaking indicated that certain sensor-based measures were predictive of mental workload when caring for patients requiring specific care interventions. The burstiness of speaking is a measure of the temporal distribution of time spent speaking. Higher levels of burstiness of speaking means speaking is more clumped together in time with periods of relatively intensity and sparseness, and lower levels mean a more even distribution of speaking over time. Insulin infusion protocols improve outcomes for ICU patients, [44] but the nursing workload associated with these complex protocols is known to be high. [45] For each patient on an insulin drip, a nurse must assess blood sugar levels, make complex calculations, enter changes into the infusion pump, document all information, and find a second nurse to independently double check the completeness of the steps. Fig 2A illustrates that this bursty social dynamic, potentially an indicator of interruptions [46] or challenges in finding an available nurse to perform the independent double check, was only significantly associated with mental exertion in the context of managing patients on insulin drips. The final reduced model for physical exertion had one significant main effect and positive as well as negative interactions. First, environmental noise within services areas was the main predictor for physical exertion. This potentially indicates congestion in these areas contributing to perceptions of physical effort.
Second, as shown in Fig 3A, the interaction between entropy of transitions and burstiness of transitions pertained to patterns of movement through physical space. Entropy of transitions was calculated using the Shannon entropy of the time series of physical locations. Higher levels of entropy indicated less predictability in the sequence of transitions. The burstiness of transitions characterized the temporal variation of movement events from one physical space to another. Higher levels of burstiness indicated more clumping in time of movement between physical areas. Shift segments where physical transitions were both unstructured and clumped in time were more physically exerting.
Third, as illustrated in Fig 3B, the positive interaction between high levels of time speaking outside of the main work areas and high levels of time spent at nursing stations could indicate care of more complex patients, requiring more documentation and coordination of activities, thereby compressing physical activity in the patient rooms into less time.
A fourth and also challenging interaction to interpret was the negative interaction between environmental noise in service areas and time walking in patient rooms, as shown in Fig 3C. This relationship could indicate that a busier service area (more congested and noisier) and more physical activity in the patient room combined to impact physical workload. These are two areas that require the most physical activity from nurses (e.g., patient handling and procedures in the patient room; moving and lifting supplies in service areas).
Fifth, as illustrated in Fig 3D, the interaction between average patient load and volume while speaking at the nursing stations indicated increased strength in the relationship between speaking volume at the nursing station and physical exertion in shifts where patients required more care interventions. Speaking intensity or volume is a feature of speech commonly associated with stress. [47] In our study, volume was only predictive of perceived physical exertion when localized to the nursing station and related to the level of monitoring and intensity of task demands made on the nurse.

Implications for future research and practice
This study demonstrated the predictive validity of sensor-based measures. Some features were clearly meaningful, while the interpretation of others was more challenging. A tighter integration with the existing psychometric test development processes is needed to help ensure the content of these sensor-based systems are indicative of the construct purported to be measured. With a refined system, sensor-based measures could be used to guide more fine grained workflow analyses [48] to identify recurrent trends and target these areas for further investigation and improvement efforts, or as real time feedback to help staff on the unit self-regulate and balance workload. [49] Projections indicate severe shortages in nurse [50] and physician [51] workforces for decades to come. Identifying mechanisms to improve productivity and retention could have substantial savings. By better understanding workload in real time, managers can provide lateral support to reduce workload and ultimately create a safer and more productive work environment.

Limitations
This study was conducted in one surgical critical care unit in an academic medical center. Larger datasets collected across multiple critical care units in different facilities, including a wider range of task demands and clinical roles will be required to establish the generalizability of sensor-based measurement features across settings and personnel. For example, burstiness of speaking was related to mental exertion only when managing patients on insulin drips. This dynamic could be indicative of higher levels of workload in other situations, but this study was underpowered to detect the effect. While this study drew from existing models of nursing workload, qualitative focus groups, and observational methods, it remained largely exploratory. Advances in integrating sensor-based measurement within the psychometric test development framework will be necessary to develop more prospective measure development and validation. In contrast to traditional assessment methods (e.g., self-report and observation) with established best practices and methods, these types of wearable and environmental sensors have generally unknown error structures [52] but systematic device related variance has been demonstrated in other research for sensors like those used in this study [53]. Additionally, detecting naturalistic speech and isolating it to the sensor wearer (vs. others speaking in the area) may be particularly difficult and potentially error prone. We did not formally assess the reliability of all sensor features used in analyses reported here.

Conclusion
Sensor-based measurement systems are valuable tools for understanding performance in complex socio-technical systems. These methods have the potential to enhance patient safety, improve productivity and reduce burnout among nurses. This approach may be applied to physicians and other health care workers and extended to other types of organizational performance, such as coordination and teamwork which are known drivers of safety and quality yet difficult to measure on a large scale with currently available methods.