Impact of delayed response on wearable cognitive assistance

Wearable cognitive assistants (WCA) are anticipated to become a widely-used application class, in conjunction with emerging network infrastructures like 5G that incorporate edge computing capabilities. While prototypical studies of such applications exist today, the relationship between infrastructure service provisioning and its implication for WCA usability is largely unexplored despite the relevance that these applications have for future networks. This paper presents an experimental study assessing how WCA users react to varying end-to-end delays induced by the application pipeline or infrastructure. Participants interacted directly with an instrumented task-guidance WCA as delays were introduced into the system in a controllable fashion. System and task state were tracked in real time, and biometric data from wearable sensors on the participants were recorded. Our results show that periods of extended system delay cause users to correspondingly (and substantially) slow down in their guided task execution, an effect that persists for a time after the system returns to a more responsive state. Furthermore, the slow-down in task execution is correlated with a personality trait, neuroticism, associated with intolerance for time delays. We show that our results implicate impaired cognitive planning, as contrasted with resource depletion or emotional arousal, as the reason for slowed user task executions under system delay. The findings have several implications for the design and operation of WCA applications as well as computational and communication infrastructure, and additionally for the development of performance analysis tools for WCA.


Introduction
Wearable Cognitive Assistants, WCA for short, have recently started to garner attention from the research community [1,2].They represent a novel category of highly interactive and context-sensitive augmented reality applications, that aim to augment human cognition in both day-to-day tasks and professional settings.Their mode of operation is analogous to how GPS navigation systems guide drivers, by seamlessly providing relevant instructions and feedback relating to the current task at hand.Note November 6, 2020 1/27 arXiv:2011.02555v1[cs.HC] 4 Nov 2020 that this implies seamless interaction with the context of the user -at no moment does the user need to trigger an update explicitly, as the application is constantly tracking the state of the target task.An example is an IKEA assistant [3] which monitors the assembly of a piece of furniture in real time, providing timely, step-by-step feedback to guide the user toward completion.WCA systems show great potential for two main use-cases.One is providing quality of life improvements to the millions of people around the world affected by some form of cognitive decline.WCA can, for instance, provide assistance to people recovering from traumatic brain injuries, smoothly guiding them through day-to-day interactions with the world which would otherwise be extremely challenging.
The second use-case is as a companion tool for specialists and as a means for guiding their training.Non-wearable augmented reality and cognitive assistance systems have already been proven to be valuable tools in the industrial workplace [4,5].Detethering this assistance from its current fixed location will surely make it available to many more fields.
Based on these use cases, we identify three main requirements for WCA: 1. WCA systems should be available whenever the user requires them, without being tethered to a particular physical location.Assistants need to be pervasive and mobile.
2. Interaction with the system should be immersive and seamless, i.e. the assistant should be able to analyze the current context and automatically provide relevant feedback without explicit commands from the user.In this sense, WCA is expected to operate much like a human assistant would, by observing the performance of the user and offering guidance proactively.
3. Feedback should be "quick", relative to the task at hand.This requirement is further strengthened by the previously mentioned "seamless interaction" characteristic of these systems.This means that users will have expectations of constant, immediate feedback as they progress through the task.In the case of a step-by-step task like IKEA, delayed feedback might simply confuse or distract the user.However, in a highly interactive task like a Ping-Pong assistant [2,6] late guidance is at best a nuisance and at worst a severe handicap.
Item 1 implies use of lightweight and low-power devices, preferably a wearable device that frees both hands for work.Item 2, on the other hand, suggests a level of context sensitivity and proactivity that can only be provided by real-time analysis of sensor inputs such as video and audio feeds.The compute-intensive processing suggested by Item 2 cannot be met by the lightweight wearable devices suggested by Item 1.Only by offloading computation from a wearable device to cloud-based or edge-based infrastructure can this circle be squared.However, offloading implies an extended end-to-end pipeline with many potential sources of queueing, transmission, and processing delays.Item 3 therefore emerges as a key concern, requiring deep understanding of the impact of end-to-end delays on WCA users.
Item 3 forms the base motivation for the research presented in this paper.We still have a very limited understanding of how humans react to delays in these systemsspecifically, how changes in system responsiveness impact users.System responsiveness here denotes a qualitative scale ranging from high (that is, not subject to delay or subject to negligible delay with respect to human perception) to low (i.e.subject to considerable delay).Characterizing the relationships between system responsiveness and user behavior and experience is of paramount importance for the design and evaluation of these applications.A clear understanding of these relationships would allow, for instance, for the development of strategies for load balancing and optimization for large-scale deployments of WCA.
This paper builds upon preliminary work in the field of time perception and delay characterization of WCA.We expand upon the findings of Ha et al. [1], who identified the need for low-latency offloading in WCA, and of Chen et al. [7], who outlined the bounds for "noticeable" and "unbearable" latencies in these systems.While these bounds present a general understanding of when it is likely that a user will drop an application, they do not provide any insights as to what happens before that -i.e.how human behavior changes with system responsiveness.We aim to tackle this question through the characterization of human responses to delays in the application pipeline, using latencies in the range defined by the previously established bounds.This is an important step toward a more systematic understanding of human behavior in this domain.
We present in this paper an experimental WCA test-bed of our design and making.This test-bed was subsequently employed in a study in which undergraduate students were asked to interact with and follow the instructions given to them by a cognitive assistant.Unbeknownst to the participants, we altered the responsiveness of the system in real-time and recorded the resulting behavioral and physiological reactions.The participants wore an array of biometric sensors measuring physiological responses that have proven useful in assessing cognitive workload during human-computer interaction [8,9] such as heart rate and EEG.
Our results indicate that reduced responsiveness in WCA systems leads to a disruption of participants' cognitive plan for the task.This is evidenced by an emergent pacing effect on user actions as system responsiveness is reduced.While it would seem self-evident that users take longer to complete a task while using a system with low responsiveness -as they have to wait longer for new instructions -our study found that user slow-down represents a source of substantial additional delay.To be more precise, the data indicate that users slow down not only because they have to wait for the system to catch up, but that their reactions to new instructions is also delayed.Furthermore, this effect persists for a while after system response improves and is modulated by intrinsic personality traits, in particular, neuroticism [10], which has previously been connected to intolerance for time delay [11].
We believe that these results provide concrete and relevant implications for WCA design, deployment, and optimization.One example is the behavioral slow-down, as it extends application runtime significantly, and thus has clear and direct implications for resource and power consumption.Another is the fact that the adverse effects of delay on users do not immediately subside as delay is diminished -this has potential consequences for resource allocation strategies.Moreover, in multi-user scenarios, the dependency of user slow-down effects on delay mean efficient resource allocation across applications potentially looks different from what could be considered "fair".
Our hope is that the results we provide might prove useful for the understanding and optimization of deployments of WCA.These results represent unexpected, valuable findings, which can be employed to model and understand how users interact with latency in applications and systems, and develop resource allocation and power optimization strategies.Additionally, we hope that the results we provide might pave the way for the improvement of performance evaluation tools such as our previous work in Olguín Muñoz et al. [12,13].Such systems would greatly benefit from this knowledge, as it would allow for the design and implementation of realistic models of human behavior, making highly accurate benchmarks a reality in the domain of WCA.
The structure of this paper is as follows.We describe the existing body of research around time perception and the effects of delay on human performance in Section 2. Section 3 presents the experimental design, measures, and specific protocol.Then, in November 6, 2020 3/27 Section 4 we detail the results of our experiment.Implications for further modeling of the effects of delay are presented in Section 5 before finally concluding the paper in Section 6.
2 Background and Related Work

Time perception in computing systems
The question of how people respond to delay in a computer system is grounded in how people perceive time.Time perception has been described as regulated by an attentional gate that, when opened, starts a cognitive pulse counter [14,15].More recent research indicates, however, that duration perception is highly malleable and the result of multiple timing mechanisms found in overlapping, flexible neural systems [16,17].The estimation of an event's duration varies with context of various types: (i) events subsequent to a long or short interval are contracted or extended, respectively [18]; (ii) repeated events tend to be perceived as shorter than novel ones [19]; and (iii) arousal can expand durations [20].
Expectations play a critical role in time perception as well [14,15].It has been shown that people have a general tendency to be hypersensitive to delays in worse-than-expected states, and under-sensitive to meeting or exceeding expectations [21].Accordingly, failures to meet expected fast response times tend to be experienced as highly negative, whereas fast latencies are not noticed.Violations of expectancy have a strong impact on the acceptability of computer systems.Users of a computer system anticipate the latency for events, for which the standards only become more stringent as systems improve in response time.In immersive systems like WCA, which aim to provide seamless interaction, delays are particularly noticeable.
It has long been recognized that slow system response times can undermine cognitive processing, slow the pace of users, and lead to stereotyped behavior and errors, as well as cause negative emotional consequences [22].However, standards for what constitute tolerable delays have changed dramatically compared to three decades ago, when delays on the order of 10 s were deemed acceptable [23,24,25].Today's user context, and WCA in particular, often demand response times orders of magnitude shorter.
For WCA the acceptable range for latencies was explored by Chen et al. [7], by constructing assistants for tasks with a range of time constraints, including step-by-step tasks and more interactive contexts like playing Ping-Pong against a human opponent.They then proposed a latency tolerance zone according to the task demands.For an essentially self-paced task like LEGO assembly, they found two key ranges of latency; unnoticeable, 0 to 0.6 s; and impaired, 0.6 to 2.7 s. Beyond that, users could begin to show the negative outcomes previously catalogued [22].

Potential Mechanisms Relating Delay to Human Performance
While behavioral changes and negative interaction outcomes have been well documented in prior research on system delay, the specific mechanisms that mediate these outcomes are less well understood.These mechanisms could be cognitive or emotional in origin.
Research on cognitive and motor planning suggests that delay may move users from relatively automatic to more attention-demanding processing.Cognitive and motor tasks are commonly described as a hierarchical system, progressing from high-level goals to the sequence of commands that accomplishes them.
As competency in a task increases, execution of the hierarchy becomes increasingly automated.Automatization has been described from a computational perspective in Anderson's ACT-R model as the compiling of multiple productions into one [26].Neural measurements indicate that with automaticity, control moves from frontal brain areas to more posterior ones [27,28], and similar distinctions have been related to temporal processing [29,30,31].
Although activities guided by a WCA are not simple motor actions, immediate feedback after each of a series of repeated actions should promote development and automatic execution of a hierarchical plan.Delays, in contrast, would disrupt such a plan through the loss of automated control [31].
An alternative view of delay effects appeals to emotional systems rather than cognitive processes.As users of a system become emotionally aroused by delay, they may be subject to generalized arousal, causing decrements in performance [31].
A third potential explanation of delay effects is what has been called "ego depletion", the notion that expending effort on self-control eliminates resources needed for further effort [32,33].
The various processing accounts of delay effects predict different outcomes, which we will consider in the context of the current data.If delay increases attentional demands on cognitive processes, responses should be slowed and errors expected, particularly on time-critical tasks.Generalized arousal triggered by emotional stress from delay should emerge in physiological measures, such as increased heart rate or skin conductivity.Arousal can also reduce movement smoothness or add erratic gestures [34].Ego depletion has been found to produce premature responses culminating in error [33], or to lead to abandoning a task entirely [32].
Over-arching prescriptions for tolerable system response time have not tended to take into account individual differences in users with respect to salient variables like cognitive ability or personality.Relevant research can be found in studies of delay discounting, the tendency to devalue rewards for which one must wait.High discounting rates, indicative of waiting intolerance, have been associated with negative social and academic outcomes.Hirsh, Morisano & Peterson [11] found that higher discounting was associated with extraversion among those with low cognitive function, whereas lower discounting was associated with emotional stability (low neuroticism) for people with high cognitive function.Among computer system users who tend to have relatively high cognitive ability (which presumably describes the present experimental population), this points to neuroticism as a personality factor that might modulate tolerance for waiting.Extraversion could also be a moderating factor among the broader target audience of WCA, which are intended for relatively inexperienced users of an application.These and other measures of individual variation were considered here.

Experimental Design
The core elements of our experiment are shown in Fig. 1: • Subjects interact with a WCA while wearing an array of biometric sensors.
• The responsiveness of the application, i.e. the interval of time between an input being provided to the system and the associated output returned to the user, is manipulated in real time.The effects of these manipulations on the subjects are recorded and subsequently analyzed.
This study was conducted with the approval of the Carnegie Mellon University Institutional Review Board.Subjects were recruited from a pool of undergraduate students fulfilling a course requirement at Carnegie Mellon University.No particular exclusion criteria were applied.In total, 40 participants were recruited, all of them of college student age (18 to 25 years old).Experimental test-bed.Participants interact with the cognitive assistant through task-related inputs and outputs -in practice, these correspond to the video feed captured by the assistant and the instructions provided by it.The assistant itself has been instrumented with a data collection layer, which collects and processes experiment-related data such as biometric signals from the participants (these are merely processed here and do not form part of the inputs to the cognitive assistant as such, however), and a delay buffer, which introduces controlled delays in the transit of information from the core processing component.

The Cognitive Assistance Application
We used a modified version of the LEGO Assistant application introduced by Chen et al. [2].This application belongs to a category of WCAs that are designed to guide users through the execution of a sequential task.Such applications constitute "conversational computing tasks" in the taxonomy proposed by Dabrowski & Munson [22].A set of instructions are to be performed by the user in a semi-predetermined order.The results of the user performing these instructions are provided as inputs to the system.These inputs may either be correct, in which case the system proceeds to output the next instruction, or incorrect, in which case a procedurally generated corrective step that fixes the mistake is provided to the user.
In more formal terms, we can provide definitions for task, subtask, and step in such an application as: Definition 1 -Task and subtask.A task will be understood as a finite sequence of instructions to be performed in order.Subtask will refer to a specific action to be performed by the user, described by a single instruction.See for instance Fig. 2, which pictures the task of assembling a simple LEGO model, with each subtask corresponding to the addition of a specific LEGO piece to the current model.Definition 2 -Step.Interval of time delimited by two consecutive instructions, see Figs. 2 and 4a.Steps are characterized by the actions of the user and the assistant.At the beginning of a step, an instruction is given to the user.The user then proceeds to perform the subtask specified by the instruction, while the cognitive assistant is continuously sampling the subtask state at specified intervals.While the subtask remains unfinished, the results of the processing of the sampled inputs are discarded.Once the user finishes the required action, the next sample taken will contain a valid input, and thus the cognitive assistant will provide a new instruction.This finishes the current step and potentially begins a new one if there remain instructions to be performed in the task.
In the base LEGO Assistant, the task consists of steps leading the user through the assembly of a LEGO model; each subtask requires the user to append a LEGO brick to the model at a specific location and orientation.The system monitors progress through a video feed and provides timely feedback in the form of visual and textual instructions to guide the user towards the desired end result.
The LEGO assistant has features that make it a good target for assessment of delay effects in a cognitive assistant: 1.It is easily understood by users and requires essentially no training.
2. The step-by-step nature of these applications simplifies the isolation of the relevant experimental variables and the effects of the delay.
3. The states of the display can be recognized by simple image processing algorithms and do not require extensive training as for machine-learning-based applications.
4. Each step has an intrinsic hierarchical cognitive structure (see Fig. 3), affording multiple levels of cognitive control.The original design of the LEGO assistance application was based on a client-server model communicating over a wireless network, with the client software running on a wearable device and the server software deployed on a cloudlet.For the purposes of this study, this design was altered to be executable on a single, non-networked computer, in order to eliminate the stochastic effects of jitter and latency on the network link.By this means, we achieve much more fine-grained control over the latencies to which the system is subject.Additionally, this greatly simplified the instrumentation of the application.Instructions were output in image and text form to a computer display situated on a table directly in front of the participants.Participants performed the instructions on the table, these actions being captured by a high-definition camera located on top of the display.
Finally, a data collection and experimental layer was implemented between the user interface and video capture and the core processing component of the LEGO Assistant.This layer controlled the manipulated experimental variables and recorded the measured biometric and task-related effects.

The Experimental LEGO Task
For the experimental LEGO task, a key modification was made to the structure of the steps.After each the processing of each input frame is completed, the result is withheld for a variable period of time until a specific target delay is reached, as illustrated in figure Fig. 4b.The length of this delay is one of two independent variables we manipulated for the experiment.Seven levels of delay were used -no added delay (which we will also refer to as 0 s delay), 0.6, 1.125, 1.65, 2.175, 2.7, and 3.0 s -chosen based on the latency bounds found previously by Chen et al. [7].A value of 600 ms was identified as the bound where users start noticing delays in the assistant.Conversely, 2.7 s was identified as the upper bound on delays after which the application is considered to be in such a degraded state it is basically "unusable".Thus, our selection of delays is centered around the range of delays where latency is noticeable to users, but the application remains in a "usable" state, while including one delay value in the unnoticeable range and one completely in the "unusable" range.
Additionally, in order to study the effects of a delay applied across multiple steps, we implemented an experimental design component called a block : Definition 3 -Block.A sequence of consecutive steps within a task subject to the same delay; see Fig. 4c.The length of a block corresponds to the number of steps it encompasses, and is the second of the two independent variables manipulated in our experiment.We used values of 4, 8, and 12 steps for the lengths, values chosen as representative of the number of steps in tasks in the base LEGO Assistant application.Additionally, we define the duration of a block, as the time elapsed between the start timestamp of the first step in the block and the end timestamp of the final step in the block; e.g. for Fig. 4c, the duration of the pictured block would be t n+k − t n .
To assign tasks to the participants, a pseudo-random permutation of the combinations of block length and delay was first generated, and a unique sequence of steps assigned to each of these combinations to create 21 unique blocks.A Latin square-type design was then used to reorder this initial permutation in order to generate a task for each participant.This ensures a counterbalance of the order of the blocks across participants and avoids systematic learning effects.The design rotates the block types (as defined by length and delay) across participants so that each type is tested in each ordinal position, but it only coarsely samples from the 21 × 20 possible sequences from one block type to another.Note that unlike the base LEGO assistant task, in which instructions led participants through the assembly of a specific model, the experimental LEGO task consisted of a sequence of instructions with no evident goal.
(a) Structure of a step in a generic cognitive assistance application.The assistant provides an instruction to the user and continuously samples the subtask state; inputs captured while the subtask is unfinished are silently "discarded" (i.e. they do not cause the generation of a new instruction) by the assistant, as they do not contain relevant information.However, once the user finishes performing the given instruction, the next input will cause the generation of a new instruction, which will subsequently be provided to the user.
(b) Modified structure of a step in the experimental task.In contrast to Fig. 4a, an additional variable segment of time is introduced immediately following the processing of the input frame in the cognitive assistant, in order to extend the perceived processing time of the input to a specific target delay.Users were directed to either add a piece to the ongoing model or to remove a piece, and blocks were designed in such as way so as to ensure that the transitions between them were invisible to the user.
In this paper we will thus consider blocks to be our basic element of study, and most aggregations will be done at this level (with a few exceptions).For this, we will need additional definitions: Definition 4 -Delay associated with a block.We will refer to the delay of a block as the delay applied to every frame of every step in that block.Definition 5 -Execution time.Given the variability in the system latency to detect step completion, correcting step-completion time for system response time by a fixed amount would not enable sufficient precision.We therefore estimated the execution time empirically for an individual step as the total time between the user receiving the instruction for a subtask and their presenting the completed subtask to the system.
Formally, for an arbitrary step, we define the sequence S = {t 0 , t 1 , . . ., t n } as the sequence of timestamps corresponding to the sampling instants during the step.t 0 corresponds then to the instant where the instruction for the step is given to the user (and the first sample is taken), and t n to the timestamp of the last sample before a new instruction is given (or conversely, the sample which captured the finished subtask as presented by the user).If we define t c as the instant marking when the user finished and presented the task to the system, we infer the following: • t c < t n , since by definition the user must have finished the task before the system took the sample that triggered a new instruction.
• t c > t n−1 , since otherwise the system would have triggered a new instruction after some t k , k ∈ [0, n − 1] instead of after t n .
Therefore, t n−1 < t c < t n (see also Fig. 5).Due to the discrete sampling of the task state, there remains some imprecision in the estimate of execution time relative to t c .However, this introduces no bias in the results, as we can infer that t c is uniformly distributed in the range (t n−1 , t n ).We therefore calculate execution time for each step as t c − t 0 , t c ∈ U (t n−1 , t n ), which on average works out to an adjustment of the observed time by 1.5 times the mean sampling rate of the step.This is also aggregated into an average at the block level.

Collected Data
The collected data from the experiments fall into four categories: behavioral and personality indicators, frame-to-frame metrics, biometric data, and video recordings.November 6, 2020 10/27

Behavioral and Personality Indicators
Before beginning the experimental procedure, participants were asked to fill out two questionnaires.
The first of these, the Big Five Inventory of Personality (BFI [10]), consists of 44 questions to be answered on a 5-point Likert-type scale, assessing the traits of agreeableness: detached to compassionate (8 questions), conscientiousness: careless to organized (9 questions), extroversion: reserved to outgoing (8 questions), neuroticism: secure to sensitive (8 questions), and openness: cautious to curious (9 questions).Of these, extroversion and neuroticism have been related to tolerance for delay [11].
The second survey, the Immersive Tendencies Questionnaire (ITQ [35]), comprises 29 questions, 28 of which were used for the study (one categorical question was disregarded), assessing the sub-scales of involvement, the tendency to become involved in activities; focus, the tendency to maintain focus on current activities, and games, the tendency to play games.These questions use a 7-point horizontal scale with opposing descriptors anchoring the ends.Participants were asked to mark the appropriate point in the scale, and these responses were converted to a numerical value between 1 and 7 for processing.
In post-processing, the obtained scores for both questionnaires were normalized to fall in the [0, 1] range for ease of interpretation.See Table 1 for their means and standard deviations.

Frame-to-frame metrics
During the execution of the task, we logged every event occurring in the application pipeline.Each incoming frame (including frames discarded by the assistant), as well as its associated outputs, was logged at multiple points in the process along with associated metadata such as currently implemented delay as specified by the experimental design.This allowed us to extract metrics relating to the performance of the task, such as the time spent by participants on particular steps, any mistakes made, etc.In particular, it made possible the easy segmentation of the other time-series data we collected into our main unit of analysis, the aforementioned block.

Biometric Data
The participant wore devices to acquire four physiological measures: • galvanic skin response (GSR); • accelerometer data from the dominant wrist; November 6, 2020 11/27 • brain activity in the form of electroencephalography (EEG); • and heart rate.
These metrics have been used as indicators of stress and cognitive load by an ample body of previous research [36,37,38].More specifically, galvanic skin response (GSR, also known as electrodermal activity) is the measure of the variation of the conductive properties of human skin due to changes in the state of the sweat glands.It is interpreted as an indicator of physiological arousal and has long been a widely used metric in studies seeking to characterize mental workload [36,37,38,39,40,41].Electroencephalography (EEG) refers to the monitoring of brain activity through the measurement of the fluctuations of the electric field surrounding the brain.EEG measures the voltage fluctuations due to electrical activity within neurons in the brain, which result in distinct waves of specific frequencies associated with different contexts, emotions and actions.EEG has been used to measure cognitive load in the context of human-computer interactions [9,42,43].
Wrist acceleration, GSR and heart rate data were obtained using the Empatica E4 [44] biosensing wristband.Accelerometer data was sampled at 32 Hz, GSR was sampled at 4 Hz, and instantaneous heart rate was calculated from a blood volume pulse (BVP) signal sampled at 64 Hz.Participants were asked to wear the device for approximately 10-15 minutes before starting the experiment, in order to allow the sensors to reach an stable equilibrium and establish a baseline for the signals.
The E4 wristband was chosen due to its small, non-invasive and wireless form factor (samples were streamed to the system over Bluetooth LE) and for the fact that its use in research has been experimentally validated in previous research [45,46].The E4 also includes a skin temperature thermometer; however, due to failure to reach equilibrium, the measure was not used for the present study.
For the EEG data we employed the OpenBCI EEG Headband Kit [47], a kit consisting of a number of dry electrodes fastened to a Velcro headband.It thus provides a quick, easy and non-invasive way of obtaining EEG signals from participants.Electrodes were placed according to the 10-20 Electrode System [48] on the Fp1 and Fp2 points, in order to capture brain activity in the frontal lobe.Ground and reference electrodes were positioned on the right and left earlobes respectively.
This kit was paired with the OpenBCI Ganglion Biosensing Board [49] for the actual acquisition of the signals, which were sampled at 200 Hz and streamed over Bluetooth to the experiment computer.
Following capture, the EEG signal was post-processed in the following fashion: 1. Since our main interest was in the α (8 to 12 Hz) and β (12 to 30 Hz) bands, a low-pass Butterworth filter of order 8 and cutoff frequency 40 Hz was applied to filter out noise in the higher end of the spectrum (in particular, noise from the board power supply at 60 Hz).
2. A high-pass Butterworth filter of order 8 and cutoff frequency 0.1 Hz was then applied to filter out noise at the low end of the spectrum.
3. Subsequently, a pair of cascaded Savitsky-Golay filters [50], both of order 8 and window size 21, were applied to the signal in order to smooth out noise, as proposed by Agarwal et al. [51].
4. A spectrogram was then calculated for the signal.
5. Finally, the power over time for each EEG frequency band is obtained by integrating over the relevant frequencies for each time-step in the spectrogram.
An example of the effects of the filtering on the raw EEG signal can be observed in Fig. 6.

Video recordings
the task, participants were recorded by two separate cameras.One camera was angled downwards, towards the table, the LEGO board, and the participant's hands.This camera was used to capture the necessary inputs for the LEGO assembly task as well as to record the actions performed by user.The second camera was angled towards the participant, in order to record their facial expressions during the execution of the task.
Both video feeds were captured at a rate of (i.e. with a sampling interval of 41.6 ms) in parallel processes to ensure a constant rate of capture.Examples of the frames can be seen in Fig. 7.The video feeds were not used for the present study, but may be utilized in future analysis.The results consist of execution time per step.and the outcome of physiological variables: heart rate, GSR, and EEG.Each will be discussed in turn.

Execution Time
Before describing the analysis and results relating to execution time, it should be noted that participants' performance during the execution of the task was error free; all steps were completed as instructed.
Fig. 8 shows the mean per-step execution time per block, averaged over block length (number of steps) and artificial delay.We can clearly see a trend for the execution time to increase with the delay, increasingly so for longer blocks.Since the per-step execution time compensates for the added delay in the measure per se, this trend must result from the participants' behavioral adjustment to the delay.This leads to one of the key outcomes from this study: participants tend to act more slowly on steps affected by longer delays -i.e.there is evidence of a pacing effect in users' behavior with respect to the responsiveness of the system.
We confirmed this effect through an analysis of variance (ANOVA) with factors of block length and delay.An ANOVA test uses the F-test statistic, which is a ratio of the variability in the data introduced by experimental manipulations (and their interactions) to the variability in the data attributable to randomness.The degrees of freedom represent the number of observations going into each of these variability measures.As we are using within-subject designs, the variability in the data due to randomness is estimated by the among subjects with respect to the of the experimental variables.The partial-η 2 (η 2 p ) statistic is a measure of effect size, corresponding to the proportion of variance explained by the effect after excluding variance from others, and p-value, a measure of the probability of the effect under the null hypothesis.p < .05 is the criterion for significance.The current length X delay ANOVA found significant main effects of both factors and the interaction, Table 2.  Further analysis focused on the progressive effect on per-step execution time as additional steps occurred at a constant delay.For this purpose, the steps within a block were aggregated over sequences of 4, constituting a slice.Note that the first slice within a block is procedurally identical for all block lengths, in the sense that a participant currently performing a step in the first slice of a block has no way of predicting if the block ends after step 4 or not.The same logic can be applied to slice 2 for blocks of length 8 and 12. Accordingly for each participant, slice 1 data (steps 1-4) block 8 and and 3 the last four steps in blocks of length 12.An ANOVA on slice number (1-3) and delay (7 values) yielded effects of slice number, delay, and the interaction, detailed in Table 3.As shown in Fig. 9, blocks with shorter delays showed a trend for the execution time to progressively decline over the course of the block (i.e., by slice number), indicating that the participant accommodated to the feedback pace with more efficiently timed responses.With the longest delays, where the execution time per step was longest, the slow-down persisted; that is, the system response time hindered the participant's execution throughout the course of the block.The data also allowed us to perform sub-analyses to assess the effects of carryover from one delay to another.Specifically, we measured the per-step execution time for the first four steps of a block when participants transferred from a relatively long delay (2.175 to 3.0 s) versus a short delay (0 to 1.65 s).Note that use of the first four steps controls for block length.We performed analyses where participants transitioned from either a short-or long-delay block into a: (i) no delay block (36 subjects); (ii) 1.65 s delay block (40 subjects); (iii) 2.7 s delay block (40 and (iv) 3.0 s delay block (37 subjects).The destination delays were chosen so as to maximize the number of participants which contributed to the analyses, the results of which are pictured in Fig. 10.We that transitions from long-delay blocks carried over to significantly per-step execution time of initial steps in the destination that were assessed (0, 2. summary, we observe two direct effects on execution time due to added delays.we see a clear hampering of the improvement of execution time across steps that is otherwise evident across blocks.Secondly, we notice that this effect lingers on even after the delay is removed, affecting subsequent blocks in the task.

Acceleration
data from the E4 wristband were taken in 3 axes defined relative to the As our interest was in the amount of movement rather than the direction in we calculated a "movement score" for each block which we defined as: where B is an arbitrary block, and represents the set of vector samples collected for said block.This score would include the time imposed by the delay and the time to decode the instructions and respond by moving the pieces so that the next state was recognized.It is only during the last part of the that explicit movement is required, so any additional accelerations would derive from arbitrary movements while processing and waiting.We normalized the sum of the November 6, 2020 16/27 accelerations by the duration of the step to correct for differences in delay and the execution time, which tends to increase with delay as described above.An ANOVA on block length and delay showed a significant effect such that movement score decreased with delay, as shown in Table 4 and Fig. 11.There was also a significant delay by length interaction, reflecting that delay effects occurred particularly for the longer lengths.We next conducted the same analysis as for execution time, dividing blocks into "slices" of four steps each, such that slice 1 comprised the first four steps of all block lengths, slice 2 the second four steps of blocks of length 8 and 12, and slice 3 the four steps of blocks 12 steps long.As shown in Fig. 12 the effect of delay was only to the later slices, yielding main effect of slice number and delay and an interaction (Table 5).

EEG
Analyses on the log power in the alpha band, beta band, and total of all bands measured.Readings the placed poles were highly correlated and were pooled into an average.Logs were taken for analysis because the EEG distribution tended to have a rightward tail.Twelve participants were excluded from the analysis, 9 due to device failure and 3 because of extreme values (i.e., the participant mean of log total power was greater than 3 s.d.s.from the mean of all participants).
The analyses then comprised 28 participants.Omnibus ANOVAs were conducted with delay and block length (number of steps) as factors, on the EEG data from the alpha and beta band.The analysis of alpha EEG found no significant effects.Beta EEG showed only an effect of block length, F (2, 54) = 3.56, p = .035,η 2 p = .12,a tendency for the 4-step length to produce lower log power (mean 3.9 vs. 4.1 lengths of 8 and 12).However, this effect was small and not consistent across delays.Again the analysis dividing blocks into 4-step slices was conducted, with delay and number as factors.For both the alpha and beta bands, ANOVAs yielded effects of slice number -see Table 6.Both bands showed the same tendency: EEG declined as a sequence of steps with the same delay progressed.These effects are shown in Fig. 13.Thus, EEG taken from frontal locations mimics the execution time data in showing a decline over the course of a block, but unlike the execution time, there was no tendency for the decline in EEG to be reduced at longer delays.

Galvanic Skin Response (GSR) and Heart Rate
The measure of GSR was the log of total amplitude in the signal.To control for effects of block length (steps × delay) and the slow-down in execution time at longer delays, GSR data were normalized by the temporal duration of Due sensor failure and the elimination of one subject with extreme values of GSR amplitude using the same rule as for EEG, the analysis comprised 34 participants.No systematic trend due to delay or block length was observed.In addition, the same analysis of slice position within block length as was performed for the EEG revealed no effects, indicating that GSR was stable across step positions within a block.Heart rate, measured in beats per minute, averaged 90.2 BPM and showed no trend related to the experimental variables of delay and block length.The same was true of the in beats across conditions was 20.3).
The absence of systematic trends in both these results is interesting in the context of our initial suggestions of potential mechanisms relating delay in the application to human behavior.In section Section 2.2 we proposed that emotional arousal during the execution of the task was a potential explanation for the effects observed in the human.However, these results seem to refute this hypothesis, and will be further discussed in Section 5.

Individual Difference Analysis
The Big 5 personality inventory and Immersive Tendencies Questionnaires (ITQ) were combined with outcome variables in an analysis of individual differences.Given the results, we initially considered the following outcome variables: • execution time in the most demanding block (length 12, delay 3.0 s); • total of alpha and beta bands EEG in the first four steps at all lengths ("Slice 1"); • average heart rate; • and average log GSR.
November 6, 2020 Although the heart rate and GSR data had produced significant effects in the analysis of experimental variables, they could in principle correlate with experimental outcomes across individuals.A principal-components analysis (PCA) on a subset of these variables, shown in Table 7, was ultimately conducted on the 28 participants for which there was EEG data.Three of the personality measures were excluded after initial analyses indicated significant correlations among neuroticism, openness, agreeableness, and extroversion.Neuroticism (with poles of sensitivity security) was selected as relevant the issue of response to system delay and was included along with conscientiousness.GSR was used in a subsidiary analysis, because only 25 participants had both EEG and GSR measures that were reliable.The PCA produced three components that accounted for 73.13 % of the variance in six factors considered.Component 1 included neuroticism, both ITQ scores, and execution time, indicating that more sensitive and immersed individuals tended to slow their responses under extended delay.Component 2 included conscientiousness and the absence of neuroticism along with immersive involvement, indicating that efficient and secure participants tended to be more involved.The third component had positive loadings on EEG in Slice 1 (first four steps of a block) and the focus component of the ITQ.An additional analysis including heart rate added a component but improved the PCA fit by only 4.5 %, indicating that any variability in this measure across individuals is unrelated to personality, immersiveness, or outcome.A further analysis including GSR produced a solution in which GSR loaded with the first component, along with execution time.
Fig. 14 the correlation between neuroticism and execution time for the block with extreme values of length (12 steps) and delay (3.0 s), for the full set of 40 participants.It confirms the relationship indicated by the first component in the PCA the smaller sample of 28 participants; that is, higher neuroticism is associated more responsiveness to delay.On the whole, these results suggest that individual differences in widely accepted personality variables and immersive tendencies moderate the response to delay.This fact could have practical implications in the future.It could, for instance, provide a tunable parameter for eventual models aiming to emulate human interaction with a WCA.In addition, physiological measures of heart rate and EEG appear not to be direct indicators of behavioral response to delay, although GSR may be more promising in this regard.November 6, 2020

Discussion
start our discussion with the main results of our experimentation presented in the previous section: • and perhaps most importantly, we find that a system slow-down induces an additional slow-down.That is, as system responsiveness decreases, data indicates that users significantly slow-down in their execution of the task.slow-down scales with the decrease in responsiveness; compared to the case, participants were on average 12 % slower at 1.65 s delay and 26 % at s delay.Moreover, there is a temporal component to this effect; users become progressively slower the more time passes with reduced system responsiveness.
• Secondly, we find that the effects of behavioral slow-down due to impaired system responsiveness remain at least a few steps after system responsiveness This is evidenced by the longer per-step-execution times of the first four steps of blocks immediately following a high-delay block, as pictured in Fig. 10.The question of whether any lingering effect can be measured after these four steps remains open.
• we evidence a speed-up in execution time over a series of steps; that is, get faster at performing steps as the task progresses.However, the of this effect decreases as delay increases.Whereas for blocks without users performed the last four steps of a 12-step block on average 36 % faster than the first four, at the maximum delay this effect practically disappears.
• Fourthly, in terms of inter-subject differences, PCA revealed three main factors users' response to delay.The first factor represents sensitivity to delay moderated by the "Big Five" personality trait of neuroticism and both measures of immersion: focus and involvement.Factor two and three represent dedication to the task as opposed to delay intolerance and reflect variables related to attentiveness, respectively.In simple terms, these results suggest that the effects of delay are most potent in individuals who are sensitive and involved in the task.The findings appear selective to cognitive assistance tasks like the present ones, inasmuch as the same measures did not correlate with outcomes in other computer-intensive environments such as immersive VR [52].These correlations are also consistent with previous findings indicating that individuals scoring high in neuroticism tend to be intolerant to delayed reward [11].
A central question therefore arises: to which physio-and psychological mechanisms can these findings, most importantly the substantial slow-down in task execution, be attributed?
In Section 2, we initially considered the possibility that delays might produce negative emotional reactions.These could in turn elicit generalized arousal.We also postulated that adapting to delay might progressively deplete cognitive resources in users.However, the present data provide relatively little for these alternatives, in that physiological measures of GSR and HR to show evidence of differential arousal under long vs. short delays, and speed-induced errors and non-completions predicted by resource depletion were not observed.The acceleration indicate that extended delay increases erratic movement.To the contrary, the data suggest this effect results from a delay in movement after an instruction is introduced.That is, users fail to capitalize on the new information as quickly as they could.Thus, contrary to our preliminary postulations, behavioral effects seem to arise from impaired cognitive control mechanisms, and not from emotion or resource depletion.We hypothesize that the effects of feedback latency can best be understood as changes in the use of a cognitive plan.As was described in Section 3 and Fig. 3, complex cognitive motor tasks have been modeled as the unfolding of a hierarchy of command, from high-level plans to physical output.Long system latencies, we propose, disrupt the automating of such a plan, instead relegating it to attention-based control at the step-by-step level that is easily diverted.This also provides a possible explanation for the lingering effects of delay after an acceptable system responsiveness is restored, as needs time re-adjust and re-automate their cognitive plans.As to the applicability of our findings to other applications, it must be noted that these results pertain to a specific class of applications, namely step-based task-guidance WCA.However, we would expect our findings to extend to similar applications, as long as they follow the same pattern of seamless interaction -i.e.such that the user does not need explicitly interact with the application to advance the state.
The results here presented provide a number of possible implications for WCA design and optimization, both for single and multi-application flows.
• Due to the behavioral slow-down in users, even short-term reductions in responsiveness will lead to significantly extended application lifetimes.This has implications for resource and power consumption.
• The fact that the adverse effects of delay on users do not immediately disappear as the system returns to a high-responsive state could have unconventional consequences for resource allocation.This is of particular importance, for instance, for cases where the user may be able finish the task before these effects subside.In such cases, the limited potential gains might not justify diverting valuable resources to the impaired application.
• In multi-user environments, the time dependency of user slow-down effects mean that fair degradation of system responsiveness across applications may not ultimately be beneficial to the system as a whole.Take for instance two applications on the same system which negatively interfere with each other.The they interfere with each other, the longer their respective lifetimes are November 6, 2020 22/27 going to be, which in turn causes them to interfere even longer, potentially entering a negative feedback loop.In such a case, prioritizing one over the other rather than trying to improve responsiveness for both might lead to resources being freed up faster system-wide.
• Based on our findings relating individual differences between users and their sensitivity to delays, it might also be possible to extrapolate user characteristic from measured execution times.This could prove a valuable tool for load balancing, for instance by prioritizing resource allocation to users with a higher sensitivity to system-state degradation.However, this remains an open challenge.
To wrap up, we believe the present data provide novel and unexpected insights for the understanding and optimizing of WCA deployments.Although more subtle than expected, and in some cases somewhat counterintuitive, these insights represent a valuable tool to tackle inefficiencies in these systems.Moreover, we also argue these findings represent a first step towards a full-fledged understanding of the relationship between application responsiveness and human behavior.More research in this area will surely uncover more complex and interesting behaviors.Finally, we believe the data provide parameters that can usefully be integrated into cognitive models of WCA that might be constructed under existing architectures like ACT-R.These same parameters could be used to modulate the timing and generation of inputs in trace-based workload generation tools such as the EdgeDroid platform [12,13], allowing the tool to use the trace to generate workloads for a multitude of different user profiles.

Conclusion
this paper, we presented the results of a study on the physiological and behavioral reactions of users of WCA to delays in the application pipeline.Our ultimate aim was to identify and categorize the ways in which humans react to low system responsiveness in step-based cognitive assistance systems.
We approached this in an experimental manner, by having participants interact with an instrumented WCA setup, and found that delay appears to affect the cognitive plan users, preventing them from automating the task they are performing.The results show that user interactions in a WCA slow down in the presence of delays in the pipeline.When system responsiveness is high, the user responds quickly; when it is low, the user slows down.This was evidenced by an increase in task execution times, even after accounting for the artificially introduced delays, as well as accelerometer data from participants wrists.Additionally, we found that the strength of this effect is modulated by individual differences between subjects.
These results are interesting as they open up hitherto unexplored opportunities for the design, optimization and benchmarking of WCA systems.In this context, we believe there are two direct and important next steps to be performed.The first of these relates the implications for system optimization and resource allocation discussed in Section 5. believe these need to be implemented, tested and validated in real setups.In particular, we identify two of these implications to be prime candidates for their own experimental studies.One, our postulation that in cases where an impaired application is to finishing, diverting resources to it might not be the most optimal course of action; and two, the possibility that "fair" degradation of system responsiveness across applications may be, in some cases, undesirable.Both of these questions could be answered with straightforward setups.
The second step corresponds to the extension of existing tools for WCA benchmarking with the findings presented in this paper.These tools are of great for the study of WCA systems, as they allow for automated large-scale November 6, 2020 23/27 testing without having to resort to human users.However, they are still somewhat simplistic and unrealistic in their workload generation schemes.Incorporating the results presented here would allow for much more realistic workloads.For instance, our findings relating to the effects of delay on execution times could be directly adapted to modulate timings in the input stream.Another example would be employing the results linking neuroticism to a heightened sensibility for delays to provide a "tuning knob" for the user models in these workload generation schemes.Extending and perfecting these tools will allow for much more realistic benchmarking and testing of WCA systems, providing data of significantly better quality and ultimately leading to faster improvement, optimization and adoption of these systems.

Fig 1 .
Fig 1.Experimental test-bed.Participants interact with the cognitive assistant through task-related inputs and outputs -in practice, these correspond to the video feed captured by the assistant and the instructions provided by it.The assistant itself has been instrumented with a data collection layer, which collects and processes experiment-related data such as biometric signals from the participants (these are merely processed here and do not form part of the inputs to the cognitive assistant as such, however), and a delay buffer, which introduces controlled delays in the transit of information from the core processing component.

Fig 2 .
Fig 2. Example of a Cognitive Assistance task and its component subtasks and steps.

"
Find a 1 × 1 black piece and add it to the top right of the current model."

Fig 3 .
Fig 3. Hierarchical cognitive structure of a step in the LEGO task.
(c) Structure of a block in the experimental task.

Fig 4 .
Fig 4. Components of the cognitive assistance task.

Fig 5 .
Fig 5. Visualization of the execution time of a step.

Fig 6 .
Fig 6.Comparison of an EEG signal before and after filtering.Note in particular the artifact from the DC power supply of the capture board around 60 Hz in the raw signal, how it essentially disappears after filtering.

Frame
from the face recording of a random participant, clearly showing the locations of the EEG electrodes.(b) Frame from the board recording of the same participant.

Fig 8 .
Fig 8. Per-step execution time by block length vs. delay.Error bars indicate the Standard Error of the Mean (S.E.M.)

Fig 10 .
Fig 10.Per-step execution time across the first four steps after a block transition from block B k−1 to B k .Error bars indicate S.E.M.

Fig 14 .
Fig 14.Correlation between neuroticism score of participants and their execution in the longest block at the longest delay.Pearson correlation coefficient r = .40;2-tailed p = .01.

Table 1 .
Means and standard deviations of normalized questionnaire scores.

Table 2 .
Significant effects on per-step execution time from ANOVA on factors delay and block length.

Table 3 .
effects on per-step execution time from ANOVA on factors block slice and delay.

Table 4 .
Significant effects on accelerometer data from ANOVA on factors delay and block length.

Table 5 .
effects on accelerometer data from ANOVA on factors delay and slice number.along delay unfolds, more of the step duration is spent without This can be interpreted in the context of the increased execution time at long delays demonstrated in the previous analysis.Assuming that adding or deleting a block takes essentially the same amount of time and accelerates the wrist Movement score vs. delay, per block slice.Error bars indicate S.E.M.at any one step, it appears that the participant simply remains stationary the extra time that is induced by a series Accordingly, the acceleration per unit time, our movement score, reduced.

Table 6 .
Significant effects on log EEG power from ANOVA on factors delay and block slice.Fig 13.Mean Log EEG Power for alpha and beta bands per step slice.Error bars indicate S.E.M.

Table 7 .
Principal Component Analysis(a) Main components identified.