Vigilance, arousal, and acetylcholine: Optimal control of attention in a simple detection task

Paying attention to particular aspects of the world or being more vigilant in general can be interpreted as forms of ‘internal’ action. Such arousal-related choices come with the benefit of increasing the quality and situational appropriateness of information acquisition and processing, but incur potentially expensive energetic and opportunity costs. One implementational route for these choices is widespread ascending neuromodulation, including by acetylcholine (ACh). The key computational question that elective attention poses for sensory processing is when it is worthwhile paying these costs, and this includes consideration of whether sufficient information has yet been collected to justify the higher signal-to-noise ratio afforded by greater attention and, particularly if a change in attentional state is more expensive than its maintenance, when states of heightened attention ought to persist. We offer a partially observable Markov decision-process treatment of optional attention in a detection task, and use it to provide a qualitative model of the results of studies using modern techniques to measure and manipulate ACh in rodents performing a similar task.


Introduction
Vigilance is commonly taken to mean the state of being alertly watchful, and is principally equated with sustained attention in the scientific literature [1,2]. People notoriously find it difficult and taxing to maintain high levels of vigilance, or sustained attention, over long periods when required by the exigencies of a task, typically exhibiting fluctuations in their performance as well as an overall decline in performance over time (reviews include [3][4][5]). Vigilance was particularly actively investigated in the middle of the last century, for instance in the case of radar operators having to search for targets over long periods [6,7], and was commonly related to arousal, which was generally conceptualized in more physiological terms as the level of nonspecific activation of cerebral cortex [8][9][10] (though see [11]). Vigilance has also been explored in non-human animals, such as in the sustained attention task developed for rodents by Sarter and colleagues [12].
However, rather like selective forms of attention [13,14], there remain gaps in the examination of vigilance using modern conceptions of cost-sensitive information-processing in the brain, as, for instance, in the framework of the expected value of control (EVC) [15,16] (though see the recent review by [17]), in which the degree of cognitive control applied in a situation is normatively determined by a cost-benefit analysis. Thus, in the EVC framework, putative benefits of vigilance might include an increased signal-to-noise ratio (SNR) for sensation and cognition, but at the expense of actual costs of excess neural activity and/or opportunity costs from the diversion of processing to the task at hand rather than other potentially valuable internally-or externally-focused computations [4,[17][18][19].
Concomitantly, EVC conceptions could be enriched by insights from the human and animal paradigms concerned with the neural foundations of vigilance, which, for instance, point to the central involvement of neuromodulatory systems [2,[20][21][22][23]. Neuromodulators can be seen as one solution adopted by the brain for centralised regulation of distributed processing [24]. This has at least two facets: increasing excitability to boost the SNR of neural representations; and, since vigilance is necessary when aspects of the external environment are known to be unknown, manipulating the balance between the influence of prior expectations versus input likelihoods. The latter is the putative role of ACh as the medium of expected uncertainty in the theoretical suggestion of Yu and Dayan [25,26].
Here, we use an abstract model of the sustained attention task (SAT) of Sarter and colleagues [12] to elaborate the statistical and computational bases of vigilance in EVC terms, and explore qualitative aspects of the role of neuromodulation-specifically, ACh-in controlled information-processing. Thus, we note at the outset that our study focuses on one particular experimental paradigm, and on findings about ACh gleaned from that paradigm; this means that we certainly do not claim to address all of the broad range of findings on vigilance more generally, which also frequently involve rather different tasks (see, e.g., [5]).
In the SAT, animals face repeated trials in each of which they have to watch out for a potentially extremely brief signal (a flash of light at a particular location) that comes at an uncertain time during an extended epoch. The end of this epoch is indicated by the insertion of two levers into the operant chamber, and the animal indicates whether it believes a signal has been present or absent during the trial by its selection of lever. Correct responses (i.e., hits and correct rejections) are rewarded, while incorrect responses (i.e., misses and false alarms) are unrewarded. Withdrawal of the levers, either once the animal has made a response or after the maximum response time of 4s, marks the beginning of the next trial. Uncertainty about whether and when the brief signal will arrive means the animals have to be vigilant across the whole epoch. Many studies [27][28][29][30][31][32] have shown that good quality performance depends causally on cholinergic neuromodulation associated both with basal forebrain sources of ACh and controlled release in the medial prefrontal cortex; complex patterns of ACh release over sequences of trials have also been observed [33]. This long line of investigation by Sarter and colleagues into the role of ACh in the SAT make it a natural target for computational modelling.
From a computational perspective, we treat the decision problem faced by an animal in the SAT as a form of partially observed Markov decision process (POMDP) [34]. We suggest that animals can be more or less vigilant (i.e., have strong or weak attention), with the difference reflected in the SNR of the input, but also with costs for switching into, and maintaining, strong attention (cf. [35]). Thus, the agent's objective is not exactly to report correctly whether a trial contains a signal or not at the trial's conclusion, but rather to maximize the difference between external rewards (provided by the experimenter based on successful task execution) and internal costs (from the engagement of attention over the course of the trial) by appropriate choice of both attentional states and final report. The existence of different attentional states (strong vs. weak) and the ability to control which attentional state is occupied makes for a form of extended signal detection theory problem with explicitly controlled variable signal quality [36][37][38][39]. Our computational agent has to use the information it has acquired during a trial so far to decide whether it is worth paying the cost of switching from weak to strong attention. If so, then consistent with the sustained attention styling of the task, it has to decide whether it is worth maintaining the state of elevated attention across successive trials.
We interpret the attentional switching as an internal action [40] mediated by ACh (albeit itself under the influence of extensive cortical control), and use our model to interpret particular neuromodulatory findings of Sarter and colleagues [30,33] in terms of a phasic theory of ACh, to complement the tonic characterization offered by Yu and Dayan [25]. One important difference from the tonic theory is that there is no exact equivalence between weak/strong and top-down/bottom-up information influences in the cortex. Rather, we treat strong attention as a top-down controlled state in which bottom-up information is more reliable (and typically known to be more reliable-the consequences of possible mismatches between what is actual and what is known or believed is considered in Section 'Optogenetic manipulations', below), and so exerts a greater influence over hierarchical processing.
We first describe the abstraction of the SAT, and then consider the deployment of attention over the course of single trials, showing conditions under which it is adaptive to switch from weak to strong attention when a sufficient hint of a signal is detected. We show that even such a simple model of attention can lead to rich patterns of behaviour, both within and across trials, and in a manner that depends in subtle ways on expectations about whether and when a signal may appear, informational quality, and assumed attentional costs. We also report subtle issues concerning the source of misses and false alarms when ACh is manipulated [27,30]. We next consider sustained attention across multiple trials, particularly in the light of findings about specific circumstances under which ACh is engaged in the task [33]. Finally, we discuss our results and interpretations in the light of EVC, expected uncertainty, and older ideas about vigilance and arousal.

Model
Briefly, we model an abstract version of the SAT (Fig 1A) in which a potentially very brief signal may or may not occur (and if so, at an uncertain time) during an extended period. At the end of each trial, the decision-maker reports whether or not they think a signal occurred during the trial. Imperfect information about signal presence or absence is carried by noisy observations. Crucially, how informative these observations are depends on the attentional state of the agent, which we assume to be under its direct control. On each time step, the agent makes a binary choice about whether to attend weakly or strongly-strong attention yields better quality information, but at a cost. We consider the problem of how to optimize this sequence of decisions regarding attentional state, balancing the putative costs of enhanced attention against the benefits of better evidence. With the exception of Section 'Optogenetic manipulations', in which we consider how optogenetic manipulations may disturb correct inference, we assume that the actual and believed qualities of the signal are identical. Formally, the decision problem is modelled as a partially observed Markov decision process (POMDP; Fig 1B), with elements detailed in the following.
States. On each trial, the agent occupies a series of N states S 1 , S 2 , . . ., S N , where each state comprises two variables, S n = (X n , Y n ). X n 2 {pre, on, post} is an 'external' state component trial, in which a signal may or may not occur, comprises a sequence of N = N 0 + N 1 + 1 time steps. Whether a signal occurs, and if so, its time of onset and offset, is determined by the conditional probability of arrival, λ n , at each step n, and the (constant) probability of turning off per time step, q. On the final time step, the decision-maker reports whether a signal did or did not occur during the trial. (B) On each time step, the state comprises the pair of variables (X n , Y n ), where X n 2 {pre, on, post} is the (unknown) stage within the trial, and Y n 2 {weak, strong} is the (known) attentional state; the attentional action a n 2 {weak, strong} determines the quality of the observation O n by determining the amount of Gaussian noise (right), and the attentional state at the next time step (if n < N) or beginning of the next trial (if n = N). Unshaded nodes indicate unobserved/latent random variables (actions are assumed known). For simplicity, we have omitted the additional, 'external' action taken at N (i.e., report 'signal' or 'no signal'). (C) Conditional state transition probabilities. These can be decomposed into independent 'external' and 'internal' components, PðS nþ1 jS n ; a n Þ ¼ PðX nþ1 jX n Þ � PðY nþ1 ja n Þ. (D) Transition into the final, decision state of each trial. (E) Each trial always begins in the pre-signal state, i.e., X t 1 ¼ pre for any trial t; there is a probability δ that the next trial will be started in the weak attentional state even if the agent chose strong at the end of the previous trial. https://doi.org/10.1371/journal.pcbi.1010642.g001

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine determined by the experimenter, corresponding to the three possible stages of a trial (respectively: pre-signal; signal on; and post-signal), and is not directly known by the agent; Y n 2 {weak, strong} is an 'internal' variable that reflects the agent's current attentional state and is always assumed to be known by the agent (Fig 1B). Formally, Y n is a necessary part of the state description to allow the agent to pay different costs for switching attention from weak to strong, and for maintaining strong attention.
Actions. On all time steps 1, 2, . . ., N, the agent makes a binary choice a n 2 {weak, strong} between whether to attend only weakly to sensory information, or instead to attend strongly. On time steps 1, 2, . . ., N−1, choosing to attend strongly has the benefit of providing better information about the underlying state, but carries greater cost, as detailed below. At time step N, this same decision has no immediate informational value (since there is no observation on this step), but it determines whether the agent begins the next trial in the weak or strong attentional state-a choice of strong here can therefore be unambiguously interpreted as being motivated by future, rather than immediate, benefits.
In addition, on the final step N of each trial, the agent must indicate whether or not a signal was presented during the trial. Thus, on the final step, the agent in fact selects one of four possible actions since it must choose both an 'internal', attentional action, and an 'external' action, i.e., a N 2 {weak, strong} × {no signal, signal}. In the following, for simplicity, a n will always refer to the attentional action unless otherwise indicated.
State transitions. A trial contains a signal (referred to as a 'signal trial') with assumed (correctly or not) probability p 1 , and contains no signal ('non-signal trial') with probability p 0 = 1−p 1 . On all trials, we assume N 0 � 1 time steps where it is known that a signal cannot be present (i.e., the agent occupies X = pre with certainty), followed by N 1 time steps, where a signal may arrive. We write N = N 0 + N 1 + 1, allowing for the extra time step for the {no signal, signal} report. A signal, should one occur, is assumed to have equal probability of arriving at any one of the N 1 time steps in a trial, so that a signal's time of arrival τ is Taking into account the probability p 1 that a trial contains a signal, the probability that a signal arrives on step n given it hasn't arrived sooner (i.e., the hazard function) is Conversely, if a signal has already arrived, we assume that it turns off again with constant probability q per time step (Fig 1A). It is always turned off before the final decision step n = N. Note that the assumption that a signal has equal probability of arriving at any one of the N 1 time steps means that λ n increases over the course of a trial (Fig 2A). Note also the effect of q on the probability of being in the signal state at any time, PðX n ¼ onÞ. To be in the on state at n, the signal must have arrived earlier or on time step n and not have turned off once on. If we now let τ 1 denote the time of signal onset, and τ 2 denote the time of signal offset, then this probability is If we have q = 0, so that a signal would always persist until the end of the trial once on, then this probability increases to p 1 just before the decision state (i.e., PðX NÀ 1 ¼ onÞ ¼ p 1 ); if q = 1, so that a signal only ever lasts a single time step, then we have PðX n ¼ onÞ ¼ p 1 =N 1 for each of the N 1 time steps; and other values of q lead to intermediate cases ( Fig 2B). As we will see, this can affect the allocation of attention-since strong attention increases the detectability of a signal at a cost, it can be appropriate to wait until late in a trial when a signal is more likely to be present to be detected. For the sake of completeness, we also note that the assumption that each trial has a fixed length of N steps means that the distribution over signal durations deviates, due to truncation, from a geometric distribution with parameter q and mean 1/q (Fig 2C). The probability of a signal of duration L (conditioned on a signal occurring), which is nonzero only for 1, . . ., N 1 , can be expressed as For steps 1, 2, . . ., N−1, we assume that the attentional state on step n + 1 is completely determined by action a n , so that if the choice is weak on step n, then at the start of the next time step we have Y n + 1 = weak, and vice versa. Note that these 'external' (between world states) and 'internal' (between agent states) transitions occur independently, so the dynamics can be written in terms of a Kronecker product
Some exceptions to the preceding apply. Firstly, the transition between X N−1 and X N is such that any belief in the on state at the end of time step N−1 is transferred to post at the start of N ( Fig 1D). Secondly, it is certain that each trial begins in pre, but we assume that on transitioning from the Nth step of trial t to the 1st step of trial t + 1, there is some probability δ of reverting from strong to weak attentional state (Fig 1E). One might think of this as allowing the possibility of a passive or spontaneous 'decay' process. We generally set this decay probability to be small (δ = 0.001).
Observations. Depending on the sequence of states-and also, crucially, on the agent's actions-the agent receives a sequence of observations O 1 , O 2 , . . ., O N−1 . As mentioned above, we assume that no observation is provided at N, where the agent must report whether there was a signal or not during the trial. Observations carry information about whether there is currently a signal or not, but just how informative they are is determined by choice of attentional action. In particular, if the agent chooses to attend weakly on the current time step, then we assume O n jX n ; a n ¼ weak � If the agent chooses to attend strongly, then we have O n jX n ; a n ¼ strong � where s 2 s � s 2 w . In other words, observations are always drawn from a normal distribution, with a mean of either 0 (if no signal) or 1 (if signal), and these observations are less 'noisy' in the strong than in the weak attentional state (Fig 1B, right). Note that variability is assumed not to depend on whether the signal is present or not.
Costs and rewards. While strong attention yields better quality information, we assume this comes at a cost, which may additionally depend on the current attentional state. Costs are always treated as non-positive ( � 0) reward values. The only positive reward available in the task comes from correctly reporting the presence or absence of a signal.
We denote the reward function R n (X, Y, a), and assume that this itself is a sum of two terms, R n ðX n ; Y n ; a n Þ ¼ R i ðY n ; a n Þ þ w n¼N R e ðX NÀ 1 ; a N Þ; ð7Þ where R i (Y n , a n ) gives the 'internal', attentional cost associated with attentional action a n in attentional state Y n , R i ðY n ; a n Þ ¼ 0 if a n ¼ weak; � < 0 if Y n ¼ strong and a n ¼ strong; k < � if Y n ¼ weak and a n ¼ strong; and R e gives the 'external' reward arising from whether it was correctly or incorrectly reported on step N (signified by χ n = N ) that a signal was present or absent during the trial, which we simply define as where the rectitude of a no signal or signal action a N depends on whether X N−1 is pre or {on, post}. Note that the cost of attention only depends, in addition to attentional action a n , on the subcomponent Y n (i.e., attentional state) of S n . It can be interpreted as follows: the cost of weak attention (a n = weak) is zero, regardless of the previous action, while there is always a cost of strong attention (a n = strong); furthermore, it is assumed that choosing to attend strongly given that one is already in the strong attentional state (Y n = strong) is less costly than choosing to attend strongly given one is currently in the weak attentional state (Y n = weak). The intuition behind this latter assumption is that such repeated action can also be conceptualized as maintaining a (non-default) internal state, which may incur less cost than switching between states (and in particular, from 'default'/disengaged to 'non-default'/engaged states; see Discussion). Belief MDP. In the current case, the agent's posterior distribution over states given its observations, or 'belief state', coupled with memory of its most recent action, is sufficient to optimize behaviour. In particular, we can formulate the corresponding 'belief MDP', and use familiar dynamic programming methods to solve for the optimal policy.
Using simplified notation similar to [34], we denote by b n + 1 (x) the agent's belief that X n + 1 = x after seeing all the information O 1:n on time steps 1, . . ., n, including knowledge of the attentional actions a 1:n on those steps (and after one further probabilistic transition). (Strictly, b n + 1 (x), is a deterministic function of O 1:n , a 1:n but, for notational convenience, we treat it as a random variable and do not write down this dependency.) That is, b nþ1 ðxÞ ≔ PðX nþ1 ¼ xjO 1:n ; a 1:n Þ ¼ PðX nþ1 ¼ xjO n ; a n ; b n Þ. We then combine this belief over X n + 1 with the known value of Y n + 1 to make a stateS nþ1 ¼ ðb nþ1 ðxÞ; Y nþ1 Þ for what is called the belief MDP.
The transition and reward functions for the belief MDP are derived in the usual way. Specifically, the posterior probability over X n + 1 given action a n and particular observation O n = o n can be derived recursively via Bayes' rule, b nþ1 ðxÞ ¼ PðX nþ1 ¼ xjo n ; a n ; b n Þ ¼ P w PðX nþ1 ¼ xjX n ¼ w; o n ; a n ; b n ÞPðX n ¼ wjo n ; a n ; b n Þ ¼ P w PðX nþ1 ¼ xjX n ¼ wÞPðX n ¼ wjo n ; a n ; b n Þ / P w PðX nþ1 ¼ xjX n ¼ wÞPðo n jX n ¼ w; a n ; b n ÞPðX n ¼ wjb n ; a n Þ ð10Þ ¼ P w PðX nþ1 ¼ xjX n ¼ wÞPðo n jX n ¼ w; a n Þb n ðwÞ ð11Þ Following [34], we can think of this process in terms of the application of a state estimator function SE n (b n , a n , o n ), which takes the initial belief state at n, along with the selected action and ensuing observation, and returns the resulting belief state. The transition function over belief states is then F n ðb n ; a n ; b nþ1 Þ ≔ Pðb nþ1 ja n ; b n Þ ¼ X o n Pðb nþ1 jo n ; a n ; b n ÞPðo n ja n ; b n Þ; where ð12Þ Pðb nþ1 jo n ; a n ; b n Þ ¼ 1 if SE n ðb n ; a n ; o n Þ ¼ b nþ1 ; and where P(o n |a n , b n ) is the normalizing constant arising in Eq (10). This transition function, which marginalizes out the possible observations, is important when planning ahead, since the actual observations will not be known at that point. Using the full stateS n ¼ ðb n ; Y n Þ, the reward function in the belief MDP is derived from the original reward function via r n ðS n ; a n Þ ¼ X x b n ðxÞR n ðx; Y n ; a n Þ: ð14Þ Optimal behaviour. We consider optimal behaviour in two cases: the case of a single trial, and the case of the average reward rate over continuing trials.
For the single-trial case, a suitable objective is to determine a policy that maximizes the expected (undiscounted) return over the trial. Such an optimal policy can be found by simple backward induction. First consider the final time step N. Here, there is no observation to provide additional information and, furthermore, since there are no future trials to be considered, there is nothing to be gained by choosing other than the attentional action a n = weak for any state. Then there is the additional choice of whether to report whether a signal was present or absent. The optimal decision here is simply to choose according to whether one's belief about being in the pre-signal state is greater than 0.5 (report 'no signal') or not (report 'signal'). From these considerations about the optimal policy at N, we can find the corresponding optimal values v � N ðS N Þ, i.e., the maximum expected return for each possible stateS N using Eq (14), since the expected return is equal to the expected reward at this final state. Working backwards, we can then find the optimal values and associated optimal actions for all preceding time steps n < N using the Bellman optimality equation v � n ðS n Þ ¼ max a n r n ðS n ; a n Þ þ whereS nþ1 ¼ ðb nþ1 ; a n Þ. We thereby find a deterministic optimal policy p � ðS n Þ, for all states S n , mapping from belief and attentional state Y n at step n to an attentional action a n . The multi-trial case is made slightly more complex by the need to consider future trials in addition to the current trial. This is because the actions in trial t can affect the starting state in trial t + 1. One suitable objective in this continuing case is to maximize the average rate of reward [41], and we can solve for this maximum average reward rate g � using value iteration to solve the system of Bellman optimality equations v � n ðS n Þ ¼ max a n r n ðS n ; a n Þ À g � þ where it should be noted that v � n ðS n Þ now refers to a differential value, based on the averageadjusted sum of (undiscounted) future rewards from each state (see, e.g., [42] for details).
In both single-and multi-trial cases, we approximate the continuous component of the belief state with a finite number of discrete states, specified by applying a regular grid with non-overlapping tiles of side Δb [43]. Interpolation can then be used to extract values and actions for any other point in the belief space.

Attention and ACh: Bridging hypothesis
When considering optimal (Bayesian decision-theoretic) behaviour, the representation and updating of uncertainty (as well as utilities) are obviously of key importance. It has been previously suggested [24,25,44,45] that the neuromodulators acetylcholine (ACh) and norepinephrine (NE) play an important, if circumscribed, role in representing (though not themselves calculating) expected uncertainty (i.e., variability in the world that is predicted) and unexpected uncertainty (i.e., variability that is not predicted) at both relatively slow (for learning) and fast (for inference) timescales. This complements other forms of representation, such as in population codes [46,47]. Our interest here is primarily in fast ACh signalling and its possible roles in inference [31,32,48].
The earlier ideas about the role of ACh in inference were that it directly reports cases in which prior expectations about the world are vague (for example, in the SAT, arising from variable timing and the low SNR of sensory inputs). In such cases, it is appropriate to weight external information more heavily (i.e., boosting bottom-up over top-down information) and potentially to improve one's prior from this new information. However, that work involved Bayesian inference rather than Bayesian decision theory, and so did not consider the additional possibility of an attentional action-i.e., the possibility that the SNR of the input could advantageously, though possibly expensively, be changed to increase the net reward rate. We suggest that ACh also plays this additional role in the SAT, such that the release of ACh can be treated as an internal action rather than as a passive reporter of expected uncertainty. In some cases, these roles may be functionally indistinguishable.
We generally consider ACh to report a n = strong. However, when we consider experiments in which ACh is manipulated directly by optogenetics [30], we discuss some of the complexities of breaking the relationship between the optimally intended and the actual release of ACh-since there could then be mismatch between the assumed and actual qualities of the information that is being reported and integrated (see Section 'Optogenetic manipulations').
It is important to note that ACh is substantially heterogeneous [49,50]. In the experiments we report, there are differential effects associated with sensory cortical and prefrontal cortical ACh. At present, it is not possible to be completely precise in the model about the release and effects of ACh-themes that we return to in the Discussion.

Plots and measures
To facilitate understanding of subsequent results, we briefly introduce the main types of plot we use to understand model behaviour. Key measures of performance are also described.
One type of plot concerns which attentional action is optimal in each possible belief state b n . Since at any time step n we have 3 possible trial states, X n 2 {pre, on, post}, and a belief state must sum to 1, then the belief space at any time can be visualized as a 2-dimensional simplex (i.e., a triangle). The lower panels of Fig 3A-3C, below, depict the belief space at a particular time step, and show the value (Fig 3A, lower) and the optimal attentional actions ( Fig 3B-3C, lower) for each belief state, conditioned on the current value of Y. Each time step in a trial has a belief space of this form, and so we can visualize changes in value or policy over the course of a trial by appropriate concatenation of such plots (as in the upper panels of Fig 3A-3C).
Another plot type shows the average evolution of the belief state over a trial, along with the average evolution of the attentional state. Fig 4A (left) shows one such case: note that at any time, the sum of beliefs in each possible state (pre, on, post) sums to 1 (the upper plot); the evolving attentional state-we simply label the y-axis 'attention'-is more precisely the probability of selecting strong at each time step, Pða n ¼ strongÞ (lower plot). The probability of choosing strong at each step is alternatively rendered in terms of shaded squares when systematically varying parameters (e.g., Fig 8A, upper left), along with the 'marginal' expected number of steps per trial in which strong is chosen (e.g., Fig 8A,

lower left).
Predicted ACh activity under the model (Fig 12A and 12B, right panels) is based on the assumption that ACh reports a n = strong (see Section 'Attention and ACh: Bridging hypothesis' above), and so is similarly derived directly from Pða n ¼ strongÞ. The only additional subtlety is that the ACh transients reported by Howe et al. [33] (Fig 12A and 12B, left panels) are measured relative to a pre-signal baseline (2s before signal onset, or analogous period on a non-signal trial), and we therefore similarly report predicted changes in ACh relative to a baseline obtained by averaging attentional state over 3 time steps before the signal onset (signal trials) or the average signal onset time (non-signal trials).
In addition to these plots, we report the common sensitivity measure d 0 used in signal detection theory [51,52], At the decision state (n = 10), the value of being more certain about whether there has been a signal or not is clear, since these belief states have higher v � . This value is the same regardless of whether one is certain that there was or was not a signal. For earlier time steps, the highest values occur only where one is certain that a signal is, or has been, present. (B-C) The associated optimal policies (upper plots) conditioned on currently occupying (B) the weak attentional state (Y n = weak) or (C) the strong attentional state (Y n = strong); lower plots for time step n = 7. Darker regions bound the set of beliefs for which choosing strong is optimal; in general, this region grows over time. where H is the hit rate (i.e., the probability of reporting a signal given a signal was presented), F is the false alarm rate (i.e., the probability of reporting a signal given no signal was presented), and z(�) is the quantile function of the standard normal distribution. Note that we can make such measures conditional on the onset and duration of a signal (as in Fig 6); that is, while F is the probability of reporting signal when none has been presented, we can separately consider the hit rates H for signals of different duration, disregarding onset (Fig 6A), or even examine H for all possible onset-duration combinations (Fig 6B). In Fig 8, we report hit and false alarm rates (middle plots) in addition to d 0 (upper right plots).

Results
We consider both internal and external decision-making over the two key timescales in the task. The most obvious timescale is that associated with each individual trial. Within a trial, the agent has to accumulate evidence about the presence of the signal, decide whether and when to engage or disengage attention, and finally report their external choice (report 'signal' or 'no signal'). However, we are also concerned with what occurs across trials. Although from a signal detection viewpoint each trial is independent, and so the actual presence of a signal on one trial bears no relationship to its presence on previous or future trials, the agent's attentional state is allowed to persist across trials. Indeed, as we noted in the Introduction, it might be substantially more expensive to engage strong attention than to maintain it. Then, if attention is already engaged on trial t, it might be optimal to pay the cost of sticking in the strong state for trial t + 1 and thereby avoid the even greater cost of having to re-engage this state, if necessary. In turn, this would mean that the choice to engage attention on trial t can be partly justified by the benefits of strong attention on trial t + 1. Thus, we need to consider an intertrial timescale as well as an intra-trial one. We start with the case of an isolated trial, and then treat cross-trial influences.

Single trial
When does it make sense to engage (weak ! strong) or disengage (strong ! weak) attention within a trial if we only consider optimizing behaviour in a single trial (i.e., ignoring the effect of choice on future trials)? We expect multiple model parameters to interact to determine the answer, since the probability that a signal is on and its temporal extent (determined by p 1 , q and N 1 ), the relative qualities of information available in different attentional states (σ w , σ s ), and the costs of strong attention (κ, �) can all be expected to play a role. To gain some insight, we examine a particular regime in detail (these will be our 'default' parameters) before more systematically examining the effects of varying parameters.
Example: Frequent signals. In our first, 'default' regime, we assume that a signal is as likely as not, p 1 = 0.5, with q = 0.2. In terms of the relative quality of observations, we set σ w = 1 and σ s = 0.5. For costs, we assume the cost of a weak ! strong shift to be moderately costly, κ = −0.1, and the cost of maintaining strong attention to be substantially cheaper, � = −0.014. Finally, for convenience, a trial is relatively short, with N 0 = 3 and N 1 = 6 (so that, in total, a trial includes N = 10 time steps).
The upper plot in Fig 3A shows the optimal values of each belief state at selected time steps (1, 4, 7 and 10) when occupying the weak attentional state; the lower plot highlights step n = 7 for clarity. At the decision state (n = 10), state values are higher for more certain beliefs (where one is more likely to be correct), and lowest where there is most uncertainty (where one can expect to perform at chance). Moving backwards in time, the value remains high when convinced there is, or has been, a signal (i.e., when PðX n ¼ onÞ þ PðX n ¼ postÞ � 1). This is not true where there is strong belief in the pre-signal state (i.e., PðX n ¼ onÞ þ PðX n ¼ postÞ � 0) -since this latter conviction does not preclude the possibility of a signal arriving in the future. Fig 3B and 3C show the associated optimal policies when occupying the weak and strong attentional states, respectively. In each case, the dark, bounded region depicts the set of states at each time step where it is optimal to choose strong. (There are only planes at each discrete time step; the interpolated rendering is for didactic clarity.) This region grows over time in accordance with the growing probability of a signal being on. In neither case, however, is there any reason to select strong in the final decision state (n = N = 10)-this is because no information is provided at this state, and so it can have no bearing on performance in the current trial (and any impact on performance in future trials is not currently considered).
Again, if we consider working backwards through time, we see in the weak attentional state (Fig 3B) that choosing strong is generally worthwhile if one is uncertain about whether a signal is present, and belief in the post-signal state is low; this trend extends backwards in time, for a shrinking set of beliefs, until at n = N 0 = 3 there is no belief state where choosing strong pays off. As already mentioned, the increase with time in the set of beliefs for which choosing strong is optimal is because of the non-uniform probability of a signal being present at any given time step (cf. Fig 2B): since q < 1, a signal typically stays on for more than 1 time step, and it can therefore be economic to choose strong only later rather than earlier.
Conversely, once in the strong attentional state, it can be optimal to choose to continue in this state (Fig 3C). Again, on the penultimate time step, this occurs where there is uncertainty about being in the signal state, and belief in the post-signal state is low. Moving backwards in time, this region tends to shrink, and notably incorporates the region where belief in the presignal state is high (i.e., PðX n ¼ onÞ þ PðX n ¼ postÞ � 0)-again, since a current belief that one is in the pre-signal state is no guarantee that a signal may not occur in the future. Note that this includes the initial N 0 = 3 time steps: even if in the current case we would never expect a trial to begin in the strong state, if this were to happen, then this state would be maintained over the initial N 0 steps, despite the certainty that the pre-signal state is occupied. Where it is certainly not worthwhile to continue to choose strong is when sufficiently certain that one is in the signal or post-signal state.
While such plots of the optimal values and policy are instructive, they do not by themselves tell us how beliefs or attentional state evolve over time. For example, since each trial begins with certainty that one is in the pre-signal state, PðX 1 ¼ preÞ ¼ 1, and this certainty extends over the N 0 initial time steps, there is no change in belief during this time.
For an alternative view, Fig 4 shows the average belief trajectories and average probability of choosing strong attention throughout a trial, separated by whether a signal did or did not occur, and by whether the trial was started in a weak (Y 1 = weak) or strong (Y 1 = strong) attentional state (the difference between which is quite subtle for the current parameters). In all cases, the initial certainty about the pre-signal state decreases, and on average ends up above 0.5 for a non-signal trial, and below 0.5 for a signal trial.
When starting in the weak state (Fig 4A), the probability of choosing strong increases at n = 5, whether (right) or not (left) there is actually a signal in the trial, and returns to zero at the decision state (n = 10). While initially the probability of choosing strong is similar for nonsignal and signal trials, one can see that this probability reaches a higher level in the non-signal case, and is maintained up to the penultimate time step. By contrast, the probability of choosing strong on a signal trial decreases from time step 6 onwards. This difference comes from the asymmetry already mentioned: once a signal has been detected, there is no reason to pay the cost of remaining in the strong attentional state, since higher quality sensory information cannot change the agent's mind-a signal can either stay on or switch off. However, if the signal has not yet been detected, it remains theoretically possible (until n = 10) for it to turn on. We also see this effect when starting in the strong state (Fig 4B): whether a signal is absent (left) or present (right), strong is initially maintained, but the average probability of choosing strong decreases more rapidly on a signal trial. Fig 5 further illustrates this effect by averaging signal trials in which the signal is always of duration 3, but comes on at different times (i.e., at n = 4, n = 5, n = 6, or n = 7). Regardless of whether the trial is started in the weak or strong attentional state, an earlier onset effectively means that a decision is made earlier that a signal is indeed present, and so attention can be relaxed sooner within the trial. On the other hand, being sure that there has been no signal so far is no guarantee that one will not arise in the future-and so, once in the strong attentional state, it can make sense to maintain this heightened state of vigilance.
As one would expect, discrimination performance is better for longer signals, and starting in the strong state leads to improved sensitivity (Fig 6A). At a finer grain, sensitivity is also modulated by signal onset (Fig 6B). In general, regardless of whether one starts the trial in a weak or strong attentional state, the later a signal arrives, the more likely it is to be detected. This can ultimately be attributed to the increasing hazard function of a signal over the course of a trial, which leads the optimal policy more frequently to have attention be strong during the signal. Note also that the improvement in sensitivity when starting in the strong attentional state is due to improved detection of signals that arrive early in the trial.
Example: Rare signals. In the preceding example, we set p 1 = 0.5, meaning that in expectation, half of the trials will contain a signal. One observation was that a weak ! strong shift was likely to occur whether a signal was present or not, and indeed, the probability of occupying the strong state on a given step was generally higher on a non-signal trial (cf. Fig 4A). Fig 7 shows an example in which occupation of the strong state is instead higher for a signal trial. This regime, in which signals are expected to be rare (we set p 1 = 0.15), will also be relevant when we consider sequential effects below (Section 'Sequential effects and ACh'). Although the overall probabilities of occupying the strong attentional state are much lower (note the yaxis scale for attention), one sees that the engagement of strong attention is more likely on a

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine signal than a non-signal trial. Note, however, the much greater probability of missing a signal, since, again in expectation, PðX N ¼ preÞ > 0:5 (leading to a report of 'no signal') on a signal trial.

Effects of varying model parameters
We next measure the effects of varying the default set of parameters systematically, one at a time. In reporting hit/false alarm rates, sensitivities, and average reward rates, we compare the optimal policy to two extreme cases: the 'always-weak' policy, which enforces constant occupation of the weak attentional state; and the 'always-strong' policy, which means the strong attentional state is always occupied.
The overall results (Fig 8) can be summarized as follows: for p 1 and q, there are intermediate parameter values which lead to greatest engagement of strong attention; otherwise, as the relative improvement in quality of information from engaging strong decreases (i.e., as σ s approaches σ w ), or as the associated costs increase (i.e., |κ| and/or |�| increase), then there is progressively less engagement of strong attention until a 'breaking point' is reached-when it is no longer paying the price of strong attention at all.
Signal probability, p 1 . When measuring the effect of varying signal probability p 1 more systematically (Fig 8A), we see that the highest occupancy of the strong attentional state occurs when there is most uncertainty about whether there will be a signal or not, i.e., p 1 = 0.5 ( Fig  8A, left); this is also where strong attention is likely to be engaged earliest within a trial. As one would expect, the ability to engage strong attention generally leads to improved hit and false alarm rates (Fig 8A, middle), i.e., improved sensitivity (Fig 8A, upper right). Also as expected, p 1 = 0.5 is where reward rate is lowest, but it is also where the ability to choose strong leads to the largest increase in reward rate compared to the always-weak policy (Fig 8A, lower right).
Signal length, q. If q = 0, then signals persist until the end of a trial; the optimal strategy in this case would be to only choose strong towards the end of a trial (taking account of how much total information it is ideal to collect). On the other hand, if q = 1, then a signal only ever lasts a single time step, and the cost of choosing strong even at a single time step outweighs  For each parameter, we show the expected occupation of the strong state (left; upper plot shows probability of occupying strong at each time step, while lower plot shows the 'marginal' expected number of steps in strong state); the hit rates and false alarm rates (middle); and the sensitivities d 0 derived from the hits and false alarms, and average reward rates g � (right). For the middle the benefit in terms of enhanced detection (Fig 8B, left). Between these two extremes, we observe that initially, as q increases, the occupancy of the strong attentional state increases, peaking at around q = 0.3; beyond this, the occupancy rate decreases, until at q � 0.8, it is no longer worthwhile engaging the strong attentional state at all. This breaking point, where it is no longer worth paying the cost of strong attention, then marks where the performances of the optimal and always-weak policies converge (Fig 8B, middle and right).
Information quality, σ s . To be worthwhile to choose strong, the informational benefit of this state must outweigh its cost. As we increase σ s towards the value of σ w , it becomes less worthwhile to choose strong, until there is another breaking point (at σ s � 0.6) where it is no longer worthwhile at all (Fig 8C, left). Again, this leads to a convergence in performance between the optimal and always-weak policies (Fig 8C, middle and right).
Attention costs, κ and �. Very similar to the pattern of results seen when reducing the quality of information via σ s , as we increase the cost of a weak ! strong shift (i.e., increase |κ|), there is again a point (|κ| � 0.15) at which this cost is no longer worth paying (Fig 8D, left), and we observe convergence of optimal and always-weak policies (Fig 8D, middle and right). For κ fixed, as we gradually make the cost of staying in the strong state more expensive (i.e., increase |�|), we also observe a decrease in the willingness to occupy this state (Fig 8E, left). At |�| = |κ| = 0.1, where there is no benefit to maintaining the strong attentional state, it is still worthwhile to choose strong, but only for a single time step just before the decision is made.

Optogenetic manipulations
Gritton et al. [30] used optogenetics to either stimulate or suppress cholinergic activity while mice performed the sustained attention task (SAT); separate experiments focused either on cholinergic cell bodies in the basal forebrain (BF) or cholinergic terminals in the right medial prefrontal cortex (mPFC). They reported a rich pattern of results (Fig 9A). On signals trials, while BF activation led to an increase in hits for shorter signals, and BF suppression led to a decrease in hits for longer signals, no significant effect on performance was found for mPFC manipulations. On non-signal trials, activation of either area led to an increase in false alarms, while suppression of BF had no effect on false alarms. Despite the relative simplicity of our model, we were inspired by these results to measure the effect of certain manipulations. In the usual case we treated so far, it is assumed that the actual and believed qualities of the signal are the same. That is, for example, if the present choice is to attend strongly (a n = strong), then the instantiation of O n is indeed generated from N ðm; s 2 s Þ, where μ depends on whether a signal is in fact present (μ = 1) or not (μ = 0), and inference proceeds on this basis (i.e., based on PðO n jX n ; a n ¼ strongÞ).
However, as well as considering different cases where this assumption is met, we can study cases where it breaks down, so that there is a mismatch between the actual and believed quality of the signal. We report results for the default set of model parameters (cf. Section 'Single trial'), but the qualitative pattern of results holds for a broad range of model parametrizations. and right plots, we compare the performance of the optimal policy with the 'always-weak' and 'always-strong' policies (see main text). Parameters varied one at a time, with other parameter values fixed at 'default' values: p 1 = 0.5; q = 0.2; σ w = 1; σ s = 0.5; k = −0.1; � = −0.014; N 0 = 3; N 1 = 6. https://doi.org/10.1371/journal.pcbi.1010642.g008

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine Fig 9C (left) shows the 'matched' case: observations are generated according to the optimal ('control'), always-strong ('act.') or always-weak ('supp.') attentional policies, and the effect on inference is consistent/matched to the policy. As expected, hits increase (marginally) under always-strong and decrease under always-weak, while false alarms remain essentially the same (tiny decrease) and increase, respectively (cf. the hits and false alarms at p 1 = 0.5 in Fig 8A,  middle). Modulo the relative sizes of these effects, this is as expected: more precise observations lead to better performance-and vice versa for less precise observations-and this is apparent in both hits and false alarms.
One type of mismatch we can consider is when observations are treated during inference as being generated under the optimal attentional policy, but are in fact generated under an always-strong ('act.') or always-weak ('supp.') policy (Fig 9C, middle, 'mismatch obs.'). This leads to a slight inflation of false alarms for the latter case, but otherwise yields the same qualitative pattern as the matched condition.
Another type of mismatch is when observations are indeed generated under the optimal attentional policy, but treated as being generated under always-strong ('act.') or always-weak ('supp.') policies (Fig 9C, right, 'mismatch inf.'). Interestingly, what we see here is qualitatively aligned with the BF findings of Gritton et al. (cf. Fig 9B): hit rates are increased and decreased under 'activation' and 'suppression', but so too does 'activation' increase the false alarm rate, and 'suppression' (to a lesser degree) decrease the false alarm rate.
The fact that the latter, and not the former, type of mismatch captures the qualitative trend of results is notable, as this allows us to decompose the putative three roles that ACh plays: in order for the (i) increase in the SNR of the input to have an effect, the impact of the bottom-up information concerned has to be increased relative to prior expectations-something that can happen through (ii) boosting the strength of this signal, (iii) the suppression of priors, or both. The mismatch more consistent with the data (the rightmost plot of Fig 9C) implicates

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine (ii) and/or (iii), suggesting that this might be the dominant effect of the ACh manipulations. False alarms and hits both go up with BF activation, since the prior, which would be suppressed, implements the default assumption in this mode of signal processing that the signal is not present. Suppressing ACh boosts the relative effect of the prior, thus decreasing hits. The effect on false alarms is harder to assess, not least because of a floor effect. That activating ACh just in the mPFC does not increase hits could be because the prior is already correctly incorporated in the sensory processing hierarchy, implying that hits cannot be substantially rescued at the level of the mPFC.

Multiple trials
So far, we have considered optimal control of attention when the optimization problem is to maximize performance for a single trial. In the single-trial case, it never makes sense to choose strong in the decision state, since one would be paying a cost for no advantage, informational or otherwise. In this section, we consider the optimal policy when future trials are also taken into account. We show that it can then make sense to maintain a strong attentional state into the next trial in order to avoid the future possible cost of a weak ! strong shift, and suggest this may help explain a sequential effect in ACh signalling.
Maintaining strong attention across trials. We keep the same, 'default' parameters as in the frequent signal, single-trial case, but we now solve for a policy that takes into account future trials; in particular, we solve for the policy that maximizes the average reward rate (see Methods).
The optimal policies in the weak and strong attentional states are essentially identical to the single-trial case; the only exception is that if one occupies the strong attentional state at the time of decision, then it is optimal to choose strong for all belief states-and thereby start the next trial in the strong attentional state. This is despite the fact that there is no information to be gained in the decision state, and the following N 0 steps of the next trial are known to be spent in the pre-signal state; the only reason to maintain strong into the next trial is because this is cheaper than paying the cost of a subsequent weak ! strong shift.
Correspondingly, when examining average trajectories, one sees that on average, attention never completely reverts strong ! weak at the end of a trial (Fig 10). Furthermore, for a signal trial, the later the signal's arrival, the more likely it is that strong attention is maintained into the next trial (Fig 11).

Sequential effects and ACh
Howe et al. [33] measured ACh release in right mPFC while rats performed the SAT. They reported that ACh transients were only present on signal trials in which an animal correctly reported a signal (i.e., a hit trial) and not on correct rejection or miss trials; false alarm trials were not analyzed due to the comparatively low number of such trials (�16% of non-signal trials). Furthermore, Howe et al. reported that a significant ACh increase occurred on a hit trial only when a non-signal was reported on the previous trial ('incongruent hit'), while no such increase was observed on a hit trial if the previous trial was also a hit trial ('consecutive hit'; Fig  12A, left). A nonsignificant trend for a decline in ACh during an 'incongruent correct rejection' (i.e., correct rejection when the previous trial was a hit trial) was also reported (Fig 12B,  left).
Dayan [24] suggested that if an initial detection of a signal establishes a detection-oriented task set that lasts across trials, then there would be little unexpected uncertainty when a (detected) signal follows (detected) signal-and so ACh release would not be expected during these consecutive hits (Fig 12C; see also [32]). Implicit here is the additional requirement that

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine strong ! weak shifts do sometimes occur, thereby temporarily deactivating the detection-oriented task set-otherwise, there would be no interesting dynamics to observe at all.
In terms of our model, attentional dynamics-and, by hypothesis, so too ACh dynamicsthat are similar to the reported pattern of ACh activation will arise in the model if the following are satisfied: (i) a weak ! strong shift is more likely to occur on a signal trial than a non-signal trial (since an increase in ACh was only observed during signal trials); (ii) a strong attentional state is at least sometimes maintained into the next trial following a hit trial (since no increase in ACh was observed for consecutive hits); and (iii) strong ! weak shifts do sometimes occur.
We already saw that a weak ! strong shift is more likely to occur on a signal trial than a non-signal trial in the model if signals are expected to be rare (i.e., p 1 � 0.5; Fig 7). We also saw a regime in which strong attention is maintained across trials (Fig 10). What we have perhaps not examined so closely is the issue of when disengagement (i.e., a strong ! weak shift) occurs. One might draw a distinction between passive and active mechanisms of disengagement: active disengagement would be mandated by the optimal policy (responding to the relative costs and benefits); while passive disengagement would not be so mandated, but rather Left: a significant increase in ACh on signal trials in which an animal correctly reported a signal (i.e., hit trial), but only when a non-signal was reported on the previous trial ('incongruent hit') and not if the previous trial was also a hit trial ('consecutive hit'). Adapted from [33]. Right: simulation results. (B) Non-signal trials. Left: no increase in ACh on correct rejection trials; a non-significant trend for a transient decline in ACh on a correct rejection trial if preceded by a hit trial ('incongruent correct rejection'). Adapted from [33]. Right: simulation results. In the experiments, ACh signals are relative to a pre-signal baseline (period of 2s before signal onset, or analogous period on non-signal trial), and aligned to signal onset or analogous time point for non-signal trials (black arrows); average time of levers' insertion following cue (red arrows). In the simulations, 'ACh' is measured relative to a baseline obtained by averaging attentional state over 3 time steps before the signal onset (signal trials) or the average signal onset time (non-signal trials). Vertical dashed line and black arrow indicate cue; red arrow indicates average time of trial end. Averages reflect 10,000 simulated trials. Prior signal probability p 1 switches between 0.1 and 0.15, with parameters otherwise fixed as follows: q = 0.2, σ w = 1, σ s = 0.5, κ = −0.1, � = −0.005, δ = 0.001. (C) Left: if a signal has prompted a weak ! strong attentional shift (and detection) on trial t−1, this strong attention may persist into trial t, where detection is associated with no change in attentional state. Right: if the weak attentional state is currently occupied, a weak ! strong shift is most likely to occur on a (detected) signal trial; t−1 may have no signal or a signal that is undetected (i.e., a miss trial). (D) Left: consecutive correct reports of no signal are not expected to be associated with attentional change. Right: if a signal (and detection) means that the strong attentional state is occupied at the end of trial t−1, it may persist into trial t; however, if there is no signal in this next trial, then there may be a reversion back to the weak attentional state (i.e., a relative decrease). https://doi.org/10.1371/journal.pcbi.1010642.g012

PLOS COMPUTATIONAL BIOLOGY
Vigilance, arousal, and acetylcholine occur sporadically, somewhat like a 'lapse' process, or perhaps even associated with the sort of spontaneous fluctuations in attention apparent in longer-running vigilance tasks [3,5]. The parameter δ in the model was introduced principally to allow for the latter possibility-even if the intention is to begin the next trial in the strong attentional state, one in fact begins in the weak state with probability δ.
Here, we focus on what is arguably the more interesting case-active disengagement. One possibility, amongst others (see Discussion), arises if we relax the very strong assumption that an agent knows the true parametrization of the task. For example, the fact that the true probability of a signal trial is p 1 = 0.5, and is fixed at this level, is not a given. The signal probability needs to be estimated across trials, and may or may not be assumed to be stationary. Since the optimal policy depends on the estimated signal probability, if the latter changes, then so too should the policy (clearly, the same logic applies if there are nonstationarities in other model quantities, such as increases in cost with fatigue, etc.).
Consider the case where the estimated signal probabilityp 1 tends to be low, but changes over time with trial-to-trial feedback. Furthermore, assume that for relatively high estimates, it is optimal to maintain strong attentional state across trials, while for lower estimates it is not. Since a signal trial will lead to an increase inp 1 , the occurrence of a signal trial will encourage maintenance of the strong attentional state if it is occupied; conversely, a non-signal trial will lead to a decrease inp 1 , thus discouraging maintenance of a strong attentional state.
In Fig 13, we show an example of the evolution of beliefs, attention, and estimated signal probability over 6 trials. For simplicity, we assume thatp 1 switches between only two (low) values: a higher probability (p 1 ¼ 0:15) and a lower probability (p 1 ¼ 0:1). Changes between these estimates are prompted in two ways. The first way is that ifp 1 is currently low, and a weak ! strong shift occurs during the trial, then this shift immediately promptsp 1 to be increased during the trial, and this lasts at least until feedback at the end of the trial. This can be justified by considering that a weak ! strong shift always indicates a suspicion that a signal may be present, whereas a strong ! weak shift does not necessarily indicate a belief that a signal is absent (indeed, we saw above that switches of the latter type are more associated with certainty that a signal has occurred). The second way is in response to feedback (i.e., reward vs. no reward) at the end of each trial, which, at least from the experimenter's point of view, unambiguously indicates whether a signal was or was not in fact present in the trial. Therefore, according to this latter assumption, we deterministically setp 1 ¼ 0:15 after a signal trial, and p 1 ¼ 0:1 after a non-signal trial. A change inp 1 yields an immediate change in policy, and, in our chosen example, strong is always maintained across trials whenp 1 ¼ 0:15, but not when p 1 ¼ 0:1 (parameters otherwise set as follows: q = 0.2, σ w = 1, σ s = 0.5, κ = −0.1, � = −0.005, δ = 0.001).
The first trial is a miss (M) trial (i.e., there was a signal, but a non-signal was reported), but the feedback indicates that a signal was in fact present, and sop 1 is increased. Trial 2 is also a signal trial, but during this trial there was a weak ! strong shift, and the response was correct (hit); this is an incongruent hit (iH). Sincep 1 is already high, the strong attentional state is maintained into Trial 3. This trial is also a signal trial, and the response is correct (i.e., congruent hit (cH))-but crucially, since already in the strong state, there is no change in attentional state. Trial 4 is a (incongruent) correct rejection (iCR) trial, and the lack of a signal leads to a reversion ofp 1 to a lower value. Trial 5 is a miss trial, leadingp 1 to be increased at its end. Finally, Trial 6 sees a weak ! strong shift, and is a hit.
The point of this example is to show one way in which incongruent hits can be associated with weak ! strong shifts (i.e., relative ACh activation), while congruent hits can be associated with no change in attentional state (i.e., no change in ACh). Indeed, if we run a large number of trials, and treat attentional/ACh dynamics in the model similarly to how Howe et al. treated their ACh data (i.e., baseline-corrected and aligned), we find a similar difference between congruent vs. incongruent hits (Fig 12A, right). Interestingly, in the model, we see a rather large negative change at the end of an incongruent correct rejection trial (Fig 12B, right). This reflects a strong ! weak shift: the previous trial is a hit trial, on which the strong attentional state is likely to be occupied and maintained into the next trial (sincep 1 will be high); the current trial is a non-signal trial, during which strong tends to be maintained throughout, but at its end, feedback leads to a decrease inp 1 (since there was indeed no signal), leading to a strong ! weak shift (i.e., relative decrease in attention/ACh).
Of course, Howe et al. [33] observed only a non-significant, and apparently transitory, trend towards a decrease in ACh on incongruent correct rejections, and so this latter pattern of model results is, strictly speaking, a misprediction. Nevertheless, we think it perhaps interesting to consider, since it is the mirror of the case for hit trials: i.e., incongruent hits are associated with a weak ! strong shift because attention is likely to start weak (because of preceding trial type) and be activated during the course of the current, signal trial; incongruent correct rejections are associated with a strong ! weak shift because attention is likely to start strong (due to preceding trial type) and be deactivated at the end of the current, non-signal trial ( Fig  12D). Our misprediction is likely to be influenced by a particular simplifying assumption that we made concerning whenp 1 can be reduced, namely only at the end of the trial. Thus, in the simulation results in Fig 12A, the incongruent hit signal represents averaging over weak ! strong shifts occurring on different time steps (due to different possible onsets of the signal). By comparison, in Fig 12B, the incongruent correct rejection signal averages over strong ! weak shifts that all occur at the same time (i.e., trial end). The latter thus leads to a much stronger (and consequently mispredicted) decrease in attention/ACh. In this light, it would certainly be interesting to consider more realistic assumptions about ongoing estimation ofp 1 , as well as other modifications of the model.

Discussion
We considered vigilance, or sustained attention, from the point of view of an optimal control problem in which a series of choices of internal, attentional action are made, and where these internal actions may accrue both costs and benefits. In particular, inspired by the sustained attention task (SAT; [12]) developed for rodents, we considered the control of a binary attentional state in an abstract model of the SAT involving trials in which strong attention yields better sensory information than weak, but incurs costs. We analysed how optimal allocation of attention (i.e., the choice of which attentional state to occupy when) depends on signal characteristics (whether a trial is likely to contain a signal, and if so, when it is likely to be on), the relative qualities of information in the different attentional states, and the costs associated with paying strong attention. We also considered what happens when the actual and expected signal properties do or do not match, and possible relationships with the effects on detection performance observed in an optogenetic study.
In the multi-trial case, we saw that optimal allocation of attention also takes into account performance in future trials. Here, it can be optimal to pay the costs of maintaining strong attention into the next trial, not because of any immediate informational benefit-there is none-but in order to avoid the future cost of a weak ! strong shift. We demonstrated that this can lead to trial history effects in when weak ! strong shifts occur, and suggested that such sequential effects may help explain when cortical ACh transients have been observed to occur in the SAT.

Elective attention
We treated the elective allocation of attention as a characteristic form of internal action [40]. Such internal actions can be considered on an equivalent footing with external actions, potentially coming with actual costs associated with the active deployment of neural resources, opportunity costs associated with the unavailability of those resources for other tasks (which we did not consider; see [18]), and, in our case, well-defined benefits in terms of increasing the signal-to-noise ratio (SNR) of external inputs. Our calculations are examples of the expected value of control (EVC) [15,16], and may involve similar psychological and neural considerations. Indeed, it has been observed that rewarding human subjects for their performance on sustained attention tasks can improve their momentary performance [53,54], consistent with the proactive engagement of various cortical attention networks [55]. In such tasks, there is often an overall decrement in vigilance over the course of a session, which is also subject to varied motivational effects [53,56,57]. We did not address vigilance decrement in the current study, particularly since this is not a reliable feature of SAT performance [12,27](M. Sarter, personal communication, July 2022). However, given the prominent place of vigilance decrement in the sustained attention literature, and recent intriguing reports of dissociable influences of motivational manipulations on overall performance versus performance decrements over time [53], this would certainly be a salient target for future work.
From the perspective of a general POMDP [34], the way that information-gathering is instrumental for reward-gathering (or cost-avoiding) is completely standard. It also arises in the context of sequential probability ratio tests or drift diffusion decision-making [58,59], where the benefits of gaining new information about an uncertain external stimulus are balanced against the opportunity cost of time. In economics, considerations such as ours are associated with the framework of rational inattention [39,[60][61][62] in which agents are charged for the mutual information between internal representations and external signals, and so, like our agent, opt to remain partially ignorant. Recently, Mikhael et al. [63] adapted the rational inattention framework to the case that the neuromodulator dopamine (DA), by reporting the average reward rate in an environment [64], controls the fidelity of the internal representation of the external information (for similar lines of thought, see [65][66][67][68]). In our terms, this suggests a rich relationship between DA, as the medium of expected value, and ACh, as the medium of arousal and attentional improvement. For example, St. Peters et al. [69] provided evidence that NMDA-mediated stimulation of the shell (though not the core) of the nucleus accumbens-a principal target of dopaminergic neurons in the ventral tegmental area-can, via activation of cholinergic cortical projections, counteract distractor-induced impairments of performance in the SAT (though interestingly, performance was not improved by such stimulation in the absence of distractors).
We assumed just two attentional states, strong and weak, which has the practical benefit of facilitating analysis, but is also consistent with interpretations of behaviour in the SAT in terms of shifts between a small number of task-sets or 'modes', such as 'default' and 'detection' modes [32] (see also below). Nevertheless, the model could readily be extended to allow for different degrees of attention incurring different costs, as, for instance, in the graded information-based costs in rational inattention [39,61], thereby permitting more fine-grained control of the SNR.

Heterogeneous attentional deployment
Particularly when engaging and/or maintaining attention is expensive, it would often be better to have either strong or weak attention for the whole trial-and, indeed, this is conventional for signal detection theory [51]. Here, we investigated the more subtle case when the agent collects sufficient information to conclude that it becomes worth engaging attention part-way through a trial, leading to a variable rate of information gathering. The work therefore has natural links with previous studies that consider optimal integration for (uncontrolled) time-varying reliability of evidence [36,37] and control of visual fixations [38,70]. The structure of the task considered here, in which the signal never comes on for a set period after initiation and is more likely to be on towards the end of a trial, implies that the participants should engage in a form of temporal orienting of attention [71]-most obviously, refusing to pay attention immediately after a trial starts. Of course, such temporal considerations are subject to the vagaries of interval timing [72], which would rapidly make this imprecise over the timescale of several seconds. In our case, this conditional engagement of attention arises under a somewhat particular set of parameter values in which, very crudely, the default operating regime is to assume the absence of a signal (due to its actual/assumed low probability of occurrence in a trial and/or temporal sparsity within a trial), but if a signal is suspected, then it is worth confirming with the higher precision information available following strong attentional engagement. False alarms could then arise from a form of lapse process or 'trembling hand', which, for convenience, we did not model. Altogether, this makes the task involve a form of vigilance [7,27,73]. Note that in the regime we assumed in relation to Howe et al. [33], although a signal is as likely to appear as not on any given trial, we set the assumed probability of a signal in the model to be much lower, which is why strong attention was more likely to be engaged on signal trials. There was therefore a mismatch between the 'true' generative model (of the experimenter) and the model ascribed to the agent. While this was somewhat pragmatic in our hands, we suspect such mismatches to be not uncommon. In the particular case of the SAT, subjects ought, if they understand that the feedback at the end of each trial reliably indicates whether a signal was present, to apprehend that the long-run relative frequency of a signal trial is approximately one half-but only then if they also (correctly) believe that this relative frequency is fixed.
Other cases of heterogeneous attentional engagement have been studied when external information signals that the environment has switched into a markedly different state (e.g., going from safe for foraging to dangerous [74]). This has been considered as a sort of interrupt signal that involves norepinephrine (NE) rather than ACh [45,75]. The case of danger also involves arousal, but more a form of cognitive reset and task-switching from unexpected uncertainty than a controlled engagement of focus within a single task arising from expected uncertainty [25]. The task that Dayan and Yu [45] used to examine phasic NE [76] also involved signal detection-but NE was treated purely as a read-out mechanism for a state change in the external world, rather than having an effect on stimulus processing as here. It would certainly be intriguing to also measure the involvement of NE in the task we considered.
Shea-Brown et al. [77] previously considered the involvement of NE in a signal detection task. For them, NE release also depended on excursions beyond a threshold of certainty about the presence of a signal. However, NE played the role in their model of destabilizing the dynamics of a downstream decision-making network so that it stopped integrating new information, and instead evolved quickly to a boundary in state space at which an action would be initiated. ACh plays a radically different role in our model by regulating-rather than curtailing-ongoing integration. One important consideration raised by [77] is the speed of action of the neuromodulator. NE takes its effect in cortex rather slowly-at least around 100ms after electrical stimulation of the source nucleus, the locus coeruleus [78]. A similar speed of action for ACh [79] might suggest that its phasic engagement might be too sluggish to improve the processing of short signals (the shortest were 25ms and 50ms in Howe et al. [33]), though conceivably fast enough to affect neural activity prompted by the signal that outlasts the signal itself, even at relatively early stages of vision (e.g., [80]).
On a rather different timescale are the fluctuations in attention typically observed over the course of minutes to hours in the long-run vigilance tasks often used to interrogate sustained attention in humans [3,5,17]. These are also associated with NE (e.g., slow fluctuations in tonic activity, which are anticorrelated with phasic responses to task-relevant stimuli [23,81]). Providing an integrated understanding of the effects across these various timescales, perhaps including their contribution, if any, to spontaneous strong ! weak attentional transitions, is an important task for the future.

Acetylcholine
We designed and illustrated our account in the light of the sustained attention task (SAT) of Sarter and colleagues [12,30,33]. However, given the complexity of the paradigm, we sought a qualitative rather than quantitative match with their results. ACh is known to play a causal role in this task because of the effects of pharmacological and optogenetic suppression and stimulation [27,30]. In our information-processing operating regime, the default is that the signal is absent and ACh is involved in the choice to gather sufficient information to refute this presumption. Thus, it is no surprise that the main effect of suppressing ACh is to reduce detection (i.e., reduce hits). That stimulating the BF increases detection can be explained in the same way-allowing higher-quality information about signal presence to be collected-particularly aiding detection of signals that are not normally able to benefit from ACh release in virtue of being too short. It is notable that activation of ACh axon terminals in right mPFC had no effect on signal trials [30].
The observation that stimulating the BF also increases false alarms [30] is harder to reconcile with the simple view that the release simply increases the quality of sensory information. One possibility, hinted at by the account we gave in which there was a discrepancy between the assumed and actual information, is to note that there are two ways that information integration can be affected in the light of a changed relative SNR in the external input versus the internal default. One is for there to be extra activity in the representation of that input, such as extra spikes associated with the signal that can overwhelm meagre activity in the integrator representing signal absence [82]. The other is for recurrent interactions in the integrator to be suppressed in favour of the external signal [32,[83][84][85]. To the extent that the latter happens when the external signal is actually not of any better quality, the more likely it is that a false alarm could arise on a non-signal trial. It is likely that both mechanisms are engaged, and over multiple layers of cortical inference about the signal; disturbance of the delicate balance between internal expectations of signal unlikeliness and external signal and noise would lead to false alarms. Unfortunately, Gritton et al. [30] did not report whether trial sequence also affected the consequences of optogenetic manipulations, as one might suspect given the previous report of Howe et al. [33].
In contrast to the effects of stimulating the cell bodies of cholinergic neurons in the BF, Gritton et al. [30] found that activation of their axons in right mPFC in non-signal trials led to an increase in false alarms, but activation in signal trials did not improve detection. The authors speculated that activation of ACh axon terminals in mPFC may be sufficient to modulate behaviour under circumstances where endogenous ACh transients are not expected (i.e., non-signal trials), but insufficient to modulate behaviour in circumstances where endogenous ACh transients are expected (i.e., signal trials)-BF activation/suppression, by modulating ACh release more widely, may exert a stronger influence. Unfortunately, technical limitations meant that they did not record cholinergic transients concurrently with optogenetic manipulations. In our terms, stimulation in PFC might only affect the suppression of the default integration-and so mainly lead to problems with false alarms.
Note that the authors of the original study have a different interpretation of the ACh release, at least in PFC. That is, Howe et al. [33] interpret their findings in terms of a mechanism supporting shifts from 'externally-directed' attention, which in this case involves sensory monitoring for the possible appearance of a signal, to 'internally-directed' attention, here involving the retrieval and (re)activation of signal-associated response rules when a signal is detected (cf. [86]). Thus, they consider ACh to be engaged following full-blown detection, but to be involved in the correct response rather than in refined signal processing. Indeed, Sarter et al. [87] cite previous results [88] showing that selective cholinergic deafferentiation of primary and secondary visual cortices does not alter animals' SAT performance. Of course, one of the beneficial characteristics of the task is that signal and no signal have the same response demands (just requiring different levers to be pressed)-adding some complexity to this account.
Pinto et al. [79] used optogenetics to activate or inactivate ACh neurons in BF in awake mice selectively while they performed a visual go/no-go task involving discriminating between a vertical (target) and horizontal (non-target) drifting grating; task difficulty was manipulated by adjusting stimulus contrast. They found that ACh activation led to enhanced discriminability at all contrast levels, and this was due to a selective increase in the hit rate (there was no significant change in false alarm rate). Activation of cholinergic axons in V1 also led to enhanced discriminability attributable to increased hits, though not at the highest contrast. This could be consistent with enhanced processing of the input stimuli if the targets were more temporally dense (indeed, in this paradigm, 1s after an initial cue, a drifting grating was presented for 4s; the intertrial interval was 3s), so that there was no equivalent default of 'no-go' as in [30,33]. Inactivation of ACh cells in BF reduced sensitivity at all contrast levels. However, this may be due to an increase in false alarm rate, rather than a decrease in hits (L. Pinto, personal communication, December 2021), which could be harder to explain.
Pinto et al. [79] primarily targeted the nucleus basalis, though they note the possibility that ACh neurons in other BF nuclei may also have been activated by optogenetic stimulation; retrograde tracing revealed that cholinergic neurons throughout the BF project their axons to V1. Dual retrograde tracing revealed very few ACh neurons projected to both V1 and mPFC (5%), or to both V1 and primary auditory cortex (6%), while a slightly higher number projected to both V1 and neighbouring higher visual areas (13%).
However, we should note that some studies of the activity of ACh neurons do not find the sorts of responses that our analysis would expect. For instance, Hangya et al. [89] (see also [90]) recorded from optogenetically-identified ACh neurons in the BF of mice during an auditory task, in which animals had to respond to one tone with a lick to gain a water reward ('go' stimulus), and withhold licking in response to a differently pitched tone to avoid an airpuff punishment ('no-go' stimulus); to manipulate difficulty, the loudness of tones was also varied. Hangya et al. reported that most recorded ACh neurons were activated by both reward and punishment delivery. Neural responses to reward were modulated by the intensity of the preceding tone, so that the strongest neural response was observed following the quietest tone, when the animal may be least certain of the outcome-consistent with the coding of a degree of 'reinforcement surprise'; neural responses to punishment showed no such modulation by stimulus strength. Only 2/34 ACh neurons showed attentional modulation, operationalized as activity before stimulus onset that predicts RT or accuracy. Although reinforcement surprise is a key driver of expected uncertainty, consistent with the view of ACh reported above [25], the lack of attentional effect in this study is more puzzling. The more recent study of Laszlovszky et al. [90], which reports a greater degree of heterogeneity in ACh responses, and evidence about local control of release of ACh [91] may help resolve such puzzles. It may also be important, as with [79], to consider distinctions between the paradigm used in [89,90] compared to the SAT, such as differences in stimuli, timing, and response requirements (e.g., go/no-go lick vs. press left/right lever). Further clues may come from studies of the action of ACh at a detailed, implementational, level such as, for instance, recent modelling work that uses excitatory-inhibitory networks to explore how ACh may promote local gamma oscillations and theta-gamma coupling [92,93].
Along with these effects within a single trial is the evidence about the engagement of ACh when considering the sequential structure of trials [33]. Here, perhaps the most surprising finding is that ACh is particularly activated on a signal trial when the previous trial either did not involve a signal, or it did but the signal was missed. We interpret this vivification of ACh as being associated with the deployment of attention. Indeed, the observed ACh transients are thought to be generated within PFC (i.e., local release) based on interactions between thalamic glutamatergic afferents and heteroreceptor-mediated regulation of cholinergic terminals, rather than reflecting phasic activity of BF ACh neurons [91]. Howe et al. [33] speculate that during consecutive hits, ACh activity associated with the first (incongruent) hit may induce persistent spiking and continuing release of ACh that helps maintain the activated (signal-oriented) task set. Decay of the latter activation would then lead to a return to the predominant monitoring state. One prediction we can make related to sequentiality is that the short signals will be better detected when attention was already engaged at the beginning of the trial on which they occur, based on the events on the previous trial.
Our account of why these sequential effects arise is rather fragile, and depends on two key assumptions. The first is that once attention is engaged in a trial, it is worth keeping it engaged even after the agent is statistically convinced (or has reported) that the signal has been present. This depends on the cost of maintaining strong attention being much less than a weak ! strong switch, and also being outweighed by the benefit of starting a new trial with strong attention engaged (including, for instance, the excess probability of correctly detecting a very short signal, or an immediate change in the assumed likelihood that trials will include a signal). The second assumption is that attention will sometimes revert to weak, thus requiring another weak ! strong shift; in our parameter regime, these latter shifts will happen preferentially on signal trials. We focused on an active mechanism of disengagement (i.e., strong ! weak shift) arising from fluctuating estimates in signal probability but, as noted, changes in other parameters could also play a role; utilities, costs, signal-to-noise ratios could all plausibly fluctuate over the course of a trial as an animal's state of satiety and/or fatigue varies (e.g., [94]). The model could also accommodate passive shifts or sporadic distraction, or indeed the sort of 'inzone' to 'out of zone' spontaneous fluctuations that happen in long-run vigilance tasks [3,5], are affected by motivational factors [56], and might be associated with NE [23,81]. The converse spontaneous increases in attention could be similarly modelled.
Our account followed Hasselmo and Sarter [32] in treating weak and strong states as forms of task-sets or 'modes': a default response mode, in which the animal simply executes the most extensively-practiced response (i.e., presses the lever to report a non-signal, which is indeed the majority response) and a detection mode. The costs in switching between the two are then analogous to the switch costs that are prevalent in the task-switching literature [95]. We assumed that switch cost only accrues for the shift away from the default.
One important predecessor of the current work is by Atkinson [35], who assumed that maintaining a high level of sensitivity would be costly, and so subjects would aim to adjust their sensitivity level dynamically to trade off any consequent decrement in performance against the minimization of this cost. He specified particular rules for updating, trial to trial, both the level of sensitivity and the decision rule (i.e., criterion), and noted that sequential effects in detection performance could arise from adjustments to either or both. Partly inspired by this earlier work, Gilden and Wilson [3] studied 'streakiness' of performance (i.e., non-stationarity in correct vs. incorrect responses across trials) in signal detection tasks-akin to our discussion of maintained and lapsed attention. They interpreted their results as suggesting that the greater the attentional resources demanded in a task, the lower the level of streakiness; they also considered the possibility that the delay between trials is an important determiner of sequential dependence, where longer delays encourage greater independence between trials, though their evidence on this was mixed. We note that Sarter et al. [96], in interpreting ACh transients observed during a task slightly simpler than the SAT (rats were trained to approach a port to retrieve a reward following a visual signal-see [97]), suggest that the much longer ITI involved (90±30s) meant that ACh transients were reliably elicited when animals detected the signal-since animals would most likely disengage during the long ITI, meaning that any subsequent successful detection would be an 'incongruent hit' (and thus elicit ACh release). Gilden and Wilson [3] also considered a number of models that would give rise to streaky outcomes, including a model of 'intermittent attention' with probabilistic transitions between a low effort state (with lower hit rate) and a high effort state (the conditional transition probabilities were assumed to be fixed). Similar such switching in engagement has been more recently studied in the context of animals solving a series of signal detection tasks [98], and may relate to the fluctuations in attentional engagement found over the long run in human sustained attention tasks that we mentioned above [5,17,81].
Our task and analysis concerns a relatively simple signal-the turning on and off of a light of fixed intensity and location, and of variable duration (i.e., unimodal, and varying along one dimension)-and response (i.e., press one of two levers), and so is likely not ideal for addressing heterogeneity in cholinergic neuromodulation, let alone the substantial additional complexities of cortical control over attentional enhancement (e.g., [99]) and its motivational sensitivities [54][55][56]100]. Heterogeneity in all the neuromodulatory systems is being intensively investigated [101][102][103], and ACh has long been identified as being more specific [104]. This is appropriate to the extent that ACh reports on expected uncertainty, given that such expectations can be diverse.

Conclusion
Stepping back from the details of this task and model, our simulations exemplify ways in which a neuromodulator might provide central regulation and coordination of the sort of radically distributed processing that otherwise occurs in the context of point-to-point, wired, connections.