Figures
Abstract
Animal responses occur according to a specific temporal structure composed of two states, where a bout is followed by a long pause until the next bout. Such a bout-and-pause pattern has three components: the bout length, the within-bout response rate, and the bout initiation rate. Previous studies have investigated how these three components are affected by experimental manipulations. However, it remains unknown what underlying mechanisms cause bout-and-pause patterns. In this article, we propose two mechanisms and examine computational models developed based on reinforcement learning. The model is characterized by two mechanisms. The first mechanism is choice—an agent makes a choice between operant and other behaviors. The second mechanism is cost—a cost is associated with the changeover of behaviors. These two mechanisms are extracted from past experimental findings. Simulation results suggested that both the choice and cost mechanisms are required to generate bout-and-pause patterns and if either of them is knocked out, the model does not generate bout-and-pause patterns. We further analyzed the proposed model and found that it reproduced the relationships between experimental manipulations and the three components that have been reported by previous studies. In addition, we showed alternative models can generate bout-and-pause patterns as long as they implement the two mechanisms.
Citation: Yamada K, Kanemura A (2020) Simulating bout-and-pause patterns with reinforcement learning. PLoS ONE 15(11): e0242201. https://doi.org/10.1371/journal.pone.0242201
Editor: Gennady Cymbalyuk, Georgia State University, UNITED STATES
Received: June 15, 2020; Accepted: October 29, 2020; Published: November 12, 2020
Copyright: © 2020 Yamada, Kanemura. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Simulation programs are available from GitHub: https://github.com/echo0yasum1/simulating_bout_and_pause_pattern.
Funding: This study was supported in part by Grant-in-Aid for JSPS Fellows (20J21568) to KY from the Japan Society for the Promotion of Science (http://www.jsps.go.jp/english/e-grants). The funder had no role in study design, data collection, data analysis, and preparation of the manuscript. KY and AK are employed by and receive salaries from LeapMind Inc. (https://leapmind.io/en/), and the both authors played roles in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. There was no additional external funding received for this study.
Competing interests: KY and AK are employed by and receive salaries from LeapMind Inc. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development, or marketed products to declare.
Introduction
Animals engage in various activities in their daily lives. For humans, they may be working, studying, practicing sports, or playing video games. For rats, they may be grooming, foraging, or escaping from a predator. Although specific activities are different between different species, common behavioral features are often observed.
Bout-and-pause patterns are one of the behavioral features commonly observed in many species. Activities engaged by an animal do not occur uniformly through time but often have short periods in which a burst of engaged responses is observed. For example, in an operant conditioning experiment, a rat presses a lever repeatedly in a short period and then it stops lever pressing. After a moment, the rat starts lever pressing again. The rat switches between the lever pressing behavior and the no lever pressing behavior again and again throughout the experiment. Such a temporal structure comprising of short-period response bursts and long pauses is observed in various species and activities; for example, email and letter communication by humans [1], foraging by cows [2], and walking by Drosophila [3].
Shull et al. [4] showed that bout-and-pause patterns, observed under an environment where rewards are available probabilistically at a constant rate (variable interval (VI) schedule), can be described with a broken-stick shape in the log-survivor plot of interresponse times (IRTs), which are characterized by a bi-exponential probability model. If IRTs follow a single exponential distribution, then the log-survivor plot shows a straight line. If IRTs follow a mixture exponential distribution called a bi-exponential model, the log-survior plot shows a broken-stick shape composed of two straight lines that have different slopes. Killeen et al. [5] found that lever pressing by rats is well described with a bi-exponential model, suggesting that this behavior has a bout-and-pause pattern. If IRTs follow a bi-exponential distribution, there are two different types of responses; within-bout responses, which have short IRTs, and between-bout responses, which have long IRTs. Each response type has its own exponential distribution in a bi-exponential model. Killeen et al. [5] formulated the bi-exponential model as follows: (1) where the first term describe IRTs of within-bout responses and the second term describes IRTs of between-bout responses. This model has three free parameters: q, ω, and b, each of which corresponds to a different component in bout-and-pause patterns. First, q denotes the mixture ratio of the two exponential distributions in the model and it corresponds to the mean length of a bout. The bout length is the number of responses contained in one bout. Second, ω denotes the rate parameter for the exponential distribution of within-bout IRTs and it corresponds to the within-bout response rate. Finally, b denotes the rate parameter for the exponential distribution of between-bout IRTs and it corresponds to the bout initiation rate. These three model parameters define the overall response rate. They are also called bout components.
The bout length, the within-bout response rate, and the bout initiation rate are affected by motivational and schedule-type manipulations [4, 6–11]. Motivational manipulations include the reinforcement rate, the response-reinforcement contingency, and the deprivation level. An example of schedule-type manipulations is adding a small variable ratio (VR) schedule in tandem to a variable interval (VI) schedule.
Table 1 summarizes existing findings on the relationships between experimental manipulations and the two bout components. The bout length was reported to be affected by manipulations as follows:
- It increases or stays the same as the reinforcement rate increases [4, 6].
- It increases or stays the same as the deprivation level increases [4, 7, 8].
- It decreases or stays the same by extinction [10, 12].
- It increases by tandem VR [4, 6, 13]. When a VI schedule is followed by a small VR (tandem VI VR), an animal stays in a bout longer and emits more responses in each bout.
The bout initiation rate was reported to be:
- It increases as the reinforcement rate increases [4, 6, 14, 15].
- It increases as the deprivation level increases [4, 7, 8].
- It decreases by extinction [10, 12, 16].
- It decreases or stays the same by tandem VR [10]. Brackney et al. [10] showed that if we add a small VR schedule in tandem to a VI schedule, the bout initiation rate decreased slightly.
Although the previous studies have investigated the relationships between some experimental manipulations and the bout components, we still do not know how to construct a model that generate bout-and-pause patterns based on the experimental findings. Smith et al. [17] showed experimentally that choice and cost play important roles in organizing responses into bout-and-pause patterns. When pigeons were trained under a single schedule, the log-survivor plot did not show a broken-stick shape [18, 19]. Smith et al. [17] trained pigeons under a concurrent VI VI schedule with and without a changeover delay (COD). When pigeons were trained under the concurrent VI VI schedule without a COD, the log-survivor plot still did not show a broken stick, resulting in the a straight line. However, under the concurrent VI VI schedule with a COD, the log-survivor plot showed a broken stick, indicating that bout-and-pause patterns were clearly observed. Similar observations have been made for rats, assuming that they engage in alternative behaviors during conditioning [20]. From these experimental observations, we extracted the following three facts. 1) When animals engage only in one response in a given situation, bout-and-pause patterns are not observed. 2) If animals can choose responses from two alternatives without a COD, bout-and-pause patterns are still not observed. 3) Considering 1) and 2), we conclude that bout-and-pause patterns are organized only when animals have two (or more) possible alternatives under a given situation (i.e., choice is available) and there is a COD between the start of engagement and a reinforcement (i.e., cost is associated with a changeover). These facts are interesting but they remain to be inductive and we still do not have constructive explanation that generate bout-and-pause patterns. Existing studies on bout-and-pause patterns have investigated to describe the phenomena rather than to provide constructive models. Although many models have been proposed [5, 9, 21]), they are descriptive and did not answer the question of “what mechanisms shape responses into bout-and-pause patterns?”
Kulubekova and McDowell [22] examined a computational model aimed to reproduce bout-and-pause patterns based on the principle of selection by consequences developed by McDowell [23] but they did not test which mechanisms are behind bout-and-pause patterns. In other words, they showed that a computational model of selection by consequence could reproduce bout-and-pause patterns but did not show minimal requirements to reproduce them.
In this article, we propose a computational model based on reinforcement learning that accounts for the constructive mechanism of bout-and-pause patterns. We assume that bout-and-pause patterns are generated by two mechanisms: a choice between operant and other behaviors and a cost that is required to a transition from one behavior to another. We suppose that motivational manipulations affect only the choice mechanism and schedule-type manipulations affect the cost mechanism. To incorporate those two mechanisms, we design a three-state Markov transition model, which has an extra state in addition to the bout and pause states. We perform three simulation studies to analyze the proposed model. In Simulation 1, we introduce our model on the basis of the two different mechanisms, choice and cost. We show that the proposed model can reproduce bout-and-pause patterns by finding that the log-survivor plot shows a broken-stick shape. We compare three models: a dual model, a no cost model, and a no choice model. The dual model is composed of both the choice and cost mechanisms. The no cost model has only the choice mechanism and the no choice model has only the cost mechanism. Simulation results demonstrate that the dual model can reproduce bout-and-pause patterns but the other two models failed to reproduce them. It implies that both choice and cost are required for animal responses to be organized into bout-and-pause patterns. In Simulation 2, we analyze the dual model in depth and report its behavior under various experimental settings to test if the dual model can reproduce the relationships between the experimental manipulations and the bout components discovered so far. Simulation results suggest that the dual model can reproduce them not only qualitatively but also quantitatively. In Simulation 3, we show that a two-state model can also reproduce bout-and-pause patterns even without the third state because it incorporates the two mechanisms. However, having the third state is useful for separating the effects of the choice and cost mechanisms. We speculate that real animals might have similar mechanisms that generate bout-and-pause patterns as the dual model, which can be a useful computational tool for studying animal behavior.
1 Simulation 1
Material and method
Model.
Our model is based on reinforcement learning [24]. We designed a three-state Markov process for modeling bout-and-pause patterns (Fig 1(a)). Two of the three states are “Operant” and “Others,” in which the agent engages in the operant behavior or performs other behaviors, respectively. We call them Operant and Others instead of engagement or visit and disengagement or pause, thereby we emphasize that bout-and-pause patterns are results of a choice between the operant and other behaviors. In the third “Choice” state, the agent makes a decision between the operant and other behaviors. By having the Choice state in our model, we incorporate the knowledge that animals can choose their behavior from available options (e.g. grooming, exploration, and excretion) when they move freely during an experiment. The second knowledge is a cost required to make a transition from one behavior to another. Animals must decide whether to keep doing the same behavior or to make a transition, because a fast switching is not optimal if a transition incurs a cost. Fig 1(b) and 1(c) shows two knockout models, the no choice model and the no cost model, respectively. In each model, one of the two mechanisms from the dual model is removed. In the no choice model, an agent can choose only the operant behavior in a given situation. In the no cost model, no cost is required when a transition is made.
(a) The model scheme of the dual model. The upper node, the bottom left node, and the bottom right node correspond to the Choice state, the Operant state, and the Others state respectively. Each arrow denotes the transition from one state to others. (b) The model scheme of the no choice model. In this model, the Others state is omitted. (c) The model scheme of the no cost model. In this model, self-transitions in the Operant and Others state is omitted.
Here is how the agent travels in the proposed model. In the Choice state, the agent chooses either the operant or other behaviors. As a result of the choice, it moves from the Choice state to one of the Operant or Others states. It makes the choice based on the preference for each behavior, which is denoted by Qpref. We will explain how to calculate Qpref in the next paragraph. In the Operant state, the agent engages in the operant behavior, and, after every response, it decides whether to stay in the Operant state or to move back to the Choice state. It decides to stay or move based on Qcost, which represents a transition cost to the Choice state, whose mathematical definition will be given later in this Model section. The Others state is the same as the Operant state except for that the agent performs other behaviors.
The preference Qpref is a function that compares the operant and other behaviors when the agent makes a choice between them. The Qpref function changes over time since it is updated based on the presence (or absence) of a reinforcer per bout. The following equation describes the updating rule for Qpref: where t denotes time in session; αrft and αext denotes the learning rates of reinforcement and extinction, respectively; r denotes the reinforcer value and we assume r > 0 when a reinforcer is present and r = 0 when a reinforcer is absent; and i ∈ {Operant, Others} denotes each option, that is, i = Operant if the operant behavior is chosen and if i = Others if other behaviors are chosen. We omit superscript (i) and denote Qpref when it can be any of i = Operant or Others.
In the Choice state, the agent chooses either of the Operant or Others states according to the probability distribution calculated from the preferences for the two behaviors. The probability of transition to option i ∈ {Operant, Others} is defined as follows: (3) where the softmax inverse temperature parameter β represents the degree to which a choice is focused on the highest-value option.
The cost Qcost is a function that defines a barrier in making a transition from the performed behavior to the Choice state. We assumed that the cost is independent from the preference and depends only on the number of responses that are emitted to obtain a reinforcer from a bout initiation. When a reinforcer is presented, the cost function Qcost is updated according to (4) where x denotes the number of responses that are emitted to obtain a reinforcer in a bout. Then, x is initialized to 1 when the agent receives a reinforcer or comes back to the Choice state without a reinforcer. The other parameters are the same as Eqs (2a) and (2b). The same (i)-omitting rule applies also to Qcost. In Eq (4), x is attenuated by taking its logarithm. This is because, if we do not attenuate x, the barrier defined by Qcost becomes too high and the agent keeps staying at the performed state. To avoid it, we employed Fechner‘s law [25] to make the performed state less attractive.
If the agent is in either of the Operant or Others states, it makes a decision whether to stay in the same state or to go back to the Choice state. A decision is made according to the probability of staying in the same state calculated from the cost and the preference for the state, which is defined as follows: (5) where wpref and wcost are positive weighting parameters for Qpref and Qcost, respectively. We assumed wcost > wpref because schedule-type operations have stronger effects on the bout length than motivational manipulations. When Qpref or Qcost increase, pstay increases too.
Simulation.
In Simulation 1, we compared the three possible models; the dual model, the no choice model, and the no cost model. The dual model (Fig 1(a)) includes both the choice and cost mechanisms as we described in the Model section. The second model was the no choice model (Fig 1(b)), which has only the cost mechanism and it can be thought of as a model made by removing the choice mechanism from the dual model. In the no choice model, the agent only engages in the operant behavior. In other words, this model chooses only the operant behavior in the Choice state. The third model was the no cost model, which has only the choice mechanism without the cost mechanism. The no cost model chooses either operant or other behavior independent of the previous behavior; that is, according to this model, the agent does not continue to be in the same state and comes back to the Choice state after each response. In the no cost model, the self transition paths were removed because pstay is very low without having Qcost in Eq (5).
Simulation conditions were as follows. The schedule for the operant behavior was VI 120 s (0.5 reinforcer per min) without an inter-trial interval, and the schedule for the other behavior was FR 1. The maximum number of reinforcers in the Operant state was 1,000; that is, if the number of reinforcers reached 1,000, the simulation was terminated. The value of a reinforcer given by taking the operant behavior was r(Operant) = 1.0 and that by taking other behaviors was r(Others) = 0.5. The model parameters were αrft, αext, β, wpref, and wcost. We set αrft = 0.05, αext = 0.01, β = 12.5, wpref = 1.0 and wcost = 3.5. The response probabilities in the Operant and the Others states were fixed at 1/3 in each time step. These parameters were designed based on the knowledge on experimental conditions, e.g., the reinforcer for the operant behavior should be higher than that for other behaviors, implying r(Operant) > r(Others). Before the start of the simulation, we initialized the agent and the experimental environment. The initial values of and were both 0 and we created a VI table according to Flesher and Hoffman [26]. We set the time step in the simulation to be 0.1 s.
We show pseudocode of the model and simulation in Algorithm 1, where NumResponses means x in Eq 4 and the three Behavior() functions are defined in Algorithms 2, 3, and 4. We implemented the algorithm in Julia 1.0 and ran simulations on a computer with a 1.80 GHz Intel i7-8565 processor, 16 GB of RAM, and 1 TB of SSD, operating with Ubuntu 18.04 LTS. The same configuration was used also for Simulations 2 and 3. The Julia code is available at: https://github.com/echo0yasum1/simulating_bout_and_pause_pattern.
Algorithm 1 Pseudocode of simulation
t ← 0, NumRewards ← 0, ResponseTimes ← {}, i ← Choice
while NumRewards < 1000 do
t ← t + 0.1
if i = Choice then
ChoiceBehavior()
end if
if i = Operant and uniform(0, 1)≤1/3 then
OperantBehavior()
end if
if i = Others and uniform(0, 1)≤1/3 then
OthersBehavior()
end if
end while
Algorithm 2 Definition of ChoiceBehavior()
Select a state i ∈ {Operant, Others} with probability defined by Eq (3)
NumResponses ← 1
Algorithm 3 Definition of OperantBehavior()
Append t to ResponseTimes
NumResponses ← NumResponses + 1
Select a state i ∈ {Operant, Choice} with probability defined by Eq (5)
if reward is presented then
Update according to Eq (2a)
Update according to Eq (4)
NumRewards ← NumRewards + 1
NumResponses ← 1
end if
if reward is absent then
if i = Choice then
Update according to Eq 2b
end if
end if
Algorithm 4 Definition of OthersBehavior()
NumResponses ← NumResponses + 1
Update according to Eq (2a)
Update according to Eq (4)
reward is presented according to FR 1
NumResponses ← 1
Select a state i ∈ {Others, Choice} with probability defined by Eq (5)
Results: Simulation 1
Fig 2(a) shows event records of IRTs generated by each model and Fig 3 shows the model schemes with transition probabilities. The top panel of Fig 2(a) shows that the no choice model generated a dense repetition of only the operant behavior at a high rate without long pauses. From Fig 3, the probability the agent stayed in the Operant state was empirically 0.95. In the middle panel of Fig 2(a), the response rate under the no cost model was low and each response was separated by long pauses. From Fig 3, the probability of the agent choosing to transit to the Operant state was empirically 0.06 and the agent returned to the Choice state immediately after it responded. In the bottom panel of Fig 2(a), the agent with the dual model generated a repetitive pattern of responses with a high rate in a short period followed by a long pause. From Fig 3, the agent in the Choice state made a transition to the Operant state with a 0.12 probability and it stayed in the Operant state with a 0.71 probability.
(a) Response event records in the (top) no choice, (middle) no cost, and (bottom) dual models in the 50 s period just after 500 reinforcers were presented (event records were stable after 500 reinforcer presentations). Each vertical line denotes one response. (b) Log-survivor plots of the three models drawn by using all the IRTs after 500 reinforcers.
Fig 2(b) show log-survivor plots to see whether they show a straight line or a broken stick. We used the IRTs from after the agent obtained 500 reinforcers to the end of the simulation. The log-survivor plots of the no choice model and the no cost model were described by one straight line whereas that of the dual model was described with a broken-stick shape. The no choice model has a steeper slope than the no cost model and is tangential to the curve of the dual model at the leftmost position. The no cost model slope was slightly steeper than that of the dual model at the right side.
1.1 Discussion: Simulation 1
Both the event records and log-survivor plots in Fig 2 imply that only the dual model generated bout-and-pause patterns and the other two models failed to reproduce bout-and-pause patterns. The event records in Fig 2(a) suggests that only the dual model exhibit bout-and-pause patterns. The log-survivor plot of only the dual model in Fig 2(b) showed not a straight but a broken-stick shape, which is an evidence that the underlying IRTs follow a bi-exponential distribution. Thus, only the dual model reproduced bout-and-pause patterns.
We posit that both of the choice and cost mechanisms are necessary to organize responses into bout-and-pause patterns. The no choice model failed because it lacks the choice mechanism. Without the choice mechanism, the agent almost always stayed in the Operant state and responded at a high rate without pauses. The reason behind the failure of the no cost model was the knockout of the cost mechanism. When the cost of a changeover is zero, the agent easily return to the Choice state, resulting in sporadic operant responses followed by long pauses. Similar behaviors were observed in pigeons under a concurrent VI VI schedule without COD [17]. The choice and cost mechanisms contribute differently to generate bout-and-pause patterns; the choice mechanism generates pauses and the cost mechanism produces response bursts. Since the dual model has both the mechanisms, it reproduced bout-and-pause patterns.
Since we have a full control of the simulation environment and the agent in it, we can exclude the possibility of contamination by other factors. Smith et al. [17]’s results implied that choice and cost are behind bout-and-pause patterns but it was not clear if other factors influence the formation of bout-and-pause patterns; this is an inherent limitation of experimental studies. It was not straightforward to draw conclusions like “these mechanisms are enough to generate bout-and-pause patterns” from the experimental findings that IRT distributions observed in pigeons followed a bi-exponential distribution under concurrent VI VI schedules with a COD. In contrast, our constructive approach makes it clear that the two mechanisms are sufficient to reproduce bout-and-patterns, and this conclusion is hard to draw only from the experimental findings from [17].
We suggest that what is important for generating bout-and-pause patterns is not the specific architecture of our model but the choice and cost mechanisms. Our model is composed of three states and five equations, and those equations are from one of most popular reinforcement algorithms called Q-learning. Even if such model architecture and algorithm are substituted with others, the new model will still reproduce bout-and-pause patterns if it involves the choice and cost. The specific equation forms such as the logarithm in softmax function in Eq (3) or the logarithm in Eq (4) the can also be replaceable with other forms. We do not reject other possible forms to implement the two mechanisms.
We also do not claim the uniqueness of our experimental settings. Although we employed an FR 1 schedule for the other behaviors, other schedules including VI should produce similar results.
2 Simulation 2
Having demonstrated in Simulation 1 that the dual model successfully reproduced bout-and-pause patterns, in Simulation 2 we analyzed this model under various environments. The previous studies [4, 6–10, 27] have applied various experimental manipulations to animals to understand bout-and-pause patterns, as summarized in Table 1. We applied manipulations to the agent in the model by changing environmental settings.
2.1 Method: Simulation 2
Using the dual model, we performed four experiments by manipulating only one of the four variables while keeping the other three variables the same as Simulation 1. The procedure of simulation was also the same as Simulation 1.
The four experimental manipulations are applied independently to each of the four variables: 1) the rate of reinforcement, 2) the deprivation level, 3) the presence of extinction, and 4) the schedule type. 1) We manipulated the rate of reinforcement by varying mean intervals of the VI schedule. Mean intervals used in this simulation were VI 30 s, 120 s, and 480 s (2.0, 0.5, and 0.125 reinforcer per min). 2) We varied reward values obtained in the Operant state to control the deprivation level of the agent. Those values were 0.5, 1.0, and 1.5 to induce low deprivation, baseline, and high deprivation levels, respectively. The reward value that the agent received by taking other behaviors was the same as Simulation 1 throughout all the simulations. 3) To attenuate the engagement to the operant response, we switched the schedule from VI 120 s (0.5 reinforcer per min) to extinction after the agent obtained 1,000 reinforcers. The extinction phase finished when 3,600 s (36,000 time steps) elapsed. 4) We manipulated the schedule type by adding a small VR schedule in tandem to a variable time (VT) schedule. The mean interval of the VT schedule was fixed to 120 s and VR values were 0, 4, and 8.
When we analyzed the IRTs data from the extinction simulation, we used a dynamic bi-exponential model [10], in which the model parameters, q, ω, and b, are time-dependent and Eq (1) is rewritten as follows: (6)
Extinction causes exponential decay of the model parameters according to the following equations: (7) (8) where the parameters γ and δ denote the decay rates of q and b, respectively. Since the decay of any of the three model parameters q, b, and ω can cause extinction, we need to identify which of these parameters actually decayed during the extinction simulation. We excluded ω because it was fixed to 1/3 during the simulation. To identify whether one or both of the q and/or b parameters decayed, we compared three models, that is, the qb-decay, q-decay, and b-decay models. We calculated WAIC (widely applicable information criterion [28]) for each model. We use Markov chain Monte Carlo (MCMC) with Stan [29] to estimate posterior distribution and used MCMC samples to calculate WAIC. The same configuration as Simulation 1 was used in Simulation 2.
To examine the molar relationship between the reinforcement rate and response rate, we fitted Herrnstein’s hyperbola [30] to the simulated data. We used its modern version [31], (9) where R is the response rate, r is the reinforcement rate, re is the external reinforcement rate, k is the total amount of behavior, and a is the exponent and bias parameters, respectively. Since the parametrization of term is redundant, we did not fit re and c separately and estimated only .
2.2 Results: Simulation 2
Fig 4 shows the log-survivor plots of IRTs from each of the four simulations. Fig 4(a) and 4(b) shows that manipulating the rate of reinforcement or the deprivation level changed the slope and intercept of the right limb. As the rate of reinforcement or the deprivation level increased, the slope of the right limb became steeper, indicating that the bout initiation rate became larger. The broken sticks in Fig 4(c) have different slopes and y-axis intercepts, suggesting that both the bout initiation rate and the bout length were changed. Fig 4(d) shows that adding the tandem VR schedule to the VT schedule affected only the y-axis intercept of the right limb without changing its slope. As the required response increased from the baseline to VR 4 or VR 8, the bout length became larger. However, the right limbs were not stable and we performed a fitting analysis described in the next paragraph.
Table 2 shows estimated parameters of the bi-exponential model, q, ω, and b in three simulations except for extinction. Parameter q increased as the reinforcement rate, the deprivation level, and the number of required responses increased. Parameter ω did not change in all manipulations. Parameter b increased as the rate of reinforcement and the deprivation level increased.
In Fig 4(c), the total number of IRTs during the extinction phase was insufficient to reliably estimate the right limb. Then, we analyzed the dynamic bi-exponential model fitted to the IRTs during extinction. Table 3 shows the WAIC values for the three models. The smallest WAIC was attained by the qb-decay model, but the differences from the other models are not large and it is not conclusive which of the bout initiation rate and the bout length decayed during extinction.
Fig 5 shows the boxplots of Qpref and Qcost in the three simulations except for extinction, which are to be used for assessing how the changes in the bout components are mediated. We excluded the extinction simulation because we already knew that Qpref causes the change of the bout components since Qcost is fixed during the extinction phase. The top panel shows that Qpref and Qcost increased as the rate of reinforcement increased. The middle panel indicates that increasing the deprivation level moved Qpref and Qcost upward. From the bottom panel, we can see that adding tandem VR schedule increased Qcost without affecting Qpref. Table 5 summarized the dependency of Qpref and Qcost to experimental manipulations. Comparing Tables 1 and 5, Qpref and Qcost correspondent to the bout initiation rate and the bout length.
The top, middle, and bottom rows correspond to the reinforcement rate, the deprivation level, and the tandem VT VR simulations, respectively, and the left and right columns show Qpref and Qcost, respectively.
Fig 6 shows the relationship between the reinforcement rate and response rate in our model. The response rate increased with diminished gradients, converging to k = 187.41. The other parameters were fitted to be a = 2.25, and . The percentage of variance accounted for (%VAF) was 99.3, and a = 2.25 implies that our model showed overmatching. In our model, β in Eq (3) controls the absolute difference between the Operant and Others behaviors and we can change overmatching to strict matching by lowering the value of β.
The dots are from the simulation and the line is the modern version of Herrnstein’s hyperbola (the generalized matching law) fitted to the data.
2.3 Discussion: Simulation 2
In Simulation 2, we tested whether the dual model has the same characteristics as animals reported by the previous studies. We analyzed the model with four experimental manipulations: the rate of reinforcement, the deprivation level, the presence of extinction, and the schedule type. The rate of reinforcement, the deprivation level, and the presence of extinction affected the bout initiation rate and the bout length and adding the tandem VR schedule to the VT schedule affected only the bout length.
Table 4 summarizes the relationship between the experimental manipulations and the bout components observed in the dual model, which suggests that the behaviors of the dual model are consistent with the existing knowledge on animal behaviors. Furthermore, we made stable predictions to the cells with the question marks in Table 1. Our predictions are stable because our results can be easily reproduced and tested using the same simulation code. In contrast, experimental studies with animals could report different conclusions. Although our model does not implement Herrnstein’s hyperbola a priori, the molar relationship between the reinforcement rate and response rate is well described by the modern matching theory (Fig 6). Cheung et al. [12] and Brackney et al. [32] showed that the bout initiation rate and the bout length decayed during extinction. Table 3 shows that parameter selection for dynamic bi-exponential model with WAIC but the differences between each model are small. However, the lowest WAIC model is consistent with previous studies. Therefore, the dual model satisfies at least the necessary conditions to be a model to be analyzed for the generation mechanism of bout-and-pause patterns.
Table 2 showed estimated parameters of the bi-exponential model in each simulation and they are consistent with the parameters of previous study with real animals.
The dependency of Qpref and Qcost to experimental manipulations, showed in Table 5, can be understood according to the categorization of motivational and schedule-type manipulations proposed by Shull et al. [4]. In our simulations, manipulating any of the three motivational variables, i.e. the rate of reinforcement, the deprivation level, or extinction, changed Qpref and Qcost. The change of Qcost was not a primary but a secondary effect because Qcost was changed as a result of the increased Qpref; with a higher Qpref, the agent emits more responses. The schedule type manipulation affected only Qcost. These changes of Qpref and Qcost are consistent with what was proposed by Shull et al. [4].
The dual model is limited to reproduce only some of the previous findings. Here are three examples of limitations. First, our model is not designed for analyzing the addition of a tandem VR schedule to a VT schedule, by which Tanno [9] and Matsui et al. [21] found the change of the within-bout response rate, which was fixed in our model. Second, the value and the delay between a response and a reinforcer were fixed in our model. Brackney et al. [10] and Podlesnik et al. [8] considered a delayed reinforcement from a bout initiation causes the inverse correlation between the bout initiation rate and the bout length. This result can be reproduced if required responses by tandem VR is very high (more than 32). Third, sometimes the bout length does not decrease during extinction [10]. Our dual model could not reproduce this result even if we changed the model parameters.
3 Simulation 3
In Simulation 3, we examined a two-state model that incorporates the choice and cost mechanisms to examine the possibility of alternative models, particularly a simpler one. We built a two-state model without the Choice state and ran simulations with it.
3.1 Method: Simulation 3
Fig 7 shows the two-state model comprising of the Operant and Others states. Although it does not have the Choice state, the choice mechanism is implemented as the transitions between the Operant and the Others states. The probability of staying at the same state is defined as follows. (10) where wpref and wcost are positive weights for Qpref and Qcost, respectively. Updating rules for Qpref and Qcost are the same as Eqs (2a), (2b) and (4), respectively. The parameters of the two-state model were sought in the ranges shown in Table 6, which includes the parameter values used for the three-state, dual model. The following parameter settings were selected from the range: αrft = 0.01, αext = 0.01, wpref = 4.0, wcost = 3.5, r(Operant) = 1.0, and r(Others) = 0.5.
To examine if the two-state model could generate bout-and-pause patterns and if it could be used for simulations with experimental manipulations, we performed simulation analysis. We varied the reinforcement rate as VI 30 s, VI 120 s, and VI 480 s, which were the same as the values used in Simulation 2.
3.2 Results: Simulation 3
Fig 8 shows the log-survivor plots of IRTs from the simulation of the two-state model with different values of the reinforcement rate. It showed broken-stick shapes and the slopes and intercepts of the right limbs decreased as the reinforcement rate decreased.
3.3 Discussion: Simulation 3
Since the log-survivor plots of IRTs generated by the two-state model showed broken-stick curves, bout-and-pause patterns were reproduced. In addition, the change of the log survivor plots of the two-state model was consistent with experimental findings. Therefore, we can construct alternative models even without the explicit third state. Also, the two-state model implements the two mechanisms through Eq (10).
We consider the three-state, dual model has advantages in modeling and analyzing bout-and-pause patterns.
In the three-state dual model, the effects of choice and cost are separated. It is clear in the dual model shown in Fig 1(a) that the choice between the operant and other behaviors is made at the Choice state and whether the agent continues to stay in the same state is moderated by the cost mechanism at each of the Operant and Others states. This can be understood by Eq (3), which describes only the choice rule, and Eq (5), which calculates the stay probability based on the cost mechanism. However, in the two-state model, choice and stay are not well separated; in Eq (10), choice and cost are mixed and the behavior of the agent cannot be explained by only one of them.
4 General discussion
In this paper, we have developed a computational model with reinforcement learning. The model was meant to explain how bout-and-pause patterns can be generated and we examined its validity by comparing computer simulations and experimental findings. We hypothesized that two independent mechanisms, the choice between Operant and Others and the cost in the changeover of behaviors, are necessary to organize responses into bout-and-pause patterns. We demonstrated in Simulation 1 that the dual model reproduced bout-and-pause patterns under a VI schedule. Simulation 2 found that the relationships between various experimental manipulations and the bout components in our model were consistent. Simulation 3 found that two-state model incorporating the two mechanisms can also reproduce the bout-and-pause pattern. However, the third state has advantages in analyzing an agent behavior because it separates the effects of the choice and cost mechanisms. These results support our hypothesis that assumes that an agent transitioning between the three states driven by the choice and cost mechanisms organizes its responses into bout-and-pause patterns. This is our answer to “why bout-and-pause patterns are organized?”
Our constructive model reproduced the descriptive results reported by [4, 6, 7, 27]. Although our dual model does not explicitly include the bi-exponential model in Eq (1), IRTs generated by the dual model followed the bi-exponential model.
The fundamental difference between our model based on reinforcement learning and Kulubekova and McDowell [22]’s model based on selection by consequences is that our model explicitly has the choice and cost mechanisms but Kulubekova and McDowell [22]’s model is unclear about them. Their model did not generate a clear distinction between a burst of responses in a short period and long pauses that separate bursts, resulting in a dull bend of the log-survivor plot. Kulubekova and McDowell [22] discussed that this divergence from live animals might be due to the lack of CODs in their model. Our model reproduced clear distinction between bursts and pauses (Fig 2), and this was because our model can change CODs through the cost mechanism. Another advantage of us over Kulubekova and McDowell [22] is that they did not compare with alternative models but we tested our hypothesises of the choice and cost mechanisms by the knockout analysis Simulation 1.
Our model has at least two shortcomings, which are the range of parameters and the redundancy of the model. The parameters αrft, αext, β, ωpref, and ωcost in our model have not been optimized to fit to behavioral data from real animals. The evidence that supports our parameter selection is that our model quantitatively reproduced bout-and-pause patterns. Second, although our model has five parameters, fewer parameters may suffice to reproduce bout-and-pause patterns. To verify our model for these two points, it would be useful to compare empirical data from real animals and our computational model.
Standing on our model proposed in this paper, we can extend our research to many directions to explain more various aspects of bout-and-pause patterns. Here we discuss four of them. The following four paragraphs are devoted to this topic.
First, our results were retrospective to data from previous behavioral experiments and the proposed model was not tested by its prediction ability to unseen data. Our model can suggest a new experiment that could add new knowledge about how manipulating CODs affect animals’ behavior under a concurrent VI VI schedule. Smith et al. [17] pointed out that employing asymmetrical CODs in a concurrent VI VI schedule could produce behaviors under a single response schedule. Our modeling is consistent to what Smith et al. [17] pointed out but approaches from a different direction. We consider that, even if an animal is under the single schedule, it makes choices between the operant behavior and other behaviors; this is implemented to our simulation as concurrent VI FR 1. We used an FR 1 schedule for other behaviors in our simulations, but we can change it from FR 1 to a VI schedule so that the whole schedule becomes a concurrent VI VI schedule. In our model, the cost for the operant behavior, defined by , affects actions of the agent only in the Operant state, without affecting those in the Others state; similarly, the cost for other behaviors influences the agent to its actions only in the Others state. Therefore, according to our model, it is expected that, in a concurrent VI VI schedule, if the experimenter varies CODs for one schedule, the behavior of the animal changes only for the varied schedule without affecting the behavior for the other schedule. It will be interesting to conduct such experiments with real animals to reveal actual effects of CODs on the behavior under concurrent VI VI schedules. In such a way, our model can bridge between animal behaviors observed in concurrent schedules and single schedules by offering a unified framework.
The second approach is verification based on neuroscientific knowledge. Even if the model can correctly predict unseen data from behavioral experiments, it is not guaranteed that animals employ the same the model. To explore real mechanisms that animals implement, it would be effective to compare the internal variables of the model with neural activities measured from real animals during behavioral experiments. Possible experiments are to perform knock out experiments by inducing lesions at specific areas of the brain that should be active during the experiments, or to activate or deactivate specific neurons during the experiment.
Third, we can assess the plausibility of our model in more detail by conducting simulation under new experimental manipulations including disruptors or analyzing measures that we did not analyze. For example, recent studies showed that the distribution of bout lengths is sensitive to experimental manipulations [13, 33, 34]. Sanabria et al. [35] have proposed a computational formulation of behavior systems [36] and their descriptive model well described bout-and-pause patterns including the distribution of bout lengths.
Fourth, we can design models that are not Markov transition models. The bout-and-pause response patterns shown in Fig 2 can be generated by a Markov transition model whose transition matrix is given a priori without reinforcement learning. We argue that the statistical description of the Markov model (i.e., the transition matrix defined by the transition probabilities shown in Fig 3) is not the source of the reproducibility of bout-and-pause patterns. There may be other models that are not formulated by Markov transition, such as the model proposed by McDowell [23]. We can introduce the choice and cost mechanisms to such models.
Reinforcement learning can be employed to model and explain animal behaviors other than bout-and-pause patterns, since it is a general framework where an agent learns optimal behaviors in a given environment through trial-and-error [24]. Such a reinforcement learning framework agrees well with the three-term contingency in behavior analysis. There are three essential elements in reinforcement learning; a state, an action, and a reward. The state is what agent observe and is information about the environment. The action is a behavior that the agent takes in a given state. The reward is what the agent obtains as the result of the action. These three elements are similar to a discriminative stimulus, a response, and an outcome. This similarity would allow behavior analysts to employ reinforcement learning in their research. For example, Sakai and Fukai [37] employed actor-critic reinforcement learning to modeling the matching law. We hope more computational studies will be performed to expand methods of behavioral science.
References
- 1. Barabasi AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435(7039):207.
- 2. Tolkamp BJ, Kyriazakis I. To split behaviour into bouts, log-transform the intervals. Animal Behaviour. 1999;57(4):807–817.
- 3. Sorribes A, Armendariz BG, Lopez-Pigozzi D, Murga C, de Polavieja GG. The origin of behavioral bursts in decision-making circuitry. PLoS Computational Biology. 2011;7(6):e1002075.
- 4. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Effects of relative reinforcement and schedule type. Journal of the Experimental Analysis of Behavior. 2001;75(3):247–274.
- 5. Killeen PR, Hall SS, Reilly MP, Kettle LC. Molecular analyses of the principal components of response strength. Journal of the Experimental Analysis of Behavior. 2002;78(2):127–160.
- 6. Shull RL, Grimes JA, Bennett JA. Bouts of responding: The relation between bout rate and the rate of variable-interval reinforcement. Journal of the Experimental Analysis of Behavior. 2004;81(1):65–83.
- 7. Shull RL. Bouts of responding on variable-interval schedules: Effects of deprivation level. Journal of the Experimental Analysis of Behavior. 2004;81(2):155–167.
- 8. Podlesnik CA, Jimenez-Gomez C, Ward RD, Shahan TA. Resistance to change of responding maintained by unsignaled delays to reinforcement: A response-bout analysis. Journal of the Experimental Analysis of Behavior. 2006;85(3):329–347.
- 9. Tanno T. Response-bout analysis of interresponse times in variable-ratio and variable-interval schedules. Behavioural Processes. 2016;132:12–21.
- 10. Brackney RJ, Cheung TH, Neisewander JL, Sanabria F. The isolation of motivational, motoric, and schedule effects on operant performance: A modeling approach. Journal of the Experimental Analysis of Behavior. 2011;96(1):17–38.
- 11. Chen X, Reed P. Factors controlling the micro-structure of human free-operant behaviour: Bout-initiation and within-bout responses are effected by different aspects of the schedule. Behavioural Processes. 2020; p. 104106.
- 12. Cheung TH, Neisewander JL, Sanabria F. Extinction under a behavioral microscope: Isolating the sources of decline in operant response rate. Behavioural Processes. 2012;90(1):111–123.
- 13. Brackney RJ, Sanabria F. The distribution of response bout lengths and its sensitivity to differential reinforcement. Journal of the experimental analysis of behavior. 2015;104(2):167–185.
- 14. Reed P. The structure of random ratio responding in humans. Journal of Experimental Psychology: Animal Learning and Cognition. 2015;41(4):419.
- 15. Reed P, Smale D, Owens D, Freegard G. Human performance on random interval schedules. Journal of Experimental Psychology: Animal Learning and Cognition. 2018;44(3):309.
- 16. Brackney RJ, Cheung TH, Sanabria F. A bout analysis of operant response disruption. Behavioural processes. 2017;141:42–49.
- 17. Smith TT, McLean AP, Shull RL, Hughes CE, Pitts RC. Concurrent performance as bouts of behavior. Journal of the Experimental Analysis of Behavior. 2014;102(1):102–125.
- 18. Bennett JA, Hughes CE, Pitts RC. Effects of methamphetamine on response rate: A microstructural analysis. Behavioural Processes. 2007;75(2):199–205.
- 19. Bowers MT, Hill J, Palya WL. Interresponse time structures in variable-ratio and variable-interval schedules. Journal of the Experimental Analysis of Behavior. 2008;90(3):345–362.
- 20. Wallace M, Singer G. Schedule induced behavior: A review of its generality, determinants and pharmacological data. Pharmacology Biochemistry and Behavior. 1976;5(4):483–490.
- 21. Matsui H, Yamada K, Sakagami T, Tanno T. Modeling bout–pause response patterns in variable-ratio and variable-interval schedules using hierarchical Bayesian methodology. Behavioural Processes. 2018;157:346–353.
- 22. Kulubekova S, McDowell JJ. A computational model of selection by consequences: Log survivor plots. Behavioural Processes. 2008;78(2):291–296.
- 23. McDowell JJ. A computational model of selection by consequences. Journal of the Experimental Analysis of Behavior. 2004;81(3):297–317.
- 24.
Sutton RS, Barto AG. Reinforcement learning: An introduction. 2nd ed. MIT Press; 2018.
- 25. Nieder A. Counting on neurons: The neurobiology of numerical competence. Nature Reviews Neuroscience. 2005;6(3):177–190.
- 26. Fleshler M, Hoffman HS. A progression for generating variable-interval schedules. Journal of the Experimental Analysis of Behavior. 1962;5(4):529–530.
- 27. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Resistance to extinction. Journal of the Experimental Analysis of Behavior. 2002;77(3):211–231.
- 28. Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research. 2010;11(Dec):3571–3594.
- 29. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A probabilistic programming language. Journal of Statistical Software. 2017;76(1).
- 30. Herrnstein RJ. On the law of effect 1. Journal of the Experimental Analysis of Behavior. 1970;13(2):243–266.
- 31. McDowell J. On the classic and modern theories of matching. Journal of the Experimental Analysis of Behavior. 2005;84(1):111–127.
- 32. Brackney RJ, Cheung TH, Herbst K, Hill JC, Sanabria F. Extinction learning deficit in a rodent model of attention-deficit hyperactivity disorder. Behavioral and Brain Functions. 2012;8(1):59.
- 33. Jiménez ÁA, Sanabria F, Cabrera F. The effect of lever height on the microstructure of operant behavior. Behavioural processes. 2017;140:181–189. pmid:28499811
- 34. Daniels CW, Sanabria F. About bouts: A heterogeneous tandem schedule of reinforcement reveals dissociable components of operant behavior in Fischer rats. Journal of Experimental Psychology: Animal Learning and Cognition. 2017;43(3):280.
- 35. Sanabria F, Daniels CW, Gupta T, Santos C. A computational formulation of the behavior systems account of the temporal organization of motivated behavior. Behavioural processes. 2019;169:103952.
- 36. Timberlake W. Behavior systems and reinforcement: An integrative approach. Journal of the Experimental Analysis of Behavior. 1993;60(1):105–128.
- 37. Sakai Y, Fukai T. The actor-critic learning is behind the matching law: Matching versus optimal behaviors. Neural Computation. 2008;20(1):227–251.