^{1}

^{2}

^{2}

^{2}

^{3}

^{2}

^{1}

^{4}

The authors have declared that no competing interests exist.

Humans and other animals are able to discover underlying statistical structure in their environments and exploit it to achieve efficient and effective performance. However, such structure is often difficult to learn and use because it is obscure, involving long-range temporal dependencies. Here, we analysed behavioural data from an extended experiment with rats, showing that the subjects learned the underlying statistical structure, albeit suffering at times from immediate inferential imperfections as to their current state within it. We accounted for their behaviour using a Hidden Markov Model, in which recent observations are integrated with evidence from the past. We found that over the course of training, subjects came to track their progress through the task more accurately, a change that our model largely attributed to improved integration of past evidence. This learning reflected the structure of the task, decreasing reliance on recent observations, which were potentially misleading.

Humans and other animals possess the remarkable ability to find and exploit patterns and structures in their experience of a complex and varied world. However, such structures are often temporally extended and latent or hidden, being only partially correlated with immediate observations of the world. This makes it essential to integrate current and historical information, and creates a challenging statistical and computational problem. Here, we examine the behaviour of rats facing a version of this challenge posed by a brain-stimulation reward task. We find that subjects learned the general structure of the task, but struggled when immediate observations were misleading. We captured this behaviour with a model in which subjects integrated evidence from recent observations together with evidence from the past. The subjects’ performance improved markedly over successive sessions, allowing them to overcome misleading observations. According to our model, this was made possible by more effective usage of past evidence to better determine the true state of the world.

Natural environments are replete with statistical structure and regularities over many spatial and temporal scales. Humans and other animals are adept at extracting this structure by building cognitive maps [

One critical aspect of prediction in environments involving temporal regularities is that it typically depends on memory, with the immediate sensory evidence alone being insufficient [

Hidden structures exist over a variety of timescales. Short times of a few seconds are associated with accumulate-to-bound decision making [

By analysing singular aspects of the behavioural data in the task we find that rats learn to use such medium-term structure to predict oncoming states and adjust their actions accordingly. However, their behaviour reveals errors which, when they arise, result from chance recent observations that are misleading as to the identity of the hidden state. We show how to account for their performance by building an HMM which characterises the environment, and in which evidence from past observations is imperfectly integrated with recent observations. We find that as training progressed subjects learned to predict oncoming states more accurately. This revealed a process by which subjects learn to use past evidence more effectively to infer their state in the world.

We consider a cumulative handling time task [

At the beginning of each trial, a high frequency stimulation train, called a prime, is delivered. The subjects are then free to choose whether and when to engage with the lever. We analyse two major dependent variables. The first is the engagement probability (EP), which is the probability that subjects engage with the lever at all. The second, if subjects do indeed engage, is the initial response time (IRT), which is the time it takes them to first press the lever following the prime; we define these in more detail in materials and methods.

Trials come in a predictable cyclic triad consisting of ‘lead’, ‘test’ and ‘trail’ trial types (

(

For lead trials, which correspond to fixed, high-frequency stimulation with a short price of 1 second, subjects typically work the entire duration of the trial, as the high-frequency stimulation is highly rewarding. By contrast for trail trials, which have fixed, low-frequency stimulation at the same short price of 1 second, subjects barely work. Test trials, which involve a range of frequencies and prices which change from trial to trial (but are fixed across a particular trial) give rise to variable amounts of work, depending on the particular values of the frequency and price.

The data in the present paper are drawn from [

We studied a total of six subjects, each of which had experienced approximately 1500 triads of trials over a period of weeks. To allow adjustment from training to the full task we excluded the first 126 triads from our analysis, corresponding to one complete survey of the test trial frequencies and prices as defined in [

Previous analysis of these data has primarily focused on behaviour during test trials, and in particular on responses occurring after the initial responses [

_{1}). For the significant subject, the median trail trial IRT was not large (3.15s) and the EP was 1, and so this likely reflected initial stages of learning.

(_{3}). (_{4}), as subjects learned to predict the highly rewarding lead trial. For test trials, with their lower expected rewards, median IRTs remained relatively constant and were longer in trained subjects than lead trial IRTs for all subjects (permutation test; _{2}). For the poorly rewarding trail trials, median IRTs appeared not to change consistently, but we examine the properties of the trail trial distribution in more detail in

After the training period, the same subjects emitted very different initial responses for the different trial types. To a first approximation, the difference in these initial responses for trained subjects reflected the expected worth of the trial: the larger this worth, the greater the EP (up to a maximum of 1) and the shorter the IRT.

For lead and test trials, the EP was generally very close to 1, with test trial IRTs being slightly longer for all subjects than those for (the on average more valuable) lead trials (permutation test; _{2}). That test trial IRTs were longer than lead trial IRTs is interesting as this behaviour is seemingly suboptimal—subjects need to explore to find out the test trial’s value before they can determine the appropriate response, and waiting at the beginning of a trial reduces their potential to exploit the test trial if it is indeed of high value. We therefore interpret the longer latency on test trials as indicating a sub-optimal Pavlovian response to an accurate prediction of relatively lower expected future reward, an effect which has been observed elsewhere [

Subjects responded very differently on the negligibly rewarding trail trials. EPs were typically small, and when subjects did engage, the resulting IRTs were often long. However, on a substantial fraction of occasions, the IRTs were instead short, which is surprising because trail trials were designed to be effectively worthless to the subject. We explore the possibility that the pattern of long and short IRTs is a signature of subjects’ inability to predict trail trials perfectly, and are thus a result of erroneous inference. They therefore provide a window into the subjects’ inferential processes.

Trail trials are preceded by test trials, which involve a range of different frequencies and prices. Some of these conditions resemble either lead (region α,

(_{5}). Similarly, short IRTs on trail trials following test trials in region α are more similar to lead trial IRTs than to test trial IRTs for 3/6 subjects with the difference not being significant for the remaining subjects (permutation test; _{6}).

To test this hypothesis, we sorted the trail trial IRTs by the frequency and price of the previous test trial (

To quantify whether the short IRTs sorted in this way are more lead-like or test-like respectively we calculated the earth mover’s similarity between these distributions and lead and test distributions (_{5}). Similarly, responses on a trail trial following a trail-like test trial were significantly more lead-like than test-like for 3/6 subjects (permutation test; _{6}).

Having discovered this confusion effect, we investigated it in more detail by considering the separate influences of frequency, price and duration. We found that frequency strongly influenced subjects’ inferences (

(_{7}) and ‘High f’ (binomial proportion test; _{8}) categories. (_{9}). (_{10}).

Whilst the description of confusion outlined in

The task itself can be described in the form of an HMM, with hidden states representing the trial type, and a binary transition matrix reflecting the deterministic cyclic triad structure. Given the predominant regularities in responses, highlighted earlier, we assume that subjects have learned this essential structure, associated with a task transition matrix (

(_{1}). As it has learned the transition structure, described by matrix _{2}). If recognition was perfect, this knowledge would persist through the test trial; we model subjects’ imperfection as arising from uncertainty in past evidence, which we describe using a parameter _{3}) of frequency and price. This leads to a posterior belief (_{3}), which then leads to the subjective belief about the trial type at the beginning of what is actually the trail trial (_{4}). This can then be used to generate a response: either no engagement or engagement with an associated IRT. (

In our model, matrix

If the subjects’ inference was perfect relative to the actual Markov chain, they would continue to believe that they were in a test trial throughout its entirety. However, unlike other trials, during a test trial subjects may be presented with observations that are misleading as to the trial type. Continued belief therefore depends on subjects being able to correctly rely on past information in the face of competing and more recent evidence.

We model imperfections in the subjects’ ability to do this as arising from an incorrect generative model involving an intermediate matrix (

If the value of

For a given triad of trials, probabilistic integration according to the HMM can be described using Bayes rule as:
_{3} is the inferred trial type at the end of a test trial, _{3} is the observed frequency and price (_{2} is the state at the beginning of the test trial.

This makes clear the influence of both recent observations, _{3}|_{3}), and evidence from the past, _{3}|_{2}), on the posterior belief at the end of a test trial. Having determined this belief we find the belief at the beginning of a trail trial, _{4}|_{3}, _{2}), simply by applying the task transition matrix _{3}).

We then calculate the probability of a particular response according to:
_{4} is the response at the beginning of a trail trial (including no responses) and the summation is over the three possible trial types.

To calculate the probability of IRTs given a known trial type we used non-parametric fitting of lead, test and non-confusing trail trial IRTs (

Having built the HMM we then split the data into three tertiles (details outlined in the following subsection), and determined the maximum likelihood estimate (MLE) of the parameters independently for each tertile. We were then able to simulate response distributions by sampling from

(

To investigate simpler versions of the model that could provide a more parsimonious explanation for the observed responses we also tested models in which subjects only used evidence from one of frequency or price but not both, as well as models which either used past information perfectly (_{f} and _{b} which allow forward and backward transitions to be fit separately. This model was intended to test the hypothesis that subjects were more likely to prematurely transition their beliefs ‘forwards’ from test to trail rather than ‘backwards’ to lead. Interestingly, we found this asymmetry to be present, according to BIC, for two subjects. However, on average this model performed worse (by a score of 5.8) and so we do not use it for further analysis.

Subject | No |
No |
_{f}, _{b} |
||
---|---|---|---|---|---|

1 | 1331.5 | 261.9 | 15.2 | 30.4 | 14.4 |

2 | 173.8 | 33.7 | 26.8 | 33.1 | 11.4 |

3 | 115.5 | 37.7 | 9.6 | 6.1 | -7.4 |

4 | 833.3 | 125.9 | 30.0 | 20.9 | -4.1 |

5 | 400.7 | 63.1 | 28.6 | 76.3 | 2.8 |

6 | 822.0 | 70.6 | 47.8 | 70.0 | 17.7 |

Subjects typically encountered well over a thousand triads of trials. We therefore analysed improvements on the task with experience by dividing the data by triads into three sequential tertiles. When comparing the final tertile to the first tertile for subject 1 we observe a marked decrease both in the EP and in the probability of short IRTs on trail trials (_{11}). Taken together, this indicates that by the final tertile most subjects had improved their ability to track their progress through the task, as even on misleading trials they were rarely confused.

(_{11}) indicating that these subjects improve in their ability to identify the trail trial. The remaining two subjects show no significant change. (_{12}). Although significance was tested using a permutation test, we illustrate errorbars using the mean square error in the MLE of the parameters. The decrease in _{13}). We therefore do not find strong evidence to suggest that improvements in performance in the majority of subjects was due to a more accurate association of frequency and price with the trial type.

In order to understand these changes in the context of our model, we fit model parameters independently to each tertile. MLEs of the parameters identified significantly lower values of _{12}). The subjects for which this parameter changed significantly corresponded to those which had shown a significant decrease in the fraction of short IRTs. This suggests that over time, the majority of the subjects learned to use evidence from the past more effectively and so improved their identification of the test and subsequent trail trials.

We also examined changes in the MLEs of the parameter _{13}). This indicates that for most subjects there is no evidence that improvements in performance can be attributed to a more accurate association of frequency and price with the appropriate trial type.

Finally, to assess the linear correlation between estimates of the parameters

We have shown that subjects learned a model of the world which reflected an experimentally defined transition structure. However, we also identified a small fraction of trials where behaviour seemingly went awry, as evidenced by subjects responding rapidly in advance of unrewarding trials. We demonstrated that these responses could be attributed to mistaken inference of the trial type, and described this process using an HMM. This involved introducing two parameters:

An important part of the work we have described is not only demonstrating subjects’ abilities to learn structure in their environment but also in building a statistical model which describes inference in this context. The model developed was clearly defined and involved parameters which were interpretable, allowing for greater insight into the changing role of past and present evidence.

The representation used in our model proposed that subjects maintain belief states in an HMM. It is also conceivable that they might instead have adopted a less compressed, history-based representation of state, by storing the frequency and price of previous trials (either explicitly or implicitly). It is hard to distinguish these based only on behaviour (particularly given the relative paucity of errors); but this would, of course, still constitute a functional form of world model.

In our preferred representation, observed ‘mistakes’, corresponding to short IRTs on worthless trail trials, are due to mistaken inference of the hidden state. To support this claim we demonstrated that when the inference problem was easy, such as following lead, trail or non-confusing test trials, subjects’ responses reflected clear understanding of the structure. By contrast, when inference was hard, subjects more frequently responded inappropriately in a way which we were able to predict.

This imperfection arose in our model from uncertainty in past evidence, such that subjects failed to maintain their initial beliefs during a test trial. However, it is difficult to pin down the precise interpretation for this uncertainty. One interpretation is forgetting, or a lack of certainty in memory, which allows for a potential switching of beliefs when presented with observations which are more likely to have been generated by a lead or trail trial. Alternatively, this uncertainty could arise from imperfections in subjects’ generative model, such that transitions could occur at any point during a trial as a result of misleading observations. One issue with this latter view is that subjects always experienced each trial as being deterministically stable across time, with no change in either price or frequency. Nevertheless, further work is necessary to distinguish between these two interpretations.

In our analysis we primarily focused on responses immediately after test trials, as only test trials varied across triads and thus posed substantial possibilities for confusion. By contrast, both lead and trail trials were unchanged in frequency, price and duration across triads, and empirically resulted in consistent response properties on subsequent trials after only a small amount of training. Whilst there may also have been confusion present early in training for these trials too, this learning may have progressed too rapidly to enable detailed analysis of its progress.

Our model is starkly simple, using only two parameters to predict behaviour without reference to the detailed microstructure of a given trial, such as the number of reward encounters or the average reward rate.

Nevertheless, we made this choice to capture and highlight the predominant effects observed across subjects whilst also maintaining interpretability. In turn, this allowed us to identify a significant change in the

Another related aspect of fitting our model was the choice not to use trial duration in addition to price to predict responses. As alluded to earlier, this was due to the strong correlation between duration and price, which implied that using either would produce similar results. On the other hand, we were able to show that subjects do use both frequency and price/duration, indicating that they successfully combined multiple sources of evidence in the inferential process.

Two aspects of learning merit future work. One is how the subjects learned the overall model of the world over early training—particularly given their initially imperfect memories and their ignorance of the number of potential states. One promising approach is to consider a non-parametric statistical structure such as an infinite hidden Markov model [

Finally, understanding the neural underpinnings of the diverse processes involved in this task provides an exciting challenge for future research. In the case of working memory, its functioning is thought to be supported by persistent activity in a number of brain regions, including medial prefrontal cortex [

Animal-care and experimental procedures were carried out in accordance with the principles in the Canadian Council on Animal Cares (CCAC) Guide to the Care and Use of Experimental Animals, with the approval of the Concordia University Animal Research Ethics Committee (certificate #: 30000302).

We outline here elements of the modelling methodology. For a full description of the experimental methodology see [

Since all trial types terminate after set intervals (25s for lead and trail; a variable duration for the test), some care is necessary with the resulting censoring of the time during which the subjects could engage. Furthermore, we occasionally observed cases towards the end of the trail trial in which the subject briefly pressed the lever for such a short time that there was no possibility of obtaining reward. This might have been a Pavlovian reaction to the expectation of the upcoming lead trial.

To avoid problems from these cases, we counted a trial as having been engaged in for the purposes of the EP if at least one reward was obtained, and we only considered IRTs (defined as the time taken from the beginning of a trial to press the lever for the first time) on those same trials. This constraint implies that initial responses after 24s on lead and trail trials would be impossible as there remains insufficient time to obtain a reward; so we only examine the properties of the IRT below this value. For test trials, which have variable trial duration, this ignored potential IRTs much larger than 24 seconds. However, in practice such cases were extremely rare (

One facet of the experimental design is that the subjects received idiosyncratic calibrated frequencies of brain stimulation reward. We duly defined lower (

Subject | f_{lead} |
P_{lead} |
f_{trail} |
P_{trail} |
|||||||
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 217.4 | 1.0 | 10.0 | 1.0 | 20.0 | 125.9 | 0.4 | 3.9 | 31.6 | 79.4 | 8.1 |

2 | 196.1 | 1.0 | 10.0 | 1.0 | 35.5 | 100.0 | 0.4 | 3.9 | 63.1 | 125.9 | 4.1 |

3 | 250.0 | 1.0 | 10.0 | 1.0 | 44.7 | 158.5 | 0.4 | 3.9 | 79.4 | 158.5 | 4.1 |

4 | 200.0 | 1.0 | 10.0 | 1.0 | 31.6 | 100.0 | 0.4 | 3.9 | 50.1 | 79.4 | 4.1 |

5 | 250.0 | 1.0 | 10.0 | 1.0 | 28.2 | 125.9 | 0.4 | 3.9 | 39.8 | 100.0 | 4.1 |

6 | 163.9 | 1.0 | 10.0 | 1.0 | 25.1 | 79.4 | 0.4 | 3.9 | 39.8 | 100.0 | 4.1 |

We tested for statistical significance using two-tailed permutation and binomial proportion tests. Permutation tests were used to determine the probability that the observed difference in the test statistics between classes would occur for class labels which were randomly permuted. In all cases we used 1000 simulations.

One non-trivial usage of the permutation test was to see if changes in the MLE of model parameters was significant. For this we determined the MLE of the model parameters in the first and last tertiles for data in which the time labels were permuted randomly and calculated the absolute difference between these parameter values. This was repeated 1000 times in order to generate a distribution of differences. We then tested if the absolute difference in the MLE of the parameters for the non-permuted data was significant (greater than the 95th percentile).

The binomial proportion tests were used to determine the probability of the equality of two binomial proportions for two observed distributions. To compute this we evaluated the test statistic:
_{1} and _{2} the corresponding number of observations and:

We then calculated p-values from

Our null hypotheses referenced in the Results section were as follows:

_{1}: Trail trial IRTs have the same median as a combined grouping of lead and test IRTs for untrained subjects (permutation test)

_{2}: Test trial IRTs have the same median as lead trial IRTs for trained subjects (permutation test)

_{3}: EPs on trail trials are the same for trained subjects as they are for untrained subjects (binomial proportion test)

_{4}: Lead trial IRTs have the same median for trained subjects and untrained subjects (permutation test)

_{5}: Test trial IRTs are equally similar to trail trial IRTs with preceding test trials in region

_{6}: Lead trial IRTs are equally similar to trail trial IRTs with preceding test trials in region

_{7}: The fraction of short trail trial IRTs is the same for the ‘Intermediate’ category as for the ‘Low f’ category, with P = 1s (binomial proportion test)

_{8}: The fraction of short trail trial IRTs is the same for the ‘Intermediate’ category as for the ‘High f’ category, with P = 1s (binomial proportion test)

_{9}: The fraction of short trail trial IRTs is the same for the ‘P = 1s’ category as for the ‘P > 7s’ category, with high f (binomial proportion test)

_{10}: The fraction of short trail trial IRTs is the same for the ‘Low f’ category as for the ‘High f’ category, with P < 0.3s (binomial proportion test)

_{11}: The fraction of short trail trial IRTs is the same in the final tertile as it is in the first tertile (binomial proportion test)

_{12}: The MLE of

_{13}: The MLE of

The P-values for these hypotheses for all subjects are listed in

Null hypothesis | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 | Subject 6 |
---|---|---|---|---|---|---|

_{1} |
0.506 | 0.001 | 0.472 | 0.200 | 0.488 | 0.543 |

_{2} |
< 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |

_{3} |
< 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |

_{4} |
0.024 | < 0.001 | 0.532 | 0.025 | < 0.001 | 0.723 |

_{5} |
< 0.001 | < 0.001 | < 0.001 | 0.003 | 0.003 | 0.001 |

_{6} |
0.023 | 0.181 | < 0.001 | 0.003 | 0.414 | 0.177 |

_{7} |
0.004 | < 0.001 | < 0.001 | 0.014 | < 0.001 | < 0.001 |

_{8} |
0.002 | < 0.001 | < 0.001 | 0.002 | < 0.001 | < 0.001 |

_{9} |
< 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |

_{10} |
0.007 | 0.058 | < 0.001 | < 0.001 | 0.014 | 0.108 |

_{11} |
< 0.001 | < 0.001 | < 0.001 | 0.165 | 0.210 | < 0.001 |

_{12} |
< 0.001 | < 0.001 | 0.049 | 0.204 | 0.232 | < 0.001 |

_{13} |
0.355 | 0.044 | 0.065 | 0.743 | < 0.001 | 0.022 |

We calculate the Bayesian Information Criterion (BIC) for a given model M according to:
^{ML} are the maximum likelihood parameters of the model, _{M} is the number of model parameters and _{D} is the number of data points.

As we split the data into tertiles, we calculate the BIC for each tertile first and sum these to form an overall BIC for each subject.

When comparing model parameters across tertiles, for illustration in _{ii} is the i^{th} diagonal element of the matrix ^{−1}, the negative inverse of the Hessian

As the matrix

Frequencies and prices used during training for subject 1 are shown. The range of frequencies and prices is more limited than that employed in the full experiment.

(PDF)

When trail trial responses are filtered such that only those with preceding test trials in region λ are included, a decrease in the EP is observed.

(PDF)

We evaluated the probability of the observed responses given certainty about the trial type by constructing kernel density estimates of the observed responses. For lead and test trials, which do not lead to confusion, the density estimate was based directly on the observed distributions. For trail trials, to account for confusion, we first filtered the trials such that only those with preceding test trials in region λ were included.

(PDF)

We determined the linear correlation between estimates of the parameters

(PDF)

We would like to thank Wittawat Jitkrittum, Jesse Geerts, Tian Tian and Li Wenliang for fruitful discussions. Steve Cabilio developed and maintained the experimental-control and data acquisition software used in this study. The experimental-control and data acquisition hardware was designed, built, and maintained by David Munro.