Tracking human skill learning with a hierarchical Bayesian sequence model

doi:10.1371/journal.pcbi.1009866

Fig 1.

The Alternating Serial Reaction Time (ASRT) task with second-order dependence structure.

(a) Participants had to press the key corresponding to the current sequence element (i.e. cue location) on the screen as accurately and quickly as possible, using the index and middle fingers of both hands. In the display, the possible locations were outlined in black and the cue always looked the same, fill color and saturation are only used here for explanatory purposes. (b) The structure of the example sequence segment in (a). Color saturation and outline indicate the element that was presented on a trial. The vertical arrow indicates the current trial. The task was generated from an eight element second-order sequence where every second element was deterministic and the elements in between were random. The deterministic components in this example are: red-blue-yellow-green. The element on any random trial (including the current one) is unpredictable. However, this current trial happens to mimic the deterministic second-order dependence where green is followed by red after a gap of one trial, making it a high probability trigram trial (H). The other random elements were associated with lower probability trigrams (L). (c) Under the true generative model, when in a random state, high-probability trigrams (rH) and low-probability trigrams (rL) are equally unexpected. (d) A learner who can pick up second-order dependencies, but who is agnostic to the higher-order alternating state structure, would expect rH more than rL. (e) In the last training session (session 8; after more than 14,000 trials), participants responded faster to deterministic than random trials, suggesting that they learned to predict the upcoming element. They also responded quickly even on random trials if those happened to complete a high probability trigram (rH). The y axis shows the standardised reaction time (RT) averaged over the different trial types on the last session of learning. The error bars indicate the 95% CI.

More »

Expand

Fig 2.

Modeling strategy.

We adopted a model-based approach, fitting the hyperparameters θ of an internal sequence model (upper box), together with low level effects (the spatial distance between subsequent response locations, response repetition, error and post-error trials; lower box) to participants’ response times. The contribution of the sequence model is the scaled log of the predictive probability of each key press k (one of the four keys, marked as transparent square), given the context u (previous events, marked as a string of colored squares). The sequence model makes predictions by flexibly combining information from deepening windows onto the past, considering fewer or more previous stimuli.

More »

Expand

Table 1.

Terminology of the hierarchical Chinese restaurant process mapped onto the experimental measures of the current study.

More »

Expand

Fig 3.

Treating the sequence learning problem as an hierarchical nonparametric clustering problem.

(a) The traditional, unforgetful Chinese restaurant process (CRP) is a nonparametric Bayesian model where the probability that a new observation belongs to an existing cluster or a new one is determined by the cluster sizes and the strength parameter α. In the metaphor, the new customer (new observation; see the terminology in Table 1; shown as black dots) sits at one of the existing tables (clusters labeled by key press identity, e.g., ‘response to left side of the screen’; shown as colored circles) or opens up a new table (shown as open circle) with probabilities proportional to the number of customers sitting at the tables and α. Here, the most likely next response would be of the type pink. (b) The distance-dependent or ‘forgetful’ Chinese restaurant process (ddCRP) is governed by a distance metric, according to the ‘close together sit together’ principle. In our case, the customers are subject to exponential decay with rate λ, as shown in the inset (and illustrated by the grey colours of the customers). Even though the same number of customers sit at the tables as in (a), this time the predictive probability of a yellow response is highest because most of the recent responses were yellow. (c) In the distance-dependent hierarchical Chinese restaurant process (HCRP), restaurants are labeled by the context of some number of preceding events and are organized hierarchically such that restaurants with the longest context are on top. Thus, each restaurant models the key press of the participant at time point t, k_t, given a context of n events (e_t−n, …e_t−1). A new customer arrives first to the topmost restaurant that corresponds to its context in the data (in the example, the customer is bound to visit the restaurant labeled by the context ‘yellow-blue’ when he arrives to level 2). If it opens up a new table, it also backs off to the restaurant corresponding to the context one element shorter (in the example, to the restaurant labeled by the context ‘blue’).

More »

Expand

Fig 4.

Fitted parameter values shown session by session, and at a subsession resolution in the initial and final sessions.

The grey band on the bottom of each plot shows the sequence that participants practiced: the old sequence in sessions 1–8 (dark grey), the new sequence in session 9 (light grey), and both sequences alternately in session 10. Point distance in (a) and cell width in (b) are proportional to data bin size—we fitted the model to 5 epochs within sessions 1, 9, and 10 to assess potentially fast shifts. (a) Fitted values of the response parameters in units of τ [ms]. The error bars indicate the 95% CI for the between-subjects mean. (b) Fitted values of the strength α (left) and forgetting rate λ (middle) parameters are shown, as well as their joint effect on prediction (right). A context of n previous events corresponds to level n in the HCRP. Lower values of α and λ imply a greater contribution from the context to the prediction of behavior. The context gain for context length n is the decrease in the KL divergence between the predictive distribution of the complete model and a partial model upon considering n previous elements, compared to considering only n − 1 previous elements. Note that the scale of the context gain is reversed and higher values signify more gain.

More »

Expand

Fig 5.

Correlation between the fitted HCRP parameters and working memory.

(a)(b) Pearson correlation matrices of the working memory test scores and the strength parameters α and decay parameters λ of the HCRP model, respectively. Correlations that met the significance criterion of p < .05 are marked with black boxes. (c)(d) Scatter plots of the correlations that that met the significance criterion of p < .05. Bands represent the 95% CI.

More »

Expand

Fig 6.

Trial-by-trial predictive check.

In (a) and (b) we show example segments of held-out data from session 1 and 7 of participant 102 (Top) Colored bars show the positive (slowing) and negative (speeding) effects predicted by the different components in our model relative to the intercept (horizontal black line). The overall predicted RT value (red line) is the sum of all effects. The color code of the event and the response are shown on the bottom. A mismatch between the two indicates an error. (Middle) Predictive probabilities of the four responses are shown for each trial. The cells’ hue indicate the response identity, saturation indicates probability value. The sequence prediction effect (pale red bar in (Top)) is inversely proportional to the probability of the response, i.e. higher probability yields faster response. The ticks at the bottom indicate high-probability trigram trials. (Bottom) We show what proportion of the predictive probability comes from each context length. Higher saturation indicates a larger weight for a context length. (c) Test prediction performance of the full model and each model component in terms of unique variance explained, averaged across participants. Bands represent the 95% CI.

More »

Expand

Fig 7.

Calibration of the HCRP model.

(a) RTs predicted by our HCRP model are shown against measured RTs for d versus r trials on held-out test data. (b) Same as (a) for rH versus rL trials. The two dashed lines mark the mean RTs for d and rH trials in session 8. The RT advantage of d over rH by session 8 marks (> 2)-order sequence learning. (c-d) rH versus rL trials are labelled according to the old trigram model (i.e. old sequence) or the new trigram model (i.e. new sequence). The grey band on the bottom shows the sequence that participants practiced: the old sequence in sessions 1–8 (dark grey), the new sequence in session 9 (light grey) and alternating the two sequences in session 10. (e) (> 2)-order sequence learning, quantified as the standardized RT difference between rH and d trials, shown for measured and predicted RTs. In session 1, rH trials are more expected because they reoccur sooner on average. By session 8, d trials are more expected because they are more predictable, given a > 2 context. This was predicted by the HCRP but not the trigram model. (f) Correlation of the measured and predicted (> 2)-order effect in session 1 and session 8. (g) Average predictive performance of the HCRP and the trigram models. (a-g) The error bands and bars represent the 95%CI.

More »

Expand

Table 2.

Repeated measures ANOVAs in sessions 1–8.

In the left set of columns, the trial type is defined as the state and in the right set of columns it is defined as P(trigram).

More »

Expand

Table 3.

Repeated measures ANOVAs in session 9 and 10.

In the left set of columns, the trial type is defined as P_old(trigram) and in the right set of columns it is defined as P_new(trigram).

More »

Expand

Fig 8.

Proportions of errors of different types across training sessions.

More »

Expand

Fig 9.

Predicting the latency of errors.

(a) Pattern errors. (b) Recency errors. In the case of HCRP_f, the hyperparameter priors were adjusted to express more forgetfulness. The error bars represent the 95%CI.

More »

Expand

Fig 10.

The HCRP_f weighting of context depth is different among errors of different types.

(Left) Weights of all HCRP_f levels. (Right) Zoomed in for HCRP_f level 3 only.

More »

Expand