Representation of foreseeable choice outcomes in orbitofrontal cortex triplet-wise interactions

Shared neuronal variability has been shown to modulate cognitive processing. However, the relationship between shared variability and behavioral performance is heterogeneous and complex in frontal areas such as the orbitofrontal cortex (OFC). Mounting evidence shows that single-units in OFC encode a detailed cognitive map of task-space events, but the existence of a robust neuronal ensemble coding for the predictability of choice outcome is less established. Here, we hypothesize that the coding of foreseeable outcomes is potentially unclear from the analysis of units activity and their pairwise correlations. However, this code might be established more conclusively when higher-order neuronal interactions are mapped to the choice outcome. As a case study, we investigated the trial-to-trial shared variability of neuronal ensemble activity during a two-choice interval-discrimination task in rodent OFC, specifically designed such that a lose-switch strategy is optimal by repeating the rewarded stimulus in the upcoming trial. Results show that correlations among triplets are higher during correct choices with respect to incorrect ones, and that this is sustained during the entire trial. This effect is not observed for pairwise nor for higher than third-order correlations. This scenario is compatible with constellations of up to three interacting units assembled during trials in which the task is performed correctly. More interestingly, a state-space spanned by such constellations shows that only correct outcome states that can be successfully predicted are robust over 100 trials of the task, and thus they can be accurately decoded. However, both incorrect and unpredictable outcome representations were unstable and thus non-decodeable, due to spurious negative correlations. Our results suggest that predictability of successful outcomes, and hence the optimal behavioral strategy, can be mapped out in OFC ensemble states reliable over trials of the task, and revealed by sufficiency complex neuronal interactions.

However, the relationship between shared variability in frontal areas and behavioral performance is heterogeneous and complex [9]. It depends on the memory of choices preceding the current trial, such that successful task engagement is associated with low variability [14]. By contrast, some orbitofrontal cortex (OFC) neurons show the opposite trend [22]; and whilst prefrontal cortex (PFC) neurons encode predictable biases in action timing, stochastic variability is strongly represented downstream [2]. Moreover, a recent study has shown that mean pairwise correlation might not serve as a proxy for encoded information nor for behavioral performance in general, since only the variability along the encoding axis is detrimental to information [38][39][40].
In this work, we investigate this scenario further by focusing on the relationship between optimal choice behavior and shared trial-to-trial variability in rodent OFC [20,41]. The OFC provides a particularly interesting case study, since it has been associated with multiple behaviorally relevant variables in the decision-making task space (e.g., [42][43][44][45][46][47][48][49][50][51]) such as outcomes expectations that guide action [41,52], their desirability [53] or the availability of multiple-valued choices in economic decision making [45] (but see also [54]). In contrast with other frontal areas [30,55,56], the OFC representation of whether optimal choices are or are not predictable from previous trials outcomes [53] is less established [50,51,57].
Our hypothesis is that the representation of the outcome predictability is not always evident from individual unit rates and their pairwise correlations. However, it might be established when higher-order neuronal interactions are considered on a trial-by-trial basis. To test this hypothesis, we used a two-choice interval-discrimination task devised such that the previous outcome enables the animal to infer the optimal course of action, but in which the stimulus is not always foreseeable. In this task, it has been previously shown [41] that OFC individual units encode a compact combination of past-trial state variables that can inform the upcoming decision [41]. Here, we focus on OFC ensembles for analyzing the role of shared variability in predicting the choice outcome [15,16].
We first observed that shared variability among triplets of neurons is higher during correct choices, and that this was sustained during the entire trial duration; but this effect was unclear for pairwise correlations. This suggested that stronger three-way correlations are systematically associated with successful outcomes; whilst pairwise correlations are not sufficient to discriminate the choice outcome. Paralleling this result, a neuronal state-space spanned by up to threeway interactions optimally decoded the correct choice versus the rest of the incorrect choices.
Intriguingly, only states representing predictable correct choices remained stable for over 100 trials of the task, while ensemble states for incorrect or unpredictable choice outcomes randomly wandered in the state space and hence could not be effectively decoded. All in all, our results suggested that correct-choice predictability and hence the optimal behavioral strategy could be encoded in metastable states temporarily assembled by sufficiently complex, third-order lateral OFC (lOFC) constellations.

Shared variability among triplets of neurons is higher for correct-choice outcomes
The mapping of choices to single-unit activity across the course of a trial has been previously demonstrated [41]. Thus, given the role of second-order statistics in stimulus and behavior coding (e.g., [5,33,38]), we sought to investigate the relationship of behavioral choices with trial-to-trial variability and neuronal interactions. To this end, we recorded orbitofrontal ensembles from three behaving rats implanted with tetrodes, while they performed the task outlined in Fig 1A. The interval discrimination task is described in more detail in Materials and Methods and in [41]. The animal had to access the central socket in order to trigger a sequence of two 50 ms pure tones (T1 and T2) separated by either a short or a long inter-tone interval; by nose-poking either the left socket (for short ITI, light orange shades, see example in Fig 1A) or the right socket (for long ITI, darker orange shades) to successfully retrieve the reward. After an incorrect trial, the previous ITI was repeated. Otherwise, the ITI was randomly drawn from a distribution of values which grade the difficulty of the task (Fig 1A, see also Materials and methods and Fig 5A).
First, as an exploratory analysis, we computed a common measure of shared neuronal variability (e.g., [9,33,38,[58][59][60]), the absolutely value of the trial-averaged correlations for each ensemble; separately for correct-and incorrect-choice outcomes (Eq (1) in Materials and methods).  Fig 1C), and do not discern between correct-and The animal has to discriminate between two sequences of 50 ms pure tones (T1 and T2) separated by an inter-tone-interval (ITI) of variable duration, by nose-poking either the right socket for long ITI (termed here stimulus long, ITI 350 − 500 ms) or the left socket for short ITI (termed stimulus short, ITI 50 − 200 ms, example shown in the figure) to successfully retrieve the reward. Vertical lines indicate the average position of different salient events during the trial (yellow: T1 onset, orange: T2 onset, cyan: averaged movement onset, which has a high variance and thus is merely indicative). (B) Trial-averaged correlations (Eq (1) in Materials and methods) among pairs (left panel), triplets of neurons (center panel) and quadruplets (right panel), further averaged across 5 ensembles having n � 5 units. Correlations are computed separately on trials in which the choice was correct (blue) and for the remaining trials (incorrect, black). �� p < 0.001 (test details in main text, error bars are SD). Vertical lines indicate the average position of different salient events.

PLOS COMPUTATIONAL BIOLOGY
Representation of foreseeable outcomes in orbitofrontal cortex triplet-wise interactions incorrect-choice outcomes (Fig 1B, left), in line with a recent study that analyzes the effects of mean pairwise correlations on different tasks and brain regions [39].
However, the distinction between correct-and incorrect-choice outcomes (encompassing all types of incorrect responses) is more strongly expressed in systems of triplets of units ( Fig  1B and 1C). Such correlations are higher in magnitude for correct-choice outcomes (Fig 1B, center; two-tailed t-test of mean correlation coefficients T(40) = 7.3, p = 6.2 � 10 −9 , MANOVA incorrect vs correct trials, Wilks' V = 0.88, p = 1.8 � 10 −4 ), are sustained throughout the entire trial and significant (one-sided Wilcoxon signed-rank of mean correlation coefficients for all trials W = 1125, p = 1.11 � 10 −9 ).
Moreover, we tested this observation in a setting in which animals were passively exposed to the same stimuli, but in which rewards were not provided [41]. If triple-wise correlations are associated with the task, passive correlations should be weaker than correlations associated with correct-choice trials, that is, when the task has been performed successfully. S1 Fig suggests that passive correlations are not distinguishable from correlations during incorrect trials and hence they are significantly lower than for correct-choice trials.
Overall, this exploratory analysis seems consistent with constellations of up to three interacting units, assembled during trials in which the task is successfully performed, whilst pairwise correlations do not suffice to discriminate the choice outcome.

Successful choices are encoded in orchestrated interactions within small ensembles
Mechanistically, correlations in Fig 1 are thus suggestive of a sustained, orchestrated firing of small OFC networks accompanying successful performance. This raises the question of whether neuronal interactions help decodability of the choice outcome and thus behavior.
To assess this question, we first constructed state spaces as explained in Materials and Methods (schematics in Fig 2A): the multi-unit space spanned by the firing rates of each ensemble units (Fig 2A, left), is then enriched by incorporating pairwise (Fig 2A center) plus triplet products of the rates (Fig 2A right) as new axes (Eq (8)). A multi-unit space enlarged by θ > 1 rate products can represent up to θ th -order correlations, as demonstrated in Materials and Methods/S1 Methods. (Eqs (10) and (11), (S17a), (S17b)) and will be thus referred to as the θ th -order space [10,37,56].
Second, a decoder specialized in operating on such sparse high-dimensional input data (a regularized kernel-Fisher discriminant [61], Eqs (3) to (7)) was used to predict the trial choice outcome from ensemble rates. The entire trial duration was used, since three-way correlations consistently depend on the correctness of the choice during all trial periods (Fig 1).
Is worth stressing that this decoder is just a conventional linear discriminant operating in a state space spanned not only by the neuron fining rates as in previous studies, but also by products among such firing rates. For instance, for θ = 3 and an ensemble of three neurons with firing rates x i (t), i = 1, 2, 3, the expanded space would contain axes such as x 1 (t), x 2 (t), x 3 (t), (Fig 2A, right panel; see Materials and methods and S1 Methods for further details). The decisive advantage of this approach over a conventional decoder is in leveraging the θ-order correlations to foster the out-of-sample decoding performance, as shown in the next section.  [56,61]). Each marker corresponds to the firing-rate observation during a 80 ms bin, plots show data from 40 consecutive trials each. Colors represent behavioral response categories for the trial, namely "C" (correct response, blue), "M" (missed responses and false alarms, red), "P-C" (premature central response, gray), "P-L" (premature lateral response, see description in Materials and methods). (C) Triplet (left) and pairwise (right) correlations, averaged for the entire trial duration for each one of the response categories. Dashed lines in the pairwise correlation plots indicate the triplet-wise correlations on the left, which are significantly higher for correct choices ( � omitted for clarity). Error bars are SD. � p < 0.05 (test details in main text).  In Figs 2 and 3, DC stands for the discriminant coordinates, which define the optimal subspace (further orthogonalized) in which the data is projected for decoding (see Data analysis section in Materials and methods). All categories of unsuccessful choice outcomes (missed and false alarms, plus both sides premature responses; red, grey and green dots, respectively, in Fig 2B) are mapped onto different volumes in the state space, drifting randomly and largely overlapping around the origin. By contrast, the response pattern during correct choices (blue) remains in the same position of the state space from trial to trial (the same axes are used for all plots in Fig 2B).
Moreover, correlations averaged for all ensemble units and trial bins (as in Fig 1C) are consistent with this visualization: all triplet-wise correlations are substantially stronger for correct choices (Fig 2C; Kruskall-Wallis χ 2 (3, 124) = 50.4, p = 6.6 � 10 −11 ; Bonferroni-corrected comparisons with correct choices at p < 0.05 indicated). However, this effect is not observed in pairwise correlations, which, in addition, decrease significantly for correct choices with respect to triplets of neurons (Wilcoxon rank sum, W = 1040, p = 7.2 � 10 −12 , dashed versus solid lines in Fig 2C) and only distinguish among correct and premature central responses. The same striking observation is salient for smaller ensembles (for instance Fig 3A); again consistent with mean triplet-wise correlations ( The robustness of the decoding over trials was assessed by two causally cross-validated indexes computed on future trials not used to train the decoder: the decoding error, DE, and the divergent trajectories, DT, indexes defined in Materials and Methods. This cross-validation process is specifically designed to quantify how much the 3-dimensional subspace DC1 − DC2 − DC3 obtained in previous trials is still a valid decoder for future trials; that is, the stability of the subspace over time. We refer to "trajectory" as the sequence of consecutive firing rate bins during a single trial (Eq (12)). For instance, during a 2000 ms trial using a 80 ms bin, a trajectory is a sequence of 25 consecutive points in the space corresponding to the same trial (see an example in Fig 3A). The DT index measures the percentage of trials containing at least one incorrectly decoded point, starting from the trial endpoint and checking the trajectory backwards in time (Eq (13)). This index is suggestive of the degree in which decoded states differ empirically from ideal attracting sets as discussed in [10,37,56].
The overall decoding accuracy is significant but low. This is expected, since the cross-validation imposed is demanding: training (estimation) and validation sets may not be consecutive and both contain the same number of trials. By contrast, training sets are typically much larger than test sets in decoding studies [41]. Despite this, and in line with the visual display, DE and DT are substantially lower for the correct choice than in the rest of the responses (see analysis with the full dataset in Fig 4). Permutation tests constructed by shuffling the trial choice outcome (Materials and methods) further confirm this analysis: the correct choice is significantly decoded ( Note that the decoder is still multivariate, thus, standard receiver-operating characteristics curve analyses do not apply and the chance level of a uniformly random decoder would be at DE = 75%. In addition, and consistently with the correlation analysis ( Fig 1B-1D), the decoding of correct choices is suboptimal when the space is spanned only by the units activity (rendering a conventional linear discriminant analysis, order 1 in Fig 4 top panels, post-hoc tests order θ = 1 vs order θ = 3, p <0.037, Bonferroni-corrected) or by using up to pairwise interactions; this is also the trend observed when more complex interactions over triplets are considered (for establishing a fair comparison between orders, only ensembles with four or more units are used in the top panels in Fig 4 and in forthcoming analyses, rendering the same conclusions; see pink bars in bottom panels for comparison). Note that larger state spaces (θ > 3) tend to over-fit, as shown, for example, in [37,56]. Moveover, decoding is still possible when the correlation with the immediately previous trial is removed, suggesting that higher interactions may carry extra information for decoding show the decoding performance indexes per behavioral response category, calculated in a range of state spaces spanning from the multiple single-unit activity space (order 1) to state spaces incorporating up to four-way interactions (order 4). For establishing a fair comparison across orders, only ensembles consisting of four or more units are considered in the inset plots and in forthcoming analyses. Results for the optimal state-space order (order 3, magenta) are also shown in the large panels below for comparison.
https://doi.org/10.1371/journal.pcbi.1007862.g004 [41]. To test this, we regressed units rates with the preceding trial (Eq (2)), and repeated the decoding analysis on the residuals of the optimal adjustment (S3 Fig). Even though there is an expected, significant drop in decoding performance, we observed a similar phenomenon as in Fig 4. In summary, only the "correct-choice state" can be decoded from the ensemble rates; that is, it is associated with high trial-to-trial stability (Figs 2B, 3 and 4), often even on a singleensemble basis (S2 Fig). This phenomenon is optimally salient when up to three-way interactions within each ensemble are considered for decoding (inset panels in Fig 4) and is attenuated, but still expressed when the information from the immediately previous trial is linearly decimated (S3 Fig).

Destabilization of unpredictable correct-choice states
Results so far suggest that correct-choice ensemble states are robust over trials. However, correct-choice trials are not consecutive and can have different cognitive demands: this task is designed such that the stimulus is repeated in the upcoming trial after an incorrect-choice outcome, and randomized otherwise ( Fig 5A and Behavior section in Materials and methods). In addition, previous studies showed that OFC encodes a map of the task-space variables for both the current and the immediately previous trial [41]. Thus, we hypothesize that the low decoding error for correct-choice trials (Fig 4) represents the predictability of the decision after incorrect choices.
To verify this hypothesis, it should be impossible to decode the correct-choice state when immediately preceded by another correct-choice trial ( In contrast, by decoding only trials following correct choices; that is, trials in which the stimulus is randomized (termed here Unpredictable trials, Fig 5A top), correct choices cannot be decoded any longer (permutation test). The correct-choice state also becomes indistinguishable from other states (Fig 5B left,

Shared variability is associated with predictability of successful choices
All in all, low decoding errors for correct choices (Fig 4) are associated with correlated variability among small constellations consisting of three units (Figs 1B, 2 and 3). Moreover, the optimal decoder operates in a space in which constellations of up to three units are explicitly represented (Fig 4). Thus, we expect that stronger positive correlations during trials after incorrect-choice outcomes might explain the reason for low decoding errors for choices associated with predictable successful outcomes (Fig 5).  By contrast, negative triplet-wise correlations for correct choices depend on the preceding trial outcome (Fig 6C center, filled bars, W = 143391, p = 2.4 � 10 −7 ). Thus, we hypothesized that negative correlations may explain the differences in DE between unpredictable and predictable correct-choice trials shown in Fig 5B. Intuitively, infrequent negative correlations, randomly appearing during the trial, would introduce perturbations causing cross-validated decoding errors to increase with the strength of negative correlations. Fig 6 tests this hypothesis. Three-way negative correlations are stronger for correct choices during trials preceded by another consecutive correct choice ( Fig 6A); that is, when the current stimulus is unpredictable (insets in Fig 6A, coefficients averaged over the trial duration, W = 45381, p = 4.1 � 10 −9 ). These weak but significant triplet-wise negative correlations can counter-balance the effect of dominant positive correlations in the stability of the correctchoice state during the trial, explaining the high DE for correct-choice trials in Fig 5A (left). This is the case either for most of the periods of interest individually (Fig 6B left plots, averages over the three periods of interests defined in Materials and Methods and in Fig 1).
On the contrary, negative correlations among triplets of neurons do not distinguish among choice outcomes after trials in which the stimulus is repeated (predictable stimuli, inset in Fig  6A right plot, W = 123951, p = 0.066). In this scenario, infrequent negative correlations are too weak to destabilize the correct-choice ensemble state, explaining the lower DE for predictable correct choices (Fig 5B right). Consistent with our previous analysis, this predictability-dependent effect is overall not observed for negative pairwise correlations, shown in S4A Fig. Interestingly, negative correlations during correct choices are particularly low during early trial stages, when no information about the upcoming stimulus is available (Fig 6A right, 'Initiation' period averages, W = 119577, p = 2 � 10 −4 ). We thus propose that for predictable trials the hampering effect of infrequent negative correlations in decoding is negligible, since in such trials the information on the optimal decision is already available during the trial initiation period, and negative correlations during this period are dampened. To further test this hypothesis, we have devised a simple surrogate index termed differential correlation, described in S1 Methods and in S5 Fig. In short, this index amounts how much positive correlations express more strongly the difference between correct and incorrect trials than negative ones during a specific time period (Eqs (S18) and (S19)). According to our hypothesis, this effect is only significant for third-order correlations and early trial stages (S5 Fig, central row, trial initiation and stimulus periods).
However, trial-to-trial correlations could be also induced by stimulus repetition in the preceding trial and thus might not be indicative of stimulus predictability per se. To control for this possibility, we repeated the correlation analysis but on the residual of the optimally regressed units rates with the preceding trial; exactly as we did for the decoding analysis in S3  Fig (Eq (2)). Fig 7 and S4B Fig show these partial correlation coefficients (correlation coefficients of the residuals), which render similar results. That is, negative three-way correlations for correct-choice trials are still significantly stronger when the current stimulus is unpredictable (Fig 7C filled bars). This distinction is inconclusive for partial three-way positive correlations (Fig 7C, unfilled bars), and again for partial pairwise correlations (S4B Fig). To conclude, the ensemble state during correct-choice trials is robust when the optimal choice can be predicted from the previous trial outcome; it temporarily destabilizes otherwise. This state emerges from up to three-way interactions within OFC networks consistently over trials, and it is not observed for incorrect choices.

Discussion
The representation of behavioral choices in frontal areas variability is currently the focus of a debate (e.g., [9,30,35,55,62,63]). In this study, we find that correct choices that can be predicted from the previous trial outcome are associated with a reliable representation over tens of trials in lOFC (Figs 2-5, S2 and S3 Figs). Constellations of triplets of neurons positively correlated (Figs 1, 6 and 7, S1, S4 and S5 Figs) seem to furnish such a robust representation of correctly predicted choice outcomes. However, the representation of unsuccessful or unpredictable choice outcomes is weakly cohesive and randomly wanders in the state space.
Recent evidence suggests that the cognitive map of the task-space provided by OFC units [64] enable them to encode a compact combination of past-trial state variables, which can predict the upcoming decision [41]. However, the coding of behavioral performance in OFC ensembles remains controversial [41,45,50,57,64]. Our results suggest that correlations within OFC constellations involving up to three units, enhance the reliability of the decoding of the behavioral responses, in accordance with recent findings in the PFC [3]. Leveraging the continuous activity over tens of trials, and harnessing the finer temporal synchrony provided by multi-unit recordings [2], we found that such three-way interactions enable lOFC ensembles to encode predictable outcomes.

Shared variability dynamics during cognitive processing
The relationship between variability decline and the cognitive state has been extensively studied using pairwise noise correlations; that is, how much of such variability across trials is shared across units. Weakly correlated trial-to-trial variability has been often considered to be beneficial for behavioral performance [14,15,33,34]. Attenuation in noise correlations was typically associated with top-down effects of selective attention [32]; which reduces task-irrelevant variability in the visual cortex [34] and thus increases the signal-to-noise ratio. This effect led to suggest that low correlations may contribute to the effective processing of cognitive decisions, as also proposed by various models [9,65].
However, recent results challenge these classical views, suggesting instead that attention reshapes the stimulus representation in earlier stages such that they are more effectively decoded by downstream neurons, to ultimately guide decision-making [66]. This optimization can result in opposite effects of attention on shared variability within and between processing areas. For instance, attention would result in an increase of pairwise correlated variability between the middle-temporal area and the superior colliculus for effective visualmotor processing [66].
Moreover, despite the vast literature in noise correlations, it was recently shown that lower (higher) mean pairwise correlations do not necessarily imply higher (lower) signal-to-noise ratios, and thus they are not guaranteed to affect behavior as often suggested [39]. Indeed, the dynamic pattern of noise correlations and not their absolute magnitude is often the determinant factor which modulates the information encoded by ensembles [3,5,32,38,40,59].
This was found to be the case for correlations that are similar over all periods of behavioral interest during the task [3], as we also observed in the present work for triple-wise correlations (Fig 1D). In our study, slightly correlated variability is suggestive of a systematic three-way interaction pattern occurring during trials in which the choice outcome can be predicted. Consistently, although relatively weak, third-order positive correlations are the strongest of all and contribute to equip the most informative state space [3]; whilst infrequent triplet negative correlations seem to undermine the decoding of unpredictable trial outcomes (Figs 6 and 7, S4 and S5 Figs).
By contrast, mean pairwise correlations do not suffice to discriminate between correct and incorrect choices (Fig 1B left and S4 Fig), nor are informative enough to build up the state space in which correct outcomes are stable (order 2 in Fig 4 top, inset panels). This result also seems consistent with previous reports studying the roles of weak higher-order correlations in cortical ensembles (e.g., [67][68][69]). It is also in line with Nogueira et al. recent study [39] discussed before, in which pairwise correlations computed on multiple behavioral tasks and brain areas are not reliable surrogates for the behavioral performance [39]. As previously stressed in [38], only the variability along the encoding axis modulates information coding, not necessarily the mean magnitude of pairwise correlations.

Neuronal correlations and state-space decoding
Stable subspaces of the space spanned by the ensemble activity, encoding components of the task-space, have been identified in primate PFC (for instance in [24]) and in rodent anterior cingulate cortex [37,56] during working memory. In our study, a linear combination of ensemble unit activity and their correlations up to a third order define the correct-choice robust subspace, indicating a stable weighing of the connectivity among units over time.
The identification of this functional interaction pattern between units could be addressed in future studies leveraging recent dynamic generalized linear models, which implement adaptive Granger causality analysis for spike trains [70]. In addition, within-trial non-stationarity in high-order interactions could be captured by specifically designed state-space approaches [71]. This robust, fixed interaction pattern may facilitate a subsequent readout of the information downstream as proposed in [24]. This hypothesis is also reminiscent of a recent aforementioned study [66], suggesting that optimal processing could entail remolding stimulus representations with respect to fixed readout dimensions at subsequent processing stages.
Other computationally efficient methods to estimate high-order correlations for larger neuronal populations are specialized information geometry frameworks (e.g., [72][73][74]); whilst a recent approach demonstrates that third-order correlations are capable of inducing synergy/ redundancy states in an information-theoretic sense [75]. More broadly, these informationtheoretic approaches are advantageous to assess the importance of orders higher than three, which is computationally challenging for the direct calculation of the coefficients (Eq (1) and (S17)), and would require larger ensembles.
In this study, we propose that triple-wise correlations switching their sign irregularly during the trial (Figs 6 and 7) hinder decoding of unpredictable choice outcomes (Fig 5A left). All in all, absolute correlation coefficients are low (Fig 1B), indicating sparse spiking and the presence of frequent silence periods. Simultaneous silences in spike trains can underlie the observed higher-order correlations, as was recently demonstrated in cultured hippocampal slices [69]. Moreover, such simultaneous silences explain the alternating signs at successive higher interactions orders estimated in maximum entropy models. Interestingly, the authors [69] observed how higher-order (θ > 2) interactions vanish when inhibition is blocked. In this scenario, pairwise correlations suffice to account for the probability distribution of spike patterns [69], as reported in retina [76]. The sparse firing and sign-alternating triplet interactions we observed here may also stem from balanced inhibition [65]; whilst higher orders are too weak in these small ensembles to play a significant role in decoding.
Our results are also in line with a recent report which shows evidence of long-term memory representations of cue-reward associations in rodent OFC neurons, which are stable for days and even after changing such associations [51]. The cognitive map of the task-space that we previously found in single-units and small ensembles [41], seemed to be more robust when the animal follows the optimal lose-switch strategy in this task [41]. Thus, we speculate that task-space representations in OFC might be modulated by a higher-order representation of the behavioral outcomes operating at a longer timescale (Figs 2 and 3); which resembles findings in premotor areas [14]. The downstream routing of this information to optimize behavior remains unknown, and could be investigated using in vivo two-photon calcium imaging [51].
Intriguingly, the weak correlations observed (Fig 1B) are also suggestive of a relatively irregular, near-asynchronous dynamics, typically associated with a wide dynamical repertoire (e.g., [65,77]). In a more conjectural vein, the successful processing of the task by small lOFC ensembles could map to long-lasting metastable states over tens of trials [78], which gain stability when the choice is deterministic and behaviorally relevant, and destabilize otherwise. Such a metastable portrait might subserve the enhancement of detailed coding of task-epoch variables that enable the animal to effectively predict the upcoming choice. Electrophysiology. Data were obtained from three Wistar rats (250 − 350 g) that were chronically implanted (when the psychometric curve reached the 70% in the task outlined below) with tetrodes in their lateral orbitofrontal (lOFC) cortex and trained as described in detail in [41]. We recorded n = 137 single-units and ensembles from three animals. Single-unit and small ensembles (up to three units) were analyzed in a previous work [41]. In the present study, we specifically focus on ensembles containing a minimum of three units (82 units in total), and their corresponding local field potential (LFP). The LFP was downsampled to 250 Hz to filter out the typical spiking activity component (see S1 Methods).

Experiments
Spike trains were convolved with Gaussian functions to obtain statistically reliable estimates of spike densities. The value of the optimal bandwidth for each neuron (variance of the Gaussian smoothing function) was optimized using a multivariate kernel density estimation approach as recommended in [56,79]. Spike density estimates were then binned at 40 ms. The reason for selecting this bin width is that at least 99% of bins for each unit contained up to a single spike, enabling us to compute standard spike train correlation analyses for most of the bins. Results reported were statistically invariant by using bin sizes up to 80 ms which was then used to reduce the computational cost of the decoding analysis (Data analysis section). Lowresponsive units (< 2 Hz) are considered in this study, since the focus is on analyzing the dynamic pattern of correlations during the trial irrespective of their mean magnitude or the mean firing rate [38,39].
Behavior. The experiment was designed to identify deterministic temporal patterns across consecutive trials as a function of animal performance. Further details of the behavioral task can be found in [41]. Briefly, rats were trained in an auditory time-interval categorization task. Trials were self-initiated by the animals by nose poking in a central slot (Fig 1A), which triggered a tone of 50 ms duration (delivered through earphones) after a randomly drawn delay uniformly discretely distributed from 50 to 300 ms in steps of 50 ms. After an inter-tone interval (ITI), randomly drawn (except for incorrect trials, see next section), a second pure tone of the same duration and frequency was presented. The task is to classify the ITI, as short (50, 100, 150 or 200 ms) or long (350, 400, 450 or 500 ms).
A reward was delivered when the animal poked the left socket for short ITI and the right for long ITI; and was available for 3 s before the beginning of the next trial. False alarms (poking the incorrect side) and withdrawals before the second tone were followed by white noise after a 3-s delay (WAV-file, 0.5 s, 80 dB sound pressure level). After an incorrect trial, the ITI of the previous trial was repeated (see schematics in Fig 5A). More details can be found in [41]. Only trials in which animals engaged in the task were considered for the study.
For variability and correlation analyses, three periods of interest within the first 1000 ms of the trial were considered [41]: (1) trial-initiation, which starts with the rat nose-poking into the central socket and ends 150 ms later; (2) the stimulus offset period, which starts 100 ms before the second tone onset and it finishes 50 ms later with its offset; and (3) the choice period, corresponding to the 150 ms time window starting from the rat nose-poking into one of the two lateral sockets.
For decoding analyses (see Data analysis section), we focused on 160 consecutive entire trials of the task for visualization purposes (Figs 2 and 3). Behavioral responses associated with a trial were summarized in c = 4 non-overlapping categories, namely "correct response" (the animal pokes the correct lever and successfully retrieves the reward), "missed responses and false alarms" (this category encompasses all types of non-premature yet incorrect responses: the animal stays idle and comes back to the central socket without choosing a side, or moves away from it towards the wrong side), "premature central response" (the animal moves away from the central socket too early, before the presentation of second tone) and "premature lateral response" (the animal pokes any side again too early).
This categorization was chosen to provide a similar number of trials for each behavioral response. It also enabled us to visualize the decoding process in a three-dimensional space using discriminant analysis, which projects multi-unit ensemble rates to a (c − 1)-dimensional subspace (number of distinct behavioral categories-1) to perform the decoding (see next section). For such decoding analysis, the discrepancy in the number of bins per behavioral category was kept to � 10% by discarding excess bins, enabling us to hypothesize a similar prior probability of each choice overall. In three of the ensembles analyzed, both premature responses were joined into a single category in order to have a balanced number of bins per category for a more reliable decoding, thus c = 3 for them.
In the comparative analyses of trials after correct and after incorrect behavioral responses (termed Unpredictable and Predictable respectively in Figs 5-7, S4 and S5 Figs), trials are not consecutive. Thus, to establish a faithful comparison among correct and other trial choices (termed incorrect here), only some of the incorrect trials were considered; specifically those which occurred immediately after (or as near as possible) correct-choice trials, such that the number of trials of both types is balanced. In this way, both types of trials are in comparable temporal vicinities.

Data analysis
Shared neuronal variability: Statistical testing. Shared variability was analyzed by trialaveraged correlations between combinations of n distinct units within each ensemble, where trial-averages were specific to each response category. Thus, we may loosely refer to them as "noise decision correlations" (hereafter simply "correlations"), since the behavioral response is fixed, even though the stimulus is typically randomized from trial to trial. Figs 1-3, 6 and 7, S1, S4 and S5 Figs, show means and standard deviations of these correlations across units and ensembles.
The Pearson correlation among up to θ units x i in a given ensemble, 1 � i � n, of rates x 1 (t, T), x 2 (t, T), . . ., x n (t, T) is defined next (Eq (1)). In this θ-order correlation coefficient, i indicates the neuron number, t is the time bin within each trial and T is the trial number, whilst < x i (t) > is the trial-average: θ = 2: TÞÀ < x 1 ðtÞ >Þ � ðx 2 ðt; TÞÀ < x 2 ðtÞ >Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P T ðx 1 ðt; TÞÀ < x 1 ðtÞ >Þ 2 � P T ðx 2 ðt; TÞÀ < x 2 ðtÞ >Þ 2 2 q ; . . . θ � n: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi such that m i are any natural numbers that verify: where the denominator of the coefficients for a correlation order θ odd is real in this study; this definition ensures that 0 � Corr(x 1 , x 2 , . . ., x n ; θ)(t) � 1. Noticeably, the θ-order correlation must take into account all possible products of the neurons i = 1, . . ., n up to order θ � n. For instance, in the same vein that the numerator of the pairwise Pearson correlation coefficient between units i = 1, 2 for zero-mean rates, contains x 1 (t, T) � x 2 (t, T) summands; the θ = 3 numerator of one of the coefficients among units i = 1, 2, 3 contains terms such as x 1 (t, T) � x 2 (t, T) � x 3 (t, T), whilst another 3 rd -order coefficient contains summands like x 2 1 ðt; TÞ � x 2 ðt; TÞ etc. Further details are discussed in S1 Methods. The kernel-discriminant described below is a classifier which will take into consideration such θ-order correlations for decoding; as discussed in the next section and demonstrated in S1 Methods.
The partial correlation coefficient Corrðx 1 ;x 2 ; . . . ;x n ; yÞ (Fig 7 and S4B Fig) is defined as in Eq (1), but variables are instead the residuals of a linear regression adjustment with the preceding trial,x where x i (t, T − 1) is the rate of the i th unit at the same time bin t in the immediately preceding trial T − 1, and b 0 , b 1 are linear regression coefficients optimized by least-squares across trials. Nonparametric tests were used when normality assumption was rejected according to conservative Lilliefors tests at p = 0.01, as further detailed in the Results section. For the decoding analysis (next section), permutation tests were performed by generating n = 1000 bootstraps, providing a one-tailed significance level p = 0.001. Bootstraps were designed by shuffling the bin order for each neuron within each single trial (orange boxplots in Figs 3-5 and S2 Fig). Whiskers in the boxplots show outliers, defined here as ±2.7 � SD, orange triangle markers indicate the 1% percentile.
Decoding algorithm. Analyses shown in Figs 2-5, S2 and S3 Figs were based on a robust standard decoder, termed a regularized kernel-Fisher discriminant [10,37,56,80,81] (briefly, kernel-discriminant). In this section, the trial index T will be omitted for simplicity, that is, x (t) � x(t, T); where, like in the previous section, t is time bin index within a specific trial.
As intuitively outlined in Results, this decoder is simply a standard linear discriminant operating in a state space spanned by firing rates plus products among them (see Fig 2A, right panel schematics). The advantage over a standard linear discriminant is in improving its decoding capability by considering θ-order correlations, as explained in the next section (see also Fig 4 top panels for a comparison with a linear discriminant, order 1 lines). See for example [10,37,56] for comprehensive descriptions of this standard approach in machine learning, but adapted for ensemble recording analyses. In addition, S1 Methods show a detailed description of the decoder and intuitive parallels with classical discriminant analyses.
Like the conventional linear discriminant, under optimal conditions (next section), the kernel-discriminant provides the Gaussian conditional probability P(x(t)|y) of an (n × 1) observation vector x(t) = [x 1 (t), x 2 (t), . . ., x n (t)] T , spanned by the firing-rates in an n-units ensemble. y 2 {1, . . ., c} represents the behavioral category index, where c is the maximum number of different behavioral categories as defined in the previous section (correct responses, missed responses and false alarms, and premature central/lateral responses, see full description in Experiments).
As discussed earlier, for θ = 1, the kernel-discriminant is fully equivalent to a classical discriminant analysis (Fig 2A, left panel); whilst for θ > 1 it functions in a state space explicitly accounting for all possible interactions among units up to order θ (Fig 2A, center and right).
Other standard approaches such as kernel-logistic regression, relevance support-vector machines [61], or more recently deep learners [82] can have comparable capabilities. However, as discussed in the following sections and in S1 Methods, this classifier straightforwardly provides a direct link with (neuronal) correlations, and an intuitive visualization of the effect of correlations for decoding (Figs 2 and 3, see also [10,56]). In contrast to other approaches, it is based on the optimization of a single parameter λ (Eq (3) below). Thus, it was the choice for this study for its interpretability and robustness.
The kernel-discriminant can be casted as a constrained optimization problem [61] (S1 Methods), in which the goal is to obtain c − 1 nonzero vectors α of dimension (l × 1). In what follows, l ¼ P c y¼1 l y is the number of observation vectors on each estimation (training) dataset (see Reliability over trials analysis section for dataset definitions), where l y are the observations associated with the y th behavioral category. The optimization program is: α : min α ðα T Nα þ l � α T KαÞ; subject to: where terms in such an optimization program and the probabilistic interpretation are described below. Loss term. The first summand in Eq (3) plus the constraint is analogous to a loss function (S1 Methods). N is a (l × l) matrix defined as: where 1 y 0 is a (l × 1) vector whose entries are ones if the observation x(t) occurs during a trial in which the behavioral response is y = y 0 and zero otherwise. K is termed the gram matrix (l × l), whose entries K t,t 0 for two input observations x(t), x(t 0 ) are provided by the following function commonly termed inhomogeneous multinomial kernel [83]: where θ is the maximum correlation order represented in the space, as will be discussed below.
The output of the discriminant is the next (l × (c − 1)) matrix F of entries F ij : In Eq (6), A is the (l × (c − 1)) matrix whose columns A j = α are the vectors α of dimension (l × 1) providing subsequent minimums to the optimization program (Eq (3)) i.e., the j = 1, . . ., c − 1 columns of F are F j = Kα for each α. This program can be solved via least squares or, for moderate sizes of the gram matrices K, directly as an eigenvalue problem in which α are the successive generalized eigenvectors of ð P c y¼1 ðμ y À mÞ � ðμ y À mÞ T Þα ¼ n � ðN þ l � KÞα [61], sorted in decreasing order of the eigenvalues ν (see details in S1 Methods). Rows of F, termed F i = f(x(t)) are (1 × (c − 1)) projections for the single observation x(t) and all its interactions up to order θ (see next section) onto an optimally discriminant subspace of dimension c − 1 (Figs 2 S2 and S3 Figs). This intuitive interpretation is further discussed in S1 Methods, adapted from well-known results and previous studies [37,61]. All in all, F row vectors present minimal within-category variance whilst maximizing the distance between mean vectors per category, like a conventional discriminant [84], to facilitate decoding, as discussed in the next section.
Regularization and probabilistic interpretation. The second summand in Eq (3) is the Tikonov (L2) regularization term to avoid over-fitting, and it is the same as the one used in support vector machines [61], weighed by the penalization constant λ optimized by cross-validation per ensemble (see Reliability over trials analysis section).
The discriminant criterion is a Bayes-optimal decoding choice over other classifiers if the data projected in the output subspace is normally distributed [84]. That is, if fðxðtÞÞ � N cÀ 1 , where N cÀ 1 is a c − 1 multivariate normal distribution. This is typically the case in our data (Lilliefors non-parametric test, p < 0.045 for all ensembles but a small ensemble n = 5 units, see S1 Methods for further discussion). Thus, a reliable estimation of the probability of an observation x(t) to be classified in the category y = y 0 is where < f> y 0 and S y 0 are, respectively, the mean vector and the (c − 1 × c − 1) covariance matrix of the projected data vector f computed for all the l 0 observations associated with the behavioral category y 0 . Thus, the predicted class is the one that maximizes Eq (7), since equal class-priors were assumed (Experiments section). State spaces and correlations. The decoder operates in a range of state spaces (see schematics in Fig 2A). A θ-order state space is defined as the ensemble multi-unit space spanned by the units rates as dimensions, further augmented by axes representing constellations of units interacting up to a specific order θ (Fig 2A) [37]. The j th component of a vector at the time bin t, ϕ(t) on a θ th -order space constructed from a n-units ensemble is defined as: θ = 1: where j(i 0 ,.., i n ) is the j th entry of the ϕ(θ, x(t)) vector uniquely associated with specific values for the indexes i 0 ,.., i n , and the binomial coefficient is The dimensionality of such spaces is D ¼ nþy n À � À 1; which in this study spans from D = n to a much higher dimensionality. For instance, in an ensemble of n = 9 units and θ = 3, D ¼ 12 9 À � À 1 ¼ 1319. Such state spaces are typically sparse since D ≳ l dataset patterns (see next section). This leads to computational challenges for any classifier operating is such high-dimensional spaces.
Thus, instead of classifying directly in spaces spanned by ϕ(θ, x(t)) high-dimensional vectors, the decoder used (Eq (3)) alleviates this drawback by recasting the optimization problem in terms of the (l × l) symmetric gram matrix, whose entries are provided by Eq (5), which is computationally tractable here. This step is the well-known kernelization process of a classifier, which relies on the fact that only the D × D covariance matrix and thus products of input vectors x(t) are typically needed for their classification, whereas explicit representations of the individual vectors are unnecessary [61].
Correlation coefficients (Eq (1)) are directly linked with the state space. For this discussion, we make explicit again the trial index T for a given observation vector x(t, T), and start by redefining the state-space vectors as: f�ðy; xðt; TÞÞg j ¼ f�ðy; xðt; TÞÞg j ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi subject to the same constraints indicated in Eq (8). Thus, for z-scored data, the correlation coefficient between any pair of different units a 1 6 ¼ a 2 in an ensemble (Eq (1)) is simply Corrðx a 1 ; x a 2 ; y ¼ 2ÞðtÞ ¼ P T f�ðy ¼ 2; xðt; TÞÞg j ;where j ¼ jði a 1 ; i a 2 Þ; a 1 ; a 2 > 0 is a specific entry of the� vector (Eq (10)) corresponding to i a 1 ¼ i a 2 ¼ 1. The average correlation for all possible pairs of different units in an ensemble of n units at time bin t within a trial can be expressed as: where a 1 6 ¼ a 2 and nþ1 nÀ 1 À � À n is the number of different correlation coefficients per ensemble. The S1 Methods demonstrate this relationship between mean projections on space coordinates and correlations for an arbitrary higher-order θ (Eqs (S17a) and (S17b)).
Reliability over trials analysis. Columns α of the matrix A are further orthogonalized using the Gram-Schmidt algorithm to depict an intuitive portrait of the data projected in a Euclidean space. Thus, the projected data matrix (Eq (6)) isF ¼ K �Â, whereÂ columns are orthogonal [10]. This is the view displayed on Figs 2A and 3A. Orthogonalization provides a faithful representation of the data projected in an optimized subspace.
Consecutive blocks B k of 40 successive trials each (hereafter termed blocks) were used as estimation and test data; this was the lowest number of test trials to effectively decode the choice outcome. To quantify the reliability of the decoded representation through future trials, causal cross-validation was defined as follows: first, the optimization program (Eq (3)) was computed for a given block B k . Second, solution vectors (columns of A) were held fixed. Third and finally, the discriminant output f was computed for the following blocks B k+1 , . . ., B 4 (rendering a six-fold-ahead cross-validation), that is, the test data is always set in the future.
This cross-validation strategy was specifically devised to test the stability of the decoded choices for the longest possible period, even on non-consecutive validation blocks up to 160 trials away. This exigent setting, in which estimation and validation sets are of the same size and not adjacent, is particularly challenging for any decoder [84].
To evaluate the reliability over future trials of the representation obtained in the training data set, two simple indexes were computed: The conventional classification error per behavioral category (decoding error, DE) and the divergent trajectories index (DT) [56]. We term DE as the fraction of misclassified trials on the test data, averaged over all cross-validation blocks per choice outcome category. Penalization constants λ were fixed to yield minimum DEs from a uniform grid of up to 1000 sampled values dawn from [0, 0.5] [56].
Divergent trajectories. For computing DT, we leveraged the decoding probabilities (Eq (7)) to estimate the distance of each vector x(t) with respect to the centroid of the category; this spatial relationship is not straightforwardly provided by other approaches [84].
Consider the matrix R y 0 of size (n × l 0 ) associated with a category y = y 0 , containing the sequence of consecutive firing rate vectors x(t) of an n-units ensemble during a single trial consisting of l 0 bins (see an example in Fig 3A), A divergent trajectoryR y 0 is defined as a matrix R y 0 in which its last r + 1 > 0 consecutive column vectors, x(t = t l 0 −r ), . . ., xðt ¼ t l 0 Þ are misclassified. That is, at least the vector xðt ¼ t l 0 Þ is misclassified, and, potentially, any immediately previous vectors, or in other words: such that for an index r, 0 � r < l 0 it holds that: max y ðPðxðtÞjyÞÞ 6 ¼ y 0 ; 8t 2 ft l 0 À r ; . . . ; t l 0 À 1 ; t l 0 g; ð13Þ where DT is also defined as the mean fraction of divergent trajectoriesR y 0 over test sets; that is, in upcoming blocks of trials. This is an indication that such trajectory tends to escape the category boundaries defined by the discriminant, as further discussed in [10]. In Fig 5 analyses, trials are not consecutive and thus only DEs are computed. Analyses were performed in a 16-core Hewlett-Packard Z440 workstation using Matlab Parallel Computing Toolbox 2018 (Matworks inc.). Demonstrations and further implementation details are provided in S1 Methods. The j th differential correlation coefficient Δδ predictable(unpredictable) (j; θ) is the difference between positive (δ + ) and negative (δ − ) deltas; where δ + (−) consist of the difference between positive (negative) correlations during correct and incorrect trials, aggregated during a specific time period (see Eq (S18)). The figure shows mean differential correlation coefficients for θ = 2 (top row), θ = 3 (middle row) and θ = 4 (bottom row), bars are SEM. Consistently with Fig 6 ((A) and (B), right panels), the mean Δδ predictable (θ = 3) is significantly stronger than Δδ unpredictable (θ = 3) during early stages of the trial; especially before the upcoming stimulus becomes available (trial initiation period, left panel in middle row, Wilcoxon rank-sum W = 122311, p = 0.015, n = 359; stimulus period, p = 0.023). This effect does not reach significance neither for pairwise (top row) nor for quadruple-wise correlations (bottom row, dotted lines show the third order correlations, Δδ(θ = 3), for comparison