Exploring the limits of learning: Segregation of information integration and response selection is required for learning a serial reversal task

Animals are proposed to learn the latent rules governing their environment in order to maximize their chances of survival. However, rules may change without notice, forcing animals to keep a memory of which one is currently at work. Rule switching can lead to situations in which the same stimulus/response pairing is positively and negatively rewarded in the long run, depending on variables that are not accessible to the animal. This fact raises questions on how neural systems are capable of reinforcement learning in environments where the reinforcement is inconsistent. Here we address this issue by asking about which aspects of connectivity, neural excitability and synaptic plasticity are key for a very general, stochastic spiking neural network model to solve a task in which rules change without being cued, taking the serial reversal task (SRT) as paradigm. Contrary to what could be expected, we found strong limitations for biologically plausible networks to solve the SRT. Especially, we proved that no network of neurons can learn a SRT if it is a single neural population that integrates stimuli information and at the same time is responsible of choosing the behavioural response. This limitation is independent of the number of neurons, neuronal dynamics or plasticity rules, and arises from the fact that plasticity is locally computed at each synapse, and that synaptic changes and neuronal activity are mutually dependent processes. We propose and characterize a spiking neural network model that solves the SRT, which relies on separating the functions of stimuli integration and response selection. The model suggests that experimental efforts to understand neural function should focus on the characterization of neural circuits according to their connectivity, neural dynamics, and the degree of modulation of synaptic plasticity with reward.

Introduction Natural environments are complex places in which animals strive to survive, with hidden variables and stochastic factors such that the information available at any moment is partial, and it must be sampled at several time points and integrated. What is more, the rules governing the environment might change with time, leading to conflicting information. For example, an animal might learn how and where to seek for food, but if the place for feeding cyclically changes, or the means of obtaining food change, the animal has to switch strategies along [1,2]. In this case, no unique strategies exist, but several strategies must be learned. More importantly, the value of a response not only depends on the current scenario, but in the history of events, for example, the history of recent success of a given strategy. Therefore, it is relevant to study tasks in which rules might change over time in such a way that the reinforcement of stimulus/ response pairings is inconsistent, i.e. Inconsistent-Reinforcement Tasks (IRTs). In particular, the Serial Reversal Task (SRT) is an IRT in which two rules alternate over time, demanding the animal to keep track of previous events in order to maximize reward [3,4]. With enough training, animals learn to adapt their behaviour as soon as a reversal occurs. However, learning an SRT through a neural network model can be problematic: since each stimulus/response pairing is positively and negatively reinforced in the long run, learning of one rule may lead to the erasure of information regarding other rules, conforming a case of catastrophic forgetting [5]. On the other hand, although brain regions like the prefrontal cortex [6,7] and the striatum [8,9] have been found necessary for learning the SRT, the precise neural mechanisms involved are not well understood.
The goal of this work is to find the essential properties required by biologically plausible neural networks to solve an IRT, taking the SRT as paradigm. We focus on stochastic spiking neural networks (SSNN), a very general kind of neural network model that has been employed to explain how key features of neural circuits, like excitatory-inhibitory balance [10] and spike timing-dependent plasticity (STDP) [11], can lead to Bayesian inference [12] and reinforcement learning [13]. For a very general family of SSNNs, we show analytically that strong limitations to learning the SRT emerge when the functions of integration of stimuli information and response selection are conducted by the same neural population. We propose a model that is able to learn the SRT and discuss the implications of the results in relation to the neural mechanisms of decision-making.

Results
We will study the characteristics of an agent controlled by a biologically plausible neural network that learns to solve a SRT, conforming to what we will define as the hypothesis of functionality by learning, which states that the set of configurations that gives functionality is a small subset of the set of initial configurations. In this way, functionality is acquired by a learning mechanism that always leads the system from any random initial condition to one of the functional configurations. The hypothesis implies that the system is not initially designed to solve a given task from start.
A SRT is a discrimination task in which the mapping between the stimulus and the correct response is reversed after a given (random) number of trials (Fig 1a). One out of two possible cue stimuli (s 1 or s 2 ) is presented to the agent. During cue presentation the agent has to execute one out of two possible responses (R 1 or R 2 ) in order to get a reward. Which response is correct depends on the current rule (rule L 1 : s 1 ! R 1 , s 2 ! R 2 ; rule L 2 : s 1 ! R 2 , s 2 ! R 1 ). A reward stimulus is shown after cue presentation: r 1 for correct responses or r 0 for incorrect ones. One rule withstands until a switch of rules occurs at random. Switching occurs with low probability, to ensure that a considerable number of trials with the same rule are presented. The structure of the task implies that any agent that follows only one stimulus/response mapping as strategy will fail to get reward in half of the trials. Moreover, information provided by the stimuli is useless unless the agent is capable of retaining information about the current rule. Optimal performance can be achieved by adhering to a successful strategy, and to switch strategies when the current one is no longer successful.
We will consider an agent that is controlled by a spiking stochastic neural network composed of a sensory module Y and an integration/decision module K (Fig 1b). Neurons in module Y code the sensory stimuli and project to module K, while neurons in module K project to the response neurons and to other K neurons. One half of the K population projects to response neuron R 1 (the K R 1 subset of module K), the other half to response neuron R 2 (the K R 2 subset of module K). We assume that the firing of any neuron within a K R group is enough to trigger the corresponding behavioural response. Therefore, the K module integrates sensory information together with information from within the network, and at the same time it defines the response that is going to be executed.

Fig 1. Serial Reversal protocol and simple network connectivity. (a)
Each trial is composed of a cue stimulus presentation, during which the behavioural response must be executed, and a reward stimulus presentation. Correct responses depend on the stimulus presented and the current rule, which changes with probability p switch . (b) Diagram representing the general connectivity of the simple network. Neurons in module Y codify both cue and reward stimuli, and projects to the K module. K neurons connect with each other and project to one of the two response neurons. Therefore, K neurons can be sorted in two halves depending on whether they project to neuron R 1 (K R 1 neurons) or neuron R 2 (K R 2 neurons). Firing of any K neuron elicits their target R neuron to fire. Connections between module K and module R are assumed to be hardwired prior to any learning, such that firing in module K completely defines the executed response. (c) An example sequence of 3 trials of the SRT for the model depicted in (b) prior to learning, with a minimal K module composed of 8 neurons in which K R 1 ¼ fn 1 ; n 2 ; n 3 ; n 4 g and K R 2 ¼ fn 5 ; n 6 ; n 7 ; n 8 g. The current rule is L 1 .
In what follows we will show that the network sketched in Fig 1b (referred to as simple network) is incapable of learning to solve the SRT without contradicting the hypothesis of functionality by learning. First we will consider a "reduced" example of the simple network of Fig  1b that nevertheless puts in evidence the nature of the problem (see the Methods section for a proof regarding both the reduced network and a general version of the simple network).
The firing state of module K will be represented by a vector n(t), where each element n i (t) ∊ {0, 1} represents the firing state of the ith K neuron. Similarly, we define a vector y(t) where each element y i (t) ∊ {0, 1} represents the firing state of neuron y i . As a shortcut, we will use p (n i (t) = 1) and p(n i (t)) as equivalent expressions that represent the probability of neuron i of being active at time t. The same holds for p(y i (t) = 1) and p(y i (t)).
We will consider a network with a Y module composed of 4 neurons such that each stimulus is perfectly codified by one specific neuron, i.e. (p(y i |S i ) = 1 and p(y i |S j ) = 0 8I 6 ¼ j) where S i is the ith element of S = (s 1 , s 2 , r 1 , r 0 ). Module K is composed of 8 neurons, which is the minimum number of neurons required to solve the SRT: one neuron for each stimulus (cue or reward) for each rule. Each trial T has two time points (t and t + 1), one for cue presentation and another for reward stimulus presentation. The K R 1 group comprises neurons from 1 to 4; K R 2 comprises neurons from 5 to 8. Only one Y neuron and one K neuron fire at each time point, and the decision is evaluated during cue presentation (Fig 1c). Then, each neuron in module K has a probability of firing that is given by: where w stands for all synaptic weights in the network, w i is a vector containing the synaptic weights of afferent connections from all Y and K neurons onto the ith neuron in module K, and z is a vector containing the firing states of all Y and K neurons such that w ij is the synaptic weight of the jth neuron with firing state z j that projects to neuron i. The function f can be any function with the sole condition of being strictly increasing with w ij . Eq (1) endorses the K module with characteristics of a "soft winner-take-all" circuit in which a highly excited neuron inhibits the other neurons in the module through a global inhibitory circuit [12]. Synaptic weights w ij change according to the local pre/post synaptic activity and the reward stimuli r. The change Δw ij of a synaptic weight w ij is given by a function g: where z i (t) and z j (t − 1) are the respective firing states of the pre and post synaptic neurons, and rew is a function of the delivery of reward, such that rew = 1 during the cue and reward presentation for trials in which the response was correct, and rew = 0 otherwise. The g function can be in principle any function taking real values δ, with one δ for each combination of pre and post synaptic state and reward function rew. We assume that the neural network sketched in Fig 1b fulfils the Markov condition: the firing state of the system (i.e. which neuron is firing at time t) is only dependent on the firing state of the network at the previous time. This means that information about past events can only by carried on in the current state of the system. In the case of a SRT, a cue stimulus should elicit either the response R 1 or R 2 , depending on the current rule. For example, s 1 should elicit response R 1 only during rule L 1 , or R 2 only during rule L 2 . This implies that s 1 should elicit a response from a subset of the K R 1 group when L 1 rule is current, or from the K R 2 group when L 2 rule is current. Since there is no explicit stimulus acting as a cue of the rule, the differential response of the K module in front of the same stimulus can be achieved only if the K neurons integrate inputs from the Y module together with inputs from the K module itself. This means that each stimuli must be coded by different groups of K neurons depending on the current rule. Then, the occurrence of an error should act as pivot, leading the system to the set of states associated with the other strategy.
We can write the transition probability of the Markov chain that describes the dynamics of the whole system (network, stimuli and rules): p½sðt þ 1Þ; Lðt þ 1Þ; yðt þ 1Þ; nðt þ 1Þ; wðt þ 1ÞjsðtÞ; LðtÞ; yðtÞ; nðtÞ; wðtÞ ¼ p½nðt þ 1ÞjwðtÞ; yðtÞ; nðtÞ:p½yðt þ 1Þjsðt þ 1Þ:p½sðt þ 1Þ:p½Lðt þ 1Þ : Eq (3) is obtained by applying the chain rule of conditional probabilities, and using the fact that L is independent of stimulus, and that firing state n(t + 1) is independent of any other variable when conditioned to n(t), y(t) and w(t). Note that, since plasticity is assumed deterministic, Eq (3) is true if w(t + 1) is the resulting synaptic weight configuration of applying function g given (n(t), y(t), n(t + 1)). Any transition to a different synaptic weight configuration will have zero probability. Now we can find the transition probabilities that solve the SRT and study under what conditions a learning process is capable of reaching the solution. Fig 2 shows the directed graph for the transitions in the state space that solve the SRT. Under rule L 1 , neurons n 1 and n 2 fire with cue s 1 and cue s 2 respectively, while neuron n 3 codes r 1 and n 4 codes r 0 . For rule L 2 , neurons n 5 and n 6 fire with cue s 1 and cue s 2 , while neuron n 7 codes r 1 and n 8 codes r 0 . Neurons n 4 and n 8 are responsible for the strategy switching in the behaviour of the agent. Each time a transition between rules occurs, an error is committed, and the corresponding error neuron fires. Eq 1 tells us that the only way to change the transition probabilities is by adjusting the synaptic weights. Since the f function is strictly increasing with w ij , weights must be increased to favour a transition, or decreased to make a transition less probable.
The transition probabilities depicted in Fig 2 lead to specific transition probabilities for the firing state of each neuron in module K, conditioned to the firing state of their respective presynaptic neurons (Fig 3). For example, if stimulus s 1 is presented at time t and neuron n 3 fired at time t − 1, then the firing probability of neurons n 1 at time t should be high, while the firing probability of the other neurons should be low. This is because each combination of stimulus and rule must be coded by one specific neuron of the eight neurons that compose module K. This specificity in transition probabilities required to solve the SRT translates into a specificity in the solution weight matrix (Fig 4a), due to the strictly incremental relation between firing probability and synaptic weight depicted in Eq (1).
Based on the hypothesis of functionality by learning, we can say that the network learns to solve the SRT only if the plasticity function g leads the system to the solution weight matrix of Fig 4a, regardless of the initial conditions. However, the SRT is problematic in that there are no combinations of cue stimulus and behavioural response that are always rewarded. To understand this point, we can compare the SRT with another task, a discrimination task (DT), which comprises two stimuli and two responses as in the SRT. Moreover, there are two possible rules which define which stimulus/response pairing is rewarded, as shown in Fig 1a. The difference with the SRT strives in that in the DT each rule is cued by a specific stimulus (different from s 1 and s 2 ), which are codified in turn by neurons in the Y module. In this way, the network has direct information about which rule is current at a given moment. This means that the set of stimulus/response pairings that leads to reward and the set that leads to no reward are disjoint sets. Fig 4b shows the synaptic weight matrix that allows the network to solve a DT in which the set of stimulus/response pairings {s 1 , L 1 , R 1 } and {s 2 , L 2 , R 2 } are always rewarded, while the other combinations are always not rewarded. It can be seen from Fig 4b that, to solve the DT, it is enough to increment the synaptic weights of connections from the Y neurons that codify the cues to the K neurons that exert the correct response. This fact is what makes possible to find a network that converges to the solution matrix for the DT by choosing a suitable g function, such like a Hebbian plasticity function that leads to increments in the synaptic weights only when a reward is obtained.
have very low probability. Under these transition probabilities, for each rule there is a different set of neurons that codes each stimulus, and one neuron per rule that elicits the transition between rules when an error occurs.
https://doi.org/10.1371/journal.pone.0186959.g002 However, in the case of the SRT there are no disjoint sets of stimulus response pairings that separates reward from no reward. In fact, since we assume that the system is initiated without any information about how to solve the task, it can be seen that: pðr 1 js x ; n y Þ ¼ pðr 0 js x ; n y Þ ¼ 1 2 8x; y, and p(n|r 1 ) = p(n|r 0 ) = p(n). In particular: This allows us to write the average change hΔw ij i for a given w ij : where " The hΔw ij i can be understood as the inner product between the vector " p z x ;z y representing the probability distribution of the pre/post synapses pair, and " d, which contains the net change in w ij for each pre/post configuration. The inner product implies a kind of correlation between the two vectors, and changing a pair of synaptic weights in specific directions requires a precise adjustment of this inner product: Thus, to get to the solution weight matrix pictured in Fig 4a, a detailed adjustment between the probability distribution of n(t) and the plasticity function must hold. Adjusting " d to " p z x ;z y would mean that the plasticity was designed to solve a specific task for a specific initial condition, contradicting the hypothesis of functionality by learning. Adjusting " p z x ;z y to " d would mean that the initial synaptic weights were specifically chosen to solve the task, again contradicting the hypothesis. Therefore, since the requirements for reaching the w ij that are solution to the SRT necessarily contradict the hypothesis of functionality by learning, we must conclude that the neural network sketched in Fig 1b and  Learning to solve a SRT requires segregation of stimulus history coding from decision making The incapacity of the model depicted in Fig 1b for solving the SRT stems from the fact that the solution weight matrix cannot be reached by any plasticity function g. Conversely, this characteristic arises from two facts: 1. Correct stimulus/response pairings change over time, and there are no cues that give information about the current rule. Thus, in order to keep information about the current rule, the response of the system towards the stimuli must be specially conditioned by the previous states of the system.
2. The population that codes information about the current rule is the same population that defines the behavioural motor response.
Fact number 1 implies that the task cannot be solved as a DT, since the reward does not separates stimulus/response pairs into any two disjoint subsets. Fact number 2 implies that coding of stimuli cannot be done freely, because when a neuron codes a stimulus by firing, it is also defining a motor response that is expected to lead to reward. Fact number 1 cannot be avoided because it stems from the very nature of the task. But fact number 2 can be circumvented in a model in which coding and decision functions are performed in separated neural populations. Fig 5a depicts such a model (referred to as complex network; see Methods for a detailed description of its implementation). There, module K integrates information about cues and reward as before, and about the response executed as well, but does not defines the motor response. Neurons in the integration module K project to two decision neurons D 1 and D 2 . The decision neuron that fires univocally defines which response neuron (R 1 or R 2 ) will activate, leading to the corresponding motor response.
Therefore, module K needs to codify all the information required to solve the task. Ideally, it would suffice that neurons in module K codified the cue presented and the current rule. Nevertheless, no cue informs about the current rule, and module K only sees stimuli. Therefore, information about the current rule must be extracted from the history of perceived stimuli. For example, the sequence (s 1 , R 1 , r 1 ) shows that rule L 1 was currently working, and it should continue to do so except a reversal occurs, which is unpredictable but relative rare. In this manner, a possible solution is that neurons in module K codify each stimulus differently, depending on the previous stimulus history or contingency. This can be done following the model presented in Kappel et al [14]. There, it was shown that stochastic spiking neural networks with lateral excitation and a global inhibitory feedback, in combination with spike timing-dependent plasticity (STDP), have as an emergent property the formation of neural assemblies that encode external stimuli differently depending on the sequence of stimuli that preceded. In our case, module K should divide in groups of neurons codifying sequences of 4 stimuli: (s(T − 1), R(T − 1)), r(T − 1), s(T)), implying 16 possible contingencies.
The SRT structure for the following simulations is depicted in Fig 5b. Each trial starts with the presentation of one cue, 25 ms long. At t decision = 15 ms from trial onset, the state of neurons in the R module are actualized based on which neuron is firing in module D. At the same time, the response is characterized as correct or incorrect. During the interval [15 ms,25 ms] the synapses from module K to module D are modified following Eq (24) (see Methods section). The states of the R neurons are sustained unaltered between actualizations. The reward stimulus, also 25 ms long, is presented immediately after cue offset, being r 1 or r 0 depending on the correctness of the response. The rule is reversed every 15-20 trials, unless otherwise stated.
Conceptually, learning is achieved in two steps. In the first place, neurons in module K need to form subpopulations that respond differently to each cue at time t, given the past contingency up to the cue presentation at trial T − 1. This is achieved by plasticity rule described by Eq (23), provided that the system has enough memory so that events in trial T − 1 have an impact during trial T. Next, neurons in module D need to read the firing of module K, mapping each contingency coded in module K to the correct response. This is achieved thanks to the learning rule described by Eq (24), which is proved to reduce the distance between module Each cue and reward stimulus is coded by the Y neuron population, like in the simple network. Besides, the executed motor response gives sensory feedback, such that each response is also coded by module Y. Module Y connects to all neurons in the integration module K, which in turn connect with each other and with each neuron in the decision module D. Each neuron D is hardwired to one neuron R, so that the response executed is entirely defined by the D module. Synapses between module Y and module K, and within module K are plastic, subject to plasticity rule defined in Eq (23), which is applied at all times and is not dependent on reward. Synapses between module K and module D are plastic, subject to plasticity rule defined in Eq (24), which depends on reward. (b) Serial reversal protocol for training the network depicted in (a). Stimuli are presented for 25 ms, and the motor response to be executed is chosen at t decision = 15 ms from cue onset. Plasticity between K and D neurons is applied only if there was reward and within a window spanning from t decision to the end of cue presentation.
The model effectively learns to solve a SRT, as can be seen in Fig 6a. After 10000 trials of training, the model is capable of changing strategies in the trial immediately following rule reversal (Fig 6b). The dynamics of synaptic weights along training depends on each kind of connection (Fig 7).
After learning, neurons in module K fire in sequences (Fig 8) which presumably contain the information employed by module D to choose the right response. We studied the firing profile of the K module by computing the probability of firing of each K neuron during t decision (Fig 9a). It can be seen that each one of the 16 possible contingencies has a firing profile that is almost unique. Some contingencies are codified by s single neuron (for example, contingency 15), while other contingencies are codified by a set of neurons that fire more evenly (for example, contingency 14). This can be seen more clearly by computing a Similarity Index (SI) for pairs of firing profiles (Fig 9b). Most pairs have a small SI, and many contingencies are coded by unique sets of neurons. Therefore, the firing state of the K module together with the response executed conform a set of states that can be separated in two disjoint subsets when conditioned to reward, which allows the D module to map each firing state in module K to the correct motor response by means of plasticity rule described in Eq (24).
It is interesting to note that only half of the 16 contingencies are possible within blocks of trials under rule L 1 , being the other half only possible within the block of trials under rule L 2 . This implies that learning the contingencies could be subjected to a problem of catastrophic forgetting. However, this was seldom the case as can be seen from Fig 9, at least for the protocol of 15-20 trials for each rule. To further explore this issue, we trained networks in a SRT during 10000 trials under protocols with blocks of crescent number of trials with the same rule, and computed the SI and average performance (Fig 10a and 10b). Performance dropped as quickly as the SI values went up, as trials per block were increased, reaching a plateau for the longest blocks. However, it is worth noting that high performance (69%) is still attainable for blocks of 320 trials, showing that the model has a remarkable resilience to catastrophic forgetting of information regarding contingencies.
To better understand the dynamics of learning, we computed how well neurons in module K codified each element of the contingency vector (s(T − 1), R(T − 1), r(T − 1), s(T)) along training. Every 1000 trials of training we employed the last actualized synaptic weights in a separated simulation of 200 trials without plasticity, and assessed contingency coding by training a tree bagger classifier to classify each of the 16 contingencies based on the firing of all K neurons during t decision . Then, the classifier was used to classify trials sharing one of each of the components of the contingency vector (Fig 11a). Classification performance (CP) before training was around 50% for each separate element, and around 6% for the whole contingency, matching the CP values expected by chance. After 1000 trials of training, the CP of s(T), R(T − 1) and r(T − 1) were almost 100%. The response stimulus is the only stimulus that lasts 50 ms, and is only changed after t decision , meaning that its coding demands the least memory and thus is expected to be the easiest to code, along with s(T). Coding of reward stimulus r(T − 1) demands more memory from the system, but nevertheless is coded with similar proficiency to that of s(T) and R(T). On the other hand, the CP of s(T − 1) grows following a sigmoid-shaped function that resembles the temporal dynamics of the synaptic weights within the K module. Within the contingency vector, s(T − 1) is the first stimulus to be presented, and presumably the one having the strongest memory requirements. Moreover, it is followed by r(T − 1), which could act as an interferent. The coding dynamics of s(T − 1) is almost identical to the coding dynamics of the entire contingency vector, and also grows similar to the growth in behavioural performance (Fig 6a), suggesting that coding s(T − 1) is the bottleneck for contingency coding, and presumably for behavioural learning.
Results in Fig 11a show that module K has enough memory to retain information for at least 50 ms. To further explore the memory capacity of the system, we tested the model that learned the SRT by simulating 2000 trials without plasticity. Trials were sorted according to their membership to each contingency and a Naive Bayes classifier was trained to classify trials according to their membership to a given contingency, based on the activity of the K neuron population at time points ranging from the start of s(T) to the end of s(T + 5) (300 ms of consecutive activity). The CP was assessed for each contingency separately, and for the set of 16 contingencies (global performance), (Fig 11b). Global performance starts around 50% at t = 0 ms, which means that the s(T − 1), R(T − 1) and r(T − 1) components were already codified at trial initiation; uncertainty remained regarding s(T), which is expected since this stimulus had not been presented at t = 0 ms. Global performance picked rapidly, reaching its maximum of 97% at t = 15 ms. At this time, the response is actualized and thus can differ from Similarity index (SI) between pairs of contingencies firing profiles, which is inversely proportional to the 1-norm between firing profiles, and normalized to the interval between 0 (no similarity) and 1 (total similarity). In general the SI values are low. The highest SI was equal to 0.23, between contingencies 8 and 16, which only differs in their s(T − 1). The second highest SI value was equal to 0.06, computed between contingencies 7 and 8, which only differs in s(T). There was a tendency for SI values to be high for pairs of contingencies that share the same s(T − 1) or s(T).
https://doi.org/10.1371/journal.pone.0186959.g009  the response in the contingency being analysed, explaining that the maximum global performance is found at t decision .
For good performance, information about the previous trial must be retained until t decision . We can see that the memory of the system far exceeds this minimum requirement, with a CP of 22% at t = 300 ms. Notably, CP values per contingency are clustered in two well defined The information conveyed by the K module about the contingencies was estimated by employing tree-bagger classifiers trained on the K module firing profile to classify trials according to their membership to a given group of contingencies that share some specific element, depicted in the legend. Probe simulations were run before beginning training (Trial = 0) and then every 1000 trials. Firing profiles where computed at t decision . Information about s(T − 1) takes more training to be acquired, acting as a bottleneck for the coding of the whole contingency. (b) Memory about the occurrence of each contingency was estimated by assessing the classification performance of a Naive Bayes classifier trained to correctly classify the 16 contingencies based on the K module firing profile computed from t = 0 of trial T, to the end of trial T + 5 (being T the trial when s(T) of the target contingency was presented). The CP value picks around t decision as expected since the contingency may change after that time. For contingencies which involved the r 1 stimulus, information is retained above chance levels long after the time of decision. On the contrary, information about contingencies involving r 0 was retained for a shorter period, suggesting that information retention is proportional to the frequency of occurrence of the contingency. (c) When reward is delivered at random, differences in information retention between contingencies involving r 1 and r 0 disappears. https://doi.org/10.1371/journal.pone.0186959.g011 groups that differ in how fast classification performance drops. The group of contingencies for which the system has shorter memory (CP drops fast) is composed of contingencies where r(T − 1) = r 0 (red curves), while memory is longer for rewarded contingencies. It is important to note that r 0 is less and less presented as learning progresses, leading to an underrepresentation of contingencies containing r(T − 1) = r 0 . This suggests that the number of times a given contingency is presented during training defines for how long the system retains information about that contingency. To test this hypothesis, we performed a new training in which both r 1 and r 0 have equal chances of been presented regardless of the chosen motor response. In this case, the CP of all contingencies followed a similar temporal course, as expected (Fig 11c).

Discussion
In this work we have studied under what conditions a biologically plausible neural network is capable of solving a serial reversal task. The distinctive feature of this paradigm is that each stimulus/response pairing is eventually reinforced, since correct responses depend on the current rule. Thus, the sole information about the perceived stimulus and executed response collected at any single point in time is not sufficient to solve the task. This problem is reminiscent of the problem of catastrophic forgetting, also called the stability/plasticity dilemma, which is usually stated as the difficulty that many neural network models have in acquiring new information without erasing old information [5,16]. Catastrophic forgetting studies usually focus on paradigms where a set of stimulus response pairings must be learned sequentially. Thus, the difficulty of the task strives in the distributed representation of stimuli in the neural network, where the same set of synaptic weights are modified each time a new pairing is presented. It has been shown that forgetting can be alleviated in models that incorporates different levels of plasticity, i.e. mataplasticity [17,18]. Moreover, previously acquired information can be preserved in the correlated firings of the neural population [19]. Thus, it might be reasonable to think that similar mechanisms could be at work in a behavioural paradigm like the SRT. However, the results presented in this work show that no plasticity rule or neural activation function is sufficient to guarantee good performance in the SRT without contradicting the hypothesis of functionality by learning. In particular, we showed that the SRT cannot be learned by any network in which the same neural population integrates stimuli information and at the same time defines the motor response through non-plastic connections.
It is assumed that learning occurs through neural mechanisms that drive the network to a configuration that solves the task. A prerequisite for learning is that the probability of sequences of stimuli and responses must be different when conditioned to reward than when conditioned to no reward; the non-fulfilment of this prerequisite means that reward delivery is not dependent on behaviour and there is nothing to be learned. Then, the network must achieve two properties: to differentially code in its states the sets of rewarded and nonrewarded sequences of stimulus/response pairings, and to map network states to the correct motor response. It is important to note that this last property (mapping) is only attainable after the first property (coding) is achieved. In the simple network, once coding is achieved the mapping is completely defined, since motor responses are pre-defined based on the activity of the integration/decision module K. But adequate mapping requires appropriate coding as a prerequisite, implying that simple networks will achieve mappings that allow high performance only by chance, which although not impossible, since we considered stochastic networks, is something that can hardly be regarded as learning. Moreover, the probability of finding a solution in this way would be very low, since the solution trajectories are only a small subset of all possible trajectories. For example, the module K ruled by Eqs (18)(19)(20)(21)(22)(23)(24) is capable of coding the 16 (s(T − 1), r(T − 1), R(T − 1), s(T)) sequences. Let's consider 16 K neurons, half of them leading to R 1 and the other half leading to R 2 . We may assume that each neuron will code one of the 16 possible contingencies at random, since initial conditions were randomly chosen. Then, there are 8! x 8! out of 16! possible assignments between K i neurons and contingencies C j that lead to 100% of correct responses. This means that, by choosing an initial random condition, this K module will exhibit 100% performance with a probability of 7.8x10 -5 .
Building from the restrictions exhibited by the simple network scheme, we proposed a neural network model capable of solving the SRT, which relies in assigning the functions of contingency coding and response selection to different neural populations (integration module K and decision module D in Fig 5a). In this way, all the information required to solve the SRT (i.e. the coding of the (cue,response,reward) contingencies) is firstly acquired in module K, and then the module D adapts its response through reinforcement learning in order to maximize reward. Besides the SRT, the model should perform well in any task that implies unpredictable changes of rules. Also, other related phenomena, like the overtraining reversal effect, could be recapitulated in the model by the addition of attentional mechanisms, such as reward-modulated stimulus gain [20].
It is interesting to note that, although the separation of functions achieved in the complex model allows to untie the problem generated by the reversal paradigm, the coding of the contingencies themselves implies a possible problem of catastrophic forgetting, because it is the same set of synaptic weights that is required to change to learn contingencies which are presented in a sequential schedule. Nevertheless, the soft winner-take-all network implemented as module K showed a remarkable resilience to forgetting. Although information of contingencies within a block of trials with the same rule could persist long enough into the other block, this is not likely, since the memory of the system declines considerably after 6 trials (Fig 11b). A possible explanation for the resilience to forgetting could be found by noticing that the distribution of synaptic weights attained among neurons in module K is sparse (as shown in Fig  7c), a fact that could decrease the chances of interfering representations [21].
The impossibility result shown here has special meaning for brain regions typically related to decision making like the prefrontal cortex (PFC). The PFC is key to several high level cognitive process such as behavioural plasticity [22], working memory [23][24][25], rule learning [26] and decision making [27]. Experiments involving brain lesions have shown that different sub regions within the PFC and the striatum are differentially involved in the SRT. In particular, it has been found that the orbitofrontal cortex (OFC), the medial PFC, and the medial and dorsomedial striatum are required for learning a SRT [8,9,28]. In most cases, lesions of the involved areas led to slower learning of the SRT, with a higher rate of perseverative errors. In our model, perseverative errors occur if module K codifies stimuli but does not have enough memory to codify cues, reward and responses taking place in the previous trial. Since the coding capacity in module K stems from the competitive dynamics between neurons that occurs through inhibition, a failure in the inhibitory system would harm coding capacity of module K, leading to perseverative errors. This is consistent with [29] in which mutant mice with deficit in frontal cortical inhibitory neurons showed more perseverative errors, and impaired learning in the SRT.
The experimental results enumerated before, together with our theoretical results suggest that, in order to understand the neural mechanisms required for solving the SRT, and the IRTs in general, it would be of great value to characterize subpopulations of neurons according to their afferent and efferent projections and in relation to their firing profile. It could be expected for example, that the PFC neurons could be sorted in populations of coding neurons, that code complex contexts and stimuli histories, and decision neurons, that integrate contingency information from the coding population and projects to motor structures like the dorsal striatum, or the motor cortices. Another interesting possibility is that simple and complex networks coexist within different brain regions, for example as circuits spanning the PFC and the basal ganglia. If this two kind of networks are somehow segregated, then a specific brain lesion could damage more complex networks than simple networks. The damage in complex networks would hamper learning in tasks like the SRT, while the remaining simple networks would still be capable of solving other non-IRT tasks like a simple discrimination, or a delayed matching-to-sample task.
Synaptic plasticity in the model depicted in Fig 5a fulfils two different functions. In module K, plasticity allows the system to classify stimuli contingencies. The modulation of plasticity by reward would make no difference there, since all contingencies are equally rewarded, at least during the beginning of learning. Evidence of sustained plasticity have been found experimentally, in the form of the continuous formation and erasure of synaptic spines in cortex, which occurs even in the absence of any obvious reward [30]. On the other hand, plasticity between module K and module D has the function of allowing the D module to read the firing of K neurons that carries contingency information, and to map it with the correct response. In this case a reward-modulated form of synaptic plasticity is essential, and related experimental evidence can be found in the known effects that the neuromodulator dopamine (DA) has on synaptic plasticity in brain regions like the cerebral cortex [31], hippocampus [32] and striatum [33], and in the fact that DA neurons code reward and reward-predicting cues [34,35]. This fundamental difference in plasticity modes in the model suggests that experimental approaches to understand neural computation should focus on searching for subpopulations based in their synaptic plasticity profile, dissecting populations of neurons according to how sensitive their synaptic changes are to neuromodulators related to reward. Understanding the relationship between connectivity, firing profile, and reward and non-reward modulated plasticity could help to discover the building blocks of neural computation.
In brief, the study of a well-known task as the SRT allowed to gain new insights into the computational limits of an important set of biological neural networks that are commonly considered as models of learning and decision-making, and to give new theoretical support to the experimental exploration of the anatomy and function of neural circuits. Future work should focus on the rules of connectivity that allows greater memory for coding more complex contingencies, and in the kind of algorithms that can be learned by combining different circuit motifs with reward and non-reward modulated plasticity.

Methods
Proofs for the impossibility of simple neural networks to learn to solve the SRT Achieving high performance in the SRT implies that the network responds to stimuli according to the transitions depicted in Fig 2. The behaviour of the network will be inherently stochastic, since it is required to respond to stimuli that are themselves stochastic. However, given the state of the network at time t, the transition probability for the correct response is expected to be close to one, with all other responses having transition probabilities close to zero. Without the stochasticity of the stimuli, the network would follow a deterministic limit cycle, in which n(t) = n(t + m), being m the length of the cycle. In this manner, we say that the transition probability matrix is a deterministic probability matrix, and that the network follows a stochastic limit cycle, where the stochastic component of the behaviour is given by external factors that do not depend on the activity of the network.
With this concepts in mind, we will prove that the reduced neural network cannot learn to solve the SRT by showing that, given any excitation function f and plasticity function g, the network either does not converge, or it converges to one of many possible stochastic limit cycles, where only a small subset of these limit cycles allows high performance in the SRT.
First, we will study the convergence properties of the reduced neural network, assuming that external stimuli are not stochastic. We build on the mathematical framework of decision systems as presented in [36]. There, a decision system is defined, which is composed of a state space X, a decision space D, and transition probabilities p i (x) ≔ p(i|x) and p i (x, A) ≔ p(x ∊ A| x,i), where x ∊ X, i ∊ D, and A is any element of the sigma algebra on X. At each time t, a decision i is taken given x(t) and p i (x), obtaining i(t + 1). Then, x(t + 1) is obtained, conditioned to i(t +1) and x(t) through P i (x(t), x(t + 1)).
The evolution of a stochastic spiking network can be represented within this framework by the following representation: D = {(z i , z j )}, the set of all possible pairs of firing network states the network can assume. Vectors z i is the ith vector of the set of all possible firing state vectors z X = {(w, z, rew)}, the set of all possible combinations of whole system synaptic weights configurations (w), networks firing states (z) and reward function rew.
where w i is the ith vector of the set of all possible synaptic weight configurations for the whole network. By reachable we mean that w l is the whole synaptic configuration that is obtained when applying plasticity function g after transition from z o to z v , having rew the value corresponding to that trial given z v , s and L. Theorem 1 in [36] shows that a decision system converges with probability 1 to a limit cycle if and only if for each state x there is a decision i such that: where c is a constant and Q n i is defined inductively as Q nþ1 (7) is fulfilled only when the probability of transitioning from firing state z i to z j infinitely often does not vanish, which happens only if the probability converges to 1.
In the case of a reduced network where l is the K neuron that is active in state j. The function f is any function with the condition that is strictly increasing with w ∊ R.
It is the fact that synaptic weights change deterministically the reason why P i (x, x 0 ) is either 1 or 0. This allows us to simplify condition (7) to: where l is the active neuron in the destination state j, z is the source state, T m;q;rew ðw l Þ ¼ T 1 m;q;rew ðw l Þ ¼ v : v k ¼ w l;k ; k = 2 fm; qg; v k ¼ w l;k þ gðz l ; z k ; rewÞ; k 2 fm; qg, and T n m;q;rew ðw l Þ ¼ T nÀ 1 m;q;rew ðw l Þ. The transformation T takes the vector of synaptic weights of inputs to neuron l and applies a synaptic change according to plasticity function g (Eq 2) to the weights corresponding to presynaptic neurons q and m (one for a neuron Y, the other for a neuron K) that are active, i.e. z m = z q = 1.
Eq (9) holds only if logφ l,z,rew converges, which is an infinite sum of logarithms. In turn, the sum converges if the application of T leads to an increase in the transition probability. Since f is strictly increasing with w, Eq (9) holds if δ 1,1,rew > 0. In other words, the network will converge to a limit cycle if for each pair of active neurons Y and K there is a transition to a neuron n l such that, if the transition is repeated infinite times, the probability of the transition increases, something that occurs if the pre/post activation leads to potentiation of the synapse, i.e. Hebbian plasticity.
As stated before, a neural network that must learn to solve the SRT will not reach a limit cycle since it is bonded to follow stimuli that are stochastic. However, the evolution of the network can be segmented in transitions that eventually reach probability 1. Namely, for states defined as (s, n i , L) and (r, n i , L), we can consider transitions conditioned to a given s and L, i.e. the external stochastic factors which are independent of the network behaviour. For example, the transition between a given source state (s, n i , L) and destination states (r, n j , L) for any neuron n j , can be considered a decision system. Then, if condition (9) is fulfilled, any of these decision systems will converge to a "limit cycle" in which only one destination state (r, n j , L) is chosen. The same holds for transitions between source state (r, n i , L) and destination states (s, n j , L 0 ). It is important to note that a neural network that solves the SRT needs to converge to a unique decision even for incorrect trials, i.e. for source states (r 0 , n, L). This means that g(n i , n j , rew) > 0 for any rew ∊ {1, 0}.
Any pair of source and destination states can become the limiting transition, the probability of this happening depending on the initial transition probability, which depends on the initial synaptic weights. In particular, for networks in which Eq (9) holds, any limiting transition is attracting since the transition probability rises with probability equal to itself. The SRT is solved with high performance for only a small subset of all the possible limiting transitions. Therefore, a simple network which is initialized with random synaptic weights will reach a synaptic configuration that solves the SRT with very low probability. In particular, the probability of reaching the solution will be high only if the initial transition probabilities are close to the solution probabilities.
A more general definition for the simple neural network The reduced neural network can be extended to a more general definition of simple network, with arbitrary number of neurons and for which the impossibility result holds. In this case, the networks dynamics develops in discrete time steps of 1 ms. The SRT is structured in trials composed of cue stimulus presentation followed by a reward stimulus presentation, each one lasting t stimulus in ms. The response is observed in the interval [t cue offset −Δt response , t cue offset ]. The sequence of firing states of module K during this time interval univocally defines the behavioural response R.
We will consider a simple neural network composed of N Y neurons in module Y and N K neurons in module K. The firing state of the ith neuron in module Y will be represented by the variable y i , the ith element of vector y. The firing state of the ith neuron in module K will represented by the variable n i , the ith element of vector n. Neurons in module Y fire independently of each other, conditioned to the stimulus presented: where I 1 y is the set of indexes of Y neurons that are active in vector y, and I 0 y is the set of indexes of Y neurons that are inactive in vector y.
The postsynaptic potential PP i,j elicited by the train of spikes of neuron j onto neuron i is defined as the product of the post synaptic potential time course x j and the corresponding synaptic weight: The variable x j , the postsynaptic potential time course associated with the spike train of neuron j, is defined as: where ∊ is a kernel function, and t 0 runs over all the firing times up to time t at which the jth neuron of the module fired. The excitability of neuron i in module K is defined as: Conversely, its probability of firing is: where sign(f(PP i,j )) = sign(PP i,j ), lim In this way, neuron i will fire with probability 1 with the sole firing of a neuron j, provided that w ij is maximal, and will remain silent with probability 1 if w ij is inhibitory (negative) and maximal in absolute value. Any number of neurons may fire at the same time, and all neurons are conditionally independent of each other given PP. Thus, the probability of an activation state n(t) of the whole module K is given by: where I 1 nðtÞ and I 0 nðtÞ are sets of indexes of neurons that are respectively active and inactive in n (t).
Neurons are plastic all the time. Synaptic weight w i,j changes according to the function g, defined as: Dw i;j ðtÞ ¼ gðz i ðtÞ; PP i;j ðtÞ; w i;j ðtÞ; rewðtÞÞ: ð16Þ The Δw i,j values depend on w i,j in such a way that lim n!1 T n z i ;PP i;j ;rew ðw i;j Þ < jw max j. This assures that synaptic weights remain within reasonable prefixed limits. In this case, the variable rew is defined as: with γ a kernel function and t 0 the onset times of stimulus r 1 . For a simple neural network defined according to Eqs (10) to (17), the impossibility result holds. In particular, since the plasticity rule is deterministic, transitions with probability one will be possible if the corresponding value of Δw i,j is positive. In this case, all the variability in the network will stem from the stochastic nature of stimuli presentation and rule switching, and from the uncertainty in the coding of stimuli by the sensory module Y.

Implementation of the complex network
In the implementation of the complex network sketched in Fig 5a, module Y was composed of two neurons for coding each cue stimulus, two neurons for each reward stimulus, and one neuron for coding each response. Module K was composed of N K = 150 neurons. All initial synaptic weights were sampled from a normal distribution of mean = 0 and standard deviation = 1/64. There were no self-connections (w i,j = 0).
Each neuron i in module K has a variable u i : where w iy is a vector containing the synaptic weights for the connection from each neuron in the module Y to the ith neuron in module K, while w iK is an analogous vector for the inputs that neuron i receives from the other neurons in module K. The vector products w i,j (t)x y (t) and w i,K (t)x K (t) represent the postsynaptic potentials (PP) at time t associated with the train of spikes at each afferent synapse from module Y and K, respectively. The ith element any vector x represents the temporal course of the PP, which only depends on the spike emission times, and is defined as: where t 0 runs over all the firing times up to time t at which the ith neuron of the module fired, and ∊ is a double exponential kernel function: where τ 1 = 2 ms, τ 2 = 20 ms, and Θ stands for the Heaviside function. The parameter b i controls the excitability of the neuron. This parameter was adjusted at each time t following the homeostatic mechanism described in Habenschuss et al. [37]: 8 > > > < > > > : which assures that each neuron in the module fires with equal probability, helping to exploit all neurons in the module, avoiding silent neurons and thus favouring learning. The parameter μ was set to 0.1. The firing probability of neurons in the Y module where defined by the stimulus they coded, such that pðy 1 x i jx i Þ ¼ pðy 2 x i jx i Þ ¼ 0:95 and pðy 1 x i jx j Þ ¼ pðy 2 x i jx j Þ ¼ 0; 8i 6 ¼ j, where y q x i is the qth Y neuron coding stimulus x i . The response executed was coded by one Y neuron each, such that pðy R 1 jR 1 Þ ¼ pðy R 2 jR 2 Þ ¼ 1, and pðy R 1 jR 2 Þ ¼ pðy R 2 jR 1 Þ ¼ 0.
Within module K, the firing probability of neuron i is defined as: with index j going through all neurons in module K.
The firing probability of the two neurons in module D are defined just as for neurons in module K, with the sum in Eq (22) encompassing only the two D neurons. Only one neuron in module K and module D fires at each time t.
Connections from module Y to module K, from module K to module D, and between neurons in module K are plastic. The connections from neurons Y to neurons K and between neurons in module K change at each time t according to Δw ij : Dw ij ðtÞ ¼ ðe À w ij ðtÞx j ðtÞ À 1Þa 1 ; ð23Þ where index i refers to the postsynaptic neuron, index j to the presynaptic neuron, x is the time course of the postsynaptic potential associated with neuron j, and α 1 = 5x10 -4 is a learning constant. This plasticity rule is a kind of STDP rule that leads the model to codify each stimulus by a different population of neurons. Note that the rule does not depend on reward, and weight changes are applied at each time t. Connections from module K to module D change over time according to: where d i stands for the firing state of decision neuron i, u i is its excitability variable, x j the PP time course of afferent neuron j and α 2 = 8x10 -4 is a learning constant. The variable rew equals 1 only during the decision window and only if the motor response was correct. Otherwise, rew = 0.

Simulations and analysis
where p c is a vector in which the ith element is the estimated probability of firing of neuron i conditioned to contingency C. The SI adopts values from 0 (when firing probabilities under both contingencies are equal for each neuron) to 1 (when every neuron fire with probability 1 under one contingency, and with probability 0 under the other contingency. We employed classifiers to obtain a measure of the information conveyed by the neuron population of module K about contingencies. Specifically, for the result shown in Fig 11a we employed the TreeBagger function in Matlab R2009b to train 50 trees, to match the firing of the K module at t decision with their corresponding contingency. The classification performance was obtained as CP = 100x(1-err), where err is the out-of-bag misclassification probability, obtained through the oobError function. For the result shown in Fig 11b we employed the Nai-veBayes function. We trained 100 classifiers onto 80% of each training set and tested performance in the 20% remaining. The CP in this case was the average performance of the 100 classifiers in the test set, expressed as percentage. The results shown hold regardless of which classifier was employed.