Basal ganglia role in learning rewarded actions and executing previously learned choices: Healthy and diseased states

The basal ganglia (BG) is a collection of nuclei located deep beneath the cerebral cortex that is involved in learning and selection of rewarded actions. Here, we analyzed BG mechanisms that enable these functions. We implemented a rate model of a BG-thalamo-cortical loop and simulated its performance in a standard action selection task. We have shown that potentiation of corticostriatal synapses enables learning of a rewarded option. However, these synapses became redundant later as direct connections between prefrontal and premotor cortices (PFC-PMC) were potentiated by Hebbian learning. After we switched the reward to the previously unrewarded option (reversal), the BG was again responsible for switching to the new option. Due to the potentiated direct cortical connections, the system was biased to the previously rewarded choice, and establishing the new choice required a greater number of trials. Guided by physiological research, we then modified our model to reproduce pathological states of mild Parkinson’s and Huntington’s diseases. We found that in the Parkinsonian state PMC activity levels become extremely variable, which is caused by oscillations arising in the BG-thalamo-cortical loop. The model reproduced severe impairment of learning and predicted that this is caused by these oscillations as well as a reduced reward prediction signal. In the Huntington state, the potentiation of the PFC-PMC connections produced better learning, but altered BG output disrupted expression of the rewarded choices. This resulted in random switching between rewarded and unrewarded choices resembling an exploratory phase that never ended. Along with other computational studies, our results further reconcile the apparent contradiction between the critical involvement of the BG in execution of previously learned actions and yet no impairment of these actions after BG output is ablated by lesions or deep brain stimulation. We predict that the cortico-BG-thalamo-cortical loop conforms to previously learned choice in healthy conditions, but impedes those choices in disease states.


Introduction
The basal ganglia (BG) is an evolutionarily conserved complex network of excitatory and inhibitory neurons located in the deep brain of vertebrates that controls action selection (see e.g. [1]). The BG is comprised of the dorsal striatum, external and internal portions of the globus pallidus (GPe, GPi), subthalamic nucleus (STN) and substantia nigra [2]. It is traditionally implicated in motor control since BG lesions are associated with movement disorders [3,4]. The BG is a shared processing center involved in a broad spectrum of motor and cognitive control [2]. A cortico-BG-thalamo-cortical neurocircuit loop is suggested to be the structure that provides this control [2,5]. However, understanding how this loop functions remains far from complete and requires more experimental and theoretical studies.
The BG is also widely recognized for its involvement in learning [6,7]. Reinforcement learning is recognized as the mechanism that establishes behavioral responses for rewards, such as food or drugs of abuse and is altered in numerous disorders and disease states including Parkinson's disease [8][9][10]. Reinforcement learning is based on communication between midbrain dopamine neurons and the striatum [9], specifically ventral tegmental area (VTA) projections to ventral striatum in the mesolimbic neurocircuit and substantia nigra pars compacta (SNc) projections to dorsal striatum in the BG [11,12]. Dopamine (DA) released by dopaminergic VTA and SNc inputs to striatum signals the difference between received and expected rewards-the reward prediction error (RPE) [10,13]. RPE encoding in VTA-ventral striatal neurocircuits involves prediction of reward value which in turn feeds back to both VTA and SNc dopamine neurons [13]. Given its role in motor control, the SNc-dorsal striatum component of the BG translates RPE into action: the hypothesized critic-actor roles of these two dopaminergic projections [13,10]. If the RPE is positive, additional DA release leads to positive reinforcement of the preceding action; if the error is negative (expected more than received), a pause in DA release leads to negative reinforcement and blocks the action. As a mechanism for this control, DA modulates plasticity of synaptic projections from the cortex to striatal medium spiny neurons (MSNs) [14,15]. As a reflection of the bidirectional DA modulation, there are two types of MSNs. Those that are responsible for promoting movement are part of the BG direct pathway and express D1-type dopamine receptors (D1-MSNs) and those that inhibit movement are part of the BG indirect pathway and express D2 dopamine receptors (D2-MSNs) [16][17][18]. Indirect and direct BG pathways respectively inhibit or disinhibit the thalamocortical relay neurons responsible for producing particular movements [19,20]. The coordination of activity within the two types of MSNs determines action [21][22][23]. Within the BG loops, synaptic plasticity of corticostriatal projections is a key node in the learning of rewarded choices [6,7,24,15].
The BG is suggested to remain involved in action selection after the action-reward association is learned [5,25]. On the other hand, clinical interventions for Parkinson disease (PD) do not cause impairments in learned movements [26][27][28]. Specifically, GPi lesions and deep brain stimulation (DBS) in the STN, which both thought to disrupt the main output of the BG, are used to improve motor functions. This observation gave rise to a hypothesis that the BG play a critical role in learning, but not in the expression of already learned actions or choices [29,30]. These choices are suggested to instead be stored in synaptic connections within cortex. This hypothesis apparently contradicts the suggested involvement of the BG in executing actions learned previously. Therefore, it is essential to fill in this knowledge gap by further investigating the role of the BG in action learning.
Mathematical modeling have been widely used to reproduce and explain various aspects of BG electrophysiology and related behavior. A large set of these models is focused on understanding the dynamics of specific neurons in disease and control conditions irrespective of the BG function [31][32][33][34]. Other models are constructed based on functional ideas and emulate how biophysical changes caused by a disorder violate the functions [35][36][37][38][39]. Both types of models contribute to understanding of the BG function at different levels [40,41]. However the picture remains far from complete, and the obvious reason for that is the complexity of the BG circuitry and physiology, as well as the diversity of its functions. The present model was designed as a simple implementation of principles suggested to underlie the learning and action selection functions of the BG in the simplest, yet most frequently used two-choice instrumental conditioning task. This simplification allows for a comprehensive implementation of mechanisms for BG functions and dysfunctions. Thus, our paper addresses the need for simplified BG model that reproduces learning and action selection in a standard behavioral task.
The goal of the present study was to design a simple model of BG function that utilized experimentally known physiological processes and replicate behavior in a classical task. Such a model would provide an opportunity for identifying gaps in knowledge to better guide additional experimentation. To this end, this paper presents a computational model of the cortico-BG-thalamo-cortical loop involved in a two-choice instrumental conditioning task [25]. This task is standard for assessing action-reward association in animals and humans.

Materials and methods
We adopt rate model formalism extensively used to reproduce activity and function of numerous brain structures [46]. In particular, we follow a validated model of motor control [42] and modify it for action selection.

Structure of the basal ganglia
Fig 1 presents a schematic diagram of nuclei and connections within the BG and their connections with cortices. The cortico-BG-thalamo-cortical loop is separated into channels selective for each of the two actions of the model (see below). First, the striatum, the primary input structure of the BG, receives excitatory inputs from the prefrontal cortex (PFC) and premotor cortex (PMC) in the cerebrum as well as the thalamus. From the striatum, two competing pathways are activated: a direct pathway (striatum-SNr/GPi) and an indirect pathway (striatum-GPe-STN-SNr/GPi). These two pathways converge at the BG output nuclei, the SNr and GPi, and serve to modulate their activites. In the model SNr and GPi activity are treated as one unit. SNr/GPi activity inhibits a corresponding neural group in the thalamus and PMC and blocks the corresponding action. Thalamus and PMC activity is treated as a single unit (PMC/Thal). To execute the action, SNr/GPi activity must decrease and disinhibit the PMC/Thal neurons. The actions (channels) compete with each other via reciprocal inhibition at the PMC level. Reciprocal inhibition also exists at the GPe level, but it was omitted in the model as STN-GPe network dynamics was shown to be only weakly dependent on this inhibition [47,48]. In addition, DA neurons in the SNc signal a reward prediction error (RPE), which change synaptic weights of PFC-striatum connections via DA-dependent long-term synaptic potentiation (LTP) and long-term synaptic depression (LTD) to allow for reward-based learning.

Behavioral task
Our model implements a standard design for intertemporal choice tasks [25]. The circuitry shown in Fig 1 is built to reproduce selection between two actions, one of which is rewarded. A typical task is to learn that, for instance, action 1 is rewarded if a conditioning stimulus (CS) is presented. Then, this task is "reversed": after learning this contingency, the reward following the same CS is shifted to action 2. Thus, the cortico-BG-thalamo-cortical loop has 2 channels: for choice 1 and 2, except for the PFC that represents the CS and the SNc that represents the unexpected reward. Activation of neural groups 1 and 2 in the PMC/thalamus correspond to The structure of the cortico-basal ganglia-thalamo-cortical loop model. The BG receives inputs from the prefrontal cortex (PFC) signaling the conditioning stimulus (CS) as well as reward inputs via substantia nigra pars compacta (SNc). The SNc forms a dopamine reward prediction error (RPE) signal, which governs plasticity of the connections from the PFC (DA LTP/LTD; green). The BG input structure, striatum, contains medium spiny neurons (MSNs), which cluster in 2 subtypes: D1 and D2 dopamine receptor-containing (direct and indirect pathways respectively). The rest of the nuclei are the globus pallidus external (GPe), subthalamic nucleus (STN), and the output structures: substantia nigra pars reticulata and globus pallidus internal (SNr/GPi). The loop is completed by connections from and to premotor cortices/thalamus (PMC/Thal). The two channels of the loop are colored purple/ blue. https://doi.org/10.1371/journal.pone.0228081.g001 execution of action 1 and 2 respectively. Thus, in the model, an action is considered selected if the activity level of the corresponding PMC neural group at the end of a simulated trial exceeds that of the other group plus the noise level (0.1; see below). The behavioral readout is if the stimulus-reward contingencies can be learned, and how many trials learning takes.

Firing rate equations
The activity of every neuron (except the dopaminergic neurons in the SNc) is governed by the following differential equation [42]: where A is the instantaneous activity level of the neuron. N(t) is a uniformly distributed noise with amplitude 0.1. Here, τ is a time constant taken to equal 12.8 msec for the STN, 20 msec for the GPe, and 15 msec for all other neural groups based on previous models and experimental studies [49]. I is the synaptic input to the neuron. The expressions for synaptic input to each neuron group, and the formula are compiled in Table 1. σ(I) is a normalized response function defined as: This formalism normalizes firing rates for all nuclei to be between 0 and 1 to avoid difficulties with modeling very different firing rates observed in BG of different species and allows us to focus on the general learning mechanism. We have adapted the following notation: X m to denote the activity (firing rate) of neural group X in the pathway for the m th action. Since our model contains only two actions, the only possible values for m are 1 and 2. The index n in the formula for X m is equal to 2 if m = 1, and n = 1 if m = 2, i.e. it refers to the other of the two channels and describes interaction between them. Further, w X_Y denotes the synaptic weight (strength of connection) from group X to group Y and dr X denotes a tonic drive to group X. Many of these weights are assumed constant throughout our trials, but several of them are plastic as described below.

Synaptic plasticity
The synaptic weights from PFC to PMC neurons and from PFC to MSNs are plastic, which means that they change depending on the activity of these nuclei and behavioral outcome (reward received) respectively [44,45,43]. In simulations, the synaptic weights are updated at the beginning of every trial depending on the behavior of the model in previous trials. Before we discuss the specific mechanisms by which we updated these plastic synaptic weights, we will first discuss how we calculated the activity of the dopaminergic neurons in the SNc, which essentially mediate reward-based learning. The activity of the SNc neurons is associated with a reward prediction error (RPE) [50]. Following previous models (e.g. [42]), we assume that the activity of the SNc neural group reflects the difference between the expected reward and the actual reward: where R is the actual reward given based on the action selected, and R e j is the expected reward at the j th trial. The animals are pre-trained on single choice task, and, therefore, they expect a reward. The expected reward on the first trial, R e 1 , is equal to 1 and is then subsequently updated according to the following scheme [42]: where α is a constant (set equal to 0.15) and R j denotes the actual reward received by the model on the j th trial. Note that the RPE given by Eq (3) may be positive (actual greater than expected) or negative (actual less than expected). The actual reward received in simulations, R, is determined by the following: where we determined which action is selected by comparing the activities of the PMC neurons at the end of each trial as described above. Altogether, after each trial, the PFC-striatal synaptic connections are updated by adding the following increments to the previous values: where PFC, D1 m , and D2 m denote the activity of the respective neural group at the end of the trial (m = 1,2). Here, λ D1 and λ D2 are learning rate constants and d is the decay rate constant. The value for these constants are adopted from the previous literature [30,42] with a modification that takes into account the differences for synapses contacting D1 and D2 MSNs: Gurney et al. [51] has shown using experiments and modeling that plasticity of the synapses on D2 MSNs is weaker approximately by a factor of two than on the D1 MSNs. Note that the formalism does not allow for direct modeling of the eligibility traces necessary for stimulus-reward association [52][53][54], and we account for that by PFC activity that persists for the duration of the trial. Lastly, we describe the mechanism by which we updated the connections between the PFC and PMC neurons. Here, we let w PFC-PMCm denote the synaptic weight of the connection between the PFC neural group and the m th PMC neural group. After each trial, the synaptic weights are updated according to the following Hebbian Learning Rule: where λ CM is the learning rate and d CM is the decay rate of the cortical connections. Here, PFC and PFC m denote the activity of the PFC neurons and m th PMC neuron group at the end of the trial. Now, we will outline our methodologies for calibrating our three different BG model states: healthy, Parkinsonian, and Huntington's disease.

Healthy BG state
We target to reproduce rodent behavior in instrumental conditioning (IC) tasks [29,25]. Thus, an animal will learn contingencies between a conditioning signal and a rewarded actionpressing one of two levers. We reduce the model by [30,42] and focus our model on the interaction of the thalamocortical and BG networks (Fig 1) and reproduce the function of the cortico-BG-thalamo-cortical loop in the above two-choice task. The parameter values are shown in Table 2. The values were taken from previous studies [30,42] with a few minor modifications that allow for both robust instrumental conditioning as well as reversal learning.
Specifically, equations for the D1 and D2 MSN neurons reproduce their balanced excitation by cortical inputs in vivo [21,55]. The balance is supported by a number of complex mechanisms, from differential effects of DA on excitability of the D1 and D2 MSNs [56] to their lateral inhibition and contribution of striatal fast spiking interneurons [32]. These mechanisms are very hard to implement in a rate model, and we calibrate the D1 and D2 MSN equations identically to reflect the balance. The balance is perturbed in the PD DA depleted state (see below).

Parameter
Value used in this model Randomly set between 0 and 0.001, updated after each trial are multiple mechanisms that beak the activation balance of D1 and D2 MSNs in the DA depleted state [56]. All of them lead to the net increase in the activation of the D2 MSNs and decreased activation of D1 MSNs [32]. Thus, we model these changes by decreased synaptic excitation of D1 MSNs and increased synaptic excitation of D2 MSNs (Table 3).

Huntington's BG state
The pathology of Huntington's Disease (HD) is less well-understood; however, it is clear that there is a progression of the disease from chorea (involuntary, jerky movement) at its onset to akinesia (loss of the power of voluntary movement) at its conclusion [76]. We modeled the chorea phase (Grade 2 HD) by weakening the D2 MSN-GPe connection by 75%, weakening the D1 MSN-GPi connection by 35%, and decreasing the PFC and PMC inputs to account for destruction of the cortices [76,77]. These percentages are gathered from the physiological observations of Reiner et al. [76]. The resulting parameters are shown in Table 4.

Numerical simulations
Our model was coded in MATLAB. We considered a trial to last 750 msec, and at the end we register the activity of each neuron in the circuit. We chose to cutoff trials at this point because it was sufficient to guarantee that the neural activity converges to a steady state. An exception is a case when neural activity does not approach a steady state and remains oscillatory, which we also found in this study. We update strengths for the plastic synapses after each trial. Finally, we reset the initial activity of the neurons to be at randomized levels at the beginning of each subsequent trial. We ran simulations consisting of 500 such trials. The code is available in ModelDB database http://modeldb.yale.edu/261616.

Results
We simulated the same standard two-choice IC and reversal task in three conditions: Healthy, Parkinsonian, and Huntington's BG. Fig 1 presents a schematic diagram of nuclei and connections within the BG and their connections with cortices. The model is described in detail in Materials and Methods. The models received a stimulus (CS) that activates prefrontal cortical (PFC) neurons for all 500 trials. We say that the network chooses action 1 if the activity of the premotor cortical (PMC) neural group 1 exceeds the activity of the PMC group 2 by 0.1. The comparison of the activity levels is made at the end of each trial. For reversal training, after action 1 is rewarded in trials 1 through 199, for trials 200 through 500, action 2 was rewarded instead. We analyze and compare the learning and reversal performance in the three model states below.

Healthy BG facilitates learning of rewarded choices
Fig 2A shows choices made in the simulations: a higher activity of the PMC1 manifests choice 1 and vice versa. The graph shows the activity at the end of each trial, which is taken to be 750 msec long. On early trials, the choice is made randomly due to random initial conditions in the PMC network and mutual inhibition of PMC1 and PMC2. This reproduces the exploration phase, where the information about reward is collected [78,79]. The modeled animal receives an unexpected reward every time it chooses action 1 (PMC1 on top). Within several trials, the system starts to consistently choose the rewarded action, although a few exploratory deviations may be made after that. This fast initial learning replicates experiments and is thought to occur so fast because animals are pretrained first on a single choice task (e.g. to press a single lever for reward). On trial 200, we switch the simulated task to reversal: action 2 is rewarded instead. This quickly leads to reestablished exploratory behavior, and then locks the system to the rewarded choice, with occasional exploratory returns to choice 1. Fig 2E shows performance improvement over reversal learning that matches experiments [29]. As explained below, our model allows for detailed analysis of the mechanism of this learning. Two mechanisms facilitate learning of the rewarded choice-one fast and one slow. The first mechanism is the potentiation of the PFC-to-striatum synaptic connections (Fig 2B). Since the animals are pre-trained on single choice task, they expect a reward, and a reward omission creates a negative RPE (unexpected punishment; Fig 3) encoded by SNc DA signaling that potentiates PFC connections to all D2-MSNs ( Fig 2B). Importantly, whereas the DA signal itself is not selective for MSNs specific to the rewarded action, DA-mediated potentiation of PFC-MSN synapses is selective. What makes potentiation selective is the level of activation of the corresponding MSN: in the initial trials the reward is omitted if choice 2 is selected, that is when PMC2 activity is greater, and, consequently, MSNs selective to choice 2 are activated more (due to static synaptic connections from PMC to MSNs specific for each choice). Since synaptic plasticity explicitly depends on the activity of the postsynaptic neuron, PFC-to-D2-MSN2 connections are potentiated much more strongly than D2-MSN1 connections (Fig 2B purple vs. yellow). Then, every choice that is not followed by the expected reward activates the corresponding indirect pathway (i.e. D2-MSN2), which excites the downstream GPi2 neurons, and consequently inhibits the PMC2 activity. This blocks the nonrewarded action and helps to lock the choice to the rewarded action.
Simultaneously, reward omission reduces expected reward, and the next rewarded trial results in positive RPE and leads to potentiation of the connections to D1-MSNs (Fig 2B blue). This further selectively activates D1-MSNs responsible for action 1. The mechanism for this selectivity is the same: the reward is granted only if choice 1 is selected, that is when PMC1 activity is greater, and, consequently, the MSN corresponding choice 1 is activated more (Fig  4). The increased activity level of D1-MSN1s selectively inhibits downstream GPi1 neurons and, consequently, disinhibits the PMC1 neural group (Figs 2A and 4). Thus, due to direct excitation from the PFC associated with the stimulus, activation of D2-MSNs associated with choice 2 and D1-MSNs associated with choice 1 increase. Co-activation of the two mechanisms is sufficient to lock the choice to the rewarded action. Note that the complex pattern of co-activation of D1 and D2 MSN populations is in agreement with the recent literature [21,80].
During subsequent repetitions of the same trial, the PFC-MSN connection strength starts to decrease and approaches zero (Fig 2B trials 40 to 200). However, the persistence of the rewarded choice remains intact (Fig 2A). The mechanism for this is the growth of direct PFC-PMC1 connections (Fig 2C) via classical reward-independent Hebbian synaptic plasticity: the two neural groups are co-active most of the time. This transition from PFC-MSN to PFC-PMC connections as a robust supporting mechanism for the rewarded choice occurs after the number of repetitions exceeds approximately a hundred (Fig 2). In these later trials, the PFC-MSN connection strengths are decreased, but the choice remains locked to the rewarded action. Therefore, the model predicts that direct cortico-cortical connections are responsible for the choice of the rewarded action after long training.
We next analyzed the behavior of the model when we began rewarding a choice different from the choice the model had been previously conditioned to make; this learning task is called reversal learning [81]. Beginning at trial 200, we rewarded the model for selecting the other action (choice 2). Thus, starting at trial 200 the model mimics omission of a reward (unexpected punishment) for selecting action 1. This punishment potentiates synaptic connections from the PFC to D2-MSNs associated with action 1 (D2-MSN1, Fig 2B yellow), and, slightly later, to D1-MSNs associated with action 2 (D1-MSN2, Fig 2B red). This engagement of both direct and indirect pathways offsets the model bias for action 1 and quickly sends the model into another exploratory phase. As Fig 2A demonstrates, between trials 200 and 300 the model is randomly choosing between the two actions. It is important to note that, in accordance with others' findings [82,83], this second exploratory phase lasts longer than the initial exploratory phase. During reversal, the new potentiation of PFC-MSN connections is not enough to Basal ganglia role in learning and execution of rewarded choices effectively overcome the bias for the initially learned choice and ensure choosing the newly rewarded option. The reversal exploratory phase ends only when the PFC-PMC2 connections become as strong as PFC-PMC1 and remove the bias (Fig 2). Thus, the longer exploratory phase during reversal occurs because the model must first overcome its bias for the previously learned choice and then develop a new stimulus-choice 2 association. The lengths of the exploratory phases matches experimental results [82,83].
After the onset of reversal learning, the system continues choosing option 1, even though it's not rewarded, due to the potentiated PFC-PMC1 connection. This generates a negative reward prediction error (Fig 3) and potentiates PFC connections to the D2-type neurons associated with action 1 (D2-MSN1; Fig 2B yellow). The connection of PFC to D1-MSN2 lags by several trials (Fig 2B, red), during which the exploratory phase begins and allows finding the new rewarded option. Both the initial and the reversal learning engage direct pathways at a greater strength than the indirect (Fig 2B, yellow and red) because the model reflects a greater plasticity rate for the cortical connections to D1 than to D2 MSNs.

Mild parkinsonian BG: Impeded learning and spontaneous oscillations
In our Parkinsonian BG mode, the indirect pathway is strengthened by parameter changes reflecting physiological data. Our simulations (Fig 5) show drastic difference in dynamics of the PMC neurons during initial learning and reversal in the model with mild-parkinsonian BG. During both phases, learning is severely impaired. First, the choice remains random for approximately the first 50 trials. Second, the model does not reliably choose the rewarded option even after this period, although the rewarded option is chosen on a much greater number of trials (Fig 5A blue above red in the initial learning and vice versa in the reversal). Third, the activity of the PMC neurons is overall reduced compared to that in the model with healthy BG, and the trial-to-trial variations of this activity are drastically increased, even when only trials with the same choice are considered.
The underlying dynamic of the synaptic weights is also significantly altered. During both initial learning and reversal, the activation levels for both direct and indirect pathways (Fig 5B) is much lower than in the model with healthy BG (Fig 2B). The latter follows directly from the reduced SNC signaling (by 70%), which decreases the RPE and, thus, impedes potentiation of PFC-MSN connections. Since both PMC neural groups are active at a similar level, both connections from PFC are potentiated (Fig 5C), and the system does not develop a preference for the rewarded choice. After trial 50, the rewarded choice starts to prevail as the PFC-PMC connections reflect the preference for choice 1. However, the PFC-PMC1 connection does not achieve the level reached in the model with healthy BG (Fig 2C) within the 200 trials designated for initial learning. Hence, exploration between the choices persists for all 200 trials, and the prevalence of the rewarded choice requires the persistent activation of PFC-MSN connections. Therefore, the model with mild parkinsonian BG is capable of learning the choices, but the effective learning rate is much lower.
Reversal learning has been shown impaired in PD conditions [84][85][86]. In the model, the low levels of PFC-PMC connections persist into the reversal phase and never reach the levels shown by the model with healthy BG even though plasticity rules of the PFC-PMC connections remain the same in both models. Therefore, our modeling predicts that the mild-parkinsonian BG does not allow for the proper potentiation of the PFC-PMC connections, and this leads to impaired learning. Learning based on cortical synaptic potentiation simply reflects the choice frequency because the PMC group responsible for the choice fires together with the PFC. One reason for the lack of proper potentiation is that the models with parkinsonian BG cannot maintain the rewarded choice. Experimentally, the inability to maintain the choice was observed in 6-OHDA-leasioned rats [86]. The model also reproduces perseveration of the previously correct choice as shown experimentally [86] to contribute to the low performance at the very beginning of reversal (trials 1-5). Interestingly, the reversal phase starts with activation of both indirect pathways simultaneously (Fig 5B, purple and yellow). This suppresses the activity of both PMC neural groups, blocks any choice ( Fig 5E) and blocks changes in the PFC-PMC synaptic weights. Only after some 50 trials, the blocking signal for choice 2 is removed (Fig 5 purple). The abrupt drop in the connection weight to the D2-MSN2 group is caused by positive RPE following the choice of a rewarded option. The choice is made by chance, and most of the previous trials were not rewarded because PMC activity was suppressed, and the probability of one PMC group to significantly exceed the activity of the other was low. Thus, the model with mild-parkinsonian BG predicts that the exploratory phase at the beginning of the reversal learning is replaced by blockade of any choice, and this further impedes learning.
Perhaps the most interesting change in the model with parkinsonian BG is the drastic increase in the trial-to-trial variability of the PMC neurons (Fig 5A). To explain the mechanism of this variability, we considered within-trial dynamics of activity for all neural groups in the model. Fig 6 shows these dynamics for the PMC, GPe and STN neural groups in the healthy vs. parkinsonian BG models. In the healthy case activity levels come to an equilibrium, while in the parkinsonian case, they engage in persistent oscillations. The oscillations arise from the negative feedback loop that the BG, and in particular its indirect pathway, provides for the activity of each PMC neural group. Indeed, the static PMC to D2 MSN connections, which constitute this negative feedback, are stronger in the parkinsonian case (w PMC−D2 , in Table 3). The period of these oscillations is approximately 150 ms, which is 6.7 Hz. No potentiation in the PFC-PMC and PFC-MSN connections within the ranges in Fig 5B and 5C suppress the oscillations. Therefore, the simulations predict that the trial-to-trial variability of the PMC neurons in the model with parkinsonian BG is caused by robust within-trial oscillations in the activity of all neuron groups in the model. In order to model the impact of surgical interventions on performance and learning in PD, we performed additional simulations of the PD model in which the BG signal to PMC was ablated from trial 150 till the end (Fig 7). This directly models GPi lesions, which was the first standard surgical treatment for PD, and also mimics DBS treatment, which is suggested to reduce GPi output (see Discussion). In this period, the variability of the PMC activity vanishes completely. Furthermore, the PFC-striatal connections no longer exert any influence on the choices, but the PFC-PMC connections are strong enough to lock the choice to the rewarded option, and the cortical connections increase further at a greater rate. After the reversal on trial 200, however, the changed values of the choices remain unnoticed by the system, the choice remains locked on the now unrewarded option, and the cortical connections supporting this choice keep rising. In this state, behavior improves, but learning is impaired.

Grade 2 Huntington's disease BG state: Persistent exploratory behavior
If the above case of Parkinson's disease is associated with strengthening the indirect pathway, in the case of Huntington's disease the connections in the indirect pathway become weaker ( Table 4). The major difference with the healthy BG model is that the trial-to-trial dynamics of the PMC neural groups looks like the exploratory phase never ends (Fig 8A). At the same time, we see from the synaptic weights (Fig 8B and 8C) that choice-reward contingencies are learned almost as effectively as in the healthy case (notice similarity with Fig 2), although the synaptic weights are somewhat lower. The synaptic weight dynamics is qualitatively similar to the healthy case because the plasticity rules stay the same. The differences are the persistence of the potentiated PFC-MSN connections for the duration of initial/ reversal training similar to the parkinsonian case and the activation of the indirect pathway for choice 2 lingering at the beginning of the reversal phase (Fig 8B purple). The former, however, is not a cause but rather a consequences of the continuous exploratory choices that bring no reward. Therefore, despite the efficacious learning (Fig 8C), choice behavior is impaired relative to control (Fig 8A).
The cause for the persistent exploratory phase is the positive PMC-BG feedback loop through D1 MSNs, which is not balanced by the D2 MSN pathway. Thus, activation of the D2 MSN pathway cannot robustly stop the unwanted action. Indeed, an occasional increase in the activity of the PMC2 neural group, which represents a non-rewarded action, excites the corresponding D1 MSN group, and through inhibition of GPi2 activity, further increases the PMC2 activity (Fig 9). The reduced connectivity in the D2 MSN pathway makes the GPi neural activity the same for choices 1 and 2 (Fig 9) and excludes the BG from the competition between the choices. This leads to occasional choices of the non-rewarded option, and our simulations show that this behavior is robust with respect to growing PFC-PMC and PFC-MSN connections (Fig 8). Therefore, the lack of balance between direct and indirect pathways in the model of Huntington's disease causes persistent random switching from rewarded to non-rewarded choice after both initial learning and reversal.
In order to model the impact of BG DBS or surgical interventions on performance and learning in HD, we also performed additional simulations of the HD model in which the BG signal to PMC was ablated from trial 100 till the end (Fig 10). The random switches between the choices cease shortly after, but not at the onset of the treatment. The response to the treatment is very similar to that in the PD case (Fig 7). In this period, the PFC-striatal connections no longer exert any influence on the choices, but the PFC-PMC connections are strong enough to lock the choice to the rewarded option. After the reversal on trial 200, however, the changed values of the choices remain unnoticed by the system, the choice remains locked on the now unrewarded option, and the cortical connections supporting this choice keep rising. Therefore, during DBS, or after surgical interventions ablating BG output, behavior improves, but learning is impaired in HD as well as in the PD state. Trial-totrial dynamics of PFC neural activity (A) and underlying dynamics of synaptic weights (B,C). The notation is the same as in Fig 2. (D) and (E) present performance at the beginning and the end of the initial learning and reversal respectively. The performance scores were averaged over 10 simulated animals. https://doi.org/10.1371/journal.pone.0228081.g008

Discussion
Our model implements the cortico-BG-thalamo-cortical loop function in a standard 2-choice instrumental conditioning task. We have shown that potentiation of cortico-striatal synapses enables learning of rewarded options. However, later these synapses become redundant as direct connections between prefrontal and premotor cortices (PFC-PMC) potentiate by Hebbian learning. The model shows that disease-related imbalances of the direct and indirect pathways in the BG impairs learning and suggests that these imbalances may also impede choices that have been learned previously, in spite of BG redundancy for those choices.
Our model of the parkinsonian state reproduces several major behavioral and electrophysiological features documented experimentally: First, the overall PMC activity is diminished in the PD state, consistent with PD studies [68]. Further, the model predicts that this activity is lowest at the beginning of the reversal due to aberrant engagement of the indirect pathway, which can be displayed as stronger bradykinesia and very low task performance scores. Reversal learning has been shown impaired in PD conditions [84][85][86]. Perseveration of the previously correct choice and impairment in maintaining of the new choice has been shown experimentally to contribute to the low performance [86], and our model reproduces these components as well. We have not found experimental evidence for the prediction of stronger bradykinesia. However, PD physiology is diverse and such bradykinesia may be evident at its more advanced stages. We tested the model for a range of parameters, and the duration of the choice blockade increases gradually as the model transitions from the healthy to the Parkinsonian state. Additionally, the block may be interrupted due to fluctuations in neuronal activity, and such perturbations as changes in the environment or forced choice trials would end the blockade phase. Second, the model shows robust oscillations in the activity of the cortico-BGthalamo-cortical loop in the PD state. The frequency of these oscillations is about 6 Hz, which is in the theta band. An increase in the EEG theta band is a marker of PD-related cognitive decline [87,88]. The oscillations are generated by a negative feedback branch of the loop through the indirect pathway as suggested before [40,89]. The hyperdirect pathway also contributes to this negative feedback and may support oscillations. Our simulations show that the oscillations cause multiple choice errors and, consequently, impede task performance and learning. Parkinsonian-state oscillations in the BG, although in the beta band and caused by a different mechanism, has been suggested to affect decision making by another model [90].
In the HD state, our model displays persistent randomly occurring choices of the unrewarded option, especially frequent after the reversal. This would register as impaired learning in behavioral tests, which is consistent with experimental results for cognitive [91,92] and motor tasks [93,94] in HD patients in the early stages of the disease. Furthermore, the model suggests that performance for previously learned tasks is also reduced by approximately 20%. Therefore, our model reproduces impairments of the previously learned actions documented in BG-affecting diseases like PD and HD as well as after certain BG lesions [5,25,95]. However, surgical and DBS interventions in PD and HD patients do not impair, but rather restore motor function [26][27][28]96]. This raises the question: how can these two lines of evidence therefore be reconciled? Learning in the model consists of two phases: BG-based and cortex-based. In a faster BGbased phase, the connections from PFC to MSNs are potentiated according to the RPE signal. The BG output inhibits choices with negative RPE and disinhibits those with positive RPE. Once the behavior is learned, the RPE vanishes, and the PFC-MSN connections decay to zero. The future choices are supported by the slower cortex-based learning phase: The connections from PFC directly to PMC are potentiated based on the Hebbian mechanism. Our simulations show that, even after the cortico-cortical connections increase to the levels ensuring robust choice of the rewarded option in the healthy state, both of the disease models are unable to make robust choices. Thus, behaviors that no longer need the BG are impaired. In accord with this result, 6-OHDA-lesioned rodents cannot maintain the correct choice, especially after reversal [86]. The model shows that it is an abnormal BG output that impairs the choices. Indeed, the BG output to the PMC does not vanish even when the behavior is learned and the BG no longer receives any RPE signal. In this case, due to the inputs from the PMC, the healthy BG disinhibits the previously learned choice, i.e. it conforms with the PFC-PMC associations. This disinhibitory function is impaired in both PD and HD, as well as after striatal lesions [5,25,95]. According to this prediction, disruption of the BG output would improve performance on previously learned tasks. Indeed, our model of a lesion of BG output demonstrates strengthening of performance on previously learned choices. Thus, GPi lesions were predominantly used in early surgical treatments of the PD, and sometimes are used now [97]. Additionally, DBS was successfully used in PD [26][27][28] and tested in HD patients [96]. The mechanism for DBS is not fully known, but thought to functionally lesion the excitatory input to GPi from STN and reduce GPi activity either by synaptic depletion or plasticity [40,98]. Therefore our model reconciles how specific GPi lesions that abolish BG output, or DBS that reduces the impact of this output, restore previously learned behaviors that were lost due to disrupted BG function, however this comes at the expense of decreased cognitive flexibility. A similar solution was suggested in an extensive computational study by Scholl and colleagues [36,41,99]. However, our model also combines the functional alterations with aberrant neural oscillations in PD.
The combination of the two learning mechanisms has been proposed and explored previously both experimentally and computationally [29,30,42]. Such combinations have been shown to be essential for cortical sensorimotor control, explained how reinforcement learning can shape cortical plasticity, and been used in brain-machine interface [100][101][102]. Here, we demonstrated how cortical learning can be indirectly disrupted in PD and HD conditions. The three types of dynamics, healthy-, PD-and HD-like behavior persist in wide ranges of parameters in the model, whereas specific quantitative features, such as performance scores, show gradual parameter dependence. We tested multiple model manipulations, such as ablation of the hyperdirect pathway, or the STN-GPe feedback pathway to prove model robustness and mechanisms supporting dynamical properties of the model (data not shown). We showed that, in pathological states, ablation of the BG output may reveal hidden cortical learning and drastically improve performance. Cortical learning simply reflects the average of past choices regardless of the reward. If the switch to reversal occurs much earlier, the model predicts that cortical learning is not sufficiently engaged, and reversal takes fewer trials. By contrast, the model predicts that after sufficiently large number of rewarded trials, cortical learning may lock the choice even if it becomes non-rewarded, or even punished. Such aversion-resistant behavior is shown in substance abuse disorders [103,104]. The ability of the system to stop an unwanted behavior depends on the strength of the cortical vs. BG inputs to the PMC. To avoid aversionresistant behavior for non-addictive, natural reinforcers, it's necessary to assume that the cortico-cortical synaptic plasticity is further limited to the low values achieved in our simulations. Homeostatic mechanisms that counteract Hebbian potentiation are plentiful [105], and misfunctioning of these mechanisms may, therefore, lead to aversion-resistant behaviors.
On the other hand, our model will forget a choice that was rewarded once tens of trials in the past due to the decay in the cortico-striatal connections. While there may be situations in which rare decisions are kept in the memory, which are not reproduced by the model, these situations are probably kept by other memory systems (e.g. emotional memory). Other limitations of our mode are mostly related to the firing rate formalism: First, the formalism does not allow for direct modeling of the eligibility traces necessary for stimulus-reward association, and we account for that by PFC activity that persists for the duration of the trial. Second, the D1 and D2 MSNs in the model reproduce their balanced excitation by cortical inputs in vivo [21,53]. The balance is supported by a number of complex mechanisms, from differential effects of DA on excitability of the D1 and D2 MSNs [54] to their lateral inhibition and contribution of striatal fast spiking interneurons [32]. These mechanisms are very hard to implement in a rate model, and we calibrate the D1 and D2 MSN equations identically to reflect the balance. The balance is perturbed in the PD DA depleted state (see Methods). Third, reciprocal inhibition between actions is implemented only at the PMC level, and omitted at the GPe level as STN-GPe network dynamics was shown to be only weakly dependent on this inhibition [45,46]. Fourth, the model does not differentiate between premotor cortex and thalamus because their interaction is a very complex separate direction. Finally, the oscillations that the model shows in the PD state are highly regular in spite of the noise added to all model components. The firing rate model is defined in terms of averaged firing rates of neural populations, and therefore, does not aim to reproduce all the noise. By contrast, it reproduces dynamical mechanisms that underlie robust deterministic processes that determine signal processing in the networks.
BG is suggested to be one of the main brain structure that determines action selection in multiple tasks and contexts. Hence, BG dysfunctions are shown to be linked to a broad spectrum of diseases, from Parkinson, to drug abuse. Traditionally, research effort on these diseases are disconnected from one another, even though they concern the same circuitry. Combining these efforts, in particular by modeling, will give us a more comprehensive picture of mechanisms involved in action selection at different levels of the brain circuitry. Modelling such complex mechanisms require connecting multiple brain regions, including cortical and subcortical. As a future direction, this model will be used as a building block in simulations of this circuitry. In particular, separating the dorsomedial and dorsolateral striatal circuits (and, correspondingly, the cortical regions that project to these circuits), will allow one to address the development of goal-directed and habitual behavior in simulations as the two parts of the striatal circuitry are associated with the two distinct types of behavior [106]. One can further separate thalamic and cortical circuits to take into account the contribution of their interaction to action selection [107]. Finally, the model may be generalized into more complex tasks with multiple stimulus-response mappings. Thus, the simplicity of our model allows for qualitative explanation of mechanisms and, simultaneously, building large scale models that involve multiple brain regions.
Altogether, we have modeled the function of the cortico-BG-thalamo-cortical loop in a 2 choice instrumental conditioning task and shown that disbalance of the direct and indirect pathways is the mechanism by which this function is disrupted in HD and PD conditions. The model predicts that, after long training, direct cortico-cortical connections are responsible for the choices, and the cortico-BG-thalamo-cortical loop conforms to previously learned choices. The model also predicts that reversal is easier to achieve after short training of the initial contingency, and may be greatly impeded after very large number of repetitions of the initially rewarded choice. We have predicted how in pathological states, when BG impedes these choices, GPi lesion or DBS restores them, but completely disrupts learning of new behavior. Along with other computational studies [36, 98,99], our results further reconcile the apparent contradiction between the critical involvement of the BG in execution of previously learned actions and yet no impairment of these actions after BG output is ablated by lesions or reduced by DBS.