Figure 1.
Block diagram of the symbiotic BMI controller.
The architecture contains two key components. The Actor is driven by the primary motor cortex (st) and its primary role is to select actions (ai) in the environment. These actions are evaluated by the Critic, which is driven by the NAcc. At each instance in time, the Critic provides an error signal (εt) that is computed based on the gradient of reward expectation (vt) and is used to adapt the parameters of the Actor for choosing actions that lead to reward. In this entire system, there is in intrinsic coupling between the motor system, reward system, and the environment.
Figure 2.
Conceptual diagram of reward expectation modulation of the user based on IA actions.
The temporal structure of NAcc neuronal activity indicates the expectation of reward or aversion in goal directed tasks. The Critic must interpret this activity and transform it to a scalar error signal.
Table 1.
Adaptation algorithm of the Actor structure.
Figure 3.
Stereotaxic neurosurgical methods were used to target the NAcc and M1.
In experiments involving simultaneous recording of NAcc and M1, a dual electrode array was implanted.
Figure 4.
Top view of the animal behavioral box.
A nose poke into the IR beam initiated the random selection of a target level cued by a light (LED). The animal had up to 4 seconds to press a lever. If the correct lever was pressed, a water reward was delivered.
Figure 5.
Perievent time histogram of 3 representative NAcc neurons.
Dual-nonselective neurons (both decrease firing after cue) for (A) left and (B) right trials. Dual-selective neurons (increase and decrease firing for both targets) (C) left and (D) right trials, and uni-selective neurons (decrease for one and stay constant for the other target) for (E) left and (F) right trials.
Figure 6.
Decoding performance during sequential presentation of the targets in the four-target configuration.
(A) Sequential presentation of 4 targets as indicated by red stems. Blue stems indicate if the target was acquired (1) or missed (0). Note that when a new target is introduced the performance decreases but within a few trials it recovers. (B) Temporal sequence of action values. Each colored trace represents the value of one action (i.e. up, down, left, right). Note that for each target only certain actions have high value since they are required to acquire the target. (C) Weight values for the output layer of the Actor. Each colored trace corresponds to an individual weight. Note that when a new target is introduced that the weights adapt then plateau once the performance improves. (D) The temporal difference error becomes maximally positive when the targets are acquired.
Table 2.
Decoding performance during sequential target acquisition.
Figure 7.
Neural tuning map of the synthetic M1 neurons.
(A) before and (B) after reorganization. Here ‘hot’ colors indicate maximal firing.
Figure 8.
Network adaptation after reorganization of the tuning map.
At trial 100, the reorganization was imposed. (A) Performance degradation as indicated by the blue stems marking (0) were observed at trial 100. However, with adaptation perfect performance was regained at trial 200. (B–C) A decrease in performance was matched with adaptation of the action values and weight values to compensate for neural changes. (D) Evaluative feedback in terms of the error here is shown to modulate more frequently when the performance is poor but it stabilizes once performance is regained.
Figure 9.
Actor-Critic decoding performance in navigating the robot to the target based on M1 neural activity.
Target acquisition performance and robot trajectory in 3D space during adaptation of the Actor using: (A–B) an evaluative feedback extracted from the NAcc. (C–D) Random values as evaluative feedback.
Figure 10.
Actor's parameter adaptation during closed-loop control.
(A) Cumulative reward over time. (B) Action values computed at the output layer of the Actor. Each color represents the value of a specific action. Here the red corresponds to the action that navigates the robot in a direct path to the target. (C) Output of the 3 hidden layer processing elements of the Actor. Larger adaptation of the values occurs before the “knee” of the cumulative reward curve. After the “knee” the system parameters stabilize their relative values indicating consolidation of the performance.