Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment

doi:10.1371/journal.pcbi.1000220

Figure 1.

Schematic drawings of (A) local representation model and (B) multiple timescale model.

(A) Curves colored red, blue, and green represent sensori-motor sequences corresponding to motor primitives. Output of the system consists of behavior sequences made up of combinations of these primitives. In the local representation model, functional hierarchy is realized through the use of explicit hierarchical structure, with local modules representing motor primitives in the lower level, and a higher module representing the order of motor primitives switched via additional mechanisms such as gate-selection. (B) In the multiple timescale model, primitives are represented by fast context units whose activity changes quickly, whereas sequences of primitives are represented by slow context units whose activity changes slowly.

More »

Expand

Figure 2.

Task design.

(A) A humanoid robot was fixed to a stand. In front of the robot, a workbench was set up, and a cubic object (approximately 9×9×9 cm) was placed on the workbench to serve as the goal object. The task for the robot was to autonomously generate five different types of behavior: (1) move the object up and down three times, (2) move the object left and right three times, (3) move the object backward and forward three times, (4) touch the object with one hand, and (5) clap hands three times. For each behavior, the robot began from the home position and ended at the same home position. (B) For each behavior other than the clapping hands task, the object was located at five different positions (positions 1–5). Since the clapping hands behavior was independent of the location of the object, the object was located at the center of the workbench (position 3) and was never moved for this task.

More »

Expand

Figure 3.

System overview.

(A) Action generation mode. Inputs to the system were the proprioception mˆ _t and the vision sense ŝ _t . Based on the current mˆ _t and ŝ _t the system generated predictions of proprioception m_t₊₁ and the vision sense s_t₊₁ for the next time step. This prediction of the proprioception m_t₊₁ was sent to the robot in the form of target joint angles, which acted as motor commands for the robot in generating movements and interacting with the physical environment. Changes in the environment were sent back to the system as sensory feedback. The main components of the system were modeled by the CTRNN, which is made up of input-output units and context units. Context units were divided into two groups based on the value of time constant τ: a group of fast context units (τ = 5) and a group of slow context units (τ = 70). Every unit of the CTRNN is connected to every other unit, including itself, with the exception of input units which do not have a direct connection to the slow context units (see Method). (B) Training mode. In the training process, the network generates behavior sequences based on the synaptic weights at a certain moment during the learning process. Synaptic weights are updated based on the error between generated predictions (m_t₊₁, s_t₊₁) and the teaching signals (m*_t₊₁, s*_t₊₁). In training mode, the robot did not interact with physical environment. Instead of actual sensory feedback, predicted proprioception and vision served the input for the following time step (mental simulation). Through this mental simulation process, the network was able to autonomously reproduce behavior sequences without producing actual movements. In addition to virtual sensory feedback, in order to accelerate convergence, a small amount of the teaching signal of the previous time step m*_t₊₁, s*_t₊₁ was also mixed into m_t₊₁, s_t₊₁ (see Method for details). Both in the generation mode and training mode, initial state of the slow context units was set according to the task goal.

More »

Expand

Table 1.

Learning error and robot performance for the basic pattern training.

More »

Expand

Figure 4.

Example of behavior sequence for up-down behavior.

Proprioception (first row), vision (second row), sparsely encoded RNN activation (third row), fast and slow context activation (forth and fifth row) of teaching signal (left column), mental simulation of trained network (center column) and actual sensory feedback in physical environment (right column) during up-down behavior at position 3 are shown. In proprioception, 4 out of a total of 8 dimensions were plotted (full line: left arm pronation, dashed: left elbow flexion, dot-dash-dot-dash: right shoulder flexion, dotted: right arm pronation). In the case of vision, two lines correspond to the relative position of the object (full line: X-axis, dashed line: Y-axis). Values for proprioception and vision were mapped to the range from 0.0 to 1.0. CTRNN outputs are sparsely encoded. Both in CTRNN outputs and context activation, the y axis of the graph corresponds to each unit from among the output units and context units. A long sideways rectangle thus indicates the activity of a single neuron over many time steps. The first 64 units of output correspond to proprioception and the last 36 units of output correspond to vision. Colors of rectangles indicate activation level, as indicated in the color bar at the lower right. Reach: reach for the object, UD: up-down, Home: return to the home position.

More »

Expand

Figure 5.

Example of behavior sequences for other basic behavior.

Proprioception, vision, fast and slow context activation of teaching signal and actual values in physical environment during left-right (LR: first column), backward-forward (BF: second column) touch with single hand (Touch: third column) and clapping hands (Clap: fourth column) behavior at position 3 are shown. Correspondences for line types in each graph are the same as in Figure 4. Reach: reach to the object, Home: return to the home position.

More »

Expand

Figure 6.

Changes in context state space associated with changes in object position.

Changes of context activation during each behavior at every position are shown in a 2 dimensional space based on the results of PCA analysis. The four graphs on the left side and single graph on the right side correspond to fast context activities and slow context activities, respectively. State changes of the fast context units for each behavior exhibit a particular structure which shifts with the object position. On the other hand, activity of the slow context units for a particular behavioral task exhibited very little location-dependent variation. UD: up-down, LR: left-right, BF: backward-forward and Touch: touch with single hand.

More »

Expand

Figure 7.

Example of behavior sequence for novel combinations of motor primitives.

Proprioception, vision, and fast and slow context activation values of the teaching signal, as well as actual values in physical environment, are shown for two novel behaviors at position 3. The first behavior (left column) consists of moving the object up and down three times, then moving the object left and right three times, and finally returning to the home position. The second behavior (right column) consists of moving the object backward and forward three times and then touching the object with one hand, and finally returning to the home position. Correspondences for line types in each graph are same as in Figure 4. UD: up-down, LR: left-right, BF: backward-forward and Touch: touch with single hand. Reach: reach for the object, Home: return to the home position.

More »

Expand

Figure 8.

Primitive representations in fast context units before and after additional training.

Changes of context activation during each movement before and after additional training are visualized in a 2 dimensional space based on the results of PCA analysis (plotted only for position 3). The four graphs on the left side and two graphs on the right side correspond to representations before and after additional training, respectively. The first and second movements in the novel sequences learned through additional training are colored red (UD and BF) and green (LR and Touch), respectively. The structure of representations corresponding to each primitive were preserved even after additional training, indicating that motor primitives were represented in dynamics of fast context units, with novel behavior sequences constructed out of combinations of these primitives. UD: up-down, LR: left-right, BF: backward-forward and Touch: touch with single hand behavior.

More »

Expand

Figure 9.

Effects of multiple timescales.

Learning error for basic pattern and novel pattern training for various slow context time constant values are shown. Differences in timescale are described by the ratio of τ values in the fast and slow context units (τ-slow/τ-fast). Bars in the graph correspond to mean values over 5 learning trials for each parameter setting. Error bars indicate the degree of standard deviation. Asterisks indicate significant differences in mean values between the standard setting (τ-ratio = 14.0) and other settings. The significance of these differences was examined using a randomized test. Both in basic pattern training and in additional training, performance for the case of small τ-ratio was significantly worse than the standard setting. These results suggest that multiple timescales in the fast and slow context units was an essential factor leading to the emergence of hierarchical functional differentiation.

More »

Expand

Figure 10.

System details.

The main part of the system is the CTRNN. The total number of CTRNN units was 180. The first 100 units (indices i = 1‥100) correspond to input-output units (O). Among input units, the first 64 units (indices i = 1‥64) correspond to proprioceptive inputs (M), whereas the last 36 units (indices i = 65‥100) correspond to vision inputs (S). The remaining 80 units (indices i = 101‥180) correspond to the context units. Among the context units, the first 60 units (indices i = 101‥160) correspond to the fast context units (Cf), and the last 20 units (indices i = 161‥180) correspond to the slow context units (Cs). Inputs to the system were the proprioception mˆ _t and the vision sense ŝ _t , which were transformed into sparsely encoded vectors using topology preserving maps (TPM, Equation 3), one map corresponding to proprioception (TPMm) and one map corresponding to vision (TPMs). A 100-dimensional vector, transformed by the TPM (p_i,t) and previous activation levels of the context units y_i,t₋₁, is set to the neural states x_i,t (Equation 7). Membrane potential (u_i,t) and activation (y_i,t) of each unit are calculated using Equation 5 and Equation 6, respectively. Outputs of the CTRNN (y_i,t, i∈O) are transformed into 10 dimensional vectors (m_t₊₁ and s_t₊₁) using inverse computation of the TPM (iTPM, Equation 4). These 10 dimensional vectors correspond to predictions of the proprioception m_t₊₁ and the vision sense s_t₊₁ for the next time step. This prediction of the proprioception m_t₊₁ was sent to the robot in the form of target joint angles, which acted as motor commands for the robot in generating movements and interacting with the physical environment. Changes in the environment resulting from this interaction were sent back to the system in the form of sensory feedback. In training, output of the CTRNN (y_i,t, i∈O) is compared with the desired output y* _i,t calculated from target sensori-motor states m*_t₊₁ and s*_t₊₁, using the same TPMs.

More »

Expand