Maximization of Learning Speed in the Motor Cortex Due to Neuronal Redundancy

Many redundancies play functional roles in motor control and motor learning. For example, kinematic and muscle redundancies contribute to stabilizing posture and impedance control, respectively. Another redundancy is the number of neurons themselves; there are overwhelmingly more neurons than muscles, and many combinations of neural activation can generate identical muscle activity. The functional roles of this neuronal redundancy remains unknown. Analysis of a redundant neural network model makes it possible to investigate these functional roles while varying the number of model neurons and holding constant the number of output units. Our analysis reveals that learning speed reaches its maximum value if and only if the model includes sufficient neuronal redundancy. This analytical result does not depend on whether the distribution of the preferred direction is uniform or a skewed bimodal, both of which have been reported in neurophysiological studies. Neuronal redundancy maximizes learning speed, even if the neural network model includes recurrent connections, a nonlinear activation function, or nonlinear muscle units. Furthermore, our results do not rely on the shape of the generalization function. The results of this study suggest that one of the functional roles of neuronal redundancy is to maximize learning speed.


Introduction
In the human brain, numerous neurons encode information about external stimuli, e.g., visual or auditory stimuli, and internal stimuli, e.g., attention or motor planning. Each neuron exhibits different responses to stimuli, but neural encoding, especially in the visual and auditory cortices, can be explained by the maximization of stimulus information [1][2][3]. This maximization framework can also explain learning that occurs when the same stimuli are repeatedly presented; previous neurophysiological experiments have suggested that perceptual learning causes changes in neural encoding to enhance the Fisher information of a visual stimulus [4]. However, a recent study has suggested that information maximization alone is insufficient to explain neural encoding. Salinas has suggested that ''how encoded information is used'' needs to be taken into account: neural encoding is influenced by the downstream circuits and output units to which neurons project, and it is ultimately influenced by animal behavior [5]. In the motor cortex, neural encoding is influenced by the characteristics of muscles (output units) because motor cortex neurons send motor commands to muscles through the spinal cord. In adaptation experiments, some motor cortex neurons exhibit rotations in their preferred directions (PDs), and these rotations result in a population vector that is directed toward a planned target [6]. Neural encoding therefore changes to minimize errors between planning and behavior, suggesting that neural encoding is influenced by behavior and properties of output units.
A critical problem exists in the relationship between motor cortex neurons and output units: the neuronal redundancy problem, or overcompleteness, which refers to the fact that the number of motor cortex neurons far exceeds the number of output units. Many different combinations of neural activities can therefore generate identical outputs. Neurophysiological and computational studies have revealed that the motor cortex exhibits neuronal redundancy [7,8]. However, it remains unknown how neuronal redundancy influences neural encoding. In other words, we do not yet understand the functional roles of neuronal redundancy in motor control and learning, though other types of redundancies are known to play various functional roles [9].
One of these types of redundancy is muscle redundancy: many combinations of muscle activities can generate identical movements. The functional roles of this muscle redundancy include impedance control to achieve accurate movements [10], reduction of motor variance by constructing muscle synergies [11], and learning internal models by changing muscle activities [12]. Another redundancy is kinematic redundancy: many combinations of joint angles result in identical hand positions. This redundancy ensures the stability of posture even if one joint is perturbed [13], and it facilitates of motor learning by increasing motor variance in a dimension irrelevant to the desired movements [14]. Redundancies therefore play important functional roles in motor control and learning.
Similar to the muscle and kinematic redundancies, neuronal redundancy likely has functional roles in motor control and learning. However, the functional roles of this redundancy are unclear. Here, using a redundant neural network, we investigate these functional roles by varying the number of model neurons while holding the number of output units constant. This manipulation allows us to control the degree of neuronal redundancy because, if a neural network includes a large number of neurons and a small number of output units, many different combinations of neural activities can generate identical outputs. It should be noted that we used a redundant neural network model that can explain neurophysiological motor cortex data [7]. The key conclusion arising from our study is that one of the functional roles of neuronal redundancy is the maximization of learning speed.
Initially, a linear model with a fixed decoder was used. Analytical calculations revealed that neuronal redundancy is a necessary and sufficient condition to maximize learning speed. This maximization is invariant whether the distribution of PDs is unimodal [6] or bimodal [15][16][17]; both distributions have been reported in neurophysiological investigations. Second, numerical simulations confirmed the invariance of our results, even when the neural network included an adaptable decoder, a nonlinear activation function, recurrent connections, or nonlinear muscle units. Third, we show that our results do not depend on learning rules by using weight and node perturbation, both of which are representative stochastic gradient methods [18]. Finally, we demonstrate that our hypothesis does not depend on the shape of the generalization function which shape depends on the task (broad or sharp in force field [19,20] or visuomotor rotation adaptation [21], respectively). Our results strongly support our hypothesis that neuronal redundancy maximizes learning speed.

Results
Neuronal redundancy is defined as the dimensional gap between the number of neurons N and the number of outputs M. It is synonymous with overcompleteness [22]: many combinations of neural activities A[R N|1 can generate identical outputs x[R M|1 through a decoder Z[R M|N (x~ZA) because there are more neurons than necessary, i.e., N&M (Figure 1). It should be noted that neuronal redundancy is defined not by N but by the relationship between N and M. In most parts of this study, the number of constrained tasks T is the same as M and is set to two, i.e., M~T~2, so there is neuronal redundancy if Nw2. Thus, throughout this paper, the extent of neuronal redundancy can be expressed simply using the number of neurons. In this study, we can change only the neuronal redundancy; N can be increased while T is held constant at two, enabling the investigation of the functional roles of neuronal redundancy. In the Importance of Neuronal Redundancy section, we distinguish the effects of neuronal redundancy from the effects of neuron number by varying both N and T.
In this study, we discuss the relationship between neuronal redundancy and learning speed by assuming adaptation to either a visuomotor rotation or a force field. These tasks are simulated by using a rotational perturbation R~c os w { sin w sin w cos w where w is the rotational angle. Due to this perturbation, if an error occurs between target position t k(t)~( cos h k(t) , sin h k(t) ) T and output (motor command) x in the tth trial, neural activities A(h k(t) ) are modified to minimize the error, where h k(t) is the angle of the k(t)th target which is radially and equally distributed (t~1,:::,Trial, k(t)[1,:::,K, h k(t)~2 p k(t) K ). To model the learning process in the motor cortex, we used a linear rate model, which

Author Summary
There are overwhelmingly more neurons than muscles in the motor system. The functional roles of this neuronal redundancy remains unknown. Our analysis, which uses a redundant neural network model, reveals that learning speed reaches its maximum value if and only if the model includes sufficient neuronal redundancy. This result does not depend on whether the distribution of the preferred direction is uniform or a skewed bimodal, both of which have been reported in neurophysiological studies. We have confirmed that our results are consistent, regardless of whether the model includes recurrent connections, a nonlinear activation function, or nonlinear muscle units. Additionally, our results are the same when using either a broad or a narrow generalization function. These results suggest that one of the functional roles of neuronal redundancy is to maximize learning speed. can reproduce neurophysiological data [7] and be easily analyzed. In this model, x is given by a weighted average of A, and each component of Z is accordingly set to O( 1 N ), i.e., (i,j)th component of Z is defined as Z ij~1 N z ij , where z ij is a variable that is independent of N. Because of this assumption, the learning rate is set to NB such that the trial-to-trial variation of x do not depend on N (O(1)), but the optimized learning rate g Ã is O(N) (see Text S1), i.e., g Ã~N B Ã , suggesting that we consider the quasioptimal learning rate in this study. It should be noted that, because the following results do not depend on B, our results hold when the optimal learning rate is used. Furthermore, even when each component of Z is O(1), the following results are invariant if we set the learning rate to its optimal value (see Text S1). Our study shows that neuronal redundancy is necessary and sufficient to maximize learning speed.

Neuronal redundancy maximizes learning speed
Fixed homogeneous decoder. In the case of a fixed decoder, Z~1 N cos Q 1 ::: cos Q N sin Q 1 ::: sin Q N , the ith neuron has uniform force amplitude (FA) ( 1 N 2 ( cos 2 Q i z sin 2 Q i )~1 N 2 ) and force direction (FD), Q i , which is randomly sampled from a uniform distribution. Because of its uniformity, we refer to this decoder as a fixed homogeneous decoder. This model corresponds to the one proposed by Rokni et al. [7].
In this case, the squared error can be calculated recursively as where e~t{x~t{RZA. Here, we assume that a single target is repeatedly presented for simplicity (general case is discussed in the Methods section), I is the identity matrix, L~NRZZ T R T , NB is the learning rate, and neural activity A is updated as for the tth trial to minimize the squared error. Multiplication by N in equation (2) is included for the purpose of scaling; it ensures that the amount of trial-to-trial variation in A does not explicitly depend on N. Equation (1) can thus be simplified as where the diagonal elements of l, l 1 and l 2 , are eigenvalues of L, L is decomposed as V T lV (V T V~I), v t~V e t , and learning speed is therefore determined based on the eigenvalues of each component of which is O(1). The larger l i becomes, the faster learning becomes (i~1,2). It should be noted that learning speed and l i do not explicitly depend on N.
Analytical calculations can yield necessary and sufficient conditions to maximize learning speed (see the Methods section). The following self-averaging properties [23] maximize learning speed or maximize the minimum eigenvalue of L: 1 N where P(Q) is the probability distribution in which FDs are randomly sampled.  [24]. Thus, in the case of a fixed homogeneous decoder, neuronal redundancy plays a functional role in maximizing learning speed.
We numerically confirmed the above analytical results. Figures 2A and 2B show the learning speed and learning curves calculated using the results of 1,000 sets of randomly sampled Q values, an identical target sequence (K~8), and w~p=3. The more neuronal redundancy grows, the faster learning speed becomes. Figure 2C shows the relationship between learning speed and neuronal redundancy. The horizontal axis denotes the number of neurons, and the vertical axis denotes the increase in learning speed. Although a saturation of the increase can be seen, greater neuronal redundancy still yields faster learning speed. Therefore, these figures support our analytical results: in the case of a fixed homogeneous decoder, neuronal redundancy maximizes learning speed.
Fixed non-homogeneous decoder. The question remains whether it is necessary for FD and FA to be distributed uniformly, so we assume that the values (Z 1i ,Z 2i ) are randomly sampled from the probability distribution P(Z 1 ,Z 2 ) to make FD and FA nonhomogeneous, i.e., FDs are non-uniformly distributed, and FAs are different for each neuron. In the case of a non-homogeneous decoder, the necessary and sufficient conditions to maximize learning speed are also the following self-averaging properties: where P(Z 1 ) and P(Z 2 ) are marginalized distributions. Figures 3A  and 3D show distributions of Z that satisfy equations (8) and (9). Z is randomly sampled from unimodal and bimodal Gaussian distributions in Figures 3A and 3D, respectively. Because these  figures show the non-uniformity in both FD and FA, neuronal redundancy maximizes learning speed regardless of these nonuniformities.
Distribution of preferred directions. Some neurophysiological studies have suggested that the distribution of PD is a skewed bimodal [15][16][17], but other neurophysiological studies have suggested that the distribution of PD is uniform [6]. We investigated whether our results were consistent with the results of these neurophysiological studies. Figures 3B and 3E depict the distribution of preferred directions (PDs) that results when Z is randomly sampled as shown in Figures 3A and 3D, respectively, with PDs calculated as PD i~a rg max h A i (h) (see the Methods section). Figures 3B and 3E show that both a skewed bimodal distribution and a uniform distribution can be observed when P(Z 1 ,Z 2 ) satisfies equations (8) and (9), suggesting that our hypothesis is consistent with the results of previous neurophysiological experiments. Figures 3C and 3F show the distribution of modulation depth, which is calculated as m i~m ax h jA i (h)j (see the Methods section). Our results suggest that the distribution of modulation depth is skewed.
Adaptable decoder. We have analytically elucidated the relevance of neuronal redundancy to learning speed only when Z is fixed, but the question remains of whether neuronal redundancy can maximize learning speed even when Z is adaptable. In this case, it is analytically intractable to calculate learning speed, so we used numerical simulations. Figure 4A shows the learning speed when N~2,4,10,100, or 1000 in the case of an adaptable decoder. Although there was no significant difference in learning speed between the cases in which N~100 and N~1000, neuronal redundancy maximized learning speed even if the decoder was adaptable. Figure 4B, which shows the learning curve when N~2,4, or 100, also supports the maximization.

Importance of neuronal redundancy
Although we have revealed that neuronal redundancy maximizes learning speed when T~2, it is important to verify that the effect is caused by the neuronal redundancy, i.e., the dimensional gap between N and T, and not simply the number of neurons N. In this section, we investigate this question by varying both N and T while assuming that each component of t is randomly sampled from a Gaussian distribution. Figures 5A and 5B show the learning speed and the learning curve produced when N~T~10,50, and 100 with a fixed non-homogeneous decoder. If N alone were important for maximizing learning speed, learning speed would be faster when N~T~100 than when N~T~10 or N~T~50. However, the results shown in these figures support the opposite conclusion, i.e., learning speed becomes slower when N~T~100 compared to the other cases. This result suggests that the number of neurons alone is not important for maximizing learning speed. Figures 5C and 5D show the learning speed and learning curve produced when T~10,50, or 100 with N~50 and a fixed nonhomogeneous decoder. If neuronal redundancy were important, the learning speed would be faster when T~10 than when T~50 or T~100. These figures support this hypothesis; learning speed increased when T~10 compared to the other cases. Taken together, these results indicate that the important factor for maximizing learning speed is in fact neuronal redundancy and not simply the number of neurons.
In addition, we investigated whether neuronal redundancy or neuron number is important when Z is adaptable. In this case, we only show learning curves because learning speed cannot be exponentially fitted, which makes it impossible to calculate learning speed. Figures 5E and 5F show the learning curves calculated when N~T~10,50, or 100 and T~10,50, or 100 with N~50. These figures show the same results as the case when Z is fixed; even when Z is adaptable, the important factor for maximizing learning speed is neuronal redundancy, not simply the number of neurons.

Generality of our results
The generality of our results should be investigated because we analyzed only linear and feed-forward networks, but neurophysiological experiments have suggested the existence of recurrent connections [25] and nonlinear neural activation functions [26]. Also, only a linear rotational perturbation task was considered, so we need to investigate whether our results hold when the constrained tasks are nonlinear because, in fact, motor cortex neurons solve nonlinear tasks. The neurons send motor commands and control muscles whose activities are nonlinearly determined: muscles can pull but cannot push. Using numerical simulations, we show that neuronal redundancy maximizes learning speed, even when the neural network includes recurrent connections ( Figure  S1), when it includes nonlinear activation functions ( Figure S2), and when the task is nonlinear ( Figure S3).
In addition, we used only deterministic gradient descent, so the generality regarding the learning rule needs to be investigated. In fact, previous studies have suggested that stochastic gradient methods are more biologically relevant than deterministic ones [27,28]. Analytical and numerical calculations confirm that our results are invariant even when the learning rule is stochastic ( Figure S4). Our results therefore have strong generality.
Activity noise and plasticity noise. Although our results have strong generality, there is still an open question regarding the robustness of noise: does neuronal redundancy maximize learning speed even in the presence of neural noise? Actually, neural activities show trial-to-trial variation [29], and the neural plasticity mechanism also includes trial-to-trial fluctuations [7]. This section investigates the relationships between neuronal redundancy, learning speed, and neural noise. Figures 6A and 6D show the variance of learning curves when s a~0 ,0:1,0:2,0:3,0:4,0:5 and s p~0 ,0:05,0:1,0:15,0:2, respectively, with N~4,10,100, or 1000 and s a and s p representing the standard deviations of activity noise and plasticity noise, respectively. The definition of the variance is 1 Trial X Trial t~1 Var(E t ), which is a measure of the stability of learning. Examples of learning curves are shown in Figures 6B, 6C, 6E, and 6F. These figures show that neuronal redundancy enhances the stability of learning by eliminating the influences of activity and plasticity noise. Neuronal redundancy therefore not only maximizes learning speed but also facilitates robustness in response to neural noise.
Shape of the generalization function. In many situations, learning in one context is generalized to different contexts, such as different postures [30], different arms [31], and different movement directions [19][20][21], with the degree of generalization depending on the task. In this study, we define the generalization function as the degree of generalization to different movement directions. The performance of reaching towards h k(t) is generalized to that of reaching towards h, and the degree of this generalization is determined by the generalization function f (h{h k(t) ). In visuomotor rotation adaptation, the generalization function is narrow in the direction metric [21]. In contrast, the generalization function is broad in force field adaptation [19,20]. To investigate the generality of our results with respect to various kinds of tasks, it is necessary to investigate the relationships between neuronal redundancy, learning speed, and the shape of the generalization function. Figure 7 shows the relationship between the shape of the generalization function and learning speed. Figures 7A and 7B show the learning speed and learning curve calculated when the generalization function is broad ( Figure 7C). Figures 7D and 7E show the learning speed and learning curve calculated when the generalization function is narrow ( Figure 7F). Although these figures show that narrower generalization results in a slower learning speed, neuronal redundancy maximizes learning speed independently of the shape of the generalization function.

Discussion
We have quantitatively demonstrated that neuronal redundancy maximizes learning speed. The larger the dimensional gap grows between the number of neurons and the number of constrained tasks, the faster learning speed becomes. This maximization does not depend on whether the PD distribution is unimodal or bimodal, the decoder is fixed or adaptable, the network is linear or nonlinear, the task is linear or nonlinear, or the learning rule is stochastic or non-stochastic. Additionally, we have shown that neuronal redundancy has another important functional role: it provides robustness in response to neural noise. Furthermore, neuronal redundancy maximizes learning speed in a manner independent of the shape of the generalization function. These results strongly support the generality of our results.  Neuronal redundancy maximizes learning speed because only T equalities, x~t, need to be satisfied, and N-dimensional neural activity A is adaptable (N&T). This dimensional gap yields the large (N{T) dimensional subspace of A in which the T equalities are satisfied. The more N increases, the greater the fraction of the subspace becomes: lim N?? N{T N ?1. Neuronal redundancy, rather than the number of neurons, thus enables A to rapidly reach a single point in the subspace. This interpretation likely applies even in the cases of an adaptable decoder, recurrent connections, a nonlinear network, a nonlinear task, and a stochastic learning rule. Furthermore, this interpretation is supported by the results shown in Figure 5; the bigger (N{T) grows, the faster learning speed becomes.
At first glance, our results may seem inconsistent with the results of Werfel et al. [18], who concluded that learning speed is inversely proportional to N. In their model, because they considered the single-layer linear model, N is the same as the number of input units, which is defined as T( = M) in the present study. A similar tendency can be observed in Figure 5; the more T increases, the slower learning speed becomes. We calculated the optimal learning rate and speed as shown in Text S1, and confirmed that learning speed is inversely proportional to T. Thus, our results are consistent with Werfel's study and additionally suggest that neuronal redundancy maximizes learning speed.
Neuronal redundancy plays another important role: generating robustness in response to neural noise ( Figure 6). Because neuronal redundancy has the same meaning as overcompleteness, its functional role is the same as the robustness of overcompleteness in the face of perturbations in signals [32]. This additional functional role further supports our hypothesis that neuronal redundancy is a special neural basis on which to maximize learning speed. For example, if we increase the learning rate B in a non-redundant network, the learning speed approaches the maximal speed in a redundant network in which the learning rate is fixed to B. As shown in Figure 6, however, a non-redundant network is not robust with respect to neural noise. Furthermore, neuronal redundancy minimizes residual errors when the neural network includes synaptic decay [7] (see the Methods section and Figure S5). Thus, neuronal redundancy represents a special neural basis for maximizing learning speed while minimizing residual error and maintaining robustness in response to neural noise.

Model definition
Our study assumed the following task: participants move their arms towards one of K radially distributed targets. If the k(t)th target is presented in the tth trial, the neural network model receives the input t k(t)~( cos h k(t) , sin h k(t) ) T (k(t)[1,:::,K, t~1,:::,Trial), where h k(t)~2 p k(t) K . The input units project to neurons (hidden units), the activities of which are determined by where W t [R N|2 is synaptic weight in the tth trial, s a is the standard deviation of neural activity noise, j t [R N|1 denotes independent normal Gaussian random variables, and N is the number of neurons ( Figure 1). The ith neuron has a PD given by PD i~a rctan W i2 W i1 and a modulation depth m i~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi , this cosine tuning having been reported by many neurophysiological studies.
The neural population generates a force of F t k(t) through a decoder matrix Z[R M|N : where M is the number of outputs, which, in most cases, is set to 2. When Z is fixed and homogeneous, the (1,i)th and (2,i)th components of Z are defined as Z 1i~1 N cos Q i and Z 2i~1 N sin Q i , respectively, where division by N is used for scaling and FD Q i is randomly sampled from a uniform distribution (i~1,:::,N). When Z is fixed and non-homogeneous, (Z 1i ,Z 2i ) is randomly sampled from a probability distribution P(Z 1 ,Z 2 ) and divided by N. As a result, the neural network generates a final hand coordinate which means that F t is perturbed by a rotation R~c os w { sin w sin w cos w which assumes a visuomotor rotation or curl force field. Rotational perturbations are assumed because many behavioral studies have used them. Because we discuss only the endpoint of the movement, we refer to x t k(t) as the motor command. The constrained tasks are those that the neural network generates x k(t) toward t k(t) , i.e., x k(t)~tk(t) , which means the number of constrained tasks T is the same as M. We used T instead of M in the following sections.
If the error occurs between t and x, synaptic weights W t are adapted to reduce the squared error, which is defined as where A is the synaptic decay rate, B is the learning rate (B is set to 0.2 in most parts of the present study), s p is the strength of synaptic drift, and f t [R N|2 denotes normal Gaussian random variables.
Since each component of Z is O( 1 N ), multiplying B by N allows trial-by-trial variation of both A and W to be O(1). As shown in Text S1, the optimal learning rate g Ã is O(N) (g Ã~N B Ã ), suggesting that we consider a quasi-optimal learning rate. It should be noted that our results hold whether the learning rate is optimal or quasi-optimal because the results do not depend on B. It should also be noted that the amount of variation in W does not explicitly depend on N.

Learning curve
Equation (13) yields the following update rule of squared error: where L~NRZZ T R T , and I denotes the identity matrix. At first, we assume a case in which K~1 for simplicity. Because L is symmetric, AI{BL can be decomposed as AI{BLṼ T (AI{Bl)V, where each row of V is one of the eigenvectors (V T V~I) and each diagonal component of a diagonal matrix l is one of the eigenvalues of L. This decomposition transforms equation (14) into the simple form where v t~( v t 1 ,v t 2 ) T~V e t and s~(s t 1 ,s t 2 ) T~V t. This recurrence formula yields the analytical form of the learning curve: Equation (16) requires that the larger the eigenvalues become, the faster the learning speed becomes and the smaller the residual error becomes ( Figure S5). Because whose component is O(1), simple algebra gives the analytical form of the eigenvalues, which are also O(1), suggesting that learning speed does not depend explicitly on N. Because the learning speed is determined by the smaller eigenvalue, the necessary and sufficient conditions to maximize learning speed, or to maximize the smaller eigenvalue, are and What kind of conditions can simultaneously satisfy equations (19) and (20)? The only answer is sufficient neuronal redundancy, i.e., N??, because sufficient neuronal redundancy enables selfaveraging properties to exist in a neural network, i.e., 1 N where P(Q) is the probability distribution in which FDs are randomly sampled. Conversely, if equations (21), (22), and (23) are satisfied in all of the sets of randomly sampled FDs, the number of neurons needs to satisfy N?? because the fluctuation of Monte Carlo integrals is O(1= ffiffiffiffi ffi N p ) [24]. Therefore, to maximize learning speed, the necessary and sufficient condition is sufficient neuronal redundancy.
The above analytical calculations hold even when Kw1. Equation (13) yields the recurrence equation of the squared error: where A is set to 1 for simplicity. Using L~V T lV, this equation can be written as The larger the eigenvalue becomes, the faster learning speed becomes if v t i,k(tz1) and v t i,k(t) cos (h k(tz1) {h k(t) ) have the same sign, or if v t i,k(tz1) | cos (h k(tz1) {h k(t) )v t i,k(t) w0. This inequality is appropriate if the equality v T k(tz1) v k(t)~e T k(tz1) V T Ve k(t)C cos (h k(tz1) {h k(t) ) can be proved, where C is a positive constant. To prove this equality, let us assume that in the 1st trial after the rotational perturbation R is applied, output can be written as x k(t)~R t k(t) because the neural network can generate accurate outputs if there is no perturbation. In this case, where 2(1{ cos w) is a positive constant. Thus, the larger l i becomes, the faster learning speed becomes even when Kw1; analytical calculations show that neuronal redundancy maximizes learning speed even when Kw1.

Learning rule of decoder Z
When Z is adaptable, this is also adapted to minimize the squared error: where Z 0 is set to g 0 =N, g 0 is a normal Gaussian random variable, and B Z is set to 0.1 in the Adaptable Decoder section and 0.05 in the Importance of Neuronal Redundancy section. This learning rule corresponds to back-propagation [34].

High dimensional tasks
In the Importance of Neuronal Redundancy section, the neural network generates the output x[R T|1 , which is determined by for the tth trial. An initial value of Z 0 is randomly sampled from the normal Gaussian distribution and divided by N for scaling. The input t is randomly sampled from the normal Gaussian distribution and is normalized to satisfy t T t~1 to avoid the effect of this value on learning speed. In addition, we used a fixed value of t because the generalization function (see the following section) strongly depends on T, i.e., t t~t . It should be noted that learning speed does not explicitly depend on T because learning speed is determined only by the minimum eigenvalue of NZZ T .
The generalization function and the update rule for motor commands Equation (13) yields the following update rule for motor commands: If equations (27) and (28) (or (22) and (23)) are satisfied, equation (33) can be written as where the cross term of t T k(t) and t k(tz1) determines the generalization function f (h k(tz1) {h k(t) ), e.g., f (h k(tz1) {h k(t) ) cos (h k(tz1) {h k(t) ), if we define t k(t)~( cos h k(t) , sin h k(t) ) T . We set B and s 2 to satisfy Bs 2~0 :2. It should be noted that equation (34) corresponds to a model for sensorimotor learning that can explain the results of behavioral experiments [35], suggesting that our hypothesis is consistent with the results of behavioral experiments.
Because the shape of the generalization function depends on the task, we need to confirm the generality of our results with regard to the shape of the generalization function. To simulate various shapes of generalization functions, we used the von-Mises function exp (a cos (h{m n I ))), where a, m i , and N I are the precision parameter, the preferred direction of the ith input unit, and the number of input units, respectively. The normalization factor Z I is determined to make t T k(t) t k(t)~1 to avoid the influence of this value on the learning speed, where t~(t 1 ,:::,t N I ) T . This normalization permits us to investigate the influence of the shape of the generalization function alone on learning speed. The larger the value of a, the sharper the shape of the generalization function becomes. We set N I to 100 throughout this study.

Numerical simulation procedure
We conducted 100 baseline trials with w~0 and K~8 to identify the baseline values of W. The initial value of W, W 0 , was set to 0. After these trials, 100 learning trials were conducted using w~p 3 and K~8. Learning speed b was calculated by fitting the exponential functionÊ E t~a exp ({bt)zc to E t . All the figures denote b which was obtained only in learning trials. The present study calculated learning speed and learning curves by averaging the results of 1000 sets of baseline and learning trials, each set including an identical target sequence that was randomly sampled, and each set using different FD values.
For all of the statistical tests, we used the Wilcoxon sign rank test. It should be noted that the p-value was indicated only if the value was significantly different from 0; no statistically significant differences were detected. Text S1 Generality of our results. This file contains the detailed descriptions of Generality of our results section. (PDF)