Transferring Learning from External to Internal Weights in Echo-State Networks with Sparse Connectivity

Modifying weights within a recurrent network to improve performance on a task has proven to be difficult. Echo-state networks in which modification is restricted to the weights of connections onto network outputs provide an easier alternative, but at the expense of modifying the typically sparse architecture of the network by including feedback from the output back into the network. We derive methods for using the values of the output weights from a trained echo-state network to set recurrent weights within the network. The result of this “transfer of learning” is a recurrent network that performs the task without requiring the output feedback present in the original network. We also discuss a hybrid version in which online learning is applied to both output and recurrent weights. Both approaches provide efficient ways of training recurrent networks to perform complex tasks. Through an analysis of the conditions required to make transfer of learning work, we define the concept of a “self-sensing” network state, and we compare and contrast this with compressed sensing.


Introduction
Training a network typically involves making adjustments to its parameters to implement a transformation or map between the network's input and its output, or to generate a temporally varying output of a specified form. Training in such a network could consist of modifying some or all of its weights. Learning schemes that modify the recurrent weights are notoriously difficult to implement [1][2] (although see [3]). To avoid these difficulties, Maass and collaborators [4] and Jaeger [5] suggested limiting synaptic modification during learning to the output weights, leaving the recurrent weights unchanged. This scheme greatly simplifies learning, but is limited because it does not allow the dynamics of the recurrent network to be modified. Jaeger and Haas [6] proposed a clever compromise in which modification is restricted to the output weights, but a feedback loop carries the output back into the network. By permitting the output to affect the network, this scheme modifies the intrinsic dynamics of the network. FORCE learning was developed as an efficient algorithm for implementing this approach with the benefits of creating stable networks and enabling the networks to operate in a more versatile regime [7].
While the echo-state approach greatly expands the capabilities for performing complex tasks [6] [8] [7], this capacity comes at the price of altering the architecture of the network through the addition of the extra feedback loop ( Figure 1A), effectively creating an all-to-all coupled network. In neuroscience applications in particular, the original connectivity of the network is typically restricted to match anatomical constraints such as sparseness, but the additional feedback loop may violate these constraints by being non-sparse or excessively strong, and thus may be biologically implausible. This raises an interesting question: Can we train a network without feedback ( Figure 1B) to perform the same task as a network with feedback ( Figure 1A), using the same output weights, by modifying the internal, recurrent connections?
The answer is yes, and previously [7] we described how the online FORCE learning rule could be applied simultaneously to recurrent and output weights in the absence of an output-tonetwork feedback loop ( Figure 1B). We now expand this result in three ways. First, we develop batch equations for transferring learning achieved using a feedback network with online FORCE learning to the recurrent connections of a network without feedback. The reason for this two-step approach is that it speeds up the learning process considerably. Second, we use results from this first approach to more rigorously derive the online learning rule for training recurrent weights that we proposed previously [7]. Third, we introduce the concept of a self-sensing network state, and use it to explore the range of network parameters under which internal FORCE learning works.
There has been parallel work in studying methods for internalizing the effects of trained feedback loops into a recurrent pool. These studies focused on control against input perturbations [9][10], regularization [11] and prediction [12]. The principle issue that we study in this manuscript is motivated from a computational neuroscience perspective: what are the conditions under which transfer of external feedback loops to the recurrent network will be successful, while preserving sparse connectivity. Maintenance of sparsity requires us to work within a random sampling framework. Our focus on respecting locality and sparseness constraints increases the biological relevance of our results and leads to a network learning rule that only requires a single, global error signal to be conveyed to network units.

Results
Our network model ( Figure 1) is described by an N-dimensional vector of activation variables, x, and a vector of corresponding ''firing rates'', r~tanh (x)(other nonlinearities, including nonnegative functions, can be used as well). The equation governing the dynamics of the activation vector for the network of Figure 1B is of the standard form The time constant t has the sole effect of setting the time scale for all of our results. For example, doubling t while making no other parameter changes would make the outputs we report evolve twice as slowly. The N|N matrix J describes the weights of the recurrent connections of the network, and we take it to be randomly sparse, meaning that only n v N randomly chosen elements are non-zero in each of its rows. The non-zero elements of J are initially drawn independently from a Gaussian distribution with zero mean and variance g 2 =n. The parameter g, when it is greater than 1, determines the amplitude and frequency content of the chaotic fluctuations in the activity of the network units. In order for FORCE learning to work, g must be small enough so that feedback from the output into the network can produce a transition to a non-chaotic state (see below and Sussillo and Abbott, 2009). The scalar input to the network, I(t), is fed in through the vector of weights v with elements drawn independently and uniformly over the range ½{1,1. Thus, up to the scale factors v, every unit in the network receives the same input. The output of the network, z(t), is constructed from a linear sum of the activities of the network units, described by the vector r, multiplied by a vector of output weights w [13] [4][5], Training in such a network could, in principal, consist of modifying some or all of the weights v, w or J. In practice, we restrict weight modification to either w alone ( Figure 1A), or w and J ( Figure 1B). Increasing the number of inputs or outputs introduces no real difficulties, so we treat the simplest case of one input and one output.
The idea introduced by Jaeger and Haas [6], which allows learning to be restricted solely to the output weights w, is to change equation 1 for the network of Figure 1B to.
for the network of Figure 1A. The components of u are typically drawn independently and uniformly over the range {1 to 1 and are not changed by the learning procedure. As indicated by the second equality in equation 3, the effective connectivity matrix of the network with the feedback loop in place is Jzuw T . This changes when w is modified, even though J, u and v remained fixed. This is what provides the dynamic flexibility for this form of learning. The problem we are trying to solve is to duplicate the effects of the feedback loop in the network of Figure 1A by making the modification J?JzdJ in the network of Figure 1B. A comparison of equations 1 and 3 would appear to provide an obvious solution; simply set dJ~uw T . In other words, the network without output feedback is equivalent to the network with feedback if the rank-one matrix uw T is added to J. The problem with this solution is that the replacement J?Jzuw T typically violates the sparseness constraint on J. Even if both u and w are sparse, it is unlikely that the outer product uw T will satisfy the specific sparseness conditions imposed on J. This is the real problem we consider; duplicating the effect of the addition of a rank-one matrix to the recurrent connectivity by a modification of higher rank that respects the sparseness of the network.

Review of the FORCE Learning Rule
Because the FORCE learning algorithm provides the motivation for our work, we briefly review how it works. More details can be found in [7]. The FORCE learning rule is a supervised learning procedure, based on the recursive least squares algorithm (see [14]), that is designed to stabilize the complex and potentially chaotic dynamics of recurrent networks by making very fast weight changes with strong feedback. We describe two versions of FORCE learning, one applied solely to the output weights of a network with the architecture shown in Figure 1A, and the other applied to both the recurrent and output weights of a network of the form shown in Figure 1B. In both cases, learning is controlled by an error signal, which is the difference between the actual network output, z, and the desired or target output, f . For the architecture of Figure 1A, learning consists of modifications of the output weights made at time intervals Dt and defined by w(t)~w(t{Dt){e(t)P(t)r(t): ð5Þ P(t) is a running estimate of the inverse of the network correlation matrix, where the sum over t refers to a sum over samples of r taken at different times. FORCE learning is based on a related matrix C approx (t) that is initially set proportional to the identity matrix, C approx (0)~aI. At each learning interval, C approx (t) is updated with a sample of r, so that C approx (t)~C approx (t{Dt)zr(t)r T (t).
As t??, C approx (t) approaches the correlation matrix C defined in equation 6 (more precisely, they approach each other if normalized by the number of samples). At each time step, P(t) is the inverse of C approx (t), however it does not have to be determined by computing a matrix inverse. Instead, it can be computed recursively using the update rule, which is derived from the Woodbury matrix identity [14], Equations 5 and 7 define FORCE learning applied to w. The factor a {1 acts both as the initial learning rate and as a regularizer for the recurrsive matrix inversion being performed. By setting a {1 to a large value, the learning rule is able to drive the network out of the chaotic regime by feeding back a close approximation of the target signal f (t) through the feedback weights u [7]. As learning progresses, the matrix P acts as a set of N learning rates with a 1=t annealing schedule. This is seen most clearly by shifting to a basis in which P is diagonal. Provided that learning has progressed long enough for P to have converged to the inverse correlation matrix of r, the diagonal basis is achieved by projecting w and r onto principal component (PC) vectors of C. In this basis, the learning rate, g a , for the component of w aligned with PC vector a after M weight updates is 1=(Ml a za), where l a is the corresponding PC eigenvalue. This rate divides the learning process into two phases. The first is an early control phase when Mva=l a and g a &1=a and the major role of weight modification is virtual teacher forcing, that is to keep the output close to f (t) and drive the network out of the chaotic regime. The second phase begins when Mwa=l a and g a &1=(Ml a ), and now the goal of weight modification is traditional learning, i.e. to find a static set of weights that makes z(t)~f (t). Components of w with large eigenvalues quickly enter the learning phase, whereas those with small eigenvalues spend more time in the control phase. Controlling the components with small eigenvalues allows weight projections in dimensions with large eigenvalues to be learned despite the initial chaotic state of the network. At all times during learning, the network is driven through u with a signal that is approximately equal to f (t), thus the name FORCE Learning -First Order Reduced and Controlled Error Learning.
FORCE learning was also proposed as a method for inducing a network without feedback ( Figure 1B) to perform a task by simultaneously modifying w and J. In this formulation, equations 5 and 7 are applied to the actual output unit and, in addition, to each unit of the network, which is treated as if it were providing the output itself. In other words, equations 5 and 7 are applied to every unit of the network, including the output, all using the same error signal defined by equation 4. The only difference is that the modification in equation 5 for network unit i is applied to the vector of weights J ij for all j for which J ij =0 rather than w, and the values of r used in equations 5 and 7 are restricted to those values providing input to unit i. Details of this procedure are provided in [7] and, in addition, this ''in-network'' algorithm is rederived in a later section below. The idea of treating a network unit as if it were an output is also a recurring theme in the following sections.

Learning in Sparse Networks
Because sparseness constraints are essential to the problem we are considering, it is useful to make the sparseness of the network explicit in our formalism. To do this, we change the notation for J. Each row of J has only n v N non-zero elements. We collect all the non-zero elements in row i of the matrix J into an ndimensional column vector j (i) . In addition, for each unit (unit i in this case) we introduce an n | N matrix S (i) that is all zeros except for a single 1 in each row, with the location of the 1 in the n th row indicating the identity of the n th non-zero connection in J. Using this notation, equation 1 for unit i can be rewritten as a notation that, as stated, explicitly identifies and labels the sparse connections. This is only a change of notation, the set of equations 8 for i~1,2, . . . ,N is completely equivalent to equation 1. However, in this notation, the sparseness constraint on dJ is easy to implement; we can modify the n-dimensional vectors j (i) , for i~1,2 . . . ,N by j (i) ?j (i) zdj (i) with no restrictions on the vectors dj (i) . According to equation 8, the modification j (i) ?j (i) zdj (i) induces an additional input to unit i given by dj T (i) S (i) r. This will duplicate the effect of the feedback term in equation 3, if we can choose dj (i) such that The goal of learning in a sparse network is to make this correspondence as accurate as possible for each unit (exact equality may be unattainable). By doing this, the total input to unit i in the network of Figure 1B is whatever it receives through its original recurrent connections plus the contribution from changing these connections, dj (i) S (i) r, which is now as equal as possible to the input provided by the feedback loop, u i w T r, in the network with feedback ( Figure 1A). In this way, a network without an output feedback loop operates as if the feedback were present.
Equivalence of training a sparse unit and a sparse output. Equation 9, which is our condition on the change dj (i) of the sparse connections for unit i, is similar in form to equation 2 that defines the network output. To make this correspondence clearer we write.
Each unit of the network has its own vector w sparse if this equation is applied to all network units, so w sparse should really have an identifying index (i) similar to the subscript on dj (i) . However, because each network unit is statistically equivalent in a randomly connected network with fixed sparseness per unit, we can restrict our discussion, at this point, to a single unit and thus a single vector w sparse . This allows us to drop the identifier (i), which avoids excessive indexing. Similarly, we will temporarily drop the (i) index on S (i) , simply calling it S. We return to discussing the full ensemble of network units and re-introduce the index i in a following section. From equation 9, we can define the quantity.
Satisfying equation 9 as nearly as possible then amounts to making z sparse (t) as close as possible to z(t). Comparing equation 2 and 11 shows that, although z sparse (t) arises from our consideration of the recurrent inputs to a network unit, it is completely equivalent to an output extracted from the network, just as z(t) is extracted, except that there is a sparseness constraint on the output weights. Therefore, the problem we now analyze, which is how can w sparse be chosen to minimize the difference between z sparse (t) and z(t), is equivalent to examining how accurately a sparsely connected output can reproduce the signal coming from a fully connected output. In order for our results to apply more generally, we allow the number of connections to this hypothetical sparse unit, which is the dimension of w sparse to be any integer m v N, although for the network application we started with and will come back to, m~n.
We optimize the match between z sparse (t) and z(t) by minimizing Ð dt(z sparse (t){z(t)) 2 . Solving this least-squares problem gives with C defined by equation 6. The superscript z indicates a pseudoinverse, which is needed here because SCS T may not be invertible. The matrix being pseudoinverted in equation 12 is not the full correlation matrix, but rather C restricted to the m|m elements corresponding to correlations between units connected to the sparse output or, equivalently, the network unit that we are considering. This pseudoinverse matrix multiplies (with the sum in the matrix product restricted by S to sparse terms) the correlation matrix times the full weight vector. Note that if m is equal to N and the connections are labeled in a sensible way, S is the identity matrix and equation 12 reduces to w sparse~w . This recovers the trivial solution for modifying the network connections implied by the second equality in equation 3. We now study the non-trivial case, when 0 v m vN.
For what follows, it is useful to express equation 12 in the basis of principal component vectors. To do this, we express C~VDV T , where V is the N|N matrix constructed by arranging the eigenvectors of C into columns, and D is the diagonal matrix of eigenvalues of C (D ii~li , the i th eigenvalue of C). These eigenvectors are the principal component (PC) vectors. We arrange the diagonal elements of D and the columns of V so that they are in decreasing order of PC eigenvalue. Using this basis, we introduce.
where the hats denote vectors described in the PC basis. In this basis, equation 12 becomesŵ The Dimension of Network Activity Equation 11 corresponds to a sparsely connected unit with n input connections attempting to extract the same signal z(t) from a network as the fully connected output. For this to be done, it must be possible to access the full dynamics of N network units from a sampling of only n v N of them. The degree of accuracy of the approximate equality in equation 9 that can be achieved depends critically on the dimension of the activity of the network.
At any instant of time, the activity of an N-unit network is described by a point in an N-dimensional space, one dimension for each unit. Over time, the network state traverses a trajectory across this space. The dimension of network activity is defined as the minimum number of dimensions into which this trajectory, over the duration of the task being considered, can be embedded. If this can only be done to a finite degree of accuracy, we refer to the effective dimension of the network. The key feature of the networks we consider is that the effective dimension of the activity is typically less than, and often much less than, N.
For most networks performing tasks that involve inputs and parameters with reasonable values, the PC eigenvalues fall rapidly, typically exponentially [15] [7] [16]. Thus, we can write l i~e xp ({i=p eff ), where p eff acts as an effective dimension of the network activity. If p eff v N, this raises the possibility that only n & p eff rates can provide access to all the information needed to reconstruct the activity of the entire network. Therefore, we ask how many randomly chosen rates are required to sample the meaningful dimensions of network activity? In addressing this question, we first consider the idealized case when p PC eigenvalues are nonzero and N{p are identically zero. We then consider an exponentially decaying eigenvalue spectrum.

Accuracy of Sparse Readout
For the idealized case where the activity of the network is strictly p-dimensional, we defineṼ V as the N|p matrix obtained by keeping only the first p columns of V and similarlyD D is the p|p diagonal matrix obtained by keeping only the nonzero diagonal elements of D. When p v N, we can replace V and D in equation 14 byṼ V andD D, and ignore the components ofŵ w beyond the first p.

Equation 14 then becomeŝ
The matrix SṼ V has dimension m|p and thus is not invertible if m = p. However, provided that the m rows of SṼ V span p dimensions (see the final section before the Discussion), we have Furthermore, if m §p, (SṼ V) z (SṼ V) is equal to the identity matrix (although (SṼ V)(SṼ V) z is not). As a result, Therefore, z sparse~z , and we find that a sparse output or a network unit with m connections can reproduce the full output perfectly if m §p and p, the dimension of the network activity, is less than N.
When the PC eigenvalues fall off exponentially with effective dimension p eff , sparse reconstruction of a full network output is not perfect, but it can be extremely accurate. The error in approximating a fully connected output with a sparse output depends, of course, on the nature of the full output, which is determined by w. To estimate the error, and to compute it in network simulations, we assume that the components ofŵ w are chosen independently from a Gaussian distribution with zero mean and variance 1=N. This is in some sense a worst case because, in applications involving a specific task, we expect that the components ofŵ w corresponding to PC vectors with large eigenvalues will dominate. Thus, the accuracy of sparse outputs in specific tasks (where w is trained) is likely to be better than our error results with generic output weights.
The error we wish to compute is S(z sparse (t){z(t)) 2 T. As a standard against which to measure this error, we introduce another, more common way of approximating a full output using only m terms; simply by using the first m components ofŵ w (in the PC basis) to construct an approximate output that we denote as z PC . The error S(z PC (t){z(t)) 2 T is easy to estimate, because this approximation matches the first m PCs exactly and sets the rest to zero. The error coming from the N{m missing components is Here using the same set of results and approximations as for equation 18. In this context, the squared error of the approximation is expressed as the fraction of the output variance that is missing. We expect the error for z sparse to be larger than z PC becausê w w sparse does not perfectly match the first m components ofŵ w, nor does it approximate the remaining components as zero. We extracted a good fit to the error for a sparse output with m connections when the effective network dimension is p eff by studying a large number of numerical experiments and network simulations (for examples, see Figure 2). We found that this error is well-approximated by.
The difference between the accuracy of the output formed by m random samplings of r and that constructed by a PC analysis is the factor 1zm=p eff in equation 20 grows with m, but it multiplies a term that decays exponentially as m increases. Thus, using m randomly selected inputs is almost as good as using an optimal PC approximation with m modes. The latter requires full knowledge of the eigenvectors and the locations of the meaningful PC dimensions, whereas the former relies only on random sampling.
To illustrate the accuracy of these results, we constructed a network with N~1000, n~100, g~1:5 and t~10 ms, and injected a time-dependent input with variable amplitude. Changing the amplitude of the input allowed us to modulate p eff , which is a decreasing function of input amplitude [17]. The readout weights, w, were selected randomly so that all modes of the network were sampled. There is good agreement between the results of the network simulation for the error in z PC (filled blue circles) and equation 18 (blue curve), and the error in z sparse (filled red circles) and our estimate, equation 20 (red curve). Both equations fit the simulation data over a wide range of m and p eff values.

Transfer of Learning from a Feedback to a Non-Feedback Network
We now return to the full problem of adjusting the recurrent weights for every unit in a network in order to reproduce the effects of an output feedback loop. This merely involves extending the previous results from a single unit to all the units. In other words, we combine equations 10 and 12 to obtain an equation determining dj (i) for all i values, Note that we have restored the (i) indexing that identifies the sparseness matrices for each unit. If these adjustments satisfy equation 9 to a sufficient degree of accuracy, a network of the form shown in Figure 1B, with the synaptic modification and output weights w should have virtually identical activity to a network with unmodified recurrent connections, the same output weights, and feedback from the output back to the network ( Figure 1A). We discuss the conditions required for this to happen in the final section before the Discussion. An example of a network constructed using equation 21 is shown in Figure 3. First, a network (N~2000, n~600, g~1:35, t~10 ms) with output feedback was trained with online FORCE learning to generate an output pulse after receiving two brief input pulses, but only if these pulses were separated by less than 1 second ( Figure 3A, left column). When presented with input pulses separated by more than 1 second, the network was trained not to produce an output pulse ( Figure 3A, right column). The input pairs were always either less than 975 ms or more than 1025 ms apart to avoid ambiguous intervals extremely close to 1 s. The learning was then batch transferred to the recurrent connections using equations 21, and the output feedback to the network was removed. After this transfer of learning to the sparse recurrent weights, the network performed almost exactly as it did in the original configuration ( Figure 3B). Over 940 trials, the original feedback network performed perfectly on this task, and the network with no feedback but learning transferred to its recurrent connections performed with 98.8% accuracy. The green traces in Figure 3 show that dj T (i) S (i) r matches u i w T r quite accurately.
Relation to simultaneous online learning of w and J. The previous section described a batch procedure for transferring learning from output weights to recurrent connections. It is also possible to implement this algorithm as an online process. To do this, rather than duplicating the complete effects of feedback with output weight vector w by making a batch modification dj (i) , we can make a series of modifications Dj (i) (t) at each learning time step that duplicate the effects of a sequence of weight changes Dw(t). We could accomplish this simply by applying equation 21 at each learning time step, replacing the factor of w with Dw(t). However, this would assume that we knew the correlation matrix C, whereas FORCE learning, as described earlier, constructs this matrix (actually a diagonally loaded version of its inverse) recursively. Therefore, the correct procedure is to replace the factors of C in equation 21, when it is applied at time t, by C approx (t). Similarly, the matrix (S (i) CS T (i) ) z in equation 21 is replaced by a running estimate, updated by an equation analogous to equation 7, There is no problem with doing the inverse (rather than pseudoinverse) here because, as a consequence of setting C approx (0)~aI, P (i) is diagonally loaded. The recursive learning rule for modifying J in concert with the modification of the output weights (equation 5) is then because C approx (t) and P(t) are inverses of each other. Thus, The factor of u i is needed if these modifications are designed to match those of a specific output feedback loop that uses u as its input weights. If all that is required is to generate a network without a feedback loop ( Figure 1B) that does a desired task, any non-singular set of u i values can be chosen, for example u i~1 for all i. Equation 23 is equivalent to the learning rule proposed previously when this particular choice of u is made [7]. Note that all recurrent units and outputs are changing their weights through  exactly the same functional form using only the global error and information that is local to each unit. Please see Appendix S1 in the supplemental materials for a derivation of these equations using index notation, which may be more helpful for implementation on a computer.

Self-Sensing Networks and Compressed Sensing
We can now state the condition for successful transfer of learning between the networks of Figures 1A and 1B. This condition defines our term self-sensing. We require that, for each unit in the network, an appropriate modification of its sparse set of input weights allows the unit to approximate any function that can be extracted from the activity of the network by a linear readout with full connectivity. In other words, with an appropriate choice of dj (i) , dj T (i) S (i) r(t) can approximate any readout, z(t)~w T r(t), for all i from 1 to N.
Self-sensing and our analysis of it have relationships to the field of compressed sensing [18][19]. Both consider the possibility of obtaining complete or effectively complete knowledge of a large system of size N from m v N (and often m % N) random samples. Self-sensing, as we have defined it, refers to the accuracy of outputs derived from random sparse samples of network activity. Compressed sensing refers to complete reconstruction of a sparse data set from random sampling. The problem in compressed sensing is that the data can arise from a large or even infinite set of different low-dimensional bases, and the reconstruction procedure is not provided with knowledge about which basis is being used. In self-sensing, the sparse basis is given by PCA, but the problem is that a sparsely connected unit cannot perform PCA on the full activity of the network. No matter what computational machinery is available to a unit for computing PCs, it cannot find the high variance PC vectors due to a lack of information. In a parallel and distributed setting, the only strategy for a unit with sparse inputs to determine what a network is doing is through random sampling. The general requirements for both self-and compressed sensing arise from their dependence on random sampling. The conditions for both are similar because it is as difficult to randomly sample sparsely from a single, unknown low-dimensional space as it is to sample from a sparse one when the low-dimensional state is unknown.
Our approach to constructing weights for sparse readouts is to start with the matrix of PC eigenvectors V, keep only the p relevant vectors givingṼ V, and then randomly sample m components from each of these vector, giving the matrix SṼ V (e.g. see equation 14). Random sampling of this form will fail, that is generate zero vectors, if any of the eigenvectors of V are aligned with specific units or if the m columns of SṼ V fail to span p dimensions. These requirements for a self-sensing network correspond to the general concepts of incoherence and isotropy in the compressive sensing literature [19]. Put into our language, incoherence requires that the important PC eigenvectors not be concentrated onto a small number of units. If they were, it is likely that our random sparse sampling would miss these units and thus would have no access to essential PC directions. Isotropy requires that, over the distribution of random samples (all S), the columns of SV are equally likely to point in all directions. This corresponds to our requirement that the m rows of the matrix SṼ V span p dimensions.
To be more specific, a random sampling of the network will fail to sample all of the modes of the network if some of the modes are created by single units. This problem can be eliminated by imposing an incoherence condition that the maximum element of V V be of order 1= ffiffiffiffi ffi N p [18], which ensures thatṼ V is rotated well away from the single-unit basis (the basis in which each unit corresponds to a single dimension). We require this condition, but it is almost certain to be satisfied in the networks we consider. One reason for this is that the connectivity described by J is random, and no single or small set of units in the networks we consider are decoupled from the rest of the network. Further, random connections induce correlations between units, and these correlations almost always ensure that the eigenvector basis is rotated away from the single-unit basis. Even if such an aligned eigenvector existed, the loss in reconstruction accuracy would likely be small because the r variables defining the correlation matrix are bounded. This implies that it is unlikely that an aligned mode would be among those with the largest eigenvalues because eigenvectors involving all of the units can construct larger total variances.
We now address the isotropy condition, which in our application means that the m columns of SṼ V span p dimensions, as was required to prove that sparse reconstruction is exact if p ƒ m v N (equation 17). The columns of the full eigenvector matrix V are constrained to be orthogonal and so, of course, they isotropically sample the network space. However, if m % N, the column vectors of SṼ V are no longer orthogonal. We make the assumption that, in this limit, the elements selected by the random matrix S can be treated as independent random Gaussian variables. Studies of V matrices extracted from network activity and randomly sparsified support this assumption ( Figure 4). If SṼ V is a random Gaussian variable, the m columns of SṼ V are unbiased and isotropically sample the relevant p dimensional space.
In networks with a strictly bounded dimensionality of p, selfsensing requires n §p. In networks with exponentially falling PC eigenvalues, self-sensing should be realized with an accuracy given by equation 20 if n w p eff . The effective dimensionality is affected by the inputs to a network, which reduce p eff for increasing input amplitude, and the variance of the elements of J (controlled by g 2 ), which increases p eff for increasing g 2 . In response to an input [17] or during performance of a task, p eff drops dramatically and is likely to be determined by the nature of the task rather than by N. The crucial interplay is then between the scale of the input and the variance of J, controlled by g 2 . The self-sensing state should be achievable in many applications where the networks are either input driven or are pattern generators that are effectively input driven due to the output feeding back.

Discussion
We have presented both batch and online versions of learning within a recurrent network. The fastest way to train a recurrent network without feedback is first to train a network with feedback and then to transfer the learning to the recurrent weights using equation 21. This will work if the network is in what we have defined as a self-sensing state.
An interesting feature of the online learning we have derived is that equation 23, specifying how a unit internal to the network should change its input weights, and equation 5 determining the weight changes for the network output, are entirely equivalent. Both involve running estimates of the inverse correlation matrix of the relevant inputs (P (i) (t) for network unit i and P(t) for the output) multiplying the firing rates of those inputs (either S (i) r or r). Importantly, both involve the same error measure e(t). This means that a single global error signal transmitted to all network units and to the output is sufficient to guide learning. The modifications on network unit i are identical to those that would be applied by FORCE learning to a sparse output unit with connections specified by S (i) . In other words, each unit of the network is being treated as if it was a sparse readout trying to reproduce, as part of its input, the desired output of the full network. The selfsensing condition, which assures that this procedure works, relies on the same incoherence and isotropy conditions as compressed sensing. These assure that units with a sufficient number of randomly selected inputs have access to all, or essentially all, of the information that they would receive from a complete set of inputs. In this sense, a sparsely connected network in a self-sensing state acts as if it was fully connected.

Supporting Information
Appendix S1 Equations with Indices for ''internal'' FORCE Learning Rule. (PDF)