Stabilizing patterns in time: Neural network approach

Recurrent and feedback networks are capable of holding dynamic memories. Nonetheless, training a network for that task is challenging. In order to do so, one should face non-linear propagation of errors in the system. Small deviations from the desired dynamics due to error or inherent noise might have a dramatic effect in the future. A method to cope with these difficulties is thus needed. In this work we focus on recurrent networks with linear activation functions and binary output unit. We characterize its ability to reproduce a temporal sequence of actions over its output unit. We suggest casting the temporal learning problem to a perceptron problem. In the discrete case a finite margin appears, providing the network, to some extent, robustness to noise, for which it performs perfectly (i.e. producing a desired sequence for an arbitrary number of cycles flawlessly). In the continuous case the margin approaches zero when the output unit changes its state, hence the network is only able to reproduce the sequence with slight jitters. Numerical simulation suggest that in the discrete time case, the longest sequence that can be learned scales, at best, as square root of the network size. A dramatic effect occurs when learning several short sequences in parallel, that is, their total length substantially exceeds the length of the longest single sequence the network can learn. This model easily generalizes to an arbitrary number of output units, which boost its performance. This effect is demonstrated by considering two practical examples for sequence learning. This work suggests a way to overcome stability problems for training recurrent networks and further quantifies the performance of a network under the specific learning scheme.

Where sign(x) = −1 for x ≤ 0, τ x is the time scale of the generating neurons and t ∈ [0, ∞), is a continuous variable representing time . We wish to examine whether the network is capable of learning a continuous target sequence, z t (t). As in the discrete case, we would like the network to reproduce the target sequence upon cuing it with the appropriate initial condition. In addition we would like the network to keep reproducing the target sequence periodically. We chose to model the target sequence as a two-state (z t = ±1) symmetric Markov process. We denote by τ −1 z the process parameter, that reflects the rate in which the sequence changes state. With these assumptions we get that z t (t) =0, averaged over time, resembling the discrete case target sequence.
As in the discrete case, given a target sequence, z t (t), one can find the solution to Eq. (S8), where the initial condition is given from the periodicity constraint x (T ) = x (0) : Eq. (S9), together with z t (t) , forms an infinite training set. In order to incorporate the same learning method as in the discrete case, one should form a finite training set, from which we could infer a proper set of J weights. Finding a method to construct such a set, from which the network could generalize and classify the entire sequence correctly is the crux of the matter.
Our simulations show that finding such a solution is possible. S2 Fig is an example of a stable solution. The network is capable of reproducing the desired sequence periodically, with only slight jitters in both output unit and activity within the network. S2 FigC shows explicitly the projected error, R (n) · J at each time step. We note that most of the time the projected error is kept small, indicating that small perturbation rapidly decay. Larger error occurs in times the readout unit produces erroneous output. In the case of a successful learning, this error decays exponentially and does not induce future errors.
Constructing The Training Set In order to find an appropriate training set we rely on an iterative method of re-sampling and learning, where we limit the number of iterations to three. Obviously, if the target sequence is learnable, then there exist a sampling set of the dynamics, one from which we can infer an appropriate set of J weights. Naturally a computational problem arises, as there are infinitely many possibilities to sample a single sequence. The challenge is thus; constructing a sampling algorithm which will be computationally reasonable and also reliable.
The sampling method we propose is as follows. First, we identify the transition points in the target sequence as key points. Therefore we start by sampling them, note that on average we have T /τz switching points in a sequence. In addition, we uniformly sample the sequence in τ −1 s frequency, which fulfills τ −1 x . This choice reflects the fact, that in order for neuron with time constant τ x be able to track a target sequence, which is switching states in τ −1 z frequency, the following must hold τ x < τ z . Note that sampling the sequence in τ −1 s frequency will provide us another T /τs points for the training set. Combining the set of points from the two sampling methods we form a finite training set . We can now try and solve the perceptron problem for this set. In case the learning didn't converge we re-sample the sequence in points where the sequence wasn't classified correctly, adding these points to the previous points we update our training set. We can now follow this procedure until the solution converges, in our simulations we restricted the number of iterations to three, mostly from computational reasons.

The Solution Dynamics
We first give a heuristic description of the solution dynamics, in terms of it's margin, κ. Therefore we start by plotting the averaged κ dynamics for a specific target sequence. That is, we use the same target sequence but change W and V in each realization. The results indicate (S3 Fig) that κ → 0, every time the target sequence switches state. Hence the network could only hope producing the target sequence with small jitters. Thinking of trying to classify a continuous trajectory in phase space (i.e. x (t) trajectory), indeed reveals this difficulty. If we denote the time in which the target sequence changed state by t j , we know from continuity that lim ∆t→0 [x (t j + ∆t) − x (t j − ∆t)] = 0, hence for any weights, κ → 0 at switching points of the target sequence. We now turn turn on quantifying it's performance.
Quantifying The Memory Capacity We quantify the MC as the longest sequence which the network is capable of learning (denoted by T * ). Since the nature of the target sequence is defined by its switching rate parameter τ −1 z , an informative quantity for the MC is the dimensionless variable T * /τz. This variable represents the expected, maximal number of switches the target sequence can have, and still be learnable. Results (S4 Fig) show that there exist an optimal ratio ( τz /τx), for which the MC is maximized. Intuitively one may assume that an easy sequence for the network to learn is when the typical time scale of the sequence resembles the typical time scale of the network. Hence from this numerical result we might estimate the effective time scale τ ef f x of the network. In the slow sequence dynamics, where the target sequence is switching states at a frequency much slower than the effective time scale of the network . The shape of the memory curve can be fairly estimated via statistical consideration alone.