Information Driven Self-Organization of Complex Robotic Behaviors

Information theory is a powerful tool to express principles to drive autonomous systems because it is domain invariant and allows for an intuitive interpretation. This paper studies the use of the predictive information (PI), also called excess entropy or effective measure complexity, of the sensorimotor process as a driving force to generate behavior. We study nonlinear and nonstationary systems and introduce the time-local predicting information (TiPI) which allows us to derive exact results together with explicit update rules for the parameters of the controller in the dynamical systems framework. In this way the information principle, formulated at the level of behavior, is translated to the dynamics of the synapses. We underpin our results with a number of case studies with high-dimensional robotic systems. We show the spontaneous cooperativity in a complex physical system with decentralized control. Moreover, a jointly controlled humanoid robot develops a high behavioral variety depending on its physics and the environment it is dynamically embedded into. The behavior can be decomposed into a succession of low-dimensional modes that increasingly explore the behavior space. This is a promising way to avoid the curse of dimensionality which hinders learning systems to scale well.


A Estimating the TiPI
As stated above we consider the TiPI on the process of error propagations because it allows us to derive explicit expressions. Thus we start with the definition of the error propagation to derive eq. (11) and provide further insights.
As a first step, using the notion of an orbit of the dynamical system defined by the map ψ : R n → R n we define a sequence of statesŝ t ∈ R n for any time t within the time window t − τ ≤ t ≤ t starting from stateŝ t−τ = s t−τ . ψ k (s) denotes the k-fold iteration of the map ψ with ψ (0) (s) = s. We can considerŝ t as the predicted state over t − (t − τ ) time steps. In particular, the prediction over τ steps isŝ t = ψ τ (s t−τ ). The error propagation can now be defined as the difference between the true state s t , eq. (6), and the stateŝ t obtained by the deterministic dynamics (ψ), see Figure 1. The dynamics of the δs t obeys the rule with starting state δs t−τ = 0 and L(s) denoting the Jacobian matrix of ψ. This can be derived by usingŝ t = ψ (ŝ t −1 ) and writing In the following we will use this approximation which is arbitrary good for infinitesimally small noise. Note that this dynamics corresponds to that of a linear system, however with state dependent dynamical operator L. In a linear system, L is independent of the state and thuŝ s t = Lŝ t −1 such that the dynamical evolution of δs and s are the same.
As a remark, in the case of finite noise, we can obtain a related exact rule by using the mean value theorem of differential calculus stating that under mild restrictions one can find a statẽ s t ∈ [ŝ t , s t ] so that yields the exact dynamics of the multi-step prediction error δs t . The interesting point now is that I τ (S t : S t−1 ) (eq. (4)) is equal to that of the process defined by the error propagation dynamics, i. e.
For the proof consider two random vectors S and S together with the shifted vectors U = S + a and U = S + a . Using that the probability densities p S (s) and p U (u) obey p U (u) = p U (s + a) = p S (s) one obtains H (S) = H (U ). Analogously, the joint probability densities obey p U U (u, u ) = p U U (s + a, s + a ) = p SS (s, s ) so that H (S |S) = H (U |U ). This result is central for the following arguments-we will make use of the fact that the dynamics eq. (A3) is more easily treated to obtain explicit estimates for the TiPI and its gradient.

Explicit expressions
By iterating eq. (A3) we obtain an explicit expression for δs t (using here and in the following for any t. In general it is very complicated to obtain the entropy of δS t in realistic situations with high dimensional physical systems. Therefore we will base the further considerations on a convenient estimate of the latter. With white Gaussian noise, the process δS t is Gaussian as well, i. e. δS t ∼ N (0, Σ t ) (it is a linear combination of independent Gaussians), so that the entropy is given in terms of the covariance matrix Σ t of the random vector δS t as [1] |A| denoting the determinant of a square matrix A and is the covariance matrix of δS t and p (δs t ) is the probability density distribution of the random variable δS t . Using eq. (A6), explicit expressions for Σ can readily be obtained, see eq. (A13) below. By the same arguments, the conditional entropy is defined, using eq. (7), as where Ξ denotes the process of the noise with p (ξ) being the probability density function of Ξ ∼ N (0, D t ). Thus we obtain the estimate of the TiPI as which is the entropy of the state δs minus that of the noise.

White noise
Explicit expressions revealing more details of the theory are obtained for the case of white noise, i. e. ξ t ξ t = 0 if t = t , so that using eq. (A6) in eq. (A9) yields In particular, in the case of τ = 2, the shortest nontrivial time window, we find It is also useful to introduce the transformed dynamical operatorL = √ and (using This corresponds to using a so-called whitening transformation on the state dynamics, replacing in eq. (A4) the state vector δs by a new vector δx = √ D −1 δs so that the covariance matrix of the noise in the δx dynamics is just the unit matrix.
Interestingly, theL operators also exist if the overall noise strength λ = ξ goes to zero, so that I τ stays finite although the defining entropies, conditioned on the state s t−τ , are equal to zero in the deterministic system. This can be seen by introducingD = λ −2 D whereD stays finite with λ → 0, we haveL = D−1 L D = √ DL √ D since λ cancels out.

The linear case
For linear systems explicit expressions for the PI were obtained in [2]. In this case L is not dependent on the state s t of the system so that L (k) = L k in eq. (A7). Using eq. (A13), with τ → ∞, we reobtain the results of [2]. Note that all eigenvalues of the Jacobi matrix L must be less than one by absolute value so that the limes will exist. This requirement also guarantees that the conditioning on s t−τ looses its influence for τ → ∞. Under the additional assumption that L is a normal matrix and the noise is isotropic the explicit expression Σ = I − LL was obtained.

B Explicit gradient step
In order to derive the general gradient step on the TiPI based on eq. (13) we need to calculate the derivative ∂ ∂θ ln |Σ t |. Considering any (square) matrix M depending on a single parameter θ k of the set θ we have (see for example [3]) so that, using Σ = Σ = δsδs and omitting the time index By using the cyclic invariance of the trace we obtain from eq. (A16) now valid for the entire set of parameters θ. By eq. (A4) we obtain (ignoring the dependence of ξ on the parameter) where (Σ is symmetric) Stipulating the self-averaging property of the stochastic gradient, see section One-shot gradients for details, we realize the update rule as Here we see again that τ = 2 is the simplest non-trivial case where the sum consists of a single term.

Characterizing the parameter dynamics
In order to better characterize the parameter dynamics, let us consider for the moment Σ at the r. h. s. of eq. (A16) to be some fixed, positive matrix (not depending on the parameters θ k ). Then, we can write (using the cyclic invariance of the trace in the last step). The update rule eq. (13) becomes using again the self-averaging where a 2 M = a M −1 a defines the length of a vector a in the metric given by M (considered fixed in the current gradient step). From eq. (A21) it becomes obvious that following the gradient is to increase the norm of δs in the Σ metric.

C Neural networks-derivation of the update rule
We derive the parameter dynamics for neural networks eq. (28) from the general parameter dynamics for the two-step time window given by eq. (14). According to eq. (26) we have L = V G (z) C + T with z = Cs + h and G (z) = diag[g 1 (z), . . . , g m (z)]. Putting this into eq. (14) yields (omitting the time indices) The second term remains to be calculated. Because G is a diagonal matrix the vectors on both sides of the derivative carry the index i such that we get In the case of g (z) = tanh (z) we find, using g i (z) = −2g i (z) g i (z) and a = g(z) The final update rule follows by putting eq. (A24) and eq. (A26) into eq. (A22) Analogously we obtain the parameter dynamics of h as A more compact matrix notation can be obtained by introducing the diagonal matrix Γ Γ = diag[γ 1 , · · · , γ i ] and thus (reintroducing the time indices) In the case of arbitrary neuron activation functions g we obtain equivalent formula by defining Note the factor − g i g i gi is 2 in the case of g = tanh. In the derivation of eqs. (A24) and (A28) we ignored the dependence of the state s in g (Cs + h) on the parameters C and h. This dependence can be considered explicitly if the state is at a fixed point. In that case, a more detailed discussion in [4] (section 6.2) shows that the effect of the derivative can be condensed into the so-called sense parameter α multiplying γ. Thus we replace γ as where α is an empirical constant, typically α ≥ 1, by which the sensitivity of the sensorimotor dynamics to external perturbations can be regulated. This works also in more general cases like a limit cycle dynamics, see [4].