Automated Design of Complex Dynamic Systems

Several fields of study are concerned with uniting the concept of computation with that of the design of physical systems. For example, a recent trend in robotics is to design robots in such a way that they require a minimal control effort. Another example is found in the domain of photonics, where recent efforts try to benefit directly from the complex nonlinear dynamics to achieve more efficient signal processing. The underlying goal of these and similar research efforts is to internalize a large part of the necessary computations within the physical system itself by exploiting its inherent non-linear dynamics. This, however, often requires the optimization of large numbers of system parameters, related to both the system's structure as well as its material properties. In addition, many of these parameters are subject to fabrication variability or to variations through time. In this paper we apply a machine learning algorithm to optimize physical dynamic systems. We show that such algorithms, which are normally applied on abstract computational entities, can be extended to the field of differential equations and used to optimize an associated set of parameters which determine their behavior. We show that machine learning training methodologies are highly useful in designing robust systems, and we provide a set of both simple and complex examples using models of physical dynamical systems. Interestingly, the derived optimization method is intimately related to direct collocation a method known in the field of optimal control. Our work suggests that the application domains of both machine learning and optimal control have a largely unexplored overlapping area which envelopes a novel design methodology of smart and highly complex physical systems.


Structure
This document contains the mathematical derivations for the paper: Automated Design of Complex Dynamical Systems. It is structured as follows. Section 2 is devoted to explaining the BPTT algorithm as it is commonly used for training recurrent neural networks in machine learning. It explains the discrete-time version of BPTT and the necessary concepts. Is section 3 we introduce the online version of continuous time backpropagation through time for ordinary differential equations. In section 4 we show how it is possible to transform this into a computationally less demanding algorithm. Sections 5 to 7 provide similar derivations for important other kinds of differential equations, such that the primary theoretical contribution of this paper is contained within sections 3 to 7. Throughout this document, we will often use the word 'training' in the same context as 'optimisation'.

Discrete time backpropagation through time
Before we start, it is useful to consider notations. Throughout the following derivations (sections 3 to 7), we will use bold uncapitalised letters for column vectors, except for the gradients ∇ θ and γ θ which are row vectors by definition, and bold capital letters for matrices. When we define a derivative of the form dx/dy, this is a matrix with vertical size the dimension of x and horizontally that of y. In this section we will give a brief overview of backpropagation through time (BPTT) as it is commonly applied in recurrent neural networks (RNNs). Suppose we have a state a t , which is governed by the following update equation: in which s t is a discrete time external input signal, and θ is the set of parameters that need to be updated. In typical recurrent neural networks (RNN), these parameters are the elements that make up a square recurrent weight matrix W and input matrix V, and the nonlinear function is commonly a sigmoid function, e.g., the hyperbolic tangent. The equation takes the form a t+1 = tanh (Wa t + Vs t ) , but here we will keep the derivation fully general. The goal is to minimise a certain cost function C(a t , t) at a certain moment in time. The gradient of this cost function w.r.t the parameters is given by Here,ē t is the output error (the transpose of the gradient of the cost function w.r.t, a t ), and is the total state derivative w.r.t. θ. Due to the recursion relation, a t will depend on a t−1 , a t−2 , ..., a 0 (we assume that a 0 is a fixed initial value). This means that we can use the chain rule to "unfold" the total derivative in G t in time, until we end up with nothing but partial derivatives: and equation 3 can be rewritten as and More efficiently, G t can be computed with a recursion relation: This forms the basis of real-time recurrent learning (RTRL), an online variant of the more common BPTT scheme. It provides a gradient for the associated cost at each time step, such that optimisation can happen during the system updates. The downside of this approach is that, while K t is often very sparse, G t generally is not, and computing it iteratively is slow and cumbersome compared to updating the recursion equation. It has dimensions equal to the number of states times the number of parameters, which can be a very large number.
Suppose that instead of computing the cost gradient at just one moment in time, we consider its sum over a whole sequence of length T : Inserting equation 1 and replacing G t with the expression given in equation 6 yields If we exchange the order of summation, we get The variable e i can be computed recursively, just like G t , but backwards in time (hence the term backpropagation through time): This means that we can write the sum of the gradients over the whole sequence as The advantage of this expression is twofold. First of all, computing the sequence e i is usually comparable in computational cost to computing the sequence of states a t , as it has the same dimensionality. Furthermore, in most practical situations K t is highly sparse. This is especially true for RNNs, but the argument remains valid for many continuous time systems as well (which is important in the following sections). The sum over matrix-vector multiplications in equation 7 can commonly be written as a single, highly efficient matrix matrix multiplication which only takes a small fraction of additional time to compute. In what follows we will formally extend the presented algorithm to continuous time.

Cost gradient for ordinary differential equations
In this section we introduce the simplest method for computing a parameter gradient for dynamical systems (DS). Suppose we have a continuous-time dynamical system characterised by a state a(t) and a set of parameters θ. It is also excited by an external input signal s(t). The dynamics are governed by the following differential equation Suppose we have a cost function C(a(t), t) we wish to minimise. The gradient of this cost function w.r.t the parameters is given by Here,ē(t) is the output error (the gradient of the cost function w.r.t, a(t)), and is the total derivative of the state a(t) w.r.t. the parameters; a matrix with vertical dimension the number of elements in a(t), and horizontal dimension the number of elements in θ For notational reasons we also introduce the partial derivative: which has the same dimensions as G(t). Finally, we introduce the Jacobian of f w.r.t. a(t) as and w.r.t.ȧ(t) If we take the derivative of equation 8 w.r.t the parameters, we can write: In other words, we have defined a differential equation for the evolution of G(t), provided that J 1 (t) is an invertible matrix. This is for instance the case for explicit ODEs, where J 1 (t) is the identity matrix. Computing G(t) allows to perform online learning. We can compute G(t) online, together with the actual simulation of the DS, and slowly adjust the parameters concurrently 1 . In reality, the dimensionality of G(t) may prove to be impractically large. Even though the size of the matrix is the same, the partial derivative K(t) is usually much sparser, as most variables only depend directly on a small subset of the parameters.

Offline backpropagation through time
Very much like in discrete-time RNNs, it is possible to avoid explicitly computing G(t). We can for continuous time let the error evolve backwards in time too, where it evolves according to a differential equation instead of an update equation. It is possible to get the same result by considering BPTT on an Euler integration of the differential equation, and taking the limit of dt → 0. This step can be omitted, however, and we will stay in the continuous time domain, which proves that the validity of our approach does not depend on any specific method used to solve the differential equations. Let us start by considering the integral of the gradient for a given time span T . Indeed, if we consider limited time sequences in which we update the parameters slowly each moment in time using the previously defined gradient, the total change of the parameters would be the integral over time of the gradient (provided the change is so small that the dynamics of the DS are not changing during the considered time interval).
We will assume that the input signal, the internal state, the derivatives G(t) and the error signalē(t) are identical to zero for t < 0 and t > T . If we use γ θ to perform parameter updates, we can avoid computing G(t) explicitly. To achieve this, we introduce the variable e(t), which is governed by the following differential equation : where s = T − t, i.e., s is time running backwards, starting from T . This means that, in order to solve e(s) numerically, we need to integrate the equation backwards in time. The unknown variable x(s) is what we wish to find. Assuming thatē(t) = e(t) = 0 if t < 0 or t > T , we can implicitly solve equation 15 using an integral: or in the normal time domain this becomes: As we will show, e(t) is the continuous time equivalent of the backpropagated error. We start by replacing G(t) in equation 41 by its implicit integral solution: We can switch the order of integration over t and t , yielding If we reorder equation 17, we find that Taking the transpose and insertion in equation 18 yields Switching the order of integration in the final term yields such that the last two terms in equation 20 disappear if This means that with this definition of x(t) the time integral of the gradient reduces to in which the matrix G(t) no longer occurs.

Differential algebraic equations
In the case that J 1 (t) is non-invertible, the previous analysis no longer holds. The system of equations is known as differential algebraic equations (DAE). This situation happens often in engineering applications such as finite-element methods, electric circuits etc. Often, it is possible to rewrite the equations as follows: Here, the bottom equation is known as the algebraic part of the system. It will impose restrictions on the variables q(t) through additional variables r(t). First of all we introduce the following notations: K g (t) = ∂g(q(t), r(t), s(t), θ) ∂θ , K h (t) = ∂h(q(t), r(t), s(t), θ) ∂θ , Using these notations we can take the derivative of equation 23 w.r.t. θ: This takes on the form of another set of equations. For the remainder of the derivation we will assume that J h r (t) is non-singular, which is true for DAEs of index 1. Now we can solve for G r (t) in the second equation, which yields leaving a single differential equation for G q (t). We wish to find out if we can redo the previous analysis in order to find a backpropagation algorithm. First we define output errors 2 on q(t) and r(t) asē q (t) andē r (t). The integral over the gradient is then given by Eliminating G r (t) leads to The first term no longer contains G q (t). In the second term we replace the expression between curved brackets byē T x (t). Changing G q (t) into its implicit integral solution and changing the order of integration then yields for the second term (ST): We now assume that there exists a backpropagated error e q (t), defined by such that we find that We can insert this expression into the second term and largely repeat the same derivation as before. The end result is given by:

Delayed differential equations
Many real-world DSs are governed by interconnection delays. One important example is the finite speed of light, which matters greatly in high-speed electronics or photonics components. Delays are an interesting phenomenon in DSs as, at least in principle, they make the state space infinite-dimensional (the state history over the full length of the delay is now part of the "immediate" state of the DS). Furthermore, in contrast to discrete time DSs, in continuous time delays are differentiable, such that they too are parameters trainable by gradient descent. We define a set of N delays, denoted by d = [d 1 , d 2 , · · · , d N ]. We now consider the following differential equation.
Here, a d (t) is a vertical concatenation of delayed versions of a(t): such that we assume that the evolution of a(t) depends on N delays. Note that, if the state evolution would happen to depend on delayed versions of the input signal, we can simply use all previous definitions, where we augment the input signal with the delayed versions. In this case, however, the situation is a little more complicated because we cannot reduce the equation to an ordinary differential equation, and we will need to redo the previous derivation. Before we begin we use the chain rule to find the total derivative of the state of the DS w.r.t. the parameters. We find thaṫ We will use the following definitions: and This reduces equation 37 toĠ Consequently, changing G(t) by its implicit integral solution in equation 41 yields: We now wish to define a differential equation for the error backpropagation. We will for now assume it to be equal to de(s) ds =ē(s) + x(s), where x(s), is the unknown part, again with s = T − t. We repeat the same derivation as before, but now exchange the integral overē(t) by Changing the order of the integration, and substituting the integral overē(t) finally yields: and after switching the order of the integration in the final term and replacing the implicit integral solution of G(t), this becomes We again wish to obtain the following equality: such that we require that We start by making the following observation where This means that we can rewrite the left hand side of equation 45 as Due to the fact that we have explicitly defined G(t) and e(t) to be equal to zero outside the range 0 < t < T , we can integrate over each term in this equation separately and change the integration variable without changing the limits of integration, yielding: which is fulfilled if This result reduces the differential equation for e(s) to such that it too is defined by a delayed differential equation. Let us as an example consider what happens if we wish to find the gradient w.r.t. a certain delay value d k . The only variable we need to consider is the part of K(t) which corresponds to the partial derivative of f w.r.t. d k . As we assume that d k is part of the parameter set, we also assume that f explicitly depends on d k . We can write K d k (t) = ∂f (a(t), s(t), a d (t), θ) ∂d k = ∂f (a(t), s(t), a d (t), θ) ∂a(t − d k ) 7 Delayed differential algebraic equations Obviously, certain DSs will have both an algebraic part and interconnection delays. We will limit ourselves here to one particular case which describes the photonic SOA networks. We again use two variables r(t) and q(t), the first algebraic, the second dynamic. Furthermore we write r d (t) = [r(t − d 1 ), r(t − d 2 ), · · · , r(t − d N )]. The DDAE we will consider is of the formq (t) = g(q(t), r d (t), s(t), θ) r(t) = h(q(t), r d (t), s(t), θ).
If external errors on q and r are written asē q andē r , respectively, we can associate the backpropagated errors e q and e r aṡ e T q (t) =ē T q + e T q (t)J g q (t) + e T r (t)J h q (t) in which J g d i (t) = ∂g(q(t), r d (t), s(t), θ) ∂h(q(t), r d (t), s(t), θ) ∂r(t − d i ) .
We end up with the following expression for the gradient The proof is lengthy but highly similar to the previous two derivations, therefore we omit it here. For the photonic networks, the dynamic variable q(t) corresponds to the free carrier densities of the SOAs, and the algebraic variable r(t) to the complex amplitudes of the incoming light (see main text for more details)