Pruning deep neural networks generates a sparse, bio-inspired nonlinear controller for insect flight

Insect flight is a strongly nonlinear and actuated dynamical system. As such, strategies for understanding its control have typically relied on either model-based methods or linearizations thereof. Here we develop a framework that combines model predictive control on an established flight dynamics model and deep neural networks (DNN) to create an efficient method for solving the inverse problem of flight control. We turn to natural systems for inspiration since they inherently demonstrate network pruning with the consequence of yielding more efficient networks for a specific set of tasks. This bio-inspired approach allows us to leverage network pruning to optimally sparsify a DNN architecture in order to perform flight tasks with as few neural connections as possible, however, there are limits to sparsification. Specifically, as the number of connections falls below a critical threshold, flight performance drops considerably. We develop sparsification paradigms and explore their limits for control tasks. Monte Carlo simulations also quantify the statistical distribution of network weights during pruning given initial random weights of the DNNs. We demonstrate that on average, the network can be pruned to retain a small amount of original network weights and still perform comparably to its fully-connected counterpart. The relative number of remaining weights, however, is highly dependent on the initial architecture and size of the network. Overall, this work shows that sparsely connected DNNs are capable of predicting the forces required to follow flight trajectories. Additionally, sparsification has sharp performance limits.

Introduction generating input-output computations. Thus they have important practical advantages over their fully-connected counterparts. They are also more representative of biological neural systems, in which neural pathways are sparsely and specifically connected for task performance.
The inverse problem of insect flight is a highly nonlinear dynamical system, in part due to the unsteady mechanisms of flapping flight [23,24] and the noisy environment through which insects maneuver. As such, the inverse problem of insect flight serves as an exemplar to study whether a DNN can solve a biological motion control problem while maintaining a sparse connectivity pattern. In an inverse problem, the initial and final conditions of a dynamical system are known and used to find the parameters necessary to control the system. In other words, the DNN in this study is trained to predict the controls required to move the simulated insect from one state space to another. Solving the inverse problem of insect flight has been previously simulated using a genetic algorithm wedded with a simplex optimizer for hawkmoth level forward flight and hovering [25]. Another study linearized the dynamical system of simulated hawkmoth flight and found the system to operate on the edge of stability [26]. Recently, a study developed an inertial dynamics model of M. sexta flight as it tracked a vertically oscillating signal, which modeled the control inputs using Monte Carlo methods in a model-inspired by model predictive control (MPC) [27].
In this work, we use the inertial dynamics model in [27] to simulate examples of M. sexta hovering flight. Fig. 1 shows the physical parameters of the simulated moth and the inertial dynamics model. These data are used to train a DNN to learn the controllers for hovering. Drawing inspiration from pruning in biological neural systems, we sparsify the network using neural network pruning. Here, we prune weights based simply on their magnitudes, removing those weights closest to zero. Importantly, the pruned weights remain zeroed out throughout the sparsification process. This bio-inspired approach to sparsity allows us to find the optimally sparse network for completing flight tasks. Insects must maneuver through high noise environments to accomplish controlled flight. It is often assumed that there is a trade-off between perfect flight control and robustness to noise and that the sensory data may be limited by the signal-to-noise ratio. Thus the network need not train for the most accurate model since in practice noise prevents high-fidelity models from exhibiting their underlying accuracy. Rather, we seek to find the sparsest model capable of performing the task given the noisy environment. We employed two methods for neural network pruning: either through manually setting weights to zero or by utilizing binary masking layers. Furthermore, the DNN is pruned sequentially, meaning groups of weights are removed slowly from the network, with retraining in-between successive prunes, until a target sparsity is reached. Monte Carlo simulations are also used to quantify the statistical distribution of network weights during pruning given random initialization of network weights. This work shows that sparse DNNs are capable of predicting the controls required for a simulated hawkmoth to move from one state-space to another, or through a sequence of control actions. Specifically, for a given signal-to-noise level the pruned network can perform at the level of the fully connected network while requiring only a fraction of the memory footprint and computational power.

Results
Network pruning results Fig. 2 shows the learning curve for a network trained using the sequential pruning protocol with TensorFlow's Model Optimization Toolkit (see Methods section for details) [28]. The network is trained until a minimum error is reached, and then pruned The moth body is made of two ellipses attached with a spring. There are three control variables (F , α, and τ ) and four parameters to describe the state space (x, y, θ, and φ). See Table 2  to a specified sparsity percentage and then retrained until the loss is once again minimized. The sparsity (or pruning) percentages are shown in Fig. 2 where they occur in the training process. An arbitrary threshold error of 10 −3 (shown as a red, dashed line) was chosen to define the optimally sparse network (i.e. sparsest possible network that performs under the specified loss). This specific threshold value was chosen because it is near the performance of the trained, fully-connected network. In practice, the red line represents the noise level encountered in the flight system. Specifically, given a prescribed signal-to-noise ratio, we wish to train a DNN to accomplish a task with a certain accuracy that is limited by noise. Thus high-fidelity models, which can only practically exist with perfect data, are traded for sparse models which are capable of performing at the same level for a given noise figure. In the example in Fig. 2, the optimally sparse network occurs at 94% sparsity (or when only 6% of the connections remain). Beyond 94% sparsity, the performance of the network breaks down because too many critical weights have been removed.  Learning curve for sequential pruning of network. Fully-connected neural network is trained until the mean-squared error is minimized. Then, the network is sequentially pruned by adding in masking layers and trained again. The performance of the network improves below the minimum error achieved by the fully-connected network for low levels of pruning, but performs comparably to the fully-connected network until 94% of the network is pruned.

Monte Carlo results
To compare the effects of pruning across networks, we trained and pruned 1320 networks with different random initializations on the same dataset. In this experiment, the hyper-parameters, pruning percentages, and architecture are held constant. Fig. 3 shows the training curves of 9 sample networks. The red, dashed line in each of the panels represents the same threshold as in Fig. 2 (10 −3 ). The black, solid lines in Fig. 3 represent the optimally sparse networks. Although the majority of networks in this subset breakdown at 93% sparsity, a few breakdown at higher and lower levels of connectivity. Fig. 4 shows the loss after pruning the 1320 networks at varying pruning percentages (from 0% sparsity to 98% sparsity). The box plot in Fig. 4 is directly comparable to Fig.  2, but it is the compilation of the results for many different networks. The networks do not all converge to the same set of weights, which is evident by the numerous outliers, as well as the variance around the median loss.
The median minimum loss achieved by the networks before pruning is 7.9 x 10 −4 . The first box in the box plot in Fig. 4 corresponds to the losses of all the trained networks before any pruning occurs. The variance on the loss is relatively small, but there are several outliers. Once again, the red, dashed line in the box plot in Fig. 4 represents the threshold below which a network is optimally sparse. Many networks follow a similar pattern and perform under the threshold until they exceed 93% sparsity. Also, many networks perform better than the median performance of the fully-connected networks when pruned up to 85% sparsity.
The number of optimally sparse networks in each sparsity category is shown in the bar plot at the top of Fig. 4. Of the 1320 networks trained, 858 of the networks are optimally sparse at 93% sparsity. A small number of networks (5) remain under the threshold up to 95% pruned. Note that the total number of networks represented in the bar plot does not add up to 1320. This is because several networks never perform below

Analysis of layer sparsity
The subset of optimally sparse networks pruned to 93% is used in the following analysis of network structure (858 networks). The sparsity across all the layers was found to be uniform (7% of weights remain in each layer) despite not explicitly requiring uniform pruning in the protocol. Table 1 shows the average number of remaining connections across the 858 networks, as well as the variance and the fraction of remaining connections. Fig. 5 shows a box plot of the number of connections from the input layer to the first hidden layer for the subset of pruned networks. Interestingly, the initial head-thorax angular velocity was completely pruned out of all of the networks in the subset, meaning it has no impact on the output and predictive power of the network. Additionally, the initial abdomen angular velocity connects to either zero, one, or two nodes in the second hidden layer, while all the other inputs have a median connection to at least 5% of the weights in the first hidden layer.

Materials and methods
All code associated with the simulations and the DNNs is available on Github [29].

Moth model
The simulated insect uses an inertial dynamics model developed in Bustamante et al., 2021 [27] and was inspired by hawkmoth flight control, M. sexta with body proportions rounded to the nearest 0.1 cm. The simulated moth was made up of two ellipsoid body segments, the head-thorax mass (m 1 ) and the abdomen mass (m 2 ). The body segments are connected by a pin joint consisting of a torsional spring and a torsional damper as seen in [30]. The simulated moth model could translate in x-y plane, and both the head-thorax mass, and the abdominal mass could rotate with angles (θ,φ) in the x-y plane . See Fig. 1 and Tables 2 and 3 for more description of the simulated insect. The computational model of the moth had three control variables and four state-space variables (as well as the respective state-space derivatives). This model is by definition underactuated because the number of control variables is less than the degrees of freedom. The controls are as follows: F , the magnitude of force applied; α the direction of force applied (with respect to the midline of the head-thorax mass); and  τ , the abdominal torque exerted about the pin joint connecting the two body segment masses (with its equal and opposite response torque). The controls are randomized every 20 ms, which is approximately the period of the wing downstroke or upstroke for M. sexta (25 Hz wing beat frequency) [31].
The motion of the moth state-space is described by four parameters (x: horizontal position, y: vertical position, θ: head-thorax angle, and φ: abdomen angle), as well as the respective state-space derivatives (ẋ: horizontal velocity,ẏ: vertical velocity,θ: head-thorax angular velocity, andφ: abdomen angular velocity). The x and y position indicate the position of the pin joint where the head-thorax connects with the abdomen.

Data preparation for deep neural network training
The force (F ) and force angle (α) were converted to horizontal and vertical components (F x and F y ), using the following equations: F x = F · cos(α) and F y = F · sin(α). The data were split into training and validation sets for cross validation (80:20 split). The validation data is a sample used to provide an unbiased evaluation of a model fit while tuning the hyper parameters (such as number of hidden units, number of layers, optimizer, etc.). The data were scaled using a min-max scaler according to the training dataset and transformed values to be between −0.5 and +0.5. The same scaler was then used to transform the validation and test data.

Training and pruning a deep neural network
The deep, fully-connected neural network was constructed with ten input variables and seven output variables (see Fig. 1). The initial and final state space conditions are the inputs to the network: The network predicts the control variables and the final derivatives of the state space in its output layer The final derivatives of the state space were made outputs to be able to chain together 20 ms solutions to allow the moth to complete a complex trajectory for use in future work. The training and pruning protocols were developed using Keras [33] with the TensorFlow backend [28]. To scale up training for the statistical analysis of many networks, the training and pruning protocols were parallelized using the Jax framework [34].
To demonstrate the effects of pruning, the network was chosen to have a deep, feed-forward architecture with wide hidden layers (many more nodes than in the input and output layer). The network had four hidden layers with 400, 400, 400, and 16 nodes, respectively. Wide hidden layers were used rather than using a bottleneck structure (narrower hidden layer width) to allow the network to find the optimal mapping with little constraint, however, the specific choices of layer widths were arbitrary. The inverse Algorithm 1: Sequential pruning and fine-tuning Train fully-connected model until loss is minimized; Define list of sparsity percentages; for Each sparsity percentage do while Loss is not minimized do for Each epoch do Set n weights to zero s.t. n/N equals the sparsity percentage; Evaluate loss; Update weights; end end end tangent activation function was used for all hidden layers to introduce nonlinearity in the model. To account for the multiple outputs, the loss function was the uniformly-weighted average of the mean squared error for all the outputs combined.
For optimizing performance, there are several hyper-parameter differences in the TensorFlow model and the Jax model. In developing the training and pruning protocol in TensorFlow, the network was trained using the rmsprop optimizer with a batch size of 2 12 samples. However, to scale up and speed up training we used the Jax framework, the Adam optimizer [35], and the batch size was reduced to 128 samples. Regularization techniques such as weight regularization, batch normalization, or dropout were not used. However, early stopping (with a minimum delta of 0.01 with a patience of 1000 batches) was used to reduce overfitting by monitoring the mean squared error.
After the fully-connected network is trained to a minimum error, we used the method of neural network pruning to promote sparsity between the network layers. In this work, a target sparsity (percentage of pruned network weights) is specified and those weights are forced to zero. The network is then retrained until a minimum error is reached. This process is repeated until most of the weights have been pruned from the network.
We developed two methods to prune the neural network: 1) a manual method that involves setting a number of weights to zero after each training epoch and 2) a method using TensorFlow's Model Optimization Toolkit [28] which involves creating a masking layer to control sparsity in the network. Both methods are described in detail in the following sections.

Manual Pruning
Alg. 1 describes the a method of pruning in which the n weights whose magnitudes are closest to zero are manually set to zero. If N is the total number of weights in the network, the n weights are chosen such that n/N is equivalent to a specified pruning percentage (e.g. 15%, 25%, ..., 98%). After the n weights are set to zero, the network is retrained for one epoch. This process is repeated until the loss is minimized. After the network has been trained to a minimum loss, we select the next pruning percentage from the predetermined list and repeat the retraining process. The entire pruning process is repeated until the network has been pruned to the final pruning percentage in the list.
Upon retraining, the weights are able to regain a non-zero weight and the network is evaluated using these non-zero weights. Although this likely still captures the effects of Algorithm 2: Sequential pruning with masks and fine-tuning Train fully-connected model until loss is minimized; Define list of sparsity percentages; for Each sparsity percentage do Define pruning schedule using ConstantSparsity; Create prunable model by calling prune low magnitude; Train pruned model until loss is minimized; end pruning the network over the full training time, it is not true pruning in the sense that connections that have been pruned can regain weight.

Pruning using Model Optimization Toolkit
The manual pruning method described above has the downside of allowing weights to regain non-zero value after training. These weights are subsequently set back to zero on the next epoch, but the algorithm does not guarantee that the same weights will be pruned every time.
To ensure weights remain pruned during retraining, we implemented the pruning functionality of a TensorFlow built toolkit called the Model Optimization Toolkit [28]. The toolkit contains functions for pruning deep neural networks. In the Model Optimization Toolkit, pruning is achieved through the use of binary masking layers that are multiplied element-wise to each weight matrix in the network. A four-layer neural network can be mathematically described the following way.
In Eq. 2, the inputs to the network are represented by x, the predictions byŷ, the weight matrices by A i , and the activation function by σ i , where i = 1, 2, 3, 4 for the four layers of the network. During pruning, the binary masking matrix, M i is placed between each layer.ŷ In Eq. 3, the binary masking matrices, M i , are multiplied element-wise to the weight matrices (• denotes the element-wise Hadamard product). The sparsity of each layer is controlled by a separate masking matrices to allow for different levels of sparsity in each layer. Before pruning, all elements of M i are set to 1. At each pruning percentage (e.g. 15%, 25%, ..., 98%), the n weights whose magnitudes are nearest to zero are found and the corresponding elements of the the M i are set to zero. The network is then retrained until a minimum error is achieved. The masking layers are non-trainable, meaning they will not be updated during backpropagation. Then, the next pruning percentage is selected and the process is repeated until the network has been pruned to the final pruning percentage.
In the TensorFlow Model Optimization Toolkit, the binary masking layer is added by wrapping each layer into a prunable layer. The binary masking layer controls the sparsity of the layer by setting terms in the matrix equal to either zero or one. The masking layer is bi-directional, meaning it masks the weights in both the forward pass and backpropagation step, ensuring no pruned weights are updated [36]. Alg. 2 shows the pruning paradigm utilizing the Model Optimization Toolkit.
Rather than controlling for sparsity at each epoch of training, as was done in the manual pruning method described above, we control for sparsity each time we want to prune more weights from the network. Sparsity is kept constant throughout each pruning cycle and therefore we can use TensorFlow's built-in functions for training the network and regularization.

Preparing for statistical analysis of pruned networks
To be able to train and analyze many neural networks, the training and pruning protocols were parallelized in the Jax framework [34]. Rather than requiring data to be in the form of tensors (such as in TensorFlow), Jax is capable of performing transformations on NumPy [37] structures. Jax however does not come with a toolkit for pruning, therefore pruning by way of the binary masking matrices was coded into the training loop.
The networks were trained and pruned using a NVIDIA Titan Xp GPU operating with CUDA [38]. At most, 400 networks were trained at the same time and the total number of networks used in the analysis was 1320. These networks were all trained with identical architectures, pruning percentages, and hyper-parameters. The only difference between the networks is the random initialization of the weights before training and pruning. The Adam optimizer [35] and a batch size of 128 were used to speed up training and cross-validation was omitted. However, early stopping was used on the training data to avoid training beyond when the loss was adequately minimized. Additionally, early stopping was used to evaluate the decrease in loss across batches, rather than epochs.

Discussion
In this study, we set out to investigate whether a sparse DNN can control a biologically relevant motor-task -in our case the dynamic control of a hovering insect. Taking inspiration from synaptic pruning found across wide ranging animal taxa, we pruned a DNN to different levels of sparsity in order to find the optimal sparse network capable of controlling moth hovering. The DNN uses data generated by the inertial dynamics model in [27] which models the forward problem of flight control. In this work, the DNN models the inverse problem of flight control by learning the controls given the initial and final state space variables.
Through this work, we found that sparse DNNs are capable of solving the inverse problem of flight control, i.e. predicting the controls that are required for a moth to hover to a specified state space. In addition, we demonstrate that across many networks, a network can be pruned by as much as 93% and perform comparably to the median performance of a fully-connected network. However, there are sharp performance limits and most networks pruned beyond 93% see a breakdown in performance. We found that although uniform pruning was not enforced, on average, each layer in the network pruned to match the overall sparsity (i.e. sparsity of each layer was 93% for networks pruned to overall sparsity of 93%). Finally, we looked at the sparsity of individual layers and found that the initial head-thorax angular velocity is consistently pruned from the input layer of networks pruned to 93% sparsity, indicating a redundancy in the forward original model.
Though we have shown that a DNN is capable of learning the controls for a flight task, there are several limitations to this work. Firstly, though the model in [27] used to generate the training data provided control predictions for accurate motion tracking in a two-dimensional task, biological reality is more rich and complex than can be captured by the forward model. Thus, since the DNN is trained with this data, it is only capable of learning the dynamics captured in the model in [27]. Furthermore, the size, shape, and body biomechanics of this systems all matter. This study uses the same global parameters across the data set (see Table 2), but, in reality, these parameters vary significantly (across insect taxa and within the life of an individual) and this likely affects the performance of the DNN.
We have shown here that DNNs are capable of learning the inverse problem of flight control. The fully-connected DNN used here learned a nonlinear mapping between input and output variables, where the inputs are the initial and final state space variables and the outputs are the controls and final velocities. A fully-connected network can learn this task with a median loss of 7.9 x 10 −4 . However, due to the random initialization of weights preceding training, some networks perform as much as an order of magnitude worse (see Fig. 4). This suggests that the performance of a trained DNN is heavily influenced by the random initialization of its weights.
We used magnitude-based pruning to sparsify the DNNs in order to find the optimal, sparse network capable of controlling moth hovering. For the task of moth hovering, a DNN can be pruned to approximately 7% of its original network weights and still perform comparably to the fully-connected network. The results of this analysis show that when trained to perform a biological task, fully-connected DNNs are indeed overparameterized. Much like their biological counterparts, DNNs do not require fully-connected connectivity to accomplish this task. Additionally, flying insects maneuver through high noise environments and therefore perfect flight control is traded for robustness to noise. It is therefore assumed that the data has a given signal-to-noise ratio or performance threshold. The performance threshold represented by the red dashed line in Figs. 2, 3, and 4 was arbitrarily chosen to represent a loss comparable to the loss of the fully-connected network (i.e. 0.001). In other words, this line represents a noise threshold, below which the network is considered well-performing and adapted to noise. It has been shown that biological motor control systems are adapted to handle noise [39]. Biological pruning may be a mechanism for identifying sparse connectivity patterns that allow for control within a noise threshold.
On average, when the networks are pruned beyond 7% connectivity, there is a dramatic performance breakdown. Beyond 93% sparsity, the performance of the networks break down because too many critical weights have been removed. A significant proportion of the 1320 networks breakdown before they reach 7% connectivity (approximately 30% of the networks). This again supports the aforementioned claim that the random initialization of the weights before training affects the performance of a DNN and can be exacerbated by neural network pruning. Additionally, this shows that there exists a diversity of network structures that perform within the bounds of the noise threshold.
To investigate the substructure of the well-performing, sparse networks, we looked closer at the subset of networks that were optimally sparse at 93% pruned (858 networks). We have shown that the average sparsity of each layer in this subset is uniform, meaning each of the five layers have approximately 7% of their original connections remaining. However, the variance on the number of remaining connections between input layer and first hidden layer and between the final hidden layer and the output layer is markedly higher than the variance in the weight matrices between the hidden layers. This suggests that in networks pruned to 93% sparsity, the greatest amount of change in network connectivity occurs in input and output layers. However, there are notable features in the connectivity between the input and first hidden layer that are consistent across the 858 networks. Fig. 5 shows that the input parameter, initial head-thorax angular velocity (θ i ), is completely pruned from all of the 858 networks. The initial abdomen angular velocity (φ i ) is also almost entirely pruned from all of the networks. All of the other input parameters maintain an median of at least 5% connectivity to the first hidden layer. The complete pruning ofθ i suggests a redundancy in the original forward model. However, this redundancy makes physical sense because θ i and φ i are coupled in the original forward model.
In this work, we have shown that a sparse neural network can learn the controls for a biological motor task and we have also shown, via Monte Carlo simulations, that there exists at least some aspects of network structure that are stereotypical. There are several computationally non-trivial extensions to the work presented here. Firstly, network analysis techniques (such as network motif theory) could be used to further compare the pruned networks and investigate the impacts of neural network structure on a control task. Network motifs are statistically significant substructures in a network and have been shown to be indicative of network functionality in control systems [40]. Other areas of future work include investigating the sparse network's response to noise and changes in the biological parameters. Biological control systems are adapted to function adequately in the presence of noise. Pruning improves the performance of neural networks up to a certain level of sparsity, however the effects of noise on this bio-inspired control task are yet to be explored. Furthermore, the size and shape of a real moth can change rapidly (e.g. change of mass after feeding). The question of whether sparsity improves robustness in the face of such physical parameters could also be a future extension of this work.

Conclusion
Synaptic pruning has been shown to play a major role in the refinement of neural connections, leading to more effective motor task control. Taking inspiration from synaptic pruning in biological systems, we apply the equally thoroughly investigated method of DNN pruning to the inverse problem of insect flight control. We use the inertial dynamics model in [27] to simulate examples of M. sexta hovering flight. This data is used to train a DNN to learn the controls for moving the simulated insect between two points in state-space. We then prune the DNN weights to find the optimally sparse network for completing flight tasks. We developed two paradigms for pruning: via manual weight removal and via binary masking layers. Furthermore, we pruned the DNN sequentially with retraining occurring between prunes. Monte Carlo simulations were also used to quantify the statistical distribution of network weights during pruning to find similarities in the internal structure across pruned networks. In this work, we have shown that sparse DNNs are capable of predicting the controls required for a simulated hawkmoth to move from one state-space to another.   Resting configuration of the torsional spring = (initial abdomen angle) -(initial head-thorax angle) -π t 0.02 s Time step

Variable Expression
Units Description m 1 ρ head · 4 3 π · (b head ) 2 · a head g Mass of head-thorax m 2 ρ butt · 4 3 π · (b butt ) 2 · a butt g Mass of the abdomen ec head a head /b head N/A Eccentricity of head-thorax ec butt a butt /b butt N/A Eccentricity of abdomen g·cm 2 Moment of inertia of the abdomen S head π · (b head ) 2 cm 2 Surface area of the head-thorax. In this case, it is modeled as a sphere.
S butt π · (b butt ) 2 cm 2 Surface area of the abdomen. In this case, it is modeled as a sphere.

N/A
Reynolds number for the abdomen