The information theory of developmental pruning: Optimizing global network architectures using local synaptic rules

During development, biological neural networks produce more synapses and neurons than needed. Many of these synapses and neurons are later removed in a process known as neural pruning. Why networks should initially be over-populated, and the processes that determine which synapses and neurons are ultimately pruned, remains unclear. We study the mechanisms and significance of neural pruning in model neural networks. In a deep Boltzmann machine model of sensory encoding, we find that (1) synaptic pruning is necessary to learn efficient network architectures that retain computationally-relevant connections, (2) pruning by synaptic weight alone does not optimize network size and (3) pruning based on a locally-available measure of importance based on Fisher information allows the network to identify structurally important vs. unimportant connections and neurons. This locally-available measure of importance has a biological interpretation in terms of the correlations between presynaptic and postsynaptic neurons, and implies an efficient activity-driven pruning rule. Overall, we show how local activity-dependent synaptic pruning can solve the global problem of optimizing a network architecture. We relate these findings to biology as follows: (I) Synaptic over-production is necessary for activity-dependent connectivity optimization. (II) In networks that have more neurons than needed, cells compete for activity, and only the most important and selective neurons are retained. (III) Cells may also be pruned due to a loss of synapses on their axons. This occurs when the information they convey is not relevant to the target population.


Reviewer #1, comment #1
e authors put great e ort into improving the paper and addressed most of my concerns. I only have a few remaining issues (and suggestions) before I recommend publication: Concerning 1.4: Random unit pruning: (a) "… in the one-layer RBM all visible units are connected to all hidden units. When a hidden unit is removed here, the activity and Fisher information can completely re-arrange with re-training". As the unit pruning seems to be the main advantage of the FI-approach, I think this control case should also be included for the single layer RBM. e above intuition can the be discussed and demonstrated in Fig 2C: One would see a large divergence before but not a er retraining.
Our response #1.1 We had conducted this experiment, and initially decided not to include it in the original manuscript.
e results are as you predicted: a large divergence before, but not a er retraining. We added this result in Figure 2C of the revised manuscript. We further added a paragraph where we discuss this additional control case (see Lines 181-189).
Reviewer #1, comment #2 (b) From what I see in Figure 3, the random unit pruning is not really fair with respect to the units in hidden layer 1, as it removes an order of mangitude more neurons in that layer. To really demonstrate that FI-pruning leads to a be er network structure more quickly, I propose to adapt this and remove less neurons in h1 to arrive at a comparable structure Our response #1.2 Indeed the number of units in hidden layer 1 is much lower in the case of random unit removal. e random unit removal was included as an additional control case a er the rst revision. We implemented it in such a way that a comparable number of weights is pruned as with our synaptic pruning rules. We think it is a suitable control to show that FI-pruning allows topological optimization of the network in the di erent layers as it preserves more units in the rst layer (see lines 277 -282).
Since the ratio of weights removed to units removed is xed for "random unit" pruning, it is impossible to match the horizontal axes between all plots in F3B,C. We focus on how pruning of weights can optimize the network, with unit-pruning as a useful emergent side-e ect of using the FI-based rules. If one were to remove fewer units in Figure 3C (top), such that the random unit pruning removes the same number of units, then the number of weights would not be matched in Figure  3B.
Still, we conducted another experiment where we removed less units in the rst hidden layer (corresponding to 5% of weights instead of 10%). Even then, more hidden units were removed than with the other criteria. As one can see in panel B of Figure 1, the layer ends up with a higher number of weights than when it was pruned according to other criteria, complicating comparability. e encoding performance deteriorated to a similar degree (see Figure 1). It demonstrates that the rst hidden layer is a bo leneck for performance: if it loses too many neurons, the performance decreases.
Reviewer #1, comment #3 l.134/l.333f Could you provide more motivation why it is easier to track ring rates and weights instead of correlation and weights. Biologically, Ca or CaMKII are thought to be local proxies of correlated activity, but I am not aware of molecular signals tracking especially the presynaptic rate.

Our response #1.3
is is true. We think it is interesting to note that weight magnitude (i.e. synaptic strength) can serve as a proxy for these correlations, since it allows for a various possible biological implementations. Experiments to explore whether similar pruning rules occur in vivo should therefore explore not just connections to correlations, but other variables that correlate with synaptic weights. We now brie y elaborate on this in Lines 135-139.
Reviewer #1, comment #4 l.165 I guess here you need to discuss the results a bit deeper, as otherwise panel 2C would have been su cient to make the point. Speci cally, I noticed that the generative performance only seems to be poor for seldom pa erns whereas the performance for abundant pa erns seem to match (although with larger variation in the Anti-FI case). Is this really so bad for neural system? From an information theoretic viewpoint, they are surely the most informative pa erns. However, as these unmatched pa erns are rare, the error introduced by them may be negligible.
Our response #1.4 anks for this suggestion. is is an issue with most theoretical neuroscience works that explore sensory channels as optimal encoders. Not all sensory information is equally important, and in practice sensory systems do not transmit all pa erns. e selection of information however likely also depends on ecological and behavioural factors, and it seems di cult to test hypotheses beyond a general information maximisation objective. e energies in our models here could be interpreted not as the environmental probabilities, but rather a more complicated behavioral cost function. To model this explicitly, one would need to modify the wake-sleep learning rules to adjust pa ern frequency or the amount of plasticity driven by each pa ern. is is an interesting line of further study, but beyond the scope of our work.
Yet we agree that we should discuss Figure 2B further. We brie y explain the results that can be seen in the gure now (Lines 175-178).
Reviewer #1, comment #5 l.372 e statement seems a bit bold. Maybe use "activity-dependent pruning that aims to identify uninformative neurons" Our response #1.5 We agree and changed the sentence to read as suggested.
Reviewer #1, comment #6 Suggestions to improve readability: -In my opinion, it would make sense to move the introduction of the RBMs (l.23-33) to the end of the introduction (a er l.55) Our response #1.6 anks for the suggestion. We agree that moving this part to the suggested position eases the ow of reading. We adjusted the text accordingly.
Reviewer #1, comment #7 -l.70 Maybe one could also mention the relation between energy and pa ern probability in equation 1.
Our response #1.7 anks, we added the sentence "Lower energy corresponds to higher probability of the respective model state. " at line 74.
Reviewer #1, comment #8 -l.101 I would mention how the models were ed here (wake sleep algorithm).
Our response #1.8 anks! We added this (now at line 102).
Reviewer #1, comment #9 -l.101 It is not immediately clear what is meant by "parameter-wise" ( rst mention). I would stick to the terms full and diagonal or at least specify what is meant in this sentence. Moreover, I think it is may be less confusing to discuss the results in the order they are presented in the gure and move the Also, an activity dependent form is only available from Equation 3 or 4, right?
Our response #1.9 anks a lot for this comment. We agree that the ow of reading was a bit unsteady here. We changed the order to match the one one presented in the gure. We also dropped the term parameterwise throughout the manuscript and no longer refer to Equation 2 here. anks for noticing this! Reviewer #1, comment #10 -l.141 It is not immediately clear why the FI introduced before is "variance" based. Maybe the term could be introduced together with the method and the motivation of "variance" could be explained.
Our response #1.10 anks, we now explain why we call it the variance estimate of FI in the paragraph above the introduction of the heuristic estimate (see lines 133-134).
Reviewer #1, comment #11 -l.150 I think it should be shortly motivated what the generative performance means/relates to in the neuronal/biological system, to give a be er intuition what the FI-approach actually preserves.

Our response #1.11
In RBMs and DBMs, good generative performance is equivalent to Shannon-optimal encoding (Hinton et al. 1995). It also implies internal models that can accurately predict lower-level inputs from internal states. We now motivate evaluating the generative performance and comment on this in the text at lines 152-157.

Reviewer #1, comment #12
Finally, I would have another suggestion: Another advantage of the FI-dependent pruning over other methods may be the fact that it could be used to determine when pruning should be stopped. At the moment this is not the case as the lowest-FI quantile of synapses is always removed. If, instead, only synapses below an FI-threshold would be removed, pruning would naturally stop if all synapses have high FI. Such a convergence would remove the necessity to select a suitable number of pruning iterations for the model and prevent the performance loss of the FI-based models a er massive pruning in Fig 3. Assuming that pruning stops a er all synapses have high FI, one would get one "optimal" pruned model (instead of one per pruning iteration). Determining these optimal models for di erent input statistic would also allow predictions on the number of surviving synapses and neurons as well as weight distributions (for example comparing the networks a er training with a 5-class MNIST subset and the full dataset). Varying the input statistics and ge ing di erent resulting models would greatly underline the point that FI-pruning actually selects input-related "optimal" model architectures and not just "smaller" models whose size is determined by the number of iterations. Moreover, such an analysis would provide more insight into the relation between the encoding of the Boltzmann machine and optimal pruned models, which, I guess, was a goal of this line of research. e di erences in the resulting optimal networks could, in turn, be compared with existing data on network complexity/ neuron and spine densities in animals reared in di erent environments (e.g. dark rearing, rearing with di erently oriented bars, normal cages, enriched environments). is would make a nice connection to biology and provide actually testable pre/postdictions (Concerning the experiments you proposed: at least the experimentalists I know say that it is not feasible to track pre-and postsynaptic activity and the weight of an identi ed synapse over time at the moment).
I am aware that this additional analysis may be work-intensive and beyond the scope of this paper. However, I think it may greatly improve the manuscript or at least provide an interesting direction for future research.
Our response #1.12 ese are really interesting thoughts and ideas for future directions. We included an additional graph in the appendix of the previous revised manuscript, showing that a rise in the average latent activity may also be a signal to stop pruning.
Unfortunately we have to agree that these analyses are out of the scope of this article. However, they will be an excellent project for future students.