Figures
Abstract
Artificial neural networks are often interpreted as abstract models of biological neuronal networks, but they are typically trained using the biologically unrealistic backpropagation algorithm and its variants. Predictive coding has been proposed as a potentially more biologically realistic alternative to backpropagation for training neural networks. This manuscript reviews and extends recent work on the mathematical relationship between predictive coding and backpropagation for training feedforward artificial neural networks on supervised learning tasks. Implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning are discussed along with a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.
Citation: Rosenbaum R (2022) On the relationship between predictive coding and backpropagation. PLoS ONE 17(3): e0266102. https://doi.org/10.1371/journal.pone.0266102
Editor: Gennady S. Cymbalyuk, Georgia State University, UNITED STATES
Received: June 29, 2021; Accepted: March 14, 2022; Published: March 31, 2022
Copyright: © 2022 Robert Rosenbaum. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code and data to reproduce the results in the paper have been uploaded to figshare (https://doi.org/10.6084/m9.figshare.19387409.v2).
Funding: This work was supported by National Science Foundation grants NSF DMS 1654268, NSF Neuronex DBI 1707400, and the Air Force Office of Scientific Research (ASOFR) award number FA9550-21-1-0223. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The backpropagation algorithm and its variants are widely used to train artificial neural networks. While artificial and biological neural networks share some common features, a direct implementation of backpropagation in the brain is often considered biologically implausible in part because of the nonlocal nature of parameter updates: The update to a parameter in one layer depends on activity in all deeper layers. In contrast, biological neural networks are believed to learn largely through local synaptic plasticity rules for which changes to a synaptic weight depend on neural activity local to that synapse. While neuromodulators can have non-local impact on synaptic plasticity, they are not believed to be sufficiently specific to implement the precise, high-dimensional credit assignment required by backpropogation. However, some work has shown that global errors and neuromodulators can work with local plasticity to implement effective learning algorithms [1, 2]. Backpropagation can be performed using local updates if gradients of neurons’ activations are passed upstream through feedback connections, but this interpretation implies other biologically implausible properties of the network, like symmetric feedforward and feedback weights. See previous work [3, 4] for a more complete review of the biological plausibility of backpropagation.
Several approaches have been proposed for achieving or approximating backpropagation with ostensibly more biologically realistic learning rules [2–14]. One such approach [11–14] is derived from the theory of “predictive coding” or “predictive processing” [15–23]. A relationship between predictive coding and backpropagation was first discovered by Whittington and Bogacz [11] who showed that, when predictive coding is used to train a feedforward neural network on a supervised learning task, it can produce parameter updates that approximate those computed by backpropagation. These original results have since been extended to more general network architectures and to show that modifying predictive coding by a “fixed prediction assumption” leads to an algorithm that produces the exact same parameter updates as backpropagation [12–14].
This manuscript reviews and extends previous work [11–14] on the relationship between predictive coding and backpropagation, as well as some implications of these results on the interpretation of predictive coding and artificial neural networks as models of biological learning. The main results in this manuscript are as follows,
- Accounting for covariance or precision matrices in hidden layers does not affect parameter updates (learning) for predictive coding under the “fixed prediction assumption” used in previous work.
- Predictive coding under the fixed prediction assumption is algorithmically equivalent to a direct implementation of backpropagation, which raises the question of whether it should be interpreted as more biologically plausible than backpropagation.
- Empirical results show that the magnitude of prediction errors do not necessarily correspond to surprising features of inputs.
In addition, a public repository of Python functions, Torch2PC, is introduced. These functions can be used to perform predictive coding on any PyTorch Sequential model (see Materials and methods).
Results
A review of the relationship between backpropagation and predictive coding from previous work
For completeness, let us first review the backpropagation algorithm. Consider a feedforward deep neural network (DNN) defined by
(1)
where each
is a vector or tensor of activations, each θℓ is a set of parameters for layer ℓ, and L is the network’s depth. In supervised learning, one seeks to minimize a loss function
where y is a label associated with input, x, and
is the network’s output, which depends on parameters
. The loss is typically minimized using gradient-based optimization methods with gradients computed using automatic differentiation tools based on the backpropagation algorithm. For completeness, backpropagation is reviewed in the pseudocode below.
Algorithm 1 A standard implementation of backpropagation.
Given: Input (x) and label (y)
# forward pass
for ℓ = 1, …, L
# backward pass
for ℓ = L − 1, …, 1
A direct application of the chain rule and mathematical induction shows that backpropagation computes the gradients,
The negative gradients, dθℓ, are then used to update parameters, either directly for stochastic gradient descent or indirectly for other gradient-based learning methods [24]. For the sake of comparison, I used backpropagation to train a 5-layer convolutional neural network on the MNIST data set (Fig 1A and 1B; blue curves). I next review algorithms derived from the theory of predictive coding and their relationship to backpropagation, as originally derived in previous work [11–14].
A,B) The loss (A) and accuracy (B) on the training set (pastel) and test set (dark) when a 5-layer network was trained using a strict implementation of predictive coding (Algorithm 2 with η = 0.1 and n = 20; red) and backpropagation (blue). C,D) The relative error (C) and angle (B) between the parameter update, dθ, computed by Algorithm 2 and the negative gradient of the loss at each layer. Predictive coding and backpropagation give similar accuracies, but the parameter updates are less similar.
A strict interpretation of predictive coding does not accurately compute gradients.
I begin by reviewing supervised learning under a strict interpretation of predictive coding. The formulation in this section is equivalent to the one first studied by Whittington and Bogacz [11], except that their results are restricted to the case in which fℓ(vℓ−1; θℓ) = θℓgℓ(vℓ−1) for some point-wise-applied activation function, gℓ, and connectivity matrix, θℓ. Our formulation extends this formulation to arbitrary vector-valued differentiable functions, fℓ. For the sake of continuity with later sections, I also use the notational conventions from [12] which differ from those in [11].
Predictive coding can be derived from a hierarchical, Gaussian probabilistic model in which each layer, ℓ, is associated with a Gaussian random variable, Vℓ, satisfying
(2)
where
is the multivariate Gaussian distribution with mean, μ, and covariance matrix, Σ, evaluated at v. Following previous work [11–14], I take Σ = I to be the identity matrix, but later relax this assumption [21].
If we condition on an observed input, V0 = x, then a forward pass through the network described by Eq (1) corresponds to setting and then sequentially computing the conditional expectations or, equivalently, maximizing conditional probabilities,
until reaching an inferred output,
. Note that this forward pass does not necessarily maximize the global conditional probability,
and it does not account for a prior distribution on VL, which arises in related work on predictive coding for unsupervised learning [15, 21]. One interpretation of a forward pass is that each
is the network’s “belief” about the state of Vℓ, when only V0 = x has been observed.
Now suppose that we condition on both an observed input, V0 = x, and its label, VL = y. In this case, generating beliefs about the hidden states, Vℓ, is more difficult because we need to account for potentially conflicting information at each end of the network. We can proceed by initializing a set of beliefs, vℓ, about the state of each Vℓ, and then updating our initial beliefs to be more consistent with the observations, x and y, and parameters, θℓ.
The error made by a set of beliefs, , under parameters,
, can be quantified by
for ℓ = 1, …, L − 1 where v0 = V0 = x is observed. It is not so simple to quantify the error, ϵL, made at the last layer in a way that accounts for arbitrary loss functions. In the special case of a squared-Euclidean loss function,
where ‖u‖2 = uT u. Standard formulations of predictive coding [20, 21] use
(3)
where recall that y is the label. In this case, ϵL satisfies
(4)
where
We use the
to emphasize that
is different from
(which is defined by a forward pass starting at
) and is defined in a fundamentally different way from the vℓ terms (which do not necessarily satisfy vℓ = fℓ(vℓ−1; θℓ)). We can then define the total summed magnitude of errors as
More details on the derivation of F in terms of variational Bayesian inference can be found in previous work [12, 16, 20, 21] where F is known as the variational free energy of the model. Essentially, minimizing F produces a model that is more consistent with the observed data. Minimizing F by gradient descent on vℓ and θℓ produce the inference and learning steps of predictive coding, respectively.
Under a more heuristic interpretation, vℓ represents the network’s “belief” about Vℓ, and fℓ(vℓ−1; θℓ) is the “prediction” of vℓ made by the previous layer. Under this interpretation, ϵℓ is the error made by the previous layer’s prediction, so ϵℓ is called a “prediction error.” Then F quantifies the total magnitude of prediction errors given a set of beliefs, vℓ, parameters, θℓ, and observations, V0 = x and VL = y.
In predictive coding, beliefs, vℓ, are updated to minimize the error, F. This can be achieved by gradient descent, i.e., by making updates of the form
where η is a step size and
(5)
In this expression, ∂fℓ+1(vℓ; θℓ+1)/∂vℓ is a Jacobian matrix and ϵℓ+1 is a row-vector to simplify notation, but a column-vector interpretation is similar. If x is a mini-batch instead of one data point, then vℓ is an m × nℓ matrix and derivatives are tensors. These conventions are used throughout the manuscript. The updates in Eq (5) can be iterated until convergence or approximate convergence. Note that the prediction errors, ϵℓ = vℓ − fℓ(vℓ−1; θℓ), should also be updated on each iteration.
Learning can also be phrased as minimizing F with gradient descent on parameters. Specifically,
where
(6)
Note that some previous work uses the negative of the prediction errors used here, i.e., they use ϵℓ = vℓ − fℓ(vℓ−1; θℓ). While this choice changes some of the expressions above, the value of F and its dependence on θℓ is not changed because F is defined by the norms of the ϵℓ terms. The complete algorithm is defined more precisely by the pseudocode below:
Algorithm 2 A direct interpretation of predictive coding.
Given: Input (x), label (y), and initial beliefs (vℓ)
# error and belief computation
for i = 1, …, n
for ℓ = L − 1, …, 1
ϵℓ = vℓ − fℓ(vℓ−1; θℓ)
vℓ = vℓ + ηdvℓ
# parameter update computation
for ℓ = 1, …, L
Here and elsewhere, n denotes the number of iterations for the inference step. The choice of initial beliefs is not specified in the algorithm above, but previous work [11–14] uses the results from a forward pass, , as initial conditions and I do the same in all numerical examples.
I tested Algorithm 2 on MNIST using a 5-layer convolutional neural network. To be consistent with the definitions above, I used a mean-squared error (squared Euclidean) loss function, which required one-hot encoded labels [24]. Algorithm 2 performed similarly to backpropagation (Fig 1A and 1B) even though the parameter updates did not match the true gradients (Fig 1C and 1D). Algorithm 2 was slower than backpropagation (31s for Algorithm 2 versus 8s for backpropagation when training metrics were not computed on every iteration) in part because Algorithm 2 requires several inner iterations to compute the prediction errors (n = 20 iterations used in this example). Algorithm 2 failed to converge on a larger model. Specifically, the loss grew consistently with iterations when trying to use Algorithm 2 to train the 6-layer CIFAR-10 model described in the next section. S1 Fig shows the same results from Fig 1 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.
Fig 1C and 1D shows that predictive coding does not update parameters according to the true gradients, but it is not immediately clear whether this would be resolved by using more iterations (larger n) or different values of the step size, η. I next compared the parameter updates, dθℓ, to the true gradients, for different values of n and η (Fig 2). For the smaller values of η tested (η = 0.1 and η = 0.2) and larger values of n (n > 100), parameter updates were similar to the true gradients in the last two layers, but they differed substantially in the first two layers. The largest values of η tested (η = 0.5 and η = 1) caused the iterations in Algorithm 2 to diverge.
Relative error and angle between dθℓ produced by predictive coding (Algorithm 2) as compared to the exact gradients, computed by backpropagation (relative error defined by ‖dθpc − dθbp‖/‖dθbp‖). Updates were computed as a function of the number of iterations, n, used in Algorithm 2 for various values of the step size, η, using the model from Fig 1 applied to one mini-batch of data. Both models were initialized identically to the pre-trained parameter values from the trained model in Fig 1. Parameter updates converge near the gradients after many iterations for smaller values of η, but diverge for larger values.
Some choices in designing Algorithm 2 were made arbitrarily. For example, the three updates inside the inner for-loop over ℓ could be performed in a different order or the outer for-loop over i could be changed to a while-loop with a convergence criterion. For any initial conditions and any of these design choices, if the iterations over i are repeated until convergence or approximate convergence of each vℓ to a fixed point, , then the increments must satisfy dvℓ = 0 at the fixed point and therefore the fixed point values of the prediction errors,
, must satisfy
(7)
for ℓ = 1, …, L − 1. By the definition of ϵL, we have
(8)
where
Combining Eqs (7) and (8) gives the fixed point prediction errors of the penultimate layer
(9)
where we used the fact that
and the chain rule. The error in layer L − 2 is given by
Note that we cannot apply the chain rule to reduce this product (like we did for Eq (9)) because it is not necessarily true that
. I revisit this point below. We can continue this process to derive
and continue for ℓ = L − 4, …, 1. In doing so, we see (by induction) that
can be written as
(10)
for ℓ = 1, …, L − 2. Therefore, if the inference loop converges to a fixed point, then the subsequent parameter update obeys
(11)
by Eq (6). It is not clear whether there is a simple mathematical relationship between these parameter updates and the negative gradients,
, computed by backpropagation.
It is tempting to assume that , in which case the product terms would be reduced by the chain rule. Indeed, this assumption would imply that
and
and, finally, that
and
, identical to the values computed by backpropagation. However, we cannot generally expect to have
because this would imply that
and therefore
. In other words, Algorithm 2 is only equivalent to backpropagation in the case where parameters are at a critical point of the loss function, so all updates are zero. Nevertheless, this thought experiment suggests a modification to Algorithm 2 for which the fixed points do represent the true gradients [11, 12]. I review that modification in the next section.
Note also that the calculations above rely on the assumption of a Euclidean loss function, . If we want to generalize the algorithm to different loss functions, then Eqs (3) and (4) could not both be true, and therefore Eqs (7) and (8) could not both be true. This leaves open the question of how to define ϵL when using loss functions that are not proportional to the squared Euclidean norm. If we were to define ϵL by (3), at the expense of losing (4), then the algorithm would not account for the loss function at all, so it would effectively assume a Euclidean loss, i.e., it would compute the same values that are computed by Algorithm 2 with a Euclidean loss. If we instead were to define ϵL by Eq (4) at the expense of (3), then Eqs (5) and (7) would no longer be true for ℓ = L − 1 and Eq (6) would no longer be true for ℓ = L. Instead, all three of these equations would involve second-order derivatives of the loss function, and therefore the fixed point Eqs (10) and (11) would also involve second order derivatives. The interpretation of the parameter updates is not clear in this case. One might instead try to define ϵL by the result of a forward pass,
but then ϵL would be a constant with respect to vL−1, so we would have ∂ϵL/∂vL−1 = 0, and therefore Eq (5) at ℓ = L − 1 would become
which has a fixed point at
. This would finally imply that all the errors converge to
and therefore dθℓ = 0 at the fixed point.
I next discuss a modification of Algorithm 2 that converges to the same gradients computed by backpropagation, and is applicable to general loss functions [11, 12].
Predictive coding modified by the fixed prediction assumption converges to the gradients computed by backpropagation.
Previous work [11, 12] proposed a modification of the predictive coding algorithm described above called the “fixed prediction assumption” which I now review. Motivated by the considerations in the last few paragraphs of the previous section, we can selectively substitute some terms of the form vℓ and fℓ(vℓ−1; θℓ) in Algorithm 2 with (or, equivalently,
) where
are the results of the original forward pass starting from
. Specifically, the following modifications are made to the quantities computed by Algorithm 2
(12)
for ℓ = 1, …, L − 1. This modification can be interpreted as “fixing” the predictions at the values computed by a forward pass and is therefore called the “fixed prediction assumption” [11, 12]. Additionally, the initial conditions of the beliefs are set to the results from a forward pass,
for ℓ = 1, …, L − 1. The complete modified algorithm is defined by the pseudocode below:
Algorithm 3 Supervised learning with predictive coding modified by the fixed prediction assumption. Adapted from the algorithm in [12] and similar to the algorithm from [11].
Given: Input (x) and label (y)
# forward pass
for ℓ = 1, …, L
# error and belief computation
for i = 1, …, n
for ℓ = L − 1, …, 1
vℓ = vℓ + ηdvℓ
# parameter update computation
for ℓ = 1, …, L
Note, again, that some choices in Algorithm 3 were made arbitrarily. The three updates inside the inner for-loop over ℓ could be performed in a different order or the outer for loop over i could be changed to a while-loop with a convergence criterion. Regardless of these choices, the fixed points, , can again be computed by setting dvℓ = 0 to obtain
Now note that ϵL is fixed, so
and we can combine these two equations to compute
where we used the chain rule and the fact that
. Continuing this approach we have,
for all ℓ = 1, …, L (where recall that
is the output from the feedfoward pass). Combining this with the modified definition of dθℓ, we have
where we use the chain rule and the fact that
. We may conclude that, if the inference step converges to a fixed point (dvℓ = 0), then Algorithm 3 computes the same values of dθℓ as backpropagation and also that the prediction errors, ϵℓ, converge to the gradients,
, computed by backpropagation. As long as the inference step approximately converges to a fixed point (dvℓ ≈ 0), then we should expect the parameter updates from Algorithm 3 to approximate those computed by backpropagation. In the next section, I extend this result to show that a special case of the algorithm computes the true gradients in a fixed number of steps.
I next tested Algorithm 3 on MNIST using the same 5-layer convolutional neural network considered above. I used a cross-entropy loss function, but otherwise used all of the same parameters used to test Algorithm 2 in Fig 1. The modified predictive coding algorithm (Algorithm 3) performed similarly to backpropagation in terms of the loss and accuracy (Fig 3A and 3B). Parameter updates computed by Algorithm 3 did not match the true gradients, but pointed in a similar direction and provided a closer match than Algorithm 2 (compare Fig 3C and 3D to Fig 1C and 1D). Algorithm 3 was similar to Algorithm 2 in terms of training time (29s for Algorithm 3 versus 31s for Algorithm 2 and 8s for backpropagation). S2 Fig shows the same results from Fig 3 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.
Same as Fig 1 except Algorithm 3 was used (with η = 0.1 and n = 20) in place of Algorithm 2. The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation, but the parameter updates are less similar for these hyperparameters.
I next compared the parameter updates computed by Algorithm 3 to the true gradients for different values of n and η (Fig 4). When η < 1, the parameter updates, dθℓ, appeared to converge, but did not converge exactly to the true gradients. This is likely due to numerical floating point errors accumulated over iterations. When η = 1, the parameter updates at each layer remained constant for the first few iterations, then immediately jumped to become very near the updates from backpropagation. In the next section, I provide a mathematical analysis of this behavior and show that when η = 1, Algorithm 3 computes the true gradients in a fixed number of steps.
Relative error and angle between dθ produced by predictive coding modified by the fixed prediction assumption (Algorithm 3) as compared to the exact gradients computed by backpropagation (relative error defined by ‖dθpc − dθbp‖/‖dθbp‖). Updates were computed as a function of the number of iterations, n, used in Algorithm 3 for various values of the step size, η, using the model from Fig 3 applied to one mini-batch of data. Both models were initialized identically to the pre-trained parameter values from the backpropagation-trained model in Fig 3. In the rightmost panels, some lines are not visible where they overlap at zero. Parameter updates quickly converge to the true gradients when η is larger.
To see how well these results extend to a larger model and more difficult benchmark, I next tested Algorithm 3 on CIFAR-10 [25] using a six-layer convolutional network. While the network only had one more layer than the MNIST network used above, it had 141 times more parameters (32,695 trainable parameters in the MNIST model versus 4,633,738 in the CIFAR-10 model). Algorithm 3 performed similarly to backpropagation in terms of loss and accuracy during learning (Fig 5A and 5B) and produced parameter updates that pointed in a similar direction, but still did not match the true gradients (Fig 5C and 5D). Algorithm 3 was substantially slower than backpropagation (848s for Algorithm 3 versus 58s for backpropagation when training metrics were not computed on every iteration).
Same as Fig 3 except a larger network was trained on the CIFAR-10 data set. The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation and parameter updates are similar to the true gradients.
Predictive coding modified by the fixed prediction assumption using a step size of η = 1 computes exact gradients in a fixed number of steps.
A major disadvantage of the approach outlined above—when compared to standard backpropagation—is that it requires iterative updates to vℓ and ϵℓ. Indeed, previous work [12] used n = 100–200 iterations, leading to substantially slower performance compared to standard backpropagation. Other work [11] used n = 20 iterations as above. In general, there is a tradeoff between accuracy and performance when choosing n, as demonstrated in Fig 4. However, more recent work [13, 14] showed that, under the fixed prediction assumption, predictive coding can compute the exact same gradients computed by backpropagation in a fixed number of steps. That work used a more specific formulation of the neural network which can implement fully connected layers, convolutional layers, and recurrent layers. They also used an unconventional interpretation of neural networks in which weights are multiplied outside the activation function, i.e., fℓ(x; θℓ) = θℓgℓ(x), and inputs are fed into the last layer instead of the first. Next, I show that their result holds for arbitrary feedforward neural networks as formulated in Eq (1) (with arbitrary functions, fℓ) and this result has a simple interpretation in terms of Algorithm 3. Specifically, the following theorem shows that taking a step size of η = 1 yields an exact computation of gradients using just n = L iterations (where L is the depth of the network).
Theorem 1. If Algorithm 3 is run with step size η = 1 and at least n = L iterations then the algorithm computes and
for all ℓ = 1, …, L where
are the results from a forward pass with
and
is the output.
Proof. For the sake of notational simplicity within this proof, define . Therefore, we first need to prove that ϵℓ = δℓ. First, rewrite the inside of the error and belief loop from Algorithm 3 while explicitly keeping track of the iteration number in which each variable was updated,
Here,
,
, and
denote the values of
,
, and
respectively at the end of the ith iteration,
corresponds to the initial value, and all terms without superscripts are constant inside the inference loop. There are some subtleties here. For example, we have
in the first line because vℓ is updated after ϵℓ in the loop. More subtly, we have
in the second equation instead of
because the for loop goes backwards from ℓ = L − 1 to ℓ = 1, so ϵℓ+1 is updated before ϵℓ. First note that
for ℓ = 1, …, L − 1 because
. Now compute the change in ϵℓ across one step,
Note that this equation is only valid for i ≥ 1 due to the i − 1 term (
is not defined). Adding
to both sides of the resulting equation gives
We now use induction to prove that ϵℓ = δℓ after n = L iterations. Indeed, we prove a stronger claim that
at i = L − ℓ + 1. First note that
for all i because
is initialized to δL and then never changed. Therefore, our claim is true for the base case ℓ = L.
Now suppose that for i = L − (ℓ + 1) + 1 = L − ℓ. We need to show that
. From above, we have
This completes our induction argument. It follows that
at iteration i = L − ℓ + 1 at all layers ℓ = 1, …, L. The last layer to be updated to the correct value is ℓ = 1, which is updated on iteration number i = L − 1 + 1 = L. Hence, ϵℓ = δℓ for all ℓ = 1, …, L after n = L iterations. This proves the first statement in our theorem. The second statement then follows from the definition of dθℓ,
This completes the proof.
This theorem ties together the implementation and formulation of predictive coding from [12] (i.e., Algorithm 3) to the results in [13, 14]. As noted in [13, 14], this result depends critically on the assumption that the values of vℓ are initialized to the activations from a forward pass, initially. The theoretical predictions from Theorem 1 are confirmed by the fact that all of the errors in the rightmost panels of Fig 4 converge to zero after n = L = 5 iterations.
To further test the result empirically, I repeated Figs 3 and 5 using η = 1 and n = L (in contrast to Figs 3 and 5 which used η = 0.1 and n = 20). The loss and accuracy closely matched those computed by backpropagation (Figs 6A and 6B and 7A and 7B). More importantly, the parameter updates closely matched the true gradients (Figs 6C and 6D and 7C and 7D), as predicted by Theorem 1. The differences between predictive coding and backpropagation in Fig 6 were due floating point errors and the non-determinism of computations performed on GPUs. For example, similar differences to those seen in Fig 6A and 6B were present when the same training algorithm was run twice with the same random seed. The smaller number of iterations (n = L in Figs 6 and 7 versus n = 20 in Figs 3 and 5) resulted in a shorter training time (13s for MNIST and 300s for CIFAR-10 for Figs 6 and 7, compare to 29s and 848s in Figs 3 and 5, and compare to 8s and 58s for backpropagation).
Same as Fig 3 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.
Same as Fig 5 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.
In summary, a review of the literature shows that a strict interpretation of predictive coding (Algorithm 2) does not converge to the true gradients computed by backpropagation. To compute the true gradients, predictive coding must be modified by the fixed prediction assumption (Algorithm 2). Further, I proved that Algorithm 2 computes the exact gradients when η = 1 and n ≥ L, which ties together results from previous work [12–14].
Predictive coding with the fixed prediction assumption and η = 1 is functionally equivalent to a direct implementation of backpropagation
The proof of Theorem 1 and the last panel of Fig 4 give some insight into a how Algorithm 3 works. First note that the values of vℓ in Algorithm 3 are only used to compute the values of ϵℓ and are not otherwise used in the computation of dθℓ or any other quantities. Therefore, if we only care about understanding parameter updates, dθℓ, we can ignore the values of vℓ and only focus on how ϵℓ is updated on each iteration, i. Secondly, note that when η = 1, each ϵℓ is updated only once: for i < L − ℓ + 1 and
for i ≥ L − ℓ + 1, so ϵℓ is only changed on iteration number i = L − ℓ + 1. In other words, the error computation in Algorithm 3 when η = 1 and n = L is equivalent to
# error computation
for i = 1, …, L
for ℓ = L − 1, …, 1
if ℓ == L − i + 1
The two computations are equivalent in the sense that they compute the same values of the errors, , on every iteration. The formulation above makes it clear that the nested loops are unnecessary because for each value of i, ϵℓ is only updated at one value of ℓ. Therefore, the nested loops and if-statement can be replaced by a single for-loop. Specifically, the error computation in Algorithm 3 when η = 1 is equivalent to
# error computation
for ℓ = L − 1, …, 1
This is exactly the error computation from the standard backpropagation algorithm, i.e., Algorithm 1. Hence, if we use η = 1, then Algorithm 3 is just backpropagation with extra steps and these extra steps do not compute any non-zero values. If we additionally want to compute the fixed point beliefs, then they can still be computed using the relationship
We may conclude that, when η = 1, Algorithm 3 can be replaced by an exact implementation of backpropagation without any effect on the results or effective implementation of the algorithm. This raises the question of whether predictive coding with the fixed prediction assumption should be considered any more biologically plausible than a direct implementation of backpropagation.
Accounting for covariance or precision matrices in hidden layers does not affect learning under the fixed prediction assumption
Above, I showed that predictive coding with the fixed prediction assumption is functionally equivalent to backpropagation. However, the predictive coding algorithm was derived under an assumption that covariance matrices in the probabilistic model are identity matrices, Σℓ = I. This raises the question of whether relaxing this assumption could generalize backpropagation to account for the covariances, as suggested in previous work [11, 12, 26].
We can account for covariances by returning to the calculations starting from the probabilistic model in Eq (2) and omit the assumption that Σℓ = I. To this end, it is helpful to define the precision-weighted prediction errors [20, 21, 26],
for ℓ = 1, …, L − 1 where
is the inverse of the covariance matrix of Vℓ, which is called “precision matrix.” Recall that we treat ϵℓ as a row-matrix, which explains the right-multiplication in this definition.
Modifying the definition of ϵL to account for covariances is not so simple because the Gaussian model for Vℓ is not justified for non-Euclidean loss functions such as categorical loss functions. Moreover, it is not clear how to define the covariance or precision matrix of the output layer when labels are observed. As such, I restrict to accounting for precision matrices in hidden layers only, and leave the question of accounting for covariances in the output layer for future work with some comments on the issue provided at the end of this section. To this end, let us not modify the last layer’s precision and instead define
The free energy is then defined as [20, 21]
Performing gradient descent on F with respect to vℓ therefore gives
and performing gradient descent on F with respect to θℓ gives
These expressions are identical to Eqs (5) and (6) derived above except that
takes the place of ϵℓ.
The precision matrices themselves can be learned by performing gradient descent on F with respect to Pℓ or, as suggested in other work [21], by parameterizing the model in terms of and performing gradient descent with respect to Σℓ. Alternatively, one could use techniques from the literature on Gaussian graphical models to learn a sparse or low-rank representation of Pℓ. I circumvent the question of estimating Pℓ by instead just asking how an estimate of Pℓ (however it is obtained) would affect learning. I do assume that Pℓ is symmetric. I also simplify the calculations by restricting the analysis to predictive coding with the fixed prediction assumption, leaving the analysis of fixed point prediction errors and parameter updates under strict predictive coding with precisions matrices for future work. Some analysis has been performed in this direction [21], but not for the supervised learning scenario considered here.
Putting this together, predictive coding under the fixed prediction assumption while accounting for precision matrices in hidden layers is defined by the following equations
The only difference between these equations and Eq (12) is that they use
in place of
. Following the same line of reasoning, therefore, if the updates to vℓ are repeated until convergence, then fixed point precision-weighted prediction errors satisfy
Notably, this is the same equation derived for ϵℓ under the fixed prediction assumption with Σℓ = I, so fixed point precision-weighted prediction errors are also the same,
and, therefore, parameter updates are the same as well,
In conclusion, accounting for precision matrices in hidden layers does not affect learning under the fixed prediction assumption. Fixed point parameter updates are still the same as those computed by backpropagation. This conclusion is independent of how the precision matrices are estimated, but it does rely on the assumption that fixed points for vℓ exist and are unique.
Above, we only considered precision matrices in the hidden layers because accounting for precision matrices in the output layer is problematic for general loss functions. The use of a precision matrix in the output implies the use of a Gaussian model for the output layer and labels, which is inconsistent with some types of labels and loss functions. If we focus on the case of a squared-Euclidean loss function,
then the use of precision matrices in the output layer is more parsimonious and we can define
in place of the definition above (recalling that
). Following the same calculations as above, gives fixed points of the form
and, therefore, weight updates take the form
at the fixed point. Hence, accounting for precision matrices at the output layer can affect learning by re-weighting the gradient of the loss function according to the precision matrix of the output layer. Note that the precision matrices of the hidden layers still have no effect on learning in this case. Previous work relates the inclusion of the precision matrix in output layers with the use of natural gradients [26, 27].
Prediction errors do not necessarily represent surprising or unexpected features of inputs
Deep neural networks are often interpreted as abstract models of cortical neuronal networks. To this end, the activations of units in deep neural networks are compared to the activity (typically firing rates) of cortical neurons [3, 28, 29]. This approach ignores the representation of errors within the network. More generally, the activations in one particular layer of a feedforward deep neural network contain no information about the activations of deeper layers, the label, or the loss. On the other hand, the activity of cortical neurons can be modulated by downstream activity and information believed to be passed upstream by feedback projections. Predictive coding provides a precise model for the information that deeper layers send to shallower layers, specifically prediction errors.
Under the fixed prediction assumption (Algorithm 3), prediction errors in a particular layer are approximated by the gradients of that layers’ activations with respect to the loss function, , but under a strict interpretation of predictive coding (Algorithm 2), prediction errors do not necessarily reflect gradients. We next empirically explored how the representations of images differ between the activations from a feedforward pass,
, the prediction errors under the fixed prediction assumption, ϵℓ = δℓ, as well as the beliefs, vℓ, and prediction errors, ϵℓ, under a strict interpretation of predictive coding (Algorithm 2). To do so, we computed each quantity in VGG-19 [30], which is a large, feedforward convolutional neural network (19 layers and 143,667,240 trainable parameters) pre-trained on ImageNet [31].
The use of convolutional layers allowed us to visualize the activations and prediction errors in each layer. Specifically, we took the Euclidean norm of each quantity across all channels and plotted them as two-dimensional images for layers ℓ = 1 and ℓ = 10 and for two different input images (Fig 8). For each image and each layer (each row in Fig 8), we computed the Euclidean norm of four quantities. First, we computed the activations from a forward pass through the network (, second column). Under predictive coding with the fixed prediction assumption (Algorithm 3), we can interpret the activations,
, as “beliefs” and the gradients, δℓ, as “prediction errors.” Strictly speaking, there is a distinction between the beliefs,
, from a feedforward pass and the beliefs,
, when labels are provided. Either could be interpreted as a “belief.” However, we found that the difference between them was negligible for the examples considered here.
The Euclidean norm of feedforward activations (, interpreted as beliefs under the fixed prediction assumption), gradients of the loss with respect to activations (
, interpreted as prediction errors under the fixed prediction assumption), beliefs (v) under strict predictive coding, and prediction errors (ϵℓ)) under strict predictive coding computed from the VGG-19 network [30] pre-trained on ImageNet [31] with two different photographs as inputs at two different layers. The vertical labels on the left (“triceratops” and “Irish wolfhound”) correspond to the guessed label which was also used as the “true” label (y) used to compute the gradients.
Next, we computed the gradients of the loss with respect to the activations (δℓ, third column in Fig 8). The theory and simulations above and from previous work confirms that these gradients approximate the prediction errors from predictive coding with the fixed prediction assumption (Algorithm 3). Indeed, for the examples considered here, the differences between the two quantities were negligible. Next, we computed the beliefs (vℓ, fourth column in Fig 8) computed by strict predictive coding (Algorithm 2). Finally, we computed the prediction errors (ϵℓ, last column in Fig 8) computed by strict predictive coding (Algorithm 2).
Note that we used a VGG-19 model that was pre-trained using backpropagation. Hence, the weights were not necessarily the same as the weights that would be obtained if the model were trained using predictive coding, particularly strict predictive coding (Algorithm 2) which does not necessarily converge to the true gradients. Training a large ImageNet model like VGG-19 with predictive coding is extremely computationally expensive. Regardless, future work should address the question of whether using pre-trained weights (versus weights trained by predictive coding) affects the conclusions reached here.
Overall, the activations, , from a feedforward pass were qualitatively very similar to the beliefs, vℓ, computed under a strict interpretation of predictive coding (Algorithm 2). To a slightly lesser degree, the gradients, δℓ, from a feedforward pass were qualitatively similar to the prediction errors computed under a strict interpretation of predictive coding (Algorithm 2). Since
and δℓ approximate beliefs and prediction errors under the fixed prediction assumption, these observations confirmed that the fixed prediction assumption does not make large qualitative changes to the representation of beliefs and errors in these examples. Therefore, in the discussion below, we used “beliefs” and “prediction errors” to refer to the quantities from both models.
Interestingly, prediction errors were non-zero even when the image and the network’s “guess” was consistent with the label (no “mismatch”). Indeed, the prediction errors were largest in magnitude at pixels corresponding to the object predicted by the label, i.e., at the most predictable regions. While this observation is an obvious consequence of the fact that prediction errors are approximated by the gradients, , it is contradictory to the heuristic or intuitive interpretation of prediction errors as measurements of “surprise” in the colloquial sense of the word [16].
As an illustrative example from Fig 8, it is not surprising that an image labeled by “triceratops” contains a triceratops, but this does not imply a lack of prediction errors because the space of images containing a triceratops is large and any one image of a triceratops is not wholly representative of the label. Moreover, the pixels to which the loss is most sensitive are those pixels containing the triceratops. Therefore those pixels give rise to larger values of . Hence, in high-dimensional sensory spaces, predictive coding models do not necessarily predict that prediction error units encode “surprise” in the colloquial sense of the word.
In both examples in Fig 8, we used an input, y, that matched the network’s “guessed” label, i.e., the label to which the network assigned the highest probability (). Prediction errors are often discussed in the context of mismatched stimuli in which top-down input is inconsistent with bottom-up predictions [32–37]. Mismatches can be modeled by taking a label that is different from the network’s guess. In Fig 9, we visualized the prediction errors in response to matched and mismatched labels. The network assigned a probability of p = 0.9991 to the label “carousel” and a probability of p = 3.63 × 10−8 to the label “bald eagle”. The low probability assigned to “bald eagle” is, at least in part, a consequence of the network being trained with a softmax loss function, which implicitly assumes one label per input. When we applied the mismatched label “bald eagle,” prediction errors were larger in pixels that are salient for that label (e.g., the bird’s white head, which is a defining feature of a bald eagle). Moreover, the prediction errors as a whole are much larger in magnitude in response to the mismatched label (see the scales of the color bars in Fig 9).
Same as Fig 8, but for the bottom row the label did not match the network’s guess.
In summary, the relationship between prediction errors and gradients helped demonstrate that prediction errors sometimes, but do not always conform to their common interpretation as unexpected features of a bottom-up input in the context of a top-down input. Also, beliefs and prediction errors were qualitatively similar with and without the fixed prediction assumption for the examples considered here.
Discussion
We reviewed and extended previous work [11–14] on the relationship between predictive coding and backpropagation for learning in neural networks. Our results demonstrated that a strict interpretation of predictive coding does not accurately approximate backpropagation, but is still capable of learning (Figs 1 and 2). Previous work proposed a modification to predictive coding called the “fixed prediction assumption” which causes predictive coding to converge to the same parameter updates produced by backpropagation, under the assumption that the predictive coding iterations converge to fixed points. Hence, the relationship between predictive coding and backpropagation identified in previous work relies critically on the fixed prediction assumption. Formal derivations of predictive coding in terms of variational inference [20] do not produce the fixed prediction assumption. It is possible that an alternative probabilistic model or alternative approaches to the variational formulation could help formalize a model of predictive coding under the fixed prediction assumption.
We proved analytically and verified empirically that taking a step size of η = 1 in the modified predictive coding algorithm computes the exact gradients computed by backpropagation in a fixed number of steps (modulo floating point numerical errors). This result is consistent with similar, but slightly less general, results in previous work [13, 14].
A closer inspection of the the fixed prediction assumption with η = 1 showed that it is algorithmically equivalent to a direct implementation of backpropagation. As such, any potential neural architecture and machinery that could be to implement predictive coding with the fixed prediction assumption could also implement backpropagation directly. This result calls into question whether predictive coding with the fixed prediction assumption is any more biologically plausible than a direct implementation of backpropagation.
Visualizing the beliefs and prediction errors produced by predictive coding models applied to a large convolutional neural network pre-trained on ImageNet showed that beliefs and prediction errors were activated by distinct parts of input images, and the parts of the images that produced larger prediction errors were not always consistent with an intuitive interpretation of prediction errors as representing surprising or unexpected features of inputs. These observations are consistent with the fact that prediction errors approximate gradients of the loss function in backpropagation [11–14]. Gradients are large for input features that have a larger impact on the loss. While surprising features can have a large impact on the loss, unsurprising features can as well. We only verified this finding empirically on a few examples. The reader can try additional examples by inserting the URL of any image into the file PredErrsFromURLimage.ipynb contained in the directories linked in Materials and Methods, and can also be accessed directly at https://bit.ly/3JwGUM9. Future work should attempt to quantify the relationship between prediction errors and surprising features more systematically across many inputs. In addition, prediction errors could be computed for learning tasks associated with common experimental paradigms so they can be used to make experimentally testable predictions.
When interpreting artificial deep neural networks as models of biological neuronal networks, it is common to compare activations in the artificial network to biological neurons’ firing rates [28, 29]. However, under predictive coding models and other models in which errors are propagated upstream by feedback connections, many biological interpretations posit the existence of “error neurons” that encode the errors sent upstream. In most such models (including predictive coding), error neurons reflect or approximate the gradient of the loss function with respect to artificial neurons’ activations, δℓ. Any model that hypothesizes the neural representation of backpropagated errors would predict that some recorded neural activity should reflect these errors. Therefore, if we want to draw analogues between artificial and biological neural networks, the activity of biological neurons should be compared to both the activations and the gradients of artificial neurons.
Following previous work [11, 12], we took the covariance matrices underlying the probabilistic model to be identity matrices, Σℓ = I, when deriving the predictive coding model. We also showed that relaxing this assumption by allowing for arbitrary precision matrices in hidden layers does not affect learning under the fixed prediction assumption. Future work should consider the utility of accounting for covariance (or precision) matrices in models without the fixed prediction assumption (i.e., under the “strict” model) and accounting for precisions or covariances in the output layer. Moreover, precision matrices could still have benefits in other settings such as recurrent network models, unsupervised learning, or active inference.
Predictive coding and deep neural networks (trained by backpropagation) are often viewed as competing models of brain function. Better understanding their relationship can help in the interpretation and implementation of each algorithm as well as their mutual relationships to biological neuronal networks.
Materials and methods
All numerical examples were performed on GPUs using Google Collaboratory with custom-written PyTorch code. The networks trained on MNIST used two convolutional and three fully connected layers with rectified linear activation functions using 2 epochs, a learning rate of 0.002, and a batch size of 300. The networks trained on CIFAR-10 used three convolutional and three fully connected layers with rectified linear activation functions using 5 epochs, a learning rate of 0.01, and a batch size of 256. All networks were trained using the Adam optimizer with gradients replaced by the output of the respective algorithm. All of the code to produce the figures in the manuscript can be found at https://doi.org/10.6084/m9.figshare.19387409.v2 A Google Drive folder with Colab notebooks that produce all figures in this text can be found at https://drive.google.com/drive/folders/1m_y0G_sTF-pV9pd2_sysWt1nvRvHYzX0 An additional copy of the same code is also stored at https://github.com/RobertRosenbaum/PredictiveCodingVsBackProp Full details of the neural network architectures and metaparameters can be found in this code.
Torch2PC software for predictive coding with PyTorch models
The figures above were all produced using PyTorch [38] models combined with custom written functions for predictive coding. Functions for predictive coding with PyTorch models are collected in the Github Repository Torch2PC. Currently, the only available functions are intended for models built using the Sequential class, but more general functions will be added to Torch2PC in the future. The functions can be imported using the following commands
!git clone https://github.com/RobertRosenbaum/Torch2PC.git
from Torch2PC import TorchSeq2PC as T2PC
The primary function in TorchSeq2PC is PCInfer, which performs one predictive coding step (computes one value of dθ) on a batch of inputs and labels. The function takes an input ErrType, which is a string that determines whether to use a strict interpretation of predictive coding (Algorithm 2; ErrType=“Strict”), predictive coding with the fixed prediction assumption (Algorithm 3; “FixedPred”), or to compute the gradients exactly using backpropagation (Algorithm 1; “Exact”). Algorithm 2 can be called as follows,
vhat,Loss,dLdy,v,epsilon=
T2PC.PCInfer(model,LossFun,X,Y,“Strict”,eta,n,vinit)
where model is a Sequential PyTorch model, LossFun is a loss function, X is a mini-batch of inputs, Y is a mini-batch of labels, eta is the step size, n is the number of iterations to use, and vinit is the initial value for the beliefs. If vinit is not passed, it is set to the result from a forward pass, vinit = vhat. The function returns a list of activations from a forward pass at each layer as vhat, the loss as Loss, the gradient of the output with respect to the loss as dLdy, a list of beliefs, vℓ, at each layer as v, and a list of prediction errors, ϵℓ, at each layer as epsilon. The values of the parameter updates, dθℓ, are stored in the grad attributes of each parameter, model.param.grad. Hence, after a call to PCInfer, gradient descent could be implemented by calling
with torch.no_grad():
for p in modelPC.parameters():
p-=eta*p.grad
Alternatively, an arbitrary optimizer could be used by calling
optimizer.step()
where optimizer is an optimizer created using the PyTorch optim class, e.g., by calling
optimizer = optim.Adam(model.parameters()) before the call to T2PC.PCInfer.
The input model should be a PyTorch Sequential model. Each layer is treated as a single predictive coding layer. Multiple functions can be included within the same layer by wrapping them in a separate call to Sequential. For example the following code:
model = nn.Sequential(
nn.Conv2d(1,10,3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(10,10,3),
nn.ReLU())
will treat each item as its own layer (5 layers in all). To treat each “convolutional block” as a separate layer, instead do
model = nn.Sequential(
nn.Sequential(
nn.Conv2d(1,10,3),
nn.ReLU(),
nn.MaxPool2d(2)),
nn.Sequential(
nn.Conv2d(10,10,3),
nn.ReLU()))
which has just 2 layers.
Algorithm 3 can be called as follows,
vhat,Loss,dLdy,v,epsilon=
T2PC.PCInfer(model,LossFun,X,Y,“FixedPred”,eta,n)
The input vinit is not used for Algorithm 3, so it does not need to be passed in. The exact values computed by backpropagation can be obtained by calling
vhat,Loss,dLdy,v,epsilon=
T2PC.PCInfer(model,LossFun,X,Y,“Exact”)
The inputs vinit, eta, and n are not used for computing exact gradients, so they do not need to be passed in. Theorem 1 says that
T2PC.PCInfer(model,LossFun,X,Y,“FixedPred”,eta = 1,n = len(model))
computes the same values as
T2PC.PCInfer(model,LossFun,X,Y,“Exact”)
up to numerical floating point errors. The inputs eta, n, and vinit are optional. If they are omitted by calling
T2PC.PCInfer(model,LossFun,X,Y,ErrType)
then they default to eta=.1,n = 20,vinit = None which produces vinit = vhat when
ErrType=“Strict”. More complete documentation and a complete example is provided as
SimpleExample.ipynb in the GitHub repository and in the code accompanying this paper. More examples are provided by the code accompanying each figure above.
Supporting information
S1 Fig. Comparing backpropagation and predictive coding in a convolutional neural network trained on MNIST across multiple trials.
Same as Fig 1 except the model was trained 30 times with different random seeds. Dark curves show the mean values and shaded regions show ± one standard deviation across trials.
https://doi.org/10.1371/journal.pone.0266102.s001
(EPS)
S2 Fig. Comparing backpropagation and predictive coding modified by the fixed prediction assumption in a convolutional neural network trained on MNIST across multiple trials.
Same as Fig 3 except the model was trained 30 times with different random seeds. Dark curves show the mean values and shaded regions show ± one standard deviation across trials.
https://doi.org/10.1371/journal.pone.0266102.s002
(EPS)
References
- 1. Izhikevich EM. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral cortex. 2007;17(10):2443–2452. pmid:17220510
- 2.
Clark DG, Abbott L, Chung S. Credit Assignment Through Broadcasting a Global Error Vector. arXiv preprint arXiv:210604089. 2021;.
- 3. Lillicrap TP, Santoro A, Marris L, Akerman CJ, Hinton G. Backpropagation and the brain. Nature Reviews Neuroscience. 2020;21(6):335–346. pmid:32303713
- 4. Whittington JC, Bogacz R. Theories of error back-propagation in the brain. Trends in Cognitive Sciences. 2019;23(3):235–250. pmid:30704969
- 5. Urbanczik R, Senn W. Learning by the dendritic prediction of somatic spiking. Neuron. 2014;81(3):521–528. pmid:24507189
- 6. Lillicrap TP, Cownden D, Tweed DB, Akerman CJ. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications. 2016;7(1):1–10. pmid:27824044
- 7. Scellier B, Bengio Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience. 2017;11:24. pmid:28522969
- 8.
Aljadeff J, D’amour J, Field RE, Froemke RC, Clopath C. Cortical credit assignment by Hebbian, neuromodulatory and inhibitory plasticity. arXiv preprint arXiv:191100307. 2019;.
- 9.
Kunin D, Nayebi A, Sagastuy-Brena J, Ganguli S, Bloom J, Yamins D. Two routes to scalable credit assignment without weight symmetry. In: International Conference on Machine Learning. PMLR; 2020. p. 5511–5521.
- 10. Payeur A, Guerguiev J, Zenke F, Richards BA, Naud R. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature Neuroscience. 2021; p. 1–10.
- 11. Whittington JC, Bogacz R. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Computation. 2017;29(5):1229–1262. pmid:28333583
- 12.
Millidge B, Tschantz A, Buckley CL. Predictive coding approximates backprop along arbitrary computation graphs. arXiv preprint arXiv:200604182. 2020;.
- 13. Song Y, Lukasiewicz T, Xu Z, Bogacz R. Can the brain do backpropagation?—exact implementation of backpropagation in predictive coding networks. Advances in Neural Information Processing Systems. 2020;33:22566. pmid:33840988
- 14.
Salvatori T, Song Y, Lukasiewicz T, Bogacz R, Xu Z. Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks. arXiv preprint arXiv:210303725. 2021;.
- 15. Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience. 1999;2(1):79–87. pmid:10195184
- 16. Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;11(2):127–138. pmid:20068583
- 17. Huang Y, Rao RP. Predictive Coding. Wiley Interdisciplinary Reviews: Cognitive Science. 2011;2(5):580–593. pmid:26302308
- 18. Bastos AM, Usrey WM, Adams RA, Mangun GR, Fries P, Friston KJ. Canonical microcircuits for predictive coding. Neuron. 2012;76(4):695–711. pmid:23177956
- 19.
Clark A. Surfing uncertainty: Prediction, action, and the embodied mind. Oxford University Press; 2015.
- 20. Buckley CL, Kim CS, McGregor S, Seth AK. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology. 2017;81:55–79.
- 21. Bogacz R. A tutorial on the free-energy framework for modelling perception and learning. Journal of Mathematical Psychology. 2017;76:198–211. pmid:28298703
- 22. Spratling MW. A review of predictive coding algorithms. Brain and cognition. 2017;112:92–97. pmid:26809759
- 23. Keller GB, Mrsic-Flogel TD. Predictive processing: a canonical cortical computation. Neuron. 2018;100(2):424–435. pmid:30359606
- 24.
Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. MIT press Cambridge; 2016.
- 25. Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. Citeseer. 2009;.
- 26.
Millidge B, Seth A, Buckley CL. Predictive Coding: a Theoretical and Experimental Review. arXiv preprint arXiv:210712979. 2021;.
- 27. Amari Si. Information geometry of the EM and em algorithms for neural networks. Neural networks. 1995;8(9):1379–1408.
- 28.
Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, Issa EB, et al. Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv preprint. 2018;.
- 29. Schrimpf M, Kubilius J, Lee MJ, Murty NAR, Ajemian R, DiCarlo JJ. Integrative Benchmarking to Advance Neurally Mechanistic Models of Human Intelligence. Neuron. 2020;. pmid:32918861
- 30.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014;.
- 31. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015;115(3):211–252.
- 32. Hertäg L, Sprekeler H. Learning prediction error neurons in a canonical interneuron circuit. Elife. 2020;9:e57541. pmid:32820723
- 33. Gillon CJ, Pina JE, Lecoq JA, Ahmed R, Billeh Y, Caldejon S, et al. Learning from unexpected events in the neocortical microcircuit. bioRxiv. 2021;.
- 34. Keller GB, Bonhoeffer T, Hübener M. Sensorimotor mismatch signals in primary visual cortex of the behaving mouse. Neuron. 2012;74(5):809–815. pmid:22681686
- 35. Zmarz P, Keller GB. Mismatch receptive fields in mouse visual cortex. Neuron. 2016;92(4):766–772. pmid:27974161
- 36. Attinger A, Wang B, Keller GB. Visuomotor coupling shapes the functional development of mouse visual cortex. Cell. 2017;169(7):1291–1302. pmid:28602353
- 37. Homann J, Koay SA, Glidden AM, Tank DW, Berry MJ. Predictive coding of novel versus familiar stimuli in the primary visual cortex. BioRxiv. 2017; p. 197608.
- 38. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems. 2019;32:8026–8037.