On the relationship between predictive coding and backpropagation

Artificial neural networks are often interpreted as abstract models of biological neuronal networks, but they are typically trained using the biologically unrealistic backpropagation algorithm and its variants. Predictive coding has been proposed as a potentially more biologically realistic alternative to backpropagation for training neural networks. This manuscript reviews and extends recent work on the mathematical relationship between predictive coding and backpropagation for training feedforward artificial neural networks on supervised learning tasks. Implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning are discussed along with a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.


Introduction
The backpropagation algorithm and its variants are widely used to train artificial neural networks.While artificial and biological neural networks share some common features, a direct implementation of backpropagation in the brain is often considered biologically implausible in part because of the nonlocal nature of parameter updates: The update to a parameter in one layer depends on activity in all deeper layers.In contrast, biological neural networks are believed to learn largely through local synaptic plasticity rules for which changes to a synaptic weight depend on neural activity local to that synapse.While neuromodulators can have non-local impact on synaptic plasticity, they are not believed to be sufficiently specific to implement the precise, high-dimensional credit assignment required by backpropogation.However, some work has shown that global errors and neuromodulators can work with local plasticity to implement effective learning algorithms [1,2].Backpropagation can be performed using local updates if gradients of neurons' activations are passed upstream through feedback connections, but this interpretation implies other biologically implausible properties of the

Results
A review of the relationship between backpropagation and predictive coding from previous work For completeness, let us first review the backpropagation algorithm.Consider a feedforward deep neural network (DNN) defined by where each vℓ is a vector or tensor of activations, each θ ℓ is a set of parameters for layer ℓ, and L is the network's depth.In supervised learning, one seeks to minimize a loss function L(ŷ, y) where y is a label associated with input, x, and ŷ = f (x; θ) = vL is the network's output, which depends on parameters θ = {θ ℓ } L ℓ=1 .The loss is typically minimized using gradient-based optimization methods with gradients computed using automatic differentiation tools based on the backpropagation algorithm.For completeness, backpropagation is reviewed in the pseudocode below.
Algorithm 1 A standard implementation of backpropagation.
The negative gradients, dθ ℓ , are then used to update parameters, either directly for stochastic gradient descent or indirectly for other gradient-based learning methods [24].For the sake of comparison, I used backpropagation to train a 5-layer convolutional neural network on the MNIST data set (Figure 1A,B; blue curves).I next review algorithms derived from the theory of predictive coding and their relationship to backpropagation, as originally derived in previous work [11,12,14,13].
A strict interpretation of predictive coding does not accurately compute gradients.
I begin by reviewing supervised learning under a strict interpretation of predictive coding.The formulation in this section is equivalent to the one first studied by Whittington and Bogacz [11], except that their results are restricted to the case in which f ℓ (v ℓ−1 ; θ ℓ ) = θ ℓ g ℓ (v ℓ−1 ) for some point-wiseapplied activation function, g ℓ , and connectivity matrix, θ ℓ .Our formulation extends this formulation to arbitrary vector-valued differentiable functions, f ℓ .For the sake of continuity with later sections, I also use the notational conventions from [12] which differ from those in [11].
Predictive coding can be derived from a hierarchical, Gaussian probabilistic model in which each layer, ℓ, is associated with a Gaussian random variable, V ℓ , satisfying where is the multivariate Gaussian distribution with mean, µ, and covariance matrix, Σ, evaluated at v. Following previous work [11,12,14,13], I take Σ = I to be the identity matrix, but later relax this assumption [21].
If we condition on an observed input, V 0 = x, then a forward pass through the network described by Eq. (1) corresponds to setting v0 = x and then sequentially computing the conditional expectations or, equivalently, maximizing conditional probabilities, until reaching an inferred output, ŷ = vL .Note that this forward pass does not necessarily maximize the global conditional probability, p(V L = ŷ | v 0 = x) and it does not account for a prior distribution on V L , which arises in related work on predictive coding for unsupervised learning [15,21].One interpretation of a forward pass is that each vℓ is the network's "belief" about the state of V ℓ , when only V 0 = x has been observed.Now suppose that we condition on both an observed input, V 0 = x, and its label, V L = y.In this case, generating beliefs about the hidden states, V ℓ , is more difficult because we need to account for potentially conflicting information at each end of the network.We can proceed by initializing a set of beliefs, v ℓ , about the state of each V ℓ , and then updating our initial beliefs to be more consistent with the observations, x and y, and parameters, θ ℓ .
The error made by a set of beliefs, {v ℓ } L ℓ=1 , under parameters, {θ ℓ } L ℓ=1 , can be quantified by It is not so simple to quantify the error, ǫ L , made at the last layer in a way that accounts for arbitrary loss functions.In the special case of a squared-Euclidean loss function, where u 2 = u T u.Standard formulations of predictive coding [21,20] use where recall that y is the label.In this case, ǫ L satisfies where We use the • to emphasize that ṽL is different from vL (which is defined by a forward pass starting at v0 = x) and is defined in a fundamentally different way from the v ℓ terms (which do not necessarily satisfy v ℓ = f ℓ (v ℓ−1 ; θ ℓ )).We can then define the total summed magnitude of errors as More details on the derivation of F in terms of variational Bayesian inference can be found in previous work [16,21,20,12] where F is known as the variational free energy of the model.Essentially, minimizing F produces a model that is more consistent with the observed data.Minimizing F by gradient descent on v ℓ and θ ℓ produce the inference and learning steps of predictive coding, respectively.
Under a more heuristic interpretation, v ℓ represents the network's "belief" about V ℓ , and f ℓ (v ℓ−1 ; θ ℓ ) is the "prediction" of v ℓ made by the previous layer.Under this interpretation, ǫ ℓ is the error made by the previous layer's prediction, so ǫ ℓ is called a "prediction error."Then F quantifies the total magnitude of prediction errors given a set of beliefs, v ℓ , parameters, θ ℓ , and observations, V 0 = x and V L = y.
In predictive coding, beliefs, v ℓ , are updated to minimize the error, F .This can be achieved by gradient descent, i.e., by making updates of the form where η is a step size and In this expression, ∂f ℓ+1 (v ℓ ; θ ℓ+1 )/∂v ℓ is a Jacobian matrix and ǫ ℓ+1 is a row-vector to simplify notation, but a column-vector interpretation is similar.If x is a mini-batch instead of one data point, then v ℓ is an m × n ℓ matrix and derivatives are tensors.These conventions are used throughout the manuscript.The updates in Eq. ( 5) can be iterated until convergence or approximate convergence.Note that the prediction errors, ǫ ℓ = f ℓ (v ℓ−1 ; θ ℓ ) − v ℓ , should also be updated on each iteration.Learning can also be phrased as minimizing F with gradient descent on parameters.Specifically, Note that some previous work uses the negative of the prediction errors used here, i.e., they use While this choice changes some of the expressions above, the value of F and its dependence on θ ℓ is not changed because F is defined by the norms of the ǫ ℓ terms.The complete algorithm is defined more precisely by the pseudocode below: Algorithm 2 A direct interpretation of predictive coding.
Given: Input (x), label (y), and initial beliefs (v ℓ ) # error and belief computation Here and elsewhere, n denotes the number of iterations for the inference step.The choice of initial beliefs is not specified in the algorithm above, but previous work [11,12,14,13] uses the results from a forward pass, v ℓ = vℓ , as initial conditions and I do the same in all numerical examples.I tested Algorithm 2 on MNIST using a 5-layer convolutional neural network.To be consistent with the definitions above, I used a mean-squared error (squared Euclidean) loss function, which required one-hot encoded labels [24].Algorithm 2 performed similarly to backpropagation (Fig. 1A,B) even though the parameter updates did not match the true gradients (Fig. 1C,D).Algorithm 2 was slower than backpropagation (31s for Algorithm 2 versus 8s for backpropagation when training metrics were not computed on every iteration) in part because Algorithm 2 requires several inner iterations to compute the prediction errors (n = 20 iterations used in this example).Algorithm 2 failed to converge on a larger model.Specifically, the loss grew consistently with iterations when trying to use Algorithm 2 to train the 6-layer CIFAR-10 model described in the next section.S1  Relative error and angle between dθ ℓ produced by predictive coding (Algorithm 2) as compared to the exact gradients, ∂L/∂θ ℓ computed by backpropagation (relative error defined by dθ pc − dθ bp / dθ bp ).Updates were computed as a function of the number of iterations, n, used in Algorithm 2 for various values of the step size, η, using the model from Fig. 1 applied to one mini-batch of data.Both models were initialized identically to the pre-trained parameter values from the trained model in Fig. 1.Parameter updates converge near the gradients after many iterations for smaller values of η, but diverge for larger values.
Fig. 1C,D shows that predictive coding does not update parameters according to the true gradients, but it is not immediately clear whether this would be resolved by using more iterations (larger n) or different values of the step size, η.I next compared the parameter updates, dθ ℓ , to the true gradients, ∂L/∂θ ℓ for different values of n and η (Fig. 2).For the smaller values of η tested (η = 0.1 and η = 0.2) and larger values of n (n > 100), parameter updates were similar to the true gradients in the last two layers, but they differed substantially in the first two layers.The largest values of η tested (η = 0.5 and η = 1) caused the iterations in Algorithm 2 to diverge.Some choices in designing Algorithm 2 were made arbitrarily.For example, the three updates inside the inner for-loop over ℓ could be performed in a different order or the outer for-loop over i could be changed to a while-loop with a convergence criterion.For any initial conditions and any of these design choices, if the iterations over i are repeated until convergence or approximate convergence of each v ℓ to a fixed point, v * ℓ , then the increments must satisfy dv ℓ = 0 at the fixed point and therefore the fixed point values of the prediction errors, ǫ * ℓ , must satisfy for ℓ = 1, . . ., L − 1.By the definition of ǫ L , we have where ṽ * Combining Eqs.(7) and (8) gives the fixed point prediction errors of the penultimate layer where we used the fact that ṽ * L = f L (v * L−1 ; θ L ) and the chain rule.The error in layer L − 2 is given by Note that we cannot apply the chain rule to reduce this product (like we did for Eq. ( 9)) because it is not necessarily true that v * L−1 = f L−1 (v * L−2 ; θ L−1 ).I revisit this point below.We can continue this process to derive and continue for ℓ = L − 4, . . ., 1.In doing so, we see (by induction) that ǫ * ℓ can be written as for ℓ = 1, . . ., L − 2. Therefore, if the inference loop converges to a fixed point, then the subsequent parameter update obeys by Eq. ( 6).It is not clear whether there is a simple mathematical relationship between these parameter updates and the negative gradients, dθ ℓ = −∂L/∂θ ℓ , computed by backpropagation.
It is tempting to assume that v * ℓ = f ℓ (v * ℓ−1 ; θ ℓ ), in which case the product terms would be reduced by the chain rule.Indeed, this assumption would imply that v * ℓ = vℓ and ṽ * L = vL and, finally, that ǫ ℓ = ∂L/∂v ℓ and dθ ℓ = −∂L/∂θ ℓ , identical to the values computed by backpropagation.However, we cannot generally expect to have v * ℓ = f ℓ (v * ℓ−1 ; θ ℓ ) because this would imply that ǫ * ℓ = 0 and therefore ∂L/∂v * ℓ = ∂L/∂θ ℓ = 0.In other words, Algorithm 2 is only equivalent to backpropagation in the case where parameters are at a critical point of the loss function, so all updates are zero.Nevertheless, this thought experiment suggests a modification to Algorithm 2 for which the fixed points do represent the true gradients [12,11].I review that modification in the next section.
Note also that the calculations above rely on the assumption of a Euclidean loss function, L(ŷ, y) = ŷ − y 2 /2.If we want to generalize the algorithm to different loss functions, then Eqs. ( 3) and ( 4) could not both be true, and therefore Eqs. ( 7) and ( 8) could not both be true.This leaves open the question of how to define ǫ L when using loss functions that are not proportional to the squared Euclidean norm.If we were to define ǫ L by (3), at the expense of losing (4), then the algorithm would not account for the loss function at all, so it would effectively assume a Euclidean loss, i.e., it would compute the same values that are computed by Algorithm 2 with a Euclidean loss.If we instead The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation, but the parameter updates are less similar for these hyperparameters.
were to define ǫ L by Eq. ( 4) at the expense of (3), then Eqs. ( 5) and ( 7) would no longer be true for ℓ = L − 1 and Eq. ( 6) would no longer be true for ℓ = L. Instead, all three of these equations would involve second-order derivatives of the loss function, and therefore the fixed point equations ( 10) and (11) would also involve second order derivatives.The interpretation of the parameter updates is not clear in this case.One might instead try to define ǫ L by the result of a forward pass, but then ǫ L would be a constant with respect to v L−1 , so we would have ∂ǫ L /∂v L−1 = 0, and therefore Eq. ( 5) at ℓ = L − 1 would become which has a fixed point at ǫ * L−1 = 0.This would finally imply that all the errors converge to ǫ * ℓ = 0 and therefore dθ ℓ = 0 at the fixed point.
I next discuss a modification of Algorithm 2 that converges to the same gradients computed by backpropagation, and is applicable to general loss functions [12,11].
Predictive coding modified by the fixed prediction assumption converges to the gradients computed by backpropagation.
Previous work [11,12] proposed a modification of the predictive coding algorithm described above called the "fixed prediction assumption" which I now review.Motivated by the considerations in the last few paragraphs of the previous section, we can selectively substitute some terms of the form v ℓ and f ℓ (v ℓ−1 ; θ ℓ ) in Algorithm 2 with vℓ (or, equivalently, f ℓ (v ℓ−1 ; θ ℓ )) where vℓ are the results of the original forward pass starting from v0 = x.Specifically, the following modifications are made to the quantities computed by Algorithm 2 for ℓ = 1, . . ., L − 1.This modification can be interpreted as "fixing" the predictions at the values computed by a forward pass and is therefore called the "fixed prediction assumption" [11,12].Additionally, the initial conditions of the beliefs are set to the results from a forward pass, v ℓ = vℓ for ℓ = 1, . . ., L − 1.The complete modified algorithm is defined by the pseudocode below: Algorithm 3 Supervised learning with predictive coding modified by the fixed prediction assumption.Adapted from the algorithm in [12] and similar to the algorithm from [11].
Given: Input (x) and label (y) Note, again, that some choices in Algorithm 3 were made arbitrarily.The three updates inside the inner for-loop over ℓ could be performed in a different order or the outer for loop over i could be changed to a while-loop with a convergence criterion.Regardless of these choices, the fixed points, ǫ * ℓ , can again be computed by setting dv ℓ = 0 to obtain and we can combine these two equations to compute where we used the chain rule and the fact that vℓ = f ℓ (v ℓ−1 ; θ ℓ ).Continuing this approach we have, for all ℓ = 1, . . ., L (where recall that ŷ = vL is the output from the feedfoward pass).Combining this with the modified definition of dθ ℓ , we have where we use the chain rule and the fact that vℓ = f ℓ (v ℓ−1 ; θ ℓ ).We may conclude that, if the inference step converges to a fixed point (dv ℓ = 0), then Algorithm 3 computes the same values of dθ ℓ as backpropagation and also that the prediction errors, ǫ ℓ , converge to the gradients, δ ℓ = ∂L/∂v ℓ , computed by backpropagation.As long as the inference step approximately converges to a fixed point (dv ℓ ≈ 0), then we should expect the parameter updates from Algorithm 3 to approximate those computed by backpropagation.In the next section, I extend this result to show that a special case of the algorithm computes the true gradients in a fixed number of steps.I next tested Algorithm 3 on MNIST using the same 5-layer convolutional neural network considered above.I used a cross-entropy loss function, but otherwise used all of the same parameters used to test Algorithm 2 in Fig. 1.The modified predictive coding algorithm (Algorithm 3) performed similarly to backpropagation in terms of the loss and accuracy (Fig. 3A,B).Parameter updates computed by Algorithm 3 did not match the true gradients, but pointed in a similar direction and provided a closer match than Algorithm 2 (compare Fig. 3C,D  I next compared the parameter updates computed by Algorithm 3 to the true gradients for different values of n and η (Fig. 4).When η < 1, the parameter updates, dθ ℓ , appeared to converge, but did not converge exactly to the true gradients.This is likely due to numerical floating point errors accumulated over iterations.When η = 1, the parameter updates at each layer remained constant for the Relative error and angle between dθ produced by predictive coding modified by the fixed prediction assumption (Algorithm 3) as compared to the exact gradients computed by backpropagation (relative error defined by dθ pc − dθ bp / dθ bp ).Updates were computed as a function of the number of iterations, n, used in Algorithm 3 for various values of the step size, η, using the model from Fig. 3 applied to one minibatch of data.Both models were initialized identically to the pre-trained parameter values from the backpropagation-trained model in Fig. 3.In the rightmost panels, some lines are not visible where they overlap at zero.Parameter updates quickly converge to the true gradients when η is larger.first few iterations, then immediately jumped to become very near the updates from backpropagation.In the next section, I provide a mathematical analysis of this behavior and show that when η = 1, Algorithm 3 computes the true gradients in a fixed number of steps.
To see how well these results extend to a larger model and more difficult benchmark, I next tested Algorithm 3 on CIFAR-10 [25] using a six-layer convolutional network.While the network only had one more layer than the MNIST network used above, it had 141 times more parameters (32,695 trainable parameters in the MNIST model versus 4,633,738 in the CIFAR-10 model).Algorithm 3 performed similarly to backpropagation in terms of loss and accuracy during learning (Fig. 5A,B) and produced parameter updates that pointed in a similar direction, but still did not match the true gradients (Fig. 5C,D).Algorithm 3 was substantially slower than backpropagation (848s for Algorithm 3 versus 58s for backpropagation when training metrics were not computed on every iteration).
Predictive coding modified by the fixed prediction assumption using a step size of η = 1 computes exact gradients in a fixed number of steps.
A major disadvantage of the approach outlined above -when compared to standard backpropagation -is that it requires iterative updates to v ℓ and ǫ ℓ .Indeed, previous work [12] used n = 100 − 200 iterations, leading to substantially slower performance compared to standard backpropagation.Other work [11] used n = 20 iterations as above.In general, there is a tradeoff between accuracy and performance when choosing n, as demonstrated in Fig. 4.However, more recent work [13,14] showed that, under the fixed prediction assumption, predictive coding can compute the exact same gradients computed by backpropagation in a fixed number of steps.That work used a more specific formulation of the neural network which can implement fully connected layers, convolutional layers, and recurrent layers.They also used an unconventional interpretation of neural networks in which weights are multiplied outside the activation function, i.e., f ℓ (x; θ ℓ ) = θ ℓ g ℓ (x), and inputs are fed into the last layer instead of the first.Next, I show that their result holds for arbitrary feedforward neural networks as formulated in Eq. ( 1) (with arbitrary functions, f ℓ ) and this result has a simple interpretation in terms of Algorithm 3. Specifically, the following theorem shows that taking a step size of η = 1 yields an exact computation of gradients using just n = L iterations (where L is the depth of the network).Proof.For the sake of notational simplicity within this proof, define δ ℓ = ∂L(v L , y)/∂v ℓ .Therefore, we first need to prove that ǫ ℓ = δ ℓ .First, rewrite the inside of the error and belief loop from Algorithm 3 while explicitly keeping track of the iteration number in which each variable was updated, Here, v i ℓ , ǫ i ℓ , and dv i ℓ denote the values of v i ℓ , ǫ i ℓ , and dv i ℓ respectively at the end of the ith iteration, v 0 ℓ = vℓ corresponds to the initial value, and all terms without superscripts are constant inside the inference loop.There are some subtleties here.For example, we have v i−1 ℓ in the first line because v ℓ is updated after ǫ ℓ in the loop.More subtly, we have ǫ i ℓ+1 in the second equation instead of ǫ i−1 ℓ+1 because the for loop goes backwards from ℓ = L − 1 to ℓ = 1, so ǫ ℓ+1 is updated before ǫ ℓ .First note that ǫ 1 ℓ = 0 for ℓ = 1, . . ., L − 1 because v 0 ℓ = vℓ .Now compute the change in ǫ ℓ across one step, Note that this equation is only valid for i ≥ 1 due to the i − 1 term (v −1 ℓ is not defined).Adding ǫ i ℓ to both sides of the resulting equation gives We now use induction to prove that ǫ ℓ = δ ℓ after n = L iterations.Indeed, we prove a stronger claim that ǫ i ℓ = δ ℓ at i = L − ℓ + 1.First note that ǫ i L = δ L for all i because ǫ i L is initialized to δ L and then never changed.Therefore, our claim is true for the base case ℓ = L. Now suppose that ǫ i ℓ+1 = δ ℓ+1 for i = L − (ℓ + 1) + 1 = L − ℓ.We need to show that ǫ i+1 ℓ = δ ℓ .From above, we have This completes our induction argument.It follows that ǫ i ℓ = δ ℓ at iteration i = L − ℓ + 1 at all layers ℓ = 1, . . ., L. The last layer to be updated to the correct value is ℓ = 1, which is updated on iteration number i = L − 1 + 1 = L. Hence, ǫ ℓ = δ ℓ for all ℓ = 1, . . ., L after n = L iterations.This proves the first statement in our theorem.The second statement then follows from the definition of dθ ℓ , This completes the proof.
This theorem ties together the implementation and formulation of predictive coding from [12] (i.e., Algorithm 3) to the results in [14,13].As noted in [14,13], this result depends critically on the assumption that the values of v ℓ are initialized to the activations from a forward pass, v ℓ = vℓ initially.The theoretical predictions from Theorem 1 are confirmed by the fact that all of the errors in the rightmost panels of Fig. 4 converge to zero after n = L = 5 iterations.
To further test the result empirically, I repeated Figs.,B and 7A,B).More importantly, the parameter updates closely matched the true gradients (Fig. 6C,D and 7C,D), as predicted by Theorem 1.The differences between predictive coding and backpropagation in Fig. 6 were due floating point errors and the nondeterminism of computations performed on GPUs.For example, similar differences to those seen in Fig. 6A,B were present when the same training algorithm was run twice with the same random seed.The smaller number of iterations (n = L in Figs. 6 and 7 versus n = 20 in Figs. 3 and 5) resulted in In summary, a review of the literature shows that a strict interpretation of predictive coding (Algorithm 2) does not converge to the true gradients computed by backpropagation.To compute the true gradients, predictive coding must be modified by the fixed prediction assumption (Algorithm 3).Further, I proved that Algorithm 3 computes the exact gradients when η = 1 and n ≥ L, which ties together results from previous work [12,14,13].
Predictive coding with the fixed prediction assumption and η = 1 is functionally equivalent to a direct implementation of backpropagation.
The proof of Theorem 1 and the last panel of Fig. 4 give some insight into a how Algorithm 3 works.First note that the values of v ℓ in Algorithm 3 are only used to compute the values of ǫ ℓ and are not otherwise used in the computation of dθ ℓ or any other quantities.Therefore, if we only care about understanding parameter updates, dθ ℓ , we can ignore the values of v ℓ and only focus on how ǫ ℓ is updated on each iteration, i.Secondly, note that when η = 1, each ǫ ℓ is updated only once: In other words, the error computation in Algorithm 3 when η = 1 and n = L is equivalent to The two computations are equivalent in the sense that they compute the same values of the errors, ǫ i ℓ , on every iteration.The formulation above makes it clear that the nested loops are unnecessary because for each value of i, ǫ ℓ is only updated at one value of ℓ.Therefore, the nested loops and if-statement can be replaced by a single for-loop.Specifically, the error computation in Algorithm 3 when η = 1 is equivalent to This is exactly the error computation from the standard backpropagation algorithm, i.e., Algorithm 1.
Hence, if we use η = 1, then Algorithm 3 is just backpropagation with extra steps and these extra steps do not compute any non-zero values.If we additionally want to compute the fixed point beliefs, then they can still be computed using the relationship We may conclude that, when η = 1, Algorithm 3 can be replaced by an exact implementation of backpropagation without any effect on the results or effective implementation of the algorithm.This raises the question of whether predictive coding with the fixed prediction assumption should be considered any more biologically plausible than a direct implementation of backpropagation.
Accounting for covariance or precision matrices in hidden layers does not affect learning under the fixed prediction assumption.
Above, I showed that predictive coding with the fixed prediction assumption is functionally equivalent to backpropagation.However, the predictive coding algorithm was derived under an assumption that covariance matrices in the probabilistic model are identity matrices, Σ ℓ = I.This raises the question of whether relaxing this assumption could generalize backpropagation to account for the covariances, as suggested in previous work [11,12,26].We can account for covariances by returning to the calculations starting from the probabilistic model in Eq (2) and omit the assumption that Σ ℓ = I.To this end, it is helpful to define the precisionweighted prediction errors [21,20,26], ǫℓ = ǫ ℓ P ℓ for ℓ = 1, . . ., L − 1 where P ℓ = Σ −1 ℓ is the inverse of the covariance matrix of V ℓ , which is called "precision matrix."Recall that we treat ǫ ℓ as a row-matrix, which explains the right-multiplication in this definition.
Modifying the definition of ǫ L to account for covariances is not so simple because the Gaussian model for V ℓ is not justified for non-Euclidean loss functions such as categorical loss functions.Moreover, it is not clear how to define the covariance or precision matrix of the output layer when labels are observed.As such, I restrict to accounting for precision matrices in hidden layers only, and leave the question of accounting for covariances in the output layer for future work with some comments on the issue provided at the end of this section.To this end, let us not modify the last layer's precision and instead define ǫL = ǫ L = ∂L(ŷ, y) ∂ ŷ .The free energy is then defined as [21,20] Performing gradient descent on F with respect to v ℓ therefore gives and performing gradient descent on F with respect to θ ℓ gives These expressions are identical to Eqs. ( 5) and ( 6) derived above except that ǫℓ takes the place of ǫ ℓ .The precision matrices themselves can be learned by performing gradient descent on F with respect to P ℓ or, as suggested in other work [21], by parameterizing the model in terms of Σ ℓ = P −1 ℓ and performing gradient descent with respect to Σ ℓ .Alternatively, one could use techniques from the literature on Gaussian graphical models to learn a sparse or low-rank representation of P ℓ .I circumvent the question of estimating P ℓ by instead just asking how an estimate of P ℓ (however it is obtained) would affect learning.I do assume that P ℓ is symmetric.I also simplify the calculations by restricting the analysis to predictive coding with the fixed prediction assumption, leaving the analysis of fixed point prediction errors and parameter updates under strict predictive coding with precisions matrices for future work.Some analysis has been performed in this direction [21], but not for the supervised learning scenario considered here.
Putting this together, predictive coding under the fixed prediction assumption while accounting for precision matrices in hidden layers is defined by the following equations The only difference between these equations and Eqs. ( 12) is that they use ǫℓ in place of ǫ ℓ = vℓ − v ℓ .Following the same line of reasoning, therefore, if the updates to v ℓ are repeated until convergence, then fixed point precision-weighted prediction errors satisfy Notably, this is the same equation derived for ǫ ℓ under the fixed prediction assumption with Σ ℓ = I, so fixed point precision-weighted prediction errors are also the same, and, therefore, parameter updates are the same as well, In conclusion, accounting for precision matrices in hidden layers does not affect learning under the fixed prediction assumption.Fixed point parameter updates are still the same as those computed by backpropagation.This conclusion is independent of how the precision matrices are estimated, but it does rely on the assumption that fixed points for v ℓ exist and are unique.Above, we only considered precision matrices in the hidden layers because accounting for precision matrices in the output layer is problematic for general loss functions.The use of a precision matrix in the output implies the use of a Gaussian model for the output layer and labels, which is inconsistent with some types of labels and loss functions.If we focus on the case of a squared-Euclidean loss function, then the use of precision matrices in the output layer is more parsimonious and we can define in place of the definition above (recalling that ŷ = vL ).Following the same calculations as above, gives fixed points of the form and, therefore, weight updates take the form at the fixed point.Hence, accounting for precision matrices at the output layer can affect learning by re-weighting the gradient of the loss function according to the precision matrix of the output layer.Note that the precision matrices of the hidden layers still have no effect on learning in this case.Previous work relates the inclusion of the precision matrix in output layers with the use of natural gradients [27,26].
Prediction errors do not necessarily represent surprising or unexpected features of inputs.
Deep neural networks are often interpreted as abstract models of cortical neuronal networks.To this end, the activations of units in deep neural networks are compared to the activity (typically firing rates) of cortical neurons [28,29,3].This approach ignores the representation of errors within the network.More generally, the activations in one particular layer of a feedforward deep neural network contain no information about the activations of deeper layers, the label, or the loss.On the other hand, the activity of cortical neurons can be modulated by downstream activity and information believed to be passed upstream by feedback projections.Predictive coding provides a precise model for the information that deeper layers send to shallower layers, specifically prediction errors.Under the fixed prediction assumption (Algorithm 3), prediction errors in a particular layer are approximated by the gradients of that layers' activations with respect to the loss function, ǫ ℓ = δ ℓ = ∂L ∂ vℓ , but under a strict interpretation of predictive coding (Algorithm 2), prediction errors do not necessarily reflect gradients.We next empirically explored how the representations of images differ between the activations from a feedforward pass, vℓ , the prediction errors under the fixed prediction assumption, ǫ ℓ = δ ℓ , as well as the beliefs, v ℓ , and prediction errors, ǫ ℓ , under a strict interpretation of predictive coding (Algorithm 2).To do so, we computed each quantity in VGG-19 [30], which is a large, feedforward convolutional neural network (19 layers and 143,667,240 trainable parameters) pre-trained on ImageNet [31].
The use of convolutional layers allowed us to visualize the activations and prediction errors in each layer.Specifically, we took the Euclidean norm of each quantity across all channels and plotted them as two-dimensional images for layers ℓ = 1 and ℓ = 10 and for two different input images (Fig. 8).For each image and each layer (each row in Fig. 8), we computed the Euclidean norm of four quantities.First, we computed the activations from a forward pass through the network (v ℓ , second column).Under predictive coding with the fixed prediction assumption (Algorithm 3), we can interpret the activations, vℓ , as "beliefs" and the gradients, δ ℓ , as "prediction errors."Strictly speaking, there is a distinction between the beliefs, vℓ , from a feedforward pass and the beliefs, v ℓ = vℓ + ǫ ℓ , when labels are provided.Either could be interpreted as a "belief."However, we found that the difference between them was negligible for the examples considered here.
Next, we computed the gradients of the loss with respect to the activations (δ ℓ , third column in Fig. 8).The theory and simulations above and from previous work confirms that these gradients approximate the prediction errors from predictive coding with the fixed prediction assumption (Algorithm 3).Indeed, for the examples considered here, the differences between the two quantities were negligible.Next, we computed the beliefs (v ℓ , fourth column in Fig. 8) computed by strict predictive coding (Algorithm 2).Finally, we computed the prediction errors (ǫ ℓ , last column in Fig. 8) computed by strict predictive coding (Algorithm 2).
Note that we used a VGG-19 model that was pre-trained using backpropagation.Hence, the weights were not necessarily the same as the weights that would be obtained if the model were trained using predictive coding, particularly strict predictive coding (Algorithm 2) which does not necessarily converge to the true gradients.Training a large ImageNet model like VGG-19 with predictive coding is extremely computationally expensive.Regardless, future work should address the question of whether using pre-trained weights (versus weights trained by predictive coding) affects the conclusions reached here.
Overall, the activations, vℓ , from a feedforward pass were qualitatively very similar to the beliefs, v ℓ , computed under a strict interpretation of predictive coding (Algorithm 2).To a slightly lesser degree, the gradients, δ ℓ , from a feedforward pass were qualitatively similar to the prediction errors computed under a strict interpretation of predictive coding (Algorithm 2).Since vℓ and δ ℓ approximate beliefs and prediction errors under the fixed prediction assumption, these observations confirmed that the fixed prediction assumption does not make large qualitative changes to the representation of beliefs and errors in these examples.Therefore, in the discussion below, we used "beliefs" and "prediction errors" to refer to the quantities from both models.
Interestingly, prediction errors were non-zero even when the image and the network's "guess" was consistent with the label (no "mismatch").Indeed, the prediction errors were largest in magnitude at pixels corresponding to the object predicted by the label, i.e., at the most predictable regions.While this observation is an obvious consequence of the fact that prediction errors are approximated by the gradients, δ ℓ = ∂L ∂ vℓ , it is contradictory to the heuristic or intuitive interpretation of prediction errors as measurements of "surprise" in the colloquial sense of the word [16].under strict predictive coding, and prediction errors (ǫ ℓ )) under strict predictive coding computed from the VGG-19 network [30] pre-trained on ImageNet [31] with two different photographs as inputs at two different layers.The vertical labels on the left ("triceratops" and "Irish wolfhound") correspond to the guessed label which was also used as the "true" label (y) used to compute the gradients.As an illustrative example from Fig. 8, it is not surprising that an image labeled by "triceratops" contains a triceratops, but this does not imply a lack of prediction errors because the space of images containing a triceratops is large and any one image of a triceratops is not wholly representative of the label.Moreover, the pixels to which the loss is most sensitive are those pixels containing the triceratops.Therefore those pixels give rise to larger values of ǫ ℓ ≈ δ ℓ = ∂L/∂v ℓ .Hence, in highdimensional sensory spaces, predictive coding models do not necessarily predict that prediction error units encode "surprise" in the colloquial sense of the word.
In both examples in Fig. 8, we used an input, y, that matched the network's "guessed" label, i.e., the label to which the network assigned the highest probability (argmax(ŷ)).Prediction errors are often discussed in the context of mismatched stimuli in which top-down input is inconsistent with bottom-up predictions [32,33,34,35,36,37].Mismatches can be modeled by taking a label that is different from the network's guess.In Fig. 9, we visualized the prediction errors in response to matched and mismatched labels.The network assigned a probability of p = 0.9991 to the label "carousel" and a probability of p = 3.63 × 10 −8 to the label "bald eagle".The low probability assigned to "bald eagle" is, at least in part, a consequence of the network being trained with a softmax loss function, which implicitly assumes one label per input.When we applied the mismatched label "bald eagle," prediction errors were larger in pixels that are salient for that label (e.g., the bird's white head, which is a defining feature of a bald eagle).Moreover, the prediction errors as a whole are much larger in magnitude in response to the mismatched label (see the scales of the color bars in Fig. 9).
In summary, the relationship between prediction errors and gradients helped demonstrate that prediction errors sometimes, but do not always conform to their common interpretation as unexpected features of a bottom-up input in the context of a top-down input.Also, beliefs and prediction errors were qualitatively similar with and without the fixed prediction assumption for the examples considered here.logues between artificial and biological neural networks, the activity of biological neurons should be compared to both the activations and the gradients of artificial neurons.
Following previous work [12,11], we took the covariance matrices underlying the probabilistic model to be identity matrices, Σ ℓ = I, when deriving the predictive coding model.We also showed that relaxing this assumption by allowing for arbitrary precision matrices in hidden layers does not affect learning under the fixed prediction assumption.Future work should consider the utility of accounting for covariance (or precision) matrices in models without the fixed prediction assumption (i.e., under the "strict" model) and accounting for precisions or covariances in the output layer.Moreover, precision matrices could still have benefits in other settings such as recurrent network models, unsupervised learning, or active inference.
Predictive coding and deep neural networks (trained by backpropagation) are often viewed as competing models of brain function.Better understanding their relationship can help in the interpretation and implementation of each algorithm as well as their mutual relationships to biological neuronal networks.

Materials and Methods
All numerical examples were performed on GPUs using Google Collaboratory with custom-written PyTorch code.The networks trained on MNIST used two convolutional and three fully connected layers with rectified linear activation functions using 2 epochs, a learning rate of 0.002, and a batch size of 300.The networks trained on CIFAR-10 used three convolutional and three fully connected layers with rectified linear activation functions using 5 epochs, a learning rate of 0.01, and a batch size of 256.All networks were trained using the Adam optimizer with gradients replaced by the output of the respective algorithm.A Google Drive folder with Colab notebooks that produce all figures in this text can be found at https://drive.google.com/drive/folders/1m_y0G_sTF-pV9pd2_sysWt1nvRvHYzX0A copy of the same code is also stored at https://github.com/RobertRosenbaum/PredictiveCodingVsBackPropFull details of the neural network architectures and metaparameters can be found in this code.

Torch2PC software for predictive coding with PyTorch models
The figures above were all produced using PyTorch [38] models combined with custom written functions for predictive coding.Functions for predictive coding with PyTorch models are collected in the Github Repository Torch2PC.Currently, the only available functions are intended for models built using the Sequential class, but more general functions will be added to Torch2PC in the future.The functions can be imported using the following commands !git clone https://github.com/RobertRosenbaum/Torch2PC.git from Torch2PC import TorchSeq2PC as T2PC The primary function in TorchSeq2PC is PCInfer, which performs one predictive coding step (computes one value of dθ) on a batch of inputs and labels.The function takes an input ErrType, which is a string that determines whether to use a strict interpretation of predictive coding (Algorithm 2; ErrType="Strict"), predictive coding with the fixed prediction assumption (Algorithm 3; "FixedPred"), or to compute the gradients exactly using backpropagation (Algorithm 1; "Exact").Algorithm 2 can be called as follows, vhat,Loss,dLdy,v,epsilon= T2PC.PCInfer(model,LossFun,X,Y,"Strict",eta,n,vinit) where model is a Sequential PyTorch model, LossFun is a loss function, X is a mini-batch of inputs, Y is a mini-batch of labels, eta is the step size, n is the number of iterations to use, and vinit is the initial value for the beliefs.If vinit is not passed, it is set to the result from a forward pass, vinit=vhat.The function returns a list of activations from a forward pass at each layer as vhat, the loss as Loss, the gradient of the output with respect to the loss as dLdy, a list of beliefs, v ℓ , at each layer as v, and a list of prediction errors, ǫ ℓ , at each layer as epsilon.The input vinit is not used for Algorithm 3, so it does not need to be passed in.The exact values computed by backpropagation can be obtained by calling vhat,Loss,dLdy,v,epsilon= T2PC.PCInfer(model,LossFun,X,Y,"Exact") The inputs vinit, eta, and n are not used for computing exact gradients, so they do not need to be passed in.Theorem 1 says that T2PC.PCInfer(model,LossFun,X,Y,"FixedPred",eta=1,n=len(model)) computes the same values as   Comparing backpropagation and predictive coding modified by the fixed prediction assumption in a convolutional neural network trained on MNIST across multiple trials.Same as Figure 3 except the model was trained 30 times with different random seeds.Dark curves show the mean values and shaded regions show ± one standard deviation across trials.

Figure 1 :
Figure 1: Comparing backpropagation and predictive coding in a convolutional neural network trained on MNIST.A,B) The loss (A) and accuracy (B) on the training set (pastel) and test set (dark) when a 5-layer network was trained using a strict implementation of predictive coding (Algorithm 2 with η = 0.1 and n = 20; red) and backpropagation (blue).C,D) The relative error (C) and angle (B) between the parameter update, dθ, computed by Algorithm 2 and the negative gradient of the loss at each layer.Predictive coding and backpropagation give similar accuracies, but the parameter updates are less similar.
Figure shows the same results from Fig. 1 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.

Figure 2 :
Figure2: Comparing parameter updates from predictive coding to true gradients in a network trained on MNIST.Relative error and angle between dθ ℓ produced by predictive coding (Algorithm 2) as compared to the exact gradients, ∂L/∂θ ℓ computed by backpropagation (relative error defined by dθ pc − dθ bp / dθ bp ).Updates were computed as a function of the number of iterations, n, used in Algorithm 2 for various values of the step size, η, using the model from Fig.1applied to one mini-batch of data.Both models were initialized identically to the pre-trained parameter values from the trained model in Fig.1.Parameter updates converge near the gradients after many iterations for smaller values of η, but diverge for larger values.

Figure 3 :
Figure 3: Predictive coding modified by the fixed prediction assumption compared to backpropagation in a convolutional neural network trained on MNIST.Same as Fig. 1 except Algorithm 3 was used (with η = 0.1 and n = 20) in place of Algorithm 2. The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation, but the parameter updates are less similar for these hyperparameters.
to Fig. 1C,D).Algorithm 3 was similar to Algorithm 2 in terms of training time (29s for Algorithm 3 versus 31s for Algorithm 2 and 8s for backpropagation).S2 Figure shows the same results from Fig. 3 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.

Figure 4 :
Figure4: Comparing parameter updates from predictive coding modified by the fixed prediction assumption to true gradients in a network trained on MNIST.Relative error and angle between dθ produced by predictive coding modified by the fixed prediction assumption (Algorithm 3) as compared to the exact gradients computed by backpropagation (relative error defined by dθ pc − dθ bp / dθ bp ).Updates were computed as a function of the number of iterations, n, used in Algorithm 3 for various values of the step size, η, using the model from Fig.3applied to one minibatch of data.Both models were initialized identically to the pre-trained parameter values from the backpropagation-trained model in Fig.3.In the rightmost panels, some lines are not visible where they overlap at zero.Parameter updates quickly converge to the true gradients when η is larger.

Figure 5 :
Figure5: Predictive coding modified by the fixed prediction assumption compared to backpropagation in convolutional neural networks trained on CIFAR-10.Same as Fig.3except a larger network was trained on the CIFAR-10 data set.The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation and parameter updates are similar to the true gradients.

Figure 6 :
Figure 6: Predictive coding modified by the fixed prediction assumption with η = 1 compared to backpropagation in convolutional neural networks trained on MNIST.Same as Fig. 3 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.

3 and 5 using η = 1
and n = L (in contrast to Figs. 3 and 5 which used η = 0.1 and n = 20).The loss and accuracy closely matched those computed by backpropagation (Figs.6A

Figure 7 :
Figure 7: Predictive coding modified by the fixed prediction assumption with η = 1 compared to backpropagation in convolutional neural networks trained on CIFAR-10.Same as Fig. 5 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.

Figure 8 :
Figure8: Magnitude of activations, beliefs, and prediction errors in a convolutional neural network pre-trained on ImageNet.The Euclidean norm of feedforward activations (v, interpreted as beliefs under the fixed prediction assumption), gradients of the loss with respect to activations (δ ℓ = ∂L/∂v, interpreted as prediction errors under the fixed prediction assumption), beliefs (v) under strict predictive coding, and prediction errors (ǫ ℓ )) under strict predictive coding computed from the VGG-19 network[30] pre-trained on ImageNet[31] with two different photographs as inputs at two different layers.The vertical labels on the left ("triceratops" and "Irish wolfhound") correspond to the guessed label which was also used as the "true" label (y) used to compute the gradients.

Figure 9 :
Figure 9: Magnitude of activations, beliefs, and prediction errors in response to matched and mismatched inputs and labels.Same as Fig. 8, but for the bottom row the label did not match the network's guess.
T2PC.PCInfer(model,LossFun,X,Y,"Exact") up to numerical floating point errors.The inputs eta, n, and vinit are optional.If they are omitted by calling T2PC.PCInfer(model,LossFun,X,Y,ErrType) then they default to eta=.1,n=20,vinit=None which produces vinit=vhat when ErrType="Strict".More complete documentation and a complete example is provided as SimpleExample.ipynb in the GitHub repository and in the code accompanying this paper.More examples are provided by the code accompanying each figure above.
Comparing backpropagation and predictive coding in a convolutional neural network trained on MNIST across multiple trials.Same as Figure1except the model was trained 30 times with different random seeds.Dark curves show the mean values and shaded regions show ± one standard deviation across trials.S2 Figure.
The values of the parameter updates, dθ ℓ , are stored in the grad attributes of each parameter, model.param.grad.Hence, after a call to PCInfer, gradient descent could be implemented by calling with torch.no_grad():forp in modelPC.parameters():p-=eta*p.gradAlternatively, an arbitrary optimizer could be used by callingoptimizer.step()whereoptimizer is an optimizer created using the PyTorch optim class, e.g., by calling optimizer=optim.Adam(model.parameters())beforethe call to T2PC.PCInfer.The input model should be a PyTorch Sequential model.Each layer is treated as a single predictive coding layer.Multiple functions can be included within the same layer by wrapping them in a separate call to Sequential.For example the following code: vhat,Loss,dLdy,v,epsilon= T2PC.PCInfer(model,LossFun,X,Y,"FixedPred",eta,n)