Fig 1.
(a) Schematic of visual processing across several areas defined by anatomical structure and function of the area such as V1, V2, V4, and Inferior Temporal cortex (IT). The major input to the visual cortex is from the lateral geniculate nucleus (LGN) whose principal neurons receive input from the retina [26]. (b) Models of the visual cortex. In feedforward models, information is only processed in one direction. In models with feedback connections, information from distant layers loops back to earlier layers (e.g. from IT to V1) across multiple distances [27]. Also, local activity feeds back to the same layer. (c) Feedforward models map each point in the input space to a point in the output space. In a model with feedback, a static point in the input space can generate a temporal dynamic (or trajectory) in the output space. The properties of the trajectory strongly depend on the feedback connections and the input. This image is inspired by Fig 2 in [28] describing “core vision”, the sub-second processing in primates during a single fixation.
Fig 2.
Two implementations of feedforwad transmission.
In the artificial case, feedforward connections (black arrows) transmit information instantaneously (Δ = 0). In the biological case, feedforward transmission requires time and thus introduces a time delay (Δ = 1). In both cases, feedback connections require a delay (Δ = 1) irrespective of distance (0, 1, or 2 in green, blue, and red). Note that in the artificial case, the units along lines ri only depend on the information from units along line ri−1; while in the biological case units along line di depend on the information from units along several lines di−1, di−2, …. The lines ri and di represent how the state of the network evolves as a function of the layers and time.
Fig 3.
(a) The CNN is composed of N layers (gray boxes) and bottom-up and top-down connections (black and colored arrows). Each bottom-up connection involves spatial downsampling or local pooling. The network input is a static image (i.e. h0,t = x) and the output is the activity of the last layer after T time steps (i.e. hN,T). (b) Each layer l is composed of a mechanism for aggregating information from other layers (i.e. hl−1,t, hl+1,t−1, …, hN,t−1) and a recurrent map R that updates the activity of the layer hl,t—see Eq (4)—and transmits this to other layers. (c) The aggregation consists of upsampling feedback from higher layers with the function S to match spatial dimensions of the current layer and concatenating along the channel axis (⊕). Then a linear map Tl combines these channels resulting in the same number of channels as hl−1,t, to which this feedback input is then added—see Eq (6). Finally, Fl is an additional nonlinear mapping that implements downsampling.
Fig 4.
Training and inference dynamic.
(a) Learning trajectories in parameter space. There exists a region in parameter space such that the dynamics of the activity are stable for those parameter values (inside the bifurcation boundary—blue curve). The trajectories ω1, ω2 start with stable dynamic. After some gradient updates the trajectory may remain in the stable domain (ω1) or may move beyond the stable domain (ω2). (b) Space of activity showing activity dynamic that converges (Ω1: orange curve corresponding to the endpoint of learning ω1) and dynamic that diverges (Ω2: green curve, corresponding to the endpoint of learning ω2).
Table 1.
Coefficients of characteristic polynomials.
For matrices of size N, the coefficients of the characteristic polynomials are indicated as a function of the values of the matrix. For each size, the results are shown for the cases of biological and artificial feedforward connections. We denote κ1 = α11 + α22 + α33, κ2 = α12α21 + α23α32, κ3 = α12α23α31, κ4 = α11α22 + α11α33 + α22α33, κ5 = α11α23α32 + α33α12α21, κ6 = α11α22α33.
Fig 5.
Bifurcation boundaries for 2, 3, and 5 layers.
The region of the parameter space where the network is stable depends on the number of layers N, the type of feedforward transmission (B: biological with delay, A: artificial without delay), and the distance of the feedback (see Eqs (7) and (8)). For the networks presented in the lower panels (a) N = 2, (b) N = 3, (c) N = 5, we see that the stability region for the biological case is larger or equal than for the artificial case. This is a repeating behavior for different N. Notation: (a) wi = α11 = α22, wp = α12α21. (b) wi = α11 = α22 = α33, wp = α12α21 = α23α32 and α31 = 0. (c) wshort = α23α32 = α34α43 = α45α54, wlong = α23α34α45α52. Note that in all cases, the red and blue curves intersect at the axes (wi = 0, wp = 0, wshort = 0 or wlong = 0). This is a consequence of the discussion presented in Section 3.2.
Fig 6.
Equivalence between biological and artificial implementation.
When the network consists of feedforward connections and single-distance feedback connections, the biological and artificial implementations have the same bifurcation boundaries. For a network with feedback connections of distance q = 2, the biological implementation (top panel) is represented with a pattern of arrows (bold lines) that repeats (q+1)-times. The complete information about the dynamics is in the pattern (center panel). When applying a temporal contraction, the pattern is equivalent to the information flow of the artificial implementation (bottom panel). The presented theorem (see main text) shows this equivalence through the transformation of the characteristic polynomial pB to pA.
Fig 7.
The dominant eigenvalue for a network with feedforward connections and feedback connections of a single distance q.
The eigenvalue in this network is proportional to the number of loops and the effective weight of the loops (fqβq). When all feedback connections of distance q are considered, there are a total of N−q loops.
Fig 8.
Fully connected and layered networks.
a) Decomposition into eigenvalues and eigenstates of a fully connected network. Nodes of the same color are in-phase synchronized, the nodes with opposite colors (yellow-red) are anti-phase synchronized, and the black nodes are deactivated. b) Dominant eigenvalue in a layered network.
Table 2.
Properties of activation functions.
Fig 9.
Performance on object detection during the training process.
The results here are on the validation set. Recurrent CNNs (a-d) were used as backbones in Faster R-CNN. (a, c) Feedforward networks with 3 and 5 layers, respectively. (b, d) Feedback connections of distance 0 (green arrows) and 1 (blue arrows) were added to networks a) and c). (e) During the training stage, validation loss of the feedforward networks evolve similarly, regardless of depth (lines a), c)). Adding feedback connections reduce the validation loss (lines b), d)). (f) Average precision and recall for detection of objects of different sizes in images of the validation set. The initial value of each metric (epoch 1) tends to be higher as the number of layers in the network increases. However, the evolution of each metric depends on the size of the image and whether feedback connections are included. The gray line indicates the performance of the Feature Pyramidal Network (FPN) pretrained on this data [35].
Fig 10.
Examples of object detection and classification results.
Predictions for five images (rows) of the evaluation set using Faster R-CNN. The backbone for Faster Region-CNN is one of four of our recurring CNNs or Feature Pyramidal Network—FPN (columns).
Fig 11.
Accuracy in classification task.
Recurrent CNNs (a-c) were used as feature extractors in the classification task. (a, c) Feedforwards networks with 2 and 3 layers, respectively. b) Feedback connection of distance 0 (green arrow) was added to network a). During the training of the networks (a-c), the accuracy calculated over the training set and test set increases. The performance of the networks is reduced when evaluating on the images of the test set with Gaussian noise.
Fig 12.
Temporal dynamic of the classification network.
This simulation uses the trained network of Fig 11b. a) Examples of t-SNE projection of the trajectories of the activity space of the last layer. The activity space has dimension 10; while the projection is two-dimensional. Please note that for feedforward networks without temporal dynamics, the trajectory is a point that remains constant over time. Therefore, the stability of the dynamics during inference is assured. b) Distribution of entropy of the output as a function of the time steps in the inference stage. c) Performance of the network as a function of the time steps in the inference stage for the low (green line) and high (red line) entropy images. d) Performance of the network as a function of the time steps in the inference stage. We evaluated the network on the complete test (solid black line) and by classes (dashed lines). Based on the accuracy by classes, the easy and hard classes to classify by the network were identified (green and red lines, respectively).