Architecture of the brain’s visual system enhances network stability and performance through layers, delays, and feedback

doi:10.1371/journal.pcbi.1011078

Fig 1.

Visual processing.

(a) Schematic of visual processing across several areas defined by anatomical structure and function of the area such as V1, V2, V4, and Inferior Temporal cortex (IT). The major input to the visual cortex is from the lateral geniculate nucleus (LGN) whose principal neurons receive input from the retina [26]. (b) Models of the visual cortex. In feedforward models, information is only processed in one direction. In models with feedback connections, information from distant layers loops back to earlier layers (e.g. from IT to V1) across multiple distances [27]. Also, local activity feeds back to the same layer. (c) Feedforward models map each point in the input space to a point in the output space. In a model with feedback, a static point in the input space can generate a temporal dynamic (or trajectory) in the output space. The properties of the trajectory strongly depend on the feedback connections and the input. This image is inspired by Fig 2 in [28] describing “core vision”, the sub-second processing in primates during a single fixation.

More »

Expand

Fig 2.

Two implementations of feedforwad transmission.

In the artificial case, feedforward connections (black arrows) transmit information instantaneously (Δ = 0). In the biological case, feedforward transmission requires time and thus introduces a time delay (Δ = 1). In both cases, feedback connections require a delay (Δ = 1) irrespective of distance (0, 1, or 2 in green, blue, and red). Note that in the artificial case, the units along lines r_i only depend on the information from units along line r_i−1; while in the biological case units along line d_i depend on the information from units along several lines d_i−1, d_i−2, …. The lines r_i and d_i represent how the state of the network evolves as a function of the layers and time.

More »

Expand

Fig 3.

A recurrent CNN.

(a) The CNN is composed of N layers (gray boxes) and bottom-up and top-down connections (black and colored arrows). Each bottom-up connection involves spatial downsampling or local pooling. The network input is a static image (i.e. h_0,t = x) and the output is the activity of the last layer after T time steps (i.e. h_N,T). (b) Each layer l is composed of a mechanism for aggregating information from other layers (i.e. h_l−1,t, h_l+1,t−1, …, h_N,t−1) and a recurrent map R that updates the activity of the layer h_l,t—see Eq (4)—and transmits this to other layers. (c) The aggregation consists of upsampling feedback from higher layers with the function S to match spatial dimensions of the current layer and concatenating along the channel axis (⊕). Then a linear map T_l combines these channels resulting in the same number of channels as h_l−1,t, to which this feedback input is then added—see Eq (6). Finally, F_l is an additional nonlinear mapping that implements downsampling.

More »

Expand

Fig 4.

Training and inference dynamic.

(a) Learning trajectories in parameter space. There exists a region in parameter space such that the dynamics of the activity are stable for those parameter values (inside the bifurcation boundary—blue curve). The trajectories ω₁, ω₂ start with stable dynamic. After some gradient updates the trajectory may remain in the stable domain (ω₁) or may move beyond the stable domain (ω₂). (b) Space of activity showing activity dynamic that converges (Ω₁: orange curve corresponding to the endpoint of learning ω₁) and dynamic that diverges (Ω₂: green curve, corresponding to the endpoint of learning ω₂).

More »

Expand

Table 1.

Coefficients of characteristic polynomials.

For matrices of size N, the coefficients of the characteristic polynomials are indicated as a function of the values of the matrix. For each size, the results are shown for the cases of biological and artificial feedforward connections. We denote κ₁ = α₁₁ + α₂₂ + α₃₃, κ₂ = α₁₂α₂₁ + α₂₃α₃₂, κ₃ = α₁₂α₂₃α₃₁, κ₄ = α₁₁α₂₂ + α₁₁α₃₃ + α₂₂α₃₃, κ₅ = α₁₁α₂₃α₃₂ + α₃₃α₁₂α₂₁, κ₆ = α₁₁α₂₂α₃₃.

More »

Expand

Fig 5.

Bifurcation boundaries for 2, 3, and 5 layers.

The region of the parameter space where the network is stable depends on the number of layers N, the type of feedforward transmission (B: biological with delay, A: artificial without delay), and the distance of the feedback (see Eqs (7) and (8)). For the networks presented in the lower panels (a) N = 2, (b) N = 3, (c) N = 5, we see that the stability region for the biological case is larger or equal than for the artificial case. This is a repeating behavior for different N. Notation: (a) w_i = α₁₁ = α₂₂, w_p = α₁₂α₂₁. (b) w_i = α₁₁ = α₂₂ = α₃₃, w_p = α₁₂α₂₁ = α₂₃α₃₂ and α₃₁ = 0. (c) w_short = α₂₃α₃₂ = α₃₄α₄₃ = α₄₅α₅₄, w_long = α₂₃α₃₄α₄₅α₅₂. Note that in all cases, the red and blue curves intersect at the axes (w_i = 0, w_p = 0, w_short = 0 or w_long = 0). This is a consequence of the discussion presented in Section 3.2.

More »

Expand

Fig 6.

Equivalence between biological and artificial implementation.

When the network consists of feedforward connections and single-distance feedback connections, the biological and artificial implementations have the same bifurcation boundaries. For a network with feedback connections of distance q = 2, the biological implementation (top panel) is represented with a pattern of arrows (bold lines) that repeats (q+1)-times. The complete information about the dynamics is in the pattern (center panel). When applying a temporal contraction, the pattern is equivalent to the information flow of the artificial implementation (bottom panel). The presented theorem (see main text) shows this equivalence through the transformation of the characteristic polynomial p_B to p_A.

More »

Expand

Fig 7.

The dominant eigenvalue for a network with feedforward connections and feedback connections of a single distance q.

The eigenvalue in this network is proportional to the number of loops and the effective weight of the loops (f^qβ_q). When all feedback connections of distance q are considered, there are a total of N−q loops.

More »

Expand

Fig 8.

Fully connected and layered networks.

a) Decomposition into eigenvalues and eigenstates of a fully connected network. Nodes of the same color are in-phase synchronized, the nodes with opposite colors (yellow-red) are anti-phase synchronized, and the black nodes are deactivated. b) Dominant eigenvalue in a layered network.

More »

Expand

Table 2.

Properties of activation functions.

More »

Expand

Fig 9.

Performance on object detection during the training process.

The results here are on the validation set. Recurrent CNNs (a-d) were used as backbones in Faster R-CNN. (a, c) Feedforward networks with 3 and 5 layers, respectively. (b, d) Feedback connections of distance 0 (green arrows) and 1 (blue arrows) were added to networks a) and c). (e) During the training stage, validation loss of the feedforward networks evolve similarly, regardless of depth (lines a), c)). Adding feedback connections reduce the validation loss (lines b), d)). (f) Average precision and recall for detection of objects of different sizes in images of the validation set. The initial value of each metric (epoch 1) tends to be higher as the number of layers in the network increases. However, the evolution of each metric depends on the size of the image and whether feedback connections are included. The gray line indicates the performance of the Feature Pyramidal Network (FPN) pretrained on this data [35].

More »

Expand

Fig 10.

Examples of object detection and classification results.

Predictions for five images (rows) of the evaluation set using Faster R-CNN. The backbone for Faster Region-CNN is one of four of our recurring CNNs or Feature Pyramidal Network—FPN (columns).

More »

Expand

Fig 11.

Accuracy in classification task.

Recurrent CNNs (a-c) were used as feature extractors in the classification task. (a, c) Feedforwards networks with 2 and 3 layers, respectively. b) Feedback connection of distance 0 (green arrow) was added to network a). During the training of the networks (a-c), the accuracy calculated over the training set and test set increases. The performance of the networks is reduced when evaluating on the images of the test set with Gaussian noise.

More »

Expand

Fig 12.

Temporal dynamic of the classification network.

This simulation uses the trained network of Fig 11b. a) Examples of t-SNE projection of the trajectories of the activity space of the last layer. The activity space has dimension 10; while the projection is two-dimensional. Please note that for feedforward networks without temporal dynamics, the trajectory is a point that remains constant over time. Therefore, the stability of the dynamics during inference is assured. b) Distribution of entropy of the output as a function of the time steps in the inference stage. c) Performance of the network as a function of the time steps in the inference stage for the low (green line) and high (red line) entropy images. d) Performance of the network as a function of the time steps in the inference stage. We evaluated the network on the complete test (solid black line) and by classes (dashed lines). Based on the accuracy by classes, the easy and hard classes to classify by the network were identified (green and red lines, respectively).

More »

Expand