Architecture of the brain’s visual system enhances network stability and performance through layers, delays, and feedback

In the visual system of primates, image information propagates across successive cortical areas, and there is also local feedback within an area and long-range feedback across areas. Recent findings suggest that the resulting temporal dynamics of neural activity are crucial in several vision tasks. In contrast, artificial neural network models of vision are typically feedforward and do not capitalize on the benefits of temporal dynamics, partly due to concerns about stability and computational costs. In this study, we focus on recurrent networks with feedback connections for visual tasks with static input corresponding to a single fixation. We demonstrate mathematically that a network’s dynamics can be stabilized by four key features of biological networks: layer-ordered structure, temporal delays between layers, longer distance feedback across layers, and nonlinear neuronal responses. Conversely, when feedback has a fixed distance, one can omit delays in feedforward connections to achieve more efficient artificial implementations. We also evaluated the effect of feedback connections on object detection and classification performance using standard benchmarks, specifically the COCO and CIFAR10 datasets. Our findings indicate that feedback connections improved the detection of small objects, and classification performance became more robust to noise. We found that performance increased with the temporal dynamics, not unlike what is observed in core vision of primates. These results suggest that delays and layered organization are crucial features for stability and performance in both biological and artificial recurrent neural networks.

where c ∈ C indicates a connection c is part of the path C. The nodes along the path C F F are referred to as layers and are indexed as l ∈ 1, ..., N , where l = 1 represents the input node, and l = N represents the output node.Within this framework, the definitions of feedforward and feedback connections align with those presented in Section 2. For the architecture we analyzed in Section 3.4 the feedforward path C F F is well-defined.However, in a fully connected network with connections having the same temporal delay ∆, the path C F F is not well-defined, and the analysis of Section 3.4 loses meaning as there is no notion to distinguish feedforward from feedback connections.
We have assumed uniform delay in all connections.However, in biological networks delays may differ, with the shortest delays typically appearing within layers.If we take these shortest delays as the unit of time and express all other delays as multiples, then the analysis we have carried out here could be extended to non-uniform delays by analyzing stability using the Z-transform, following [1].If discrete delays are insufficient, then one would have to switch to continuous dynamical systems approximations.In both cases, the stability analysis involves finding the solutions to the characteristic equation.In the continuous approximation, the characteristic equation is typically a transcendental equation, unlike the case with a single delay where the characteristic equation is a polynomial equation (Section 2.1).This may be the subject of future work.

A.2 Stability analysis for layers with multiple units.
In Section 3.1, we analyze the stability for networks with N layers with a single unit each.For this reduced network, let us denote M red F F and M red F B to be the weight matrices of the feedforward and feedback connections, respectively.Also, let Λ B,red and Λ A,red 1/10 be the sets of eigenvalues for the matrices corresponding to the biological and artificial implementations, respectively.
Let's now assume a network with m units per layer.A unit i of layer l can project onto any unit of layer l + 1 with weight α l→l+1 (feedforward connections).On the other hand, units in the same layer do not interact, and feedback connections between layers l 1 → l 2 will be topographic, namely, they only affect units in the same position i, say, with weight α l1→l2 .These assumptions can be summarized in the following connection weight matrices M F F and M F B , m where 1 m×m ∈ R m is the matrix with all entries equal to 1 and ⊗ indicates the tensorial product.
For this extended network, the matrices associated with the biological and artificial implementations are: and ) Note that the eigenvalues of 1 m×m are 0 (multiplicity m − 1) and m (multiplicity 1).Then, using properties of the tensor product, the eigenvalues of M B become 0 and the eigenvalues of M red B are multiplied by m.On other hand, the eigenvalues of M A are 0 and the eigenvalues of the matrix (Id − mM red F F ) −1 M red F B multiplied by m.It is important to note that the eigenvalues of (Id − mM red F F ) −1 M red F B are not equal to those of M network A but are at least m-times larger.Therefore, the eigenvalues of M A are m 2 larger than those of M red A .As mentioned above, the bifurcation boundaries for extended networks are the result of applying a scale factor to those of reduced networks.The scale factors are 1 m and 1 for the biological and artificial cases, respectively.From the analysis presented in 3.1, we know that for reduced networks, the area of the region of stability of the biological implementation is greater than that of the artificial implementation.This fact does not change when applying the scale factors to the bifurcation boundaries.Essentially we find that the stability advantage of having feedforward delays also holds for networks with many units per layer, at least for a particular simple topographic feedback structure.

A.3 Roots of Polynomial with absolute value equal to 1
Let the polynomial p(λ) = N n=0 c n λ n and θ ∈ [0, 2π) such that p(e iθ ) = 0. Separating the real and imaginary parts of the equation, we obtain: where z = cos(θ), T n 's and U n 's are Chebyshev polynomials of the first and second kind, respectively [2].
The equality is satisfied for two cases.The first case is when θ = mπ (i.e.λ = ±1) and it happens when the coefficients c n satisfy 0 = N n=0 c n (−1) mn .The other case is when the following equations are fulfilled simultaneously

A.4 Demostration of Theorem
First, let's calculate the inverse of the matrix −λId + M F F .Using properties of triangular matrices, we obtain We will calculate the polynomial p B (λ where the blocks S 1 ∈ R q×q , S 2 ∈ R q×N −q , S 3 ∈ R N −q×q and S 4 ∈ R N −q×N −q are defined by: Then, using the block matrix determinant formula, p B (λ) = det(S 1 )det(S 4 − S 3 S −1 1 S 2 ).Since S 1 is a lower triangular matrix, det(S 1 ) = (−λ) q .On the other hand, the product is where f j = α q,q+1 α q+j,j (S −1 1 ) qj .Also, (S −1 1 ) ij can be calculated using Eq. ( 3) and we obtain that f j = −( 1 λ ) q−j+1 α q+j,j q k=j α k,k+1 .For now, we focus on the case q > N 2 − 1.Then, calculating by columns, we obtained that where F B can be calculated using Eq. ( 3), That is, where the null blocks have q columns and the block L ∈ R N −q×N −q is defined by where r i = i+q−1 k=q α k,k+1 and f j = α j+q,j q−1 k=j α k,k+1 .Under the hypothesis that q > N 2 − 1, then min(N − q, i + q) = N − q, and we express where ⃗ f ̸ = 0.Then, L has N − q − 1 independent eigenvectors associated with the eigenvalue λ = 0 and ⃗ r is an eigenvector associated with the eigenvalue Table A shows some examples of the theorem.

A.5 Faster R-CNN
Faster Region Based Convolutional Neural Network (Faster R-CNN) [3] is one of the top models used for object detection.It consists of two modules: a) Region Proposal Module (RPM) and b) Detector Module (DM).The RPM selects parts of the image which most likely contain an object (called region proposals).Each of these regions is then processed by the DM in which a classifier labels each region proposal and refines the bounding boxes using a regressor.Before the two modules, Faster R-CNN uses a CNN (called backbone) to transform the input image into a feature map with dimensions H × W × Ch (e.g., Ch = 512).A schematic of the Faster R-CNN architecture is shown in Fig The goal of RPM is to learn whether an object is present in the input image at its corresponding location and estimate its size.To do this, the network uses a set of n anch anchors on the input image for each location on the feature map.These anchors indicate possible objects in various sizes and aspect ratios at this location.The RPM consists of a 3 × 3 convolution with 512 channels (padding, stride = 1) and two independent branches.The first branch is a convolution 1 × 1 with 2.n anch channels whose output (H × W × n anch × 2) is associated with the probabilities of whether a feature map point contains an object for each of the anchors (confidence scores).The other branch is a convolution 1 × 1 with 4.n anch channels whose output (H × W × n anch × 4) corresponds to the 4 regression coefficients of each of the anchors (x, y coordinates of the center, height, width) for every point on the feature map.These regression coefficients are used to improve the prediction of the position and size of the anchors that contain objects.
At this point, the RPM returns a list of H.W.n anch of boxes with its corresponding confidence score and regression coefficients.From this list, only those boxes that match 6/10 Comparison between the characteristic polynomials for networks with N layers and only feedback connections of distance q.For all N , in the case q = 0, it is obtained that g(λ) = N i=1 (λ − α ii ).Note that in all cases the Eq. ( 9) is verified.The coefficients of g depend on the entries of the matrices involved.However, the order of g only depends on N and q.In the example, we show the scheme for the biological and artificial implementation for a network with N = 3 layers and feedback connections of distance q = 2 (see highlighted row).For this case, the polynomial g is linear (order 1).specific selection criteria are used by the DM.For example, boxes with height/width smaller than a threshold, or boxes that cross the boundary are ignored.Another way to filter the boxes is to compare the confidence score with a threshold.The high confidence score is associated with anchor boxes that probably contain an object; while low score indicates that the anchor box contains no object (background).However, if the score is somewhere in between, the anchor box could contain a partial object and is not a good reference to locate the object; then, it is deleted.One last option is to use Non-Max Suppression (NMS), which identifies boxes with high IoU and removes the anchor box with lower objectness score.Usually, filtering gives about ∼ 2k proposals per image.
The Detector Module consists of an ROI pooling layer and fully connected layers followed by two branches for classification and bounding box regression.It uses a certain number of N prop of proposals from the RPN depending on whether it is in the training or evaluation stage.For the training stage, all proposals are used; while for the validation and test stage, only the top N prop ∼ 10, 100 proposals are selected.For each proposal, an ROI pooling layer takes the region corresponding (h r × w r ) from the feature map.Then, it divides this region into a fixed number of sub-windows (e.g.49 windows = 7 × 7) and applies max-pooling over each sub-window.The output of the ROI pooling layer has a fixed size (N prop × 7 × 7 × Ch), which flattens out (N prop × 25088).Finally, this result goes through two fully connected layers and is fed into the classification and regression branches.The classification layer has C units (one per class in the detection task); while the bounding box regressor consists of 4.C units ...

A.6 ResNet stages
ResNet architecture is a type of feedforward neural network consisting of a set of building blocks [4].Each block is made up of 2 or 3 convolutions and residual connections.A set of blocks that operate sequentially is called the 'ResNet stage'.Table B shows the architecture of ResNet-50 and ResNet-18 that we use for our recurrent CNNs.In the Eq. ( 6), F L represents a 'ResNet stage'.For example, F 1 for ResNet-50 is the combined operation of applying a 7 × 7 convolution (64 filters) and 3 × 3 max-pooling.In the main text, it is indicated if the predictions correspond to TP, FP, or FN.Note that in (b) the prediction is a FP case and there is a FN case, simultaneously.On the other hand, in (d), the prediction is a FP for the red class but there is a FN case for the class blue.

A.7 Metrics
On the other hand, the area under the recall-IoU curve is AR s,c = 1 0.5 r(s, µ, c)dµ and Mean Average Recall (mAR s ) is defined as the mean of AR s,c across all classes.In Section 3.7, we show the performance of the CNNs in terms of mAP and mAR.For these metrics, all confidence intervals (95%) were estimated utilizing a bootstrap procedure, resulting in a relative standard error of less than 2%.
Both mAR and mAP can be calculated on the set of all detected objects or on a subset of them.For example, we can separate objects based on their size.A 'small object' covers an area of less than 32 2 pixels; 'medium objects' cover an area of 32 2 to 96 2 ; and 'large objects' cover an area greater than 96 2 .

Fig
Fig A. Faster R-CNN architecture.The model consists of a backbone (CNN) and two main modules: a) Region Proposal Module (RPM) and Detector Module (DM).The goal of the RPM is to detect the objects; while the DM takes care of classifying them.Details about the components are presented in the main text.Abbreviations: FC (m) -Full connected layers with m units.

Fig
Fig B. Scenarios in object detection.The boxes with solid lines and dashed lines indicate the ground truth and the predictions, respectively.Blue and red colors indicate different classes (e.g., dogs and cats).In the main text, it is indicated if the predictions correspond to TP, FP, or FN.Note that in (b) the prediction is a FP case and there is a FN case, simultaneously.On the other hand, in (d), the prediction is a FP for the red class but there is a FN case for the class blue.