Reconstructing feedback representations in ventral visual pathway with a generative adversarial autoencoder

While vision evokes a dense network of feedforward and feedback neural processes in the brain, visual processes are primarily modeled with feedforward hierarchical neural networks, leaving the computational role of feedback processes poorly understood. Here, we developed a generative autoencoder neural network model and adversarially trained it on a categorically diverse data set of images. We hypothesized that the feedback processes in the ventral visual pathway can be represented by reconstruction of the visual information performed by the generative model. We compared representational similarity of the activity patterns in the proposed model with temporal (magnetoencephalography) and spatial (functional magnetic resonance imaging) visual brain responses. The proposed generative model identified two segregated neural dynamics in the visual brain. A temporal hierarchy of processes transforming low level visual information into high level semantics in the feedforward sweep, and a temporally later dynamics of inverse processes reconstructing low level visual information from a high level latent representation in the feedback sweep. Our results append to previous studies on neural feedback processes by presenting a new insight into the algorithmic function and the information carried by the feedback processes in the ventral visual pathway. Author summary It has been shown that the ventral visual cortex consists of a dense network of regions with feedforward and feedback connections. The feedforward path processes visual inputs along a hierarchy of cortical areas that starts in early visual cortex (an area tuned to low level features e.g. edges/corners) and ends in inferior temporal cortex (an area that responds to higher level categorical contents e.g. faces/objects). Alternatively, the feedback connections modulate neuronal responses in this hierarchy by broadcasting information from higher to lower areas. In recent years, deep neural network models which are trained on object recognition tasks achieved human-level performance and showed similar activation patterns to the visual brain. In this work, we developed a generative neural network model that consists of encoding and decoding sub-networks. By comparing this computational model with the human brain temporal (magnetoencephalography) and spatial (functional magnetic resonance imaging) response patterns, we found that the encoder processes resemble the brain feedforward processing dynamics and the decoder shares similarity with the brain feedback processing dynamics. These results provide an algorithmic insight into the spatiotemporal dynamics of feedforward and feedback processes in biological vision.

Introduction receives the latent representation and aims to reproduce the visual input from 23 the information encoded in the latent representation. 24 This generative model enables us to not only investigate the encoding process 25 of visual representations along the hierarchy of the encoder sub-network layers 26 but also provides us an insight into the reverse process i.e. reconstructing the 27 representations along the decoder sub-network layers. We hypothesize that the 28 3/30 visual information in the encoder sub-network in our computational model 29 mimics the feedforward pathway in the ventral visual stream and the decoder 30 sub-network which performs the reverse function may reveal the representations 31 along the feedback pathway. To test this hypothesis, after training the proposed 32 model, we compare the representations along its layers with 33 magnetoencephalography (MEG) and functional magnetic resonance imaging 34 (fMRI) data acquired from fifteen human participants in a visual recognition 35 experiment [13]. 36 Our model identified two separate dynamics of representational similarities 37 with MEG temporal data. The first one is consistent with the temporal 38 hierarchy of processes transforming low level visual information into high level 39 semantics in the feedforward sweep, and the second one reveals a temporally 40 subsequent dynamics of inverse processes reconstructing low level visual 41 information from a high level latent representation in the feedback sweep. 42 Further, comparison of encoder and decoder representations with two fMRI 43 regions of interests, namely EVC and IT, revealed a growing categorical 44 representation along the encoder layer (feedforward sweep) similar to IT and a 45 progression in detail visual representations along the decoder layers (feedback 46 sweep) akin to EVC.

48
Construction of a generative model performing image visual cortex are critical to resolving visual recognition in the brain [12,23,24]. 56 Therefore, these feedforward deep neural network models do not fully represent 57 the complex visual processes in the ventral visual pathway. Here, we investigate 58 whether a deep generative model trained to compress and reconstruct images 59 could reveal similar representations as feedforward and feedback processes in the 60 ventral visual pathway. With this aim, we developed a deep generative 61 autoencoder neural network model using adverserial autoencoder (AAE) 62 framework [34]. AAE is a generative adverserial network (GAN) [35] where the 63 generator has an autoencoder architecture. objectives -a reconstruction loss criterion, and an adversarial criterion. The 72 dual objectives training turns the autoencoder into a generative model whose 73 latent space learns data distribution properties that enables generative process 74 and avoids overfitting to the reconstruction objective. We hypothesize that the 75 encoder sub-network models the feedforward pathway of processes in ventral 76 visual stream, while the decoder sub-network models the reconstruction of visual 77 features in the feedback pathway.

78
To train our model, we assembled a super category data set (see Materials 79 and methods section for details). The super category data set includes 1,980,000 80 images from four equally distributed categories of (1) Faces, (2) Animates, (3)

81
Objects, and (4) Scenes. The rational behind assembling and using this data set 82 is two-fold: (1) ecologically, the human brain learns to develop high-level 83 category representations across multiple recognition tasks (e.g. faces, animals, In this study, we compare model representations with brain imaging data (fMRI 89 and MEG) from a visual recognition experiment [13]. We maintained consistency 90 with the four categories from the stimulus set utilized in the brain imaging 91 experiment which includes 156 images. Please note that this stimulus set was  Representational similarity of the generative model to early and 107 late brain regions in the ventral visual stream 108 We first determined the encoder/decoder representational similarities with early 109 and late brain regions along the ventral visual cortex. For this, we chose early 110 visual cortex (EVC) and inferior temporal cortex (IT) defined anatomically Computational model architecture. The model is a generative adverserial network. The generator is an autoencoder consisting of five convolutional blocks (E1-E5) and one fully connected layer (E6) in the encoder and one fully connected layer (D1) followed by five deconvolutional blocks in the decoder (D2-D6). Each convolutional block encompasses batch normalization, convolution, nonlinear activation function, and pooling operations. Alternatively, each deconvolutional block encompasses batch normalization, deconvolution, nonlinear activation function, and upsampling operations. The discriminator consists of two fully connected layers. The training Data set consists of 1,980,000 images organized into four super-ordinate categories: (i) Faces, (ii) Animates, (iii) Objects, (iv) Scenes. LV denotes the latent vector generated by the encoder and DL is a one-hot data set label (one of the four mentioned training data sets). Both vectors are concatenated and fed to the discriminator, while only the latent vector is fed to the decoder. method [37,38] as the integrative framework for model-brain comparisons.

113
For each region of interest (ROI), we extracted the fMRI response patterns to 114 each image, vectorized it and computed condition-specific pairwise distances  Pearson's R) to create a 156 x 156 representational dissimilarity matrix (RDM) 116 per participant. We also fed the images to the generator and extracted created layer-specific RDMs (see Figure 3 and Materials and methods section for 120 details). We then compared subject-specific ROI RDMs with the model layer          The fMRI and MEG data used in this study has been published in [13] 272 previously and is publicly available at http://twinsetfusion.csail.mit.edu/. In The preprocessing of functional data included slice-time correction, realignment 303 and co-registration to the first session T1 structural scan, and normalization to 304 the standard MNI space. For multivariate analysis, we did not smooth the data. for temporal source space separation and head movements correction [40,41].
takes the latent code and aims to reproduce the input data. Alternatively, the 365 GAN framework is a min-max adversarial game between two distinct neural 366 networks: (i) The generator(G), aims at generating synthetic data by learning 367 the distribution of the real data and (ii) the discriminator (D), aims at 368 distinguishing the generator's fake data from real data. The generator uses a 369 function G(z) that maps samples z from the prior p(z) (normal distribution) to 370 the data space p(x). G(z) is trained to maximally confuse the discriminator into 371 believing that samples it generates come from the data distribution. The 372 solution to this game can be expressed as following [35]: We chose an AE architecture because we hypothesize that the encoder 374 embodies similar neuronal characteristics as the image classification DNNs and 375 thereby could resemble the human brain feedforward representations.

376
Alternatively, we hypothesize that the decoder part of the AE architecture which 377 generates the image from the latent space code would encompass the neuronal 378 representations similar to feedback processes in the human visual brain.   and tested against zero for statistical significance.
We used nonparametric statistical test methods which make no assumptions on 438 the distribution of the data [50,51]. For statistical inference on the correlation 439 time series, we used permutation-based cluster-size inference with null 440 hypothesis of zero. For statistical assessments of peak latencies, we 441 bootstrapped the subject-specific correlation time series for 1000 times to 442 estimate an empirical distribution over peak latencies [12,29,43].