MouseNet: A biologically constrained convolutional neural network model for the mouse visual cortex

Convolutional neural networks trained on object recognition derive inspiration from the neural architecture of the visual system in mammals, and have been used as models of the feedforward computation performed in the primate ventral stream. In contrast to the deep hierarchical organization of primates, the visual system of the mouse has a shallower arrangement. Since mice and primates are both capable of visually guided behavior, this raises questions about the role of architecture in neural computation. In this work, we introduce a novel framework for building a biologically constrained convolutional neural network model of the mouse visual cortex. The architecture and structural parameters of the network are derived from experimental measurements, specifically the 100-micrometer resolution interareal connectome, the estimates of numbers of neurons in each area and cortical layer, and the statistics of connections between cortical layers. This network is constructed to support detailed task-optimized models of mouse visual cortex, with neural populations that can be compared to specific corresponding populations in the mouse brain. Using a well-studied image classification task as our working example, we demonstrate the computational capability of this mouse-sized network. Given its relatively small size, MouseNet achieves roughly 2/3rds the performance level on ImageNet as VGG16. In combination with the large scale Allen Brain Observatory Visual Coding dataset, we use representational similarity analysis to quantify the extent to which MouseNet recapitulates the neural representation in mouse visual cortex. Importantly, we provide evidence that optimizing for task performance does not improve similarity to the corresponding biological system beyond a certain point. We demonstrate that the distributions of some physiological quantities are closer to the observed distributions in the mouse brain after task training. We encourage the use of the MouseNet architecture by making the code freely available.


Introduction
Convolutional neural networks (CNNs) trained on object recognition were originally inspired by the visual processing in cats [1,2], and have been used as models of feedforward computation performed in the primate ventral stream [3][4][5].
Indeed, the activity in these networks often resembles activity recorded from areas of the primate visual system, from oriented Gabor-like features in early layers [6] to responses to curves and more complex geometries [7] and even functional, or representational, similarity at the population level [8,9]. Task-trained artificial neural networks have been shown to produce similar neural representations or develop predictive models of neural activity in visual [10][11][12], auditory [13], rodent whisker areas [14], and more [15][16][17]. Despite these successes and the clear power of CNNs to solve machine learning problems in the visual domain, among others [6,18], they are not structural or architectural analogues for the underlying biological circuits. Recent endeavors [19,20] show that imposing brain like structure such as shallowness and recurrence in network models can improve their functional similarity to the primate brain. The interplay of architecture and computation remains an open problem in both machine learning and neuroscience.
This issue is especially pronounced for studies of mouse visual cortex, a field undergoing explosive growth. Large scale tract tracing data sets have revealed neuro-anatomical structure in unprecedented detail [21][22][23][24]. From these efforts we have learned, in contrast to the hierarchical organization of primates, that the visual system of the mouse has a much more parallel structure [25]. Since rodents are capable of visually guided behavior including invariant object recognition [26,27], this raises questions about the role of architecture in neural computation. Recently, data from a large-scale physiological survey of neural activity in the mouse visual system [28] was used to compare the representations of visual stimuli in cortex with those of modern deep networks [29][30][31]. It was found that even purportedly "early" regions such as V1 in mouse cortex are more similar at the level of representation to middle layers of networks such as VGG16, rather than to early layers that respond optimally to simple visual features and bear more resemblance to the "simple" and "complex" cells normally supposed to describe the early visual pathway. However, the profound difference in architecture between modern CNNs and the mouse cortex raises significant challenges in interpreting these findings. To begin, many modern computational models of vision (in particular CNNs, which often have a high input resolution) have a larger number of units than the number of neurons in mouse visual cortex. Moreover, CNNs from computational vision are largely of feedforward type, either purely so or with some skip connections (e.g., in ResNet architectures), which ignores the large amount of recurrence present in real neural circuits. Furthermore, the mouse thalamo-cortical system is quite shallow [25]. Most importantly, as stated above and detailed more below, the mouse visual cortex has an intriguing parallel structure.
Here we introduce a novel framework for incorporating these data to build a biologically constrained convolutional neural network model of the mouse visual cortex-MouseNet.
The structural parameters of MouseNet are derived from experimental measurements, specifically estimates of numbers of neurons in each area and cortical layer, the 100-micrometer resolution interareal connectome, and the statistics of connections between cortical layers.
MouseNet is constructed to support detailed task-optimized models of mouse visual cortex, with neural populations that can be compared to specific corresponding populations in the mouse brain. In this work, we use a well established image classification task as a working example to demonstrate the usage of the MouseNet and to show the combined functional effects of adding all the currently known biological constraints to the architecture by contrasting the MouseNet results with VGG16, a typical artificial neural network without any biological constraints. We leave investigation of the specific functional effects of each single constraint to future work.
We train MouseNet to perform classification using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [32] as well as the CIFAR10 [33] data sets. We find that, although MouseNet is much smaller than a typical CNN and has specific architectural differences, it can reach above 90% validation accuracy on CIFAR10, and roughly 2/3rds of the performance level of a typical CNN (VGG16) on the more challenging ImageNet classification benchmark.
Next, using the large-scale functional data sets from the Allen Brain Observatory [28] on visual responses of neurons across visual cortex, we investigate the functional properties of the MouseNet architecture after training on the ImageNet dataset. We use representational similarity analysis [29,34,35] to investigate the relative effects of task-training on the different computational layers in the model. We see that ImageNet classification training of MouseNet makes responses in its corresponding level of layers more similar to responses recorded from the mouse brain.
We then contrast these results for the biologically constrained MouseNet with those for a standard CNN network, VGG16, trained on the same task. We show that the representational similarity of MouseNet to the mouse brain is comparable to that of VGG16, even though VGG16 produces significantly higher task performance.
We study the training process for both networks, and find that the highest representational similarity [29,34] between a model neural network and the mouse brain areas are not necessarily achieved by the best performing models, rather at early or intermediate points during the training process. We take this as an indication that image classification using ImageNet is not the appropriate task to describe the mouse visual cortex (or at least those regions measured in the Allen Brain Observatory) rather than a failure of the task-training approach. This conclusion is perhaps to be expected. However, we feel that the use of object recognition is an important baseline in comparison with established results in primate.
Furthermore, in addition to broad measures of representational similarity across images, we also demonstrate the effect of image classification training on MouseNet by showing how it affects the other functional properties such as lifetime sparseness and orientation selectivity index [28]. We find that training drives both of these properties to become more similar between MouseNet and the biological mouse brain. Finally, by comparing both VGG16 and MouseNet representations in individual layers before and after training, we find that the image classification task makes MouseNet layers more diverse after training, a phenomenon we attribute to the parallel pathways in the MouseNet architecture.
Overall, we describe an open framework for constructing MouseNet that is general and can be easily modified to incorporate new data on the structure of the mouse brain [36] and to study the functional significance of individual structural properties (such as connection densities) in future work. Likewise, MouseNet can be readily trained on other tasks [37], including those corresponding more closely to natural behavior. We encourage future research along these lines by making the Python code publicly available at https://github.com/mabuice/ Mouse_CNN/tree/v0, together with the step-by-step description of the model construction that we present next.

Methods
In this section, we introduce our framework for constructing MouseNet. Fig 1 shows an overview of this framework. The basic idea is to use available sources of anatomical data (e.g. tract tracing data, cell counts, and statistics of short-range connections) to constrain the CNN network structure and architectural hyperparameters. We discuss the details of each step below.

Network architecture
MouseNet spans the dorsal lateral geniculate nucleus (dLGN) and six visual areas (Fig 2A). Input to the network passes first through dLGN, and then to the primary visual area VISp. After VISp, the architecture branches into five parallel pathways, representing five secondary lateral visual areas: VISl (lateral visual area), VISal (anterolateral), VISpl (posterolateral), VISli (laterointermediate), and VISrl (rostrolateral). Finally, the output of VISp together with all five lateral visual areas provide input to VISpor (postrhinal). We include only the lateral areas because they are more associated to object recognition while the medial areas are more involved in multimodal integration [42]. The three-level architecture among the VIS areas was derived from an analysis of the hierarchy of mouse cortical and thalamic areas (Fig 6e in [25]), which considered feedforward and feedback connection structures in each area. In this analysis, VISp was clearly low in the hierarchy, and VISpor was clearly high, but the other lateral visual areas had similar intermediate positions.
In the MouseNet model, each VIS area is represented by three separate cortical layers: layer 4 (L4), layer 2/3 (L2/3) and layer 5 (L5). We call a specific cortical layer within a specific area a "region". Here we only consider the feedforward pathway, thought in primate to drive responses within �100ms of stimulus presentation [4,10]. Following depictions of the canonical microcircuit (e.g. as summarized in Fig 5 in [43]), we consider the feedforward pathway to consist of laminar connections from L4 to L2/3, and from L2/3 to L5. Input from other areas feed into L4 and all of L4, L2/3 and L5 output to downstream areas, as shown in Fig 2B. This is consistent with broad connectivity among visual areas from each of these layers (Fig 2f of [25]). Fig 2C shows the MouseNet architecture in full detail, including all 22 regions and associated connections.

From architecture to convolutional neural net
Similar to the CALC model for the primate visual cortex by one of the authors [44], the general idea is to use convolution (Conv) operations to model the projections between different regions in the mouse visual cortex. Conv operations are linear combinations of many inputs, so they impose the assumption of linear synaptic integration. They are widely used in machine learning, because they run efficiently on graphical processing units, and they share parameters across the visual field, leading to reduced memory requirements and faster learning, relative to general linear maps.
Each connection from source brain region i to target brain region j is modelled with a Conv operation, Conv ij . The input to Conv ij corresponds to the neural activities in source region i, and the output of Conv ij drives neural activities in the target region j. For example, as shown in Fig 3A, the projection from Region 1 to Region 2 (Proj 1!2) is modeled by Conv 12 . The neural activities in Region 1 correspond to the input to Conv 12 , while the neural activities in Region 2 are a nonlinear function (ReLU, as shown in Fig 3C) of the output of Conv 12 . In MouseNet, L4 of all areas except VISp receive multiple converging inputs, similar to Region 4 in Fig 3A. In this case, each projection (Proj 2!4 and Proj 3!4) is modeled by a separate Conv layer (Conv 24 and Conv 34 ), and a nonlinear function (ReLU) is applied to the sum of the output from both of the Conv layers, to produce the neural activities in Region 4. Note that we do not use any pooling layers in the main architecture.  [25] on the Allen Mouse Brain Connectivity Atlas (http://connectivity.brainmap.org) [22]; and the meta-parameters are mostly fixed by the combination of the 100-micrometer resolution interareal connectome [23] with detailed estimates of neuron density [38], and the statistics of connections between cortical layers from the literature [39][40][41]. https://doi.org/10.1371/journal.pcbi.1010427.g001

Finding meta-parameters consistent with mouse data
After fixing the architecture, we need to determine the meta-parameters for constructing the kernels for each Conv operation (Fig 3). The standard Conv operation is defined in terms of a four-dimensional kernel. The output of the kernel is a three-dimensional tensor of activations for region j, A j , which pass through element-wise ReLU nonlinearities to produce non-negative rates. Element A j abg is the activation of the neuron at the α th vertical and β th horizontal position in the visual field, in the γ th channel (or feature map). The γ th channel of the activation tensor for region j, A j g , depends on inbound connections as, where I j is the set of regions that provide input to region j. Note that both C ij gd and A j d are twodimensional, and they undergo standard two-dimensional convolution. The meta-parameters of kernel C ij are: number of input channels c ij in , number of output channels c ij out , stride s ij , padding p ij , and finally kernel size k ij , i.e. the height and width (which are set equal) of C ij gd . To make the connections realistically sparse, we add a binary Gaussian mask on the Conv

PLOS COMPUTATIONAL BIOLOGY
operations, whose parameters are also estimated from data. See Fig 3B for an illustration of Conv operation with Gaussian mask. We constrain these meta-parameters with quantitative data whenever possible, and reasonable assumptions indicated by experimental observations otherwise, as indicated below.

Cortical population constraints
Assumptions about area output size. We set the horizontal and vertical resolution of the input (in pixels) based on mouse visual acuity. According to [45], the upper bound for visual acuity in mice is 0.5 cycles/degree, corresponding to a Nyquist sampling rate of 2 pixels/cycle x 0.5 cycles/degree = 1.0 pixel/degree. According to retinotopic map studies [46], V1 included a visual coverage range of � 60˚in altitude and � 90˚in azimuth, we further simplified this to square input size of 64 by 64 pixels.
The resolution of the other regions depends on both the resolution of the input, and strides of the connections. The stride of a connection is the sampling period with respect to its input. For example, a Conv with a stride of one samples every element of its input, whereas a Conv

PLOS COMPUTATIONAL BIOLOGY
with a stride of two samples every other element (both horizontally and vertically), leading to output of half the size in each dimension. Because cortical neurons are not organized into discrete channels in the same way as convolutional network layers, there is no strong anatomical constraint on the stride. However, the mean stride has to be somewhat less than two; there are ten steps in the longest path through MouseNet, but if only six of them had a stride of two, the 64x64 input would be reduced to 1x1 in VISpor, with no remaining topography. Lacking strong constraints, for simplicity, we first attempted to set all the strides to one, but this left very few channels in some of the smaller regions (due to an interaction between channels and strides that we describe below). We therefore set the strides of the connections outbound from VISp to two, and others to one. The feature maps of dLGN and VISp were therefore 64x64 (the same as the input), and all others were 32x32.
Given the resolutions of the channels in each region, the numbers of channels are constrained by the number of neurons. Specifically, Let n i be the number of neurons in region i and ðl i x ; l i y Þ be the size of the output in area i, then the number of channels in area i is determined by c i ¼ n i =ðl i x � l i y Þ. Estimating number of neurons in each area from data. We only model the excitatory neural population in our model, consistent with the fact that all neurons in the model project to other visual areas. In fact, neurons in convolutional networks are neither excitatory or inhibitory, but have both positive and negative output weights. However, past modelling work [47,48] has shown that such mixed-weight projections can be transformed so that the original neurons are all excitatory, and an additional population of inhibitory neurons recovers the functional effects of the original weights.
According to [49], the estimated number of excitatory neurons in dLGN is 21200. For VISp, VISal, VISl, VISpl, we use estimated density for excitatory neurons given by [38] (https://bbp.epfl.ch/nexus/cell-atlas/), which is summarized in Table 1. Note that we use neuron density instead of counts to get a more stable estimation of number of neurons out of different versions of brain parcellations. For the remaining areas VISrl, VISli and VISpor, we approximate their density by taking the average across the above four areas with separated cortical layers.
Combined with the number of 10μm voxels counted in the Allen Mouse Brain Common Coordinate Framework (CCFv3) [50] (Table 2), we summarize the estimated number for all the regions in our model in Table 3.

Cortical connection constraints
Neurons tend to receive relatively dense inputs from other neurons that are above or below them, in other cortical layers, and the connection density decreases with increasing horizontal distance. Similarly, inputs from other cortical areas tend to have a point of highest density, with smoothly decreasing density around that point. We approximate such connection-density profiles with two-dimensional Gaussian shaped functions. Specifically, the fan-in connection

PLOS COMPUTATIONAL BIOLOGY
probability from source region i to target region j at position (x, y) (position offset from center in μm) is modeled as, where d ij p is the peak probability at the center and d ij x and d ij y are the widths in the x and y directions. For simplicity, we assume d ij x ¼ d ij y ≜d ij w and let r ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi x 2 þ y 2 p denote the offset from the center of the source layer, the above equation then simplifies to, where d ij w (μm) is the Gaussian width. Both d ij p and d ij w are estimated from mouse data. The parameters for interlaminar connections are estimated from statistics of connections between cortical layers in paired electrode studies (Section Estimating d ij w , d ij p for interlaminar connections), and the parameters for interareal connections are estimated from the mesoscale mouse connectome (Section Estimating d ij w and d ij p for interareal connections).

Conv layer with Gaussian mask
To translate our Gaussian models of connection density into network meta-parameters, we apply a binary mask to the weights of the Conv layers ( Fig 3B). To do that, we first change the unit of d ij w in Eq 3 from micrometers to source area-dependent "pixels" (unit of output size of source area i) by multiplying it with s i ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where bothr andd ij w are in the "pixel" unit. The kernel size of the Conv layer is set to be where s ij is the stride of the Conv layer. During initialization, a mask containing zeros and ones is generated for each Conv layer, with size ðc ij out ; c ij in ; k ij ; k ij Þ. The probability of each element being one is P ij ðrÞ, wherer (pixel) is the offset from the center of mask. The weights of the Conv layer are then multiplied by the mask. This gives the connections realistic densities (or sparsities), with realistic spatial profiles.

Estimating d ij w ; d ij p for interlaminar connections
For the interlaminar connections, we estimate the Gaussian width d ij w from multiple experimental resources. Firstly, from Table 3 in [40], we get the estimation of d ij w to be 114 micrometers for functional connections between pairs of L4 pyramidal cells in mouse auditory cortex. Secondly, manually extracted from [41] Fig 8B, we obtain the variation of the Gaussian width depending on source and target layer from cat V1. Finally, we use this variation to scale the L4 to L4 width of 114 μm to other layers in the mouse cortex. We summarize the Gaussian widths from cat cortex, along with corresponding scaled estimates for mouse cortex, in Table 4. Note that in the current model, we only use the values for connections from L4 to L2/3 and from L2/3 to L5 (Fig 2B).
To estimate the Gaussian peak probability d ij p , we first collect the connection probability between excitatory populations offset at 75 micrometer d ij 75 (Fig 4A in [39]). We then get the peak probability d ij p by the relation We summarize the probability at 75 micrometers d ij 75 along with the peak probability d ij p in Table 5.

Estimating d ij w and d ij p for interareal connections
To estimate interareal connection strengths and spatial profiles, we use the mesoscale model of the mouse connectome [23,24]. This model estimates connection strengths between 100 Table 4. Estimated Gaussian width d ij w for interlaminar excitatory connections. The values outside of the parenthesis are extracted from [41]; the values inside the parenthesis are scaled to mouse cortex, using the width 114 μm for L4-to-L4 connections in mouse auditory cortex [40]. Units are micrometers (μm). For the purpose of our analysis, we need to map the 3 dimensional structure into 2 dimensions. First, we fit the visual area positions by a sphere with center c 2 R 3 and radius r.

Target
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where v = 100μm is the size of the voxel. Area size. Approximations of the surface area for each brain region are needed to convert the widths of connection profiles (see Conv layer with Gaussian mask) from voxels in the mesoscale model to convolutional-layer pixels in MouseNet. For this purpose, each region's surface area size is approximated by the area of a convex hull of its mapped two-dimensional positions. These estimates are summarized in Table 6.
Estimating d ij w . For each connection from source region i to target region j, we estimate d ij w from the mesoscale model. The first step is to estimate the widths of connections to individual voxels in j. The incoming width d ij k for target voxel k in j is estimated by the standard deviation of the connection strength about its center of mass. Specifically, d ij where l indexes the voxels in source region j, w lk is the connection weight between source and target voxels l and k in the mesoscale model, and d l is the distance from voxel l to the center of mass of these connection weights. We then estimate d ij w as the average of these widths over the voxels in j. We omit from this average any target voxels that have multi-modal input profiles. This procedure provides an upper bound for d ij w , because a target voxel may include multiple neurons with partially overlapping input areas.
Estimating d ij p . The mesoscale model provides estimates of relative densities of connections between pairs of voxels. But an additional factor is needed to convert these relative

PLOS COMPUTATIONAL BIOLOGY
densities into neuron-to-neuron connection probabilities. For this purpose, we assumed that each neuron received inputs from 1000 neurons in other areas (we call this number the extrinsic in-degree, e). This is on the order of the estimate from Fig S9 M in [51]. Given this assumption, we calculated d ij p by the relation, where w ij is the connection strength from source area i to target area j, estimated from integrating the connection weights of the corresponding areas in the mesoscale model. The estimated values for d ij w and d ij p are given in S1 Table. Note that the Gaussian peak values d ij p we get for VISal2/3!VISpor4 and VISal5!VISpor4 for the current model are greater than 1. This is because the number of channels in VISal is too small, i.e. inconsistent with the other parameters we have chosen. In our current model, the probabilities that are greater than one are rounded down to one when we generate the binary mask. This is an interesting limitation in our current model, which suggests further assumptions about the architecture that can reduce the number of channels in VISal can be meaningful.

Conv kernel size for dLGN
The above methods allowed us to set kernel sizes for intracortical connections, but not subcortical ones. We set the kernel sizes for inputs to dLGN and VISp L4 according to receptive field sizes in these regions. Receptive fields are about 9 degrees in dLGN and 11 degrees in VISp [52]. As mentioned in Section Cortical population constraints, mouse visual acuity is approximately 1 pixel/degree, therefore we set kernel size of the connection from input to dLGN to 9x9. We then set the kernel size of the connection from dLGN to VISp to 3x3, such that the receptive field size for VISp is 11x11 pixels.

Summary tables
In Table 7, we summarize the calculated number of channels in each area (in parenthesis) and the kernel size for each Conv layer.
The parameters used in the model based on biological sources and assumptions are summarized in Table 8 and the formulae for calculating the Conv layer meta-parameters are sumarized in Table 9.

Results
In this section, we use a well established image classification task as a working example to demonstrate the usage of MouseNet and to show the effect of adding biological constraints to the architecture by contrasting the MouseNet results with VGG16, a typical artificial neural network without any biological constraints. We first assess the computational performance of this mouse-architecture network on an image classification task. Then, through systematic comparisons with the large scale Allen Brain Observatory dataset, we show how MouseNet can be used to probe the effect of a CNN's specific task training and architecture on its similarities and differences with responses in the biological brain.

Implementation of MouseNet
To enable training of MouseNet on a standard image classification task, we implemented the MouseNet structure in PyTorch [53]. Each Conv layer was followed by a batch normalization layer and a ReLU non-linearity. For regions such as VISpor L4 that receive input from multiple Conv layers, the outputs of the Conv layers are summed before being fed into the batch normalization layer and ReLU non-linearity.
To train the MouseNet model on an image classification task, we added a simple classifier. Specifically, in order to include the final processing output from each individual area such that the information is not bottlenecked by the relatively small VISpor area, we took the L5 output from all seven areas and reduce them to 4x4 by an average pooling layer. We then flattened, concatenated, and fed this to a linear fully-connected layer, which reduced the dimension to the number of classes of the task. The outputs were then transformed to probabilities by the softmax function, and the cross-entropy loss of the predicted probabilities (determined from the categorical distribution where individual class probabilities are from the output of the softmax) relative to the ground truth labels was used to train on the image classification task.

Computational performance of MouseNet on image classification
We trained MouseNet end-to-end using stochastic gradient decent with momentum, adapting the training script from the imagenet example script from the PyTorch examples github repository. Full training details and scripts are available on the MouseNet github repo: https:// github.com/mabuice/Mouse_CNN.
Since there is currently no known behavior experiments in the literature testing mouse doing invariant object recognition with natural images, our results provide a first guess of how a mouse sized architecture may potentially perform on such tasks. We first found that Mouse-Net achieved above 90% validation accuracy on CIFAR10 [33], a simple classification of 32x32 images into 10 categories. Note that the input to the MouseNet is always resized to 64x64. To make fair comparison with MouseNet, the input to VGG16 is also resized to 64x64. Interestingly, this is close to state of the art performance of modern networks, suggesting that mouse sized networks are fully capable of performing this simple task.
We then moved to the more challenging image classification benchmark of ImageNet [54], which requires classification of higher resolution images into 1000 categories. We found that, even for input images downsampled to a resolution of (64x64), MouseNet can still be trained

The effects of task training on functional properties
To examine the effect of the image classification task training on the functional similarity of the MouseNet and the biological mouse brain, we make use of the large-scale, publicly available Allen Brain Observatory dataset [28]. We study representational similarity of MouseNet and the biological mouse brain across a set of natural images, along with the basic functional properties of sparsity and orientation selectivity. The Allen Brain Observatory data set. The Allen Brain Observatory data set is a largescale standardized in vivo survey of physiological activity in the mouse visual cortex, featuring representations of visually evoked calcium responses from GCaMP6f-expressing neurons. In this work, we use the population neural responses to a set of 118 natural image stimuli, each presented 50 times. The images were presented for 250ms each, with no inter-image delay or intervening "gray" image. The neural responses we use are events detected from fluorescence traces using an L0 regularized deconvolution algorithm, which deconvolves pointwise events assuming a linear calcium response for each event and penalizes the total number of events included in the trace. Full information about the experiment is and database given in [28].
The Allen Brain Observatory includes data from six different brain areas, namely VISp, VISal, VISl, VISpm, VISam and VISrl. The number of neurons in the dataset, for each of the regions we use, is summarized in Table 11.
The Similarity of Similarity Matrices metric (SSM). To compare functional similarity between two representations-in MouseNet, and in the biological mouse brain-of a set of images, we use the Similarity of Similarity matrices (SSM) [29,34] metric. We begin with a matrix of neural activities, in which each row contains the population activities for a certain image. We calculate the Pearson correlation coefficient between every pair of rows within one Table 9. Meta-parameters for Conv layer connecting source area i to target area j.

PLOS COMPUTATIONAL BIOLOGY
representation matrix, to form an n by n "similarity matrix" for this representation, where each entry describes the similarity of the population response to a pair of images. Next, to compare two similarity matrices, we flatten the matrices to vectors and compute the Spearman rank correlation between these vectors. Like the Pearson correlation coefficient, the rank correlation lies in the range [−1, 1] indicating how similar (close to 1) or dissimilar (close to -1) the two representations are. Rather than examining one neuron at a time [55,56], this metric compares representations based on activities of the whole populations of artificial and biological neurons, revealing functional similarity at the population level. Another choice of such population similarity metrics is Singular Vector Canonical Correlation Analysis (SVCCA) [29,57]. An excellent review of such similarity metrics and their properties can be found in [58].
Following the procedures in [29], we construct the representation matrix for a certain mouse visual cortex region by taking the trial-averaged mean responses of the neurons in the 250ms during the image presentation. Activities of neurons in different experiments for the same brain area are grouped together to construct the representation matrix, whose dimension is number of images by number of neurons. The representation matrices for MouseNet layers are obtained from feeding the same set of 118 images (resized to 64x64) to MouseNet and collecting all the activations from a certain layer of the model.
Neural reliability and SSM noise ceiling. We next compute the SSM noise ceiling from the Allen Brain Observatory data. We use split half reliability to quantify the reliability of a single neuron from the Allen Brain Observatory. This is done by separating the 50 trials into two non-overlapping 25 trial sets, and taking the correlation of trial-averaged responses between the two. We make ten random splits, and take the mean of the ten correlations to represent the reliability of each neuron. The reliability distributions of the neural populations are shown in Fig 4 (left). VISp, VISl and VISal are most reliable areas and VISpm, VISam and VISrl are less reliable areas.
To estimate the noise ceiling of the SSM metric, we compare the mouse data representation matrices with themselves. Specifically, we split the 50 trials in the dataset into two non-overlapping sets and calculate the trial averaged representation matrices for each set. The SSM between these two representation matrices are the noise ceiling of the SSM metric. Ten splits of the dataset are computed for estimating the mean and standard deviation of the noise ceilings.
To examine how the noise ceiling changes with the reliability of the neural population, we calculate the noise ceilings by selecting neurons that surpass different levels of thresholds, as shown in Fig 4 (right). We see that for some regions, if we select a group of neurons using a certain reliability threshold, the noise ceiling becomes higher than without this selection. Thus we first order the neurons in each region according to their reliability. We calculate the noise ceiling using only higher-reliability neurons, above a certain threshold of reliability. We choose the threshold that results in the highest noise ceiling to be the "best noise ceiling" for that region. We summarize the reliability and best noise ceiling for each area in Fig 5. In this paper, we will concentrate our discussions on the most reliable areas, VISp, VISl and VISal, which are

PLOS COMPUTATIONAL BIOLOGY
included in the MouseNet model. We will use the best noise ceiling to compare with the models. Task training improves the similarity between MouseNet and the Allen Brain Observatory. To examine the effect of training to perform an image classification task on the functional similarity of MouseNet to the brain, we compute the SSM value between each layer of MouseNet with data from a brain region recorded in the Allen Brain Observatory. To account for the randomness due to initialization, we train four instances of MouseNet on ImageNet starting with different weights and look at their mean statistics.

PLOS COMPUTATIONAL BIOLOGY
Observatory. We see that for layer 2/3, area VISp in the Allen Brain Observatory, five different model areas show significant change in SSM value from the untrained model. In the following, we will add prefix "m" in front of the modeled areas from the MouseNet to contrast with the ones from the real brain. One of these is an early layer, mVISp5, while the others are in the parallel pathway portion of the architecture. Of the others, mVISl4 shows an increase in similarity with VISp_layer23, while three other model regions show a decrease in similarity. For the other two regions in Fig 6, mVISp5 shows a significant increase in similarity. For VISl_layer23, there are six other model regions that all show an increase in similarity. These statements hold specifically when comparing model regions to each other for the same area in the Brain Observatory. Comparing areas of the Brain Observatory to each other requires a different adjustment for the number of comparison (see black vs. red stars in Fig 6). These results are consistent with the idea from Shi, et al [29] that VISp is a lower order area than VISl and VISal (VISp maps to lower "pseudo-depth" in comparing to a CNN than both VISl and VISal). Layers 4 and 5 show results that are similar, but not identical to, layer 2/3. Note that, although training on ImageNet improves the corresponding level of model regions' similarity to the brain, the highest SSM value does not always occur in the model layer

PLOS COMPUTATIONAL BIOLOGY
corresponding to the specific region considered in the Brain Observatory. For example, the SSM value for mVISp regions are higher than the mVISl regions when comparing to the brain area VISl L2/3. This is possibly because the visual areas are more similar to each other than they are to the MouseNet regions (see S2 Table for the SSM values between the brain areas themselves), such that improving the similarity to one brain region can possibly lead to improving the similarity to some other regions. Nevertheless, by looking at all the layers globally, we see that for the earliest visual area VISp, the ImageNet classification training promotes the SSM values of the mVISp layers in the MouseNet while suppress the values for the later layers; whereas for secondary visual areas VISl, the training promotes both earlier layers and later layers in the parallel pathways, suggesting a higher place in the functional hierarchy (cf. with the results of [29]).
Higher task performance on image classification does not guarantee higher similarity to the mouse brain. To examine how performance on the ImageNet classification task affects the functional similarity to the brain, we contrast the SSM values for MouseNet with another network that can perform this task, the VGG16 network discussed above. We use the same input resolution, on the same task (see Section Computational performance of MouseNet on

PLOS COMPUTATIONAL BIOLOGY
image classification). Similarly as for MouseNet, we calculate the SSM values between each layer in VGG16 and the regions in the mouse visual cortex. VGG16 does not have a "corresponding layer" for each region; we choose the VGG layer that has the highest SSM with each mouse brain region. For this comparison, we do the same for MouseNet, so that for each region, we compare this 'best layer' SSM value with the best layer SSM value for MouseNet.
The best layer's SSM values for both VGG16 and MouseNet, for each mouse cortical layer in VISp, VISl, and VISal, are summarized in Fig 7. As we can see in the figure, although VGG16 has much higher performance on the ImageNet task (about 60% vs 40%), it does not have much higher SSM values to the brain for most regions. The saturation of functional similarity between the brain and models in terms of image classification performance is also observed in primates, albeit at a much higher performance level [59].
To further grasp the limited relationship between a model's task performance and its functional similarity to the mouse brain, we look at how the models' functional similarity to brain data changes during training. As shown in Fig 8, S3 Table for a summary of the best models which achieved the highest SSM values for each brain region. These results show that optimizing performance on this particular task, at least beyond an early or intermediate level of performance, does not necessarily improve the model's similarity to the biological brain. If the approach of building models for neural responses via task training of artificial networks is broadly correct, then we take this as an indication that ImageNet is not the correct task to consider for the representations in the mouse brain.
Task training with the MouseNet architecture increases the similarity of other functional properties to the mouse brain. As mentioned above, the SSM metric compares functional representations, based on activities of the whole neural population in a given model layer and a set of recordings from a given brain area. For a complementary view of the effect of task training on MouseNet representations, and of the role of its architecture, we can also study the statistical distributions of single neuron functional properties, such as orientation selectivity and lifetime sparseness [28].

PLOS COMPUTATIONAL BIOLOGY
Lifetime sparseness measures the selectivity of a neuron's mean response to different stimulus condition, defined as [28,60] where N is the number of stimulus conditions and r i is the response of the neuron to stimulus condition i averaged across trials. A neuron that responds strongly to only a few stimuli will have a lifetime sparseness close to 1, whereas a neuron that responds broadly to many stimuli will have a lower lifetime sparseness. The statistical distribution of lifetime sparseness for the mouse data on natural scene stimuli and for all the units in trained/untrained MouseNet and VGG16 models, responding to the same natural scene stimuli as in the Allen Brain Observatory, are shown in Fig 9 (top row). This demonstrates clearly that training on the image

PLOS COMPUTATIONAL BIOLOGY
classification task makes the distribution of lifetime sparseness values much closer to the mouse brain data for MouseNet, but not as much for VGG16. Similarly, we can study the orientation selectivity of individual neurons by using the static grating stimuli in the Allen Brain Observatory dataset. Specifically, we calculate the circular selectivity index (which is one minus the circular variance defined in [61]), defined as where r k is the response of the neuron to a grating with angle θ k averaged across trials. A neuron that responds strongly to only one direction will have circular selectivity index close to 1, whereas a neuron that responds broadly to many directions will have lower circular selectivity index. The statistical distributions of the circular selectivity index, for the mouse data with static grating stimuli and for trained/untrained MouseNet and VGG16 models with the same stimuli, are shown in Fig 9 (bottom row). As for the case of lifetime sparsity above, task training changes the distribution of selectivity values. These distributions, after training, are closer to the mouse brain data for the MouseNet networks than for the VGG, once again showing how the more specifically matched architecture of MouseNet can lead to more similar model responses to the biological brain. Note that the spikier distributions of the models result from the deterministic nature of the models in contrast to the noisier brain data in response to the (only) six total static grating directions. If we were to simulate neural noise in the model responses, it would smooth the distributions, resulting in closer approximation of the data, as we show in S3 Fig. Taken together, these results show how the MouseNet model can be used to explore the impact of task training on a variety of response statistics that are commonly computed in

PLOS COMPUTATIONAL BIOLOGY
physiology studies, and that those defined on individual neurons can demonstrate complementary and in some cases more dramatic changes with training than those averaged over entire populations. It is interesting to note that building in anatomical constraints into the architecture increases its similarity to physiological data on these single neuron metrics, but not on population representational similarity-an important divergence that could be of interest for future study.
Task training diversifies functional representation among MouseNet layers. Finally, we study how task training and network architecture affect the general 'geometric' layout of models' representations, separately from their similarity to representations in the mouse brain data. To do this, we calculate the SSM values between every pair of layers from both trained/ untrained MouseNet and VGG16, and visualize them in two dimensional space via the metric multidimensional scaling algorithm implemented in scikit-learn [62,63]. The results are shown in Fig 10. Table 12 summarizes the diversity index for each model defined as the product of the singular values of principal component analysis for the corresponding cluster. For VGG16, we see that representations in layers are clustered together both before and after training. By contrast, for MouseNet the representations become much more diversified after training. We hypothesize that it is the parallel architecture of MouseNet that leads it to learn this more diversified representation as it solves the image classification task.  the Conv layers, those pathways do learn very different representations. This specialization of representations for different parallel pathways is consistent with what is observed in the literature [64] and a follow up theoretical study (see [65] chapter 4). Unraveling any specific functions of each pathway, in this task or in others, is an opportunity left for future studies.

Discussion
Task-optimized deep networks show promise for brain modelling, because they are functionally sophisticated, and they often develop internal representations that overlap strongly with representations in the brain [10][11][12][13][14][15][16][17]. Convolutional neural networks share weights across the visual field, and thus form an approximation of the functional properties that may be imposed by translation invariance of natural stimuli leading to equivariant representations in neural systems [3][4][5]. This weight sharing makes them much easier to train, which is an important practical consideration for model development. Although the it requires non-local weight updates, it can be made more biologically plausible by a dynamical weight sharing mechanism [66]. While deep network architectures are originally loosely inspired by the brain, there has been an extensive empirical exploration of the effects of architectural features in machine learning, in directions often independent from neuroscience. In parallel, however, a great deal more has been learned about the architecture of the biological brain, with that of the mouse brain having been particularly well characterized.
We have developed MouseNet, a deep network architecture that is consistent with a wide range of data on mouse visual cortex, including data from tract-tracing studies and studies of local connection statistics. We constrain the architecture with mouse data whenever possible, and borrow data from other species or make reasonable assumptions indicated by experimental observations otherwise. The framework introduced here is flexible enough to incorporate new data and connections once they are available. While standard deep networks have provided useful points of comparison with neurobiological systems, in the long term more biologically realistic deep networks may enable more specific comparisons with the brain, including comparisons between homologous groups of neurons, modeling of specific lesions, and analysis of functional differences between brain areas and pathways.
Using image classification as a working example, we use MouseNet to investigate using the task-training approach to model the functional representations in the mouse brain. An aspect of special interest is whether training on this task drives the representations in the model to be closer to those recorded from the real mouse brain, in comparison to representations in untrained versions of the MouseNet model or in generic deep networks. Using recordings from the large-scale Allen Brain Observatory survey, we find-consistent with the literature [10,11] for other model species and systems-that training on an image classification task does drive MouseNet representations to better resemble those of the real data. However, this increase of functional similarity is not necessarily strictly monotonic with task performance. In our experiments we see the SSM correlation with the Brain Observatory responses saturating or even maximizing well before we achieve maximum accuracy on task performance. This is true for both MouseNet and VGG16.
Within the task-training paradigm, these results suggest that the specific image classification task we used, and perhaps image classification overall, is not the appropriate task for describing what the mouse visual cortex has learned and developed to compute. Nonetheless, MouseNet is an important reference to studies in more established species, which rely on comparisons of the ventral stream with architectures designed for object recognition. Although we know rodents are capable of performing tasks that require visual object discrimination, mouse ethology suggests that alternate computations are more important for the mouse visual system, such as motion tracking, predation, and predator avoidance. A promising future direction is to use task-training of the MouseNet model, together with the metrics tested here, to develop more realistic tasks and stimuli that may lead to more closely matched representations.
In sum this work links anatomical and physiological data to task-driven CNN models, providing a road map for developing better task-driven models of the biological brain. It opens the door to building more detailed structures into the model, such as adding further brain areas as well as adding recurrence and using different inputs and readouts for different pathways. Incorporating new anatomical data is also easy within this framework. By making our code publicly available, and illustrating the model's success and failures in matching representations using well-studied metrics and tasks, we hope to facilitate future research along these lines.
Supporting information S1  Table. SSM values between mouse visual cortex areas. Note that even with the neural subsampling issue [29], the similarity values between VISp, VISl, and VISal are much higher than they are with the CNN models. . The x axis includes all the layers in the model in a serial way. The five parallel secondary visual area pathways in the model are in shaded grey background. Black stars denote the the pvalues of two-sample t-test with Benjamini/Hochberg correction of 22 comparisons within one brain area is less than 0.05; Red stars denote the pvalues of two-sample t-test with Benjamini/Hochberg correction of all 9x22 comparisons across all 9 brain areas is less than 0.05.). (TIF) S2 Fig. Functional similarity and validation accuracy during the training process for multiple MouseNet instances. Each row compares models with a different brain area. We show three instances of MouseNet during their training process. Each dot represents the best layer's SSM of one instance at a certain epoch to the specified brain area, with each instance's highest achieved SSM during training process marked by a cross. The clear jumps of validation accuracy occurred when we reduced the learning rate.