XDream: Finding preferred stimuli for visual neurons using generative networks and gradient-free optimization

A longstanding question in sensory neuroscience is what types of stimuli drive neurons to fire. The characterization of effective stimuli has traditionally been based on a combination of intuition, insights from previous studies, and luck. A new method termed XDream (EXtending DeepDream with real-time evolution for activation maximization) combined a generative neural network and a genetic algorithm in a closed loop to create strong stimuli for neurons in the macaque visual cortex. Here we extensively and systematically evaluate the performance of XDream. We use ConvNet units as in silico models of neurons, enabling experiments that would be prohibitive with biological neurons. We evaluated how the method compares to brute-force search, and how well the method generalizes to different neurons and processing stages. We also explored design and parameter choices. XDream can efficiently find preferred features for visual units without any prior knowledge about them. XDream extrapolates to different layers, architectures, and developmental regimes, performing better than brute-force search, and often better than exhaustive sampling of >1 million images. Furthermore, XDream is robust to choices of multiple image generators, optimization algorithms, and hyperparameters, suggesting that its performance is locally near-optimal. Lastly, we found no significant advantage to problem-specific parameter tuning. These results establish expectations and provide practical recommendations for using XDream to investigate neural coding in biological preparations. Overall, XDream is an efficient, general, and robust algorithm for uncovering neuronal tuning preferences using a vast and diverse stimulus space. XDream is implemented in Python, released under the MIT License, and works on Linux, Windows, and MacOS.


Unfunded studies
Enter: The author(s) received no specific funding for this work. The authors have declared that no competing interests exist.
If the data are held or will be held in a public repository, include URLs, accession numbers or DOIs. If this information will only be available after acceptance, indicate this by ticking the box below. For example: All XXX files are available from the XXX database (accession number(s) XXX, XXX.).
• If the data are all contained within the manuscript and/or Supporting Information files, enter the following: All relevant data are within the manuscript and its Supporting Information files.
• If neither of these applies but you are able to provide details of access elsewhere, with or without limitations, please do so. For example: Data cannot be shared publicly because of [XXX]. Data are available from the XXX Institutional Data Access / Ethics Committee (contact via XXX) for researchers who meet the criteria for access to confidential data.
The data underlying the results presented in the study are available from (include the name of the third party and contact information or URL). To Whom It May Concern: We are enclosing a manuscript entitled "Finding Preferred Stimuli for Visual Neurons Using Generative Networks and Gradient-Free Optimization" for your consideration for publication as a methods or software article in PLOS Computational Biology.
A central question in Neuroscience is the elucidation of tuning properties of cortical neurons. In the visual system, neurophysiological recordings over the last five decades have discovered neurons with center-surround receptive fields, tuning to bars of specific orientations, tuning to curvature, all the way to tuning to complex objects such as faces. Despite these textbook notions of stimulus selectivity along ventral visual cortex, we still do not know whether these stimuli are truly optimal or whether there could be other images that trigger even more selective responses. The main challenge to defining preferred stimuli is that it is impossible to exhaustively sample all images. Instead, past investigations depended on ingenious heuristics that were based on intuitions, natural image statistics, computational models, and random exploration.
We recently introduced a new method to address this challenge and systematically investigate neural coding properties in an unbiased manner. We refer to this method as XDream (for EXtending DeepDream with Real-time Evolution for Activation Maximization). In Ponce et al., Cell 2019, we presented initial experimental evidence of the method's feasibility in macaque monkey neurons. This method elicited strong interest, leading to a large number of important questions about the performance characteristics and practical application of the algorithm. In the current manuscript, we quantitatively investigate these questions, with the goal of making the algorithm more widely applicable by the research community. By using computational models as a proxy for neuronal recordings, we are able to systematically and extensively test critical questions about the utility of the algorithm. Specifically, we determined the following performance characteristics of the algorithm: § XDream can find "super stimuli" that trigger stronger responses than more than one million natural images; § XDream is highly "expressive," allowing extensive sampling of image space; § XDream can find globally optimal stimuli; § The ability of XDream to find effective stimuli is general across stages of processing from early to late vision, across different network architectures, and across different training regimes analogous to developmental shaping of selectivity; § XDream is robust to different random initial conditions and performs well with them.
In addition, we extended the method in the following way: Cover Letter § We incorporated several image generators and two other optimization algorithms, and empirically optimized parameter settings. The performance of XDream is robust to a wide range of parameter settings and algorithmic choices.
Based on the enclosed results, we can provide specific recommendations for applying this method. Given the ability of the XDream method to uncover stimulus preferences in an unbiased, efficient, and robust manner, we expect that these results will be of interest to the large community of researchers using electrophysiological recordings, calcium imaging, or other methods to functionally study visual cortex across species and visual areas. The principles of the algorithm could be informative for designing similar methods for studying other sensory modalities.
The source code of the algorithm is openly shared with the community (https://github.com/willwx/XDream). As a part of publishing the software, we are updating the repository as recommended in the pre-submission inquiry and will include license information, more API documentation, and more examples before acceptance of the paper.
The manuscript is not under consideration elsewhere. We thank you in advance for considering this submission.

Introduction 1
What stimuli excite a neuron, and how can we find them? Considering vision as a 2 paradigmatic example, the selection of stimuli to probe neural activity has shaped our 3 understanding of how visual neurons represent information. It is practically impossible 4 to exhaustively evaluate neuronal responses to images, due to the combinatorically large 5 number of possible images. Instead, investigators have traditionally selected stimuli 6 guided by natural image statistics, behavioral relevance, theoretical postulates about 7 internal representations, intuitions from previous studies, and serendipitous findings. 8 Stimuli selected in this way underlie our current understandings of how circular 9 center-surround receptive fields [2] give rise to orientation tuning [3], then to encoding 10 of more complex shapes such as curvatures [4,5], and further to selective responses to 11 complex objects such as faces [6][7][8]. 12 Despite the progress made in understanding visual cortex by testing limited sets of 13 hand-chosen stimuli, these experiments could be missing the true feature preferences of 14 neurons. In other words, there could be other images that drive visual neurons better 15 than those found so far, and such images could lead us to revisit our current 16 descriptions of feature tuning in visual cortex. 17 A recently introduced method shows promise to begin bridging the gap. Termed

18
XDream (eXtending DeepDream with real-time evolution for activation maximization), 19 this method combines a genetic algorithm and a deep generative neural 20 network [9]-both inspired by previous work [10][11][12][13]-to evolve images that trigger high 21 activation in neurons [1]. XDream can generate strong stimuli for neurons in macaque 22 inferior temporal (IT) and primary visual cortex (V1). 23 The performance and design options of XDream have not been thoroughly evaluated, 24 due to the time-intensiveness of neuronal recordings and the difficulty to fully control 25 all experimental variables. To overcome these challenges, here we test the performance 26 of XDream using state-of-the-art in silico models of visual neurons in lieu of real 27 neurons, in the same spirit of [14]. Specifically, we use convolutional neural networks 28 (ConvNets) pre-trained on visual recognition tasks as an approximation to the 29 computations performed along ventral visual cortex [15][16][17]. Using these models as a 30 proxy for real neurons allows us to compare synthetic stimuli with a large set of 31 reference images, to evaluate XDream's performance across processing stages, model 32 architectures, and training regimes, to empirically optimize algorithm and parameter 33 choices in a systematic fashion, and to disentangle the effects of response stochasticity. 34 Although there is a rich literature in computer science on feature 35 visualization [18][19][20]

50
Random exploration of stimulus space is inefficient 51 A common approach for exploring neuronal selectivity is to use arbitrarily selected 52 images, often from a limited number of categories (for example in [8]). Thus, we 53 considered random exploration as a baseline for comparison. We used the AlexNet 54 architecture as the target model ( [25], implemented as CaffeNet; Table 2) and sampled 55 images from ImageNet ( [23]; ILSVRC12 dataset, 1,431,167 images), a large dataset 56 common in computer vision that also contains the training set of CaffeNet. We 57 randomly sampled n images either from all of ImageNet or from 10 categories randomly 58 selected from the 1,000 training categories in ImageNet (n/10 images per category). For 59 units in different layers of the network, we evaluated the activation value to these 60 images, and calculated the maximum relative activation defined as the ratio between 61 the maximum activation in the n random images and the maximum activation in all of 62 ImageNet (S1 Fig). By definition, the relative activation for the best image in ImageNet 63 is 1, which is also an upper bound on the observed relative activation values. Randomly 64 selected images with both approaches typically yielded relative activation values well 65 below 1. As expected, the maximum observed relative activation increased with n but 66 only did so slowly, with near-logarithmic growth. Moreover, for later layers (e.g., fc8), 67 sampling from only 10 categories yielded significantly worse results than sampling 68 completely randomly, which we hypothesize is because the small number of categories 69 imposes a bottleneck on the diversity of high-level features represented. In neuroscience 70 studies, category selection is clearly not completely random; investigators may have 71 intuitions and prior knowledge about the types of stimuli that are likely more effective. 72 To the extent that those intuitions are correct, they can enhance the search process. Overview of the XDream method. a), XDream combines in a closed loop an image generator (e.g., a generative adversarial network), a target neuron (e.g., a unit in a ConvNet), and a non-gradient-based optimization algorithm (e.g., a genetic algorithm). In each iteration, the optimization algorithm proposes a set of codes, the image generator synthesizes the codes into images, the images are evaluated by the target neuron to produce one scalar score per image, and the scores are used by the optimization algorithm to propose a new set of codes. Importantly, no optimization gradient is needed from the neuron. b,c), An example experiment targeting CaffeNet layer fc8, unit 1. b), mean activation achieved over 500 generations, 20 images per generation (10,000 total image presentations). c), Images obtained at a few example generations indicated by minor x-ticks in b). The activation to each image is labeled above the image and indicated by the color of the margin.  an optimization algorithm performing the search (Fig 1a). In each generation, the 79 generator creates images from their latent representations (codes), the target unit 80 activation is evaluated for each of the generated images, and the optimizer refines the 81 codes based on the measured activation values. Initialized randomly (examples shown in 82 Fig 1a), the algorithm is iterated for 10,000 total image presentations, a relatively small 83 number that comprises < 1% of ImageNet and that is accessible in a typical 84 neuroscience experiment. Critically, the algorithm does not use any prior knowledge 85 about the architecture or weights of the target model. ImageNet. Unit 1 in fc8 was trained to be a "goldfish" detector. Correspondingly, when 94 we randomly sampled 10,000 images from ImageNet, the best images are photos of 95 goldfish (Fig 1d). The highest activation value observed in this random sample was generated images that elicited higher activation than any natural image from ImageNet. 102 We refer to such images with relative activation > 1 as super stimuli.  Table 2 for the specific layers). The violin contours indicate kernel density estimates of the distributions, white circles indicate the medians, thick bars indicate first and third quartiles, and whiskers indicate 1.5× interquartile ranges. For comparison, grey boxes (interquartile ranges) and lines (medians) show the distribution of maximum relative activation for 10,000 random ImageNet images. The horizontal dashed line corresponds to the best ImageNet image. b), Optimized (top row) and best ImageNet (bottom row) images and activations for 10 example units across layers and architectures. For output units, corresponding category labels are shown below the images.
XDream can efficiently search a large and diverse stimulus 104 space to find the ground truth optimal stimulus 105 Is XDream limited in the kind of images it can find? This is an essential question for the 106 utility of XDream. An analysis of this question was presented in the supplement to [1], 107 but the question is relevant here so we discuss the analyses in the current context using 108 slightly different data. Because XDream optimizes in the latent space of a generative 109 network, a first constraint is the range of images that can be created by the generative 110 network. It is hard to quantify what fraction of all possible images is represented by a 111 generative network. Instead, we qualitative assessed the expressiveness of the generative 112 network by challenging it to synthesize diverse, arbitrarily selected target images (Fig 3,  Can you really hold a stick for that long?
I think you mean Figure 3.
heuristic methods: 1) iteratively optimizing an image code to minimize the pixel-wise 115 difference between the generated and target image (labeled "opt"); 2) directly using 116 CaffeNet fc6 representation of the image as the image code, because the generative 117 network was originally trained to invert this representation (labeled "ivt"; see 118 Methods for details of both methods). The image generator can approximate arbitrary images, and XDream can find these images using only scalar distance as a loss function. This figure reproduces Supplementary Figure 1 in [1]. The generative network is challenged to synthesize arbitrary target images (row 1) using one of two encoding methods, "opt" (row 2) and "ivt" (row 3; Methods). In addition, XDream can discover the target image efficiently (within 10,000 test image presentations) by using the genetic algorithm to minimize the mean squared difference between the target image and any test image as a loss function, either in pixel space (row 4) or in CaffeNet pool5 representation space (row 5).
Not only does XDream need to represent diverse images, it must also efficiently find 121 those images (i.e., in a reasonable number of queries). When investigating the activation 122 of units in a target model as in presentations. The optimized images generated by XDream were not perfectly identical 131 to the ground truth target. At least part of the remaining differences could be 132 attributed to the loss function: Pixel-wise loss is known to lead to excessive smoothing, 133 and pool5 loss is expected to lose some features and spatial information due to pooling 134 operations and ReLU activations in preceding layers. 135 These results show that XDream has is highly "expressive"-i.e., it can generate a 136 very large and diverse set of images can be generated-and suggest that it is in principle 137 possible to reach close to the global maximum. Of note, these results do not show that 138 any image can be generated, nor do they provide a proof of convergence to the global 139 maximum for all visual neurons. The default generative network used in XDream was trained to invert the internal 143 representations of CaffeNet layer fc6, which was in turn trained on ImageNet [9]. Could 144 this generator allow XDream to generalize to other network layers, architectures, and 145 training sets? If XDream is specific to certain layers and architectures, or specific to 146 ImageNet-trained networks, this may limit its applicability to real neurons. 147 We first assessed whether XDream could extrapolate to other layers in CaffeNet by 148 selecting 100 units respectively from the early, middle, late, and output layers of 149 CaffeNet (Fig 2a). XDream was able to find optimized images that are better than the 150 best randomly selected images across all layers (p < 10 −16 , FDR corrected for 28 tests 151 in this section). The optimized images were also significantly better than the best 152 images in ImageNet across all units (p < 10 −9 , FDR corrected for 28 tests in this Can Again, this is figure 2 in this document. section). The highest relative activation was obtained for the late layer, but this was not 154 a result of using the fc6-based generative network, a possibility we tested in Fig 5. 155 Next, we tested 100 units from 4 layers each from 5 different network architectures: 156 ResNet-v2 152-and 269-layer variants [30], Inception-v3 [31], Inception-v4, and 157 Inception-ResNet-v2 [32]. These models were all trained on ImageNet. XDream was 158 able to generate better images than the best random images for the vast majority of  Fig 2b. Notably, in all networks, the late layer could be 165 driven to higher relative activation than the other layers, suggesting that the relatively 166 high optimized activation may be a property of layers at this relative processing stage in 167 the network. 168 Finally, we tested a network with the same architecture as CaffeNet but trained on a 169 different dataset, PlacesCNN [24]. PlacesCNN also contains photographic images, but 170 they depict scenes rather than salient objects. Again, XDream was able to find super 171 stimuli across layers in this network, even though XDream had no accesss to the 172 corresponding training images (Fig 2a,   form the initial population (Fig 4c). To convert images into image codes comprising the 199 initial population, we used either the "opt" or the "ivt" method (Methods).

200
Initializing with better or worse natural images did not improve the optimized images in 201 the conv2 layer (p = 0.87 and 0.19 for "opt" and "ivt", respectively, FDR-corrected for 202 8 tests in this and the next sentence). In higher layers, initializing with the best natural 203 images led to slightly higher relative activation values (Fig 4c;  -> late hidden layer? Or output layer?
The comment starting with "but this was not…" here is confusing. Delete it here and don't bring it up until you get to Figure 5. In fact, I think I would delete the whole sentence.
This is somewhat surprising. I had a similar idea many moons ago (using the GA over a vocabulary of image generating functions), and expected that there may not be a convex region containing the optimal stimuli -rather, I expected that there would be multiple optimal stimuli. My research didn't go anywhere as the basis functions available at the time were not adequate. I think most neuroscientists expect that there will be a single optimal stimulus. There may well be an invariance manifold, but it needn't be connected. Can't you simple interpolate between the codes you found to see if points in between also are near optimal?
In Figure 2b, it would be more useful to show early, middle and late layer images, not just output images for the non-caffe architectures. "opt" and p < 10 −10 for "ivt" across layers  For each unit, the 20 best, middle, and worst images from ImageNet, as ranked by 211 that unit, were used to initialize the genetic algorithm. The images were converted to 212 image codes using one of two encoding algorithms, "opt" or "ivt" (see Methods).

213
"Slope" quantifies the improvement in relative activation (median across 100 random 214 units each layer) if a better initialization is used (worst → middle or middle → best).

215
Concretely, the slope is the linear regression coefficient with the independent variable 216 being {0,1,2} for {worst, middle, best}, respectively. The "p-value" indicates whether 217 there is a statistically significant difference initializing by {worst, middle, best} images 218 (one-way ANOVA, FDR-corrected).

219
To summarize, initializing the algorithm with different random conditions resulted in 220 only a small variation in the optimized image activation, and the images were similar 221 although not identical at the pixel level. Initializing with prior knowledge has little to 222 no effect on the optimized image activation, unless the seed is comparable to the best 223 image in ∼ 1 M images and only in later layers. An essential component of XDream is the image generator. In other experiments in the 227 paper, we used a generative network based on CaffeNet fc6 representations [9]. Here, we 228 examine whether the choice of image generator matters for the performance of XDream, 229 and whether the answer depends on the target unit. We hypothesized that image shown thus far were based on using a genetic algorithm as the optimization algorithm, a 265 choice inspired by previous work [10][11][12]. Here, we compared the genetic algorithm to 266 two additional algorithms, a naïve finite-difference gradient descent algorithm (FDGD; 267 Methods) and Natural Evolution Strategies (NES; [26], Methods). NES has been 268 used in a related problem [27]. FDGD and NES were significantly worse than the 269 September 24, 2019 8/19 Use a reference, rather than saying "the paper." Or do you mean in *this* paper? How about "In the previous experiments, we used Caffe fc6 to generate images. Here, we look at the effect of using different image generators." ?? What does that mean?
but the genetic genetic algorithm in CaffeNet conv2 (p < 10 −13 for FDGD and NES, FDR corrected for 270 20 tests here and in the stochastic condition in the next section) and conv4 layers 271 (p < 10 −3 ). Yet, both FDGS and NES were significantly better than the genetic 272 algorithm in CaffeNet fc6 (p < 10 −16 ), fc8 (p < 10 −16 ), and Inception-ResNet-v2 273 classifier layers (p < 10 −12 ; Fig 6). evoke different responses on repeated presentations (even though trial-averaged response 279 may be highly consistent; see [33]). To test whether XDream could still find super 280 stimuli with noisy units, we implemented a simple model of stochasticity in the units by 281 using the true activation value to control the rate of a homogeneous Poisson process, 282 from which the "observed" activation value on a single trial was drawn (Methods).

283
Homogeneous Poisson processes have been used extensively to model stochasticity in 284 cortical neurons [34].

285
As expected, performance deteriorated when noise was added (Fig 6, noisy   286 condition). However, XDream was still able to find optimized stimuli that were better 287 than random exploration for most layers (p < 10 −10 for all tested layers except p = 0.19 288 for CaffeNet fc8, FDR-corrected for 5 tests) and was also able to find super stimuli for 289 some layers (p < 10 −5 for CaffeNet conv4 and fc6 layers; p = 0.069 for CaffeNet conv2 290 layer; FDR-corrected for 5 tests), although it was not able to find super stimuli for most 291 units in CaffeNet fc8 layer and Inception-ResNet-v2 classifier layer (p = 1). The noiseless vs. noisy colors in the key for this image don't correspond to all the plots in Figure 6 -just state that noisy is on the right and noiseless is on the left.
It is challenging to study the stimulus features encoded by visual cortical neurons, 302 because the number of possible images is beyond astronomical while experimental time 303 with a neuron is limited. In this work, we thoroughly characterize a method, named 304 XDream, that combines an image generation algorithm with closed-loop optimization to 305 directly use neural activity to search image space and find preferred stimuli. Using units 306 in artificial neural networks as models of neurons in the brain, and with realistic 307 constraints, we systematically evaluated the performance of XDream. We found that 308 XDream can efficiently find images that trigger high activations, often higher than that 309 from even the best in over a million natural images (Fig 1). XDream can generate a 310 diverse set of stimuli and, in one scenario where the global maximum is known by 311 construction, can find images approximating that maximum (Fig 3). XDream can 312 generalize across early and late processing stages, across widely different architectures, 313 and also across different training sets, which resemble different developmental 314 environments (Fig 2). Furthermore, XDream is robust to different initial conditions 315 (Fig 4) and to noise in unit responses (Fig 6).

316
In computer science literature, activation maximization is a well-known approach for 317 visualizing features represented by units in a ConvNet [13,20,[35][36][37]. However, the 318 techniques are only applicable to networks that provide optimization gradients. In other 319 words, perfect knowledge of the architecture and weights is assumed. Clearly, such  Another approach that has inspired the current work is to use a genetic algorithm to 324 search a parametrically-defined stimulus space [10][11][12]. XDream extends this idea by 325 using a more diverse stimulus space learned by a generative neural network, which does 326 not require prior knowledge or intuitions about the tuning properties of the neurons 327 under study. In addition, we frame the approach more broadly, incorporating additional 328 image generators and optimization algorithms.

329
Recently, several other studies have focused on similar goals to the ones in XDream, 330 but with a different approach [21,22,38,39]. In that approach, a ConvNet-based model 331 is first fitted to predict neuronal firing responses. Then, standard white-box activation 332 maximization techniques are applied on the ConvNet model. The relation between this 333 approach and XDream is similar to the relation between the so-called "substitute model" 334 approach and what, in comparison, we may call a "direct" approach, in computer 335 science research on black-box adversarial attack. There, studies have found that the 336 direct approach is both free of transferability problems (because no substitute model is 337 involved) and more sample efficient [27,40]. In this light, when comparing the 338 substitute-model approach and the direct approach for studying neural coding 339 properties, sample efficiency is an important consideration, as is test case performance. 340 Recent preliminary results suggest that some ConvNet-based models do not fully 341 extrapolate to images very different from those used during training [1,21]. If so, these 342 models may not be adequately guiding the exploration of image space, of which training 343 images can only represent a small fraction.

344
The performance of XDream is robust to many design and parameter choices in the 345 algorithm. The use of the genetic algorithm and several related high-level generative 346 networks [9] are adequate and perhaps ideally suited for this family of tasks. In 347 addition, empirically optimized hyperparameter values are listed in S1 Table. The 348 robustness of the algorithm indicates that it is unlikely specific parameters will 349 drastically change performance, and suggests that there is no need to tailor parameters 350 to specific neurons, areas, or species. Nevertheless, it may be possible to further  really? This seems like a stretch.
in THE computer science literature reword to avoid repeating the word "approach". In fact, the word "approach" appears 7 times in this paragraph. Try substituting synonyms here and there, e.g., "method", "technique". Maybe it is just me that is bothered by this.
I expected that in the conclusion, you would give some recommendations for using the software -in particular, recommended hyper-parameters. Maybe in the supplementary material, or in a Table here? identical images that trigger similar activation values (Figure 4b). We speculate that 354 there may be a whole "invariance manifold" of related images that elicit similar cortex, these models cannot replace actual neurons. The fact that XDream extrapolates 373 across layers, architectures, and models trained on different datasets bodes well for it to 374 extrapolate to different ventral stream areas and even visual cortices of different species, 375 but experiments with biological neurons are necessary to evaluate how well XDream will 376 actually generalize to different neuronal types, brain areas, and species.

377
In summary, the XDream method is able to discover preferred features of visual  [15][16][17]. When substituting in silico units for visual 394 neurons, we considered the activation of a unit to a given image as analogous to the 395 firing rate of a real neuron to a picture; this activation provides the objective function 396 used by the optimization algorithm to search for preferred images. Further, we 397 considered a layer in the models as analogous to a cortical area. We treated model 398 layers as the units of comparison and report statistics across 100 randomly selected 399 units each layer, unless otherwise noted. When comparing two conditions for the same 400 100 units, we used a Wilcoxon signed-rank test. When comparing multiple conditions 401 (in Fig 4 and Fig 5), we used one-way ANOVA. We considered one-way ANOVA 402 sufficient because the distributions were approximately normal, one-way ANOVA is 403 conservative when there is deviation from normality, and one-way ANOVA quickly 404 converges to normal-case power while the sample sizes were reasonably large. For 405 brevity, we only cite the corresponding p-values and do not mention the type of test in 406 text, but it should be clear from context. For multiple comparisons made to answer the 407 same question, P-values were corrected for false discovery rate using the 408 Benjamini-Hochberg procedure with an alpha level of 0.01; the number of conditions 409 corrected will be mentioned in text with the first P-value in the group.

410
Target models and layers 411 We selected several state-of-the-art ConvNets as target models, many of which have 412 been shown to be reasonably good models of primate visual neuron responses [16]. In 413 each model, we tested what are approximately the early, middle, and late processing 414 stages as well as the output layer; these layers roughly correspond to early-to-late 415 processing stages in the ventral visual cortex [17,21,22]. Table 2 specifies which 416 architectures and layers were used. One hundred (100) units were randomly selected 417 from each layer. For convolutional layers, only the center spatial position was selected 418 for each feature channel. All the networks were trained on the ImageNet dataset [23] 419 except PlacesCNN, which was trained on the Places-205 dataset [24]. pool2 3x3 s2 reduction a concat reduction b concat classifier inception-v4 inception stem3 reduction a concat reduction b concat classifier inception-resnet-v2 stem concat reduction a concat reduction b concat classifier placesCNN conv2 conv4 fc6 fc8 For each network, 4 layers from what is roughly the early, middle, late stages of processing, together with the output layer before softmax, were selected as targets. PlacesCNN has the same architecture as CaffeNet but is trained on the Places-205 dataset [24]. CaffeNet is as implemented in https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet, PlacesCNN as in [24], and the remaining as in https://github.com/GeekLiB/caffe-model.

Image generators 421
An image generator is a function that outputs an image given some representation of 422 that image (an image code) as input. We tested the family of DeePSiM generators 423 developed in [9]; they are generative adversarial networks trained to invert each layer of 424 AlexNet [25]. The pre-trained models are available on the authors' website at  values by a target model unit to the image associated with each code) to propose a new 433 set of codes expected to have higher fitness. We used a genetic algorithm by default, 434 but also considered two other algorithms: finite-difference gradient descent (FDGD) and 435 natural evolution strategies (NES).

436
The genetic algorithm works as follows: Each generation consists of n codes, where n 437 is the population size parameter. Their corresponding fitness values y i , i = 1, . . . , n are 438 transformed into probability weights w i = exp((y i − min i (y i )) /k), where 439 k = stdev i (y i )/s is analogous to temperature in the Boltzmann equation and s is the 440 selectivity parameter (higher s is analogous to lower temperature and means high 441 fitness is more heavily favored). To create each code in the next generation (a progeny), 442 two codes (parents) are drawn with the probability for each code to be drawn equal to 443 p i = w i / i w i . In our setting, the two parents do not have to be distinct. A random Gaussian nor any covariances, different from the general case discussed in [26]. However, 464 we do update the scale of the search distribution, different from [27]. We have tried 465 updating separate, independent σ's for each component in the image code, but the 466 performance is much worse, presumably because there is too little information to 467 reliably estimate gradients for the second moment.

468
Converting images to image codes 469 In several cases (e.g., Fig 3, Fig 4), we needed to convert an image into an image code in 470 the input space of the image generator. We used two heuristic methods for this purpose, 471 "opt" and "ivt." In the "opt" method, starting with an all-zero image code, the image 472 code was iteratively optimized using backpropagation and gradient descent to minimize 473 the pixel-wise difference between the generated image and the target image. In the "ivt" 474 method, because the generator was originally trained to invert this encoding [9], the 475 fc6-layer encoding of the target image by AlexNet was used as the image code. To choose a set of good hyperparameters, we used a greedy algorithm that maximized 478 performance over a small set of target units by varying one hyperparameter at a time. 479 To keep the computation tractable, we used 12 units total, 3 randomly chosen from the 480 output layer of each of 4 networks: CaffeNet, ResNet-152, Inception-v2, and PlacesCNN. 481 Starting from an educated guess of hyperparameter values, one hyperparameter was 482 chosen at a time. Four test values were chosen around the current value with a 483 pre-defined step size, and optimization performance was measured with the test values. 484 The value that yielded the best performance was set as the current value. Then, another 485 hyperparameter was chosen to be varied. The same hyperparameter was not chosen 486 again until all others had been considered once; we call each repeat of all 487 hyperparameters one round. If no hyperparameter was updated in a given round, the 488 step size was decreased for the hyperparameter that had not been updated for the 489 longest time. This procedure was repeated until all pre-defined, progressively decreasing 490 step sizes for each parameter were exhausted. The final best parameter settings were 491 used as the default values. Hyperparameters were optimized separately for each 492 optimization algorithm, for each image generator, and for the noiseless and noisy case 493 (described below). The hyperparameters used in this paper are listed in S1   activation value corrupted by stochastic noise was drawn from Y ∼ Poisson(max(0, γ)), 501 where the rate parameter γ is analogous to the number of spikes of a neuron and equals 502 y times a constant scaling factor. The scaling factor is necessary because the 503 signal-to-noise ratio (SNR) of a Poisson process-in terms of mean over standard 504 deviation-increases as the rate of the Poisson process increases, but normalized vs. 505 unnormalized networks and different layers in an unnormalized network produce 506 activation values of different scales. We used a scaling factor of 20/ŷ, whereŷ is the 507 median of the max activation to 2,500 random ImageNet images, and 20 is a realistic 508 number of spikes a biological neuron may fire to a preferred stimulus within a 509 measurement time of 200 ms. Note that a Poisson process with γ = 20 has an SNR of 510 µ/σ = √ 20 ≈ 4.5; the SNR will be lower for less optimal images with a lower rate 511 parameter. To simulate repeated image presentations, we simply drew multiple Y values 512 from the same Poisson distribution, with the realistic trade-off that fewer unique images 513 could be presented given the same total number of allowed queries.

514
Computing environment

515
The generative models were based on the caffe library [28] in Python. We have 516 converted the models to PyTorch for convenience for future research. Links to the 517 converted models are available on the code repository (see Code availability below).

518
Neural network computations were performed on NVIDIA GPUs.

522
Source of the target images in Fig 3   523 The leftmost 2 images were manually created; the third image was synthesized as 524 described in [29]; the fourth image is from the ImageNet test set; the rightmost 3 images 525 are public domain images from NASA and The Metropolitan Museum of Art. images with random sampling. We measured the max relative activation expected 533 in two random sampling schemes. "Random" refers to picking a given number of images 534 randomly from the ImageNet dataset (blue). "10 categories" refers to first randomly 535 picking 10 categories out of the 1000 ImageNet categories and then picking a given 536 number of images randomly from those categories so that the total number of images is 537 the one indicated on the x-axis (gray). We considered 4 layers from the CaffeNet 538 architecture. Lines indicate the median relative activation (highest activation divided by 539 the highest activation for all ImageNet images). Shading indicates the 25th-to  These mutation rates seem rather high. Can you comment on that? S2