A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation

While many models of biological object recognition share a common set of “broad-stroke” properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model—e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct “parts” have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.


Input and Pre-processing
The input of the model was a 200 × 200 pixel image. In the pre-processing stage, referred to as Layer 0 , this input was converted to grayscale and locally normalized: where the Normalize operation is described in detail below. Because this normalization is the final operation of each layer, in the following sections, we refer to N −1 as the input of each Layer >0 and N as the output.
2 Linear Filtering Description: The input N −1 of each subsequent layer (i.e. Layer , ∈ {1, 2, 3}) was first linearly filtered using a bank of k filters to produce a stack of k feature maps, denoted F . In a biologically-inspired context, this operation is analogous to the weighted integration of synaptic inputs, where each filter in the filterbank represents a different cell.
Definitions: The filtering operation for Layer is denoted: and produces a stack, F , of k feature maps, with each map, F i , given by: where ⊗ denotes a correlation of the output of the previous layer, N −1 with the filter Φ i (e.g. sliding along the first and second dimensions of N −1 ). Because each successive layer after Layer 0 , is based on a stack of feature maps, N −1 is itself a stack of 2-dimensional feature maps. Thus the filters contained within Φ are, in turn, 3-dimensional, with the their third dimension matching the number of filters (and therefore, the number of feature maps) from the previous layer (i.e. k −1 ).

Parameters:
• The filter shapes f s × f s × f d were chosen randomly with f s ∈ {3, 5, 7, 9} and f d = k −1 .
• Depending on the layer considered, the number of filters k was chosen randomly from the following lists: -In Layer 1 , k 1 ∈ {16, 32, 64} -In Layer 2 , k 2 ∈ {16, 32, 64, 128} -In Layer 3 , k 3 ∈ {16, 32, 64, 128, 256} All filters were initialized to random starting values, and their weights were then learned during the Unsupervised Learning Phase (described below; an example of a set of learned filterbanks from one model instance is shown in Figure S6).

Activation Function
Description: Filter outputs were subjected to threshold and saturation activation function, wherein output values were clipped to be within a parametrically defined range. This operation is analogous to the spontaneous activity thresholds and firing saturation levels observed in biological neurons.
Definitions: We define the activation function: that clips the outputs of the filtering step, such that: Where the two parameters γ min and γ max control the threshold and saturation, respectively. Note that if both minimum and maximum threshold values are −∞ and +∞, the activation is linear (no output is clipped). Parameters: • γ min was randomly chosen to be −∞ or 0 • γ max was randomly chosen to be 1 or +∞

Pooling
Description: The activations of each filter within some neighboring region were then pooled together and the resulting outputs were spatially downsampled.
Definitions: We define the pooling function: such that: Where is the 2-dimensional correlation function with 1 a ×a being an a ×a matrix of ones (a can be seen as the size of the pooling "neighborhood"). The variable p controls the exponents in the pooling function. Parameters: • The stride parameter α was fixed to 2, resulting in a downsampling factor of 4.
Note that for p = 1, this is equivalent to blurring with a a × a boxcar filter. When p = 2 or p = 10 the output is the L p -norm 1 .

Normalization
Description: As a final stage of processing within each layer, the output of the Pooling step were normalized by the activity of their neighbors within some radius (across space and across feature maps). Specifically, each response was divided by the magnitude of the vector of neighboring values if above a given threshold. This operation draws biological inspiration from the competitive interactions observed in natural neuronal systems (e.g. contrast gain control mechanisms in cortical area V1, and elsewhere [1,2]) Definitions: We define the normalization function: such that: 1 The L 10 -norm produces outputs similar to a max operation (i.e. softmax ). with Where δ ∈ {0, 1}, ⊗ is a 3-dimensional correlation over the "valid" domain (i.e. sliding over the first two dimensions only), and 1 b ×b ×k is a b × b × k array full of ones. b can be seen as the normalization "neighborhood" and δ controls if this neighborhood is centered (i.e. subtracting the mean of the vector of neighboring values) before divisive normalization. ρ is a "magnitude gain" parameter and τ is a threshold parameter below which no divisive normalization occurs. Parameters: • The size b of the neighborhood region was randomly chosen from {3, 5, 7, 9}.

Final model output dimensionality
The output dimensionality of each candidate model was determined by the number of filters in the final layer, and the x-y "footprint" of the layer (which, in turn, depends on the subsampling at each previous layer). In the model space explored here, the possible output dimensionalities ranged from 256 to 73,984.

Unsupervised Learning
Description: During the Unsupervised Learning Phase, filter weights are learned from input video sequences. This procedure bears similarity to nonparametric density estimation, e.g. online K-means clustering. The algorithm for this phase additionally contains simple mechanisms for taking advantage of temporal information in a video sequence, and thus Unsupervised Learning was conducted on sequences of video frames. In this work, 15,000 video frames were used.

Definitions:
For each incoming video frame, an output for each filter at each location was computed, and a "winning" filter Φ winner was selected: This winning filter was adapted to the input, by adding the corresponding input patch, times a fixed learning rate λ, to the filter weights: The resulting updated filter was then re-normalized to zero-mean and unit-length: Where Φ winner represents the mean of the winner's weights and Φ winner is the filter carried forward into the next learning iteration. The incoming patch could be normalized (i.e. ||patch|| 2 = 1), or not, under parametric control, and multiple patches could enter into one "round" of competition at the same time (e.g. filter stack outputs corresponding to multiple patches could be evaluated, and the largest output across all patches could decide the winner). The selection of the number of patches simultaneously competing was governed by the Competition Neighborhood Size and Competition Neighborhood Stride parameters, which served tile a set of competing filter stacks across the input.
• "Temporal Advantage" (or "trace", see also [3,4,5,6] for variants): the output score of the last-winning filter is multiplied by {1, 2 or 4} (3 choices) prior to determining which filter "wins." A value of 1 is the equivalent of no advantage; a value of 2 doubles the effective output of the filter for the purposes of competition, biasing it to win again.

Classification during Screening and Validation Phases
During the Screening and Validation Phases, the representations generated during the Unsupervised Learning Phase were evaluated in a variety of object recognition tasks (see main text). This Classification Phase consisted of the following steps, with fixed parameters across all model instantiations: • A random sampling of up to 5,000 outputs from the full representation were taken (to accelerate processing).
• Dimensionality was further reduced by PCA (using training data only, keeping the full eigensubspace projection, i.e. as many dimensions as training examples).
• A linear SVM (using the libsvm 3 solver, with regularization parameter C = 10) was used with a 10-trial random subsampling cross-validation scheme (150 training and 150 testing examples).

Random Exploration
Note that the parameters and parameter ranges described here are clearly not the most comprehensive search space; rather they represent a starting point intended to demonstrate the utility of the overarching approach. While a brute force search procedure was used here, other more elaborate optimization schemes (e.g. evolutionary algorithms [7]) could also be used.