Object detection through search with a foveated visual system

Humans and many other species sense visual information with varying spatial resolution across the visual field (foveated vision) and deploy eye movements to actively sample regions of interests in scenes. The advantage of such varying resolution architecture is a reduced computational, hence metabolic cost. But what are the performance costs of such processing strategy relative to a scheme that processes the visual field at high spatial resolution? Here we first focus on visual search and combine object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We develop a foveated object detector that processes the entire scene with varying resolution, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. We compared the foveated object detector against a non-foveated version of the same object detector which processes the entire image at homogeneous high spatial resolution. We evaluated the accuracy of the foveated and non-foveated object detectors identifying 20 different objects classes in scenes from a standard computer vision data set (the PASCAL VOC 2007 dataset). We show that the foveated object detector can approximate the performance of the object detector with homogeneous high spatial resolution processing while bringing significant computational cost savings. Additionally, we assessed the impact of foveation on the computation of bottom-up saliency. An implementation of a simple foveated bottom-up saliency model with eye movements showed agreement in the selection of top salient regions of scenes with those selected by a non-foveated high resolution saliency model. Together, our results might help explain the evolution of foveated visual systems with eye movements as a solution that preserves perceptual performance in visual search while resulting in computational and metabolic savings to the brain.


INTRODUCTION
There has been substantial progress (e.g.[7], [14], [22], [29], [37], [46], [53] to name a few) in object detection research in recent years.However, humans are still unsurpassed in their ability to search for objects in visual scenes.The human brain relies on a variety of strategies [9] including prior probabilities of object occurrence, global scene statistics [33], [45] and object co-occurrence [10], [28], [35] to successfully detect objects in cluttered scenes.Object detection approaches have increasingly included some of the human strategies [1], [2], [11], [14], [41].One remaining crucial difference between the human visual system and a modern object detector is that while humans process the visual field with decreasing resolution away [26], [39], [43], [49] from the fixation point and make saccades to collect information, typical object detectors [14] scan all locations at the same resolution and repeats this at multiple scales.The goal of the present work is to investigate the impact on object detector performance of using a foveated visual field and saccade exploration rather than the dominant sliding window paradigm [14], [29], [53].Such endeavor is of interest for two reasons. .First, from the computer vision perspective, using a visual field with varying resolution might lead to reduction in computational complexity, consequently the approach might lead to more efficient object detection algorithms.Second, from a scientific perspective, if a foveated object detection model can achieve similar performance accuracy as a non-foveated sliding window approach, it might suggest a possible reason for the evolution of foveated systems in organisms: achieving successful object detection while minimizing computational and metabolic costs.
Contemporary object detection research can be roughly outlined by the following three important components of a modern object detector: the features, the detection model and the search model.The most popular choices for these three components are Histogram of Oriented Gradients (HOG) features [6], mixture of linear templates [14], and the sliding window (SW) method, respectively.Although there are efforts to go beyond these standard choices (e.g.new features [37], [46]; alternative detection models [22], [46], whether object parts should be modeled or not [8], [54]; and alternative search methods [13], [14], [20], [24], [46]), HOG, mixture of linear templates and SW form the crux of modern object detection methods ( [14], [29], [53]).Here, we build upon the "HOG + mixture of linear templates" framework and propose a biologically inspired alternative search model to the sliding window method, where the detector searches for the object by making saccades instead of processing all locations at fine spatial resolution (See Section 4 for a more detailed discussion on related work).
The human visual system is known to have a varying resolution visual field.The fovea has higher resolution and this resolution decreases towards the periphery [26], [39], [43], [49].As a consequence, the visual input at and around the fixation location has more detail relative to peripheral locations away from the fixation point.Humans and other mammals make saccades to align their high resolution fovea with the regions of interest in the visual environment.There are many possible methods to implement such a foveated visual field in an object detection system.In this work, we opt to use a recent model [15] which specifies how responses of elementary sensors are pooled at the layers (V1 and V2) of the human visual cortex.The model specifies the shapes and sizes of V1, V2 regions which pool responses from the visual field.We use a simplified version of this horizontal eccentricity (in degrees) vertical eccentricity (in degrees)  model as the foveated visual field of our object detector (Figure 1).We call our detector as "the foveated object detector (FOD)" due to its foveated visual field.
The sizes of pooling regions in the visual field increase as a function of eccentricity from the fixation location.As the pooling regions get larger towards the periphery, more information is lost at these locations, which might seem to be a disadvantage, however, the exploration of the scene with the high resolution fovea through a guided search algorithm might mitigate the apparent loss of peripheral information.On the other hand, fewer computational resources are allocated to process these low resolution areas which, in turn, lower the computational cost.In this paper, we investigate the impact of using a foveated visual field on the detection performance and its computational cost savings.

Overview of our approach
The foveated object detector (FOD) mimics the process by which humans search for objects in scenes utilizing eye movements to point the high resolution fovea to points of interest (Figure 2).The FOD gets assigned an initial fixation point on the input image and collects information by extracting image features through its foveated visual field.The features extracted around the fixation point are at fine spatial scale while features extracted away from the fixation location at coarser scale.This fine-to-coarse transition is dictated by the pooling region sizes of the visual field.Then, based on the information collected, the FOD chooses the next fixation point and makes a saccade to that point.Finally, the FOD integrates information collected through multiple saccades and outputs object detection predictions.
Training such an object detector entails learning templates at all locations in the visual field.Because the visual field has varying resolution, the appearance of a target object varies depending on where it is located within the visual field.We use the HOG [6] as image features and a simplified version of the V1 model [15] to compute pooled features within the visual field.A mixture of linear templates is trained at selected locations in the visual field using a latent-SVM-like [14], [18] framework.

Contribution
We present an object detector that has a foveated visual field based on physiological measurements in primate visual cortex [15] and that models the appearance of target objects not only in the high resolution fovea but also in the periphery. .Importantly, the model is developed in the context of a modern object detection algorithm and a standard data-set (PASCAL VOC) allowing for the first time direct evaluation of the impact of a foveated visual system on an object detector.
We believe that object detection using a foveated visual field offers a novel and promising direction of research in the quest for an efficient alternative to the sliding window method, and also a possible explanation for why foveated visual systems might have evolved in organisms.We show that our method achieves greater computational savings than a state-of-the-art cascaded detection method.Another contribution of our work is the latent-LDA formulation (Section 2.4.2) where linear discriminant analysis is used within a latent-variable learning framework.
In the next section, we describe the FOD in detail and report experimental results in Section 3 which is followed by the related work section, conclusions and discussion.

Foveated visual field
The Freeman-Simoncelli (FS) model [15] is neuronal population model of V1 and V2 layers of the visual cortex.
The model specifies how responses are pooled (averaged together) hierarchically beginning from the lateral geniculate nucleus to V1 and then the V2 layer.V1 cells encode information about local orientation and spatial frequency whereas the cells in V2 pools V1 responses non-linearly to achieve selectivity for compound features such as corners and junctions.The model is based on findings and physiological measurements of the primate visual cortex and specifies the shapes and sizes of the receptive fields of the cells in V1 and V2.According to the model, the sizes of receptive fields increase linearly as a function of the distance from the fovea and this rate of increase in V2 is larger than that of V1, which means V2 pools larger areas of the visual field in the periphery.The reader is referred to [15] for further details.
We simplify the FS model in two ways.First, the model uses a Gabor filter bank to compute image features and we replace these with the HOG features [6], [14].Second, we only use the V1 layer and leave the nonlinear pooling at V2 as future work.We use this simplified FS model as the foveated visual field of our object detector which is shown in Figure 1.The fovea subtends a radius of 2 degrees.We also only simulate a visual field with a radius of 10 degrees which is sufficient to cover the test images presented at a typical viewing distance of 40 cm.The square boxes with white borders (Figure 1 represent the pooling regions within the fovea.The surrounding colored regions are the peripheral pooling regions.While the foveal regions have equal sizes, the peripheral regions grow in size as a function -which is specified by the FS model -of their distance to the center of the fovea.The color represents the weights that are used in pooling, i.e. weighted summation of, the underlying responses.A pooling region partly overlaps with its neighboring pooling regions (see the supplementary material of [15] for details).Assuming a viewing distance of 40cm, the whole visual field covers about a 500x500 pixel area (a pixel subtends 0.08 • ).The foveal radius is 52 pixels subtending a visual angle of 4 degrees.
Given an image and a fixation point, we first compute the gradient at each pixel and then for each pooling region, the gradient magnitudes are pooled per orientation for the pixels that fall under the region.At the fovea, where the pooling regions are 8x8 pixels, we use the HOG features at the same spatial scale of the original DPM model [14], and in the periphery, each pooling region takes a weighted sum of HOG features of the 8x8 regions that are covered by that pooling region.

The model
The model M consists of a mixture of n components where w i is a linear template and i is the location of the template with respect to the center of the visual field.
The location variable i defines a unique bounding box within the visual field for the i th template.Specifically, i = (ω i , h i , x i , y i ) is a 4-tuple whose variables respectively denote width, height and x,y coordinates of the i th template within the visual field.The template, w i , is a matrix of weights on the features extracted from the pooling regions underlying the bounding box i .The dimensionality of w i , i.e. the total number of weights, depends both on the width and height of its bounding box and its location in the visual field.A component within the fovea covers a larger number of pooling regions compared to a peripheral component with the same width and height, hence the dimensionality of a foveal template is larger.Three example components are illustrated in Figure 3 where the foveal component (red) covers 7x5 = 35 pooling regions while the (blue and green) peripheral components cover 15 and 2 regions, respectively.Since a fixed number of features 1 is extracted from each pooling region (regardless of its size), foveal components have higher-resolution templates associated with them.

Detection model
Suppose that we are given a model M that is already trained for a certain object class.The model is presented with an image I and assigned an initial fixation location f .We are interested in searching for an object instance in I.Because the size of a searched object is not known apriori, the model has to analyze the input image at various scales.We use the same set of image scales given in [14] and use σ to denote a scale from that set.When used as a subscript to an image, e.g.I σ , it denotes the scaled version of that image, i.e. width (and height) of I σ is σ times the width (and height) of I. σ also applies to fixation locations and bounding boxes: if f denotes a fixation location (f x , f y ), then f σ = (σf x , σf y ); for a bounding box b = (w, h, x, y), b σ = (σw, σh, σx, σy).
To check whether an arbitrary bounding box b within I contains an object instance, while the model is fixating at location f, we compute a detection score as where ) is a feature extraction function which returns the features of I σ for component c (see Equation ( 1)) when the model is fixating at f σ .The vector w is the blockwise concatenation of the templates of all components.
) .The fixation location ,f σ , together with the component c define a unique location, i.e. a bounding box, on I σ .G(b σ , f σ ) returns the set of all components whose templates have a predetermined overlap (intersection over union should be at least 0.7 as in [14]) with b σ when the model is fixating at f σ .During both training and testing, σ and c are latent variables for example (I, b).Ideally, s(I, b, f ) > 0 should hold for an appropriate f when I contains an object instance within b.For an image that does not contain an object instance, s(I, b = ∅, f ) < 0 should hold for any f .For this to work, a subtlety in G(•)'s definition is needed: G(∅, f ) returns all components of the model (Equation ( 1)).During training (Section 2.4), this will enforce the responses of all components for a negative image to be suppressed down.

Integrating observations across multiple fixations
So far, we have looked at the situation where the model has made only one fixation.We describe in Section 2.3 how the model chooses the next fixation location.For now, suppose that the model has made m fixations, f 1 , f 2 , . . ., f m , and we want to find out whether an arbitrary bounding box b contains an object instance.This computation involves integrating observations across multiple fixations, which is a considerably more complicated problem than the single fixation case.The Bayesian decision on whether b contains an object instance is based on the comparison of posterior probabilities: where y b = 1 denotes the event that there is an object instance at location b.We use the posteriors' ratio as a detection score, the higher it is the more likely b contains an instance.Computing the probabilities in (3) requires training a classifier per combination of fixation locations for each different value of m, which is intractable.We approximate it using a conditional independence assumption (derivation given in Appendix A): We model the probability P (y b = 1|f , I) using a classifier and use the sigmoid transfer function to convert raw classification scores to probabilities: We simplify the computation in (4) by taking the log (derivation given in Appendix B): Taking the logarithm of posterior ratios does not alter the ranking of detection scores for different locations, i.e. b's, because logarithm is a monotonic function.In short, the detection score computed by the FOD for a certain location b, is the sum of the individual scores for b computed at each fixation.
After evaluating (6) for a set of candidate locations, final bounding box predictions are obtained by nonmaxima suppression [14], i.e. given multiple predictions for a certain location, all predictions except the one with the maximal score are discarded.

Eye movement strategy
We use the maximum-a-posteriori (MAP) model [4] as the basic eye movement strategy of the FOD.The MAP model is shown to be consistent with human eye movements in a variety of visual search tasks [4], [47].Studies have demonstrated that in some circumstances human saccade statistics better match an ideal searcher [31] that makes eye movements to locations that maximize the accuracy of localizing targets, yet in many circumstances the MAP model approximates the ideal searcher [32], [51] but is computationally more tractable for objects in real scenes.The MAP model select the location with the highest posterior probability of containing the target object as the next fixation location, that is f i+1 = center of * where * = arg max P (y = 1|f 1 , f 2 , . . ., f i , I). ( Finding the maximum of the posterior above is equivalent to finding the maximum of the posterior ratios, since for two arbitrary locations 1 , 2 ; let p 1 = P (y 1 = 1|•) and p 2 = P (y 2 = 1|•), then we have

Initialization
A set of dimensions (width and height) is determined from the bounding box statistics of the examples in the training set as done in the initialization of the DPM model [14].Then, for each width and height, new components with these dimensions are created to tile the entire visual field.However, the density of components in the visual field is not uniform.Locations, i.e. bounding boxes, that do not overlap well with the underlying pooling regions are discarded.To define goodness of overlap, a bounding box is said to intersect with an underlying pooling region if more than one fifth of that region is covered by the bounding box.Overlap is the average coverage across the intersected regions.If the overlap is more than 75%, then a component for that location is created, otherwise the location is discarded (see Figure 4 for an example).In addition, no components are created for locations that are outside of the visual field.Weights of the component templates (w i ) are initialized to arbitrary values.Training the model is essentially optimizing these weights on a given dataset.Optimizing the cost function in ( 10) is manageable for mixtures with few components, however, the FOD has a large number of components in its visual field (typically, for an object class in the PASCAL VOC 2007 dataset [12], there are around 500-700) and optimizing this cost function becomes prohibitive in terms of computational cost.As an alternative, cheaper linear classifiers can be used.Recently, linear discriminant analysis (LDA) has been used in object detection ( [18]) producing surprisingly good results with much faster training time.Training a LDA classifier amounts to computing Σ −1 (µ 1 −µ 0 ) where µ 1 is the mean of the feature vectors of the positive examples, µ 0 is the same for the negative examples and Σ is the covariance matrix of these features.Here, the most expensive computation is the estimation of Σ, which is required for each template with different dimensions.However, it is possible to estimate a global Σ from which covariance matrices for templates of different dimensions can be obtained [18].For the FOD, we estimate the covariance matrices for the foveal templates and estimate the covariance matrices for peripheral templates by applying the feature pooling transformations to the foveal covariance matrices.
We propose to use LDA in a latent-SVM-like framework as an alternative to the method in [18] where positive examples are clustered first and then a LDA classifier is trained per cluster.Consider the t th template, w t .LDA gives us that LDA gives us that w t,LDA = Σ −1 t (µ pos t − µ neg t ) where Σ t is the covariance matrix for template t, µ pos t and µ neg t are the mean of positive and negative feature vectors, respectively, assigned to template t.We propose to apply an affine transformation to the LDA classifier: and modify the cost function as ) where the first summation pushes the score of the mean of the negative examples to under zero and the second summation, taken over positive examples only, pushes the scores to above 0. α and β are appropriate blockwise concatenation of α t and β t s.C is the regularization constant.Overall, this optimization effectively calibrates the dynamic ranges of different templates' responses in the model so that the scores of positive examples and negative means are pushed away from each other while the norm of w is constraint to prevent overfitting.This formulation does not require the costly mining of hard-negative examples of latent-SVM.We call this formulation (Equation ( 12)) as latent-LDA.
To optimize (12), we use the classical coordinatedescent procedure.We start by initializing w by training on warped-positive examples as in [14].Then, we alternate between choosing the best values for the latent variables while keeping w fixed, and optimizing for w while keeping the latent variables of positive examples fixed.

EXPERIMENTS
We evaluated our method on the PASCAL VOC 2007 detection (comp3) challenge dataset and protocol (see [12] for details).All results are obtained by training on the train+val split and testing on the test split.

Comparison of SW based methods
We first compared our SW implementation, which corresponds to using foveal templates only, to three stateof-the-art methods that are also SW based [14], [18], [29].Table 1 gives the AP (average precision) results, i.e. area under the precision-recall curve per class, and mean AP (mAP) over all classes.Originally, the deformable parts model (DPM) uses object parts, however, in order to make a fair comparison with our model, we disabled its parts.The first row of Table 1 shows the latest version of the DPM system [17] with the parts-learning code disabled.The second row shows results for another popular SVM-based system, known as the exemplar-SVM (E-SVM), which also only models whole objects, not its parts.Finally, the third row shows results from a LDA-based system, "discriminative decorrelation for classification" (DCC) [18].All three systems are based on HOG features and mixture of linear templates.The results show that SVM based systems perform better than the LDA based systems, which is not a surprising finding since it is well known that discriminative models outperform generative models in classification tasks.However, LDA's advantage against this performance loss is that it is ultra fast to train, which is exactly the reason we chose to use LDA instead of SVM.Once the background covariance matrices are estimated (which can be done once and for all [18]), training is as easy as taking the average of the feature vectors of positive examples and doing a matrix multiplication.We estimated the time that training a SVM based system for our FOD to be about 300 hours (approximately 2 weeks) for a single object class, whereas the LDA based system can be trained under an hour on the same machine which has an Intel i7 processor.
Although our SW method achieves the same mean AP (mAP) score as the DCC method [18], the latter has a detection model with higher computational cost.We use 2 templates per class while DCC trains more than 15 templates per class within an exemplar-SVM [29]-like framework.DCC considers the dot product of the feature vector of the detection window with every exemplar within a cluster, which basically means that a detection window is compared to all positive examples in the training set.In our case, the number of dot products considered per detection window is equal to the number of templates, which is 2 in this paper, which clearly demonstrates the advantage of our latent-LDA approach over DCC [18].

Comparison of FOD with SW
Next, we compared the performance of FOD with our SW method.We experimented with two eye movement strategies, MAP (Section 2.3) and random strategy to demonstrate the importance of guidance of eye movements.
Table 2 shows the AP scores for FOD with different eye movement strategies and different number of fixations.We also include in this table the "Our SW" result from Table 1 for ease of reference.The MAP and random strategies are denoted with MAP and RAND, respectively.Because the model accuracy results will depend on initial point of fixation, we ran the models with different initial points of fixation.The presence of a suffix on a model refers to the location of the initial fixation: "-C" stands for the center of the input image, i.e. (0.5, 0.5) in normalized image coordinates where the topleft corner is taken as (0, 0) and the bottom-right corner is (1, 1); and "-E" for the two locations at the left and right edges of the image, 10% of the image width away from the image border, that is (0.1, 0.5) and (0.9, 0.5).MAP-E and RAND-E results are the performance average of two different runs, one with initial fixation close to the left edge of the image, the other run close to the right edge of the image.For the random eye movement, we report the 95% confidence interval for AP over 10 different runs.We ran all systems for a total of 5 fixations.Table 2 shows results for after 1,3 and 5 fixations.A condition with one fixation is a model that makes decisions based only on the initial fixation.

RAND-C 1
This row is the same with the "MAP-C, 1" above.

RAND-E 1
This row is the same with the "MAP-E, 1" above.Ratio of mean AP scores of FOD systems relative to that of the SW system.Graph shows two eye movement algorithms: maximum aposteriori probability (MAP) and random (RAND) and two starting points (C: center; E: edge).
The results show that the FOD using the MAP rule with 5 fixations (MAP-C,5 for short) performs nearly as good as the SW (a difference of 0.2 in mean AP).
Figure 5 shows the ratio of mean AP for the FOD with the various eye movement strategies to that of the SW system (relative performance) as a function of fixation.The relative performance of the MAP-C to SW (AP of MAP-C divided by AP of SW) is 98.8% for 5 fixations, 96.5% for 3 fixations and 84.8% for 1 fixation.The FOD with eye movement guidance towards the target (MAP-C,5) achieves or exceeds SW's performance with only MAP-C performs quite well (84.8%relative performance) even with 1 fixation.The reason behind this result is the fact that, on average, bounding boxes in the PASCAL dataset cover a large portion of the images (average bounding box area normalized by image area is 0.2) and are located at and around the center [44].To reduce the effects of these biases about the location of object placement on the results, we assessed the models with an initial fixation close to the edge of the image (MAP-E).When the initial fixation is closer to the edge of the image, performance is initially worse than when the initial fixation is at the center of the image, The difference in performance diminishes achieving similar performance with five fixations (0.2 difference in mean AP). Figure 6 shows how the distribution of AP scores for different object classes for MAP-E improves from 1 fixation to 5 fixations

Importance of the guidance algorithm
To assess the importance of guided saccades towards the target we compared performance of the MAP model against FOD that guides eye movements based on a random eye movement generator.
Figure 5 allows comparisons of the relative performance of the MAP FOD and those with a random eye movement strategy.The performance gap between MAP-C, RAND-C pair and MAP-E,RAND-E pair shows that MAP eye movement strategy is effective in improving the performance of the system.

Computational cost
The computational complexity of the SW method is easily expressed in terms of image size.However, this is not the case for our model.The computational complexity of FOD is O(mn) where m is the number of fixations and n is the total number of components, hence templates, on the visual field.These numbers do not explicitly depend on the image size; so in this sense, the complexity of FOD is O(1) in terms of image size.Currently, m is given as an input parameter but if it were to be automated, e.g. to achieve a certain detection accuracy, m would implicitly depend on several factors such as the difficulty of the object class, the location and size distribution of positive examples.Targets that are small (relative to the image size) and that are located far away from the initial fixation location would require more fixations to get a certain detection accuracy.The number of components, n, depends on both the visual field parameters (number of angle and eccentricity bins which, in our case, are fixed based on the Freeman-Simoncelli model [15]) and the bounding box statistics of the target object.These dependencies make it difficult to express the theoretical complexity in terms of input image size.For this reason, we compare the computational costs of FOD and SW in a practical framework, expressed in terms of the total number of operations performed in template evaluations.
In both SW based methods and the FOD, linear template evaluations, i.e. taking dot-products, is the main time consuming operation.We define the computational cost of a method based on the total number of template evaluations it executes (as also done in [46]).A model may have several templates with different sizes, so instead of counting each template evaluation as 1 operation, we take into account the dimensionalities of the templates.For example, the cost of evaluating a (6-region)x(8-region) HOG template is counted as 48 operations.
It is straightforward to compute the computational cost (as defined above) of the SW method.For the FOD, we run the model on a subset of the testing set and count the number of operations actually performed.Note that, in order to compute a detection score, the FOD first performs a feature pooling (based on the location of the component in the visual field) and then a linear template evaluation.Since these are both linear operations, we combine them into a single linear template.The last column of Table 2 gives the computational costs of the SW method and the FOD.For the FOD the computational cost is reported as a function of different number of fixations.For ease of comparison, we normalized the costs so that the SW method performs 100 operations in total.The results show that FOD is computationally more efficient than SW.FOD achieves 98.8% of SW's performance at 49.6% of the computational cost of SW.Note that this saving is not directly comparable to that of the cascaded detection method reported in [13] because FOD's computational savings comes about from fewer root filter evaluations, whereas in [13] a richer model (DPM, root filters and part filters) is used and the savings are associated to fewer evaluations in the part filters (i.e., the model applies the root filters at all locations first and sequentially running other filters on the non-rejected locations).

Using richer models to increase performance
To directly compare the computational savings of the FOD model to a cascade-type object detector, we used a richer and more expensive detection model at the fovea.This is analogous to the cascaded detection idea where cheaper detectors are applied first and more expensive detectors are applied later on the locations not rejected by the cheaper detectors.To this end, we run our FOD and after each fixation we evaluate the full DPM detector (root and part filters together) [17] only at foveal locations that score above a threshold which is determined on the training set to achieve high recall rate (95%).We call this approach "FOD-DPM cascade" or FOD-DPM for short.Table 3 and Figure 7 give the  performance result of this approach.FOD-DPM achieves a similar average performance to that of DPM (98.2% relative performance, 0.6 AP gap) using 9 fixations and exceeds DPM's performance starting from 11 fixations.On some classes (e.g.bus, car, horse), FOD-DPM exceeds DPM's performance probably due to lesser number of evaluations and reduced false positives; on other cases (e.g.bike, dog, tv) FOD-DPM underperforms probably due to low recall rate of the FOD detector for these classes.Figure 8 gives per class AP scores of FOD-DPM and DPM to demonstrate the improvement from 1 to 9 fixations.
We compare the computational complexities of FOD-DPM and DPM by their total number of operations as defined above.For a given object class, DPM model has 3 root filters and 8 6x6 part filters.It is straightforward to calculate the number of operations performed by DPM as it uses the SW method.For FOD-DPM, the total number of operations is calculated by adding: 1) FOD's operations and 2) DPM's operations at each high-scoring foveal detection b, one DPM root filter (with the most similar shape as b) and 8 parts evaluated at all locations within the boundaries of this root filter.Note that we ignore the time for optimal placing of parts in both DPM and FOD-DPM.Cost of feature extraction is also not included as the two methods use the same feature extraction code.We report the computational costs of FOD-DPM and DPM in the last column of Table 3.The costs are normalized so that DPM's cost is 100 operations.Results show that FOD-DPM drastically reduces the cost from 100 to 3.09 for 9 fixations.Assuming both methods are implemented equally efficiently, this would translate to an approximately 32x speed-up which is better than the 20x speed-up reported for a cascaded evaluation of DPM [13].These results demonstrate the effectiveness of our foveated object detector in guiding the visual search.
Finally, in Figure 9 we give sample detections by the FOD system.We ran the trained bicycle, person and car models on an image outside of the PASCAL datasaet.The models were assigned the same initial location and we ran them for 3 fixations.Results show that the each model fixates at different locations, and these locations are attracted towards instances of the target objects being searched.

RELATED WORK
The sliding window (SW) method is the dominant model of search in object detection.The complexity of identifying object instances in a given image is O(mn) where m is the number of locations to be evaluated and n is the number of object classes to be searched for.Efficient alternatives to sliding windows can be categorized in two groups: (i) methods aimed at reducing m, (ii) methods aimed at reducing n.Since typically m >> n, the are a larger number efforts in trying to reduce m, however, reducing the contribution of the number of object classes has recently been receiving increasing interest as search for hundreds of thousands of object classes has started to be tackled [7].According to this categorization, our proposed FOD method falls into the first group as it is designed to locate object instances by making a set of sequential fixations where in each fixation only a sparse set of locations are evaluated.

Reducing the number of evaluated locations (m)
In efforts to reduce the number of locations to be evaluated, one line of research is the branch-and-bound methods ( [20], [24]) where an upper bound on the quality function of the detection model is used in a global branch and bound optimization scheme.Although the authors provide efficiently computable upper bounds for popular quality functions (e.g.linear template, bag-ofwords, spatial pyramid), it might not be trivial to derive suitable upper bounds for a custom quality function.Our method, on the other hand, uses binary classification detection model and is agnostic to the quality function used.
Another line of research is the casdaded detection framework ( [13], [23], [48]) where a series of cheap to expensive tests are done to locate the object.Cascaded detection is similar to our method in the sense that simple, coarse and cheap evaluations are used together with complex, fine and expensive evaluations.However, we differ with it in that it is essentially a sliding window method with a coarse-to-fine heuristic used to reduce the number of total evaluations.Another coarse-to-fine search scheme is presented in [34] where a set of low to high resolution templates are used.The method starts by evaluating the lowest resolution template -which is essentially a sliding window operation -and selecting the high responding locations for further processing with higher resolution templates.Our method, too, uses a set of varying resolution templates; however, these templates are evaluated at every fixation instead of serializing their evaluations with respect to resolution.
In [46], a segmentation based method is proposed to yield a small set of locations that are likely to corresponds to objects, which are subsequently used to guide the search in a selective manner.The locations are identified in an object class-independent way using an unsupervised multiscale segmentation approach.Thus, the method evaluates the same set of locations regardless of which object class is being searched for.In contrast, in our method, selection of locations to be foveated is guided by learned object class templates.
The method in [1], similar to ours, works like a fixational system: at a given time step, the location to be evaluated next is decided based on previous observations.However, there are important differences.In [1], only a single location is evaluated at a time step whereas we evaluate all template locations within the visual field at each fixation.Their method returns only one box as the result whereas our method is able to output many predictions.
There are also vector quantization based methods [19], [21], [40] aiming to reduce the time required to compute linear template evaluations.These methods to reduce the contribution of m in O(mn) are orthogonal to our foveated approach.Thus, vector quantization approaches can be integrated with the proposed foveated object detection method.

Reducing the number of evaluations of object classes(n)
Works in this group aim to reduce the time complexity contributed by the number of object classes.The method proposed in [7] accelerates the search by replacing the costly linear convolution by a locality sensitive hashing scheme that works on non-linearly coded features.Although they evaluate all locations in a given image, their approach scale constantly over the number of classes, which enables them to evaluate thousands of object classes in a very short amount of time.
Another method [42] uses a sparse representation of object part templates, and then uses the basis of this representation to reconstruct template responses.When the number of object categories is large, sparse representation serves as a shared dictionary of parts and accelerates the search.
Importantly, the way the methods in this group accelerate search is orthogonal to the savings proposed by using a foveated visual field.Therefore, these methods are complementary and can be integrated with our method to further accelerate search.
In the context of the references listed in this and the previous sections, our method of search through fixations using a non-uniform foveated visual field is novel.

Biologically inspired methods
There have been previous efforts, (e.g.[41]), on biologically inspired object recognition.However, these models do not have a foveated visual field and thus do not execute eye movements.More recent work has implemented biologically inspired search methods.In [11], a fixed, pre-attentive, low-resolution wide-field camera is combined with a shiftable, attentive, highresolution narrow-field camera, where the pre-attentive camera generates saccadic targets for the attentive, highresolution camera.The fundamental difference between this and our method is that while their pre-attentive system has the same coarse resolution everywhere in the visual field, our method, which is a model of the V1 layer of the visual cortex, has a varying resolution that depends on the radial distance to the center of the fovea.There have been previous efforts to create foveated search models with eye movements [30], [31], [38], [51].Such models have been applied mostly to detect simple signals in computer generated noise [31], [51] and used as benchmarks to compare against human eye movements and performance.
Other biologically inspired methods include the target acquisition model (TAM) [50], [52], the Infomax model [5] and artificial neural network based models [2], [25].TAM is a foveated model and it uses scale invariant feature transform (SIFT) features [27] for representation and utilizes a training set of images to learn the appearance of the target object.However, it does not include the variability in object appearance due to scale and viewpoint, and the evaluation is done by placing the objects on a uniform background.The Infomax, on the other hand, can use any previously trained object detector and works on natural images.They report successful results on a face detection task.Both TAM and Infomax uses the same template for all locations in the visual field while our method uses different templates for different locations.[25] was applied to image categorization and [2] to object tracking in videos.Critically, none of these models have been tested on standard object detection datasets nor they have been compared to a SW approach to evaluate the potential performance loss and computational savings of modeling a foveated visual field.

CONCLUSIONS AND DISCUSSION
We present an implementation of a foveated object detector with a recent neurobiologically plausible model of pooling in the visual periphery and report the first ever evaluation of a foveated object detection model on a standard data set in computer vision (PASCAL VOC 2007).Our results show that the foveated method achieves nearly the same performance as the sliding window method at 49.6% of sliding window's computational cost.Using a richer model (such as DPM [14]) to evaluate high-scoring locations, FOD is able to outperform the DPM with more computational savings than a state-of-the-art cascaded detection system [13].These results suggest that using a foveated visual system offers a promising potential for the development of more efficient object detectors.

Fig. 1 .
Fig. 1.The foveated visual field of the proposed object detector.Square blue boxes with white borders at the center are foveal pooling regions.Around them are peripheral pooling regions which are radially elongated.The sizes of peripheral regions increase with distance to the fixation point which is at the center of the fovea.The color within the peripheral regions represent pooling weights.

Fig. 2 .
Fig.2.Two example detections by our foveated object detector (FOD).Yellow dots show fixation points, numbers in yellow fonts indicate the sequence of fixations and the bounding box is the final detection.Note that FOD does not have to fixate on the target object in order to localize it (example on the right).

Fig. 3 .
Fig. 3. Illustration of the visual field of the model.(a) The model is fixating at the red cross mark on the image.(b) Visual field (Figure1) overlaid on the image, centered at the fixation location.White line delineate the borders of pooling regions.Nearby pooling regions do overlap.The weights (Figure1) of a pooling region sharply decrease outside of its shown borders.White borders are actually iso-weight contours for neighboring regions.Colored bounding boxes show the templates of three components on the visual field: red, a template within the fovea; blue and green, two peripheral templates at 2.8 and 7 degree periphery, respectively.(c,d,e) Zoomed in versions of the red (foveal), blue (peripheral) and green (peripheral) templates.The weights of a template, w i , are defined on the gray shaded pooling regions.

Fig. 4 .
Fig. 4. Two bounding boxes (A,B) are shown on the visual field.While box A covers a large portion of the pooling regions that it intersects with, box B's coverage is not as good.Box B is discarded as it does not meet the overlap criteria (see text), therefore a component for B in the model is not created.

2. 4 . 2 Training
Consider a training set D = {(I i , b i )} K i=1 where I i is an image and b i a bounding box and K is the total number of examples.If I i does not contain any positive examples, i.e. object instances, then b i = ∅.Following the DPM model [14], we train model templates using a latent-SVM formulation: ∈F (Ii,bi) max(0, 1 − y i s(I i , b i , f )).
(10) where y i = 1 if b i = ∅ and y i = −1, otherwise.The set F (I i , b i ) denotes the set of all feasible fixation locations for example (I i , b i ).For b i = ∅, a fixation location is considered feasible if there exists a model component whose bounding box overlaps with b i .For b i = ∅, all possible fixation locations on I i are considered feasible.

" 3 9
Fig. 5.Ratio of mean AP scores of FOD systems relative to that of the SW system.Graph shows two eye movement algorithms: maximum aposteriori probability (MAP) and random (RAND) and two starting points (C: center; E: edge).

Fig. 6 .
Fig. 6.AP scores achieved by SW and MAP-E per class.

Fig. 7 .
Fig. 7. FOD-DPM's performance (mean AP over 20 classes) as a function of number of fixations.FOD-DPM achieves DPM's performance at 11 fixations and exceeds it with more fixations.

Fig. 9 .
Fig. 9. Fixation locations and bounding box predictions of FOD for different object classes (bicycle, person, and car from left to right) but for the same image and initial point of fixation.

TABLE 1
Average precision (AP) scores of SW based methods on the PASCAL VOC 2007 dataset.

TABLE 2
AP scores and relative computational costs of SW and FOD on the PASCAL VOC 2007 dataset.

TABLE 3
AP scores and relative computational costs of FOD-DPM and DPM on the PASCAL VOC 2007 dataset.