Object detection through search with a foveated visual system

doi:10.1371/journal.pcbi.1005743

Fig 1.

The foveated visual field of the proposed object detector.

Square blue boxes with white borders at the center are foveal pooling regions. Around them are peripheral pooling regions which are radially elongated. The sizes of peripheral regions increase with distance to the fixation point which is at the center of the fovea. The color within the peripheral regions represent pooling weights.

More »

Expand

Fig 2.

Flowchart of the non-foveated sliding window (SW) model and the foveated object detector (FOD).

The feature extraction step is common to both models. First, the image is filtered with simple edge detection filters with different orientations, and gradient magnitude and orientation are estimated at each pixel. Then, the image is divided into small square boxes on a regular grid. Within each box, total gradient magnitude per orientation is computed, which results in a histogram. The output is a collection of feature maps for x, y locations and orientations. For simplicity, only one feature map (H) is shown as input to both models. Right side: Foveated Object Detector. The FOD has an initial fixation position that determines the pooling regions of the underlying histogram of gradient features. FOD’s templates are learned through training and are specific to each retinotopic location. The scores reflecting probability of target presence are used to guide saccades to the most likely target location. The object probability scores for each location are integrated across saccades and used for the final perceptual decision.

More »

Expand

Fig 3.

Histogram of oriented gradients (HoG) of a sample image.

Left: input image, right: HoG result. First, the input image is convolved with two 1-D filters, namely [+ 1 0 −1] and its transpose. The gradient magnitude and orientation at each pixel are estimated from the convolution results. Then, the image is divided into small, square bins. In each bin, an orientation histogram is computed, which shows the (relative) total gradient magnitude per orientation. Finally, the histogram in each bin is normalized by the total “energy” (e.g. ℓ₂ norm) of a 2x2 block containing the bin akin to divisive local contrast normalization. This final step is known as block normalization. On the right, each HoG bin is represented with short, oriented line segments where brightness encodes the magnitude of the associated orientation. Due to the block normalization, in homogeneous areas (e.g. top-right) all orientations have high and similar magnitudes. (Image source statement: the original picture on the left was taken by the first author.)

More »

Expand

Table 1.

Per class percent average precision (AP), mean average precision (mAP) over all 20 classes and relative computational costs of non-foveated SW and FOD on the PASCAL VOC 2007 dataset.

(Object class abbreviations are as follows. ap: aeroplane, bk: bike, bd: bird, bt:boat, bl: bottle, bs: bus, cr: car, ct: cat, ch: chair, cw: cow, dt: dining-table, dg: dog, hs: horse, mb: motorbike, pr: person, pt: potted-plant, sh: sheep, sf: sofa, tr: train, tv: tv-monitor).

More »

Expand

Fig 4.

Ratio of mean average precision (AP) scores of FOD systems relative to that of the non-foveated SW system.

Graph shows two eye movement algorithms: maximum aposteriori probability (MAP) and random (RAND) and two starting points (C: center of the image; E: left or right edge of the image).

More »

Expand

Fig 5.

Area under the recall precision curve (AP scores) achieved by the non-foveated (SW) model and the foveated object detector with a Maximum a posteriori eye movement strategy and a starting fixation point to the side of the image (MAP-E).

Symbols represent each object class type. Identity (diagonal) line corresponds to equal performance across models.

More »

Expand

Table 2.

Per class percent average precision (AP), mean average precision (mAP) over all 20 classes and relative computational costs of FOD-DPM and DPM on the PASCAL VOC 2007 dataset.

(Object class abbreviations are as follows. ap: aeroplane, bk: bike, bd: bird, bt:boat, bl: bottle, bs: bus, cr: car, ct: cat, ch: chair, cw: cow, dt: dining-table, dg: dog, hs: horse, mb: motorbike, pr: person, pt: potted-plant, sh: sheep, sf: sofa, tr: train, tv: tv-monitor).

More »

Expand

Fig 6.

FOD-DPM’s performance (mean AP over all 20 classes) as a function of number of fixations.

FOD-DPM achieves SW-DPM’s performance at 11 fixations and exceeds it with more fixations.

More »

Expand

Fig 7.

Per class AP scores achieved by FOD-DPM and non-foveated SW-DPM.

More »

Expand

Fig 8.

Fixation locations and bounding box predictions of the FOD for three different object classes (person, car and bicycle) but for the same image and initial point of fixation.

Top-left: original image (source: https://www.flickr.com/photos/kristoffer-trolle/27882648666/ with Creative Commons license.), top-right: person detection, bottom-left: car detections, bottom-right: bicycle detection. Yellow dots show fixation points, numbers in yellow fonts indicate the sequence of fixations and the bounding boxes are the final detections.

More »

Expand

Fig 9.

Performance comparison of the foveated saliency model versus the non-foveated saliency model.

We ran both models for the simple task of identifying the topmost salient location, on 100 natural images randomly selected from the PASCAL VOC 2007 dataset. The blue curve plots the average distance (in degrees) between the topmost salient locations, S1 and S2, found by the foveated and the non-foveated model, respectively, on the same image. Note that this location is unique and fixed for the non-foveated model while it changes for the foveated model as the model explores the image, i.e. makes more and more fixations. The red curve plots the average number of iso-orientation suppression operations of the foveated model relative to that of the non-foveated model. Again, the number of such operations for the non-foveated model is fixed but it changes for the foveated model with the number of fixations. Foveated model finds the same topmost salient location as the non-foveated model, after 16 fixations. Notably, after 8 fixations, the distance between S1 and S2 becomes less than 1 degree. The foveated model achieves this level of accuracy by doing 42% less iso-orientation suppression operations than the non-foveated model.

More »

Expand

Fig 10.

Illustration of the visual field of the model.

(a) The model is fixating at the red cross mark on the image (see Fig 8’s caption for the source of the image). (b) Visual field (Fig 1) overlaid on the image, centered at the fixation location. White line delineate the borders of pooling regions. Nearby pooling regions do overlap. The weights (Fig 1) of a pooling region sharply decrease outside of its shown borders. White borders are actually iso-weight contours for neighboring regions. Colored bounding boxes show the templates of three components on the visual field: red, a template within the fovea; blue and green, two peripheral templates at 2.8 and 7 degree periphery, respectively. (c, d, e) Zoomed in versions of the red (foveal), blue (peripheral) and green (peripheral) templates. The weights of a template, w_i, are defined on the gray shaded pooling regions.

More »

Expand

Fig 11.

Two bounding boxes (A, B) are shown on the visual field.

While box A covers a large portion of the pooling regions that it intersects with, box B’s coverage is not as good. Box B is discarded as it does not meet the overlap criteria (see text), therefore a component for B in the model is not created.

More »

Expand