Considering the Spatial Layout Information of Bag of Features (BoF) Framework for Image Classification

The spatial pooling method such as spatial pyramid matching (SPM) is very crucial in the bag of features model used in image classification. SPM partitions the image into a set of regular grids and assumes that the spatial layout of all visual words obey the uniform distribution over these regular grids. However, in practice, we consider that different visual words should obey different spatial layout distributions. To improve SPM, we develop a novel spatial pooling method, namely spatial distribution pooling (SDP). The proposed SDP method uses an extension model of Gauss mixture model to estimate the spatial layout distributions of the visual vocabulary. For each visual word type, SDP can generate a set of flexible grids rather than the regular grids from the traditional SPM. Furthermore, we can compute the grid weights for visual word tokens according to their spatial coordinates. The experimental results demonstrate that SDP outperforms the traditional spatial pooling methods, and is competitive with the state-of-the-art classification accuracy on several challenging image datasets.


Introduction
Image classification plays a significant role in the computer vision research. The recent stateof-the-art image classification pipeline consists of two major parts: 1) the image representation, e.g., bag of features (BoF) [1][2][3] and spatial pyramid matching (SPM) [4]; 2) the classifier, e.g., support vector machines (SVMs) and its variants [5,6]. Nowadays, developing discriminative image representation is challenging for image classification.
Referring to the bag of words (BoW) used in textual information retrieval, the BoF method has been widely used for image representation [1][2][3]. The standard BoF model first extracts the local feature, e.g., the SIFT descriptor, from all images, and then uses cluster algorithms or vector quantization methods to transform local features into a visual vocabulary, where each cluster delegates a visual word type. Thus, BoF can describe the images as orderless collections of the visual word. The representative extensions of BoF include the geometric correspondence search [7,8], the discriminative vocabulary learning [9][10][11][12], and the constrained coding methods [5,13].
To further improve BoF by considering the spatial layout information, the authors of [4] propose a downstream SPM method for BoF. After generating the visual vocabulary, SPM partitions the image into a set of regular grids at different levels and concatenates histograms of visual words from each grid. Empirical results show that SPM can significantly improve the classification performance, however, it assumes that the spatial layout of all visual words obey the uniform distribution over these regular grids. This generates a conflict to the intuition that different visual words should obey different spatial layout distributions. To address this problem, we suggest a novel spatial distribution pooling (SDP) algorithm to improve SPM. In SDP, we develop an extension model of Gauss mixture model (GMM), and use this model to estimate the spatial layout distribution for each visual word type. SDP can generate a set of flexible grids rather than regular grids from the traditional SPM. As the example shown in Fig 1, SDP can generate more reasonable grids than SPM, resulting in (i.e., Fig 2) more consistent imagelevel representation than SPM. Furthermore, SDP can compute the grid weights for visual word tokens according to their spatial coordinates. A number of experiments have been conducted to evaluate SDP. The experimental results demonstrate that SDP outperforms the existing spatial pooling methods.
The reminder of this paper is organized as follow: In Section 2, we introduce the framework of the proposed algorithm. In Section 3, we report and discuss the experimental results. In Section 4, conclusions are discussed.

Proposed Algorithm
In this section, we first review the popular SPM-based image classification system, and then introduce the proposed SDP algorithm.
Given a visual vocabulary with V visual words, let N ! ¼ n 1 ; n 2 ; Á Á Á ; n V ½ , where n v is the number of times that visual word v has occurred in the training images and N ¼ o be the full spatial layout (i.e., spatial coordinate) set for visual word v (as shown in Fig 1),

SPM-based Image Classification System
We introduce the traditional flowchart of the SPM-based image classification system. As shown in Fig 3, this system extracts local features, e.g., SIFT and DHOG [1,14] descriptors, from all images, and then codes these local features into a visual vocabulary using clustering algorithms or vector quantization methods [5,[9][10][11][12][13]. For each image, it transforms the local features into visual words according to the visual vocabulary, and then generates its imagelevel feature vector using SPM. After sweeping all images, the traditional algorithm, e.g., SVMs, is commonly used to train the classifier. In this image classification system, SPM is used to capture the spatial layout information. For clarity, we illustrate an example of SPM with 2 l × 2 l grids each level, where the level l is set to be 0, 1, 2. As shown in Fig 4, suppose that the visual vocabulary contains three visual word types (i.e., V = 3), which are indicated by circles, diamonds and crosses. Following the above setting, SPM divides the image at three levels, and then count the visual word histograms for each grid in each level. Concatenating all visual word histograms together, we can finally construct an image-level feature vector. Considering this example, each image corresponds to a 63-dimensional (i.e., V Â P 2 l¼0 2 l ¼ 63) feature vector.

Spatial Distribution Pooling
SPM rigidly partitions the image into several regular grids, and assumes that the spatial layout of all visual words obey the uniform distribution over these grids. That is to say, each visual word in SDP occurs in the regular grids in each level following equal probability. However, this generates a conflict to the intuition that different visual words should obey different spatial layout distributions. To address this problem, we propose a spatial distribution pooling (SDP) algorithm. Inspired by the idea of generative Bayesian model [15,16], we develop an extension model of GMM (e-GMM) to describe spatial layout distributions of visual words. We assume that visual words are independently drawn. For each visual word v, its spatial layout v ! is a multinomial distribution over K latent grids, drawn from the Dirichlet prior β. Each latent grid k is a bivariate Gaussian distribution with respect to the spatial coordinate of the visual word token, where m v k is the expectation and S v k is the covariance matrix. Formally, the spatial coordinate generative process for visual word v is as follows: For each of the n v visual word v: a. Choose a latent grid: where The graphical model representation of e-GMM is given in Fig 5. SDP is more flexible compared to SPM. Under e-GMM, SDP can assign each visual word to a latent grid according to its spatial coordinate, instead of a regular grid. For each image, we can construct its image-level feature vector by concatenating all visual word histograms of latent grids together.

Inference and Estimation
In this section, we discuss the two inherent issues of e-GMM: 1) Inference: if the parameters of k¼1;v¼1 and S v k È É k¼K;v¼V k¼1;v¼1 ) are known, given a visual word v with spatial coordinate c v ! , we want to infer its corresponding latent grid; 2) Estimation: given a number of visual word v with the spatial coordinate set C v , we want to estimate model parame- The inferential problem is to compute the posterior distribution of the grid assignment given a visual word v with spatial coordinate c v ! . It can be computed as follows: We consider the posterior as gird weights of this visual word token. We sort these K grid weights, and use the top M (where M 2 {1, 2, Á Á Á, K}) values to accumulate histograms of visual word v in the corresponding latent grids. The final grid weight values used are computed by: where k m , as well as k i , is one of the top M latent grids. For clarity, we illustrate an example of M = 3 setting. Suppose that there is a visual word token v assigned three grids {1, 2, 3} with grid weights {0.1, 0.3, 0.6}. We consider that the visual word v occurs in the latent grid 1 0.1 times, the latent grid 2 0.3 times and the latent grid 3 0.6 times.
Estimation. For each visual word v, given . This can be achieved by maximizing the likelihood: where z is the grid assignments; . Because this likelihood is intractable to compute and the variables v ! , z are latent, we use the expectation maximization (EM) algorithm to optimize model parameters. EM algorithm iteratively optimizes the likelihood Eq (4). Each EM iteration consists of two steps, i.e., expectation step (E-step) and maximization step (M-step). The details are given as follows: In E-step, we fix v ! , μ v and S v , and then compute the expectations for z as: In M-step, we fix the expectations of z obtained in E-step, and then optimize v ! , μ v and S v by maximizing the likelihood Eq (4). The update rules are as follows: Iterating E-step and M-step until convergence, we can obtain the optimal v ! , μ v and S v . For clarity, we summarize the estimation process in Fig 6.

Related work
There are some related attempts aimed at improving the spatial pooling of SPM. [27] proposes a feature and spatial covariant kernel under the histogram representation, which considers the spatial constraints against heavy cluster and occlusion. The authors of [28] combine convolutional neuron networks with the spatial pooling method. The receptive filed learning (RFL) and generalized regular spatial pooling (GRSP) proposed in [29] and [30], respectively, explore optimal pooling grids based on SPM. RFL learns adaptive grids by optimizing the pooling parameters; and GRSP allows the pooling grids in the same resolution (i.e., level) have denser or sparser distributions. Our SDP also focuses on learning more optimal pooling grids than SPM. In comparison with the above two relevant works, the advantage of SDP is to consider each visual word individually. This is more reasonable following the intuition that different visual words should obey different spatial layout distributions.
Generally, there are some other works aimed at improving spatial pooling from the part model perspective. The reconfigurable bag of words (RBoW) model [31] divides the image into a set of pre-defined sub-models, which are related to the spatial information. The visual words in RBoW have different weights for different sub-models. In a sense, this RBoW model is similar to topic models, which assign each gird of the image a "topic". Deformable part-based models (DPM) such as [32,33] use deformation parameters to penalize the deviation of the parts from the default locations, which are relative to the root. In comparison with these algorithms, roughly the main difference to our SDP is that they consider the spatial pooling at the grid level but SDP considers the spatial pooling at the visual word level. Particularly, in DPM deformation parameters and the part appearance models are trained jointly (i.e., latent SVMs) but in SDP the latent grids of visual words and classifiers are trained separately.

Experiment
In this section, we evaluate the proposed SDP algorithm on three widely used datasets: Caltech-101 [17], Caltech-256 [18] and MIT-indoor-71 [37]. We use the dense SIFT descriptor to extract local features. The SIFT descriptors extracted from 16×16 pixel patches are densely sampled from each image on a grid with stepsize 8 pixels [13]. The locality-constrained linear coding (LLC) [5] algorithm is used to train the visual vocabulary, and the number of neighbors is set to 5 with the shift-invariant constraint. In this setting, five visual words are actually assigned to descriptors. For each visual word per descriptor, SDP estimates its latent grid and accumulates its word weight from LLC to this grid. To process images of different sizes, SDP normalizes the coordinates by the width and height of images. For the final image-level representation, we use the "max-pooling" combined with "L2 normalization" as in [5]. For SDP, the number of latent grids is set to 21, and the parameter M is set to 3, and the Dirichlet prior β is set to 1; for SPM, 1×1, 2×2 and 4×4 regular grids are used. The popular LibSVM [26] tool is used to train the classifier.

Caltech-101
The dataset Caltech-101 collects 9144 images divided into 101 classes. The majority of images are medium resolution around 300×300 pixels and the number of images varies from 31 to 800 per class. Following the previous studies [5,14], we train on 5, 10, 15, 20, 25 and 30 images per class and no more than 50 test images per class. For balance, all images are resized to be no larger than 300×300 pixels.
We train a visual vocabulary with 2048 visual words. Some reported results in [4, 13, 19-22, 29, 30, 34] are used as performance baselines. The experimental results are shown in Table 1. It can be seen that SDP outperforms other spatial pooling methods, e.g., about 2.5% improvements to SPM in all settings and about 2% improvements to RFL. However, a gap still exists between our SDP and the state-of-the-art algorithms, which uses more complex coding methods. This indicates that the coding method is more important for classification of dataset Caltech-101.

Caltech-256
The dataset Caltech-256 collects 30,607 images divided into 256 classes, where each class contains at least 80 images. We train on 15, 30, 45 and 60 images per class and at most 50 images for testing. Similar to dataset Caltech-101, we also resize the images to be no larger than 300×300 pixels.
We train a visual vocabulary with 2048 visual words. We use some reported results in [18,[22][23][24][25]35] as performance baselines. The experimental results are shown in Table 2. First, we observe that SDP outperforms SPM in all settings, i.e., about 3% * 5% improvements. Second, SDP is competitive with the reported results, e.g., about 2% improvements against [25], and is a little lower than state-of-the-art algorithms based on more complex coding and heterogeneous features. We argue that SDP is a better and effiective spatial pooling method.

MIT Indoor-67
The dataset MIT Indoor-67 collects 15,620 images divided into 67 indoor scene classes. We train algorithms on 80 images per class and test on 20 images per class. A visual vocabulary with 2048 visual words is trained and some reported results in [22,30,34,36] are used as performance baselines. Table 3 shows the experimental results. Also, we observe that SDP outperforms other spatial pooling methods, e.g., about 5% improvements to SPM and about 2% improvements to GRSP, and is competitive with some reported results. Altough a gap still exists between SDP and the state-of-the-art results using more complex coding methods, SPM is successful in spatial pooling.

Experiments with Parameters
We investigate the influence of two significant parameters M and K on datasets Caltech-101 and Caltech-256. For dataset Caltech-101/Caltech-256, 30/60 images per class are used for training.
We fix K = 21, and evaluate the classification accuracy with different M values over the set {1, 2, Á Á Á, 21}. The experimental results are shown in Fig 7. For both datasets, we observe that the results show similar trends, i.e., smaller M values commonly perform better than larger M values. For example, M = 3 is about 35% better than M = 21 on dataset Caltech-101, and M = 3 is about 25% better than M = 21 on dataset Caltech-256. That is because larger M values lead to dense image-level feature vectors. This reduces the discrimination of feature vectors and provides negative influence for classifiers, especially for SVMs. Besides, we observe that the performance gaps between M = 1, 2, 3, 4 are very small, and when M goes larger than 4, the performance starts to drop. In practice, value 3 performs best and is used as the default setting for the parameter M. We fix M = 3, and evaluate the classification accuracy with different K values over the set {3, 6,9,12,15,18,21,24, 27}. The experimental results are shown in Fig 8. It can be seen that relatively larger K values perform better than smaller K values, and when K goes larger than 21, the performance starts to drop. The best performance is achieved by K = 18, 21. It is interesting that it is close to the commonly used SPM setting of 1×1, 2×2 and 4×4 grids, where the total number of regular grids is 21. In practice, value 21 is used as the default setting for the parameter K.

Conclusion
In this paper, we develop a novel SDP algorithm to improve the spatial pooling in the BoW model for image classification. Different from SPM, SDP algorithm individually consider each visual word. SDP is based on the proposed e-GMM model, which describes the generative process for spatial coordinates of visual word tokens. This e-GMM model assumes that for each visual word, there are some latent grids and neighborhood tokens are inclined to assign to the same grid. SDP uses e-GMM to organize flexible latent grids, and then concatenates all visual word histograms together to construct image-level feature vectors. This is more reasonable  than SPM, which divides images into regular grids. We evaluate the proposed SDP algorithm on three widely used image datasets Caltech-101, Caltech-256 and MIT-indoor-67. The experimental results indicate that SDP significantly improves the performance of SPM. In our experiments we use the setting of LLC+SDP, however, this setting performs worse than some stateof-the-art algorithms using more complex coding methods. In the future, we plan to refine and apply SDP to the state-of-the-art features such as Fisher vectors [36].