Pooling region learning of visual word for image classification using bag-of-visual-words model

In the problem where there is not enough data to use Deep Learning, Bag-of-Visual-Words (BoVW) is still a good alternative for image classification. In BoVW model, many pooling methods are proposed to incorporate the spatial information of local feature into the image representation vector, but none of the methods devote to making each visual word have its own pooling regions. The practice of designing the same pooling regions for all the words restrains the discriminability of image representation, since the spatial distributions of the local features indexed by different visual words are not same. In this paper, we propose to make each visual word have its own pooling regions, and raise a simple yet effective method for learning pooling region. Concretely, a kind of small window named observation window is used to obtain its responses to each word over the whole image region. The pooling regions of each word are organized by a kind of tree structure, in which each node indicates a pooling region. For each word, its pooling regions are learned by constructing a tree with its labelled coordinate data. The labelled coordinate data consist of the coordinates of responses and image class labels. The effectiveness of our method is validated by observing if there is an obvious classification accuracy improvement after applying our method. Our experimental results on four small datasets (i.e., Scene-15, Caltech-101, Caltech-256 and Corel-10) show that, the classification accuracy is improved by about 1% to 2.5%. We experimentally demonstrate that the practice of making each word have its own pooling regions is beneficial to image classification task, which is the significance of our work.


Introduction
Image classification, as one of the most challenging tasks in computer vision, has attracted much attention in decades. Its target is to classify images into semantic predefined classes. There are many challenges in image classification task, such as the change in viewpoint, illumination, partial occlusion, clutter, inter and intra-class visual diversity. A great number of works have proposed to deal with these challenges. Nowadays, two popular classification learned a weighted l p -norm spatial pooling function tailored for the class-specific feature spatial distribution. However, this pooling function is still used under the framework of spatial pyramid matching (SPM) [14].
In this paper, we propose to make each word have its own pooling regions, and raise a simple yet effective method for learning pooling region. Specifically, a kind of small window named observation window is used to obtain its responses to each word over the whole image region. We adopt a kind of tree structure to organize the pooling regions of each word. The pooling regions of each word are learned by constructing a tree with the labelled coordinate data. The labelled coordinate data consists of the coordinates of responses and image class labels. In the process of tree construction, when dividing a pooling region of parent node, we employ linear discriminant analysis (LDA) to learn a dividing direction, and select the best dividing line from the set of the candidate dividing lines in this direction according to information gain. The effectiveness of our method is validated by observing if there is an obvious classification accuracy improvement after applying our method. Our experiments are conducted on four small datasets, i.e., Scene-15 [14], Caltech-101 [15], Caltech-256 [16] and Corel-10 [17]. Our experimental results show that the classification accuracy is improved by about 1% to 2.5%. This phenomenon demonstrates that the practice of making each word have its own pooling regions, is beneficial to image classification task.
The remainder of this paper is organized as follows: the proceeding section is about the related works. Section 3 illustrates our work in detail. Experimental evaluation and analysis are reported in Section 4, and the conclusion is drawn in Section 5.

Related works
The most related work is SPM proposed by Lazebnik et al. [14]. It partitions the whole image into multiple blocks at different resolution levels of 1 × 1, 2 × 2 and 4 × 4, and then concatenates the pooling vectors obtained in these blocks to form the image representation vector. Some improved works are proposed based on SPM. Huang et al. [18] weighted the spatial locations of local features from each block by a Gaussian function. Wu et al. [19] built a directed graph by viewing the blocks as the nodes to consider the relationship between the blocks. Harada et al. [20] proposed to form the image representation vector as a weighted sum of all the pooling vectors. Considering that SPM is not invariant to global geometric transformation, some works have devoted to solving this problem. Zhang et.al [21] proposed different heuristic methods by employing three frequency histograms, i.e. shapes, pairs and binned log-polar features representation. Penatti et al. [22] proposed an method named word spatial arrangement (WSA). It captures the relative positions of visual words by partitioning the image space into four quadrants making the position of a given word as the origin, and then aggregates the statistics of all the words from each quadrant. Except for these works devoting to dividing images into subregions of different shapes, some works focus on learning discriminative image regions, such as [23] and [24]. In [23], a saliency map indicating the discriminability of local feature is used to weight the coding vectors of local features. In [24], a latent support vector machine is adopted to learn a set of latent pyramidal regions. Jia et al. [25] proposed to adaptively learn the discriminative blocks from a set of overcomplete spatial blocks, and a boosting method to block learning is introduced by Zhang et al. [26].
Another way of incorporating the spatial information is to encode relationship or cooccurrence of visual words. The works [27] [28] [29] group the spatially close visual words into visual phrases and then represents an image as a histogram of these phrases. Similarly, Silva et. al [30] and Dammak et.al [31] employed graph instead of visual phrase to accurately describe the spatial relationships among visual words. Different from obtaining visual phrases after feature quantization, Morioka et al. [32] and Boureau et al. [2] concatenated the neighboring local features into a joint feature to preserve the local region information. A notable work proposed by Khan et al. [33] considered the global geometric relationships among the Pairs of Identical Words (PIWs). Based on the angles between these identical visual words, a normalized histogram is calculated termed as PIWAH (Pairs of Identical Visual Word Angle Histogram). Anwar et al. [34] extended this work to encode the global geometric relationships of the visual words in a scale-and rotation-invariant manner. This work computes angles made by triplets of identical visual words, and then constructs histograms from these angles termed as TIWAH (Triplets of Identical Visual Words Angle Histogram). Based on this work, a very recent work proposed by Zafar et al. [35] calculates an orthogonal vector relative to each point in the triplets of identical visual words and then created the histogram from the magnitude of these orthogonal vectors.

Our work
In this section, we first illustrate our work under the framework of BoVW model. Afterwards, the detail about observation window is presented. Next, the method for learning pooling regions is illustrated in detail. In the end, we explain how to obtain the image representation vector.

Process of image representation
BoVW model has formed a unified framework over the last decade. There are five basic stages in this framework. These stages are image patch extraction, image patch description, dictionary learning, feature coding and feature pooling, respectively. Our work only involves in the last stage, i.e., feature pooling. Fig 1 shows the process of image representation including our work.
As shown in Fig 1, the input image is converted into the set of the coding vectors through the first four stages. The first stage is to extract patches from the input image. This process is implemented via sampling local areas of the image usually in a dense manner, e.g., the dense patches of 16 × 16 pixels with the step of 8 pixels. Then, the image patches are described as the feature descriptors (local features). This process is usually implemented via statistical analysis over pixels of image patches. SIFT is widely used to discribe image patch as a 128-dimensional vector. Except for SIFT, local binary pattern and HoG are also employed in some works. Afterwards, the feature descriptors are encoded as the coding vectors with a visual dictionary, which is generated using the feature descriptors extracted from all training images. Each feature descriptor activates a number of visual words, and generates a coding vector, whose length is equal to the number of visual words. The difference of various coding methods lies in how to activate the visual words.
At the last stage, the set of the coding vectors are converted into an image representation vector by our proposed method. Concretely, this stage consists of three steps: 1) obtaining the responses of different kinds of observation windows to each visual word (illustrated in Section 3.2); 2) grouping the responses to each visual word in terms of the pooling regions learned for the word (illustrated in Section 3.3); 3) taking the maximum from each group, and concentrating all the maximums to form the image representation vector (illustrated in Section 3.4).

Observation window
In this paper, we propose the observation window. For clarity, we illustrate the principle of observation window under the condition that each feature descriptor is represented only by the most similar visual word to it. An observation window of the size (w, h) includes w × h image patches. If the feature descriptor of some image patch in an observation window is represented by the ith visual word, the response of the window to the ith word is set 1 from 0 to denote that the ith word exists in the window. The observation window is placed on image by certain horizontal and vertical steps to obtain the responses of the window to each word over the whole image region. For each word, the responses of observation window to it form a response matrix. Fig 2 explains this process. Furthermore, the observation windows of different sizes and steps can also be applied simultaneously, as shown in Fig 1. In this case, for each kind of observation window, each word has a corresponding response matrix obtained with the window. Instead of the coding vectors, the response matrices of all the words are used to obtain the image representation vector. In practice, feature descriptor is usually encoded by multiple visual words. It is assumed by our method that in the place where feature descriptor locates at, multiple visual words appear simultaneously. In this case, the response of an observation window to the ith visual word is the maximum of the coding coefficients of the feature descriptors (included in the window) to the ith word.
The advantages of using observation window are twofold. One is that, the robust of image representation to image variability, such as object size, location and poses, is improved by checking the existence of visual word in a slightly larger window (demonstrated in Tables 2  and 3). Another is that, the existences of visual word in the windows of different sizes and steps provide more discriminative information than in the window of single size. In fact, this observation is supported by the effectiveness of SPM. SPM partitions an image into multiple blocks at different resolution levels of 1 × 1, 2 × 2 and 4 × 4. From the viewpoint of feature descriptor, for each descriptor, the existence of the word assigned to it is checked in three kinds of windows (1/4 × 1/4, 1/2 × 1/2 and 1 × 1 image size).

Pooling regions of visual word
A pooling region is a 2-dimensional region in image space. In the existing methods (e.g., [14,18,19]), the whole image region is divided into multiple pooling regions. The coding vectors of all the feature descriptors from an image are grouped by the pooling regions. The coding vectors in a group are aggreated into a pooling vector by computing a statistical value (e.g., maximum value) of the coding coefficients to each visual word, respectively. The length of the pooling vector is equal to the number of visual words. From the viewpoint of viusal word, the coding coefficients to each word are grouped by the same pooling regions, and a statistical value is computed for each group.
In this paper, we allow each word have its own pooling regions. In this case, the coding coefficients to each word can be grouped by its own pooling regions. There is a reason for this practice. Feng et.al [13] have pointed out that, for images from a class, their feature descriptors indexed by different words have distinct spatial distributions, and their feature descriptors indexed by same word often share similar spatial distribution. Besides, class-specific spatial distributions are distinct from each other. This observation implies that, 1) each word can have its own pooling regions. These regions can be designed more discriminative; 2) the pooling regions of different shapes, sizes and locations have different discriminability.
By taking into account the above fact, we propose to learn its pooling regions for each visual word in terms of its spatial distributions on the images of different classes. To achieve this, the pooling regions of visual word are organized by a kind of tree structure, as shown in Fig 3, Each node corresponds to one pooling region indicated by the blue colour. The shape of region is not restrained to the rectangular shape by allowing that the region of parent node can be divided in any direction. In such a tree, the root node indicates the whole image region, and each parent node has two child nodes, which correspond to two subregions. The combination of the pooling regions indicated by all the nodes at any level is the whole image region. In this manner, SPM can also be easily represented by this kind of tree structure. The regions indicated by the nodes at the levels 0, 2, 4 correspond to the regions divided at the resolution levels of 1 × 1, 2 × 2 and 4 × 4.
Based on the tree-based pooling region representation, the pooling regions can be learned by constructing a tree from the root node to the leaf nodes. The key of pooling region learning is how to divide the pooling region of a parent node. Moreover, it is worthing note that, for each word, instead of the coding coefficients to it, the responses of observation window to it need to be grouped by its own pooling regions in our method. To this end, we generate the labelled coordinate data using the coordinates of the responses of observation window and image class labels (illustrated in Section 3.3.1), and learn the best dividing line using the labelled coordinate data (illustrated in Section 3.3.2). After obtaining a tree, we group the responses of observation window by the pooling regions indicated by the nodes of the tree (illustrated in Section 3.3.3).
3.3.1 Labelled coordinate data. In our method, the pooling regions of a visual word are applied on the responses of observation window to it to group the responses. Therefore, for each word, its pooling regions need to be learned in terms of its spatial distributions obtained with observation window on the training images of each class. The class-specific spatial distribution data of a visual word can be attained by recording the coordinates of the non-zero responses to it on each training image and the class label of this image. The coordinate of a response is defined as the average value of the center coordinates of the image patches included by the observation window that generates the response.
Concretely, a coordinate s = (x, y) of response and a class label c constitute a labelled coordinate datum (s, c). For the kth word, the coordinates of the non-zero responses to it on the jth image and the class label c j of this image are combined to form the labelled coordinate data B k j ¼ fðs j i ; c j Þ; i ¼ 1; 2; :::; N j g of the word obtained on the jth image, where N j is the number of the non-zero responses to the kth word. The labelled coordinate data B k j obtained on each training image are gathered to obtain the labelled coordinate data B k ¼ fðs j i ; c j Þ; i ¼ 1; 2; :::; N j ; j ¼ 1; 2; :::; Mg of the kth word, where M is the number of the training images.
Each word has its own labelled coordinate data. Furthermore, the obseration windows of different sizes and steps will generate different responses to same word. Hence, if different kinds of observation windows are used, for each kind of window, each word will have the labelled coordinate data obtained with this kind of window. Providing there are O kinds of observation windows, B k, 1 , B k, 2 , . . .B k, O corresponding to the O kinds of windows will be generated for the kth word. Each word will have O labelled coordinate data. Due to the variety of image sizes, the center coordinate of image patch is normalized according to image size, in order to attain the normalized coordinate of response. In this paper, the image center is defined as the origin of coordinate system. The center coordinate (x, y) of image patch represents that it is x pixels and y pixels away from the image center in horizontal and vertical directions, respectively. For an image of the size a × b, the normalized coordinate is calculated as (2x/a, 2y/b). After normalization, the horizontal and vertical coordinate values of image patch are both limited to the values from -1/2 to 1/2.

Pooling region learning.
Pooling region learning of a visual word is achieved by constructing a tree for the word. The tree is constructed from the root node to the leaf nodes. The whole image region is taken as the pooling region of the root node, and the pooling region of parent node is divided into two subregions by a dividing line. In order to obtain the subregions with high discriminability, the best dividing line is found according to information gain. In detail, when dividing the pooling region of a parent node (splitting a parent node), we first learn a dividing direction by applying LDA on the labelled coordinate data from the region, and then find the best dividing line in this direction, by which the weighted entropy of the subregions of the parent region is smallest. The step of splitting node (dividing pooling region) is recursively performed until any one of the following three conditions is satisfied: (1) for a node, if its depth in the tree is larger than a user-specified threshold (maximum depth), then stop splitting the node; (2) for a node, if the number of the labelled coordinate data from its pooling region is less than a threshold value decided by user, then stop splitting the node. (3) for a node, if the labels of the labelled coordinate data from its pooling region are identical, then stop splitting the node.
The complete tree construction process is illustrated as follows: function Tree-Construction (Node Q) step 1: If the node Q does not meet any one of the stop conditions of splitting node, then step 2: Learn a dividing direction p = (p x , p y ) of Q using the labelled coordinate data from the pooling region of Q, as stated below. step 3: Learn the best dividing line y = (p y /p x )x + b of Q using the labelled coordinate data from the pooling region of Q, as stated below. step 4: Create two child nodes Q l and Q r of Q by the best dividing line. The data of Q l is the labelled coordinate data from the left region of the dividing line, and the data of Q r is the rest obtained by subtracting the data of Q l from the data of Q. The depths of Q l and Q r are both set to d + 1, where d is the depth of Q. step 5: Tree-Construction (Node Q l ) step 6: Tree-Construction (Node Q r ) step 7: End The above function is performed recursively after inputting the root node. For the kth word, the data of the root node is the labelled coordinate data B k . In the following, we present in detail the method for learning the dividing direction and the method for learning the best dividing line.
Learning of dividing direction. Each element in image representation vector corresponds to a pooling region of visual word. Hence, the discriminability of pooling region has close relationship with the discriminability of image representation vector. In our work, the discriminability of a pooling region is evaluated by the entropy of the label distribution of the labelled coordinate data from the region. On account of that the spatial distributions of the coordinate data with different labels are not same, the weighted entropies of the subregions divided by the dividing lines of different locations and directions, are also different. To obtain the pooling subregions with high discriminability, LDA is employed to learn a dividing direction, and the best dividing line is selected from the set of the candidate dividing lines in this direction according to information gain.
The advantages of using LDA are twofold. One is that, this practice reduces the computational costs on finding the best dividing, since it avoids to build the candidate split points along each dimension and calculate the information gain for each split point. Another is that, if the coordinate distribution of each class is Gaussian-like one, compared with the set of the split points, it is likely to find the dividing with higher information gain from the set of the candidate dividing lines.
Let B = {(s i , c i ), i = 1, . . ., N} be the labelled coordinate data with C classes fo c g C c¼1 from a parent region, where s i 2 R 2�1 denotes the i-th coordinate datum and c i is the class label of s i , the fisher criterion is defined as follows: where, N c is the number of the coordinate with the label c. Here, the learning objective is to maximize the Fisher criterion J(w) under the condition w T S W w = 1. The objective is achieved by solving the generalized eigenvalue problem S B w = λS W w. We retain the eigenvector w with the largest eigenvalue λ max . The dividing direction p = (−w y , w x ) is obtained by calculating the orthogonal unit vector of the eigenvector w = (w x , w y ).
Learning of the best dividing line. After obtaining the dividing direction of a pooling region, we find the best dividing line from the set of the candidate ones in this direction. The weighted entropy of its subregions obtained with each candidate line is calculated, and the candidate one that corresponds to the smallest weighted entropy is selected as the best dividing line.
Given the dividing direction p = (p x , p y ) obtained for the region R and a dividing line y = kx + b in this direction p, where k = p y /p x and b is a vertical intercept, the weighted entropy H of the entropies H l , H r of the two subregions R l , R r are computed as: where R l (R r ) is the left (right) region of this line, N l (N r ) is the number of the labelled coordinate data from the subregion R l (R r ), and N = N l + N r . The entropy H l (H r ) is computed on the label distribution of the labelled coordinate data from the subregion R l (R r ). To judge which of the two subregions a coordinate belongs to, the function f(x, y) = y − kx − b is built and used as follows: ifk � 0; For example, given a coordinate s i = (x i , y i ), if k > = 0 and f(x i , y i )>0, then it belongs to the subregion R l .
The best dividing line is selected from the set S of the candidate dividing lines. The only difference among the candidate dividing lines is the vertical intercept b. Given the labelled coordinate data B = {(s i , c i ), i = 1, . . ., N} from the region R and the dividing direction p, we build the set S by the following steps. First, we find all the vertical intercepts I = {b 1 , b 2 , . . ., b N } by solving the formula b = y − (p y /p x )x using each coordinate in B. For the coordinate s i = (x i , y i ), its vertical intercept b i is y i − (p y /p x )x i . Then, we sort I in ascending order to obtain a sorted set J = {d 1 , d 2 , . . ., d N }. At last, the dividing direction p and an average value (d i + d i+1 )/2 form a candidate dividing line y = (p y /p x )x + (d i + d i+1 )/2. Since there are N intercepts in J, the cardinality of the obtained set S is N − 1. For each line in S, we compute its weighted entropy H by formula (3). The line with the smallest entropy H min (maximum information gain) is selected as the best dividing line.

Grouping the responses of observation window.
The responses of observation window to each word are grouped by the pooling regions of the word. The responses belonging to same pooling region form a group. It is required to know which pooling regions each response belongs to. Specifically, for a response, deciding which of the child nodes it belongs to is performed from the root node to the leaf nodes according to its coordinate and the best dividing line of non-leaf node. The pooling regions of the nodes on the decision path are the regions the response belongs to. Given the coordinate s i = (x i , y i ) of a response and a non-leaf node q, as done in formula (4), the sign of f(x i , y i ) = y i − k q x i − b q decides which of its child nodes the response belongs to, where k q , b q are the parameters about the best dividing line of the node q.
In fact, it is feasible that the responses are grouped only by a part of all the pooling regions of visual word, e.g., the pooling regions indicated by the nodes at the levels 0, 2, 4. This practice not only decreases the dimensionality of image representation vector, but also reduces the redundant information incurred by too many pooling regions (demonstrated in Table 4).

Image representation vector
The image representation vector consists of the maximums from the groups of all the visual words. If only one kind of observation window is used, each word only has one tree. For the kth word, the maximums v k 1 ; v k 2 ; :::; v k G from its G groups obtained with its tree are concatenated to form the vector v k ¼ ðv

Datasets
In our experiments, four small datasets are used to evaluate the classification performance of our proposed method. Scene-15: It consists of 4492 images from the fifteen classes, such as bedroom, industrial, forest and so on. The number of images per class varies from 260 to 440. We chose randomly 100 images from each class to form the training set, and the remaining images are used as a test set.
Caltech-101: Caltech-101 dataset is a challenging object recognition dataset, which contains 9,144 images in 101 object classes and one background class. The number of images per class ranges from 31 to 800. We consider 30 training images and up to 30 testing images per class.
Caltech-256: Caltech-256 dataset consists of 257 object classes. There are 30607 images in total. Compared with Caltech-101, it presents much higher variability in object size, location and poses. 30 images and 20 images from each class are used for training and testing, respectively.
Corel-10: It contains 1,000 images in 10 classes (flower, elephant, owls, tiger, building, beach, skiing, horses, mountains, food). For evaluation, the images were randomly divided into 50 training and 50 test images for each class.

Implementation details
For images from all the datasets, we extract the dense patches of 16 × 16 pixels. The step between two neighboring patches are set to 8 pixels for Scene-15 and Corel-10, 6 pixels for Caltech-101 and Caltech-256. Each patch is described as a SIFT descriptor (128-dimensional vector). We use the K-means implemented by VLFeat [36] to learn visual dictionary. The dictionary size is set to 1024 for Scene-15 and Corel-10, and 2048 for Caltech-101 and Caltech-256. Localized Soft-assignment Coding (LSaC) [37] is applied to encode the SIFT descriptors using the learned dictionary owing to its superior performance to sparse coding [38] and Locality-constrained Linear Coding [39]. As suggested in [36], the number of visual words to encode a descriptor is set to 5. In this case, in the place where a descriptor locates at, five different visual words appear simultaneously. For all the datasets, a one-versus-rest linear SVM for each class is trained. We adopt [37] as our baseline, termed as SPM(baseline). All the experiments are conducted on a 64-bit Windows 10 with Intel Core i5-4590 at 3.30 GHz � 4 on 16GB RAM.
In order to evaluate if the classification accuracy is improved after applying our method more accurately, the following experimental setups are taken. For each dataset, we only randomly select the training images and testing images 10 times to obtain 10 fixed training sets and testing sets. 10 fixed dictionaries are learned on the fixed training sets respectively. The coding and pooling strategies adopted by our method and SPM(baseline) are also same. In this case, the only factor that influences the classification accuracy, is pooling region. For each experiment setup about pooling region learning, we conduct the experiment 10 times on the 10 fixed training and testing sets, and report the average of the classification accuracies of 10 experiments. The average classification accuracy of SPM(baseline) is also obtained on the 10 fixed training and testing sets.

Impact of the nodes at different levels
The experiments are started with an in-depth analysis of the discriminability of the pooling regions indicated by the nodes at different levels. We investigate the nodes at the level 0 to 6, respectively. For the level l, the number of the nodes at this level is 2 l . In SPM, the blocks obtained by dividing the whole image region at the resoluation levels of 1 × 1, 2 × 2, 4 × 4 and 8 × 8, correspond to the pooling regions of the nodes at the level 0, 2, 4 and 6. Here, we only use one kind of observation window. Its size and step are set to 1 × 1 and 1. Table 1 shows the classification accuracies corresponding to the different levels. For all the datasets, the accuracies obtained by our method are superior to the baseline consistently. The results demonstrate that the pooling regions learned are more discriminative than the ones of SPM. We note the drop of the accuracy after some level, for example, the level 4 for Scene-15 and the level 5 for Caltech-101. This can be explained by the noise incurred by too finer dividing. Besides, we can achieve the similar accuracies to the best obtained by SPM with lower dimensionality. For Scene-15, the accuracy 82.62% (4096 dimensions) of the single level 2 is very close to 82.84% (21504 dimensions, shown in Table 5). For Caltech 101, Caltech 256 and Corel-10, we can draw the same conclusion (shown in Tables 6-8). We also note that, for all the datasets, the accuracies of the level 6 are higher than SPM obviously. This means that, compared with SPM, the pooling regions learned by our method are more robust to the noise incurred by too finer dividing.

Impact of observation window
In this section, we investigate the impact of the size and step of observation window. A series of parameter combinations about size and step are tested. Here, the nodes at the levels 0, 2 and 4 are selected for grouping the responses. Tables 2 and 3 show the accuracies obtained on Scenes-15 and Caltech-101, respectively. From these tables, we find that, the size and step of observation window can influence the classification accuracy. Fig 4 shows the accuracies in terms of the area of observation window. Overall, the slightly larger window can achieve higher accuracy. For Scene-15, the highest accuracy 84.89% is obtained when the size of observation window is set to 5 × 5, and for Caltech-101, the window of the size 8 × 8 leads to the highest accuracy 75.09%. However, an obvious drop appears when the size of window exceeds some size, such as 5 × 5 for Scene-15 and 8 × 8 for Caltech-101. The reason is that, the robust of image representation to image variability is improved by checking the existence of word in slightly larger window, but the information on the spatial distribution of word will loss when applying the larger window like 6 × 6. Moreover, we also note that, for Scene-15, when the aspect ratio r of observation window is less than 1, the accuracies are higher than the ones obtained when r > 1, such as 84.72% for 1 × 3 vs. 84.32% for 3 × 1. Nevertheless, we cannot observe the same phenomenon on Caltech-101. Compared to the accuracies shown in Table 1, the accuracies obtained with the nodes at the levels 0, 2 and 4 are higher.

Impact of the combination of observation window and level of tree
We evaluate the classification accuracy of the combination of observation window and level of tree.

Accuracy comparison
In this section, we compare our method with SPM(baseline) under the same experimental setup (illustrated in Section 4.2). In addition, we also list the results reported by some representative methods, including the deep learning methods [40,41,44,45,50] and the BoVW methods [38,42,48]. The results of these methods are obtained without the help of transfer learning, i.e., using a pre-learned CNN to extract image features. As shown in Tables 5-8, compared with SPM(baseline), our method improves classification accuracy by about 1% to 2.5%. This phenomenon demonstrates that, the practice of  making each word have its own pooling regions is beneficial to image classification task. However, our results are lower in comparison with some methods [2,43,48,50]. Although the accuracies obtained by our method are not attractive enough, our method is easy to be combined with a number of BoVW methods to achieve higher classification accuracy (illustrated in Section 4.7). We also note that, the deep learning methods do not achieve obvious improvement over the BoVW methods due to the lack of training data. We compute the average computational time for Caltech-101 spent on converting an image to an image representation vector. The average time (0.76s) required by our method is slightly more than the average time (0.61s) of SPM(baseline). The time spent on pooling region learning is about 380s when the observation window of 4(2) × 4(2) is used.

Discussion
The advantages and disadvantages of our method are shown as below: • Our method improves the discriminability of image representation vector by learning its pooling regions for each word. However, there are computational costs associated with pooling region learning. For Caltech 101, the time spent on pooling region learning is about 380s when the observation window of 4(2) × 4(2) is used. Besides, the time spent on converting an image to an image representation vector increases slightly.
• Although the classification accuracies obtained by our method are not attractive enough compared with some existing methods [2,43,48,50], our method is easy to be combined with a number of BoVW methods to achieve higher classification accuracy. The reason is that our method only involves in the stage of feature pooling. The existing works focusing on feature extraction, feature description, dictionary learning and feature coding can be used in conjunction with our method. Besides, the works on Analysis Dictionary Learning (ADL) can also be applied on the image representation vector obtained by our method, resulting in more discriminative image representation vector.
• The effectiveness of our method depends on inter and intra-class visual diversity to a great extent. When the spatial distributions of the local features relevant highly to image class are similar for the images from same class, our method works well. For example, the accuracy improvement of about 2% to 2.5% is obtained on Scene-15, Caltech-101 and Caltech-256. For most of the classes in the three datasets, the images are aligned artificially well. By contrast, only the gain of 0.9% is acquired on Corel-10 due to the large intra-class visual diversity.

Conclusion
In this paper, we proposed to make each word have its own pooling regions, and raised a simple yet effective method for learning pooling region. A kind of small window named observation window was proposed to obtain its responses to each word over the whole image region. The pooling regions of visual word were learned by constructing a tree with the coordinates of responses and image class labels. The classification accuracy was improved by about 1% to 2.5% after applying our method. This phenomenon demonstrates that the practice of making each word have its own pooling regions is beneficial to image classification task. Furthermore, we found by our method that, the image representation vector obtained with a slightly larger observation window (e.g. 4 × 4) achieves higher accuracy than with a small window (e.g., 1 × 1). The classification accuracy is likely to be improved further by applying the observation windows of different sizes and steps simultaneously. The future works we are pursuing are: 1) learning the pooling regions of visual phrase; 2) applying the thought of our method on the convolution layers of CNN.