A proposal of prior probability-oriented clustering in feature encoding strategies

Codebook-based feature encodings are a standard framework for image recognition issues. A codebook is usually constructed by clusterings, such as the k-means and the Gaussian Mixture Model (GMM). A codebook size is an important factor to decide the trade-off between recognition performance and computational complexity and a traditional framework has the disadvantage to image recognition issues when a large codebook; the number of unique clusters becomes smaller than a designated codebook size because some clusters converge to close positions. This paper focusses on the disadvantage from a perspective of the distribution of prior probabilities and presents a clustering framework including two objectives that are alternated to the k-means and the GMM. Our approach is first evaluated with synthetic clustering datasets to analyze a difference to traditional clustering. In the experiment section, although our approach alternated to the k-means generates similar results to the k-means results, our approach is able to finely tune clusters for our objective. Our approach alternated to the GMM significantly improves our objective and constructs intuitively appropriate clusters, especially for huge and complicatedly distributed samples. In the experiment on image recognition issues, two state-of-the-art encodings, the Fisher Vector (FV) using the GMM and the Vector of Locally Aggregated Descriptors (VLAD) using the k-means, are evaluated with two publicly available image datasets, the Birds and the Butterflies. For the results of the VLAD with our approach, the recognition performances tend to be worse compared to the original VLAD results. On the other hand, the FV using our approach is able to improve the performance, especially in a larger codebook size.


Introduction
Clustering is a fundamental technique for several purposes such as statistical analysis and data mining. The main purpose of clustering is to make groups called clusters. Each clustering technique has a specific objective to make groups, such as finding groups that minimize a quantization error and estimation of the appropriate distribution [1,2]. This paper focusses on clustering in image recognition algorithms and presents an efficient objective.
In recent image recognition problems, a local feature framework is a key technique. This detects regions of interest on an image and describes a discriminative feature vector from each PLOS ONE | https://doi.org/10.1371/journal.pone.0210146 January 10, 2019 1 / 17 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 region [3][4][5]. The basic idea of codebook-based encodings is to capture the statistics of the distribution of local features extracted from an image. By treating local features as visual vocabularies appeared in an image, images can be processed in the same way as the natural language processing (NLP). In the NLP, specifically, the bag-of-words (BOW) model [6] expresses a document feature vector by assigning words existing in sentences to corresponding common words and counting their frequencies. For images, common visual words, called codebook, are constructed by clustering local features extracted from various images. The model in image recognition follows the same procedure as the BOW to represent image feature vectors. This approach is well-known as the bag-of-visual-words (BoVW) model [7], and its variants [8][9][10][11][12][13] have achieved excellent performance on several tasks, such as object recognition [8,9,11] and image retrieval [12,13]. Gosselin et al. [10] have suggested that increasing the number of common visual vocabularies is an important factor for improving recognition performance. For instance, the best recognition rate has been observed with the largest vocabulary size in their experiment. It has also been reported that saturation of the recognition performances accompanying the increase the vocabulary size has not been observed. On the other hand, a huge vocabulary size becomes a cause of high computational complexity [10] and to possibly generate not suitable vocabularies due to the over-fitting to clustering samples [9]. Our previous study [14] has considered that the distribution of prior probability can be used to measure the quality of image feature vectors in codebook-based feature encoding strategies. In addition, optimization of the distribution does not require additional computational complexity in practical applications because it is an offline step in the image recognition pipeline.
This paper focuses on the codebook construction step and presents a clustering procedure, named prior probability oriented clustering, that generates a suitable codebook considered from the perspective of the distribution of prior probabilities [14] for feature encoding strategies. The contribution of this paper is threefold: first, our proposal has an explicit objective to optimize the codebook parameters. Second, it relaxes conditions to construct an optimized codebook, compared with the grid search used in [14]. Third, the framework uses general optimization techniques to minimize our objective.
The rest of this paper is organized as follows: the next section briefly reviews the relationship between clustering algorithms and feature encoding approaches; After that, we describe our proposal clustering framework; Then we analyze numerical characteristics of our proposal with synthetic clustering datasets; After that, we evaluate an effect for image recognition performance with image recognition datasets; Finally, we conclude this paper.

Literature review of feature encodings
The basic pipeline for recognizing objects consists of the following steps.

Extract local features.
A given image is first converted to a set of d-dimensional local fea- The local features [3][4][5] have the robustness to some deformations, such as scale, rotation, occlusion.
Here, the codebook is constructed in advance in an offline step. This section reviews the codebook construction step and the feature encoding step.

Codebook construction
The basic clustering algorithms are the k-means [1] and the Gaussian mixture model (GMM) [2], which are usefull in several research fields [7,[16][17][18], such as image processing, signal processing, and physiology. The aim of the k-means algorithm is to find the clusters that minimize the quantization error between given samples and the corresponding mean vector. The mean vector is a representative position of a cluster, the quantization error is defined as a sum of square distances between a mean vector and the samples belonging to the cluster. The GMM constructs Gaussians that well represents the normal distribution of given samples. In general, clustering algorithms cannot directly find global optimal by any analysis. To find an suboptimal solution, the above algorithms follow an iterative procedure, called the expectation and maximization (EM) algorithm, for exploring local minima. This algorithm consists of the following two steps: the expectation step and the maximization step.
In the case of the k-means, let X ¼ fx respectively be the clustering samples and the model parameters, the objective function is defined as follows: where J k−means is the objective value, which measures the quantization error between the samples and the clusters, p(x t ; μ k ) is a probability function that becomes 1 if μ k is the nearest cluster to x t and 0 otherwise, and k�k is the Euclidean norm operator. To minimize the quantization error, the k-means algorithm iteratively optimizes the model parameters with Eq (2) for the expectation step and Eq (3) for the maximization step.
In the expectation step, the probabilities q t,k of a sample x t are computed using the current mean vectors. Then, the maximization step updates the positions. The EM algorithm iterates the above two steps until termination criteria, such as a designated maximum number of iterations and the convergence of the moves, are satisfied.
Fitting the GMM model also uses the EM algorithm. The GMM model contains where μ k and S k denote the mean and the covariance matrix of the k-th Gaussian and w k is a mixing weight for mixing K Gaussians. The mixing weight w k is also called "prior probability", which means the ease of assignment to the k-th Gaussian.
pðx t ; m k ; S k Þ ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

Feature encoding
As introduced in the previous section, the BoVW is the simplest approach to represent image features and well performs in image recognition applications. The BoVW usually uses the k- be a set of d-dimensional local descriptors extracted from an image, the BoVW feature is defined as: where f k 2 R 1 is the frequency of the local descriptors assigned to the k-th visual vocabulary. For precisely capture image information, the BoVW requires a huge codebook, because the dimensionality of the BoVW is equal to a codebook size K, and it increases the computational cost, such as the finding nearest neighbors as in Eq (2). Recently developed approaches [8,13] relax this issue by capturing higher order statistics on d-dimensional local feature space with a smaller codebook. In recent reports, the Fisher Vector (FV) [8,9] and the Vector of Locally Aggregated Descriptors (VLAD) [12,13] encodings are well known as state-of-the-art approaches. The FV supplements two higher-order statistics with the GMM codebook, in addition to the frequency as follows: where F ðwÞ 2 R 1 , F ðmÞ 2 R d , and F ðsÞ 2 R d respectively denote frequency, mean, and covariance. These are captured as: where the Gaussians are assumed to have diagonal covariances because of the derivation [8] and computational reasons [9,10]. Therefore, a FV signature have K(2d + 1)-dimensions. The VLAD captures only mean statistics by aggregating the residuals between the local features and the mean vectors of the codebook as follows: where the dimensionality of a VLAD signature is Kd.

The prior probability-oriented clustering
As described in the above section, the distribution or prior probabilities w k is an important factor to measure the quality of the feature encodings. The aim of the prior probability-oriented clustering is to mainly minimize the variance of prior probabilities. The k-means and the GMM follow the iterative procedures because there is no analytic solution for unknown samples [19], as described in the literature reviews. Even in our approach, the procedure uses general optimization algorithms for finding local minima. The objective function is defined as the following equation and consists of two terms: The main objective term is an approximated measure of the variance calculation where w � is the average of the prior probabilities. d(x t ;Θ) is a regularizer that measures the quantization error between the t-th sample and its nearest cluster mean. It serves to smooth solution space. For example, when clustering a number of samples with only the main objective, the solution space might be discrete, which means that small changes of candidate mean positions probably give the same objective value. λ is a weighting factor that controls which the main objective term and the regularization term is relatively more important. In our concept, λ is set to a small value to emphasize the main objective. An effect of λ is discussed in the next section.
As an optimization framework, a black-box optimization framework is used to minimize our objective shown in Eq (14), which does not require any constraints, such as derivation, for objective functions. In the next section, some black-box optimization frameworks are evaluated with synthetic clustering datasets. The general optimization procedure to find suboptimal solution is as follows: 1. generate initial mean vectors by k-means++ algorithm [20]; 2. repeat: 3. evaluate the our proposal objective function as in Eq (14), where the detail on how to evaluate the regularization term is described in below subsections; 4. update mean vectors by a black-box optimization framework; 5. until the number of iterations reaches.

Hard clustering alternated to the k-means
In this case, the clustering problem is defined as minimizing the variance of prior probabilities while minimizing the quantization error. The quantization error is defined as follows: The procedure of Eq (15) is as follows.
1. predict assignment probabilities q t,k for all clustering samples X ¼ fx t g T t¼1 , using Eq (2); 2. compute prior probabilities, in the same manner as the GMM, as: 3. evaluate the objective value, using Eq (14) with Eq (15).

Soft clustering alternated to the GMM
In order to estimate Gaussians with only mean vectors, each posterior probability of a sample for the k-th cluster is approximated with the nearest search as in Eq (2) as: It is a natural approximation because of the following reasons.
• Many GMM implementations [21,22] use the k-means initialization before the EM iterations.
• The distribution of posterior probabilities is peaky in general, a posterior probability closes to 1 and others become 0.
• The term of the Mahalanobis distance is dominant to predict the posterior probability function in Eq (4).
The regularization term is calculated in the same way as the distance metric, the Mahalanobis distance, of the GMM as follows: The procedure of the soft objective is as follows.
1. predict assignment probabilities q t,k for all clustering samples X ¼ fx t g T t¼1 , using Eq (2); 2. estimate w k and S k in the same manner as Eq (6); 3. evaluate the objective value, using Eq (14) with Eq (17).

Numerical analysis
In this section, we first explore which optimization framework is better for our objective function. Then, we analyze the characteristics of the traditional clustering approaches, described in the previous section, and our proposal clustering approach. To evaluate these algorithms, we used two synthetic clustering datasets: the A-sets [23] and the S-sets [24], which are publicly available [25]. The A-sets and the S-sets respectively consist of A1, A2, and A3 for varying the number of clusters and S1, S2, and S3 for varying spatial complexity [23,24]. Their statistics are shown in Table 1.
The following shows the experimental setup.
• Parameters in the k-means and the GMM. The initial algorithm was the k-means++ algorithm [20], which improves the stability of solutions. The covariance matrices of Gaussians were assumed to diagonal. For the analysis, the implementations of the scikit-learn package [26] with the Python programming language were used. The termination criterion was that the number of iterations of the EM procedure reaches 2,000 times.
• Parameters in our proposal. As optimization frameworks, the Nelder-Mead (NM) [27], the Subplex [28], the Constrained BY Linear Approximation (COBYLA) [29], the NEWUOA [30], and the AUGmented LAGrangian algorithm (AUGLAG) [31,32], which have been implemented in the NLOPT library [33], were evaluated. These algorithms are usually used for problems whose solution space structure is unknown and do not require any additional information, such as derivative of solution space, other than objective function. The initial position was set to the concatenated mean vectors generated by the k-means algorithm with 10 iterations. Therefore, the optimization frameworks explore the Kd-dimensional space.
The termination criterion was that the number of the evaluations of the objective function reaches 2,000 times. The weighting factor was set to λ = 10 −9 .
• Clustering samples. The subsets, A1, A2, and A3, of the A-sets, were used for hard clustering, and the subsets, S1, S2, and S3, of the S-sets were used for soft clustering. The samples of each subset were linearly normalized that the values in each dimension fit within the range of [0, 1]. Tables 2 and 3 show the objective values optimized by the optimization algorithms with the weighting factor λ = 10 −9 , where each value shows the best value over five trials and the values for the baselines were obtained only by the main objective term in Eq (14). For all the optimization algorithms, the optimized values were smaller than the baseline results. Specifically, the Subplex gave the smallest objective values on all datasets except for S1. For the Subplex results on the A-sets, the objective values increased as the number of samples or clusters increases, in order to A1, A2, and A3. The mean value of prior probabilities is always 1/K because of the probabilistic constraint, and the large cluster size is expected to a cause to decrease the value of our main objective term. Therefore, our proposal with the hard objective might not effective for large samples or cluster size. For the S-sets, the results of soft objective suggest an advantage to the spatial complexity of sample distribution, the objective value decrease as sample distribution is more complicated, in all the optimization algorithms.

Comparison of the optimization algorithms
In the results on S3, the Subplex showed especially better value compared with the results of the other optimization algorithms.  (1) does not have a term to minimize the variance of prior probabilities. Therefore, our proposal finely tunes mean positions for the main objective in Eq (14). Fig 2 shows the estimated Gaussians on the S-sets. The GMM generated fewer Gaussians than the designated number of clusters, as in Fig 2(A)-2(C); three Gaussians for S1 and S2, and seven Gaussians for S3 were converged to the same positions of other Gaussians. It is considered that the number of Gaussians becomes smaller as clustering samples are more complicated. In the codebook construction step, lots of local features, usually 100K-1M, are used as clustering samples. Therefore, this characteristic has a disadvantage, that the number of unique visual-words becomes less than a designated codebook size when generating a codebook. Specifically, some components of an image signature have the same trend due to the overlapping of Gaussians or become always zeros when using the approximation of assignment probability, as in Eq (16) Table 3; the objective value becomes better as the samples have more spatial complexity in our proposal. In addition, it suggests that our proposal possibly better for the codebook construction.    The ranges of the main objective term and the regularization term are in [0, 0.0035] and [0.001, 0.003] for the A-sets, and in [0, 0.02] and [1.4, 2] for the S-sets. The minimum of the main term ideally becomes 0 when all prior probabilities w k are the same value 1/K. The regularization term never becomes 0 because a cluster consists of scattered samples. For the results on the A-sets in Fig 3(A)-3(C), the values of the regularization term decrease as the number of clusters increase because the dispersion of samples in each cluster is small in order to A1, A2, and A3 in Fig 1. For the S-sets in Fig 3(D)-3(F), the values of the regularization term increase in order to S1, S2, and S3 because of the increase of the spatial complexity.

Effect of the weighting factor
As shown in Fig 3(A) and 3(B), a larger weighting factor probably is a cause to increase the main objective term, where the objective value needs to be smaller. We consider that a relatively smaller weighting factor (λ < 10 −7 ) correctly works, especially in Fig 3(A). On the whole trends in Fig 3(A)-3(F), there was no clear trend of the main objective regarding the weighting factor. The tendency to the regularization term is relatively intuitive, in particular for the soft objective, the quantization error decreases as the weighting coefficient increases.

Experiments with image databases
This section evaluates our proposal on image recognition tasks with the following image datasets: Birds [34] and Butterflies [35] provided by Ponce Group.
The Birds dataset consists of 600 images categorized into six bird species, where each category has 100 images. The Butterflies dataset has 619 images of seven different butterflies. Each category has about 40 to 130 images. The above two datasets are composed of visually similar images.
In the experiments with the above datasets, we used the same parameter setup except for numbers of training images to construct a codebook and a discriminant model. We used SURF [5] as the local feature framework. To extract SURF features, we followed the dense sampling strategy [36], which SURF features were described from the intersection points of the lattice of six pixels intervals, with multiple scale regions, 16, 20, 24, and 28 pixels for each point, where each image was resized so that the long side was 300 pixels. Each SURF feature was projected to 8-dimensional space by the Principle Component Analysis before constructing a codebook and encoding an image feature [37].
To construct a codebook, clustering samples were the SURF features extracted from 10 images from each category for the Birds and 5 images from each category for the Butterflies, where we decided about 10% of the smallest number of images of their categories. The codebook sizes of the five different patterns K = {16, 32, 64, 128, 256} were used. The termination criterion for the k-means and the GMM was set to 30 iterations because they do not converge sometimes. For our proposal, the termination criterion was set to 2,000 evaluations of the objective function. Gaussians of the GMM and our proposal with soft objective were assumed to diagonal covariance. The weighting factor of our objective was set to λ = 10 −9 . The k-means and ours with hard objective were used for the VLAD encoding, and the GMM and ours with soft objective were used for the FV encoding. Here, the dimensionality of image signatures depends on an experimental setting, for example, the number K of clusters and the number d of the dimension of local features. As introduced in the literature review section, the dimensionality becomes Kd for the VLAD and K(2d + 1) for the FV. Furthermore, the VLAD and the FV have 2, 048 and 4, 352 features when K = 32 and d = 8.
The SVM with the linear kernel, implemented in [26], was used as a discriminant model. The number of training images for each category was {30, 40, 50} for the Birds and {20, 30, 40} for the Butterflies. The training images were randomly selected, and the rest images were used for the test. The recognition accuracy was the ratio of the number of correctly recognized images for the number of test images. We measured by the average over five different training and test images.     Fig 4, the baseline, the VLAD with the k-means codebook, and the VLAD with our hard objective showed similar performances regardless of the parameters such as the number of training images and the codebook sizes. As discussed in the numerical analysis section, the hard objective mainly performs to finely tune mean positions, the k-means and our hard objective clustering potentially construct similar codebooks. Table 6 shows the objective values of the codebooks used in Fig 4. When the codebook size is not greater than 64, the hard objective showed significantly better objectives compared with the k-means objectives. However, when the codebook size is greater than or equal to 64, they showed almost the same objectives. The k-means is possible to construct suitable clusters from the perspective of the variance of prior probabilities, regardless of the size of the clustering sample set or the codebook size, as shown in Fig 4. The hard objective might have difficulty to effectively optimize codebook for large clustering sample set or large codebook sizes, as discussed in the qualitative comparison in the numerical section.
On the other hand, the FV with our soft objective often showed better performances compared with the FV with the GMM codebook, especially when the codebook size is 128. When the codebook size was small, K = 16 and K = 32, there is no significant difference of the recognition performances of the baseline and the FV with the soft objective. For the larger codebook size, the FV with the soft objective performed better accuracies. Moreover, our soft objective with a relatively larger codebook size was more effective for the case that training image set is smaller compared with the test image set. The highest mean recognition accuracy was achieved when the codebook size was 64, 128, and 128 respectively for 30, 40, and 50 training images per category. Therefore, an increase in the codebook size does not necessarily lead to improving recognition performance, the codebook size K = 64 or K = 128 might be enough for the Birds dataset. Table 7 shows the objective values of the codebooks used in Fig 5. In contrast to the trend of the objective values of the hard objective, the soft objective could maintain the better values, shown in Table 7, even when the codebook size is increased. As with the discussions in numerical analysis, the soft objective is able to construct a suitable codebook, from the perspective of the variance of prior probability, even in image recognition tasks. When comprehensively comparing the results of the VLADs in Table 4 and the FVs in Table 5, the FV with our soft objective (K = 64) showed the best accuracy of 68.71 when the training images were 30 for each category. The FV with ours (K = 128) also showed the best accuracies as follows: 71.56 for 40 training images and 74.13 for 50 training images. Figs 6 and 7 respectively show the average recognition accuracies of the VLAD and the FV on the Butterflies dataset. Tables 8 and 9 show the detailed values (mean accuracy and standard deviation over the five trials) corresponding to Figs 6 and 7. From the results in Fig 6, the hard objective may deteriorate recognition performance when codebook size is smaller than or equal to 64. In addition, the objective values of the hard objective, shown in Table 10, were not enough optimized as with the case of the Birds dataset. For the results with the FV, the GMM  and the soft objective showed similar performances when the codebook size is small. As with the numerical analysis, a smaller codebook size has less influence on the convergence of the Gaussians, and the GMM makes it easier to converge Gaussians to the same positions when the clustering samples is spatially complicatedly distributed and a codebook size is large. However, it improved recognition performances clearly when the codebook size is larger than 32, in all of the training images per category and lead to improve recognition performances when the codebook size was 256. Table 11 shows the objective values of the soft objective with respect to the number of codebook size and these values suggest that our framework is able to estimate proper Gaussians regardless of the codebook size. In the case of comparing the results  Table 8 and the FVs in Table 9, the VLAD with the k-means (K = 256) showed best accuracy: 87.93 for 20 training images and 90.27 for 30 training images. On the other hand, for the 40 training images, the FV with ours (K = 256) showed the best accuracy of 91.33.

Conclusions
This paper focussed on clustering from the perspective of the variance prior probabilities and presented the clustering frameworks, namely hard and soft objectives, that are respectively alternative to basic approaches such as the k-means and the GMM. In the numerical analysis, four optimization frameworks were evaluated with synthetic clustering datasets. The results of all of the frameworks were better than the basic clusterings. Especially, it showed that the Subplex optimizer is able to give better objective values from the perspective of the variance of prior probabilities and to construct intuitively appropriate clusters for complicatedly distributed clustering samples. In the experiment with image datasets, the hard objective was probably not effective for the VLAD encoding because the objective values became worse compared with the k-means results as the number of clusters increase. On the other hand, the FV encoding with the soft objective showed improvements in recognition performance regardless of some parameters such as the codebook size and the ratio of training and test images.