Using an Improved SIFT Algorithm and Fuzzy Closed-Loop Control Strategy for Object Recognition in Cluttered Scenes

Partial occlusions, large pose variations, and extreme ambient illumination conditions generally cause the performance degradation of object recognition systems. Therefore, this paper presents a novel approach for fast and robust object recognition in cluttered scenes based on an improved scale invariant feature transform (SIFT) algorithm and a fuzzy closed-loop control method. First, a fast SIFT algorithm is proposed by classifying SIFT features into several clusters based on several attributes computed from the sub-orientation histogram (SOH), in the feature matching phase only features that share nearly the same corresponding attributes are compared. Second, a feature matching step is performed following a prioritized order based on the scale factor, which is calculated between the object image and the target object image, guaranteeing robust feature matching. Finally, a fuzzy closed-loop control strategy is applied to increase the accuracy of the object recognition and is essential for autonomous object manipulation process. Compared to the original SIFT algorithm for object recognition, the result of the proposed method shows that the number of SIFT features extracted from an object has a significant increase, and the computing speed of the object recognition processes increases by more than 40%. The experimental results confirmed that the proposed method performs effectively and accurately in cluttered scenes.


Introduction
Object recognition has become one of the most active research topics in the fields of computer vision and pattern recognition because of its potential value in practical applications. Many novel methods have been proposed in the literature for object recognition, and they can be broadly classified into two main categories: holistic methods and local feature-based methods. The holistic methods attempt to recognize the object as a whole. Thus, the query image is acquired, pre-processed, and segmented, and the global features are extracted. Finally, statistical classification techniques are used. This class of algorithms is especially suited to homogeneous objects, which can be easily segmented. Typical holistic methods can be found in [1][2][3][4]. The holistic methods are simple and fast, but there are limitations in recognition during changes in illumination and poses. Local feature-based methods, however, are better suited to textured objects and are more robust with respect to variations in viewpoint and illumination. Local feature-based methods are based on the idea of representing an object by a collection of local invariant patches. Generally, local feature-based methods primarily involve the following steps: first, salient points, which are typically corners or blob-like shapes from the image to be matched, are extracted. Second, descriptors from regions around the salient key-points are constructed using mechanisms that aim to keep the characteristics of these regions insensitive to viewpoint and illumination changes and invariant to rotation, scaling and affine transformations. Finally, correspondence points between the query and model images are computed based on extracted features. From the matched points, an affine transformation between the query and model images can be computed using a fitting method, such as Least of Squares and random sample consensus (RANSAC) method [5]. The matching process is then iteratively refined by removing the correspondence points that do not fit this affine transformation. The idea originates from the work of Schmid and Mohr [6], whereby the centers of patches are located at points of interest and are invariant under rotation. Typical local feature-based methods can be found in [7][8][9][10]. Lowe developed an efficient object recognition approach based on the scale invariant feature transform (SIFT) [11].
The SIFT algorithm, proposed by Lowe, is one of the most widely used local feature-based method for object recognition and is useful for nearly all computer vision tasks. The algorithm attempts to detect similar feature points in each of the available images and subsequently describe these points with a feature vector, which is invariant to scale and rotation and is partially invariant to illumination and viewpoint changes. In addition to these properties, SIFT features are highly distinctive and relatively easy to extract and match against large databases of local features. However, the main drawback of the SIFT algorithm is that the computational complexity of the algorithm increases rapidly with an increasing number of key points, especially during the matching step due to the high dimensionality of the SIFT feature descriptor. To overcome the main drawbacks of the SIFT algorithm, various modifications have been proposed. In general, strategies addressing the acceleration of SIFT feature matching can be classified into three different categories: reducing the descriptor dimensionality [12], [13], using parallelization and exploiting the power of hardware [14][15][16], and improving feature matching algorithms [17][18][19].
Traditional local feature-based object recognition methods are open-loop methods, which mean that the result of each step depends on the result of the previous step. Therefore, errors are accumulated over the entire recognition system and propagated to the final step. Hence, the final result tends to be error prone and unreliable. This problem is usually solved using closed-loop control techniques. Because the method is non-linear and no mathematical model is available, a fuzzy control strategy is used. Ever since fuzzy set theory was used to synthesize a fuzzy logic controller for a simple dynamic process in Mamdani and Assilian [20], fuzzy logic control has become one of the most successful applications of the theory. An important application of a fuzzy-knowledge-based system is the control of complex, nonlinear systems [21]. Control algorithms with fuzzy controllers offer better response and efficiency in the case of complex nonlinear systems when compared to conventional controllers [22], [23]. The basic difference between fuzzy and conventional controllers is that the latter are designed using a mathematical model for the process being controlled. In contrast, fuzzy controllers are based on the synthesis of prior knowledge, which is provided by human expertise to construct a set of rules in the form of IF-THEN statements [24]. Typically, the design of a fuzzy controller is mostly based on expert control experience [25] or on a self-learning process [26], which requires human experience to design fuzzy control systems that demonstrate good performance.
In this paper, we propose an improved SIFT algorithm and a fuzzy closed-loop control strategy for object recognition in cluttered scenes. A fast SIFT algorithm is proposed by classifying SIFT features into several clusters based on several attributes computed from the SIFT orientation histogram, in the feature matching step, only features that share nearly the same corresponding attributes are compared. Feature matching is performed following a prioritized order based on the scale factor, which is calculated between the object image and target image to guaranteeing robust feature matching. A fuzzy closed-loop control strategy based on SIFT features is applied to increase the invariant to affinity, thereby increase the quality of the results of the matching process, which is essential for autonomous object manipulation. To compare the improved SIFT algorithm with the original SIFT algorithm, the proposed method was compared with two algorithms for approximate nearest neighbors (ANN) searching, hierarchical k-means tree (HKMT) [27] and randomized KD-trees (RKDTs) [28]. The hierarchical k-means tree is constructed by splitting the data points at each level into K distinct regions using k-means clustering, and then applying the same method recursively to the points in each region. We stop the recursion when the number of points in a region is less than K. The randomized KD-trees are built by choosing the split dimension randomly from the first D dimensions on which the data have the greatest variance. When searching the trees, a single priority queue is maintained across all the randomized trees such that the search can be ordered by increasing distance to each bin boundary. The degree of approximation is determined by examining a fixed number of leaf nodes, at which point the search is terminated and the best candidates returned. The presented experimental results indicate that the proposed method outperforms the two other considered algorithms. Additionally, several images from a stranded image database and from real-world stereo images were tested under different conditions, in which viewpoint were altered, partial occlusion, pose invariant, or the illumination during image acquisition conditions. The experimental results confirmed that the proposed method performs effectively and accurately for object recognition in cluttered scenes.

Fast SIFT algorithm
Although there have been early impressive object recognition results achieved using the SIFT algorithm, efficient object recognition under cluttered-scene conditions is still challenging. To achieve fast object recognition a novel strategy to accelerate the SIFT features-matching process is introduced in this paper. The strategy is based on hashing of SIFT features into several clusters during the feature extraction phase using new attributes computed from the SOH. First, in the key point detection stage, the key points are split into two types: Maxima and Minima. To speed up the feature matching process, it is assumed that some new independent angles can be assigned to each feature. These angles are invariant to changes in the viewing geometry and illumination, and they are computed from sub-orientation histograms (SOHs) of the SIFT-D. In the original SIFT algorithm for computation of the SIFT-D, the interest region around the key point is subdivided into sub-regions in a rectangular grid.
From each sub-region, an SOH is built [29]. Theoretically, a SIFT feature can be extended using a number of angles equal to the number of SOHs because these angles are to be calculated from the SOHs. In the case of 4x4 grids, the number of angles is 16, as shown in Fig. 2.
However, to speed up SIFT matching, these angles should be components of a multivariate random variable that is uniformly distributed in the 16-dimensional space [-180°, 180°]. To meet this requirement, the following two conditions must be verified. First, each angle must be uniformly distributed in [-180°, 180°] [30]. Second, the angles should be pair-wise independent [31]. The angles between the orientations corresponding to the vector sum of all bins of each SOH and the horizontal orientation are suggested as the SIFT feature angles. Mathematically, the proposed angles {θ ij ; i, j = 1,..,4} are calculated as follows: where mag(k) and ori(k) are the magnitude and orientation of the k th bin of the ij th SOH, respectively. Because the angles θ ij are computed from SOHs from which the SIFT-D is built, they are invariant under geometrical and photometrical transformations. Finally, four angles can be pair-wise independent, and only border angles can meet the equally likely condition. Therefore, the best choices are the corner angles, ϕ 1 = θ 11 , ϕ 2 = θ 14 , ϕ 3 = θ 41 and ϕ 4 = θ44, which can be considered as new attributes of the SIFT feature. Compared with the original 128-dimensional SIFT descriptor, the improved SIFT algorithm can lead to a significant decrease in computational time. At the matching stage, a new idea is proposed based on the new angles by comparing features that share the same corresponding angles, which may lead to correct matches. Among all  possible matches, only a small number of correct matches exist. For each possible match, four different angle differences {Δϕ 11 , Δϕ 22 , Δϕ 33 , Δϕ 44 } for each pair of SIFT features can be constructed. Considering the angle differences as random variables, the behaviors of these random variables vary according to the type of matches being analyzed. Fig. 3 shows the probability density function of the angle differences. For a false match, its feature is independent. Therefore, each of the two corresponding angles is independent because four random variables are uniformly distributed and are pair-wise independent. However, each of the corresponding correct match angles tends to be equal because the features of the correct matches tend to have the same SIFT descriptors. Therefore, the four random variables tend to be concentrated at approximately 0°. By calculating the probability density function of the random variables, approximately 95% of the correct matches and only 15% of the false matches are found to belong to the range Thus far, we have extended a SIFT feature by adding 4 pair-wise-independent angles that are invariant to rotation, scale and illumination changes. During the extraction phase, the SIFT features are classified based on their angles into different clusters. Thus, in the matching phase, only SIFT features that belong to clusters from which correct matches may be expected are compared. A fast SIFT algorithm is concluded to confirm the efficiency of object recognition system.

Robust SIFT feature matching
Although SIFT features are reasonably invariant, they cannot accommodate large changes in viewpoint or extreme illumination conditions, which is the core problem of object recognition in cluttered scenes. This problem is caused by the absence of true positive correspondences or by their portion being insufficient for fitting methods to work correctly. This paper introduces a new procedure to determine the scale factor between object images to be recognized by dividing SIFT features into different sub-sets based on their octaves. Then, the matching process is performed following a prioritized order, whereby only features of the same scale ratio are compared in each step. Additionally, a scale ratio histogram (SRH) is constructed. Only matches of the step corresponding to the highest SRH bin are provided to the fitting method. This restriction decreases the portion of outliers among positive matches, leading to an improved performance of the fitting method, which is called a random sample consensus. This strategy exhibits an increased matching performance and robustness with no additional computational time cost.
In general, the SIFT algorithm is a local image operator that takes an input image and transforms it into a collection of local features. To use the SIFT operator for object recognition purposes, it is applied on two images: a model image and test image. The model image presents only the object taken under predefined conditions, whereas the test image is an image including the target object captured in cluttered scenes. Using the SIFT operator, the two object images are transformed into two SIFT image feature sets. These two features sets are divided into subsets according to the octaves in which the features arise. To perform the newly proposed SIFT feature matching strategy, the feature subsets obtained are arranged so that a subset of the model image feature set is aligned to a subset of the test image feature set. The process of aligning the model image subset with the test image subset is performed in n+m-1 steps, where n and m are the total number of octaves (subsets) corresponding to the model image and test image, respectively. For each step, all pairs of aligned subsets must have the same ratio v, which is defined as:  total number of positive matches within each step is indexed using the appropriate shift index: The shift index can be negative, positive or zero. The highest number of positive matches achieved determines the optimal shift index k opt and the consequent scale factor: To realize the proposed procedure mathematically, a scale ratio histogram F(x) is defined as where RðM i 1 ; M j 2 Þ is the number of positive matches between the i th subset of the model image feature set M i 1 and the j th subset of the test image feature set M j 2 and x is the modified shift index introduced for the sake of simplicity in the above equation.
The scale ratio histogram F(k) obtains its maximum at the shift index K opt = arc max (F (K)) = 1,which corresponds to the scale factor.S = 2 kopt The optimal shift index defines a "domain of correct matches". All matches outside this domain, including positive matches, are excluded. The positive matches from the domain of correct matches are used to determine the affine transformation (rotation, matrix, and translation vector) between the two feature sets using the RANSAC method. Once the transformation is calculated, every match that is either positive or negative within the domain of correct matches is examined to determine if it meets the previously calculated transformation. If the match fulfills the transformation, it is labeled as correct; otherwise, it is labeled as a false match. Thus, this method can significantly reduce the number of false match features.
Among all the found matches, many correct matches will exceed Lowe's threshold τ. To retrieve these correct matches, the ratio between the Euclidean distance to the nearest feature neighbor and the Euclidean distance to the second nearest feature neighbor must be reduced. The feature F i 1 from the model image feature set is correctly assigned to the feature F j 0 2 from the test image feature set. Additionally, suppose that F j 1 2 is the second nearest feature to feature F i 1 . Reducing the ratio can be performed by either reducing the smallest distance d 1 ðF i 1 ; F j 0 2 Þ or by increasing the next smallest distance d 2 ðF i 1 ; F j 1 2 Þ. In practice, the first alternative is impossible, whereas the enlargement of the next smallest distance can be achieved by limiting the search area for both the nearest feature and the next nearest feature to the feature F i 1 within a specified domain. Feature F j 2 2 is the second nearest feature to F i 1 when the search is limited only to the octave in which the features F j 0 2 are found. As shown in Fig. 5, because the distance d 3 ðF i 1 ; F j 2 2 Þ ! d 2 ðF i 1 ; F j 1 2 Þ always holds, the following is obtained: Thus, by reducing the search area, the ratio related to the feature F i 1 can be decreased and can be less than the threshold τ. The proposed method improves the object feature matching robustness of object recognition in a cluttered background.

A fuzzy closed-loop control strategy for object recognition based on SIFT features
A fuzzy closed-loop control approach is proposed for object recognition based on SIFT features. This approach uses the benefits of fuzzy closed-loop structure to increase the invariance to affinity, and consequently, to increase the quality of the matching process, which is essential for autonomous object manipulation in cluttered scenes. Fig. 6 presents the proposed closedloop control system.
The idea of this approach is to extract two independent parallel feature streams (Maxima and Minima SIFT features) from both the model and test images and then match them to features belonging to the corresponding streams to estimate two independent affine transformations. The dissimilarity between these transformations is used as a feedback variable to observe and control the matching process. If this variable is larger than a certain threshold, one of the transformations is selected using a fuzzy controller to warp the model image. The procedure is  repeated until the two transformations become similar or until one of them converges to the identity matrix.
We use a similar principle for improving the quality and the quantity of the matching process, which enhances the efficiency of the object recognition system. To close the loop, a quantitative measurement should be defined to describe the quality of the matching result and to modify the input of the image matching process for improving its output when the matching result is not accepted. The definition of this quantitative measurement is because the SIFT feature locations are efficiently detected by identifying the Maxima and Minima of the differenceof-Gaussian (DoG) scale space.
Each set of the SIFT features for the test image GF test and for the model image GF model is divided into two subsets: one for the Maxima SIFT features and the other for the Minima SIFT features.
By matching the Maxima SIFT features with Maxima and the Minima SIFT features with Minima, two independent sets of positive matches GM max and GM min are obtained.
From these sets of positive matches, two independent affine transformations (Maxima and Minima affine transformations) can be estimated using the RANSAC algorithm.
The next step is to calculate the dissimilarity between the two affine transformations. Because at least three non-collinear corresponding points between two images are required to determine the affine transformation, at least three non-collinear points are required to compute the dissimilarity between two affine transformations T 1 and T 2 . Assuming that p 1 (a,a), p 2 (a,-a), and p 3 (-a,a), are three non-collinear points in the xy plane, where a is an arbitrary value, each of these points is mapped by each affine transformation: where i = 1,2,3. Hence, the dissimilarity Dis(T 1 ,T 2 ) is defined as: where d(p 1 ,p 2 ) is the Euclidian distance between two points p 1 (x 1 ,y 1 ) and p 2 (x 2 ,y 2 ) and is computed as follows: The dissimilarity between these transformations is used as a signal, indicating the matching quality. The transformations are fed back to a controller to improve the matching result. Because there is no available mathematical model for the system, a fuzzy controller is used. The dissimilarities between the identity matrix and each of the affine transformations are delivered to the fuzzy controller. The task of the controller is to select the best transformation to produce a new model image to be used in the next matching iteration as long as the termination criterion is not met. For each channel (Maxima and Minima) of the object recognition system, the error e max/min , which is computed according to equation 12, and the error derivative Δe max/min are chosen as inputs: where I is the identity transformation given by: The output is defined as a quality index, which is a real value in the range [0,1] representing how correct the corresponding affine transformation estimation is. The fuzzy controller consists of three main stages: the formation of membership functions, definition and evaluation of fuzzy rules and selection of defuzzification. In the proposed method, a triangular shape is se-  Fig. 7. The fuzzy rules in a linguistic form are shown in Table 1. Knowledge is interpreted using IF-THEN rules, and multiple statements are joined by the AND connective. The centroid area method is used for the defuzzification processes. In this method, the resultant membership functions are developed by considering the union of the outputs of each rule, which means that the overlapping area of the fuzzy output sets is counted only once, providing additional results. The center of gravity of the shape is mathematically obtained by the following equation: The proposed controller is based on fuzzy expert rules and uses the triangular membership functions for fuzzification, max/min operators for inference, and centroid area method for the defuzzification processes. The method has been verified through several experiments on the

Results
To evaluate the performance of the proposed method, several experiments were conducted on different pairs of images from a standard LEAR image database [32] and from real world images. The first experiment was performed to investigate the difference between the original SIFT algorithm and the proposed optimized SIFT algorithm. The performance of the improved SIFT algorithm was compared with the performance of two algorithms: HKMT and RKDTs. The comparisons were performed using the fast library for approximate nearest neighbors (FLANN). In the experiment, SIFT features were extracted from images in the LEAR database; subsequently, each of the two corresponding images were matched using the HKMT, RKDT and speedy SIFT algorithms under different degrees of precision. The matching step performs tradeoffs between the matching speedup and matching accuracy, and the experimental results are shown in Fig. 8. The precision degree is defined as the ratio between the number of correct matches returned using the considered algorithms and the exhaustive search, whereas the speedup factor is defined as the ratio between the exhaustive matching time and the matching time of the corresponding algorithm. As shown in Fig. 8, the speedy SIFT algorithm outperforms the other two considered algorithms in speeding up feature matching for all precision Table 1. Fuzzy-expert rules in linguistic form. degrees. For a precision of approximately 95%, the speedy SIFT algorithm obtains a speedup factor of approximately 1250.

Rule
In the second experiment, the proposed object recognition system was compared with the original SIFT-based method under the cluttered scene condition. A variety of images photographed by a regular digital camera were used in the experiments. The input of the system is a model image where only a single target object is available, and the test image includes the target object captured under the cluttered scene condition. The recognition progress was performed using an Intel Core2 2.6-GHz processor with images of size 1024×768 pixels. The comparison results for the coffee cup recognition in the cluttered scene are illustrated in Fig. 9. Tables 2 and 3 demonstrate the comparison of the SIFT feature matching results between the original SIFT algorithm and the improved SIFT algorithm. In Table 2, a comparison is performed to show an improvement in the robustness of the feature matching process. The number of correct SIFT features is significantly increased. Table 3 presents the computational matching time of the proposed SIFT approach and of the original SIFT approach. Compared with the original SIFT algorithm, a 40% reduction in processing time was achieved. Thus, the appearance of objects in  the test images is different from their appearance in model images because of different conditions, such as illumination during image acquisition, viewpoint, partial occlusion, rotation, and illumination conditions. The advantage of the proposed recognition technique over the original SIFT matching technique is evident.

Discussion
Most applications of object recognition, such as face recognition systems and robotic vision, require efficient performance. Conventional methods are time consuming, and their performance always drops significantly during partial occlusions, large pose variations, and extreme ambient illumination conditions. Thus, this paper proposed a method for fast object recognition in cluttered scenes based on an improved SIFT algorithm and a fuzzy closed-loop control strategy. The proposed improved method is highly distinctive and significantly speeds up object feature matching by dividing the SIFT features into several clusters, restricting the matching tactics based on the scale factor, and decreasing the portion of outliers among positive matches, which leads an improvement in the robustness of the object recognition system in a cluttered background.
Possible directions for future work on the proposed methods are as follows. First, the original SIFT algorithm contains some easy but computationally intensive operations, such as Gaussian filtering and the detection of scale-space extremes, and the proposed fast SIFT algorithm is based on dividing features into several subsets. Therefore, the feature matching process can be parallelized so that it can be adapted to parallel computation and can be implemented with a hardware pipeline in the field programmable gate array (FPGA). Achieving an on-chip architecture for the SIFT algorithm would be novel way to obtain an on-chip hardware and software co-design, which provides flexibility to the users to customize the SIFT feature descriptors according to the needs of the object recognition application. Second, another main application of the fuzzy closed-loop control strategy is camera calibration. Camera calibration is the estimation of a camera's intrinsic, extrinsic, and lens-distortion parameters. Typical uses of a calibrated camera are for correcting optical distortion artifacts, estimating the distance of an object from a camera, measuring the size of objects in an image, and constructing 3D views for augmented reality systems. Camera calibration is an important and potential application in computer vision tasks.