Texture Classification by Texton: Statistical versus Binary

Using statistical textons for texture classification has shown great success recently. The maximal response 8 (Statistical_MR8), image patch (Statistical_Joint) and locally invariant fractal (Statistical_Fractal) are typical statistical texton algorithms and state-of-the-art texture classification methods. However, there are two limitations when using these methods. First, it needs a training stage to build a texton library, thus the recognition accuracy will be highly depended on the training samples; second, during feature extraction, local feature is assigned to a texton by searching for the nearest texton in the whole library, which is time consuming when the library size is big and the dimension of feature is high. To address the above two issues, in this paper, three binary texton counterpart methods were proposed, Binary_MR8, Binary_Joint, and Binary_Fractal. These methods do not require any training step but encode local feature into binary representation directly. The experimental results on the CUReT, UIUC and KTH-TIPS databases show that binary texton could get sound results with fast feature extraction, especially when the image size is not big and the quality of image is not poor.


Introduction
Texture analysis is an active and fundamental research topic in the fields of computer vision and pattern recognition. Generally speaking, there are four basic problems in texture analysis: classifying images based on texture content; segmenting an image into regions of homogeneous texture; synthesizing textures for graphics applications; and establishing shape information from texture cues [1]. Texture classification has been widely studied because of many potential applications, including fabrics inspection [2], remote sensing [3], and medical image analysis [4].
Early texture classification methods focus on the statistical analysis of texture images. The representative ones include the cooccurrence matrix method [5] and the filtering based method [6]. These methods could achieve good classification results if the training and testing samples are captured by similar orientations. To address the rotation invariance issue, some model-based methods were proposed, such as circular autoregressive model [7], multiresolution autoregressive model [8], hidden Markov model [9], and Gaussian Markov random field [10]. Recently, scale and affine invariance receive extensive attention, and some algorithms were developed to address this issue, such as fractal transform [11], and local phase information [12].
In fact, classifying texture images taken under arbitrary viewing and illumination conditions is a difficult task. The method of statistically representing image local features has achieved great success for this problem [13], two paradigms for image representation were proposed, Signature (representative descriptors of an image [13]) and Statistical Texton (''the putative units of pre-attentive human texture perception'' [30] which does not have a specific definition.). In the former paradigm, an image is represented by signatures which are adaptively extracted from each image [14], and Earth mover's distance [15] is utilized to compare different images; in the latter paradigm, an image is modeled by feature texton histogram over a dictionary of textons [16][17][18][19][20][21][22], and histogram dissimilarity (usually chi-square) is used for histogram comparison [23][24].
Statistical texton based methods are simple to implement and could achieve good performance on texture image classification [16][17][18][19][20][21][22]. However, these methods suffer two disadvantages. First, it requires an offline step to learn a texton dictionary from training samples, thus the recognition accuracy is related with the training samples; second, to build the histogram, one needs to search the nearest texton from the dictionary for each pixel. This step is time consuming especially when the dimension of feature is high and the size of dictionary is large.
Local binary pattern (LBP) is a simple and efficient operator which labels the pixels of an image by thresholding the neighborhood of each pixel and considers the results as a binary number [25], and it is not influenced by the above mentioned issues. Inspired by the idea of LBP, in this paper three binary texton methods, Binary_MR8, Binary_Joint, and Binary_Fractal, were proposed. These methods do not require learning and they are fast for feature extraction. They could be regarded as the counterpart of the three state-of-the-art statistical texton methods, Statistical_MR8 [16][17], Statistical_Joint [18][19] and Statistical_-Fractal [20]. Our previous work showed that Binary_MR8 could get better accuracy than Statistical_MR8 on CUReT database [26]. This paper extended previous work by proposing two new binary texton methods, Binary_Joint and Binary_Fractal, and by doing more comprehensive experiments on three databases, CUReT database [27], UIUC database [14] and KTH-TIPS database [28].
The rest of the paper is organized as follows. Section 2 introduces three statistical texton methods. Section 3 shows the three proposed binary texton methods and dissimilarity metric. Section 4 reports the experimental results on three texture databases. Section 5 gives the conclusion and provides a suggestion for future work. [16][17] The Statistical_MR8 filter bank consists of 38 filters, which are shown in Fig.1. To achieve rotation invariance, the filters are implemented at multiple orientations and on multiple scales. On each scale only the maximal response among the different orientations is kept. The final response at each pixel is an 8dimension feature vector (3 scales for the edge and bar filters, plus 2 isotropic filters).

Review of Statistical_MR8
During dictionary learning, a selection of n images is chosen for each class of texture, and the filter responses to all these images are aggregated, then c texton cluster centres are computed using the standard K-Means algorithm [29]. The learnt textons for each texture are then collected into a single dictionary (n*c). For a given image, after getting 8-dimension feature vector for one pixel, the feature is searched in the dictionary to find the closest one and label the pixel with that texton. Finally, an appearance frequency of all textons in the whole dictionary is built as the histogram feature for the image. 2. Review of Statistical_Fractal [20] To address the scale and affine issue, fractal feature was proposed based on Statistical_MR8 filter banks [20]. Given an image point (x, y), after getting 8-dimensional filter response by Statistical_MR8 filter banks, f (x,y)~½f 1 (x,y),f 2 (x,y),:::,f 8 (x,y) T , fractal is computed by an assumption: given a suitable measure m, the ''size'' of local point sets in textured images follows a local power law.
where m(B i (x,y,r))is the sum of all pixel filter responses of the i th dimension (i = 1,2, …,8) that lies within a closed disk B of radius r centered at an image point (x, y), m(B i (x,y,r))P and L i (x,y) (intercept of log m versus log r) are computed by least square estimation. The former is the local fractal dimension and is invariant to scales changes, while the latter is the local fractal length and is rotation invariant only [20]. Fig. 2 illustrates an example to compute D and L. Two new 8-dimension features, D(x, y) and L(x, y) for each pixel is computed, after that the same dictionary learning and feature extraction procedure as Statisti-cal_MR8 is used. In the following figure, Statistical_Fractal_D and Statistical_Fractal_L represent the texton histogram of D and L, respectively.
3. Review of Statistical_Joint [18][19] Instead of applying filters on gray level images, Statistical_Joint is proposed to use multi-dimension intensity value (gray value) for each texton [18][19]. For a point x, a r*r rectangle around x is selected and the intensity of the rectangle is used to represent the texton feature for the point. Fig. 3 shows an example for r = 3. Fig. 4 shows an overview of the statistical texon and binary texton methods. As illustrated in Fig. 4, binary texton is extracted from intensity value or filter response directly, it does not need any training stage and is fast to build the feature map.   [18][19]. A 3*3 image patch is converted to a 1*9 texton through recording intensity values row by row. doi:10.1371/journal.pone.0088073.g003

Binary_MR8
As shown in Fig.5, some local regions may have multiple dominant orientations. The magnitude of the filter response at each angle could be treated as a confidence measurement in the feature occurring at that orientation [17]. It is intuitive to define a binary texton for multiple orientations as: where g i is the filter as shown in Fig. 1, * is the convolution operator, and I is the input image. However, the feature length of Eq. (2) is too long, it has 38 bits and is not rotation invariant. Thus, the 38 bits is divided into 8 rows based on the filters shown in Fig. 1, for each row, a rotation invariant sub-texton is defined through one scale of filter(s). For the last two rows, as there is only 1-bit string, it is in nature rotation invariant. For the first six rows, the output is 6-bits binary string, a rotation invariant sub-texton designed based on the idea of ''rotation invariant uniform'' [25] is defined: where j is the index of row, and The filtering output at each position is a 8-dimensional vector, and there are 1,048,576 (8*8*8*8*8*8*2*2) kinds of patterns in total. Such a dimension is too large to build histogram and it will bring a computation issue. To reduce the feature size, we empirically divide the 38 filters into 2 groups as shown in Fig. 6. Thus for each image, only two 4-dimensional histograms need to be built and then the 2 histograms are concatenated. The final histogram size is reduced to 2,048 (8*8*8*2*2). Fig. 7 shows an example to illustrate the difference between Statistical_MR8 and Binary_MR8.

Binary_Fractal
In Statistical_Fractal, only selective fractal values are left. As shown in Section 3.1, the texture image is complex and may contain multiple orientations at a local region. Similar to Binary_MR8, all fractal values are binarized and kept for feature extraction in Binary_Fractal.
After getting 38 filtered images, F i (i = 1, 2, …, 38), by Statistical_MR8 filter banks, 38 images of local fractal dimensions (D i ) and 38 images of local fractal lengths (L i ) are gotten by Eq. (1) respectively. Unlike Binary_MR8, the local fractal values are all positive values, thus ''0'' which is used in Binary_MR8 could not be used as the threshold to get binary results. Instead, median value of each fractal image is selected as the threshold to get binary output: ,i~1,2,:::,38 ð6Þ ,i~1,2,:::,38 ð7Þ  Where median(D i ) and median(L i ) are median value of the whole image of D i and L i respectively. Then, similar to Binary_MR8, for each position, the 38 binary bits of BD i (x,y)and BL i (x,y)are divided into 8 rows respectively and for each row a rotation invariant sub-texton is defined. Finally, as discussed in Section 3.1, the 8 rows are divided into 2 groups to reduce feature size and a feature histogram with size 2,048 (8*8*8*2*2) is built for Binary_Fractal_D and Binary_ Fractal_L respectively.

Binary_Joint
Although Statistical_Joint could get good results on texture classification, it is too slow for some real time applications. Taking 7*7 patch based Statistical_Joint as an example, it is time consuming to search 49-dimension statistical textons especially for a large size image [19]. Inspired by the idea of LBP, a binary counterpart of local patch is proposed.
As 7*7 patch based Statistical_Joint could get good results in most of applications [18][19], the Binary_Joint is defined on a 7*7 patch only. First, taking a position (x, y) in the original image as the central point, a 7*7 rectangle is selected from the original image. 49 binary bits are computed: where G m,n is the gray value at coordinate (m, n), and G I is the average intensity of the whole image.
As 49 binary bits are too long for feature extraction, the 7*7 binary rectangle is divided into 6 blocks as shown in Fig. 8.
Then for each block, a sub-histogram is built. For blocks 1, 2, and 3, the 7 bits of the central column form a sub-texton: Where B c i is the i th bit of the central column. There are 128 possible values for 7 bits. To reduce the feature size, the idea of ''uniform'' from LBP [25] is used to reduce the number of subtextons. A sub-texton is labeled to ''non-uniform'', if U 0 , the number of bitwise 0/1 changes, is bigger than 2. U 0 is defined as: Thus, the total number of possible sub-textons of the central column is 45. Table 1 lists the 45 kinds of sub-textons with their labels.
For the left and right columns, the number of 1-bits is compared with the central column. There are three possible values.
Where B l i and B r i is the i th bit of the right and left columns, respectively. A sub-histogram based on [ST r , ST c , ST l ] is built  Table 1. 45 kinds of sub-textons with 7 bits and their labels. for the whole image. Thus, for blocks 1, 2, and 3, the feature size of a sub-histogram is 405 (45*3*3). Similarly, for blocks 4, 5, and 6, a sub-texton is based on the central row, and the number of 1-bit is compared with the up and down rows. A sub-histogram with size of 405 is extracted for each block. Finally, six sub-histograms are concatenated and a feature histogram with 2,430 (405*6) bins is extracted for each image. Fig. 9 shows an example to illustrate the difference between Statistical_Joint and Binary_Joint.

Dissimilarity Metric
The dissimilarity of sample and model histograms is a test of goodness-of-fit, which could be measured with a nonparametric statistic test. There are many metrics to evaluate the goodness between two histograms, such as histogram intersection, loglikelihood ratio, and chi-square statistic [25]. In this study, a test sample T is assigned to the class of model L that minimizes the chisquare distance: where N is the number of bins, and T n and L n are the values of the sample and model image at the n th bin, respectively. In this paper, the nearest neighborhood classifier with chi-square distance is used to measure the dissimilarity between two histograms, because it is equivalent to the optimal Bayesian classification [23], and good performance for texture classification can be achieved [24].

Experimental Results and Discussion
To evaluate the effectiveness of the proposed methods, we carried out a series of experiments on three large and comprehensive texture databases: the Columbia-Utrecht Reflection and Texture (CUReT) database, which contains 61 classes of realworld textures, each imaged under different combinations of illumination and viewing angle [27], University of Illinois at Urbana-Champaign (UIUC) database [14], which includes 25 classes and 40 images per class collected under significant viewpoint variations, and Kungliga Tekniska högskolan (Swedish) -Textures under varying Illumination, Pose and Scale (KTH-TIPS) database [28], which include contains 10 classes and 81 images per class imaged under different scales, different poses and different illumination conditions. Except Binary_Fractal, the image sample was normalized to have an average intensity of 0 and a standard deviation of 1 [16][17][18][19][20] for different methods. This is to remove global intensity and contrast [16][17][18][19][20]. For Binary_Fractal, the image sample was normalized to have an average intensity of 128 and a standard deviation of 20 as this setting could get better accuracy. 7*7 local patch is used for Statistical_Fractal and Binary_Joint. Although large size patch could get better recognition accuracy, it is more time consuming [19] and the main focus of this work is to investigate the effect of binary representation. For comparison, the    [25]. To get better results, multiscale scheme is used for the LBP method [25]. The Chi-square dissimilarity defined in Section 3.4 and the nearest neighborhood classifier were used for all methods here.

Experimental Results on CUReT Database
The CURet database contains 61 textures, as shown in Fig. 10, and there are 205 images of each texture acquired at different viewpoints and illumination orientations. There are 118 images which have been shot from a viewing angle of less than 60u. Of these 118 images, we selected 92 images, from which a sufficiently large region could be cropped (200*200) across all texture classes [17]. To get statistically significant experimental results [19], L training images are randomly chosen from each class while the remaining 92-L images are used as the test set. The first 23 images of each class are used to learn the library and 40 textons are clustered from each of the texture classes. The average accuracy and standard deviation over 1000 random splits are listed in Table 2.
Two findings could be found in Table 2. First, the proposed methods are much better than simple LBP method. However, the feature length is a little long. Fortunately, it is not a big issue for nowadays computer.
Second, binary operators show their superiority over statistical operators on CUReT database. For example, Binary_MR8 and Binary_Fractal could get better results than Statistical_MR8 and Statistical_Fractal. While, Binary_Joint could get competitive results with Statistical_Joint.
Furthermore, as there is a training step in statistical operators, the performance will be related with the training set. Table 3 shows the classification rate of Statistical_MR8 and Statistical_-Joint under different training samples. Here C is the number of classes of training samples to learn the texton dictionary. As shown in Table 3, when the training set is reduced, the performance will be degraded.
Compared with statistical texton, binary texton is sensitive to noise. A small disturbance may output the same statistical textons, but the binary textons will be different. Fig. 11 shows an example, in a 5*5 local patch, a pixel value is changed from 0.51 to 0.49, while other pixels are the same. As the difference is very small, the statistical texton may be the same, however, as illustrated in Section 3.3, the binary texton will be significantly different.
To further illustrate noise effect, we created three texture databases with noise: where I(x,y) is the gray value of original image pixel andÎ I(x,y) is the transformed value. n(x,y)is random noise with 0 mean and 1 standard. t (t = 1,2,3) is the parameter to control the degree of noise. Table 4 shows the classification rate for different datasets. As shown in Table 4, statistical texton method could get sound results even when t = 3, while Binary_Joint fails to over 30% at the same

Experimental Results on UIUC Database
The UIUC texture database [14] includes 25 classes and 40 images in each class. The resolution of each image is 640*480. The database contains materials imaged under significant viewpoint variations as shown in Fig. 12. Similar to Section 4.1, L training images are randomly chosen from each class while the remaining 40-L images are used as the test set. The first 10 images of each class are used to learn the library and 100 textons are clustered from each of the texture classes. The average accuracy and standard deviation over 1000 random splits are listed in Table 5.
Several findings could be found in Table 3. First, similar to what we found in Table 2, the proposed methods are much better than simple binary operators, LBP.   [20]. Classification rates are originally reported in [19]. Other classification rates are computed by us. doi:10.1371/journal.pone.0088073.t005 Second, the proposed scheme fails to get better esults than statistical texton methods. This is possibly because of the high resolution. As the image resolution is much higher, statistical textons could describe subtle differences. For example, in UIUC database, there is 4.71% possibility that two different statistical textons have an identical binary texton. While in CUReT database, the possibility is only 2.22%. To further illustrate the resolution effect for statistical and binary operators, we manually down-sampled every image of UIUC database to 1/2 and 1/4 of its original size and applied Statistical_Joint and Binary_Joint methods on these two databases. As shown in Table 6, because statistical textons could describe subtle differences, the classification rate is lower when the image is down-sampled, on the contrary binary texton may get better result. Thus, statistical texton methods are more favorable for high resolution images.

Experimental Results on KTH-TIPS Database
The KTH-TIPS texture database [28] contains 10 kinds of materials. Images were taken at 9 different scales spanning two octaves. At each scale, 9 images were taken in a combination of three poses and three illumination conditions. Thus, for each class, there are 81 image samples. A 200*200 patch is manually cropped from each sample. However, for some images, due to large camera-target distances, some samples are smaller than 200*200 pixels. Fig. 13 shows image samples of KTH-TIPS database. Similar to Section 4.1, L training images are randomly chosen from each class while the remaining 40-L images are used as the test set. The first 20 images of each class are used to learn the library and 250 textons are clustered from each of the texture classes. The average accuracy and standard deviation over 1000 random splits are listed in Table 7.
Similar findings could be found in Table 7. First, the proposed methods are much better than simple LBP method. Second, as the image resolution is not high, binary texton methods could still get good performance. Binary_MR8 and Binary_Fractal could get   better results than Statistical_MR8 and Statistical_Fractal, respectively. But, Binary_Joint is a little worse than Statistical_Joint.

Time Cost
The proposed methods and statistical textons are implemented using Matlab R2008a on a windows XP, T6400 CPU (2.13 GHz) and 2 GB Ram PC. As the feature length are similar for each method and the classifier is the same, we only list the average feature extraction time on different databases. As shown in Table 8, Statistical_Joint method is the most time-consuming method while binary methods are much faster than statistical methods.

Conclusion
In this paper, we proposed three binary texton methods and reported their experimental results with their statistical counterpart on three large public texture databases. We empirically found that statistical method could get good results for most cases. However, it may be time consuming for feature extraction, especially when the image size is big. Furthermore, it requires a training step to build the texton dictionary which may limit the accuracy when the training sample is not enough. For good quality images with small image size, binary texton methods could get better results than statistical ones. And it does not require training step and is fast for feature extraction. As different schemes have different advantages, future work should investigate how to utilize these properties and improve the classification rate further.