Gray-level invariant Haralick texture features

Haralick texture features are common texture descriptors in image analysis. To compute the Haralick features, the image gray-levels are reduced, a process called quantization. The resulting features depend heavily on the quantization step, so Haralick features are not reproducible unless the same quantization is performed. The aim of this work was to develop Haralick features that are invariant to the number of quantization gray-levels. By redefining the gray-level co-occurrence matrix (GLCM) as a discretized probability density function, it becomes asymptotically invariant to the quantization. The invariant and original features were compared using logistic regression classification to separate two classes based on the texture features. Classifiers trained on the invariant features showed higher accuracies, and had similar performance when training and test images had very different quantizations. In conclusion, using the invariant Haralick features, an image pattern will give the same texture feature values independent of image quantization.


Introduction
The easiest and most intuitive image features for most applications in image analysis are first order statistics computed from histograms of the gray-level values in images, like their mean, variance, skewness and kurtosis. Such features involve the values of individual pixels, but ignore the spatial interaction between pixels. Histogram features do not reflect objects or patterns in the image, only the distribution of gray-levels. This inability makes first order statistics a blunt tool for quantifying changes in images, or any change in the spatial distribution of gray values.
Most texture analysis methods use higher-order statistics, and consider the relation between two or more pixels at a time. Methods such as local binary patterns [1], wavelets [2] and Gabor filters [3] can be used to assess texture in images. Haralick et al. [4] proposed using a gray-level co-occurrence matrix (GLCM) as a method of quantifying the spatial relation of neighboring pixels in an image. Haralick texture features, computed from the GLCM, are widely used due to their simplicity and intuitive interpretations, and have successfully been applied in e.g. in the analysis of skin texture [5], in land-use and forest-type classification [6], in automatic pollen detection [7], in fabric defect detection [8], in plant leaf classification [9], in cutting tool condition monitoring [10], and electrophoresis analysis [11]. In recent years there has been a rapid increase in the application of Haralick  analysis of ultrasound and MRI images of the liver [12,13], the heart [14], X-ray mammography [15,16], MRI images in the study of breast cancer [17,18], prostate cancer [19][20][21] and brain cancer [22][23][24]. It is also used in radiomics [25,26], which is an emerging technique where a large number of quantitative features are extracted from medical images, and used to build models predicting e.g. tumor phenotype [27], survival [28,29] and tumor classification [30]. When creating a GLCM from an image or a region of interest (ROI), the image bit depth, i.e. the number of gray-levels, may be reduced in a process called quantization. It is, however, not straight-forward to set an appropriate bit depth, and many projects have been looking at the optimal bit depths for various applications. [28,[31][32][33][34][35] The optimal bit depth depends on the size of the image or ROI, image noise [36] and, of course, the content of the image [33]. To complicate things further, the values of the Haralick features depend strongly on the bit depth and the choice of maximum and minimum values used in the quantization [34,37]. Clausi et al. [38] proposed a method for normalizing two features with respect to the quantization gray-levels to improve classification, however this method is not generalizable to all Haralick features. Shafiq-ul-Hassan et al. [39] proposes another method for scaling a selection of the Haralick features to make them less sensitive to the quantization gray-levels. The presented approach is empirical, and no general rule for feature normalization is presented. Currently, most Haralick feature values cannot be compared between analyses using different numbers of gray-levels, even when they are computed on images depicting the same texture. This means that Haralick features cannot be used to create general models that can be applied regardless of image size or noise levels, and are not reproducible unless the same quantization is performed.
The aim of this work is to propose a modified set of Haralick texture features that are asymptotically invariant to the image quantization, while preserving most of the interpretations of the original features. The modified features allow statistical models to be constructed from images with varying number of gray-levels, or to apply a model trained on one number of gray-levels to image data using a different number of gray-levels.

Theory
The first step in texture analysis using the Haralick features is to map the gray-levels in the original image with size M × K, I 0 , from the range [a, b], to a quantized image, I, in the range [1, N], that has the desired number of gray-levels, N. We represent this step with a quantization function that returns the quantized image In this work we have set φ to be an affine function, but there is no constraint on this function for the purpose of this work.
The second step is to construct the non-normalized GLCM, X 2 N N�N , by counting the number of times each pair of gray-levels occurs as neighbors the image, I, or in an arbitrary region in I. The neighbor of a pixel is defined by a displacement vector, δ = (d x , d y ), in two dimensions, where d x ; d y 2 Z represent the displacement in x and y in units of pixels. Each element, X(i, j), in the GLCM is computed as where I(m, k) is element m, k in the quantized image I. Simply put, the element X(i, j) of the non-normalized GLCM counts how many times gray values i and j occur as neighbors in I, where i, j 2 [1, N]. It is possible to create a GLCM with several displacement vectors, e.g.
for the eight immediate neighbors of a pixel, in two dimensions. This can be extended to three dimensions, in which case 26 displacement vectors would be required. A GLCM created using all of these displacement vectors, e.g. by summation, is said to be direction invariant. If displacement vectors with opposite directions are used to construct the GLCM, the GLCM will be symmetric. Symmetric GLCMs are semi-direction invariant, they represent relations between pixels in e.g. vertical or horizontal directions instead of up or down, left or right. A semi-direction invariant GLCM can also be created by adding the transpose of a GLCM created by one displacement vector to itself. This means that only one of the columns in Eq 4 needs to be calculated to obtain a direction invariant GLCM in two dimensions, or 13 displacement vectors for GLCMs constructed from neighbors in three dimensions. The third step is to construct a normalized GLCM, P 2 Q N�N , where each element represents the estimated probability of each combination of pairs of neighboring gray-levels in the image. The normalized GLCM is computed as The normalized GLCM can be interpreted as a probability mass function of the gray-level pairs in the image.
The fourth step is to compute the texture features from the normalized GLCM. The Haralick texture features are functions of the elements and their corresponding indices in the normalized GLCM, and can be written in a general form as where ψ is a function of an element of P, g is a vector-valued function of P, and ϕ is a function of the indices and g. Examples of g are found in in Table 1. For instance when computing Correlation (see Table 2), we have The number of gray-levels in the quantized image determines the size of the GLCM, and will affect the texture feature values. By increasing the number of gray-levels, N, the sums in Tables 1 and 2 will change. Further, the element values, p(i, j), of the normalized GLCM will decrease as the number of gray-levels increases, since the sum of all elements is unity. These two properties of the GLCM and the Haralick texture features means that the texture feature values depend on the number of gray-levels in the quantized image.     [4], which has been corrected. λ 2 (Q(i, j)) denotes the second largest eigenvalue of a matrix Q(i, j). Note, however, that this feature is computationally unstable, and was therefore not included in the examples in this work. For symmetric GLCMs, μ x and μ y is identical, and is represented by μ in the expression for Cluster prominence and Cluster shade.

Feature Original expression Invariant expression
Information measure of correlation 1 [4] HXY À HXY1 max ðHX; HYÞ g HXY À g HXY1 maxð g HX; g HY Þ Information measure of correlation 2 [4] ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 À exp ½À 2ðHXY2 À HXYÞ� p ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi

Invariant texture features
To develop asymptotically invariant texture features, we propose an interpretation of the GLCM as a discrete approximation of a probability density function. In this interpretation, the texture features are expressed as integrals over functions of the density, i.e. as where p � is the underlying true probability density function, for which Eq 7 can be approximated by a Riemann sum: where where Δ i and Δ j are differentials, andP is the invariant GLCM. So, to make the Haralick The discrete approximations of the integrals depend on the discretization of the distribution (i.e., the quantization, or the number of gray-levels), but the approximation is equal to the integral in the limit as N ! 1. The original features do not give the same approximated values for different discretizations, they either approach zero, a fixed value or infinity when N ! 1.

Materials and methods
We investigated the properties of the original and invariant texture features using two data sets. Dataset 1 contained T1-weighted MRI volumes of the brain. We computed texture features for the cerebellum and the prefrontal cortex, see Fig 3. We used logistic regression to classify each region based on their texture properties, and compared the result for the two methods. This dataset has limited clinical utility, and is used to compare the performance of   respectively, and the bandwidth was 392 Hz/pixel. A region of the prefrontal cortex (30 × 30 × 10 mm 3 , 23 × 23 × 10 voxels), and a region of the cerebellum (25 × 25 × 10 mm 3 , 19 × 19 × 10 voxels), were delineated on a reference image set, see Fig 3. The other images were rigidly registered to the reference, and the corresponding regions were extracted. Twenty-three images from 13 patients where the image registration failed were excluded from the study. The clinical trial was approved by the local Regional Ethical Review Board of Umeå University, and oral and written consent was given by all subjects.

Texture analysis
Thirty-two direction invariant GLCMs were created from the immediate voxel neighbors in Eq 4, for each brain region or gland structure, one for every quantization of 8-256 gray-levels in steps of 8. We used fixed upper and lower limits of 1500 and 4500 pixel units for the cerebellum and 1000 and 8000 pixel units for the prefrontal cortex, determined by the limits of the histograms from all regions. The upper and lower limits of the glandular structures were set to 0-256. The 20 original Haralick features and the corresponding invariant features described in Table 2 were computed for each quantization. All texture analysis was done using MICE Toolkit [42] and MATLAB 2016b (MathWorks, Inc., Natick, MA).

Classification
The goal of each logistic regression model was to correctly predict the brain regions (cerebellum or prefrontal cortex) in Dataset 1 or the glandular structures (benign or malignant) in Dataset 2, from the texture values. Two scenarios were explored. In the first scenario, the datasets were split into training (50%) and test dataset (50%). One hundred classifiers were trained using a random quantization of each member in the training data. These models were used to predict the correct class in the test set for all quantization levels. In the second scenario, the datasets were split into training (50%), validation (25%) and test (25%) sets. One classifier was trained for each of the 32 quantizations. These classifiers were used to predict the correct class in the test set for all quantization levels. For each dataset and texture method, the validation set was used to perform variable selection to remove irrelevant or highly correlated features. We used forward selection [43], and maximized the mean accuracy.

Evaluation
The first scenario was evaluated by the average accuracy of the 100 classifiers that was trained on randomly quantized data. We compared the original and invariant Haralick features using a two-sample, two-tailed Welch's t-test for populations with unequal variances. The second scenario was evaluated using a Mann-Whitney U-test to compare the results of the original and invariant Haralick features.

Fig 4 shows the original GLCMs (left) and the invariant GLCMs (right)
for the reference region of the cerebellum for 16, 32, 64, and 128 gray-levels. The standard GLCMs decrease in amplitude with orders of magnitude as the number of gray-levels increase. The invariant features retain the same volume, and the amplitude only increases slightly between 8 and 128 gray-levels, due to noise introduced with increased gray-level resolution. Fig 5 shows how the texture feature values change with the number of gray-levels of the GLCM, computed from a benign glandular structure in the gland dataset (Dataset 2). In the left column, the upper graph shows features that increase rapidly with the number of gray-levels, the middle graph shows features that increase modestly, or reach a limit as the number of gray-levels increases, and the lower graph shows features that decrease with increasing number of gray-levels. The right column shows the corresponding invariant texture features. Note that the original features are plotted on a logarithmic scale on the vertical axis, whereas the invariant features are plotted on a linear scale. Fig 6 shows the accuracy of 100 logistic regression models trained on randomly quantized images in the range of 8-256 gray-levels in steps of 8, and tested on images quantized to all gray-levels in the range 8-256 in steps of 8. The error bars show the standard deviation of the accuracy of the 100 classifiers, for each test dataset quantization level. Consistently, classifiers trained on the invariant features outperformed classifiers trained on the original features, and had a lower standard deviation of the accuracy. The average accuracies were 0.77 ± 0.08 and 0.96 ± 0.03 (p < 10 −39 , t = 20.82, using a two-sample Welch's t-test with 138 degrees of freedom, estimated using the Satterthwaite-Welch formula [44]) for Dataset 1 (the brain dataset), and 0.69 ± 0.06 and 0.80 ± 0.03 (p < 10 −35 , t = 17.15, using a two-sample Welch's t-test with 142 degrees of freedom [44]) for Dataset 2 (the gland dataset) for the original and invariant features, respectively. Fig 7 shows the accuracy of the logistic regression models trained on one quantization levels, and tested on all quantization levels in the range 8-256 in steps of 8, for Dataset 1 (the brain dataset) and Dataset 2 (the gland dataset). The classifiers that were trained and tested on the original features had a high accuracy only when the test data quantization was close to the training data quantization. The classifiers that were trained on the invariant features had a high accuracy for all combinations of training and test dataset quantizations. The renormalized feature accuracies were significantly larger (p < 10 −99 using a Mann-Whitney U-test) for both datasets.

Discussion
We have presented a simple modification of the Haralick texture features that makes them asymptotically invariant to the numbers of gray-levels in the quantizations. This is achieved by viewing the GLCM as a discrete approximation of a probability density function, instead of a probability mass function, over the pairs of gray-levels in the image. We have shown examples of how the standard and invariant GLCMs scale with increasing bit depth, Fig 4, and how most of the modified texture features quickly approach a limit, whereas most of the original features diverge or converge to zero, Fig 5. We have demonstrated the benefit of the proposed modified features by training logistic regression models to separate two brain regions in Dataset 1, and the malignancy of colorectal glands in Dataset 2, based only on texture features. Classifiers based on the invariant features performed better than the original features in all tests performed in this work. Further, the invariant Haralick features are rescaled versions of the original features, which means that they retain their original interpretations in most cases. Finally, the proposed modifications allow texture features to be reproducible regardless of quantization, since the same texture feature will give similar values independent of the quantization.
We tested the texture features in two scenarios. In the first scenario, we trained classifiers on texture features computed from images with different quantizations. In this scenario, the quantization levels might be chosen to enhance features optimally in each image, which is employed by e.g. Leijenenaar et al. [45]. Fig 6 shows that models trained on the invariant features have a higher accuracy and smaller standard deviation than models trained on the original features. In the second scenario, we trained the classifiers on one quantization level and predicted the classes from features computed on each of the other quantization levels. This scenario represents cases in e.g. radiomics, where a predictive model is created using one quantization, and employed to other datasets where the quantization is different due to e.g. resolution constrains, the size of the ROI, or where the optimal quantization is different. The classifiers trained on the invariant features had equal or higher accuracies compared to classifiers trained on the original features for the same combination of training and test dataset quantizations in most cases. The exceptions were some classifiers trained on the original features where the test quantization is slightly higher than the training quantization, see the upper left plot in Fig 7. In these situations, feature differences between the two classes are enhanced by the changes in feature values of the test data due to the different quantization gray-levels, as shown in Fig 5. This effect depends on the inherent texture of the two classes to be separated, and the features that are used in the model.
Maximum probability is very sensitive to noise, which is evident from Figs 4 and 5. For the invariant features, the maximum probability is now interpreted as the maximum probability density, i.e. the density value at the mode.
In the continuous case with the invariant features, the entropy becomes a differential entropy, and no longer has all the properties of the discrete counterpart. For instance, the differential entropy can be negative. This reinterpretation has particularly problematic consequences for the Information Measure of Correlation 1, (see Table 2). The denominator can be both positive and negative, and in particular, it can be close to zero. We note that in our particular case, with integration limits running from zero to one, division by zero can only occur if either of the marginal distributions of P are uniform.

Gray-level invariant Haralick texture features
It is important to note that except for maximum probability and the features that involve entropy, the interpretations of the texture features do not change. The invariant features are obtained by renormalizing the GLCM, the indices, and by multiplication with a differential determined by the bin size. It is merely a rescaling of the features to make them independent of the number of gray-levels of the GLCM. This has generally no impact on the interpretations of these features.
We chose to study the behavior of the invariant features between 8 and 256 quantization gray-levels. Too few gray levels will lead to a crude Riemann approximation in Eq 9, which can be seen in e.g. Fig 5 where the feature values stabilize around 16 gray-levels for most features, or in Figs 6 and 7 where the accuracies improve drastically between 8 and 16 gray-levels. Choosing excessively many quantizations gray-levels, e.g. on the order of the pixels present in the image, can produce sparse GLCMs. The invariant features will suffer from the same symptoms as the original features in this regard, i.e. an over-sensitivity to noise; and for small image regions which will produce sparse GLCMs, a failure to properly represent the underlying texture information.
Clausi et al. [38] proposed a normalized version of Homogeneity and Inverse difference by dividing the indices i and j by N to improve classification. The result is equivalent to the invariant versions presented in Table 2, since the effect of the differential Δ ij and the renormalized GLCM,pði; jÞ cancel out in these features. However Clausi's approach to only normalize the indices will not render all features invariant to the quantization; features such as Energy, Entropy measures, Variance measures, Information measure of correlation 1 and 2 and Maximum probability are not explicitly expressed in terms of the GLCM indices and cannot be normalized in this way. Shafic-ul-Hassan et al. [39] empirically derive normalization factors in terms of the quantization gray-levels for a selection of Haralick features. Of those, only one (Contrast) were identical to our results. A few of their normalized features were similar but not identical to the results presented here. Hence, most empirical derivations presented by Shafic-ul-Hassan do not fit the theory presented in this work. Further, many features presented in Table 2 cannot be reduced to a simple scaling factor of the original features, e.g. any feature containing entropy.
In this study we chose global min-max limits when quantizing the images in each dataset to minimize variations in the feature values due to image noise or other structures in the region of interest. Another approach is to set the limits for each ROI based on the minimum and maximum values inside each ROI [37,45]. However, one extremely high or low pixel value inside the ROI will shift the center of mass of the GLCM, drastically affecting the Haralick feature values. The global limits approach requires that the image intensities are comparable between all images in the dataset, which requires that the images are acquired with the same hardware and imaging settings, or it can be achieved by e.g. normalizing the intensity of common structures in the images.
The invariant features can be used when analyzing multiple images of different sizes (different number of pixels), and possibly even different amounts of noise, by optimizing the quantization level for each image, while still obtaining comparable texture features. Image noise will affect the texture feature values [36,37,46], but the smoothing effect of gray-level quantization could reduce the impact on the resulting features. Some features are more sensitive to noise [37], and a more aggressive gray-level quantization for the invariant version of the features could reduce impact of the noise, while retaining the possibility to compare feature values with those from images analyzed at different quantization levels. This approach requires further research. Finally, there are other methods of extracting features, such as the Gray-level Size Zone Matrix (GLSZM), Gray-level Run Length Matrix (GLRLM) and the Neighborhood Gray-tone Difference Matrix (NGTDM), which feature values are also affected by the graylevel quantization. Shafic-ul-Hassan et al. [39] proposes empirical normalization factors for some features from these methods, and it is be feasible that a similar approach to what is presented in this work could reduce the impact of gray-level quantization to these methods as well. Making more methods independent of the gray-level quantization will increase the applicability and reproducibility of radiomics analyses, so this is also an interesting prospect for future research.

Conclusion
By reinterpreting the GLCM as a discretized probability density function, it is possible to construct a modified set of Haralick texture features that are asymptotically invariant to the image quantization. Except for maximum probability and features involving entropy, the invariant features retain their original interpretations. We show that the invariant features can be used in different classification setups, with results superior to the original definitions. This mean that the invariant Haralick texture features are reproducible even when different gray-level quantizations are used, unlike the original definitions.