Automatic microarray image segmentation with clustering-based algorithms

Image segmentation, as a key step of microarray image processing, is crucial for obtaining the spot expressions simultaneously. However, state-of-art clustering-based segmentation algorithms are sensitive to noises. To solve this problem and improve the segmentation accuracy, in this article, several improvements are introduced into the fast and simple clustering methods (K-means and Fuzzy C means). Firstly, a contrast enhancement algorithm is implemented in image preprocessing to improve the gridding precision. Secondly, the data-driven means are proposed for cluster center initialization, instead of usual random setting. The third improvement is that the multi features, including intensity features, spatial features, and shape features, are implemented in feature selection to replace the sole pixel intensity feature used in the traditional clustering-based methods to avoid taking noises as spot pixels. Moreover, the principal component analysis is adopted for various feature extraction. Finally, an adaptive adjustment algorithm is proposed based on data mining and learning for further dealing with the missing spots or low contrast spots. Experiments on real and simulation data sets indicate that the proposed improvements made our proposed method obtains higher segmented precision than the traditional K-means and Fuzzy C means clustering methods.


Introduction
Microarray is a kind of useful biotechnological tool and has been widely applied in the field of life sciences, such as cancer research, pharmacology, disease diagnosis, and environmental engineering [1]. Microarray technology has the advantage of allowing scientists to measure the expression levels of thousands of genes simultaneously. It involves sample preparing, microarray designing, scanning, image processing, and data analyzing [2,3]. Among them, image processing plays an important role in extracting the gene expressions. Microarray image processing is mainly included by four parts: 1) pre-processing, 2) gridding, 3) segmentation, and 4) intensity extraction [3]. Pre-processing is mainly aimed at reducing the noises and enhancing the image quality. Gridding is implemented to find out a series of horizontal and vertical lines so as to separate one slide image into sub-grids and individual spots areas. The procedure of segmentation divides the spot pixels into foreground, background, or noise. Finally, the intensity extraction is aimed at obtaining the gene expression levels according to PLOS  the previous operational results. During the four procedures, the segmentation is quite crucial for accurately extracting gene expressions. However, it is a challenging task because the real microarray image usually contains noises, poor contrast and various spot shapes. (In the past decades, lots of state-of-art tools, such as GenePix [4], Imagene [5], QuantArray [6], ScanAlyze [7], Magic Tool [8], Spot [9], Dapple [10], Spotfinder [11], P-Scan [12], UCSF Spot [13], have been proposed for microarray image processing. They are originally manual and now semi-automatic. In addition, since 1997, microarray image segmentation has attracted increasing attention due to the fact that it is an essential step for extracting gene expression values, and many segmentation algorithms have been proposed for microarray image segmentation. These segmentation methods can be classified into the following seven categories: 1. Shape-based segmentation, including fixed circle or adaptive circle, adaptive shape [14], active contours [15][16][17] and Snake Fisher model [18]. It can obtain the boundary and region information of spots based on the relativity of target shape; 2. Model-based segmentation, involving Markov Random Filed [19], 3D spot modeling [20] and total variation (TV)-based regularization method [21]. It segments spots by building spot model or modeling the image as a function according to a series of parameter estimation; 3. Region-based segmentation, containing seeded region growing (SRG) algorithm [22] and watershed algorithm [23]. The spots is segmented by splitting image into areas depend on the image topology; 4. Threshold-based segmentation, including global and local threshold [24], soft-thresholding [25]. This method separates foreground and background based on fluorescence intensities.
6. Supervised learning-based segmentation, containing neural networks [28] and support vector machines [29]. These methods segment the spots depended on prepared training dataset.
However, among these methods, the clustering-based methods are prone to be affected by noises. Meanwhile, real microarray images usually contain poor quality problem such as low contrast, noises, artifacts, shape-varied spots, and so on. All these deficiencies make the clustering-based segmentation become a challenging task. Therefore, in this article, an improved clustering-based scheme is proposed to improve the segmentation efficiency. Firstly, an improvement of automatic contrast enhancement and noise reduction is introduced into the image preprocessing. Subsequently, an initial clustering center design scheme is proposed for improving the clustering performance instead of random points selecting. Because the single feature of intensity cannot reflect all characters of spots, the factors of spatial, intensity and shape features are considered. In addition, the principal component analysis (PCA) is implemented for feature selection. Finally, an adaptive adjustment strategy is adopted for improving the segmentation accuracy. This article is organized as follows. Section 2 introduces the improved clustering-based method. Section 3 presents the comparison experiments on different data sets. The main conclusions are described in Section 4.

The improved clustering-based method
As aforementioned, a great number of clustering methods have been proposed for cDNA microarray image segmentation. However, only a portion of them perform well on simulation images. In addition, single intensity feature is adopted for clustering in almost all the algorithms. To improve the segmentation accuracy and take full advantage of traditional clustering method, we improved the KM and FCM clustering algorithms. The proposed method (online code is in S1 Algorithm) includes six parts: 1) image preprocessing, 2) cluster center initialization, 3) feature selection, 4) feature extraction with PCA, 5) improved KM and FCM clustering, 6) adaptive adjustment, as shown in Fig 1.

Image preprocessing
To enhance the image quality, the contrast enhancement and gridding are conducted during the image preprocessing step. Our previous method has been adopted for contrast enhancement [3], and it can be executed according to the following steps: After contrasting the enhancement operation, a median filter is applied for noise reduction.
In addition, the maximum between-class variance based gridding is conducted according to the following steps [3]

Cluster center initialization
The class number and the initial cluster center are two vital factors for obtaining good clustering performance. Here we set the number of clustering classes to 2. For the initial cluster center of one group, we put forward to adopt the background gray value k estimated in the image preprocessing step instead of random initialization. As for another group, the gray mean of a 3 � 3 window in each spot center is selected as the initial cluster center.

Feature selection
Generally, the pixel intensity features are adopted by various clustering algorithms and the classification is realized according to the Euclidian distance between the pixel and cluster center. If the intensity of a pixel is high, it will have a great probability to belong to the spot and vice versa. However, there exists the missing classification situation since the noises may contain higher intensities. Therefore, here we implemented other features, including intensity features, spatial features and shape feature, as shown in Table 1. To be specific, the intensity features are composed of the pixel intensity, the mean intensity, the intensity standard deviation and the entropy [29]. The spatial features include the coordinates of each pixel, the Euclidean distance and the city block distance between the pixel and the clustering center. In addition, to describe the similarity between the pixels centered area and the theoretical spot, the shape feature which computes the correlation coefficient is adopted. In the shape feature, the neighborhood size of pixel and the size of Gaussian template is d ¼ 1 obtained from gridding step, H and V denotes the coordinates of horizontal and vertical gridding lines, respectively.
According to the definition above, a feature vector F ¼ ½r; i; j; D M ; D Eud ; Iði; jÞ; � Iði; jÞ; d; E� T of each pixel can be obtained. However, how to use these different features is quite important and no literatures discussed the multi-features fusion scheme for microarray image segmentation to date. Therefore, we deal with these features by using PCA which transfers a set of correlation variables into another irrelevance one by linear transformation. According to the Table 1. Features used in our approach.

Type Form Description
Shape feature r r ¼ P i P j ðIði; jÞ À � I ÞðTði; jÞ À � T Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi is a Gaussian template with variable size of d×d, d represents the estimated spot diameter

Spatial features i
Row of the pixel j Column of the pixel covariance eigenvalue output from PCA operation, the top three transformed principal components are selected. Finally, the Euclidean distance is adopted for evaluating the relationship between the clustering center and each pixel.

Improved KM and FCM clustering
As KM clustering is a kind of general algorithm, therefore, we mainly introduce the operations of FCM clustering here. There are four main steps for conducting FCM clustering.
Step1: Initializing the membership grade function according to u ij represents the membership degree of pixel j to cluster i.
Step 2: Updating the membership values for each pixel Where d ij = kx j −c i k and d kj = kx j −c k k denote the Euclidean distance between the feature vectors of pixel j to cluster i and k.
Step 3: Updating the cluster centroid by m is the fuzziness parameter.
Step 4: Computing the cost function by the following equation Step 5: Repeating step 2-4 until the cost function is minimized. The detail Pseudo of the proposed FCM algorithm can be seen as following.

Adaptive adjustment
After clustering by using the method of IKM and IFCM, there may exists some situations as shown in Fig 3. On one hand, the segmented spots may be surrounded by noises, for example, the top line shown in Fig 3. On the other hand, the missing spots appeared with different situations, as the bottom line depicted in Fig 3. To avoid the deficiencies described above, an adaptive adjustment is crucial step. Therefore, we put forward a method of noise removal and missing spot estimation based on the spot size. Firstly, an approximate spot size is estimated according to the microarray image data. To realize this, we utilize the boundary lines information computed in the gridding step [3].
bv � represent the boundary grid lines coordinates on horizontal and vertical, respectively. Then we can estimate one spot size by Therefore, the rough spot size for one microarray sub-grid image can be obtained by The adaptive adjustment algorithm can be executed according to the following steps: Step 1. Estimate the rough spot diameter s by using the results drawn from the gridding step.
Step 2. Draw a circle in diameter s within each spot area.
Step 3. Consider the recognized foreground pixels (the white pixel) outside circle as noises and remove them.
Step 4. Count the total pixel number n s and the recognized foreground pixel number n f during the rectangular spot area.
Step 5. Count the number of recognized foreground pixels n c inside the circle area.
Step 6. A spot is considered as a missing spot when n c <0.5× (3.14×s) or (n c �3.14×s) & (n f >0.9×n s ). The former indicates that the recognized spot pixels are less than half of the circle area pixel number. The latter one describes the situation that the recognized spot pixels are more than the pixel number in a circle area and the recognized foreground pixel number is more than 90% of pixel number in a rectangular area (see Fig 3). All these illustrate that the segmented results are false.

Intensity extraction
When the spots are segmented into foreground and background parts by our proposed clustering method, the spot expression value can be obtained according to E = log 2 (I cy3 /I cy5 ). In which I cy3 = R fore −R back ,I cy5 = G fore −G back indicates the background corrected spot intensities of Red and Green channels, and R fore ,G fore represent the mean intensities of the foreground pixels, yet R back ,G back denote the mean intensities of the background pixels.

Image quality assessment
To estimate the performance of our proposed methods, two means are adopted. One is conducting experiments on both the simulation and real images. The other is introducing various quantitative measurements.

Testing data sets
On the simulation and real images, six real data sets and one simulation data set are adopted, as shown in Table 2. In addition, we used a microarray simulation model (see http://www.gnu. org/copyleft/gpl.html [44]) to generate various quality images. First, the ScanAlyze toolbox is used to generate real microarray image, and some parameters (gray mean of spot and background intensity) are obtained. Then the gray mean of each spot is taken as the input of simulation model. Meanwhile, the cDNA microarray data can be generated through simulating the hybridization behavior of probe on slide surface and modifying the general options. To control the quality of the simulated image, some model parameters such as different noise model selecting, slide parameter setting, hybridization parameter controlling, and virtual scanner parameter setting, are tuned. Based on the steps above, three types (good, normal and poor) of microarray images are generated.

Quantitative measurement
To comparative analyze the segmentation results, some quantitative measurements are conducted on the expression level of spots. Firstly, the log differential expression ratio M = log 2 (I cy3 / I cy5 ) and the log intensity A ¼ 1 2 log 2 ðI cy3 � I cy5 Þ are introduced. In addition, a quality index [49] q index = (q sig−noise � � q bkg1 � � q bkg2 ) 1/3 is used and q sig−noise = F mean �/(F mean +B mean ) represents the signal to noise ratio, the level of the local background is expressed as . In which F mean denotes the mean value of the spot, B mean is the mean value of the local background, BSD indicates the standard deviation of the local background, and bgk 0 represents the global average of the background. In addition, the number of pixels clustered as foreground and background for each spot is represented by N fore ,N back , respectively. Furthermore, mean of the expression value MI cy3 ,MI cy5 for each sub-grid are also introduced. For the simulation data, owing to its corresponding annotate image, hence we can compute the pixel to pixel accuracy acc = (TP+TN)/(TP+TN+FP+FN), the sensitivity se = TP/(TP+FN) and the specificity sp = TP/(TP+FP). In which, TP denotes the correct number of pixels segmented as spot, TN represents the correct number of pixels segmented as background, yet FP and FN indicate the false number that spot pixels are recognized as background and background pixels are considered as spot, respectively.

Results and discussion
The proposed clustering based algorithms (IKM and IFCM) are compared with traditional KM, FCM methods and moving K-means (MKM) [3] on six different data sets and one simulation data set.

Image segmentation
Performance of four methods is compared on 188 real sub-grids drawn from 6 data sets, and the randomly selected original images and their corresponding segmented results are shown in Fig 4. It can be seen that a majority of spots are separated from background by these four methods. Segmentation results on GEO and SMD data sets are poor due to the low contrast of original images. In contrast, segmentation results on SIB data set are better owing to their high contrast of original images and FCM performs worst due to its sensitiveness to noises. It can be concluded that the performance of these methods is dependent on image quality. Moreover, IKM and IFCM algorithms can extract more low contrast spots than KM and FCM methods. Especially, IKM and IFCM methods can recognize missing spots by our proposed adaptive adjustment method.
To further compare the performance of four methods, the corresponding gene expression values for images in Fig 4 are presented in Fig 5. Generally, gene expression values between -2 to +2 are considered as normal data, yet those less than -2 or higher than +2 may represent noises or special genes. Therefore, we use a bigger size symbol to describe those special gene expressions as shown in Fig 5. From Fig 5A-5C, it reveals that gene expression values varies greatly on GEO, SMD and BCM data sets owing to the low contrast of original images. At the same time, there is more gene expressions exceeding +2 or lowering -2 and most of them can be correctly obtained by IKM and IFCM methods, indicating that the improved methods perform better on poor quality image than KM and FCM algorithms. For Fig 5D-5F, a majority of gene expressions are between +2 to -2 due to the high quality of original images. In other words, almost all the spots are correctly segmented by these four methods.
In addition, the number of gene expression beyond +2 and -2 are counted for different methods on 6 data sets, as shown in Table 3. It can be seen that IKM and IFCM algorithms obtain more special gene expressions than KM and FCM ones. Especially for SMD and GEO data sets, partial spots cannot be extracted by KM and FCM methods due to their poor quality. Therefore, there is a lot of special gene expressions obtained by IKM and IFCM.

Quantitative analysis
Scatter plot method [45] is usually used for describing the correlation between two objects. In general, a straight line in scatter plot indicates that the data points are too closer, i.e., it is a Clustering-based microarray image segmentation perfect correlation with their ratio equal to 1. Here, we adopt the scatter plot to compare the background-corrected spot intensities for red channel I cy3 and green channel I cy5 . The scatter plot of four methods on DeRisi and SMD datasets results are shown in Figs 6 and 7, respectively. Meanwhile, M-A plot is utilized to diagram the log differential expression ratio M and  Clustering-based microarray image segmentation the log intensity A of each spot, results of four methods on DeRisi and SMD datasets are shown in Figs 8 and 9. Here, to display the results clearly, we convert the original 16-bit gray values into 8-bits on dividing the original data by 256. In Figs 6 and 7, all data points are exhibited in Red, partial selected data points (to clearly display the correlation between I cy3 and I cy5 ) are in blue and green, respectively. From Fig 6, it can be seen that the results on the four methods are quite similar and a majority of data points are close to the diagonal line. This phenomenon is in agreement with the analysis on gene expression presented in Fig 5. However, Fig 7 reveals quite different results. First of all, the background corrected intensities of two channels on SMD dataset are much lower than that of DeRisi dataset (Fig 6), that is, most spot intensities are confined around 15 for SMD, yet 60 for DeRisi. In addition, IKM and IFMC algorithms obtain more spots with lower intensities compared to KM and FCM methods (see the circle in Fig 7). In other words, these four methods possess similar performance on spots with higher intensities, while on low contrast spots extracting, only IKM and IFCM methods perform well. Clustering-based microarray image segmentation To further prove the effectiveness of our proposed methods, the mean (MI cy3 ,MI cy5 ), minimum and maximum background corrected intensities for two channels are calculated, as shown in Table 4. To display the data clearly, we also transferred the data from 16-bit to 8-bit by dividing 256.
One can see from Table 4 that almost all mean background corrected intensities (see MI cy3 and MI cy5 ) obtained by IKM and IFCM are smaller than those by KM and FCM on six datasets. The reason is that the proposed IKM and IFCM methods can extract more spots with lower contrast than KM and FCM ones. Meanwhile, the minimum and maximum intensities obtained by IKM and IFCM also exhibit a wider range, implies that our proposed methods perform better on different quality images.
In addition, the M-A plot of the four methods on DeRisi and SMD data set are shown in Figs 8 and 9, respectively. From these two figures, it can be seen that the performance of the four methods are similar for log intensity A, all higher than their mean values, but there is remarkably different for the log intensity A, smaller than its mean. In other words, for the low Clustering-based microarray image segmentation log intensity part, IKM and IFCM can obtain more spots with their M beyond +2 and -2 than KM and FCM (see circled spots in Fig 8 and spots beyond the dotted line in Fig 9). Actually, spots with higher intensities are easy to extract, yet those spots with lower intensities are difficult to be found. Therefore, IKM and IFCM algorithms may obtain more spots with lower intensities, revealing that these two methods own superior performance. In summary, all these results indicate that the methods of IKM and IFCM perform better than KM and FCM, which proves the effectiveness of our improved strategies.

Spot segmentation
As we know, real spots in a microarray image usually contain various shapes instead of circles. Therefore, we choose some spots randomly according to their quality and shape. Examples of these spots and their corresponding segmented binary images are shown in Fig 10. For those good quality spots, no matter what their expression level is, IKM and IFCM algorithms can extract the spots perfectly, yet KM and FCM methods cannot segment the spots completely under the low gene expression level. Obviously, the proposed methods perform better than KM and FCM for both normal and poor quality spots. For MKM, it performs better than KM, but worse than IKM. Similarly, our proposed methods also possess optimal segmented results than KM and FCM on various shape spots.
To further evaluate the effectiveness of our proposed methods, quantitative analysis (pixels segmented as foreground or background, expression level and signal to noise ratio) on above different spots are displayed in Table 5. From Table 5, one can observe that our improved methods (IKM, IFCM) classify more required pixels into foreground (N fore ) than KM and FCM algorithms, so as to be closer to the real spot shape. Furthermore, spot intensities for two channels (I cy3 ,I cy5 ) obtained by IFCM and IKM are generally lower than those gained from KM and FCM, indicating that IKM and IFCM find more pixels with lower gray values around the spot edge. In other words, KM and FCM algorithms preferentially extract pixels with higher gray values. What's more, the number of foreground pixels obtained by MKM is similar to IKM and more than KM. Finally, the higher value of signal to noise ratio q sig−noise also indicate the better performance of our proposed methods.

Segmentation results on simulation images
Because the real images lack of annotation image, we choose simulation images for further comparison. Fig 11 displays the segmented results of four methods. It can be seen that all the spots are extracted from the background for all four methods owing to the fact that simulation image is simple and high quality compared to real image. Meanwhile, as the local magnified parts shown, it is obvious to see that segmentation results of KM and FCM contain lots of noises.
Furthermore, the pixel to pixel accuracy, specificity and sensitivity are computed for each spot and their corresponding means on images in Fig 11, as shown in Table 6. In general, IKM and IFCM own higher values of acc, sp and se than the original KM and FCM, indicating that these two improved algorithms perform better. In addition, the ratio that they cluster the correct pixels into foreground (see se in Table 6) are both more than 99%. All these results prove that our proposed IKM and IFCM algorithms outperform than KM, FCM and MKM methods.

Computational complexity analysis
As we know, the computational complexity of KM is O(n), here n represents the number of spot pixels. Compared to KM, FCM involves computing the Centroid values of each cluster according to its fuzzy membership relation and updating the membership values, so its computational complexity is about several times of KM, As for MKM, there is only one more pixel reassignment operation compared to KM, so their computational complexity is similar. For our improved algorithms, the image contrast enhancement, the multi-features computing, feature selection by PCA and refinement are extra operations compared to KM and FCM, so our methods may be about two or three times over than its original method. Table 7 illustrates Clustering-based microarray image segmentation the average computational time of above five methods on six data sets. From Table 7 we can see that the same algorithm required various computational times for different data sets due to the distinguished image layout and resolution as displayed in Table 2, in which blocks in UCSF data set take minimum time, yet blocks among BCM data set require maximum time. According to all above analysis, we can draw the following conclusions: for good quality image with high contrast, less noises and sparse spot distribution, the IKM can be used for its Clustering-based microarray image segmentation segmentation. For poor quality image with low contrast and more noises, IFCM can be utilized to segment the spots. For image contains lots of special shape spots, IKM performs better than IFCM. Finally, IKM and IFCM are both suitable for normal quality image segmentation.

Conclusions
In conclusion, although current state-of-art clustering-based segmentation methods (KM, FCM) are easy, fast and effective, they are prone to be affected by noises and mostly be conducted based on one feature. Therefore, we proposed the IKM and IFCM algorithms by introducing multi-features such as intensity, spatial, and shape features. Meanwhile, an adaptive adjustment method for segmentation results is introduced and a cluster center initialization strategy is considered. Experiments on six real data sets and one simulation data set verify that our proposed IKM and IFCM methods perform better than the original KM and FCM. In addition, the quantitative analysis on gene expression, background corrected intensities and log-differential expression ratio versus log intensity, further proves the effectiveness of our proposed methods.