A Combinational Clustering Based Method for cDNA Microarray Image Segmentation

Microarray technology plays an important role in drawing useful biological conclusions by analyzing thousands of gene expressions simultaneously. Especially, image analysis is a key step in microarray analysis and its accuracy strongly depends on segmentation. The pioneering works of clustering based segmentation have shown that k-means clustering algorithm and moving k-means clustering algorithm are two commonly used methods in microarray image processing. However, they usually face unsatisfactory results because the real microarray image contains noise, artifacts and spots that vary in size, shape and contrast. To improve the segmentation accuracy, in this article we present a combination clustering based segmentation approach that may be more reliable and able to segment spots automatically. First, this new method starts with a very simple but effective contrast enhancement operation to improve the image quality. Then, an automatic gridding based on the maximum between-class variance is applied to separate the spots into independent areas. Next, among each spot region, the moving k-means clustering is first conducted to separate the spot from background and then the k-means clustering algorithms are combined for those spots failing to obtain the entire boundary. Finally, a refinement step is used to replace the false segmentation and the inseparable ones of missing spots. In addition, quantitative comparisons between the improved method and the other four segmentation algorithms--edge detection, thresholding, k-means clustering and moving k-means clustering--are carried out on cDNA microarray images from six different data sets. Experiments on six different data sets, 1) Stanford Microarray Database (SMD), 2) Gene Expression Omnibus (GEO), 3) Baylor College of Medicine (BCM), 4) Swiss Institute of Bioinformatics (SIB), 5) Joe DeRisi’s individual tiff files (DeRisi), and 6) University of California, San Francisco (UCSF), indicate that the improved approach is more robust and sensitive to weak spots. More importantly, it can obtain higher segmentation accuracy in the presence of noise, artifacts and weakly expressed spots compared with the other four methods.


Introduction
DNA Microarray is a powerful tool for biologists to simultaneously analyze thousands of genes [1]. Microarray technology becomes useful in many fields such as the diagnosis of disease, biomedicine, gene discovery, drug discovery and so on [2]. DNA microarray analysis aims to identify different gene expressions which can be used in studying the function of genes [3]. Generally, microarray technology includes experimental design, RNA and probe preparation, hybridization to DNA arrays, image processing, data analysis, biological confirmation and microarray databases [4][5][6]. First of all, Ribonucleic Acids (RNAs) are isolated from the experimental sample and control sample. Secondly, these extracted RNAs are converted into cDNAs. Subsequently, the mixture of these cDNAs, labeled with fluorescent dyes Cy3 (Green) and Cy5 (Red), is then hybridized to a glass slide. Finally, the slide is scanned with red and green laser and two different channels of array images are obtained [7]. Actually, the difference in fluorescence between these two color channels shows the relative difference of the gene's expression between those two sources (experimental and control). Therefore, the microarray image processing plays an important role in extracting a series of meaningful biological conclusions regarding gene expression.
Generally speaking, microarray image processing is comprised of three major steps: 1) gridding, 2) segmentation and 3) intensity extraction. Gridding aims to separate each spot into a single area by segmenting the image into numerous compartments. Segmentation usually classifies the pixels in a region immediately surrounding the gene as belonging to either the foreground or background domains [3]. Intensity extraction tends to calculate the red and green foreground intensity pairs and background intensities [8]. Thus, a higher precision of each image processing step is needed for extracting accurate gene expression values and meaningful biological application. Especially, segmentation, as a previous step to intensity extraction, imposes a significant effect on the accuracy of image processing.
However, the complex preparation procedure of microarray, including the manufacture of cDNA microarray chip, hybridization of mRNA extracted from the sample, and scanning of chips may introduce serious errors. These facts demand the compensation of defects to the image processing procedures. To the best of our knowledge, an ideal microarray image would be characterized by deterministic grid geometry, known background intensity with zero uncertainty, pre-defined spot shape, and constant spot intensity [9]. Yet the actual microarray images may contain thousands of spots of various sizes, shapes, and intensity levels. They also contain inhomogeneous background and are contaminated by noise and artifacts. Fig 1 exhibits the typical difficulties in microarray image processing.
According to our knowledge, the diverse image content comes from the following three aspects: 1. As for the slides, the numbers of grid lines and spots are various towing to the different chip manufacturers. The image and spot resolutions are also different, such as one image in BCM data set (see Fig 1a) with the image resolution of 4,325×11,612 and spot of 25×25 (pixel). However, the resolutions of image and spot are 1,910×5,550 and 18×18 respectively for one image in SMD data set (see Fig 1f). Moreover, the image quality sometimes is high or poor. Here the poor image quality is defined as that image with lower contrast and may contain noise and missing spots.
2. In terms of sub-grid, there is a non-uniform distribution, e.g., some sub-grids in DeRisi data set (see Fig 1c) are compact while others in SIB data set may be sparse (see Fig 1d). In addition, the sub-grid may contain noises with various types and levels. The missing spots also exist to varying degrees (see Fig 1e).
3. According to spot, its shapes vary from circle to square to triangle, depending on the gene chip companies. If the basic shape is circle, there may exist the "peak-shaped spot, doughnut-shaped spot, egg-shaped spot or volcano-shaped spot" as Fig 1g shown [10]. Furthermore, some spots may be stuck together.
Microarray image reveals that various qualities, as mentioned above, can be attributed to 1) different physical and chemical conditions during construction stage, 2) the laser light reflection, 3) photon noise and electronic noise introduced during scanning, and 4) dust on the glass slide [11]. These impairments will affect the cDNA microarray image formation and make the image processing be indeed complicated and challenging. Moreover, segmentation of spots can be further complicated in the non-uniform shape and surface intensity distribution. To eliminate the processing errors from propagating further down the processing pipeline to the gene expression analysis tasks, more accurate and sophisticated segmentation methods are needed [12].
Actually, each method has its advantages and disadvantages, and there is no one algorithm that can tackle all the microarray image segmentation problems perfectly. What's more, the most recent and state-of-the-art work for automatic image segmentation is clustering based algorithms. Compared to the hierarchical clustering, the partition clustering is simple and it includes several algorithms, such as k-means, k-medoids, k-modes, k-prototypes, fuzzy cmeans and so on. Among those the k-means clustering algorithm, which classifies the objects into k number of group based on features, is the most simple and fast method even though it is sensitive to the noises [2]. However, the k-medoids approach is another classical partitioning method by selecting one point in cluster as representative instead of the cluster center in kmeans [22]. But it suffers from the serious drawback that its performance heavily depends on the initial starting conditions. K-modes approach is an extending of k-means by replacing the means of clusters with modes. Furthermore, k-prototypes method integrates the k-means and k-modes algorithm to cluster the mixed data by using k-modes approach to update the categorical attribute values of cluster prototypes [23]. In addition, the fuzzy c-means clustering method is also introduced to dealing with some uncertainty problems, i.e. they considered that each pixel may belong to more than one cluster [24]. However, fuzzy c-means clustering algorithm strongly depends on the fuzziness parameter and also sensitive to noises. Recently, to minimize the affect of noise, a moving k-means clustering algorithm is proposed based on kmeans clustering by introduced fitness function [25]. Considering that most of the partitional clustering approaches are based on k-means, and moving k-means clustering can minimize the noise affect, this paper introduces an idea of integrating both k-means clustering and moving k-means clustering to conduct the segmentation.
Therefore, to improve the segmentation accuracy, in this article we propose an adaptive segmentation method by combining the k-means and moving k-means clustering methods. This method is unique in the way to enhance the image contrast automatically and well-performed on different data sets.
The rest of the paper is organized as follows. Section II makes an insight into the related works. A framework of the proposed method which combines k-means clustering method and moving k-means clustering method is discussed in Section III. Contrast enhancement is also introduced. Selected cDNA Microarray images from six data sets are used for experiments. Comparisons with the prior art in cDNA image segmentation are also provided in Section IV. Finally, the main conclusions are summarized in Section V.

Related Works
In the last few years, several commercial softwares and freeware packages have been built for microarray image processing. As shown in Fig 2, most of these tools are manual or semi-manual and a series of image processing technologies have been proposed. Fixed circle segmentation has been used in ScanAlyze (Eisen 1999), adaptive circle segmentation has been applied in GenePix (Axon Instruments, Inc. 1999), seeded region growing (SRG) algorithm has been employed in Spot (Buckly 2000), threshold based method has been used in QuantArray (GSI Lumonics 1999), gradient based spot segmentation employed in Dapple, and a histogrambased segmentation method has been applied in ImaGene [26].
However, since a spot's morphology is not always a circle, it is hard for the previously mentioned methods to tackle the real spots accurately and automatically. Therefore, a series of automatic algorithms have been proposed and there is a trend from shape-based segmentation to learning dependent segmentation, as shown in Fig 3. We can classify these methods into the following categories: 1. Thresholding-based algorithms make use of statistical intensity modeling and find the optimal threshold to segment out the spot [27,28], but its performance relies on the appropriate choice of background samples.
2. Edge and shape detection-based methods utilize gradients, snakes and active contours to capture the boundary and region information of spot [13][14]29]. A disadvantage of these methods is that it will give inaccurate results in the presence of noise and artifacts.
3. Morphology-based techniques combine the mathematical morphology operations to realize the spot detection [15,26,[30][31], yet the structure element is designed humanly and the segmentation effect is strongly dependent on the shape of structure element.  4. Intelligent optimization methods (genetic algorithm or neural network) based algorithms make use of iterative operation to assign pixels into spot or non-spot classes [3,7,[16][17]. However, it is prone to be affected by noises and requires parameter presetting or network training.

5.
Modeling-based algorithms rely on the clustering of pixels' values to achieve the spot localization [32][33]. One major drawback of these methods is that it ignores the spatial dependencies among adjacent pixels which will lead to an over-segment of the microarray spots.
6. MRF modeling-based method utilizes the neighboring information, along with the intensity information to segment spots [18][19]34]. However, it requires an initial classification of the pixels and in turn which will affect the final results.
7. Clustering-based algorithms take advantage of K-means, hybrid K-means and fuzzy Cmeans (FCM) to determine which pixels should belong to the spot or background area [20][21][24][25][35][36][37]. Nevertheless, these techniques become inaccurate when the spots have poor contrast or they are closer to each other.
8. Other methods utilize pattern recognition or classification to realize the spot segmentation [38][39][40][41], yet input of parameters is required for these methods.
All aforementioned algorithms are automatic to some extent, and they also need human intervention to define input parameters or to correct the segmentation results. To summarize the discussions made so far, we can draw the observations that 1) segmentation is an important and challenging problem; 2) image quality needs to be improved for extracting gene expression; 3) a lot of segmentation methods have been proposed with some performances better than others; 4) no single segmentation algorithm can meet the demands of all microarray images, and 5) there has been little progress on developing sufficiently fast, efficient but effective algorithms to segment a microarray image by using up-to-date techniques [7]. In addition, some segmentation algorithms are normally designed to perform well on microarray images acquired by certain types of arrayers and scanners. Thus, it is significant to explore a new method for microarray image segmentation.

Materials and Methods
The proposed method mainly consists of five steps: 1) image contrast enhancement, 2) gridding, 3) segmentation, 4) refinement of segmentation, and 5) intensity extraction, as shown in

Image Quality Enhancement
We have conducted a comparison experiment on the influence of contrast enhancement on gridding, as shown in Fig 5. Obviously, the gridding accuracy is greatly increased when the image contrast is enhanced compared to those without enhancement. As the example shown in Considering that the low contrast in microarray image between foreground and background makes it difficult to distinguish one from the other, the contrast enhancement is necessary to highlight important features embedded in the microarray image data. Let f(x,y)(x2 [1,w],y2 [1, h])represent the gray microarray image,w and h represent the width and height of the image, respectively. The 2D signal is first transferred into 1D signal p. At the same time, in order to  only enhance the spots features, the contrast enhanced image g can be gained by ( in which C is the contrast degree estimated automatically by the following equation Where p is the mean value, s means a standard deviation, s 2 represents the mean square error, and s 4 denotes the four-order moment.
Since spots play a fundamental role in microarray image understanding, one good way to enhance the contrast is to enhance the spots. Therefore, we propose a method to estimate the background pixel value k by performing following steps: 1. Randomly select an area A j by a 10×10 rectangular shape window in an image edge region.
For reasonability, choose 3 blocks in each top, left, right and bottom edge region, respectively.
2. Compute the maximum gray value in each area.
3. Adopt the minimum one as background gray value among all maximum gray values.
4. Repeat step 1) to 3) for m times to avoid noise effects, here m = 10, therefore the background pixel value can be defined as Finally, a 3×3 median filter is adopted to reduce noises.

Gridding
The main role of gridding is to divide spots into independent areas, and here we adopt the gridding method previously designed by us [42]. The steps of gridding process are shown in Fig 6. 1. First, the contrast enhanced image g(x,y) is used as input for gridding.

A morphological reconstruction of H0
4. Next, the maximum between-class variance operation with d ¼ m d ðoÀm dt Þ 2 oð1ÀoÞ is developed to look for the optimal threshold. a. Take the horizontal signal H 0 as example, let L denote the maximum gray value and n i represent the number of pixels at gray level i, its corresponding probability P i = n i /h,i2[1, L] and the total mean u ¼ P L i¼1 ip i can be computed.
b. Supposing that t2[1,L] is a threshold and o ¼ P t i¼1 p i describes the occurrence probability of one class divided by the threshold t, the image average variance m 2 d ¼ 1 L P L i¼1 ði À uÞ 2 p i and the class average variance m 2 dt ¼ 1 t P t i¼1 ði À u t Þ 2 p i can be obtained, in which u t ¼ P t i¼1 ip i denotes the first order cumulative moments of the histogram up to t th level.
c. Change the threshold t and recalculate the between-class variance until a maximum d is gained.
5. Subsequently, the horizontal signal H 0 is transferred into a binary signal according to the threshold d. 7. However, some grid lines may locate at the spot even though the spots number l and s computed by our proposed method are correct. Therefore, we put forward a refinement step based on the statistical analysis of the obtained grid lines coordinate data and give some heuristic rules [42].
8. Finally, an accurate gridding of the microarray image can be obtained based on the above mentioned steps.

Segmentation
After gridding, all spots are divided into different parts. Then k-means clustering and moving k-means clustering are executed separately in each spot area. First of all, the feature selection for clustering is crucial. As for each pixel, there are two basic features defined as intensity for the Red channel and Green channel [21]. In addition, the rows of the pixel, columns of the pixel and the Euclidean distance are also introduced for spatial description [20]. Recently, to segment the spot by classification, shape features have also been proposed besides the mean intensity, intensity standard deviation and entropy of intensity features [38]. In this paper, we selected five features as described in Table 1.
Based on these features, the k-means clustering and the moving k-means clustering can be conducted. To take advantage of both algorithms, the final result is defined by To be specifically, the moving k-means clustering is first performed for all spots within their regions and the clustering result g 1 (x,y) can be obtained. Then, count the number of pixels belonging to the foreground N t and background N b in each spot area, respectively. If N t <N b conduct the k-means clustering and obtain another result g 2 (x,y). Next, combine these two results by Eq 4.
The moving k-means clustering method consists of the following steps [4].
1. Extract the spot area according to the gridding result.
2. Select the maximum and the minimum gray value as the original clustering center c 1 and c 2 for class S 1 and S 2 , respectively.
4. Compute the Euclidean distance of each pixel g i to the two clustering center by Δ j = ||g j −c j ||, j = 1,2.
5. If Δ 1 <Δ 2 it indicates that g i is closer to class S 1 , then merge g i into the closer class S 1 , and vice versa.
6. Update the new clustering center c j ¼ 1 N j X i2S j g i . N j represents the pixel number in class S j .

Define a new fitness function and compute the fitness for two clustering centers
8. Compare the two fitness values and denotes the higher and lower one as c h ,c l .
9. If F(c l )<a a F(c h ) and there is g i <c h for the pixel within c h , then classify it into the class c l . Calculate the new clustering center with c h ¼ 1 10. Update α a with α a = α a /2 and repeat step 8) to 9) until F(c l )!a a F(c h ).
13. Finally, the segmented result g 1 (x,y) can be achieved.
The operation of k-means clustering method is similar to the step 1) to 6) of the moving kmeans clustering algorithm, so that we also can obtain another segmented result g 2 (x,y).
In addition, the circle shape segmentation will be carried out when N t <0.3 (N t +N b ), where the diameter of circle is automatically obtained in the gridding step.

Intensity Extraction
After segmentation, each spot area can be located. Then the intensity extraction operation is conducted on the original raw image f(x,y). The reason for extracting the spot intensity on the raw image, instead of the preprocessed image, is that the preprocess step will result in the image information changing. Especially for our contrast enhancement operation which only enhance the spot gray values with those of background remaining unchanged, as described in Eq 1. Take one sub-grid of image 49 from the SMD data set for example, owing to the contrast enhancement that is automatically processed, the contrast of channel 1 is magnified 24 times and channel 2 15 times. If we use the preprocessed image to extract the intensity, a big error will occur when counting the gray value of the foreground. The following steps are defined to extract the spot intensity: 1. Compute the gray value of the foreground and background R t ¼ 4. Repeat step 1) to 3) and compute the average gray value for each microarray image channel, then the spot intensity value can finally be determined by

Performance
Fig 7 exhibits the segmentation results of five different methods on three data sets. It can be seen from this figure that a lot of spots are detected as the whole spot area boundary instead of its real shape in the edge detection method. Similar results are found in the thresholding method. In terms of k-means clustering method and the moving k-means algorithm, the former can segment the spot close to its real edge, while the latter is able to detect the spot with lower contrast. However, there still exists the situation of the discontinuous area and irregular shapes in these two clustering methods. Moreover, all the methods perform better on the DeRisi data set owing to its sub-grid having a dense spot layout. All in all, although the proposed method performs best, some spots still can't be segmented accurately. Fig 8 shows the average segmentation accuracy of these five methods on six data sets. The accuracy is defined as the correct segmentation number of spots/total number of test spots. Clearly, the imprved method has remarkably higher accuracy than the other four methods. Among these algorithms, the edge detection method gains the lowest accuracy on all situations, while the improved method is best performed at all times. In addition, the moving k-means clustering algorithm always performed better than the k-means clustering method. What's more, there is a similar trend on all data sets, which is all the algorithms gain a highest accuracy on the DeRisi data set and lowest on GEO data set. The reason for such lower segmentation accuracy of the other four methods on the GEO data set is that most sub-grids have low contrast. While the sub-grids in the DeRisi data set have the higher contrast. Particularly, our algorithm improved the accuracy for almost 20% after contrast enhancement. All above analysis proved that the contrast enhancement for microarray image was crucial.

Segmentation in the Presence of Noise
To further verify the effect of the improved method we selected some sub-grids at the presence of noises from six data sets. Fig 9 shows the segmentation examples of five methods on three data sets. It can be seen that the edge detection method and the thresholding method classify the noises into spots. What's more, they can't recognize the spots those have much lower contrast. For the k-means clustering method and the moving k-means clustering method, they only locate partial of the spots owing to the effect of noises. The improved method proposed performs better than all the other four algorithms. However, it will be a little bit affected by noises so that some spots are located with extra boundary. Furthermore, Fig 10 exhibits some examples that our improved method fails to obtain the boundaries of spots. For example, a crescent moon shape spot is obtained as shown in Fig 10a  owing to the noises in that region have similar fluorescent with that of spot. Due to noises have higher gray values than spots so they are classified into spots by mistake as shown in Fig  10b,10c,10d,and 10e. In addition, a square shape spot is segmented in Fig 10f because that area is fulfilled with noises. Table 3 exhibits the segmentation accuracy when noise appears. The accuracy of detected spots are divided into incorrectly, marginally and correctly [43], here the marginally, which is different from the definition for gridding, means the spot edge detection is correct but with some holes in it. The segmentation accuracy on the SIB data set dropped a lot owing to its sparse distribution of spots, which makes the algorithm prone to be affected by noise. In addition, owing to the uneven distribution of spot gray values, there are some holes in the detected spots on different data sets. What's more, to perform the quantitative analysis, we selected 10 similar spots from each data set. At the same time, the number of pixels clustered to spots m i is counted and the root mean square error RMSE ¼ 1

Segmentation of Spots with Special Shapes
is also calculated, where m 0 i represents the actual number of spots counted manually.  The average number of pixels which belong to 10 spots is presented in Table 4. It can be observed that the result obtained by the improved method is closer to the standard ones. Of course, there are some errors because the standard number of spots is counted manually, and the randomly selected 10 spots can't represent the whole image. However, the statistical result reveals a tendency in accordance with the former analysis.

Intensity Extraction
To further analyze the validity of different methods, we plot the intensity distribution for one sub-grid drawn from the UCSF data set, as shown in Fig 13, where the intensity for each spot is calculated by Eq 4. Fig 13 reveals that the improved method obtains a best gene expression outcome with less noise because this method located the spot near to its edge. However, the edge detection method exhibits the worst result with lots of unstable intensity values owing to it  divides the background pixels into spots by mistake or vice versa. Meanwhile, the thresholding algorithm also displays the second best stability because it segments the spots with a continuous edge.
In addition, to describe the degree that spot intensity deviates from zero, we calculated the intensity MSE for five methods on six data sets, as shown in Table 5. The improved method possessed the lowest MSE value, meaning that all the intensities are distributed near the region of zero. Because the image in BCM data set contains a large number of noises, there is a higher MSE value. Fig 14 exhibits the average time consumption for the whole process of five methods on six different data sets. This processing time starts from sub-grids input to the intensity extraction. There is a different time consumption trend on different data sets owing to the various resolution and spots number of microarray images. Obviously, less processing time is needed for the edge detection method. In contrast, the improved method and the moving k-means clustering algorithm require much more processing time which might be ten times that of the edge  detection approach. What's more, all the methods spend less time on the SIB data set owing to there is only 35 spots in each sub-grid. However, there are 440 and 1,600 spots in each sub-grid drawn from the BCM and DeRisi data sets, respectively, so they require more run-time. Of course, the processing times for all methods are longer because the experiments are done on the Matlab platform. The time consummation will be remarkably decreased if the algorithm is compiled by C language. In addition, the improved method only combines results of the kmeans and the moving k-means clustering methods simply with "or" operation. We can design a new scheme to evaluate the spot quality and select the appropriate method to segment it, such as some spots segmented by k-means clustering and others by moving k-means clustering, so that the current processing by these two methods simultaneously can be replaced. Certainly, the dealing time will also decline greatly.

Conclusions
Microarray technology has been widely applied in drug design, environmental health research, clinical diagnosis and treatment, and in cancer detection. Image processing is a key step of this technology. What's more, the segmentation step is critical in the processing of microarray image, and the segment accuracy and consistency is also significant. However, owing to the presence of noise, artifacts and laser reflection during microarray experiment procedure, the real microarray image quality varies in spot size, shape, contrast and quantity of missing spots. These dynamics of quality will exert a great challenge to segmentation. In recent years, clustering based approaches such as k-means, fuzzy c-means, and moving k-means, are frequently used in bioinformatics and show better performance than the shape based segmentation ones. However, the conventional clustering based methods tend to face unsatisfactory result when image quality is poor. Therefore, in this paper we proposed an improved method by combining the k-means and the moving k-means clustering methods. Specifically, an automatic contrast enhancement algorithm is used in the pre-processing step to improve the image quality. The segmentation ability of edge detection, thresholding, k-means clustering, moving kmeans clustering and the improved method has been made on six different data sets. Experiment results show that our improved method provides better results than the other four algorithms. It will be more suitable for microarray image segmentation with better performance.
However, it is more time-consuming for about 5 times than that of k-means clustering algorithm. Therefore, optimization of the improved algorithm with less running time may become our forthcoming work, and the GPU-based computing structure for microarray image segmentation will be considered.