Unsupervised morphological segmentation of tissue compartments in histopathological images

Algorithmic segmentation of histologically relevant regions of tissues in digitized histopathological images is a critical step towards computer-assisted diagnosis and analysis. For example, automatic identification of epithelial and stromal tissues in images is important for spatial localisation and guidance in the analysis and characterisation of tumour micro-environment. Current segmentation approaches are based on supervised methods, which require extensive training data from high quality, manually annotated images. This is often difficult and costly to obtain. This paper presents an alternative data-independent framework based on unsupervised segmentation of oropharyngeal cancer tissue micro-arrays (TMAs). An automated segmentation algorithm based on mathematical morphology is first applied to light microscopy images stained with haematoxylin and eosin. This partitions the image into multiple binary ‘virtual-cells’, each enclosing a potential ‘nucleus’ (dark basins in the haematoxylin absorbance image). Colour and morphology measurements obtained from these virtual-cells as well as their enclosed nuclei are input into an advanced unsupervised learning model for the identification of epithelium and stromal tissues. Here we exploit two Consensus Clustering (CC) algorithms for the unsupervised recognition of tissue compartments, that consider the consensual opinion of a group of individual clustering algorithms. Unlike most unsupervised segmentation analyses, which depend on a single clustering method, the CC learning models allow for more robust and stable detection of tissue regions. The proposed framework performance has been evaluated on fifty-five hand-annotated tissue images of oropharyngeal tissues. Qualitative and quantitative results of the proposed segmentation algorithm compare favourably with eight popular tissue segmentation strategies. Furthermore, the unsupervised results obtained here outperform those obtained with individual clustering algorithms.


Introduction
The automatic segmentation of cells and tissues in digitized histopathological images is an essential stage in computer-assisted analysis. Over the years, various approaches have been proposed integrating computer vision and machine learning tools, e.g. [1,2]. Ideally, a first step is to partition the image into distinct regions; followed by machine learning methods that automatically detect patterns within those regions conforming to given histological models in order to guide further analysis and enable enable 'intelligent' procedures that respond to image contents.
There has been a growing interest in developing automatic routines for segmentation and classification of tissue compartments in images of H&E stained histological sections obtained by light microscopy. Several studies emphasise that determining accurately the spatial distribution of such tissues is essential for extracting results of prognostic value, e.g. cancer growth and progression and tumour microenvironement [3]. Therefore analysis of the sample makeup (typically epithelium and stroma in the samples investigated here) will serve for identification of specific areas (e.g. epithelial cancerous tissues) as well as the tumour progression in digital images. However, the large variability and complexity of histological samples makes this a challenging task, especially for the H&E stained images which, in addition, exhibit an element of variability in staining uptake.
There is a rich body of literature on automated segmentation of tissue compartments, particularly those represented by epithelial and stromal tissues. Current approaches utilize supervised segmentation methods (e.g. [4][5][6][7]) where predictive models are built from labelled training data to predict the classes of novel unlabelled data. While such techniques have delivered promising results, they have a number of limitations. Firstly, those methods often require large amounts of hand-labelled region annotations (delineated histological components by experts) for training purposes. In practice, obtaining detailed annotations on digital histopathology images is challenging and time-consuming. Secondly, most approaches exploit traditional segmentation strategies (e.g. they use fixed-size square windows or are pixel-based) to partition the image into multiple binary subregions prior to applying the predictive models. Such low-level image segmentation approaches can yield poor quality results (e.g. over/undersegmented images) and this affects the classification accuracy. Thirdly, the supervised learning models are unable to perform well in on-line learning settings, where only one data pass is allowed, that is on the raw unlabelled data. On-line learning is often needed in a variety of data stream problems, including real-time decision making and resource-constrained learning [8]. These problems sometimes arise in histopathological imaging analysis, where an instant decision is required on novel unlabelled data. From the machine learning perspective, these problems are often amenable to the application of unsupervised learning models.
Cluster analysis is an unsupervised task that seeks to partition unlabelled data samples into homogeneous clusters based on assumed similarity measures. In histological imaging, clustering is particularly useful as an exploratory tool as it can provide information about the hidden image structures that may support anatomical and diagnostic models (e.g. [9][10][11]). Unsupervised recognition approaches are attractive because, unlike supervised methods, they do not require predefined image data annotations, making clustering powerful in scope and potentially useful in the current context of histological imaging.
The machine learning literature covers several different types of clustering techniques, each with its own strengths and weaknesses; often these different approaches, when applied to the same dataset, will give rise to varying segmentation results, depending on the algorithms used. Furthermore, some clustering algorithms have been shown to be sensitive to initialisation parameter changes so that again, they may lead different cluster results are obtained for the same data when those parameters are adjusted [12,13]. Although re-sampling and cross validation techniques help in optimising model parameters and assessing the stability, it is hard to find the "best" algorithm for a given data set. To overcome this, recent approaches have suggested collecting results from multiple clustering algorithms in a repository known as the 'cluster ensemble', and combining the multiple results into one robust solution. This approach is known as consensus clustering (CC) [14][15][16], and lends itself to imaging applications (e.g. [17,18]). However, to the best of our knowledge, its application to histopathological imaging, and particularly to the segmentation of tissue compartments, remains to be fully explored.
In this paper we present an unsupervised (data-independent) strategy for segmenting images into epithelial and stromal tissues, in haematoxylin and eosin (H&E) stained sections from tissue micro-array (TMA) cores. The first step is to partition the image into distinct regions; this is followed by the use of unsupervised learning models, based on the CC method, to automatically detect tissue compartments in images without the necessity for prior annotation of training data (hand annotated images) that is often required in supervised models. A block-diagram with an overview of the proposed method is presented in Fig 1. The contributions of the proposed framework can be summarized as follows: 1. We introduce a novel initial segmentation algorithm based on mathematical morphology [19], to partition images into binary regions or tiles representing potentially relevant image segments. Unlike most of the current segmentation methods, our approach reduces the image complexity while capturing some intended abstracted model of reality (i.e., virtualcells, each enclosing a potential nucleus).
2. We utilize colour features which quantify the distribution of the dye uptake (haematoxylin, eosin and a residual channel) within those segmented regions, instead of exploiting the standard red, green, and blue components of the image. We also present a method that extracts and combines morphological features from the segmented virtual-cells as well as their enclosed nuclei.
3. We exploit a consensus clustering framework for the unsupervised recognition of epithelial and stromal regions to provide a robust unsupervised identification of different tissue compartments. This is in contrast to most of the existing unsupervised histological segmentation approaches that utilize a single clustering method, which affects the reliability and robustness of the results (e.g. [9][10][11]). Two consensus functions are considered here, the Evidence Accumulation Clustering (EAC) [14] and the voting-based consensus function (e.g. [16]). We also suggest an ensemble selection procedure for selecting an effective cluster ensemble based on diversity measures.

4.
We propose a implementation for the voting-based consensus method, based on image processing operations, to generate a consistent labelling scheme among the base clustering outcomes. A preliminarily version of the voting-based algorithm applied to other segmentation methods, not covered in this paper, will be presented (with different findings) at a forthcoming conference [20].
This paper is organised as follows: Section 1 gives the background and briefly describes relevant previous work. Section 2 gives a brief description of the data used and preparation process. Sections 3, 4 and 5 describe our unsupervised segmentation, feature extraction and CC methods, respectively. Experimental results are presented in section 6 and discussed in section 7. Section 8 concludes the study.

Related work
The various approaches used in histological image segmentation depend on the nature of the analyses required and the features needing identification, e.g. nuclei, cells, higher hierarchical structures such as glands, tissue types where cells reside (compartments), the relations they hold and any histological or diagnostic types they might represent. The attributes of those histologically-relevant regions are often quantified on the basis of size, shape, colour intensity and texture. For this purpose, classical segmentation techniques are often based on pixel-based routines, e.g., intensity thresholding to identify differentially stained structures as foreground objects [21]. More advanced algorithms (such as the marker watershed, hidden markov models, grow-cut, seeded region growing and geodesic active contour) have been used to target more complex scenarios [22][23][24][25]. Algorithms have also been used to address the problem of partial occlusion and merged nuclei, e.g. [26]. Image segments resulting from these segmentation steps can be characterized using multi-dimensional feature sets, which in turn can be fed to supervised or unsupervised learning models in order to determine their nature, identify abnormalities or group them into different histological types or grades of disease. Comprehensive surveys of image segmentation techniques are presented in [1,2].
The automated identification of stromal and epithelial regions within H&E tissue images has recently received increased attention in the histopathological analysis community owing to pervasive use of this type of material in diagnostic pathology. Current approaches often exploit traditional binary segmentation to select patches from large images prior to feature extraction. Then they apply supervised learning models which attempt to discriminate between the classes of interest by learning from a large training data set (predefined handannotated histological images). In [4], texture features (based on local binary patterns) and a SVM classifier have been exploited for the automated segmentation of epithelial and stromal tissues in TMAs of colorectal cancer. The initial segmentation was performed by downscaling the images by 50% and then partitioning it into square blocks. Binary classification of epithelial and stromal tissue was then performed by the SVM model which processes the blocks independently. Hiary et al. [5] proposed a Bayesian voting-based model for automated segmentation of stromal tissue from immunohistochemistry (IHC) images of breast cancer. The study utilized colour-based texture features extracted from predefined square image blocks. A feature learning based on deep convolutional neural networks was presented in [6] to automatically segment and classify epithelial from stromal tissues in H&E and IHC images of breast and colon cancer. Unlike other approaches that are based on hand-crafted feature representation, such as colour and texture, the DCNN used a deep architecture to learn high-level features in a data-driven fashion. The extracted features were used to construct a supervised classifier for discriminating the two types of tissues. However, deep learning models generally require large amount of training data and therefore it is computationally expensive to train. Wang et al. [7] present a pixel-based supervised segmentation method, using texture features, which segments H&E tissue images into four classes of tissue morphologies, including tumour, stroma, lymphoid/inflammatory cells/necrosis and background.
Several approaches in the context of unsupervised segmentation have used clustering analysis. In [9] a hierarchical self-organizing map clustering algorithm was exploited to segment H&E stained images of prostate tissues and identify four tissue clusters (glands, epithelium, stroma, and cell nuclei) based on colour and textural features. Chen et al. [10] used a multilevel fuzzy c-means clustering algorithm to identify cell nuclei, stroma and fat-like regions that enabled the detection of potential cancerous cells in virtual slide images. Naqvi and Garibaldi [11] studied unsupervised learning methods as objective replacements for breast cancer grading using principal component analysis for data dimensionality reduction, followed by fuzzy cmeans clustering algorithm for automatic determination of cancer grades.
Owing to the high complexity of histological images, unsupervised segmentation using clustering methods remains a challenging task because performance often relies on single clustering algorithms and these tend to behave differently with various types of images. Here we suggest an improved unsupervised histological segmentation that aggregates the strengths of several individual clustering algorithms. In this so-called Consensus Clustering (CC) approach, the results of different clusterings are combined into a single, more robust and stable solution. A thorough summary of the existing CC approaches is presented in [15].
While CC has received attention in the machine learning community, it has not been efficiently exploited in the context of unsupervised histological imaging. Khan et al [17] proposed a framework for random projections of data features with ensemble clustering and applied it to pixel-level classification of tumour vs. non-tumour regions in breast cancer images. In this case, multiple random projections of features were used to generate multiple clustering results from the low dimensional representation of features and a consensus function combined the partitions and generated a final consensus result. However, the consensus utilised a single pixel-level clustering algorithm (the standard k-means [27]). Cooper et al. [18], used the CC model to aggregate tumours into groups based on their morphological signatures derived from the analysis of cell images.

Tissue samples, digitisation and preprocessing
Our analysis was performed on images of H&E stained tissue cores from oropharyngeal cancer TMAs prepared at the Institute of Cancer and Genomic Sciences, University of Birmingham, UK (the research was carried out under REC ethics reference 10/h1210/9, field Permit Institute for Head and Neck Studies and Education (InHANSE), University of Birmingham, UK). H&E is the routine staining method used in diagnostic microscopy; the haematoxylin dye is primarily taken up by nucleic acids (staining the nuclei blue/violet) and the eosin acts as counterstain and stains in pink protein-rich material in the intra-and extra-cellular compartments. TMA slides were scanned using an Olympus BX50 microscope (Olympus Optical Co. Ltd, Tokyo, Japan) with a x20 magnification objective (N.A. 0.5, resolution 0.67 μm), attached to a QImaging Retiga 2000R camera with a liquid crystal RGB filter (Surrey, BC, Canada). A motorised stage (OASIS Glide XY Scanning Stage, Objective Imaging, Cambridge, UK) and a Prior H122 (Prior Scientific, Cambridge, UK) motorised focus drive were controlled by the Surveyor (Object Imaging) platform which scanned the image and automatically stitched multiple fields. Before stitching, each field was background-corrected using the light transmittance as the ratio of transmitted light through the specimen and the incident light, scaled by the bit depth of the image (for each of the RGB components). The individual core images were approximately 3300 × 3300 pixels in size and the inter-pixel distance was 0.367μm. We selected fifty-five images to be used in the analysis (ten images were used for the optimization of model parameters and forty-five images were used for testing). The selected images had substantial variation in their appearance as well as their tissue distribution (2.3% to 98.8% of epithelium tissue component and 25.5% to 83.2% of background out of the whole image).
The images were cropped to remove uninformative background areas around the tissue cores. The purpose of this procedure was to reduce the burden on the computational processing required for the subsequent tasks. This was achieved by using a thresholded version of the images, followed by a morphological opening by reconstruction, so that only the large portions of stained tissue are represented as foreground. Then the binary image was scanned from the boundary regions inwards until foreground pixels were found. Their coordinates were then used to define a new frame for the image and the image cropped.

Initial segmentation using mathematical morphology
This section describes our segmentation based on Mathematical Morphology (MM) routines [19], that partitions an image I (Fig 2(a)) into a number of binary segments or 'virtual-cells' (vcells) represented in image S (Fig 2(g)). The algorithm, denoted here as "vcells-MM", can be described in four main steps as follows: 1. Colour deconvolution. This procedure separates the contribution of the two (H&E) dyes to exploit the differential stain uptake to identify the different structures in the image, in a similar way to how human observers interpret stained sections (e.g. nuclei are more strongly stained by haematoxylin than cystoplasmic and extracelluar stromal material, etc.). Each pixel in I has colour information represented as red, green, and blue channels as captured by the imaging equipment. Colour deconvolution [28] can process the RGB information from two or three combined dyes (in our case, haematoxylin and eosin) into separate 'stain channels' (see Fig 2(b) and 2(c)) if the original RGB components of separate dyes are known, assuming they combine subtractively (i.e. they are light absorbing dyes following the Beer-Lambert law [29]). In the case of two-dye stains, a third component represents a residual channel of the deconvolution process. The results of the deconvolution process can then be combined into a "stain" RGB image here denoted I Ã (see Fig 2(d)), where the R, G and B image channels hold the light transmittance of the haematoxylin, eosin and residual images, instead of containing their red, green, and blue components. Note that this image (I Ã ) is used for the forthcoming feature extraction step.

Generation of the tissue mask.
A set of operations is applied to I Ã to generate a mask representing the stained tissue regions (i.e. the foreground of the image, Fig 2(e)). This is given by the minimum of haematoxylin and eosin channels. Huang's fuzzy global thresholding [30] is then applied to obtain a binary mask and exclude the background. Unlike in other segmentation techniques (e.g. watershed [31], superpixels [32,33]), which segment the whole image (background and foreground) and result in a large number of regions being processed, our method ensures that the subsequent partitioning process is only applied on the image foreground, thereby reducing the number of segments to process and improving the performance of the procedure.
3. Extraction of nuclear markers (seeds). This is achieved by finding the regional minima (dark regions) in the haematoxylin image (Fig 2(b)) using greyscale reconstruction (see [34]) for extracting dark "basins" of a given depth h in the greyscale space. The procedure is also known as h-concave transform [19] and is obtained by subtracting the original image from the h-minima transform. These regions are dark 8-connected regions of up to the given depth h, surrounded by strictly higher greyscale value pixels. Note that h controls the amount of detail in the image. A large h (deep basins) results in detection of darkly stained structures (in this case cell nuclei) while small h results in excessive amount of detail detected, leading to over-segmentation. The value of h was tuned experimentally over the range of [10, 100] using a cross validation procedure in order to find an optimum value. The binary image containing the set of nuclear seeds is shown in Fig 2(f). Small regions attributable to noise (which would lead to over-segmentation) were removed using a binary opening by reconstruction. This consists of deleting regions that disappear after a number of morphological erosions e and it is achieved by eroding the mask image e times and reconstructing the original using the eroded image as seeds. This advantage of this over traditional morphological opening is that it deletes small regions without smoothing the remaining ones.

Partitioning of the tissue mask into v-cells.
In this step, mask pixels are assigned to exclusive zones of influence for each seed; using the watershed transform [35,36]. The segmented image S (see Fig 2(g)) is finally saved for further processing. Note that the size and the number of the generated v-cells is controlled by the value of h, as discussed above.

Feature extraction and selection
The purpose of this step is to extract a set of features from the segmented regions in S to perform clustering using a machine learning framework. As shown in previous works (e.g. [4][5][6][7]10]), colour and morphological features are carriers of important structural information in histological images. While H&E is not a stoichiometric staining method, the differences in dye uptake across histological structures achieved with standardised laboratory protocols is sufficient to enable humans to recognise features unequivocally in relation to the differences in the composition of tissues, e.g. identifying different tissue compartments in a section. We investigated the distribution of dye uptake in the v-cells using eleven descriptors for each dye (haematoxylin, eosin and residual channels) from image I Ã . However, colour features alone were not sufficient to obtain high clustering accuracy. Therefore, we selected a set of morphological descriptors for the v-cells as well as the nuclear markers (seeds) located inside the v-cells, ensuring that the morphological features obtained from each v-cell is correctly combined with those from their enclosed nucleus. This was achieved by first labelling each v-cell with a unique value and then assigning the same label to the enclosed nucleus. Experimental observations showed that inclusion of morphological features alongside colour features resulted in superior clustering results than using colour features alone. For the feature extraction, we used an Ima-geJ plugin in [37] for estimating various statistics of binary 8-connected segmented regions. A brief description of the used features is provided in Table 1.
There is a large number of possible features to be extracted from our images, so we used a feature selection algorithm applied to a validation set (ten H&E labelled images) in order to remove the redundant and non-discriminative features. In particular, we used the Weka 'feature subset evaluator' (CfsSubsetEval) [38], to compute the worth of a subset of features by considering their relevance with respect to the prediction process along with the degree of redundancy between the features. The optimal set consisted of sixty three features and these are presented in Table 1.

Consensus clustering (CC) framework
Consensus Clustering (CC) is an elaboration of the classical clustering problem that aims to address the variable performance of single clustering algorithms under the assumption that the consensual opinion of a group of solutions is more reliable than the opinion of just one. In this approach, results of various clustering solutions are combined into a single consensus partition without accessing the features or algorithms that were used to obtain the individual clusterings. This improves the robustness, stability and quality of the unsupervised segmentation. The CC framework consists of three steps: (a) ensemble generation receives the input dataset and returns an ensemble of cluster solutions; (b) in ensemble selection, an effective sub-set of cluster solutions is chosen from the ensemble based on their diversity measure, and finally (c) the consensus function combines the solutions into a single, more robust clustering result.

Ensemble generation
Let X = {x 1 , x 2 , . . ., x n } be a set of n v-cells in the binary segmented image S. Note that each x i is defined over the 63 features described earlier. In this study, a clustering algorithm takes X as an input and groups the n segments into (at least) two clusters corresponding to stroma and epithelium regions, forming a data partition L.

Ensemble selection
The level of diversity between the ensemble members has been identified as an important factor in the performance of consensus clustering [43,44]. Sometimes the generated ensemble includes a set of identical or irrelevant partitions. An ensemble of duplicate partitions will not outperform the individual ensemble members and hence there would be no need to apply a CC method. Conversely, an ensemble incorporating a few significantly inconsistent partitions (poorly correlated with respect to the rest of the ensemble) may lead to unreliable consensus output. The problem of generating a better performing cluster ensemble, using diversityrelated heuristics, has recently received attention in the scientific community (e.g. [43]). In [44] it was shown that better consensus results are obtained when ensembles have moderate diversity among their partitions. Here we explored the dissimilarity among ensemble members to obtain an effective cluster ensemble, denoted _ E, from the original pool of partitions E. The procedure aims at maintaining a moderate level of diversity among the partitions by pruning out significantly inconsistent partitions as well as identical or low diversity partitions and it is as follows: Given clustering solutions L i in the original ensemble E, to decide whether to include L i in _ E we measure how well L i agrees with the general trend contained in the original ensemble. The diversity of a given partition L i is calculated using the average Rand Index (RI) [45] between L i and each clustering solution L j contained in E, where i = 1, Á Á Á C, as follows: where (L i , L j 2 E) and (i 6 ¼ j). The RI counts pairs of points where two clusterings agree or disagree and it is calculated as: where TP and TN are the number of point pairs that are grouped in the same cluster, and in different clusters, respectively, under both L i and L j . FP and FN are the number of point pairs that are grouped in the same cluster under L i but not under L j , and in same cluster under L j but not under L i , respectively. The RI lies between 0 and 1, where 1 implies two partitions agreeing perfectly, and 0 that they completely disagree. Two thresholds D 1 and D 2 are defined corresponding to the minimum and maximum accepted level of diversity among the partitions. If L i exhibits an acceptable level of diversity with respect to the rest of the population in E (i.e., larger than D 1 and less than D 2 ) then it is considered as an eligible voter and is added to the new ensemble, if the opposite applies then it is taken out from the new ensemble, as shown below, and replaced with another solution that satisfies the criteria. Note that it is not always possible to find another partition of the same algorithm that satisfies the criterion. In this case, the partition is removed without replacement. The total number of selected partitions in _ E is denoted here as _ C, where _ C C, _ E is formed as follows, It is worth mentioning that we tried applying the CC methods (to be discussed below) on the original ensemble E by skipping the ensemble selection step, however, this yielded lower clustering accuracy when compared to the case of exploiting the new ensemble _ E.

Consensus function
The goal of the consensus step is to find an optimal data partition, denoted here as L Ã , which represents the CC based on the information contained in _ E. For this we use two consensus functions described below.

Evidence Accumulation Consensus (EAC) function.
The EAC algorithm [14] uses a voting mechanism that considers the co-occurrences of pairs of patterns in the same cluster as votes for their association. Note that the estimation of voting between partitions is deemed nontrivial owing to the usage of arbitrary cluster labels. To this end, the EAC uses a pairwise inspection approach to construct a new similarity measure between patterns through an n × n consensus matrix, defined as M ij ¼ m ij = _ C, where m ij is the number of times the pattern pair (i, j) are grouped together in the same cluster and _ C is the number of cluster solutions in ensemble _ E. Note that the consensus matrix M is a symmetric matrix of real numbers within the range from 1 (perfect consensus among the partitions) down to 0 (no association). The entries of M are treated as similarities and in order to produce L Ã we run another clustering algorithm on M. Here the Agglomerative Hierarchical Clustering (AH) algorithm was applied to extract L Ã from M. Ideally, L Ã should be consistent with the clustering results in ensemble _ E and exhibit robustness to small variations in results. The clustering output is represented in another image, namely _ S.

Voting-based consensus function.
Unlike in classification tasks, labels resulting from clustering methods are symbolic, and therefore an individual partition (clustering solution) in _ E includes clusters that do not necessarily have labels that correspond to other clusters in different partitions of _ E. The voting-based consensus function initially addresses the cluster label mismatch problem, and uses a simple majority voting technique to find the L Ã that optimally summarizes _ E. The label mismatch is defined as the problem of finding the optimal relabelling of a given partition L i with respect to a reference partition L j . This problem is commonly formulated as a weighted bipartite matching problem [16], and it is solved by inspecting whether data patterns in two partitions share labels more than with other clusters.
In this paper we propose a re-labelling algorithm (based on image processing tools) to solve the labelling mismatch problem among the partitions in _ E. Recall that data points in each partition represent image regions in the binary segmented image S. Labels resulting from the different partitions in _ E are assigned to the corresponding image regions in S using two different colours. The labelled regions are displayed in two images denoted here as IMG L i and IMG L j for partition L i and L j , respectively. As a result of label mismatching, however, a pair of correlated clusters from different partitions may be assigned different labels in the images. Our target is therefore to permute the cluster labels so as to maximise agreement amongst the labels within each partition of _ E. Specifically, epithelial tissue regions should have the same label across all partitions in _ E, and hence appear in similar colours in their corresponding image regions, and the same applies to the stromal regions.
The algorithm basically exploits the fact that pairs of individual clusters from different partitions, L i and L j match when the majority of their pixels overlap, in IMG L i and IMG L j , respectively. To this end, individual clusters, denoted here as c and c 0 , found in partitions L i and L j are visualized in two separate binary images, denoted here as IMG L i c and IMG L j c0 , respectively.
The algorithm then estimates the degree of overlapping/similarity between the two binary images, using the Jaccard Index (JI) evaluation measure [45] as follows: The JI value lies within the range of [0, 1], where 1 means a perfect correspondence between the two images and 0 implies zero correspondence. For every cluster c in L i we compute JI obtained against the other clusters in L j . Then, we inspect the cluster of the highest JI, if it has a different label (c 6 ¼ c 0 ) then the match is achieved by swapping the labels in the image IMG L j and therefore the labels in L j . The process is repeated until the labels in the ensemble partitions are matched with the labels in the reference partition L i . Note that L i remains unchanged throughout the re-labelling process. The now aligned labels for all partitions are combined into a final consensus partition L Ã via a majority voting technique. In exceptional cases, where the numbers of votes are equal we select the vote of the partitions that produce the highest total similarity (RI) with respect to the ensemble (Eq (2)). As before, L Ã will be represented in image _ S. Note that the suggested Voting-based significantly outperforms the existing EAC in terms of execution time. This is because, while EAC and Voting-based have complexity functions of order O(n 2 ) and O(K 2 ), respectively (where K is the number of classes), in our case n reached as high as 4000 in some images, but in every case K = 2.

Experiments and results
The purpose of the following experiments is two-fold. First, using the same initial segmentation technique (vcells-MM) we illustrate how the CC algorithms (EAC, Voting-based) improve the accuracy of clustering, compared to individual clustering approaches. Second, we evaluate the impact of using our proposed vcells-MM methodology on the performance obtained against other popular segmentation approaches. In this paper, all imaging procedures and machine learning algorithms were implemented on the ImageJ platform [46] using the WEKA data mining JAVA libraries [47]. We used a computer with an Intel R core(TM) i7-4790 CPU running at 3.60GHZ, with 32GB of RAM and 64-bit Linux operating system.

Clustering evaluation methods
The quantitative evaluation of clustering results is problematic owing to the label correspondence problem. Therefore, the analysis does not necessarily provide a clear interpretation/ labelling of (in our case) tissue types. Because of this unavoidable limitation and for reasons of qualitative evaluation, a total of fifty-five images were manually labelled by one of us (GL), with a background in Oral Pathology, into epithelium, connective stroma and background areas to serve as a gold-standard (denoted here as R).
Five external cluster measures were used which quantify the similarity between our clustering result _ S and the gold-standard R. However, the difference in nature and number of data patterns in R (pixels) and in _ S (v-cell regions) does not allow a direct quantitative comparison between them. To overcome this problem, we generated a further set of gold-standard images (here denoted G) from R by assigning to each of the generated v-cells in the binary image S the corresponding tissue type found in R. This procedure unifies the number of the compared units (v-cells) among the images (G and _ S) and facilitates the quantitative comparisons. However, on a few occasions a single v-cell in S incorporates more than one tissue type from the corresponding image R, and in such cases we selected the most prevalent type as the label. The evaluation measures used here are as follows: • The Rand Index (RI): This is defined in Eq 1 and it compares the final clustering solution given in our segmented image _ S with a known reference partition (gold-standard G), that is generally based on the class labels associated with the data.
• Precision: It denotes the number of true positive results divided by the number of all positive results as follows: • Recall: It is the number of true positive results divided by the number of positive results that should have been returned as follows: Precision estimates the quality, whereas recall estimates the quantity. In particular, high precision denotes more relevant results returned by the algorithm than irrelevant ones, whereas high recall denotes that the algorithm returned most of the relevant results [48].
• F1-score: It is a weighted average of the precision and recall and it is defined as: F1-score is usually more significant than the RI, especially if you have an uneven class distribution. It reaches its best value at 1 and worst at 0.
• Jaccard Index (JI): This is defined in Eq 4 and it is estimated here by dividing the number of pixels in the intersection of two images ( _ S and G) by the number of pixels in their union.

Comparison with individual clustering methods
This section compares the unsupervised CC frameworks (EAC and Voting-based) against five standard clustering algorithms (previously used in the ensemble generation step) namely, kmeans, LVQ, EM, MDB and AH. Firstly, for each scanned image I we applied the vcells-MM algorithm which partitioned I into n v-cells and produced the binary segmented image S. According to a cross-validation procedure performed on a validation set of ten images, we set the two segmentation parameters h and e to 30 and 3, respectively. As mentioned before, the parameter h controls the number of generated v-cells. Secondly, we extracted a set of colour and morphological features described in section 4 from each v-cell in S as well as their enclosed nuclear markers. The extracted features were passed to the five clustering algorithms (mentioned above). We fixed the number of clusters to two in all experiments. The k-means and the EM algorithms were run 10 times with different initialisation parameters. The LVQ, MDB and AH were run 4, 1 and 6 times, respectively. The number of seeds in k-means and EM were randomly chosen from the range [10,200]. The AH algorithm was used with two link types (Complete and Mean) and three different distance functions (Euclidean, Manhattan and Minkowski). We also tried other link types (single, centroid and average methods), but they returned poor clustering results. Learning rates in the LVQ algorithm were set at the values of 0.05, 0.07, 0.09 and 0.1. The ensemble generation process yielded a total of 31 clustering solutions, stored in the initial pool of cluster solutions, E. Then, we applied the diversity selection strategy to form another better performing ensemble _ E, which pruned the significantly inconsistent partitions and also reduced the number of identical clustering solutions. The D1 and D2 acceptance thresholds were set to 0.5 and 0.9, respectively. Finally, we applied the two discussed consensus functions (EAC and Voting-based) which produced one consensual robust clustering solution L Ã .
Quantitative and qualitative results of the EAC and Voting-based methods as well as the five individual clustering approaches are presented in Table 2 and Fig 3, respectively. To evaluate the performance of the individual clustering algorithms (Table 2), we evaluated the mean evaluation measures as well as the standard deviations across the forty-five test images for each algorithm run, then we estimated the average results across the multiple runs. Performance of the CC methods was evaluated by calculating the mean evaluation measures and the standard deviations across the forty-five test images. S1 Table provides a detailed version of the quantitative results for all the applied algorithms.
For display purposes, in Fig 3 we randomly selected a single clustering output (out of the multiple runs) to represent the performance of the compared clustering algorithms. S1, S2, S3 and S4 Figs provide a detailed qualitative results for four example H&E stained tissue images. Each figure illustrates the initial segmentation steps used to partition a single tissue image into binary tiles. This is followed by showing all the clustering results obtained by the applied algorithms. We also provide in S6, S7, S8, S9, S10, S11, S12, S13 Figs and S2 Table a minimal anonymized data set with the resulting images in full resolution, as used in the study. The source code of our proposed methods is provided as a supporting file in S1 Source-code.
The overall result demonstrates the superiority of the EAC and Voting-based frameworks over the individual clustering approaches in detecting epithelial and stromal tissue regions. In particular, the highest accuracy was achieved by the EAC and Voting-based approaches (80% and 81% in terms of F1 score (respectively) and 77% in terms of RI for both). In terms of the comparison between EAC and Voting-based algorithms, Table 2 shows that the performance of Voting-based is slightly better than EAC, but more importantly (based on experimental observations) it is significantly faster due to its lower computational complexity. S5 Fig com-pares the time performance of the EAC against the Voting-based method across the forty-five tested images. The average time required by the EAC and the Voting-based to estimate the consensus clustering L Ã was 85.88 and 9.99 milliseconds, respectively.
Besides the discriminative ability, the visual comparisons provided in Fig 3 show that the EAC and Voting-based algorithms exhibit more consistent performance over individual clustering methods, as illustrated by lower standard deviations of the different evaluation measures. In particular, although most of the individual clusterings provide acceptable output across the displayed images, they all failed to perform well in at least one case. By contrast, the CC methods which show more stable clustering results.

Comparison with other segmentation methods
This section compares results of the colour-based clustering performance using the vcells-MM segmentation with those from other eight commonly used segmentation methods. The Table 2. Performance evaluation of the consensus clustering methods (EAC and Voting-based) compared against five individual clustering approaches in terms of mean RI, Precision, Recall, F1-score and JI along with standard deviations (±) across the forty-five tested images. In all algorithms initial segmentations are performed via our vcells-MM method. Best RI, F1-score and JI results are marked in bold font. compared methods utilize various segmentation approaches and they can be categorised into three groups: region-based, pixel-based and histogram-based segmentation.

Measure EAC Voting-based k-means
In region-based segmentation, regions are constructed by grouping homogeneous neighbouring pixels into regions. We assessed the performance of our method against the watershed [31], waterfall [49] and superpixels [32,33].
The watershed transform takes its inspiration from geography. Given an irregular landscape flooded by water, the watersheds of a landscape define lines (dams) which divide the catchment areas (basins) that are filled with water. There are algorithms that implement this idea in the digital domain (e.g. [50]), although the classical watershed often leads to over-segmentation.
The waterfall algorithm is a hierarchical segmentation method [49] that aims to reduce the over-segmentation of the watershed transform. An initial watershed segmentation is processed iteratively to achieve increasingly simplified partitions by removing watershed dams between basins that are surrounded by higher basins. Thus, the importance of a basin dam line is considered with respect to its neighbouring basins. The process may be repeated until a final single region is obtained. The initial watershed segmentation was applied to the greyscale gradients of the original image. Then two consecutive levels (iterations) of the waterfall algorithm were applied, denoted as levels L1 and L2. Subsequent levels were also tried, but they returned grossly under-segmented results deemed not suitable for our purposes.
In superpixels segmentation, image pixels with similar colour and spatial properties are also grouped into atomic regions, so-called superpixels, which reduces the complexity of pixel grid images. Here we used an advanced version of the superpixels segmentation, namely Simple Linear Iterative Clustering (SLIC) framework [32,33] which partitions the original image I into a set of compact and relatively uniform superpixels. The method allows the size and compactness of the resulting superpixels to be adjusted, providing some control over the number of superpixels generated. In our experiments, we used the size that provided the best clustering results on the validation set. We used the recently proposed jSLIC [51], a Java implementation of SLIC that is faster than the original version in [32].
In the three aforementioned methods, initial segments were firstly extracted from image I to generate a binary segmented image (see Fig 4). Colour features of the segmented regions were extracted and passed to the subsequent unsupervised recognition procedures. Note that, unlike the earlier experiment, we excluded morphological features because many of the resulting segments obtained with the other various methods did not represent comparable histological components across epithelial and stromal compartments (e.g. superpixels and pixel-based). Therefore, some binary segments can exhibit marked heterogeneity, which can affect the subsequent analysis. For fair comparisons among the segmentation approaches, clusterings were performed using just the k-means algorithm and not the CC methods, because in this comparison we wanted to highlight the advantages of the initial vcells-MM segmentation method alone. The number of randomized seeds used in k-means was fixed at 50 in all experiments.
The pixel-based segmentation method is based on a k-means pixel clustering using colour features (values from the RGB channels). Each pixel in the output image represents the cluster to which the original pixel was assigned. For this, we used the ImageJ pixel-based k-means clustering plugin [52].
Histogram-based segmentation methods are among the simplest and the most commonly used techniques in image segmentation. They operate on pixels based on the image intensity histogram. We tested the performance of three histogram-based segmentation methods and compared their segmentation results with the ones obtained by our approach. In particular, we tested the performance of the multi Otsu thresholding technique [53] (an extended version of the Otsu's method [54]), which maximises the between-class variance of pixel intensity to partition the image. The extended multi Otsu thresholding generates multiple thresholds from an image. Note that our images contain three classes of (background, epithelial tissue and stromal tissues). We used the ImageJ multi Otsu Thresholding plugin available in [55] to segment the grey level images into three classes.
We also compared our results against a fuzzy-based thresholding approach known as 'nothreshold segmentation' [56] where grades of membership to C classes are estimated using a global probability density function, and the mode of the regularised histogram corresponds to one class of pixel. Here, we used three classes corresponding to our image regions of interest.
This procedure produces three images, each presenting the grades of membership of every pixel to the three classes. The final segmentation is obtained by applying probabilistic relaxation to the images and defuzzification to the grades of membership. An ImageJ plugin for this purpose is available in [57]. Unsupervised morphological segmentation of tissue compartments in histopathological images Finally, the maximum entropy multi-thresholding technique had also been assessed against our segmentation method. It is a generalization of the single maximum entropy method that is presented in [58].
Note that owing to the complexity of our images, the waterfall segmentations required extremely long computation times. To improve this, we downscaled the original image I by 50% before applying all the above eight segmentation methods. For all segmentation methods, comparisons were done in terms of Jaccard Index obtained against the rescaled (by 50%) goldstandard image R. The JI was averaged across the forty-five test images for each segmentation method. The region-based methods are known for being computationally expensive, for this we compared the total execution time (in seconds) and the total number of segments generated for the region-based as well as the pixel-based methods.
Quantitative results (presented in Table 3) show that the unsupervised segmentation performance obtained by our vcells-MM framework outperforms the other segmentation methods. In particular, for the region-based segmentation methods and with respect to the mean JI, the vcells-MM approach achieved an average relative performance improvement of 19%, 28%, 25% and 13% over the watershed, waterfall L1, waterfall L2 and superpixels, respectively. It also maintained the lowest mean number of segments as well as a reasonable execution time. For the pixel-based segmentation, our method achieved an average relative performance improvement of 28% over the k-means pixel clustering. For the histogram-based methods our algorithm attained a performance improvement of 25%, 23%, 76% over the multi Otsu, no-threshold and maximum entropy multi-thresholding, respectively.
Qualitative results are presented in Figs 4 and 5. Fig 4 shows the binary segmented images of six example cases of H&E core tissue images. The segmented images are resulting from our vcells-MM method (second column) as well as the rest of the compared region-based segmentation approaches (discussed above). Fig 5 shows the output for the detection of epithelial and stromal regions for the same six core tissue images. The first row (from the top) shows the original scanned image, the second row is the gold-standard and the third row shows our segmentation results. The remaining rows depict the results obtained by the various methods discussed above.
As shown, our vcells-MM segmentation provides an efficient initial binary partitioning which removes the background area while maintaining an effective segmentation of the core tissue image. Note that, the watershed yields an excessive number of segments which leads to high spatial complexity in image and therefore poor accuracy (JI). Furthermore, the vcells- Table 3. Performance evaluation of our vcells-MM segmentation compared against eight existing commonly used segmentation methods. The compared methods utilize various approaches including region-based segmentation, pixel-based segmentation and histogram-based segmentation. For all methods performance are evaluated in terms of mean Jaccard Index (JI) along with standard deviations (±) across forty-five tissue images. The table also reports the mean number of segments generated, as well as the mean execution time (in seconds) for region-based and pixel-based segmentations. Best JI result (the vcells-MM) is marked with bold font. Unsupervised morphological segmentation of tissue compartments in histopathological images MM segmentation produces more natural regions than the ones produced by the superpixels (with more compact and uniform segments). Regarding the detection of epithelial and stromal regions our method avoids the over-segmentation results (occurring in the watershed, pixelbased methods, Otsu thresholding and no-threshold segmentation) as well as the under-segmentation results (observed in the waterfall L2 and the maximum entropy multi-thresholding).

Discussion
The proposed unsupervised framework for segmentation of epithelial and stromal regions in H&E images offers several advantages over existing methods. Unlike the current supervised segmentation approaches, our data-independent segmentation doesn't require predefined training data of hand-labelled region annotations. However, it could be argued that the average recognition accuracy obtained here (F1 score of 80-81%) is somewhat lower than the average accuracy reported with supervised methods (e.g. [4][5][6][7]). This is due to the fact that clustering methods do not take advantage of any predefined manual annotations. So while they can be used to alleviate the burden of the predefined manual annotation, this is often at the cost of slightly inferior accuracy. Segmentation of microscopy images independently (or in absence) of a manually-annotated image is particularly needed in real-time systems where a large variety of images with undefined content and no gold-standard need to be segmented. Unlike the individual clustering algorithms, the use of CC enabled a more robust detection of tissue regions in our data set. The procedure attempts to emulate real-life strategies which often arise when experts do not agree about a particular problem. In such cases, experts meet to discuss all known aspects of a case in order to reach a consensus. As illustrated in Fig 3, it is difficult to select one best clustering method that is consistently superior across all images and although the k-means algorithm showed promising results in most cases, it failed to perform well in the third case from the top. By contrast, although the LVQ method produced poor results in most cases, it provided good results in at least four cases. These results also illustrate the difficulty of standardising clustering algorithms and the advantages of CC in the context of histological image unsupervised analysis.
In contrast to the compared region-based segmentation methods, the proposed segmentation approach helps to decrease the spatial complexity of the image while appearing to retain important information about tissue compartments. This also improves the visualization and hence the interpretation of the image. It is also computationally less expensive when compared to the popular watershed method.
Although the histogram-based and the pixel-based segmentation methods do not require prior binary initial segmentations, they neglect the spatial information in the image as they only rely on the intensity of pixels. This fact explains their poor segmentation results attained in our experiments. Furthermore, the histogram-based algorithms are sensitive to noise and often fail in images with no defined distribution peaks.

Conclusion
We proposed an integrated framework for unsupervised segmentation of epithelial and stromal tissues, in haematoxylin and eosin (H&E) stained sections from tissue micro-array (TMA) cores. This involves an initial segmentation algorithm, based on mathematical morphology techniques, to partition the digitized images into binary virtual-cells. Colour and morphological features were extracted and selected for the discrimination between the two tissue compartments. This is achieved by utilizing a selective Consensus Clustering (CC) technique, which considers the consensual opinion of a group of clustering algorithms to provide a single more reliable and stable clustering result. For this, we exploited two CC functions, the EAC and the voting-based. For the latter, we introduced a label matching technique which imposed consistency to the different base clustering outcomes.
The strength of the proposed framework is derived from the fact that (a) it provides a data annotation-independent recognition model, (b) it facilitates the interpretation and visualization of the image by reducing its spatial complexity, while retaining its essential histopathological contents, (c) unlike other standard segmentation models our framework avoids over-/ under-segmentations, (d) the CC algorithm incorporates the capability of multiple clustering models and hence provides more robust unsupervised identification of different tissue compartments, and (e) qualitative and quantitative results tested on a set of forty-five hand-segmented H&E stained tissue images verified that our segmentation method outperforms in accuracy and stability current individual clustering approaches. It also compares favourably with commonly used segmentation methods.
Supporting information S1 Table. Detailed quantitative results of the individual and the consensus clustering approaches. The table displays the average evaluation measures (across the forty-five tested images) for each run of the individual clustering methods. Performance is evaluated in terms of Rand Index, Precision, Recall, F1-score and Jaccard Index. (PDF) S2 Table. Results table with