Hierarchical Information Clustering by Means of Topologically Embedded Graphs

doi:10.1371/journal.pone.0031929

Figure 1.

A schematic overview of the construction of the bubble tree.

(i) An example of PMFG graph made of nine vertices and containing three separating 3-cliques: , and . (ii) The separating 3-cliques have vertex sets: , , and . (iii) The separating 3-cliques identify four planar sub-graphs called “bubbles”: , , and with vertex sets , , and . (iv) The graph can be viewed as a “bubble tree” made of four bubbles connected through three separating 3-cliques.

More »

Expand

Figure 2.

Illustration of the DBHT technique.

(i) Construction of the directed bubble tree where directions are given to the 3-cliques , and (from Fig. 1) accordingly with the largest weight and (see Eq.1). In this example we have two converging bubbles: and . A unique set of vertices can be associated to each of the two converging bubbles and where vertices shared by both the converging bubbles (i.e. the vertices and ) are assigned accordingly with the largest strength (Eq.2). (ii) All the other non-assigned vertices (i.e. , and ) are associated to the cluster with minimum average shortest path length (Eq.3). (iii) The vertex set is uniquely divided into two clusters respectively associated to the two converging bubbles: and . (iv) The hierarchical organization and the clustering structure can be represented with a dendrogram.

More »

Expand

Figure 3.

Demonstration that the DBHT technique can outperform other state-of-the-art clustering techniques, namely: k-means++[29], Spectral clustering via Normalized cut on k-nearest neighbor graph (kNN-Spectral) [30], [31], Self Organizing Map (SOM) [32], and Q-cut [33].

The figures report the adjusted Rand indexes [34] for the comparison between the the ‘true’ partition embedded in the artificially generated data and the partition retrieved by the clustering methods. In these examples we have eight clusters of size 5 elements and one cluster of size 64 elements with , and . The plots report average values over a set of the 30 trials. The horizontal-axis reports the gap between average intra- and inter-cluster correlations that becomes smaller when the noise increases. (a) Normally distributed correlated datasets with added Normal noise with varying from 0 to 4. (b) Log-Normally distributed correlated datasets with added power law noise with and varying from 0 to 0.1.

More »

Expand

Figure 4.

Demonstration that the DBHT technique can detect clusters at different hierarchical levels outperforming other established linkage methods.

The synthetic data are generated via a multivariate Gaussian generator with added power law noise with exponent and . (a) Input correlation for a synthetic data structure with nested hierarchical clustering with 4 ‘large’ clusters, containing 8 ‘medium’ clusters, containing 16 ‘small’ clusters. (b) Dendrogram associated with the DBHT hierarchical structure. (c) Dendrogram associated with the Average linkage. (d) Dendrogram associated with the Complete linkage.

More »

Expand

Figure 5.

Comparison between the clustering obtained via: (a) DBHT technique, (b) best Qcut and (c) best kNN-Spectral on iris flower data set from Fisher [38].

The labels inside the symbols correspond to the three different types of flowers: (s) Iris Setosa; (v) Iris Versicolour; (g) Iris Virginica. The shapes of the symbols correspond to the clusters retrieved by the different clustering techniques.

More »

Expand

Figure 6.

Average Adjusted Rand index to compare performances of clustering algorithms: k-means++, Qcut, kNN-Spectral and DBHT for the benchmark data sets collected by de Souto et al [40] (k++ indicates k-means++).

The relatively high performing “Parameter given” results refer to cases when the true number of cluster is given to the algorithm as input. In all the other cases the number of cluster is computed by using internal validity measures. (a) Affymetrix data; (b) cDNA data.

More »

Expand

Figure 7.

Adjusted Rand indexes for each sample in the de Souto et al [40] datasets.

(Top) Performances for each dataset when the true number of cluster is given as input. (Bottom) Performances for each dataset when the true number of cluster is computed by using internal validity measures. (Left) Affymetrix data; (Right) cDNA data.

More »

Expand

Figure 8.

Comparison between the clusters obtained with the DBHT method and the clusters obtained from kNN graph with Qcut results for optimal Q for the dataset Yeoh-v1 Affymetrix [40].

(a) Correlation matrix structure , which are ordered accordingly with the ‘known’ clustering structure of Yeoh-v1 data. (b, c, d) Insets: correlation matrices ordered accordingly with the Qcut, kNN-Spectral and DBHT respectively. The clusters are indicated on the bottom with color bars. (b, c, d) Main plots: results for Qcut, kNN-Spectral and DBHT respectively where the ‘golden standard’ clusters for Yeoh-v1 data (as by de Souto et al [40]) are depicted in vertices of different shapes: square or circle. The computed clusters are instead depicted in different colors, shown both in the graphs and in the color bars on the bottom of the Correlation matrix. One can note that, despite kNN-Spectral technique gives a very good agreement with the ‘golden standard’ provided by de Souto et al, the structure extracted by the DBHT method gives a very clean clustering partition that is clearly revealed in the visualization of the relative correlation matrix in the inset of (d).

More »

Expand

Figure 9.

Sample-cluster structure for 96 malignant and normal lymphocyte samples from Alizadeh et al 2000 [44], the labels inside the symbols correspond to the different sample types as listed in the legend.

The DBHT technique retrieves 11 sample-clusters here represented with different symbols (see legend). The underlying network is the PMFG from which the clustering has been computed.

More »

Expand

Table 1.

Survival rates of cancer patients with DLBCL type of Lymphoma. The patients are divided in four groups corresponding to the four sample-clusters containing DLBCL obtained with DBHT technique (see Fig. 6).

More »

Expand

Table 2.

Number of up-regulated (on the left) and/down-regulated (on the right) expression profiles for each group of clones with known physiological roles as reported in Ref. [44].

More »

Expand

Figure 10.

Expression profiles for six significant gene-clusters obtained by the DHBT method.

Left: Heat map of gene expression profiles for the clusters of genes. Each row represents the expression profile from a clone, and each column represents a sample. The samples are organized according to the DBHT hierarchy as shown on the dendrogram on the top. Significant gene-clusters are highlighted with different colors as follows (from top to bottom, colours online): Red - gene-cluster ‘44’ (significant for sample-cluster ‘1’); Green - gene-cluster ‘109’ (significant for sample-cluster ‘4’); Blue - gene-cluster ‘1’ (significant for sample-cluster ‘5’); Black - gene-cluster ‘4’ (significant for sample-cluster ‘7’); Magenta - gene-cluster ‘125’ (significant sample-cluster ‘9’); Yellow - gene-cluster ‘102’ (significant for sample-cluster ‘11’). The same color scheme is used on the bottom of the heat-map to denote the corresponding sample-clusters. Right: Mean expression profile for each gene-cluster together with the expression profiles of note-worthy gene for each sample-cluster. The x-axes report the gene clusters. The boundaries of the relevant sample-cluster for each gene-cluster are indicated with the vertical dashed lines.

More »

Expand