ClueNet: Clustering a temporal network based on topological similarity rather than denseness

doi:10.1371/journal.pone.0195993

Fig 1.

Illustration of how a raw temporal dataset (left) is modeled as a dynamic network (right).

One parameter is the length of the temporal window during which interactions are aggregated. In our illustration, this parameter value is one week (note that weeks begin on Monday and end on Sunday). For example, the network snapshot for week 1 (January 1^st through January 7^th) will aggregate interactions between nodes A and B, B and C, and C and D. Another parameter is the minimum number of events that must occur between the same nodes within the given time window in order to link these nodes in the corresponding snapshot of the dynamic network. This parameter is set to one in this example.

More »

Expand

Fig 2.

Illustration of (a) static and (b) dynamic graphlets.

(a) All nine static graphlets with up to four nodes. (b) All dynamic graphlets with up to three events. Multiple events along the same edge are separated with commas. We note that only smaller graphlets are shown for both static and dynamic graphlets for the purpose of illustration, but larger graphlets are used. The figure originates from [5].

More »

Expand

Fig 3.

Illustration of the existence of dynamic graphlets D₁, D₂, and D₉ in a toy dynamic network with three snapshots.

Dashed lines denote instances of the same node in different snapshots. Colored lines denote what it means for D₁ (blue), D₂ (green), and D₉ (red) to exist in a network. The figure originates from [5].

More »

Expand

Fig 4.

Summary of ClueNet.

More »

Expand

Fig 5.

Pairwise edge overlaps between the snapshots of (a) social Enron and (b) biological aging-related dynamic networks.

The darker the color, the higher the edge overlap between the given snapshots. For the Enron data, the following network construction parameter values are used: t_w = 2 months and w = 2, but the results are similar for the other tested parameter values. Equivalent results for the other two social networks (hospital and high school), which are similar to the Enron results, are shown in S1 Fig.

More »

Expand

Fig 6.

The fit of each method’s partition to the ground truth partition(s).

The fit of each method’s partition (the methods are: ClueNet (its three versions: C-ST, C-D, and C-C), Louvain (L), Infomap (I), Hierarchical Infomap (HI), label propagation (LP), simulated annealing (SA), and Multistep (M)) to the ground truth (GT) partition(s), for (a) social Enron, (b) social hospital, (c) social high school, and (d) biological aging-related dynamic networks, with respect to topological (D-GDV) similarity versus interaction denseness (modularity). In the given panel, a method is good if its partition is in the same quadrant as the ground truth partition and if the two partitions both show high or low D-GDV similarity and modularity scores. In panel (a), only the three ClueNet versions match both high D-GDV similarity and low (close to 0 but positive) modularity scores of the ground truth partition. In panel (b), all three versions of ClueNet are closer to the ground truth partition than the existing methods. Note that in panel (b), the Louvain method is missing, because it did not produce any output for this network. In panel (c), all methods mimic well both high D-GDV similarity and high (positive) modularity scores, but the three ClueNet versions are the closest to the ground truth partition, along with simulated annealing and Multistep. Note that all five of these methods produce the exact same partition. So, their visualizations have been slightly manipulated by moving some of the methods’ results just a bit up/down or left/right, in order to make all five methods visible. In panel (d), there are four ground truth partitions, depending on which aging-related ground truth data is considered (BE2004, BE2008, AD, or SequenceAge; Section Data). For three of the four ground truth partitions, only the three versions of ClueNet match both high (positive) D-GDV similarity and low (close to 0 but positive) modularity scores.

More »

Expand

Fig 7.

The ranking of all DNC methods used in this study.

The ranking of the methods (ClueNet (its three versions: C-ST, C-D, and C-C), Louvain (L), Infomap (I), Hierarchical Infomap (HI), label propagation (LP), simulated annealing (SA), and Multistep (M)) over all considered social datasets (i.e., the three ground truth partitions corresponding to the three social dynamic networks; the first column) and (b) biological datasets (i.e., the four ground truth partitions corresponding to the biological aging-related dynamic network; the second column) with respect to all of precision, recall, and AMI (F-score is excluded here because it is redundant to precision and recall). Each row corresponds to one of the three versions of ClueNet that is compared to the existing methods: C-ST (top), C-D (middle), and C-C (bottom). The ranking is expressed as a percentage of all cases (across all ground truth partitions and all three partition quality measures) in which the given method yields the k^th best score across all methods. We rank the methods based on their p-values (i.e., the smaller the p-value, the better the method); in case of ties, we compare the methods based on their raw partition quality scores. The ‘N/A’ rank signifies that the given method did not produce a statistically significant partition under the given partition quality score.

More »

Expand

Fig 8.

Detailed method comparison.

Detailed method comparison results for the social Enron (left) and biological aging-related (right) dynamic networks, quantifying the fit of each method (ClueNet (C-ST,C-D,C-C), Louvain (L), Infomap (I), Hierarchical Infomap (HI), label propagation (LP), simulated annealing (SA), and Multistep (M)) to the corresponding ground truth partition in terms of precision. There is one ground truth partition for the Enron network (results shown in the figure). There are four ground truth partitions for the aging-related networks, depending on which aging-related ground truth data is considered (BE2004, BE2008, AD, or SequenceAge; Section Data). Results are shown in this figure for the SequenceAge-based ground truth partition. For each dataset, for each method, we compare the precision score of the partition produced by the given method (red) to the average precision score of its random counterparts (blue) and show the resulting p-value (see Section Measuring partition quality for details). These are representative results for one network/ground truth partition from each of the social and biological domains and one measure of partition quality. Equivalent results for the other three biological aging-related ground truth partitions, for the other two social dynamic networks (hospital and high school), and for the other three partition quality measures (recall, F-score, and AMI) are shown in, S3, S4, S5 and S6 Figs.

More »

Expand

Table 1.

Results when ClueNet’s dynamic graphlet-based topological similarities are used on top of the existing denseness-based simulated annealing method.

More »

Expand

Fig 9.

Running time comparison.

Running time comparison of the different methods (ClueNet (its three versions: C-ST, C-D, and C-C), Louvain (L), Infomap (I), Hierarchical Infomap (HI), label propagation (LP), simulated annealing (SA), and Multistep (M)) for the (a) social Enron and (b) biological aging-related dynamic networks. On the y-axis, log base 10 is used. Equivalent results for the other two social networks (hospital and high school), which are similar to the Enron results, are shown in S8 Fig.

More »

Expand