Skip to main content
Advertisement
  • Loading metrics

Vertex clustering in diverse dynamic networks

  • Devavrat Vivek Dabke ,

    Contributed equally to this work with: Devavrat Vivek Dabke, Olga Dorabiala

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Level Ventures, New York, NY, United States of America

  • Olga Dorabiala

    Contributed equally to this work with: Devavrat Vivek Dabke, Olga Dorabiala

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    olgad400@uw.edu

    Affiliation Dept. of Applied Mathematics, University of Washington, Seattle, WA, United States of America

Abstract

We present theoretical and experimental results for spatiotemporal graph k-means (STGkM)—a new unsupervised method to cluster vertices within a dynamic network. STGkM finds both short-term dynamic clusters and a “long-lived” partition of vertices within a network whose topology is evolving over time; we first introduced this technique in a recent conference paper. Here, we update our algorithm with a more efficient relaxation scheme, provide additional theoretical results, compare its performance to several other methods, and demonstrate its capabilities on real, diverse datasets. We construct a theoretical foundation to distinguish STGkM from connected components and static clustering and prove results for the stochastic setting for the first time. In addition to our previous experiments on the United States House of Representatives dataset, we report new state-of-the-art empirical results on a dynamic scientific citation network and Reddit dataset. These findings demonstrate that STGkM is accurate, efficient, informative, and operates well in diverse settings. Finally, as previously noted, one of the main advantages of STGkM is that it has only one required parameter: k, the number of clusters; we therefore include an extended analysis of the range of this parameter and guidance on selecting its optimal value. Our data and code are available on Github; see: https://github.com/dynestic/stgkm.

Author summary

With the explosion of data about the world around us, we must constantly develop new ways of studying datasets. One popular method for analysis is k-means, which can identify clusters of related objects based on shared characteristics. Traditionally, k-means worked over sets of objects (e.g. animal subjects in a biology study) for which it was possible to define a list of consistent, numerical, and complete “features” or pieces of information (like age or weight). Over the years, this algorithm has been adapted for a variety of datasets and implemented in efficient code libraries. Our paper builds on this vast literature to extend k-means to the setting of networks that change over time, and we provide a practical and efficient implementation of it for real-world usage. We show that our new algorithm works on datasets like the United States House of Representatives roll call votes, citation networks from major publications, and a sample of Reddit posts. We also provide formal mathematical proofs and demonstrate the theoretical soundness of our technique.

1 Introduction

As we have previously noted [1, 2], dynamic graphs are prevalent mathematical structures that capture important aspects of the rich and frenetic world around us. Though graphs, or networks, have traditionally been studied as static objects, many such systems evolve over time. Also called “time-varying” or spatiotemporal graphs, they extend static graphs by permitting edges to change, and they inherently reflect many systems, e.g., road networks, online communities, and epidemic spread. While we have made strides in understanding them, there are still many exciting open questions about dynamic graphs, especially when compared to our relatively compendious knowledge about static ones.

Given a dynamic network, we may ask: what is the relationship between vertices over time? For example, given a social network with evolving friendship, we may want to understand which people are the most influential, important, or well-connected. Or, in a voting network, we may want to track the rise of parties and factions over time. These types of problems—finding the emergent dynamic relationships between vertices—is called community detection and is a form of vertex clustering. Besides the variety of motivating applications, vertex clustering is a fundamental theoretical problem in the static setting, and so it is natural to extend it to the dynamic one.

In this paper, we expand upon spatiotemporal graph k-means (STGkM)—a novel technique that we first introduced as a recent conference paper [1]—which is able to track the multi-scale relationships between graph vertices. STGkM applies a two-phase clustering approach, wherein the first phase outputs an assignment for each vertex at every time step and the second phase produces a single, long-term partition of vertices based on historical cluster membership. STGkM identifies communities of interest and automatically tracks their evolution.

While we had previously provided some theoretical and experimental results when introducing the technique, the results presented in this paper are entirely novel. Leveraging some theoretical analysis, we can show when and how our method relates to dynamic connected components. In some cases, we exactly align with this technique for vertex clustering, but we also demonstrate where our method finds sensible clusters when connected components is too strong of a concept. We also demonstrate why dynamic clustering is superior to simple static clustering in a dynamic network and we provide a basic result demonstrating robustness to noise. Our empirical results show, first and foremost, that STGkM is applicable to a diverse range of datasets. It is also efficient and tractable. We also observe many interesting results, like finding political parties in voting networks and evidence for polarization in online discourse: here, we provide mathematical evidence for corresponding results in the social sciences [3]. Finally, we also provide a comparison to other clustering techniques, including aggregation methods, using connected components, and heuristic approaches like DCDID [4]. The main benefits of using STGkM are:

  1. It is more granular than connected components.
  2. It is more powerful than traditional static or “aggregation”-based clustering.
  3. It is robust to noise.

2 Related work

Much of the literature on dynamic graphs focuses on extending well-known concepts from the static case like connectivity [5], optimal routing [6], induced dynamical systems [7], and more. Further work has been done in applying dynamic networks to sports analytics, machine learning [8], and epidemiological spread [9, 10]. Our work fits into this foundational literature by extending the notion of vertex clustering. Graph clustering is a fundamental tool for network analysis, with applications across the social and natural sciences, and we seek to bring this tool to the dynamic setting. In dynamic graph clustering, we find a partition of graph vertices that takes into account both spatial similarity—so that there are many edges within a cluster and relatively few between clusters—and temporal similarity, so clusters stay consistent. These partitions help us detect latent community structures.

In static graphs, vertex clustering has a broad literature with interdisciplinary interest and there has been a push to extend these results to the dynamic setting. Most approaches to dynamic community detection find clusters independently at each time step and then use aggregation to successively infer relationships between partitions [11]. These methods are often unable to achieve temporal smoothness and inevitably do not capture the dynamics of the network [12]. Another subset of methods first constructs a single coupling graph that summarizes the temporal properties of the dynamic network and then runs a classic community detection method on this graph [13]. As with aggregation, the use of coupling graphs results in a loss of temporal information.

Evolutionary clustering addresses this shortcoming through a unified framework, where clusters are iteratively formed based on current network structure and previous partitions. A cost function regulates the tradeoff between cluster quality at each snapshot and cluster consistency [14]. This framework has been successfully adopted and refined [15, 16]. Similarly, incremental methods, identify an initial network structure and use some criterion to successively update clusters. One such method is DCDID, which uses batch processing to discover interesting partitions [4]. Other lines of research extend static community detection using online algorithms [17], machine learning [18], or systems-based approximation algorithms [19]. These papers leverage diverse methods to contend with the sometimes staggering size of dynamic networks.

Our method, STGkM, develops a unified framework akin to, but distinct from, evolutionary clustering [14]. STGkM, achieves temporal smoothness by restricting the search space of new cluster centers based on temporal reachability from previous centers. In addition, our method goes further than existing techniques to also extract long-lived communities of vertices based on historical dynamic cluster membership. STGkM is the graph analogue to our previously developed point-based method [20]. We first introduced a notion of STGkM in an extended abstract with some preliminary evidence of its effectiveness and later as a conference paper, but we have since refined our approach and this work provides our complete results [1, 2].

Finally, we note that there are numerous methods to group vertices within a dynamic network, each with its own motivation, challenges, and rich literature. Our method of vertex partitioning prioritizes long-term stable connections and our main theoretical result highlights the relationship to the distinct concept of connected components. However, there are many other interesting notions of connectivity [21] with variants in stochastic settings [22] along with other related problems, like motif detection [23], centrality measurement [2427], and even novel frameworks for capturing properties of dynamic networks [28].

3 Spatiotemporal graph k-means (STGkM)

Our goal is to partition a vertex set given a dynamic graph. In STGkM, we construct a partition by finding central nodes to represent each cluster and then assigning each remaining vertex based on its closest central node. Just as with k-means, we define the problem of finding good clusters as a minimization problem; our novel objective has a unified formulation over space and time that predicts a partition for each vertex at every time step. After pre-processing, STGkM consists of two phases: in a single pass of Phase 1, the algorithm outputs vertex membership and dynamic cluster center journeys; in Phase 2, we extract the long-lived communities. We first presented most of this algorithm in [1], but we reproduce it here for completeness; however, the relaxation scheme in Section 3.3.3 is entirely novel. Also, our approach is perhaps more analogous to k-medoids, but in a network context, the distinction between k-means and k-medoids is not clear.

3.1 Setup

As input data, we need:

  • An indexed vertex set: V. We assume it is finite, but in principle, SGTkM would exist in the infinite case; we could not find a practical application and so will not consider that possibility in this paper.
  • A (finite) totally ordered time set: , usually
  • A dynamic graph: where EtV × V
  • (optional) a non-negative real cost function for all

Our parameters are , the number of clusters, and—optionally—, the maximum cluster center drift, and , the drift time window.

3.2 Pre-processing

For all pairs of vertices across time, we compute and store the s-journey δ, see [5] for details. The value of δt(u, v) is the length of the shortest journey (i.e. dynamic path) starting at vertex u at time t and ending at vertex v. If no such journey exists, it assigns + ∞. Thought not a true metric (it is missing symmetry and coincidence), this function has the same purpose as a distance in classical k-means. If we have a weight function, we consider this the “cost” of an edge, i.e., an edge with a smaller weight is a “shorter” edge and a missing edge has essentially “infinite” weight. If no weight functions are provided or if the weight functions only output natural numbers, then δ will assign only natural numbers. (As a practical matter, we can deploy a tie-breaking scheme when comparing two distances that are both infinity.) We also define the related true metric with the additional convention that .

3.3 Phase 1

Given a fixed value of k, the first phase of STGkM selects a set of k vertices to serve as cluster centers and assigns each vertex to a cluster at every time step. Vertices have the flexibility to switch cluster membership at every time step, but cluster centers are constrained by drift parameters λ and γ.

3.3.1 Natural objective.

The natural extension of k-means would be to optimize the objective function in Expression 1: (1) where we minimize the graph distance over cluster centers and assignment tensors . Formally, is the set of all sequences of length where each element is an ordered subset of V with k elements; such that . Note that each vertex is assigned to at least one cluster at every time step.

3.3.2 Objective with regularization.

Optimizing Expression 1 is -hard (we can reduce STGk)M to k-medoids, which is -hard [29]), so we instead iteratively optimize a modified objective function that restricts the search space. We begin by choosing initial cluster centers c0 to be the nodes that are most closely connected to all others at t0. When there are ties, we sample randomly. At each time t henceforth, we assume that we have chosen optimal cluster centers cs for all s < t, and we minimize Expression 2. (2)

The constraint in Expression 2 imposes that the center of a given cluster can only switch from vertex u to vertex v if the distance between them is no more than λ for the previous γ time steps. This regularization serves two purposes: first, it associates dynamic clusters between time steps; second, it restricts the search space for cluster centers. In the worst case, e.g. when the graph is complete at every time step, optimizing this objective is still -hard, but in practice, it makes STGkM tractable. As we decrease λ or increase γ, we decrease the number of potential centers at time t and enforce stricter cluster consistency; see Fig 1 for an example.

thumbnail
Fig 1. Cluster center selection process at time t.

At time t, the current cluster center is chosen based on the previous γ cluster centers , 1 ≤ qγ. The drift time window γ determines for how many previous time steps centers must be within maximum drift λ of one another. The objective in Expression 2 is evaluated for all potential cluster centers; the center that minimizes the objective is chosen.

https://doi.org/10.1371/journal.pcsy.0000023.g001

3.3.3 Optimizing our objective.

We can further make our method more efficient, see Section 5.3.1 for empirical performance comparisons. In optimizing Expression 2, observe that at each time step, each vertex could belong to multiple clusters, since the rows of Wt are not restricted; we only require the overall tensor to be binary. However, we can further require that each row of Wt have exactly one non-zero entry; in other words, each vertex can only be in one cluster. The former approach, as explored in [1], allows us to avoid making strong assignments and potentially falling into a local minimum, but increases computational complexity. The latter approach is more attractive for the analysis of larger graphs, as it restricts the search space of potential cluster centers even further. We therefore focus on the latter method in this paper so that we can explore larger dynamic graph structures.

Our algorithm is an adaptation of FasterPAM with eager swapping, as detailed in [30]. FasterPAM consists of two phases: the BUILD phase, which finds initial cluster centers, and the SWAP phase, which efficiently seeks iterative improvements for the cluster centers by scanning over all vertices in the graph. We run the BUILD phase, as implemented in [30], on the first time slice of our matrix storing s-journeys. For all following time steps, we update the center of a cluster only if the objective value is improved. To do so, we run eager SWAP phases, where instead of considering all points as potential swaps for each center, we consider only those points which are reachable based on the constraint in Expression 2. We greedily update cluster centers, performing any update that yields some improvement in the loss function. Our algorithm terminates until either the clusters stabilize or we reach a maximum number of iterations.

3.3.4 Algorithmic analysis.

The main feasibility issue with STGkM arises from finding new cluster centers at every time step. Evaluating all possible subsets of size k of |V| is -hard. The objective in Expression 2 does not obviate this possibility in the worst case, even though in practice, the added regularization makes it unlikely. In order to further improve run time, we build upon FasterPAM, as discussed above and take advantage of the fact that FasterPAM deploys the strategies of local search and greedy updates. As discussed in [30], the runtime complexity of a single BUILD phase of FasterPAM is O(k|V|2), and in the worst case, where every vertex would be reachable from every medoid, the runtime complexity of each SWAP phase would be O(|V|(|V| − k)). This means that in the worst case, Phase 1 of STGkM has a runtime complexity of .

3.4 Phase 2

By building on Phase 1, Phase 2 of STGkM aims to identify the long-lived partitions of graph vertices. The output is an assignment of communities containing vertices with the most similar spatiotemporal characteristics. Intuitively, we expect vertices with similar partitioning histories to belong to the same persisting community in the long run.

Recall that the Hamming distance is defined by counting the number of entries where two matrices disagree: . Using this distance, we define similarity sim(u, v) as (3)

This definition gives us a powerful way to compare all pairs of vertices. Since sim(·, ·) is compatible with traditional clustering techniques, we input it to agglomerative clustering. We then output the resulting partition to get a clustering of the vertices based on long-lived communities as desired.

3.4.1 Cluster balance.

There is no strong requirement that clusters be balanced. Indeed, our experiments all operate on data with differently sized clusters. We do require that the graph has the same number of vertices over time, but it’s possible to generate clusters with very different sizes with STGkM. For balanced clusters, there are two modifications that are possible: the easiest is to update Phase 2 to require balanced clusters, which is relatively straightforward with agglomerative or spectral clustering; the other option is to update the main objective function in Phase 1 with a regularization term that penalizes large differences in cluster sizes, which is a function of the weight matrix.

In our theoretical results, the first theorem also does not require balanced clusters (indeed, we do not make such an assumption). For the rest of the theoretical results, we assume balanced clusters for simplicity: the main purpose of each theorem is to demonstrate the robustness of STGkM in certain important conditions (like in the stochastic setting). It is possible to generalize these results on unbalanced clusters (with additional conditions, e.g., that the smallest cluster has to have more than k elements), but it adds substantially complexity to the statement of the result and the corresponding proofs. We therefore only present the results with balanced clusters in this work and leave the generalization to unbalanced clusters to future work.

4 Theoretical results

STGkM has theoretical guarantees that explain its good performance in a variety of practical situations, namely:

  1. Where important, STGkM can recover the well-known concept of dynamic connected components. (This result was first shown in [1] and we simply restate the main theorem here.) See: Theorem 4.1.
  2. STGkM produces more insightful clusters that dynamic connected components, since they requires strong assumptions about a system. See: Theorem 4.3.
  3. One common approach to clustering in dynamic networks is to flatten a dynamic graph into a single static one and then perform traditional static clustering. While sometimes effective, this approach can be catastrophically wrong, but STGkM is immune to these types of issues. See: Theorem 4.8.
  4. Finally, we show that our technique is robust to noise in the stochastic setting. (Admittedly, this result is relatively simple and we do not provide a sensitivity analysis with respect to noise.) See: Theorem 4.11.

4.1 Harmony with connected components

Although the clusters that STGkM generates are distinct from connected components, we can find connected components under certain conditions, as presented in Theorem 4.1; this result, with more extensive proofs and intermediate results, is simply reproduced from [1] for context. Though our method may not be most efficient way to find connected components (and there are other notions of connected components), our theoretical result provides evidence that STGkM can find interesting partitions.

Definition 4.1 (Dynamic Connected Component). Vertices u, v are (dynamically) connected if there exists a finite journey from u to v and from v to u over all time steps. A set of vertices U (where UV) is a (dynamic) connected component if all vertices in this set are connected and there is no vertex in V\U that is connected to a vertex in U.

Theorem 4.1. Given a holding, non-stranding dynamic graph with k connected components, the partition of vertices induced by the optimal solution to Expression 1 is exactly the connected components given sufficient time.

Proof. See [1].

This theorem states that—given the correct setting of k—STGkM finds the expected partition, namely the connected components. It does not give guidance on selecting an appropriate k. As seen below, with the special case of a single connected component, there are other interesting settings of k that find more intuitive clusters. This theorem demonstrates that if the goal is to find connected components, it is readibly doable.

4.2 More granularity than connected components

Dynamic connectivity is a relatively strong requirement for two vertices. Indeed, if two vertices become merely disconnected for just one time step, they cannot be dynamically connected. With little noise (especially in real-world data), it is easy to generate dynamic networks with no true non-trivial dynamic connected components; alternatively, it is just as easy to find real-world data where the whole graph is one dynamic connected component and we cannot distinguish between vertices. To demonstrate this issue and how STGkM can handle such cases, we construct a special type of dynamic network, see Definition 4.2. We then show that it has only one dynamic connected component, but an intuitive notion of clusters. Lemma 4.2 shows that using dynamic connected components would fail. Theorem 4.3 demonstrates our ability to find the correct k clusters. Note that this theorem is not incompatible with Theorem 4.1, which implies that if we set ran STGkM with a parameter of 1, we would indeed find the single connected component.

Definition 4.2 (Clique-cross-Clique). Let k be the “ground-truth” number of clusters and n be the number of vertices per cluster. We have a total of Nk × n vertices in V. For ease of indexing, we let vi,j be the jth vertex in the ith cluster, i.e. i ∈ [k], j ∈ [n]. We construct the following dynamic graph: (vi,j, vi′,j) is in Et if and only if one of the following conditions is satisfied:

  1. self-loops, i.e. i = i′, j = j
  2. the vertices are in the same cluster, i.e. i = i′, jj
  3. the vertices have the same position in their respective clusters and it is their “turn” to be connected, i.e. ii′, j = jand j = t mod n.

We will call such a dynamic graph a Clique-cross-Clique. An example is shown in Fig 2.

thumbnail
Fig 2. Visualization of a Clique-cross-Clique over three timesteps.

The graph consists of three “ground truth” clusters with five members each.

https://doi.org/10.1371/journal.pcsy.0000023.g002

Lemma 4.2. A Clique-cross-Clique has one dynamic connected component, namely all of its vertices.

Proof. This dynamic graph is holding and non-stranding; at every time step, the static graph induced by Gt = (V, Et) is connected. Therefore, by Proposition 3.4 in [5], this graph is dynamically connected. Graphs that are dynamically connected have, by definition, one dynamic connected component. We note that intuitively, the idea is simply that each of the k clusters is fully connected at all time steps and that between any two clusters, there is always an edge.

Theorem 4.3. Given a Clique-cross-Clique with k clusters, the partition of vertices induced by the optimal solution to Expression 1 is exactly the k clusters with sufficient time.

Proof. First, we will show the simple fact that vertices in the same cluster have a smaller objective value than those in different clusters given sufficient time. We observe that δt(u, v) = 1 for u, v in the same cluster for all time steps. If u, v are in different clusters, then δt(u, v)≥2 at least for time steps (this bound is conservative) given sufficiently large T (specifically, T > 2k). Therefore, for any that where u, v are in the same cluster but u, v′ are in distinct clusters.

Second, we will show that the optimal cluster assignment is always to assign one entry of ct to each cluster; we proceed by contradiction. Assume that we have found an optimal assignment ct, W; select j ∈ [k] and assume that ct contains no vertices that are in cluster j; let v be some vertex in this cluster. By the pigeonhole principle, there must be two vertices in ct that are in the same cluster: call them u, u′ and let i, i′ be their respective indices. We will show that we can switch one of these “duplicate” vertices to v to decrease our objective value. Select some vertex w in the same cluster as u and w′ in the same cluster as v. Since δt(w, u) = δt(w, u′), we know that . In other words, if we remove u′ from , then we could simply update to take the value of without changing the objective value. Thus, if we set to v, then we observe that the overall objective value has not changed for vertices in the same cluster as u.

However, if we set to v, by the fact that we showed in the beginning of this proof, there would be an update to W that would indeed decrease the overall objective value. For completeness, we observe that for a vertex that is not in the cluster with u nor with v would remain unaffected. Such an assignment is thus not minimal, which is a contradiction.

Remark 4.4. Note that when λ = γ = 1, then Expression 2 and, in fact, even our single membership scheme also finds the optimal result.

4.3 More powerful than static clustering

One common clustering technique with dynamic networks is to simply convert a dynamic network into a static one. In particular, one easy way to “aggregate” a dynamic network is to count the total number of dynamic edges between two vertices. We can then perform traditional static clustering on such a static graph. However, this method will certainly miss important dynamic properties. We construct an example in Definition 4.3 and show that the total number of edges between vertices is uniform in Lemma 4.7—and so static clustering will not be able to meaningfully distinguish between vertices—but that STGkM will find the appropriate number of clusters, see Theorem 4.8. This example is, in fact, a motivating example for STGkM and provides some insight into how we can find induced clusters over time.

Definition 4.3 (Theseus Clique). Let U be the set of vertices u0, …, un−1 and W be the set of vertices w0, …, wn−1. In this case, VUW is our vertex set with 2n vertices. Let be the “round” index and be the “slice” index; we parameterizes our “time” index as t(r, i) = r * n + (i mod n).

We thus define a dynamic graph where and then where Sr,i swaps all edges involving the vertices ui, w(i+r) mod n. Formally, (a, b) ∈ Sr, i(E) if and only if:

  1. aPr, i, bPr, i and (a, b) ∈ E
  2. aPr, i, bPr, i and (a, b) ∈ E
  3. aPr, i, bPr, i and
  4. aPr, i, bPr, i and

where Pr,i = {ui, w(i+r) mod n} and of an element • in Pr,i is simply the other element. An example is shown in Fig 3.

thumbnail
Fig 3. Visualization of a Theseus Clique, with n = 5 nodes in each ground-truth cluster.

The clusters are disconnected every n time steps, and the connectivity pattern repeats every n2 time steps.

https://doi.org/10.1371/journal.pcsy.0000023.g003

Lemma 4.5. The total number of edges at any given time in a More powerful than static clustering is 2n2.

Proof. We can proceed by induction. By construction E0 has 2n2 edges since it is the union of two sets of size n2; also, note that, by construction, each vertex has degree n.

We assume that El has 2n2 edges and every vertex has degree n. By construction, note that if an edge is in El+1, there is a corresponding edge in El and so we cannot have more than 2n2 edges. Now, we observe that (a, b) ∈ El+1 when (a, b) ∈ Ela, bP, where P is the relevant swapping set. In these cases, we have clearly preserved the number of edges and the degree of each vertex. Now, we note that if aP, then for every edge that includes a in El, there is an edge that includes in El+1; if there is an edge that includes in El, then there is an edge that includes a in El+1. In other words, we have not changed the number of edges, nor the degree of any vertex.

Lemma 4.6. A Theseus Clique has two static connected components that are cliques at any given time step.

Proof. We can proceed by induction. By construction, E0 has two cliques. We assume that El has two cliques, say A, B. Let a, bP be distinct. If a, b are in the same clique, then El+1 trivially preserves cliques. If a, b are in different cliques, say aA, bB, then we observe that (a, b) ∉ El and so (a, b) ∉ El+1. Next, we note that (a, a′) ∉ El+1 for any aA: because (b, a) ∉ El (by our inductive hypothesis), then because . Symmetrically, this must be true for b. Finally, we need only observe that by Lemma 4.5, since degree is preserved, then we must have maintained two cliques.

Lemma 4.7. For a Theseus Clique with total time steps T that is some positive multiple of n2, then the number of edges between any two vertices v, v′ ∈ V is .

Proof. Every vertex has constant degree by Lemma 4.5. Because of Lemma 4.6, we know there are always two cliques. At the end of each round, our cliques return to U, V by construction. Finally, note that every vertex in U and every vertex in V is connected exactly once per n2 time steps.

Theorem 4.8. Given a More powerful than static clustering, the partition of vertices induced by the optimal solution to Expression 1 with 2 clusters has lower objective value than with 1 cluster.

Proof. By construction, δt(u, v) is either 1 or ∞ if u, v are the in same clique at time t or not, respectively. The objective value with 1 cluster is therefore ∞, since there exists at least one vertex that is not in the same clique as the cluster center. There exists a finite objective value with 2 clusters: namely, at each time step, simply select a cluster center from each clique, which is well-defined by Lemma 4.6.

4.4 Robust to noise

Here, we use a similar setup as in Section 4.2, but we define a stochastic system, rather than a deterministic one. Theorem 4.11 shows that we still find the correct number of clusters in expectation. This result is somewhat weak, since we do not provide any analysis on the distribution of the number of clusters, nor do we provide a sensitivity analysis with respect to noise. That type of analysis is much more complicated, which we hope to provide in future work. However, our result provides at least some theoretical guarantees that STGkM is insensitive to noise, and we suspect that it is actually quite robust to noise based on our empirical results.

Definition 4.4 (Random Clique-cross-Clique). Let k be the “ground-truth” number of clusters and n be the number of vertices per cluster. We have a total of Nk × n vertices in V. For ease of indexing, we let vi,j be the jth vertex in the ith cluster, i.e. i ∈ [k], j ∈ [n]. We construct the following random dynamic graph: is where 1 > p > p′ > 0. In other words, each edge at each time step is a Bernoulli random variable and we further require that each is independent. We will call such a random dynamic graph a Random Clique-cross-Clique. An example is shown in Fig 4.

thumbnail
Fig 4. Visualization of a Random Clique-cross-Clique containing three clusters with five nodes each, intra-cluster connection probability p = .30, and inter-cluster connection probability p′ = .20.

https://doi.org/10.1371/journal.pcsy.0000023.g004

Lemma 4.9. A Robust to noise is time-homogenous Markovian.

Proof. For distinct time steps t, t′ note that induced static graphs are independent by definition. Note that each edge is a Bernoulli random variable with probability that does not depend on time, so are identically distributed.

Lemma 4.10. Given a Robust to noise with appropriate k, p, p′; and given vertices u, v, vsuch that u, v are in the same cluster and u, vare in distinct clusters; then .

Proof. In the trivial case where u = v, then ; since there is non-zero probability, namely 1 − p′, that (u, v′) ∉ Et, then . We thus proceed by assuming that uv.

First, let Xu,v represent the first t such that (u, v)∈Et—formally, Xu,v = min{t: (u, v) ∈ Et}—and let Xu,v be the analogous random variable for the edge u, v′. These are geometric random variables by construction and have well-defined expectations. Since δt(u, v) is the length of the shortest journey, then . This fact holds because one possible journey from u to v is simply to stay at u (because of the almost sure self-loops) and then to cross the edge from u to v when it appears for the first time. The shortest journey must thus be no larger than whenever this edge first appears. Therefore, is well-defined and, by similar logic, so is .

Next, we will show that any journey between u, v is at least as likely to exist in the dynamic graph as any “equivalent” journey between u, v′: Because our dynamic graph is Markovian by Lemma 4.9, we can simply assume we are starting at time t = 1 without loss of generality. Let j ≜ (u, w1, …, wl−1, v) be any journey of length l starting at time step t where no wi is v or v′. Let j′ ≜ (u, w1, …, wl−1, v′) be the “equivalent” journey of length l. We observe that by construction. Since any journey can be written in this form, any journey between u, v is at least as likely to exist in the dynamic graph as any journey between u, v′.

It is a tedious and technical matter to address the case where j includes v′ and j′ includes v. To do this, we first observe that if v appeared in j before the last vertex, then it would not be a proper journey of length l by definition. We then define a notion of journey equivalence where if j contains any instances of v′, we swap these to v in j′ and we show that we have not inadvertently increase the probability of j′. From here, we can conclude that our statement is still correct.

Finally, we note that if any journey between u, v is at least as likely to exist as the equivalent journey between u, v′, then by definition of a shortest journey. To wit, if the probability of a journey j increases, then the probability that the shortest journey is longer than j cannot increase. Now, we observe that: where c, c′ represents the expected shortest journey length conditioned on the edge (u, v), (u, v′) not existing. Since we showed that 1 < cc′ and by definition p > p′, then .

Theorem 4.11. In expectation, given a Robust to noise with k clusters, p, p′, the partition of vertices induced by the optimal solution to Expression 1 is exactly the k clusters with sufficient time.

Proof. By Lemma 4.10, we know that . By linearity of expectation, we can simply apply Theorem 4.3.

5 Experimental results

5.1 Synthetic data

We begin by evaluating STGkM across a series of synthetic networks with ground truth. We compare STGkM’s performance against three other methods: connected components found at every time step (CC), k-medoids run on a coupling graph defined as the inverse of the sum of the total number of edges between nodes over time, and DCDID [4]. We choose to compare against DCDID, since it was shown in [4] to outperform a handful of other methods in detecting quality dynamic communities. We note that DCDID is completely parameter free and does not require knowledge of the number of clusters k, whereas k is a required input for STGkM and k-medoids. While the parameter-free nature of DCDID is attractive and admittedly gives the algorithm the ability to operate in domains where both the number of nodes and clusters change over time, we show that this flexibility results in a tradeoff with cluster consistency. We will show that although STGkM maintains a fixed number of clusters, the restriction gives insight into cluster behaviour that none of the other methods can provide.

We complete 100 independent runs of each method. The parameters for STGkM are set to the default values of λ = 1 and γ = 1. The ground truth contains the long-lived partitions of nodes, but STGkM, DCDID, and CC all report short-term clusters, i.e. a label for each node at every time step. Therefore, for all the aforementioned methods, we use Phase 2 of STGkM to predict long-term clusters. Since k-medoids run on a flattened graph outputs only a single, long-term cluster assignment, we use it’s output directly. We report average Adjusted Mutual Information (AMI) score, with results shown in Table 1. Though these scores are representative of the quality of long-term cluster identification, we also stress the quality and importance of identifying cohesive, short-term clusters. Of all the methods, STGkM is the only one that is built to enforce lasting cluster identities (i.e. cluster 0 at time 0 is cluster 0 at time n). The implications of this requirement can be observed in Figs 510, which visualize the clustering histories of each node, as predicted by STGkM, CC, and DCDID. The colors in the figures correspond to unique cluster identities. While CC and DCDID can sometimes predict upwards of 50 unique short-term clusters in a time window, STGkM tracks the evolution of a specified number of clusters that are automatically associated. This innovation, unavailable with any of the other methods, allows us to directly observe the evolution of the makeup and stability of our long-term clusters.

thumbnail
Table 1. Average AMI scores (higher is better) of 100 independent runs of various community detection methods over a range of synthetic datasets.

STGkM is our method, CC uses dynamic connected components, k-medoids compresses a dynamic graph into a single static one and uses k-medoids, and DCDID is a heuristic method [4]. The best performance is bolded.

https://doi.org/10.1371/journal.pcsy.0000023.t001

thumbnail
Fig 5. Cluster assignment histories of STGkM, CC, and DCDID run on a Clique-cross-Clique.

CC and DCDID identify only the single, connected component at every timestep, whereas STGkM finds three stable clusters.

https://doi.org/10.1371/journal.pcsy.0000023.g005

thumbnail
Fig 6. Cluster assignment histories of STGkM, CC, and DCDID run on a Strong Random Clique-cross-Clique.

Due to the strong connectivity, CC can only ever find a single connected component. DCDID sometimes finds multiple clusters, but they are inconsistent. STGkM finds three relatively stable, evolving communities.

https://doi.org/10.1371/journal.pcsy.0000023.g006

thumbnail
Fig 7. Cluster assignment histories of STGkM, CC, and DCDID run on a Mixed Random Clique-cross-Clique.

Though all three methods have three unique groups of cluster histories, only STGkM’s persist throughout time and give insight into cluster evolution.

https://doi.org/10.1371/journal.pcsy.0000023.g007

thumbnail
Fig 8. Cluster assignment histories of STGkM, CC, and DCDID run on a Weak Random Clique-cross-Clique.

Due to the disconnected nature of the network, CC and DCDID find upwards of fifty unique clusters throughout the duration of the simulation. Contrastingly, STGkM monitors the evolution of three cohesive communities.

https://doi.org/10.1371/journal.pcsy.0000023.g008

thumbnail
Fig 9. Cluster assignment histories of STGkM, CC, and DCDID run on a Theseus Clique.

CC and DCDID identify the two connected components at every time step, while STGkM attempts to maintain some level of stability in cluster identity over time.

https://doi.org/10.1371/journal.pcsy.0000023.g009

thumbnail
Fig 10. Cluster assignment histories of STGkM, CC, and DCDID run on Three Clusters.

Only STGkM is able to track how three clusters evolve. CC and DCDID predominantly find only a single cluster.

https://doi.org/10.1371/journal.pcsy.0000023.g010

5.1.1 Clique-cross-Clique.

We create a Clique-cross-Clique with a ground truth of three clusters containing five nodes each. As seen in Table 1, both STGkM and k-medoids identify the long-term ground truth communities correctly, while DCID and CC perform quite poorly. Although k-medoids does well in identifying the long-lived node partitions, it gives no insight into short-term cluster behavior, which all three of the other methods do. Fig 5 visualizes how cluster membership evolves for each node using the other three algorithms. While DCDID and CC identify only the connected component in the network (see Lemma 4.2), STGkM properly identifies the three distinct clusters. Unsurprisingly, these clusters remain extremely stable. This experiment emphasizes STGkM’s granularity over connected components.

5.1.2 Strong random Clique-cross-Clique.

Next, we evaluate a strong Random Clique-cross-Clique with three clusters, five nodes per cluster, p = 0.50, and p′ = 0.30. We choose both p and p′ to be well above the threshold of the Erdős–Rényi bound [31], meaning that the graph will most likely be fully connected at every time step. We generate a new network for every evaluation. Again, STGkM and k-medoids perform best. Fig 6 emphasizes that once more, STGkM is the only algorithm that finds consistent short-term clusters as well.

5.1.3 Mixed random Clique-cross-Clique.

Four our next experiment, we use a Random Clique-cross-Clique with three clusters, five nodes per cluster, p = 0.50, and p′ = 0.05. This time, we set p above and p′ below the Erdős–Rényi bound [31]. We generate a new network for every evaluation. In this paradigm, the graph will often separate into three separate connected components, making it unsurprising that all three methods correctly identify the ground truth long-term clusters. However, as we see in Fig 7, the quality of short-term clusters produced by STGkM, CC, and DCDID differ greatly. Although the clustering histories of points 0 − 4, 5 − 9, and 10 − 15 have strong similarities to one another, respectively, across all three algorithms, only STGkM’s clusters are consistent and allow us to observe cluster evolution, highlighting its performance over aggregate or incremental algorithms.

5.1.4 Weak random Clique-cross-clique.

Our final Random Clique-cross-Clique has three clusters, ten nodes per cluster, p = 0.10, and p′ = 0.01. With values of p and p′ both below the Erdős–Rényi bound [31], the graph is be extremely disconnected. We generate a new network for every evaluation. As seen in Table 1, all four methods identify the long-term partitions near perfectly. However, Fig 8 again showcases STGkM’s superior identification of short-term node relationships. Both CC and DCDID predict upwards of 50 unique clusters in a given time window, while STGkM provides a continuous perspective of cluster evolution.

5.1.5 Theseus Clique.

Our fifth synthetic graph is the Theseus Clique composed of two clusters with ten nodes each over fifty time steps. As shown in Lemma 4.7, the total number of edges between any two vertices in this graph is 25. As a result, k-medoids performs poorly on this dataset, since the coupling graph cannot distinguish the strength of relationships between nodes. Contrastingly, STGkM, CC, and DCDID all identify the ground truth clusters perfectly. In Fig 9, we see that CC and DCDID perfectly separate the connected components at every time step, while STGkM attempts to maintain some level of consistency, albeit weak, between the two clusters.

5.1.6 Three clusters.

Finally, we apply all three methods to a synthetic dataset, consisting of three clusters with ten fully connected nodes in each cluster, tracked over twenty time steps. The result is a dynamic graph with 30 nodes and 300 edges at each time t. At every time step we randomly choose up to 30 edges to remove within clusters and up to 30 edges to add between clusters. Table 1 shows that STGkM, k-medoids, and DCDID capture the ground truth clusters best. Unlike the other two methods, k-medoids does not give a glimpse into how the clusters evolve over time. A snapshot of the evolution of detected clusters, as predicted by STGkM, CC, and DCDID, is shown in Fig 10.

5.1.7 Robustness.

Overall, we observe that STGkM has the most consistent performance over a range of synthetic networks with respect to the accuracy of detected long-term clusters. In addition, STGkM is the only method that forces consistency of predicted short-term clusters, allowing us a glimpse into how clusters change over time. Our next goal is to argue the robustness of STGkM.

According to Theorems 4.3 and 4.11, given sufficient time, STGkM will find the optimal long-term clusters for both standard and Random Clique-cross-Cliques. We experimentally test this claim by evaluating the accuracy of clusters predicted by STGkM versus number of ingested time steps on one standard and multiple Random Clique-cross-Cliques with varying numbers of clusters, cluster sizes, and intra- and inter-cluster connection probabilities. Fig 11 shows the sensitivity of STGkM to number of ingested time steps over networks with various spatial complexities, temporal complexities, and connectivity levels. On a standard Clique-cross-Clique, regardless, STGkM always quickly converges to the ground truth communities. On Strong Random Clique-cross-Cliques with both strong intra- and inter- cluster connectivity, STGkM takes much longer to converge when the ground truth communities are made up a small number of nodes. Still, we observe a clear upward trend in AMI score over the number of time steps. On Mixed Random Clique-cross-Cliques with strong intra- but weak inter- cluster connectivity, STGkM again converges quickly to the ground truth across network complexities. On Weak Random Clique-cross-Cliques, where the network is predominantly disconnected, STGkM again converges slower with small ground truth communities, but with a clear upward trend. Convinced of STGkM’s utility, we move on to testing the method on real-world data.

thumbnail
Fig 11. Sensitivity of AMI score for STGkM run on Standard and Random Clique-cross-Cliques with varying spatial complexities, time complexities, and levels of connectivity.

Though STGkM may take much longer to converge on small communities when inter- and intra- connectivity are both very high or very low, there is a clear upward trend over time. STGkM will eventually find the ground truth communities.

https://doi.org/10.1371/journal.pcsy.0000023.g011

5.2 Choosing k

Perhaps the greatest challenge in using a k-means-based approach for clustering is determining the optimal number of clusters. With regard to classical k-means, two of the most common methods for choosing k are the Elbow Method [32], wherein the sum of square error of each cluster is calculated, and the value of k which results in the most extreme difference (the “elbow”) is chosen and silhouette score [33, 34], which provides a measure of cluster cohesion versus separation.

In our previous work [1], we chose k using an approach similar to the Elbow Method but specialized for multi-cluster membership. Unlike classical k-means where increasing k results in points getting progressively closer to their centers, in mutli-membership STGkM, increasing k is likely to cause vertices to be assigned to progressively more clusters simultaneously, resulting in in a larger objective value. This allows us to search for a local minimum, as opposed to the “elbow,” in our plot of k vs objective value.

In this work, because we choose to restrict vertices to single cluster membership for improved computational efficiency, we can no longer follow the same procedure. Since the “elbow” can be very difficult to identify when using real data, we instead turn to the silhouette score [33, 34]. As a note, silhouette score ranges between 0 and 1, and a higher silhouette score is better. For each value of k, we calculate the average silhouette score over all time steps, and choose the number of clusters that results in the maximum average silhouette score. We repeat the experiments from [1] and show that they are consistent with our previous results. An example of this process on the three cluster dataset from Section 5.1.6 is shown in Fig 12. As expected, the highest average silhouette score is achieved when STGkM is run with the true number (three) of ground truth clusters.

thumbnail
Fig 12. Average silhouette score versus number of clusters on a synthetic dataset.

Note: a higher score is better.

https://doi.org/10.1371/journal.pcsy.0000023.g012

5.3 Detecting political parties

The experimental results in this subsection were first introduced in [1] and we reproduce them here with minor edits for completeness. To demonstrate the utility of STGkM on a real-world dataset, we turn to the political sphere. Communities naturally arise is politics, particularly in recent years where we have witnessed polarization with political figures consistently voting along party lines. Taking inspiration from [35], we form a dynamic graph based on 100 roll call votes from the House of Representatives between June 21, 2023 and July 27, 2023. Each vote is a time step, each representative is a node, and nodes are connected if they vote the same way on a bill. Possible votes are “Yea”, “Nay”, and “Present”. If a representative does not cast a vote, they have no connecting edges for that vote. The ground truth communities are representative’s affiliated political parties.

By running STGkM on the roll call graph, we can identify the communities of representatives that vote similarly and observe how those communities evolve over time. Interestingly, in this example, the entire roll call graph is dynamically connected, since there are votes that have consensus. Thus, using dynamic connected components as a clustering technique would not work and we have to use a technique like STGkM to find clusters that align with the ground truth.

We choose our maximum center drift to be λ = 1 and our time connectivity to be γ = 5. Intuitively, we expect k = 2, corresponding to the two major US political parties, but in [1], when using multi-membership STGkM, our k selection process recommended k = 3. Beyond separating Democrats and Republicans into separate long-term clusters, we also identified an additional sub-cluster of three Democrats who very often vote “Present” together, as opposed to the majority Democratic party vote. Here, when using single-membership STGkM, we instead find k = 2 to be optimal, as shown in Fig 13. Interestingly, however, if we set k = 3, we are still able to recover the same sub-cluster of outlying Democrats, as in [1], up to stochasticity.

thumbnail
Fig 13. Average silhouette score versus number of clusters on the roll call dataset.

Note: a higher score is better.

https://doi.org/10.1371/journal.pcsy.0000023.g013

Fig 14 visualizes the similarity scores, as defined in Eq 3, between the cluster assignment histories for each pair of representatives when single-membership STGkM is run using k = 2 and k = 3. The rows and columns of the similarity matrices are ordered according to the discovered long-term communities. In the left figure, these clusters correspond to Republicans followed by Democrats, while in the right figure, the three outlying Democrats are moved to the final three rows and columns of the matrix. We observe a distinct color difference between these three rows and the remainder of the matrix, demonstrating that the similarity between the outlying Democrats and remaining Democrats is much lower. These figures agree with those generated in [1], using multi-membership STGkM.

thumbnail
Fig 14. Matrices showing the similarity scores of short-term cluster assignment histories between every pair of vertices in the roll call dataset using STGkM with γ = 5, λ = 1, and k = 2 on the left and k = 3 on the right.

Rows and columns are ordered based on long-term cluster membership.

https://doi.org/10.1371/journal.pcsy.0000023.g014

5.3.1 Runtime.

The difference in choice of optimal k between multi- and single-membership STGkM is intriguing. Perhaps restricting to single cluster membership leads to a loss of information, because we are forced to make immediate decisions about vertices that are equidistant from multiple centers. However, the computational speedup of restricting the search space cannot be ignored. When running multi- and single- membership STGkM on the Roll Call data 100 times each with k = 2, we find that the average runtime of multi-membership STGkM is about 1.36 seconds, while the average runtime of single-membership STGkM is.93 seconds, a speedup of 32%. We expect this computational advantage to become even more pronounced as our networks grow.

5.4 Detecting journal communities

For our next experiment, we explore communities based on citations between scientific journals. Using the Semantic Scholar API [36], we retrieve the 500 most cited papers for each year between 2000 and 2024 that are returned by the query “dynamic network.” We then form a connectivity matrix based on whether there is a citation from a paper published in journal u to a paper published in journal v or vice versa in a given year. We also keep track of how many times such citations occur with a weight function defined as follows: for journals u, v with m total citations from papers in u to papers in v in year t, the s-journey between these two journals starting at time t will be scaled by . Our final dataset tracks 199 journals over 24 years. The resulting connectivity matrix is sparse, with only 2.5% of entries containing nonzero values.

We set our maximum center drift and time connectivity to the default values, λ = 1 and γ = 1. When searching for an optimal number of clusters between 2 and 20, we repeatedly return a value close to the top of the range. For example, Fig 15 shows a situation where the optimal choice is k = 18, and we report results for this parameter choice. Fig 16 captures the short term cluster similarity scores between the cluster assignment histories for each pair of journals. The rows and columns of the similarity matrix are ordered according to the long-term clusters discovered by Phase 2 of STGkM. While we observe strong similarity within long-term clusters, we also observe a coarse, macro-structure in the similarity matrix. For instance, cluster 0, which contains 56% of journals covering a diverse range of topics, maintains above average similarity with most clusters, other than cluster 1, which has a distinct general biology focus. We also see strong similarities between pairs of long term-clusters, such as 1 and 2. Upon further investigation, these contain journals focused on the closely related topics of general biology and neuroscience, respectively. We look to the evolving contents of our clusters for further explanation.

thumbnail
Fig 15. Average silhouette score versus number of clusters on the Semantic Scholar dataset.

Note: a higher score is better.

https://doi.org/10.1371/journal.pcsy.0000023.g015

thumbnail
Fig 16. Matrix storing the similarity scores of short-term cluster assignment histories between every pair of nodes in the Semantic Scholar dataset using STGkM with γ = 1, λ = 1, and k = 18.

Rows and columns are ordered based on long-term cluster membership.

https://doi.org/10.1371/journal.pcsy.0000023.g016

Fig 17 tracks the four most common journals in a subset of our 18 dynamic clusters. We note no fewer than four clusters corresponding to biological topics, with separate clusters for chemistry, molecular biology, cells, and neuroscience. Other topic clusters cover communications, neural networks, operations research and wireless/mobile networks, and computer vision. The observation that journals covering similar topics are clustered together gives us confidence in STGkM’s results.

thumbnail
Fig 17. Four most common journals for a selection of clusters from the Semantic Scholar dataset using STGkM with γ = 1, λ = 1, and k = 18.

The most common journals in each cluster tend to cover similar topics.

https://doi.org/10.1371/journal.pcsy.0000023.g017

One of the advantages of STGkM is that we can track a journal’s cluster membership over time. Incidentally, this dataset also has just one large connected component. Unsurprisingly, given the overlap in topic coverage between clusters, most journals’ memberships change often. In fact, on average, a journal is a member of 5.51 clusters.

Because the majority of journals interact with many clusters, instead of distinct boundaries, we maintain loose similarities between our long-term clusters, which helps explain the hierarchical structure in Fig 16. Digging further into which journals switch clusters most often, we find that arXiv.org, with it’s general focus belongs to 10 different clusters at least once, while a highly specialized journal like IEEE Transactions on Geoscience and Remote Sensing never switches membership. Perhaps, there is a loose relationship between the breadth of topics that a given journal covers and the stability of a journal’s cluster assignment history.

5.5 Detecting clusters of subreddits

For our final experiment, we seek to find clusters of subreddits based on posts over a 2.5 year period between January 2014 and April 2017 [37]. We break the data into monthly windows and restrict to subreddits with at least 20 posts in the 2.5 year period. We define a connection between subreddit u and subreddit v if there is a hyperlink pointing to v present in the title of post u or vice versa We use the same weight function as with the Semantic Scholar data, i.e., if subreddits link to one another m times within time period t, the s-journey between those two nodes starting at time t will be scaled by . We provide two different sets of analysis based on whether the sentiment between the subreddits is positive or negative. The resulting positive sentiment connectivity matrix tracks the interactions of 1619 subreddits, while the negative sentiment connectivity matrix tracks the interactions of 115 subreddits, both over 41 months. This graph is dynamically connected and thus has just one connected component.

For both sets of analyses, we run STGkM with the default values λ = 1 and γ = 1. For the positive sentiment subreddits, our k selection process repeatedly recommends values of k towards the top end of our search range, and we report results for k = 19. For the negative sentiment subreddits, our k selection process consistently settles on k = 2. These processes are shown in Fig 18. One observation of note is that whereas the average silhouette scores for the negative sentiment subreddits are similar to those in previous experiments, the scores for positive sentiment subreddits are extremely low, suggesting that at most time steps, nodes are very difficult to separate into clusters. This hypothesis is strengthened by Fig 19, which shows the short term cluster similarity scores for both the positive and negative sentiment Reddit data, ordered by long-term cluster membership. Almost all positive sentiment subreddits belong to one long-term cluster, while negative sentiment subreddits are contained to two distinct long-term clusters. Since STGkM tracks both short- and long-term interactions between nodes, we turn to the content of our dynamic clusters for an explanation.

thumbnail
Fig 18. Average silhouette score verses number of clusters on the Reddit dataset for subreddits with positive sentiment on the left and negative sentiment on the right.

Note: a higher score is better.

https://doi.org/10.1371/journal.pcsy.0000023.g018

thumbnail
Fig 19. Matrices showing the similarity scores of short-term cluster assignment histories between every pair of vertices in the Reddit dataset using STGkM with λ = 1, γ = 1, and k = 19 on positive sentiment data on the left and λ = 1, γ = 1, and k = 2 on negative sentiment data on the right.

Rows and columns are ordered based on long-term cluster membership.

https://doi.org/10.1371/journal.pcsy.0000023.g019

The first row of Fig 20 shows the four most common positive sentiment subreddits over time in a selection of four clusters. The first cluster pertains to negative Donald Trump content, the second to operating systems and coding, the third to conservative political content, and the fourth to sports. The second row of Fig 20 shows the ten most common subreddits over time in our two dynamic negative sentiment clusters. The contents of the first cluster are primarily political, as opposed to the contents of the second, even though there is some overlap. For instance, “btc” (bitcoin) is one of the most popular subreddits in both. These results suggest that in both experiments, STGkM appropriately separates subreddits of different topics. The behaviour of our long-term clusters in Fig 19 can be explained by how often subreddits switch clusters.

thumbnail
Fig 20.

Top row: Four most common subreddits for a selection of clusters from the positive sentiment Reddit data using STGkM with γ = 1, λ = 1, and k = 19. Bottom row: Ten most common subreddits for the two dynamic clusters from the negative sentiment Reddit data using STGkM with γ = 1, λ = 1, and k = 2.

https://doi.org/10.1371/journal.pcsy.0000023.g020

On average, positive sentiment subreddits belong to 5.88 clusters. In contrast, the cluster membership of negative sentiment subreddits is much more stable. The idea that negative sentiment is much more polarizing than positive sentiment is well-studied [3]. Therefore, our observation that on average posts containing negative sentiment towards similar topics have much more stable clustering histories than posts containing positive content is unsurprising.

6 Conclusion

In [1], we introduced spatiotemporal graph k-means (STGkM) for community detection by vertex clustering on dynamic graphs. The approach is unified over space and time and gives us the ability to analyze both the short- and long-term partitions of graph vertices, monitor the multi-scale relationships between communities, and has just three explainable parameters, only one of which is required. We provide a principled approach to estimating the required parameter: the number of clusters k. In this work, we improve the runtime and thereby the feasibility of running STGkM on larger networks by embracing a new relaxation and carry out experiments on six synthetic and three real-world datasets to empirically validate performance.

We also extend our theoretical guarantees and explain clustering behavior under certain conditions. While in our conference paper, we show that the partition induced by STGkM is identical to connected components in certain cases, in this work, we distinguish the cases where we can find sensible clusters that dynamic connected components cannot. We also provide a result that demonstrates why dynamic clustering may find better partitions than a static clustering technique; in particular, static clustering methods that aggregate or “average” behavior and then cluster the resulting graph cannot capture important dynamics. Finally, we give an initial result demonstrating that STGkM is robust to noise. We provide experimental evidence of these theoretical results.

In our synthetic experiments, we showcase STGkM’s superior ability to identify both accurate long-lived partitions and cohesive short-term clusters compared to a handful of existing algorithms. In our experiments on real data, we find that STGkM is able to detect informative clusters and make interesting conclusions about trends in the datasets, such as detecting outlying political activity, the extent of similarity between subtopics of scientific journals, and supporting evidence of polarization in negative versus positive sentiment social media posts. These mathematical results align with the existing literature in each task’s respective area of study, which provides further confidence that STGkM is finding “useful” clusters for real applications. These conclusions would not be possible without STGkM’s ability to break down the multi-scale relationships between graph nodes.

One direction for further study is to extend STGkM to the online case. Because clustering is carried out at each time step independently, relying only on the centers at the previous time step to seed the current choice, it is feasible to transform STGkM to an online algorithm. The main challenges will be in updating s-journeys and deciding how many time steps of a dynamic graph to maintain and collect between cluster updates, but we leave this for future work. Inspired by the results of our experiments on the Semantic Scholar and Reddit data, another potential extension is a hierarchical version of STGkM. In the Semantic Scholar experiments, we already saw promising hierarchical structure within the short term cluster similarity matrix. On the other hand, in the Reddit experiments, we observed one massive long-term cluster, but dynamic short-term clusters with distinct topic focuses. We hypothesize that by intelligently applying STGkM to smaller and smaller subsets of data, we could more clearly extract the evident hierarchical relationships in the Semantic Scholar data and perhaps further break down the mega-cluster in the Reddit data.

In our future work, we seek to improve the efficiency of STGkM, both in practice and in theory. As STGkM is applied to larger datasets, further approximation strategies will be necessary to ensure feasibility. We would also like to provide guidance on the quality and convergence of our approximation strategies. The theoretical guarantees in this paper are only correct under narrow conditions, so we would like to provide more contexts in which clustering is assured to work correctly. Finally, we would like to provide much more precise results in the stochastic setting with more specific bounds on STGkM’s output based on noise. We will also explore online extensions of STGkM, where we explore dynamic graphs in real-time. We intend to leverage STGkM in a variety of applications and we hope that it becomes an interesting tool in the study and analysis of various network-based data for data practitioners.

Acknowledgments

We would like to thank Albert Azout and Liam Shalon at Level Ventures and J. Nathan Kutz at the University of Washington for their support.

References

  1. 1. Dabke DV, Dorabiala O. A Novel Method for Vertex Clustering in Dynamic Networks. In: Complex Networks & Their Applications XII. Springer; 2023. p. TBA.
  2. 2. Dabke DV, Dorabiala O. Spatiotemporal Graph k-means. In: Proceedings of the Communities in Networks ComNets @ NetSci 2023; 2023. p. none available.
  3. 3. Baumeister RF, Bratslavsky E, Finkenauer C, Vohs KD. Bad is stronger than good. Review of general psychology. 2001;5(4):323–370.
  4. 4. Sun Z, Sheng J, Wang B, Ullah A, Khawaja F. Identifying communities in dynamic networks using information dynamics. Entropy. 2020;22(4):425. pmid:33286200
  5. 5. Hylton A, Short R, Cleveland J, Freides O, Memon Z, Cardona R, et al. A Survey of Mathematical Structures for Lunar Networks. In: 2022 IEEE Aerospace Conference (AERO); 2022. p. 1–17.
  6. 6. Cleveland J, Hylton A, Short R, Mallery B, Green R, Curry J, et al. Introducing Tropical Geometric Approaches to Delay Tolerant Networking Optimization. In: 2022 IEEE Aerospace Conference (AERO); 2022. p. 1–11.
  7. 7. Dabke DV. On Systems of Dynamic Graphs: Theory and Applications [PhD thesis]. Princeton University. Fine Hall, 2nd Floor, Princeton, NJ, USA; 2023. Available from: https://dataspace.princeton.edu/handle/88435/dsp010z709070k.
  8. 8. Dabke DV, Chazelle B. Extracting Semantic Information from Dynamic Graphs of Geometric Data. In: Complex Networks & Their Applications X. Springer; 2021. p. 474–485.
  9. 9. Dabke DV, Arroyo EE. Rumors with Personality: A Differential and Agent-Based Model of Information Spread through Networks. SIAM Undergraduate Research Online. 2016;9:453–467.
  10. 10. Dabke DV, Karntikoon K, Aluru C, Singh M, Chazelle B. Network-Augmented Compartmental Models to Track Asymptomatic Disease Spread. Bioinformatics Advances. 2023. pmid:37476534
  11. 11. Fortunato S. Community detection in graphs. Physics reports. 2010;486(3-5):75–174.
  12. 12. Görke R, Maillard P, Schumm A, Staudt C, Wagner D. Dynamic graph clustering combining modularity and smoothness. Journal of Experimental Algorithmics (JEA). 2013;18:1–1.
  13. 13. Rossetti G, Cazabet R. Community discovery in dynamic networks: a survey. ACM computing surveys (CSUR). 2018;51(2):1–37.
  14. 14. Chakrabarti D, Kumar R, Tomkins A. Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 2006. p. 554–560.
  15. 15. Lin YR, Chi Y, Zhu S, Sundaram H, Tseng BL. Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th international conference on World Wide Web; 2008. p. 685–694.
  16. 16. Chi Y, Song X, Zhou D, Hino K, Tseng BL. Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining; 2007. p. 153–162.
  17. 17. Ruan B, Gan J, Wu H, Wirth A. Dynamic structural clustering on graphs. In: Proceedings of the 2021 International Conference on Management of Data; 2021. p. 1491–1503.
  18. 18. Yao Y, Joe-Wong C. Interpretable clustering on dynamic graphs with recurrent graph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 4608–4616.
  19. 19. DiTursi DJ, Ghosh G, Bogdanov P. Local Community Detection in Dynamic Networks. In: 2017 IEEE International Conference on Data Mining (ICDM); 2017. p. 847–852.
  20. 20. Dorabiala O, Dabke DV, Webster J, Kutz N, Aravkin A. Spatiotemporal k-means; 2024.
  21. 21. Casteigts A, Flocchini P, Quattrociocchi W, Santoro N. Time-varying graphs and dynamic networks. International Journal of Parallel, Emergent and Distributed Systems. 2012;27(5):387–408.
  22. 22. Becker R, Casteigts A, Crescenzi P, Kodric B, Renken M, Raskin M, et al. Giant components in random temporal graphs. arXiv preprint arXiv:220514888. 2022;.
  23. 23. Gurukar S, Ranu S, Ravindran B. COMMIT: A Scalable Approach to Mining Communication Motifs from Dynamic Networks. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD’15. New York, NY, USA: Association for Computing Machinery; 2015. p. 475–489. Available from: https://doi.org/10.1145/2723372.2737791.
  24. 24. Lerman K, Ghosh R, Kang JH. Centrality Metric for Dynamic Networks. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs. MLG’10. New York, NY, USA: Association for Computing Machinery; 2010. p. 70––77. Available from: https://doi.org/10.1145/1830252.1830262.
  25. 25. Bergamini E, Meyerhenke H. Approximating Betweenness Centrality in Fully Dynamic Networks. Internet Mathematics. 2016;12(5):281–314.
  26. 26. Yen CC, Yeh MY, Chen MS. An Efficient Approach to Updating Closeness Centrality and Average Path Length in Dynamic Networks. In: 2013 IEEE 13th International Conference on Data Mining; 2013. p. 867–876.
  27. 27. Habiba, Tantipathananandh C, Berger-Wolf TY. Betweenness Centrality Measure in Dynamic Networks. DIMACS; 2007. 19.
  28. 28. Latapy M, Viard T, Magnien C. Stream graphs and link streams for the modeling of interactions over time. Social Network Analysis and Mining. 2018;8(1):61.
  29. 29. Megiddo N, Supowit KJ. On the complexity of some common geometric location problems. SIAM journal on computing. 1984;13(1):182–196.
  30. 30. Schubert E, Rousseeuw PJ. Fast and eager k-medoids clustering: O (k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Information Systems. 2021;101:101804.
  31. 31. Erdős P, Rényi A. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences. 1960;5:17–60.
  32. 32. Kodinariya TM, Makwana PR, et al. Review on determining number of Cluster in K-Means Clustering. International Journal. 2013;1(6):90–95.
  33. 33. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987;20:53–65.
  34. 34. Lenssen L, Schubert E. Medoid Silhouette clustering with automatic cluster number selection. Information Systems. 2024;120:102290.
  35. 35. Reda K, Tantipathananandh C, Johnson A, Leigh J, Berger-Wolf T. Visualizing the evolution of community structures in dynamic social networks. In: Computer Graphics Forum. vol. 30. Wiley Online Library; 2011. p. 1061–1070.
  36. 36. Kinney R, Anastasiades C, Authur R, Beltagy I, Bragg J, Buraczynski A, et al. The semantic scholar open data platform. arXiv preprint arXiv:230110140. 2023;.
  37. 37. Kumar S, Hamilton WL, Leskovec J, Jurafsky D. Community interaction and conflict on the web. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee; 2018. p. 933–943.