GCN-based unsupervised community detection with refined structure centers and expanded pseudo-labeled set

Bing Guo; Liping Deng; Tao Lian

doi:10.1371/journal.pone.0327022

Abstract

Community detection is a classical problem for analyzing the structures of various graph-structured data. An efficient approach is to expand the community structure from a few structure centers based on the graph topology. Considering them as pseudo-labeled nodes, graph convolutional network (GCN) is recently exploited to realize unsupervised community detection. However, the results are highly dependent on initial structure centers. Moreover, a shallow GCN cannot effectively propagate a limited amount of label information to the entire graph, since the graph convolution is a localized filter. In this paper, we develop a GCN-based unsupervised community detection method with structure center Refinement and pseudo-labeled set Expansion (RE-GCN), considering both the network topology and node attributes. To reduce the adverse effect of inappropriate structure centers, we iteratively refine them by alternating between two steps: obtaining a temporary graph partition by a GCN trained with the current structure centers; updating each structure center to the node with the highest structure importance in the corresponding induced subgraph. To improve the label propagation ability of shallow GCN, we expand the pseudo-labeled set by selecting a few nodes whose affiliation strengths to a community are similar to that of its structure center. The final GCN is trained with the expanded pseudo-labeled set to realize community detection. Extensive experiments demonstrate the effectiveness of the proposed approach on both attributed and non-attributed networks. The refinement process yields a set of more representative structure centers, and the community detection performance of GCN improves as the number of pseudo-labeled nodes increase.

Citation: Guo B, Deng L, Lian T (2025) GCN-based unsupervised community detection with refined structure centers and expanded pseudo-labeled set. PLoS One 20(7): e0327022. https://doi.org/10.1371/journal.pone.0327022

Editor: Rashmi Sahay, ICFAI Foundation for Higher Education Faculty of Science and Technology, INDIA

Received: November 11, 2024; Accepted: June 9, 2025; Published: July 1, 2025

Copyright: © 2025 Guo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data availability: The datasets analysed during the current study are available in public repositories, which are cited in the corresponding footnotes.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Many complex systems in the real world can be abstracted as a network [1], e.g., social networks, biological networks, and citation networks. Sometimes, the nodes in these networks are also attached with abundant attributes. A prominent feature of various networks is the existence of community structure [2–4]—the organization of nodes into groups, where nodes in the same group are densely connected or share similar attributes. Community detection is helpful to reveal mesoscale properties of complex networks [5,6]. For example, it can facilitate the detection of protein complexes and functional modules in protein-protein interaction networks [7].

Many existing research explores the community structure from the global view [8], which takes the entire network as a whole and optimizes some global quality function. Typical global methods include modularity maximization [9–11], spectral clustering [12], hierarchical clustering [2,13], etc. Global methods are often computationally expensive and faced with the resolution limit [14] that prevents them from identifying small communities in large networks. Moreover, sometimes one might only care about communities in a small region, which should have little to do with portions of the network that are very far away [8].

An alternative approach is local community detection, which only utilizes local information to build individual communities around a few seed nodes [15–19]. These methods are computationally efficient without the need to analyze the entire network. Among them, local expansion methods are widely used for local community detection in large networks. Such methods build a local community around a specified seed node by adding nodes greedily into the community until a local optimum of some quality function is reached [20,21]. However, a few seed nodes are required to be specified in advance, and the results are sensitive to initial seeds [15]. To overcome this problem, Wang et al. [16] were inspired by the literature [22], and proposed a method to automatically identify the center of the network structure. These nodes are characterized by a higher local density than their neighboring nodes and a relatively large distance from other nodes with higher density. They can be used as seed nodes for the local expansion method.

Recently, community detection through deep learning has received considerable attention [23]. In particular, graph convolutional network (GCN) is exploited in many works to realize community detection [17,18,24]. The graph convolution layer can be seen as a local filter that can efficiently propagate and aggregate the information of local neighbors to derive low-dimensional node representations, which are further used to infer their community labels. To train the GCN, only a few seed nodes [17] (or structure centers [18]) can be used as labeled (or pseudo-labeled) nodes. Hence, Wang et al. exploited the label propagation algorithm [25] to acquire a little extra supervision information [17], or simply added several neighbors of the structure centers into the training set [18].

However, the community detection performance of the above methods is hindered by two obstacles: inappropriate structure centers and insufficient propagation ability.

The initial structure centers may be inappropriate, which has an adverse effect on the resulting communities. As shown in Fig 1, two of the three nodes with the largest structure centrality [22] belong to the same community, leaving out the nodes in the third ground-truth community. The reason is that the structure center selection procedure only exploits the network topology, ignoring the node attributes. Thus, it is necessary to refine the initial structure centers that serve as seed nodes for community detection.

Download:

Fig 1. The identified structure centers may be problematic.

(a) The ground-truth community partition (b) The structural centralities of different nodes.

https://doi.org/10.1371/journal.pone.0327022.g001

GCN cannot effectively propagate the labels to the entire graph when given limited amount of supervision information [26]. As known, the graph convolution is a local filter that induces a node’s representation by aggregating its neighbors’ information. To avoid over-smoothing, shallow GCNs are widely used in the literature, which however has insufficient propagation ability on large networks with only a few seeds [26]. A larger and balanced set of pseudo-labeled nodes need to be constructed and fed into GCN as the supervision information.

To overcome these two problems, we propose a novel unsupervised approach to community detection based on GCN, which leverages both graph topology and node attributes to refine the structure centers and expand the set of pseudo-labeled nodes. It firstly identifies a few structure centers that have high local density and are far away from each other. To reduce the adverse effect of inappropriate structure centers, we iteratively refine the initial structure centers by alternating between two steps: obtaining a temporary graph partition by training a GCN with the current structure centers; updating each structure center to the node with the highest structure importance in the corresponding induced subgraph. The process is shown in Fig 2 with an example network. The initial structure centers (i.e., nodes 1, 8, and 17) in Fig 2(a) are updated to a set of more representative seeds (i.e., nodes 1, 8, and 21) in Fig 2(e). For larger networks, a GCN trained only with these few structure centers is not able to make accurate predictions for all the remaining nodes. To make up for the lack of propagation ability, we construct a larger and balanced pseudo-labeled set by selecting several nodes whose affiliation strength to a community is similar to that of its structure center. The final GCN is trained with the expanded pseudo-labeled set to realize community detection.

Download:

Fig 2. An example network showing the iterative refinement of structure centers and graph partition.

(a) Initial structure centers (b) Graph partition in 1st pass (c) Updated structure centers (d) Graph partition in 2nd pass (e) Updated structure centers (f) Graph partition in 3rd pass.

https://doi.org/10.1371/journal.pone.0327022.g002

The main contributions of this paper are summarized as follows.

We propose an unsupervised approach to community detection based on GCN which can leverage both network topology and node attributes, and demonstrate its effectiveness on both attributed and non-attributed networks.
We develop an iterative structure center refinement strategy which can yield a better set of proper structure centers and lay a good foundation for community detection.
We devise a pseudo-labeled set expansion strategy based on community affiliation strength which can make up for the lack of propagation ability of shallow GCN by supplying it with a larger amount of supervision information.

The rest of this paper is organized as follows. In section Related Work, we introduce related work on community detection and graph convolutional network. In section Preliminaries, we present the problem formulation along with other preliminaries. In section Methodology, we elaborate the proposed approach in detail. Experiment results and analysis are presented in section Experiments, followed by concluding remarks in section Conclusion.

Related work

In this section, we present the related work on community detection and graph convolutional networks.

Community detection

Community detection is one of the important tasks in network data mining that helps us to analyze and understand the structural properties and group characteristics of various networks. Existing research attempts to explore the community structure either from the global view or from the local view.

Global methods require information about the whole network structure and partitioning it from the global perspective [8]. Recently, some scholars have designed nice clustering algorithm frameworks, where the type of graph, and the sparsity and noise of the initial graph are fully considered before clustering, in addition to multi-scale information embedding learning (multi-scale information embedded), all these factors affect the clustering effect. For example, the scholars [27] propose methods that can cluster homomorphic and heteromorphic graphs simultaneously. First, homomorphic and heteromorphic graphs are constructed separately, then the two graphs are fused into a single graph, and finally the attributes and structures of this graph are encoded and learned. Based on this, the scholars [28] propose a novel method, namely deep attention-guided graph clustering with dual selfsupervision (DAGC). Inspired by the success of Variational Graph Auto-Encoders (VGAEs) learning, the article [29] addresses improvements to Variational Graph Auto-Encoders (VGAEs) type methods: they formulate a new variational lower bound that incorporates an explicit clustering objective function. To improve the clustering performance, the article [30] proposes a novel clustering network called Embedding-Induced Graph Refinement Clustering Network (EGRC-Net), which effectively utilizes the learned embedding to adaptively refine the initial graph and enhance the clustering performance. Typical methods include modularity maximization [9–11], spectral clustering [12], hierarchical clustering [2,13], etc. The global approach has several limitations. For example, it is time-consuming to compute the eigen vectors in spectral clustering for large networks. Modularity optimization may fail to identify communities smaller than a scale, i.e., the resolution limit [14]. Moreover, it is hard to know the entire network in real settings, which is also unnecessary if the user only wants to know the local community structure in a small region of a huge network.

Compared with global methods, local methods can effectively discover communities without complete information about the entire network [27]. A widely used approach is to start from a few seed nodes and expand them into several local communities [15–17]. Such methods can be parallelized and are scalable to large networks. However, local expansion methods only perform well when the seed nodes are located in the core region of individual communities, which is known as the seed-dependence problem. To alleviate this issue, several studies make efforts to select a good set of seeds [15,16,30,31]. For instance, Chen et al. [30] considered the node with local maximum degree as a better starting node. Inspired by Rodriguez and Laio [22], Wang et al. [16] recently proposed the structural centrality index to identify structural centers in a network that have a higher local density than their neighbors and a relatively large distance from other nodes with higher densities. But inappropriate nodes may be identified as structure centers, since the above method only exploits the network topology, ignoring the node attributes.

GCN-based community detection

In recent years, different types of graph neural networks [29,32,33] have been proposed to boost the performance of various graph analysis tasks. In particular, Graph Convolutional Network (GCN) [32] is a successful attempt of generalizing the powerful convolution operation from Euclidean data to graph-structured data, which can effectively integrate the network topology and node attributes to extract deeper network features. However, traditional GCNs often assume isotropic information propagation, neglecting directional relationships that could better capture community structures [34]. Variants of GCN have shown excellent performance in different tasks, such as node classification [35], personalized recommendation [36], and traffic prediction [37]. It’s worth noting that traditional graph clustering methods like symmetric nonnegative matrix factorization [38] still provide valuable theoretical foundations for modern GNN-based approaches.

Without exception, GCN is also applied to the problem of community detection on complex networks. For example, Jin et al. [24] integrated GCN and MRF to realize semi-supervised community detection. Note that semi-supervised learning for GCN requires a considerable amount of labeled nodes to achieve satisfying performance [26], while recent semi-supervised deep attributed clustering methods [39] have shown promising results in reducing annotation dependency. However, it is hard to acquire enough high-quality labeled nodes for community detection in large networks.. However, it is hard to acquire enough high-quality labeled nodes for community detection in large networks. Hence, several GCN based unsupervised methods have been proposed for community detection [40–42]. While spectral clustering with graph learning [43] represents another important direction, our work focuses specifically on GCN-based architectures. Among them, graph autoencoder is a commonly adopted architecture [29,40,41], where a graph convolutional module is used as the encoder to obtain latent node representations, which are then passed through the decoder to minimize the reconstruction error for the graph adjacency matrix (as well as node attributes). The learned node representations are finally used to infer their community labels by node clustering. The CLEAR model [44] is a novel unsupervised GNN model with cluster-aware self-training, which learns embeddings using intrinsic network cluster properties and thus needs no direct supervision from labels. Second, unlike other GNN models that rely on a static graph structure, CLEAR further proposes a topology refining scheme that reduces inter-cluster connections of neighbor nodes to alleviate the impact of noisy edges. However, the model refinement process only considers the topology ignoring the node attributes, and does not account for anomalous edge patterns that may distort community detection [45].

Preliminaries

This section first presents our problem formulation along with basic notations, and then introduces the essentials of graph convolutional network used in our model.

Problem formulation

We are interested in the community detection task in an undirected graph , where denotes the set of nodes, and denotes the set of edges. The topology of can be represented by its adjacency matrix . In settings where the nodes in are associated with attributes, is used to denote the feature matrix, where the i-th row, i.e., , is the feature vector for node . In this paper, we focus on the problem of non-overlapping community detection based on the network topology A and node attributes X, which aims to partition the node set into a set of disjoint communities , where for and . We treat it as an unsupervised learning task, where the ground-truth community label for each node is not available for training.

Graph convolutional network

To make full use of network topology and attribute information, GCN is a basic module in our method. GCN [32] is a multi-layer neural network that operates directly on a homogeneous graph and induces the embedding vector of a node based on the properties of its neighbors. The layer-wise propagation rule is as follows:

(1)

It is a special form of localized filter—a linear combination of the feature vectors of adjacent neighbors. , which added self-connections to the original adjacency matrix. be the degree matrix, where . is a layer-specific trainable transformation matrix. denotes an activation function such as ReLU. denotes the hidden representations of nodes in the l-th layer. Initially, . For non-attributed networks, X is initialized as one-hot representations of nodes in the graph.

Methodology

Overview

The proposed model framework is shown in Fig 3. Firstly, we select a few structure centers based on graph topology. Secondly, initial structure centers are iteratively updated by considering both the network topology and node attributes. The refined structure centers can be regarded as representatives of different communities, and constitute a small set of pseudo-labeled nodes—one per community. Thirdly, we assign more nodes with pseudo community labels based on temporary partition, yielding a larger training set of pseudo-labeled nodes. With the expanded pseudo-labeled training set, the GCN can be trained to predict the community labels for the remaining nodes.

Download:

Fig 3. RE-GCN is comprised of four steps: 1) selecting initial structure centers; 2) refining the structure centers; 3) expanding pseudo-labeled set; and 4) finally training GCN for community partition.

https://doi.org/10.1371/journal.pone.0327022.g003

Selecting initial structure centers

The selection of initial structure centers is particularly important. As the carrier of initial labels, they affect the resulting communities to a certain extent. A structure center should have high local density and meanwhile keep a relatively large distance from other nodes with higher density. Thus, the structural centrality of a node should take into account two aspects: the local density and the relative distance [22]. In the following, we present the corresponding definitions.

Definition 1: Local Density.

The local density of node in the network is defined as:

(2)

where d_ij denotes the distance between node and , and d_c is a cutoff distance. is the Heaviside step function.

(3)

Intuitively, is equal to the number of nodes with a distance shorter than d_c to node . When d_c = 1, is equal to the number of nodes directly connected to node , i.e., its degree.

Definition 2: Relative Distance.

The relative distance is measured by computing the minimum distance between node and any other nodes with higher local density.

(4)

If node has the highest local density, since there is no node with larger density than , . Note that is much larger than typical nearest neighbor distances only for nodes with a local maximum density.

Definition 3: Structural Centrality.

A structural center should not only have a higher density than its neighbors, but also have a relatively large distance from other nodes with higher local density. The structural centrality of node is defined as:

(5)

The requirement of a relativly large could avoid the situation that multiple nodes with high local density in the same community are simultaneously identified as the structure centers to some extent.

The procedure for selecting initial structure centers is listed in algorithm 1, which selects the K nodes with largest centrality and assigns them with distinct community labels respectively. The output is denoted as , where denotes the k-th structure center and is assigned a pseudo community label k.

Algorithm 1 Initial structure centers selection.

Refining structure centers

Now we have identified K structure centers by algorithm 1. Ideally, for each ground-truth community, there is one structure center belonging to it, which could be regarded as a representative node for the community. In practice, sometimes none of the nodes in a community is identified as a structure center, and more than one structure centers might belong to the same ground-truth community, as shown in Fig 2(a). Hence, the results may be unsatisfactory: a small community may disappear if no node in it is identified as one of the initial structure centers; a large community might be split into more than two fragments if more than two nodes are identified as initial structure centers. The reason may be that the procedure for selecting initial structure centers only depends on the graph topology, ignoring the node attributes. Nevertheless, many real networks often exhibit the homophily principle [46]: nodes with similar attributes are more likely to connect to each other, forming a cohesive community.

To reduce the adverse effect of inappropriate structure centers, we propose to refine the initial structure centers by leveraging both graph topology and node attributes. Technically, we propose to refine the initial structure centers iteratively, as shown in algorithm 2. Firstly, we train a GCN to predict the community labels for each node, yielding a temporary partition of all nodes in the graph. Secondly, we build K induced subgraphs and refine the k-th structure center according to the structure of the k-th subgraph. The two steps are repeated until a stable state is reached—the structure centers stay unchanged between two consecutive iterations. Fig 2 visualizes the update process of structure centers for an example network.

Specifically, a two layer GCN is trained under the supervision of current structure centers . The GCN takes as input the graph adjacency matrix A and node attribute matrix X. The output, denoted as , is computed as

(6)

, which is the normalized adjacency matrix with self connection. and are the weight parameters for the two graph convolution layers. , and , which are the activation functions for the first and second GCN layer. Let the entry z_i,k denote the affiliation strength of node to the k-th community. Then the predicted community label for an unlabeled node is

(7)

Based on the predicted community labels, we can obtain a temporary partition of the nodes , where .

Then K subgraphs can be induced: , where , for . Some bad structure centers may emerge now: the k-th initial structure center, i.e., , may locate on the periphery of subgraph or even in another subgraph . To find the core of , we compute the local structure importance for each node in , which is calculated from the perspective of shortest path [47]. Formally speaking, we introduce the following definitions.

Definition 4: SLP (Similarity based on Local Simple Paths).

Given a network , the SLP between nodes and is defined as:

(8)

is the number of simple paths—paths with no repeated nodes—with length l between nodes and . are non-negative weights that satisfy and . Intuitively, a higher value of indicates that the two nodes and are better connected in the subgraph .

Definition 5: Local Structural Importance.

Given a network , the notion of local structural importance of node is defined as:

(9)

Since measures the connectivity strength between nodes and from the perspective of the number of simple paths shorter than 3, can indicate the local structure importance of node with respect to its surrounding nodes. A higher value of means that the node is closely connected to other nodes in the local neighborhood and is more likely to become a center of . Therefore, the node with the largest value of is selected as the structure center for the subgraph . That is to say,

(10)

Algorithm 2 Structure centers refinement.

Expanding pseudo-labeled set

Now we have obtained a set of K refined structure centers, each of which is assigned a distinct pseudo community label. To realize community partition, a GCN can be trained to predict the community labels for other nodes in the graph by integrating graph structure and node attributes. As known, a GCN is a localized filter, hence it cannot effectively propagate the label information to the entire graph when only a limited amount of labeled nodes are available [26]. In order to train a better GCN for community detection, we propose to expand the pseudo-labeled set by the following process.

When the structure centers are iteratively updated in Algorithm 2, a temporary graph partition is also returned: node is assigned to if . For each node , we compute the difference between its affiliation strength to the k-th community and that of , i.e., . Then we construct the set of nodes with pseudo label k by selecting the nodes with the smallest difference among .

(11)

where is the function to select the nodes with smallest values. The lower bound of is estimated by solving K , where is the average degree of nodes in , and L is the number of graph convolution layers which is 2 in our experiments. Note that the pseudo-label expansion process takes into account both the graph topology and node attributes, since the probability for node given by a L-layer GCN depends on the labels and attributes of its L-hop neighbors in the graph, as defined in Eq 6. In this way, we expand the set of refined structure centers to a larger set of pseudo-labeled nodes .

An alternative strategy is to select the nodes with the largest z_i,k among all nodes in [26], i.e.,

(12)

We did not adopt this strategy, because it is sensitive to inappropriate structure centers. Besides, some nodes satisfying this criterion are ones of low degree which are located on the periphery of a certain community and far away from other communities. Although such nodes belong to the corresponding community with a high confidence, they have limited propagation ability.

Training GCN with expanded pseudo-labeled set

Under the supervision of the larger set of pseudo-labeled nodes , we can train a two-layer GCN with the same structure as defined by Eq 6. We adopt the cross-entropy loss over all pseudo-labeled nodes:

(13)

where denotes the set of nodes with pseudo label k, and K is the output dimension of the softmax layer, which actually corresponds to the number of communities. The adam [48] optimizer is used to update the model parameters and . Once the GCN is trained to convergence, we can predict the community label for every node using Eq 7 and obtain the final community partition , where

(14)

The whole process of our proposed method is shown in in Algorithm 3.

Algorithm 3 RE-GCN community detection.

Experiments

In this section, we validate the performance of the proposed community detection method on various real-world networks. We conduct extensive experiments with the aim of answering the following research questions:

RQ1: How well does the proposed RE-GCN perform in detecting communities on both attributed and non-attributed networks compared with other methods?
RQ2: Are both structure center refinement and pseudo-labeled set expansion essential for RE-GCN?
RQ3: Does the step of refining structure centers indeed yield better structure centers for later community detection?
RQ4: How do the expansion strategy and the size of pseudo-labeled set influence the community detection performance?

Datasets

We conduct extensive experiments on 8 public network datasets, including Karate (http://konect.cc/networks/ucidata-zachary/), Dolphins (http://www-personal.umich.edu/~mejn/netdata/), Football (https://www.cc.gatech.edu/dimacs10/archive/clustering.shtml) PolBooks (https://www.cc.gatech.edu/dimacs10/archive/clustering.shtml), PolBlogs (https://www.cc.gatech.edu/dimacs10/archive/clustering.shtml), Cora (https://linqs.org/datasets/#cora), CiteSeer (https://linqs.org/datasets/#citeseer-doc-classification), and PubMed (https://linqs.org/datasets/#pubmed-diabetes). Detailed statistics are shown in tab:dataset.

Download:

Table 1. Summary statistics of the 8 network datasets used in this study.

https://doi.org/10.1371/journal.pone.0327022.t001

Karate [49]:: The Zachary’s karate club network is a network of friendship among 34 members of a karate club. Over a period of time the club split into two factions due to leadership issues and each member joined one of the two factions.
Dolphins [50]:: The dolphin social network was constructed based on the observations recording frequent associations between a group of 62 bottlenose dolphins over a period of 7 years from 1994 to 2001. In this network, dolphins represented as nodes have an edge with each other if they are observed together more often than expected by chance. In previous study, it is generally divided into two communities in terms of sex and age of dolphins.
Football [2]:: This is a network of American football games between Division IA colleges during regular season Fall 2000. In the network nodes denote the 115 teams that are divided into 12 conferences, and the edges represent 613 games.
PolBooks [51]:: US politics-related books network includes 105 nodes that represent books about US politics sold by the online bookseller Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. The political orientation of these books — liberal, neutral, or conservative — are taken as the ground-truth community label in our experiment.
PolBlogs [52]:: The PolBlogs dataset is a directed network of hyperlinks between political blogs collected during the 2004 U.S. election. It includes 1,490 nodes and 16,715 directed edges. The political orientation of each blog is either conservative or liberal.
Cora [53]:: The Cora dataset consists of 2,708 machine learning papers classified into one of the seven classes — Case Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory. The citation network consists of 5,429 links. Each publication is described by a 1,433 dimensional 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.
CiteSeer [53]:: The CiteSeer dataset consists of 3,312 scientific publications classified into one of the six classes — Agents, AI, DB, IR, ML, and HCI. The citation network consists of 4,732 links. Each publication is described by a 3,703 dimensional 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.
PubMed [54]:: The PubMed dataset consists of 19,717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes (“Diabetes Mellitus, Experimental”, “Diabetes Mellitus Type 1”, “Diabetes Mellitus Type 2”). The citation network consists of 44,338 links. Each publication is described by a TF-IDF weighted word vector from a dictionary which consists of 500 unique words.

Evaluation metrics

To evaluate the community detection performance of baselines and our method, we utilize three widely used performance metrics—Accuracy, Normalized Mutual Information (NMI) [55] and Adjusted Rand Index (ARI) [56]. Accuracy to evaluate division performance from different perspectives. They assesses the community quality by measuring the agreement between the community partition predicted by an algorithm and the ground-truth community partition of the network. Let be the ground-truth community partition with A communities, and be the community partition detected by an algorithm.

assesses cluster quality by measuring the agreement between the community partition predicted by an algorithm and the ground-truth community partition of a network. is the ratio of the number of correctly predicted samples to whole samples, which is defined as given in Eq. (15).

(15)

where P_i is the actual category label of the i-th sample, C_i is the predicted category label of the model on the i–th sample. The map function establishes a mapping between predicted community labels and ground-truth community labels such that the highest accuracy is reached given the partition. denotes an indicator function defined as shown in Eq. (16):

(16)

(17)

where is the number of nodes in common between the ground-truth community and detected community , , , and N is the total number of nodes in G.

(18)

The range of NMI and ARI is . The value is equal to 1 only if the community partition detected by an algorithm is completely identical to the ground-truth community partition, and 0 for a random partition.

Baselines

We compare our method with 11 baseline methods listed in tab:models. These methods can be classified into three types according to the network information they exploit. The first type only uses graph structure, including GN [2], LP [25], BGLL [13], and DeepWalk [57]. The second type is the K-means clustering algorithm [59], which only uses node features. The third type uses both the graph structure and node features, including TADW [60], VGAE [29], GUCD [42], DGI [61], ARGA [41], MGAE [40], GRV [62], and SP-AGCL [63]. These studies mostly adopt graph neural networks to learn node embeddings, and then apply the K-means algorithm to obtain node clusters.

Download:

Table 2. Comparison of information types utilized by different community detection models.

https://doi.org/10.1371/journal.pone.0327022.t002

GN [2]:: The Girvan-Newman (GN) algorithm detects communities by progressively removing edges with the highest edge betweenness which is defined as the number of shortest paths between node pairs that run through the edge. The edges connecting different communities typically have high edge betweenness, thus removing such edges will separate different groups from one another.
LP [25]:: The label propagation (LP) algorithm first initializes every node with unique labels, and then updates their labels iteratively based only on the network structure, where each node adopts the community label that most of its neighbors currently carry.
BGLL [13]:: It is an iterative method for unfolding hierarchical communities in large networks. In each iteration, a new network is first built by merging small communities in the previous iteration as a single node, and then larger communities are detected by performing modularity maximization on the new network. A graph partition can be obtained at the top level of the hierarchy.
DeepWalk [57]:: DeepWalk transforms the graph structure into node sequences by truncated random walks, and learns node embeddings by applying SkipGram [58] on generated node sequences.
K-means [59]:: The K-means algorithm performs node clustering based on the node attributes.
TADW [60]:: The text-associated DeepWalk (TADW) model incorporates text features of nodes into network representation learning under the framework of matrix factorization, based on the equivalence between DeepWalk and matrix factorization.
VGAE [29]:: The variational graph autoencoder (VGAE) is an unsupervised framework for learning node embeddings, where a GCN encoder is exploited to integrate the topological structure and node attributes into latent node embeddings, and a simple inner-product decoder is used to reconstruct the graph adjacency matrix.
GUCD [42]:: It is an unsupervised community detection method for attributed networks, which adopts MRFasGCN [24] as an encoder to derive node community membership in the hidden layer and introduces a dual decoder to separately reconstruct the network structure and node attributes from the derived node community membership.
DGI [61]:: Deep Graph Infomax (DGI) is an unsupervised method for learning node representations on graph-structured data, which utilizes graph convolutional architectures to encode the local patch centered around each node, and then maximizes the mutual information between local patch representations and the global graph summary via a noise-contrastive loss.
ARGA [41]:: The adversarially regularized graph autoencoder (ARGA) is similar to VGAE. The difference is that an adversarial module is incorporated to discriminate whether the latent node representation is generated from the GCN encoder or from the prior distribution. Once the node representations are learned, the K-means algorithm is applied to perform node clustering.
MGAE [40]:: The marginalized graph autoencoder (MGAE) learns node representations by introducing some randomness into the node features and then marginalizes the corrupted features in a graph autoencoder framework, allowing the node content to interact with the network structure.
GRV [62]:: The graph representation vulnerability (GRV), an information theoretic-based measure used to quantify the robustness of a graph encoder.
SP-AGCL [63]:: A similarity-preserving adversarial graph contrastive learning (SP-AGCL) framework that preserves the feature similarity information and achieves adversarial robustness. The node similarity-preserving view helps preserve the node feature similarity by providing selfsupervision signals generated from the raw features of nodes.

Experimental results and analysis

Performance comparison (RQ1).

To answer RQ1, we compare the performance of RE-GCN with other baselines on both attributed and non-attributed networks. In all experiments, we run the algorithm 30 times on each network, and report the average NMI and ARI for each method. The configuration of hyper-parameters , and in this paper are consistent with that in the literature [47]. Experiments show that when , and , RE-GCN reaches the optimal value in most cases, so we set , , .

Download:

Table 3. Comparison of different community detection methods that only make use of the graph topology on 8 real networks.

https://doi.org/10.1371/journal.pone.0327022.t003

Download:

Table 4. Comparison of different community detection methods that make use of node attributes on 3 real networks.

https://doi.org/10.1371/journal.pone.0327022.t004

In tab:result_o_attribute, RE-GCN is compared with four different baselines that only exploit the graph topology on both attributed and non-attributed networks. We can observe that the performance of RE-GCN is superior to other algorithms on all networks. Note that the proposed RE-GCN can leverage both the graph topology and node attributes with the help of GCN when refining the structure centers and expanding the pseudo-labeled set. Thus, considering the node attributes is beneficial for improving the performance of community detection. Moreover, for non-attributed networks, compared with DeepWalk, GCN is able to effectively encode the local neighborhood information centered around each node to obtain better node representations for community detection.

In tab:result_w_attribute, RE-GCN is compared with seven methods that can leverage the node attributes on attributed networks. Among them, the K-means algorithm is solely based on the node attributes, while the other methods can leverage both the graph topology and node attributes. These methods achieve better performance than the K-means algorithm, indicating that the graph topology is very necessary for community detection. In addition, our RE-GCN achieves better performance than the other methods that also considers both the graph topology and node attributes. Recall that VGAE, GUCD, ARGA, and MGAE are unsupervised methods with an autoencoder structure, where a common loss is to minimize the reconstruction error for the graph topology and/or the node attributes. DGI is also an unsupervised method which attempts to maximize the mutual information between local node representations and the global graph summary. However, our method is specifically designed for community detection following the spirit of local expansion methods. We first locate and refine the structure centers in a network, each of which can serve as the representative for a potential community and thus is assigned a pseudo label; then the pseudo-labeled set is expanded based on preliminary predictions made by GCN; finally the GCN is trained with the expanded pseudo-labeled set to minimize a classification loss, and then used to infer the community labels for the remaining nodes. Although RE-GCN is also an unsupervised method, good pseudo labels and the corresponding classification loss can more directly enhance the community detection performance.

Ablation study (RQ2).

Two key steps of RE-GCN are structure center refinement (ReSC) and pseudo-labeled set expansion (PLSet). To answer RQ2, we compare four variants of RE-GCN. (1) Variant 1 (w/o ReSC and PLSet) directly trains a GCN for community detection under the supervision of the initial structure centers identified by algorithm 1, neither refining the structure centers nor constructing an expanded pseudo-labeled set. (2) Variant 2 (w/o ReSC) does not refine the initial structure centers, but construct a larger pseudo-labeled set based on these structure centers, which is used to train the final GCN for community detection. (3) Variant 3 (w/o PLSet) refines the initial structure centers, but does not expand the pseudo-labeled set before training the final GCN for community detection. (4) Variant 4 is the full model with both ReSC and PLSet. The results are shown in tab:result_o_component.

Download:

Table 5. Comparison of four variants of RE-GCN.

https://doi.org/10.1371/journal.pone.0327022.t005

We can roughly obtain a rank of the four variants according to their performance: Variant 1 < Variant 2 < Variant 3 < Variant 4. (1) Variant 1 (w/o ReSC & PLSet) performs the worst. Since it neither refines the initial structure centers identified by algorithm 1 nor constructs an expanded pseudo-labeled set, when some of the initial structure centers are not good, training the GCN with a limited set containing inappropriate seeds will yield unsatisfactory community partition. (2) Variant 2 (w/o ReSC) achieves better performance than Variant 1, but still falls far behind the full model. Note that it expands the pseudo-labeled set on the basis of initial structure centers, without refining them in advance. On the one hand, training the final GCN with an expanded set of pseudo-labeled nodes improves its propagation ability when detecting communities. On the other hand, if inappropriate initial seeds are directly used for expanding pseudo-labeled set, it may mislead the model. (3) The performance of Variant 3 is better than that of Variant 1, but is still lower than the full model. It refines the initial structure centers, which provides high quality seeds for local community detection. However, the final GCN does not have enough propagation ability if it is trained with only the refined structure centers. (4) The performance of Variant 2 is lower than that of Variant 3, indicating that refining the structure centers has a larger impact than expanding the pseudo-labeled set.

Based on the above analysis, we conclude that both structure center refinement and pseudo-labeled set expansion are essential for RE-GCN to achieve its best performance. By updating the initial structure centers, the former step obtains a set of high-quality seed nodes, which lay a good foundation for local community detection. By expanding the pseudo-labeled set, the latter prepares a larger amount of supervision information for training GCN, which helps to improve its label propagation ability.

Case study for structure center refinement (RQ3).

To answer RQ3, we conduct case studies on the Football network, and visualize the process of structure center refinement. In Fig 4, we first use algorithm 1 to select 12 initial structure centers. As shown in Fig 4(b), some of them are located in the same community, and none of the nodes in 4 communities are selected as structure centers. After 10 iterations of updates, the refined 12 structure centers are scattered in 11 communities, as shown in Fig 4(d).

Download:

Fig 4. The update process of the structure centers on the Football network when initialized by Algorithm 1.

(a) The ground-truth 12 communities in the Football network; (b) Initial structure centers selected by Algorithm 1 are located in 8 communities; (c) 3rd update iteration: structure centers are located in 9 communities; (d) 10th update iteration: structure centers are located in 11 communities.

https://doi.org/10.1371/journal.pone.0327022.g004

In Fig 5 we randomly select 12 initial structure centers, which are located in 6 communities as shown in Fig 5(b). After 11 iterations of updates, the refined 12 structure centers come from 11 communities, as shown in Fig 5(d).

Download:

Fig 5. The update process of the structure centers on the Football network when initialized randomly.

(a) The ground-truth 12 communities for the Football network; (b) Initial structure centers randomly selected are located in 6 communities; (c) 3rd update iteration: structure centers are located in 8 communities; (d) 11th update iteration: structure centers are located in 11 communities.

https://doi.org/10.1371/journal.pone.0327022.g005

Therefore, no matter the initial structure centers are selected by algorithm 1 or at random, they can be refined to a set of more representative seeds that are scattered in different communities by algorithm 2. That is to say, it can overcome the sensitivity to initial structure centers to some extent, and reduce the adverse effect of inappropriate structure centers on community detection.

Fig 6 shows the structure centers refinement analysis for all datasets. The figure shows the variation number of structural centers in each dataset, containing the number of initial structural centers, the number of the updated structural centers, and the number of original communities.

Download:

Fig 6. Effect of the structure center refinement process on all datasets.

https://doi.org/10.1371/journal.pone.0327022.g006

Influence of the size of pseudo-labeled training set (RQ4).

Finally, we make a comparison of two different expansion strategies specified by Eq 11 and Eq 12 respectively, and investigate how their performance varies with respect to the number of expanded nodes per pseudo label (i.e., ). Fig 7 and Fig 8 report the variation of NMI and ARI with the increasing of respectively, where the vertical line indicates the lower bound of .

Download:

Fig 7. Variation of NMI with respect to the number of expanded nodes per pseudo-label.

(a) Football network results; (b) Cora network results; (c) CiteSeer network results; (d) PubMed network results.

https://doi.org/10.1371/journal.pone.0327022.g007

Download:

Fig 8. Variation of Adjusted Rand Index (ARI) with respect to the number of expanded nodes per pseudo-label.

(a) Football network; (b) Cora network; (c) CiteSeer network; (d) PubMed network.

https://doi.org/10.1371/journal.pone.0327022.g008

We can observe that our expansion strategy as defined in Eq 11 outperforms the alternative strategy in Eq 12 on three datasets—Football, Cora, and PubMed, and achieves slightly lower performance on CiteSeer. In all datasets, the community detection performance of RE-GCN with either expansion strategy generally improves as the number of expanded nodes per pseudo-label increases to a moderate size. The reason is that a larger set of pseudo-labeled nodes makes up for the lack of label propagation ability of local graph convolutional filter.

Computational complexity analysis

The computational complexity of RE-GCN can be decomposed into three main components:

Initial center selection: This phase computes the structural centrality measures for all nodes. Using Dijkstra’s algorithm with a binary heap for sparse graphs, the time complexity is . This includes:
- Computing shortest paths between all node pairs:
- Calculating local density () and relative distance ():
- Selecting top-K centers:
Iterative refinement: Each iteration requires:
- Training a 2-layer GCN: per epoch
- Generating subgraphs and computing SLP: , where
- Updating structure centers:
With T iterations (typically T<8), the total complexity is when .
Pseudo-label expansion: This step involves:
- Calculating affiliation strengths:
- Sorting nodes for each community:
For small K (e.g., ), this becomes negligible compared to other phases.

The overall complexity is therefore . For sparse graphs where , this simplifies to when , transitioning to dominance for larger networks.

Memory requirements scale as due to:

Storage of graph adjacency and node features:
Maintaining community assignments:

where d is the feature dimension (typically ).

Conclusion

In this article, we proposed an unsupervised approach to community detection by structure center refinement and pseudo-labeled set expansion, with GCN as a foundation module which can leverage both network topology and node attributes. It firstly identifies a few structure centers with high local density and large distance from each other based on graph topology. To overcome the sensitivity to initial structure centers, we iteratively refine the structure centers based on both graph topology and node attributes. The refinement process alternates between two steps: obtaining a temporary graph partition by a GCN trained with the current structure centers; updating each structure center to the node with the highest structure importance in the corresponding induced subgraph. To improve the label propagation ability of shallow GCN, we expand the pseudo-labeled set that serves as the supervision information for training GCN. The expansion process selects a few nodes whose affiliation strength to the community is similar to that of its structure center among the subset of nodes that probably belong to the community. The final GCN is trained with the expanded pseudo-labeled set and used to infer the community labels for remaining nodes. Extensive experiments on 8 real networks demonstrate that the proposed approach can achieve better community detection performance than baseline methods on both attributed and non-attributed networks. Additional studies corroborate that both the structure center refinement process and the pseudo-labeled set expansion process contribute to the performance improvement. The refinement process yields a set of more representative structure centers, which can reduce the adverse effect of inappropriate structure centers. And the community detection performance of GCN improves as the number of pseudolabeled nodes increases.

In the future, we would like to explore other techniques for identifying and refining structure centers and expanding pseudo-labeled set when the community definition is different or the community characteristic is vague. In some networks, the community structure may be overlapping, or there may be many weak communities. Under these circumstances, existing strategies may fail to correctly identify all communities.

References

1. Kunegis J. KONECT. In: Proceedings of the 22nd International Conference on World Wide Web. 2013. https://doi.org/10.1145/2487788.2488173
2. Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99(12):7821–6. pmid:12060727
- View Article
- PubMed/NCBI
- Google Scholar
3. Porter MA, Onnela JP, Mucha PJ. Communities in networks. Notices Am Math Soc. 2009;56(9):1082–97.
- View Article
- Google Scholar
4. Atay Y, Koc I, Babaoglu I, Kodaz H. Community detection from biological and social networks: a comparative analysis of metaheuristic algorithms. Appl Soft Comput. 2017;50:194–211.
- View Article
- Google Scholar
5. Le Gorrec L, Mouysset S, Ruiz D. Doubly stochastic scaling unifies community detection. Neurocomputing. 2022;504:141–62.
- View Article
- Google Scholar
6. Bai L, Cheng X, Liang J, Guo Y. Fast graph clustering with a new description model for community detection. Information Sciences. 2017;388–389:37–47.
- View Article
- Google Scholar
7. Manipur I, Giordano M, Piccirillo M, Parashuraman S, Maddalena L. Community detection in protein-protein interaction networks and applications. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(1):217–37. pmid:34951849
- View Article
- PubMed/NCBI
- Google Scholar
8. Fortunato S, Newman MEJ. 20 years of network community detection. Nat Phys. 2022;18(8):848–50.
- View Article
- Google Scholar
9. Newman MEJ. Fast algorithm for detecting community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 2):066133. pmid:15244693
- View Article
- PubMed/NCBI
- Google Scholar
10. Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;70(6 Pt 2):066111. pmid:15697438
- View Article
- PubMed/NCBI
- Google Scholar
11. Yang L, Cao X, He D, Wang C, Wang X, Zhang W. Modularity based community detection with deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016. p. 2252–8. https://www.ijcai.org/Abstract/16/321
12. von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17(4):395–416.
- View Article
- Google Scholar
13. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.
- View Article
- Google Scholar
14. Fortunato S, Barthélemy M. Resolution limit in community detection. Proc Natl Acad Sci U S A. 2007;104(1):36–41. pmid:17190818
- View Article
- PubMed/NCBI
- Google Scholar
15. Ding X, Zhang J, Yang J. A robust two-stage algorithm for local community detection. Knowl-Based Syst. 2018;152:188–99.
- View Article
- Google Scholar
16. Wang X, Liu G, Li J, Nees JP. Locating structural centers: a density-based clustering method for community detection. PLoS One. 2017;12(1):e0169355. pmid:28046030
- View Article
- PubMed/NCBI
- Google Scholar
17. Wang X, Li J, Yang L, Mi H, Yu JY. Weakly-supervised learning for community detection based on graph convolution in attributed networks. Int J Mach Learn Cyber. 2021;12(12):3529–39.
- View Article
- Google Scholar
18. Wang X, Li J, Yang L, Mi H. Unsupervised learning for community detection in attributed networks based on graph convolutional network. Neurocomputing. 2021;456:147–55.
- View Article
- Google Scholar
19. Zhou X, Su L, Li X, Zhao Z, Li C. Community detection based on unsupervised attributed network embedding. Exp Syst Appl. 2023;213:118937.
- View Article
- Google Scholar
20. Bagrow JP, Bollt EM. Local method for detecting communities. Phys Rev E Stat Nonlin Soft Matter Phys. 2005;72(4 Pt 2):046108. pmid:16383469
- View Article
- PubMed/NCBI
- Google Scholar
21. Clauset A. Finding local community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2005;72(2 Pt 2):026132. pmid:16196669
- View Article
- PubMed/NCBI
- Google Scholar
22. Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014; 344 (6191):1492–6.
- View Article
- Google Scholar
23. Liu F, Xue S, Wu J, Zhou C, Hu W, Paris C, et al. Deep learning for community detection: progress, challenges and opportunities. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2020. p. 4981–7. https://doi.org/10.24963/ijcai.2020/693
24. Jin D, Zhang B, Song Y, He D, Feng Z, Chen S, et al. ModMRF: a modularity-based markov random field method for community detection. Neurocomputing. 2020;405:218–28.
- View Article
- Google Scholar
25. Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2007;76(3 Pt 2):036106. pmid:17930305
- View Article
- PubMed/NCBI
- Google Scholar
26. Li Q, Han Z, Wu X. Deeper insights into graph convolutional networks for semi-supervised learning. AAAI. 2018;32(1):3538–45.
- View Article
- Google Scholar
27. Li J, Wang X, Wu P. Review on community detection methods based on local optimization. Bullet Chin Acad Sci. 2015;30(2):238–47.
- View Article
- Google Scholar
28. Peng Z, Liu H, Jia Y, Hou J. Deep attention-guided graph clustering with dual self-supervision. IEEE Trans Circuits Syst Video Technol. 2023;33(7):3296–307.
- View Article
- Google Scholar
29. Kipf TN, Welling M. Variational graph auto-encoders. NeurIPS Workshop on Bayesian Deep Learning. 2016. https://doi.org/10.48550/arXiv.1611.07308
30. Chen Q, Wu T-T, Fang M. Detecting local community structures in complex networks based on local degree central nodes. Phys A: Statist Mech Appl. 2013;392(3):529–37.
- View Article
- Google Scholar
31. Chang Y, Ma H, Chang L, Li Z. Community detection with attributed random walk via seed replacement. Front Comput Sci. 2022;16(5).
- View Article
- Google Scholar
32. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. 2017. https://doi.org/10.48550/arXiv.1609.02907
33. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: a review of methods and applications. AI Open. 2020;1:57–81.
- View Article
- Google Scholar
34. Mesgaran M, Hamza AB. Anisotropic graph convolutional network for semi-supervised learning. IEEE Trans Multimedia. 2021;23:3931–42.
- View Article
- Google Scholar
35. Abu-El-Haija S, Kapoor A, Perozzi B, Lee J. N-GCN: Multi-scale graph convolution for semi-supervised node classification. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. 2020. p. 841–51. https://doi.org/10.48550/arXiv.1802.08888
36. He X, Deng K, Wang X, Li Y, Zhang Y, Wang M. LightGCN. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020. p. 639–48. https://doi.org/10.1145/3397271.3401063
37. Zhao L, Song Y, Zhang C, Liu Y, Wang P, Lin T, et al. T-GCN: a temporal graph convolutional network for traffic prediction. IEEE Trans Intell Transport Syst. 2020;21(9):3848–58.
- View Article
- Google Scholar
38. Kuang D, Ding C, Park H. Symmetric nonnegative matrix factorization for graph clustering. In: Proceedings of the 2012 SIAM International Conference on Data Mining. 2012. https://doi.org/10.1137/1.9781611972825.10
39. Berahmand K, Bahadori S, Abadeh MN, Li Y, Xu Y. SDAC-DA: semi-supervised deep attributed clustering using dual autoencoder. IEEE Trans Knowl Data Eng. 2024;36(11):6989–7002.
- View Article
- Google Scholar
40. Wang C, Pan S, Long G, Zhu X, Jiang J. MGAE. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017. p. 889–98. https://doi.org/10.1145/3132847.3132967
41. Pan S, Hu R, Fung S-F, Long G, Jiang J, Zhang C. Learning graph embedding with adversarial training methods. IEEE Trans Cybern. 2020;50(6):2475–87. pmid:31484146
- View Article
- PubMed/NCBI
- Google Scholar
42. He D, Song Y, Jin D, Feng Z, Zhang B, Yu Z, et al. Community-centric graph convolutional network for unsupervised community detection. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020. p. 3515–21. https://doi.org/10.24963/ijcai.2020/486
43. Berahmand K, Saberi Movahed F, Sheikhpour R, Li Y, Jalili M. A comprehensive survey on spectral clustering with graph structure learning. 2025.
- View Article
- Google Scholar
44. Zhu Y, Xu Y, Yu F, Liu Q, Wu S. Unsupervised graph representation learning with cluster-aware self-training and refining. ACM Trans Intell Syst Technol. 2023;14(5):1–21.
- View Article
- Google Scholar
45. Mesgaran M, Hamza AB. Graph fairing convolutional networks for anomaly detection. Pattern Recogn. 2024;145:109960.
- View Article
- Google Scholar
46. McPherson M, Smith-Lovin L, Cook JM. Birds of a feather: homophily in social networks. Annu Rev Sociol. 2001;27(1):415–44.
- View Article
- Google Scholar
47. Zheng W, Che C, Qian Y, Wang J, Yang G. A graph clustering algorithm based on paths between nodes in complex networks. Chin J Comput. 2020;43(7):1312–27.
- View Article
- Google Scholar
48. Kingma DP, Ba JL. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015. https://doi.org/10.48550/arXiv.1412.6980
49. Zachary WW. An information flow model for conflict and fission in small groups. J Anthropol Res. 1977;33(4):452–73.
- View Article
- Google Scholar
50. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol. 2003;54(4):396–405.
- View Article
- Google Scholar
51. Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci U S A. 2006;103(23):8577–82. pmid:16723398
- View Article
- PubMed/NCBI
- Google Scholar
52. Adamic LA, Glance N. The political blogosphere and the 2004 U.S. election. In: Proceedings of the 3rd International Workshop on Link Discovery. 2005. https://doi.org/10.1145/1134271.1134277
53. Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi‐Rad T. Collective classification in network data. AI Magazine. 2008;29(3):93–106.
- View Article
- Google Scholar
54. Namata G, London B, Getoor L, Huang B. Query-driven active surveying for collective classification. In: International Workshop on Mining and Learning with Graphs (MLG-2012). 2012. https://doi.org/10.48550/arXiv.1508.03116
55. Danon L, Díaz-Guilera A, Duch J, Arenas A. Comparing community structure identification. J Stat Mech. 2005;2005(09):P09008.
- View Article
- Google Scholar
56. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Statist Assoc. 1971;66(336):846–50.
- View Article
- Google Scholar
57. Perozzi R, Al-Rfou R, Skiena S. DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014. p. 701–10. https://doi.org/10.1145/2623330
58. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations. 2013. http://arxiv.org/abs/1301.3781
59. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66.
- View Article
- Google Scholar
60. Yang C, Liu Z, Zhao D, Sun M, Chang EY. Network representation learning with rich text information. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015. p. 2111–7. https://www.ijcai.org/Abstract/15/299
61. Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD. Deep graph infomax. In: International Conference on Learning Representations. 2019.
- View Article
- Google Scholar
62. Xu J, Yang Y, Chen J, Jiang X, Wang C, Lu J, et al. Unsupervised adversarially robust representation learning on graphs. AAAI. 2022;36(4):4290–8.
- View Article
- Google Scholar
63. In Y, Yoon K, Park C. Similarity preserving adversarial graph contrastive learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023. p. 867–78. https://doi.org/10.1145/3580305.3599503

[ref1] 1. Kunegis J. KONECT. In: Proceedings of the 22nd International Conference on World Wide Web. 2013. https://doi.org/10.1145/2487788.2488173

[ref2] 2. Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99(12):7821–6. pmid:12060727
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Porter MA, Onnela JP, Mucha PJ. Communities in networks. Notices Am Math Soc. 2009;56(9):1082–97.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref4] 4. Atay Y, Koc I, Babaoglu I, Kodaz H. Community detection from biological and social networks: a comparative analysis of metaheuristic algorithms. Appl Soft Comput. 2017;50:194–211.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref5] 5. Le Gorrec L, Mouysset S, Ruiz D. Doubly stochastic scaling unifies community detection. Neurocomputing. 2022;504:141–62.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref6] 6. Bai L, Cheng X, Liang J, Guo Y. Fast graph clustering with a new description model for community detection. Information Sciences. 2017;388–389:37–47.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref7] 7. Manipur I, Giordano M, Piccirillo M, Parashuraman S, Maddalena L. Community detection in protein-protein interaction networks and applications. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(1):217–37. pmid:34951849
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref8] 8. Fortunato S, Newman MEJ. 20 years of network community detection. Nat Phys. 2022;18(8):848–50.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Newman MEJ. Fast algorithm for detecting community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 2):066133. pmid:15244693
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref10] 10. Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;70(6 Pt 2):066111. pmid:15697438
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref11] 11. Yang L, Cao X, He D, Wang C, Wang X, Zhang W. Modularity based community detection with deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016. p. 2252–8. https://www.ijcai.org/Abstract/16/321

[ref12] 12. von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17(4):395–416.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Fortunato S, Barthélemy M. Resolution limit in community detection. Proc Natl Acad Sci U S A. 2007;104(1):36–41. pmid:17190818
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref15] 15. Ding X, Zhang J, Yang J. A robust two-stage algorithm for local community detection. Knowl-Based Syst. 2018;152:188–99.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Wang X, Liu G, Li J, Nees JP. Locating structural centers: a density-based clustering method for community detection. PLoS One. 2017;12(1):e0169355. pmid:28046030
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref17] 17. Wang X, Li J, Yang L, Mi H, Yu JY. Weakly-supervised learning for community detection based on graph convolution in attributed networks. Int J Mach Learn Cyber. 2021;12(12):3529–39.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. Wang X, Li J, Yang L, Mi H. Unsupervised learning for community detection in attributed networks based on graph convolutional network. Neurocomputing. 2021;456:147–55.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref19] 19. Zhou X, Su L, Li X, Zhao Z, Li C. Community detection based on unsupervised attributed network embedding. Exp Syst Appl. 2023;213:118937.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref20] 20. Bagrow JP, Bollt EM. Local method for detecting communities. Phys Rev E Stat Nonlin Soft Matter Phys. 2005;72(4 Pt 2):046108. pmid:16383469
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref21] 21. Clauset A. Finding local community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2005;72(2 Pt 2):026132. pmid:16196669
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref22] 22. Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014; 344 (6191):1492–6.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref23] 23. Liu F, Xue S, Wu J, Zhou C, Hu W, Paris C, et al. Deep learning for community detection: progress, challenges and opportunities. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2020. p. 4981–7. https://doi.org/10.24963/ijcai.2020/693

[ref24] 24. Jin D, Zhang B, Song Y, He D, Feng Z, Chen S, et al. ModMRF: a modularity-based markov random field method for community detection. Neurocomputing. 2020;405:218–28.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2007;76(3 Pt 2):036106. pmid:17930305
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref26] 26. Li Q, Han Z, Wu X. Deeper insights into graph convolutional networks for semi-supervised learning. AAAI. 2018;32(1):3538–45.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Li J, Wang X, Wu P. Review on community detection methods based on local optimization. Bullet Chin Acad Sci. 2015;30(2):238–47.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Peng Z, Liu H, Jia Y, Hou J. Deep attention-guided graph clustering with dual self-supervision. IEEE Trans Circuits Syst Video Technol. 2023;33(7):3296–307.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref29] 29. Kipf TN, Welling M. Variational graph auto-encoders. NeurIPS Workshop on Bayesian Deep Learning. 2016. https://doi.org/10.48550/arXiv.1611.07308

[ref30] 30. Chen Q, Wu T-T, Fang M. Detecting local community structures in complex networks based on local degree central nodes. Phys A: Statist Mech Appl. 2013;392(3):529–37.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref31] 31. Chang Y, Ma H, Chang L, Li Z. Community detection with attributed random walk via seed replacement. Front Comput Sci. 2022;16(5).
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref32] 32. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. 2017. https://doi.org/10.48550/arXiv.1609.02907

[ref33] 33. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: a review of methods and applications. AI Open. 2020;1:57–81.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref34] 34. Mesgaran M, Hamza AB. Anisotropic graph convolutional network for semi-supervised learning. IEEE Trans Multimedia. 2021;23:3931–42.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref35] 35. Abu-El-Haija S, Kapoor A, Perozzi B, Lee J. N-GCN: Multi-scale graph convolution for semi-supervised node classification. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. 2020. p. 841–51. https://doi.org/10.48550/arXiv.1802.08888

[ref36] 36. He X, Deng K, Wang X, Li Y, Zhang Y, Wang M. LightGCN. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020. p. 639–48. https://doi.org/10.1145/3397271.3401063

[ref37] 37. Zhao L, Song Y, Zhang C, Liu Y, Wang P, Lin T, et al. T-GCN: a temporal graph convolutional network for traffic prediction. IEEE Trans Intell Transport Syst. 2020;21(9):3848–58.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref38] 38. Kuang D, Ding C, Park H. Symmetric nonnegative matrix factorization for graph clustering. In: Proceedings of the 2012 SIAM International Conference on Data Mining. 2012. https://doi.org/10.1137/1.9781611972825.10

[ref39] 39. Berahmand K, Bahadori S, Abadeh MN, Li Y, Xu Y. SDAC-DA: semi-supervised deep attributed clustering using dual autoencoder. IEEE Trans Knowl Data Eng. 2024;36(11):6989–7002.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref40] 40. Wang C, Pan S, Long G, Zhu X, Jiang J. MGAE. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017. p. 889–98. https://doi.org/10.1145/3132847.3132967

[ref41] 41. Pan S, Hu R, Fung S-F, Long G, Jiang J, Zhang C. Learning graph embedding with adversarial training methods. IEEE Trans Cybern. 2020;50(6):2475–87. pmid:31484146
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref42] 42. He D, Song Y, Jin D, Feng Z, Zhang B, Yu Z, et al. Community-centric graph convolutional network for unsupervised community detection. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020. p. 3515–21. https://doi.org/10.24963/ijcai.2020/486

[ref43] 43. Berahmand K, Saberi Movahed F, Sheikhpour R, Li Y, Jalili M. A comprehensive survey on spectral clustering with graph structure learning. 2025.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref44] 44. Zhu Y, Xu Y, Yu F, Liu Q, Wu S. Unsupervised graph representation learning with cluster-aware self-training and refining. ACM Trans Intell Syst Technol. 2023;14(5):1–21.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref45] 45. Mesgaran M, Hamza AB. Graph fairing convolutional networks for anomaly detection. Pattern Recogn. 2024;145:109960.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref46] 46. McPherson M, Smith-Lovin L, Cook JM. Birds of a feather: homophily in social networks. Annu Rev Sociol. 2001;27(1):415–44.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref47] 47. Zheng W, Che C, Qian Y, Wang J, Yang G. A graph clustering algorithm based on paths between nodes in complex networks. Chin J Comput. 2020;43(7):1312–27.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref48] 48. Kingma DP, Ba JL. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015. https://doi.org/10.48550/arXiv.1412.6980

[ref49] 49. Zachary WW. An information flow model for conflict and fission in small groups. J Anthropol Res. 1977;33(4):452–73.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref50] 50. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol. 2003;54(4):396–405.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref51] 51. Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci U S A. 2006;103(23):8577–82. pmid:16723398
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref52] 52. Adamic LA, Glance N. The political blogosphere and the 2004 U.S. election. In: Proceedings of the 3rd International Workshop on Link Discovery. 2005. https://doi.org/10.1145/1134271.1134277

[ref53] 53. Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi‐Rad T. Collective classification in network data. AI Magazine. 2008;29(3):93–106.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref54] 54. Namata G, London B, Getoor L, Huang B. Query-driven active surveying for collective classification. In: International Workshop on Mining and Learning with Graphs (MLG-2012). 2012. https://doi.org/10.48550/arXiv.1508.03116

[ref55] 55. Danon L, Díaz-Guilera A, Duch J, Arenas A. Comparing community structure identification. J Stat Mech. 2005;2005(09):P09008.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref56] 56. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Statist Assoc. 1971;66(336):846–50.
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref57] 57. Perozzi R, Al-Rfou R, Skiena S. DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014. p. 701–10. https://doi.org/10.1145/2623330

[ref58] 58. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations. 2013. http://arxiv.org/abs/1301.3781

[ref59] 59. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref60] 60. Yang C, Liu Z, Zhao D, Sun M, Chang EY. Network representation learning with rich text information. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015. p. 2111–7. https://www.ijcai.org/Abstract/15/299

[ref61] 61. Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD. Deep graph infomax. In: International Conference on Learning Representations. 2019.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref62] 62. Xu J, Yang Y, Chen J, Jiang X, Wang C, Lu J, et al. Unsupervised adversarially robust representation learning on graphs. AAAI. 2022;36(4):4290–8.
View Article
Google Scholar

[164] View Article

[165] Google Scholar

[ref63] 63. In Y, Yoon K, Park C. Similarity preserving adversarial graph contrastive learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023. p. 867–78. https://doi.org/10.1145/3580305.3599503

Figures

Abstract

Introduction

Related work

Community detection

GCN-based community detection

Preliminaries

Problem formulation

Graph convolutional network

Methodology

Overview

Selecting initial structure centers

Refining structure centers

Expanding pseudo-labeled set

Training GCN with expanded pseudo-labeled set

Experiments

Datasets

Evaluation metrics

Baselines

Experimental results and analysis

Performance comparison (RQ1).

Ablation study (RQ2).

Case study for structure center refinement (RQ3).

Influence of the size of pseudo-labeled training set (RQ4).

Computational complexity analysis

Conclusion

References