Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods

Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community.


Introduction
There is an extensive literature on the topic of graph partitioning and community detection in networks [1]. This literature studies methods for partitioning the nodes in a network into a number of groups, often referred to as communities or clusters. The general idea is that nodes belonging to the same cluster should be relatively strongly connected to each other, while nodes belonging to different clusters should be only weakly connected.
Which methods for graph partitioning and community detection perform best in practice? The literature does not provide a clear answer to this question, and if the question can be answered at all, then most likely the answer will be dependent on the type of network that is being studied and on the type of partitioning that one is interested in.
In this paper, we therefore address the above question in one specific context. We are interested in grouping scientific publications into clusters and we expect each cluster to represent a set of publications that are topically related to each other. Clustering scientific publications is a problem that has received a lot of attention in the bibliometric literature. In this literature, publications have for instance been clustered based on co-occurring words in titles, abstracts, or full text [2,3], based on co-citation or bibliographic coupling relations [4][5][6], and sometimes even based on a combination of different types of relations [4,[7][8][9]. Following Waltman and Van Eck [10] and Boyack and Klavans [11,12], our interest in this paper is in clustering publications based on direct citation relations. Direct citation relations are of special interest because they allow large sets of publications to be clustered in an efficient way. Waltman and Van Eck for instance cluster ten million publications from the period 2001-2010 based on about hundred million citation relations between these publications. In this way, they obtain a highly detailed classification system of scientific literature covering all fields of science.
The analysis presented in this paper focuses on systematically comparing the performance of a large number of clustering methods when applied to the problem of clustering scientific publications based on citation relations. The following clustering methods are included in the analysis: spectral methods [13,14], modularity optimization [15][16][17][18], map equation methods [19,20], matrix factorization [21], statistical methods [22], link clustering [23], label propagation [24][25][26][27][28], random walks [29], clique percolation [30] and expansion [31], and selected other methods [32,33]. These are all methods that have been proposed during the past years in the literature on graph partitioning and community detection.
To evaluate the performance of the different clustering methods, we perform an in-depth analysis of the statistical properties of the clusterings obtained by each method. On the one hand we focus on general properties of the clusterings, but on the other hand we also consider a number of properties that are of special relevance in the context of citation networks of publications. However, to obtain a deep understanding of the differences between clustering methods, we believe that analyzing the statistical properties of clusterings is not sufficient. Understanding the differences between clustering methods also requires an expert-based assessment of different clusterings. This is a challenging task that involves a number of practical difficulties, but in this paper we nevertheless make an attempt to perform such an expertbased assessment. The expert-based assessment is performed for publications in the field of library and information science, focusing on the subfield of scientometrics. This paper is organized as follows. We first discuss the data and methods included in our analysis. We then present the results of the analysis. We conclude the paper by providing a detailed discussion of our findings.

Methods
Below we first discuss the citation networks of publications that we consider in our analysis. We then discuss the clustering methods included in the analysis. Finally, we discuss the criteria that we use for comparing the clustering methods. These criteria relate to the following four properties of a clustering method: Cluster sizes. Ideally the differences in the size of clusters should not be too large. For instance, the largest cluster preferably should be no more than an order of magnitude larger than the smallest cluster.
Small clusters. For practical purposes, it is usually inconvenient to have a large number of very small clusters. Therefore the number of very small clusters should be minimized as much as possible.
Clustering stability. Running the same clustering method multiple times may yield different results (due to random elements in many clustering methods), but the results should be reasonably similar. Likewise, when small changes are made to a citation network, this should not have too much effect on the results of a clustering method.
Computing time.
Preferably, a clustering method should be fast. Especially in applications to large citation networks the issue of computing time is of significant importance.
In addition to the above four properties, a fifth property for comparing clustering methods is the intuitive sensibility of the results provided by a method. Experts should be able to interpret the clusters obtained from a clustering method in terms of meaningful research topics. We do not evaluate this fifth property using quantitative criteria. Instead, our expert-based assessment of the results of different clustering methods is focused on this criterion.
Citation networks of scientific publications. Citation relations between scientific publications are represented as a simple undirected and unweighted graph by first discarding the directions of citations, any multiple citations and citations from a publication to itself. Publications neither citing nor cited by any other are also discarded. Let n be the number of nodes N, n = |N|, and m the number of links in such citation network. Denote k to be the average node degree, i.e. the number of links incident to a node, k = 2m/n, and LCC the largest connected component, i.e. the largest subset of mutually reachable nodes.
We analyze four citation networks representing publications in the fields of Scientometrics, Library & Information Science and Physics, and also the entire science (see Table 1). Publications and their citations were collected from the Web of Science bibliographic database produced by Thomson Reuters. More specifically, we used the in-house version of the Web of Science database of the Centre for Science and Technology Studies of Leiden University. This version of the Web of Science database is very similar to the one available online at www. webofscience.com. However, there are some differences, notably in the identification of citations between publications [34]. Data collection was restricted to the Science Citation Index Expanded, the Social Sciences Citation Index and the Arts & Humanities Citation Index, while only publications of the Web of Science document types 'article' and 'review' were included in the data collection.
The field of Scientometrics was delineated by selecting all publications in the following three journals: Journal of Informetrics, Journal of the Association for Information Science and Technology (including its precursor Journal of the American Society for Information Science and Technology), and Scientometrics. The field of Library & Information Science was delineated by selecting all publications in the Web of Science journal subject category Information Science & Library Science. Finally, the field of Physics was delineated by selecting all publications in the eight Physics journal subject categories in Web of Science as well as the subject category Astronomy & Astrophysics.
Graph partitioning and community detection methods. For a thorough empirical comparison, we select a large number of representative graph partitioning and community detection methods [1,35], which we refer to as clustering methods in this paper. Table 2 lists selected methods roughly divided into different classes. Due to the number of methods considered, detailed description is omitted here. We use the source code provided by the authors of all methods in all cases except Mouvain and LPA, where we use our own implementations [18,25]. We adopt default parameter settings of each particular algorithm. Graclus, METIS, BigClam and CoDA demand the number of clusters to be specified apriori. Thus, Graclus(S) and Graclus(L) denote the same method with the number of clusters set to n/15 and n/50, respectively, while Graclus refers to Graclus (S) on networks with n < 10 6 and to Graclus(L) on larger networks (similarly for METIS, Big-Clam and CoDA). On the other hand, Links(S) and Links(L) denote the same method with Jaccard similarity threshold [23] set to 0.1 and 0.01, respectively, whereas Links always refers to Links(S). Finally, some of the methods return overlapping clusters. For reasons of simplicity, each node in multiple clusters is assigned to the first cluster that appears in the output of the particular algorithm.
Certain otherwise prominent algorithms like Infomap can not be applied to very large networks in a time comparable with the fastest algorithms like Louvain and BPA. A straightforward solution is to first adopt some other method M to cut the network into smaller subgraphs and then independently apply Infomap to each of these. Let C i be some cluster of nodes in a network, C i N, and let s i be its size, s i = |C i |. Next, let C ¼ fC i g be the clustering of all the nodes in a network returned by the method M, S i C i = N and C i \ C j = ;, i 6 ¼ j. Then, for each cluster C i with s i > 50, Infomap is applied to the subgraph induced by the nodes in C i , whereas the clustering of C i is accepted only when it improves the log-likelihood of C (see Eq (5)). Several such derived methods are considered. Gracmap and Metimap refer to methods that adopt spectral algorithms Graclus and METIS for the first method M, respectively, where Table 2. Graph partitioning and community detection methods. We consider a large number of methods divided into different classes. See text for the details of methods implementation and parameters setting.

Class
Method Description Ref.
For comparison, we also include Louvmap and Labmap that adopt modularity optimization known as Louvain algorithm and label propagation algorithm LPA in the first step, respectively. Finally, the setting of the number of clusters in Graclus is limited to 2500. Thus, for very large networks, we use Metilus that adopts METIS for M and Graclus afterwards. In total, we consider 30 methods. These are the 20 methods listed in Table 2, five variations with an alternative setting of the number of clusters and five derived methods as described above. Let C ¼ fC i g be the clustering returned by some method M. C often includes clusters C i that are too small or too large to be of any practical use, s i < s tiny or s i > s giant . A straightforward solution is a two-step post-processing approach that first tries to further partition each of the giant clusters as above and then merges the tiny clusters with larger ones. We set s tiny = 15 and s giant = 10 4 . First, for each cluster C i with s i > s giant , the same clustering method M is applied to the subgraph induced by the nodes in C i and the resulting clustering is accepted based on the log-likelihood of C as before. Note that, due to the resolution limit of community detection methods [38,39], most will further partition cluster C i . Next, for each cluster C i with s i < s tiny , C i is merged with a neighboring cluster that most improves or least worsens the log-likelihood of C. While the first post-processing step can be carried out simultaneously for each of the giant clusters, the tiny clusters in the second post-processing step have to be assessed in a random order.
Graph cuts and community structure statistics. Let C be some clustering of network nodes as described above and let A be the network adjacency matrix, A ij = A ji 2 {0, 1} and A ii = 0. To measure the structure of clustering C, we select different representative graph cuts and community structure statistics [40]. We measure the internal connectivity of clustering C as the average node internal degree K [41], where c i is the cluster of node i and δ is the Kronecker delta. The external connectivity of clustering C is measured as the average node external degree or expansion E [41], By definition, k = K+E, whereas K/k is the fraction of links covered by the clustering C. Next, the Flake function F [42] considers internal and external connectivity of clustering C and is defined as the fraction of nodes with larger external than internal degree, where k i is the degree of node i. For reference with previous work, we also report the value of modularity function Q [37,43] that compares the internal connectivity of clustering C to the configuration model [44], i.e. a random graph with the same degree sequence, Finally, we report the posterior probability of clustering C or the likelihood of C given the network observed [45]. Assume that links in a network formed solely based on nodes' cluster membership and let θ i be a linking probability associated with cluster C i . Then m i links observed between the nodes in cluster C i would form with probability y m i i and the remaining M i − m i possible links would not form with probability ð1 À y i Þ M i Àm i , M i = s i (s i − 1)/2. Let e y be a linking probability representing the connectivity between the clusters. Then e m links observed between the nodes in different clusters would form with probability e y e m , e m ¼ m À P i m i , and the remaining e M À e m possible links would not form with probability ð1 À e yÞ e M Àe m , e M ¼ nðn À 1Þ=2 À P i M i . Thus, the probability that the network formed according to C or the likelihood of C is defined as where θ i = m i /M i and e y ¼ e m= e M are the maximum likelihood estimators [46]. For reasons of numerical stability, we report the log-likelihood of C as log LðCÞ.
Denote C to be a random variable corresponding to clustering C, P(C = C i ) = s i /n. The distance between two clusterings C and D is measured using the variation of information V [47] defined as where H(C|D) and H(D|C) are conditional entropies. Since V 2 [0, log n], we report the normalized variation of information V/log n [48]. Clustering robustness plots RðM; aÞ [48] estimate the robustness of clustering C or the respective clustering method M under random perturbations of network links. R is defined as the distance between C and C a , where C a is obtained by M after randomly rewiring α links in the network. Bibliometric clustering criteria. Let C be some clustering of network nodes as described above. To measure the utility of clustering C, we select different bibliometric clustering criteria. We report the average cluster size S and the fraction of covered links K/k already introduced above. Next, we define the orders of magnitude covered by cluster sizes O as where s L is the size of the largest cluster and s S is the size of the smallest. Note that twice the value of s S , which is negligible, has the same effect on O as twice the value of s L , which is substantial. We thus report 5-percentile effective orders O 5 defined as where s 5 is the size of the smallest remaining cluster after removing the 5% smallest clusters. To measure the diameter of clusters in C, we compute the 90-percentile effective cluster diameter D 90 [49], i.e. the average number of hops to reach 90% of all the nodes within a cluster. The value of D 90 is estimated from 1000 randomly selected seed nodes. Finally, the robustness of clustering C [48] or equivalently the uncertainty U of the respective clustering method M is defined as the distance between the clusterings C 1 and C 2 obtained by two consecutive realizations of M (see Eq (6)), All values, plots and diagrams reported in Results are averages over 100 realizations for Scientometrics, 10 realizations for Library & Information Science, two realizations for Physics and a single realization for All Fields.

Results
We start by directly comparing the clusterings obtained by all 30 clustering methods described in Methods to derive a manageable set of representatives. Next, we analyze structural and bibliometric statistics of the clusterings obtained by representative methods, and perform an expert-based assessment of the clusterings. Last, we analyze also the large-scale behavior of the most prominent methods.
Pair-wise clustering comparison. Fig 1 shows heatmaps of the pair-wise distances between the clusterings returned by the considered methods (see Eq (6)). The methods are applied to two citation networks representing the fields of Scientometrics and Library & Information Science (see Table 1). To gain insight into different classes of methods, we apply the k-means data clustering algorithm [50] to the rows/columns of the heatmaps with the number of classes set to 5 and 11 (left-and right-hand side of Fig 1, respectively). The classes of methods are shown in the order of decreasing size and the methods within each class are listed in the order of decreasing silhouette coefficients S h [51]. S h ðMÞ of some method M is defined as a normalized difference between the lowest average inter-class dissimilarity and the average intra-class dissimilarity, for which we adopt the standard cosine similarity.
We observe compact classes of methods, most notably pronounced for the larger network (see right-hand side of Fig 1, panel B). Namely, the largest three classes represent spectral and statistical methods (e.g. Graclus, METIS and OSLOM), modularity optimization (e.g. Louvain and SLM) and map equation algorithms (e.g. Gracmap, Metimap and Infomap). Other smaller classes correspond to label propagation algorithms (e.g. LPA, BPA and COPRA), random walks (e.g. Walktrap), link clustering (i.e. Links), methods based on cliques (i.e. GCE and SCP) and other methods. Thus, despite the large number of methods considered, these can be divided into only a handful of truly different classes, but the differences between the classes can be rather substantial. In the following we limit the analysis to the 15 class representatives explicitly stated above, although the actual subset of methods considered depends on the size of the network analyzed.
Structural clustering analysis. Past literature often reported a power-law form s −γ of the cluster size distribution P(s) [15,52], to the extent that s −γ is also incorporated into the standard network benchmarks for testing clustering methods [53,54]. Nevertheless, this may be merely an artifact of the power-law degree distribution P(k)*k −γ observed in real-world networks [55], while recent work on principled clustering methods sheds further doubts on the power-law form of P(s) [56].  Table 1). The methods are paired according to a similar shape of P(s), where each pair is named by its most "famous" representative. Statistical methods are thus reported under map equation, while methods based on cliques appear under spectral analysis and link clustering. Notice that the validity of the power-law claim P(s)*s −γ clearly depends on the particular method considered. For instance, there is evidently a peek in the distributions of spectral methods with a lack of heavy tail (see left-hand side of Fig 2, panel A). Furthermore, in the case of map equation and statistical methods, the power-law form s −γ is violated for small and moderate s. On the other hand, the distributions for modularity optimization, label propagation and link clustering seem to follow the power-law scaling over several orders (see right-hand side of Fig 2, panel A) with the power-law exponent γ increasing from left to right. In the extreme case, link clustering produces a few very large clusters covering most of the nodes in the network, while the size distribution of the remaining ones follows a power-law. The observed differences between the clustering methods are even more striking on a larger network (see Fig 2, panel B). Table 3 shows structural statistics of the clusterings obtained by representative methods applied to the Library & Information Science citation network. Most methods return a little less than 2000 clusters with some notable exceptions. Modularity optimization method Louvain, and also the methods based on dynamical processes (e.g. Walktrap and BPA), return a much smaller number clusters. On the other hand, link clustering and some other methods (e.g. COPRA) return a much larger number of clusters. Table 3 further shows the average internal degree of the nodes in the clusters K and the average external degree or expansion E (see Eqs (1) and (2)). Although most methods achieve K ) E, there are some important differences between the methods. The Flake function F measures the fraction of nodes with larger external than internal cluster degree (see Eq (3)). Notice that the values of F reflect the differences in the cluster size distributions P(s) observed in Fig 2. Modularity optimization and other methods that return clusterings with a power-law distribution P(s) * s −γ can, due to a number of very large clusters, effectively cover many of the links in the network, giving low F (e.g. Louvain, Walktrap and BPA). On the contrary, spectral methods with a rather homogeneous distribution P(s) must inevitably cut a large number of links between the clusters, thus giving very high F (e.g. Graclus). As in Fig 2, the middle ground between these two regimes is represented by map equation and statistical methods (e.g. Infomap and OSLOM).
Mainly for reference with previous work, Table 3 shows the values of modularity Q (see Eq (4)). Expectedly, the modularity optimization method Louvain gives the highest Q. Table 3 also reports the log-likelihood log L of the clusterings given the network observed (see Eq (5)). The most likely clustering is obtained by Infomap, yet it should be stressed that the map equation is actually a likelihood criterion. Fig 3 shows the robustness plots V(α) of the clusterings returned by representative methods for the Scientometrics and Library & Information Science citation networks (see Eq (7)). The plots measure the distances between the clusterings obtained by the same method after randomly rewiring α links in the network. Although initially introduced as a measure of network community structure [48], we here adopt the same approach to measure the robustness of different clusterings.
The methods in Fig 3 are paired as in Fig 2. Since many of them are nondeterministic, most of the plots do not start in the origin. The clusterings obtained by spectral and statistical methods (e.g. Graclus and OSLOM) prove to be the least robust with high values of V even for small α (see left-hand side of Fig 3). Map equation algorithm Infomap, and modularity optimization on the larger network (see middle of Fig 3, panel B), seem to give stable clusterings with gradually increasing V over all α. Label propagation methods and link clustering appear very robust at first sight with surprisingly low V even for very large α (see right-hand side of Fig 3). For instance, the clustering returned by Links stays almost unchanged even after rewiring 30% of the links in the network. Nevertheless, this is a consequence of the existence of a few very large clusters that occupy the majority of the nodes in the network (see Figs 2 and 4) and change very little compared to the clusterings returned by other methods.
Bibliometric clustering analysis. The above structural analysis of the clusterings of citation networks would most likely be of interest to network scientists, but might provide limited value to the bibliometric community. In the following, we therefore analyze the clusterings also from an alternative perspective. Table 4 shows bibliometric statistics of the clusterings obtained by representative methods applied to the Library & Information Science citation network. The average cluster sizes S can be interpreted as the number of clusters in Table 3. For most methods, S % 15. Modularity optimization method Louvain gives almost five times larger clusters on average, while link clustering and some other methods (e.g. COPRA) return much smaller clusters with S % 10. Table 4 further shows 5-percentile effective orders O 5 that measure the orders of magnitude covered by cluster sizes s (see Eq (9)). For many practical applications, the clusters ideally should span no more than a single order of magnitude giving O 5 % 1. This turns out to be an illusive goal as O 5 ) 1 for all methods except the spectral ones (e.g. Graclus), which one can observe also in Fig 2. Next, the 90-percentile effective diameter D 90 measures the average number of hops to reach most of the nodes in a cluster (see Methods). Most methods return clusterings with small D 90 consistent with the small-world network structure [57]. On the other hand, D 90 > 10 for methods based on cliques (i.e. GCE and SCP) and link clustering, indicating the existence of some very large clusters, which is rather inconvenient in practice.   Table 4 also shows the fractions of the links covered by different clusterings K/k (see Methods). Notice substantial diversity between the methods, which can again be interpreted in terms of different cluster size distributions P(s) (see Fig 2). The methods that return clusterings with a power law P(s)*s −γ , namely modularity optimization (e.g. Louvain), link clustering and methods based on dynamical processes (e.g. Walktrap, COPRA and BPA), can effectively cover over 80% of the links in the network. However, spectral and statistical methods (e.g. Graclus and OSLOM) that are characterized by a rather homogeneous P(s) give K/k as low as 30%. The middle ground is again represented by the map equation algorithm Infomap with K/k around 60%.
The uncertainty U measures the stability of a method or equivalently the distance between the clusterings obtained by two consecutive realizations of the same method (see Eq (10)). Note that U = V(0) in Fig 3. Table 4 shows the uncertainties of representative clustering methods. Spectral and statistical methods (e.g. Graclus and OSLOM) are substantially less stable than the rest with U % 0.4. Due to the existence of a few very large clusters already discussed above, link clustering and some other methods (i.e. Walktrap and SCP) appear very robust with U % 0. For the rest, U % 0.2.
The method complexity T in Table 4 is measured as the execution time on a 2.3 GHz Intel Core i7 processor with a sufficient amount of memory. The fastest methods are those based on modularity optimization (i.e. Louvain), label propagation (e.g. BPA) and also spectral analysis (e.g. Graclus). Notice that the map equation algorithm Infomap takes only about ten seconds on the Library & Information Science citation network. Although this does not seem much, the network is relatively small. In fact, the algorithm takes almost three hours on the Physics citation network (results not shown) and would probably take several days to cluster the All Fields citation network (see Table 1). Fig 4 shows the degeneracy diagrams D of the clusterings returned by representative methods on the Library & Information Science and Physics citation networks. These display the non-degenerate or effective ranges of the clusterings that span the fraction of nodes not covered by tiny clusters with s < s tiny , s tiny = 15, or the largest or giant cluster. Hence, the degeneracy diagram D is defined as a range (∑ s i < stiny s i /n, 1 − s L /n), where s L is the size of the largest cluster. In the best-case scenario, the ranges in  Fig 4, panel A). However, these can include many tiny clusters. On the other hand, modularity optimization and label propagation methods (e.g. Louvain and BPA) return clusterings with at least one very large cluster (see right-hand side of Fig 4, panel A). Even more, in the case of link clustering and some other methods (e.g. SCP), the giant cluster contains almost all the nodes in the network. Although the existence of a giant cluster and tiny clusters is not clearly visible in the case of a larger network (see Fig 4, panel B), we stress that even a slight deviation from right or left is already substantial.
Expert-based clustering assessment. An expert-based assessment was performed on the clusterings obtained by representative methods on the Library & Information Science citation network. Within this network, the assessment focused on clusters covering topics or research areas in the field of scientometrics. Scientometrics can be seen as a subfield of the broader field of library and information science. The assessment was performed jointly by the second and the third author (NJvE and LW), who both have an extensive expertise in the field of scientometrics. A detailed investigation and comparison of the different clusterings was done with the help of the CitNetExplorer software tool for visualizing and analyzing citation networks of publications [58].
We start by comparing the obtained clusterings based on the resolution they provide. A clustering consisting of a small number of clusters, with each cluster including a relatively large number of publications, has a low resolution. On the other hand, a clustering consisting of a large number of clusters, each including only a small number of publications, has a high resolution.
There are a number of clusterings for which we consider the resolution to be too high. This is the case for spectral methods Graclus(S), Graclus(L), METIS(S) and METIS(L). In these clusterings, topics that we would expect to be represented by a single cluster were instead represented by multiple clusters, each covering a subset of the publications dealing with a topic. For instance, the clustering returned by Graclus(L) includes four clusters that all cover part of the literature on the topic of the h-index, a very prominent topic in the field of scientometrics. Of these four clusters, there is one that clearly has its own focus. This cluster includes publications studying the mathematical properties of the h-index. Having a separate cluster for these publications is probably defensible. However, the other three clusters all seem to cover very similar publications, and therefore we see no justification for the fact that these publications are distributed over three clusters rather than all being assigned to the same cluster.
Other clusterings have a resolution that is too low for a meaningful analysis of the scientometric literature. The clusterings for which this is the case are obtained by BPA and Walktrap. One of the clusters created by BPA for instance consists of 3,808 publications and essentially covers the entire scientometric literature. This cluster seems to properly delineate the scientometric literature from the rest of the library and information science literature. Hence, if one's purpose is to identify subfields within the field of library and information science, then BPA may provide good results. However, in our case, we are interested in identifying topics rather than entire subfields, and for this purpose the results provided by BPA are not helpful.
The clusterings with a resolution that matches reasonably well with the idea of identifying topics within the subfield of scientometrics are obtained by the statistical method OSLOM and the map equation algorithms Infomap and Metimap. In addition to the clustering methods presented in Methods, we here consider also a variant of the Louvain modularity optimization method with a resolution parameter [59] that one can tune to customize the clustering resolution [18]. Setting the resolution parameter to 10 gives the most suitable resolution here, which we denote Louvain (10). We next analyze OSLOM, Infomap, Metimap and Louvain(10) in more detail.
The clustering obtained by OSLOM has a relatively high resolution. It includes only three clusters with more than 100 scientometric publications, which means that most scientometric publications are assigned to small clusters. As a consequence, some topics that we would expect to be represented by a single cluster are in fact distributed over multiple clusters. Important examples are the topic of webometrics and the topic of patents. These topics are each distributed over two clusters of approximately equal size, which we consider an unsatisfactory result. A more general problem of OSLOM is that we observe a relatively large number of publications that are assigned to a cluster where they do not seem to belong. For instance, there is a cluster covering the topic of the analysis and visualization of bibliometric networks, but this cluster includes a significant number of publications dealing with other topics, such as the topic of indicators for citation analysis.
Louvain (10) clustering is characterized by a somewhat unusual cluster size distribution. Compared with other clusterings, it includes a relatively large number of clusters with more than 100 publications and a relatively small number of clusters with a number of publications between 10 and 100. As a consequence, there are a number of larger scientometric clusters for which there is no similar cluster in other clusterings, for instance obtained by Metimap or Infomap. A detailed examination of these clusters indicates that they do not cover easily recognizable topics. Publications included in these clusters usually do have something in common. For instance, there are clusters in which many publications relate to a specific country or a specific geographical region, such as China or Africa. However, our overall impression is that the clusters are of a somewhat heterogeneous nature and that it would have been better if the publications in the clusters had been distributed over a number of smaller clusters. The presence of these heterogeneous clusters is a significant weakness of Louvain (10).
The clusterings that we are most satisfied with are obtained by Metimap and Infomap. In Table 5, we present for each of these clusterings a list of all scientometric clusters with at least 50 publications. For each cluster, we report the number of publications included in the cluster or equivalently the cluster size s and we provide an indication of the topic that is represented by the cluster. Metimap and Infomap both offer a reasonable perspective on the main topics in the field of scientometrics. As can be seen in Table 5, the clustering returned by Metimap has a somewhat higher resolution than that of Infomap and consequently some topics that are covered by a single cluster in the case of Infomap are distributed over multiple clusters in the case of Metimap. We have a slight preference for Infomap over Metimap because the way in which topics are distributed over multiple clusters in the case of Metimap does not always seem fully satisfactory to us. For instance, we prefer to have a single cluster covering the topic of bibliometric networks instead of the two clusters that are provided by Metimap. However, we emphasize that the differences between the two clusterings are small and that we have only a weak preference for Infomap. Furthermore, even though Metimap and Infomap gave the best clusterings obtained in our study, it should be mentioned that these clusterings sometimes suffer from questionable assignments of publications to clusters. This is a problem especially for smaller clusters. In the case of clusters with fewer than 100 publications, we often observe that a significant share of the publications assigned to a cluster (e.g. about 25% of the publications) are only weakly related to the main topic of the cluster. In the case of the clusterings obtained by Metimap and Infomap, we also investigated the effect of applying our post-processing approach (see Methods). Due to the relatively small size of the Library & Information Science citation network, the effect of the post-processing approach on the main clusters obtained in the Metimap and Infomap clusterings is small. The number of publications that are reassigned from small clusters to larger clusters, i.e. clusters with at least 50 publications, is very limited. Given the small effect of the post-processing approach, no significant influence on the quality of the clusters could be observed.
Large-scale clustering analysis. In the following, we analyze the large-scale behavior of different clustering methods. We limit the analysis to the Louvain modularity optimization method, the map equation algorithm Metimap, the label propagation algorithm BPA and the  Table 5 for details of the clusterings. spectral analysis approach Metilus. These were selected since they can cluster the All Fields citation network in about an hour. Table 6 shows bibliometric statistics of the clusterings obtained by the selected methods applied to the Physics citation network (see Table 1). Compared to the clusterings obtained for the Library & Information Science network in Table 4, one can observe a notable increase in the average cluster size S and the effective orders of magnitude O 5 . The clusterings thus include at least some much larger clusters. Yet, the effective diameter D 90 and the clustering coverage K/k remain comparable. The clusterings returned by modularity optimization and label propagation methods (i.e. Louvain and BPA) again cover around 80% of the links, while the spectral method Metimap gives K/k below 40%. Finally, despite a substantial increase in the network size, the method uncertainty U stays about the same, while the complexity T obviously increases. Table 6 also shows the effect of the clustering post-processing approach presented in Methods that first tries to further partition the largest clusters with s > s giant and then merges the tiny clusters with larger ones for s < s tiny , s tiny = 15 and s giant = 10 4 . In the case of the map equation, label propagation and spectral methods (i.e. Metimap, Metilus and BPA), the post-processing approach has no apparent affect on the largest clusters. Due to the merging of tiny clusters, the average cluster size S increases, while all the remaining statistics remain roughly the same (see Table 6). On the other hand, the post-processing manages to further partition the largest clusters returned by the modularity optimization method Louvain. This decreases the cluster size S, and also the effective orders O 5 and the effective diameter D 90 . However, the clustering coverage K/k decreases as well, while the method uncertainty U increases (see Table 6). Fig 6 shows the impact of the post-processing approach on the cluster size distributions P(s) and the clustering degeneracy diagrams D. All distributions P(s) remain conceptually the same, with the difference that most tiny clusters have been merged with larger ones (see Fig 6, panel  A). Notice that a small number of tiny clusters with s < 15 remain, which correspond to disconnected components that could obviously not be merged with other clusters (see Table 1 for the size of LCC). Still, the degeneracy diagrams D show that post-processing effectively removes tiny clusters, and also the giant cluster in the case of the modularity optimization method Louvain, but fails to further partition the giant cluster in the case of the label propagation algorithm BPA (see right-hand side of Fig 6, panel B).
Last, we apply the selected methods to the All Fields citation network (see Table 1). Table 7 shows different statistics of the obtained clusterings. Compared to those obtained for the Physics citation network in Table 6, we can again observe an increase in the average cluster size S Table 6. Bibliometric statistics of the clusterings obtained by selected methods. The methods are applied to Physics citation network and bibliometric statistics of the clusterings with and without post-processing are shown. See Methods for the definitions of statistics and the details of clustering post-processing approach.

Method
Size and the effective orders O 5 . Thus the size of the largest clusters further increases. Yet, as before, the clustering coverage K/k of different methods remains roughly the same, while the differences between the methods can also clearly be observed in the average internal degree K. Table 7 also shows the statistics of the clusterings after the post-processing approach, which has exactly the same effect on the clusterings as in Table 6. Notice also that the post-processing does not substantially increase the running time of the methods.
To better understand the nature of different clusterings and the effects of the post-processing approach, Fig 7 shows the sizes s and coverage K/k of the largest 50 clusters returned by the selected methods (see Methods). The coverage K/k of an individual cluster is defined as the  Table 7. Statistics of the clusterings obtained by the selected methods. The methods are applied to the All Fields citation network and different statistics of the clusterings with and without post-processing are shown. See Methods for the definitions of the statistics and the details of the clustering post-processing approach.

Method
Size average internal degree of the nodes in the cluster divided by the total degree of these nodes. As already lengthly discussed above, the spectral analysis approach Metilus returns clusters with very low K/k % 15% (see left-hand side of Fig 7, panel B), while the modularity optimization and label propagation methods (i.e. Louvain and BPA) give clusters with very high K/k % 80% (see right-hand side of Fig 7,

Discussion
Which methods for graph partitioning and community detection perform best for the purpose of grouping scientific publications into clusters? In this paper, we have carried out an extensive analysis comparing the performance of a large number of methods. The methods have been applied to a number of networks of publications connected by direct citation relations. We have studied the statistical properties of the results provided by the different methods, and we have also performed an expert-based assessment of the results. From a bibliometric point of view, a good clustering of publications ideally should have a number of properties. First of all, although it is natural to expect that there will be larger and smaller clusters, it is inconvenient for practical purposes if there are very large differences in the size of clusters. As a rule of thumb, we ideally would like the difference in size between the largest and the smallest clusters to be no more than an order of magnitude. Second, if it turns out to be inevitable that some publications end up in very small clusters, for instance because these publications have almost no citation relations with other publications, then at least we would prefer the number of publications assigned to these insignificant clusters to be as limited as possible. Third, we would like the results of a clustering method to be reasonably stable. Many methods include a random element, in which case different runs of a method may yield different results. However, running the same method multiple times should not affect the results too much, and the results should also be reasonably robust to small changes in a citation network of publications. Fourth, the computing time of a clustering method should not be excessive. This is especially important when one aims to apply a method to networks consisting of large numbers of publications and citation relations. Finally, and perhaps most importantly, the results produced by a clustering method should make intuitive sense. Experts should be able to recognize the scientific topics represented by clusters of publications.
Our analysis shows that most clustering methods yield results with large differences in the size of clusters. The larger clusters are typically several orders of magnitude larger than the smaller clusters. Sometimes more than half of the publications in a citation network are all assigned to the same cluster. This was for instance observed for the results obtained from the Links and SCP methods in the Library & Information Science citation network. The only methods that yield clusters of more or less similar size are the spectral methods (e.g. Graclus). These methods produce results that are characterized by a much more uniform cluster size distribution. Depending on the cluster size distribution and also on the resolution of a clustering, there can be large differences in the share of all citation relations that are covered by clusters. Coverage for instance ranges from less than 30% to more than 85% in the Library & Information Science citation network. Clustering methods also often assign a significant share of the publications in a citation network to very small clusters. In the Library & Information Science citation network, the Graclus and Infomap methods for instance assign more than 25% of the publications to clusters consisting of fewer than 15 publications. The stability or robustness of the results obtained from a clustering method also partly depends on the size of the clusters produced by the method. Not surprisingly, methods that produce one or more very large clusters tend to yield relatively robust results. Furthermore, in the Library & Information Science citation network, spectral and statistical methods (e.g. Graclus and OSLOM) produce results with a relatively low robustness, while Infomap and modularity optimization yield quite robust results.
In terms of computing time, there are substantial differences between the various methods. For instance, clustering the publications in the Library & Information Science citation network takes more than 100 times longer for the slowest method than for the fastest method. Modularity optimization methods (e.g. Louvain), label propagation (e.g. BPA), and spectral analysis methods (e.g. Graclus) perform best in terms of computing time. Other methods require a more significant amount of computing time, making them less suitable for applications on large citation networks.
Turning now to the expert-based assessment of the results produced by different clustering methods for the scientometrics subfield within the Library & Information Science citation network, we find that the Infomap and Metimap (i.e. Infomap combined with spectral method METIS) methods give the most satisfactory results, with a slight preference for the Infomap results over the results obtained from Metimap. Other methods, such as OSLOM and Louvain, provide less satisfactory results.
Our analysis seems to provide most support for the use of Infomap and related methods such as Metimap to cluster the publications in a citation network. Infomap has the best performance in our expert-based assessment, and it yields quite robust results. Compared with some of the other methods, Infomap has a relatively high computing time, but this can be overcome by using Metimap in larger citation networks. The price that we pay for the good performance of Infomap seems to be the assignment of a relatively large number of publications to small clusters. Paying this price seems necessary to obtain high-quality clustering results. In large citation networks, a post-processing procedure can be applied to minimize the number of small clusters, but the effect of the use of such a procedure on the quality of the clustering results is not clear.
The promising results obtained for Infomap are in line with earlier findings reported in the network science literature [60]. Although Infomap has been introduced in the bibliometric literature [61] and has been applied to citation networks in a number of studies [19,20,62,63], the method has not yet gained a widespread popularity in the bibliometric community, where researchers seem to prefer the use of modularity-based methods. Our findings suggest that the bibliometric community could benefit from exploring the use of other clustering methods in addition to modularity-based methods. Infomap seems to be of particular interest. Future studies should reveal whether Infomap indeed consistently performs well in applications to citation networks.
Limitations of the analysis. It is important to emphasize that our results should be interpreted cautiously because of a number of limitations of our analysis. One obvious limitation is that, despite the large number of clustering methods included in our analysis, we did not exhaustively cover all methods proposed in the literature. The selection of the methods included in our analysis was made based on the popularity of a method and to some degree also on our familiarity with a method. In addition, the availability of source code played a role as well. Many methods discussed in the literature are not included in our analysis. In particular, methods that produce overlapping clusters [64,65] or clusters at multiple levels of resolution [66,67] are not covered. Also, we for instance do not cover some recently developed principled methods based on statistical inference [56].
A second limitation is that each clustering method was applied using the default parameter settings. We did not try to optimize the parameter values of the different methods. So the performance of some methods may have been better if we had used optimized parameter values for these methods. Some methods for instance have a parameter that can be used to fine-tune the level of granularity of the clustering results. One could use such a parameter to try to obtain results at similar levels of granularity for different methods, and in that way a more accurate comparison between different methods may be possible. We did not explore this possibility in our analysis, but we do consider this an interesting direction for future research. We note that the clustering method proposed by two of us in an earlier paper [10] requires a careful choice of parameter values. For this reason, this method was not included in our present analysis.
A third limitation is our exclusive focus on undirected and unweighted networks of direct citation relations between publications. We did not consider the possibility of taking into account the direction of a citation relation, and we did not test the effect of assigning weights to citation relations [10]. We also did not study the use of indirect citation relations between publications, in particular co-citation and bibliographic coupling relations.
Finally, we should emphasize the limitations of our expert-based assessment of the clustering results obtained for the scientometrics subfield within the Library & Information Science citation network. The expert-based assessment was carried out at a high level of detail by two experts with an extensive expertise in the field of scientometrics. Nevertheless, any expertbased assessment will necessarily be of a subjective nature, and different experts therefore may not always reach the same conclusions. Moreover, experts typically have a deep understanding of the literature only in a relatively small area of science. This for instance explains why in our expert-based assessment we could not cover the entire field of library and information science but only the subfield of scientometrics. Unfortunately, it is difficult to say to what extent conclusions reached for such a relatively small area of science can be expected to generalize to other areas. For this reason, the findings of our expert-based assessment should be interpreted with some caution.