Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Publication authorship: A new approach to the bibliometric study of scientific work and beyond

  • Steffen Blaschke

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    sbl.msc@cbs.dk

    Affiliation Department of Management, Society, and Communication, Copenhagen Business School, Frederiksberg, Denmark

Abstract

Bibliometric studies offer numerous ways of analyzing scientific work. For example, co-citation and bibliographic coupling networks have been widely used since the 1960s to describe the segmentation of research and to look the development of the scientific frontier. In addition, co-authorship and collaboration networks have been employed for more than 30 years to explore the social dimension of scientific work. This paper introduces publication authorship as a complement to these established approaches. Three data sets of academic articles from accounting, astronomy, and gastroenterology are used to illustrate the benefits of publication authorship for bibliometric studies. In comparison to bibliographic coupling, publication authorship produces significantly better intra-cluster cosine similarities across all data sets, which in the end yields a more fine-grained picture of the research field in question. Beyond this finding, publication authorship lends itself to other types of documents such as corporate reports or meeting minutes to study organizations, movements, or any other concerted activity.

Introduction

Bibliometric studies use publication data to describe the segmentation of research and to look at the development of the scientific frontier. The seminal works of the 1960s and 1970s [13] built networks of publications (vertices) connected by co-citation or bibliographic coupling (edges). Starting in the 1980s [4, 5], scholars turn to the social dimension of scientific research by looking at networks of authors (vertices) connected by author co-citation or co-authorship (edges).

Several refinements have been made to both publication and social networks in recent years. For example, co-citation proximity analysis [6] posits that citations appearing closer together (e. g., within a paragraph) in a publication are more similar than those further apart (e. g., one in the introduction and one in the discussion). This idea balances out the shortcoming that all citations contribute equally to the establishment of a relation between two (co-)cited publications. Another example is author bibliographic-coupling [7], which estimates a relation between authors based on the overlap of the bibliographies found in their complete oeuvres. This approach effectively expands the intellectual structure of scientific research from single publications to entire life works, which arguably paints a more realistic picture of the scientific frontier.

Bibliometric studies name either publications (e. g., academic articles, research grants, scientific patents) or individuals (i. e., authors) as the vertices of a network. The edges of the network, in turn, are either citations or authorship. The combination of vertices and edges then accounts for a number of different bibliometric networks, whether they are co-citation, bibliographic coupling, author co-citation, author bibliographic-coupling, or co-authorship networks. Notably missing from the combination of vertices and edges is the idea that publications may be connected by authorship. I consequently call this combination publication authorship. It clearly denominates vertices as publications and edges between them as the authorship of these.

In the following, I discuss the theoretical foundation of publication authorship including the most prevalent differences to traditional approaches in bibliometric studies. Empirical data from management, physics, and medicine then illustrates publication authorship opposite to bibliographic coupling. More specifically, I draw on academic articles published between 2010 and 2019 in the top-10 journals in accounting, astronomy, and gastroenterology. As a first step in the research, descriptive statistics for both publication authorship and bibliographic coupling provide an overview of development of the literature in these three academic areas. I then apply a standard clustering algorithm and test its goodness-of-fit using the cosine similarity between article abstracts. These empirical illustrations and respective statistical analysis show that publication authorship yields a significantly better segmentation of research than bibliographic coupling in all three academic areas, which consequently points out a more fine-grained picture of the scientific frontier. Finally, I point out similarities and differences in the findings for each one of the three academic areas, discuss co-word analysis and Latent Dirichlet Allocation as two alternative approaches, and conclude with implications for the theory and practice of bibliometric studies.

Theoretical considerations

A brief introduction to co-citation, bibliographic coupling, author co-citation, author bibliographic coupling, and co-authorship sets the stage for an elaboration of publication authorship. Fig 1 provides an overview of these altogether six bibliographic networks. Publications appear as rectangles and authors as circles. Citations are directed either from publications (co-citation and author co-citation) or to publications (bibliographic coupling and author bibliographic-coupling), whereas authorship is undirected (i. e., individuals (co-)author publications and publications are (co-)authored by individuals).

The intellectual structure of scientific work

Up until the mid-1960s, direct citation and keyword analysis are the dominant methods of inquiry into the structure of academic work and the development of the scientific frontier. The concepts of co-citation and bibliographic coupling are first and foremost critical responses to these methods used in early bibliometric studies.

Following in the footsteps of de Solla-Price [2], Small [3] introduces co-citation as a measure of scientific similarity in 1973. He argues that the frequency with which two publications are cited together by other publications (i. e., co-citation) is a better measure than direct citation, which is limited by the need of an explicit reference from one to another publication. Co-citation then identifies the intellectual connections between publications based on their citation patterns. It singles out seminal works in a given academic area using their citation count and tracks the development of intellectual ideas over time by looking at the evolution of clusters in co-citation networks [8].

Already ten years earlier, Kessler [1] introduces the concept of bibliographic coupling in 1963, which measures the similarity between two scientific publications based on the references they have in common. Bibliographic coupling effectively replaces earlier approaches (e. g., keyword analysis) to understand the development of academic areas. It is similar to co-citation in that it identifies structural properties of a given scientific field. However, where co-citation is more sensitive to the overall structure of a field, bibliographic coupling focuses on specific clusters of related publications. Co-citation maps the intellectual structure of an academic area and points to its research frontier, while bibliographic coupling relies on the similarity of publications to interpret core and peripheral works in a discipline.

Co-citation and bibliographic coupling both define publications as the vertices in a network. The edges in co-citation connect two publications (A and B) which are jointly cited by one or more other publications (X). They may be weighted by the number of publications which jointly cite the two. Conversely, bibliographic coupling connects two publications (A and B) which share common references to one or more other publications (X). The edges may be weighted by the the number of references two publications have in common. At the center of attention of both these bibliometric networks are the themes and topics of clusters of publications that make up schools of thought and push the scientific frontier.

Both co-citation and bibliographic coupling have been widely used in various fields of research such as biology, chemistry, physics, medicine, psychology, sociology, as well as computer, information, and management science. For example, Small et al. identify 71 emerging topics across all of science by combining direct citations and co-citations in publications from 2007 to 2010 [9]. They conclude that three non-exclusive forces drive research: scientific discovery, technological innovation, and exogenous events. On a side note, nearly all emerging topics contain highly cited papers, but only about 10 percent of highly cited papers are part of emerging topics. Jarneving complements bibliometric coupling with a complete-link cluster analysis [10] similar to previous work on co-citation clusters [11]. He tests this combination on a large multidisciplinary set of more than 600000 publications and 17 million references to estimate an optimal level of clustering that preservers core documents essential to the mapping of academic areas. His conclusion reveals but three large clusters of core documents. In a last example of research, Boyack and Klavans show which citation approach represents the intellectual structure of scientific work most accurately [12]. Their compelling comparison between (co-)citation and bibliographic coupling finds that the latter slightly outperforms the first approach with more coherent clusters to represent the scientific frontier.

The social structure of scientific work

Beginning in the 1980s, bibliographic studies turn to the social dimension of scientific work. Author co-citation [4], author bibliographic-coupling [7], and co-authorship [5] are similar to co-citation and bibliographic coupling in that the main interest of any analysis is still the structure of scientific work. The key difference is that author co-citation, author bibliographic coupling, and co-authorship all focus on the social structure as opposed to the intellectual structure.

In author co-citation, two authors relate to each other if their works are frequently cited by other authors. In author bibliographic coupling, two authors relate to each other if they are frequently cited together in the same set of references. In addition, the concept of co-authorship allows for the study of collaborative relationships between authors in publications. Tracing these social structures provides insights into research communities and collaborations within and across scientific disciplines.

Author co-citation, author bibliographic-coupling, and co-authorship define authors as the vertices in a network. The edges in author co-citation connect two authors (1 and 2) who are jointly cited by one or more publications (X). They may be weighted by the number of publications which jointly cite the two. Conversely, the edges in author bibliographic-coupling connect two authors (1 and 2) who jointly cite one or more publications (X). The edges may be weighted by the number of publications two authors jointly cite. Finally, co-authorship connects two authors (1 and 2) who collaborate on one or more publications (X). The edges may be weighted by the number of publications two authors have in common. Clusters of authors stand in for schools of thought. Sometimes they are further grouped by affiliation or place to see which university or country is pushing the scientific frontier. Instead of a focus on the themes and topics of clusters of publications, the center of attention shifts to clusters of scientific collaboration among authors.

Similar to co-citation and bibliographic coupling, bibliometric studies of the social structure of scientific work span across various academic disciplines. For example, White and McCain study the social structure of information science [13]. They submit the top 120 authors most frequently cited in twelve key journals from 1972 through 1995 to author co-citation analysis. Their findings yield automatic classifications relevant to the history of the field including the most canonical authors. In a combination of co-authorship and bibliographic coupling, Biscaro and Giupponi examine citations counts of academic articles [14]. Their study based on 5585 publications from a variety of academic disciplines offers a number of findings, among which are: authors who collaborate with more authors tend to get more citations, and articles that use references from different strands of the literature tend to get more citations. As a last example of research, Schubert and Glänzel take a look at country-by-country co-authorship to find that location, culture, and language determine clusters of mutually strong preferences in geopolitical areas such as Central Europe, Scandinavia, or the Far East [15]. The United States, unsurprisingly, enjoy universal co-authorship preference.

More comprehensive reviews of the theory and practice of bibliometric studies are found in Borgman and Furner [16], Mingers and Leydesdorff [17], and Donthu et al. [18].

Combining the intellectual and social structure of scientific work

Publication authorship takes inspiration from the above discussed approaches to the analysis of scientific work. On the one hand, it defines publications as the vertices of a network similar to co-citation and bibliographic coupling. On the other hand, it takes authors as the basis of a definition of edges as authorship similar to co-authorship. The edges in publication authorship then connect two publications (A and B) which are authored by one or more individuals (0). They may be weighted by the number of authors two publications have in common. Publication authorship keeps the focus on the themes and topics of publications to describe the segmentation of research and the development of the scientific frontier. At the same time, it accounts for the social dimension of scientific work with clusters of publications emerging from the collaboration among authors.

Publication authorship may appear as simply another combination of vertices and edges that fills a void in the roster of approaches to the analysis of scientific work. However, it firmly rests with the theoretical argument of a communicative constitution of social systems [19]. The theory suggests that any form of documentation or record (e. g., academic publications, corporate reports, meetings minutes) is a condensate of the participation of individuals in communication [20]. In turn, individuals who participate in communication are common sources of information that connect communication event and episode [21]. Publication authorship follows exactly this line of argument. Scholars participate in academic discourse by authoring publications which, in turn, cluster to reflect the segmentation of research and the development of the scientific frontier [22].

Common to all the approaches in bibliometric studies is the idea that the relations among publications or authors present similarities in the underlying scientific work, which allows for the analysis of clusters of tightly coupled and central vertices (i. e., schools of thought and the scientific frontier). In particular, publication authorship assumes that two publications are similar to the extend that one or more scholars (co-)authors them. Since authors frequently specialize in a narrow field of research (e. g., behavioral economics or adolescent oncology), their publications are likely to present a narrow field of research, too (e. g., a behavioral economist is unlikely to work on transaction-costs issues and an adolescent oncologist rarely contributes to research on childhood obesity). Publication authorship is therefore more exclusive than co-citation and bibliographic coupling because the number of authors who collaborate on two publications is almost always smaller than then number of joint citations or common references. (None of the 27444 publications used in the empirical analysis of this paper had more authors than joint citations or common references.) At the same time, it is more inclusive than author co-citation and co-authorship because it includes both single-authored and co-authored publications.

Co-citation, bibliographic coupling, author co-citation, author bibliographic-coupling, co-authorship, and publication authorship all yield unique insights into scientific work. In the light of the similarities and differences among these and other approaches in bibliometric studies [23], publication authorship is closest to bibliographic coupling, not least because it defines vertices as publications and, therefore, focuses on the themes and topics of these. The following empirical illustrations pit publication authorship against bibliographic coupling to highlight differences in the segmentation of research and a consequently more detailed scientific frontier.

Data

Three data sets of academic articles in accounting, astronomy, and gastroenterology provide the empirical basis for the illustrations of publication authorship. The choice of academic disciplines is motivated by the idea to pick examples that are independent of each other, which is a safe assumption for scientific work in management, physics, and medicine. Indeed, there are no cross-references among the three data sets and each one exhibits its own unique features such as, for example, a smaller average number of authors for accounting than in astronomy or gastroenterology, a larger dispersion of the number of authors in astronomy than in gastroenterology, and a larger average number of references in accounting than in the other two disciplines (Table 1). These and other idiosyncrasies of each discipline reflect in the below analysis, of course. A smaller average number of authors on publications in accounting immediately translates to a lower density in respective bibliometric networks, and so on. The point of the empirical illustrations, however, is to compare bibliographic coupling to publication authorship across different academic disciplines, and not to compare disciplines to each other. Thus, I can safely report that data sets of academic articles in marketing, political science, and cancer research yield similar illustrations.

The data sets are compiled and downloaded from Elsevier’s abstract and citation database Scopus. They comprise of academic articles published in the ten years between 2010 and 2019 in one of the top-10 journals for accounting, astronomy, and gastroenterology (see Table 17 in the S1 Appendix for an overview of journals). The journals are ranked according to their respective CiteScore in 2019. The data sets may be replicated following a step-by-step research protocol available on GitHub [24]. The R source code for the following illustrations of publication authorship can be found in the same location. Altogether, there are 5333 articles in accounting, 10817 articles in astronomy, and 11293 articles in gastroenterology.

As usual with publication data, the data sets require considerable cleaning before further analysis. This involves the removal of double entries (e. g., pre-prints), non-article publications (e. g., editorials, notes, letters, book reviews, errata), articles without an abstract or without references, and articles with anonymous authors. For later text mining, abstracts are stripped of punctuation, stop words, and numbers, multiple white-space characters are collapsed into one, and copyright notices are removed.

I compute networks for both bibliographic coupling and publication authorship in accounting, astronomy, and gastroenterology. Vertices represent academic articles. They are connected by edges either because they share one or more references in case of bibliographic coupling or because they have one or more authors in common in case of publication authorship. Therefore, the number of vertices is the same for both types of networks while the number of edges differs from one to the other (cf. Table 2). The difference in the number of edges among the networks already highlights the idiosyncrasies of each academic discipline. For example, the high average number of references in accounting leads to a more than 20 times higher edge count in bibliographic coupling than the low average number of authors in publication authorship. Conversely, the low average number of references in gastroenterology puts the number of edges for bibliographic coupling and publication authorship almost on par. Derivative measures such as network density (i. e., the ratio of the number of edges to the number of possible edges) differ accordingly.

Interestingly, network-level measures such as transitivity and assortativity do not follow the decreasing differences in the number of edges and network density from bibliographic coupling to publication authorship. Transitivity quantifies the probability that the adjacent vertices of a vertex are connected. In other words, it points out the probability that three articles form a triangle either because they share common references or authors. Transitivity reveals that the segmentation of research in bibliographic coupling is less dense than in publication authorship, it finds the inverse case to be true in astronomy, and shows a similar coefficient in gastroenterology. Assortativity quantifies the probability that a vertex connects to other vertices that are similar in one way or another. I use the degree of a vertex (i. e., the number of connections a vertex has to other vertices) to quantify the probability that an article with many common references or authors connects to other articles with many common references or authors. Assortativity shows an increase from bibliographic coupling to publication authorship in accounting and a decrease in astronomy and gastroenterology. These differences first and foremost highlight that academic areas are idiosyncratic in the way they conduct research. A low number of large research segments is most often associated with more loose connections among articles, whereas a high number of small research segments commonly calls for more dense connections among articles.

Results

With a description of the data in place, I further investigate the differences between bibliographic coupling and publication authorship. I first compute clusters of articles, then estimate their goodness-of-fit to the data using a measure of cosine similarity, and finally discuss the segmentation of research and the development of the respective scientific frontier. These steps follow common practice in bibliometric studies (e. g., [25, 26]).

Clustering

Transitivity and assortativity offer bird’s-eye views of the clustering of networks. In order to compute clusters of vertices for bibliographic coupling and publication authorship across all three academic areas, I use a fast-greedy algorithm [27] widely employed in network analysis. The algorithm takes edge weights as an indicator of the strength of bibliographic coupling or publication authorship. I use the cosine similarity between a set of references or authors from publication A and a set of references or authors from publication B: (1) The weight of the respective edge between two vertices is therefore the ratio of the number of references or authors the two publications A and B have in common, normalized by the square root of the product of the number of references or authors from the two publications A and B.

Clusters delimit subsets of articles that share similar theoretical insight or empirical evidence based on common references or common authors. They may be thought of as schools of thought or theoretical paradigms. Consider, for example, bibliographic coupling in accounting. Seven clusters describe the majority of research in the ten years from 2010 to 2019. Four of them share common topics such as banks, information, investors, liquidity, and stock. In contrast, cluster 4 leans towards references to entrepreneurship, innovation, and knowledge. To some extend, these topics adhere to different theoretical paradigms, ranging from economics to law and social science.

Bibliographic coupling is infused with a number of troubles that publication authorship hopes to remedy. Among these troubles is the misconception that common references provide a unanimous argument [28, 29]. While it is true that a majority of articles cites references to back up an argument, the same references may well be used to undermine it. Bibliographic coupling is therefore ill equipped to account for the quality of the argument by weighting common references.

Publication authorship addresses this shortcoming based on the notion that authors themselves stand in for a school of thought. Authors are more likely to work together because they complement each other in their theoretical ideas, methodological approaches, or empirical interests. Conversely, scholars of opposing schools of thought are unlikely to publish together. There are famous and rare exceptions to this, of course. For example, the academic debate between Habermas and Luhmann eventually led to a joint book publication that carefully elaborated on the commonalities and differences between Habermas’ theory of communicative action and Luhmann’s social systems theory [30]. However, most debates take place as an exchange of arguments in the form of alternating publications or lectures between scholars (e. g., Bohr and Einstein on quantum theory or Hawking and Penrose on time-reversal invariance). Bibliographic coupling draws these academic debates together because the respective articles share common references, whereas publication authorship separates the fields of research based on the authors’ opposing schools of thought (i. e., disjoint authorship).

The number of clusters from bibliographic coupling to publication authorship jumps from seven to 278 clusters in accounting, still shows a steep increase from 26 to 138 clusters in astronomy, but slightly decreases from 70 to 62 clusters in gastroenterology (Table 2). In general, bibliographic coupling yields larger clusters that are more inclusive of opposing research, whereas publication authorship produces a more fine-grained picture of schools of thought, theoretical arguments, or fields of interest. Fig 2 shows the distribution of clusters for bibliographic coupling and publication authorship in accounting, astronomy, and gastroenterology. Opposite to the number of articles in each cluster (gray bars) stands the cumulative percentage of cluster sizes (solid black line) and the 80-percent cut-off (dashed black line). While this cut-off is arbitrary, it puts the focus on a limited number of clusters to tell a story about the segmentation of research and the development of the scientific frontier.

Goodness-of-fit

Next, I look for evidence of how well clusters fit the bibliometric data. Given that two articles are assumed to be similar in their content based on common references or authors, I compute an additional similarity measure based on article abstracts. Following the above formula 1 for the cosine similarity between two attribute vectors of either references or authors, I compute the cosine similarity (i. e., edge weights) between attribute vectors of abstract terms of two articles (i. e., vertices). I then use the mean intra-cluster cosine similarity to compare the goodness-of-fit of clusters for bibliographic coupling and publication authorship in accounting, astronomy, and gastroenterology.

Fig 3 shows boxplots for the mean intra-cluster cosine similarities for bibliographic coupling and publication authorship in all three academic areas. In addition, I run a Mann-Whitney U test on the one-tailed alternative hypothesis that the means in publication authorship are greater than the means in bibliographic coupling. This alternative is true for all three academic areas. Fig 3 additionally shows the corresponding non-parametric measure p, which can take on values between 0 or 1. The extreme values represent entirely separate distribution of means, whereas a p-value of 0.5 indicates a complete overlap. Accounting shows a difference in mean intra-cluster cosine similarities from bibliographic coupling to publication authorship at a p-value of 0.77. Although not as large a difference, publication authorship in astronomy also yields higher means at a p-value of 0.64. Finally, gastroenterology shows a difference between bibliographic coupling and publication authorship at a p-value of 0.75 despite a decrease in the number of clusters from one to the other. The results clearly show that the goodness-of-fit of clusters in publication authorship to the content of articles in questions is better than in bibliographic coupling.

Research segmentation

I already established that bibliographic coupling is broader in the segmentation of research than publication authorship. The question now is, what additional insights does a more detailed picture yield? Again, I draw on networks to provide an answer for the segmentation of research and the development of the scientific frontier in accounting, astronomy, and gastroenterology. The large numbers of articles and the bibliographic coupling or publication authorship to connect them are prohibitive for any practical visualization. Therefore, I first collapse articles into clusters I already obtained with the help of the above presented algorithm. I then collapse bibliographic coupling or publication authorship between articles into respective relations between clusters and take the mean inter-cluster cosine similarity to weight these relations. Finally, I remove isolate clusters to focus the attention on the central component of each research field. Fig 4 shows six networks for bibliographic coupling and publication authorship in accounting, astronomy, and gastroenterology.

The size of the vertices indicates the (normalized) number of articles in each cluster, ranging from a minimum of two articles up to the biggest cluster with 2966 articles for bibliographic coupling in astronomy. The color of the vertices marks the mean age (in years) of articles in a cluster on a gray scale from the youngest cluster in light gray to the oldest cluster in dark gray. In like manner, the color and width of the edges indicates the mean cosine similarity between clusters on a gray scale from least similar relation in light gray to the most similar relation in dark gray. I use Kamada and Kawai’s layout algorithm [31], which is among the most commonly used algorithms to position vertices and edges.

To describe the segmentation of research, I compute the term frequency-inverse document frequency (tf-idf) for article abstracts within each cluster for bibliographic coupling and publication authorship in accounting, astronomy, and gastroenterology to highlight the most prominent themes and topics. In addition to the visualization of the six networks (Fig 4), I report the number of articles, the mean and standard deviation of their age (in years), as well as the degree, betweenness, and closeness centrality for each cluster. A full glossary of respective technical terminology is found in the S2 Appendix.

Degree is the simplest measure of connectivity. It counts the number of edges a vertex has to other vertices. Betweenness and closeness centrality are frequently used measures in bibliographic studies where they signal interdisciplinarity and multidisciplinarity, respectively [25]. That is to say, the larger the number of shortest paths that go through a vertex (i. e., the more times a cluster sits in between others), the more that cluster may be considered to be interdisciplinary, and the smaller the average length of shortest paths from a vertex to all other vertices is (i. e., the closer a cluster is to others), the more that cluster may be considered to be multidisciplinary.

Accounting.

Bibliographic coupling in accounting comes about six connected clusters (Table 3). Already the three largest clusters (1, 2, and 3) combine more than 80 percent of all articles and broadly outline distinct research with only one shared tf-idf term (i. e., information; cf. Table 4). Cluster 5 also shares some common terms with the three largest clusters but is considerably smaller and younger, which may indicate a push of the scientific frontier. Cluster 4 sets itself apart with unique tf-idf terms such as research, universities, technology, innovation, and entrepreneurship. Nonetheless, bibliographic coupling paints a rather coarse picture for accounting.

thumbnail
Table 3. Network measures for bibliographic coupling in accounting.

https://doi.org/10.1371/journal.pone.0297005.t003

thumbnail
Table 4. Top-ten tf-idf terms for bibliographic coupling in accounting.

https://doi.org/10.1371/journal.pone.0297005.t004

Publication authorship, in turn, promises more details with the segmentation of research into 41 connected clusters. While there is considerable overlap in tf-idf terms among the top-ten clusters (e. g., cluster 2 shares seven terms with cluster 8 and five terms with clusters 3 and 7), some clusters exhibit exclusive terms that delineate unique lines of research (see Tables 5 and 6 for network measures and tf-idf terms). For example, cluster 2 centers on international financial reporting standards (ifrs), cluster 6 looks into high-frequency trading systems (hfts), and cluster 10 brings together venture capital (vc) and initial public offerings (ipo). Each of these three clusters marks a differentiation of research in accounting and thus a push of the scientific boundary.

thumbnail
Table 5. Network measures for publication authorship in accounting (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t005

thumbnail
Table 6. Top-ten tf-idf terms for publication authorship in accounting (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t006

Astronomy.

Bibliographic coupling in astronomy shows a segmentation of research which is largely made up of four clusters (1, 2, 3, 4). These four clusters are closely connected to each other (Table 7) at the center of the network. They share tf-idf terms that any layperson would guess are descriptive of research in astronomy (e. g., galaxy, mass, star; Table 8). With an average overlap 7.5 tf-idf terms among them (most notably, clusters 2 and 3 share all top-ten terms, albeit in different order), the four largest clusters are too generic to constitute particular fields of interests in astronomy.

thumbnail
Table 7. Cluster network measures for bibliographic coupling in astronomy (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t007

thumbnail
Table 8. Top-ten tf-idf terms for bibliographic coupling in astronomy.

https://doi.org/10.1371/journal.pone.0297005.t008

Some smaller clusters are more unique in their contributions to the research field. For example, cluster 5 exhibits a large body of research on solar flares and cluster 9 features numerous studies on the formation of stars and other stellar objects. In the end, bibliographic coupling makes astronomy appear as if it was a field of research where perhaps only some newer or renewed interests (e. g., the smaller and younger cluster 9 opposite the older and larger cluster 4) are bound to push the scientific boundary.

Publication authorship splits research in astronomy into 56 connected clusters. The ten largest clusters make up almost 80 percent of all publications. This more fine-grained picture immediately reflects in the 49 unique top-10 tf-idf terms that describe the clusters, whereas bibliographic coupling only shows 34 unique terms (Tables 9 and 10).

thumbnail
Table 9. Cluster network measures for publication authorship in astronomy (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t009

thumbnail
Table 10. Top-ten tf-idf terms for publication authorship in astronomy (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t010

A combination of tf-idf terms such as black, hole, and kev (kiloelectron volts) in cluster 9 then points to the latest research findings based on data from NASA’s Nuclear Spectroscopic Telescope Array. In contrast, bibliographic coupling buries this research mainly in its largest cluster 1. A similar observation can be made for research on the formation of galaxies found in cluster 3. Next to the generic tf-idf terms such as galaxy, mass, and star, the additional term redshift specifically contributes to our understanding of an ever expanding universe where light from distant stellar objects shifts towards longer wavelength and, therefore, moves into the red end of the electromagnetic spectrum. Again, bibliographic coupling puts this research in its two largest clusters 1 and 2. Other unique lines of inquiry can be made out, too (e. g., cluster 10 on the role of solar winds in the sun’s heliosheath), but ultimately require the expert interpretation of astronomers.

Gastroenterology.

Bibliographic coupling in gastroenterology presents as a dense network of 26 connected clusters. The ten largest clusters make up a little more than 80% of all articles. The periphery is negligible with no more than 21 articles found in the seven smallest clusters (Table 11). Gastroenterology is dominated by Latin terminology and medical abbreviations foreign to laypersons (Table 12). Examples for research foci in gastroenterology include Crohn’s disease (cluster 1), liver cirrhosis (cluster 2), colorectal cancer (crc; cluster 4).

thumbnail
Table 11. Network measures for bibliographic coupling in gastroenterology.

https://doi.org/10.1371/journal.pone.0297005.t011

thumbnail
Table 12. Top-ten tf-idf terms for bibliographic coupling in gastroenterology.

https://doi.org/10.1371/journal.pone.0297005.t012

Publication authorship in gastroenterology expands the number of connected clusters from 26 to 37 (Table 13). This more detailed picture is best exemplified with cancer research in gastroenterology. Bibliographic coupling groups gastric and colorectal (crc) cancer into clusters 4 and 6. In contrast, publication authorship clearly shows the four most common types of gastrointestinal cancers. First and second, gastric cancer (clusters 1 and 8) and colorectal cancer (cluster 6) are immediately visible as distinct fields of interest. Moreover, colorectal cancer often coincides with inflamatory bowel disease (ibd) and eosinophilic esophagitis (eoe), both of which are large parts of cluster 5. Liver cancer (clusters 1, 3, and 8) and pancreatic cancer (cluster 9) mark the third and fourth most common type of cancer.

thumbnail
Table 13. Network measures for publication authorship in gastroenterology (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t013

The distribution of the most common types of gastrointestinal cancer across clusters finds explanation in additional tf-idf terms that relate to common practice in treatment or diagnosis. For example, cluster 1 highlights endoscopic submucosal dissection (esd) as preferential treatment of gastric or liver cancer in patients. In contrast, cluster 8 puts forward diagnostic research on the expression and risk of early gastric cancer (Table 14).

thumbnail
Table 14. Top-ten tf-idf terms for publication authorship in gastroenterology (10 largest clusters).

https://doi.org/10.1371/journal.pone.0297005.t014

The development of the scientific frontier is not immediately apparent for publication authorship, although the more detailed picture allows even a layperson to make out clear distinctions within research sub-fields such as the focus on treatment of gastric or liver cancer in cluster 1 and 3 as opposed to the diagnosis of these types of cancer in cluster 8. Further interpretations that may shed a light on the latest developments in the research field call for the expertise of gastroenterologists.

Discussion

Publication authorship proves a point in displaying a more detailed picture of research than bibliographic coupling. Of course, it is only one methodological approach among many others used in bibliometric studies. Alternatives to bibliometric networks based on (co-)citation or author(ship) include the co-occurrences of words in the title or abstract of articles and topic modeling algorithms such as, for example, Latent Dirichlet Allocation (LDA). While bibliometric networks do not immediately compare to these alternative approaches, I discuss some findings from running a cluster analysis of word co-occurrences in abstracts as well as an LDA for the research field of accounting.

Co-word analysis

Co-word analysis [32, 33] looks at the intellectual organization of research based on the co-occurrences of article keywords. Its strength is a simple setup of words as vertices and edges as their co-occurrences, commonly weighted by an equivalency index [34] similar to the above discussed measure of term frequency-inverse document frequency (tf-idf). However, a first trouble with co-word analysis is that not all articles in scientific databases come with keywords, mostly due to the fact that some journals do not require authors to supply keywords to their articles. This trouble shows most prominently when approximately one third of all articles in accounting and more than 40 percent of all articles in gastroenterology have no associated keywords. It is somewhat less of a concern in astronomy where only five percent of all articles are missing keywords.

In order to have the same baseline number of articles as the above studies in bibliographic coupling and publication authorship, I use words in article abstracts instead of keywords to compute word co-occurrences. I run the same network statistics and cluster analysis in the co-word analysis of accounting, astronomy, and gastroenterology in order to highlight similarities and differences to bibliographic coupling and publication authorship. Most notably, the number of vertices and edges increases dramatically in co-word analysis now that words instead of articles are the starting point (Table 15). Density, transitivity, and assortativity hover around the same values, though they defy any immediate comparison among the disparate networks.

The number of clusters steadily increases from accounting to astronomy to gastroenterology (Fig 5). At first sight, it appears as if co-word analysis provides a segmentation of research somewhat opposite to publication authorship where the number of clusters decreases. However, the distribution of clusters reveals that all three academic areas feature one huge cluster of words that are most common to all articles. Disregarding this pool cluster, we observe a more even distribution of words among clusters. While these word clusters describe research fields in great detail, a second and major drawback of co-word analysis is that words are exclusive to clusters. A fairly common word in accounting such as stocks, for example, necessarily appears only in single cluster then. This calls into question the meaningfulness of clusters in the first place. Bibliographic coupling and publication authorship both provide a first layer of connectivity among articles that in later analysis allows for terms to appear in multiple, overlapping clusters, which is much better suited to describe the segmentation of research and the development of the scientific frontier.

Latent Dirichlet Allocation

An alternative to the exclusive clusters of co-word analysis is Latent Dirichlet Allocation (LDA) [35]. LDA is a generative probabilistic model based on the idea that each document (e. g., an article abstract) in a corpus is a random mix of latent topics, and each topic is in turn characterized by a probability distribution over words. For example, the terms stocks and economy are both likely to make up a topic that describes the impact of stocks on a country’s economy (e. g., Novo Nordisk’s market value has now exceeded the size of the entire Danish economy), whereas they are perhaps less likely to appear in a topic that outlines the connection between the initial public stock offering (IPO) and the economy (e. g., California frequently has a large budget surplus due to income taxes of IPO sales).

While LDA yields topics similar to the clusters of bibliometric coupling and publication authorship, it shares little commonalities with the network analysis of vertices and edges. Its biggest drawback is that the number of topics needs to be fixed a priori, though there are several ways to determine the optimal number of topics by now [3638]. Another weak spot is that it requires significant computational power. Indeed, computing the optimal number of topics in 25 iterations of LDA in accounting failed due to issues of memory allocation on the ten cores of an Apple Silicon M1 Max with 32 GB RAM. The following computations were instead carried out on 64 Intel Xeon high-performance cores with 364 GB RAM. Running time was around 35 minutes for accounting, 74 minutes for astronomy, and 116 minutes for gastroenterology. In contrast, the entire computations in bibliographic coupling and publication authorship run in less than three minutes on Apple Silicon for all three academic areas combined.

Fig 6 shows normalized values for a number of topics ranging from 10 to 250 in accounting, astronomy, and gastroenterology. With a look for either a minimal [37, 38] or a maximal [36] value, the optimal number of topics falls somewhere between 70 and 140 in accounting, between 100 and 140 in astronomy, and between 100 and 160 topics in gastroenterology.

Already the number of topics at the lower end of the range for each academic area is larger than then number of connected clusters in bibliographic coupling and publication authorship, which suggests a greater detail of research segmentation. However, LDA offers limited information on the organization of research beside the document-topic probability γ and the topic-word probability β. On the one hand, γ indicates the probability with which a topic represents a document; on the other hand, β indicates the probability with which a word is common to a topic. Taken together, Table 16 shows the top-ten topics in decreasing order of their mean γ alongside the respective top-ten terms in decreasing order of their β scores.

thumbnail
Table 16. Top-ten topics and terms for Latent Dirichlet Allocation in accounting.

https://doi.org/10.1371/journal.pone.0297005.t016

While some topics in LDA compare favorably to clusters in bibliographic coupling (e. g., topic 63 and cluster 4 on knowledge and innovation) and publication authorship (e, g., topic 54 and cluster 8 on the role of analysts in firm earnings or topic 40 and cluster 9 on the quality of audits), others certainly require their own interpretation (e. g., topic 22 on accounting practices and accountability). Unfortunately, additional information on centrality, size, or age of topics similar to clusters is not readily available in LDA. The segmentation of research by topics in LDA is perhaps similar to the one by cluster, though the development of the scientific frontier is not easy to spot, not least because of the missing information on the organization of research areas.

Conclusion

Bibliometric studies are common practice in all academic disciplines. They assess the history of a research field, point out the state of the art, and identify the development of the scientific frontier. Bibliometric studies are transparent, reproducible, and scalable, making them a cost-effective way of analyzing large volumes of academic articles. In the end, they highlight idiosyncrasies of scientific work that are insightful to both laypersons and experts.

From classic approaches of mapping research publications by (co-)citation and bibliographic coupling to centering on collaboration among scholars by author co-citation, author bibliographic coupling, and co-authorship, the methodology of bibliometric studies has gotten more and more technically refined. Still, there are some limitations. For example, (co-)citation analysis and bibliographic coupling do not capture the reasoning behind citations. Whether articles are cited to make or break an argument is therefore unknown. Publication authorship does away with this limitation by accounting for both the social dimension of authorship and the intellectual dimension of scientific work.

Analyzing the content of academic articles, of course, is the prime domain of natural language processing. The findings of bibliometric studies may thus be further interpreted using measures such as term frequency-inverse document frequency (tf-idf) to highlight scientific concepts that are most descriptive for academic areas. Together with measures on the level of vertices and edges (e. g., degree, betweenness, closeness, size, age) and on the level of the bibliometric network in question (e. g., density, assortativity, transitivity), the segmentation of research becomes not only more interpretable but also comparable across the space and time of scientific work.

Of course, bibliometric studies are far from the only means of inquiry into the segmentation of research and the development of the scientific frontier. Approaches used in natural language processing such as, for example, the analysis of word co-occurrences and Latent Dirichlet Allocation (LDA) are particularly suited to capture the intellectual dimension of scientific work without necessarily inheriting the limitations of (co-)citation analysis and other bibliometric approaches. However, they are computationally costly to begin with and their findings are often harder to interpret without the backdrop of additional measures from the realm of bibliometric studies.

The key differences between publication authorship and approaches in natural language processing such as LDA are what makes bibliometric studies attractive in the first place. Publication authorship is transparent in both its definition of what vertices and edges are and its analysis of the respective bibliometric networks. It is easily reproducible not only across the space of multiple disciplines but also across the time of a single discipline, which allows for a comparison of different academic areas and an interpretation of the development of the scientific frontier. Last, publication authorship scales well from small fields of research to large volumes of academic articles. In contrast, LDA as a generative probabilistic model is somewhat opaque, not least because it requires the specification of the number of clusters and a number of training parameters to begin with. Its findings are also more difficult to interpret without additional measures derived from the structure of scientific work. And it is computationally intense, which makes it a costly alternative to bibliometric studies.

Consider that publication authorship clearly identifies themes and topics in accounting despite the lower number of clusters. For example, one cluster shows a large but rather peripheral body of work on international financial reporting standards, whereas another cluster that comprises of a slightly smaller number of academic articles on high-frequency trading systems sits in the center of adjacent work in accounting. LDA is more generic, despite the fact that its higher number of clusters suggest more detail. For example, it shows a cluster about a cluster about banking and credit, a cluster about innovation and research, and a cluster about trading and insider information. None of these clusters are immediately identifiable as larger or smaller, central or peripheral, older or younger.

Admittedly, the latest developments in artificial intelligence promise to remedy some of these shortcomings in natural language processing (e. g., ChatGPT-4 suggests that Habermas and Luhmann are intellectual rivals despite the fact that they published together; at the same time, it cannot correctly identify the DOIs of either works). Unfortunately, artificial intelligence with hundreds of billions of parameters or more operates largely as a black box. Perhaps there is still room for bibliometric studies carefully rooted in theory then.

Publication authorship, I argue, offers a more fine-grained picture of academic research that provides explanatory power beyond simple refinement. The illustrations of bibliographic coupling versus publication authorship in accounting, astronomy, and gastroenterology ultimately confirm significant benefits to bibliometric studies of scientific work.

Moreover, the idea to connect publications by authorship immediately extends to, for example, organization studies. Following the now popular notion that communication constitutes organization [3941], we may conceive of corporate documents such as meeting minutes, project reports, or product presentations as communication episodes [21]. The authorship of these episodes, in turn, provides the proverbial glue among the said documents. Documents and authorship are therefore conceived as the vertices and the edges that map out an organization as a network of communication episodes. A respective cluster analysis commonly shows the functions of an organization (e. g., accounting, engineering, marketing) similar to the sub-fields of an academic discipline [21]. Indeed, an academic discipline may well be thought of as an organization of the scientific work conducted within the disciplinary boundaries. My hope then is that publication authorship provides another useful approach in the toolbox of bibliometric studies and beyond.

Supporting information

S1 Appendix. Top-10 journals in accounting, astronomy, and gastroenterology.

https://doi.org/10.1371/journal.pone.0297005.s001

(PDF)

References

  1. 1. Kessler MM. Bibliographic Coupling Between Scientific Papers. American Documentation. 1963;14(1):10–25.
  2. 2. de Solla Price DJ. Networks of Scientific Papers. Science. 1965;149(3683):510–515.
  3. 3. Small H. Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. Journal of the American Society for Information Science. 1973;24(4):265–269.
  4. 4. White HD, Griffith BC. Author Cocitation: A Literature Measure of Intellectual Structure. Journal of the American Society for Information Science. 1981;32(3):163–171.
  5. 5. Newman MEJ. The Structure of Scientific Collaboration Networks. Proceedings of the National Academy of Sciences. 2001;98(2):404–409. pmid:11149952
  6. 6. Gipp B, Beel J. Citation Proximity Analysis (CPA)—A New Approach for Identifying Related Work Based on Co-citation Analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics. vol. 2. Rio de Janeiro; 2009. p. 571–575.
  7. 7. Zhao D, Strotmann A. Evolution of Research Activities and Intellectual Influences in Information Science 1996–2005: Introducing Author Bibliographic-coupling Analysis. Journal of the American Society for Information Science and Technology. 2008;59(13):2070–2086.
  8. 8. Leydesdorff L. The Development of Frames of References. Scientometrics. 1986;9(3–4):103–125.
  9. 9. Small H, Boyack KW, Klavans R. Identifying Emerging Topics in Science and Technology. Research Policy. 2014;43(8):1450–1467
  10. 10. Jarneving B. Bibliographic Coupling and Its Application to Research-front and Other Core Documents. Journal of Informetrics. 2007;1(4):287–307.
  11. 11. Braam RR, Moed HF, van Raan AFJ. Mapping Science by Combined Co-citation and Word Analysis. I. Structural Aspects. Journal of the American Society for Information Science. 1991;42(4):233–251.
  12. 12. Boyack KW, Klavans R. Co-Citation Analysis, Bibliographic Coupling, and Direct Citation: Which Citation Approach Represents the Research Front Most Accurately? Journal of The American Society For Information Science And Technology. 2010;61(12):2389–2404.
  13. 13. White H, McCain KW. Visualizing a Discipline: An Author Co-citation Analysis of Information Science, 1972–1995. Journal of the American Society for Information Science. 1998;49(4):327–355.
  14. 14. Biscaro C, Giupponi C. Co-Authorship and Bibliographic Coupling Network Effects on Citations. PLOS ONE. 2014;9(6):e99502. pmid:24911416
  15. 15. Schubert A, Glänzel W. Cross-national Preference in Co-authorship, References and Citations. Scientometrics. 2006;69(2):409–428.
  16. 16. Borgman CL, Furner J. Scholarly Communication and Bibliometrics. Annual Review of Information Science and Technology. 2002;36(1):2–72.
  17. 17. Mingers J, Leydesdorff L. A Review of Theory and Practice in Scientometrics. European Journal of Operational Research. 2015;246(1):1–19.
  18. 18. Donthu N, Kumar S, Mukherjee D, Pandey N, Lim WM. How to Conduct a Bibliometric Analysis: An Overview and Guidelines. Journal of Business Research. 2021;133(9):285–296.
  19. 19. Luhmann N. Social Systems. Stanford, CA: Stanford University Press; 1995.
  20. 20. Luhmann N. How Can the Mind Participate in Communication? In: Rasch W, editor. Theories of Distinction: Redescribing the Descriptions of Modernity. Stanford, CA: Stanford University Press; 2002. p. 169–184.
  21. 21. Blaschke S, Schoeneborn D, Seidl D. Organizations as Networks of Communication Episodes: Turning the Network Perspective Inside Out. Organization Studies. 2012;33(7):879–906.
  22. 22. Wenzel M, Will MG. The Communicative Constitution of Academic Fields in the Digital Age: The Case of CSR. Technological Forecasting & Social Change. 2019;146;517–533.
  23. 23. Yan E, Ding Y. Scholarly Network Similarities: How Bibliographic Coupling Networks, Citation Networks, Cocitation Networks, Topical Networks, Coauthorship Networks, and Coword Networks Relate to Each Other. Journal of the American Society for Information Science and Technology. 2012;63(7):1313–1326.
  24. 24. Blaschke, S. Publication Authorship. https://doi.org/10.5281/zenodo.10319422
  25. 25. Leydesdorff L. Betweenness Centrality as an Indicator of the Interdisciplinarity of Scientific Journals. Journal of the American Society for Information Science and Technology. 2007;58(9):1303–1319.
  26. 26. Jarneving B. A Comparison of Two Bibliometric Methods for Mapping of the Research Front. Scientometrics. 2005;65(2):245–263.
  27. 27. Clauset A, Newman MEJ, Moore C. Finding Community Structure in Very Large Networks. Physical Review E. 2004;70(6):1–6. pmid:15697438
  28. 28. Weinberg BH. Bibliographic Coupling: A Review. Information Storage and Retrieval. 1974;10(5-6):189–196.
  29. 29. Kochtanek TR. Bibliographic Compilation Using Reference and Citation Links. Information Processing and Management. 1982;18(1):33–39.
  30. 30. Habermas J, Luhmann N. Theorie der Gesellschaft oder Sozialtechnologie: Was leistet die Systemforschung? Frankfurt am Main: Suhrkamp; 1971.
  31. 31. Kamada T, Kawai S. An Algorithm for Drawing General Undirected Graphs. Information Processing Letters. 1989;31(1):7–15.
  32. 32. Callon M, Courtial JP, Turner WA, Bauin S. From Translations to Problematic Networks: An Introduction to Co-word Analysis. Social Science Information. 1983;22(2):191–235.
  33. 33. Leydesdorff L. Words and Co-words as Indicators of Intellectual Organization. Research Policy. 1989;18(4):209–223.
  34. 34. Callon M, Courtial JP, Laville F. Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemistry. Scientometrics. 1991;22(1):155–205.
  35. 35. Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003;3(1):993–1022.
  36. 36. Griffiths TL, Steyvers M. Finding Scientific Topics. Proceedings of the National Academy of Sciences. 2004;101(1):5228–5235. pmid:14872004
  37. 37. Arun R, Suresh V, Veni Madhavan CE, Narasimha Murthy MN. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In: Zaki MJ, Xu Yu J, Ravindran B, Pudi V, editors. Advances in Knowledge Discovery and Data Mining. Berlin: Springer; 2010 p. 391–402.
  38. 38. Cao J, Tian X, Jintao L, Yongdong Z, Sheng T. A Density-based Method for Adaptive LDA Model Selection. Neurocomputing. 2009;72(7–9):1775–1781.
  39. 39. Ashcraft KL, Kuhn TR, Cooren F. Constitutional Amendments: “Materializing” Organizational Communication. Academy of Management Annals. 2009;3(1):1–64.
  40. 40. Brummans BHJM, Cooren F, Robichaud D, Taylor JR. Approaches to the Communicative Constitution of Organizations. In: Putnam LL, Mumby DK, editors. The SAGE Handbook of Organizational Communication. 3rd ed. Thousand Oaks, CA: SAGE; 2014. p. 173–194.
  41. 41. Schoeneborn D, Blaschke S, Cooren F, McPhee RD, Seidl D, Taylor JR. The Three Schools of CCO Thinking. Management Communication Quarterly. 2014;28(2):285–316.