The authors have declared that no competing interests exist.
Conceived and designed the experiments: DH JK XS FM. Performed the experiments: JK DH XS LP MJ SP. Analyzed the data: JK DH XS FM. Contributed reagents/materials/analysis tools: JK DH XS LP MJ SP FM. Wrote the paper: JK XS FM.
The use of quantitative metrics to gauge the impact of scholarly publications, authors, and disciplines is predicated on the availability of reliable usage and annotation data. Citation and download counts are widely available from digital libraries. However, current annotation systems rely on proprietary labels, refer to journals but not articles or authors, and are manually curated. To address these limitations, we propose a social framework based on crowdsourced annotations of scholars, designed to keep up with the rapidly evolving disciplinary and interdisciplinary landscape. We describe a system called Scholarometer, which provides a service to scholars by computing citation-based impact measures. This creates an incentive for users to provide disciplinary annotations of authors, which in turn can be used to compute disciplinary metrics. We first present the system architecture and several heuristics to deal with noisy bibliographic and annotation data. We report on data sharing and interactive visualization services enabled by Scholarometer. Usage statistics, illustrating the data collected and shared through the framework, suggest that the proposed crowdsourcing approach can be successful. Secondly, we illustrate how the disciplinary bibliometric indicators elicited by Scholarometer allow us to implement for the first time a universal impact measure proposed in the literature. Our evaluation suggests that this metric provides an effective means for comparing scholarly impact across disciplinary boundaries.
Many disciplinary communities have sought to address the need to organize, categorize, and retrieve the articles that populate their respective online libraries and repositories. Unfortunately, the great promise of such mechanisms is hindered by the fact that disciplinary categories, as an organizing principle, do not accommodate the trend toward interdisciplinary scholarship and the continual emergence of new disciplines. An initial step towards a solution comes in the form of journal indices, such as those supported by Thomson-Reuters as part of their Journal Citation Reports (JCR) and Web of Science (WoS) commercial products. Systems like the Web of Science, and similar discipline classifications such as MeSH for life sciences, PACS for physics, and ACM CCS for computing, are based on a
The “Web Science” paradigm suggests an alternative approach. Rather than attempting to match new scientific production to predefined categories, it would be useful to facilitate semantic evolution by empowering scholars to annotate each other's work. This
Disciplinary boundaries create similar hurdles for measuring scholarly impact, although these hurdles are relegated more to standards and practices. For example, the fields of history and physics have very different publishing patterns and standards of collaboration. A historian may work for years to publish a solitary work while an experimental physicist may co-author numerous articles during the same time period. How do we compare scholars across fields?
Radicchi
What we envisage is crowdsourcing the knowledge of community members in a scenario similar to those explored in citizen science
Scholarometer is a social tool for scholarly services developed at Indiana University, with the dual aim of exploring the crowdsourcing approach for disciplinary annotations and cross-disciplinary impact metrics
The goal of this paper is to detail the design and implementation of the Scholarometer tool. We present visualization and data exchange services that are fueled by the data crowdsourced through Scholarometer. We also outline the computation of both disciplinary and universal rankings of authors enabled by this data. In particular, we make the following contributions:
We present the architecture, user interface, and data model used in the design and implementation of the Scholarometer system. We discuss several heuristics employed to deal with the noisy nature of both bibliographic data and user-supplied annotations (
As an illustration of potential applications of crowdsourced scholarly data, we report on data sharing and interactive visualization services. These applications suggest that the crowdsourcing framework yields a meaningful classification scheme for authors and their disciplinary interactions (Data Sharing and Visualization section).
By leveraging socially collected discipline statistics, we implement the so-called “universal
Tools exist for both citation analysis (e.g., Publish or Perish
The extraction of bibliographic information from online repositories is not new. Bibliographic management tools such as BibDesk offer robust search of online resources and digital libraries like PubMed
Social tagging of scholarly work is not a new idea, either
We have chosen to use Google Scholar as the citation database for our research. Web of Science, Scopus, and Microsoft Academic Search are possible alternatives
An important goal of the proposed annotation crowdsourcing platform is to enable the computation of scholarly impact. Bibliometrics is the use of statistical methods to analyze scholarly data and identify patterns of authorship, publication, and use. Constitutive of bibliometrics is citation analysis, used to measure the impact or influence of authors and papers in a particular field. There is a plenitude of citation measures. Some (e.g., Hirsch's
Scholarometer's crowdsourcing method, in which annotation data is generated by users in exchange for a service, is grounded in prior work as well. Amazon's Mechanical Turk
In this section we outline the main features of the Scholarometer system, available at scholarometer.indiana.edu.
Any citation analysis tool can only be as good as its data source. As mentioned earlier, Scholarometer uses Google Scholar as a data source, which provides freely accessible publication and citation data to users without requiring a subscription. Google Scholar provides excellent coverage, in many cases better than the Web of Science — especially in disciplines such as computer science, which is dominated by conference proceedings, and some social sciences, dominated by books. Nevertheless, Google Scholar is based on automatic crawling, parsing, and indexing algorithms, and therefore its data is subject to noise, errors, and incomplete or outdated citation information. The data collected from Google Scholar comprises the number of papers by an author along with their citation counts and publication years. Alternative sources, such as Microsoft Academic Search (academic.research.microsoft.com) or CiteSeer (citeseerx.ist.psu.edu), can provide the same data for the queried author. Therefore, the system architecture and design that we describe below are independent of the data source.
Due to the lack of an API to access Google Scholar data, a server-based implementation would violate Google Scholar's policy about crawling result pages, extracting data (by scraping/parsing) and making such data available outside of the Google Scholar service. Indeed, server-based applications that sit between the user and Google Scholar are often disabled, as Google Scholar restricts the number of requests coming from a particular IP address. Workarounds such as configurable proxies are not desirable solutions as they also appear to violate policy. We further excluded Ajax technology due to the same origin policy for JavaScript, and the gadget approach because it would render the tool dependent on a particular data source. We turned to a client-based approach, but ruled out a stand-alone application (such as Publish or Perish) for portability reasons. These design considerations led us to a browser extension approach, which is platform and system independent and, to the best of our knowledge, in compliance with Google's terms of service.
In keeping with the above considerations, Scholarometer is implemented as a smart browser extension, through which the user queries the source, annotates the results, and shares with the Scholarometer community only annotation metadata from the users and public citation data. We emphasize that Scholarometer does not store a copy of a subset of the Google Scholar database. In particular, the records returned to the users from Google Scholar are not stored. The data that our system collects from the users comprises of publication year, number of citations, and number of authors for each article. This information is open for the users to share with the community.
The architecture and workflow of Scholarometer is illustrated in
The Scholarometer tool has two interfaces for communicating with users: one in the browser extension for entering queries and tags, the other in the main browser window for presenting and manipulating bibliographic data and citation analysis results. The browser extension is available in two versions: one for the Firefox browser hosted at the Mozilla Firefox Add-ons site, and one for Chrome browser hosted at the Google Chrome Web store (scholarometer.indiana.edu/download.html). The Firefox interface is illustrated in
The query interface in the browser extension is designed to identify one or more authors and retrieve their articles. The default interface hides many advanced features and simplifies the common case of a single author uniquely identified by name. Advanced interfaces are available with explicit Boolean operators for multiple authors or ambiguous names, with controls for filtering subject areas and languages, and with additional keyword fields.
Tagging a queried author with disciplinary annotations is a key requirement of the extension interface. We considered two possibilities for the set of usable tags. One is the use of a predefined, controlled vocabulary. This closed approach has the advantage of producing “clean” labels, but the limitation of disallowing the bottom-up, user-driven tracking of new and emerging disciplines, which is a crucial goal of our project. At the other extreme, the open approach of free tagging addresses the latter goal but opens the door to all kinds of noise, from misspelled keywords to the use of non-disciplinary labels that can be useful to a particular individual but not necessarily to the community — think of tags such as “ToRead,” “MyOwn,” “UK,” and so on. We therefore aimed for a compromise solution in our design. The user must enter at least
The interface in the main browser window is designed to facilitate the manipulation and cleaning of the results, to visualize how the impact measures are calculated, and to expose annotations from other users for the same author(s). The output screen is divided into three panels:
A filter panel with two modules. One module is for pruning the set of articles based on the publication year or the number of citations. The second module is for limiting the set of articles to selected name variations or co-authors.
The list of articles, with utilities for live searching and for alternating between a simplified and an extended view, as well as links to external resources. This panel also has remove and merge utilities to correct two common sources of noise in Google Scholar results: articles written by homonymous authors and different versions of the same paper.
A citation analysis panel reporting impact measures. As discussed in the section sec:background, many impact measures have been proposed, and it is infeasible to implement them all. Since a single measure can only capture some aspect of scientific evaluation, a good citation analysis tool should incorporate a set of measures that capture different features, such as highly cited publications, co-authorship, and different citation practices. To this end we have implemented Hirsch's
To provide additional incentives for users to submit more queries, thus contributing more annotation data, we offer the functionality of exporting bibliographic records from the main browser window. Publication data can be exported individually or in bulk into formats commonly used by reference management tools and scholarly data sharing services. At present, Scholarometer supports the following formats: (BIB), RefMan (RIS), EndNote (ENW), comma-separated values (CSV), tab-separated values (XLS), and BibJSON
The data that we collect comes from users, so it is naturally noisy and subject to various issues. We propose several heuristics to deal with these sources of noise.
We employ a blacklist to prevent spammers from polluting our database. An example is the fictitious author “Ike Antkare,” fabricated to highlight the vulnerability of online sources of citation data
A critical challenge for bibliometric services is that author names are often ambiguous. Ambiguous names lead to biased impact metrics. The problem is amplified when names are collected from heterogeneous sources, including crowdsourced annotations. This is the case in the Scholarometer system, which cross-correlates author names in user queries with those retrieved from bibliographic data. A component of the Scholarometer system therefore attempts to detect an ambiguous name at query time. When an author name is deemed ambiguous, the user is prompted to refine the query. This design aims at decreasing noise in the database and limiting inaccurate impact analysis.
Our first attempt to deal with ambiguous author names deployed a simple heuristic rule based on citation counts associated with name variations
Work on the ambiguous name detection problem is ongoing. We are currently exploring the incorporation of additional features into the classifier. One new class of features under study is based on metadata consistency. We developed a two-step method to capture the consistency between coauthor, title and venue metadata across publications. Authors are likely to collaborate with a certain group of authors, write papers with related titles, and publish papers in similar journals or conferences. The metadata associated with these publications by the same author should be consistent. Another new feature is the consistency between topics associated with publication metadata and discipline annotations crowdsourced from the users. By combining all these features, the accuracy reaches almost 80%
Since there is no established way to uniquely identify authors (the ORCID initiative is under development
If the author name is already present in the database, the system prompts the user to make a selection from a list of names provided along with citation metadata.
If the user chooses someone from the list, the system updates the information for the author rather than creating a new record.
If the user does not choose a name from the list, but an author generated from an identical query is present in the database, the user is prompted to use additional keywords to disambiguate the query.
If the user does not choose a name from the list, and an identical author is not present in the database, a new record for the author is created.
A second issue is the arbitrary nature of uncontrolled discipline annotations. As mentioned earlier, free tags can be noisy, ambiguous, or duplicated. We employ manual and automatic techniques to deal with noisy annotations. We found different types of noise in our tag collection. First, some users misunderstand the tagging request, and utilize author names instead of discipline names as tags. Second, misspelled disciplines names are common, resulting in a duplication of existing tags. Third, some users adopt acronyms without checking if an extended version of the discipline name already exists (e.g., “hci” vs. “human computer interaction”). Finally, people may abuse the tool, using non-sensical or random tags, e.g., the first discipline starting with the letter ‘a.’
Some of these issues can be dealt with automatically by (i) checking if a tag corresponds to an author name present in the database, and (ii) ordering all tags in lexicographical order and calculating the edit distance within a window of neighboring tags. We employ the Damerau-Levenshtein (DL) distance
Finally, we need a way to estimate the reliability of crowdsourced discipline tags. We view each query as a vote for the discipline tags of the queried author. For example, a query that tags Einstein with “physics” and “philosophy” generates a vote for (Einstein, “physics”) and a vote for (Einstein, “philosophy”). The number of votes together with the number of tags can be used to determine heuristically which tags are
When a tag is selected it receives a vote, bringing its total number of votes to
Scholarometer provides several ways to share the crowdsourced data with the research community, and to explore the data through interactive visualizations.
The API (scholarometer.indiana.edu/data.html) makes the data collected by Scholarometer available. It also makes it easy to integrate citation-based impact analysis data and annotations into other applications. It exposes information about authors, disciplines, and relationships among authors and among disciplines.
The Widget provides an easy and customizable way to embed a dynamically updated citation analysis report into any website. The results screen in the main browser window includes a special “widget” button (see
Scholarometer also publishes crowdsourced data according to the basic principles of “Linked Data”
Links are labeled with the correspondence relationships between resources. This diagram is a portion of the cloud diagram by Richard Cyganiak and Anja Jentzsch (lod-cloud.net). As in the original cloud diagram, the color of a node represents the theme of the data set and its size reflects the number of triples.
One way to explore the quality of the annotations obtained through the crowdsourcing approach employed by the Scholarometer system is to map the interdisciplinary collaborations implicit in the tags. Since an author can be tagged with multiple disciplines, we can interpret such an annotation as an indicator of a link between these disciplines. For example, if many users tag many authors with both “mathematics” and “economics” tags, we can infer that these disciplines are strongly related, even though they belong to different branches of the JCR — science and social sciences, respectively.
Along with interactive discipline network, Scholarometer also provides interactive visualizations of author networks. Starting from an author submitted in a query, the author network displays similar authors. An author is represented as a vector of discipline tags, weighted by votes. Author nodes are connected by an edge weighed by the cosine similarity between the corresponding vectors. Authors are therefore deemed similar if they are tagged similarly. These visualizations can help identify potential referees, members of program committees and grant panels, collaborators, and so on. Such a scenario is illustrated in
In this example scenario, the user is looking for potential members of an interdisciplinary panel on complex networks. Starting from a known physicist (“A L Barabási”) and navigating through “A Vespignani” and “F Menczer,” the user identifies “J Klienberg,” a computer scientist who studies networks.
The Scholarometer system was first released in November 2009. At the time of this writing the Scholarometer database has collected information about 1.9 million articles by 26 thousand authors in 1,200 disciplines. There are about 90 thousand annotations, or tag-author pairs. Once we apply the heuristics described in the section heuristics, we reduce these numbers to about 1.4 million articles by about 21 thousand reliable authors with about 34 thousand reliable annotations into about 900 reliable disciplines. Naturally this folksonomy grows and evolves daily as Scholarometer handles new queries. The growth in the numbers of discipline tags, authors, and queries is charted in
Note that the sets of authors in these disciplines may overlap, as authors are often tagged with multiple disciplines. Therefore the total number of unique authors in these 20 disciplines is actually lower than shown here. Bottom: Relative size of top 20 disciplines based on the number of tagged authors.
Various statistics for authors and disciplines are available on the Scholarometer website (scholarometer.indiana.edu/explore.html). The annotation data enables us to derive rankings for authors — both universal and disciplinary — based on impact metrics.
|
|
|
|
|
1 | JN Ihle | DR Cox | S Freud | S Freud |
2 | WC Willett | GM Whitesides | M Friedman | N Chomsky |
3 | MJ Stampfer | JN Ihle | P Bourdieu | S Kumar |
4 | M Friedman | W Zhang | SH Snyder | W Zhang |
5 | W Zhang | S Freud | E Witten | CR Sunstein |
6 | SH Snyder | LA Zadeh | JE Stiglitz | R Langer |
7 | Y Sun | A Shleifer | S Weinberg | E Witten |
8 | S Freud | P Bourdieu | N Chomsky | P Krugman |
9 | B Vogelstein | N Chomsky | HA Simon | P Bourdieu |
10 | S Kumar | T Maniatis | RA Posner | JL Goldstein |
|
|
|
|
|
1 | DE Goldberg | LA Zadeh | LA Zadeh | LA Zadeh |
2 | S Thrun | DE Goldberg | DE Knuth | NR Jennings |
3 | NR Jennings | AL Barabasi | DE Goldberg | AL Barabasi |
4 | D Dubois | DE Knuth | D Dubois | A Zisserman |
5 | LA Zadeh | S Haykin | S Thrun | I Horrocks |
6 | A Zisserman | NR Jennings | NR Jennings | J Peters |
7 | AL Barabasi | G Salton | H Prade | J Kleinberg |
8 | H Prade | M Dorigo | JY Halpern | O Faugeras |
9 | DE Knuth | A Zisserman | A Zisserman | S Thrun |
10 | I Horrocks | D Dubois | MY Vardi | A Halevy |
The universal
Note that an author tagged with several disciplines will have multiple
Since the discipline/year statistics depend on the annotations we collect from queries, they are subject to noise and may take a while to converge. Once the statistics are reliable, one should in theory be able to compare the impact of authors in different disciplines. Given the dependence of
We have already shown in
For a quantitative evaluation of the universality of
Another way to verify that
Discipline |
|
Discipline |
|
|
1 | hematology | 35 | ophthalmology | 1.93 |
2 | obesity | 34 | geosciences, multidisciplinary | 1.75 |
3 | physics, theoretical | 33 | neuroimaging | 1.72 |
4 | gastroenterology & hepatology | 32 | materials science, multidisciplinary | 1.71 |
5 | immunology | 31 | clinical neurology | 1.70 |
6 | biostatistics | 30 | meteorology & atmospheric sciences | 1.68 |
7 | medicine | 29 | geochemistry & geophysics | 1.63 |
8 | nutrition & dietetics | 29 | radiology, nuclear medicine & medical imaging | 1.62 |
9 | medicine, research & experimental | 29 | pathology | 1.60 |
10 | neuroimaging | 27 | psychology, experimental | 1.59 |
We considered disciplines with at least 20 authors (as of April 2012).
We introduced a Web Science approach to gather scholarly metadata. We presented Scholarometer, a social Web tool that leverages crowdsourced scholarly annotations with many potential applications, such as bibliographic data management, citation analysis, science mapping, and scientific trend tracking. We discussed a browser-based architecture and implementation for the Scholarometer tool, affording platform and source independence while complying with the usage policy of Google Scholar and coping with the noisy nature of the crowdsourced data. We outlined disambiguation algorithms to deal with the challenge of common author names, by incorporating a classifier into the query manager.
We found evidence that the crowdsourcing approach can yield a coherent emergent classification of scholarly output. The annotation and citation metadata that we collect is shared with the research community via an API and linked open data. By combining a visualization of disciplinary networks with lists of high-impact authors into an interactive application, the Scholarometer system can be a powerful resource to explore relevant scholars and disciplines. Interactive author networks can help one identify influential authors in one's discipline or in interdisciplinary or emerging areas.
We outlined several citation-based impact metrics that are computed by the Scholarometer tool, including the first implementation of the universal
Of course, as the crowdsourced database grows, our data for each discipline will become more representative and our measures more reliable.
Additional metrics can be implemented, for instance universal ones based on percentiles
Studies of co-authorship patterns in conjunction with citation patterns might help further characterize the structure and evolution of disciplines. Moreover, by tracking the spikes in the popularity of disciplines, we plan to explore trends in scientific fields, in particular how disciplines emerge and die over time.
Part of the work presented in this paper was performed while Xiaoling Sun and Lino Possamai were visiting the Center for Complex Networks and Systems Research (cnets.indiana.edu) at the Indiana University School of Informatics and Computing. We are grateful to Geoffrey Fox, Alessandro Flammini, Santo Fortunato, Hongfei Lin, Filippo Radicchi, Jim Pitman, Ron Larsen, Johan Bollen, Stasa Milojevic, two anonymous referees, and all the members of the Networks and agents Network (cnets.indiana.edu/groups/nan) for helpful suggestions and discussions.