Measuring novelty in science with word embedding

Sotaro Shibayama; Deyun Yin; Kuniko Matsumoto

doi:10.1371/journal.pone.0254034

Abstract

Novelty is a core value in science, and a reliable measurement of novelty is crucial. This study proposes a new approach of measuring the novelty of scientific articles based on both citation data and text data. The proposed approach considers an article to be novel if it cites a combination of semantically distant references. To this end, we first assign a word embedding–a vector representation of each vocabulary–to each cited reference on the basis of text information included in the reference. With these vectors, a distance between every pair of references is computed. Finally, the novelty of a focal document is evaluated by summarizing the distances between all references. The approach draws on limited text information (the titles of references) and publicly shared library for word embeddings, which minimizes the requirement of data access and computational cost. We share the code, with which one can compute the novelty score of a document of interest only by having the focal document’s reference list. We validate the proposed measure through three exercises. First, we confirm that word embeddings can be used to quantify semantic distances between documents by comparing with an established bibliometric distance measure. Second, we confirm the criterion-related validity of the proposed novelty measure with self-reported novelty scores collected from a questionnaire survey. Finally, as novelty is known to be correlated with future citation impact, we confirm that the proposed measure can predict future citation.

Citation: Shibayama S, Yin D, Matsumoto K (2021) Measuring novelty in science with word embedding. PLoS ONE 16(7): e0254034. https://doi.org/10.1371/journal.pone.0254034

Editor: Alessandro Muscio, Universita degli Studi di Foggia, ITALY

Received: February 15, 2021; Accepted: June 17, 2021; Published: July 2, 2021

Copyright: © 2021 Shibayama et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: S.S. received a research grant from Lars Erik Lundberg Foundation (https://www.lundbergsstiftelserna.se) and Japan Society for the Promotion of Science (19K01830, https://www.jsps.go.jp/english/index.html). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Novelty constitutes a core value in science, as new discoveries shape the basis of scientific advancement [1, 2] and has broader impact on technological innovation [3]. Accordingly, novelty serves as a key criterion for the evaluation of scientific output as well as decision makings such as funding allocation, employment, and scientific awards [1, 4–6]. It is therefore critical that scientific novelty can be reliably measured. In practice, novelty is usually assessed through peer review on a small scale [7], while evaluating novelty on a larger scale remains to be a challenge. Though recent bibliometric techniques have enabled us to measure various qualities of scientific discoveries, including novelty [8–11], the validity and practical utility of the extant measures are debatable [12, 13].

Previous bibliometric measures for the novelty of scientific documents draw on roughly two data sources, either citation data or text data. Text data are of obvious use, in that once a scientific discovery is documented, its novelty should be revealed in text information. Nonetheless, due to the ambiguity and complexity of natural languages, previous measures use text data rather superficially without sufficiently exploiting the semantic information [e.g., 14]. It is relatively recently that such semantic information got extracted from text data and translated into bibliometric indices [e.g., 15]. To circumvent the technical challenges in extracting semantic information from text data, citation data have been extensively utilized in previous novelty measures. As a citation represents information flow from a cited document to a citing document, it can be used to infer certain qualities, including novelty, of a document without scrutinizing the content [10, 16]. However, the validity of this approach has been occasionally questioned [12]. In fact, insufficient validation has been a limitation common to most novelty measures [17]. Furthermore, a practical limitation common to previous measures is that they require access to a large-scale bibliometric database (often the whole universe of scientific documents), which are usually proprietary and expensive, and high computational power, which potential users of the measures do not always have.

To address previous limitations, we propose a new approach to compute the novelty of scientific documents by combining both citation and text data (see Fig 1). Our approach features recombinant novelty [18–21], considering a document to be novel if it cites a combination of semantically distant documents. This is in line with the previous measures based on citation data [e.g., 8]. Unlike previous measures, however, we use text data to quantify the distances between cited documents. Specifically, based on the text information included in cited documents, we map each document to a word embedding–a high-dimensional vector assigned to each vocabulary [22]–with which to compute distances between cited documents. To the best of our knowledge, this is the first to use the word-embedding technique to measure the novelty of scientific documents.

Download:

Fig 1. Algorithm of novelty computation.

https://doi.org/10.1371/journal.pone.0254034.g001

For text information, we test three sources–the abstract, keywords, and the title of cited documents–finding all satisfactory performance. Of the three sources, titles of cited documents are often included in the focal document itself, and the burden of data access is minimized. As a library of word embeddings, we draw on scispaCy [23], which is publicly available and thus significantly reduces the computational cost. We publicly share the code [24], with which one can compute the novelty score of a document only with the focal document’s reference list.

We validate the proposed measure in three exercises. First, we confirm that word embeddings from the selected library can be used to quantify semantic distances between documents by comparing with an established bibliometric distance measure. Second, we test the criterion-related validity of the proposed novelty measure based on self-reported novelty scores collected from a questionnaire survey. Third, as novelty is known to be a predictor of future citation impact [8, 11], we test whether the proposed measure is correlated with future citation.

This paper is structured as follows. In the next section, we categorize previous novelty measures and discuss their characteristics and limitations. The following section describes our proposed measure and outlines its operationalization. Then, we present the methods and data for the validation exercises. Finally, we present the results and conclude.

Literature review

Previous bibliometric measures for novelty can be categorized based on their conceptualization and operationalization (Table 1). Conceptually, some measures aim to represent the uniqueness of a certain knowledge element (Groups 1 and 4)–for example, a discovery of a new molecule and development of a new material. In contrast, other measures aim to capture a recombination of knowledge elements (Groups 2 and 3), in which a new or rare combination of knowledge is considered to be a sign of novelty. The notion of recombination as a source of novelty has been widely discussed in the literature. The creativity literature argues that associating remote elements is a path to creative solution in general as well as in science [18, 19], and the management literature suggests that combining components is a major route to technological innovation [20, 21].

Download:

Table 1. Previous novelty measures.

https://doi.org/10.1371/journal.pone.0254034.t001

For operationalization, a group of measures exploits citation information to assess novelty indirectly (Group 3), and the other draws on text analysis to assess the content of documents (Groups 1, 2, and 4). Among the latter, the majority uses text information only superficially without using the semantic information of the text (Groups 1 and 2), but recent measures attempt to extract semantic information (Group 4). Studies on novelty measures have been relatively advanced in technology management, in which a patent is used as a unit of document [e.g., 16, 25]. We also refer to these measures because the key idea behind the measures is applicable to scientific documents. In what follows, we discuss four groups of previous measures.

(1) A new word

The first group of novelty measures is based on the first appearance of a word(s) that appears in a document [14, 25]. If a document includes or is associated with a certain word or a sequence of words that is new to the world, it can be inferred that the document delivers novel information. For example, if a document contains a previously unknown chemical compound, it suggests that the document is novel. In this category, Azoulay et al. [14] drew on Medical Subject Heading (MeSH), a controlled keyword dictionary, and operationalized the novelty of a journal article based on the average age of keywords (the number of years since its first appearance). Balsmeier et al. [26] and Arts et al. [25] also identified novel inventions based on the first occurrence of a word as well as a sequence of words (bigram and trigram) in patent documents.

(2) Recombination of words

The second group is technically similar to the first group but conceptually different as it is to measure "recombinant" novelty [19, 20]. When a document includes a rare combination of knowledge elements, even if each element has been known, the document can be considered to be novel. In this category, Boudreau et al. [9] measured the novelty of a research grant proposal based on a new combination of MeSH keywords. Similarly, drawing on a controlled dictionary of patent classifications, Verhoeven et al. [27] measured recombinant novelty by a new combination of IPC codes assigned to the patent. Arts et al. [25] also measured the novelty of a patent based on a new combination of two words that appeared in the patent.

The first and second groups are intuitively straightforward but have some limitations. Among others, these measures largely disregard semantic information included in text data. For example, the first group may consider a new synonym of an existing concept to be novel, unless controlled dictionaries are available. Similarly, the second group may consider any recombination equally novel regardless of the semantic distance between combined elements.

(3) Recombination of cited documents

The third group also measures recombinant novelty, but instead of using text information, it draws on citation information. A document citing another document implies that knowledge in the latter is used by the former [28]. Thus, a document can be characterized by its cited documents, by considering each of cited documents to be a knowledge element that is incorporated into the citing document. Based on the recombinant novelty concept [18, 19], a document citing a set of documents that have rarely been cited together can be considered as a sign of novelty. In contrast to the first and second groups, in which a single word is considered a representation of knowledge, considering a cited document as a knowledge element adds semantic richness, making this approach popular in previous studies.

In this group, Dahlin and Behrens [16] proposed a novelty measure of patents based on a rare combination of cited references. Trapido [10] applied the same approach to journal articles, specifically in the field of electrical engineering. This approach is extended by Matsumoto et al. [17] so that it is applicable in any scientific field. A variation of this approach is to draw on journals in which cited documents are published [8, 11]. That is, if a focal document cites documents in two journals that have rarely been cited together, it is considered as a sign of novelty. This approach thus consolidates the unit of knowledge further at the journal level. Though considering a document or a journal as a unit of knowledge, without needing to scrutinize the content of documents, is convenient, its validity is under dispute [12, 13].

(4) A distant text

The last group quantifies the uniqueness of a document based on text analysis, and relies on more recent development of natural language processing (NLP) to extract semantic information. In particular, drawing on the word embedding technique, Hain et al. [15] proposed a measure of patent novelty. Word embeddings map each word to a high-dimensional vector (i.e., a list of numbers). It allows us to quantify a semantic relationship between a pair of words by calculating the distance between the vectors–i.e., similar words have close vectors while dissimilar words have remote vectors. Hain et al. [15] assigned a vector to each patent by aggregating the vectors for a set of words that appear in the patent. Then, they calculated a distance between every pair of patents, with which a patent remote from any other patent is considered to be novel.

Proposed measure of novelty

Measuring novelty with word embedding

As a new approach, we propose to measure recombinant novelty of scientific documents by applying the combination of the word embedding technique and citation analysis. We consider a cited document as an appropriate unit of knowledge input, as in Group 3. Unlike the previous measures, which disregard the content of cited documents, we draw on the word embedding technique to extract semantic information in cited documents.

The word embedding technique often draws on machine learning algorithms (e.g., word2vec) to calculate a vector representation for each word based on the co-occurrences of words in a text corpus [22]. The approach is gaining confidence as the performance of machine learning has been improving, and has been recently applied to scientific documents for various purposes. For example, Tshitoyan et al. [29] captures the knowledge structure in the extant literature in material sciences with which they predict future scientific discoveries in the field. Still, to the best of our knowledge, the technique has not been used to measure the novelty of scientific documents.

Although computing word embeddings is demanding, some algorithms are publicly available, and some well-trained word embedding models (a list of vectors for a set of vocabularies) are also publicly accessible [30]. In this study, we use scispaCy as an established and publicly available library of word embeddings. ScispaCy builds on a popular spaCy model [30] and offers vector representations in a 200-dimensional vector space for 600,000 vocabularies specializing in biomedical texts [23, 31].

Operationalization

With the selected word embedding library and citation information, the novelty of a document is computed through the following steps (Fig 1). Suppose that a focal document cites N references, and that each of the cited references has some text information. One can use various sources of text information, such as the full text and the abstract. In the following analysis, we construct respective measures from three text sources: the abstract, keywords, and the title of cited documents. Of the three sources, we intend to propose primarily using the title to minimize data requirement and maximize the utility of the measure.

Step 1. First, we vectorize the text information of the i-th reference as v_i∈ℝ²⁰⁰ (i∈{1,…,N}). Since the text information includes multiple words, v_i is calculated as the mean of word embeddings of all words included.

Step 2. Second, we compute the distance of each pair of cited documents. The cosine distance between i-th and j-th references (1≤i<j≤N) is given by: (1)

The cosine distance ranges from 0 to 2, where a larger value indicates a larger distance.

Step 3. Finally, we aggregate the distance scores over all pairs of cited references. In our dataset, one document has 32 cited references on average, which gives approximately 500 reference pairs. As a novelty measure of a focal document, we take the q-percentile value of the distance scores (Novel_q), where q∈[0,100] and the 100-percentile value is defined as the maximum. Hence, (2) where R(d_ij) is the ordinal rank of d_ij of all the distances of N(N−1)/2 reference pairs.

Computational cost

The aforementioned previous measures of novelty require extensive data access and processing. Text-based approaches (Table 1, Groups 1, 2, and 4) require the entire history of word uses, and citation-based approaches (Table 1, Group 3) need comprehensive citation network data. This poses two practical challenges for potential users of the novelty measures. First, the required data are usually proprietary, and thus, literally expensive. Second, processing the massive data takes high computational power. Not all users have such rich resources, compromising the utility of the measures.

Our proposed approach addresses these issues and aims to allow anyone to compute and use the novelty measures. Our measure requires only limited data access and little need for proprietary data. The measure can be computed only with the titles of a focal document’s cited references, which is often included in the focal document itself, and a publicly available library of word embeddings. The approach requires only small data processing. Unlike previous measures, our approach does not require extensive citation network analysis unlike Group 3, nor comparison with the whole document universe unlike Group 4. With the publicly shared code, anyone can compute the measure.

Methods and data

Previous novelty measures have been rarely validated with a few exceptions [17]. To confirm the validity of our proposed measure, we carry out three exercises. The primary analysis is to test the criterion-related validity based on self-reported novelty scores for selected documents. As a preparatory step to this main analysis, we test whether scispaCy word embeddings can be indeed used to measure distances between documents (corresponding to Step 2). Finally, since novelty is known as a predictor of future citation impact [8, 11], we run regression analyses to test whether our proposed measure is positively associated with future citation.

To compute the proposed measures, we downloaded bibliometric information from Web of science (WoS). Since scispaCy specializes in the vocabularies in biomedicine, we focus on documents within relevant Subject Categories [32]. We focus on "article" as a document type and documents written in "English" [33]. We employ different sets of random samples for each analysis as detailed below.

Validation of distance

Before validating the novelty measure itself, we test if scispaCy word embeddings convey semantic information of a text, and that they can assess the distance between a pair of documents. To this end, we compute distances of pairs of documents in two approaches–one based on scispaCy word embeddings and the other with a previously established approach–and confirm that the two are sufficiently correlated.

As a previously established approach, we compute the co-citation distance between a pair of documents i and j: (3) where ref_i is the number of references cited by i and coref_ij is the number of references cited by both i and j. Co-citation distance has been previously used to measure the distance of scientific documents without a need to look into the content of the documents [10, 17]. A basic assumption is that a pair of documents should include a similar content if they cite a similar set of documents. We do not consider that the co-citation distance is superior to the word-embedding distance, but the two distances are expected to be correlated if scispaCy word embeddings do convey semantic information.

Second, using scispaCy word embeddings, we assign vectors respectively to the same pair of documents i and j (see Step 1 in Fig 1) and compute their distance (Eq 1). As text data for vectorization, we draw respectively on three sources (the title, the abstract, and keywords) from the pair of documents, preparing three distance measures (, and ). Note that the word-embedding distance between a pair of focal documents is computed in this analysis, and this is applied to pairs of references cited by focal documents when we compute novelty.

For this analysis, we employed the following sampling strategy. First, we randomly sampled 100 authors in the field of biomedicine. Then, we collected all documents authored by these authors [34]. Finally, we filtered out documents outside of the biomedical field as well as documents missing reference information, resulting in 1,600 documents (16 documents per author on average). We compute the distance measures between documents written by the same author (i.e., we do not compare documents written by different authors). This is because co-citation is rare between a randomly chosen pair of documents written by different authors, which spuriously inflates the correlation.

Validation of novelty

After confirming that the scispaCy word embeddings carry semantic information of text, we test the criterion-related validity of the proposed novelty measure (Eq 2). To this end, we draw on self-reported novelty scores, which we obtained from a questionnaire survey we conducted in 2009–2010 [35, 36]. The survey was responded by 2,081 scientists from various scientific fields, of whom this study draws on a subset of 321 respondents in biomedical fields.

The survey included a wide range of questionnaire items, one section of which asked the respondents to assess a randomly selected journal article that they published in 2001–2006. This section includes eight items to characterize the finding reported in the article (Table 2). As novelty is a multifaceted concept [37], the survey incorporated four aspects (theory, phenomenon, method, and material) in which the article may make scientific contribution. For each aspect, the survey further included two items, one indicating newness and the other indicating improvement over existing literature. We expect that the proposed measure should be correlated more with the newness items but less with the improvement items. Each item was responded in a 5-point scale (1: not relevant at all—5: highly relevant).

Download:

Table 2. Questionnaire of novelty.

https://doi.org/10.1371/journal.pone.0254034.t002

For the selected articles, we computed the proposed novelty measures (Eq 2), based on the title, the abstract, and keywords respectively, which generates three series of novelty measures (, and ) where q∈{100, 99, 95, 90, 80, 50}.

Prediction of future citation

Previous studies consistently indicate a positive association between novelty and future citation impact of scientific documents [8, 11]. Thus, we test whether the proposed novelty measure can predict future citation effectively. For this analysis, we use "top-1% cited" (TC) in the respective field as the dependent variable and regress it on the proposed novelty measures. TC is a dummy variable coded 1 if the citation count of the article is within top 1% and 0 otherwise. Three sets of novelty measures are calculated with the title, the abstract, and keywords respectively (, and ) where q∈{100, 99, 95, 90, 80, 50}. Since the dependent variable is a dummy variable, we draw on logistic regressions: (4) where f is the logistic function.

For this analysis, we randomly sampled 2,000 articles published in biomedicine fields in 2010, and evaluated their citation impact as of 2020 (10 years after publication). We oversampled top-1% cited articles, so that the final sample consists of approximately 1,000 top-1% cited articles and 1,000 non-top-1% cited articles.

Results

Description of the measure

To illustrate the distribution of the proposed measures, we computed the novelty of randomly selected documents (Fig 2B) and the distances of cited references of the documents (Fig 2A). Comparing distances based on three text data sources, Fig 2A shows that the abstract-based measure () takes lower values. This is because abstracts include longer text information, which increases the chance that two cited documents share something in common. Based on the distances, novelty measures (Novel_q) with various q’s are computed (see S1 Appendix). Fig 2B presents Novel₁₀₀, which takes the maximum value of all reference pairs.

Download:

Fig 2. Distribution of distance and novelty.

The same sample for the third validation study (prediction of future citation) is used, except that oversampled highly-cited documents are excluded. The 947 selected documents include in total approximately 230,000 combinations of cited references, for which the distance (Eq 1) is computed (A). The distances are summarized at the focal document level (Eq 2), and Novel₁₀₀ is displayed as an example (B). Novelty measures with different q values are illustrated in S1 Appendix. Since abstracts and keywords are not available for all documents, the sample sizes are smaller.

https://doi.org/10.1371/journal.pone.0254034.g002