Bibliometrics for Social Validation

This paper introduces a bibliometric, citation network-based method for assessing the social validation of novel research, and applies this method to the development of high-throughput toxicology research at the US Environmental Protection Agency. Social validation refers to the acceptance of novel research methods by a relevant scientific community; it is formally independent of the technical validation of methods, and is frequently studied in history, philosophy, and social studies of science using qualitative methods. The quantitative methods introduced here find that high-throughput toxicology methods are spread throughout a large and well-connected research community, which suggests high social validation. Further assessment of social validation involving mixed qualitative and quantitative methods are discussed in the conclusion.


Introduction
Text analysis could be used to interpret the blockmodel partition in figure 6 of the main paper. To be clear, this partition is imposed on the citation network, in the sense that the partition algorithm is forced to produce a partition with exactly 2 communities. 2 communities would not necessarily minimize the description length of the network, and thus might not be optimal from the perspective of the blockmodel approach. For the purposes of this project, the optimal partition structure of the citation network is less interesting than whether the core vertices are confined to a single community of a binary partition of the whole network.
However, text analysis might still be useful for understanding the core set itself. Specifically, does the blockmodel partition of the core set correspond to any textual differences? This analysis examines that question. For data reasons discussed below, as well as considerations of length, this analysis was conducted in an exploratory fashion, rather than integrated into the primary analysis discussed in the main paper.

Data
Paper abstracts were not retrieved in the original data collection process. This was primarily a matter of computational resources. Due to the number of nodes of the network, even with only relatively little paper metadata, the graphml file containing the network is relatively large (nearly 1 GB on disk). Including abstracts could easily increase the size of this file 2 or 3 times. Working with a single file this large could have severely taxed the computational machinery available for this project; and distributing the data across multiple files would have significantly complicated project file management and downstream analysis. Also, due to weekly quotas imposed by the Scopus API, retrieving abstracts for 80,120 papers would have delayed revisions to the manuscript by several weeks. By contrast, text analysis of the core set papers alone would require far fewer resources, and abstracts for all of these papers can be retrieved in less than an hour. For these reasons, I focus here on a text analysis of the core set abstracts.

Methods
Topic models, fit using Latent Dirichlet Allocation, are a common tool in unsupervised text analysis, even in fields that do not traditionally use computational methods [1,2]. Briefly, topic models use a Bayesian method to group words into a given number of topics based on their co-occurrence patterns. Documents are modeled as drawing words randomly from these topics according to a latent conditional distribution across topics: γ i,j = pr(topic i |document j ). A high value of γ i,j indicates that document j draws almost exclusively from topic i. Topic models have been found to be efficient for discovering syntactic patterns across large-document corpuses, even when each individual document is small, such as Twitter tweets [3].

Topic Stability
Topic model algorithms generally require analysts to manually specify the number of topics k in the model, and fitted models are generally evaluated either by their ability to classify held-out documents according to human-curated categories [4] or by human judgment that the word assignments to topics are "relevant and intuitive" [5]. By contrast, [6] provide a method to quantitatively assess the stability of topic modeling across a range of values of k. Briefly, the method first constructs several subsamples s 1 , . . . , s n of the entire corpus; below, we use 50 samples, each comprising 80% of the documents in the core set. Next, for each value of k under consideration, the method fits a model with k topics to the entire corpus s 0 and the subsamples s 1 , . . . , s n . An agreement score is then calculated for each s 1 , . . . , s n , relative to s 0 . The distribution of agreement scores for a given value of k indicate the stability of the topic model with k topics for the entire corpus s 0 .
The method introduced by [6] evaluates agreement in terms of the rank lists of terms in each topic model; roughly, two models agree insofar as the top 20 terms in each topic are the same. For the purposes of the current analysis, the stability of the document assignments is more interesting than the stability of the term lists. Thus, in this analysis, agreement scores between models are calculated in terms of the correlations of the γ i,j , the posterior distributions of topics for each document.

Topic-Partition Comparison
After a value of k is selected, we compare the document assignments in the topic model to the blockmodel partition. Because topic models fit posterior probabilities γ i,j , topic assignment can be handled either discretely or continuously. For a discrete assignment, document j is assigned to topic i if, and only if, γ i,j is greater than a certain threshold; below we use γ i,j > .8. With k = 2 topics, this gives us 3 bins of core papers: in topic A, in topic B, and in neither topic. This tripartite classification of papers is then compared to the blockmodel partition assignments, using a contingency table approach similar to that used to compare the core partition with the blockmodel partition.
With k = 2, topic assignments can be handled continuously by simply working with one of the two families γ A,j , that is, the posterior probabilities of a given topic A across all documents j. Since γ A,j = 1 − γ B,j , low values of γ A,j correspond to documents j that are "assigned" to topic B.

Results
In figure S.1, points indicate the agreement score for each sample at each value of k; the line indicates the mean agreement score at each value of k. The plot indicates that k = 2 is somewhat more stable than other values of k, on average, but with much more variance.
Thus, k = 2 should be considered at best only moderately stable. We proceed with k = 2 as the best available option.
The network contains 323 core nodes, but Scopus returned abstracts for only 319 papers. For discrete topic assignments, documents -core paper abstracts -are binned into topics using a threshold γ i,j > .8.
To visualize the distribution of topics over the core nodes, we work with the continuous γ value for one of the two topics, coloring nodes more red insofar as they have a higher  There is a very strong correlation between the partitions and topic assignments. This can also be seen if we plot γ a,j , the posterior distribution for topic A, against blockmodel partition assignment; see figure S.4.
However, it is difficult to map these topics to recognizable areas of toxicology research. Tables  S.2 and S.3 give DOIs, titles, and γ values for the 10 articles mostly highly associated with topics A and B, respectively. Both tables contain papers on fundamental HTT research as well as applications of HTT to both human health and ecotoxicology. Similarly, table S.4 shows the terms most associated with the two topics. "Chemical," "results," "exposure," and "data" appear in both lists, suggesting that the two topics substantially overlap. This lack of clarity and distinctness in the topics corresponds to the lack of stability observed above. It appears that the LDA model is unable to find substantive, consistent patterns in the distribution of terms.

Conclusion
While this analysis finds very good concordance between the k = 2 topic model and the core nodes, in terms of both the citation network connecting these nodes by themselves and in terms of the blockmodel partition that they inherit from the larger citation network, text analysis of paper abstracts does not provide much insight into these topological features of this particular citation network.

Appendix: Reproducibility
System environment and package versions used to conduct this analysis: