COVIDScholar: An automated COVID-19 research aggregation and analysis platform

The ongoing COVID-19 pandemic produced far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community’s COVID-19 response lead to the emergence of new research at the remarkable rate of more than 250 papers published per day. This posed a challenge for the scientific community as traditional methods of engagement with the literature were strained by the volume of new research being produced. Meanwhile, the urgency of response lead to an increasingly prominent role for preprint servers and a diffusion of relevant research through many channels simultaneously. These factors created a need for new tools to change the way scientific literature is organized and found by researchers. With this challenge in mind, we present an overview of COVIDScholar https://covidscholar.org, an automated knowledge portal which utilizes natural language processing (NLP) that was built to meet these urgent needs. The search interface for this corpus of more than 260,000 research articles, patents, and clinical trials served more than 33,000 users at an average of 2,000 monthly active users and a peak of more than 8,600 weekly active users in the summer of 2020. Additionally, we include an analysis of trends in COVID-19 research over the course of the pandemic with a particular focus on the first 10 months, which represents a unique period of rapid worldwide shift in scientific attention.


Introduction
The scientific community has responded to the COVID-19 pandemic with unprecedented speed, and as a result an enormous amount of research literature is rapidly emerging, at a rate of over 250 papers a day [1].The urgency and volume of emerging research has caused pre-prints to take a prominent role in lieu of traditional journals, leading to widespread usage of pre-print servers for the first time in many fields, most prominently biomedical sciences [2] [3].While this allows new research to be disseminated to the community sooner, this also circumvents the role of journals in filtering poor or flawed papers and highlighting relevant research [4].Additionally, the uniquely multi-disciplinary nature of the scientific community's response to the pandemic has lead to pertinent research being dispersed across many open access and pre-print services -no single one of which captures the entirety of the COVID-19 literature.
These challenges have created a need and opportunity for new tools and methods to rethink the way in which researchers engage the wealth of available COVID-19 scientific literature.
COVIDScholar is an effort to address these issues by using natural language processing (NLP) techniques to aggregate, analyze, and search the COVID-19 research literature.We have developed an automated, scalable infrastructure for scraping and integrating new research as it appears, and used it to construct a targeted corpus of over 81,000 scientific papers and documents pertinent to COVID-19 from a broad range of disciplines.The search interface for this corpus, https://covidscholar.org, now serves over 2000 unique users weekly.
While a variety of other COVID-19 literature aggregation efforts exist [5,6,7], COVIDScholar differs in the breadth of literature collected.In addition to the biological and medical research collected by other large-scale aggregation efforts such as CORD-19 [6] and LitCOVID [7], COVIDScholar's collection includes the full breadth of COVID-19 research, including public health, behavioural science, physical sciences, economics, psychology, and humanities.
In this paper, we present a description of the COVIDScholar data intake pipeline and back-end infrastructure, and the NLP models used to power directed searches on the front-end search portal.We also present an analysis of the COVIDScholar corpus, and discuss trends in the dynamics of research output during the pandemic.

Data Pipeline & Infrastructure
At the heart of COVIDScholar is the automated data intake and processing pipeline, depicted in Fig. 1.Data sources are continually checked for new or updated papers, patents, and clinical trials, which are then parsed, cleaned, analyzed with NLP models, and made searchable on https://covidscholar.org. 5.

The COVIDScholar research corpus consists of research literature from 14
different open-access and pre-print services, listed in Table .1.For each of these, a web scraper regularly checks for new documents and updates to existing ones.
Our web portal, COVIDScholar.org,provides an accessible user interface to a variety of literature search tools and information retrieval algorithms tuned specifically for the needs of COVID-19 researchers.Because there still remains a great deal that we do not know about the disease, we have directed our efforts towards developing tools that can extend beyond information retrieval and aid researchers at the knowledge discovery phase as well.To do this, we have utilized new machine learning and natural language processing techniques together with proven information retrieval approaches to create the search algorithms behind COVIDScholar, which we describe in the remainder of this section.
Machine learning algorithms can be used to identify emerging trends in the literature and correlate them with similar patterns from pre-existing research.
For this reason, we chose to base our search back end on the Vespa engine [22], which provides a high level of performance, wide scalability, and easy integration with custom machine learning models.For example, the default search result ranking profile on COVIDScholar.orgcombines the BM25 relevance[BM25] with a "COVID-19 relevance" score calculated by a classification model trained to predict whether a paper discusses the SARS-CoV-2 virus or COVID-19 using this approach.We observe that papers from before the COVID-19 pandemic that are related to certain viruses/diseases tend to receive high relevance scores, especially papers on the original SARS and other respiratory diseases.SARS-CoV-2 shares 79% of its genome sequence identity with the SARS-CoV virus [23], and there are many similarities between how the two viruses enter cells, replicate, and transmit between hosts.[24] Because the relevance classification model gives a higher score to studies on these similar diseases, search results are more likely to contain relevant information, even if it is not directly focused on COVID-19.
For example, the transmembrane protease TMPRSS2 plays an important role in viral entry and spread for both SARS-CoV and SARS-CoV-2, and its inhibition is a promising avenue for treating COVID-19 [25].A wealth of information on strategies to inhibit TMPRSS2 activity and their efficacy in blocking SARS-CoV from entering host cells was available in the early days of the COVID-19 pandemic.These studies were boosted in search results because of their higher relevance scores, thereby bringing potentially useful information to the attention of researchers more directly.In comparison, results of a Google Scholar search for "TMPRSS2" (with results containing "COVID-19" and "SARS-CoV-2" filtered out) are dominated by studies on the protease's role in various cancers.
COVIDScholar also provides tools that utilizes unsupervised document embeddings so that searches can be performed within "related documents" to automatically link research papers together by topics, methods, drugs, and other key pieces of information.Documents are sorted by similarity via the cosine distances between unsupervised document embeddings [26], which is then combined with the more overall result-ranking score mentioned above.This allows users to focus their results into a more specific domain without having to repeatedly pick and choose new search terms to add to their queries.Users can also filter all of the documents in the database by broader subjects relevant to COVID-19 (treatment, transmission, case reports, etc), which are all determined though the application of machine learning models trained on a smaller number of handlabeled examples.All combined, these tools have allowed us to create much more targeted tools for literature search and knowledge discovery that would not be possible otherwise.

Text Analysis NLP Models
Classification of abstracts is performed using a fine-tuned SciBERT [27] model.
While other BERT models pre-trained on scientific text exist (e.g.BioBERT [28], MedBERT [29], and ClinicalBERT [30]), we select SciBERT due to its broad, multidisciplinary training corpus, which we expect to more closely resemble the COVIDScholar corpus than those pre-trained on a single discipline.SciBERT has state-of-the-art performance on the task of paper domain classification [31], as well as a number of biomedical domain benchmarks [32,33,34] -the most common discipline in the COVIDScholar corpus.A single fully-connected layer with sigmoid activation is used as a classification head, and the model is finetuned for 4 epochs using 2600 human-annotated abstracts 6ROC curves for the classifier's performance for each top-level discipline using 20-fold cross-validation are shown in Fig. scores which are between 0.1 and 0. It is also of note in each case that while precision is broadly similar between the two models, the baseline model exhibits significantly lower recall.This may be due to unbalanced training data -no single discipline accounts for more than 33% of the total corpus.For search applications, often a relatively small number of documents is relevant to each query.In this case, a high recall is more desirable than a high precision -in practice, the performance gap between the two models is larger than indicated by relative F1 scores.
On the task of binary classification as related to COVID-19, our current  For the task of unsupervised keyword extraction, 63 abstracts were annotated by humans, and two statistical methods, TextRank [36]and TF-IDF [37], and two graph-based models, RaKUn [38] and Yake [39], were tested.Models were evaluated for overlap between human-annotated keywords and extracted keywords, and results are shown in  [38], Yake [39], TextRank [36], and TF-IDF [37].Output from keyword extractors was compared to 63 abstracts with human-annotated keywords.
To better visualize the embedding of COVID-19-related phrases and find latent relationship between biomedical terms, we designed a tool based on Embedding Projector [40].A screenshot of the tool is shown in Fig. 3 We utilize FastText [41] embeddings for the embedding projector, with an embedding dimension of 100.Embeddings are trained on the abstracts of all papers which have been classified as relevant to COVID-19.
For the purpose of visualization, embeddings must be projected to a lower dimensional space (2D or 3D).The dimensionality reduction technique used here includes principal component analysis (PCA), uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE).Users can set various parameters and do the dimension reduction via an Fig. 3: A screenshot of the embedding projector visualizing tokens similar to "spike protein", using FastText [41] embeddings trained on the COVIDScholar corpus.
interactive page.They can also load and visualize the cached result on the server with default parameters.
Cosine distance is used to measure the similarity between phrases.If the cosine distance between two phrases is quite small, they are likely to have similar meaning.

Cosine Distance
p 1 , p 2 represent two phrases, Emb maps phrases to their embedded representation in the learned semantic space.
4 COVIDScholar Corpus Analysis  A breakdown of research by discipline over the course of 2020 is shown in Fig. 5, which depicts the fraction of monthly COVID-19 publications primarily associated with each discipline.From January -April, the relative popularity of discipline showed some shifts.While Biological and Chemical Sciences comprised 45% of the total corpus in January, by April that had decreased to 28%.This is largely accounted for by an increase in papers from Physical and Medical Sciences -over the same period the fraction of papers from Medical Sciences increased from 15% to 20% of the total, and Physical Sciences from 5% to 8%.
By April, the fraction of the corpus from each discipline seems to have stabilized, with fluctuations of relative fractions of under 1%.This is further support for the evidence in Fig. 4 that research output had already reached its maximum rate by April/May -this seems to hold true on a discipline-by-discipline basis also.
We investigate this increase in Fig. 6, where we have plotted the fraction of total monthly papers on selected mental health-and lockdown-related topics.Over the April-June period, there is a clear increase in research related to psychological impacts of lockdown and social distancing, accounting for 6-8% of total monthly papers.Between March and April, many countries and territories instituted lockdown orders, and by April, over half of the world's population was under either compulsory or recommended shelter-in-place orders [46].The corresponding emergence of a robust literature on psychological impacts associated with this is the major driving force behind the increase in COVID-19 literature from Humanities & Social Sciences.

Summary and Future Work
We have developed and implemented a scalable research aggregation, analysis, and dissemination infrastructure, and created a targeted corpus of over 81,000 While to-date the COVIDScholar research corpus has primarily been used for front-end user search, it provides a rich opportunity for NLP analysis.Recent work [47] has highlighted the ability of NLP to discover latent knowledge from unstructured scientific text, utilizing information from thousands of research papers.We are now moving to employ similar techniques here, applied to such problems as drug re-purposing and predicting protein-protein interactions.

Fig. 1 :
Fig. 1: The data pipeline used to construct the COVIDScholar research corpus.
14 higher.These are the broadest disciplines, encompassing multiple disparate fields.The large variability of subjects within these domains may account for the inability of TF-IDF-based models to classify them well.For the remaining two disciplines, Biological & Chemical Sciences and Public Health, the F1 scores are similar between SciBERT and the baseline model.In the case of Biological & Chemical Sciences, this may be explained by relatively distinctive vocabulary and narrow subjects within the discipline.Public Health was observed to have the largest inter-annotator disagreement, leading to a lower performance by the classifier.
models perform similarly well, achieving an F1 score of 0.98.While the binary classification task is significantly simpler from an NLP perspective -the majority of related papers contain "COVID-19" or some synonym -this still represents a significant performance improvement over the baseline model, which achieves an F1-score of 0.90.Given the relative simplicity of this task, in cases where an abstract is absent we classify it as related to COVID-19 based on the title.

Fig. 2 :
Fig. 2: ROC curves for discipline classification models of paper abstracts using a fine-tuned SciBERT [27] model adapted for classification.Training is performed using a set of 2500 human-annotated abstracts, and results shown are generated with 20-fold cross validation.

Fig. 4 :Fig. 5 :
Fig. 4: Cumulative count by primary discipline of COVID-19 papers in the COVIDScholar database, and total number of reported US COVID-19 cases during the first 10 months of 2020.Papers are categorized by the classification model described in Sec. 3, and assigned to the discipline with highest predicted likelihood.Case data from The New York Times, based on reports from state and local health agencies.Note that only those papers with abstracts available are classified, and so the publication is somewhat lower than the total from Sec. 4.1.

Fig. 6 :
Fig. 6: Fraction of COVID-19 literature on mental health-and lockdown-related topics on a monthly basis.

6 Acknowledgements
Portions of this work were supported by the C3.aiDigital Transformation Institute and the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231.The text corpus analysis and development of machine learning algorithms were supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act.This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

Table 1 :
The source of papers, patents, and clinical trials in the COVIDScholar collection, with the count of COVID-19 related publications from each source.

Table 2 :
[27]he classifier performs extremely well, with F1 scores above 0.73 for all disciplines.Performance metrics of the discipline classifier are displayed in Table.2, compared to a baseline random forest model using TF-IDF features.Scoring metrics of SciBERT[27]and baseline random forest discipline classification models.Models were evaluated using 10-fold cross-validation on 2600 labeled abstracts.Input features to the random forest model generated using TF-IDF.

Table 3 :
Table. 3. Note that due to the inherent subjectivity of the keyword extraction task that scores are relatively low -the best performing model, RaKUn has an F1 score of only 0.2.However, the quality of extracted keywords from this model was deemed reasonable for display on the search portal after manual inspection.Precision, recall, and F1 scores for 4 unsupervised keywords extractors, RaKUn

Table 4 :
As of October 2020, the COVIDScholar corpus consists of 150,113 total documents, of which 143,887 are papers.The remainder is composed of 3306 patents, 1712 clinical trials, 1025 book chapters, and 183 datasets.Of the papers, 81,106 are classified as related to COVID-19,7and are approximately equally split between preprints and published papers -44% pre-prints, 56% published.A breakdown by discipline of the COVID-19 relevant papers is shown in Table.4.As The number of papers and fraction of total COVID-19 related papers in the COVIDScholar corpus for each discipline.Only papers with abstracts are classified and included in final count.Note that a given paper may have any number of discipline labels.