The authors have declared that no competing interests exist.
Conceived and designed the experiments: TA DA KK. Performed the experiments: TA DA. Analyzed the data: TA DA. Contributed reagents/materials/analysis tools: TA DA TR. Wrote the paper: TA DA KK. Obtained data permission: DA KK. Data preparation: TA DA.
Finding relevant publications is important for scientists who have to cope with exponentially increasing numbers of scholarly material. Algorithms can help with this task as they help for music, movie, and product recommendations. However, we know little about the performance of these algorithms with scholarly material. Here, we develop an algorithm, and an accompanying Python library, that implements a recommendation system based on the content of articles. Design principles are to adapt to new content, provide near-real time suggestions, and be open source. We tested the library on 15K posters from the Society of Neuroscience Conference 2015. Human curated topics are used to cross validate parameters in the algorithm and produce a similarity metric that maximally correlates with human judgments. We show that our algorithm significantly outperformed suggestions based on keywords. The work presented here promises to make the exploration of scholarly material faster and more accurate.
Recommendation systems are routinely used to suggest items based on customers’ past preferences. These systems have proven useful for music, movies, news, and retail in general [
There are multiple recommendation systems that use either the personal preferences of a new user (e.g., content-based recommendations) or exploit the similarity between the new users’ preferences and previous users’ preferences (e.g., collaborative filtering). Most such system are available for commercial software, such as news [
Scientific literature is different from other applications of recommendation systems. Scientist are very specialized and they tend to become even more specialized as they progress in their careers. Therefore, understanding the differences in such fine grained topics is challenging. For example, for the Society for Neuroscience conference, there are more than 500 hand-curated areas of research (
There have been several attempts at producing recommendation systems for scientific literature. In [
Here we introduce Science Concierge (
We tuned and tested the algorithm on a collection of scientific posters from the largest Neuroscience conference in the world, Society for Neuroscience (SfN) 2015. First, we cross-validated the LSA model to capture most of the variance contained in the topics. Second, we tuned the parameters of the algorithm to recommend posters that maximally resembled human curated classifications into poster sessions. We showed that our algorithm significantly outperformed a popular alternative based on keywords, improving suggestions further as it learned more from the user. A front-end interface that uses the algorithm in the back end is available at
Conferences are events that need a rapid understanding of trends. The time between submissions, acceptances, and presentations is typically much shorter than in journals. This makes it crucial for recommendation systems to quickly scan the documents of the conference and let scientists discover materials fast. This is one reason why we focus on testing our system using conference data.
We obtained a license from the Society for Neuroscience (SfN) for the Neuroscience 2015 conference. This is the largest conference in Neuroscience in the world. This dataset included 14,718 posters and talks distributed over 500 sessions spanning 5 days. Not all the content of the conference had abstracts available (e.g., talks) and therefore they were dropped from the analysis. The dataset is not publicly available but an academic license may be requested directly through
There are three main design principles behind Science Concierge. First, it aims at using the content of the documents rather than collaborative filter to prevent the Mathew’s effect, commonly found in recommendation systems [
Documents are tokenized by removing English stop words and stemming using the Porter Stemming algorithm [
These terms can be weighted in different manners by considering the local structure of each document as compared to the global structure of all documents. Different weighting schemes exist in the literature and we try some of the most popular, including simple term frequency (term frequency), term-frequency inverse document-frequency (tf-idf), and log-entropy [
Term frequency, tf-idf, and log-entropy can all be computed based on the following term statistics. Let
Finally, the log-entropy of term
Additionally, we use an alternative representation of a document based on word vectors. The word vector represention [
LSA is a simple topic modeling technique based on singular value decomposition [
Even with the simple level of preprocessing provided by LSA, we can already start understanding how posters are separated into distinct topical areas. To visualize this, we transform the LSA vectors using a two dimensional reduction by
(A) Schematic of the workflow for converting abstracts into vector representations (see Algorithm 1) (B) Schematic of Rocchio Algorithm (C) Projection of SfN abstract vectors to 2-dimensional space using
The dataset used here also contained keywords. Searching by keywords is a popular way to discover topics at conferences. We compare our methods to this type of search. In this algorithm, the LSA analysis is applied to the keyword vector of each poster and all the rest of the analysis described below is the same.
The Rocchio algorithm is used to produce recommendations based on relevant and non-relevant documents previously voted by the user [
The parameters
Once the query vector
Each poster contained in this dataset has been manually classified into a topic by the organizers of the conference. The topic classification is based on a tree with three levels. For example, a poster might be classified with topic
We implicitly assume that a random attendee would be highly interested in one topic only and not interested in others. This means that users would tend to like poster from the same area. While this assumption may not be accurate for a large number of attendees, we believe in captures the intention behind the classification of topics in this particular SfN conference. To measure the quality of suggestions, then, we ask the system to suggest ten posters based on a set of“liked” posters from a particular topic. For each of the suggestions, we compute the distance between its topic and the topic of the liked posters, using as distance measure the lowest common ancestor in the topic tree [
1: Receive set of
2: Preprocess documents
3: Convert documents to matrix
4: Apply Latent Semantic Analysis
5: Index nearest neighbor model to
6: Compute user’s preference vector
7: Use indexed nearest neighbor model to suggest relevant documents
One parameter of the LSA algorithm is the number of components of the SVD decomposition [
The number of LSA components vs the average human curated tree distance of suggested posters.
The Rocchio algorithm differently weighs the importance of relevant documents with a parameter
The experiment this time is done by liking one poster from a random topic and voting as non-relevant a poster from a different topic. We performed three experiments where we tested three distances of the non-relevant voted posters: distance 1 (1 subdivision away), distance 2 (in a difference subarea), and distance 3 (in a different area of study). For each of these experiments, we tried a grid of
Performance of the system as a function alpha and beta parameters for non-relevant documents that are 1 distance away (A), 2 dislike distance away (B), and 3 dislike distance away (C) in human curated topics.
We found that voting as non-relevant those posters that are 1 distance away produces large effects on performance. Interestingly, we found that the parameter
In this section, we compare the performance of our algorithm against common alternatives techniques and features. Our framework requires the use of abstracts, a significantly more complex data source than keywords, for example. Using LSA compromises speed in recommendation while still captures closeness in human curated distance after multiple votes. Moreover, using the full abstract could capture variability that is not available in other simpler methods. To properly test the advantages of our approach, we cross validate the performance of our method by using human curated topic distances as a performance metric (see section 2).
For each of the algorithms tested below, we do the following simulation to estimate their performances. For each run of a simulation, we pick a poster at random and vote it as relevant. Then, we ask the algorithm to suggest ten posters based on that vote. We compute the average distance in human curated topic space between the suggestions and the liked poster. We then vote for another poster randomly selected from the same human curated topic as the first poster. We again ask the algorithm to suggest a set of ten posters and again we compute the average distance. We repeat this process ten times to obtain the average distance to human curated topics as a function of the number of votes. This simulation will help us understand the performance of the algorithms as they gather more votes from a simulated user.
As a baseline for all comparisons, we compute the performance of a null model that suggests posters at random. This is necessary because the distribution of human curated topics is not uniform and therefore the distances could be distributed in unexpected ways. Not surprisingly, the average distance to the human curated topic remained constant with the number of votes (
All term weighting schemes except keywords improve recommendations with votes.
Keywords are a common recommendation technique that relies on using human curated “tags” for each document. In our dataset (see section 2), the authors themselves assigned multiple keywords by picking them from a predefined set, crafted by the organizers. We used these keywords to produce recommendations by treating them as documents. We then model keyword counts by transforming them from the tf-idf matrix to 30 SVD dimensions. Preliminary analysis revealed that 30 dimensions was an appropriate number to capture most of the variability. Suggestions using keywords perform significantly better than random suggestions (paired
Science Concierge uses the abstracts to automatically extract topics and provide suggestions. We found that Science Concierge produces significantly better suggestions (
Ideally, we want distances between documents induced by an algorithm to closely match the distances induced by human curated topics. Here, we tested this idea using the keyword model and the Science Concierge model. This is, we tested the correlation between the distance of two random documents using a model and the distance of those same documents in human curated topic space. Word vector had the highest correlation with human topics (
Discovering new and relevant scholarly material progressively requires automation. Such discovery is already possible for commercial content such as movies, news, and music. However, the same cannot be said about scientists. They commonly use legacy search systems that cannot learn from previous searchers. In this article, we propose a system that can improve recommendations based on a scientist votes for relevant and irrelevant documents. We tested the system on a set of posters presented at the Society for Neuroscience 2015 conference. We found that our system significantly improves on suggestions based on author-assigned keywords. We also found that our system learns as the user provides more votes. The system returns a complete schedule of posters to visit within 100 ms for a conference of around 15K posters. We publish the source code of the system so others can expand its functionality (
One surprising finding in our analysis is that the posters voted as non-relevant were better left not influencing the final preference vector. In particular, when voted non-relevant posters were close in topic space to the liked posters, then the performance degraded. If those non-relevant posters were far away, then the performance remains unchanged. In the future, we want to expand the experiment to domains in which a large number of votes is casted. This may allow the algorithm to better understand topic preferences and therefore offer suggestions that exploit the knowledge built in non-relevant votes.
The topic modeling technique used in our algorithm assumes that topics live on a linear space. While this assumption provides significant computational advantages, there might be cases where more complex modeling approaches may capture subtle topical relationships. In the future, we will try non-linear probabilistic approaches such as Latent Dirichlet Allocation (LDA), which recently has been shown to scale well [
It may appear that the averaging of preferences that our algorithm perform would mislead recommendations for scientists with multiple, disparate interests. However, we found in the usage data of the website that scientists tend to like few documents, all within similar topics. More importantly, given the high dimensional representation of the documents, averaging preferences from different clusters would still produce sensible recommendations. This happens because the nearest neighbor of the mean of two clusters is closer to those clusters than to a random document. This phenomenon gets stronger with higher dimensions [
Future research could expand our system in many ways. The Rocchio algorithm’s dependency on nearest neighbors makes it inappropriate to exploit the potential richness available on large number of votes [
Our system proposes a new way to discover scholarly material. Many similar systems (e.g., Google Scholar) do not release their algorithms, making them difficult to study. By opening our algorithm, we will engage the scientific community to collaborate. Moreover, our algorithm does not necessarily need to be constrained to scientific text as any document that can be represented as a vector can be fed into it (e.g., scientific figures, audio). In sum, our system provides an open and fast approach to accurately discover new research.
Titipat Achakulvisut was supported by the Royal Thai Government Scholarship grant #50AC002. Daniel E. Acuna was supported by the John Templeton Foundation grant to the Metaknowledge Research Network. We thank SfN for providing the abstract dataset for our tests.