Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel NIH research grant recommender using BERT

  • Jie Zhu ,

    Contributed equally to this work with: Jie Zhu, Braja Gopal Patra

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center, Houston, Texas, United States of America

  • Braja Gopal Patra ,

    Contributed equally to this work with: Jie Zhu, Braja Gopal Patra

    Roles Conceptualization, Data curation, Investigation, Software, Writing – review & editing

    Affiliation Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, New York, United States of America

  • Hulin Wu,

    Roles Conceptualization, Resources, Supervision

    Affiliation Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center, Houston, Texas, United States of America

  • Ashraf Yaseen

    Roles Conceptualization, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Ashraf.Yaseen@uth.tmc.edu

    Affiliation Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center, Houston, Texas, United States of America

Abstract

Research grants are important for researchers to sustain a good position in academia. There are many grant opportunities available from different funding agencies. However, finding relevant grant announcements is challenging and time-consuming for researchers. To resolve the problem, we proposed a grant announcements recommendation system for the National Institute of Health (NIH) grants using researchers’ publications. We formulated the recommendation as a classification problem and proposed a recommender using state-of-the-art deep learning techniques: i.e. Bidirectional Encoder Representations from Transformers (BERT), to capture intrinsic, non-linear relationship between researchers’ publications and grants announcements. Internal and external evaluations were conducted to assess the system’s usefulness. During internal evaluations, the grant citations were used to establish grant-publication ground truth, and results were evaluated against Recall@k, Precision@k, Mean reciprocal rank (MRR) and Area under the Receiver Operating Characteristic curve (ROC-AUC). During external evaluations, researchers’ publications were clustered using Dirichlet Process Mixture Model (DPMM), recommended grants by our model were then aggregated per cluster through Recency Weight, and finally researchers were invited to provide ratings to recommendations to calculate Precision@k. For comparison, baseline recommenders using Okapi Best Matching (BM25), Term-Frequency Inverse Document Frequency (TF-IDF), doc2vec, and Naïve Bayes (NB) were also developed. Both internal and external evaluations (all metrics) revealed favorable performances of our proposed BERT-based recommender.

Introduction

The importance of recommendation systems can be understood from its daily usage in recommending movies, books, videos, news, products, and so on. The working of a typical recommender depends on analytic modeling of a user’s behavior based on the past preferences/statistics. It can be broadly grouped as content-based, collaborative filtering, and hybrid [1]. Given their useful applications in several areas, extending the application of recommenders to include scholarly resources, such as recommending grants for researchers, would be beneficial.

Acquisition of research grants is important for researchers to conduct research in academia. There are several funding opportunities available for researchers to help innovate and implement bright ideas. These funding opportunities are normally from different government and private sources such as NIH, National Science Foundation, Microsoft, and many more. However, searching for relevant grant announcements in a large database is always a difficult and exhausting process for researchers.

There is currently a commercial website named SPIN [2] that lists all the grants available in the USA. However, manual searches in SPIN revealed that the performance of the implemented search engine is quite poor since it can only handle very limited queries, and is only useful when the researchers know exactly what they are looking for. To the best of our knowledge, research dedicated to recommending research grant opportunities to help alleviate the problem is very limited. We were able to find only two [3,4] that were restricted to using keywords and association rules for grants opportunities in Japan, and a recent one [5] based on TF-IDF with Random forest and Rocchio algorithm. But we did find studies for other scholarly resources such as literature [69], collaborators [1013] and datasets [14,15] that utilized deep learning techniques such as transformers.

Considering the research gap and outstanding performances of deep learning models on other academic recommendation tasks such as citation/paper, dataset recommendations, we proposed a novel research grant recommender based on state-of-the-art BERT model. The main contributions of our work in this area are:

  • We are the first to introduce a grant recommender that utilizes the advanced, state-of-the-art natural language model, i.e. BERT, to capture intrinsic, non-linear relationship between researchers and grant opportunities.
  • Complementary to our main model architecture, we additionally introduced DPMM clustering algorithm with Recency Weight for aggregation for practical applications/service purpose.
  • We crawled data suitable for real-world applications: publications from the PubMed, and NIH grant opportunities from grants.gov, and the current web-based application for our recommender is available at http://genestudy.org/recommends/#/grants, giving our research a practical use. This also allowed us to collect feedback/ratings from end-users to conduct an external evaluation of the system.

The rest of the article is organized as follows: Related work summarizes literature regarding grant recommendations as well as BERT-based recommenders. An overview of collected grants and publications are provided in the Data section. Methods used for developing the recommendation system and evaluation used in experiments are described in the Methods section. Experimental results and detailed analysis are presented in Results section. Finally, conclusions and discussions, and future directions are discussed in Conclusions and discussions. The overall research methodology is summarized in Fig 1.

Related work

Literature on grant recommendations is very limited. Kamada et al. [3,4] developed a Japanese grant recommender using keywords and association rules between researchers and grants, and further extended the system with TF-IDF technique. Another system called EILEEN [5] also adopted TF-IDF with Latent Semantic Analysis for topic extractions and used Rocchio Algorithm and Random Forest to predict potential matches of grants and publications.

In addition to our work in [1317], we were able to locate studies that focus on other academic recommendations and BERT-based recommenders that are related to our research, Patra et al. [16] experimented with information retrieval paradigms (BM25, TF-IDF, etc.) for Gene Expression Omnibus data recommendation to researchers. Zhu et al. [13] utilized graph neural networks to capture intrinsic, complex and changing dependencies among researchers for dynamic collaborator recommendation. Regarding BERT-based systems, Zhu et al. [15] developed a BERT-based recommender to recommend public available papers to researchers. Later Zhu et al. [17] performed a sensitivity analysis on the training class imbalance on BERT-based dataset recommendation system. Bilal et al. [18] used BERT classifier along with three bag-of-words based classifiers to recommend helpful online reviews on Yelp datasets. Jeong et al. [19] combined graph convolution networks with BERT representation of textual data to generate context-aware paper recommendations. Dai et al. [20] introduced a two-stage COVID-19 paper citation recommender by enhancing BERT representation learning in the first stage, and learning effective dense vector of nodes among bibliographic graph through heterogenous deep graph convolutional networks. Hassen et al. [21] compared several popular encoder models including USE, BERT, InferSent, ELMo and SciBERT and found out that solely semantic information from these models did not outperform BM25 for paper recommendations. Yang et al. [22] proposed a semi-supervised research literature and researcher recommendation system using BERT for keywords extraction and Latent Dirichlet Allocation for topic representations.

Data

The proposed grant recommendation system requires data describing grant announcements and researchers. Grants announcements’ data collected from GRANTS.GOV and the NIH website [23], and researchers’ data created from publications in PubMed. Data collection methods and summaries of data are described next.

Researcher publications

Published articles downloaded from PubMed database were used to represent researchers. We were particularly interested in articles’ ids, titles, abstracts and dates of publication. A total of 193,592 records were created. An example of PubMed article can be found in Fig 2, and basic word count summary can be found in Table 1.

Research funding announcements (RFAs)

We crawled GRANTS.GOV because of its comprehensive meta-data and neatly parsed texts. Particularly for our experiments, we were interested in RFA ids, titles as well as descriptions. Since we focused on the biomedical domain, we then kept RFAs that were from NIH only. We had a total of 5,030 grant announcements. An example of a grant’s detail can be found in Fig 3 and basic word count summary can be found in Table 2.

Ground truth establishment

The relationships between PubMed articles and RFAs were established via NIH’s ExPORTER [24]. It archives relations between publications and project numbers of funded grants, as well as relations between project numbers and corresponding RFAs. Using these two relationships, we could therefore establish the relations between publications and RFAs for evaluation. This relation is then processed into a citation dictionary with each entry recorded as {‘1287764’: [PAR-17-095, PAR-12-298]}, where ‘1287764’ is the PubMed Identifier (PMID) [25], and ‘PAR-17-095’, ‘PAR-12-298’ are the two RFA ids that are associated with this publication. An example of such relationships is provided in Fig 4.

thumbnail
Fig 4. Relations between publications and RFAs through project number.

https://doi.org/10.1371/journal.pone.0278636.g004

We excluded papers that have too many citations of project numbers (usually survey papers) and limit our final datasets to 193,952 unique papers and 3,678 RFAs.

For training purposes of our proposed method, we need to have both positive (ground truth) and negative (not related) training pairs. Positive training pairs were created out of the existing relations; negative ones were created with random sampling. All possible combinations of publications and RFAs were created first, then positive pairs were excluded from the pool, and finally, an equal number of false pairs were selected. The composite dataset was split on unique publications with ratios 7:1:2 for training, validation and testing, see summaries in Table 3.

Methods

The overview of the system architecture is outlined in Fig 5. The grant announcement recommendation system developed in this work is part of our Virtual Research Assistant (VRA) project (http://genestudy.org/recommends/#/), a scholarly recommender platform developed at the Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston.

thumbnail
Fig 5. Overview of our grant recommender system architecture.

https://doi.org/10.1371/journal.pone.0278636.g005

There are two main components in our recommender: the offline training component on the bottom, where our model is trained and evaluated against the RFA-publication relationships; and the online test/service component on the top: where researchers/end users submit their information (CVs), and we then use the trained model to provide recommendations. The recommendations are presented in clusters (through clustering on publications and aggregating recommendations per cluster). These aggregated results are then rated by the researchers/end users.

All implementation details could be found at https://github.com/ashraf-yaseen/VRA/tree/master/grants_rec. Below we introduce the main model components and evaluations in details.

Models

Baselines: IR and NB.

We built two sets of baseline systems: Information Retrieval (IR)-based and classifier-based. Namely, three IR-based systems utilizing Term Frequency-Inverse Document frequency (TF-IDF), BM25, and doc2vec respectively; the classifier-based system is a Naïve Bayes (NB) classifier combined with the best-performing (on validation data) IR techniques from the three methods.

  • TF-IDF: a numerical statistical representation of how important a word is to a document in a collection or corpus [26]. For each vocabulary V, the value increases proportionally to the number of times that V appears in the document (term frequency, TF) and is offset by the total number of documents that contain V (inverse document frequency, IDF). We used TF-IDF implementation from scikit-learn [27].
  • BM25: a ranking function that is based on a probabilistic retrieval framework that utilizes adjusted values of TF and IDF and document length [28]. We used BM25 implementation from genism [29].
  • doc2vec: an unsupervised neural network that generalizes word2vec and learns continuous distributed vector representations for variable-length pieces of texts [30,31]. We utilized doc2vec implemented in gensim [29].
  • NB: A probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features given the value of the class variable. It is widely used in document classification tasks (e.g. email spam detections) due to its simplicity and desirable performance. We used the implementation from scikit-learn [27].

For TF-IDF, BM25, and doc2vec, the whole RFAs was used as corpus for retrieval, and publication were used as queries to find the best matching RFAs using cosine-similarity. For NB, we chose the best performing IR techniques on validation data for vector representation and then modeled vectors under the classification labels as the multinomial distribution.

All training parameters can be found in the System parameters section.

Proposed method: BERT-based classifier recommender with DPMM and Recency Weight.

During initial explorations, we observed that the words in publications and RFAs were not at the same semantic level. For example, more specific words such as ‘clustering genes’, ‘protein analysis’ were present in the publications, whereas the corresponding funding RFAs containing more generic words such as ‘bioinformatics’. Thus, we proposed a classifier recommender using Bidirectional Encoder Representations from Transformer (BERT) to better capture this relationship.

BERT [32] was developed by Google and was pre-trained on 800M-words BooksCorpus [33] and 2500M word English Wikipedia [34] using masked language model and next sentence prediction as the pre-training objectives. It is known for capturing better logical and non-linear information in complex text inputs. It had previously achieved state-of-the-art performance in many classical NLP tasks.

The goal of the system is to predict whether a particular RFA and a particular publication, and ultimately RFA and a particular researcher, are going to be match. In order to achieve this, we followed a two-stage process. In the first sage, we fine-tuned the base-BERT model using sentence pair classification task, where we defined sentence pair to be “(titles and abstracts of publications, titles and descriptions of the RFAs)”. We truncated both inputs at token size 256 (total 512) with wordPiece tokenizer [35], see Fig 6. The output logits were then converted to probability for aggregating and ranking results. We used Huggingface’s Transformers implementation [36] of base-BERT, and further tuned the model architecture with Ax Bayesian Optimization [37], with final tuned parameters summarized in System parameters.

In the second stage, a particular researcher’s publications are clustered using Dirichlet Process Mixture Model (DPMM), and all RFA-publication results were aggregated based on clusters using Recency Weight, and final recommendations are made per research cluster for each researcher.

DPMM is an iterative non-parametric clustering algorithm that exhibits flexibility in producing varying number of clusters [38] (which suits to the practical needs of our service since each researchers are intrinsically different with varying publication history), scalability, robustness to outliers [39], as well as proven record of success in tasks of document clustering [16,39,40].

Starting with finite mixture model, where each data point is draw from one of the K fixed unknown distributions with parameters θ1,…θK. Since the number of clusters is unknown, we assume that data point xn follows a general mixture model in which the parameters are generated from a distribution G [41]. The Dirichlet Process (DP) is a stochastic process that generalizes the Dirichlet distribution from being the conjugate prior for a fixed number of categories (multinomial) into the prior for infinitely many categories [38], is characterized by a positive scaling parameter α and a base distribution G0. Assigning a DP prior to G in the general mixture model leads to the DPMM [42]. The α value is inversely related to the number of clusters, i.e. decreasing the α parameter in DPMM may increase the number of output clusters. In our case, based on manually observing the clusters and feedback from researchers [16], the α is empirically set as where N is the total number of papers for a researcher.

The complete process is as follow: Publications of a particular researcher (let’s call him/her B) and all available RFAs were made into pairs, and fed into our trained model for prediction of matching probability. Then we took the pairs of ‘positive’ (Pr(+)>0.5) predictions and used the probability as the initial matching score (msji) of a particular RFA (j) to a particular publication (i). Then DPMM was introduced to create research clusters (m1, m2,…mB) on B’s publications. Once clusters are made, we introduced the Recency Weight λi to penalize the initial matching score based on publication year to reflect research interest trend across time: where t is the difference between the year of current experiment and the year of publication. c is the decaying factor to decrease the rate proportional to its current value, and for the present study, we kept c = 0.05. For rationale, if the publication was published in 1998, the corresponding RFA recommendations are probably of less interest to a research than those for a publication published in 2018. Let’s say this particular publication im2, then we can take the sum of weighted matching scores msji within this cluster m2 as their final ranking score for RFA (j) for this cluster m2 where is total number of publications in the m2 cluster. From there, we can take top K = 10 final ranking scores’ corresponding RFAs as the recommendations.

System parameters.

Parameters used during training for baselines vs. our proposed method are all listed in Table 4. We ran several experiments with ranges of values for tuning the parameters of the methods listed in Table 4. The values shown are correspond to best performance. For example, we experiment with a few max_feature options for TF-IDF such as 1000, 2000, and 5000. The performance of the method using max_feature of 2000 slightly outperforms 1000 and no gain in performance when using 5000, so we went with 2000.

thumbnail
Table 4. Selective hyperparameters used in baseline vs. our proposed method.

https://doi.org/10.1371/journal.pone.0278636.t004

Evaluations

The evaluation was performed in two stages: a) automatic evaluation, where we utilized RFA-publication relationship detailed in Data, Ground truth establishment; b) external evaluation, where experienced researchers were involved in rating recommendations tailored to their profiles. Details are as below.

Internal evaluation.

This evaluation was developed to verify the effectiveness our proposed method. Metrics were calculated against the ground truth between RFAs and publications that was described in details in Data, Ground truth establishment. Metrics used include Recall@k, Precision@k, Mean Reciprocal Rank (MRR), as well as ROC-AUC. In order to better describe Recall@k and Precision@k, we supplement the confusion matrix as shown in Table 5 below.

  • Recall@1: At the k-th retrieved item, this metric measures the proportion of relevant items that are retrieved. We evaluated Recall@1 (R@1).
  • Precision@1: At the k-th retrieved item, this metric measures the proportion of the retrieved items that are relevant. In our case, we are interested in Precision@1(P@1).
  • Mean reciprocal rank: The Reciprocal Rank (RR) measures the reciprocal of the rank at which the first relevant document was retrieved. RR is 1 if the relevant document was retrieved at rank 1, RR is 0.5 if document is retrieved at rank 2, and so on. When we average retrieved items across the queries Q, the measure is called the MRR.
  • ROC-AUC: Area under the ROC curve provides an aggregate measure of discriminating performance across all possible classification thresholds.

For baseline IR methods, we produced the similarity matrix on the test using corpus built on all RFAs, and calculated Recall@1 (R@1), Precision@1 (P@1), and MRR based on the same entries test on classifiers.

For baseline NB, we used the best performing IR from the three previously mentioned methods, and calculated additional ROC-AUC from intermediate results, before we took predicted ‘match’ (1) and aggregated recommendations at publication level for the three metrics mentioned above.

For the proposed method, we calculated the same set of metrics as we did for NB.

External evaluation.

School of Public Health Departmental professors with a history of grant searches and approvals in the biomedical domain were engaged to evaluate externally our proposed method. We got responses from a total of 10 researchers to participate in the evaluation. After receiving their consent and CVs, researchers’ names were searched in PubMed using a python script for their publications and resultant publications were cross-referenced using their own CVs. Thus, the final total number of papers for reach researcher are different due to their varying years of research history, and our proposed method would produce different number of research clusters, and recommendations corresponding to each cluster. They were asked to rate top 10 recommended grants for each cluster on a scale of 1 to 3 stars based on how satisfied they were with the recommendations, with 3 stars being ‘most satisfied’. We used our grant recommendation platform for collecting results. An example of evaluation platform can be found in Fig 7.

We defined the stars > = 2 as ‘partially relevant’ (P) and 3 stars as ‘strictly relevant’ (S) and calculated Precision@k for these two scenarios and for k = 1, 10: P@1(P), P@1(S) and P@10(P), P@10(S), as well as overall average stars.

Results

The results for automatic evaluations are summarized in Table 6. Since the best performing IR on validation set (results not shown here) was TF-IDF, we used TF-IDF vectorization for NB features.

thumbnail
Table 6. Test results for baselines vs. our proposed method.

https://doi.org/10.1371/journal.pone.0278636.t006

We can see that classification-based baseline (NB) outperformed IR baseline, and our proposed method also outperformed classification baseline. Specifically, NB classifier has much worse ROC-AUC comparing to our proposed method, meaning that its overall discriminating power is not on par with the proposed. Since its R@1 is low, NB was not able to identify as much potential matches as our proposed method does, and therefore suffers from coverage problem in its recommendations, even though it has relatively comparable P@1 and MRR.

External evaluation results are summarized in Table 7. 80% of our users gave us average stars > = 2.0 (partially relevant). For our top 1 recommendation, 90% of our users thought they were at least partially relevant (P@1(P)) and 60% of our users thought they were strictly relevant (P@1(S)), across all clusters recommended. For our top 10 recommendations, 70% of our users had a P@10 (P) > 0.9, however all the P@10 (S) were no more than 0.5, indicating that all 3-star percentages for top 10 were not as high as for top 1 hit among users.

Conclusions and discussions

To the best of our knowledge, this attempt is the first of its kind to utilize advance, state-of-the-art natural language model, i.e. BERT, to capture intrinsic, non-linear relationship between researchers and grant opportunities dedicated to research grant recommendations. We formulated the problem as a classification task, fine-tuned base-BERT with sentence classification, and paired our core model with DPMM clustering with Recency Weight for final results aggregation for practical applications. Both internal (using RFA-publication relationships) and external evaluation (by users) revealed that our proposed BERT-based system is useful to biomedical researchers.

We think that BERT’s ability to capture intrinsic, non-linear relationship in the publication-RFA pairs greatly contributed to the desirable results compared with baselines. In addition, DPMM allowed us the flexibility to cluster each researcher’s interests differently, and thus provided us a reasonable way to aggregate our recommendations together with our Recency Weights, rendering practicality to the final outputs. However, there are still several limitations regarding our current implementations that call for future actions.

In terms of publication collections for a particular researcher, currently we are using CV cross-references to solve the author name disambiguation [43], i.e., authors with the same name might exist and querying the name in PubMed might sometimes result in publications from other researchers. There are currently a few other approaches that we could possibly explore and compare the effectiveness of performances in the future. One of the most promising one is ORCID [44], which is a persistent digital identifier created especially for the purpose distinguishing researchers with same names. However, many researchers involved in our experiments did not have an associated ORCIDs. By encouraging them to adopt an account, we could ultimately reduce this potential issue. Other methods include rule-based unsupervised [45] as well as supervised approaches [46].

Secondly, since researchers’ publications were crawled from PubMed, there could be a potential discrepancy as publications from most recent conferences or journals might not be timely updated in the database, but they might already appear in researchers’ CVs. Therefore, these publications would not end up as inputs to our system.

In terms of our system architecture, since we need enough amount of publications in the PubMed to begin with, our recommender might not be useful for early-stage researchers. But this problem could be potentially solved by Collaborative Filtering, a technique that utilizes preferences/ratings from other agents, users and data sources [1,47]. This requires a sizeable proportion of user-feedbacks. With our plan to go public with the service in the biomedical domain, we hope to collect useful feedbacks to further improve our system along the way.

Acknowledgments

The authors would like to express thanks to all faculty members at UTHealth that participated in the external evaluation.

References

  1. 1. Ricci F, Rokach L, Shapira B. Introduction to Recommender Systems Handbook. In: Ricci F, Rokach L, Shapira B, Kantor PB, editors. Recommender Systems Handbook [Internet]. Boston, MA: Springer US; 2011 [cited 2022 Jan 26]. p. 1–35. Available from: http://link.springer.com/10.1007/978-0-387-85820-3_1.
  2. 2. Sponsored Programs Information Network [Internet]. [cited 2022 Mar 30]. Available from: https://spin.infoedglobal.com/Home/SOLRSearch.
  3. 3. Kamada S, Ichimura T, Watanabe T. A Recommendation System of Grants to Acquire External Funds. 2016 IEEE 9th Int Workshop Comput Intell Appl IWCIA. 2016 Nov;125–30.
  4. 4. Kamada S, Ichimura T, Watanabe T. Recommendation System of Grants-in-Aid for Researchers by using JSPS Keyword. 2015 IEEE 8th Int Workshop Comput Intell Appl IWCIA. 2015 Nov;143–8.
  5. 5. Acuna DE, Nagre K, Matnani P. EILEEN: A recommendation system for scientific publications and grants [Internet]. arXiv; 2022 [cited 2022 Oct 2]. Available from: http://arxiv.org/abs/2110.09663.
  6. 6. Achakulvisut T, Acuna DE, Ruangrong T, Kording K. Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications. PLoS ONE [Internet]. 2016 Jul 6 [cited 2020 Mar 13];11(7). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934767/. pmid:27383424
  7. 7. Bulut B, Gündoğan E, Kaya B, Alhajj R, Kaya M. User’s Research Interests Based Paper Recommendation System: A Deep Learning Approach. In: Kaya M, Birinci Ş, Kawash J, Alhajj R, editors. Putting Social Media and Networking Data in Practice for Education, Planning, Prediction and Recommendation [Internet]. Cham: Springer International Publishing; 2020 [cited 2022 Jan 26]. p. 117–30. (Lecture Notes in Social Networks). Available from: https://doi.org/10.1007/978-3-030-33698-1_7.
  8. 8. Patra BG, Maroufy V, Soltanalizadeh B, Deng N, Zheng WJ, Roberts K, et al. A content-based literature recommendation system for datasets to improve data reusability—A case study on Gene Expression Omnibus (GEO) datasets. J Biomed Inform. 2020 Apr;104:103399. pmid:32151769
  9. 9. Yoneya T, Mamitsuka H. PURE: a PubMed article recommendation system based on content-based filtering. Genome Inform Int Conf Genome Inform. 2007;18:267–76.
  10. 10. Afolabi IT, Ayo A, Odetunmibi OA. Academic Collaboration Recommendation for Computer Science Researchers Using Social Network Analysis. Wirel Pers Commun. 2021 Nov 1;121(1):487–501.
  11. 11. Chuan PM, Son LH, Ali M, Khang TD, Huong LT, Dey N. Link prediction in co-authorship networks based on hybrid content similarity metric. Appl Intell. 2018 Aug 1;48(8):2470–86.
  12. 12. Kong X, Jiang H, Yang Z, Xu Z, Xia F, Tolba A. Exploiting Publication Contents and Collaboration Networks for Collaborator Recommendation. PLoS ONE. 2016 Feb 5;11(2):e0148492. pmid:26849682
  13. 13. Zhu J, Yaseen A. A Recommender for Research Collaborators Using Graph Neural Networks. Front Artif Intell [Internet]. 2022 [cited 2022 Oct 2];5. Available from: https://www.frontiersin.org/articles/10.3389/frai.2022.881704. pmid:35978654
  14. 14. Patra BG, Roberts K, Wu H. A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository. Database [Internet]. [cited 2020 Oct 1]; Available from: https://academic.oup.com/database/advance-article/doi/10.1093/database/baaa064/5909105. pmid:33002137
  15. 15. Zhu J, Patra B, Yaseen A. Recommender systems of scholarly papers using public datasets. In: 2021 AMIA Informatics Summit. 2021.
  16. 16. Patra BG, Roberts K, Wu H. A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository. Database. 2020 Jan 1;2020:baaa064.
  17. 17. Zhu J, Wu H, Yaseen A. Sensitivity Analysis of a BERT-based scholarly recommendation system. In: Proceedings of FLAIRS-35 [Internet]. 2022. Available from: https://journals.flvc.org/FLAIRS/issue/view/6020.
  18. 18. Bilal M, Almazroi AA. Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews. Electron Commer Res [Internet]. 2022 Apr 29 [cited 2022 Oct 2]; Available from: https://doi.org/10.1007/s10660-022-09560-w.
  19. 19. Jeong C, Jang S, Park E, Choi S. A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics. 2020 Sep 1;124(3):1907–22.
  20. 20. Dai T, Zhao J, Li D, Tian S, Zhao X, Pan S. Heterogeneous deep graph convolutional network with citation relational BERT for COVID-19 inline citation recommendation. Expert Syst Appl. 2022 Sep 12;213:118841. pmid:36157791
  21. 21. Hassan H, Sansonetti G, Gasparetti F, Micarelli A, Beel J. BERT, ELMo, USE and InferSent Sentence Encoders: The Panacea for Research-Paper Recommendation? In: Proceedings of 2019 ACM RecSys. Copenhagen, Denmark; 2019.
  22. 22. Yang N, Jo J, Jeon M, Kim W, Kang J. Semantic and explainable research-related recommendation system based on semi-supervised methodology using BERT and LDA models. Expert Syst Appl. 2022 Mar 15;190:116209.
  23. 23. NIH. NIH grants & funding [Internet]. [cited 2022 Mar 30]. Available from: https://grants.nih.gov/funding/index.htm.
  24. 24. ExPORTER [Internet]. NIH Research Protfolio Online Reporting Tools. [cited 2022 Mar 30]. Available from: https://exporter.nih.gov/.
  25. 25. Search Field Descriptions and Tags [Internet]. PubMed user guide. [cited 2022 Mar 30]. Available from: https://pubmed.ncbi.nlm.nih.gov/help/.
  26. 26. Rajaraman A, Ullman JD. Mining of Massive Datasets [Internet]. Cambridge: Cambridge University Press; 2011. Available from: https://www.cambridge.org/core/books/mining-of-massive-datasets/A06D57FC616AE3FD10007D89E73F8B92.
  27. 27. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(85):2825–30.
  28. 28. Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M. Okapi at TREC-3. In 1994. p. 0.
  29. 29. Rehurek R, Sojka P. Software Framework for Topic Modelling with Large Corpora. In: In Proceedings of the Lrec 2010 Workshop on New Challenges for Nlp Frameworks. 2010. p. 45–50.
  30. 30. Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs [Internet]. 2013 Sep 6 [cited 2022 Jan 28]; Available from: http://arxiv.org/abs/1301.3781.
  31. 31. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed Representations of Words and Phrases and their Compositionality. ArXiv13104546 Cs Stat [Internet]. 2013 Oct 16 [cited 2020 Feb 12]; Available from: http://arxiv.org/abs/1310.4546.
  32. 32. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs [Internet]. 2019 May 24 [cited 2022 Jan 27]; Available from: http://arxiv.org/abs/1810.04805.
  33. 33. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, et al. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. ArXiv150606724 Cs [Internet]. 2015 Jun 22 [cited 2022 Jan 28]; Available from: http://arxiv.org/abs/1506.06724.
  34. 34. Merity S, Xiong C, Bradbury J, Socher R. Pointer Sentinel Mixture Models. ArXiv160907843 Cs [Internet]. 2016 Sep 26 [cited 2022 Jan 28]; Available from: http://arxiv.org/abs/1609.07843.
  35. 35. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv160908144 Cs [Internet]. 2016 Oct 8 [cited 2021 Dec 15]; Available from: http://arxiv.org/abs/1609.08144.
  36. 36. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv191003771 Cs [Internet]. 2020 Jul 13 [cited 2022 Mar 30]; Available from: http://arxiv.org/abs/1910.03771.
  37. 37. Bakshy E, Dworkin L, Karrer B, Kashin K, Letham B, Murthy A, et al. AE: A domain-agnostic platform for adaptive experimentation. 8.
  38. 38. Li Y, Schofield E, Gönen M. A tutorial on Dirichlet process mixture modeling. J Math Psychol. 2019 Aug 1;91:128–44. pmid:31217637
  39. 39. Yin J, Wang J. A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE). 2016. p. 625–36.
  40. 40. Hu L, Li J, Li X, Shao C, Wang X. TSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing [Internet]. Lisbon, Portugal: Association for Computational Linguistics; 2015 [cited 2022 Jan 28]. p. 787–92. Available from: https://aclanthology.org/D15-1091.
  41. 41. Yu G, Huang R, Wang Z. Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD ‘10 [Internet]. Washington, DC, USA: ACM Press; 2010 [cited 2021 Apr 20]. p. 763. Available from: http://dl.acm.org/citation.cfm?doid=1835804.1835901.
  42. 42. Antoniak C. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Ann Stat. 1974;2(6):1152–74.
  43. 43. Smalheiser NR, Torvik VI. Author name disambiguation. Annu Rev Inf Sci Technol. 2009;43(1):1–43.
  44. 44. ORCID [Internet]. ORCID. [cited 2022 Mar 25]. Available from: https://orcid.org/.
  45. 45. Tekles A, Bornmann L. Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches1. Quant Sci Stud. 2020 Dec 1;1(4):1510–28.
  46. 46. Han H, Giles L, Zha H, Li C, Tsioutsiouliklis K. Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries—JCDL ‘04 [Internet]. Tuscon, AZ, USA: ACM Press; 2004 [cited 2022 Jan 28]. p. 296. Available from: http://portal.acm.org/citation.cfm?doid=996350.996419.
  47. 47. Terveen L, Hill W. Beyond Recommender Systems: Helping People Help Each Other. 2001;21.