BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale

Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public via https://github.com/ncbi-nlp/BioConceptVec.


Introduction
In the biomedical domain, one primary application of text mining is to extract knowledge within the biomedical literature automatically [1]. Specifically, identifying important concepts (mentioned in the literature, such as gene/proteins, diseases, and mutations, is critical to biocuration [2], literature-based knowledge discovery [3], and many downstream applications [4][5][6]. Previous studies have used different words such as concepts, entities, names, and mentions to refer to the same topic in the biomedical domain. Here, we use bio-concepts for consistency. Similar to the use of word embeddings, capturing the representation of bio-concepts plays a vital role in biomedical applications such as biomedical relation extraction [12] and document classification [13]. Existing studies use the term concept embeddings, which is a special kind of word embedding [7][8][9]. According to the literature, a concept embedding may contain only the concept vectors [10], or it may contain vectors of both concepts and common words [11]. Named entity recognition (NER) tools or concept dictionaries are often used to identify and normalize concepts in a consistent format [10].
Since 2014, word embedding models have revolutionized how to represent text. In these models, each word is represented as a high dimensional vector [12,13]. The vector representations are learned on largescale free text corpora via unsupervised learning. Primary methods include training the embeddings based on (1) averaged surrounding context words, such as the continuous bag-of-words (cbow) model in word2vec [14], (2) weighted context words, such as the skip-gram model in word2vec, (3) global cooccurrence statistics, such as GloVe [15], and (4) word n-grams, such as fastText [16]. The use of vector representations can capture related words from different lexicons, such as cancer and tumor. This overcomes the limitations of traditional bag-of-words approaches that rely on exact term matching [17].
To date, text-mining applications have rapidly adopted word embeddings. For instance, the use of embeddings have shown promising performance in biomedical applications such as biomedical document classification [18], sentence retrieval [19], and question answering [20].
It is known that biomedical concepts have a high degree of ambiguity [21]. The same words can be used to describe different types of concepts in free text; for example, AP2 can be the name of a gene (https://www.ncbi.nlm.nih.gov/gene/?term=2167), a chemical (https://meshb.nlm.nih.gov/record/ui?ui=C417523), or a cell-line (https://web.expasy.org/cellosaurus/CVCL_1147). Conversely, the same concepts can have different names; for example, the HER2 gene has at least 10 different synonyms mentioned in text (https://www.ncbi.nlm.nih.gov/gene/2064). In addition, a bio-concept can span multiple words; for example, serum and glucocorticoid-induced protein kinase is the name of a gene (SGK1, https://www.ncbi.nlm.nih.gov/gene/6446). Therefore, accurate NER is essential prior to training concept embeddings.
We present a detailed summary of the existing bio-concept embeddings in Table 1. These studies have used various corpora (mainly electronic health records (EHR), combined with medical claims, biomedical corpora, or Wikipedia) and several training methods (mainly word2vec, while some used GloVe and fastText) to train concept embeddings. Overall, the primary method paradigm is consistent among these studies and generally involves two steps. In the first step, NER tools are applied to identify and normalize target concepts and to replace the mentions in the text as a preprocessing to the corpora. In the second (embedded training) step, embedding training occurs, whereby standard word embedding training methods, such as word2vec, are employed. Note that we consider concept embeddings trained on knowledge bases, such as gene2vec [22], as different work because knowledge bases are distinct from free-text collections. For example, knowledge bases contain concepts already curated either manually or semi-automatically; therefore, training concept embeddings via knowledge bases does not require NER tools. In addition, the relationships between concepts in knowledge bases already have been organized in a structured format, such as ontologies. Free text, however, is unstructured, and training embeddings on free text occurs purely in an unsupervised way. Also note that individual knowledge bases contain only specific types of concepts by design. By contrast, a wide spectrum of concept types are described in the literature.
Despite these recent efforts, past studies share some limitations. As shown in Table 1, existing studies used NER tools to recognize and normalize Unified Medical Language System (UMLS) concepts [23]. A long series of evaluation studies demonstrate that the effectiveness of these NER tools fluctuates dramatically for different types of UMLS concepts [24][25][26][27][28]. For example, Hassanzadeh et al. evaluated the NER tools used by the studies in Table 1 and found that the F1-score ranged from 5% to 75% for different types of UMLS concepts [24]. Likewise, Reátegui et al. found that the F1-score of the NER tools varied from 44% to 96% for different types of diseases [26]. Importantly, errors produced in the NER step may diminish the effectiveness of bio-concept embeddings. For example, low precision, such as a nonconcept word wrongly identified as a bio-concept by NER tools, will bias the context or nearby words of the true bio-concepts when training embeddings. Similarly, low recall, such as true bio-concepts that are not identified by NER tools, will reduce the number of training instances and decrease the concept coverage of bio-concept embeddings.
Second, almost no studies had evaluated the effectiveness of concept embeddings in extrinsic evaluations.
The evaluation of word embeddings can be broadly categorized into two types (i.e., intrinsic and extrinsic) [29]. Intrinsic evaluations are commonly accomplished via an unsupervised setting or using weakly supervised labels, whereas extrinsic evaluations are often performed via a supervised setting in downstream applications. As shown in Table 1, only one study [8] performed extrinsic evaluations for heart failure, predicting whether a patient would be diagnosed as having heart failure based on the associated clinical notes. The study used a basic long short-term memory (LSTM) model with randomly initialized embedding as the baseline and replaced the randomly initialized embedding with the proposed concept embedding to compare the performance. Although the results demonstrated that the proposed concept embedding has better performance, the study (1) did not compare the results with those of other existing concept embeddings and (2) did not compare the results with those of the state-of-the-art model that had achieved the highest performance on that task [30].
Further, importantly, the existing concept embeddings are designed primarily for concepts and applications in the clinical domain, whereas concept embeddings for the biological domain remain to be developed. As shown in Table 1, existing studies used UMLS concepts and mainly used EHR data as the training corpora. Correspondingly, the evaluation focuses on clinical applications, i.e., the evaluation datasets are generated from EHR data. For example, most of the studies evaluated the two datasets, UMNSRS (Medical Residents Relatedness Set)-Similarity [31] and UMNSRS-Relatedness [31], each consisting of ~600 pairs of clinical concepts derived from EHR data and annotated by physicians.
Similarly, the above extrinsic evaluation of heart-failure prediction is also based on a patient's clinical notes [8]. Developing embeddings for biological concepts and applications is also important.
In response, we propose BioConceptVec, a collection of concept embeddings on primary biological concepts mentioned in the biomedical literature. Fig 1 shows an overview of our study. Specifically, the study has three primary contributions: 1. To our knowledge, we are the first study to use machine learning-based NER tools to recognize and normalize biological concepts for training bio-concept embeddings. Specifically, we employed PubTator, a state-of-the-art NER system with concept annotations for the entire PubMed abstracts [32]. It contains over 400,000 concepts, which is the largest among the publicly available concept embeddings. For example, our evaluation of the human gene coverage shows that BioConceptVec covers 33% more gene concepts than the existing concept embeddings.
2. We conducted large-scale intrinsic and extrinsic evaluations to quantify the validity and utility of BioConceptVec. The intrinsic evaluations contain ~18 million instances from six datasets.
BioConceptVec has significantly higher performance (up to 10% improvement) than the existing concept embeddings and is consistent across multiple datasets. The extrinsic evaluations cover two downstream applications: protein-protein interaction (PPI) prediction, consisting of ~8 million PPIs from the STRING database [33], and drug-drug interaction (DDI) classification, consisting of ~5,000 DDIs from a community-recognized gold standard dataset. The extrinsic evaluation results demonstrate that the deep learning models that use BioConceptVec can significantly improve the state-of-the-art performance, achieving an AUC of 0.95 for predicting PPIs and an F1-score of 0.80 for extracting DDIs.
3. We make all of the embeddings and evaluation datasets publicly available. The embeddings and datasets can be downloaded via https://github.com/ncbi-nlp/BioConceptVec. We also provide a Jupyter notebook that contains code examples for users to get started.

Materials and Methods
Training corpus and method NER step: using PubTator to annotate biological concepts We trained concept embeddings on the ~30 million abstracts in the entire PubMed. We followed the preprocessing pipeline from [34] (the code is publicly available via https://github.com/ncbinlp/BioSentVec). As noted, the first step of bio-concept embedding development is to use NER tools to identify the target concept mentions (e.g., "estrogen receptor") and to further normalize the mentions to the concept identifiers (e.g., "NCBI Gene: 2099"). As an example, shown in Fig 2, a targeted concept (i.e., MLN4924) is identified and normalized to a chemical concept: MESH:C539933. Due to the requirement of high-quality concept normalization for the concept embeddings, we applied PubTator to annotate the full PubMed abstracts. PubTator [32] is a PubMed-scale resource that utilizes four NER tools (i.e., TaggerOne [35], GNormPlus [36], tmVar [37], and SR4GN [38]) with a recent deep learning-based module for disambiguating conflict mentions [39] (when the mentions are annotated by two or more concept taggers) to recognize six key biological concepts (i.e., genes, mutations, diseases, chemicals, cell lines, and species). Table in S1 Table provides a summary of the state-of-the-art performance of the NER tools in PubTator on various benchmarking datasets. Embedding training step: using word2vec, GloVe, and fastText to produce BioConceptVec We trained concept embeddings on the full collection of PubMed abstracts after concept recognition via PubTator, i.e., identified named entities are replaced with bio-entity types and IDs (e.g., Disease_MESH_D008288) before training. To our knowledge, there is no agreement on which embedding model is the most effective in biomedical domains. For example, Wang et al. [40] showed that fastText achieved the highest performance in biomedical event trigger detection versus other word embeddings [40], whereas Jin et al. [41] found that word2vec has better performance in biomedical sentence classification [41]. In this study, we therefore trained four different word embeddings, cbow, skip-gram, GloVe, and fastText such that future studies can choose our concept embeddings according to their specific requirements.
In general, the methods to train word embeddings can be categorized into two groups: window-based and matrix factorization-based [15]. The major distinction between these two categories is that window-based methods aim to learn the semantics of words based on local context, i.e., words within a pre-defined window size, whereas matrix factorization-based methods aim to learn the semantics of words based on global statistics of words in corpora. word2vec and fastText belong to the first category while GloVe belongs to the second category. word2vec has two versions: cbow, training a model using context words as input to predict a target word, and skip-gram: reversely using a target word to predict context words [14]. fastText is an extension of word2vec, using character n-grams to represent a word [42]. In contrast, GloVe is dramatically different from word2vec and fastText. It builds a matrix based on global cooccurrences between the words and then applies matrix factorization.
As mentioned, fastText represents each word as a set of character n-grams. In the case of bio-concept embeddings, however, each bio-concept should be considered a unit. Thus, when training with fastText, we disabled the n-grams representation for bio-concepts (in contrast, for the words that are not bioconcepts, we still used the default n-grams representation in fastText).
The values of hyperparameters for training embeddings are summarized in Table 2. Our choice of hyperparameters is based on similar studies in the past and other related work in the general domain.

Hyperparameters and other methods for comparison
To directly compare with the existing concept embeddings, we used the exact hyperparameter values from Yu et al. [43] as the default setting. As shown in Table 1, of the three publicly available concept embeddings, it is the only concept embedding trained on PubMed. The other two were trained on EHR data. We measured the concept overlap in terms of genes and found that concept embeddings trained on EHR data contain a significantly fewer number of genes than do embeddings trained on PubMed. Thus, we did not compare with those two EHR-driven methods.
Yu et al. [43] used cbow to train the concept embeddings and their hyperparameters are summarized in Table 2. Hence, under the same parameter settings, we firstly trained a common cbow word embedding on PubMed abstracts, as a baseline. Common word embeddings do not contain vectors for normalized bio-concepts. The words in a bio-concept name, however, often exist in common word embeddings. For example, the TOR3A gene (https://www.ncbi.nlm.nih.gov/gene/64222) does not exist in a common word embedding, but the words of its name torsin family 3 member A all exist. Thus, we averaged the word vectors based on the bio-concept name to represent the concept vector. Averaged vectors are used as a strong baseline for many embedding-related tasks, such as sentence similarity [44] and sentiment analysis [45]. We refer to the averaged word embedding baseline as BioAvgWord (cbow). As such, we are able to directly compare BioConceptVec (cbow) with the two baselines: BioAvgWord (cbow) and the concept embedding provided by Yu et al.
In addition, we trained and assessed BioConceptVec (cbow) under different parameters but keeping the same values for minimal word occurrences (so that embeddings share the same vocabulary), learning rate and training epochs (so that embeddings share the same optimization procedure). For each of the other hyperparameters, we selected two representative values that were used in the previous studies on embeddings [46,47], as shown in Table 2 (other values). Note that we do not select larger values for the negative samples and down-sampling threshold because the training epoch is set to be 10it would require more epochs to stabilize the loss when there are more samples.
Furthermore, different studies show that performance can vary by different embedding methods [46,48].
Thus, we also train BioConceptVec using skip-gram, GloVe and fastText, using the same default setups.
We make all of the four versions of BioConceptVec (cbow, skip-gram, GloVe and fastText) publicly available so that users can experiment and choose between the models for their tasks.
To ensure a fair comparison, the evaluation datasets described below contain only concepts shared among these baseline methods and BioConceptVec. We also measured the coverage of concepts using human genes as an example.

Intrinsic evaluations
Identifying related genes based on drug-gene and gene-gene interactions We posit that concept embeddings should give higher similarity to related concepts than to unrelated concepts. The intrinsic evaluations in our study quantify the effectiveness of concept embeddings in terms of identifying related genes. We concentrate on genes because genes are a central focus of biological studies; the interactions between genes (or genes and other biological concepts) are essential for understanding the structures and functions of a cell [49,50]. In addition, biological studies over the decades have collected related genes from different perspectives, such as those based on expression signatures, pathways, and gene ontologies (GO). These collected related genes can be used as a gold standard for our intrinsic evaluations. In contrast, other biological concepts, such as diseases and mutations, are somewhat difficult to define in regard to the notion of relatedness systematically. We considered related gene pairs based on drug-gene interactions and gene-gene interactions, as explained below.
Evaluation dataset construction and evaluation metrics We adopted six datasets for creating evaluation datasets. The detailed statistics of these datasets are summarized in Table 3. The relatedness of genes was modeled from two broad categories. The first was based on the relationships between genes and other bio-concepts, and the second was based on the relationships among genes.
For the first category, we used the Comparative Toxicogenomics Database (CTD) [51], which captures drug-gene interactions. For each drug, we consider the genes that interact with the same drug as a related set and randomly select the same number of genes that do not interact with the drug as an unrelated set. A related and unrelated set together form a group. Ideally, concept embeddings should have significantly higher similarity for the related sets than the unrelated sets for each group.
For the second category, we used five gene sets (C1-C5) of MSigDB [52]. MSigDB captures related genes using different perspectives, and each gene set is generated from a distinct perspective. For example, MSigDB C1 is generated based on human chromosomes, and MSigDB C5 is generated based on GO. The strategy of creating related and unrelated sets is the same as above. For example, in terms of MSigDB C5, the genes that share the same GO term are considered a related set, and the same number of genes that do not share that GO term are randomly generated as an unrelated set.
We computed the similarity of a set by averaging the cosine similarity of all of the pairs in the set, using concept embeddings. Cosine similarity is the most popular similarity measure used by embeddings [29].
Importantly, different embeddings may report different cosine similarities for same pairs, and the range of cosine similarities also may be different, which is strictly inevitable [53]. To reduce the biases, for each embedding, we first applied Z-score standardization to the cosine similarities of all of the pairs and then used Min-Max normalization to transform the range to [0, 1].
We used the similarity score difference between related sets and unrelated sets at group level as the final evaluation metric. As noted, a more effective concept embedding should have a greater similarity score difference between the related set and the unrelated set for a group. For computational efficiency, we restricted the maximum number of genes in a set to be 100, i.e., a group has, at most, 200 genes in total.
Note that MSigDB has other gene sets, such as C6 and C7. We did not use them because the number is fewer than 100 in shared genes. Collectively, our intrinsic evaluation datasets contain over 13,000 genes and over 17 million instances across six datasets.

Extrinsic evaluations
We further evaluated the utility of BioConceptVec in two downstream applications: protein-protein interaction (PPI) prediction on the STRING database [33] and drug-drug interaction (DDI) classification on biomedical literature [54].

Protein-protein interaction prediction on the STRING database
Analyzing functional interactions between proteins, which facilitates the understanding of the cellular processing and characterization, is a routine task in molecular systems biology [55]. The STRING database is one of the most comprehensive data resources that integrate, score, and analyze publicly available PPIs [33]. To date, it consists of over 3 billion PPIs from ~25 million proteins (https://stringdb.org/). The PPIs in the STRING database are scored by accumulating a wide range of evidence, such as measurements from biological experiments, co-expressions, and gene co-occurrences.
Existing studies have used STRING for training and testing machine learning models for PPI prediction [56,57]. In a recent study, for example, Smaili et al. constructed two PPI datasets for human proteins: (1) PPIs based on combined scores, i.e., the score calculated from multiple sources (including results from the biomedical literature and many others, such as gene co-expressions, biological experiments and pathways), which we refer to as the combined-score, and (2) PPIs that have the experimental score over 700, i.e., the score is based only on biological experiments and is greater than 700, which we refer to as the experimental-700. The study considered these PPIs as positive instances and randomly generated the same number of negative instances. Smaili et al. split the datasets into the training and testing datasets, accounting for 70% and 30% of the total number of PPIs, respectively. They further developed a deep learning model by taking the vector representations of the two proteins as inputs and predicting whether the proteins have interactions. The deep learning model was an artificial neural network (ANN) that had two hidden layers [57]. Using the same model, the study tested different vector representations and reported Area Under the Curve (AUC) accordingly.
We followed this study [57] for creating the datasets and implementing the reported ANN model. Table 4 provides a summary of the statistics of the datasets. The combined-score dataset covers all of the 13,802 proteins that are shared by concept embeddings and STRING databases. In comparison, the previous study sampled only 1,800 proteins. We also implemented a 2-layer ANN. The details of the hyperparameters are summarized in Table in S2 Table. In keeping with the previous study, the model and hyperparameters are identical when testing different concept embeddings. The Precision, Recall, F1score, and AUC are reported.

Drug-drug interaction extraction on biomedical literature
We also examined the usefulness of concept embeddings in a text-mining task. Specifically we evaluated the performance of concept embeddings on the SemEval 2013: Task 9 DDI extraction corpus [54] for DDI classification. This dataset consists of over 1,000 documents from the DrugBank database [58] and PubMed abstracts and ~5,000 DDIs manually annotated by two senior pharmacists, serving as a gold standard dataset for relation extraction by the community [59].
In this task, the input is a sentence that contains a pair of drugs. If the pair of drugs represents a true DDI, the model needs to output the DDI type; otherwise, the model needs to indicate the pair is not a true DDI [54]. The annotators classified a DDI into one of four types: advice, effect, mechanism, and int (the interaction occurs, but its type cannot be classified) [59]. We used the official training and testing datasets. The statistics of the datasets are summarized in Table 5. This is a multi-class classification problem (i.e., 5 classes: 4 DDI types and a negative class indicating a pair is not a DDI), and the organizers used the F1-score to measure the multi-class performance of true DDIs (i.e., without considering the negative cases). We followed the same evaluation procedure.
We implemented a simple averaged sentence embedding neural network model (SEN) for DDI classification. Fig 3 illustrates the architecture of SEN. For an input sentence, it first uses word embedding to map the vectors of each word in the sentence (Embedding Layer in Fig 3). We used the recent context-based word embedding ELMo in the Embedding Layer [60], which was shown to be superior to common word embeddings in relation extraction tasks [61]. Then it averages all of the word vectors to obtain the sentence vectors (Averaged Layer), followed by dense layers (the hidden layers used in the ANN above). Finally, it outputs class probabilities. The details of the hyperparameters of SEN are summarized in Table in S3 Table. SEN has been used widely as a baseline model in sentence-related applications [34]. We hypothesized that adding the vector representations of the drugs mentioned in the sentences will increase the classification performance. We used PubTator to map the drug mentions into concept identifiers. Thus, similar to PPI prediction, we used the same model and tested different concept embeddings. The Precision, Recall, and F-1 score are reported. In total, these four embeddings cover 18,881 human genes, ~98% of which can be found in BioConceptVec. We manually examined the genes that were missing in BioConceptVec and found that most of them only occurred once. We also found that these genes occur more frequently in PMC full-text articles; we plan to integrate both PubMed abstracts and PMC full-text articles for training concept embeddings in the future.    Table 2): w, v, s, and n stand for window size, vector dimension, sampling threshold, and negative samples, respectively.
In Fig 6, we report the effect of different embedding methods. As shown, there is no one-size-fits-all method that always achieves the best performance across all of the datasets. For instance, BioConceptVec (cbow) had the best performance on the CTD dataset, whereas BioConceptVec (GloVe) had the highest score on the MSigDB C1 dataset. This is consistent with the findings in the previous literature on embedding comparison [46,48]. Hence, it is necessary to make embeddings trained with different methods publicly available.

Extrinsic evaluation results
Protein-protein interaction predictions on STRING database Table 6 illustrates the classification results of PPI predictions on the STRING database. The direct comparison results show that BioConceptVec (cbow) has better performance than the baseline approaches achieving the highest F1 score and AUC on both datasets. The results of BioConceptVec (cbow) with different hyperparameters is summarized in Table in S4 Table, which further demonstrate that its performance was consistent overall. When comparing BioConceptVec trained using different methods, BioConceptVec (fastText) had the best overall performance for this task, although the performance of BioConceptVec (cbow) and BioConceptVec (skip) are very close. Note that we were unable to directly compare with the previous study [57] because the proposed embedding is not publicly available. Also as noted, the performance of the study was measured on ~1,800 proteins, whereas our datasets contain ~13,000 proteins. Table 7 demonstrates the evaluation results on DDI extraction. We ran the model 5 times with different random seeds and then calculated the average performance [62]. The state-of-the-art (SOTA) model by Zhang and colleagues achieved an F1-score of 0.73 on this dataset [63]. Their model uses an LSTM as an encoder with an attention mechanism and outperforms other feature-based, kernel-based, and neural networks-based methods. We found that, compared with the SOTA model, the SEN model had a slightly better classification performance on advice, effect, and mechanism relation types but had a dramatically lower performance on int relation where a DDI cannot be classified into a specific type.

Drug-drug interaction extraction results
We also measured the performance of SEN by adding concept vectors. The direct comparison results show that BioConceptVec has better performance than the baseline approaches. Adding BioConceptVec improves the F1-score significantly and BioConceptVec (cbow) appears to be the most effective in this task. The results of BioConceptVec (cbow) using different hyperparameters are summarized in Table in S5 Table. It also shows that the performance is consistent.
We further qualitatively analyzed the errors by comparing the results of the SEN model with and without BioConceptVec. We found that the SEN model failed to classify challenging cases in which the definitions of relation types are somewhat similar. For example, the sentence, "Zidovudine competitively inhibits the intracellular phosphorylation of stavudine," contains the relation "zidovudine-stavudine." The annotator classified it as the effect type, but the SEN model wrongly classified it as the mechanism type.
According to the annotation guidelines, both effect and mechanism types can describe pharmacological effects. The effect type, however, focuses on the change of the effect, whereas the mechanism type focuses on the underlying reason for the change. For this case, inhibiting the intracellular phosphorylation describes the change rather than the mechanism. There are ~20 similar erroneous cases for which the SEN model only mixed the effect type with the mechanism type. Adding BioConceptVec (cbow) to the SEN model correctly classified all of them. This is likely due to the fact that BioConceptVec provides additional information learnt from the entire PubMed abstracts, making the classification of the two related types easier as a result. Collectively, the results confirm the hypothesis that adding concept representatives improves the performance of downstream deep learning models and suggests that BioConceptVec has the potential to facilitate the development of deep learning models in the biomedical domain.
In this work, we propose BioConceptVec, concept embeddings that focus on primary biological concepts mentioned in the biomedical literature. We employed SOTA biological NER tools and trained four concept embeddings on the full collection of ~30 million PubMed abstracts. We evaluated the effectiveness of BioConceptVec in intrinsic and extrinsic settings, consisting of ~25 million instances in total. The results demonstrate that BioConceptVec consistently achieves the best performance in multiple datasets and in a range of applications. We hope that it can facilitate the development of deep learning models in biomedical research. In the future, we plan to leverage both PubMed abstracts and PMC fulltext articles for training BioConceptVec.
This study focused on the evaluation on human genes because there are rich resources readily available for serving as a gold standard. We plan to evaluate BioConceptVec embeddings on different concept types in the future. Also, the quality of our concept embeddings is dependent on the accuracy of the NER tools. Improving NER tools such as PubTator would help enhance the quality of BioConceptVec. Finally, in this work, we did not apply retro-fitting, which is a fine-tuning step to further optimize the embeddings based on specific tasks with gold standard labels. For example, one of the most common retro-fitting procedures is to optimize the performance of the generated embeddings on identifying synonyms and acronyms. We did not employ it because such datasets are very limited for biomedical concepts. We plan to develop related datasets and apply the approach to further enhance BioConceptVec.       Table 2). Supporting Information Legends S1   Table 3. The statistics of datasets in intrinsic evaluation tasks. There are six datasets in total. #groups: the number of groups in a dataset. Each group has a related set and an unrelated set of genes based on drug-gene interactions provided by CTD or gene sets provided by MSIGDB. #distinct concepts: the total number of distinct genes in a dataset. Avg #concepts per group: the average of number of genes in a group; note that one gene may be in multiple groups. #pairs: the total number of pairs in a dataset. Avg #pairs per group: the average of the number of pairs per group.