A pre-training and self-training approach for biomedical named entity recognition

Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.


Introduction
Named entity recognition (NER) is a critical component for many downstream applications, such as information retrieval, information extraction, and question answering. NER is especially important in the domain of biomedical literature mining, where it is becoming more difficult for individuals to keep up with the sheer volume of new research being published. Building effective NER approaches that can effectively identify biomedical concepts such as diseases, chemicals, and proteins can aid researchers in finding and identifying relevant research and speed up the process of scientific discovery. Existing NER tools can be broken down into two broad categories: rule-based methods and machine learning methods [1][2][3][4]. Rule-based methods require human experts to manually hand-craft specific rules to identify different types of named entities-examples include termmatching with an existing concept database such as the Unified Medical Language System (UMLS) [5,6] or pattern matching based on part-of-speech and sentence structure [7,8]. In practice, rule-based methods require expensive expert knowledge to develop and tend to work only within a very limited domain on which the rules were developed. Furthermore, in the domain of biomedical literature, rule-based approaches often fail to adapt to novel concepts and vocabulary that are characteristic of new scientific publications.
On the other hand, machine learning approaches automatically learn patterns for identifying named entities using a large corpus of labeled training data. In general, machine learning approaches tend to be more flexible than rule-based approaches; however, they require large volumes of word-level annotations which are expensive and difficult to obtain in biomedical settings [9]. The generalizability and accuracy of machine learning approaches, especially in the case of newer deep learning models, are heavily dependent on the amount of labeled data available. In biomedical NER, annotated data is often limited to only a particular type of entity such as chemicals or genes; as a result, existing machine learning NER tools can be limited in scope in that they can only identify a very limited set of entity types.
Developing effective biomedical NER systems for a new application area can be difficult if there is very limited annotated training data, as obtaining gold standard biomedical annotations often requires expensive expert knowledge. In this work, we address this challenge by utilizing a combination of transfer learning, in which we first pre-train a model using a large annotated NER corpus from an adjacent domain, and semi-supervised learning, in which we generate pseudo-labels on unlabeled data from the target domain to improve the performance of our NER model. Using a base NER model such as the popular Bidirectional Long Short Term Memory Conditional Random Field (BiLSTM-CRF) [10] or state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) [11], we show that the combination of transfer learning and semi-supervised learning can significantly reduce the amount of labeled data required to obtain strong performance. Our method obtains F1 scores comparable to a fully supervised model trained on 3x-8x the amount of labeled data when evaluated on eight standard biomedical NER benchmarks. To our knowledge, there does not exist any previous work that thoroughly examines the cumulative effect of transfer learning and semisupervised learning in the biomedical NER space. Our contributions are as follows: • We evaluate the benefits of pre-training on three different corpora for biomedical named entity recognition using two common NER approaches-BiLSTM-CRF and BERT-and eight standard biomedical NER datasets covering common biomedical entity types such as chemicals, genes, and diseases.
• We explore the benefits of semi-supervised self-training with different amounts of labeled data using the BiLSTM-CRF and BERT on the same eight standard biomedical NER datasets.
• We show that by combining transfer learning with self-training, a NER model such as a BiLSTM-CRF or BERT can obtain similar performance to a fully supervised model while using only 12%-30% of the total available training data.
• We show that semi-supervised self-training can propagate errors and lower the F1 score when initial model performance is low, and that transfer learning can be critical in low data settings (250-500 labeled samples) to get the initial model performance to a level where semi-supervised learning can be effective.
• We evaluate the effectiveness of pre-training on UMLS entity types and then applying selftraining on a downstream NER task where the entities of interest are not the same entity types as those covered in UMLS; we show that these methods can still improve performance.

Related work
Methods for named entity recognition. Traditional NER approaches generally utilized manually crafted expert rules and heuristics and to identify entities of interest [12][13][14][15][16][17][18], such as persons, locations, and organizations; these types of rule-based approaches are still in use today in domain areas such as medicine where labeled training data is difficult to obtain [8,19,20]. Recent work has shown that supervised machine learning approaches, especially deep neural networks, achieve superior performance on various NER tasks [1,2,21]. The BiLSTM-CRF architecture is extremely popular in NER applications due to its strong performance on a wide range of sequence tagging tasks [10,[22][23][24]. More recently, BERT has shown state-of-the-art performance across a wide range of natural language processing tasks including named entity recognition [11,25,26]. Existing NER work utilizing BiLSTM-CRF and BERT-based models often focus on supervised applications that often require tens of thousands or more manually annotated sentences. In this work, we extend these two popular approaches to biomedical NER settings with very few labeled examples by incorporating transfer learning and semi-supervised techniques.
Transfer learning in NER. In transfer learning, a model that is trained on one task is then retrained on and applied to a different related task; knowledge gained when training on the first task may boost performance on the second task, especially when labeled training data is scarce for the second task [27]. A common example is downloading an image classifier that is already trained on the very large ImageNet dataset and then fine-tuning it on a specific image classification task of interest-this often achieves better performance than training on the downstream task only. Transfer learning is highly effective across a wide range of different applications in image recognition and natural language processing [28][29][30][31].
In this study, we build upon previous work that explores how to effectively leverage transfer learning for biomedical NER. [32] showed that pre-training a BiLSTM-CRF on a silver-standard corpus of 50K abstracts, tagged for biomedical entities by automated tools rather than human experts, can boost performance on downstream biomedical NER tasks that have fewer than 6K training samples. Similarly, [33] showed that pre-training a BiLSTM-CRF model on a silver standard corpus of 5M sentences from PubMed abstracts, tagged using a trained CRF model rather than human experts, boosts performance on downstream biomedical NER tasks for different entity types. Other work, including [34][35][36][37][38], explore other variations of transfer learning and come to similar conclusions that transfer learning can significantly improve performance on downstream NER tasks. We extend these previous works by (1) comparing the effectiveness of three NER pre-training corpora of differing size and quality and (2) incorporating semi-supervised learning after transfer learning to further improve the performance of our NER approaches.
In the context of BERT, it can be argued that any application that utilizes BERT also utilizes transfer learning-BERT is pre-trained on a very large corpus of unlabeled text using maskedlanguage-modeling or a similar pre-training task and then fine-tuned on a downstream application [11]. Several previous studies have simply taken BERT models pre-trained on different corpora and then applied them to various downstream NER tasks [39][40][41]. In our work, we first take a BERT model that has been pre-trained on biomedical abstracts and then further pre-train it on a NER task (as opposed to a generic language modeling task); we evaluate if this second round of pre-training boosts performance on downstream biomedical NER applications.
Semi-supervised learning in NER. In semi-supervised learning, a machine learning model is trained using both labeled and unlabeled data-the model is trained using pseudolabels or other patterns from the unlabeled data, which can provide a performance boost especially in applications where labeled data is limited [42]. There are many different types of semi-supervised learning, but a simple example is to train a classifier on the labeled data and then use it to predict on the unlabeled data-samples with high prediction confidence are assumed to be labeled correctly and used to expand the labeled training set. Like with transfer learning, semi-supervised learning has been widely and successfully applied in a range of different applications [43][44][45][46].
Several previous works [47-49] have successfully applied semi-supervised methods in the context of NER. These methods generally involve using a combination of existing predictive models, feature similarity metrics, and heuristics to generate NER pseudo-labels on an unlabeled dataset; the pseudo-labels with the highest confidence are then added to the original training data and used to train an improved NER model. In this work, we use an extremely simple semi-supervised technique-self-training-in combination with transfer learning and show that this potent combination can significantly improve the performance of NER models in biomedical settings with very few labeled training examples, especially when the entities of interest overlap with those covered in the pre-training dataset.

Problem description and proposed solution
In this work, we address the standard NER task in which we have a corpus of text segments, typically at the sentence level, in which each text segment may contain one or more named entities. Each named entity can consist of one or more consecutive words. Given an unannotated text segment T consisting of words w 0 , w 1 , . . ., w l−1 , w l and containing a set of named entities E = {e 0 , e 1 , . . ., e n−1 , e n } where each entity corresponds with one or more consecutive words, a model M must correctly identify the start and end words of each named entity within E. A commonly used method to frame this problem is the BIO annotation scheme, in which each word w i in T is tagged as either B (first word of a named entity), I (non-first word belonging to a named entity), or O (does not belong to a named entity). This annotation scheme allows for easy parsing of the positions of the entities in E, especially among entities that share neighboring word boundaries. Thus, the NER task can be framed as a sequence tagging task in which each word w i in T is treated as a three-class classification problem.
Whereas previous works in NER sequence tagging often focus on the supervised setting in which there are thousands or tens of thousands of annotated training examples, we specifically focus on settings where (1) there are limited annotated training examples in the target domain (between 250 and 2000 labeled sentences in our case), i.e., the train dataset, (2) there is access to unannotated text segments within the same target domain, i.e., the unsupervised dataset, and (3) there exist one or more corpora of annotated training data from a neighboring or related domain, i.e., the pre-training dataset.
To address the challenges associated with limited training data within the target domain, we first pre-train a model on the annotated data from the pre-training dataset and then use the limited annotated data from the train dataset to further fine-tune the model. Finally, we apply semi-supervised learning on the remaining unannotated data in the unsupervised dataset to further boost the performance of the model. Our overall workflow is illustrated in Fig 1, and we explain each step in greater detail in the following subsections.

NER models
For our NER models, we utilize a BiLSTM-CRF, which is a widely used architecture for sequence-tagging tasks, and BERT, which is a relatively new architecture that is state-of-theart or close to state-of-the-art in many NER tasks including biomedical NER.
For our BiLSTM-CRF model, we utilize publicly available Word2Vec embeddings of dimension size 200 that are pre-trained on PubMed and PMC texts. Because our word embeddings are trained on all of PubMed and PMC, our word embedding matrix contains approximately 2.3 million unique words [50]; however, our pre-training and NER datasets only use a small fraction of this total vocabulary. Therefore, we freeze the word embeddings during training (rather than initializing them as trainable parameters) to reduce overfitting and improve the generalizability of our BiLSTM-CRF.
Our BiLSTM-CRF model architecture consists of two bidirectional LSTM layers with 300 units each, followed by a CRF classification layer. All training is performed using the Adam optimizer with batch size 128 and learning rate 1e − 4. We note that while recent work introduces more complex sequence tagging architectures, such as incorporating character-level inputs [51] and convolutional neural networks (CNNs) [52], we kept our BiLSTM-CRF model fairly simple to show that our approach works with both simple and state-of-the-art models.
For our BERT model, we utilize the pre-trained WordPiece vocabulary and model weights from BlueBERT Base [53], which is the BERT Base model pre-trained on PubMed abstracts and MIMIC III clinical notes, as this model has shown superior performance on biomedical and medical NER tasks compared to other BERT-based models such as BioBERT [40]. We note this version utilizes an uncased vocabulary. For additional information about the architecture of BERT, we refer readers to previous work that describes BERT thoroughly [54][55][56][57][58].
We utilize the standard token classification setup for BERT, in which a sequence of input tokens is processed by the BERT model, and then each output token is passed to a dense linear layer followed by a softmax classification layer that assigns labels. We note that BERT utilizes the WordPiece tokenizer that breaks up long words into subword tokens; however, all our ground truth labels for NER tagging are at the word level rather than the subword level. Following previous work on applying BERT for NER [59], during training and inference, we only use the label from the first subword token associated with each word. All models are implemented using the Huggingface Transformers library [60], and training is performed using the Adam optimizer with batch size 32 and learning rate 5e − 5.

PLOS ONE
As a final baseline, we include the performance of two out-of-the-box tools which are popular resources for performing biomedical NER-scispaCy [61] and MetaMap [62]. ScispaCy is a deep-learning-based approach trained on the MedMentions dataset, while MetaMap is a rulebased approach that utilizes a manually curated dictionary. Because these two tools can perform NER without requiring any additional labeled training data, any method that utilizes supervised training on labeled data should at least outperform these two tools to be considered practically useful.

Transfer learning
To alleviate the limitations associated with a small number of labeled examples, we evaluate the effects of transfer learning in which we first pre-train our models on a large NER dataset from a related domain and then fine-tune the model weights on the target NER dataset. For our pre-training datasets, we utilize Semantic Medline (available online at url https://skr3.nlm. nih.gov/SemMedDB/), which consists of approximately 28M PubMed abstracts that are automatically annotated for all UMLS entities using the rule/dictionary-based MetaMap tool [63,64], and MedMentions (available online at url https://github.com/chanzuckerberg/ MedMentions), which consists of approximately 4K abstracts manually annotated for UMLS entities by human experts [65]. For our pre-training datasets, we utilize sentence-level inputs annotated using word-level BIO labels without entity type. We generate three different pretraining datasets-�100K annotated sentences randomly sampled from Semantic Medline, �1M annotated sentences randomly sampled from Semantic Medline, and all �50K sentences from the MedMentions dataset. Detailed dataset descriptions are available in Table 1.
For each of the pre-training datasets, we use 80/20 splitting to create training and validation sets. For the BiLSTM-CRF, we train on the training set and validate on the validation set after each epoch, stopping training when validation exact-F1 stops improving for five consecutive epochs. For BlueBERT, we use the same setup for the MedMentions dataset; however, we observed that using this setup on Semantic Medline dataset causes BlueBERT to overfit and significantly reduces performance on downstream tasks-this is likely because (1) the BERT model has 340M learnable parameters and can learn extremely nuanced patterns and (2) the labels in Semantic Medline are more prone to errors because they are annotated by a rule-based method. Therefore, we limit the training to a single epoch on both Semantic Medline datasets.

Supervised fine-tuning
Once the model has been pre-trained on one of the pre-training datasets, we fine-tune it using the target NER dataset. In our experimental setup, we assume that only a fraction of sentences within the target NER dataset has annotations. For example, in a dataset with 10K total sentences, only 500 sentences may have gold-standard annotations.
We use 80/20 splitting on the annotated subset of the dataset to create a train and validation set. We initialize the model using the weights obtained from the pre-training step, and then train on the train set, validating on the validation set after every epoch. Training stops when validation exact-F1 stops improving for ten consecutive epochs.

Semi-supervised learning
We use a simple semi-supervised method-self-training-to further boost performance by utilizing the unlabeled portion of each target NER dataset. After the supervised fine-tuning step, we use the model to predict labels on each unannotated sentence in the target dataset, hereon referred to as the unsupervised set. For each sentence in the unsupervised set, we measure the average prediction confidence across all tokens within that sentence. Sentences whose average confidence meets a defined confidence threshold are then moved from the unsupervised set and added to the training set, using the predicted pseudo-labels as the ground truth labels. We then repeat the supervised fine-tuning step by initializing a new model using the weights obtained from the pre-training step and then training on the enlarged training set (original training set plus high confidence sentences from the unsupervised set); however, we note that we retain the original validation set to ensure that only gold-standard labels are used for validation. Once the model has been trained, we once again apply self-training, predicting on the unsupervised set and moving high-confidence sentences into the train set. We repeat this process until no more sentences in the unsupervised set meet the required confidence threshold.
In our experiments, we set the confidence threshold to 99.75% average confidence across all tokens in a sentence to move that sentence from the unsupervised set to the train set; we discuss the choice of optimal confidence threshold in our Discussion section. For any given sentence, we obtain the average confidence from the BiLSTM-CRF by calculating the loglikelihood of the sequence of predicted labels (using the forward pass of the CRF) and then dividing by the number of words in the sentence. To obtain the average confidence for a given sentence from BERT, we collect the softmax score of each predicted label in the sequence and then average the scores.

NER datasets
To evaluate the effectiveness of our methodology, we test the performance of our approach on eight commonly used biomedical NER datasets that cover different types of biomedical entities-BC2GM, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, JNLPBA, NCBI-disease, Linnaeus, and S800. For all datasets, each data sample is composed of a sentence with word-level tokens X = (w 1 , . . ., w n ) and associated word-level BIO annotations Y = (y 1 , . . ., y n ). We note that while the datasets cover different entity types, within each dataset, entities are not annotated for type. Table 2 shows detailed descriptions of each dataset and how we split them into

PLOS ONE
train, unsupervised, and test sets. Each dataset is publicly available and can be downloaded from https://github.com/dmis-lab/biobert. These eight NER datasets cover common biomedical entity types that are often extracted for various applications. Thus, these entity types are generally included within the UMLS metathesarus, and as some of the entities in these NER datasets may be covered within the pretraining datasets. In Table 3, we measure the percentage of unique entities in each NER dataset (train, dev, and test) that also appear at least once as labeled entities in the pre-training datasets. In our experiments, we evaluate the relationship between the amount of entity overlap and the effectiveness of transfer learning.

Evaluation metrics
We adapt the entity-level evaluation metrics from the SemEval 2013 task 9.1 [66,67]. For each task, we measure both exact precision, recall, and F1 as well as partial precision, recall, and F1. The exact metrics give credit only if the NER model correctly predicts the exact word boundaries for a given entity, while the partial metrics give partial credit if a NER model manages to predict part of an entity. Because our datasets are not annotated for entity types, we do not incorporate entity type into our evaluation. The calculations for each metric are described in Eqs 1-4: Exact Precision ¼ Correct Actual Partial Precision ¼ Correct þ 0:5 � Partial Actual In the equations above, "Correct" refers to entities where the predicted boundaries exactly match the ground truth boundaries, "Partial" refers to entities where the predicted boundaries overlap but do not exactly match with the ground truth boundaries, "Missing" refers to entities that are in the ground truth labels but missed by the NER model, and "Spurious" refers to

PLOS ONE
entities predicted by the NER model but not actually in the ground truth labels. We note that "Incorrect" is used for incorrect entity types and is not applicable in our datasets.

Comparing different pre-training corpora
A critical part of transfer learning is selecting an appropriate corpus on which to pre-train our models. In our study, we consider three different corpora to use for pre-training-�100K random sentences from SemMed, �1M random sentences from SemMed, and the full MedMentions dataset.
In our first set of experiments, we evaluate the benefits of pre-training in settings where no labeled training data is available in the target domain. In Table 4 (see S1 Table for partial metrics), we show the performance of the BiLSTM-CRF and BlueBERT when pre-trained on each of the three different pre-training corpora and then directly applied to each of our target NER datasets without any fine-tuning. We also include the performance of MetaMap (2018 version) and scispaCy as performance baselines, as neither of these two popular NER tools requires finetuning for use. Our results show that when no fine-tuning on the downstream dataset is used, it is difficult to distinguish the effectiveness of pre-training on different corpora; in many cases, the popular MetaMap and scispaCy tools have comparable or better precision and recall than our pre-trained models. The ambiguity of these results suggests that if no labeled training samples are available in the downstream target dataset, there is no guarantee that pre-training a custom model for biomedical NER will work any better than simply using MetaMap or scispaCy. In our next set of experiments, we include limited labeled training data from the target domain and then re-evaluate the effects pre-training. In Table 5 (see S2 Table for partial metrics), we show the performance of the BiLSTM-CRF and BlueBERT when pre-trained on each of the three different pre-training corpora and then fine-tuned on 1000 labeled sentences (800 train, 200 validation) from each target dataset. We also include the performance of the BiLSTM-CRF and BlueBERT directly trained on 1000 labeled sentences without any pre-training for comparison. In these experiments, the benefit of transfer learning becomes much clearer-all methods perform much better than the scispaCy and MetaMap baselines. We see that for the BiLSTM-CRF, pre-training on any of the three datasets results in better performance in both precision and recall than no pre-training. Of the three different pre-training corpora, pre-training on 1M sentences from SemMed gives the best overall performance. This suggests that for the BiLSTM-CRF model (and other similar models utilizing Word2Vec embeddings), it is most beneficial to pre-train on very large corpora, as the model is exposed to more useful vocabulary patterns and NER information.
On the other hand, for BlueBERT, the results between the different pre-training corpora are more mixed, and in some cases the base BlueBERT model (without any further pre-training) yields the strongest results. We expect that this is because the base BlueBERT model is already pre-trained using the masked-language-modeling task on all of Pubmed, so it may have already learned information useful for downstream NER. Our results show that further pre-training an existing BERT model such as BlueBERT on a large NER dataset is helpful for Table 5

BiLSTM-CRF
some but not all downstream NER tasks. Pre-training BlueBERT on MedMentions resulted in the highest overall performance across the most downstream NER datasets. This may be because MedMentions, while smaller than the two SemMed corpora, is hand-labeled by humans and thus the labels are far more accurate; with an extremely powerful model such as BlueBERT that can learn extremely nuanced and subtle patterns, the quality of the labels may be more important than the quantity. When we compare the overall improvement from pre-training with the entity overlap between the pre-training datasets and the target NER datasets shown in Table 3, we observe no clear relationship. For example, BC5CDR-disease has the largest overlap with the pretraining datasets. However, when comparing the improvement from pre-training using the BiLSTM-CRF, the magnitude of improvement is similar to that in BC2GM, which has the smallest amount of overlap. Furthermore, none of the pre-training datasets improved performance on BC5CDR-disease for BlueBERT. As another example, S800 has a very low overlap with the pre-training datasets, yet the magnitude of improvement from pre-training is far larger than in other datasets with more overlap. This indicates that low entity overlap in the pre-training dataset does not necessarily mean that transfer learning will not give a significant performance boost, and vice versa.

Effects of transfer and semi-supervised learning
In Table 6 (see S3 and S4 Tables for partial metrics), we show the effects of transfer learning and semi-supervised learning on various NER datasets given different amounts of labeled training data. For all BiLSTM-CRF experiments, we pre-train the model on 1M sentences because it gave the overall strongest performance in Table 5. Likewise, for BlueBERT, we pretrain on MedMentions because it gave the overall strongest performance in Table 5. For both models, we also include the performance of a fully supervised version (trained on all available sentences in the train and unsupervised sets of each dataset, see Table 2 for the size of each dataset) without any pre-training for comparison.
When examining our BiLSTM-CRF results, we see that in general, more labeled data results in better performance in both transfer learning and semi-supervised learning. Compared to transfer learning without the self-training, the self-training step almost always provides an additional boost to performance; this performance boost is especially noticeable when there are few labeled training samples. In five of our eight NER datasets, combining transfer learning with self-training using 2000 labeled sentences (approximately 12%-30% of the total available labeled data depending on the dataset) yields similar or better performance than a fully supervised model trained on the full dataset.
We observe similar trends in our BlueBERT results. Increasing the amount of labeled data also increases the performance of both transfer and semi-supervised learning. Incorporating self-training on the unlabeled data provides a boost in F1 score on all but one dataset and training size (the only exception being BC4CHEMD with 250 labeled sentences); this difference is especially noticeable when the amount of labeled data is small. In five of our eight NER datasets, combining transfer learning with self-training using 2000 labeled sentences yields within 0.03 F1 score of fine-tuning BlueBERT on the full dataset. As expected, given the same training and data conditions, BlueBERT obtains notably better performance scores than the BiLSTM-CRF.
Our results show that in biomedical NER settings with small amounts of labeled training data, combining transfer learning and semi-supervised learning can boost precision and recall for both simple NER models such as a word-level BiLSTM-CRF and for more complex, stateof-the-art NER models such as BERT. We note that our experiments focus on downstream Table 6. Exact precision, recall, and F1 score of the BiLSTM-CRF and BlueBERT on each of our target datasets when fine-tuning on different amounts of labeled sentences, with and without semi-supervised self-training. A fully supervised version is included for comparison. For all sets of training data, 80% of the available data is used for training and 20% of the available data is used for validation.

BiLSTM-CRF
NER applications with common biomedical entity types that overlap with the UMLS entity types covered in the pre-training datasets; we explore the effectiveness of these methods on a low-resource dataset with rare entity types in our Discussion section.

Training time
We measured the approximate training times for each phase of our training methodology to give potential users a rough estimate of the associated computation requirements. All time measurements were performed using a single Tesla V100 GPU. For the BiLSTM-CRF, the pretraining step takes approximately one day for SemMed 1M; the fine-tuning step usually takes less than five minutes when using 1000 labeled sentences; and the semi-supervised step takes approximately one hour for the smallest dataset (NCBI-disease) to approximately sixteen hours for the largest dataset (BC4CHEMD). For BlueBERT, the pre-training step takes approximately one hour for MedMentions; the fine-tuning step usually takes less than ten minutes when using 1000 labeled sentences, and the semi-supervised step takes approximately three hours for the smallest dataset (NCBI-disease) to approximately two days for the largest dataset (BC4CHEMD).

Application on low-resource datasets
One potential limitation of our study is that our pre-training datasets-SemMed and Med-Mentions-are labeled for UMLS entities and therefore may cover some of the target entities in our downstream test datasets. Thus, it is unclear how well transfer learning by pre-training on SemMed or MedMentions will help on downstream biomedical NER tasks where the target entity types are not covered by UMLS. To further explore this, we evaluate the effect of transfer learning and self-training using the 2018 Text Analysis Conference Systematic Review Information Extraction task (TAC SRIE) [68]. The TAC SRIE dataset (available online at https://tac.nist.gov/2018/SRIE/data.html) consists of the "Material and Methods" section from 100 scientific articles covering experiments where animals were exposed to environmental toxins and other environmental factors. Each text section is annotated by human toxicology experts for words and entities that describe the experimental design of the study; these include exposure (variable being tested, vehicle of delivery, purity of exposure, verification of exposure), animal group (control group, sample size, species, sex), dose group (dose amount, dose unit, dose frequency, dose duration, dose duration units, time of first dose, time of last time, time units), and endpoints (effect of dose, unit of measurement, time of measurement). We refer readers to [68] for more details about the entity types and the dataset. We note that the entity types annotated in TAC SRIE are generally not within the entity types covered by UMLS and thus are likely to appear under different contexts than the entities from our pre-training datasets. The TAC SRIE dataset also includes "Material and Methods" sections from 344 additional articles that do not include any annotations. These articles are intended for evaluation, but the labels are not publicly available. For our experiment, we utilize these 344 articles as our unlabeled set for self-training. For our evaluation, we utilize two versions of the TAC SRIE dataset. In the first version we include all annotations and entity types provided in the dataset. In the second version, we exclude annotations from the "species" and "sex" entity types; we exclude "species" because this entity type is most likely to overlap with UMLS and therefore the pre-training sets, and we exclude "sex" because this entity type is usually a simple keyword search for "male" or "female". We provide a summary of our TAC SRIE datasets in Table 7. We use 80/10/10 splitting on the labeled set to create train/val/test sets, and we use the same experimental setup as our main experiments where we pre-train our models, then fine-tune on the labeled set, and finally apply self-training on the unlabeled set. We note that TAC SRIE includes fine-grained entity type labels for each named entity; however, for our evaluation we do not predict specific entity types and only predict BIO annotations for entity or non-entity. Table 8 shows the performance of the BiLSTM-CRF and BlueBERT on the TAC SRIE datasets with and without pre-training and self-training. For the BiLSTM-CRF, we see that pre-

PLOS ONE
training on 1M sentences from SemMed provides a large boost in precision and recall for both the full dataset and the dataset without species and sex annotations. However, the gain from self-training is very small and inconsistent. We expect that this is because the initial model performance prior to self-training is not high enough that self-training will propagate more knowledge than errors-we explore this further in our Discussion section.
On the other hand, we see that pre-training on MedMentions is not particularly helpful for BlueBERT compared to the base BlueBERT. This is not particularly surprising; we showed in our previous experiments that since BlueBERT is already pre-trained using masked-languagemodeling, further pre-training using an NER dataset such as MedMentions sometimes but not always provides an additional performance boost. Unlike with the BiLSTM-CRF, self-training gives a noticeable boost in performance for BlueBERT. We expect that this is because the initial model performance prior to self-training is strong enough such that self-training can propagate more knowledge than errors.
Our results suggest that pre-training on UMLS entities and then self-training can be beneficial for downstream biomedical NER tasks even if they do not focus specifically on common UMLS entities. However, a more detailed study using a wider variety of low-resource biomedical NER tasks may be needed to establish the full scope of the benefits and limitations of our proposed methods in the context of low-resource settings.

Self-training failure analysis
Based on our results in Table 6, we observe a general trend that utilizing semi-supervised selftraining improves the overall F1 scores of the models, especially when there is a small amount of labeled data. However, in rare cases such as the BiLSTM-CRF on S800 with 250 initial labeled sentences, the overall F1 score drops significantly; multiple repeat runs showed the same behavior. One possible explanation for this behavior is that self-training propagates both knowledge and errors-a model that is highly confident when it is wrong will propagate bad labels during the self-training phase, thereby harming the performance of the final model. Therefore, when the model has an initial low performance before the self-training phase, selftraining may not be as effective.
To better understand this phenomenon, we show the performance of the BiLSTM-CRF after each iteration of self-training under three different scenarios-S800 with 250 initial labeled sentences, Linnaeus with 250 initial labeled sentences, and BC2GM with 1000 initial labeled sentences (Fig 2). Linnaeus and S800 with 250 initial samples were chosen because the BiLSTM-CRF had the lowest F1 scores on these two datasets prior to self-training. In the S800 scenario, the performance of the model during the course of self-training is highly volatile. We observe that precision has a noticeable increase over time, especially in the early iterations; however, recall, which is already low to begin with, decreases over time causing the overall F1 score to be highly variable across the different iterations. Self-training on Linnaeus does not show this same behavior; precision, recall, and F1 score all show an initial increase and then hold at a fairly steady level through the remainder of the self-training process. Lastly, the selftraining progress on BC2GM is representative of the typical self-training progression that we observed in most of the scenarios in this study-there are small/moderate gains in precision, recall, and F1 score over the course of self-training with occasional volatility caused by the inherent stochasticity associated with training deep learning models.
A common practice in self-training and other forms of semi-supervised learning is to continually iterate the semi-supervised method until no more samples meet the confidence threshold or some similar stopping criteria is met. However, our analysis shows that this practice may not always yield higher performance, especially when the initial model has low performance. An alternative method, such as using validation score on a dedicated set of gold standard labels, may help safeguard against situations where self-training results in lower overall performance.

Effect of transfer learning on self-training
As we have previously shown, semi-supervised learning can propagate both knowledge and errors; thus, semi-supervised approaches such as self-learning can be unreliable if the initial model has low performance. In settings with very few labeled examples, transfer learning can be critical in boosting initial model performance to levels where semi-supervised learning can provide a reliable boost. To demonstrate this effect, we analyzed the performance of self-training on the BC2GM dataset with and without transfer learning using both the BiLSTM-CRF and BlueBERT (Table 9).
For the BiLSTM-CRF, we observe that for all data sizes and all training scenarios, the pretrained BiLSTM-CRF performs far better than the BiLSTM-CRF without pre-training. We note that when using the BiLSTM-CRF with no pre-training and 250 labeled sentences, no samples met the confidence threshold required to move data from the unsupervised set to the train set; therefore, self-training could not even be utilized. Compared to the pre-trained BiLSTM-CRF, the BiLSTM-CRF without pre-training also showed far greater instability in performance throughout self-training-performance often peaked in the early iterations of self-training and then slowly dropped in the later iterations. We observe a similar trend in the BlueBERT experiments in that for all data sizes and training scenarios, BlueBERT Base has lower F1 scores than BlueBERT Base with a second round of pre-training on MedMentions; however, the difference in performance is much smaller than in the BiLSTM-CRF. This is expected-as we showed in Table 5, BlueBERT Base already has strong performance in NER tasks because the base model is already pre-trained, and the second round of pre-training

PLOS ONE
using an NER-specific dataset is not guaranteed to always provide an additional performance boost.
These results show that transfer learning can be a critical tool in biomedical NER settings with very few labeled examples. When labeled data is extremely scarce, transfer learning may be required to bring the model up to a level of performance where semi-supervised learning can then be effectively applied. As shown in our experiments, the combination of transfer learning and semi-supervised learning can be a potent tool in improving performance in biomedical NER compared to a baseline model that uses neither, especially in situations where there are very few labeled sentences.

Choosing the right confidence threshold for self-training
The selection of what confidence threshold to use for self-training can have a notable impact on the final performance of the NER model. For simplicity and consistency, we used 99.75% confidence as the threshold across all of our experiments-during the hyperparameter tuning phase, we observed that this confidence threshold returned generally strong results on most of the datasets. However, we note that this threshold is not guaranteed to be optimal under all settings.
In our experiments, we observed three general trends. (1) First, lower confidence thresholds require fewer iterations of self-training because each iteration adds more samples from the unlabeled set and therefore samples from the unlabeled set are used up more quickly. We noticed that for some datasets, extremely high thresholds also require fewer iterations of self-

PLOS ONE
training because after a number of initial iterations, no more samples from the unlabeled set make it pass the threshold. (2) Second, too low or too high of a confidence threshold results in lower performance in terms of overall F-score; the optimal range for the confidence threshold varies by dataset. Finally, (3) the specific behavior of how different confidence thresholds affect precision, recall, and F-score is dependent on the dataset and model. In Table 10, we show how different confidence thresholds affect self-training using the BiLSTM-CRF (pre-trained on SemMed 1M) on S800 with 250 labeled sentences and on BC2GM with 1000 labeled sentences. On S800, we observe that lower thresholds improve recall at the expense of precision, whereas higher thresholds improve precision at the expense of recall. On BC2GM, this trend is much weaker, and we see that when the confidence threshold is set too low both precision and recall drop. In both datasets, setting the confidence threshold too high or too low causes the overall F-score to reduce; furthermore, the confidence threshold that produces the highest over F-score is not the same between the two datasets.
From these results, we see that it is difficult to define a universal "best" confidence threshold that will work well for all situations. Instead, users will likely need to tune the confidence threshold as a hyperparameter based on the needs of the specific application.

Conclusion
In this work, we evaluated the effectiveness of combining transfer learning with semi-supervised learning to perform biomedical NER in applications with limited amounts of labeled training data and that focus on common biomedical entities such as those covered in UMLS. We used two different base models-a BiLSTM-CRF and BlueBERT-and evaluated on eight different standard biomedical NER datasets covering different types of common biomedical entities. For each dataset, we generated scenarios with different amounts of available labeled data-250, 500, 1000, and 2000 labeled sentences.
For each model, we first evaluated the effect of pre-training on three different corpora-� 100K sentences from SemMed, �1M sentences from SemMed, and all �50K sentences from MedMentions. We found that for the BiLSTM-CRF model, pre-training on 1M sentences Table 10. We evaluate the BiLSTM-CRF pre-trained on SemMed1M on S800 with 250 labeled sentences and BC2GM with 1000 labeled sentences using different confidence thresholds for self-training. We report exact precision, recall, and F1 score as well as the number of self-training iterations run before no more samples from the unlabeled set meet the confidence threshold.

PLOS ONE
from SemMed provided the largest boost in performance. Since BlueBERT is already pretrained, the effect of the second round of pre-training was less consistent. Overall, further pretraining of BlueBERT on MedMentions gave the best results. Next, we evaluated the effect of incorporating semi-supervised self-training into each model. For both the BiLSTM-CRF and BlueBERT, we found that in almost all scenarios, selftraining gave a boost to the final F1 scores; this boost was especially large in scenarios with very few labeled sentences (250 and 500 initial labeled sentences). Because self-training can propagate both knowledge and errors, in rare cases where the model performance was very low before self-training was applied, self-training had inconsistent results and sometimes lowered the F1 score. In our analysis, we showed that transfer learning is critical in scenarios with very few labeled sentences to bring the model performance up to levels where self-training can be effective.
One limitation of our study is that our experiments focused on downstream NER tasks with common entity types that are covered by UMLS. As a result, the UMLS entities annotated in our pre-training datasets may overlap with the entities in the downstream NER tasks. Therefore, it is unclear how much pre-training and self-training will help in downstream NER tasks that utilize entity types not covered in UMLS. To help address this limitation, we showed that pre-training and self-training can still boost performance when applied to TAC SRIE, a lowresource dataset where the goal is to extract entities from toxicology papers that are related to experimental procedures; the entity types of interest in TAC SRIE are generally not covered within the entity types from UMLS. However, we note that a broader study utilizing a wider range of different types of low-resource NER datasets is required to establish the effectiveness of our methods in low-resource settings.
In this work, we utilized self-training for our semi-supervised method, which is an extremely simple method. We expect that more sophisticated semi-supervised methods, such as co-training or tri-training using models pre-trained on different corpora, may provide better performance. Future work also includes evaluating the effect of transfer learning and semisupervised learning on datasets where predicting entity type is part of the NER task. The code used for our experiments is available online at https://code.ornl.gov/biomedner/biomedner. Supporting information S1 Table. Exact and partial precision, recall, and F1 score of the BiLSTM-CRF and Blue-BERT on each of our target datasets when pretrained on different corpora without fine tuning. (TIF) S2 Table. Exact and partial precision, recall, and F1 score of the BiLSTM-CRF and Blue-BERT on each of our target datasets when pretrained on different corpora and fine tuning on 1000 labeled samples (800 train, 200 validation). (TIF) S3 Table. Exact and partial precision, recall, and F1 score of the BiLSTM-CRF on each of our target datasets when fine tuning on different amounts of labeled data, with and without semi-supervised self-training. A fully supervised version is included for comparison. For all sets of training data, 80% of the available data is used for training and 20% of the available data is used for validation. (TIF) S4 Table. Exact and partial precision, recall, and F1 score of BlueBERT on each of our target datasets when fine tuning on different amounts of labeled data, with and without semi-supervised self-training. A fully supervised version is included for comparison. For all sets of training data, 80% of the available data is used for training and 20% of the available data is used for validation. (TIF)