Figures
Abstract
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
Citation: Zheng D, Han R, Yu F, Li Y (2024) Biomedical named entity recognition based on multi-cross attention feature fusion. PLoS ONE 19(5): e0304329. https://doi.org/10.1371/journal.pone.0304329
Editor: Jin Liu, Shanghai Maritime University, CHINA
Received: August 14, 2023; Accepted: May 9, 2024; Published: May 28, 2024
Copyright: © 2024 Zheng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code used in this study is available at e in the public GitHub repository https://github.com/hanrong11/Bi-BWC-LM
Funding: This work was supported by the Natural Science Foundation of Heilongjiang Province,grant LH2022F037.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Named entity recognition [1, 2] refers to the identification of specific entities in text, such as people, places, and organizations, with the aim of locating entity mentions from unstructured text and classifying them into predefined categories. However, unlike in general domains, the field of biomedicine, as an interdisciplinary subject that combines methods and theories from multiple disciplines such as medicine and biology, contains a vast amount of scientific literature that records key information such as clinical studies, drug treatments, and gene expressions. This information is written for domain experts and typically requires a broader domain-specific knowledge for information extraction. Additionally, due to the vast amount of biomedical literature, manual processing of this text becomes a laborious and time-consuming task. Despite the fact that numerous manual annotation efforts have been organized internationally to extract biomedical concepts and information from text, and store the extracted information in structured knowledge resources such as Swiss-Prot [3] and GenBank [4], the rapid growth of literature data and the emergence of infectious diseases such as SARS-CoV-2 and monkeypox have made it increasingly important to develop automated and high-performance NER (Named Entity Recognition) methods to assist in retrieving, organizing, and managing large amounts of biomedical data and information. BioNER (Biomedical Named Entity Recognition) is a method that uses natural language processing techniques to annotate entities such as diseases, genes, proteins, etc. In text, and is also a crucial subtask for subsequent tasks such as biological literature retrieval [5] and biological question-answering systems [6]. As shown in Fig 1, a sample of BioNER on a biomedical text is given.
This article proposes a BioNER method based on multi-cross attention feature fusion, named Bi-BWC-LM. Firstly, we perform three rounds of cross-attention feature fusion on character and word features. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result.
In this article, Section 1 introduces the introduction of the paper, followed by Section 2 which presents the background of the paper, mainly including related work and models pertinent to this study. Subsequently, Section 3 focuses on the model structure proposed in this paper, while Section 4 details the key experimental results. Finally, Section 5 summarizes the conclusions of this paper, discusses its limitations, and provides an outlook for future work.
The main contributions of this article are as follows:
- A three-cross attention feature fusion method is proposed, which involves cross-attention feature fusion between DistilBioBERT and CharCNN, as well as between DistilBioBERT and CharLSTM. The fused features are then subject to another round of cross-attention feature fusion.
- Multi-head attention is added after the BiLSTM to address the challenges of capturing long-distance dependencies in BiLSTM and enhance the representation ability of input information. It selects more critical information from numerous textual data from a multi-level perspective.
- The model is trained using a combined loss function, which incorporates both cross-entropy loss and Focal Loss. This approach considers the issue of data imbalance, increases the penalty for misclassification, and provides an adaptive adjustment mechanism.
- Experimental results and ablation studies conducted on five biomedical datasets, including NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, BC2GM, and JNLPBA, demonstrate that the proposed model is highly effective in improving the recognition performance of biomedical named entities.
2. Background
In this section, we first introduce the relevant work of this article, and then introduce the relevant models used in the tasks of this article.
2.1. Related work
Traditional biomedical named entity recognition methods can be divided into rule-based and dictionary-based methods. However, rule-based [7] and dictionary-based [8] methods often have strong dependence on domain knowledge, poor scalability and portability, and often require a significant amount of time to develop rules and establish dictionaries. With the increase in data volume, more and more researchers are trying to use machine learning methods to handle BioNER tasks, such as Hidden Markov Model [9], Support Vector Machine model [10], Maximum Entropy model [11], and Conditional Random Fields model [12]. However, traditional machine learning methods typically require large amounts of labeled data for training, have high data dependence, require extensive feature engineering, and have difficulty with context modeling and handling unknown entities. At this point, deep learning methods provide a simpler solution to solve these problems.
CharCNN [13] and CharLSTM [14] are two character-level neural network models that can infer representations of unseen words and share information at the morpheme level. CharCNN uses convolutional and pooling operations to extract features from the input character sequence, with a focus on extracting local features, while CharLSTM models the entire character sequence using LSTM units, enabling the learning of more global and long-range context information. Ma et al [15] proposed a BiLSTM-CNN-CRF model that first uses CNN to extract character-level features, then uses a BiLSTM-CRF model to further extract contextual features and output the results. Mohsen Asghari et al [16] used four different embedding methods, RNN-Character level, CNN-Character level, LookUp, and Self-attended encoder to study the recognition of biomedical text. Kuru et al [17] proposed CharNER, a character-level tokenizer for named entity recognition that is language-independent. It used LSTM to extract character-level representations and outputs the label distribution for each character rather than each word. Then, the word-level labels are obtained from the character-level labels. M. Cho et al. [18] combined CNN and BiLSTM as feature information extraction, and then added an attention mechanism to the BiLSTM-CRF model for further extraction and recognition of BioNER.
To enhance the feature extraction ability of deep learning models, Vaswani et al [19] proposed the Transformer mechanism. The Transformer mechanism abandoned the structure of recurrent neural networks and directly modeled sequences using the Attention mechanism, greatly accelerating the parallel computing capability of the model. The attention mechanism first achieved success in the field of machine translation [20], and later began to be widely used in various aspects of natural language processing due to its excellent performance in neural network structures. Attention mechanism models can measure the importance of different information features based on information weights, strengthen key information, weaken useless information, and identify key features that affect the final model performance by analyzing the global context. As the name suggests, it allocate weights to text data inputs based on the level of attention paid to words. Bahdanau et al [21] first applied the Attention mechanism to the NER task and achieved excellent results. Subsequently, more and more scholars began to use this method to deal with NER tasks.
The introduction of BERT [22] has greatly advanced the field of natural language processing and achieved excellent performance on multiple tasks. However, with the emergence of various specialized domains, BioBERT [23], a domain-specific language representation model specifically trained on biological medical corpus, has been developed in the field of biomedicine. However, BioBERT has a large number of parameters and requires a large amount of resources to use them in practical environments, which often limits its applicability. Therefore, DistilBioBERT [24] was proposed through knowledge distillation. This model uses a 6-layer transformer and has approximately 65 million parameters. Compared with large models, this model has significantly faster training speed and far fewer parameters, and its performance is almost indistinguishable across different biological medical tasks.
2.2. Related models
2.2.1. Convolutional neural networks.
CNN (Convolutional Neural Networks) originally designed for analyzing visual images, possess the core feature of convolutional operations, enabling sliding window calculations on images. Through convolutional kernels and pooling layers, CNN extracts features from the images. Subsequently, CNN has been widely applied to natural language processing tasks such as text classification, named entity recognition, and sentiment analysis. CNN captures local dependencies between words and crucial lexical information in sentences through convolutional and pooling operations.
CharCNN, short for Character-level Convolutional Neural Network, is a model specifically designed for processing text data at the character level. Instead of taking words or phrases as input, CharCNN models text data from a character perspective. Through convolutional and pooling operations, it extracts meaningful local and global features, and projects these features onto corresponding labels or categories using fully connected layers. Therefore, CharCNN exhibits superior robustness and generalization capabilities when dealing with text data.
2.2.2. Bidirectional long short-term memory.
RNN (Recurrent Neural Networks) are a type of neural network designed to better process sequential information, especially suitable for handling text, speech, and other sequential data. LSTM (Long Short-Term Memory) is an extension of RNN, which adds forget gates, input gates, output gates, and internal memory units to address the gradient vanishing problem inherent in RNN. Subsequently, BiLSTM (Bidirectional Long Short-Term Memory) emerged and gradually became a mainstream model. In BiLSTM, each feature vector passes through a two-layer LSTM, capturing past and future information from different directions. The two independent hidden states are then concatenated to form the final contextual representation. Unlike traditional LSTM, which only considers forward information from the input sequence and can only access information prior to a given time point, BiLSTM takes into account both forward and backward information from the input sequence. This allows BiLSTM to access contextual information from the entire sequence at any time point. The bidirectional nature of BiLSTM also enables it to capture richer contextual information, thereby better understanding the overall structure and semantics of the input sequence. This makes BiLSTM perform better in tasks that require contextual understanding.
CharLSTM, is a text generation model based on LSTM that operates at the character level. Compared to traditional word-level text generation models, CharLSTM is better able to handle semantically ambiguous phrases and complex grammatical structures, resulting in more accurate and fluent text generation. CharLSTM applies LSTM to character-level text generation tasks, breaking down the input text into characters, and feeding each character as a time step into the LSTM.
2.2.3. DistilBioBERT.
DistilBioBERT is a pre-trained language model specifically designed for biomedical text processing tasks. It learns knowledge from BioBERT through knowledge distillation. The loss function of DistilBioBERT consists of three weighted losses. The first loss is the standard cross-entropy loss for the MLM (Masked Language Model) objective. This loss helps the model learn contextual representations of vocabulary. By minimizing this loss, the model is able to capture rich lexical semantics. As shown in Eq (1). The second loss is the KL divergence loss of the teacher model’s output, which is used here to measure the difference between the probability distribution of the student model’s output and the probability distribution of the teacher model’s output. As shown in Eq (2). The third loss function is an optional one. It aligns the final hidden states of the teacher and student models through cosine embedding loss, ensuring that the internal representation of the student model closely resembles that of the teacher model. This ensures that the student model can capture similar semantic and contextual information as the teacher model. As shown in Eq (3). The overall loss formula is presented in Eq (4).
(1)Among them, X represents the input to the model, and Y denotes the labels for the MLM task. Y is a set of N one-hot vectors, each with a size of |V|, where |V| is the vocabulary size of the model, and N is the number of input tokens. For masked tokens, the value of Wn is 1, and for other tokens, its value is 0. fs represents the student model.
(2)Among them, ft represents the teacher model.
(3)Among them, ht and hs represent the functions that output the last hidden states of the teacher and student models, respectively, while ∅ denotes the cosine similarity function.
(4)Among them, α1, α2, and α3 represent the weighting terms for the different combinations of losses, respectively. In this chapter, the settings are α1 = 2.0, α2 = 5.0, and α3 = 1.0.
2.2.4. Attention mechanism.
In deep learning, the Attention mechanism is an important technique, especially achieving remarkable results in natural language processing tasks. The Attention mechanism can assist neural networks in automatically learning, weighting, and focusing on key information when processing input sequences.The Self-Attention mechanism is a variant of the Attention mechanism that does not rely on external information to calculate attention weights. Instead, it uses information internal to the input sequence for computation, enabling the model to better capture long-distance dependencies when processing sequential data.
The Self-Attention mechanism can sometimes over-focus on its own position when encoding the current location, leading to the proposal of the Multi-Head Attention mechanism to address this issue. Multi-Head Attention is formed by combining multiple Self-Attention components. In the Multi-Head Attention mechanism, h sets of q (query), k (key), and v (value) are used to obtain h different feature representations. The outputs of these h parallel Self-Attention components are then concatenated, and finally, the concatenation is passed through a fully connected layer for dimensionality reduction. The specific expressions for Multi-Head Attention are given in Eqs (5) and (6).
(5)(6)Among them, h represents the number of heads, and are learnable linear transformation parameters.
Cross-Attention is important extension of the Self-Attention mechanism. The key difference between Cross-Attention and Self-Attention lies in the input: Self-Attention operates on a single embedded sequence, where Q (Query), K (Key), and V (Value) come from the same input sequence. On the other hand, Cross-Attention correlates information from two different input sequences. In this case, one sequence serves as the Q input, while the other sequence provides the K and V. Alternatively, one sequence can be used for both Q and V, while the other sequence provides the K. Assuming we have two input sequences S1 and S2, the specific calculation process is outlined in Eqs (7–10).
(7)(8)(9)(10)Among them, WQ, WK, and WV represent three trainable parameter matrices, and denotes the dimensionality of S2WK.
3. Model construction
This section describes the architecture of the proposed Bi-BWC-LM model. First, we introduce the overall architecture of the model, and then provide a detailed description of each layer of the model.
3.1. General architecture of the model
As shown in Fig 2, the model receives a sentence at the input layer, where all words in the sentence are standardized to lowercase. After the input layer, we use a dual encoder to integrate word embeddings from DistilBioBERT, a pre-trained model that generates dynamic word vectors with contextual semantic information, with character embeddings from CharCNN and CharLSTM. This cross-attention fusion between the word and character embeddings compensates for the shortcomings of CharCNN alone and further improves the word embedding expression, while also avoiding the Out-Of-Vocabulary problem. Then, we perform another cross-attention fusion on the two features after the cross-attention fusion, generating a word vector that is input into the context information layer. The context information layer uses a bidirectional LSTM to capture available features. Subsequently, we use multi-head attention to enable the model to pay attention to important token information in the sentence. This allows the model to better consider the relationships between words and interactions between different words when learning sentence representations, which improves the boundary recognition and classification accuracy of named entities. Finally, the model outputs the final result through an output layer, and is trained using a joint loss function.
3.2. Feature extraction layer
All the words in the sentence are processed through the pre-trained model DistilBioBERT to obtain their outputs, which are then converted into word features W.
Firstly, assuming that the input sentence is an X of length n, X = {X1, X2, …, Xn}, where X is a sequence of words from a biomedical dataset. All words in the sentence are obtained through a pre-trained model DistilBioBERT to obtain their outputs, which are then converted into word features W. Meanwhile, the sentence is divided into a series of characters and converted to one-hot vectors and initial character embedding vectors, which are extracted for character-level features using CharCNN and CharLSTM. The character-level features CC are extracted through CharCNN, while the character-level features CL are extracted using CharLSTM. CharCNN uses multiple-sized convolutional kernels to simultaneously capture different-sized character features, thereby obtaining a more comprehensive and rich representation at the character level. By combining CharCNN and CharLSTM, the advantages of both can be fully utilized. CharCNN can learn character-level local features, while CharLSTM can capture character-level contextual information. Therefore, the character embedding representation simultaneously includes character-level local features and contextual information. Subsequently, two cross-attention feature fusion processes are performed. The first fusion combines the word features W with the character features CC to obtain the sub-feature TWC. The second fusion combines the word features W with the character features CL to obtain the sub-feature TWL. Finally, these two sub-features, TWC and TWL, are fused again through cross-attention to obtain the final feature vector T. The specific process is shown in Fig 3. Eqs (11–13) represent the three cross-attention feature fusions.
(11)(12)(13)Among them, WQ, WK, and WV are parameter matrices, and dK is the dimension size of K.
3.3. Information extraction layer
The feature vector T = {t1, t2, …, tn}, extracted by the feature extraction layer, is input into the BiLSTM. BiLSTM is composed of a forward LSTM and a backward LSTM, which can capture the contextual information and dependencies of the input sequence. In the BiLSTM layer, each feature vector passes through a two-layer BiLSTM to capture past and future information from different directions, and then the two independent hidden states are connected to form the final context representation. The forward propagation process is as follows: Eq (14) represents the initial hidden state, Eq (15) represents the input gate, forget gate, output gate, and update unit, and Eq (16) represents the update of the cell state and hidden state.
(14)(15)(16)Among them, σ represents the activation function between basic elements, and it, ft, ot, , and Ct represent the input gate, forget gate, output gate, temporary cell state, and cell state, respectively. Wi, Wf, Wo, and WC are the weight coefficients for the input gate, forget gate, output gate, and temporary cell state, respectively, while bi, bf, bo, and bc are the bias values for the input gate, forget gate, output gate, and temporary cell state, respectively.
The backward propagation process is the same as the forward propagation process but in the opposite direction. At step i, the hidden states are connected to form the final context representation, as shown in Eq (17). We can obtain a bidirectional output sequence H = {h0, h1, …, hn-1}.
(17)Where represent the hidden state information of the preceding and subsequent context at time t, respectively. xt represents the input word at time t, and ⊕ denotes the concatenation operation of LSTM information vectors. ht represents the hidden state output of BiLSTM at time t.
The bidirectional output sequence H is then used as input, and multiple attention heads are employed to capture different focuses. By introducing multiple attention heads, the model can simultaneously pay attention to different semantic information, extract feature information that plays a key role in entity recognition, and weight the output feature vectors of the upper layer, enhancing the representation ability of the input information. This can also address the challenge of BiLSTM in capturing long-distance dependencies and improve model performance. Multi-head attention mainly performs multiple different linear transformations on the query, key, and value, allowing the model to learn relevant information in different subspaces, thereby greatly improving the model’s fitting ability. Eqs (18–20) represent the calculation process of multi-head attention.
(18)(19)(20)Among them, headj represents the attention calculation result of the jth attention head, Concat represents the concatenation of multiple heads, and W represents the weight matrix.
3.4. Output layer and training
This model employs an output layer to obtain the final recognition result, which comprises a first linear layer, a SiLU activation function, a dropout layer, and a second linear layer. The first linear layer is used to reduce the dimension of the feature vectors output by the multi-head attention mechanism. The SiLU activation function performs a nonlinear transformation on the output of the first linear layer. The dropout layer is utilized to prevent overfitting of the model. Finally, the second linear layer maps the nonlinearly transformed features to the category space, completing the entity recognition process.
In many natural language processing tasks, such as named entity recognition and relation extraction, there exists a serious issue of data imbalance. This imbalance mainly manifests in two aspects. Firstly, there is a class imbalance: different entity categories may have significant differences in the number of instances in the training set. This leads to the model being biased towards categories with more samples during training, neglecting the recognition of categories with fewer samples. Secondly, there is a sample imbalance: even within the same entity category, there may be an imbalance among different samples. This can result in the model overfitting to frequently occurring entities and performing poorly on rare entities. This means that the model may be more likely to apply common categories to all inputs during prediction, ignoring or misclassifying entities from rare categories. Additionally, sample imbalance can also limit the generalization ability of the model. When the model overfits to samples from common categories, it may not be able to effectively handle unseen or newly emerging entity categories, especially in practical applications where the emergence of new entities is inevitable.
To address the issue of data imbalance, the model uses both the cross-entropy loss function LCE, as shown in Eq (21), and the Focal Loss loss function LFL, as shown in Eq (22), for joint training. The final loss function Loss is shown in Eq (23).
(21)(22)(23)Among them, γ is the adjustable factor, which was taken as 2 in this experiment.
4. Experiments and analysis of results
4.1. Biomedical datasets
We conducted our research on five datasets in the field of biomedicine: NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, BC2GM, and JNLPBA. Table 1 shows the datasets, entity types, and their corresponding quantities used in this experiment.
- NCBI-Disease [25] is provided by the National Center for Biotechnology Information (NCBI) in the United States and is used for identifying and annotating disease-related entities in biomedical literature.
- BC5CDR [26] is provided by the BioCreative organization to promote the automatic extraction of relationships between chemical substances and diseases. For the BioNER task, it can also be divided into two categories: BC5CDR-Disease and BC5CDR-Chem.
- BC2GM [27] is the dataset for the BioCreative II gene mention recognition task, which can be used to train and evaluate gene mention recognition models for biological literature processing.
- JNLPBA [28] is derived from the GENIA 3.02 corpus and created through a controlled search on MEDLINE.
4.2. Experimental parameters
The experimental parameters for this experiment are shown in Table 2. The experimental environment is: operating system: Ubuntu 20.04, deep learning framework: PyTorch 2.0.0, development environment: Python 3.8, GPU: RTX 4090 (24GB), CPU: 15 vCPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz, memory: 80G.
4.3. Experimental results and comparison
This section compares our model with recent models on five public datasets, as shown in Table 3. Among them: (1) MTL-LS [29] proposed an effective hierarchical transfer learning method that combines multi-task learning and single-task learning to achieve multi-level information fusion between underlying entity features and upper-level data features. (2) PubMedBERT [30] proposed a method of specific domain pre-training from scratch and task-specific fine-tuning. This method pre-trains on a large-scale biomedical literature to generate domain-specific language representations, thereby improving the performance of the BioNER task. (3) HunFlair [31] proposed a NER tagger HunFlair that covers multiple entity types. It utilized a pre-trained language model Flair and combines it with data from the biomedical domain for named entity recognition. (4) BioDistilBERT [24] proposed a model based on BioBERT using multiple compression strategies, as many existing pre-trained models are resource-intensive and have high compute requirements. (5) BioBERT-CFA [32] proposed a new fully shared multi-task learning model based on pre-trained BioBERT, which integrates syntactic information into the BioBERT encoder using composite feature attention to improve performance. (6) PAMDFGA [33] proposed a novel and effective prefix and attention map discrimination fusion guidance attention mechanism that enhances BioBERT by changing the attention distribution.
Through experiments, we found that our model achieved the best results in terms of F1 score on five public datasets. Our model outperformed the previous record on NCBI-Disease by 0.21%, reaching a score of 90.76%, on BC5CDR-Disease by 2.26%, reaching a score of 89.79%, on BC5CDR-Chem by 0.5%, reaching a score of 94.98%, on BC2GM by 1.87%, reaching a score of 88.84%, and on JNLPBA by 0.21%, reaching a score of 80.27%. Compared to DistilBioBERT and PubMedBERT, our model incorporated CharCNN and CharLSTM to jointly extract features, enhancing its ability to extract embedded features. Compared to MTL-LS, which uses XLNet instead of BERT as the encoder and CRF as the decoder, our model uses a dual encoder DistilBioBERT and an output layer with two linear layers and an activation function to output the final result, improving its ability to model contextual information and enhancing its robustness to noise, thereby improving model performance. Compared to BioBERT-CFA, which uses composite attention to merge syntactic features into BioBERT, and PAMDFGA, which enhances BioBERT by fusing prefix and attention map discrimination guidance attention mechanisms, our model uses cross-attention to merge character features into DistilBioBERT, enhancing feature extraction performance, and uses multi-head attention for further enhancement of information extraction performance. Compared to HunFlair, which uses a character-level language model, our model uses a word-level language model while integrating character features, providing richer and more accurate representations that capture more fine-grained features and semantic information within words, improving the recognition ability of the BioNER task.
To further analyze our model, we conducted more detailed experiments on the BC5CDR-Disease dataset, and the results are shown in Table 4. Method 1 uses CharCNN and DistilBioBERT for cross-attention fusion, method 2 uses CharLSTM and DistilBioBERT for cross-attention fusion, method 3 uses CharCNN and CharLSTM for cross-attention fusion, method 4 uses only one DistilBioBERT without cross-attention fusion, and method 5 uses two DistilBioBERTs for cross-attention fusion.
Through experiments, it was found that compared to method 2, which uses CharLSTM for character feature extraction, method 1, which uses CharCNN, has higher precision, recall, and F1 values. This further indicates that CharCNN performs better when used alone as a character feature extraction model. The possible reason is that in CharCNN, the convolutional kernels share parameters across different positions in the input sequence, which helps reduce the number of model parameters and enhances the generalization ability of features. Meanwhile, compared with CharLSTM, the convolutional operations in CharCNN can be accelerated through parallelization, making CharCNN more efficient in processing large-scale text data. Method 3, compared to our model, has lower precision, recall, and F1 values, indicating that separately extracting character features using word-char fusion extraction performs better and further demonstrating the effectiveness of DistilBioBERT as a word-level feature extraction model that can capture richer and more accurate word-level representations. The possible reason is that the fusion of characters and words can simultaneously utilize the information from both, reducing the occurrence of ambiguity and thus representing the text more accurately. In some cases, individual characters or words may have ambiguous meanings, making it difficult to accurately express their intended sense. The results of methods 4 and 5 show that using a dual encoder can better model contextual information, reduce ambiguity, enhance robustness to noise, and improve performance. The possible reason is that the dual encoder provides precise location information, which helps the model accurately identify entities in the text. Furthermore, the dual encoder can enhance the modeling capability for entity position and direction information by comparing the outputs of the two encoders, further improving the performance of entity recognition.
4.4. Ablation experiment
We conducted ablation experiments using different fusion methods, different numbers of CharCNN convolutional kernels, multi-sized CharCNN convolutional kernels, and different numbers of multi-head attention.
4.4.1. Impact of different fusion methods.
This experiment mainly focuses on the fusion method of the feature extraction layer, namely, the exploration of two methods: cross-attention fusion and direct concatenation. Method 1 uses direct concatenation for all three feature fusion operations in the feature extraction layer. Method 2 proposes the use of cross-attention fusion for all three operations. Method 3 fuses W and CL using cross-attention fusion, while the other two operations use direct concatenation. Method 4 fuses W and using cross-attention fusion, while the other two operations use direct concatenation. Method 5 uses direct concatenation for W, CC, and CL, and then performs cross-attention fusion. Method 6 performs cross-attention fusion for W, CC, and CL, and then uses direct concatenation.
The experimental results are shown in Table 5. When using the fusion method of method 4, the Precision is the highest. When using the fusion method of method 1, the Recall is the highest. However, when using the fusion method proposed by our model (method 2), the F1 score reaches the highest level. This indicates that our proposed three-cross-attention fusion method is beneficial for improving the recognition effectiveness of named entity recognition. The possible reason is as follows: Firstly, the cross-attention mechanism can dynamically adjust the weights between different word and character features, thereby more effectively integrating global semantic information and directional information. Secondly, the cross-attention mechanism can handle input sequences of varying lengths. Although the direct concatenation of features is simple and intuitive, it may not effectively deal with the interactions and dependencies between different features. Moreover, it may encounter difficulties when processing variable-length sequences.
4.4.2. The effect of different number of CharCNN convolutional kernels.
To study the effect of different numbers of CNN convolutional kernels, we conducted a series of experiments on the BC5CDR-Disease dataset, as shown in Table 6. We found that when the number of convolutional kernels is 20, the Precision is the highest, when the number of convolutional kernels is 90, the Recall is the highest. However, when the number of convolutional kernels is 40, the F1 score is the highest. Therefore, we concluded that when the number of convolutional kernels is 40, named entity recognition performs best. Too few or too many convolutional kernels will affect its recognition performance. The reason may be that adding more convolutional kernels can capture a wider variety of features, which might describe the input data from different perspectives or levels. At the same time, multiple convolutional kernels are also more resilient to noise and variations in the input data. This enables the model to enhance its feature extraction capabilities in practical applications and handle various complex named entity recognition scenarios. However, an excessive number of convolutional kernels can lead to overfitting of the model to the training data, thereby reducing its performance on test data. Additionally, having too many convolutional kernels can also result in excessively long embedding times for character-based inputs, leading to poor performance.
4.4.3. Effects of multi-sized CharCNN convolution kernels.
This experiment uses multi-sized convolutional kernels for character feature extraction to capture character combinations of different lengths and better understand the context features in the text, thereby improving embedding performance. In this experiment, we conducted a series of experiments on the BC5CDR-Disease dataset using one, two, and three different sizes of convolutional kernels. The final experimental results are shown in Table 7. When using one size (4), the Recall is the highest, while using two sizes (4.8) results in the highest Precision and F1 score. The experimental results demonstrate that using two sizes (4,8) results in the best performance for our model, which proves the advantage of using multi-sized convolutional kernels for character feature extraction. This may be because using multi-scale convolutional kernels allows the extraction of features from different receptive fields, thereby capturing features of different levels in the input data. This helps to improve the recognition performance of the model.
4.4.4. Impact of different amounts of multi-heads attention.
Our model uses multi-head attention after BiLSTM to increase the extraction ability of feature information and solve the problem of BiLSTM in capturing long-distance dependencies. In this experiment, we studied the number of heads in multi-head attention, and the experimental results are shown in Table 8. When the number of heads is 1, the Precision is the highest, and when the number of heads is 8, the Recall is the highest. When the number of heads is 4, the F1 score is the highest. Based on a comprehensive analysis, when the number of heads is 4, the extraction ability of feature information is the strongest, and the recognition effect of NER is the highest.
In this regard, we believe that the increase in the number of multi-head attention heads can enable the model to simultaneously focus on different parts of the input sequence by processing multiple attention weights in parallel. This approach allows the model to encode and represent input data from more dimensions and angles, thus helping the model capture richer contextual information and improve recognition accuracy. However, when the number of heads becomes excessive, the number of parameters that the model needs to optimize will increase significantly, leading to overfitting or difficulties in training. Additionally, too many heads may result in information redundancy or interference, which can decrease the model’s performance. Therefore, in practical applications, it is necessary to select an appropriate number of heads based on specific task requirements, dataset characteristics, and computational resources. Through experiments and validation, we can find the optimal head configuration to strike a balance between performance and computational efficiency.
5. Conclusion and future work
In this paper, a Bi-BWC-LM model based on multi-cross attention fusion is designed. It utilizes triple cross attention fusion to obtain feature vectors, extracts information through BiLSTM and multi-head attention, and finally outputs the final results through the output layer. At the same time, a combined loss function is used to train the model. Experimental results on five datasets, including NCBI-Disease, BC5CDR-Disease, BC5CDR-Disease, BC2GM, and BC2GM, as well as ablation experiments, demonstrate that the proposed model is highly effective in improving the recognition performance of biomedical named entities.
However, according to the experiments conducted in [24], it is evident that DistilBioBERT still lags behind BioBERT by approximately 1% in terms of the F1 score. In the future, we will delve deeper into exploring ways to further reduce model parameters and enhance inference speed while leveraging BioBERT as our foundation.
References
- 1. Fan S, Yu H, Cai X, Geng Y, Li G, Xu W, et al. Multi-attention deep neural network fusing character and word embedding for clinical and biomedical concept extraction. Inf Sci. 2022;608(C):778–93.
- 2. Ji W, Fu Y, Zhu H. Multi-Feature Fusion Method for Chinese Pesticide Named Entity Recognition. 2023;13(5):3245.
- 3. Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–370. pmid:12520024
- 4. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Research. 2019;48(D1):D84–D6. pmid:30365038
- 5. Luo M, Mitra A, Gokhale T, Baral C. Improving biomedical information retrieval with neural retrievers. Proceedings of the AAAI Conference on Artificial Intelligence; 2022.
- 6. Pappas Dimitris, Malakasiotis Prodromos, and Androutsopoulos Ion. Data Augmentation for Biomedical Factoid Question Answering. In Proceedings of the 21st Workshop on Biomedical Language Processing; Dublin, Ireland. Association for Computational Linguistics. 2022. pages 63–81.
- 7. Caporaso JG, Baumgartner WA, Randolph DA, Cohen KB, Hunter L. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics (Oxford, England). 2007;23(14):1862–5. pmid:17495998.
- 8. Yang Z, Lin H, Li Y. Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Computational biology and chemistry. 2008;32(4):287–91. pmid:18467180.
- 9. Zhang J, Shen D, Zhou G, Su J, Tan C-L. Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J of Biomedical Informatics. 2004;37(6):411–22. pmid:15542015
- 10. Lee K-J, Hwang Y-S, Kim S, Rim H-C. Biomedical named entity recognition using two-phase model based on SVMs. J of Biomedical Informatics. 2004;37(6):436–47. pmid:15542017
- 11. Saha SK, Sarkar S, Mitra P. Feature selection techniques for maximum entropy based biomedical named entity recognition. J of Biomedical Informatics. 2009;42(5):905–11. pmid:19535010
- 12. Klinger R, Kolářik C, Fluck J, Hofmann-Apitius M, Friedrich CM. Detection of IUPAC and IUPAC-like chemical names. Bioinformatics. 2008;24(13):i268–i76. pmid:18586724
- 13.
Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. Proceedings of the 28th International Conference on Neural Information Processing Systems ‐ Volume 1; Montreal, Canada: MIT Press; 2015. p. 649–57.
- 14.
Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence; Phoenix, Arizona: AAAI Press; 2016. p. 2741–9.
- 15. Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:160301354. 2016.
- 16. Asghari M, Sierra-Sosa D, Elmaghraby AS. BINER: A low-cost biomedical named entity recognition. Information Sciences. 2022;602:184–200.
- 17. Kuru Onur, Ozan Arkan Can, and Deniz Yuret. 2016. CharNER: Character-Level Named Entity Recognition. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers; Osaka, Japan. The COLING 2016 Organizing Committee. 2016. pages 911–921.
- 18. Cho M, Ha J, Park C, Park S. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition. J of Biomedical Informatics. 2020;103(C):8. pmid:32004641
- 19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
- 20. Wu Y, Guo J, Tan X, Zhang C, Li B, Song R, et al., editors. Videodubber: Machine translation with speech-aware length control for video dubbing. Proceedings of the AAAI Conference on Artificial Intelligence; 2023.
- 21. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014.
- 22. Kenton JDM-WC, Toutanova LK, editors. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT; 2019.
- 23. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;36(4):1234–40. pmid:31501885
- 24. Rohanian O, Nouriborji M, Kouchaki S, Clifton DA. On the effectiveness of compact biomedical transformers. Bioinformatics. 2023;39(3). pmid:36825820
- 25. Doğan RI, Leaman R, Lu Z. Special Report: NCBI disease corpus: A resource for disease name recognition and concept normalization. J of Biomedical Informatics. 2014;47:1–10.
- 26. Li J, Sun Y, Johnson RJ, Sciaky D, Wei C-H, Leaman R, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database. 2016;2016. pmid:27161011
- 27. Smith L, Tanabe LK, Ando RJn, Kuo C-J, Chung IF, Hsu C-N, et al. Overview of BioCreative II gene mention recognition. Genome Biology. 2008;9(2):S2. pmid:18834493
- 28. Huang M-S, Lai P-T, Tsai RT-H, Hsu W-L. Revised JNLPBA corpus: A revised version of biomedical NER corpus for relation extraction task. arXiv preprint arXiv:190110219. 2019.
- 29. Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics. 2022;23(1):8. Published 2022 Jan 4. pmid:34983362
- 30. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthcare. 2021;3(1): Article 2.
- 31. Weber L, Sänger M, Münchmeyer J, Habibi M, Leser U, Akbik A. HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics. 2021;37(17):2792–4. pmid:33508086
- 32. Zhang Z, Chen ALP. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinformatics. 2022;23(1):458. Published 2022 Nov 3. pmid:36329384
- 33. Guan Z, Zhou X. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinformatics. 2023;24(1):42. Published 2023 Feb 8. pmid:36755230