Explainable detection of adverse drug reaction with imbalanced data distribution

Analysis of health-related texts can be used to detect adverse drug reactions (ADR). The greatest challenge for ADR detection lies in imbalanced data distributions where words related to ADR symptoms are often minority classes. As a result, trained models tend to converge to a point that strongly biases towards the majority class and then ignores the minority class. Since the most used cross-entropy criteria is an approximation to accuracy, the model focuses more readily on the majority class to achieve high accuracy. To address this issue, existing methods apply either oversampling or down-sampling strategies to balance the data distribution and exploit the most difficult samples of the minority class. However, increasing or reducing the number of individual tokens alone in sequence labeling tasks will result in the loss of the syntactic relations of the sentence. This paper proposes a weighted variant of conditional random field (CRF) for data-imbalanced sequence labeling tasks. Such a weighting strategy can alleviate data distribution imbalances between majority and minority classes. Instead of using softmax in the output layer, the CRF can capture the relationship of labels between tokens. The locally interpretable model-agnostic explanations (LIME) algorithm was applied to investigate performance differences between models with and without the weighted loss function. Experimental results on two different ADR tasks show that the proposed model outperforms previously proposed sequence labeling methods.


Introduction
An adverse drug reaction refers to any injury caused by taking medication, and the incidence of such injuries is quite high especially in cases of large doses or long duration of medication use. However, such reactions are unpredictable, and pre-market clinical trials of new drugs are usually only conducted on samples of 500-3000 people, for a single type of disease, and often exclude special populations (e.g., the elderly, pregnant women, and children). Therefore, such trials often fail to identify relatively rare adverse reactions, late-onset reactions, or adverse reactions that occur in special populations, which only become apparent after following after large-scale use [1]. This raises an urgent need for post-marketing drug safety surveillance after drug approval [2][3][4][5].
This monitoring has traditionally been accomplished by surveying users, but more recent approaches have focused on automatic identification and extraction of ADR symptoms. These methods often apply sequence labeling techniques in natural language processing (NLP) to monitor social media interaction on platforms such as Twitter and Facebook. Traditional methods for sequence labeling use rule-based or statistic methods, such as support vector machine (SVM) [6], which are highly dependent on hand-crafted features, such as n-gram. Conditional random field (CRF) [7] is another commonly used statistical method for sequence labeling and ADR detection. Compared with classification models, such as SVM and neural network models with softmax, CRF can extract the dependencies between labels.
Recent advances in deep neural networks (DNN) [8][9][10] and representation learning [11,12] have considerably improved the ability of algorithms to process text. Building neural networks, such as convolutional neural networks (CNN) [13], recurrent neural networks (RNN) [14], long short-term memory (LSTM) [15] and BiLSTM-CRF [16,17] can be an effective approach for in-depth research. Furthermore, attention mechanisms can be applied to improve the performance of DNN models to extract more task-specific features between tokens to provide meaningful information [18]. Other effective approaches apply the pretrained language models (PLM), such as BERT [19], RoBERTa [20] and ALBERT [21], to provide powerful representation to boost the performance of sequence labeling.
One of the biggest stumbling blocks in ADR tasks is data distribution, which often appears in conventional sequence labeling tasks and corpora. As shown in Table 1, the words of ADR symptoms in both examples from social media only take respective ratios of 6.23% and 8.3%. Table 2 presents concrete examples of several sequence labeling tasks, in which most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class. This is because the frequently used loss function in most text mining tasks is categorical cross-entropy, which approximates the accuracy score. Unfortunately, this is not a suitable metric for imbalanced-data circumstances since the minority class has less effect on accuracy than the majority class [22]. For instance, in a ratio of 99:1 between the majority and minority classes, the trained model would classify everything to the majority class, since doing so can achieve 99% accuracy. However, it will be useless for tasks where the minority classes are more important than the majority class. In most circumstances, false negatives can have higher importance, while false positives are of course undesirable. As a result, the important minority classes, i.e., ADRs and indications, will be completely ignored by such a classifier.
In this paper, we propose a weighted pre-trained language model which is both robust to the strong class imbalance and able to integrate dependencies and syntactical information between tokens. This model implements a weighted variant of the CRF loss function. By assigning more weights to the minority class, the model can be forced to pay more attention to such classes, for effective detection. Furthermore, we introduce an explainable algorithm which provides a qualitative understanding between the input features and the corresponding prediction to compare the behavior of models with and without the weighted loss function.
Experiments were conducted on several strongly class-imbalanced ADR corpora. The proposed model is compared against several current sequence labeling models. The results show that the proposed model provides a better solution to handle the imbalanced-data issue, thus improving performance. In addition, both visualizations and qualitative results are presented to demonstrate the effect of the weighted loss function on CRF.

Results
This section presents the experiments conducted on several corpora to evaluate the performance of the proposed weighted BERT-CRF model against different neural networks for the ADR detection task.

Datasets
To evaluated the effectiveness of the proposed weighted BERT-CRF model, the comparative experiments were conducted on two corpora. Notably, the split strategies for train and test sets were different, for a fair comparison with the performance that were reported in the previous studies.
• Twitter consists of two datasets, including Twitter ADR datasets (v1.0) [7,23] and ADHD datasets [15]. By collecting and annotating user published tweets, the Twitter ADR datasets (v1.0) contain references to 81 drugs and newly reported terms which can be commonly found in the US market. ADHD is a supplement of Twitter including at least one ADR or indication label. According to Cocos et al. [15], the datasets were split into a training set and a testing set with an 3:1 ratio. For hyper-parameter fine-tuning, we also split a dev set from the training set with a ratio of 8:2.
• PubMed abstracts [24] is a biomedical text dataset, in which each sentence contains at least one ADR label. Following a previous study [25], we slightly modified the datasets as follows: 1) ensure that each sentence contains only one related-drug reference and a list of corresponding ADRs caused by the drug; 2) remove 120 sentences which contain the name of drug which are commonly regarded as the cause of adverse reactions, e.g., theophylline poisoning, where theophylline is the cause of poisoning. In addition, the dataset was split into training/dev/test set with a ratio of 8: 1: 1.
Each sample in both two datasets contains at least one ADR or indication label mentioned. Details of above corpora are summarized in Table 2.

Evaluations metrics
The performance of ADR detection is evaluated by using approximate matching [24]. For example, given a tweet the Seroquel gave me lasting sleep paralysis with true ADR labels sleep paralysis, the prediction of either lasting sleep paralysis or simply paralysis are both regarded as correct. The corresponding metrics are precision, recall and F 1 -score, defined as follows Recall ¼ TP ADR TP ADR þ FN ADR ð2Þ where TP ADR is the number of the ADR tokens that are approximately matched, and TP ADR + FP ADR and TP ADR + FN ADR are respectively the number of all predicted and real ADR tokens. A higher F 1 -score indicate better detection performance.

Baselines
To evaluate the proposed weighted BERT-CRF model, we implement several baselines for comparison. The details are presented as follows.
Context encoder. For special tasks, different representations may impact the final performance. Therefore, we implement different context encoders to produce representations for the ADR detection tasks.
Each word representation is composed of the embeddings of its constituent characters. To further improve performance, self-attention was introduced after BiLSTM to align the target token to its contexts (BiLSTM-attention).
• ELMo [27] is a pre-trained language model based on LSTM, which can be transferred for NLP tasks. It can be extended by adding a BiLSTM layer (ELMo-BiLSTM) to learn long-distance dependencies.
• BERT [19] can be used as fixed input features for the following labeling tasks (BERT-fix). In addition, it can be transferred and fine-tuned for a wide range of tasks by adding a classifier without any task-specific architectural modification (BERT-finetune). To capture long distance dependency, a BiLSTM was added for both two models.
Classifiers. Both softmax and CRF were used as the output layers to label tokens in the sequence. In addition, a weighted variant of softmax (wSoftmax) and the proposed weighted CRF (wCRF) were also implemented for comparison, to further investigate whether the weighting mechanism can improve performance.
Hyper-parameters are fine-tuned on the corresponding development set of each dataset. For BiLSTM, the word vectors were pre-trained using GloVe on the 840B Common Crawl corpus [28]. The dimensionality of the word vectors is 300. Words that don't appear in GloVe were initialized with a uniform distribution U(−0.25, 0.25). Char embeddings are initialized randomly with a dimensionality of 50, and then updated along with model training. To avoid overfitting, we apply a dropout layer after the context encoder layer, and the dropout rate was set to 0.1 for all datasets. Pre-trained language models, i.e., ELMo and BERT, are initialized from a checkpoint. For ELMo and BERT, Original (5.5B) and BERT-baseuncased pretrained model were respectively used. The Adam optimizer was adopted to train all models.

Hyper-parameters fine-tuning
Several parameters may impact the final performance of the proposed weighted BERT-CRF model on both the Twitter and PubMed datasets. Fig 1 shows the hyper-parameter fine-tuning to search for optimal settings according to the final performance on the development set using the grid-search strategy. That is, we fine-tune each parameter for the optimal value in turn. When the optimal value of one parameter is obtained, it will be fixed, and the next parameter will be fine-tuned. As indicated, for Twitter and PubMed, the best performance is achieved when the number of training epochs are respectively 9 and 12, the batch size is 16 and 16, learning rate is 2e-5 and 2e-5, and the dimensionality of the hidden state in BiLSTM is 256 and 256. Once these parameters exceed the optimal settings, the F 1 -score declines. The results indicate that appropriate parameters can ensure that the model obtains the interactive relationship and syntactic information even when the distribution is imbalanced, thus improving classification performance.
Once the optimal parameters settings are obtained, they are used for token labeling on the test sets for both Twitter and PubMed corpora. Comparing the results on the development set and that on the test set in Table 3 shows that the performance on the test set was very close to that on the development set for both ADR datasets. Table 3 shows the experimental results on both the Twitter and PubMed datasets. We further combined different encoders and classifiers for ADR detection. For conventional models, BiLSTM-Attention assigns different weights to the contexts according to the contribution to the final classification of the target token, thus it outperformed BiLSTM and BiLSTM-char by about 1.7% and 1.2%. PLMs, such as ELMo and BERT, can further improve performance on BiLSTM. PLM can provide better contextual representations than conventional word vectors e.g., GloVe, thus yielding better performance. Instead of using a fixed contextual representation for tokens, i.e., ELMo and BERT-fix, BERT was also fine-tuned in the training procedure for the entire model, thus achieving the best performance on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.

Comparative results
For the output layer, recent studies suggest using the softmax function for token labeling, which performs well in independent prediction. However, its efficiency decreases when there is a strong dependency between labels. The grammar rules in the sequence labeling task limit the label between tokens that cannot be predicted independently. For example, <I-ADR> cannot be connected after <B-Indication>, or after the token that starts with the position. Using CRF can address this problem. For each generated label sequence, CRF uses a score to represent the sequence quality, with a higher score indicating that the currently generated label sequence performs better. The results demonstrate that the proposed BERT-CRF benefits from both contextual representation of BERT and weighted loss function of CRF, thus it can achieve the best performance for ADR detection. Table 4 shows the comparative performance of the proposed BERT-CRF against the stateof-the-art models on both the Twitter and PubMed datasets. As indicated, the proposed model outperformed these models by 5.1% and 3.0%, respectively.
Further, the ability of BERT to handle the sequential encoding mainly depends on position embeddings, which are added with token embeddings and segment embeddings as inputs to the BERT model. With the increase of the number of layers, the relevant sequence information will slightly dissipate in the output layer. For the conventional sequence labeling task, the use of LSTM followed by the BERT encoder can actually enhance the sequential information. Table 3. Results comparison of different context encoders w/ and w/o weighted mechanism for ADR detection tasks. The proposed weighted CRF significantly outperformed several baselines on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.

Model
Twitter PubMed Especially when the output classifier is CRF, such a combination can provide a better context information to support the implementation of CRF. To evaluate the effect of BiLSTM, an ablation study was conducted in Table 4. The result show that the removal of BiLSTM will degrade the performance of the model.

Discussion
This section used an explainable algorithm to present some explanation or speculation on the empirical results.

The effect of different loss function
The performance of the proposed weighted CRF mainly depends on the weight assignment strategies. That is, the loss of the minority classes should be assigned a heavier weight than that of the minority classes. For comparison, we introduce two other strategies. The first one uses the inverse value of the sample numbers (Strategy-1), and the other uses the inverse ratio of the sample numbers (Strategy-2). As indicated in Fig 2A, the proposed weights assignment strategy in Eq 12) (Weighted Loss) outperformed both of the aforementioned strategies, since the proposed method considers both the number of samples and the ratio of each class.
In addition, recent studies recommended using either focal loss [29] or dice loss [30] for multi-label classification with imbalanced data distributions. The focal loss can reduce the weights of the samples of the majority classes, and force the model to focus more on the samples that are difficult to detect during training. Unlike cross-entropy, dice loss was designed to fit an approximation of F 1 -score metric to attach similar importance to the samples of the minority classes. As shown in Fig 2B, the proposed weighted loss achieved better performance than both loss functions. Since both focal loss and dice loss use a softmax function as the output layer, the strong dependency between labels will be ignored.

Interpretability analysis
In text classification, the classifier is often a black-box tool, where the internal working mechanisms are completely hidden from the user. By using the deep learning models, the user may not know which features are most important for the final prediction. To further explore the effectiveness of the proposed weighted BERT-CRF model, Fig 3 randomly selects two examples of the test samples from both the Twitter and PubMed datasets and visualizes the contribution of the contexts for the target token.
To understand the prediction of the ADR detection model for a certain sample, the desired explanation should be local, i.e., must correspond to how the model behaves in the neighborhood of the token being predicted. Thus, the LIME algorithm was introduced to provide a qualitative understanding of the relationship between the input tokens and the corresponding labels. To use the LIME algorithm, we rephrase the ADR detection task as a simple multiclass classification problem. To explain the effectiveness of the weighting strategy, the LIME algorithm was applied to explain the classification of tokens with ADR labels. As indicated, green and red respectively means that the portion contributed positively and negatively to the classification of the target label. The weights are interpreted by applying them to the prediction probabilities. For the first example, if tokens go and bed were removed from the texts, the classifier is expected to predict tired as <B-ADR> with a probability 0.95 − 0.33 − 0.31 = 0.31. Thus, the tokens go and bed could be regarded as indicators of ADR. Compared with the model without a weighting strategy, the proposed model can accurately predict the ADR label based on the local information.
For the second example, the proposed model predicted <B-ADR> and <I-ADR> for the tokens gain and weight. The word pristiq is a strong indicator that those tokens are ADR. This indicates that, in the dataset, pristiq is often a drug which may cause an ADR. In contrast, the model without the weighted loss function tends to ignore both the <B-ADR> and <I-ADR> label for the tokens gain and weight, even though these is a strong indicator pristiq. Since the model applies cross-entropy as loss function, it tends to predict all <O> labels for all tokens to achieve the lowest entropy value. Fig 4 shows the overall framework of the proposed weighted BERT-CRF model for ADR, which consists of three parts. The first part is a pre-trained BERT model, the second part is a bi-directional LSTM and the third part is a weighted CRF output layer. The details of each part are described as follow.

Context encoder
The context encoder module in our model is based on BERT, which achieved impressive performance in various NLP tasks. It consists of multilayers of bidirectional transformer encoders [31], and is pre-trained by unsupervised learning of either masked language model (with a masked ratio of 15%) or next sentence prediction. The uncased BERT-base model was used, containing 12 layers of transforms with a hidden size of 768.
We use the sentences from one input sample X as the context input, which can be divided and transformed as a sequence of token embeddings, denoted as X = [x 1 , x 2 , � � �, x N ], where N is the number of tokens in this sample. The model is required to predict the label sequence y = [y 1 , y 2 , � � �, y N ] for the tokens. By using the WordPiece [32] tokenizer, a word may be tokenized as several subwords. For instance, the ADR symptom depression can be tokenized into de,  1 (target = tired). If tokens go and bed were removed from the texts, the classifier is expected to predict tired as <B-ADR> with a probability 0.95 − 0.33 − 0.31 = 0.31. Thus, the tokens go and bed could be regarded as indicators of ADR. Compared with the model without a weighting strategy, the proposed model can accurately predict the ADR label based on the local information. (B) Example 2 (target = weight). The proposed model predicted <B-ADR> and <I-ADR> for the tokens gain and weight. The word pristiq is a strong indicator that those tokens are ADR. This indicates that, in the dataset, pristiq is often a drug which may cause an ADR. In contrast, the model without the weighted loss function tends to ignore both the <B-ADR> and <I-ADR> label for the tokens gain and weight, even though these is a strong indicator pristiq. Since the model applies cross-entropy as loss function, it tends to predict all <O> labels for all tokens to achieve the lowest entropy value.
https://doi.org/10.1371/journal.pcbi.1010144.g003 ##press and ##ion, where the symbol ## means the token is not at the front of a word. In the implementation, the BOI tagging scheme was introduced to each label, and the model was trained only on the tag labels for the first subword of a split token. If the label for depression is <B-ADR>, only de will be set to <B-ADR> while the labels ##press and ##ion will be set to a special ignored token [IGN].
Using BERT, we add a special symbol x 0 , i.e.
[CLS], in front of each input sample. By concatenating with both position embeddings [33] and segmentation embeddings, the token embeddings were fed into the BERT model to get the output representation, t i 2 < d t , denoted as, where θ BERT is the trainable parameters of the BERT model which is fine-tuned during model training, and d t = 768 is the dimensionality of the local representation. The contextualized sentence-level representation [t 1 , t 2 , � � �, t N ] are used as the input embeddings of bi-directional LSTM layer, denoted as, where θ BiLSTM is the corresponding trainable parameters of the BiLSTM model. Notably, BERT was fine-tuned in the training phase of the whole model.

Conditional random fields
CRF is a type of undirected discriminative graph model [23] defined as a Markov random field. Since CRF outperforms softmax in capturing the relationship between neighboring information, it is selected to encode known relationships between the hidden representation H = [h 1 , h 2 , � � �, h N ] in the BiLSTM layer and the output label for each token in sentence. For a sequence of label, we define the score of the i-th token as, where A is a transition matrix, and its element A i,j represents the score of a transition from label i to label j. We manually add the start and end labels of a sentence as y 0 and y N+1 , which is also added to the set of possible labels. A is therefore a square matrix of size K + 2. It can be inferred from Eq 7 that the total score of the sequence is equal to the sum of the score of each token, which is mainly decided by two parts. One is the emission matrix P which is the output H by the previous BiLSTM layer, and the other is the transition matrix A which is associated with CRF layer. Then, a softmax over all possible label sequence yields a probability for the sequence y, The CRF can be trained by maximizing the log-likelihood, defined as, max logðpðyjXÞÞ ¼ max sðX; yÞ À log X y 0 2Y X expðsðX; y 0 ÞÞ ! ¼ min À sðX; yÞ À logadd where Y X denotes the possible path to all labels for the input sentence X. It is evident that we encourage our network to produce a valid sequence of output labels. The item log add s(X, y 0 ) is defined as a summation over all possible label sequences. The computation cost will be O (K N ), where the number of possible label sequences grows exponentially with the sequence length. Here, we apply a dynamic programming method, the forward-backward algorithm, to solve it effectively and reduce the computation cost to O(NK 2 ).

Weighted loss function
The idea of the imbalanced learning approach is to design a weighted loss function that includes a distinct cost for each class. The main difficulty in Eq 10 is writing the last logadd item as a sum over observations. In the context of the hidden Markov model (HMM), Rabiner [34] has shown that the computation of a logadd item can be rewritten as a product over observations as, Therefore, the weighted CRF objective function becomes, max logðpðyjXÞÞ where Y k i is all possible paths to the label of the k-th class for the i-th token, and w k is the cost parameter to weight the terms associated with the k-th class, given by, where N is the number of all tokens, k is the total number of classes, and n k is the number of tokens in the k-th class.

Related works
An adverse drug reaction is an injury resulting from the use of a drug. People suffering from such reactions frequently recount their experiences in social media texts. This section presents a brief review of existing methods for detecting ADR by using sequence labeling techniques.

Adverse drug reaction detection
Existing methods for ADR detection can be broadly divided into two categories: conventional models and deep neural networks. ADRMine [2] manually transforms the input texts and augments them with additional features, such as ADR lexicons, word contexts, cluster features of word embeddings and part-of-speech (POS) tags, in order to increase the effectiveness of the resulting CRF model [35]. Here, CRFs are a class of statistical modeling methods often applied in machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering neighbouring samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. Unfortunately, Such a model heavily depend on feature engineering, which is time consuming and label intensive.
Recent studies have shown that deep neural network models perform well on several sequence labeling tasks. Deep learning algorithms can automatically identify features in raw data as the neural network learns. Taking word embeddings as input, Cocos et al. [15] applied a LSTM classifier to label the ADRs in tweets. They also investigated the effect of different word representations on the final performance. Gupta et al. [36] proposed a semi-supervised RNN model to address the labeled data scarcity problem. For unsupervised learning, a BiLSTM was used to predict the drug name with its context, and was also trained to predict ADRs with supervised learning in the tweet. Similarly, Lee et al. [37] proposed a semi-supervised CNN to automatically extract features for ADR labeling. Li et al. [38] proposed a joint BiLSTM model to extract adverse drug events between drug and disease entities, and their resident relations between bacteria and location entities from biomedical texts. Deep neural networks can also be extended by stacking multiple layers. One viable option is to stack a CRF on a BiLSTM layer, so that the BiLSTM-CRF model [16,17] can learn both long dependency and constraints between labels.
In addition to word embeddings alone, several studies have incorporated extra embeddings, such as character embeddings [39,40], POS embeddings [41], and some other embeddings [42]. To further improve detection performance, an attention mechanism was introduced to align the target token to its context words and to pay additional attention to the contexts which contributes more to the final detection result. The self-attention can interact with the tokens in the sequence, and force them to learn the dependency of the classifier on different parts of the sequence [18,25,43].

Imbalanced learning
Machine learning for imbalanced data distribution poses a challenge for classification models since most of the learning algorithms used for classification implement an accuracy approximation function as the objective, i.e., cross entropy. These algorithms are designed around the assumption of equal examples numbers for each class. When the distribution of examples across the known classes is biased or skewed, the minority classes may be covered by the majority classes and thus will be ignored by the classifier, resulting in poor predictive performance, specifically for the minority class. Several previous works have sought to address the data imbalance issue at either the data or algorithm level.
Data-level methods. To address the data imbalance issue, several previous works applied either down-sampling [44] or up-sampling strategies [44,45] to balance the data distribution and exploit the most difficult examples of the minority class. In their simplest form, a downsampling strategy discards random samples from the majority classes while an up-sampling strategy duplicates from minority classes. Unfortunately, the former reduces the total amount of the training data while the latter provides a large number of repeated samples, which has been shown to cause over-fitting.
To strengthen class boundaries and reduce over-fitting, Chawla et al. [46] proposed the synthetic minority over-sampling technique (SMOTE) to produce artificial minority samples by the margin between existing minority samples and their nearest neighbors with the same labels. However, these methods are useless in sequence labeling tasks, since increasing or reducing the number of individual tokens in a text-sequence alone will result in the loss of the syntactic relations of the sentence.
Algorithm-level methods. Instead of handling the training data distribution, the algorithm-level methods consider a class penalty or weight to reduce bias towards the minority classes [47][48][49][50]. For cost-sensitive learning, penalties or weights are assigned to each class through a cost matrix. Obviously, increasing the cost of the minority samples is equivalent to increasing their importance or decreasing the likelihood that the model will incorrectly misclassify the sample.
Another line of data resampling is to applied alternative objective function instead of cross entropy to control the weights of examples in training procedures. For examples, focal loss [29] applies a modulating term to the cross entropy loss to focus learning on hard negative samples. It can automatically reduce the weight for easy samples and thus force the model focus on hard samples. Dice loss [30] applies an approximate function of F-measure to attach similar importance to false positives and false negatives. Boosting algorithms such as AdaBoost [51] select and weight hard examples to train classifiers.

Conclusions
In this study, a weighted conditional random field is proposed for imbalanced-data ADR detection tasks. It applies a pre-trained language model as a context encoder which is both robust to strong class imbalanced datasets and can integrate dependencies and syntactical information between tokens. To address the data imbalance issue, a weighted variant of the CRF loss function is implemented to assign more weight to the minority class, forcing the model to pay more attention to such classes to ensure effective detection. Furthermore, we introduce an explainable algorithm which provides a qualitative understanding between the input features and the corresponding prediction to compare the behaviors of the models with and without the weighted loss function.
Experimental results show that the proposed weighted variant of CRF provides a significant performance boost without changing the model architecture in imbalanced-data tasks. Future work will attempt to adjust the weights of training samples based on target metrics or to build a separate network for weight prediction.