Table 1.
The imbalance examples in labeling of adverse drug reactions.
The words of ADR symptoms in both examples from social media only take respective ratios of 6.23% and 8.3%.
Table 2.
Statistic results on the imbalance datasets of adverse drug reactions.
Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.
Fig 1.
Hyper-parameters finetuning for the proposed weighted BERT-CRF model.
(A) The best performance is achieved when the number of training epochs are respectively 9 and 12. (B) The optimal settings for the batch size is 16 and 16. (C) The optimal learning rate is 2e-5 and 2e-5. (D) The dimensionality of the hidden state in BiLSTM is 256 and 256.
Table 3.
Results comparison of different context encoders w/ and w/o weighted mechanism for ADR detection tasks.
The proposed weighted CRF significantly outperformed several baselines on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.
Table 4.
Comparative results of the proposed weighted BERT-CRF model against the previously proposed model.
The proposed model outperformed the state-of-the-art models on both Twitter and PubMed datasets by 5.1% and 3.0%, respectively.
Fig 2.
Comparative results of the proposed weighted BERT-CRF model with different weights assignment and loss function.
(A) Different weight assignment. The performance of the proposed weighted CRF mainly depends on the weight assignment strategies. The green bar shows the proposed weight assignment (Weighted Loss), as described in Eq 12. For comparison, we introduce two other strategies. The blue bar shows the inverse value of the sample numbers (Strategy-1), and the red bar shows the inverse ratio of the sample numbers (Strategy-2). (B) Different loss function. Recent studies recommended using either focal loss or dice loss for multi-label classification with imbalanced data distribution. The green bar shows the performance of the proposed weighted loss function (Weighted Loss). The blue bar shows the performance of the focal loss, which can reduce the weights of the samples of the majority classes, and force the model to focus more on the samples that are difficult to detect during training. The red bar shows the performance of the dice loss, which was designed to fit an approximation of F1-score metric to attach similar importance to the samples of the minority classes.
Fig 3.
Interpretability analysis of the selected examples for the proposed weighted BERT-CRF model.
Green and red respectively means that the portion contributed positively and negatively to the classification of the target label. The weights are interpreted by applying them to the prediction probabilities. (A) Example 1 (target = tired). If tokens go and bed were removed from the texts, the classifier is expected to predict tired as <B-ADR> with a probability 0.95 − 0.33 − 0.31 = 0.31. Thus, the tokens go and bed could be regarded as indicators of ADR. Compared with the model without a weighting strategy, the proposed model can accurately predict the ADR label based on the local information. (B) Example 2 (target = weight). The proposed model predicted <B-ADR> and <I-ADR> for the tokens gain and weight. The word pristiq is a strong indicator that those tokens are ADR. This indicates that, in the dataset, pristiq is often a drug which may cause an ADR. In contrast, the model without the weighted loss function tends to ignore both the <B-ADR> and <I-ADR> label for the tokens gain and weight, even though these is a strong indicator pristiq. Since the model applies cross-entropy as loss function, it tends to predict all <O> labels for all tokens to achieve the lowest entropy value.
Fig 4.
System architecture of weighted BERT-CRF model.
It consists of three parts. The first part is a pre-trained BERT model, the second part is a bi-directional LSTM and the third part is a weighted CRF output layer.